Transient trees in C
2021-09-03
You may have heard about persistent data structures, that is immutable structures where every update returns a new version of the structure instead of modifying it in place.
I’ve been interested in persistent structures lately since I’ve been toying with the idea of building a text editor, and they have a couple of very desirable properties:
implementing undo is trivial, since you can just store a list of previous versions of the structure;
they are safe to access from multiple threads, so syntax highlighting on large files can be offloaded to another thread while the buffer remains editable.
The downside is that persistent operations require allocating and copying; a smart implementation can mitigate the performance impact by sharing as many parts as possible between different versions, but some overhead is inevitable. For this reason sometimes it may be useful to forgo immutability. In Clojure, this is achieved by using transients.
In pseudocode, usage looks like this:
let n = 1000000
let v = new persistent vector
// persisent
for i from 0 to n
v = v.push(i)
// transient
t = v.transient()
for i from 0 to n
t.push(i)
v = t.persistent()
what happens is that the first loop will create (light) copies of
v
at each iteration, while the second may modify the vector
directly1 - in hot paths this may end up
saving a lot of CPU time. Naturally, transients don’t have the
properties I mentioned above (in particular they are not thread-safe),
so you usually want to come back to a persistent version as soon as
possible.
Making transience the default
Clojure’s approach is nice in a functional context, but I thought that turning it inside out could make it fit better with an imperative language.
The idea is this: we treat the structure as mutable by default, but with the option of taking a persistent snapshot; updates on said snapshots have a copy-on-write semantics. It’s worth noting that mutable structures may have some immutable subparts, but this is handled transparently by the update procedures.
Let’s see an example of how this could be implemented in practice in C.
A transient binary search tree
This is the definition of our tree:
struct tree {
;
atomic_int rcbool is_const;
int key;
char const *value;
struct tree *left, *right;
};
As you can see it’s a BST augmented with two fields: a reference
count3 and a boolean flag.
is_const
is handled by following these rules:
if
is_const
is true, the node is immutable. Changing its fields other thanrc
is forbidden, so a mutable copy will be made if updating is required;is_const
is propagated lazily, so that making a tree persistent is just a matter of settingis_const
to true on the root. When we descend down the tree to add a leaf or edit a value, if we encounter a node coming from an immutable parent, we’ll set that node’s flag (seetree_put
later). Additionally, if a node is a shared child of multiple parents and at least one of them is immutable, then it too needs to be immutable4 (this will be relevant in the implementation oftree_clone
).
Here are some simple functions to create, increase the reference count, and free tree nodes:
struct tree *
(int k, char const *v, struct tree *l, struct tree *r)
tree_new{
struct tree *t = malloc(sizeof(*t));
if (!t) return 0;
*t = (struct tree){
.rc = 1,
.key = k,
.value = v,
.left = l,
.right = r
};
return t;
}
void
(struct tree *t)
tree_hold{
if (atomic_fetch_add(&t->rc, 1) == 0)
(1);
exit}
void
(struct tree *t)
tree_free{
if (atomic_fetch_sub(&t->rc, 1) != 1)
return;
if (t->left)
(t->left);
tree_freeif (t->right)
(t->right);
tree_free
(t);
free}
There’s nothing interesting going on here. Nodes are created with
is_const
set to false and rc
to 1.
Getting a value is no different than what we’d have in a mutable BST:
char const *
(struct tree *t, int k)
tree_get{
while (t) {
if (t->key == k)
return t->value;
= k > t->key ? t->right : t->left;
t }
return 0;
}
The meat of the problem is, of course, insertion. We’ll need an helper function that creates a mutable shallow copy of a node:
struct tree *
(struct tree *t)
tree_clone{
struct tree *res = malloc(sizeof(*res));
if (!res) return 0;
(res, t, sizeof(*t));
memcpy->rc = 1;
res->is_const = false;
res
if (res->left) tree_hold(res->left);
if (res->right) tree_hold(res->right);
if (t->is_const) {
if (t->left)
->left->is_const = true;
tif (t->right)
->right->is_const = true;
t}
return res;
}
There’s a couple of things to note here:
by copying
t
we’re creating new references to its children, so we need to increase their reference counts;as mentioned before, if the original node was immutable then
is_const
must be propagated to its children (again, ⁴).
Now we’re ready to define tree_put
. This is how it
works:
descend down the tree, selecting the appropriate child as usual in a BST;
if the current node
t
is immutable, create a mutable copy withtree_clone
and use it to replacet
int
’s parent (we’re deleting a reference tot
, so calltree_free
on it);if a node is found that matches the provided key, write the desired value on that node and return;
otherwise, when a leaf is reached add a new node with the provided key-value pair as a child.
You may have noticed that these steps require modifying some nodes. However, step 2 means that we maintain the following invariant:
At the start of each iteration,
t
’s parent is mutable.
This guarantees that step 2 and 4 are valid. Additionally, step 3 is
executed after step 2, so t
itself is mutable and changing
its value
field is allowed.
tree_put
’s return value is the provided tree if it is
mutable, its mutable copy if it isn’t.
Here’s the code:
struct tree *
(struct tree *t, int k, char const *v)
tree_put{
struct tree *root = t;
struct tree **p = &root;
while (t) {
if (t->is_const) {
= tree_clone(t);
t if (!t) return 0;
(*p);
tree_free*p = t;
}
if (t->key == k) {
->value = v;
treturn root;
}
= k > t->key ? &t->right : &t->left;
p = *p;
t }
*p = tree_new(k, v, 0, 0);
if (!*p) return 0;
return root;
}
There’s only one piece missing: a way to make a tree persistent. This
is very easy thanks to the lazy propagation of
is_const
:
void
(struct tree *t)
tree_persist{
->is_const = true;
t(t);
tree_hold}
Using the tree
Let’s see an example of how we might use our transient BST6:
struct tree *t = tree_new(0, "root", 0, 0);
= tree_put(t, -10, "foo");
t = tree_put(t, -5, "bar");
t = tree_put(t, 10, "baz");
t = tree_put(t, 5, "fizz");
t = tree_put(t, 15, "buzz");
t = tree_put(t, 7, "pippo"); t
This creates the following tree:
At the moment the tree is still completely mutable, so the value
returned by tree_put
is t
itself:
struct tree *u
= tree_put(t, 0, "tree");
u (t == v); assert
We can take a persistent snaphost:
(t);
tree_persist= tree_put(t, 6, "pluto"); u
Now we have the following:
But what’s intersting is that t
and u
share
some structure (left is t
, right is u
):
If we keep modifying u
, they will gradually diverge:
= tree_put(u, -5, "car"); u
u
’s nodes that also belong to t
are
immutable, the rest is mutable.
Remember to clean up after yourself:
(t);
tree_free(u); tree_free
Conclusions
First of all, you can find the full code on Codeberg: dalz/transient-tree.
I haven’t shown it here, but after calling tree_persist
on a tree, that tree becomes safe to pass between threads.
Modifying a BST is pretty simple, but things get more hairy with
complex data structures that have multiple, more involved update
procedures (in contrast with a single tree_put
). Tracking
immutability and reference counts in your head starts to be annoying in
that case. This could be solved with a stronger type system that
provides better static guarantees7; but this is C, so just
be smarter?
Comments? Send them to
blog@alsd.eu!
Back to the article list.