Transient trees in C
2021-09-03
You may have heard about persistent data structures, that is immutable structures where every update returns a new version of the structure instead of modifying it in place.
I’ve been interested in persistent structures lately since I’ve been toying with the idea of building a text editor, and they have a couple of very desirable properties:
implementing undo is trivial, since you can just store a list of previous versions of the structure;
they are safe to access from multiple threads, so syntax highlighting on large files can be offloaded to another thread while the buffer remains editable.
The downside is that persistent operations require allocating and copying; a smart implementation can mitigate the performance impact by sharing as many parts as possible between different versions, but some overhead is inevitable. For this reason sometimes it may be useful to forgo immutability. In Clojure, this is achieved by using transients.
In pseudocode, usage looks like this:
let n = 1000000
let v = new persistent vector
// persisent
for i from 0 to n
v = v.push(i)
// transient
t = v.transient()
for i from 0 to n
t.push(i)
v = t.persistent()
what happens is that the first loop will create (light) copies of v
at each iteration, while the second may modify the vector directly1 - in hot paths this may end up saving a lot of CPU time. Naturally, transients don’t have the properties I mentioned above (in particular they are not thread-safe), so you usually want to come back to a persistent version as soon as possible.
Making transience the default
Clojure’s approach is nice in a functional context, but I thought that turning it inside out could make it fit better with an imperative language.
The idea is this: we treat the structure as mutable by default, but with the option of taking a persistent snapshot; updates on said snapshots have a copy-on-write semantics. It’s worth noting that mutable structures may have some immutable subparts, but this is handled transparently by the update procedures.
Let’s see an example of how this could be implemented in practice in C.
A transient binary search tree
This is the definition of our tree:
struct tree {
;
atomic_int rcbool is_const;
int key;
char const *value;
struct tree *left, *right;
};
As you can see it’s a BST augmented with two fields: a reference count3 and a boolean flag. is_const
is handled by following these rules:
if
is_const
is true, the node is immutable. Changing its fields other thanrc
is forbidden, so a mutable copy will be made if updating is required;is_const
is propagated lazily, so that making a tree persistent is just a matter of settingis_const
to true on the root. When we descend down the tree to add a leaf or edit a value, if we encounter a node coming from an immutable parent, we’ll set that node’s flag (seetree_put
later). Additionally, if a node is a shared child of multiple parents and at least one of them is immutable, then it too needs to be immutable4 (this will be relevant in the implementation oftree_clone
).
Here are some simple functions to create, increase the reference count, and free tree nodes:
struct tree *
(int k, char const *v, struct tree *l, struct tree *r)
tree_new{
struct tree *t = malloc(sizeof(*t));
if (!t) return 0;
*t = (struct tree){
.rc = 1,
.key = k,
.value = v,
.left = l,
.right = r
};
return t;
}
void
(struct tree *t)
tree_hold{
if (atomic_fetch_add(&t->rc, 1) == 0)
(1);
exit}
void
(struct tree *t)
tree_free{
if (atomic_fetch_sub(&t->rc, 1) != 1)
return;
if (t->left)
(t->left);
tree_freeif (t->right)
(t->right);
tree_free
(t);
free}
There’s nothing interesting going on here. Nodes are created with is_const
set to false and rc
to 1.
Getting a value is no different than what we’d have in a mutable BST:
char const *
(struct tree *t, int k)
tree_get{
while (t) {
if (t->key == k)
return t->value;
= k > t->key ? t->right : t->left;
t }
return 0;
}
The meat of the problem is, of course, insertion. We’ll need an helper function that creates a mutable shallow copy of a node:
struct tree *
(struct tree *t)
tree_clone{
struct tree *res = malloc(sizeof(*res));
if (!res) return 0;
(res, t, sizeof(*t));
memcpy->rc = 1;
res->is_const = false;
res
if (res->left) tree_hold(res->left);
if (res->right) tree_hold(res->right);
if (t->is_const) {
if (t->left)
->left->is_const = true;
tif (t->right)
->right->is_const = true;
t}
return res;
}
There’s a couple of things to note here:
by copying
t
we’re creating new references to its children, so we need to increase their reference counts;as mentioned before, if the original node was immutable then
is_const
must be propagated to its children (again, ⁴).
Now we’re ready to define tree_put
. This is how it works:
descend down the tree, selecting the appropriate child as usual in a BST;
if the current node
t
is immutable, create a mutable copy withtree_clone
and use it to replacet
int
’s parent (we’re deleting a reference tot
, so calltree_free
on it);if a node is found that matches the provided key, write the desired value on that node and return;
otherwise, when a leaf is reached add a new node with the provided key-value pair as a child.
You may have noticed that these steps require modifying some nodes. However, step 2 means that we maintain the following invariant:
At the start of each iteration,
t
’s parent is mutable.
This guarantees that step 2 and 4 are valid. Additionally, step 3 is executed after step 2, so t
itself is mutable and changing its value
field is allowed.
tree_put
’s return value is the provided tree if it is mutable, its mutable copy if it isn’t.
Here’s the code:
struct tree *
(struct tree *t, int k, char const *v)
tree_put{
struct tree *root = t;
struct tree **p = &root;
while (t) {
if (t->is_const) {
= tree_clone(t);
t if (!t) return 0;
(*p);
tree_free*p = t;
}
if (t->key == k) {
->value = v;
treturn root;
}
= k > t->key ? &t->right : &t->left;
p = *p;
t }
*p = tree_new(k, v, 0, 0);
if (!*p) return 0;
return root;
}
There’s only one piece missing: a way to make a tree persistent. This is very easy thanks to the lazy propagation of is_const
:
void
(struct tree *t)
tree_persist{
->is_const = true;
t(t);
tree_hold}
Using the tree
Let’s see an example of how we might use our transient BST6:
struct tree *t = tree_new(0, "root", 0, 0);
= tree_put(t, -10, "foo");
t = tree_put(t, -5, "bar");
t = tree_put(t, 10, "baz");
t = tree_put(t, 5, "fizz");
t = tree_put(t, 15, "buzz");
t = tree_put(t, 7, "pippo"); t
This creates the following tree:
At the moment the tree is still completely mutable, so the value returned by tree_put
is t
itself:
struct tree *u
= tree_put(t, 0, "tree");
u (t == v); assert
We can take a persistent snaphost:
(t);
tree_persist= tree_put(t, 6, "pluto"); u
Now we have the following:
But what’s intersting is that t
and u
share some structure (left is t
, right is u
):
If we keep modifying u
, they will gradually diverge:
= tree_put(u, -5, "car"); u
u
’s nodes that also belong to t
are immutable, the rest is mutable.
Remember to clean up after yourself:
(t);
tree_free(u); tree_free
Conclusions
First of all, you can find the full code on Codeberg: dalz/transient-tree.
I haven’t shown it here, but after calling tree_persist
on a tree, that tree becomes safe to pass between threads.
Modifying a BST is pretty simple, but things get more hairy with complex data structures that have multiple, more involved update procedures (in contrast with a single tree_put
). Tracking immutability and reference counts in your head starts to be annoying in that case. This could be solved with a stronger type system that provides better static guarantees7; but this is C, so just be smarter?
Just to be clear, this does not mean that modifying
t
will changev
: as you can see we need to assignt
’s value tov
after the loop.↩︎As the
value
field is a pointer, other parts of the code could modify the value it points to without calling our procedures. This would mean that changes to the value would not respect persistence. You can remove theconst
qualifier if this is fine for your use case.↩︎Not strictly necessary; you can replace it with a garbage collector, or even ignore memory management if you know you’ll keep every snapshot for the whole execution of the program.↩︎
In case it’s not clear why, consider the following example: node C is child of mutable node A and immutable node B. If we update C coming from B we’ll see it as immutable (since we need to propagate
is_const
), but if we did the same from A we’d treat it as mutable. Obviously C can’t be both mutable and immutable at the same time.↩︎If at the time
tree_hold
is called the reference count is 0, the tree is being freed and holding it is not possible. A serious program could try to handle this gracefully, but here we’ll just exit.↩︎I’m not checking the return value of
tree_new
andtree_put
for null, but you should.↩︎We could for instance define different types for mutable and immutable nodes.↩︎
Comments? Send them to
blog@alsd.eu!
Back to the article list.