Transient trees in C
2021-09-03

You may have heard about persistent data structures, that is immutable structures where every update returns a new version of the structure instead of modifying it in place.

I’ve been interested in persistent structures lately since I’ve been toying with the idea of building a text editor, and they have a couple of very desirable properties:

implementing undo is trivial, since you can just store a list of previous versions of the structure;
they are safe to access from multiple threads, so syntax highlighting on large files can be offloaded to another thread while the buffer remains editable.

The downside is that persistent operations require allocating and copying; a smart implementation can mitigate the performance impact by sharing as many parts as possible between different versions, but some overhead is inevitable. For this reason sometimes it may be useful to forgo immutability. In Clojure, this is achieved by using transients.

In pseudocode, usage looks like this:

let n = 1000000
let v = new persistent vector


// persisent

for i from 0 to n
    v = v.push(i)


// transient

t = v.transient()
for i from 0 to n
    t.push(i)
v = t.persistent()

what happens is that the first loop will create (light) copies of v at each iteration, while the second may modify the vector directly¹ - in hot paths this may end up saving a lot of CPU time. Naturally, transients don’t have the properties I mentioned above (in particular they are not thread-safe), so you usually want to come back to a persistent version as soon as possible.

Making transience the default

Clojure’s approach is nice in a functional context, but I thought that turning it inside out could make it fit better with an imperative language.

The idea is this: we treat the structure as mutable by default, but with the option of taking a persistent snapshot; updates on said snapshots have a copy-on-write semantics. It’s worth noting that mutable structures may have some immutable subparts, but this is handled transparently by the update procedures.

Let’s see an example of how this could be implemented in practice in C.

A transient binary search tree

This is the definition of our tree:

struct tree {
    atomic_int rc;
    bool is_const;

    int key;
    char const *value;

    struct tree *left, *right;
};

As you can see it’s a BST augmented with two fields: a reference count³ and a boolean flag. is_const is handled by following these rules:

if is_const is true, the node is immutable. Changing its fields other than rc is forbidden, so a mutable copy will be made if updating is required;
is_const is propagated lazily, so that making a tree persistent is just a matter of setting is_const to true on the root. When we descend down the tree to add a leaf or edit a value, if we encounter a node coming from an immutable parent, we’ll set that node’s flag (see tree_put later). Additionally, if a node is a shared child of multiple parents and at least one of them is immutable, then it too needs to be immutable⁴ (this will be relevant in the implementation of tree_clone).

Here are some simple functions to create, increase the reference count, and free tree nodes:

struct tree *
tree_new(int k, char const *v, struct tree *l, struct tree *r)
{
    struct tree *t = malloc(sizeof(*t));
    if (!t) return 0;

    *t = (struct tree){
        .rc = 1,
        .key = k,
        .value = v,
        .left = l,
        .right = r
    };

    return t;
}

void
tree_hold(struct tree *t)
{
    if (atomic_fetch_add(&t->rc, 1) == 0)
        exit(1);
}

⁵

void
tree_free(struct tree *t)
{
    if (atomic_fetch_sub(&t->rc, 1) != 1)
        return;

    if (t->left)
        tree_free(t->left);
    if (t->right)
        tree_free(t->right);

    free(t);
}

There’s nothing interesting going on here. Nodes are created with is_const set to false and rc to 1.

Getting a value is no different than what we’d have in a mutable BST:

char const *
tree_get(struct tree *t, int k)
{
    while (t) {
        if (t->key == k)
            return t->value;

        t = k > t->key ? t->right : t->left;
    }

    return 0;
}

The meat of the problem is, of course, insertion. We’ll need an helper function that creates a mutable shallow copy of a node:

struct tree *
tree_clone(struct tree *t)
{
    struct tree *res = malloc(sizeof(*res));
    if (!res) return 0;

    memcpy(res, t, sizeof(*t));
    res->rc = 1;
    res->is_const = false;

    if (res->left) tree_hold(res->left);
    if (res->right) tree_hold(res->right);

    if (t->is_const) {
        if (t->left)
            t->left->is_const = true;
        if (t->right)
            t->right->is_const = true;
    }

    return res;
}

There’s a couple of things to note here:

by copying t we’re creating new references to its children, so we need to increase their reference counts;
as mentioned before, if the original node was immutable then is_const must be propagated to its children (again, ⁴).

Now we’re ready to define tree_put. This is how it works:

descend down the tree, selecting the appropriate child as usual in a BST;
if the current node t is immutable, create a mutable copy with tree_clone and use it to replace t in t’s parent (we’re deleting a reference to t, so call tree_free on it);
if a node is found that matches the provided key, write the desired value on that node and return;
otherwise, when a leaf is reached add a new node with the provided key-value pair as a child.

You may have noticed that these steps require modifying some nodes. However, step 2 means that we maintain the following invariant:

At the start of each iteration, t’s parent is mutable.

This guarantees that step 2 and 4 are valid. Additionally, step 3 is executed after step 2, so t itself is mutable and changing its value field is allowed.

tree_put’s return value is the provided tree if it is mutable, its mutable copy if it isn’t.

Here’s the code:

struct tree *
tree_put(struct tree *t, int k, char const *v)
{
    struct tree *root = t;
    struct tree **p = &root;

    while (t) {
        if (t->is_const) {
            t = tree_clone(t);
            if (!t) return 0;

            tree_free(*p);
            *p = t;
        }

        if (t->key == k) {
            t->value = v;
            return root;
        }

        p = k > t->key ? &t->right : &t->left;
        t = *p;
    }

    *p = tree_new(k, v, 0, 0);
    if (!*p) return 0;

    return root;
}

There’s only one piece missing: a way to make a tree persistent. This is very easy thanks to the lazy propagation of is_const:

void
tree_persist(struct tree *t)
{
    t->is_const = true;
    tree_hold(t);
}

Using the tree

Let’s see an example of how we might use our transient BST⁶:

struct tree *t = tree_new(0, "root", 0, 0);

t = tree_put(t, -10, "foo");
t = tree_put(t, -5, "bar");
t = tree_put(t, 10, "baz");
t = tree_put(t, 5, "fizz");
t = tree_put(t, 15, "buzz");
t = tree_put(t, 7, "pippo");

This creates the following tree:

At the moment the tree is still completely mutable, so the value returned by tree_put is t itself:

struct tree *u

u = tree_put(t, 0, "tree");
assert(t == v);

We can take a persistent snaphost:

tree_persist(t);
u = tree_put(t, 6, "pluto");

Now we have the following:

But what’s intersting is that t and u share some structure (left is t, right is u):

If we keep modifying u, they will gradually diverge:

u = tree_put(u, -5, "car");

u’s nodes that also belong to t are immutable, the rest is mutable.

Remember to clean up after yourself:

tree_free(t);
tree_free(u);

Conclusions

First of all, you can find the full code on Codeberg: dalz/transient-tree.

I haven’t shown it here, but after calling tree_persist on a tree, that tree becomes safe to pass between threads.

Modifying a BST is pretty simple, but things get more hairy with complex data structures that have multiple, more involved update procedures (in contrast with a single tree_put). Tracking immutability and reference counts in your head starts to be annoying in that case. This could be solved with a stronger type system that provides better static guarantees⁷; but this is C, so just be smarter?

Just to be clear, this does not mean that modifying t will change v: as you can see we need to assign t’s value to v after the loop.↩︎
As the value field is a pointer, other parts of the code could modify the value it points to without calling our procedures. This would mean that changes to the value would not respect persistence. You can remove the const qualifier if this is fine for your use case.↩︎
Not strictly necessary; you can replace it with a garbage collector, or even ignore memory management if you know you’ll keep every snapshot for the whole execution of the program.↩︎
In case it’s not clear why, consider the following example: node C is child of mutable node A and immutable node B. If we update C coming from B we’ll see it as immutable (since we need to propagate is_const), but if we did the same from A we’d treat it as mutable. Obviously C can’t be both mutable and immutable at the same time.↩︎
If at the time tree_hold is called the reference count is 0, the tree is being freed and holding it is not possible. A serious program could try to handle this gracefully, but here we’ll just exit.↩︎
I’m not checking the return value of tree_new and tree_put for null, but you should.↩︎
We could for instance define different types for mutable and immutable nodes.↩︎

Comments? Send them to blog@alsd.eu!
Back to the article list.

Transient trees in C2021-09-03

Making transience the default

A transient binary search tree

Using the tree

Conclusions

Transient trees in C
2021-09-03