dalz's blog

aboutRSS feed icon

Transient trees in C
2021-09-03

You may have heard about persistent data structures, that is immutable structures where every update returns a new version of the structure instead of modifying it in place.

I’ve been interested in persistent structures lately since I’ve been toying with the idea of building a text editor, and they have a couple of very desirable properties:

The downside is that persistent operations require allocating and copying; a smart implementation can mitigate the performance impact by sharing as many parts as possible between different versions, but some overhead is inevitable. For this reason sometimes it may be useful to forgo immutability. In Clojure, this is achieved by using transients.

In pseudocode, usage looks like this:

let n = 1000000
let v = new persistent vector


// persisent

for i from 0 to n
    v = v.push(i)


// transient

t = v.transient()
for i from 0 to n
    t.push(i)
v = t.persistent()

what happens is that the first loop will create (light) copies of v at each iteration, while the second may modify the vector directly1 - in hot paths this may end up saving a lot of CPU time. Naturally, transients don’t have the properties I mentioned above (in particular they are not thread-safe), so you usually want to come back to a persistent version as soon as possible.

Making transience the default

Clojure’s approach is nice in a functional context, but I thought that turning it inside out could make it fit better with an imperative language.

The idea is this: we treat the structure as mutable by default, but with the option of taking a persistent snapshot; updates on said snapshots have a copy-on-write semantics. It’s worth noting that mutable structures may have some immutable subparts, but this is handled transparently by the update procedures.

Let’s see an example of how this could be implemented in practice in C.

A transient binary search tree

This is the definition of our tree:

struct tree {
    atomic_int rc;
    bool is_const;

    int key;
    char const *value;

    struct tree *left, *right;
};

2

As you can see it’s a BST augmented with two fields: a reference count3 and a boolean flag. is_const is handled by following these rules:

Here are some simple functions to create, increase the reference count, and free tree nodes:

struct tree *
tree_new(int k, char const *v, struct tree *l, struct tree *r)
{
    struct tree *t = malloc(sizeof(*t));
    if (!t) return 0;

    *t = (struct tree){
        .rc = 1,
        .key = k,
        .value = v,
        .left = l,
        .right = r
    };

    return t;
}


void
tree_hold(struct tree *t)
{
    if (atomic_fetch_add(&t->rc, 1) == 0)
        exit(1);
}

5

void
tree_free(struct tree *t)
{
    if (atomic_fetch_sub(&t->rc, 1) != 1)
        return;

    if (t->left)
        tree_free(t->left);
    if (t->right)
        tree_free(t->right);

    free(t);
}

There’s nothing interesting going on here. Nodes are created with is_const set to false and rc to 1.

Getting a value is no different than what we’d have in a mutable BST:

char const *
tree_get(struct tree *t, int k)
{
    while (t) {
        if (t->key == k)
            return t->value;

        t = k > t->key ? t->right : t->left;
    }

    return 0;
}

The meat of the problem is, of course, insertion. We’ll need an helper function that creates a mutable shallow copy of a node:

struct tree *
tree_clone(struct tree *t)
{
    struct tree *res = malloc(sizeof(*res));
    if (!res) return 0;

    memcpy(res, t, sizeof(*t));
    res->rc = 1;
    res->is_const = false;

    if (res->left) tree_hold(res->left);
    if (res->right) tree_hold(res->right);

    if (t->is_const) {
        if (t->left)
            t->left->is_const = true;
        if (t->right)
            t->right->is_const = true;
    }

    return res;
}

There’s a couple of things to note here:

Now we’re ready to define tree_put. This is how it works:

  1. descend down the tree, selecting the appropriate child as usual in a BST;

  2. if the current node t is immutable, create a mutable copy with tree_clone and use it to replace t in t’s parent (we’re deleting a reference to t, so call tree_free on it);

  3. if a node is found that matches the provided key, write the desired value on that node and return;

  4. otherwise, when a leaf is reached add a new node with the provided key-value pair as a child.

You may have noticed that these steps require modifying some nodes. However, step 2 means that we maintain the following invariant:

At the start of each iteration, t’s parent is mutable.

This guarantees that step 2 and 4 are valid. Additionally, step 3 is executed after step 2, so t itself is mutable and changing its value field is allowed.

tree_put’s return value is the provided tree if it is mutable, its mutable copy if it isn’t.

Here’s the code:

struct tree *
tree_put(struct tree *t, int k, char const *v)
{
    struct tree *root = t;
    struct tree **p = &root;

    while (t) {
        if (t->is_const) {
            t = tree_clone(t);
            if (!t) return 0;

            tree_free(*p);
            *p = t;
        }

        if (t->key == k) {
            t->value = v;
            return root;
        }

        p = k > t->key ? &t->right : &t->left;
        t = *p;
    }

    *p = tree_new(k, v, 0, 0);
    if (!*p) return 0;

    return root;
}

There’s only one piece missing: a way to make a tree persistent. This is very easy thanks to the lazy propagation of is_const:

void
tree_persist(struct tree *t)
{
    t->is_const = true;
    tree_hold(t);
}

Using the tree

Let’s see an example of how we might use our transient BST6:

struct tree *t = tree_new(0, "root", 0, 0);

t = tree_put(t, -10, "foo");
t = tree_put(t, -5, "bar");
t = tree_put(t, 10, "baz");
t = tree_put(t, 5, "fizz");
t = tree_put(t, 15, "buzz");
t = tree_put(t, 7, "pippo");

This creates the following tree:

Initial tree

At the moment the tree is still completely mutable, so the value returned by tree_put is t itself:

struct tree *u

u = tree_put(t, 0, "tree");
assert(t == v);

We can take a persistent snaphost:

tree_persist(t);
u = tree_put(t, 6, "pluto");

Now we have the following:

t
u

But what’s intersting is that t and u share some structure (left is t, right is u):

Structural sharing after one edit

If we keep modifying u, they will gradually diverge:

u = tree_put(u, -5, "car");
Structural sharing after another edit

u’s nodes that also belong to t are immutable, the rest is mutable.

Remember to clean up after yourself:

tree_free(t);
tree_free(u);

Conclusions

First of all, you can find the full code on Codeberg: dalz/transient-tree.

I haven’t shown it here, but after calling tree_persist on a tree, that tree becomes safe to pass between threads.

Modifying a BST is pretty simple, but things get more hairy with complex data structures that have multiple, more involved update procedures (in contrast with a single tree_put). Tracking immutability and reference counts in your head starts to be annoying in that case. This could be solved with a stronger type system that provides better static guarantees7; but this is C, so just be smarter?



Comments? Send them to blog@alsd.eu!
Back to the article list.