Baardskeerder’s transaction strategy

Baardskeerder is a simple embedded database based around an append-only B-treeish datastructure.
It’s a dictionary that also supports range queries and transactions.
Baardskeerder is implemented in ocaml, and the main idea is that it will replace Tokyo Cabinet in our Arakoon Key-Value Store.
This post will try to will explain our approach regarding transactions.

Preliminaries

Baardskeerder appends slabs to a log. Slabs are list of entries, and entries can be
either values, leaves, indexes, or commits. Suppose we start with an empty log, and add a first update (set “f” “F”).
Baardskeerder builds a slab that looks like this:

Value "F"
Leaf ["f", Inner 0]
Commit (Inner 1)

This slab gets serialized, and added to a log, which looks like this:

Value "F"
Leaf ["f", Outer 0]
Commit (Outer 1)

Notice there are Outer and Inner references. Outer ones refer to positions in the Log, while Inner ones refer to the current slab. During serialization, each inner positions needs to be translated into an outer one, while outer positions just can be copied. (In reality, Outer positions are a bit more complex to calculate as the size of entries after serialization influences them, but I’ve abstracted away from that problem here)

When a slab is ready to be serialized and added, it always ends with a Commit node. This commit node is necessary as there might be a calamity while writing out the serialized slab and it might not end up entirely in the log. So after a successful update, the last entry in the log is a Commit pointing to the current root node.
To fully appreciate the difference between slab and log references,
let’s add another update (set “d” “D”).
The new slab looks like this:

Value "D"
Leaf ["d", Inner 0; "f", Outer 0]
Commit (Inner 1)

The new leaf in the slab refers both to the value at position 0 in the log (Value “F”) and to the value at position 0 in the slab (Value “D”).
After serialization of the slab, and writing it to the log, the log looks like this:

Value "F"
Leaf ["f", Outer 0]
Commit (Outer 1)
Value "D"
Leaf ["d", Outer 3; "f", Outer 0]
Commit (Outer 4)

Let’s add some more updates to have a more interesting log:

 set "h" "H";
 set "a" "A";
 set "z" "Z"
 

After which, the log looks like this:

Value "F"
Leaf ["f", Outer 0]
Commit (Outer 1)
Value "D"
Leaf ["d", Outer 3; "f", Outer 0]
Commit (Outer 4)
Value "H"
Leaf ["d", Outer 3; "f", Outer 0; "h", Outer 6]
Commit (Outer 7)
Value "A"
Leaf ["a", Outer 9; "d", Outer 3]
Leaf ["f", Outer 0; "h", Outer 6]
Index Outer 10, ["d", Outer 11]
Commit (Outer 12)
Value "Z"
Leaf ["f", Outer 0; "h", Outer 6; "z", Outer 14]
Index Outer 10, ["d", Outer 15]
Commit (Outer 16)

This corresponds to the following tree:
tree after 5 sets

There are several little things to notice here. First, leaves that overflow (size = 4) are split into to new leaves with an Index entry to refer to them both. Also, the entries in the log only refer to previous entries, and, most importantly, nothing ever changes inside the log.

Transactions

Suppose by some accident the process dies while writing the slab, the last slab doesn’t quite fully make it to the log. In the above example, the log might end up just after the value “Z” (or a bit further with half a leaf written).
At the next opening of the log, we’ll search for the last commit, and the log gets truncated just after position 13. As such, the presence of Commit entries that make updates atomic.

They also provide a means to implement non-concurrent transactions. In fact, if we only have a commit entry at the end of a slab, the slab literally is the transaction. At the beginning of the transaction, we create a slab, and we pass it along for all updates. A commit entry is written at the end of the slab, when we’re done. The only added complexity comes from descending the tree:
We must start at the root in the slab (or log if it’s empty) and if we have to jump to an entry, we have to look at its position and jump into the slab if it’s an inner position, and into the log if it’s an outer one. (How to descend a tree is described on more detail a previous blog post)

Our transactional API is simple and clean:

 module DBX :
  sig
    type tx
    val get : tx -> k -> v
    val set : tx -> k -> v -> unit
    val delete : tx -> k -> unit

    ...

    val with_tx : log -> (tx -> unit) -> unit
  end

The following code fragment illustrates our transaction api:

let xs = ["f","F";
	  "d","D";
	  "h","H";
	  "a","A";
	  "z","Z";
	 ]
in
DBX.with_tx log
  (fun tx ->
    List.iter (fun (k,v) ->
      DBX.set tx k v) xs
  )

and suppose we apply this transaction to an empty log, we then end up with the following:

Value "F"
Leaf ["f", Outer 0]
Value "D"
Leaf ["d", Outer 2; "f", Outer 0]
Value "H"
Leaf ["d", Outer 2; "f", Outer 0; "h", Outer 4]
Value "A"
Leaf ["a", Outer 6; "d", Outer 2]
Leaf ["f", Outer 0; "h", Outer 4]
Index Outer 7, ["d", Outer 8]
Value "Z"
Leaf ["f", Outer 0; "h", Outer 4; "z", Outer 10]
Index Outer 7, ["d", Outer 11]
Commit (Outer 12)

Performance Summary: each transaction boils down to 1 slab, and each slab is written using 1 system call (write).

Slab Compaction

Batching updates in transactions uses less space compared to individual updates, but we still have a lot of dead entries. Take for example the log above. Entries 1,3, 5,8, and 9 are not useful in any way. So it would be desirable to prune them from the slab before they even hit the log.
This turns out to be surprisingly easy.

First we have to find the dead entries, which is done with a simple iteration through the slab, starting from the root, and marking every Inner position as known. When you’re done,
the positions that you didn’t mark are garbage.

type t = { mutable es: entry array; mutable nes: int}

let mark slab =
  let r = Array.make (slab.nes) false in
  let maybe_mark = function
    | Outer _ -> ()
    | Inner x -> r.(x) <- true
  in
  let maybe_mark2 (_,p) = maybe_mark p in
  let mark _ e =
    match e with
      | NIL | Value _ -> ()
      | Commit p -> maybe_mark p
      | Leaf l -> List.iter maybe_mark2 l
      | Index (p0,kps) -> let () = maybe_mark p0 in List.iter maybe_mark2 kps
  in
  let () = iteri_rev slab mark in
  let () = r.(slab.nes -1) <- true in
  r

After we have marked the dead entries, we iterate through the slab again, from position 0 towards the end, and build a mapping that tells you how to translate an old Inner position to a new Inner position.

let mapping mark =
  let s = Array.length mark in
  let h = Hashtbl.create s in
  let rec loop i o =
    if i = s then h
    else
      let v = mark.(i) in
      let i' = i + 1 in
      let () = Hashtbl.add h i o in
      let o' = if v then o + 1 else o in
      loop i' o'
  in
  loop 0 0

The last phase is the actual rewriting of the slab, without the dead entries. This again is a simple iteration.

let rewrite s s_mark s_map =
  let lookup_pos = function
    | Outer x -> Outer x
    | Inner x -> Inner (Hashtbl.find s_map x)
  in
  let rewrite_leaf kps       = List.map (fun (k,p) -> (k,lookup_pos p)) kps in
  let rewrite_index (p0,kps) = (lookup_pos p0 , rewrite_leaf kps) in
  let rewrite_commit p       = lookup_pos p in
  let esa = s.es in
  let size = s.nes in
  let r = Array.create size NIL in
  let rec loop c i =
    if i = size
    then { es = r; nes = c}
    else
      begin
	let i' = i + 1 in
	let a = s_mark.(i) in
	if a then
	  let e = esa.(i) in
	  let e' = match e with
	    | Leaf  l  -> Leaf (rewrite_leaf l)
	    | Index i  -> Index (rewrite_index i)
	    | Commit p -> Commit (rewrite_commit p)
	    | Value _
	    | NIL -> e
	  in
	  let () = r.(c) <- e' in
	  let c' = c + 1 in
	  loop c' i'
	else
	  loop c i'
      end
  in
  loop 0 0

Gluing it all together:

let compact s =
  let s_mark = mark s in
  let s_map = mapping s_mark in
  rewrite s s_mark s_map

If we take a look at the slab for our example transaction of 5 sets before compaction

Value "F"
Leaf ["f", Inner 0]
Value "D"
Leaf ["d", Inner 2; "f", Inner 0]
Value "H"
Leaf ["d", Inner 2; "f", Inner 0; "h", Inner 4]
Value "A"
Leaf ["a", Inner 6; "d", Inner 2]
Leaf ["f", Inner 0; "h", Inner 4]
Index Inner 7, ["d", Inner 8]
Value "Z"
Leaf ["f", Inner 0; "h", Inner 4; "z", Inner 10]
Index Inner 7, ["d", Inner 11]
Commit (Inner 12)

and after compaction:

Value "F"
Value "D"
Value "H"
Value "A"
Leaf ["a", Inner 3; "d", Inner 1]
Leaf ["f", Inner 0; "h", Inner 2]
Value "Z"
Leaf ["f", Inner 0; "h", Inner 2; "z", Inner 6]
Index Inner 4, ["d", Inner 7]
Commit (Inner 8)

We can see it’s worth the effort.
Of course, the bigger the transaction, the bigger the benefit of compaction.

Question to my 2.5 readers: If you’ve seen a paper, blog, article,… describing this, please be so kind to send me a link.

Offline log compaction

Given Baardskeerder has transactions with slab compaction, we can build a simple and efficient offline compaction algorithm for log files. Start from the root, walk over the tree while building big transactions (as big as you want, or as big as you can afford), and add these to a new log file. When you’re done, you have your compacted log.

Online log compaction

Baardskeerder logs can also return garbage to the filesystem while they are in use, but that will be kept for a later post.

have fun,

Romain.

Baardskeerder is Afrikaans for barber, but also refers to an order of creatures called Solifugae.
A more marketing-prone name would have been Mjolnir Mach3 Ultra Turbo Plus Enterprise Edition with Aloe vera.

What’s in a name?

Advertisements


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s