Int32 serialization in OCaml

Today, I’m going to talk a bit about the problem of serializing an int32 in OCaml. As I’m only working on Intel machines, I’m not interested in portability, and prefer little-endian serialization. This should be natural and easy.

The interface

val set32: string -> int -> int32 -> unit
val get32: string -> int -> int32

The microbenchmark

We’re going to store an int32 into a string,retrieve it, and check if it’s the same. We’re going to do this 1_000_000_000 times, see how long it took, and calculate the speed.

let benchmark n =
  let t0 = Unix.gettimeofday() in
  let s = String.create 4 in
  let limit = Int32.of_int n in
  let rec loop i32 =
    if i32 = limit
    then ()
      let () = set32 s 0 i32 in
      let j32 = get32 s 0 in
      assert (i32 = j32);
      loop (Int32.succ i32)
  let () = loop 0l in
  let t1 = Unix.gettimeofday () in
  let d = t1 -. t0 in
  let speed = float n /. d in
  let megaspeed = speed /. 1000000.0 in
  Printf.printf "%i took %f => %fe6/s\n" n d megaspeed

Attempt 0: Naive implementation

This is rather straight forward: mask, extract the char, store, shift and repeat. Retrieving the int32 from the string is the opposite. No rocket surgery here.
This is simple, readable code.

let set32_ocaml s pos (i:int32) =
  let (>:) = Int32.shift_right_logical in
  let (&:) = Int32.logand in
  let mask = Int32.of_int 0xff in
  let to_char v = Char.chr (Int32.to_int v) in
  let too_far = pos + 4 in
  let rec loop p i =
    if p = too_far
    then ()
      let vp = i &: mask in
      let cp = to_char vp in
      let () = s.[p] <- cp in
      loop (p+1) (i >: 8)
  loop pos i

let get32_ocaml s pos =
  let (<:) = Int32.shift_left in
  let (|:) = Int32.logor in
  let to_i32 c = Int32.of_int (Char.code c) in
  let rec loop acc p =
    if p < pos
    then acc
      let cp = s.[p] in
      let vp = to_i32 cp in
      let acc' = (acc <: 8) |: vp in
      loop acc' (p-1)
  loop 0l (pos + 3)

OCaml is a nice high level language, but this bit twiddling feels rather clumsy and ugly.
Anyway, let’s benchmark it.

Strategy Speed
naive OCaml 16.0e6/s

A quick peek at how Thrift does it

let get_byte32 i b = 255 land (Int32.to_int (Int32.shift_right i (8*b)))
class trans = object(self)
  val ibyte = String.create 8
  method writeI32 i =
    let gb = get_byte32 i in
    for i=0 to 3 do
      ibyte.[3-i] <- char_of_int (gb i)
    trans#write ibyte 0 4

Ok, this uses the same strategy; but there’s a for loop there. The conversion is done in the ibyte buffer and then copied along. It’s a bit sub-awesome, but the extra copy of 4 bytes shouldn’t be too costly neither.

Attempt 1: But in C, it would be way faster

It’s a platitude I hear a lot, but in this case, it really should be faster. After all, if you want to retrieve an int32 from a string, all you need to do is to cast the char* to an int32_t* and de-reference the value.

Let’s try this:

external set32 : string -> int -> int32 -> unit = "zooph_set32"
external get32 : string -> int -> int32         = "zooph_get32"
#include <stdint.h>
#include <stdio.h>
#include <caml/alloc.h>
#include <caml/memory.h>
#include <caml/mlvalues.h>

value zooph_set32(value vs, value vpos, value vi){
  CAMLparam3(vs, vpos, vi);
  char* buf = String_val(vs);
  int pos = Int_val(vpos);
  int32_t i = Int32_val(vi);

  char* buf_off = &buf[pos];
  int32_t* casted = (int32_t*)buf_off;
  casted[0] = i;
  CAMLreturn (Val_unit);

value zooph_get32(value vs,value vpos){
    char* buf = String_val(vs);
    int pos = Int_val(vpos);
    char* buf_off = &buf[pos];
    int32_t* casted = (int32_t*)buf_off;
    int32_t i32 = casted[0];
    result = caml_copy_int32(i32);

I called my compilation unit zooph.c an onomatopoeia that pays tribute to how fast I expect this to be. There’s no loop, and the machine has the skills to do the transformation in one step. So it should roughly be about 4 times faster.
Let’s benchmark it.

Strategy Speed
naive OCaml 16.0e6
C via FFI 32.3e6

Hm… it’s faster allright, but it’s also a bit disappointing. So what went wrong?

A quick look at the assembly code reveals a lot:

	movl	8(%rdx), %eax
	sarq	%rsi
	movslq	%esi, %rsi
	movl	%eax, (%rdi,%rsi)
	movl	$1, %eax
	.size	zooph_set32, .-zooph_set32
	.p2align 4,,15
	.globl	zooph_get32
	.type	zooph_get32, @function
	pushq	%rbx
	.cfi_def_cfa_offset 16
	.cfi_offset 3, -16
	movq	%rsi, %rdx
	sarq	%rdx
	subq	$160, %rsp
	.cfi_def_cfa_offset 176
	movslq	%edx, %rdx
	movq	caml_local_roots(%rip), %rbx
	leaq	8(%rsp), %rcx
	movq	%rdi, 8(%rsp)
	movl	(%rdi,%rdx), %edi
	movq	%rsi, (%rsp)
	movq	$1, 32(%rsp)
	movq	%rcx, 40(%rsp)
	leaq	(%rsp), %rcx
	movq	%rbx, 16(%rsp)
	movq	$2, 24(%rsp)
	movq	$0, 152(%rsp)
	movq	%rcx, 48(%rsp)
	leaq	16(%rsp), %rcx
	movq	$1, 96(%rsp)
	movq	$1, 88(%rsp)
	movq	%rcx, 80(%rsp)
	leaq	80(%rsp), %rcx
	movq	%rcx, caml_local_roots(%rip)
	leaq	152(%rsp), %rcx
	movq	%rcx, 104(%rsp)
	call	caml_copy_int32
	movq	%rbx, caml_local_roots(%rip)
	addq	$160, %rsp
	.cfi_def_cfa_offset 16
	popq	%rbx
	.cfi_def_cfa_offset 8

While zooph_set32 seems to be in order, its counter part is rather messy. On closer inspection, not even the set32 side is optimal. OCaml’s FFI allows smooth (at least compared to jni) interaction with native code in other languages, it also installs a firm border across which no inlining is possible (not with OCaml that is).

Let’s take a look at how the benchmark code calls this.

	movq	%rbx, 8(%rsp)
	movq	%rax, 0(%rsp)
	movq	$1, %rsi
	movq	16(%rbx), %rdi
	movq	%rax, %rdx
	movq	zooph_set32@GOTPCREL(%rip), %rax
	call	caml_c_call@PLT
	movq	caml_young_ptr@GOTPCREL(%rip), %r11
	movq    (%r11), %r15
	movq	$1, %rsi
	movq	8(%rsp), %rax
	movq	16(%rax), %rdi
	movq	zooph_get32@GOTPCREL(%rip), %rax
	call	caml_c_call@PLT
	movq	caml_young_ptr@GOTPCREL(%rip), %r11
	movq    (%r11), %r15
	movslq	8(%rax), %rax
	movq	0(%rsp), %rdi
	movslq	8(%rdi), %rbx
	cmpq	%rax, %rbx
	je	.L176

You see stuff being pushed on the stack before the call. For raw speed, you don’t want this to happen. For raw speed, you don’t even want a call.
To get there, you need to translate the benchmark to C too. I’m not going to bother, because I have another trick ready.

Attempt 2: OCaml 4.01 primitives

OCaml 4.01 got released recently, and there’s a little entry in the release notes.

PR#5771: Add primitives for reading 2, 4, 8 bytes in strings and bigarrays
(Pierre Chambart)

However, for some reason, they are not really exposed, and I had to dig to find them. Using them however is trivial.

external get32_prim : string -> int -> int32         = "%caml_string_get32"
external set32_prim : string -> int -> int32 -> unit = "%caml_string_set32"

That’s all there is to it. Basically, you say that you know that the compiler knows how to do this, and that from now on, you want to do that too.
Let’s benchmark it.

Strategy Speed
naive OCaml 16.0e6
C via FFI 32.3e6
OCaml with primitives 139e6


Closing words

I’ve put the code for this on github: Anyway, we need to (de)serialize int64 values as well. Determining the speedup there is left as an exercise for the reader (tip: it’s even better).

I think some people will feel the urge to apply this to their serialization code as well.

Have fun,



Arakoon 1.6.0 is out

H: One voter, 16,472 votes — a slight anomaly…?

E: Not really, Mr. Hanna. You see, Baldrick may look like a monkey who’s
been put in a suit and then strategically shaved, but he is a brillant
politician. The number of votes I cast is simply a reflection of how
firmly I believe in his policies.

Black Adder III, Episode 1


So, Arakoon 1.6.0 is out, a release so huge we just had to leap minor increments! This blog post will walk you through the various changes between the 1.4 series and the 1.6 series. We’ll explain why it’s faster, and give an indication by how much.

Harvesting the client queue

Arakoon is a multipaxos implementation. This means that arakoon nodes try to gain consensus on what the next transaction is that needs to be applied.
The thing they gain consensus on is called a paxos value. In all Arakoons (≤ 1.4) there was a 1:1 mapping between a client update and a paxos value. This all works fine in LAN, but when the latency between nodes becomes a lot higher, like in a WAN setup, performance becomes abysmally slow. But some of your users are using Arakoon in a multi-datacenter setup. The solution chosen to address this was to have a n:1 mapping between client updates and paxos values. What Arakoon 1.6 does when the time has come to decide on the next paxos value, is to take all (or up to a limit) of the queued updates, and stuff them in the next paxos value. For multiple clients and small updates this means an almost linear speedup in a WAN setting.

Batched Local store

Ops: We have seen a corrupt tokyo cabinet, and we can reproduce it!
Dev: Great! What’s the scenario?
Ops: It’s really easy. Just apply the following 218 million transactions to an empty store.
Dev: Err, euhm, thx ?!

When you want to fix a problem, having a 100% reproducible scenario is a blessing, but not if it takes forever. Fortunately, during the root cause analysis of such a problem, you have time to think, and thus the insight came: We can speed this up a lot! Having to maintain a transaction log is a mixed blessing, but one of the advantages is that you don’t have to apply all updates to the database immediately. So we added an in-memory caching layer that sits before Tokyo Cabinet. All updates happen there and from time to time, we push the updates through in a big batch. We can afford ourselves to do this as we already ensured durability via the transaction logs.

Logging improvements

Since the early days, Arakoon has a crash logger. When an Arakoon node dies unexpectly, it dumps its last 1000 or so log statements in a separate file. This is a really useful feature, and allowed us to determine the underlying cause of death in many occasions. Unfortunately, it was also a very expensive feature: We were always generating debug level log statements, even if the configuration stated ‘info’. Logging is expensive, even if you throw the log statements away. There is a big cost in the generation of all these strings via fancy Printf templates. So we decided to use a syntax extension that allows us to create these strings more lazily.

So now we have the best of both worlds: when all goes well, we don’t need debug statements and we never pay the cost of constructing the string, but when things turn sour,
we can still go back and create our crash log.

Numbers please

We’re not going to show full fledged benchmarks, just a simple microbenchmark to show a trend. The experiment is to let a client (or more clients) do a million sets to an Arakoon cluster and see how long it takes. All experiments were performed on a simple Intel Core i7 with a HDD.

1e6 sets/client Arakoon 1.4.3 Arakoon 1.6.0
1 client 294s 196s
4 // clients 900s 625s
8 // clients 2180s 1150s

How to read the table? 1 client does 1e6 sets on Arakoon 1.6.0 and finishes after 196s. 8 clients do 1e6 sets in parallel on Arakoon 1.6.0 and the last client finishes after 1150s. That looks like a rather nice improvement.

The next table is for a cluster of 3 nodes, all running on the same machine and thus sharing the same HDD

1e6 sets/c Arakoon 1.4.3 Arakoon 1.6.0
1 client 1090s 676s
4 // clients 4005s 2470s
8 // clients 8734s 4700s

A last table shows the effect of latency. We add 50ms latency on everything that goes over the loopback interface (via netem). Again, this is a setup with 3 nodes, all on the same host, sharing a hdd (although this will not matter too much).

1e5 sets/c Arakoon 1.4.3 Arakoon 1.6.0
8 // clients 86400 20600

Note that this benchmark only goes to 1e5 (otherwise, it would take forever) You can immediately see the dramatic effect of latency on a distributed system. Arakoon 1.6.0 does a better job at fighting latency than Arakoon 1.4.3 (the difference in performance will increase with the number of clients).

Other notable improvements

Tokyo Cabinet patched

The whole data corruption annecdote eventually lead to a 32 bit overflow issue in tokyo cabinet. With the help of some friends, we fixed it and made our changes available here in this tokyo cabinet git repo. We really don’t want to expose anyone to database corruption and decided to backport this to the 1.4 series too.

Log sections

We started to use log sections, which allows you to configure a node such a way that for example the messaging layer is in debug, while everything else stays on info level. Which allows you to put the focus on something you’re examining while not having to wade through zillions of log messages. Handy for ops and devs alike.

Client API additions

We added a 2 small but useful methods to the client API. First there’s AssertExists of key that allows you to shortcut a sequence if a key is not there. Second, there’s multi_get_option: key list -> value option list Lwt.t which allows you to fetch a series of values, even if you’re not sure all of them exist in the database.

Always sync

Some of our users have fancy SSDs for the tlogs and wanted to be able to fsync after every write of a tlog entry. We accomodated them.

Full details

You can find the full list of improvements for Arakoon 1.6 in our bug tracker


It’s stable, way faster than it used to be, and it’s free.
You can get it from our githup repo
What are you waiting for?

Interludium: It’s Doomsday!

And Now for Something Completely Different (Monty Python’s Flying Circus)


This post explains how to calculate the day of the week, with all calculations inside your head.
It’s a geekish party trick, and doesn’t really belong on a research blog,
but since today is in fact doomsday, I feel I can get away with it.
Anyway, here follows the algorithm written down in ocaml, but optimized for my brain.


Doomsday is the most important day of the year. It’s the last day of February:
February 28th on normal years, February 29th on leap years.
If you know what day of the week it is on Doomsday, you quite easily adjust for any other day of the year.
The trick is to have an anchor in each month that is the same day of the week as Doomsday.

Anchors for each month

The Anchor for February is doomsday itself.
The anchor for another even month m is m. The anchor for most odd months can be remembered by
the following mnemonic:

I work 9-5 at the 7-11

It simply helps you remember that doomsday for the 9th month is the fifth day of that month and vice-versa.
This only leaves January and March. March is easy: the zeroth of March is the last day of February, which is doomsday. January is a bit more difficult: most years it’s January 3rd, on leap years it’s the 4th. The ocaml summary is this:

  let anchor_of_month yy = function (* "I work 9-5 at the 7-11" *)
    | 1  -> 3  + leap yy
    | 2  -> 28 + leap yy
    | 3  -> 0  (* last day of Feb is zeroth day of Mar *)
    | 5  -> 9
    | 9  -> 5
    | 7  -> 11
    | 11 -> 7
    | m  -> m

An example:

  • 25/12/2013 Christmas
  • Today (28/2/2013 = Doomsday 2013) is a Thursday
  • Hence, December 12 (12/12) is also a Thursday.
  • 19 Thu, 26 Thu, 25 is a Wednesday.

This is all you need to know to be able to calculate the day of the week for any day in the current year.
The funny thing is that this is rather simple to explain to humans, but not so trivial to formalize.
Here’s an attempt:

  let day_of_week hh yy mm dd = 
    let doom = anchor_of_month yy mm
    and w = of_year hh yy
    if doom >= dd 
    then zoom_down doom w dd
    else zoom_up doom w dd

The two zoom functions are what we do when we move from the anchor in the month to the day we want.

  let rec zoom_down known w dd = 
    if known = dd then days.(w)
      let t = known -7 in
      if t > dd 
      then zoom_down t w dd
      else zoom_down (known - 1) (prev w) dd

  let rec zoom_up known w dd = 
    if known = dd then days.(w)
      let t = known + 7 in
      if t < dd 
      then zoom_up t w dd
      else zoom_up (known + 1) (next w) dd

If this were code written for a computer, I would have distilled something more compact, but my brain doesn’t handle higher order functions very well.

Doomsday for other years

The most frequent use of the doomsday algorithm as a party trick is the calculation the day of the week of someone’s birthday. The algo will yield a number, which you will interprete as a day of the week.
Days are numbered like this:

Sun 0
Mon 1

The next thing you need to know is the adjustment for the century.
For 19YY years it’s 3 (WEDnesday) while for 20YY years, it’s 2 (Y-TUE-K).
There are rules for other centuries, but I didn’t learn them, as I don’t frequent vampire parties.
The formula I tend to use to calculate the doomsday of a year is shown below.

  let of_year hh yy =
    let start = match hh with
      | 19 -> 3 (* WE'D be in this day *)
      | 20 -> 2 (* Y-Tue-K *)
    let d12 = yy / 12 in
    let r12 = (yy mod 12) in
    let r4 = r12 / 4 in
    (start + d12 + r12 + r4) mod 7

For some years it’s simpler to move the mod 7 up a bit.

Some tests

If you want to practice this, you need some special dates. Below are some exercises.

29/02/2013 => Thu
25/05/2013 => Sat
25/12/2013 => Wed
01/01/2013 => Tue
11/09/2001 => Tue
06/10/1973 => Sat
03/11/1971 => Wed
20/07/1969 => Sun
08/05/1945 => Tue
11/11/1918 => Mon

Closing words

I picked this up on various resources. The good stuff comes from there, the mistakes are my own.

Have fun,


User functions in Arakoon

Mahomet cald the Hill to come to him. And when the Hill stood still, he was neuer a whit abashed, but said;
If the Hill will not come to Mahomet, Mahomet wil go to the hill.

Francis Bacon


Arakoon tries to be a simple distributed key value store that favours consistency over availability.
From time to time, we get feature requests for additional commands like:

  • assert_exists: assert a value exists for the key without caring what the value actually is
  • increment_counter: increment (or create) a counter and return the new value.
  • queue operations : add an element to the front/back of a double ended queue or pop an element
  • set_and_update_X: insert a key value pair and update some non-trivial counter X (think averages, variances,…)

The list is semi-infinite and the common thing here is that they are too complex/specific/weird/… to do them in one step using the provided interface. Of course, you can do all of this on the client side, but it will cost extra network round-trips. In distributed systems, you really want to keep the number of round-trips low, which pushes you towards these kind of feature requests.

Once you decided (performance bottlenecks probably) that you need extra functionality there are two things you can do. First, you can try to force or entice us into adding them to the core interface or alternatively, you can get by using Arakoon’s “user functions”. For some reason people fear them but there’s no real technical reason to do so.

This blog post will cover two things. First we’ll go in to the nitty gritty of coding and deploying user functions and then we’ll look at some of the strategic/architectural challenges of user functions.

How do user functions work?

The high level view is this: you build a user function, and register it to an Arakoon cluster before you start it. Then, at runtime, you can call it, using any client, with a parameter (a string option) and get back a result (string option). On the server side, the master will log this in its transaction log, try to reach consensus with the slave(s) and once that is the case, the user function will be executed inside a transaction. The result of that call will be sent to the client. If an exception occurs, the transaction will be aborted. Since Arakoon logs transactions it can replay them in case of calamities. This has a very important impact: since Arakoon needs to be able to replay the execution of a user function, you cannot do side effects, use random values or read the system clock.

Running Example

We’re going to try to build a simple queue API.
It will offer named queues with 2 operations: push and pop. Also, it’s a first-in-first-out thingy.

Arakoon 1

Client side API

Arakoon 1 offers the following API for user functions.

def userFunction(self, name, argument):
'''Call a user-defined function on the server
@param name: Name of user function
@type name: string
@param argument: Optional function argument
@type argument: string option

@return: Function result
@rtype: string option

Let’s take a look at it. A userFunction call needs the name, which is a string, and an argument which is a string option and returns a result of type string option. So what exactly is a string option in Python? Well, it’s either a string or None. This allows a user function to not take input or to not yield a result.

Server side API

The server side API is in OCaml, and looks like this:

class type user_db =
method set : string -> string -> unit
method get : string -> string
method delete: string -> unit
method test_and_set: string -> string option -> string option -> string option
method range_entries: string option -> bool -> string option -> bool -> int
-> (string * string) list

User functions on server side match the client’s opaque signature.

user_db -> string option -> string option

Queue’s client side

Let’s create the client side in python. We’ll create a class that uses an Arakoon client and acts as a queue. The problem with push is that we need to fit both the name and the value into the one paramater we have available. We need to do our own serialization. Let’s just be lazy (smart?) and use Arakoon’s serialization. The code is shown below.

from arakoon import Arakoon
from arakoon import ArakoonProtocol as P

class ArakoonQueue:
    def __init__(self, name, client):
        self._name = name
        self._client = client

    def push(self, value):        
        input =   P._packString(self._name) 
        input +=  P._packString(value)
        self._client.userFunction("QDemo.push", input)

    def pop(self):
        value = self._client.userFunction("QDemo.pop", self._name)
        return value

That wasn’t too hard now was it?

Queue, server side

The whole idea is that the operations happen on server side, so this will be a tat more complex.
We need to model a queue using a key value store. Code-wise, that’s not too difficult.
For each queue, we’ll keep 2 counters that keep track of both ends of the queue.

A push is merely getting the qname and the value out of the input, calculating the place where we need to store it, store the value there and update the counter for the back end of the queue. A pop is similar but when the queue becomes empty, we use the opportunity to reset the counters (maybe_reset_counters). The counter representation is a bit weird but Arakoon stores things in lexicographical order and we want to take advantage of this to keep our queue fifo. Hence, we need to make the counter in such a way the counter’s order is the same as a string’s order. The code is shown below.

(* file: *)

open Registry 

let zero = ""
let begin_name qname = qname ^ "/@begin" 
let end_name qname = qname ^ "/@end"
let qprefix qname key = qname ^ "/" ^ key

let next_counter = function
  | "" -> "A"
  | s -> 
        let length = String.length s in
        let last = length - 1 in
        let c = s.[last] in
        if c = 'H' 
        then s ^ "A"
        else let () = s.[last] <- Char.chr(Char.code c + 1) in 

let log x= 
  let k s = let s' = "[plugin_qdemo]:" ^ s in
            Lwt.ignore_result (Lwt_log.debug s')
  Printf.ksprintf k x

let maybe_reset_counters user_db qname b1 = 
  let e_key = end_name qname in
  let exists = 
    try let _ = user_db # get e_key in true with Not_found -> false 
  if exists
    let ev = user_db # get e_key in
    if ev = b1 then
      let b_key = begin_name qname in
      let () = user_db # set b_key zero in
      let () = user_db # set e_key zero in
  else ()

let push user_db vo = 
  match vo with
    | None -> invalid_arg "push None"
    | Some v -> 
        let qname, p1 = Llio.string_from v 0 in
        let value, _ = Llio.string_from v p1 in
        let e_key = end_name qname in
        let b0 = 
          try user_db # get (end_name qname) 
          with Not_found -> zero 
        let b1 = next_counter b0 in
        let () = user_db # set (qprefix qname b1) value in
        let () = user_db # set e_key b1 in

let pop user_db vo =
  match vo with 
    | None   -> invalid_arg "pop None"
    | Some qname -> 
        let b_key = begin_name qname in
        let b0 = 
          try user_db # get (begin_name qname) 
          with Not_found -> zero
        let b1 = next_counter b0 in
          let k = qprefix qname b1 in
          let v = user_db # get k in 
          let () = user_db # set b_key b1 in
          let () = user_db # delete k in
          let () = maybe_reset_counters user_db qname b1 in
          Some v
          Not_found ->
            let e_key = end_name qname in
            let () = user_db # set b_key zero in
            let () = user_db # set e_key zero in

let () = Registry.register "QDemo.push" push
let () = Registry.register "QDemo.pop" pop

The last two lines register the functions to the Arakoon cluster when the module is loaded.


So how do you deploy your user function module into an Arakoon cluster?
First need to compile your module into something that can be dynamically loaded.
To compile the I persuade ocamlbuild like this:

ocamlbuild -use-ocamlfind -tag 'package(arakoon_client)' \
-cflag -thread -lflag -thread \

It’s not too difficult to write your own testcase for your functionality, so you can run it outside of Arakoon and concentrate on getting the code right.


First, you need put your compilation unit into the Arakoon home directory on all your nodes of the cluster. And second, you need to add the name to the global section of your cluster configuration. Below, I show the configuration file for my simple, single node cluster called ricky.

cluster = arakoon_0
cluster_id = ricky

plugins = plugin_qdemo

ip =
client_port = 4000
messaging_port = 4010
home = /tmp/arakoon/arakoon_0

All right, that’s it. Just a big warning about user functions here.

Once a user function is installed, it needs to remain available, with the same functionality for as long as user function calls are stored inside the transaction logs, as they need to be re-evaluated when one replays a transaction log to a store (for example when a node crashed, leaving a corrupt database behind). It’s not a bad idea to include a version in the name of a user function to cater for evolution.


Let’s use it in a simple python script.

def make_client():
    clusterId = 'ricky'
    config = Arakoon.ArakoonClientConfig(clusterId,
                                         {"arakoon_0":("", 4000)})
    client = Arakoon.ArakoonClient(config)
    return client

if __name__ == '__main__':
    client = make_client()
    q = ArakoonQueue("qdemo", client)
    q.push("bla bla bla")
    q.push("some more bla")
    print q.pop()
    print q.pop()
    print q.pop()
    print q.pop()

with expected results.

Arakoon 2

With Arakoon 2 we moved to Baardskeerder as a database backend, replacing the combination of transaction logs and Tokyo Cabinet. Since the backend is Lwt-aware, this means that the server side API has become too:

module UserDB :
type tx = Core.BS.tx
type k = string
type v = string
val set    : tx -> k -> v -> unit Lwt.t
val get    : tx -> k -> (v, k) Baardskeerder.result Lwt.t
val delete : tx -> k -> (unit, Baardskeerder.k) Baardskeerder.result Lwt.t

module Registry:
type f = UserDB.tx -> string option -> (string option) Lwt.t
val register: string -> f -> unit
val lookup: string -> f

The major changes are that

  • the api now uses Lwt
  • we have (‘a,’b) Baardskeerder.result types, which we favour over the use of exceptions for normal cases.

Rewriting the queue implementation to Arakoon 2 yields something like:

(* file: *)

open Userdb
open Lwt
open Baardskeerder

let zero = ""
let begin_name qname = qname ^ "/@begin" 
let end_name qname = qname ^ "/@end"
let qprefix qname key = qname ^ "/" ^ key

let next_counter = function
  | "" -> "A"
  | s -> 
        let length = String.length s in
        let last = length - 1 in
        let c = s.[last] in
        if c = 'H' 
        then s ^ "A"
        else let () = s.[last] <- Char.chr(Char.code c + 1) in 

let reset_counters tx qname =
  let b_key = begin_name qname in
  let e_key = end_name qname in
  UserDB.set tx b_key zero >>= fun () ->
  UserDB.set tx e_key zero

let maybe_reset_counters tx qname (b1:string) = 
  let e_key = end_name qname in
    UserDB.get tx e_key >>= function
      | OK  _ -> Lwt.return true
      | NOK _ -> Lwt.return false
  end >>= function
    | true ->
          UserDB.get tx e_key >>= function
            | OK ev ->
                if ev = b1 
                then reset_counters tx qname
                else Lwt.return ()
            | NOK _  -> Lwt.return ()
    | false  -> Lwt.return ()

let push tx vo = 
  match vo with
    | None -> (invalid_arg "push None")
    | Some v -> 
        let qname, p1 = Llio.string_from v 0 in
        let value, _ = Llio.string_from v p1 in
        Lwt_log.debug_f "push:qname=%S;value=%S" qname value >>= fun ()->
        let e_key = end_name qname in
        UserDB.get tx (end_name qname) >>= fun b0r ->
        let b0 = match b0r with
          | OK b0 -> b0
          | _     -> zero
        let b1 = next_counter b0 in
        UserDB.set tx (qprefix qname b1) value >>= fun () ->
        UserDB.set tx e_key b1 >>= fun () ->
        Lwt.return None

let pop tx = function
  | None   -> (invalid_arg "pop None")
  | Some qname -> 
        let b_key = begin_name qname in
        UserDB.get tx (begin_name qname) >>= fun b0r ->
          match b0r with
            | OK b0 -> Lwt.return b0
            | NOK _ -> Lwt.return zero
        >>= fun b0 ->
        let b1 = next_counter b0 in
        let k = qprefix qname b1 in
        UserDB.get tx k >>= fun vr ->
          match vr with
            | OK value -> 
                  UserDB.set tx b_key b1 >>= fun () ->
                  UserDB.delete tx k >>= function 
                    | OK () ->
                          maybe_reset_counters tx qname b1 >>= fun () ->
                          Lwt.return (Some value)
                    | NOK e -> (Failure e)
             | NOK _  -> 
                 reset_counters tx qname >>= fun () ->
                 Lwt.return None
let () = Userdb.Registry.register "QDemo.push" push
let () = Userdb.Registry.register "QDemo.pop" pop

Both client side and deployment remain the same.

Questions asked

Ain’t there something wrong with this Queue?

Yes! Glad you noticed. This queue concept is fundamentally broken. The problem is the pop.
Follow this scenario:

  1. the client calls the QDemo.pop function
  2. the cluster pops the value from the queue and its master sends it to the client.
  3. the client dies before it can read the popped value

Now what? We’ve lost that value. Bloody network, how dare you!

Ok, I admit this was naughty, but it’s a good example of a simple local concept that doesn’t really amount to the same thing when tried in a distributed context. When confronted with this hole, people immediately try to fix this with “Right!, so we need an extra call to …”. To which I note: “But wasn’t this extra call just the thing you were trying to avoid in the first place?”

Why don’t you allow user functions to be written in <INSERT YOUR FAVOURITE LANGUAGE HERE>?

This is a good question, and there are several answers, most of them wrong. For example, anything along the lines of “I don’t like your stinkin’ language” needs to be rejected because a language’s cuteness is irrelevant.

There are several difficulties with the idea of offering user functions to be written in another programming language. For scripting languages like Python, Lua, PHP ,… we can either implement our own interpreter and offer a subset of the language, which is a lot of work with low return on investment, or integrate an existing interpreter/runtime which will probably not play nice with Lwt, or with the OCaml runtime (garbage collector). For compiled languages we might go via the ffi but it’s still way more complex for us. So for now you’re stuck with OCaml for user functions. There are worse languages.

Wouldn’t it be better if you apply the result of the user function to the transaction log iso the arguments?

Well, we’ve been thinking about that a lot before we started with user functions. The alternative is that we record and log the effect of the user function so that we can always replay that effect later, even when the code is no longer available. It’s an intriguing alternative, but it’s not a clear improvement. It all depends on the size of the arguments versus the size of the effect.
Some user functions have a small argument set and a big effect, while for other user functions it’s the other way around.

Closing words

Technically, it’s not too difficult to hook in your own functionality into Arakoon. Just make sure the thing you want to hook in does not have major flaws.

have fun,


Caulking your distributed algorithm implementation

Me: Ok, all unit tests succeed.
All system tests succeed.
All acceptance tests succeed.
(some time later)
NN: We have a problem with your distributed database.
Me: Ok, what’s the matter?
NN: Once in a while the cluster seems to get stuck.
Me: Stuck how?
NN: It seems to be unable to elect a master.
Me: What did you do to get there?
NN: We don’t know, but we have the logs. Here they are. Happy hunting.
Me: $#%#&*!
(some time later)
Me: Aha, that’s what went wrong!


The Arakoon team has spent a big part of the last two years on variations of this theme. Some of the problems turned out to configuration problems (sometimes even of other clusters), but there definitely were situations where the issue was a caused by an implementation error. As it turns out, they all have the same cause: not all possible scenarios were covered. So, is it possible to escape this vicious circle?

We already learned some things (see this previous blog post), so we use asynchronuous message passing and removed all IO from our implementation. So what’s left?
Well we have a set of distributed objects with well defined states and a set of messages that can be exchanged between them. Can’t we just generate all possible states of the distributed system and all possible messages and check how the relationships between them? This is non-trivial as the number of possible states is infinite and so is the number of possible messages. On the other hand, most state combinations are bad and inconsistent, and most messages are irrelevant… There might be something we can do.

Growing a state space-ish diagram

What if we start from a consistent starting state for all objects (let’s call this the system state), generate all relevant messages that can be exchanged from that state and apply them in all possible orders. The system states that can be reached in this way from the starting state should all be consistent. If we find a state that’s not consistent, we stop. For consistent system states, we can iterate. What about inconsistent states? Well, clearly this means our algorithm is capable of doing bad things. We should check the scenario that produced this and fix it, and iterate again. Is this doable? Well, maybe… and what about simulating dropped messages?

Humble beginnings

Let’s start small. What about growing a Basic Paxos implementation for a system of three nodes? Modelling the messages should not be too difficult:

type 'v kind = 
  | Prepare   
  | Promise of 'v option
  | Accept  of 'v
  | Accepted  

type 'v message = {n:n; s:id; t:id; k:'v kind}

Each message is associated with a paxos round n, has a source s and a target t and has semantics described by its kind k. Finally there’s some value type (‘v) for the things the system should try to find consensus on. (You can find the code on Github)

Modelling the agents is a bit more work:

type 'v agent_state = 
  | SHalted 
  | SIdle 
  | SRequest  of ('v proposal)
  | SPromised of 'v option
  | SLead of ('v * int) (* value, outstanding acks *)

type 'v agent = { 
  id : id;
  pop : id list;
  _n : n;
  state : 'v agent_state;
  store : (n * 'v) list;

An agent has an id, knows the other agents (pop), has a store and a current paxos round n. The interesting part is the inner state representing the role it’s playing. It can be halted or idle, requesting to become a leader for a proposal or leading an update.

Now we have messages and agents, we can model the state of the system.

type 'v network = 'v message list
type 'v agents  = 'v agent list
type 'v state   = { net: 'v network; ags: 'v agents}

The system state is merely a collection of agents (ags) and a network (net) representing the messages that can be delivered to the agents.

How can the state of the system change? First of all, the delivery of a message most likely will have an impact. We might add other moves later.

type 'v move = 
  | DeliverMsg of 'v message

Generating all moves is now easy: for every message in the network, there is a move that delivers it, and since we want to stop in bad states, we don’t generate messages there:

let generate_moves state = 
  if is_bad state 
  then []
    let deliver = (fun x -> DeliverMsg x) in

How about executing a move? There is some administration to do there.
We have a move to execute, a current state, a path that was followed to arrive at that state, and a set of observed states. If the move is the delivery of a message, we find the target agent and let him handle the message. This will change the agent’s state and produce some messages (extra).

 let execute_move move state path observed = 

    let deliver m ag = 
      let agent = find_agent ag m.t in
      let agent', extra = A.handle_message agent m in
      let ag' = replace_agent agent' ag in
      ag', extra
    let state' = 
      match move with
        | DeliverMsg m -> 
          let net' = remove m in
          let ags',extra = deliver m state.ags in
          { net = net' @ extra; ags = ags'}

Executing a move will cause a new system state. We will record observing the transition from the old state to the new by this move, and create the new path.

      let transition = (state_label state, move, state_label state') in
      TSet.add transition observed in
    state' , (move,state) :: path , observed'

The whole simulator is a few functions away:

  let rec check_all level state path observed = 
    if not (is_consistent state) then raise (Bad (path,state));
    if level = limit
    then observed
      let rec loop observed = function
        | [] -> observed
        | move :: rest ->
          let state', path', observed' = execute_move move state path observed in
          let observed'' = check_all (level + 1) state' path' observed' in
          loop observed'' rest
      let moves = generate_moves state in
      loop observed moves

  let run state0 = check_all 0 state0 [] TSet.empty 

Basically we start from an initial state, go down the tree of possible moves, execute all these and accumulate observed transitions.

Generating a diagram is trivial with Graphviz. Just iterate over the observed transitions. (not shown here, see on Github for details )

The simulation

We create 3 agents, let the first one start on a value, run our simulator from the starting state, and dottify the observed transitions.

let main () = 
  let ids = id [1;2;3] in
  let a1,a1_out = start (make_agent (Id 1) ids) "x" in
  let a2,a2_out = (make_agent (Id 2) ids),[] in
  let a3,a3_out = (make_agent (Id 3) ids),[] in
  let world = [a1;a2;a3] in
  let state0 = {net = a1_out @ a2_out @ a3_out; ags = world} in
    let observed = state0 in
    M.dottify state0 observed 
  with (M.Bad (path, state)) ->
    Printf.eprintf "bad path:\n";
    M.dump_path path;
    Printf.eprintf "%s\n" (state_label state)


Let’s try this on a brain-dead simple implementation of an agent.
One that goes to the halted state as soon as it receives a message, while sending out no message at all.

module Mark0 = (struct
  let handle_message agent message = halt agent, []

end : ALGO)

What do we see here? First, there is a labelling scheme for states: R1I0I0 means the first agent has n=1 and is in a Requesting state, while the second and third agents ar in Idle state with n=0.
After the delivery of the {Prepare;1->2;n=1} message, a prepare from agent 1 to agent 2, the second agent halts. Likewise for the other prepare message. This looks ok, so let’s move on.


Let’s build an agent implementation that covers the happy path.

module Mark1 = (struct

  let prepare_when_idle source n agent= 
    let an = agent._n in
    if n > an 
      let pv = None in
      let agent' = {agent with state = SPromised pv; _n = n;} in
      let msg = {n = n;s =;t = source; k = Promise pv } in
      let out = [msg] in
      agent', out
      halt agent,[]
  let promise_for_request (source:id) (mn:n) vo (proposal:'v proposal) agent = 
    let pv,pballot = proposal in
    if mn = agent._n
      let pballot' = Ballot.register_vote pballot source vo in
      if Ballot.have_majority pballot' 
        let value = Ballot.pick_value pballot' in
        let outstanding_acks = Ballot.quorum pballot' -1 in
        let me = in
        let targets = others agent in
        let make_msg t = {n = mn; s = me; t ; k =  Accept value} in
        let broadcast = make_msg targets in
        let agent' = { agent with 
            store = (mn,value) ::;
            state = SLead (value, outstanding_acks);
        agent', broadcast
        agent, []
      halt agent, []
  let handle_accept m v agent = 
    match agent.state with
      | SPromised vo when agent._n = m.n -> 
        let agent' = {agent with state = SIdle; store = (m.n, v) ::;} in
        let out = [{n = m.n;s =;t = m.s;k = Accepted}] in
        agent', out
      | _ -> halt agent, []
  let handle_accepted m agent =
    match agent.state with
      | SLead (v,out) when agent._n = m.n -> 
        let out' = out -1 in
        let state' = if out' = 0 then SIdle else SLead(v,out') in
        {agent with state = state'},[]          
      | _ -> halt agent, []

  let handle_message agent m = 
    match m.k with
      | Prepare when agent.state = SIdle -> prepare_when_idle m.s m.n agent
      | Promise vo -> 
          match agent.state with 
            | SRequest p -> promise_for_request m.s m.n vo p agent
            | _ -> halt agent, []
      | Accept v -> handle_accept m v agent
      | Accepted -> handle_accepted m agent
      | _ -> halt agent,[]
end : ALGO)

What does this do? (click on it to see the full picture)

The good news is that there are quite a number of paths from the initial state I0I0I0 that reach our happy state I1I1I1, but there are also a lot of scenarios that end up in bad states.
Let’s look at one in detail.

R1I0I0:{Prepare;1->3;n=1} --->
R1I0P1:{Promise;3->1;n=1} --->
L1I0P1:{Accept; 1->2;n=1} --->

What happened here? A Prepare message goes from agent 1 to agent 3. That agent sends a Promise back.
This causes agent 1 to become a leader and broadcast Accept messages. One of these reaches agent 1, which is clueless as it did not receive a Prepare message first. Agent 1 therefore halts.

The diagram allows us to understand scenarios that lead to bad states, and to modify the algorithm accordingly. This process of finding holes in your algorithm,patching them and iterating is something which I call caulking in absence of a better word. In this particular case, an agent that is Idle can receive an Accept for the next n and should be able to move to the Idle state at the next n.

What about dropped messages?

Earlier, I did not answer the question about the simulation of dropped messages. The above scenario should make clear that we are actually, in luck. There is no difference between that scenario and a scenario where a Prepare from agent 1 and agent 2 was dropped. In general, there is no difference between dropping a message and delaying it until it is no longer relevant. This means there is no need for us to simulate them at all!


Let’s caulk Mark1. Looking at the diagram, not a lot of things need to be fixed. Here’s a list of messages that go awry.

  • Accept;n when agent is Idle at pred n
  • Accepted;n when agent is already Idle at n
  • Promise;n when agent is already Leading at n
  • Promise;n when agent is already Idle at n

Ok, adapting the code is easy:

module Mark2 = (struct

  let prepare_when_idle source n agent= 
    let an = agent._n in
    if n > an 
      let pv = None in
      let agent' = {agent with state = SPromised pv; _n = n;} in
      let msg = {n = n;s =;t = source; k = Promise pv } in
      let out = [msg] in
      agent', out
      halt agent,[]
  let promise_for_request (source:id) (mn:n) vo (proposal:'v proposal) agent = 
    let pv,pballot = proposal in
    if mn = agent._n
      let pballot' = Ballot.register_vote pballot source vo in
      if Ballot.have_majority pballot' 
        let value = Ballot.pick_value pballot' in
        let outstanding_acks = Ballot.quorum pballot' -1 in
        let me = in
        let targets = others agent in
        let make_msg t = {n = mn; s = me; t ; k =  Accept value} in
        let broadcast = make_msg targets in
        let agent' = { agent with 
            store = (mn,value) ::;
            state = SLead (value, outstanding_acks);
        agent', broadcast
        agent, []
      halt agent, []
  let handle_accept m v agent = 
    let _accept m =         
      let agent' = {agent with state = SIdle; store = (m.n, v) ::;} in
      let out = [{n = m.n;s =;t = m.s;k = Accepted}] in
      agent', out
    match agent.state with
      | SPromised vo when agent._n = m.n -> _accept m
      | SIdle when (next agent._n) = m.n -> _accept m
      | _ -> halt agent, []
  let handle_accepted m agent =
    match agent.state with
      | SLead (v,out) when agent._n = m.n -> 
        let out' = out -1 in
        let state' = if out' = 0 then SIdle else SLead(v,out') in
        {agent with state = state'},[]          
      | SIdle when agent._n = m.n -> agent,[]
      | _ -> halt agent, []

  let handle_message agent m = 
    match m.k with
      | Prepare when agent.state = SIdle -> prepare_when_idle m.s m.n agent
      | Promise vo -> 
          match agent.state with 
            | SRequest p -> promise_for_request m.s m.n vo p agent
            | SLead(v,out) when agent._n = m.n -> agent, []
            | SIdle when agent._n = m.n -> agent, []
            | _ -> halt agent, []
      | Accept v -> handle_accept m v agent
      | Accepted -> handle_accepted m agent
      | _ -> halt agent,[]
end : ALGO)

Look at the output diagram:

Isn’t it nice that fixing the holes in our algorithm actually makes the diagram smaller? Since we don’t end up in bad states anymore, there are way less transitions. It’s also aesthetically pleasing graphviz shows all arrows from left to right, meaning there are no transitions that actually increase the distance between the current state and the state we’re aiming for.

What about agents that are wiped clean?

This kind of calamity is not too difficult to simulate. Basically it’s a move that puts the agent back in its starting state. Let’s add the possibility that one of the agents is wiped.

let generate_moves state = 
  if is_bad state 
  then []
    let deliver = (fun x -> DeliverMsg x) in
    let id3 = Id 3 in
    let agent = find_agent state.ags id3 in
    let wipe = 
      if is_halted agent 
      then [] 
      else [Wipe id3]
    deliver @ wipe 

Let’s try that…

./paxos.byte >
bad path:
1 ---(1: Prepare) ---> 3
3 ---(1: Promise) ---> 1
1 ---(1:  Accept) ---> 2
1 ---(1:  Accept) ---> 3
Wipe 3

Auch. There actually is something wrong here. As it turns out, there is a bug in the Mark2 module.
It’s this fragment that’s wrong:

 let handle_accept m v agent = 
    let _accept m =         
      let agent' = {agent with state = SIdle; store = (m.n, v) ::;} in
      let out = [{n = m.n;s =;t = m.s;k = Accepted}] in
      agent', out
    match agent.state with
      | SPromised vo when agent._n = m.n -> _accept m
      | SIdle when (next agent._n) = m.n -> _accept m
      | _ -> halt agent, []

Handling an Accept when the agent is in the Idle state should also set the n correctly (the one of the message). Let’s fix that and try again.
Here it is:

Again, we’re confronted with lots of bad states but we know how to fix that. What are the scenarios that bring us in bad states? As it turns out, these are all caused by old Prepare messages. Adding that and generating the diagram again yields:

Indeed, wiping an agent moves to a state somewhere to the left of the current state, which matches the idea of being further away from our goal.

Are we there yet?

So far, we only addressed cases where there is only 1 agent in a requesting state. So what would happen if there are two agents requesting something at the same time?
Happy caulking!

Closing Remarks

Arriving at a correct implementation of an algorithm is difficult, but even more so in the case of distributed systems. This blog post shows a strategy you can apply in caulking your own implementations. As stated before, making your implementation pure helps you a lot.

Have fun,


The Game of Distributed Systems Programming. Which Level Are You?


When programming distributed systems becomes part of your life, you go through a learning curve. This article tries to describe my current level of understanding of the field, and hopefully points out enough mistakes for you to be able follow the most optimal path to enlightenment: learning from the mistakes of others.
For the record: I entered Level 1 in 1995, and I’m currently Level 3. Where do you see yourself?

Level 0: Clueless

Every programmer starts here. I will not comment too much here as there isn’t a lot to say. Instead, I quote some conversations I had, and offer some words of advice to developers that never battled distributed systems.

NN1:”replication in distributed systems is easy, you just let all the machines store the item at the same time

Another conversation (from the back of my memory):

NN: “For our first person shooter, we’re going to write our own networking engine”
ME: “Why?”
NN: “There are good commercial engines, but license costs are expensive and we don’t want to pay these.”
ME: “Do you have any experience in distributed systems?”
NN: “Yes, I’ve written a socket server before.”
ME: “How long do you think you will take to write it?”
NN: “I think 2 weeks. Just to be really safe we planned 4.”

Sometimes it’s better to remain silent.

Level 1: RPC

RMI is a very powerful technique for building large systems. The fact that the technique can be described, along with a working example, in just a few pages, speaks volumes of Java. RMI is tremendously exciting and it’s simple to use. You can call to any server you can bind to, and you can build networks of distributed objects. RMI opens the door to software systems that were formerly too complex to build.

Peter van der Linden, Just Java (4th edition, Sun Microsystems)

Let me start by saying I’m not dissing this book. I remember disctinctly it was fun to read (especially the anecdotes between the chapters), and I used it for the Java lessons I used to give (In a different universe, a long time ago). In general, I think well of it. His attitude towards RMI however, is typical of Level 1 distributed application design. People that reside here share the vision of unified objects. In fact, Waldo et al describe it in detail in their landmark paper “a note on distributed computing” (1994), but I will summarize here:
The advocated strategy to writing distributed applications is a three phase approach. The first phase is to write the application without worrying about where objects are located and how their communication is implemented. The second phase is to tune performance by “concretizing” object locations and communication methods. The final phase is to test with “real bullets” (partitioned networks, machines going down, …).

The idea is that whether a call is local or remote has no impact on the correctness of a program.

The same paper then disects this further and shows the problems with it. It has thus been known for almost 20 years that this concept is wrong. Anyway, if Java RMI achieved one thing, it’s this: Even if you remove transport protocol, naming and binding and serialization from the equation, it still doesn’t work. People old enough to rember the hell called CORBA will also remember it didn’t work, but they have an excuse: they were still battling all kinds of lower level problems. Java RMI took all of these away and made the remaining issues stick out. There are two of them. The first is a mere annoyance:

Network Transparency isn’t

Let’s take a look at a simple Java RMI example (taken from the same ‘Just Java’)

public interface WeatherIntf extends javva.rmi.Remote{
  public String getWeather() throws java.rmi.RemoteException;    

A client that wants to use the weather service needs to do something like this:

     Remote robj = Naming.lookup("//localhost/WeatherServer");
     WeatherIntf weatherserver = (WeatherInf) robj;
     String forecast = weatherserver.getWeather();
     System.out.println("The weather will be " + forecast);
  }catch(Exception e){

The client code needs to take RemoteExceptions into account.
If you want to see what kinds of remote failure you can encounter, take a look at the more than 20 subclasses. Ok, so your code will be a tad less pretty. We can live with that.

Partial Failure

The real problem with RMI is that the call can fail partially. It can fail before the action on the other tier is invoked, or the invocation might succeed but the return value might not make it afterwards, for whatever reason. These failure modes are in fact the very defining property of distributed systems or otherwise stated:

“A distributed system is one in which the failure of a computer you didn’t even know existed can render your own computer unusable”
(Leslie Lamport)

If the method is just the retrieval of a weather forecast, you can simply retry, but if you were trying to increment a counter, retrying can have results ranging from 0 to 2 updates. The solution is supposed to come from idempotent actions, but building those isn’t always possible. Moreover, since you decided on a semantic change of your method call, you basically admit RMI is different from a local invocation. This is an admission of RMI being a fallacy.

In any case the paradigm is a failure as both network transparency and architectural abstraction from distribution just never materialise. It also turns out that some software methodologies are more affected than others. Some variations of scrum tend to prototype. Prototypes concentrate on the happy path and the happy path is not the problem. It basically means you will never escape Level 1. (sorry, this was a low blow. I know)

People who do escape Level 1 understand they need to address the problem with the respect it deserves. They abandon the idea of network transparency, and attack the handling of partial failure strategically.

Level 2: Distributed Algorithms + Asynchronous messaging + Language support

<sarcasm>”Just What We Need: Another RPC Package” </sarcasm>
(Steve Vinoski)

Ok, you’ve learned the fallacies of distributed computing. You decided to bite the bullet, and model the message passing explicitly to get a control of failure.
You split your application into 2 layers, the bottom being responsible for networking and message transport, while the upper layer deals with the arrival of messages, and what needs to be done when they do.
The upper layer implements a distributed state machine, and if you ask the designers what it does, they will tell you something like : “It’s a multi-paxos implementation on top of TCP”.
Development-wise, the strategy boils down to this: Programmers first develop the application centrally using threads to simulate the different processes. Each thread runs a part of the distributed state machine, and basically is responsible for running a message handling loop. Once the application is locally complete and correct, the threads are taken away to become real processes on remote computers. At this stage, in the absence of network problems, the distributed application is already working correctly. In a second phase fault tolerance can be straighforwardly achieved by configuring each of the distributed entities to react correctly to failures (I liberally quoted from “A Fault Tolerant Abstraction for Transparent Distributed Programming).

Partial failure is handled by design, because of the distributed state machine. With regards to threads, there are a lot of options, but you prefer coroutines (they are called fibers, Light weight threads, microthreads, protothreads or just theads in various programming languages, causing a Babylonic confusion) as they allow for fine grained concurrency control.

Combined with the insight that “C ain’t gonna make my network any faster”, you move to programming languages that support this kind of fine grained concurrency.
Popular choices are (in arbitrary order)

(Note how they tend to be functional in nature)

As an example, let’s see what such code looks like in Erlang (taken from Erlang concurrent programming)


-export([start/0, ping/2, pong/0]).

ping(0, Pong_PID) ->
    Pong_PID ! finished,
    io:format("ping finished~n", []);

ping(N, Pong_PID) ->
    Pong_PID ! {ping, self()},
        pong ->
            io:format("Ping received pong~n", [])
    ping(N - 1, Pong_PID).

pong() ->
        finished ->
            io:format("Pong finished~n", []);
        {ping, Ping_PID} ->
            io:format("Pong received ping~n", []),
            Ping_PID ! pong,

start() ->
    Pong_PID = spawn(tut15, pong, []),
    spawn(tut15, ping, [3, Pong_PID]).

This definitely looks like a major improvement over plain old RPC. You can start reasoning over what would happen if a message doesn’t arrive.
Erlang gets bonus points for having Timeout messages and a builtin after Timeout construct that lets you model and react to timeouts in an elegant manner.

So, you picked your strategy, your distributed algorithm, your programming language and start the work. You’re confident you will slay this monster once and for all, as you ain’t no Level 1 wuss anymore.

Alas, somewhere down the road, some time after your first releases, you enter troubled waters. People tell you your distributed application has issues. The reports are all variations on a theme. They start with a frequency indicator like “sometimes” or “once”, and then describe a situation where the system is stuck in an undesirable state. If you’re lucky, you had adequate logging in place and start inspecting the logs. A little later, you discover an unfortunate sequence of events that produced the reported situation. Indeed, it was a new case. You never took this into consideration, and it never appeared during the extensive testing and simulation you did. So you change the code to take this case into account too.

Since you try to think ahead, you decide to build a monkey that pseudo randomly lets your distributed system do silly things. The monkey rattles its cage and quickly you discover a multitude of scenarios that all lead to undesirable situations like being stuck (never reaching consensus) or even worse: reaching an inconsistent state that should never occur.

Having a monkey was a great idea, and it certainly reduces the chance of encountering something you’ve never seen before in the field. Since you believe that a bugfix goes hand in hand with a testcase that first produced the bug, and now proves its demise, you set out to build just that test. Your problem however is reproducing the failure scenario is difficult, if not impossible. You listen to the gods as they hinted when in doubt, use brute force. So you produce a tests that runs a zillion times to compensate the small probability of the failure. This makes your bug fixing process slow and your test suites bulky. You compensate again by doing divide and conquer on your volume of testsets. Anyway, after a heavy investment of effort and time, you somehow manage to get a rather stable system and ditto process.

You’re maxed out on Level 2. Without new insights, you’ll be stuck here forever.

Level 3: Distributed Algorithms + Asynchronous messaging + Purity

It takes a while to realise that a combination of long running monkeys to discover evil scenarios and brute force to reproduce them ain’t making it. Using brute force just demonstrates ignorance. One of the key insights you need is that if you could only remove indeterminism from the equation, you would have perfect reproducibility of every scenario. A major side effect of Level 2 distributed programming is that your concurrency model tends to go viral on your codebase. You desired fine grained concurrency control… well you got it. It’s everywhere. So concurrency causes indeterminism and indeterminism causes trouble. So concurrency must go. You can’t abandon it: you need it. You just have to ban it from mingling with your distributed state machine. In other words, your distributed state machine has to become a pure function. No IO, No Concurrency, no nothing. Your state machine signature will look something like this

module type SM = sig
  type state
  type action
  type msg
  val step: msg -> state -> action * state

You pass in a message and a state, and you get an action and a resulting state. An action is basically anything that tries to change the outside world, needs time to do so, and might fail while trying. Typical actions are

  • send a message
  • schedule a timeout
  • store something in persistent storage

The important thing to realise here is that you can only get to a new state via a new message. nothing else. The benefits of such a strict regime are legio. Perfect control, perfect reproducibility and perfect tracibility. The costs are there too. You’re forced to reify all your actions, which basically is an extra level of indirection to reduce your complexity. You also have to model every change of the outside world that needs your attention into a message.

Another change from Level 2 is the change in control flow. At Level 2, a client will try to force an update and set the machinery in motion. Here, the distributed state machine assumes full control, and will only consider a client’s request when it is ready and able to do something useful with it. So these must be detached.

If you explain this to a Level 2 architect, (s)he will more or less accept this as an alternative. It, however, takes a sufficient amount of pain (let’s call it experience or XP) to realize it’s the only feasible alternative.

Level 4: Solid domination of distributed systems: happiness, piece of mind and a good night’s rest

To be honest, as I’m a mere Level 3 myself, I don’t know what’s up here. I am convinced that both functional programming and asynchronous message passing are parts of the puzzle, but it’s not enough.
Allow me to reiterate what I’m struggling against. First, I want my distributed algorithm implementation to fully cover all possible cases.
This is a big deal to me as I’ve lost lots of sleep being called in on issues in deployed systems (Most of these turn out to be PEBKAC but some were genuine, and cause frustration). It would be great to know your implementation is robust. Should I try theorem provers, should I do exhaustive testing ? I don’t know.
As an aside, for an append only btreeish library called baardskeerder, we know we covered all cases by exhaustively generating insert/delete permutations and asserting their correctness. Here, it’s not that simple, and I’m a bit hesitant to Coqify the codebase.
Second, for reasons of clarity and simplicity, I decided not to touch other, orthogonal requirements like service discovery, authentication, authorization, privacy and performance.
With regard to performance, we might be lucky as the asynchronuous message passing at least doesn’t seem to contradict performance considerations.
Security however is a real bitch as it crosscuts almost everything else you do. Some people think security is a sauce that you can pour over your application to make it secure.
Alas, I never succeeded in this, and currently think it also needs to be addressed strategically during the very first stages of design.

Closing words

Developing robust distributed systems is a difficult problem that is practically unsolved, or at least not solved to my satisfaction.
I’m sure its importance will increase significantly as latency between processors and everything else increases too. This results in an ever growing area of application for this type of application development.

As far as Level 4 goes, maybe I should ask Peter Van Roy. Over the years, I’ve read a lot of his papers, and they offered me a lot of insight in my own mistakes. The downside of insight is that you see others repeating your mistakes and most of the time, I fail to convince people they should do it differently.
Probably, this is because I cannot offer the panacea they want. They want RPC and they want it to work. It’s perverse … almost religious

“How hard can it be?” – on coding, chess and elo


It happens to all programmers. After some preliminary work, you’re at the point where you can see the solution with your mind’s eye, and all that’s left to be done is to write the code. You split it into steps that you think will bring you success, and start coding. Quickly, you realize it’s not so simple, and a few days later you get a version running but you’re not happy with it. It’s fragile, big, and performing awfully. It has the elegance of a drinking giraffe, and you feel disappointed. Many programmers say to themselves at this point: ‘How hard can it be?’.
Even worse, their managers, who can’t code to save their lives, will ask the same question.

Let me share an insight that comforts me at times like these. I will try to answer that exact question.

A Lower Bound

I will cheat a bit, as trying to establish exactly how difficult programming is is too difficult for me. So I will settle for a lower bound. Let’s find something that’s actually easier than programming. Well, chess seems to be a good candidate.

Chess is a limited game. Its entire universe consists of only 64 squares. There are two armies of maximum 16 pieces, each piece following some simple rules. Players take turns. Each player moves a piece (exceptionally 2) and tries to mate his opponent. In the starting position, one has only 20 possible moves, and a complex position has some 50 possible moves (I have seen composed positions with more than 200 moves) of which maybe 10 seem actually playable. A game can last 200 moves, but a typical game hardly reaches 50 moves. In the end, you can only have a finite amount of possible positions.

Programming on the other hand is far less limited. You can combine algorithms, data structures, programming languages and paradigms in almost an infinite number of combinations. As you write code, the only thing holding you back is that you have to produce something that’s acceptable to the syntax of the programming language you’re using. Other than that you’re free. A programming language like Java has a simple (compared to C++ at least) grammar, but still is very rich in possibilities. You can take the grammar to build small classes. You can then take these classes, and combine them like Lego, to build more complex building blocks, ad infinitum. Event better, your playground (the hardware you work against) becomes more and more powerful so you can try things that were impossible just a few years ago. Compared it to chess, it seems that in programming you can combine pieces and build new ones, enlarge the board, and so on. Come to think of it, it’s safe to say programming is as least as complex as chess.

If someone would still have doubts, I can say that we build chess playing programs that will beat any human. How far are we on programming programs? Ok, Chess is indeed a lower bound. Programming is at least as hard as Chess. It might be a lot harder, but that is too hard to prove, so I’ll settle for ‘at least as hard’.

About Chess

Let’s talk about Chess some more. It is actually quite a mature game. The first professional supposedly was Abu-Bakr Muhammad ben Yahya as-Suli (854-946). He was the strongest player of his time, and author of a book describing a systematic way of playing Shatranj. Apparently, in the Arab World, one can still occasionally hear the complement “he played like Yahya as-Suli”. But he played one of chess’s predecessors.

Modern chess has been around for more than half a millenium, and people are quite surprised to learn that some of the standard manoeuvres have been around that long.
For example, every chess player knows the Lucena position shown below

which is attributed to Luis Ramirez de Lucena, who wrote a chess book published in 1497 (you read correctly: 1497!).

There are a great many chess books around. Google books answers intitle:chess with more than 200000 results. Only a minority of these books concern beginners as most of these target club players and above. It’s also funny to note that such a limited game is rich enough for lot of new books to be published every year.

About Chess Players

One of the interesting things about chess is that every player has an elo rating. This allows you to see exactly how good a player is, and it allows you to calculate the odds of a player winning a game against another player. Fide keeps records of the ratings, and has an online database you can browse. For example, let’s take two of the best players in the world, and compare them.Top dog Magnus Carlsen currently has a rating of 2835. If he’d play the number one US player Hikaru Nakamura, who currently has a rating of 2759, he will score about 0.60, or score about 6 points in a 10 game match.

import math
def expected(r_a,r_b):
    d = r_b - r_a
    e = d / 400.0
    n = 1 + math.pow(10,e)
    return 1.0/n

print expected(2835, 2759) 

A person’s chess rating evolves in time. If one scores better than one’s rating predicts, the rating will go up, and vice versa. Improving your chess skill equals an elo gain. For example, Anish Giri’s evolution can be found here. In fact, elo is so prevalent, that people use it to refer to an opponent: it is even more important than the opponents name. If a chess player wants to show his game to a collegue, he will say something like:”I’ve beaten a 1900 yesterday. Let me show you how it went”.

Wanna Play?

As it turns out, chess is too hard a problem for most people to master. A master (see International Master) has a rating somewhere above 2400. It is a level only a few people ever achieve (about 1 or 2 percent of chess enthusiasts world wide).

Personally, I can testify it is very difficult: I started playing at the age of 17, invested countless hours, read (at least bought) more than one hundred books, and I only reached a level of 2000. There are some defects in my game I just cannot seem to eliminate, so I must limit my ambition. Talking to people about the reasons for this, they say the main reason is that I started too late. Anyway, if I put in some more effort, maybe I can reach 2100. Maybe.

So You Think You Can Code?

Ok, let’s go back to programming. We asserted before that it is at least as difficult as chess, and then learned chess is actually very difficult. So programming is really really hard. A way to improve is by doing it, analysing where you went wrong, and iterate, hoping you improve while doing this.

Just like chess, reading a book about it will not make you a better programmer. If you’re lucky, and read a good book on the topic (be it chess or programming), you will get some insights in what you are doing wrong. But these insights themselves will not magically improve you. So you need to put in hard effort. Some of the things I learned that will help you do that effort are:

  • Tackle a new type of problem (a chess engine, a rendering engine, a distributed key value store, a compiler, …)
  • Learn a new programming language
  • Learn a new programming paradigm(FP,OO,Declarative)
  • Learn a new platform(GPU,FPGA,…)
  • Learn from a better programmer

Master Coders, Anyone?

Whatever the activity, some humans will be better than others. As with chess, in programming, there will exist some people that have reached ‘mastery’. But, I’m a sceptic. I haven’t seen any. Moreover, as long as we don’t have anything in place like programmer elo I can claim they haven’t been observed. Actually, programmer elo can function as a discriminator for authors too, just like it does for chess books (chess book written by a non-master are rare, and being a master is not a guarantee you’ll write a good book either, but that’s an entirely different discussion)
I’m quite annoyed by the books that come out telling me how to code or to organize the software development process, from people who are clueless. Long gone are the days I used to attach value to a thing merely because it was written down in a book. In fact, the more somebody tells me how to do it, the more suspicious I become. This is a direct result from my interaction with chess masters. Sometimes I ask advice from a chess master on a problem I faced, and most of the time they answer in a sort of fuzzy wordings like “I think you probably should try to arrange your pieces something like this …”. The same position, when shown to a lower rated player often leads to an explanation like this: “you need to first take here and then there, and push that pawn”. How can it be that the same position is unclear to a master, while it is clear to a patzer ? I think it comes from respect for the complexity of game. If you see more, you fear more.

Closing Words

Basically, I think two things are worth remembering from this post. First, programming is really really hard. Second, it would be beneficial if there would be a programmer elo. It would at least prevent some authors of gathering enough courage to publish a book on programming. It would certainly allow better hiring decisions, and maybe could help you earn the money you deserve for your efforts. Or are you overpaid?

have fun,


Post Scriptum

Some people remarked something like programmer elo couldn’t be done because programming isn’t a head-to-head competition. As it turns out, this is not necessary. Sites like show that you can also compete against puzzles. What happens there is that a competitor gets a puzzle. If he solves it, he collects points from the puzzle and the puzzle loses rating points. If the user fails, the puzzle collects points from the user. This way not only the users get sorted, but also the puzzles, which allows the site to always present users puzzles that more or less fit their level. The same concept can be used for any kind of test/examination. The good thing is that people get a tool to calculate their level and can work to improve it. Exactly the kind of feedback loop one needs.