Arakoon 1.6.0 is out

H: One voter, 16,472 votes — a slight anomaly…?

E: Not really, Mr. Hanna. You see, Baldrick may look like a monkey who’s
been put in a suit and then strategically shaved, but he is a brillant
politician. The number of votes I cast is simply a reflection of how
firmly I believe in his policies.

Black Adder III, Episode 1

Introduction

So, Arakoon 1.6.0 is out, a release so huge we just had to leap minor increments! This blog post will walk you through the various changes between the 1.4 series and the 1.6 series. We’ll explain why it’s faster, and give an indication by how much.

Harvesting the client queue

Arakoon is a multipaxos implementation. This means that arakoon nodes try to gain consensus on what the next transaction is that needs to be applied.
The thing they gain consensus on is called a paxos value. In all Arakoons (≤ 1.4) there was a 1:1 mapping between a client update and a paxos value. This all works fine in LAN, but when the latency between nodes becomes a lot higher, like in a WAN setup, performance becomes abysmally slow. But some of your users are using Arakoon in a multi-datacenter setup. The solution chosen to address this was to have a n:1 mapping between client updates and paxos values. What Arakoon 1.6 does when the time has come to decide on the next paxos value, is to take all (or up to a limit) of the queued updates, and stuff them in the next paxos value. For multiple clients and small updates this means an almost linear speedup in a WAN setting.

Batched Local store

Ops: We have seen a corrupt tokyo cabinet, and we can reproduce it!
Dev: Great! What’s the scenario?
Ops: It’s really easy. Just apply the following 218 million transactions to an empty store.
Dev: Err, euhm, thx ?!

When you want to fix a problem, having a 100% reproducible scenario is a blessing, but not if it takes forever. Fortunately, during the root cause analysis of such a problem, you have time to think, and thus the insight came: We can speed this up a lot! Having to maintain a transaction log is a mixed blessing, but one of the advantages is that you don’t have to apply all updates to the database immediately. So we added an in-memory caching layer that sits before Tokyo Cabinet. All updates happen there and from time to time, we push the updates through in a big batch. We can afford ourselves to do this as we already ensured durability via the transaction logs.

Logging improvements

Since the early days, Arakoon has a crash logger. When an Arakoon node dies unexpectly, it dumps its last 1000 or so log statements in a separate file. This is a really useful feature, and allowed us to determine the underlying cause of death in many occasions. Unfortunately, it was also a very expensive feature: We were always generating debug level log statements, even if the configuration stated ‘info’. Logging is expensive, even if you throw the log statements away. There is a big cost in the generation of all these strings via fancy Printf templates. So we decided to use a syntax extension that allows us to create these strings more lazily.

So now we have the best of both worlds: when all goes well, we don’t need debug statements and we never pay the cost of constructing the string, but when things turn sour,
we can still go back and create our crash log.

Numbers please

We’re not going to show full fledged benchmarks, just a simple microbenchmark to show a trend. The experiment is to let a client (or more clients) do a million sets to an Arakoon cluster and see how long it takes. All experiments were performed on a simple Intel Core i7 with a HDD.

1e6 sets/client Arakoon 1.4.3 Arakoon 1.6.0
1 client 294s 196s
4 // clients 900s 625s
8 // clients 2180s 1150s

How to read the table? 1 client does 1e6 sets on Arakoon 1.6.0 and finishes after 196s. 8 clients do 1e6 sets in parallel on Arakoon 1.6.0 and the last client finishes after 1150s. That looks like a rather nice improvement.

The next table is for a cluster of 3 nodes, all running on the same machine and thus sharing the same HDD

1e6 sets/c Arakoon 1.4.3 Arakoon 1.6.0
1 client 1090s 676s
4 // clients 4005s 2470s
8 // clients 8734s 4700s

A last table shows the effect of latency. We add 50ms latency on everything that goes over the loopback interface (via netem). Again, this is a setup with 3 nodes, all on the same host, sharing a hdd (although this will not matter too much).

1e5 sets/c Arakoon 1.4.3 Arakoon 1.6.0
8 // clients 86400 20600

Note that this benchmark only goes to 1e5 (otherwise, it would take forever) You can immediately see the dramatic effect of latency on a distributed system. Arakoon 1.6.0 does a better job at fighting latency than Arakoon 1.4.3 (the difference in performance will increase with the number of clients).

Other notable improvements

Tokyo Cabinet patched

The whole data corruption annecdote eventually lead to a 32 bit overflow issue in tokyo cabinet. With the help of some friends, we fixed it and made our changes available here in this tokyo cabinet git repo. We really don’t want to expose anyone to database corruption and decided to backport this to the 1.4 series too.

Log sections

We started to use log sections, which allows you to configure a node such a way that for example the messaging layer is in debug, while everything else stays on info level. Which allows you to put the focus on something you’re examining while not having to wade through zillions of log messages. Handy for ops and devs alike.

Client API additions

We added a 2 small but useful methods to the client API. First there’s AssertExists of key that allows you to shortcut a sequence if a key is not there. Second, there’s multi_get_option: key list -> value option list Lwt.t which allows you to fetch a series of values, even if you’re not sure all of them exist in the database.

Always sync

Some of our users have fancy SSDs for the tlogs and wanted to be able to fsync after every write of a tlog entry. We accomodated them.

Full details

You can find the full list of improvements for Arakoon 1.6 in our bug tracker

Conclusion

It’s stable, way faster than it used to be, and it’s free.
You can get it from our githup repo
What are you waiting for?

Advertisements

Rediscovering the RSync Algorithm

A:Ok, you’re synchronizing this over the web;
and what do you use for the synchronization?

B: Oh, we implemented the rsync algorithm.
A: uhu. And what do you do with really big files?
B: The same.
A: And you also synchronise folders?
B: Yes.
A: And how do you do that?
B: we iterate over the folder, using the algorithm on every file, recursing over subfolders.
A: Can you try 2 things for me? First, a very large file; and second, a large codebase, and see if it holds.

Introduction

First of all, I am an admirer of the (original) rsync algorithm. (I think) it was a first stab at file synchronization, and it was so good people still implement it today.
But if you don’t understand its strenghts and weaknesses you might be in for a disappointment.

The Problem

You have 2 files, A’ and A that are different but alike. They reside on 2 different locations connected through a network. You want to make A’ identical to A.

The simplest solution is to just copy A, but given the similarity between the two files, there has to be a better way.

Historical Interlude

Networks used to be notoriously bad in the early 90s. Everybody who was transferring archives over large distances instinctively knew about a critical download size.
If the archive was too large, it would take too long, yielding a 100% chance something would go wrong somewhere resulting in an interrupted download. Even if the (up- or) download succeeded, chances were a small part of the file got corrupted, and you had to start over. The two first alleviations to this problem were checksums to detect accidental corruptions, and resumptions (being able to start a download at a certain offset).

RSync took care of interrupted downloads, and also provided a better solution when your file was corrupt. On top of that, it allowed low cost propagation of small changes, opening up a whole new range of applications. System administrators had a shiny new hammer.

The RSync Strategy

RSync just does a single round trip. First it creates a signature of A’, sends it over. On the other location it scans the local file, tries to find parts that are in the signature, while constructing a recipe as a stream of instructions. It’s possible to derive the algorithm starting from a primitive version, improving it step by step.
Since it’s fun too, I’ll be doing that here. While we’re playing, we’ll abstract away from IO, because it clouds the algorithmical view.

Mark 0

Let’s attack this in pure caveman style. Making a signature is splitting the file in blocks of equal size (except maybe the last). Iterating over the blocks, you calculate a digest and accumulate digests and block identifiers. Block identifiers are just their number: the first block has id 0, the second block id 1 aso.

let file_signature f b_size = 
  let f_len = String.length f in
  let rec loop acc s i =
    if s = f_len 
    then acc
    else 
      let b_len = min b_size (f_len - s) in
      let b = String.sub f s b_len in
      let b_sig = block_signature b in
      let acc' = (b_sig,i) :: acc in
      loop acc' (s+b_len) (i+1)
  in
  loop [] 0 0

We have lots of choice to calculate a block signature. Let’s be lazy and pick Digest.string which is the md5 checksum of the block.

let block_signature block = Digest.string block

To recreate the file you need to interprete the stream of instructions. But what would these instructions be?
Well, in this version, you can be told to copy over a block or write a char.

type instruction = 
  | C  of char
  | B  of int

Ok, how do you combine the signature together with the new file to generate a stream of instructions?
First thing that comes to mind is to scan over the new file, starting at position s

  • consider the block starting at s and try to find it in the signature.
  • if you find it, add a B j instruction, and jump a block forward.
  • if you miss it, add a C c instruction, and step forward 1 position.

Let’s do that:

let updates f0_sig b_size f1 = 
  let f1_len = String.length f1 in
  let rec loop acc s = 
    if s = f1_len 
    then List.rev acc
    else
      let b_len = min b_size (f1_len - s) in
      let block = String.sub f1 s b_len in
      let u,step = 
        try 
          let d = block_signature block in
          let i = List.assoc d f0_sig in 
          (B i), b_len
        with Not_found -> (C block.[0]), 1
      in
      let acc' = u :: acc in
      loop acc' (s + step)
  in
  loop [] 0

The last step in our synchronisation scheme is to create a file using the old file,
together with the stream of instructions.

let apply old b_size ins =
  let old_len = String.length old in
  let buf = Buffer.create old_len in
  let add_block i = 
    let s = i * b_size in
    let b_len = min b_size (old_len - s) in
    let block = String.sub old s b_len in
    Buffer.add_string buf block
  in
  let add_char c = Buffer.add_char buf c in
  let () = List.iter (function 
    | B i -> add_block i
    | C c -> add_char c) 
    ins
  in
  Buffer.contents buf

So it took 60 lines of code to have a first stab at a synchronisation algorithm.
Let’s try this on an example:

let bs = 5
let remote = "aaaaabbbbbcccccdddddeeeeefffffggggghhhhhiiiiijjjjjkkk"
let mine = "aaaaabXbbbcccccddddde012"
let mine_s = file_signature mine bs

let delta = updates mine_s bs remote
let mine2 = apply mine bs delta;;


val bs : int = 5
val remote : string = "aaaaabbbbbcccccdddddeeeeefffffggggghhhhhiiiiijjjjjkkk"
val mine : string = "aaaaabXbbbcccccddddde012"

val mine_s : (Digest.t * int) list =
[("$\240\t\221\19200\222\199\2035\190|\222~#\n", 4);
("P\248M\175:m\253j\159 \201\248\239B\137B", 3);
("g\199b'k\206\208\158\228\22314\2137\209d\234", 2);
("\196\148\"\21926Lm\179V E=\201O\183,", 1);
("YO\128;8\nA9n\214=\2029P5B", 0)]

val delta : instruction list =
[B 0; C 'b'; C 'b'; C 'b'; C 'b'; C 'b'; B 2; B 3; C 'e'; C 'e'; C 'e';
C 'e'; C 'e'; C 'f'; C 'f'; C 'f'; C 'f'; C 'f'; C 'g'; C 'g'; C 'g';
C 'g'; C 'g'; C 'h'; C 'h'; C 'h'; C 'h'; C 'h'; C 'i'; C 'i'; C 'i';
C 'i'; C 'i'; C 'j'; C 'j'; C 'j'; C 'j'; C 'j'; C 'k'; C 'k'; C 'k']
val mine2 : string = "aaaaabbbbbcccccdddddeeeeefffffggggghhhhhiiiiijjjjjkkk"

Ok, it works, but how good is it?
Before I can answer this, first a note on block size. There are some ‘forces’ to be reckoned with

  • the blocksize needs to be big compared to the block signature.
  • if the blocksize is big, you will find less matches between the signature and the new file, so you need send more data back
  • if the blocksize is small, you have lots of them, meaning your signature will be bigger
    and you need an efficient lookup

The best blocksize will be depend not only on the file size, but on the actual changes.
How important is it really?
Well, let’s sync 2 images:

These 2 images are bitmaps of 5.5 MB each (stored as .bmp).
(I actually uploaded smaller versions as it seems useless to let your browser download more than 10MB for just 2 silly image samples)
I’ll sync’em using different blocksizes and see what gives.

let main () =
  let bs = int_of_string (Sys.argv.(1)) in
  let () = Printf.printf "bs=%i\n" bs in
  let remote = get_data "new_image.bmp" in
  let () = Printf.printf "remote: size=%i%!\n" (String.length remote) in
  let mine = get_data "old_image.bmp" in
  let mine_s = X.file_signature mine bs in
  let () = Printf.printf "mine_s: len=%i%!\n" (Array.length mine_s) in
  let delta = X.updates mine_s bs remote in
  let (nbs,ncs) = List.fold_left (fun(bs,cs) i ->
    match i with
      | X.B _ -> (bs+1,cs)
      | X.C _ -> (bs,cs+1)
  ) (0,0) delta 
  in
  let mine2 = X.apply mine bs delta in
  let () = Printf.printf "mine2: size=%i%!\n" (String.length mine2) in
  let () = Printf.printf "bs=%i;cs=%i\n" nbs ncs in

block size # block signatures blocks chars time
8192 704 535 1384448 71s
4096 1407 1084 1323008 92s
2048 2813 2344 960512 92s
1024 5626 4995 646144 115s
512 11251 10309 482304 172s
256 22501 20972 391424 283s
128 45001 42508 319104 537s

The first column is the block size. The second is the number of blocks in the file, the third is the number of B instructions and the fourth is the number of C instructions.
The last columns is measured execution time on my laptop. Processing time is the biggest issue. Ocaml is fast, but not fast enough to compensate for my brutal inefficiency. Imagine what it would do to a 4GB movie.

Mark 1

The problem is quickly revealed: List.assoc is not suited for long lists.
A better solution is use an array, sort it on block signature, and use binary search
(using a hashtable would be viable too).

let block_signature block = Digest.string block

let file_signature f b_size = 
  let f_len = String.length f in
  let s = ref 0 in
  let n_blocks = (f_len + b_size -1) / b_size in
  Array.init n_blocks 
    (fun i -> 
      let start = !s in
      let b_len = if start + b_size > f_len then f_len - start else b_size in
      let b = String.sub f start b_len in
      let b_sig = block_signature b in
      let () = s := start + b_len in
      (b_sig,i)
    ) 

type instruction = 
  | C  of char
  | B  of int

let updates f0_sig b_size f1 = 
  let my_cmp (s0,i0) (s1,i1) = String.compare s0 s1 in
  let () = Array.sort my_cmp f0_sig in
  let len = Array.length f0_sig in
  let rec lookup b= 
    let rec find min max = 
      let mid = (min + max) / 2 in
      let (ms,mi) = f0_sig.(mid) in
      if ms = b 
      then mi
      else if min > max then raise Not_found
      else if b > ms then find (mid+1) max
      else find min (mid -1)
    in
    find 0 (len -1)
  in
  let f1_len = String.length f1 in
  let rec loop acc s = 
    if s = f1_len 
    then List.rev acc
    else
      let b_len = min b_size (f1_len - s) in
      let block = String.sub f1 s b_len in
      let u,step = 
        try 
          let d = block_signature block in
          let i = lookup d in 
          (B i), b_len
        with Not_found -> (C block.[0]), 1
      in
      let acc' = u :: acc in
      loop acc' (s + step)
  in
  loop [] 0

let apply old b_size ins =
  let old_len = String.length old in
  let buf = Buffer.create old_len in
  let add_block i = 
    let s = i * b_size in
    let b_len = min b_size (old_len - s) in
    let block = String.sub old s b_len in
    Buffer.add_string buf block
  in
  let add_char c = Buffer.add_char buf c in
  let () = List.iter (function 
    | B i -> add_block i
    | C c -> add_char c) 
    ins
  in

block size # block signatures blocks chars time(s)
8192 704 535 1384448 41
4096 1407 1084 1323008 20
2048 2813 2344 960512 7.8
1024 5626 4995 646144 1.9
512 11251 10309 482304 1.1
256 22501 20972 391424 0.8
128 45001 42508 319104 0.9

Wow, this is quite unexpected (but we’re not complaining). So what happened? Well, the lookup was so dominating, it was cloaking all other behaviour.
Now, with the lookup out of the way, other things can be observed. The problem now is that a ‘miss’ not only causes a C instruction to be emitted, but more importantly, it causes another digest to be calculated. The less blocks are found, the higher the processing time.

Mark 2

The cost of the miss was solved by Andrew Tridgell by introducing a second, weaker hash function.
He used the Adler-32 checksum which is a rolling checksum. ‘Rolling’ means that
adler32(buffer[a+1:b+1])= cheap(adler32(buffer[a:b]), which makes it suitable for our problem here. The ocaml standard library does not have an adler32 module, so I hacked one up.
It’s a naieve implementation in that it follows the definition closely. In fact, most of the modulo operations can be avoided by doing some extra administration.
I didn’t bother.

module Adler32 = struct
  type t = {mutable a: int; mutable b: int; mutable c: int}
      
  let padler = 65521
    
  let make () = {a = 1 ;b = 0; c = 0}
    
  let from buf offset length = 
    
    let too_far = offset + length in
    let rec loop a b i= 
      if i = too_far 
      then {a; b; c = length}
      else (* modulo can be lifted with a bit of math *)
        let a' = (a + Char.code(buf.[i])) mod padler in
        let b' = (b + a') mod padler in
        loop a' b' (i+1)
    in
    loop 1 0 offset
    
  let reset t = t.a <- 1;t.b <- 0; t.c <- 0
    
  let digest t = (t.b lsl 16) lor t.a 
    
  let out t c1 = 
    let x1 = Char.code c1 in
    t.a <- (t.a - x1) mod padler;
    t.b <- (t.b - t.c  * x1 -1) mod padler;
    t.c <- t.c - 1

  let rotate t c1 cn = 
    let up x = if x >= 0 then x else x + padler in
    let x1 = Char.code c1 in
    let xn = Char.code cn in
    t.a <- up ((t.a - x1 + xn) mod padler);
    t.b <- let f = (t.c * x1) mod padler in
           let r = (t.b - f + t.a -1) mod padler in 
           up r
             
  let update t buf offset length = 
    let too_far = offset + length in 
    let rec loop i = 
      if i = too_far then () 
      else
        let x = Char.code buf.[i] in
        let () = t.a <- (t.a + x) mod padler  in
        let () = t.b <- (t.b + t.a) mod padler in
        let () = t.c <- t.c + 1 in
        loop (i +1)
    in
    loop offset
      
  let string block = 
    let t = from block 0 (String.length block) in
    digest t
end

Adler32 is much weaker than md5 and using it exposes you to collisions. So in fact, it acts as a cheap test that might yield false positives. That’s the reason the rsync algo keeps both: if the adler32 of the buffer is found in the signature, there is a second check to see if the md5s match. The fact one sometimes needs to rotate the checksum and at other times needs to reinitialize if from a part of the buffer, complicates the calculation of the updates a bit.

The code is shown below.

let updates f0_sig b_size f1 = 
  let my_cmp (wh0,sh0,i0) (wh1, sh1,i1) = compare wh0 wh1 in
  let () = Array.sort my_cmp f0_sig in
  let len = Array.length f0_sig in
  let rec lookup w = 
    let rec find min max = 
      let mid = (min + max) / 2 in
      let (mw, ms,mi) = f0_sig.(mid) in
      if mw = w
      then (ms,mi)
      else if min > max then raise Not_found
      else if w > mw then find (mid+1) max
      else find min (mid -1)
    in
    find 0 (len -1)
  in
  let f1_len = String.length f1 in
  let weak = Adler32.from f1 0 b_size in
  let rec loop acc b e = 
    if b = f1_len 
    then List.rev acc
    else
      let wh = Adler32.digest weak in
      let step_1 () = 
        let bc = f1.[b] in
        let a = C bc in
        let b' = b + 1 in
        if e +1 < f1_len 
        then 
          let e' = e + 1 in
          let ec = f1.[e] in
          let () = Adler32.rotate weak bc ec in
          loop (a:: acc) b' e'
        else
          let e' = e in
          let () = Adler32.out weak bc in
          loop (a:: acc) b' e'
      in
      try
        let (s0,i0) = lookup wh in
        let sh = Digest.substring f1 b (e - b) in
        if s0 = sh 
        then
          let b' = e in
          let e' = min f1_len (e + b_size) in
          let a = B i0 in
          let () = Adler32.reset weak in
          let () = Adler32.update weak f1 b' (e' - b') in
          loop (a :: acc) b' e'
        else step_1 ()
      with Not_found -> step_1 ()
  in
  loop [] 0 b_size

The code indeed is a bit messier as we have more things to control at the same time, but it’s still quite small. Let’s see how wel it performs:

block size # block signatures blocks chars time(s)
8192 704 535 1384448 0.9
4096 1407 1084 1332008 0.9
2048 2813 2344 960512 0.8
1024 5626 4995 646114 0.6
512 11251 10309 482304 0.6
256 22501 20401 537600 0.7
128 45001 42508 319104 0.7

This almost completely removes the cost of a miss; at least for things of this size. The choice of blocksize does however affect the amount of data you need to send over.
There is a lot of other things you can do here:

  • Block Size Heuristic: something like O(sqrt(file_size))
  • SuperInstructions: make instructions for consecutive Blocks, and consecutive Chars
  • Emit function: don’t accumulate the updates, but emit them (using a callback function)
  • Smaller signatures: you can consider to drop some bytes from the strong hash
  • IO
  • Compress update stream

The last remaining problem is that some modifications are not handled well by the algorithm (for example 1 byte changed in each block).
Maybe you could try a better algorithm.
There are lot’s of them out there, and they are worth checking out. (Before you know it you’ll be dabling into merkle trees or set reconciliation )

Anyway, I already exceeded my quotum for this post, but I still want to say a little thing about synchronisation of folders

The Next Problem

You have 2 trees of files, A’ and A that are different but alike. They reside on 2 different locations connected through a network. You want to make A’ identical to A.

What Not To Do

Don’t walk the folder and ‘rsync’ each file you encounter.
A small calculation will show you how bad it really is.
Suppose you have 20000 files, each 1KB. Suppose 1 rsync costs you about 0.1s
(reading the file, sending over the signature, building the stream of updates, applying them).
This costs you about 2000s or more than half an hour. System administrators know better:
they would not hesitate: “tar the tree, sync the tars, and untar the synced tar”.
Suppose each of the actions takes 5s (overestimating) you’re still synced in 15s.
Or maybe you can try unison. It’s ocaml too, you know.

have fun,

Romain.


Share your mistakes: adventures in optimization

I used to think I knew the laws of code optimization. In my (not so) humble opinion they were

  1. profile before you optimize
  2. after profiling tells you what the problem is, first try a better strategy (algorithm or data structure)
  3. tweak code as a last resort

It’s a pure economical reasoning that’s behind this: if your code is not fast enough, first find the biggest culprit and eliminate it. By taking out the biggest you get the most value for money,  and using something that yields orders of magnitude, gain the most. Tweaking code or moving to a more low-level programming language can only give you a factor of improvement, so if you have the choice, use the bigger gun.

Suppose, as an example, profiling reveals your cost can be written like this:

Cost = 0.8 * BiggestCulprit + 0.2 * EverythingElse

You know what to do: kill the BiggestCulprit. Btw, Pareto tells you it’s commonly something like that (80-20). Ok, using a big hammer you replaced the BiggestCulprit with something that’s 100 times cheaper.

Cost2 = 0.8 * 0.01 * BiggestCulprit + 0.2 * EverythingElse = 0.208 * Cost

If you need to go even faster, you should try to optimize EverythingElse. Can you do this? Depends. Maybe you can split EverythingElse in

EverythingElse = 0.8 * TheNextHurdle + 0.2 * NotInteresting

If you can’t. It ends here.

The strategy is rational, but sometimes profiling points you in the wrong direction.

An example of a mistake I made

What follows below is an account of what happened to a piece of code over a period of two years. I hope you will, when reading on conclude that the mistakes were obvious, but at the time, they weren’t. Hindsight is 20/20.

The problem

As a small step in solving a bigger problem, we needed to generate a sample of size n from a set of size p. Important detail: no value can be selected more than once.
The population size (p) is roughly somewhere between 4000 and 16000, while the sample size is often very small, sometimes more than 2000, but never more than 2500 (we know its distribution).
Let’s look at the problem in isolation. The code shown below is a micro benchmark that is representative for our case, and we’re interested in minimizing the total running time by improving the implementation of the Sampler module


let micro_bench ns k p  = 

  let test_one n =
    let sample = Sampler.create n p in
    let use_choice _ = () in
    let rec loop k = 
      if k = 0 
      then ()
      else 
        begin
          if k mod 1000 = 0 then Printf.eprintf ".%!";
          let () = Sampler.fill_sample sample n p in
          let () = Sampler.iter sample use_choice in
          let () = Sampler.clear_sample sample in
          loop (k-1)
        end
    in
    let t0 = Unix.gettimeofday() in
    let () = loop k in
    let t1 = Unix.gettimeofday() in
    let d = t1 -. t0 in
    Printf.printf "%i | %f \n" n d
  in
  List.iter test_one ns;;


let main () =  
  let k = 100 * 1000 in
  let p = 10000 in
  micro_bench [1;2;3;4;5;6;7;8;9;10;20;40;80;160;320;640;1000;2000;2500] k p;;

let () = main ();;

Our solution must adhere to the following signature:

module type S = sig
  type t
  val create : int -> int -> t
  val fill_sample: t -> int -> int -> unit
  val clear_sample: t -> unit
  val iter: t -> (int -> unit) -> unit
end

The first solution, the one I coded in search of correctness and simplicity, was exactly that, simple and correct:

module S0 = (struct 
    type t = {mutable c: int; es:int array}
    
    let create n p = {c = 0; es = Array.make n 0}

    let in_sample t x = 
      let rec loop i = 
        if i < 0 then false
        else
          if t.es.(i) = x 
          then true
          else loop (i-1)
      in 
      loop (t.c -1)

    let add2_sample t x = 
      let i = t.c in
      t.es.(i) <- x;
      t.c <- i+1        

    let fill_sample sample n p = 
      let rec loop i = 
        if i = 0 
        then ()
        else
          let rec find_new () = 
            let x = random_range p in
            if in_sample sample x 
            then find_new()
            else add2_sample sample x
          in
          let () = find_new () in
          loop (i-1)
      in
      loop n

    let clear_sample t = t.c <- 0
  
    let iter t f = 
      let rec loop i =
        if i = t.c 
        then ()
        else 
          let () = f t.es.(i) in
          loop (i+1)
      in
      loop 0
end : S)

The sample is accumulated in an array, and we test each candidate to see if we have it already. If so, we try again. Clearing the sample is putting the counter to zero, and iteration is merely iterating over the used part of the array. Simple enough, and it suffised for almost 6 months. A run of the microbenchmark (it takes some 1700 seconds) reveals what’s going on:

1 | 0.017802 
2 | 0.017753 
3 | 0.025648 
4 | 0.033298 
5 | 0.040910 
6 | 0.050635 
7 | 0.056496 
8 | 0.065127 
9 | 0.073126 
10 | 0.081244 
20 | 0.170436 
40 | 0.402476 
80 | 1.060872 
160 | 3.131289 
320 | 10.381503 
640 | 36.543450 
1000 | 85.969717 
2000 | 408.716565 
2500 | 1127.268196 

The first column is sample size, the second is time needed for 100000 samples. As you can see, it’s really fast for small sample sizes, but quickly succumbs. Profiling shows it’s the in_sample function that’s to blame. It must scan the entire sample array so far. It gets even worse as the chance of picking an element that was chosen before increases.

Well, it isn’t that difficult to have a better membership test. The population size isn’t that big, so we can afford a BitSet. Adding a member in O(1), membership testing in O(1)… let’s do it, it should fly.

module S1 = (struct
  type t = bool array
  let create n p = Array.make p false
  let in_sample t x = t.(x) 

  let add2_sample t x = t.(x) <- true

  let clear_sample t = 
    let rec loop i = 
      if i < 0 then ()
      else
        let () = t.(i) <- false in
        loop (i-1) 
    in
    loop (Array.length t -1)

  let fill_sample sample n p = 
      let rec loop i = 
        if i = 0 
        then ()
        else
          let rec find_new () = 
            let x = random_range p in
            if in_sample sample x 
            then find_new()
            else add2_sample sample x
          in
          let () = find_new () in
          loop (i-1)
      in
      loop n

  let iter t f =
    let s = Array.length t in
    let rec loop i = 
      if i = s then ()
      else
        let () = if t.(i) then f i in
        loop (i+1)
    in
    loop 0

end : S)

Let’s see what this does.

1 | 3.760345 
2 | 3.716187 
3 | 3.730672 
4 | 3.795472 
5 | 3.799934 
6 | 3.961258 
7 | 3.804574 
8 | 3.775391 
9 | 3.807858 
10 | 3.914987 
20 | 3.949764 
40 | 4.159262 
80 | 4.430131 
160 | 4.953897 
320 | 6.132642 
640 | 8.438193 
1000 | 11.140795 
2000 | 19.150232 
2500 | 23.508719 

It takes some 124 seconds to run it. Overall, it’s more than 10 times faster, but the small samples are a lot slower, so what happened?
A closer look (with the profiler) revealed 2 things:

  1. Clearing the bitset is O(p)
  2. Iterating the bitset also is O(p)

So we tried to remedy this by using a better representation of a bitset. An array of 64 bit words. Clearing is a lot faster there.
Iteration will be faster too as the bitset is expected to be sparse, and one can skip forward by inspecting the numberOfTrailingZeroes.
We optimized the clearing of the bitset, and dabbled into De Bruijn sequences for super fast iteration.
It’s a bit of topic, and maybe interesting enough for another post. The reason why I’m not digressing here is that it was the wrong road to go down to.

In the end, after a long detour, we settled on an entirely different strategy: Sparse Sets.

module S2 = (struct
  type t = { mutable n: int;
             mutable dense: int array;
             mutable sparse: int array;}

  let create n p = 
    { n = 0;
      dense = Array.make p 0;
      sparse = Array.make p 0;
    }

  let add2_sample t x = 
    let n = t.n in
    t.dense.(n) <- x;
    t.sparse.(x) <- n;
    t.n <- (n+1)

  let in_sample t x = 
    let rsi = t.sparse.(x) in
    let ok = rsi < t.n in
    ok && (t.dense.(rsi) = x)

  let iter t f =
    let n = t.n in
    let rec loop i =
      if i = n then ()
      else
        let x = t.dense.(i) in
        let () = f x in
        loop (i+1) 
    in
    loop 0

  let clear_sample t = t.n <- 0

  let fill_sample sample n p = 
    let rec loop i = 
      if i = 0 
      then ()
      else
        let rec find_new () = 
          let x = R.random_range p in
          if in_sample sample x 
          then find_new()
          else add2_sample sample x
        in
        let () = find_new () in
        loop (i-1)
    in
    loop n
    
end: S)

Let’s see what this does:

1 | 0.019265 
2 | 0.021046 
3 | 0.030151 
4 | 0.034281 
5 | 0.040782 
6 | 0.048158 
7 | 0.055332 
8 | 0.061747 
9 | 0.068712 
10 | 0.075462 
20 | 0.144088 
40 | 0.276297 
80 | 0.539943 
160 | 1.069994 
320 | 2.143328 
640 | 4.334955 
1000 | 6.893774 
2000 | 14.607145 
2500 | 18.819379 

It runs under a minute, and has the desired order of magnitude for our operations (adding, testing, clearing, iterating).
Meanwhile, if I ever need to revisit this, I have some aces up my sleeve:

  1. There is an 1984 paper “Faster Random Sampling methods(Jeffrey Scott Vitter)”
  2. I can always special case: if sample size below a carefully chosen threshold use S0 else, use something better suited for larger samples.
    This will give me best of both worlds at the cost of ugliness.

My mistake

Have you figured out what I did wrong strategically? In the above example, I made it several times: I allowed profiling to set the scope of my optimization efforts. Profiling is great to discover bottlenecks and the possible gains of removing them, but it will give you a sort of narrowmindedness that limits the degree of success. Once you discovered a bottleneck, you need to step back, and also look at the context. The bigger the chunk you’ll be optimizing the higher the possible gains. In my concrete case, I should have been looking for a better sampling method.
Instead, I searched for a better set representation. The problem is that you tend to find what you’re looking for.

Armed with the new insight, I propose the following laws of optimization.

  1. profile before you optimize
  2. if you find a bottleneck, try to optimize the broad context CONTAINING the bottleneck.
  3. tweak code as a last resort

Confessions of a revisionist

I must confess that I moved the example out of its original context, which was a C codebase. I recreated the different versions we had of the C code in OCaml for your pleasure.
So yes, we made the common mistake of going to a lower level programming language too soon, naively thinking we had a good understanding of the problem we were trying to solve.
As a result, we wasted more time than we should have. Anyway, in the end I hope I compensated enough by writing freely about my mistakes, so you can avoid them.

have fun,

Romain.

PS

For those interested in the code itself. I have pushed the code to a git repo : https://github.com/toolslive/Sampling