Number of trailing zeroes

Last time, I hinted about using De Bruijn sequences for speeding up iteration of bit sets. It’s an old trick that (I think) was discovered by chess programmers in the 1960s (I really need a reference for this).
It’s a classic that yields elegant and quintessential code, but really needs a sidebar explanation of what’s going on.

The problem

For a binary number, write a function that determines the number of trailing zeroes, ie how many times you can divide this number by 2 without remainder.
The trivial implementation is this:

let rec ntz n =
    let rec loop acc n =
      if n land 1 = 1
      then acc
      else loop (acc+1) (n lsr 1)
    loop 0 n

It’s surprising to learn it’s possible to find the answer without looping at all.

Least significant one

If you look at the problem, you quickly realize there is only one 1 that matters. The least significant one.
It can be isolated like this:

lso = n land ((lnot n) + 1)

or (shorter)

lso = n land (-n)

Let’s look at a few examples for byte size numbers

n bin(n) bin (lnot n) bin (-n) lso
2 0000 0010 1111 1101 1111 1110 0000 0010
7 0000 0111 1111 1000 1111 1001 0000 0001
40 0010 1000 1101 0111 1101 1000 0000 1000
48 0011 0000 1100 1111 1101 0000 0001 0000

Now comes a bit of a detour into graph theory.

De Bruijn Graphs

A detailed introduction is found here, but the important thing to remember is how you construct this graph. The example below is a (2,3) graph. You start with a vertex 000. Then, starting at a vertex, you slide in (shift left and append) either a 1, or a 0, and draw an edge to the result node. Iterate and stop when you have them all.

So after the first step, you have this:

after the next step, you have:

and the closed graph is:

De Bruijn sequences

A De Bruijn sequence is just a shorthand for an Hamiltionian path (a path that visits every vertex once) through a De Bruijn graph. An example of an Hamiltonian path for the full graph above is:

000 -> 001 -> 010 -> 101 -> 011 -> 111 -> 110 -> 100 -> 000

A compact representation for this path can be recorded by writing down the start point and the choices made at every point. The example path yields following sequence:

00010111

When sliding through the sequence, one bit at a time with a window of 3 bits, each of the 8 possible 3 digit sequences appears exactly once.

n index
000 0
001 1
010 2
011 4
100 7
101 3
110 6
111 5

The trick

The table above shows you at each node how many steps you’re from the source of the graph.
So if you multiply the sequence with the least significant one or our number, we can use the table to find out how many shifts we had.

for example: 00010111 * 0010000 = 00101110000. You keep the right most 8 bits 0111000.
By looking at the leftmost 3 (= 8-5) bits 011, which you find by shifting 5 positions to the right,
you see that they are at index 4 in the table. So the number has 4 trailing zeroes.

Summary

To find the number of trailing zeroes for an 8 bit number, you need

  1. an 8-bit De Bruijn sequence: debruijn = 00010111
  2. a table of 8 precomputed entries 0,1,2,4,7,3,6,5
  3. the following formula:
    lso = n & (-n)
    ntz = table[(lso * debruijn) >> 5]
    

64 bit version

If you want to do this for 64 bits, you follow the exact same reasoning. You need a 64-bit debruijn sequence (there are 2^26 of them), and a precomputed table of size 64.

#define DEBRUIJN 0x22fdd63cc95386dULL
static uint32_t table [64] ;
void inittable (){
    uint64_t db = DEBRUIJN ;
    int i ;
    for (i=0 ; i < 64;++ i ) {
        table [db >> 58] = i ;
        db = db <<1;
    }
}

uint32_t ntz ( uint64_t n){
    n &= −n ;
    n ∗= DEBRUIJN ;
    n >>= 58;
    return table [n] ;
}

Use in bitsets

Suppose you’re representing a bitset as an array of 64 bit words. When iterating, you slide through the words, and if a bit is 1 you do your thing. If the rest of the word you’re sliding through only contains zeroes, you’re wasting effort. Knowing the number of trailing zeroes allows you to calculate the last index for the loop over the bits of a word.

A look at a random standard library

Java

There is a java.util.BitSet class that has been in the java standard library since 1.0. Since the source is open, we can inspect it. Let’s take a peek.
The iterator uses
Long.numberOfTrailingZeros(word)
and the implementation there is:

public static int numberOfTrailingZeros(long i) {
           // HD, Figure 5-14
           int x, y;
           if (i == 0) return 64;
           int n = 63;
           y = (int)i; if (y != 0) { n = n -32; x = y; } else x = (int)(i>>>32);
           y = x <<16; if (y != 0) { n = n -16; x = y; }
           y = x << 8; if (y != 0) { n = n - 8; x = y; }
           y = x << 4; if (y != 0) { n = n - 4; x = y; }
           y = x << 2; if (y != 0) { n = n - 2; x = y; }
           return n - ((x << 1) >>> 31);
}

(you can read the full source here: Long.java )

The “HD” reference in the source code points to Henry S. Warren, Jr.’s Hacker’s Delight, (Addison Wesley, 2002).

I will not comment on this.

Other languages

Left as an exercise for the reader 😉

have fun,

Romain.


Share your mistakes: adventures in optimization

I used to think I knew the laws of code optimization. In my (not so) humble opinion they were

  1. profile before you optimize
  2. after profiling tells you what the problem is, first try a better strategy (algorithm or data structure)
  3. tweak code as a last resort

It’s a pure economical reasoning that’s behind this: if your code is not fast enough, first find the biggest culprit and eliminate it. By taking out the biggest you get the most value for money,  and using something that yields orders of magnitude, gain the most. Tweaking code or moving to a more low-level programming language can only give you a factor of improvement, so if you have the choice, use the bigger gun.

Suppose, as an example, profiling reveals your cost can be written like this:

Cost = 0.8 * BiggestCulprit + 0.2 * EverythingElse

You know what to do: kill the BiggestCulprit. Btw, Pareto tells you it’s commonly something like that (80-20). Ok, using a big hammer you replaced the BiggestCulprit with something that’s 100 times cheaper.

Cost2 = 0.8 * 0.01 * BiggestCulprit + 0.2 * EverythingElse = 0.208 * Cost

If you need to go even faster, you should try to optimize EverythingElse. Can you do this? Depends. Maybe you can split EverythingElse in

EverythingElse = 0.8 * TheNextHurdle + 0.2 * NotInteresting

If you can’t. It ends here.

The strategy is rational, but sometimes profiling points you in the wrong direction.

An example of a mistake I made

What follows below is an account of what happened to a piece of code over a period of two years. I hope you will, when reading on conclude that the mistakes were obvious, but at the time, they weren’t. Hindsight is 20/20.

The problem

As a small step in solving a bigger problem, we needed to generate a sample of size n from a set of size p. Important detail: no value can be selected more than once.
The population size (p) is roughly somewhere between 4000 and 16000, while the sample size is often very small, sometimes more than 2000, but never more than 2500 (we know its distribution).
Let’s look at the problem in isolation. The code shown below is a micro benchmark that is representative for our case, and we’re interested in minimizing the total running time by improving the implementation of the Sampler module


let micro_bench ns k p  = 

  let test_one n =
    let sample = Sampler.create n p in
    let use_choice _ = () in
    let rec loop k = 
      if k = 0 
      then ()
      else 
        begin
          if k mod 1000 = 0 then Printf.eprintf ".%!";
          let () = Sampler.fill_sample sample n p in
          let () = Sampler.iter sample use_choice in
          let () = Sampler.clear_sample sample in
          loop (k-1)
        end
    in
    let t0 = Unix.gettimeofday() in
    let () = loop k in
    let t1 = Unix.gettimeofday() in
    let d = t1 -. t0 in
    Printf.printf "%i | %f \n" n d
  in
  List.iter test_one ns;;


let main () =  
  let k = 100 * 1000 in
  let p = 10000 in
  micro_bench [1;2;3;4;5;6;7;8;9;10;20;40;80;160;320;640;1000;2000;2500] k p;;

let () = main ();;

Our solution must adhere to the following signature:

module type S = sig
  type t
  val create : int -> int -> t
  val fill_sample: t -> int -> int -> unit
  val clear_sample: t -> unit
  val iter: t -> (int -> unit) -> unit
end

The first solution, the one I coded in search of correctness and simplicity, was exactly that, simple and correct:

module S0 = (struct 
    type t = {mutable c: int; es:int array}
    
    let create n p = {c = 0; es = Array.make n 0}

    let in_sample t x = 
      let rec loop i = 
        if i < 0 then false
        else
          if t.es.(i) = x 
          then true
          else loop (i-1)
      in 
      loop (t.c -1)

    let add2_sample t x = 
      let i = t.c in
      t.es.(i) <- x;
      t.c <- i+1        

    let fill_sample sample n p = 
      let rec loop i = 
        if i = 0 
        then ()
        else
          let rec find_new () = 
            let x = random_range p in
            if in_sample sample x 
            then find_new()
            else add2_sample sample x
          in
          let () = find_new () in
          loop (i-1)
      in
      loop n

    let clear_sample t = t.c <- 0
  
    let iter t f = 
      let rec loop i =
        if i = t.c 
        then ()
        else 
          let () = f t.es.(i) in
          loop (i+1)
      in
      loop 0
end : S)

The sample is accumulated in an array, and we test each candidate to see if we have it already. If so, we try again. Clearing the sample is putting the counter to zero, and iteration is merely iterating over the used part of the array. Simple enough, and it suffised for almost 6 months. A run of the microbenchmark (it takes some 1700 seconds) reveals what’s going on:

1 | 0.017802 
2 | 0.017753 
3 | 0.025648 
4 | 0.033298 
5 | 0.040910 
6 | 0.050635 
7 | 0.056496 
8 | 0.065127 
9 | 0.073126 
10 | 0.081244 
20 | 0.170436 
40 | 0.402476 
80 | 1.060872 
160 | 3.131289 
320 | 10.381503 
640 | 36.543450 
1000 | 85.969717 
2000 | 408.716565 
2500 | 1127.268196 

The first column is sample size, the second is time needed for 100000 samples. As you can see, it’s really fast for small sample sizes, but quickly succumbs. Profiling shows it’s the in_sample function that’s to blame. It must scan the entire sample array so far. It gets even worse as the chance of picking an element that was chosen before increases.

Well, it isn’t that difficult to have a better membership test. The population size isn’t that big, so we can afford a BitSet. Adding a member in O(1), membership testing in O(1)… let’s do it, it should fly.

module S1 = (struct
  type t = bool array
  let create n p = Array.make p false
  let in_sample t x = t.(x) 

  let add2_sample t x = t.(x) <- true

  let clear_sample t = 
    let rec loop i = 
      if i < 0 then ()
      else
        let () = t.(i) <- false in
        loop (i-1) 
    in
    loop (Array.length t -1)

  let fill_sample sample n p = 
      let rec loop i = 
        if i = 0 
        then ()
        else
          let rec find_new () = 
            let x = random_range p in
            if in_sample sample x 
            then find_new()
            else add2_sample sample x
          in
          let () = find_new () in
          loop (i-1)
      in
      loop n

  let iter t f =
    let s = Array.length t in
    let rec loop i = 
      if i = s then ()
      else
        let () = if t.(i) then f i in
        loop (i+1)
    in
    loop 0

end : S)

Let’s see what this does.

1 | 3.760345 
2 | 3.716187 
3 | 3.730672 
4 | 3.795472 
5 | 3.799934 
6 | 3.961258 
7 | 3.804574 
8 | 3.775391 
9 | 3.807858 
10 | 3.914987 
20 | 3.949764 
40 | 4.159262 
80 | 4.430131 
160 | 4.953897 
320 | 6.132642 
640 | 8.438193 
1000 | 11.140795 
2000 | 19.150232 
2500 | 23.508719 

It takes some 124 seconds to run it. Overall, it’s more than 10 times faster, but the small samples are a lot slower, so what happened?
A closer look (with the profiler) revealed 2 things:

  1. Clearing the bitset is O(p)
  2. Iterating the bitset also is O(p)

So we tried to remedy this by using a better representation of a bitset. An array of 64 bit words. Clearing is a lot faster there.
Iteration will be faster too as the bitset is expected to be sparse, and one can skip forward by inspecting the numberOfTrailingZeroes.
We optimized the clearing of the bitset, and dabbled into De Bruijn sequences for super fast iteration.
It’s a bit of topic, and maybe interesting enough for another post. The reason why I’m not digressing here is that it was the wrong road to go down to.

In the end, after a long detour, we settled on an entirely different strategy: Sparse Sets.

module S2 = (struct
  type t = { mutable n: int;
             mutable dense: int array;
             mutable sparse: int array;}

  let create n p = 
    { n = 0;
      dense = Array.make p 0;
      sparse = Array.make p 0;
    }

  let add2_sample t x = 
    let n = t.n in
    t.dense.(n) <- x;
    t.sparse.(x) <- n;
    t.n <- (n+1)

  let in_sample t x = 
    let rsi = t.sparse.(x) in
    let ok = rsi < t.n in
    ok && (t.dense.(rsi) = x)

  let iter t f =
    let n = t.n in
    let rec loop i =
      if i = n then ()
      else
        let x = t.dense.(i) in
        let () = f x in
        loop (i+1) 
    in
    loop 0

  let clear_sample t = t.n <- 0

  let fill_sample sample n p = 
    let rec loop i = 
      if i = 0 
      then ()
      else
        let rec find_new () = 
          let x = R.random_range p in
          if in_sample sample x 
          then find_new()
          else add2_sample sample x
        in
        let () = find_new () in
        loop (i-1)
    in
    loop n
    
end: S)

Let’s see what this does:

1 | 0.019265 
2 | 0.021046 
3 | 0.030151 
4 | 0.034281 
5 | 0.040782 
6 | 0.048158 
7 | 0.055332 
8 | 0.061747 
9 | 0.068712 
10 | 0.075462 
20 | 0.144088 
40 | 0.276297 
80 | 0.539943 
160 | 1.069994 
320 | 2.143328 
640 | 4.334955 
1000 | 6.893774 
2000 | 14.607145 
2500 | 18.819379 

It runs under a minute, and has the desired order of magnitude for our operations (adding, testing, clearing, iterating).
Meanwhile, if I ever need to revisit this, I have some aces up my sleeve:

  1. There is an 1984 paper “Faster Random Sampling methods(Jeffrey Scott Vitter)”
  2. I can always special case: if sample size below a carefully chosen threshold use S0 else, use something better suited for larger samples.
    This will give me best of both worlds at the cost of ugliness.

My mistake

Have you figured out what I did wrong strategically? In the above example, I made it several times: I allowed profiling to set the scope of my optimization efforts. Profiling is great to discover bottlenecks and the possible gains of removing them, but it will give you a sort of narrowmindedness that limits the degree of success. Once you discovered a bottleneck, you need to step back, and also look at the context. The bigger the chunk you’ll be optimizing the higher the possible gains. In my concrete case, I should have been looking for a better sampling method.
Instead, I searched for a better set representation. The problem is that you tend to find what you’re looking for.

Armed with the new insight, I propose the following laws of optimization.

  1. profile before you optimize
  2. if you find a bottleneck, try to optimize the broad context CONTAINING the bottleneck.
  3. tweak code as a last resort

Confessions of a revisionist

I must confess that I moved the example out of its original context, which was a C codebase. I recreated the different versions we had of the C code in OCaml for your pleasure.
So yes, we made the common mistake of going to a lower level programming language too soon, naively thinking we had a good understanding of the problem we were trying to solve.
As a result, we wasted more time than we should have. Anyway, in the end I hope I compensated enough by writing freely about my mistakes, so you can avoid them.

have fun,

Romain.

PS

For those interested in the code itself. I have pushed the code to a git repo : https://github.com/toolslive/Sampling