I used to think I knew the laws of code optimization. In my (not so) humble opinion they were

1. profile before you optimize
2. after profiling tells you what the problem is, first try a better strategy (algorithm or data structure)
3. tweak code as a last resort

It’s a pure economical reasoning that’s behind this: if your code is not fast enough, first find the biggest culprit and eliminate it. By taking out the biggest you get the most value for money,  and using something that yields orders of magnitude, gain the most. Tweaking code or moving to a more low-level programming language can only give you a factor of improvement, so if you have the choice, use the bigger gun.

Suppose, as an example, profiling reveals your cost can be written like this:

Cost = 0.8 * BiggestCulprit + 0.2 * EverythingElse

You know what to do: kill the BiggestCulprit. Btw, Pareto tells you it’s commonly something like that (80-20). Ok, using a big hammer you replaced the BiggestCulprit with something that’s 100 times cheaper.

Cost2 = 0.8 * 0.01 * BiggestCulprit + 0.2 * EverythingElse = 0.208 * Cost

If you need to go even faster, you should try to optimize EverythingElse. Can you do this? Depends. Maybe you can split EverythingElse in

EverythingElse = 0.8 * TheNextHurdle + 0.2 * NotInteresting

If you can’t. It ends here.

The strategy is rational, but sometimes profiling points you in the wrong direction.

An example of a mistake I made

What follows below is an account of what happened to a piece of code over a period of two years. I hope you will, when reading on conclude that the mistakes were obvious, but at the time, they weren’t. Hindsight is 20/20.

The problem

As a small step in solving a bigger problem, we needed to generate a sample of size n from a set of size p. Important detail: no value can be selected more than once.
The population size (p) is roughly somewhere between 4000 and 16000, while the sample size is often very small, sometimes more than 2000, but never more than 2500 (we know its distribution).
Let’s look at the problem in isolation. The code shown below is a micro benchmark that is representative for our case, and we’re interested in minimizing the total running time by improving the implementation of the Sampler module

```
let micro_bench ns k p  =

let test_one n =
let sample = Sampler.create n p in
let use_choice _ = () in
let rec loop k =
if k = 0
then ()
else
begin
if k mod 1000 = 0 then Printf.eprintf ".%!";
let () = Sampler.fill_sample sample n p in
let () = Sampler.iter sample use_choice in
let () = Sampler.clear_sample sample in
loop (k-1)
end
in
let t0 = Unix.gettimeofday() in
let () = loop k in
let t1 = Unix.gettimeofday() in
let d = t1 -. t0 in
Printf.printf "%i | %f \n" n d
in
List.iter test_one ns;;

let main () =
let k = 100 * 1000 in
let p = 10000 in
micro_bench [1;2;3;4;5;6;7;8;9;10;20;40;80;160;320;640;1000;2000;2500] k p;;

let () = main ();;

```

Our solution must adhere to the following signature:

```module type S = sig
type t
val create : int -> int -> t
val fill_sample: t -> int -> int -> unit
val clear_sample: t -> unit
val iter: t -> (int -> unit) -> unit
end
```

The first solution, the one I coded in search of correctness and simplicity, was exactly that, simple and correct:

```module S0 = (struct
type t = {mutable c: int; es:int array}

let create n p = {c = 0; es = Array.make n 0}

let in_sample t x =
let rec loop i =
if i < 0 then false
else
if t.es.(i) = x
then true
else loop (i-1)
in
loop (t.c -1)

let i = t.c in
t.es.(i) <- x;
t.c <- i+1

let fill_sample sample n p =
let rec loop i =
if i = 0
then ()
else
let rec find_new () =
let x = random_range p in
if in_sample sample x
then find_new()
in
let () = find_new () in
loop (i-1)
in
loop n

let clear_sample t = t.c <- 0

let iter t f =
let rec loop i =
if i = t.c
then ()
else
let () = f t.es.(i) in
loop (i+1)
in
loop 0
end : S)
```

The sample is accumulated in an array, and we test each candidate to see if we have it already. If so, we try again. Clearing the sample is putting the counter to zero, and iteration is merely iterating over the used part of the array. Simple enough, and it suffised for almost 6 months. A run of the microbenchmark (it takes some 1700 seconds) reveals what’s going on:

```1 | 0.017802
2 | 0.017753
3 | 0.025648
4 | 0.033298
5 | 0.040910
6 | 0.050635
7 | 0.056496
8 | 0.065127
9 | 0.073126
10 | 0.081244
20 | 0.170436
40 | 0.402476
80 | 1.060872
160 | 3.131289
320 | 10.381503
640 | 36.543450
1000 | 85.969717
2000 | 408.716565
2500 | 1127.268196
```

The first column is sample size, the second is time needed for 100000 samples. As you can see, it’s really fast for small sample sizes, but quickly succumbs. Profiling shows it’s the in_sample function that’s to blame. It must scan the entire sample array so far. It gets even worse as the chance of picking an element that was chosen before increases.

Well, it isn’t that difficult to have a better membership test. The population size isn’t that big, so we can afford a BitSet. Adding a member in O(1), membership testing in O(1)… let’s do it, it should fly.

```module S1 = (struct
type t = bool array
let create n p = Array.make p false
let in_sample t x = t.(x)

let add2_sample t x = t.(x) <- true

let clear_sample t =
let rec loop i =
if i < 0 then ()
else
let () = t.(i) <- false in
loop (i-1)
in
loop (Array.length t -1)

let fill_sample sample n p =
let rec loop i =
if i = 0
then ()
else
let rec find_new () =
let x = random_range p in
if in_sample sample x
then find_new()
in
let () = find_new () in
loop (i-1)
in
loop n

let iter t f =
let s = Array.length t in
let rec loop i =
if i = s then ()
else
let () = if t.(i) then f i in
loop (i+1)
in
loop 0

end : S)

```

Let’s see what this does.

```1 | 3.760345
2 | 3.716187
3 | 3.730672
4 | 3.795472
5 | 3.799934
6 | 3.961258
7 | 3.804574
8 | 3.775391
9 | 3.807858
10 | 3.914987
20 | 3.949764
40 | 4.159262
80 | 4.430131
160 | 4.953897
320 | 6.132642
640 | 8.438193
1000 | 11.140795
2000 | 19.150232
2500 | 23.508719
```

It takes some 124 seconds to run it. Overall, it’s more than 10 times faster, but the small samples are a lot slower, so what happened?
A closer look (with the profiler) revealed 2 things:

1. Clearing the bitset is O(p)
2. Iterating the bitset also is O(p)

So we tried to remedy this by using a better representation of a bitset. An array of 64 bit words. Clearing is a lot faster there.
Iteration will be faster too as the bitset is expected to be sparse, and one can skip forward by inspecting the numberOfTrailingZeroes.
We optimized the clearing of the bitset, and dabbled into De Bruijn sequences for super fast iteration.
It’s a bit of topic, and maybe interesting enough for another post. The reason why I’m not digressing here is that it was the wrong road to go down to.

In the end, after a long detour, we settled on an entirely different strategy: Sparse Sets.

```module S2 = (struct
type t = { mutable n: int;
mutable dense: int array;
mutable sparse: int array;}

let create n p =
{ n = 0;
dense = Array.make p 0;
sparse = Array.make p 0;
}

let n = t.n in
t.dense.(n) <- x;
t.sparse.(x) <- n;
t.n <- (n+1)

let in_sample t x =
let rsi = t.sparse.(x) in
let ok = rsi < t.n in
ok && (t.dense.(rsi) = x)

let iter t f =
let n = t.n in
let rec loop i =
if i = n then ()
else
let x = t.dense.(i) in
let () = f x in
loop (i+1)
in
loop 0

let clear_sample t = t.n <- 0

let fill_sample sample n p =
let rec loop i =
if i = 0
then ()
else
let rec find_new () =
let x = R.random_range p in
if in_sample sample x
then find_new()
in
let () = find_new () in
loop (i-1)
in
loop n

end: S)

```

Let’s see what this does:

```1 | 0.019265
2 | 0.021046
3 | 0.030151
4 | 0.034281
5 | 0.040782
6 | 0.048158
7 | 0.055332
8 | 0.061747
9 | 0.068712
10 | 0.075462
20 | 0.144088
40 | 0.276297
80 | 0.539943
160 | 1.069994
320 | 2.143328
640 | 4.334955
1000 | 6.893774
2000 | 14.607145
2500 | 18.819379
```

It runs under a minute, and has the desired order of magnitude for our operations (adding, testing, clearing, iterating).
Meanwhile, if I ever need to revisit this, I have some aces up my sleeve:

1. There is an 1984 paper “Faster Random Sampling methods(Jeffrey Scott Vitter)”
2. I can always special case: if sample size below a carefully chosen threshold use S0 else, use something better suited for larger samples.
This will give me best of both worlds at the cost of ugliness.

My mistake

Have you figured out what I did wrong strategically? In the above example, I made it several times: I allowed profiling to set the scope of my optimization efforts. Profiling is great to discover bottlenecks and the possible gains of removing them, but it will give you a sort of narrowmindedness that limits the degree of success. Once you discovered a bottleneck, you need to step back, and also look at the context. The bigger the chunk you’ll be optimizing the higher the possible gains. In my concrete case, I should have been looking for a better sampling method.
Instead, I searched for a better set representation. The problem is that you tend to find what you’re looking for.

Armed with the new insight, I propose the following laws of optimization.

1. profile before you optimize
2. if you find a bottleneck, try to optimize the broad context CONTAINING the bottleneck.
3. tweak code as a last resort

Confessions of a revisionist

I must confess that I moved the example out of its original context, which was a C codebase. I recreated the different versions we had of the C code in OCaml for your pleasure.
So yes, we made the common mistake of going to a lower level programming language too soon, naively thinking we had a good understanding of the problem we were trying to solve.
As a result, we wasted more time than we should have. Anyway, in the end I hope I compensated enough by writing freely about my mistakes, so you can avoid them.

have fun,

Romain.

PS

For those interested in the code itself. I have pushed the code to a git repo : https://github.com/toolslive/Sampling

1. Fabrice Le Fessant says:

I am not sure I understood exactly what you were trying to do, but I had a close problem once: I had a population with no duplicated elements, and I was trying to take every element of the population in turn, in a random order (if you do that for a limited number of elements, you get what I think is your sample).

It can be done in linear time this way:

Maintain a counter of the number of elements in the population.
At every turn, pick a random number below the counter, take the element at that position for your sample, move the last element of the population to that position, and decrease the counter. This way, you never need to check if you already picked an element, since it has been removed from the population (if you need to keep the population, just do a copy before).

Would it work in your case ?

• rslootma says:

If I understand correctly, you’re suggesting to try the ‘Josephus permutation’ strategy.
(btw, inserting the last at the position of the chosen one is a nice optimization trick.)
Yes, it would work. My main concerns are that the testing is cheap anyway, and the copying of the population is expensive.
I’ll find some time and will see how it performs in this particular case (you made me curious).

• Fabrice Le Fessant says:

My case is a bit different from yours: close to the end, 99% of the population has been chosen, so “in_sample” would almost always return [true] and my program would loop endlessly in “find_new” until, by chance, a free element would finally be found. So, I had to find a completely different strategy. In your case, you end up with only 25% of chosen ones, so “in_sample” returns [false] most of the time.

• rslootma says:

I coded up your suggestion. On my machine it takes about 85s, and as I feared it suffers from the resetting of the population. Anyway, the code is here:
https://github.com/toolslive/Sampling

2. Fabrice Le Fessant says:

I tested on my computer, and added a “5000” case:

S2:
160 | 1.045216
320 | 2.112151
640 | 4.269535
1000 | 6.779958
2000 | 14.438439
2500 | 18.564696
5000 | 44.786517

SFabrice:
160 | 2.132911
320 | 3.040744
640 | 4.771763
1000 | 6.720089
2000 | 12.181486
2500 | 14.986117
5000 | 28.753713

As you see, for big samples, timings for SFabrice grow linearly, while they grow faster with S3 (probably O(n.log(n)). So, I would recommend your solution for small samples, and mine for big samples.

• rslootma says: