Int32 serialization in OCaml
Posted: October 4, 2013 Filed under: OCaml, Programming 5 CommentsToday, I’m going to talk a bit about the problem of serializing an int32 in OCaml. As I’m only working on Intel machines, I’m not interested in portability, and prefer little-endian serialization. This should be natural and easy.
The interface
val set32: string -> int -> int32 -> unit val get32: string -> int -> int32
The microbenchmark
We’re going to store an int32 into a string,retrieve it, and check if it’s the same. We’re going to do this 1_000_000_000 times, see how long it took, and calculate the speed.
let benchmark n = let t0 = Unix.gettimeofday() in let s = String.create 4 in let limit = Int32.of_int n in let rec loop i32 = if i32 = limit then () else let () = set32 s 0 i32 in let j32 = get32 s 0 in assert (i32 = j32); loop (Int32.succ i32) in let () = loop 0l in let t1 = Unix.gettimeofday () in let d = t1 -. t0 in let speed = float n /. d in let megaspeed = speed /. 1000000.0 in Printf.printf "%i took %f => %fe6/s\n" n d megaspeed
Attempt 0: Naive implementation
This is rather straight forward: mask, extract the char, store, shift and repeat. Retrieving the int32 from the string is the opposite. No rocket surgery here.
This is simple, readable code.
let set32_ocaml s pos (i:int32) = let (>:) = Int32.shift_right_logical in let (&:) = Int32.logand in let mask = Int32.of_int 0xff in let to_char v = Char.chr (Int32.to_int v) in let too_far = pos + 4 in let rec loop p i = if p = too_far then () else let vp = i &: mask in let cp = to_char vp in let () = s.[p] <- cp in loop (p+1) (i >: 8) in loop pos i let get32_ocaml s pos = let (<:) = Int32.shift_left in let (|:) = Int32.logor in let to_i32 c = Int32.of_int (Char.code c) in let rec loop acc p = if p < pos then acc else let cp = s.[p] in let vp = to_i32 cp in let acc' = (acc <: 8) |: vp in loop acc' (p-1) in loop 0l (pos + 3)
OCaml is a nice high level language, but this bit twiddling feels rather clumsy and ugly.
Anyway, let’s benchmark it.
Strategy | Speed |
---|---|
naive OCaml | 16.0e6/s |
A quick peek at how Thrift does it
let get_byte32 i b = 255 land (Int32.to_int (Int32.shift_right i (8*b))) class trans = object(self) val ibyte = String.create 8 ... method writeI32 i = let gb = get_byte32 i in for i=0 to 3 do ibyte.[3-i] <- char_of_int (gb i) done; trans#write ibyte 0 4
Ok, this uses the same strategy; but there’s a for loop there. The conversion is done in the ibyte buffer and then copied along. It’s a bit sub-awesome, but the extra copy of 4 bytes shouldn’t be too costly neither.
Attempt 1: But in C, it would be way faster
It’s a platitude I hear a lot, but in this case, it really should be faster. After all, if you want to retrieve an int32 from a string, all you need to do is to cast the char* to an int32_t* and de-reference the value.
Let’s try this:
external set32 : string -> int -> int32 -> unit = "zooph_set32" external get32 : string -> int -> int32 = "zooph_get32"
#include <stdint.h> #include <stdio.h> #include <caml/alloc.h> #include <caml/memory.h> #include <caml/mlvalues.h> value zooph_set32(value vs, value vpos, value vi){ CAMLparam3(vs, vpos, vi); char* buf = String_val(vs); int pos = Int_val(vpos); int32_t i = Int32_val(vi); char* buf_off = &buf[pos]; int32_t* casted = (int32_t*)buf_off; casted[0] = i; CAMLreturn (Val_unit); } value zooph_get32(value vs,value vpos){ CAMLparam2(vs,vpos); CAMLlocal1(result); char* buf = String_val(vs); int pos = Int_val(vpos); char* buf_off = &buf[pos]; int32_t* casted = (int32_t*)buf_off; int32_t i32 = casted[0]; result = caml_copy_int32(i32); CAMLreturn(result); }
I called my compilation unit zooph.c an onomatopoeia that pays tribute to how fast I expect this to be. There’s no loop, and the machine has the skills to do the transformation in one step. So it should roughly be about 4 times faster.
Let’s benchmark it.
Strategy | Speed |
---|---|
naive OCaml | 16.0e6 |
C via FFI | 32.3e6 |
Hm… it’s faster allright, but it’s also a bit disappointing. So what went wrong?
A quick look at the assembly code reveals a lot:
zooph_set32: .LFB34: .cfi_startproc movl 8(%rdx), %eax sarq %rsi movslq %esi, %rsi movl %eax, (%rdi,%rsi) movl $1, %eax ret .cfi_endproc .LFE34: .size zooph_set32, .-zooph_set32 .p2align 4,,15 .globl zooph_get32 .type zooph_get32, @function zooph_get32: .LFB35: .cfi_startproc pushq %rbx .cfi_def_cfa_offset 16 .cfi_offset 3, -16 movq %rsi, %rdx sarq %rdx subq $160, %rsp .cfi_def_cfa_offset 176 movslq %edx, %rdx movq caml_local_roots(%rip), %rbx leaq 8(%rsp), %rcx movq %rdi, 8(%rsp) movl (%rdi,%rdx), %edi movq %rsi, (%rsp) movq $1, 32(%rsp) movq %rcx, 40(%rsp) leaq (%rsp), %rcx movq %rbx, 16(%rsp) movq $2, 24(%rsp) movq $0, 152(%rsp) movq %rcx, 48(%rsp) leaq 16(%rsp), %rcx movq $1, 96(%rsp) movq $1, 88(%rsp) movq %rcx, 80(%rsp) leaq 80(%rsp), %rcx movq %rcx, caml_local_roots(%rip) leaq 152(%rsp), %rcx movq %rcx, 104(%rsp) call caml_copy_int32 movq %rbx, caml_local_roots(%rip) addq $160, %rsp .cfi_def_cfa_offset 16 popq %rbx .cfi_def_cfa_offset 8 ret .cfi_endproc
While zooph_set32 seems to be in order, its counter part is rather messy. On closer inspection, not even the set32 side is optimal. OCaml’s FFI allows smooth (at least compared to jni) interaction with native code in other languages, it also installs a firm border across which no inlining is possible (not with OCaml that is).
Let’s take a look at how the benchmark code calls this.
.L177: movq %rbx, 8(%rsp) movq %rax, 0(%rsp) movq $1, %rsi movq 16(%rbx), %rdi movq %rax, %rdx movq zooph_set32@GOTPCREL(%rip), %rax call caml_c_call@PLT .L179: movq caml_young_ptr@GOTPCREL(%rip), %r11 movq (%r11), %r15 movq $1, %rsi movq 8(%rsp), %rax movq 16(%rax), %rdi movq zooph_get32@GOTPCREL(%rip), %rax call caml_c_call@PLT .L180: movq caml_young_ptr@GOTPCREL(%rip), %r11 movq (%r11), %r15 movslq 8(%rax), %rax movq 0(%rsp), %rdi movslq 8(%rdi), %rbx cmpq %rax, %rbx je .L176
You see stuff being pushed on the stack before the call. For raw speed, you don’t want this to happen. For raw speed, you don’t even want a call.
To get there, you need to translate the benchmark to C too. I’m not going to bother, because I have another trick ready.
Attempt 2: OCaml 4.01 primitives
OCaml 4.01 got released recently, and there’s a little entry in the release notes.
PR#5771: Add primitives for reading 2, 4, 8 bytes in strings and bigarrays
(Pierre Chambart)
However, for some reason, they are not really exposed, and I had to dig to find them. Using them however is trivial.
external get32_prim : string -> int -> int32 = "%caml_string_get32" external set32_prim : string -> int -> int32 -> unit = "%caml_string_set32"
That’s all there is to it. Basically, you say that you know that the compiler knows how to do this, and that from now on, you want to do that too.
Let’s benchmark it.
Strategy | Speed |
---|---|
naive OCaml | 16.0e6 |
C via FFI | 32.3e6 |
OCaml with primitives | 139e6 |
Wow.
Closing words
I’ve put the code for this on github: https://github.com/toolslive/int32_blog Anyway, we need to (de)serialize int64 values as well. Determining the speedup there is left as an exercise for the reader (tip: it’s even better).
I think some people will feel the urge to apply this to their serialization code as well.
Have fun,
Romain.
You could try to use the “noalloc” with your C via FFI solution.
Another undocumented thingy.
The only reference I could find was this:
http://camltastic.blogspot.be/2008/08/tip-calling-c-functions-directly-with.html
But from what I gather, you can only annotate the set side.
The get side does allocate. (Maybe you can prove me wrong here)
You are correct.
The ocplib-endian package in OPAM uses those new primitives, but with a nice interface. The cstruct syntax extension wraps them in an easier camlp4 extension too (see github.com/mirage/ocaml-cstruct and lib_test for a few examples of pcap)
[…] Artigo traduzido pela Redação iMasters, com autorização do autor. Publicado originalmente em http://blog.incubaid.com/2013/10/04/int32-serialization-in-ocaml/ […]