On segmentation faults, stack overflows, GDB and OCamlPosted: December 4, 2011
A couple of months ago, we got reports of random crashes of our distributed, consistent key-value store, Arakoon (if you don’t know it, go check it out!). The process died, out of the blue, due to a segmentation fault. These crashes seemed to occur at random, and we found no way to reproduce them. All we got was a couple of core dumps.
Arakoon is written in OCaml, and uses the Lwt library for lightweight threads (providing concurrency support).
There are a couple of reasons causing segmentation faults: writing to or reading from unmapped memory, writing to a read-only mapping, etc. When loading the failing binary and the coredump in the GNU Debugger, GDB, a sensible backtrace could not be created. Sadly enough I no longer have the ‘bt’ output around.
The displayed backtrace didn’t make sense though. Apparently, GDB was unable to construct a correct backtrace. After some more investigation and head-scratching, we noticed the value of the stack pointer, stored in the ‘rsp’ register, was just under a 4 kilobyte boundary. Looking into the code around the instruction pointer (the ‘rip’ register) at the moment of the crash, our suspicions were confirmed: a value was written on the stack, just under this 4k limit.
On Intel x86 systems (like the ones we’re using, x86–64), the stack is, basically, a region of memory similar to the heap, which grows towards lower addresses (to the bottom). The crashes we were investigating were caused by a so-called stack overflow: the process tried to write at an address below the mapped pages of the stack!
Do note, when using OCaml, there’s only 2 reasons for segfaults, unless there’s some unlikely bug in the compiler/code generator: stack overflows, or whatever happens when calling out to C code (in which case the issue isn’t related to OCaml at all).
To generate a useful backtrace, required to be able to pinpoint the issue as well as fix it, we had to revert to other means. By inspecting the assembly of the crashing code, as well as the contents of the process memory right above the 4k-aligned page on top of the stack pointer, we were able to reason about what happened at runtime, and calculate valid values for both the stack pointer as well as the instruction pointer, i.e. the values of the rsp and rip registers valid in the frame before the one in which the segmentation fault occurred.
After we figured out these 2 values, we replaced the faulty ones in the binary core dump file using a hex editor, taking the machine byte order into account (i.e. when replacing 0x00007fff5ed6b000, one should search for the sequence ‘0x00 0xb0 0xd6 0x5e 0xff 0x7f 0x00 0x00’!). Once the core dump was ‘fixed’, we could load it into GDB, and a valid backtrace could be generated, which confirmed our suspicions (there were more than 1.3 million frames on the stack!), and it was easy to extract a recurring sequence of frames, here it is:
#1309851 0x00007f8b7e936d40 in ?? () #1309852 0x00000000004c38bd in camlLwt__fun_838 () #1309853 0x000000001d457348 in ?? () #1309854 0x00000000004c2a3d in camlLwt__fun_724 () #1309855 0x00007fffb67bd570 in ?? () #1309856 0x00000000004c2a21 in camlLwt__fun_724 () #1309857 0x00000000028a2bf8 in ?? () #1309858 0x00000000004c3cd0 in camlLwt__run_waiters_rec_207 () #1309859 0x000000001d457330 in ?? () #1309860 0x000000001d457320 in ?? () #1309861 0x00000000007d8dd0 in camlLwt__77 () #1309862 0x00000000004c3d7e in camlLwt__run_waiters_231 () #1309863 0x0000000000000001 in ?? () #1309864 0x00000000004c65fd in camlLwt_mutex__fun_112 ()
As you can see, the OCaml compiler uses a name mangling scheme to generate symbol names. Some are obvious (‘camlLwt__run_waiters_rec’ is ‘Lwt.run_waiters_rec’), some are not: all ‘*__fun_*’ symbols are due to anonymous closures. Using the assembly code of these functions, alongside the Lwt source code, we were able to reconstruct what was going on.
Finally we figured out what caused the stack overflows we were experiencing, and could create 2 small testcases, exposing the issue at a more manageable scale.
A work-around was created, and we reported the issue to the Lwt developers, who fixed the issue some days later, check the thread on their mailing list here.
A couple of days ago, we got a new report of segmentation faults occurring. We started to investigate the core dump using GDB yet again, noticed a sensible backtrace could not be generated, and attempted to alter the coredump again, just like some months ago. Here’s what GDB told us:
#0 0x00000000005617c2 in camlLwt__try_bind_1339 () (gdb) bt #0 0x00000000005617c2 in camlLwt__try_bind_1339 () #1 0x00007fff5ed6b040 in ?? () #2 0x00000000005617a6 in camlLwt__try_bind_1339 () #3 0x0000000000000000 in ?? () (gdb) p/x $rsp $1 = 0x7fff5ed6b000 (gdb) p/x $rip $2 = 0x5617c2
As you can see, the stack pointer is at a 4k boundary again. By reading the disassembly of the failing function, we can take a look at the stack data:
(gdb) disassemble $rip Dump of assembler code for function camlLwt__try_bind_1339: 0x0000000000561790 : sub $0x28,%rsp 0x0000000000561794 : mov %rax,%rsi 0x0000000000561797 : mov %rbx,0x18(%rsp) 0x000000000056179c : mov %rdi,0x10(%rsp) 0x00000000005617a1 : callq 0x5617b0 0x00000000005617a6 : callq 0x560d70 0x00000000005617ab : jmp 0x5617ca 0x00000000005617ad : nopl (%rax) 0x00000000005617b0 : push %r14 0x00000000005617b2 : mov %rsp,%r14 0x00000000005617b5 : mov $0x1,%rax 0x00000000005617bc : mov (%rsi),%rdi 0x00000000005617bf : mov %rsi,%rbx => 0x00000000005617c2 : callq *%rdi 0x00000000005617c4 : pop %r14 0x00000000005617c6 : add $0x8,%rsp 0x00000000005617ca : callq 0x560630
The function substracts 0x28 from ‘rsp’ at entry, so the previous frame uses ‘rsp + 0x28’ as stack pointer, which we can print:
(gdb) x/64 $rsp + 0x28 0x7fff5ed6b028: 0x58002ef0 0x00007f08 0x0774cd58 0x00000000 0x7fff5ed6b038: 0x0055eda5 0x00000000 0x5ed6b0b0 0x00007fff 0x7fff5ed6b048: 0x0055ed87 0x00000000 0x0774cd58 0x00000000 0x7fff5ed6b058: 0x00560785 0x00000000 0x0774cd48 0x00000000 0x7fff5ed6b068: 0x00000001 0x00000000 0x58002f68 0x00007f08 0x7fff5ed6b078: 0x0056080e 0x00000000 0x00000001 0x00000000 0x7fff5ed6b088: 0x00563cdd 0x00000000 0x58002fb8 0x00007f08 0x7fff5ed6b098: 0x0056006d 0x00000000 0x58002f78 0x00007f08 0x7fff5ed6b0a8: 0x0055eda5 0x00000000 0x5ed6b120 0x00007fff 0x7fff5ed6b0b8: 0x0055ed87 0x00000000 0x0774cb30 0x00000000 0x7fff5ed6b0c8: 0x00560785 0x00000000 0x0774cb20 0x00000000 0x7fff5ed6b0d8: 0x00000001 0x00000000 0x58003050 0x00007f08 0x7fff5ed6b0e8: 0x0056080e 0x00000000 0x00000001 0x00000000 0x7fff5ed6b0f8: 0x00563cdd 0x00000000 0x580030a0 0x00007f08 0x7fff5ed6b108: 0x0056006d 0x00000000 0x58003060 0x00007f08 0x7fff5ed6b118: 0x0055eda5 0x00000000 0x5ed6b190 0x00007fff
If you take a close look, you can see a recurring pattern starting at 0x7fff5ed6b038, at least partially: these are return addresses of functions part of the recursion loop, and their (varying) arguments. Using some calls to ‘info symbol’ and providing the returning addresses, the names of the functions can be retrieved. There was a lot of resemblance with what we saw before as well…
We dug up the testcases we used before, and compiled them using Lwt 2.3.0, the version against which the crashing Arakoon binary was compiled as well, and our fear was confirmed: the same issue was introduced in Lwt again, a regression! Luckily the issue was fixed later on again: compiling the tests using Lwt 2.3.2 confirmed this. We had builds of Arakoon against Lwt 2.3.2 around, so production systems could be upgraded to a safe version again. I guess we should add the Lwt testcases to the Arakoon test-suite.
I read about the Python scripting support introduced in recent versions of GDB a couple of weeks ago, and decided to automate the debugging of this issue, if possible. As such I created a script which implements a new GDB command, ‘ocaml-detect-recursion’, which takes an optional stack address (it defaults to ‘rsp’), reads some stack memory, and attempts to find return addresses in these bytes. The buffer is scanned by reading 8 bytes at a time (we’re on a 64bit system, which uses 8 byte addresses), then checking whether this value is part of a known function. If it is, it’s appended to a list.
Once this is done, a minimal recurring pattern is detected, if possible. The algorithm used here is extremely simple at this moment (split the address list in n-grams for increasing values of n, then checking whether all n-grams are identical), so this might use some love.
Once the recurring pattern is detected, a summary, including the original function names, is displayed:
(gdb) source ocaml_detect_recursion.py (gdb) ocaml-detect-recursion Recurring call pattern starting at frame 1 ========================================== camlLwt__fun_1826 @ 0x55eda5 camlLwt__fun_1826 @ 0x55ed87 camlLwt__run_waiters_rec_1181 @ 0x560785 camlLwt__run_waiters_1199 @ 0x56080e camlLwt_mutex__fun_1084 @ 0x563cdd camlLwt__fun_1956 @ 0x56006d
Using this plugin, it becomes really easy to debug this issue if it ever occurs again in the future.
Scripting GDB takes some time (go check the docs!), but I do think it’s worth it. Another interesting feature is custom pretty-printers (one for OCaml values might be useful, *hint hint*).
Finally, thanks to our colleagues at Amplidata for their support debugging the crash the first time.
Interested in working on issues like this yourself? Take a look at our jobs page, we’re looking for colleagues.