Rethinking an embedded database API

One of the projects the Incubaid research team is working on is a new embedded database (where ’embedded’ refers to the way other projects like GNU DBMBerkeley DB, Tokyo Cabinet or SQLite are used). This project, codename Baardskeerder, is still in its infancy, and we’re toying with different approaches to various problems.

The database itself does not expose a full-fledged SQL interface like SQLite does: the interface provides the ability to store values for a given key, retrieve the value stored under a key, and some more complex operations like range lookups.

The API for a get operation seems rather obvious on first sight: you pass in a key, and the corresponding value is returned if the key has been set before, or the non-existence is signaled somehow (note: all pseudocode in this post is Haskell):

import qualified Data.ByteString.Char8 as BS
import Data.ByteString.Char8 (ByteString)

type Key = ByteString
type Value = ByteString

data DB = DB

get :: DB -> Key -> Maybe Value
get d k = undefined

Since the database is persisted (e.g. in a file on a file system on a disk), and will most likely be used by server-style applications, returning values to some client, whilst the server application doesn’t care about the values themselves, this style of API has a major drawback: it forces copying values from kernelspace into userspace, after which the value is written to some socket, copying it from userspace back into kernelspace, without any intermediate modifications, or even access.

In most modern kernels, including Linux, BSD, Solaris and others, there are several system calls available which allow one to send data from files on a socket, without this useless copying to and from userspace. These include sendfile(2) and splice(2). We want to be able to support these optimizations from within applications embedding the database library, whilst still providing the ‘basic’ API as well.

One approach would be to implement a custom getAndSendOnSocket call in the library, but this feels very much ad-hoc. Another way to tackle this is to return a file descriptor, an offset and a length on a get call, but this doesn’t feel right, since we don’t want to leak a file descriptor to the host application as-is.

What we got now is a callback/continuation-driven approach, where a host application can provide an action which will have access to the file descriptor, get an offset and length value, and can do whatever it feels like, whilst being wrapped inside an exception handler.

The signature of our get’ action becomes

get' :: DB -> Key -> (Result -> IO a) -> IO a

Here’s a pseudocode-implementation:

import Prelude hiding (concat, length, lookup)
import Data.Binary
import Data.ByteString.Char8
import qualified Data.ByteString.Lazy as BL
import Data.Int (Int32)
import Control.Exception (bracket, finally)
import Foreign.Marshal.Alloc
import Foreign.Marshal.Array
import Network.Socket (Socket)
import Network.Socket.ByteString
import Network.Socket.SendFile.Handle
import System.Posix (COff, Fd, fdToHandle)
import System.Posix.IO.Extra (pread)

type Length = Int32

type Key = ByteString
type Value = ByteString

data DB = DB

data Result = NotFound
            | Value Value
            | Location Fd COff Length

-- Lookup a value in the database
lookup :: DB -> Key -> IO Result
lookup d k = undefined

-- Take a reference to the database
-- This is dummy, the actual API is more complex
ref :: DB -> IO ()
ref d = undefined

-- Return a reference to the database
unref :: DB -> IO ()
unref d = undefined

get' :: DB -> Key -> (Result -> IO a) -> IO a
get' d k h = do
    v <- lookup d k

    let (pre, post) = case v of
            Location _ _ _ -> (ref, unref)
            otherwise -> (\_ -> return (), \_ -> return ())

    pre d
    h v `finally` post d

Note a Result can be NotFound, an actual value (in our case this can happen if the value is compressed or encrypted on disk), or a pointer to the value in a given file.

Implementing the classic get becomes trivial:

get :: DB -> Key -> IO (Maybe Value)
get d k = get' d k h
    h :: Result -> IO (Maybe Value)
    h NotFound = return Nothing
    h (Value v) = return $ Just v
    h (Location f o l) =
        Just `fmap` bracket
            (mallocArray l')
            -- TODO This assumes pread(2) always returns the requested number
            -- of bytes
            (\s -> pread f s l' o >> packCStringLen (s, l'))
        l' :: Int
        l' = fromIntegral l

The implementation of a get call which sends a value on a socket, prefixed with the length of the value, becomes simple as well:

sendOnSocket :: DB -> Key -> Socket -> IO ()
sendOnSocket d k s = get' d k h
    h :: Result -> IO ()
    h NotFound = sendMany s zero
    h (Value v) = do
        sendMany s $ BL.toChunks $ encode $ (fromIntegral $ length v :: Length)
        sendAll s v
    h (Location f o l) = do
        sendMany s $ BL.toChunks $ encode l
        f' <- fdToHandle f
        sendFile' s f' (fromIntegral o) (fromIntegral l)

    zero = BL.toChunks $ encode (0 :: Int32)

Instead of sendFile’, some FFI binding to splice or something similar could be used as well.

In the current pseudo-implementation, a file descriptor can still be leaked (an application can call get’ with an action which stores the given Fd in some IORef), and given actions can alter the file pointer using lseek(2) and others. We could tackle this, if necessary, by duplicating the file descriptor using dup(2) and passing the copy to the application, then closing the duplicate after completion (so there’s no use for the application to store the descriptor: it’ll become invalid anyway).


2 Comments on “Rethinking an embedded database API”

  1. Rina Noronha says:

    Hi, Nicolas!

    I’m the web editor at iMasters, one of the largest developer communities in Brazil. I´d like to talk to you about republishing your article at our site. Can you contact me at

    Rina Noronha
    Journalist – web editor

  2. […] incorporado (onde “incorporado” se refere ao caminho de outros projetos como o GNU DBM, Berkeley DB, Tokyo Cabinet ou SQLite são usados). Esse projeto, de codenome Baardskeerder, está ainda […]

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s