Rethinking an embedded database APIPosted: September 30, 2011
One of the projects the Incubaid research team is working on is a new embedded database (where ’embedded’ refers to the way other projects like GNU DBM, Berkeley DB, Tokyo Cabinet or SQLite are used). This project, codename Baardskeerder, is still in its infancy, and we’re toying with different approaches to various problems.
The database itself does not expose a full-fledged SQL interface like SQLite does: the interface provides the ability to store values for a given key, retrieve the value stored under a key, and some more complex operations like range lookups.
The API for a get operation seems rather obvious on first sight: you pass in a key, and the corresponding value is returned if the key has been set before, or the non-existence is signaled somehow (note: all pseudocode in this post is Haskell):
import qualified Data.ByteString.Char8 as BS import Data.ByteString.Char8 (ByteString) type Key = ByteString type Value = ByteString data DB = DB get :: DB -> Key -> Maybe Value get d k = undefined
Since the database is persisted (e.g. in a file on a file system on a disk), and will most likely be used by server-style applications, returning values to some client, whilst the server application doesn’t care about the values themselves, this style of API has a major drawback: it forces copying values from kernelspace into userspace, after which the value is written to some socket, copying it from userspace back into kernelspace, without any intermediate modifications, or even access.
In most modern kernels, including Linux, BSD, Solaris and others, there are several system calls available which allow one to send data from files on a socket, without this useless copying to and from userspace. These include sendfile(2) and splice(2). We want to be able to support these optimizations from within applications embedding the database library, whilst still providing the ‘basic’ API as well.
One approach would be to implement a custom getAndSendOnSocket call in the library, but this feels very much ad-hoc. Another way to tackle this is to return a file descriptor, an offset and a length on a get call, but this doesn’t feel right, since we don’t want to leak a file descriptor to the host application as-is.
What we got now is a callback/continuation-driven approach, where a host application can provide an action which will have access to the file descriptor, get an offset and length value, and can do whatever it feels like, whilst being wrapped inside an exception handler.
The signature of our get’ action becomes
get' :: DB -> Key -> (Result -> IO a) -> IO a
Here’s a pseudocode-implementation:
import Prelude hiding (concat, length, lookup) import Data.Binary import Data.ByteString.Char8 import qualified Data.ByteString.Lazy as BL import Data.Int (Int32) import Control.Exception (bracket, finally) import Foreign.Marshal.Alloc import Foreign.Marshal.Array import Network.Socket (Socket) import Network.Socket.ByteString import Network.Socket.SendFile.Handle import System.Posix (COff, Fd, fdToHandle) import System.Posix.IO.Extra (pread) type Length = Int32 type Key = ByteString type Value = ByteString data DB = DB data Result = NotFound | Value Value | Location Fd COff Length -- Lookup a value in the database lookup :: DB -> Key -> IO Result lookup d k = undefined -- Take a reference to the database -- This is dummy, the actual API is more complex ref :: DB -> IO () ref d = undefined -- Return a reference to the database unref :: DB -> IO () unref d = undefined get' :: DB -> Key -> (Result -> IO a) -> IO a get' d k h = do v <- lookup d k let (pre, post) = case v of Location _ _ _ -> (ref, unref) otherwise -> (\_ -> return (), \_ -> return ()) pre d h v `finally` post d
Note a Result can be NotFound, an actual value (in our case this can happen if the value is compressed or encrypted on disk), or a pointer to the value in a given file.
Implementing the classic get becomes trivial:
get :: DB -> Key -> IO (Maybe Value) get d k = get' d k h where h :: Result -> IO (Maybe Value) h NotFound = return Nothing h (Value v) = return $ Just v h (Location f o l) = Just `fmap` bracket (mallocArray l') free -- TODO This assumes pread(2) always returns the requested number -- of bytes (\s -> pread f s l' o >> packCStringLen (s, l')) where l' :: Int l' = fromIntegral l
The implementation of a get call which sends a value on a socket, prefixed with the length of the value, becomes simple as well:
sendOnSocket :: DB -> Key -> Socket -> IO () sendOnSocket d k s = get' d k h where h :: Result -> IO () h NotFound = sendMany s zero h (Value v) = do sendMany s $ BL.toChunks $ encode $ (fromIntegral $ length v :: Length) sendAll s v h (Location f o l) = do sendMany s $ BL.toChunks $ encode l f' <- fdToHandle f sendFile' s f' (fromIntegral o) (fromIntegral l) zero = BL.toChunks $ encode (0 :: Int32)
Instead of sendFile’, some FFI binding to splice or something similar could be used as well.
In the current pseudo-implementation, a file descriptor can still be leaked (an application can call get’ with an action which stores the given Fd in some IORef), and given actions can alter the file pointer using lseek(2) and others. We could tackle this, if necessary, by duplicating the file descriptor using dup(2) and passing the copy to the application, then closing the duplicate after completion (so there’s no use for the application to store the descriptor: it’ll become invalid anyway).