Concurrency and locking in the Malete core.
| MP vs. MT |
Unlike the OpenIsis versions up to 0.9, which where built for multithreading,
the Malete core is designed to run in multiple processes in parallel.
The major reason for this shift is ongoing problems with several
- multithreading is in a big move towards compiler supported
thread local storage (TLS). TLS routines like pthread_getspecific
or TlsGetValue are replaced by the __thread attribute on variables.
While this is nice and efficient, it is incompatible
(it does change the behaviour of libc in subtle ways).
- on Linux, the 2.2/2.4 LinuxThreads are now superseded by NPTL
in the 2.6 kernels (and some backported 2.4.20s, watch out!),
which uses compiler supported TLS. While this offers a couple of advantages,
it is not worth the effort to support both during the transition.
- on BSD thread support is only emerging. On MacOS X, the implementation
has yet to prove its stability. Even on Solaris with a long track
of reliable MT support, they are switching the threading model.
- on Windows the 9x and Me versions are still in widespread use,
which lack important features to leverage the benefits of MT
(OVERLAPPED_IO and SignalObjectAndWait). While they also lack
support for synchronization of multiple read/write processes,
at least read-only multiprocessing is feasible.
- while Java can be considered by far the most stable environment
for MT programming, this only works out when actually writing pure Java.
Embedding MT C-code via JNI is not well defined and pretty intricate,
especially where the underlying MT implementation is on the move.
On the other hand, there is some demand for MP support in certain
environments like e.g. MP PHP or a parallel server running from
wrappers like tcpserver.
Thus Malete by now focuses on MP and leaves a reworked MT version
for better times with stable and widely available TLS support.
| shared ressources |
The only ressources shared are (regions of) files,
i.e. there is no shared memory (besides mmapped files),
queues, pipes, sockets or other ressources.
In typical usage read access is much more frequent than writing.
Therefore the means of coordinating access to shared ressources
- must support multiple processes
- should support read/write locks
(i.e. shared vs. exclusive locking modes) to allow for concurrent readers
- should support locking and unlocking of regions of open files
to support concurrent writers
With some limitations to be discussed further below,
this can be implementated based on file locking.
Features we do NOT need:
- locking over network file systems means looking for trouble.
While this MAY work in simple cases, it is NOT RECOMMENDED!
If you really can't avoid accessing remote database files via NFS or SMB,
the only reasonable use of locks is to "protect" read-only access.
DO NOT even consider writing your valuable data to remote storage.
Better run a database server where the disks are.
- mandatory locking means looking for trouble,
DO NOT mount with mandatory locking support,
DO NOT set the mandatory locking file mode!
- deadlock detection.
The usage of locking is deadlock free.
(Unless mandatory locking is enabled, so don't do that).
| concurrency modes |
There are three modes of concurrency to distinguish:
- read-only mode:
any number of processes holding shared locks on whole files.
Every process may read at any time, no process may write.
This is the most efficient mode for high volume query processing.
- exclusive mode:
a single process holding exclusive locks on whole files.
The process may or may not write to the files.
A simple, safe and portable mode to allow writing.
- shared mode:
multiple processes using temporary locks on regions of files.
Every process may read and write. Supported on UNIX-style systems only.
The first two modes using locks held during process lifetime
are trivially correct. Actually older OpenIsis versions used such locks.
Shared mode is much more complicated and deserves more detailed inspection,
which is done in the remainder of this document.
| shared mode |
First of all: try to avoid it.
For development and typical data entry use, an exclusive mode single process
server will do perfectly well. For high volume query processing,
use a separate copy of the database in read-only mode.
Malete databases are designed to be easily copied and backed up.
This is intrinsically much more efficient and reliable than combining
high read and write load. Moreover, cleanup tasks like data and tree
compaction can be safely done in an exclusive writer,
but would require unduly synchronization effort in shared mode.
Having that said, here are the gory details.
Both the records (r) and the index entries (q) use two files each:
The access and data files need to be in sync,
thus writing access needs to be synchronized to some extend.
Typically, but not necessarily, the application uses index entries
in turn as pointers for the records; see below.
- a "data" (d) file
(the plain masterfile .mrd and the BTree leaves chain .mqd, resp.)
- an "access" (x) file for faster lookup
(the record xref file .mrx and BTree inner nodes .mqx, resp.)
Basically, all files are accessed using (unbuffered) native file IO.
The access files, however, are memory mapped where possible.
In summary, we use
- one "record lock" per database guarding any record write
- a "xref lock" for every record
- one "tree lock" per database guarding BTree inner node access
- a "leaf lock" for every BTree leaf block
To read a record:
- obtain the shared xref lock
- look up the xref
- release the xref lock
- read the record
(does not need a lock since records are immutable)
Writing a new or changed record is done by:
- acquiring the exclusive record lock
- appending to the end of the masterfile
- obtain the exclusive xref lock
- writing to the xref file (using msync where mmapped)
- release the xref lock
- release the record lock
When searching the index
- the shared tree lock is acquired, the tree is read to find the leaf number,
and the tree lock is released
- a shared lock on the leaf block is acquired and the leaf is read.
The shared lock is released immediatly after reading.
- on split detection (Lehmann/Yao) successive leaf blocks are read alike
The steps to write an index entry are:
- find the target leaf block, searching as above,
but using exclusive leaf locks. On split detection,
an ex leaf lock is released only after locking and reading a successor.
The final ex lock is held until after the leaf block has been written.
- if the write involves a split, lock the block after the end
of the leaves file and double check you really got a brand new block
(it must not be readable, else, repeat)
- write the block or blocks
- if the write involved a split, obtain the exclusive tree lock.
Iteratively insert pointers to any newly created blocks.
- release all locks
To support a unique key, the record lock is held while writing / looking up
the record's key in the index and until after the record was written.
| operating system considerations |
Locking an mmapped tree file may not work on some systems.
The rationale is that accesses to mapped memory can not be
checked against arbitrary mandatory locked regions;
On some systems locking mapped files is completely ruled out,
others, like Linux, deny only mandatory locks.
Solaris allows locks on the whole file only (not regions).
Since all locks are used advisory, however,
there is no need to put the lock on the actual bytes affected by an operation.
- The record lock is on byte 0 of the masterfile .mrd,
and the xref on record id n (1..) locks byte n.
- The tree lock is on byte 1 of the leaves file .mqd,
and the leaf block n (0..) locks byte 2*n.
Consequently, in read-only and exclusive mode, full locks on the
data files .m?d are sufficient to prevent conflicts with other writers.
While this statement with regard to read-only and exclusive mode processes
can be considered to constitute an interface,
the details of shared mode coordination may change
(i.e. which bytes are locked in which order, see considerations below).
Since M$ Windows does not offer support for serious programming,
we are limited to the exclusive and read-only modes:
The 9x/Me family is even lacking shared file region locks and locks that can
be waited upon, not to mention any reasonable means of signaling.
Memory mappings here are of limited use since they might copy the file to swap.
The NT-based versions have a couple of the necessary calls like LockFileEx;
Still, all this is pretty tedious and the semantics of memory mappings
(CreateFileMapping) in the presence of writers is problematic at best.
| optimizations to consider |
- where accessing a memory mapped xref can be considered atomic,
xref locks are not needed, meaning readers do not need any lock at all.
For 8 byte xrefs, we can easily avoid being interrupted by page faults,
so some architectures may support this.
- where the effect of writing a leaf block is atomic,
i.e. seen completely or not at all by any read,
readers do not need to use locks on leaves.
this may hold only under very specific conditions:
We must have neither SMP nor the new kernel preemption enabled,
must have ext2 or similar filesystem, the file region must fit
one cache page and the userspace region must not pagefault
(i.e. we need a piece of locked memory).
We always fit a cache page, if our blocksize is
a power of two of at most the cache page size.
While 4K is a usual value for the cache page size
(as well as the pagesize as well as the filesystem block size),
we use 1K (which, under Linux, is the minimum for all of these)
leaf blocks to be reasonably safe.
- for writers, there is next to no more concurrency we could achieve.
The minimum about the changed masterfile structure that needs to be
visible is its new size, and basically we are just setting this
with the one write operation. With any other scheme along the lines
of "first extend, then lock only that area" and so on we would only
get a couple of additional syscalls plus integrity issues
with much more difficult crash recovery.
Especially we don't use a separate lock on the access file.
- one might consider delaying a sync-to-disk (fsync/msync)
until after the lock is released. However, at least a
msync using MS_ASYNC and MS_INVALIDATE should be issued.
(On Linux, MS_INVALIDATE does nothing, since maps are always coherent).
- locking the whole tree might seem pretty restrictive,
however, it saves a lot of syscalls and the operations on the
mmap should we very fast. Concurrency is affected only by
writes to the tree, which are fairly rare
(on about 3% of inserts, depending on sizes and fill level).
- for non mmapped tree files, however, the classical per-block
locking scheme could be considered to reduce worst case delays.
Locks on the odd leaves file bytes are reserved for this purpose.
Yet this involves substantial overhead on every read.
| considering writer starvation |
In the presence of readers holding shared locks,
we can not expect fcntl to avoid writer starvation.
If at every time there is always at least one process having a
shared lock, a writer may wait indefinitely for an exclusive lock.
This can trivially be avoided by only using exclusive locks,
which typically will have a simple and fair FIFO implementation.
Clearly this sacrifices most concurrency in a situation where
it is obviously demanded.
A more sophisticated aproach is to guard the acquisition of a lock A,
whether shared or exclusive, by an additional ex "guard" lock Ag
in a sequence of get Ag - get A - release Ag.
That way no process can get A while another process is waiting for A.
This double locking should not be used for leaf reads,
because continued overlapping of shared locks on leaves means such
a pathological congestion that the system will croak anyway.
The record and tree locks, on the other hand, are held by readers
only while they are inspecting memory mapped files, which should
involve no disk accesses if you are going for high throughput.
More precisely, we should only very rarely see a process suspended for
any reason while holding such a shared lock,
and thus overlapped shared locks should be rare.
To summarize, in almost any case of high volume querying it is advisable to
set up a separate read-only copy of the database instead of increasing
system load by doubling the number of locks.
According to these considerations,
we might use the aforementioned optimizations to reduce writer delays,
but do not care about absolutely avoiding writer starvation.
| unmounting databases |
Whenever a process created a new database, other processes may
access it by opening it on demand just as any old database.
There is no additional interprocess communication needed to make
a database available.
Since the shared mode is meant to be used in high volume production
environments, we assume any available database to remain available without
structural changes, and support such changes only in the exclusive server,
where they can be handled internally.
For completeness, however, here goes an outline of the steps that would
be needed to support unmounting databases in a shared environment.
Changes to a database, including making it unavailable,
are handled by one process obtaining the exclusive "options" lock.
Conceptually this can be regarded as lock on the options file,
with a process using the database holding a read lock on the options
and a process changing database options requiring a write lock.
In such an environment,
- any process using a database for read a/o write without changing
options holds the shared options lock.
- a process that wants to change a database must obtain the exclusive
options lock to mount the database in single process mode.
- before waiting on the exclusive options lock,
the process acquires guard locks
and notifies other processes to release their locks
For synchronization issues, actually three locks are used:
- the "options lock" is held shared or exclusive while the database is in use
- the shared or exclusive "use lock" guards any attempt to obtain
the options lock
- the exclusive "change lock" guards any attempt to obtain an exclusive
use and options lock
A reader (shared user)
- asks for the shared use lock without waiting.
- if this fails, another process is attempting
modification and the reader must close the database.
- else, no process must have the exclusive options lock.
The reader obtains a shared options lock and releases the use lock.
A writer (about to structurally change the database)
- first asks for the exclusive change lock without waiting.
- if this fails, another process is attempting modification
and the writer must close the database and bail out.
- else, no process must have the exclusive use lock.
Then the exclusive use lock is acquired with waiting.
(Since there are shared use locks held for very short times,
this has a slight risk of writer starvation and could,
for paranoia's sake, be guarded by yet another lock).
- finally other processes are notified and the writer
waits for the exclusive options lock (probably using a timeout).
- once done with changes, the writer releases locks in reverse order.
Notifying other processes:
- for the unix multi process server,
we have all processes share the same process group,
so they can be signaled by kill(0,sig).
On most systems we should use SIGWINCH and SIGURG,
because they are ignored by default (so we don't kill our tcpserver).
- while SIGURG will have any running request aborted,
processes block SIGWINCH during normal request processing
and receive it only when about to read a new request.
To make a database finally unavailable,
a process should move or remove the files while holding the exclusive locks.
Since locks are obtained on open files, other processes may have opened
the files before this and finally succeed in obtaining the options lock.
To finally detect this race condition,
a process must use stat and fstat to ensure, after getting the options lock,
that the file it opened is still the same it would open then.