Malete file formats
| meta data |
The Malete options (record 0) file .m0d (with a zero; .mod used otherwise),
if present, contains a single serialized record containing metadata,
is compiled on demand to byte-order dependent .mcx files with
magic 'mcx' or 'MCX' for little-/big-endian, resp.
| record data |
The Malete record data file .mrd, a.k.a masterfile, is a plain text file.
Lines are terminated by the ASCII newline character with byte value 10.
Any superset of ASCII - including UTF-8 - may be used as character encoding.
In other words, byte values 0-127 are always interpreted according to ASCII,
values 128-255 have no special meaning.
More precisely, the byte values 10 (newline), 9 (tabulator),
45 (minus '-'), 48-57 (digits '0'-'9'), 64 (at-sign '@') and 87 ('W')
have structural significance.
Logically, the masterfile is a sequence of records,
where every record is terminated by an empty line.
A record is substructured as a sequence of non-empty lines,
each terminated by a newline.
Field lines are substructured as a number
(optional minus followed by digits) followed by a tabulator
followed by any characters but a newline as field values.
If the first line of a record does not start with a number,
it is used as header line. All other lines must be field lines.
As of now, there is only one valid header line format defined,
consisting of the letter 'W', followed by a tabulator and a
data record header. The data record header is of the form
- rid is a positive record id (a.k.a. masterfile number MFN, in ASCII digits)
- if @pos is present, pos must be the position of the previous version
of this record as byte offset from beginning of file in ASCII digits.
Malete software supports retrieval of old versions based on this.
- the optional leader is arbitrary leader data for the record.
It can hold a MARC leader or a unique key ("name") for the record.
If a record has no header line, its record id is one plus the highest
record id used up to that point in the file.
Using header lines, records can represent updated versions of
previously stored records. There is no way to delete records,
the operation next to it is to write an empty record.
During normal operation, data in the masterfile is never updated,
all writes are appends. Thus live backups and replications can be
based on 'tail -f' writing to a tape or piped through netcat.
A consistent snapshot can be made by recording the current file size
and only accessing record versions at lower positions.
Compacting a masterfile by purging old versions is straightforward,
but considered an offline operation
(basically printing current versions to a new masterfile).
Compacting a mounted database is feasible for the exclusive mode server,
but currently not supported.
The masterfile is a special case of the serialized
, see there for details.
The suggested handling of malformed records should not be relied upon.
| record access |
The Malete record access file .mrx, a.k.a cross reference (xref) file,
is a binary file dependent on the platform byte order and pagesize.
Since the xref file is accessed by memory mapping,
its size must be a multiple of the pagesize
(which is 4 KB for most Intel based systems).
Where the xref is found to be missing, have the wrong byte order or size
or be otherwise invalid, it is rebuilt from the masterfile.
The xref is organized in units of a fixed size ranging from 8 to 16 bytes.
The unit at offset rid*size is a pointer to the current version of record
with id rid in the masterfile (rid starting from 1).
This pointer consists of
When not compiled with support for large files,
only the combinations 4,3,1 and 4,4,0 are implemented.
- 4 to 8 bytes for the record's position
(including any header line)
- 3 to 6 bytes for the record's length
(including the terminating empty line)
- 0 to 2 bytes for the record's number of fields
(including 1 header field, even if it's empty)
If a record's number of fields exceeds the representable range,
the corresponding pointer bytes (if any) are set to 0 and the number
of field must be determined while reading the record.
This number is also 0 for "deleted" records (no fields, empty leader).
The first such unit (at mrx offset 0) consists of
- 3 bytes magic number, which is "MRX" for big endian, "mrx" else
- 1 byte type, computed as
(bytes for pos-4)*16+(bytes for length-3)*4+(bytes for fields)
- a 4 byte integer in native byte order holding the max used rid
- if the unit size is 10 or greater (i.e. with largefile support),
the following 2 (for 10,11) or 4 bytes are an integer in native byte order
holding high order bytes of the max record id.
| query data and access |
The malete query data .mqd and query access .mqx files hold
the leaves and forks (inner nodes) of a B-Link-Tree, resp.,
which is usually used as index associating keys generated from masterfile
records with pointers to such records.
Both files are organized in fixed size blocks containing binary integer numbers
and arbitrary key strings (encoded according to the collation configuration).
- 16 bytes of "header" data
- a "dictionary",
which is an array of 4 byte units describing entries in the block
with position, length of key and number of values.
- a "stack" holding the entries,
growing downwards from the end of the block.
For dictionary slots d..d[n], describing entries of size s0..sn,
entry i occupies si consecutive bytes starting at (block size)-(s0+...+si).
The header contains 8 unsigned binary numbers:
The first 5 numbers are actually redundant since the block number
must match the blocks file position, the three configurable bytes
must be the same among all leaves and forks, resp.,
and the level is not really needed.
However, wasting this 8 bytes serves as a minimal check
and makes handling much easier.
- num 4 bytes block number
- typ 1 byte block type (bitmask; see below)
- ksz 1 byte maximum key length
(default 0 treated like the maximum 255; CDS/ISIS uses 30)
- ptr 1 byte pointer type (bitmask; see below)
- lev 1 byte level over bottom (0 for leaves)
- nxt 4 bytes number of right sibling block (0 if none)
- ent 2 bytes number of entries (length of dictionary)
- stk 2 bytes offset of 1st byte used by the stack
In leaf blocks, the byte order of all header numbers is little endian
(so big endian machines have to swap bytes when reading/writing leaf blocks).
In fork blocks, the header numbers are in native byte order
(since they are usually not accessed by explicit reading and writing,
but through memory mapping).
The block type is:
- upper 2 bits 0xC0 give the basic block type:
0x00 for a standard leaf block,
0x40 and 0x80 for little and big endian forks,
0xC0 for leaves with unstructured values.
- next two bits are clear and reserved for future extensions;
should software see such a bit set,
it must not assume anything about the index structure.
- the bit 0x08, if set, will indicate that the index is compressed.
Each key is stored as one byte giving the length
of the common prefix to its predecessor followed by changed bytes.
This is not yet supported and current software will refuse to
access such an index.
Also do not confuse this with compressing each key individually based on
- lowest 3 bits 0x07 give the block size.
For leaves, this is freely configurable as bitcount-9,
from 0 for 512 (2^9) bits to 4 for 8KB (2^13).
For forks, this is bitcount-12, from 0 for 4KB (2^12) to 4 for 64KB(2^16),
and must match the system's pagesize (should there be a system with
a pagesize of less than 4 KB, 4KB is used).
The pointer type describes the structure of leaf values:
Pointers must use at least 4 and at most 14 (2+8+4; currently 12=2+6+4) bytes.
Pointer type 0x8B (3 byte rid + 2 byte tag + 3 byte pos)
is the same as used by CDS/ISIS.
All numbers in the pointer are stored in big endian byte order
(most significant byte first) for lexical sorting.
- upper two bits 0xC0 give the number of bytes to hold the tag
from 0 (0x00) to 2 (0x80).
- next two bits 0x30 give the number of bytes to hold the record id - 3,
from 0x00 for 3 bytes to 0x30 for 6 bytes
- the bit 0x08, if set, indicates that the tag is stored after
the record id (as in CDS/ISIS), else the tag bytes precede the record id
- lowest 3 bits 0x07 give the number of bytes to hold the position
(occurence*65536 + word position) from 0x00 for 0 bytes to 0x04 for 4 bytes.
For a value of 5 or 6, we should assume 4 byte position and one or two
additional bytes for the record id (currently not supported).
The dictionary contains 4 byte units, describing the position pos of an entry,
its number of values vln and the length of its key kln,
which is stored in the 4th byte (0 here is actually the empty key,
which is always the first in the leftmost block of every fork level).
In a fork block, the first 2 bytes store the pos in native byte order,
and the 3rd byte is vln.
In a leaf block (max size 8KB), the pos has 13bit and the vln 11bit.
The first 3 bytes in a dictionary unit are, independent of platform,
0: pos mod 256 (lower 8 bits),
1: pos div 256 (higher 5 bits) + 32 * (vln div 256) (higher 3 bits)
and 2: vln mod 256 (lower 8 bits).
Entries in the stack always start at pos with kln bytes holding the key
(as the actual key bytes, or, in a future version, compressed).
In a leaf block, after the key there are vln values sorted in increasing order
as of memcmp. Values have fixed size and structure as described by the
pointer type (typically 8 byte each).
Should the leaf block type be unstructured values, ptr is the actual
value size and no assumptions are made about the structure of values
(to be used independent of a masterfile as general purpose B-Tree).
A special value of all 0 bytes is used for stopwords;
no other value will be associated with the key.
For a tag-first pointer type (bit 0x08 clear), a pointer to tag 0
will be the first and is reserved to store unique keys:
at most one pointer to tag 0 will be associated with a key.
A leaf key currently always has at least one value;
should this key-value association be deleted, the entry is deleted completely.
(With index compression vln 0 will denote a stopword).
Should a key be associated with more values than fit within a block,
the following block starts with an entry with the same key
and next higher value.
(With index compression, we might consider using the empty key there;
to be defined).
In a fork block, currently (no index compression), only vln 0 and 1
are used. With vln 0, the key is directly followed by a 4 byte native
child block number (which is a leaf block number, if the block's level is 1,
else a fork block number). With vln not 0, they key is followed by
vln pairs of size (value size + 4), containing value bytes as in the leaves
followed by the 4 byte child block number. This is used where a key
spans leave blocks: we have a fork entry with vln 0 pointing to the key's
first block, followed by an entry with the same key plus the starting value
of the next block and so on.
(With index compression, we will use one entry with multiple values).
Fork block number 0 is always the root, and leaf block number 0
is always the leftmost leaf. All keys and values can be looped in order
by starting from leaf 0 and following the nxt pointers.
Note that the layout of leaf blocks is fully platform independent:
record pointers are organized big endian as in CDS/ISIS,
header numbers are little endian, and the dictionary is defined per byte.
| shared B-L-Tree access |
A process accessing the index in exclusive mode may actually shift
keys and values around according to the specified layout
as it seems fit to minimize unused space in blocks.
With shared access, changes are very limited:
- Deleting a key-value association is done in the leaf block only,
there are no updates to any fork or other key block.
- The same holds where an inserted key-value association fits within
the target leaf block.
- The only case where data is moved between blocks is when a new pair
does not fit in its target block: a new block is allocated to become
the new right sibling of the target block and as many entries are moved
from the block's end to its new sibling so that the new pair can be
inserted in one of the blocks. On such a block split, a new entry
is made in the block's parent fork (which might trigger a split there).
- Should a process find that the key expected in a block is greater
than the greatest key there, it must assume that a block split
occurred and follow the nxt link to inspect the block's successor.
(That's why it's called B-Link-Tree; invented by Lehmann and Yao).
Actually, as we use full fork file locking, this can only happen
in leave blocks.
| limits according to file formats |
The masterfile obviously has no limits but filesize.
To break the custom 2GB (32bit signed) barrier there are two approaches,
compiling with large file support and splitting files,
discussed further below.
In general, most 4 and 8 byte numbers are assumed signed entities,
since several system calls handle them that way.
Any 2^x here is to be understood as 2^x-1.
The xref in the small file implementation (8 byte pointers) can handle
- any legal (i.e. up to 2GB) masterfile size.
This limit can be broken by masterfile splitting.
- any record size fitting in there using 4,4,0 pointers
or records up to 16MB size using 4,3,1 pointers
(which have a slight performance advantage)
- any number of fields
- record ids up to 2^31
The full xref spec can handle
- large file sizes up to 2^63 (about 8.000.000.000.000.000.000)
- record sizes up to 2^48 (256 TB)
- record ids up to 2^63
The index can deal with
- general purpose key-value pairs of a combined length of up to 255 bytes
(might be limited to 127 with extensions like index compression)
- standard values (hit pointers) of 4 to 14 bytes (default: 8)
allow for keys of up to 241 to 251 bytes (default: 247)
- pointers to record ids up to 2^63 (currently impl. up to 2^48)
- pointers to non-negative tags up to 2^16
- pointers to position information up to 2^31
(by convention used as 1 or 2 bytes for field occurence,
last 2 bytes for word position).
- up to 2^32 leaf blocks of up to 8KB, totalling to a leaf file
size of up to 32 TB (with large file support).
Assuming an average key length of 16 and on average 10 values of 8 bytes
per key, each block can hold 81 keys, totalling about 320 billion keys.
The fork file, on an IA64 configured with pagesize 64K, can extend to 256TB.
This limit can be broken by index splitting.
| limits of the current implementation |
Records have to fit into available memory, which, even when using ridiculous
amounts of RAM and/or large swapfiles, is bound by addressable memory.
On 32 bit architectures, this is usually 1 or 2 GB (on the heap).
Also bound by addressable memory is the possibility to memory map
the access files. While they work without memory mapping,
performance degrades substantially.
So, to use very large databases, the system should be compiled
for a 64 bit machine, which are luckily becoming affordable these
days and we hope to get us such a box in the near future.
| extending file size limits |
Neither large file support nor file splitting are currently implemented.
However, large file support is pretty straightforward.
File splitting for masterfiles works by configuring a number n
and using a series of masterfiles f0, f1 ...
so that records with ids in the range i*n... (i+1)*n-1
are stored in masterfile i. Likewise a series of xref files
is used based on a number m, which should be some multiple of n.
File splitting for the B-Tree is based on configuring a sequence of keys
k0=empty, k1 ... and using a series of leaf files l0, l1 ...
so that keys less than ki and not less than k(i-1) are stored in li.
Fork files could be split on the same keys or some subsequence of those.
The advantage of file splitting over large file support is that
it is more portable, saves some bytes on file positions and could aid backup,
especially where records are mostly appended to the last masterfile.
Moreover it can extend indexes even beyond 32 TB.
The total database size need not even be addressable with 64 bits.
The disadvantage is that it is a little bit more complicated
and when used exhaustively could make the system's open files limit
become a problem. Also tracking and snapshotting masterfiles
is a little bit less trivial.
$Id: FileFormats.txt,v 1.7 2004/11/12 11:18:23 kripke Exp $