Announcing Malete, the database engine powering OpenIsis 1.0
| from 0.9 to 1.0 |
Based on the 0.9 engine and especially its Tcl binding,
we had a system complete enough to do very intensive application testing
of all concepts, both handling bibliographical and terminological
as well as general industrial data.
With those experiences at hand we spent the second half of 2003
to give our then two year old software a complete overhaul,
in order to create a basis to last.
Along the traditional beliefs of Unix design we figured out
that the best and most stable combination of robustness/performance
with flexibility/convenience can be achieved by clearly separating
- a general purpose database system
which is very simple in order to be fast and robust and lay
a solid ground for flexibility, but itself is meant to be
accessed by other software (or geeks) rather than humans.
While this engine is based on the Z39.2 record model
(even supporting record leaders as used by MARC), it makes no special
provisions to support bibliographical data or CDS/ISIS legacy,
but rather tries to make this model appealing to general purpose
database usage. This engine is called Malete (kurdish for "our house").
Malete includes a database core library, generic server and
access libraries for various programming languages.
- a CDS/ISIS-style application
or, actually, like Winisis, a framework for applications.
This is targeted at CDS/ISIS users and librarians in general.
It provides support for conversion from and to a variety of
known file formats including MARC, high level indexing,
references (authority files, coded data), forms and so on.
In other terms, for retrieval you will rarely need more than the
Malete engine (plus some formatting for presentation, which is
usually done in a web programming language like PHP),
while for data entry you want a convenient graphical user interface
providing all sorts of lookups and checks.
| technical changes |
For a variety of reasons (detailed elsewhere) we postponed support
for multi-threading (to at least until after the ongoing move towards
compiler supported thread local storage is stable and widely available).
Instead writing support by multiple processes is enabled based on
file locking. Fast and consistent caching even for processes with
very short life time (like CGI scripts) is achieved by replacing
the former explicit caching with memory mapping.
- platform indepence
Now both record and index data file formats are identical across
platforms (i.e. the same even on big endians like Suns and Macs).
Only the pointer and tree files are plattform dependent,
but are rebuilt from the data as needed.
- generalized record format
A record with n fields is now a series of n+1 tag-value pairs.
The tag of the first field is the negative total length -n-1
and the value is a record "header" consisting of the record id (MFN)
and optional leader data as used by MARC. Obviously, such a series
can be part of a larger record, meaning records can be easily nested.
- simplified serialized format
The serialized (textual) representation of a record,
used both in the masterfile and the server communications protocol,
has dropped low-level support for field values containing newlines.
Where needed, the application must apply proper encoding
(but tools for that are provided).
- simple transaction support
Updates to a record can optionally be qualified with the position
from which the record was last read, having the update fail
if the record has been modified meanwhile. Reads can be done in
"consistent snapshot" mode, reflecting the state of the database
at one given point in time.
- unified message interface
The server communications protocol has been simplified and straightened out.
The masterfile now is only a special case of this protocol and thus
can be directly sent to a server. Conceptually every record is a
message saying "write me".
- ucspi based server
The server is designed to run under tcpserver, meaning it can take
advantage of all of its features like access control, basic client
authentication, IPv6, SSL encryption and so on.
For more details see
| applications and components |
OpenIsis 1.x will provide the following applications and components
(probably not all in 1.0, but 1.1 should be fairly complete):
- the Malete database server
for Linux and other UNIX-like systems written in ISO C.
This is aimed towards minimal functionality at maximum performance.
Intended usage is for high volume read only processing and
read/write with application controlled indexing.
On UNIX, the server will be multi process based.
On Windows, use of multiple processes is restricted to read-only mode.
- Malete and OpenIsis command line tools
for all systems providing several tools including conversion
from and to legacy CDS/ISIS file formats.
- Java, Perl and PHP libraries
to contact a server, all written completely in the respective language.
These are aimed at tight language integration,
leveraging the application language's strengths and programmer's skills.
Will run on all systems as supported by each language.
- a Tcl extension and library
where the library acts similar to those for other languages
(but based on a C implemented record) and the extension basically
provides the server interface in process.
- an application server
for all systems (i.e. including Windows),
providing database and http service, based on Tcl with or w/o Tk.
While this will not achieve the high throughput of a purely C-based
server, the Tcl layer can add virtually arbitrary functionality.
Intended usage is for read/write with server controlled indexing
and integrated http applications based on Tcl server pages.
Servers based on other languages are waiting for volunteers.
- a Tk based GUI
for all systems. Can run standalone or acting as server and/or client.
- the OpenIsis Tcl library
providing support for CDS/ISIS-style applications, e.g. indexing
similar to FSTs.
- the OpenIsis application
targeted towards users from the CDS/ISIS community, esp. librarians,
to provide interoperability with existing ISIS databases and support
for bibliographic formats in a user friendly way.
Written in Tcl/Tk as a sister of OpenMLCM.
| Malete modules |
The Malete database system is structured in the following modules:
basic C library for handling, storing and retrieving simple records.
"patchwork" framework for high level database services based on message
passing. Some designs are borrowed from the Lisp and Smalltalk languages.
helper functions and command line tool including
communication utilities and standalone server
- java, perl and php
extension and base library
the Tcl based application server
a generic Tk graphical user interface
On top of this, the OpenIsis 1.x application set contains:
compatility functions and command line tool
the OpenIsis library and graphical user interface
| ISAM core |
This implements a variant of ISAM (index sequential access method)
based on the ideas of Z39.2 (IIF) and Z39.50 (Type-1 queries).
It provides a fully open and unprotected interface
for unrestricted access at maximum performance.
The core library is not fully self contained,
but will require a few functions like stream I/O to be provided
by each environment.
It makes only very limited use of metadata,
dealing with "physical" aspects like file names, locks and character sets.
basic list, sessions, output buffers and other utilities
services like file IO and time
recoding and collation
set of functions for database file access (master file and b-tree)
| patchwork |
The patchwork C library wraps the ISAM core into an extendible
framework for high level database services,
based on passing records as request and response messages to server objects.
It provides a fully abstract and generic method call interface
plus a couple of database objects.
An object dispatches messages by checking their type and other parameters
and taking appropriate action, including forwarding to parent objects.
This is known as the "pure object oriented" approach,
as these objects don't have any other interface but the message dispatcher,
especially no directly accessible data.
higher level operators on ISIS records a la IIF (Z39.2/ISO2709)
based on meta data, including various substructures
dispatcher wrapping the ISAM core.
Based on the 0.9 server, but with some modifications to allow
for most efficient message passing.
dispatcher for ISIS/Z39.50 Type-1 style queries
dispatcher providing record relations, views and other magic
| design guidelines |
- flexible and efficient buffered pushing of output.
Pulling is not used on lower levels;
every environment will solicit input on the outermost level as adequate.
- flexible and efficient construction, manipulation and passing
of records, especially embedded subrecords in the patchwork.
- everything is a list.
Similar to Java's String and StringBuffer,
there is the immutable "Rec" and the mutable "List".
- uniform stream output.
Conceptually, all output is a list. There is only one (output) "Stream",
which may be backed by memory buffers, files or other channels like a GUI
window, so even diagnostic output can be captured.
- negative counted subrecords.
The patchwork uses negative counted embedding, since this allows
to pass on embedded records without any modifcation or copying.
- low tag usage.
Besides reserving all negative tags for embedding, only a minimal amount
of tags should be defined. Instead subfielding will be used extensively.
The patchwork message header uses tag 0, containing the message type
as an indicator, followed by any number of simple options and
parameters, resembling a command line (see below).
Alphabetic keywords and mnemonics are favoured over numbers.
There always has been some out-of-band data on records like their mfn.
This is now generalized in the concept of a record leader (see below).
- immutable lists
are just the same as a record embedded by negative counting,
i.e. an array of fields, with the tag of the first being the negative
total field count.
- record leader
The tag of the first field of embedded records contains leader-like meta info;
for database records this is (optional) mfn plus a MARC leader.
Since there should not be a difference between the representation of
embedded and first level records, every record has a leader.
- message leader
A record representing a message also has a leader.
Where the message is not embedded, it is sent as a leading 0-tagged field.
Since message leaders start with an alphabetic character,
the 0 and tab are omitted in the textual representation.
Message leaders use tabs as separators and start with a word
indicating the message type to the dispatcher.
Following subfields are parameters, with or without identifiers.
- getopt command lines
a command line of the form "command -aopt1 -bopt2 arg1 arg2" can be
easily and canonically wrapped into one field by removing the '-'
option indicator and identifying the non-option args as subfield '@'.
A commandline interface thus maps easily, and without the need
for looking up meta information, to a message leader,
from which the method identified by "command" can fetch options
using a getopt-like utility. System and db parameters are likewise
stored in the options file.
- message body
most messages use only one type of record parameters;
however, special embedded records like indexing instructions
can be recognized by their leader, where applicable.
Where a message contains parameter fields (first level, not leader subfields),
it must use positive tags for that, preferably using low numbers.
- direct embedding
Where a message has no parameter fields,
i.e. no parameters besides its leader's subfields and embedded records,
and there is only one parameter record, the message may,
as a convenient shorthand, allow to specify the embedded record's leader
(mfn and db for database records) as message options and have its leader
immediately followed by the record data.
In other words, the message sort of embeds itself in its parameter
record's leader (and has to remove itself before passing it on).
This is the form used by masterfile metalines (with ommitted 0).
- system options
can be specified on the command line or in a system options file.
there is a global options list (e.g. verbosity) and per db options
(like file paths and readonly). The commandline format is
"-aglob1 -bglob2 dbname -xdb1 -ydb2 [... dbname ...]".
The system options file contains (the textual representation of a
record with 0-tagged) fields, one for each db, wrapped up like
"dbname xdb1 ydb2" (with tabs). Those options are NOT stored
in each db's .opt file or meta record.
- database metadata
contained in the db's .m0d file is basically a chained
message to the core engine, mostly configuring the "transmission format"
$Id: OverView.txt,v 1.5 2005/05/24 16:44:06 kripke Exp $