Coronita

Announcement
htld


Selene

Announcement


Malete

DownLoad
Status
OverView
Usage
Structures
Protocol
Query
MultiProcess


formats

FileFormats
character sets
... and converting
CDS/ISIS
IIF/ISO2709


misc

changes from earlier versions
tag numbers


drafts (partly obsolete)

MetaData
object model
Tcl
Ferienhaus in der Dordogne Porträtanfertigung
OverView
News: coronita webserver available.

Announcing Malete, the database engine powering OpenIsis 1.0

from 0.9 to 1.0

Based on the 0.9 engine and especially its Tcl binding, we had a system complete enough to do very intensive application testing of all concepts, both handling bibliographical and terminological as well as general industrial data. With those experiences at hand we spent the second half of 2003 to give our then two year old software a complete overhaul, in order to create a basis to last.

Along the traditional beliefs of Unix design we figured out that the best and most stable combination of robustness/performance with flexibility/convenience can be achieved by clearly separating
  • a general purpose database system
    which is very simple in order to be fast and robust and lay a solid ground for flexibility, but itself is meant to be accessed by other software (or geeks) rather than humans. While this engine is based on the Z39.2 record model (even supporting record leaders as used by MARC), it makes no special provisions to support bibliographical data or CDS/ISIS legacy, but rather tries to make this model appealing to general purpose database usage. This engine is called Malete (kurdish for "our house"). Malete includes a database core library, generic server and access libraries for various programming languages.
  • a CDS/ISIS-style application
    or, actually, like Winisis, a framework for applications. This is targeted at CDS/ISIS users and librarians in general. It provides support for conversion from and to a variety of known file formats including MARC, high level indexing, references (authority files, coded data), forms and so on.

In other terms, for retrieval you will rarely need more than the Malete engine (plus some formatting for presentation, which is usually done in a web programming language like PHP), while for data entry you want a convenient graphical user interface providing all sorts of lookups and checks.

technical changes

  • multiprocessing
    For a variety of reasons (detailed elsewhere) we postponed support for multi-threading (to at least until after the ongoing move towards compiler supported thread local storage is stable and widely available). Instead writing support by multiple processes is enabled based on file locking. Fast and consistent caching even for processes with very short life time (like CGI scripts) is achieved by replacing the former explicit caching with memory mapping.
  • platform indepence
    Now both record and index data file formats are identical across platforms (i.e. the same even on big endians like Suns and Macs). Only the pointer and tree files are plattform dependent, but are rebuilt from the data as needed.
  • generalized record format
    A record with n fields is now a series of n+1 tag-value pairs. The tag of the first field is the negative total length -n-1 and the value is a record "header" consisting of the record id (MFN) and optional leader data as used by MARC. Obviously, such a series can be part of a larger record, meaning records can be easily nested.
  • simplified serialized format
    The serialized (textual) representation of a record, used both in the masterfile and the server communications protocol, has dropped low-level support for field values containing newlines. Where needed, the application must apply proper encoding (but tools for that are provided).
  • simple transaction support
    Updates to a record can optionally be qualified with the position from which the record was last read, having the update fail if the record has been modified meanwhile. Reads can be done in "consistent snapshot" mode, reflecting the state of the database at one given point in time.
  • unified message interface
    The server communications protocol has been simplified and straightened out. The masterfile now is only a special case of this protocol and thus can be directly sent to a server. Conceptually every record is a message saying "write me".
  • ucspi based server
    The server is designed to run under tcpserver, meaning it can take advantage of all of its features like access control, basic client authentication, IPv6, SSL encryption and so on.

For more details see Diff09

applications and components

OpenIsis 1.x will provide the following applications and components (probably not all in 1.0, but 1.1 should be fairly complete):
  • the Malete database server
    for Linux and other UNIX-like systems written in ISO C. This is aimed towards minimal functionality at maximum performance. Intended usage is for high volume read only processing and read/write with application controlled indexing. On UNIX, the server will be multi process based. On Windows, use of multiple processes is restricted to read-only mode.
  • Malete and OpenIsis command line tools for all systems providing several tools including conversion from and to legacy CDS/ISIS file formats.
  • Java, Perl and PHP libraries
    to contact a server, all written completely in the respective language. These are aimed at tight language integration, leveraging the application language's strengths and programmer's skills. Will run on all systems as supported by each language.
  • a Tcl extension and library
    where the library acts similar to those for other languages (but based on a C implemented record) and the extension basically provides the server interface in process.
  • an application server
    for all systems (i.e. including Windows), providing database and http service, based on Tcl with or w/o Tk. While this will not achieve the high throughput of a purely C-based server, the Tcl layer can add virtually arbitrary functionality. Intended usage is for read/write with server controlled indexing and integrated http applications based on Tcl server pages. Servers based on other languages are waiting for volunteers.
  • a Tk based GUI
    for all systems. Can run standalone or acting as server and/or client.
  • the OpenIsis Tcl library
    providing support for CDS/ISIS-style applications, e.g. indexing similar to FSTs.
  • the OpenIsis application
    targeted towards users from the CDS/ISIS community, esp. librarians, to provide interoperability with existing ISIS databases and support for bibliographic formats in a user friendly way. Written in Tcl/Tk as a sister of OpenMLCM.



Malete modules

The Malete database system is structured in the following modules:
  • core
    basic C library for handling, storing and retrieving simple records.
  • pw
    "patchwork" framework for high level database services based on message passing. Some designs are borrowed from the Lisp and Smalltalk languages.
  • tool
    helper functions and command line tool including communication utilities and standalone server
  • java, perl and php
    client modules
  • tcl
    extension and base library
  • app
    the Tcl based application server
  • gui
    a generic Tk graphical user interface


On top of this, the OpenIsis 1.x application set contains:
  • old
    compatility functions and command line tool
  • isis
    the OpenIsis library and graphical user interface


ISAM core

This implements a variant of ISAM (index sequential access method) based on the ideas of Z39.2 (IIF) and Z39.50 (Type-1 queries). It provides a fully open and unprotected interface for unrestricted access at maximum performance. The core library is not fully self contained, but will require a few functions like stream I/O to be provided by each environment. It makes only very limited use of metadata, dealing with "physical" aspects like file names, locks and character sets.
  • util
    basic list, sessions, output buffers and other utilities
  • system
    services like file IO and time
  • charset
    recoding and collation
  • storage
    set of functions for database file access (master file and b-tree)


patchwork

The patchwork C library wraps the ISAM core into an extendible framework for high level database services, based on passing records as request and response messages to server objects. It provides a fully abstract and generic method call interface plus a couple of database objects.
An object dispatches messages by checking their type and other parameters and taking appropriate action, including forwarding to parent objects. This is known as the "pure object oriented" approach, as these objects don't have any other interface but the message dispatcher, especially no directly accessible data.
  • struct
    higher level operators on ISIS records a la IIF (Z39.2/ISO2709) based on meta data, including various substructures
  • base
    dispatcher wrapping the ISAM core. Based on the 0.9 server, but with some modifications to allow for most efficient message passing.
  • query
    dispatcher for ISIS/Z39.50 Type-1 style queries
  • server
    dispatcher providing record relations, views and other magic


design guidelines

requirements:
  • flexible and efficient buffered pushing of output. Pulling is not used on lower levels; every environment will solicit input on the outermost level as adequate.
  • flexible and efficient construction, manipulation and passing of records, especially embedded subrecords in the patchwork.


principles:
  • everything is a list.
    Similar to Java's String and StringBuffer, there is the immutable "Rec" and the mutable "List".
  • uniform stream output.
    Conceptually, all output is a list. There is only one (output) "Stream", which may be backed by memory buffers, files or other channels like a GUI window, so even diagnostic output can be captured.
  • negative counted subrecords.
    The patchwork uses negative counted embedding, since this allows to pass on embedded records without any modifcation or copying.
  • low tag usage.
    Besides reserving all negative tags for embedding, only a minimal amount of tags should be defined. Instead subfielding will be used extensively. The patchwork message header uses tag 0, containing the message type as an indicator, followed by any number of simple options and parameters, resembling a command line (see below). Alphabetic keywords and mnemonics are favoured over numbers.
  • leader
    There always has been some out-of-band data on records like their mfn. This is now generalized in the concept of a record leader (see below).


implementation notes:
  • immutable lists
    are just the same as a record embedded by negative counting, i.e. an array of fields, with the tag of the first being the negative total field count.
  • record leader
    The tag of the first field of embedded records contains leader-like meta info; for database records this is (optional) mfn plus a MARC leader. Since there should not be a difference between the representation of embedded and first level records, every record has a leader.
  • message leader
    A record representing a message also has a leader. Where the message is not embedded, it is sent as a leading 0-tagged field. Since message leaders start with an alphabetic character, the 0 and tab are omitted in the textual representation. Message leaders use tabs as separators and start with a word indicating the message type to the dispatcher. Following subfields are parameters, with or without identifiers.
  • getopt command lines
    a command line of the form "command -aopt1 -bopt2 arg1 arg2" can be easily and canonically wrapped into one field by removing the '-' option indicator and identifying the non-option args as subfield '@'. A commandline interface thus maps easily, and without the need for looking up meta information, to a message leader, from which the method identified by "command" can fetch options using a getopt-like utility. System and db parameters are likewise stored in the options file.
  • message body
    most messages use only one type of record parameters; however, special embedded records like indexing instructions can be recognized by their leader, where applicable. Where a message contains parameter fields (first level, not leader subfields), it must use positive tags for that, preferably using low numbers.
  • direct embedding
    Where a message has no parameter fields, i.e. no parameters besides its leader's subfields and embedded records, and there is only one parameter record, the message may, as a convenient shorthand, allow to specify the embedded record's leader (mfn and db for database records) as message options and have its leader immediately followed by the record data. In other words, the message sort of embeds itself in its parameter record's leader (and has to remove itself before passing it on). This is the form used by masterfile metalines (with ommitted 0).
  • system options
    can be specified on the command line or in a system options file. there is a global options list (e.g. verbosity) and per db options (like file paths and readonly). The commandline format is "-aglob1 -bglob2 dbname -xdb1 -ydb2 [... dbname ...]". The system options file contains (the textual representation of a record with 0-tagged) fields, one for each db, wrapped up like "dbname xdb1 ydb2" (with tabs). Those options are NOT stored in each db's .opt file or meta record.
  • database metadata
    contained in the db's .m0d file is basically a chained message to the core engine, mostly configuring the "transmission format"


Malete

$Id: OverView.txt,v 1.5 2005/05/24 16:44:06 kripke Exp $