OpenIsis/Malete field definition and record structures
| overview |
A Malete record is a sequence of one or more fields.
The first one is called the header, all others are identified by a numeric tag.
As far as the Malete database core is concerned,
a field may contain any arbitrary bytes but newline characters.
Assuming anything about the structure of field data,
including any encoding of binary data,
is solely at the application's discretion.
As Malete is designed to be a multi-purpose database engine,
there is no special schema enforced.
However, there is a schema suggested and used by the OpenIsis application.
In the database's
fields with tag (00)6 are reserved for this purpose (abuse at your own risk).
The rationale of this field definition is to provide enough flexibility
to efficiently support representations of all structures found in Z39.2
based systems (including but transcending the traditional CDS/ISIS software),
especially the various MARC formats, as well as full representations of
data commonly stored and transmitted in a couple of other formats like
MIME and XML.
The term "representation" means that Malete will not bother to
directly support XML's angle brackets nor XML's/MIME's foo="bar" options
nor the subfield delimiter characters of MARC or CDS/ISIS.
Rather, for any such data there should be a lossless transformation to an
efficient representation in some format described by this field definition.
| structure of fields |
While fields may be used to hold a single value,
it is a common technique to treat them as a sequence of subfields.
("A data element considered as a component of a field.", Z39.2).
A field may contain, in that order:
- 0 or more positional subfields of fixed length
- 1 or more positional subfields of variable length
- 0 or more identified subfields of variable length
Fixed length subfields end after as many bytes (not characters!) as given by
their length. They are typically used for data coded in some ASCII values.
Neither UTF-8 characters nor the delimiter character should be stored
in fixed length fields (however, it's up to the application to exercise care).
Variable length subfields end at a delimiter character or end of field.
Malete by default uses a tabulator as delimiter,
and import of CDS/ISIS databases converts the caret (hat '^') to tabs,
however applications are free to use any delimiter they want.
Positional subfields are identified by their position within the field,
i.e. by counting that many bytes and delimiters.
Of course, there is only one nth position within a field,
i.e. every positional subfield can occur at most once.
Since the first n bytes and first m delimited subfields are used as the
positional subfields, they may be omitted only if end of field is seen,
i.e. all other subfields are omitted.
Identified subfields, on the other hand, start with a single character
identifying the subfield, just like fields in a record are identified by a tag.
Applications unaware of UTF-8 may demand a single byte as identifier.
Where portability is an issue, only ASCII letters and digits should be used.
Since there is at least one positional variable subfield,
identified subfields always start after a delimiter (in accordance with Z39.2).
An identified subfield may occur zero, one or more times in a field.
The MAIN VALUE of a field contains the fixed length subfields together with
the first positional variable subfield. Sloppy applications may use anything
up to the first delimiter, assuming that fixed subfields do not contain it.
In the common situation of having no fixed length subfields,
the main value equals the first positional field.
The main value in a field is very similar to a record's header
and commonly used as a key to select a field in a record.
The properties of subfields stated so far are consequences of their very
definition. Additional properties, e.g. the main value being empty
or an identified subfield having a fixed length a/o occuring exactly once,
may be demanded by field definition.
It is the applications responsibility to make sure records do not violate
the field definition; the Malete server will happily store whatever it receives.
| definition of fields |
The field definition uses fields of the metadata record,
one per each field and one per subfield.
These fields themselves do not use fixed length subfields.
The main value is a (non-unique) key:
- 'tag' for a field definition,
where tag is an integer. Negative numbers are reserved for counted structures.
By convention, general application data fields should
100 - 999.
- 'tag#len' for a fixed subfield,
where len is a positive integer
- 'tag#' for an additional variable positional subfield.
the first variable positional subfield's type, values and xref
are defined with the main field definition.
- 'tag^i' for a subfield identified by character i
('^' is the actual hat character, which is NOT the subfield delimiter;
the field definition uses tabulators)
All other subfields in the field definition are identified and optional:
- n name
A name by which a field or subfield can be referred to.
Field names must be unique and subfield names must be unique in their field.
It is strongly recommended to only use C identifiers,
i.e. ASCII letters, digits and the underscore, not starting with a digit.
- d description
Some textual description suitable for the database users.
- m min/mandatory
The sub/field must occur at least as many times as given by this option's
value (empty=1, absent=0).
- r repeatable
The sub/field must occur at most as many times as given by this option's
value (empty=any, absent=1). A value preceeded by '+' (including a single
'+' for any) implies the mandatory option (at least one occurrence).
- v value
Every occurrence of this repeatable option is of the form name=value,
associating the symbolic name with a legal value for the sub/field.
The first such value is used as a default where the sub/field is created
for some reason.
- t type
Type of this sub/field; see further below.
Defaults to any (non-control) characters.
Applications might support repeated alternative types.
| types of subfields |
Note that a field's type actually defines the type of its first
positional variable subfield (which is usually the main value).
If there are no subfields defined for a field,
the field's value equals its main value.
A simple type definition consists of a single letter indicating
a character type, optionally followed by some digits giving a repeat count.
Unlike the byte-based length restrictions of fixed length fields,
the repeat count should be assumed in terms of characters.
For the terms "alphabetic" and "digit", it's up to the application's
UNICODE support to properly check these attributes for non-ASCII characters.
Simple environments may assume any code greater than 127 alphabetic.
Basic character types are:
- c character
Any character with a code value greater or equal 32 (i.e. no C0 controls).
- a alpha
Any alphabetic character.
- d digit
ASCII digits '0'-'9'.
- n numeric
Digits and optional leading minus sign.
- w word
Alpha, digits and underscore.
Extended character/byte types, possibly not supported by all environments, are:
- b bit/boolean
ASCII digits '0' or '1'. If a subfield of this type is absent, '0' should
be assumed, but a '1' if it's present and empty.
- r raw
Raw bytes using newline/vertical tab encoding as suggested by the
- i integer
Binary coded fix point decimal numbers using two decimal digits per byte
(128-99 .. 128+99) and starting with a byte 144 plus the bytes before
the decimal point (minus for negative numbers).
Such integers sort properly, avoid newlines and tabs, and the first byte
(for up to 30 decimal digits) is not valid in UTF-8 or any ISO charset.
- t time
Date and time as GTF integer. Up to 8 digits before the decimal point
for date YYYYMMDD, after the decimal point hhmmss...
For all simple type definitions, the same letter may be used uppercase.
With lowercase, the repeat count gives a maximum and defaults to any.
With an uppercase type letter, the repeat count is exact and defaults to 1.
Complex type definitions include the following:
- = pattern
Pattern is a sequence of simple type definitions of basic character types.
E.g. 'A3a6' denotes 3 to 9 alphabetic characters.
Any special character in pattern denotes itself (typically as separator).
- ~ regexp
Depending on the regexp package used.
- " literal
Must have one of the values listed with the field's v option.
The field definition of basic field definition is:
6 6 nfdt dfield definition r t=Nc
6 6^n nname dsub/field name
6 6^d ndesc dsub/field description
6 6^m nmin dmin number of occurrences tn
6 6^r nrep dmax number of occurrences
6 6^v nval dnamed values r
6 6^t ntype dsub/field type
| advanced field definition |
There are some advanced field definition options which are probably
not supported by all applications.
Where used, however, the following formats are recommended:
- b base
The key or name of another sub/field definition in this metadata record
from which options (and, for a field, subfield definitions) should be
used for this entity. Obviously just a convenience feature.
- x xref
Definition of some other entity referred to by the value of this sub/field.
- s structure a.k.a. subrecord
The field introduces a structure in the record; see further below.
- c child
This repeatable option specifies a tag or name of a legal child field.
Applications might support this being followed by '[:min][-max]'
to specify a min a/o max count of occurences of this child,
or one of the letters '+' (at least once), '?' (at most once),
'!' (exactly once) or '*' (any number of times, default).
In the definition of those childs, r0 may be used to indicate that they
should not occur in the record but where explicitly listed as legal child.
| structures |
The structure option indicates that a field is the header of a structure,
indicating that some fields following it in the record somehow belong to it.
("A group of fields within a record that may be treated as a logical entity.
(When a record describes more than one entity, the descriptions of individual
entities may be treated as subrecords.)", Z39.2).
While in general there are a couple of ways to mark a sequence of fields
as logically being one entity, there are three methods supported by
the field definition:
- counted structures
If the s option's value is empty,
the field's tag is the negative number of fields belonging to the
structure, including the header. This is the means used by the
to efficiently and transparently embed any records in messages.
Obviously counted structures cannot be accessed by their tag.
They are defined as some negative tags.
Some known format of their main value (especially a literal)
may be used to access them by key.
- delimited structures
If the s option's value is '+', the field has one additional initial
subfield of fixed length 1. For a given occurence of this field,
this subfield must contain either '-', indicating that there are
no childs, be absent (i.e. the field is completely empty),
or contain a '+', indicating that everything up to a matching
empty field of same tag are the structures childs.
- fixed structures
If the s option's value is a number, the structure has exactly as
many childs as given by this number. Note that the number of fields
may be greater if the childs are structures themselves. Rarely used.
Note that while the field definition in general does not specify
the ordering of fields, the childs of a structure are always
a consecutive range according to the structure's definition.
Z39.2 reserves control field 002 for "subrecord purposes",
e.g. listing the offsets of such "groups of fields".
| recommendations |
- fixed subfields should contain only bytes 32 to 126, inclusive
- if delimited structures are used, they should be used consistently,
i.e. all fields (but 0) should have that type
- fixed structures should only be used for internal purposes
| examples |
The headers of email or other MIME messages like
Subject: hi there
Content-Type: text/plain; charset="iso8859-1"
using a field definition of
6 10 nsubject
6 11 ncontent-type
6 11^c ncharset
10 hi there
11 text/plain ciso8859-1
Value options could be used to encode common value like text/plain.
Using delimited structures, a typical HTML table definition starting with
<table width="100%" cellpadding="0" cellspacing="0"
marginwidth="0" marginheight="0" topmargin="0" leftmargin="0" border="0">
<td valign="top" width="160">
this is the textbody <br/> of the td node
6 100 ntd s+
6 100^w nwidth
6 101 ntr
will be compacted to
100 + w100% p0 s0 m0 h0 t0 l0 b0
102 + vtop w160
0 this is the textbody
0 of the td node
which could save half of the internet's bandwidth.
Some strict XML parsers limit a node to at most one textnode child,
which then should be stored in the node's main value.
| conformance |
Most features of Z39.2 (a.k.a. ISO2709 a.k.a. IIF) map directly to
Malete records. Subfield identifiers in Z39.2 can use more than one
character, however, MARC always uses one.
Initial fixed subfields are dubbed "indicators" by Z39.2,
MARC uses two of length 1. They are not considered "data elements",
as other subfields are. Here, fixed subfields are considered less special.
MIME and *ML (SGML,HTML,XML...) data structures can be converted to records
in a straightforward manner after a parser has resolved entities and the like.
$Id: RecStruct.txt,v 1.6 2004/07/26 12:23:34 kripke Exp $