Coronita

Announcement
htld


Selene

Announcement


Malete

DownLoad
Status
OverView
Usage
Structures
Protocol
Query
MultiProcess


formats

FileFormats
character sets
... and converting
CDS/ISIS
IIF/ISO2709


misc

changes from earlier versions
tag numbers


drafts (partly obsolete)

MetaData
object model
Tcl
RecStruct
OpenIsis/Malete field definition and record structures

overview

A Malete record is a sequence of one or more fields. The first one is called the header, all others are identified by a numeric tag.
As far as the Malete database core is concerned, a field may contain any arbitrary bytes but newline characters. Assuming anything about the structure of field data, including any encoding of binary data, is solely at the application's discretion.
As Malete is designed to be a multi-purpose database engine, there is no special schema enforced. However, there is a schema suggested and used by the OpenIsis application. In the database's metadata record, fields with tag (00)6 are reserved for this purpose (abuse at your own risk).

The rationale of this field definition is to provide enough flexibility to efficiently support representations of all structures found in Z39.2 based systems (including but transcending the traditional CDS/ISIS software), especially the various MARC formats, as well as full representations of data commonly stored and transmitted in a couple of other formats like MIME and XML.
The term "representation" means that Malete will not bother to directly support XML's angle brackets nor XML's/MIME's foo="bar" options nor the subfield delimiter characters of MARC or CDS/ISIS. Rather, for any such data there should be a lossless transformation to an efficient representation in some format described by this field definition.

structure of fields

While fields may be used to hold a single value, it is a common technique to treat them as a sequence of subfields. ("A data element considered as a component of a field.", Z39.2).
A field may contain, in that order:
  • 0 or more positional subfields of fixed length
  • 1 or more positional subfields of variable length
  • 0 or more identified subfields of variable length

Fixed length subfields end after as many bytes (not characters!) as given by their length. They are typically used for data coded in some ASCII values. Neither UTF-8 characters nor the delimiter character should be stored in fixed length fields (however, it's up to the application to exercise care).
Variable length subfields end at a delimiter character or end of field. Malete by default uses a tabulator as delimiter, and import of CDS/ISIS databases converts the caret (hat '^') to tabs, however applications are free to use any delimiter they want.

Positional subfields are identified by their position within the field, i.e. by counting that many bytes and delimiters. Of course, there is only one nth position within a field, i.e. every positional subfield can occur at most once. Since the first n bytes and first m delimited subfields are used as the positional subfields, they may be omitted only if end of field is seen, i.e. all other subfields are omitted.
Identified subfields, on the other hand, start with a single character identifying the subfield, just like fields in a record are identified by a tag. Applications unaware of UTF-8 may demand a single byte as identifier. Where portability is an issue, only ASCII letters and digits should be used. Since there is at least one positional variable subfield, identified subfields always start after a delimiter (in accordance with Z39.2). An identified subfield may occur zero, one or more times in a field.

The MAIN VALUE of a field contains the fixed length subfields together with the first positional variable subfield. Sloppy applications may use anything up to the first delimiter, assuming that fixed subfields do not contain it. In the common situation of having no fixed length subfields, the main value equals the first positional field. The main value in a field is very similar to a record's header and commonly used as a key to select a field in a record.

The properties of subfields stated so far are consequences of their very definition. Additional properties, e.g. the main value being empty or an identified subfield having a fixed length a/o occuring exactly once, may be demanded by field definition. It is the applications responsibility to make sure records do not violate the field definition; the Malete server will happily store whatever it receives.

definition of fields

The field definition uses fields of the metadata record, one per each field and one per subfield. These fields themselves do not use fixed length subfields. The main value is a (non-unique) key:
  • 'tag' for a field definition,
    where tag is an integer. Negative numbers are reserved for counted structures. By convention, general application data fields should use tags 100 - 999.
  • 'tag#len' for a fixed subfield,
    where len is a positive integer
  • 'tag#' for an additional variable positional subfield. the first variable positional subfield's type, values and xref are defined with the main field definition.
  • 'tag^i' for a subfield identified by character i ('^' is the actual hat character, which is NOT the subfield delimiter; the field definition uses tabulators)

All other subfields in the field definition are identified and optional:
  • n name
    A name by which a field or subfield can be referred to. Field names must be unique and subfield names must be unique in their field. It is strongly recommended to only use C identifiers, i.e. ASCII letters, digits and the underscore, not starting with a digit.
  • d description
    Some textual description suitable for the database users.
  • m min/mandatory
    The sub/field must occur at least as many times as given by this option's value (empty=1, absent=0).
  • r repeatable
    The sub/field must occur at most as many times as given by this option's value (empty=any, absent=1). A value preceeded by '+' (including a single '+' for any) implies the mandatory option (at least one occurrence).
  • v value
    Every occurrence of this repeatable option is of the form name=value, associating the symbolic name with a legal value for the sub/field. The first such value is used as a default where the sub/field is created for some reason.
  • t type
    Type of this sub/field; see further below. Defaults to any (non-control) characters. Applications might support repeated alternative types.

types of subfields

Note that a field's type actually defines the type of its first positional variable subfield (which is usually the main value). If there are no subfields defined for a field, the field's value equals its main value.

A simple type definition consists of a single letter indicating a character type, optionally followed by some digits giving a repeat count. Unlike the byte-based length restrictions of fixed length fields, the repeat count should be assumed in terms of characters.
For the terms "alphabetic" and "digit", it's up to the application's UNICODE support to properly check these attributes for non-ASCII characters. Simple environments may assume any code greater than 127 alphabetic.
Basic character types are:
  • c character
    Any character with a code value greater or equal 32 (i.e. no C0 controls).
  • a alpha
    Any alphabetic character.
  • d digit
    ASCII digits '0'-'9'.
  • n numeric
    Digits and optional leading minus sign.
  • w word
    Alpha, digits and underscore.

Extended character/byte types, possibly not supported by all environments, are:
  • b bit/boolean
    ASCII digits '0' or '1'. If a subfield of this type is absent, '0' should be assumed, but a '1' if it's present and empty.
  • r raw
    Raw bytes using newline/vertical tab encoding as suggested by the Protocol
  • i integer
    Binary coded fix point decimal numbers using two decimal digits per byte (128-99 .. 128+99) and starting with a byte 144 plus the bytes before the decimal point (minus for negative numbers). Such integers sort properly, avoid newlines and tabs, and the first byte (for up to 30 decimal digits) is not valid in UTF-8 or any ISO charset.
  • t time
    Date and time as GTF integer. Up to 8 digits before the decimal point for date YYYYMMDD, after the decimal point hhmmss...

For all simple type definitions, the same letter may be used uppercase. With lowercase, the repeat count gives a maximum and defaults to any. With an uppercase type letter, the repeat count is exact and defaults to 1.

Complex type definitions include the following:
  • = pattern
    Pattern is a sequence of simple type definitions of basic character types. E.g. 'A3a6' denotes 3 to 9 alphabetic characters. Any special character in pattern denotes itself (typically as separator).
  • ~ regexp
    Depending on the regexp package used.
  • " literal
    Must have one of the values listed with the field's v option.

The field definition of basic field definition is:
6	6	nfdt	dfield definition	r	t=Nc
6	6^n	nname	dsub/field name
6	6^d	ndesc	dsub/field description
6	6^m	nmin	dmin number of occurrences	tn
6	6^r	nrep	dmax number of occurrences
6	6^v	nval	dnamed values	r
6	6^t	ntype	dsub/field type


advanced field definition

There are some advanced field definition options which are probably not supported by all applications. Where used, however, the following formats are recommended:
  • b base
    The key or name of another sub/field definition in this metadata record from which options (and, for a field, subfield definitions) should be used for this entity. Obviously just a convenience feature.
  • x xref
    Definition of some other entity referred to by the value of this sub/field. Described elsewhere.
  • s structure a.k.a. subrecord
    The field introduces a structure in the record; see further below.
  • c child
    This repeatable option specifies a tag or name of a legal child field. Applications might support this being followed by '[:min][-max]' to specify a min a/o max count of occurences of this child, or one of the letters '+' (at least once), '?' (at most once), '!' (exactly once) or '*' (any number of times, default). In the definition of those childs, r0 may be used to indicate that they should not occur in the record but where explicitly listed as legal child.

structures

The structure option indicates that a field is the header of a structure, indicating that some fields following it in the record somehow belong to it. ("A group of fields within a record that may be treated as a logical entity. (When a record describes more than one entity, the descriptions of individual entities may be treated as subrecords.)", Z39.2).

While in general there are a couple of ways to mark a sequence of fields as logically being one entity, there are three methods supported by the field definition:
  • counted structures
    If the s option's value is empty, the field's tag is the negative number of fields belonging to the structure, including the header. This is the means used by the Protocol to efficiently and transparently embed any records in messages. Obviously counted structures cannot be accessed by their tag. They are defined as some negative tags. Some known format of their main value (especially a literal) may be used to access them by key.
  • delimited structures
    If the s option's value is '+', the field has one additional initial subfield of fixed length 1. For a given occurence of this field, this subfield must contain either '-', indicating that there are no childs, be absent (i.e. the field is completely empty), or contain a '+', indicating that everything up to a matching empty field of same tag are the structures childs.
  • fixed structures
    If the s option's value is a number, the structure has exactly as many childs as given by this number. Note that the number of fields may be greater if the childs are structures themselves. Rarely used.

Note that while the field definition in general does not specify the ordering of fields, the childs of a structure are always a consecutive range according to the structure's definition.

Z39.2 reserves control field 002 for "subrecord purposes", e.g. listing the offsets of such "groups of fields".

recommendations

  • fixed subfields should contain only bytes 32 to 126, inclusive
  • if delimited structures are used, they should be used consistently, i.e. all fields (but 0) should have that type
  • fixed structures should only be used for internal purposes

examples

The headers of email or other MIME messages like
Subject: hi there
Content-Type: text/plain; charset="iso8859-1"
using a field definition of
6	10	nsubject
6	11	ncontent-type
6	11^c	ncharset
map to
10	hi there
11	text/plain	ciso8859-1
Value options could be used to encode common value like text/plain.

Using delimited structures, a typical HTML table definition starting with
<table width="100%" cellpadding="0" cellspacing="0"
  marginwidth="0" marginheight="0" topmargin="0" leftmargin="0" border="0">
<tr>
<td valign="top" width="160">
this is the textbody <br/> of the td node
</td>
</tr>
...
using
6	100	ntd	s+
6	100^w	nwidth
...
6	101	ntr
...
will be compacted to
100	+	w100%	p0	s0	m0	h0	t0	l0	b0
101	+
102	+	vtop	w160
0	this is the textbody
103	-
0	of the td node
102
101
...
which could save half of the internet's bandwidth.
Some strict XML parsers limit a node to at most one textnode child, which then should be stored in the node's main value.

conformance

Most features of Z39.2 (a.k.a. ISO2709 a.k.a. IIF) map directly to Malete records. Subfield identifiers in Z39.2 can use more than one character, however, MARC always uses one. Initial fixed subfields are dubbed "indicators" by Z39.2, MARC uses two of length 1. They are not considered "data elements", as other subfields are. Here, fixed subfields are considered less special.

MIME and *ML (SGML,HTML,XML...) data structures can be converted to records in a straightforward manner after a parser has resolved entities and the like.


$Id: RecStruct.txt,v 1.6 2004/07/26 12:23:34 kripke Exp $