GNI - CharSet

Coronita

Announcement
htld

Selene

Announcement

Malete

DownLoad
Status
OverView
Usage
Structures
Protocol
Query
MultiProcess

formats

FileFormats
character sets
... and converting
CDS/ISIS
IIF/ISO2709

misc

changes from earlier versions
tag numbers

drafts (partly obsolete)

MetaData
object model
Tcl

CharSet

character set support in Malete
See end for usage note.

overview

Malete supports any "character set" (in the MIME sense of "charset" = CCS + CES ) which is compatible with ASCII so that

every character defined by ASCII is encoded by the same byte value
every byte with a value in the range 0-127 inclusive encodes the character as specified by ASCII

This

includes
ISO-8859-* , Unicode in the UTF-8 encoding, various far east EUC and similar encodings usings pairs of bytes greater than 127 and works well with most (but not all) IBM/M$ codepages unless you really abused the control characters 0-31 for graphics
excludes
UTF-16 and other formats using bytes 0-127 as part of multibyte sequences and, of course, the anti-ASCII EBCDIC
should work with some restrictions on searching for encodings which at least preserve linefeed (10), horizontal TAB (9) and the digits (48-57) like the Unicode standard compression scheme SCSU , VISCII and even old ISO-646-* or Cyrillic KOI

In order to store and retrieve (by record id) data, Malete does not need to know anything about the character set. However, the content-type may contain a charset name using a preferred MIME name from the iana registered charsets. The basic server does not support character set conversion, since many client environments like Java or Tcl are well prepared to handle this. A Tcl based server may support charset conversion.

indexing

For indexing, we use quite a lot of information about characters similar to, but extending the traditional ISISUC.TAB and ISISAC.TAB

a sort order (collation sequence),
possibly using multilevel comparison according to the UCA
a mapping of certain code sequences to others, for example to use uppercased versions in the index
some notion of which characters are "alphabetic" (parts of words) for indexing in word split mode

This data is configured by a sequence of strings, which may be obtained from the collation (4) fields of a database's metadata record (in _database_.m0d) or as the lines of a textfile (_collation_.mcd, not implemented).

These strings start with a one or two character mnemonic which is followed by a tabulator separated sequences of bytes, representing single characters or sequences of characters in the database's encoding. Unlike CDS/ISIS, Malete always deals with multibyte entitities and does not use explicit codes as decimal numbers. Consequently, the collation configuration can be converted between charsets just like the database itself (e.g. using recode or iconv).

Malete does not care whether a multibyte sequence holds the two ASCII characters 'C' and 'H' in order to assign 'CH' a separate rank between 'C' and 'D' in spanish collation (a Contraction ) or an ANSEL ( codes , Z39.47 ) style composition or UTF-8 using two or more bytes to encode a character with a diacritical mark (in precomposed or decomposed form).

Configuration entries supported in the initial version are:

collation C _name_ [_options_]
assigns a name to this collation or refers to an external collation. Only the first 31 bytes in _name_ are considered. Should be a C identifier (plain ASCII) for best interoperability. Proposed (but probably not implemented) options are 'c' for compression and 'f' for french (reverse) secondaries (see below).
word W _entities_
specifies that the listed entities are considered parts of words and assigns sort ranks in ascending order to them
nonword N _entities_
like W, but the entities separate "words" in word split mode. Multiple W and N entries can be used to assign successive sort ranks.
alias A _entities_
the entities are assigned as aliases to the corresponding entities of the last seen W or N, e.g. a sequence of lowercase characters to their uppercase equivalents
map M _entities_
the second and following entities are mapped to first one, which will be iteratively checked for other rules (but not maps). This can be used to map entities to empty (completey discarding them) or to multiple entities as an Expansion

Example:)

4	C	spanish_de_PHONEBOOK
4	N	.	,	;	-	_
4	N	0	1	2	3	4	5	6	7	8	9
4	W	a	b	c	ch	d	e	f	g	h	...
4	A	A	B	C	CH	D	E	F	G	H	...
4	A	á					é
4	A	Á					É
4	M	ae	Ä	ä
4	M	oe	Ö	ö

Here, 'coche' sorts exactly like 'COCHE' after 'cocina', since the ch sorts after the c (and the i is not even considered). 'König' in german phonebooks sorts exactly like 'Koenig', and in a terms listing, both will be displayed as 'koenig'. (Unlike "de_PHONEBOOK" , the "de" collation has the o-umlaut as secondary to o).

Note that as a possibly confusing, although correct sideeffect, a prefix search for 'coc' will NOT match 'coche', since it does not beginn with the codes for 'c'-'o'-'c', but with those for 'c'-'o'-'ch'.

implementation

The collation is basically implemented by means of a recoding. Every W and N entity (byte sequence) corresponds to one code number, increasing from 2 in the order of their definition. Every unrecognized byte value, and especially the TAB (9), which can not be redefined, maps to code number 1 and runs of those are squeezed (i.e. only one 1 for a sequence of unrecognized bytes). Aliases use the same code number as their corresponding W or N entity.
The recoded key is then a sequence of code numbers corresponding to the recognized entities. Depending on the highest code number, one or two bytes (big endian) are used for every number.

This transformation is applied to every index entry before storing it, as well as to every term before lookup. From the table of entities, the original term (in W and N entities, not aliases or maps) can be reconstructed for display of indexed terms.
Note that the term decoded from a collation key does not necessarily map to the same key. Where their byte sequences overlap with others, they may become parts of other contractions.

The implementation limits both the length (in bytes) of sequences and the number of codes of a map target to 15.

compression

With the compression option implemented and enabled, the number of bits used per code is the minimal number of bits needed to represent the highest code number, and the bitstring is padded with 0 bits to the full byte. In the spanish environment one would need 29 alphabetic codes (including CH, LL and Ñ), 10 digits and some punctuation, so six bits (codes 2-63) are sufficient and we can reduce key size by up to 25%.
This probably is especially interesting for databases integrating a lot of phoenician and/or brahmi scripts, using more than 256 but less than 512 codes. Here one would need only 9 instead of 16 bits, saving more than 40%. In a CJK environment, you will need at least 15 bits anyway.

Do not confuse this compression of single keys with the option of compressing the index based on common prefixes between adjacent keys.

multilevel comparison

Some future version will also support S and T entries to support secondary (optionally french) and tertiary levels, possibly one day even quaternary and identical levels Q and I, should there be demand.

Example (c.f. the es chart , lacking the ch contraction):

4	C	spanish
4	W	a	b	c	ch	d	e	f
4	T	A	B	C	CH	D	E	F
4	S	á					é
4	T	Á					É
4	S	ä
4	T	Ä

Here, 'coche' sorts before 'Coche', since on the third level the 'c' sorts before 'C'. (Unlike plain ASCII sorting, most collations sort lowercase before uppercase). Still 'coche' sorts after 'Cocina', since the primary difference between 'ch' and 'c' takes precedence over the tertiary difference, although the latter occurs earlier in the word. Just for the fun of it, the a-umlaut is not expanded here, but listed as another secondary variant of 'a' with it's own tertiary.

For multilevel comparison, a 0 code plus additional bits are appended to the recoded key. First, for every character some bits are appended to code it's secondary variant, depending on how many variants are defined for the character, then likewise for tertiary variants.
In Latin scripts, typically every alphabetical character has one tertiary variant (it's lowercase equivalent, using one bit) and some or all vowels can have one or more diacritical marks.

By appending additional bits not only do terms sort properly, but moreover we have the option for an exact match sensitive to all levels or a match insensitive to third or second and third level very similar to a prefix match (since the first level IS a prefix).
An actual prefix match should usually be done using only the first level bits, checking for second and third level prefix is a little bit more complicated.

For french secondary sorting, the second level bits are appended in reverse order. Must not be used together with left-to-right secondary sorting.

Using the additional bits, a terms listing can reconstruct the input with regard to all variants, i.e. with proper case and diacritics.
However, aliases and mappings can not be reversed: where the a umlaut should sort *exactly* like an a followed by an e, it uses exactly the same bytes and we can not tell from the index that once there was an a umlaut in the input.

east asian word indexing

Segmentation of Chinese text is, in general, not a trivial task.
Of course, the use of spaces to explicitly separate words is an option. The usual word split will just work as for any other scripts.

Since Malete support full application controlled indexing, it is also possible to use any existing segmentation algorithm on the application level.

Where this is unwanted or not possible, a somewhat brute force, yet feasible approach is to put every single character or every digraph or trigraph in the index. (I.e. every character together with the two or three following characters).
Please contact us, should there be demand for such a character or m-graph indexing method as alternative to "word" split.

using collation

When a database is opened, Malete looks for the file _database_.m0d (where _database_ is the name yo your database, e.g. cds.m0d). If this exists, it is scanned for 4 fields.
If there is a "4 C" naming the collation _name_, and there either is a compiled collation file _name_.mcx or another database is already using a collation of that name, the existing collation info is used.
Else a new collation (with the given name or anonymous) is created from the description in the metadata and, if it is named, saved as file _name_.mcx (in the current directory).

The distribution contains two sample collation definitions, a Latin-1 based as test/cds.m0d and a UTF-8 (Unicode) based as test/unicode.m0d. Use a UTF-8 capable editor like "vim '+set encoding=utf-8'" in a "xterm -u8".
Creating a collation definition from existing ISISAC.TAB and ISISUC.TAB is a straightforward exercise left to the reader.

*WARNING*:
Adding or changing the collation in _database_.m0d will render your encoded index data garbage! Make sure to save the plaintext index data using "malete qdump _database_ >_database_.mqt" before and reencode the index with "malete qload _database_ <_database_.mqt" after editing the m0d file.
*WARNING*:
While named (and thus shared) collations are much more efficient, multiple databases using a different specification for the same collation name will not properly coexist. When changing a named collation, be sure to remove the .mcx file and reload the indexes for all affected databases. When in doubt, remove or change the collation name.

links

list of encodings supported by Java
notes on charsets and unicode with ISIS
tables of some western latin sets
statistics on unicode character properties
approaches to collation
iana registered charsets
unicode characters and scripts
ICU Collation Introduction and Concepts
Collation charts (without contractions) by locale or script (according to Unicode default collation )

$Id: CharSet.txt,v 1.12 2004/11/12 11:18:23 kripke Exp $