GNI - Query

Coronita

Announcement
htld

Selene

Announcement

Malete

DownLoad
Status
OverView
Usage
Structures
Protocol
Query
MultiProcess

formats

FileFormats
character sets
... and converting
CDS/ISIS
IIF/ISO2709

misc

changes from earlier versions
tag numbers

drafts (partly obsolete)

MetaData
object model
Tcl

Query

Malete query expressions:
Like CDS/ISIS, Malete supports both index based searching and record based filtering. Unlike in CDS/ISIS, the basic filtering syntax is not based on the formatting language, but a simple extension of the search syntax.
However, a more general record manipulation language, the "patchwork", is planned, which will also act as a filter based on complex conditions.

In a query expression, everything up to a '?' is a search expression, the remaining part defines a filter.

In general, an expression defines a result set, consisting of pointers to occurrences of keys in records. For searching, these pointers are fetched from the "inverted file", the .mqd B-Tree. On filtering, the expression creates pointers as needed from the current record and matches the record, if the result set is not empty.

An expression consists of

simple expressions creating result sets
operators modifying result sets
parentheses to control evaluation order

pointers

A pointer (to an occurrence of a key in a record) is a 4-tuple (a point in a 4 dimensional space) (r,t,o,p) of

1 the record's id
2 the tag of the field where the key was found, so Smith in an author field can be distinguished from Smith in title
3 the "occurrence" (repeat count) of field, so one can demand Mark and Twain in the same author field, instead of Mark and Twain in two author fields
4 the position (nth word) within the field

simple expressions

A set of results can be created by

[rel][whitespace]term
a relation to a term, producing all pointers to keys that have the given relation (as listed below) to term. Relation defaults to equality =. term$ is a legacy form of %term. Term is a series of "word characters", which are ASCII letters and digits, the underscore and all non-ASCII characters (code greater than 127), or any characters in double quotes ". Between the quotes, two quotes evaluate to a single quote.
#n
a reference to the nth query. Inserts the search or filter expression, as applicable, of the nth query. (Might not be fully implemented). In a search expression, an existing search result set might be used. As a special case, if the whole query expression is only #n, the nth query is used as is, including search, filter and cursor.

Relations are:

=	equality (not needed)
%	prefix match
>	greater than
>=	greater than or equal
<	less than (not yet implemented, behaves like '>'. For range queries
	use '-' rather than '<' '*' '>' )
<=	less than or equal (not yet implemented, behaves like '>=')
:	contains (filtering only)
~	regular expression match (optional, filtering only)

The regexp match will not be available in all environments (since it requires a regexp implementation, probably the one from Tcl). Yet, it's just too tempting ...

The less/greater than (or equal) operators are not used alone, but in "key identity" expressions to define ranges.

key identity

The semantics of pointers is to denote a word position in an occurrence of a tag in a record. We assume that a pointer is only found at the one key being (some transformation of) the word at that position, i.e. pointers are unique.

Therefore, matching arbitrary result sets on pointer identity is of little use. However, when applied to terms, we can use a pointer identity condition as a key identity condition, effectively limiting the range of keys searched:

- key identity
is a pseudo operator, syntactically operating on result sets, but actually implemented as joining and modifying term relations.

Both operands of - must be terms, it refuses to operate on other expressions. If the left hand term has the = relation, it is modified to >=. If the right hand has the = relation, it is modified to <. The expression 'A - B' will search all keys greater or equal "A" and less than "B". To include "B", use 'A - <=B'. The prefix '%ABC' is actually just a shorthand for 'ABC - ABD'. If multiple upper or lower bounds, including those from a prefix, are found in the operands, the lowest lower and highest upper ones apply.

Currently, only the default case greater or equal to A and less than B is implemented.

tag filter

The tag filter selects pointers with certain tags by restricting the second dimension to one or more tag values:

expr/tag or expr/(tag[,tag...])
where tags are literal numbers. In other words, it intersects the left hand result set with the set {(r,t,o,p)|t element of taglist}

As a special case, a tag filter with an empty left hand expression can be used at the very start of (the outermost level of) an expression:

in filtering to select the fields for final output. This is a proper part of filtering, since a record with no matching fields is skipped.
in searching to force a given ordering. This is not a proper part of searching, but rather defines a cursor for record retrieval, which then uses the search result set as filter. We propably should define an extended syntax for ordering, like we have for filtering, to allow key range conditions here. Since this does a full index scan while retrieving records, results are not made unique, and ordering will be most useful only with a single tag (or some alternative tags) of a non-repeatable field. Currently not implemented.

The tag filter, like the pointer identity, is another pseudo operator, actually modifying its subexpressions rather than their result sets. See below for details.

binary operators

Additive operator:

+ is the logical OR
calculating the union of result sets

All other operators reduce their left hand result set by selecting elements based on certain relations they have to those of the right hand set. In other words they are intersections of the left hand with a certain set defined (but not actually created) based on the right hand set.

Multiplicative operators select pointers from their left hand set based upon their equality to pointers in the right hand set in the first n dimensions:

* ("AND") is 1-dimensional equality
selects pointers with the same record id, i.e. those pointer from the left hand set for which there exists a pointer with the same record id in the right hand set
; or (G) is 2-dimensional equality
selecting on same (record and) tag
, or (F) is 3-dimensional equality
selecting on same (record and tag and) occurrence

Mathematically, these operators create an intersection of the left hand with a projection of the right hand to the first n dimensions cross product with the full space of the remaining dimensions. E.g. A;B := A intersect {(r,t,x,y)|exist o,p with (r,t,o,p) in B}, which we can denote as the (G) set of B.
The full 4-dimensional equality operator, pointer identity, is a pseudo operator as described above.

The difference operator

^ ("NOT" or "WITHOUT")
selects based on 1-dimensional inequality, i.e. those pointer from the left hand set for which there exists NO pointer with the same record id in the right hand set. A^B := A intersect {(r,x,y,z)|not exist t,o,p with (r,t,o,p) in B} (the NOT set of B).

Note that the literal words OR, AND and NOT are not operators, but terms.

The distance operators are based on a proximity relation in the 4th dimension, requiring 3-dimensional equality (same record, tag and occurrence):

.[...] or (n) max distance
A sequence of n dots or a number in parentheses selects pointers from the left hand, for which there is a pointer in the right hand where the word position p differs by at most n. (0) is pointer identity. A(n)B := A intersect {(r,t,o,x)|exists p with abs(p-x) <=n and (r,t,o,p) in B} (the dist set of B).
$$[$$] exact distance:
likewise with exact position difference n. (A single $ could be confused with prefix match, but is the same as a .).

Due to the problems of word counting during filtering, the distance operators may silently behave like 3-dimensional equality.

operator precedence

Adjacent terms with no operator in between are combined by *. The operator precedence can by overriden by enclosing any expression in parentheses ().

The operators in decreasing precedence:

pointer/key identity -
positional distance . and $$
field level match , and ; ((F) and (G))
filter /
record level match * and ^
additive +

On the same precedence level, the distance operators .. and $$ associate right ('A.B.C' is 'A.(B.C)'), all others associate left ('A B C' is '(A*B)*C').
The tag filter / has highest precedence to its right hand, since the tag list is not subject to any operator evaluation.

detailed semantics of searching

First of all, tag filters are *defined* to be applied fully distributive. For operators of higher precedence, they are logically distributive: '(A (G) B)/101' yields the same as '(A/101) (G) (B/101)', hence their lower precedence. For '(A ^ B)/100', it looks like we should first select all results for A, strip those where B is (anywhere) in the same record, and then strip all in fields other than 100. This, however, would be better written as 'A/100 ^ B': there is no point in delaying a tag filter. The only reasonable use of applying a tag filter to a lower precedence operation (which requires use of parentheses anyway) is to distribute it. So, '(A ^ B)/100' is *defined* as 'A/100 ^ B/100'.
Second, we define tag filters to not nest, but override, so that only the innermost applies. E.g. in '(A/101 B C)/102', only B and C have to be found in field 102, while A is checked in 101. Again, this is because there would be no point in intersecting nested tag lists: it's shorter to write what should be in effect in the first place.

To summarize,

tag filters are always distributed down to the term level, where they are applied most efficiently, and
there is always at most one tag filter in effect.

In the further reasoning, tag filters are not considered.

The final result set of a search expression is 1-dimensional (record id only), in other words Malete does not maintain Extended Result Sets , but discards the other dimensions after evaluating a search. If higher dimensional operators (including tag filter) are applied to #n references, the original search has to be recomputed. (Which might not be supported).

The reason for this is:

it reduces space requirements between queries
delivering records from search results is easier, especially support for free cursor movement
commutative properties allow for a couple of optimizations

Intermediate results during query evaluation may have to contain higher dimensional data, depending on the operations which will be applied to them:

1 record id only for OR, AND, NOT
2 plus tag for ; (G)
3 plus occurrence for , (F)
4 plus position in field for .. and $$

We say that an expression is evaluated in a n-dimensional context, if n is the highest dimension of an operator to be applied to its result set. (The OR here is considered neutral in that it keeps data in all dimensions, but does not require it). This is an important property, because it allows to ignore what happens to other dimensions.

Not all operators are associative or commutative, in general, the order of evaluation and operands matter. For example, should 'A.B.C' select a C adjacent to A or to B or to any of these? By the right associative property, 'A.B.C' is 'A.(B.C)', which by definition first selects those pointers to (occurrences of) B, which are adjacent to C, then pointers to A adjacent to one of those Bs, which is probably what you expected. It is important for 'B.C' to create pointers to B, not to C, so we can not use 'A.(C.B)' nor '(A.B).C', both of which would select As adjacent to C.

properties of operators

Yet, there are some useful properties of operators.

The multiplicative operators and the difference are based on equality relations, which are reflexive (p=p), symmetric (if p=q then q=p) and transitive (if p=q and q=r, then p=r).
The proximity relation is symmetric, but not transitive: If A is adjacent to B, and B adjacent to C, A is usually not adjacent to C (actually it can only be adjacent to another C like in a field "C A B C"). Exact distance is not even reflexive.

Therefore:

intersection operators are (left and right) distributive over union: '(A+B) op C' = '(A op C)+(B op C)' and 'A op (B+C)' = '(A op B)+(A op C)' for all but the NOT, which reads 'A^(B+C)' = 'A^B^C'.
multiplicative operators and NOT are (self) associative: '(A op B) op C' = 'A op (B op C)', where op is one of * , ; or ^. This also holds where the operators are not the same, but the first is not NOT and the second has no higher dimension.
multiplicative operators are relative commutative in their dimension: 'A op B' = 'B op A' if evaluated in no higher dimensional context (since the results are based on a n-dimensional equality, they are equal in the first n dimensions).
for all intersection operators o,p we also have a property similar to associativity: '(A o B) p C' = '(A p C) o B', since both expressions intersect A with both the o-set of B and the p-set of C. In a commutative context, this also equals 'B o (A p C)', which can be used to get an existing left hand result set B inside a right hand subexpression.
distance is commutative in lower dimensional contexts: e.g. '(A.B),C' = '(B.A),C', since it requires 3-dimensional equality, and, with the above, also '(A.B).C' = 'B.(A.C)'.

Note that the context dimension depends not only on the direct parent operator, but the highest dimension of any ancestor in the parse tree: While '(A.B),C' = '(B.A),C' as final expressions have the same (1-dimensional) results, '((A.B),C)..D' != '((B.A),C)..D', since D's distance will be tested against A or B, resp.

processing search expressions

Some notes on how a search expression is evaluated might be helpful in order to write efficient queries.

Basic naive query processing proceeds by

recursively calculating the result sets of subexpressions
combining these result sets according to the operator

Techniques used in combining result sets are

merging
This is used by the OR to create the union of sets. Duplicates with regard to the required dimensions are eliminated.
marking
First calculate the left hand result set. Iterate over the right hand set, look up and mark elements having the relation in the left hand set. Finally, eliminate all unmarked (or marked, for NOT) elements. This can be used to implement all relation based operators. However, it is most useful where the right hand expression can loop its elements without actually creating the set (direct marking).
filtering
First calculate the right hand result set. Then, while generating elements for the left hand, look up the relation in the right hand set and store left hand elements only as appropriate. Note that a filter applies only while creating sets, not during marking or merging.

All techniques require the result sets to be ordered according to the dimensions, i.e. by record id then tag then occ and pos, to run efficiently.

The actual implementation tries to avoid the naive approach. Instead of calculating subexpression results independently, the results of previous calculations and the mode in which they finally will be combined are used where possible.

tag filters are always applied at the lowest (term) level, thus do not require an additional filtering step
the dimensional context is used to purge unnecessary duplicates and, optionally, can be used to reorder expressions according to their properties.

Creating result sets:

for a ref, in an one-dimensional search expression, we have the results set. Else, it can be created by processing the query independently or inlining its text. Anyway, in most cases, especially where we create under a filter, we need to create a copy.
for an exact match term, the result set can be directly fetched from the index. Other term relations are an OR over all keys they match.
the result set of an OR is created by creating both parts recursively and merging them. Any filter is applied to all subexpressions.
for all other expressions, if not filtering and the right hand does not support direct marking, create it and use as left hand filter. Else create the left hand (using any filter), evaluate the right hand using direct marking, if possible, else create it and mark.

Direct marking in subexpressions:

all simple expressions can efficiently loop their elements and thus support direct marking.
OR and NOT support direct marking, if the left hand supports it and the right hand is a simple expression

Notes

OR supports marking also if the right hand is a plain OR. However, this case (A+(B+C)) should be expressed as ((A+B)+C).
marking can be extended to AND and OR, NOT with complex subexpressions using multiple mark values. (G) and (F) can support marking like AND when evaluated in their dimension; the only non-trivial case of this is when embedded in an OR. This extended marking is an optional feature.
in terms of buffers, filtering is most efficient in an intersection, because the filter set can be released early. It is most efficient, where the right hand support direct marking. In associative situations, it may be possible to rewrite '(A op B) op C' as 'A op (B op C)', so C's result can be released early during the evaluation of B and likewise B's result in A.

Reordering expressions is by far the most important technique, and the best thing about it is that you can do it. To be most efficient, the parse tree must be as unbalanced as possible: In the best case, it is fully left associative like '((((A op B) op C) op D) op E) op F' (with A-F simple expressions supporting direct marking). That way we can always directly operate at the outermost level on the result set of the previous expression.

On the default optimization level 0, no automatic expressions rewriting is done. Higher optimization levels will be specified, e.g. we would expect level 1 to recognize a common case 'A.B.C' in a lower dimensional context, and rewrite the proper expression 'A.(B.C)' to '(B.C).A', so we can reuse B's details once using marking.
Full optimization should also use all commutattive properties to choose subexpressions for marking and filtering.

processing filter expressions

For filter expressions, the result sets are created from the current record and thus are assumed to be rather small, so that buffer space is not an issue.
Instead, short cut evaluation can be assumed to be much more important, since for an interesting filter expression we expect it to not match in most cases. All multiplicative operators should evaluate a term before a complex subexpression and stop processing if it evaluates to the empty set.

Probably the most important optimization here would be to detect terms with exact, prefix or contains relation which are required by multiplicative operators and to do a quick check for their occurence as soon as possible.
Especially where such a filter is applied to all records of a db, it should be applied while streaming the record file, hopefully approaching the performance of grep.

limits

Several limits apply to query and especially search evaluation. Some are currently hardcoded, but may become configuration options in future releases.

the number of subexpressions (terminals and operators) in any expression is limited to 500.
the nesting depth during parsing is limited to 50. (This is usually lower then the depth of the final parse tree).
the number of open queries is limited by session option q (default 20).
the size (number of records) of a search result set is limited by session option s (default 10.000).

$Id: Query.txt,v 1.19 2004/07/16 12:42:39 istr Exp $
For all those mathematical terms, ask Dr. Math