Selene - Base Library

Selene Base is a collection of supporting facilities for the Selene database.

It includes
-	Tio a library for buffered I/O (currently not supporting windows)
-	customized versions of Lua standard libraries based on Tio instead of stdio
-	unicode (UTF-8) support and some crypto (SHA1, blowfish)

Currently the status should be considered alpha,
as there are not even yet testcases for all functions.

There are probably a couple integer overflow issues.
Tio should support any size or offset up to 2GB-1,
but to be on the save side stay below 1GB.


*	Tio

Tio supports most of stdio plus a couple of extensions.
Tio is based on low-level system I/O routines like read(2).

The windows version uses native HANDLES instead of the CRT's posix wrappers.
It is not yet functional but existing code fragments are from the
Malete project, where it is known to work on windows.

Tio does not use stdio with one exception, which is float printing.
The tio_dbl funtion can be customized to print doubles as integers,
which might be suitable for use in integer-based Lua,
or to use the libc's sprintf.
For dietlibc it could be using the __dtostr routine,
but that is quite broken anyway.
In short you can set up float printing as fast and sloppy as you like.

A few stdio functions like setvbuf do not yet have equivalent wrappers,
but the corresponding functionality is provided by other means.
The only complex function missing is the scanf family,
but there are functions tio_fgetd/l to read doubles or longs based on strtod/l.

Tio started out as "tiny I/O".
With about 2500 lines it is not that tiny anymore, but you may tailor to fit.
While Tio was written to be used with Lua in Selene,
it is in no way depending on Lua.


**	the Tio structure

Tio's equivalent to a FILE is a struct Tio.

The two most important advantages over a FILE are:
-	it does not require a system file handle,
	but can support in-memory files like string buffers.
-	it provides public access to it's buffers,
	which does not only help to avoid copying at some places,
	but more important enables decoders to consume exactly the
	proper amount of input with more than one byte lookahead.

Unlike stdio a Tio has well-defined behaviour in a couple of
special situations:
-	mixing reads and writes without intervening seek does the right thing (TM)
-	support for non-seekable bidirectional streams like sockets
-	support for non-blocking streams, even including fprintf

A Tio can be based on next to anything,
giving greater flexibility than even BSD's funopen.
Currently implemented are:
-	real files and streams
-	sockets (TCP, UDP and LOCAL, but accept is still missing)
-	bidirectional popen (using socketpair)
-	memory-mapped files (currently only real files, no MAP_ANON or mremap)
-	in-memory files based on malloc/realloc.

The memory files use a slowly growing (+50%) contiguous buffer.
A rope structure could be used as an alternative.
However, a single buffer is much simpler and a good realloc
resorting to MAP_ANON/mremap after some threshold would work
quite efficient (on linux, that is).
Most typical uses get by with the initial buffer (4K) anyway.

The operation modes of Tios provide many extra options,
set by the mode string to [fpms]open or by toggling flags:
-	synchronized I/O
-	non-blocking I/O
-	explicit CR-LF text translation even on unix (nice for some sockets)
-	exclusive creation
-	safe temporary files, anonymous or with controlled name
-	file locking
-	file permissions


*	Unicode support

The unicode string library, formerly released as standalone module,
now is based on Tio.  Changes currently only affect format:

All conversion specifiers are parsed using the full printf set
including all flags, width, precision and size specifiers.
Flags, width and precision are honoured as of printf.

Double printing recognizes %F, %a and %A and supports it,
if your libc does (or whatever you are using in tio_dbl).
The %p conversion prints a userdata's address as %x.

The %c and %s conversions now support unicode mode.
This is enabled by either using format from the unicode.utf8 or
unicode.grapheme modules or explicitly by providing one of the
flags '+' for utf-8 by codepoint or '#' for utf-8 by graphemes.
-	%c will print a multibyte character for codes greater than 127.
-	%s will calculate the number of bytes to use for precision
	and the number of blanks to add for the field witdh
	by counting utf-8 codepoints or graphemes, resp.

In addition, the %n conversion is supported, pushing the
current output position on the stack.

The basic format function now prints to a Tio.
String format prints to a memory file and returns it's buffer
followed by any positions pushed by %n conversions.


*	I/O enhancements

The I/O module is based on Tio and uses the string format function.

Besides supporting a bunch of additional mode flags in (f)open,
it also provides bopen, mopen and sopen, to open in-memory
files (buffers), memory mapped files and sockets.

You can fetch this text using:

f = io.sopen("malete.org:http", "w+T")
f:write([[GET /Selene-Base HTTP/1.0
Host: malete.org

]])
f:flush()
print(f:read("*a"))


All files support the format function.
For multibyte characters use the '+' or '#' flag with %c and %s.

In addition there is io.format as shorthand for io.stdout:format
and io.log(levell, fmt, ...) which does io.stderr:format(fmt,...)
if levell is <= tio_loglevel. level is a character in "1234567890wivdt",
corresponding to tio_loglevel 1-15 as set from the DEBUG environment
variable (in decimal, default 9).

The lock/unlock functions from Lua Filesystem are also available
as file methods.


*	future plans

-	add pack/unpack (binary numbers) and lots of encode/decode to format.
	unpacks and decodes push their results (like %n) so they can
	be used as input for other conversions and/or used after the call.
-	add a select loop with support for locally generated file events
	(so that local buffers can act as pipe between coroutines).
-	use COCO to yield from format and possibly other file functions
	when a non-blocking file or local buffer has an EAGAIN error.



* string encodings:
-	use precision like s based on input characters
-	accept a file as parameter
s	plain (only format that fills to width, possibly left aligned)
q	quote (Lua quoted string)
w	widen (latin1 to utf8)
k	hex
b	base64
r	urlencoded
m	markup (HT/XML, encodes lt, gt, quote as #34 and apos as #39)
y	quoted printable

*	plain s can 'pack' strings controlled by size:
z	append terminating 0
hh,h,l,ll	prepend 1,2,4 or 8 bytes unsigned length (only with string param)


*	packing binary numbers with size:
-	if width is given, align with 0 bytes to a multiple of width
-	flags ' ',0 for BE,LE ?
d,u	with hh,h,l,ll pack 1,2,4 or 8 bytes signed/unsigned
f	with h,l,L packs 4,8 or 12 bytes float


*	unpacking
The first unpacking conversion encountered will use the next parameter
as unpacking source. This can be a string or a file.
Following unpacking conversions will use more data from the same source.
The '<' conversion can be used to switch to a new data source
and set an offset.

*	unpacking binary numbers with size - pushes lua_Number:
-	if width is given, skip to a multiple of width
D,U,F	with size unpack binary (aligned to width)

*	unpacking printed numbers - pushes lua_Number:
-	skipping leading whitespace
-	using at most width characters and consuming at least prec
D,U	(w/o size) read printed signed/unsigned decimal
O,P	read octal, hex
C	w/o width unpacks 1 character as number
I	unpacks int as of strtol
J	unpacks lua_Number (strtol or strtod)
N	pushes source stream position


*	unpacking strings S
-	using at most prec bytes or up to whitespace
-	consuming at least width or as given by size
C	w/ width unpacks width characters as string
S	with z up to (and discarding) 0 byte
S	with hh,h,l,ll as given by prefixed length

*	unpacking (decoding) strings Q, W, K ...
-	using at most prec bytes or up to bad code
-	consuming at least width or as given by size
z	discarding the terminating byte (0 is considered a bad code in all encodings)
(hh,h,l,ll	as given by prefixed length ?)


*	stack manipulation
$	set arg pointer to width, apply offset prec
< w/ witdh sets unpack source to arg width.
	w/ '+' use next arg. in both cases apply offset prec
?	lua_insert ?


---
	$Id: Selene-Base,v 1.3 2006/08/25 15:17:29 paul Exp $