12. Clib

Clib is the C interface and "system" module of the Q programming language. As of Q 7.8, this component actually consists of two different modules, clib.q which provides basic C data structures and routines commonly used in most Q programs, and system.q which contains most of the POSIX system interface. Only clib.q is part of the prelude; for most system functions, you will have to explicitly import system.q in your programs. In the sections below, we will always indicate whether the system.q module is needed for the described operations.

In difference to the other standard library modules, clib and system are external modules, i.e., most functions are actually implemented in C (cf. C Language Interface). Together, clib and system provide additional string operations, extended file functions, C-style formatted I/O, low-level and binary I/O, an interface to various system functions, POSIX thread functions, expression references, time functions, internationalization support, filename globbing and regular expression matching, additional integer functions from the GMP library, and, last but not least, efficient C replacements for some common standard library list and string processing functions.

Even if you do not use the extra functionality provided by these modules, you will benefit from the replacement operations (which are in clib and thus included in the prelude), which considerably speed up basic list and string processing, sometimes by several orders of magnitude.

NOTE: Not all of the following operations are implemented on all systems. The UNIX-specific operations are marked with the symbol `(U)' in the clib.q and system.q scripts. Only a portable subset of the UNIX system interface is provided, which encompasses the most essential operations found on many recent UNIX (and other POSIX) systems, as described by the ANSI C and POSIX standards as well as the Single UNIX Specification (SUS). These operations are also available on Linux and OSX systems.

[ < ]

[ > ]

[ << ]

[ Up ]

[ >> ]

[Top]

[Contents]

[Index]

[ ? ]

12.1 Manifest Constants

Clib defines an abundance of symbolic values for use with various system functions. Most of these can be found in system.q, but a few constants related to the memory sizes of various basic C data types and the fseek and setvbuf operations can also be found in clib.q.

System constants actually vary from system to system; only the most common values are provided as global variables here. The variables are declared const (read-only) and are initialized at startup time. Flag values can be combined using bitwise logical operations as usual. A complete list of the variables can be found at the beginning of the clib.q and system.q scripts. Flag values which are unavailable on the host system will be set to zero, other undefined values to -1. Thus undefined values will generally have no effect or cause the corresponding operations to fail.

[ < ]

[ > ]

[ << ]

[ Up ]

[ >> ]

[Top]

[Contents]

[Index]

[ ? ]

12.2 Additional String Functions

These functions provide an interface to some familiar character routines from the C library. They can all be found in clib.q and are thus in the standard prelude.

Character predicates: These work exactly like the corresponding C library routines, except that they work with arbitrary Unicode, not just ASCII, characters, provided that the interpreter has been built with Unicode support.

public extern islower C, isupper C, isalpha C, isdigit C, isxdigit C,
  isalnum C, ispunct C, isspace C, isgraph C, isprint C, iscntrl C,
  isascii C;

String conversion: Convert a string to lower- or uppercase (like the corresponding C functions, but work on arbitrary Unicode strings, not just on single ASCII characters).

public extern tolower S, toupper S;

Examples

Count the number of alphanumeric characters in a text:

==> #filter isalnum (chars "The little brown fox.\n")
17

Convert a string to uppercase:

==> toupper "The little brown fox.\n"
"THE LITTLE BROWN FOX.\n"

[ < ]

[ > ]

[ << ]

[ Up ]

[ >> ]

[Top]

[Contents]

[Index]

[ ? ]

12.3 Byte Strings

The following type represents unstructured binary data implemented as C byte vectors. This data structure is used by the low-level I/O functions and other system functions which operate on binary data. The ByteStr type itself and its operations are implemented in clib.q and thus included in the prelude.

public extern type ByteStr;

public isbytestr B;                     // check for byte strings

Byte strings are like ordinary character strings, but they do not have a printable representation, and they may include zero bytes. (Recall that a zero byte in a character string terminates the string.) They can be used to encode arbitrary binary data such as C vectors and structures. The bytestr function can be used to construct byte strings from integers, floating point numbers, string values or lists of unsigned byte values:

public extern bytestr X;                // create a byte string

The X argument denotes the data to be encoded and can be either a list of byte values (unsigned integers in the range from 0 to 255), or an atomic data object, i.e., an integer, floating point number or string constant. In the latter case, the argument can also have the form (X,SIZE) indicating the desired byte size of the object; otherwise a reasonable default size is chosen. If the specified size differs from the actual size of X, the result is zero-padded or truncated accordingly. Integer values are encoded in the host byte order, with the least significant GMP limb first; negative integers are represented in 2's complement. Floating point values are encoded using double precision by default or if the byte count is sufficient (i.e., at least 8 on most systems), and using single precision otherwise. Strings are by default encoded in the system encoding, but you can also specify the desired target encoding as (X,CODESET) (or (X,CODESET,SIZE) if you also need to specify a byte size), where CODESET is a string denoting the target encoding.

Like ordinary character strings, byte strings can be concatenated, size-measured, indexed, sliced and compared lexicographically. Moreover, a byte string can be converted back to a (multiprecision) integer, floating point number, string value, or a list of byte values. (When converting back to a string you can specify the source encoding as in bstr (B,CODESET), otherwise the system encoding is assumed.) For these purposes the following operations are provided.

public extern bcat Bs;                  // concatenate list of byte strings
public extern bsize B;                  // byte size of B
public extern byte I B;                 // Ith byte of B
public extern bsub B I J;               // slice of B (bytes I..J)
public extern bcmp M1 M2;               // compare M1 and M2

public extern bint B;                   // convert to unsigned integer
public extern bfloat B;                 // convert to floating point number
public extern bstr B;                   // convert to string
public bytes B;                         // convert to list
public ::list B;                        // dito

You can use the bytes function to convert a byte string to a list of byte values; the list function is overloaded to provide the same functionality. These functions are defined as follows:

bytes B:ByteStr                         = map (B!) [0..#B-1];
list B:ByteStr                          = bytes B;

For convenience, the common string operators and the sub function are overloaded to work on byte strings as well. Thus #B returns the size of B (the number of bytes it contains) and B!I the Ith byte of B. B1++B2 concatenates B1 and B2, sub B I J returns the slice from byte I to J, and the relational operators `=', `<', `>' etc. can be used to compare byte strings lexicographically. These operations are all implemented in terms of the functions listed above.

Byte Strings as Mutable C Vectors

As of Q 7.11, clib supports a number of additional operations which allow you to treat byte strings as mutable C vectors of signed/unsigned 8/16/32 bit integers or single/double precision floating point numbers. The following functions provide read/write access to elements and slices of such C vectors:

public extern get_int8 B I, get_int16 B I, get_int32 B I;
public extern get_uint8 B I, get_uint16 B I, get_uint32 B I;
public extern get_float B I, get_double B I;

public extern put_int8 B I X, put_int16 B I X, put_int32 B I X;
public extern put_uint8 B I X, put_uint16 B I X, put_uint32 B I X;
public extern put_float B I X, put_double B I X;

Note that the given index argument I is interpreted relative to the corresponding element type. Thus, e.g., get_int32 B I returns the Ith 32 bit integer rather than the integer at byte offset I. Also note that integer arguments must fit into machine integers, otherwise these operations will fail. Integers passed for floating point arguments will be coerced to floating point values automatically.

For the get_xxx functions, the index parameter may also be a pair (I,J) to return a slice of the given byte string instead of a single element (this works like sub/bsub, but interprets indices relative to the element type). The put_xxx functions also accept a byte string instead of an element as input, and will then overwrite the corresponding slice of the target byte string B with the given source byte string X. Similar to sub/bsub, these variations of get_xxx/put_xxx are "safe" in that they automatically adjust the given indices to fit within the bounds of the target byte string.

Moreover, the following convenience functions are provided to convert between byte strings and lists of integer/floating point elements.

public extern int8_list B, int16_list B, int32_list B;
public extern uint8_list B, uint16_list B, uint32_list B;
public extern float_list B, double_list B;

public extern int8_vect Xs, int16_vect Xs, int32_vect Xs;
public extern uint8_vect Xs, uint16_vect Xs, uint32_vect Xs;
public extern float_vect Xs, double_vect Xs;

Examples

Encode an integer as a byte string, take a look at its individual bytes, and convert the byte string back to an integer:

==> hex

==> def B = bytestr 0x01020304; bytes B; bint B
[0x4,0x3,0x2,0x1]
0x1020304

(Note that this result was obtained on a little-endian system, hence the least significant byte 0x04 comes first in the byte list.)

Negative integers are correctly encoded in 2's complement:

==> def B = bytestr (-2); bytes B; bint B
[0xfe,0xff,0xff,0xff]
0xfffffffe

To work with these binary representations you must be aware of the way GMP represents multiprecision integers. In particular, note that the default size of an integer is always a multiple (at least one) of GMP's limb size which is usually 4 or 8 bytes depending on the host system's default long integer type. The actual limb size can be determined as follows:

==> #bytes (bytestr 0)

In order to get integers of arbitrary sizes, an explicit SIZE argument may be used. For instance, here is how we encode small (1 or 2 byte) integers:

==> bytes (bytestr (0x01,1)); bytes (bytestr (0x0102,2))
[0x1]
[0x2,0x1]

The host system's byte sizes of various atomic C types can be determined with symbolic values declared at the beginning of clib.q, such as SIZEOF_CHAR, SIZEOF_SHORT, SIZEOF_LONG, SIZEOF_FLOAT and SIZEOF_DOUBLE.

Another fact worth mentioning is that even on big-endian systems, integers are always encoded with the "least significant limb" first. So, for instance, given that the limb size is 4, as in the above examples, the 2-limb integer 0x0102030405060708 consists of bytes 0x8 0x7 0x6 0x5 0x4 0x3 0x2 0x1 on a little-endian system, in that order, whereas the byte order on a big-endian system is 0x5 0x6 0x7 0x8 0x1 0x2 0x3 0x4.

Here is how we can quickly check the byte order of the host system:

==> hd (bytes (bytestr 1))

This expression returns 1 on a little-endian system and zero otherwise.

As long as an integer does not exceed the machine's word size (which usually matches the limb size), we can simply convert between big-endian and little-endian representation by reversing the byte list:

==> bytestr (reverse (bytes B))

Floating point values can be encoded either in double or single precision, depending on the SIZE argument. The default size is double precision (usually 8 bytes).

==> bfloat (bytestr (1/3)); bfloat (bytestr (1/3,SIZEOF_FLOAT))
0.333333333333333
0.333333343267441

The default size of the encoding of a character string is the byte size of the string in the target encoding (the system encoding by default). If an explicit size is given, the string is zero-padded or truncated if necessary. The following example will work with any system encoding based on 7 bit ASCII (like Latin1, UTF-8 or ASCII itself):

==> dec

==> def S1 = bytestr "ABC", S2 = bytestr ("ABC",2), S3 = bytestr ("ABC",5)

==> bytes S1; bytes S2; bytes S3
[65,66,67]
[65,66]
[65,66,67,0,0]

==> bstr S1; bstr S2; bstr S3
"ABC"
"AB"
"ABC"

By combining elements like the ones above, and including appropriate "tagging" information, more complex data structures can be represented as binary data as well. For this purpose, the byte strings of the tags and the data elements can be concatenated with bcat or the `++' operator. This is useful, in particular, for compact storage of objects in files. Moreover, some system functions involve binary data which might represent C structures and/or vectors. Such data can be assembled from the constituent parts by simply concatenating them. For instance, consider the following C struct:

struct { char foo[108]; short bar; int baz; };

A value of this type, say {"Hello, world.", 4711, 123456}, can then be encoded as follows:

==> bytestr ("Hello, world.",108) ++ bytestr (4711,SIZEOF_SHORT) ++ \
bytestr (123456,SIZEOF_INT)

Similarly, a list of integers can be converted to a corresponding C vector as follows:

==> bcat (map bytestr [1..100])

When encoding such C structures you must also consider alignment issues. For instance, most C compilers will align non-byte data at even addresses.

In order to facilitate the handling of C vectors of integers and floating point values, as of Q 7.11 clib offers a number of specialized operations which provide direct read/write access to elements and slices of numeric vectors, and allow you to convert between C vectors and Q lists of integer or floating point values. These operations are all implemented directly in C and will usually be much more efficient for manipulating numeric C vectors than the basic byte-oriented functions. Moreover, they allow you to modify the elements of a C vector in a direct fashion, turning byte strings into a mutable data structure.

Different operations are provided to handle vectors of signed or unsigned 8/16/32 bit (machine) integers, as well as single (32 bit) and double precision (64 bit) floating point numbers. For instance:

==> def B = uint32_vect [100..110]

==> uint32_list B
[100,101,102,103,104,105,106,107,108,109,110]

==> get_uint32 B 1
101

==> put_uint32 B 1 0xffffffff
()

==> uint32_list B
[100,4294967295,102,103,104,105,106,107,108,109,110]

Note that, because these C vectors are just normal byte strings, you can freely convert between different representations of the numeric data. E.g.:

==> take 12 $ int8_list B
[100,0,0,0,-1,-1,-1,-1,102,0,0,0]

Entire slices of byte strings can be retrieved and overwritten as well. Note that, as with sub, the indices are adjusted automatically to stay within the bounds of the target vector.

==> put_uint32 B (-2) (uint32_vect [90..94])
()

==> uint32_list B
[92,93,94,103,104,105,106,107,108,109,110]

==> uint32_list $ get_uint32 B (-2,3)
[92,93,94,103]

==> uint32_list $ get_uint32 B (8,100)
[108,109,110]

[ < ]

[ > ]

[ << ]

[ Up ]

[ >> ]

[Top]

[Contents]

[Index]

[ ? ]

12.4 Extended File Functions

Clib provides the following enhanced and additional file functions:

public extern ::fopen NAME MODE, fdopen FD MODE, freopen NAME MODE F;
public extern fileno F;
public extern setvbuf F MODE;
public extern fconv F CODESET;
public extern tmpnam, tmpfile;
public extern ftell F, fseek F POS WHENCE;
public rewind F;
public extern gets, fgets F;
public extern fget F;
public extern ungetc C, fungetc F C;

These are all defined in clib.q and thus included in the prelude.

The fopen version of clib handles the `+' flag in mode strings, thus enabling you to open files for both reading and writing. The mode "r+" opens an existing file for both reading and writing; the initial file contents are unchanged, and both the input and output file pointers are positioned at the beginning of the file. The "w+" mode creates a new file, or truncates it to zero size if it already exists, and positions the file pointers at the beginning of the file. The "a+" mode appends to an existing file (or creates a new one); the initial file pointer is set at the beginning of the file for reading, and at the end of the file for writing. All these modes also work in combination with the b (binary file) flag.

The freopen function is like fopen, but reopens an existing file object on another file. Just as in C programming, the main purpose of this operation is to enable the user to redirect the standard I/O streams associated with the interpreter process (available in the interpreter by means of the INPUT, OUTPUT and ERROR variables).

The fdopen function opens a new file object on a given file descriptor, given that the mode is compatible. Conversely, the fileno function returns the file descriptor of a file object. (See also the functions for direct file descriptor manipulation in Low-Level I/O.)

The setvbuf function sets the buffering mode for a file (IONBF = no buffering, IOLBF = line buffering, IOFBF = full buffering). This operation should be invoked right after the file has been opened, before any I/O operations are performed.

The fconv function sets the encoding of a file. By default, Q's built-in I/O operations as well as clib's string I/O functions assume the system encoding, and convert between this encoding and the internal UTF-8 string representation as needed. If a text file uses an encoding different from the system encoding, you can use the fconv function to set the desired encoding. CODESET must be a string denoting a valid encoding name for the iconv function (see also Internationalization, below). This affects all subsequent text read/write operations on the file. (This operation only works for Unicode-capable systems which have iconv installed. Also note that this function is only available for Q 7.0 and later.)

The tmpnam and tmpfile functions work just like the corresponding C routines: tmpnam returns a unique name for a temporary file, and tmpfile constructs a temporary file opened in "w+b" mode, which will be deleted automatically when it is closed. See the tmpnam(3) and tmpfile(3) manual pages for details.

The ftell/fseek functions are used for file positioning. The ftell function returns the current file position, while fseek function positions the file at the given position. The rewind function provides a convenient shorthand for repositioning the file at the beginning. These operations work just like the corresponding C functions. The WHENCE argument of fseek determines how the POS argument is to be interpreted; it can be either SEEK_SET (POS is relative to the beginning of the file, i.e., an absolute position), SEEK_CUR (POS is relative to the current position) or SEEK_END (POS is relative to the end of the file). In the latter two cases POS can also be negative.

Portability Notes:

According to the ISO C specification, you should use fflush or fseek before switching between reading and writing on a file opened with the `+' flag.
On non-UNIX systems, ftell and fseek might only work reliably if the file is opened in binary mode (b flag).

The gets/fgets functions work like the C fgets function, i.e., they read a line from standard input or the given file including the trailing newline, if any. The fget function reads an entire file at once and returns it as a string. The ungetc/fungetc functions push back a single character on standard input or the given input file, like the C ungetc function. The C library only guarantees that pushing back a single ASCII character will work, so the result of pushing back multiple or multibyte characters is implementation-dependent.

Moreover, the following additional aliases are provided for C aficionados:

public ::readc as getc, ::freadc F as fgetc;
public ::writes S as puts, ::fwrites F S as fputs;
public ::writec C as putc, ::fwritec F C as fputc;

Examples

Open a new file for both reading and writing:

==> def F = fopen "test" "w+"

Write a string to the file:

==> fwrites F "The little brown fox.\n"
()

Current position is behind written string (at end-of-file):

==> ftell F
22

Rewind (go to the beginning of the file):

==> rewind F
()

Read back the string we've written before:

==> fgets F
"The little brown fox.\n"

Check that we're again at end-of-file:

==> feof F
true

Output another string:

==> fwrites F "The second line.\n"
()

Position behind the first string:

==> fseek F 22 SEEK_SET
()

Reread the second string:

==> fgets F
"The second line.\n"

And here's how to read an entire text file at once:

==> def T = fget (fopen "clib.q" "r")

To quickly compute a 32 bit checksum of the file:

==> sum (bytes (bytestr T)) mod 0x100000000
3937166

Finally, let's split the text into lines and add line numbers using sprintf (see C-Style Formatted I/O):

==> def L = split "\n" T

==> def L = map (sprintf "%3d: %s\n") (zip [1..#L] L)

==> do writes (take 5 L)
  1:
  2: /* clib.q: Q's system module */
  3:
  4: /* This file is part of the Q programming system.
  5: 
()

[ < ]

[ > ]

[ << ]

[ Up ]

[ >> ]

[Top]

[Contents]

[Index]

[ ? ]

12.5 C-Style Formatted I/O

These functions provide an interface to the C printf and scanf routines. They are all defined in clib.q and thus included in the prelude.

public extern printf FORMAT ARGS, fprintf F FORMAT ARGS,
  sprintf FORMAT ARGS;
public extern scanf FORMAT, fscanf F FORMAT, sscanf S FORMAT;

Arguments to the printf routines and the results of the scanf routines are generally encoded as tuples or single non-tuple values (if only one item is read/written).

All the usual conversions and flags of C printf/scanf are supported, except %p (pointer conversion). The basic h and l length modifiers are also understood, but not the fancy ISO C99 extensions like ll or hh, or l modifiers on characters and strings. Two further unsupported features of the printf functions are the %n (number of written characters) conversion and explicit argument indexing (m$); thus all arguments have to be in the same order as specified in the printf format string. The %n conversion is implemented for the scanf functions, though.

As these functions are simply wrappers for the corresponding C functions, integer conversions are generally limited to values which fit into machine integers. To handle integers of arbitrary sizes, you might treat them as strings (%s) in the format string and do the actual conversion manually with val or str.

Seasoned C programmers will appreciate that the wrapper functions provided here are safe in that they check their arguments and prevent buffer overflows, so they should never crash your program with a segfault. To these ends, if a %s or %[...] conversion without maximum field width is used with scanf, the field width will effectively be limited to some (large) value chosen by the implementation.

Examples

See the printf(3) and scanf(3) manual pages for a description of the format string syntax. Some basic examples follow (<CR> indicates that you hit the carriage return key to terminate a line):

==> printf "%d\n" 99
99
()

==> printf "%d\n" (99)
99
()

==> printf "%s %s %d\n" ("foo","bar",99)
foo bar 99
()

==> scanf "%d"
99<CR>
99

==> scanf "%s %s %d"
foo bar 99<CR>
("foo","bar",99)

As indicated, multiple values are denoted as tuples, and the printf function accepts both a single value or a one-tuple for a single conversion. The scanf function always returns a single, non-tuple value if only a single conversion is specified. Zero items are represented using the empty tuple. Note that you always have to supply the ARGS argument of printf, thus you specify an empty tuple if there are no output conversions:

==> printf "foo\n" ()
foo
()

The scanf function also returns an empty tuple if no input items are converted. For instance (as usual, using the * flag with a scanf conversion suppresses the corresponding input item):

==> scanf "%*s"
foo<CR>
()

Note that while scanf for most conversions skips an arbitrary amount of leading whitespace, the trailing whitespace character at which a conversion stops is not discarded by scanf. You can notice this if you invoke, e.g., readc afterwards:

==> scanf "%s %d"; readc
foo 99<CR>
("foo",99)
"\n"

If you really have to skip the trailing whitespace character, you can do this with a suppressed character conversion, e.g.:

==> scanf "%s %d%*c"; writes "input: "||reads
foo 99<CR>
("foo",99)
input: <reads function waiting for input here>

The fprintf/fscanf functions work analogously, but are used when writing or reading an arbitrary file instead of standard output or input. For instance:

==> var msg = "You're not supposed to do that!"

==> fprintf ERROR "Error: %s\n" msg
Error: You're not supposed to do that!
()

The sprintf function returns the formatted text as a string instead of writing it to a file:

==> sprintf "%s %s %d\n" ("foo","bar",99)
"foo bar 99\n"

Likewise, sscanf takes its input from a string:

==> sscanf "foo bar 99\n" "%s %s %d"
("foo","bar",99)

The %n conversion is especially useful with sscanf, since it allows you to determine the number of characters which were actually consumed:

==> sscanf "foo bar 99 *** extra text here ***\n" "%s %s %d%n"
("foo","bar",99,10)

You might then use the character count, e.g., to check whether the input format matched the entire string, or whether there remains some text to be processed.

Some remarks about the role of the length modifiers h and l are in order. Just as with the C scanf routines, you need the l modifier to read a double precision value; a simple %f will only read single precision number:

==> scanf "%f"
1e100<CR>
inf

==> scanf "%lf"
1e100<CR>
1e+100

The printf functions, however, always print double precision numbers, so the l modifier is not needed:

==> sprintf "%g" 1e100
"1e+100"

For the integer conversions, the h and l modifiers denote short (usually 2 byte) and long (usually 4 byte) integer values. If the modifier is omitted, the default integer type is used (this usually is the same as long, but your mileage may vary).

As already indicated, the printf and scanf routines are limited to machine integer sizes. Thus a scanf integer conversion will always return a short or long integer value, depending on the length modifier used. If a printf integer conversion is applied to a "big" integer value, only the least significant bytes of the value are printed, as if the printed number (represented in 2's complement if negative) had been cast to the corresponding integer type in C. Thus the printed result will be consistent with C printf output under all circumstances. For instance:

==> def N = 0xffff70008000 // big number

==> printf "%hu %lu\n" (N,N)
32768 1879080960
()

==> printf "%hd %ld\n" (N,N)
-32768 1879080960
()

To correctly print a big integer value, you can convert it manually with Q's built-in str function, then print the value using a %s conversion:

==> printf "%s\n" (str 1234567812345678)
1234567812345678
()

Similarly, you can read a big integer value by converting it as a string, and then apply the val builtin.

==> val (scanf "%s")
1234567812345678<CR>
1234567812345678

Here you might use the %[...] conversion to ensure that the number is in proper format (the initial blank is needed here to skip any leading whitespace):

==> val (scanf " %[0-9-]")
-1234567812345678
-1234567812345678

On output, the integer and floating point conversions can all be used with either integer or floating point arguments; integers will be converted to floating point values and vice versa if necessary:

==> printf "An integer: %d\n" 99.9
An integer: 99
()

==> printf "A floating point value: %e\n" 99
A floating point value: 9.900000e+01
()

[ < ]

[ > ]

[ << ]

[ Up ]

[ >> ]

[Top]

[Contents]

[Index]

[ ? ]

12.6 File and Directory Functions

These functions provide the same functionality as their C counterparts. They are all to be found in system.q and thus you have to explicitly import the system module to use them.

public extern rename OLD NEW;           // rename a file
public extern unlink NAME;              // delete a file
public extern truncate NAME LEN;        // truncate a file (U)
public extern getcwd, chdir NAME;       // get/set the working directory
public extern mkdir NAME MODE;          // create a new directory
public extern rmdir NAME;               // remove a directory
public extern readdir NAME;             // list the files in a directory
public extern link OLD NEW;             // create a hard link (U)
public extern symlink OLD NEW;          // create a symbolic link (U)
public extern readlink NAME;            // read a symbolic link
public extern mkfifo NAME MODE;         // create a named pipe (U)
public extern access NAME MODE;         // test access mode
public extern chmod NAME MODE;          // set the file mode
public extern chown NAME MODE UID GID;  // set file ownership (U)
public extern lchown NAME MODE UID GID; // set link ownership (U)
public extern utime NAME TIMES;         // set the file times
public extern umask N;                  // set/get file creation mask
public extern stat NAME, lstat NAME;    // file and link information

The stat/lstat functions return a tuple consisting of the commonly available fields of the C stat struct, see stat(2). For your convenience, the following mnemonic functions are provided for accessing the different components:

public st_dev STAT, st_ino STAT, st_mode STAT, st_nlink STAT,
  st_uid STAT, st_gid STAT, st_rdev STAT, st_size STAT, st_atime STAT,
  st_mtime STAT, st_ctime STAT;

Examples

These all need the system module, so you have to import it in the interpreter to make the following examples work. E.g.:

==> import system

With that out of the way, let's play around with some of these functions:

==> mkdir "tmp" 0777||chdir "tmp"||mkfifo "foo" 0666||\
rename "foo" "bar"||unlink "bar"||chdir ".."||rmdir "tmp"
()

(Create a tmp subdirectory, change to it, create a new FIFO special file, rename that file, delete it, change back to the original directory, and remove the tmp directory. All with a single expression which realizes identity.)

Now for something more useful. We can retrieve the current umask while setting it to zero, and then reset it to the original value as follows:

==> def U = umask 0; oct; umask U || U; dec
022

List the files in the current directory:

==> readdir "."
[".","..","Makefile","givertcap","clib.c","clib.q","Makefile.am",
"Makefile.in","README-Clib","examples","Makefile.mingw"]

Get the size of a file:

==> st_size (stat "README-Clib")
355

[ < ]

[ > ]

[ << ]

[ Up ]

[ >> ]

[Top]

[Contents]

[Index]

[ ? ]

12.7 Process Control

With the notable exception of the exit function which is included in clib (and thus in the prelude), all the following functions need an explicit import of the system module.

The system function returns the status code of the command if execution was successful, and fails otherwise:

public extern system CMD;               // exec command using the shell

Clib also provides the usual UNIX process creation and management routines. Most of these really require a UNIX system; no attempt is made to emulate operations like fork on systems where they are not implemented. Thus the only process operations which currently work under Windows are system, exec, spawn, _spawn, exit and getpid.

public extern fork;                     // fork a child process (U)
public extern exec PROG ARGS;           // execute program
public extern spawn PROG ARGS;          // execute program in child process
public extern _spawn MODE PROG ARGS;    // execute child with options
public extern nice INC;                 // change nice value (U)
public extern exit N;                   // exit process with given exit code
public extern pause;                    // pause until a signal occurs (U)
public extern raise SIG;                // raise signal in current process
public extern kill SIG PID;             // send signal to given process (U)
public extern getpid;                   // current process id
public extern getppid;                  // parent's process id (U)
public extern wait;                     // wait for any child process (U)
public extern waitpid PID OPTIONS;      // wait for given child process (U)

All these operations are simply wrappers for the corresponding C library routines. Note, however, that the kill function takes the signal to send as its first argument, which makes it easier to use partial applications of the function, e.g., to iterate a kill operation over a list of process numbers (as in `do (kill SIGTERM) PIDs').

The exec function performs a path search like the C execlp/execvp function; the parameters for the program are given as a string list ARGS, and as usual the first argument should repeat the program file name. This function never returns unless it fails. The spawn and _spawn operations are provided to accommodate Windows' lack of fork and wait; these functions work on both UNIX and Windows. The spawn function works like exec, but runs the program in a new child process. It returns the new process id (actually the process handle under Windows). The _spawn function is like spawn, but accepts an additional MODE parameter which determines how the child is to be executed, either P_WAIT (wait for the child, return its exit status), P_NOWAIT (do not wait for the child, same as spawn), P_OVERLAY (replace the current image with the new process, same as exec) and P_DETACH (run the new process in the background). (Note that the P_DETACH option is ignored on UNIX systems; the correct way to code a "daemon" on UNIX is shown in the examples section below.)

On UNIX, the following routines are provided to interpret the status code returned by the system, _spawn, wait and waitpid functions:

public extern isactive STATUS;
// process is active
public extern isexited STATUS, exitstatus STATUS;
// process has exited normally, get its exit code
public extern issignaled STATUS, termsig STATUS;
// process was terminated by signal, get the signal number
public extern isstopped STATUS, stopsig STATUS;
// process was stopped by signal, get the signal number

For more information about the process functions, we refer the reader to the corresponding UNIX manual pages.

Operations to access the process environment are also implemented. The getenv function fails if the given variable is not set in the environment; this lets you distinguish this error condition from a defined variable with empty value. The setenv function overwrites an existing definition of the given variable:

public extern getenv NAME, setenv NAME VAL;
// get/set environment variables

On UNIX, the following operations provide access to process user and group information, as well as process groups and sessions. Not all operations may be implemented on all UNIX flavours. Please see the UNIX manual for a description of these functions.

/* User/group-related functions (U). */

public extern setuid UID, setgid GID;   // set user/group id of process
public extern seteuid UID, setegid GID; // set effective user/group id
public extern setreuid RUID EUID, setregid RGID EGID;
                                        // set real and effective ids
public extern getuid, geteuid;          // get real/effective user id
public extern getgid, getegid;          // get real/effective group id
public extern getlogin;                 // get real login name

// get/set supplementary group ids of current process
public extern getgroups, setgroups GIDS;

/* Session-related routines (U). */

public extern getpgid PID, setpgid PID PGID;    // get and set process group
public extern getpgrp, setpgrp;                 // dito, for calling process
public extern getsid PID;                       // get session id of process
public extern setsid;                           // create a new session

Examples

Invoke the system function to execute a shell command:

==> import system

==> system "ls -l"

You can also run the program directly with the spawn function:

==> spawn "ls" ["ls","-l"]

Get and set an environment variable:

==> getenv "HOME"
"/home/ag"

==> getenv "MYVAR" // variable is undefined
getenv "MYVAR"

==> setenv "MYVAR" "foo bar"
()

==> getenv "MYVAR"
"foo bar"

Here are some examples demonstrating the use of named pipes and the process functions on UNIX systems. The mkfifo function allows the creation of so-called "FIFO special files" a.k.a. named pipes, which provide a simple inter-process communication facility.

For instance, create a named pipe as follows:

==> mkfifo "pipe" 0666
()

You can then open the writeable end of the pipe:

==> def OUT = fopen "pipe" "w"

Note that this call blocks until the input side of the pipe has been opened. For this purpose, start another instance of the interpreter (e.g., in another xterm), and from there open the pipe for reading:

==> def IN = fopen "pipe" "r"

Both fopen calls should now have finished, and you can write something to the output end of the pipe in the first instance of the interpreter:

==> fwrites OUT "Hello, there!\n"

Go to the other interpreter instance, and read back the string from there:

==> freads IN
"Hello, there!"

As usual, each end of the pipe is closed as soon as the corresponding file object is no longer accessible. When you close the writeable end of the pipe using, e.g., undef OUT in the first instance of the interpreter, the input side of the pipe will reach end-of-file, and thus feof IN will become true. After closing the pipe also on the input side, you can remove the FIFO special file with the unlink function.

You can also use named pipes to set up a communication channel to child processes created with fork. For instance:

import system;
def NAME = tmpnam;
def PIPE = mkfifo NAME 0666;
def MSG  = "Hello there!\n";

test     = printf "Parent writes: %s" MSG ||
           fwrites (fopen NAME "w") MSG ||
           writes "Parent waits for child ...\n" ||
           printf "Parent: child has exited with code %d\n" wait
               if fork > 0;
         = printf "Child reads: %s\n" (freads (fopen NAME "r")) ||
           writes "Child exiting ...\n" || exit 0
               otherwise;


==> test
Parent writes: Hello there!
Parent waits for child ...
Child reads: Hello there!
Child exiting ...
Parent: child has exited with code 0
()

==> unlink NAME
()

Another method for accomplishing this with anonymous pipes is discussed in Low-Level I/O.

On UNIX, it also possible to implement "daemons", i.e., processes which place themselves in the background and continue to run even when you log out. The following little script shows how to do this.

import system;

/* Becoming a daemon is easy: Just fork, have the parent exit, and call setsid
   in the child to start a new session. The new process becomes a child of the
   init process and has no controlling terminal. Thus it keeps running even if
   you log out, until it gets killed or the system shuts down. */

daemon          = setsid || main if fork = 0;
                = exit 0 otherwise;

/* The main code of the daemon then closes file descriptors inherited by the
   parent and starts executing. In this example we just open a logfile and
   start logging messages in regular intervals. We also handle the condition
   that we are terminated by a signal. */

main            = do close [0,1,2] || log F "daemon started" ||
                  do (trap 1) [SIGINT, SIGTERM, SIGHUP, SIGQUIT] ||
                  catch (sig F) (loop F) where F:File = fopen "log" "w";
                = perror "daemon" || exit 1 otherwise;

sig F (syserr SIG)
                = log F (sprintf "daemon stopped by signal %d" (-SIG)) ||
                  exit 0;

loop F          = sleep 5 || log F "daemon still alive" || loop F;

log F MSG       = fprintf F "%s at %s" (MSG, ctime time) || fflush F;

[ < ]

[ > ]

[ << ]

[ Up ]

[ >> ]

[Top]

[Contents]

[Index]

[ ? ]

12.8 Low-Level I/O

These functions provide operations for direct manipulation of files on the file descriptor level. They are all in the system module. For a closer description of the following operations we refer the reader to the corresponding UNIX manual pages.

public extern open NAME FLAGS MODE;     // create a new descriptor
public extern close FD;                 // close a descriptor
public extern dup FD, dup2 OLDFD NEWFD; // duplicate a descriptor
public extern pipe;                     // create an unnamed pipe
public extern fstat FD;                 // stat descriptor
public extern fchdir FD;                // change directory (U)
public extern fchmod FD MODE;           // change file mode (U)
public extern fchown FD UID GID;        // set file ownership (U)
public extern ftruncate FD LEN;         // truncate a file (U)
public extern fsync FD, fdatasync FD;   // sync the given file (U)

The following operations can be used on both file descriptors and file objects. They read and write binary data represented as byte strings (see Byte Strings), providing an interface to the system's read/write(2) and fread/fwrite(3) functions. The bread function returns a byte string of the given size read from the given file. Note that the returned byte string may actually be shorter than SIZE bytes because, e.g., end of file has been reached or not enough input was currently available on a pipe. The bwrite function returns the number of bytes actually written which is usually the size of the byte string unless an error occurs. The functions fail if an error occurred before anything was read or written. It is the application's responsibility to check these error conditions and handle them in an appropriate manner.

public extern bread FD SIZE;            // read a byte string
public extern bwrite FD DATA;           // write a byte string

For instance, the following function uses bread and bwrite to copy an input to an output file, using chunks of 8192 bytes at a time:

fcopy F G               = () if bwrite G (bread F 8192) < 8192;
                        = fcopy F G otherwise;

The file pointer of a descriptor can be positioned with lseek. In difference to fseek, this function returns the new offset. To determine the current position you can hence use an expression like `lseek FD 0 SEEK_CUR'.

public extern lseek FD POS WHENCE;

Some terminal-related routines are also provided:

public extern isatty FD;                // is descriptor a terminal?

The following are UNIX-specific:

public extern ttyname FD;               // terminal associated with descriptor
public extern ctermid;                  // name of controlling terminal
public extern openpty, forkpty;         // pseudo terminal operations

The openpty function returns a pair (MASTER, SLAVE) of file descriptors opened for both reading and writing on a "pseudo terminal". MASTER is to be used in the controlling process, while SLAVE can be used for the standard I/O streams in a child process. The forkpty function combines openpty with fork and makes the slave device the controlling terminal of the child process; it returns a pair (PID, MASTER), where PID is zero in the child process and the process id of the child in the parent, and MASTER is the master end of the pseudo terminal to be used by the parent. These functions are commonly used to implement applications which drive other programs through a terminal emulation interface.

On UNIX systems, clib also provides access to the following fcntl operation (see Section 2 of the UNIX manual):

public extern fcntl FD CMD ARG;

The ARG parameter of fcntl depends on the type of command CMD which is executed. The available command codes and other relevant values are defined as global variables, as listed below. Flags are bitwise disjunctions of the symbolic values listed below. (The following are the values present on most systems. Specific implementations may provide additional flags.)

public var const
  // fcntl command codes
  F_DUPFD, F_GETFD, F_SETFD, F_GETFL, F_SETFL,
  F_GETLK, F_SETLK, F_SETLKW,

  // lock types
  F_RDLCK, F_WRLCK, F_UNLCK,

  // file access modes and access mode bitmask
  O_RDONLY, O_WRONLY, O_RDWR, O_ACCMODE,

  // file descriptor flags
  FD_CLOEXEC,

  // status flags
  O_CREAT, O_EXCL, O_TRUNC, O_APPEND, O_NONBLOCK, O_NDELAY, O_NOCTTY,
  O_BINARY;

The following types of commands are implemented:

fcntl FD F_DUPFD ARG                      // duplicate a file descriptor

fcntl FD F_GETFD ()                       // get file descriptor flags
fcntl FD F_SETFD FLAGS                    // set file descriptor flags

fcntl FD F_SETFD ()                       // get status flags/access mode
fcntl FD F_SETFD FLAGS                    // set status flags

fcntl FD F_GETLK (TYPE,POS,LEN[,WHENCE])  // query file lock information
fcntl FD F_SETLK (TYPE,POS,LEN[,WHENCE])  // set an advisory file lock
fcntl FD F_SETLKW (TYPE,POS,LEN[,WHENCE]) // blocking variant of F_SETLK

The first five commands serve to duplicate descriptors and to retrieve and change the file descriptor and status flags. The remaining commands are used for advisory file locking. A file lock is specified as a triple (TYPE, POS, LEN) or quadruple (TYPE, POS, LEN, WHENCE), where TYPE is the type of lock (F_RDLCK, F_WRLCK or F_UNLCK for read locks, write locks and unlocking, respectively), POS the position in the file, LEN the number of bytes to be locked (0 means up to the end of the file) and WHENCE specifies how the POS argument is to be interpreted. (This parameter has the same meaning as for the fseek and lseek functions, see above. If WHENCE is omitted, it defaults to SEEK_SET, i.e., absolute positions.) The value returned by the F_GETLK command is the lock description with TYPE set to F_UNLCK if the given lock would be accepted, and the description of a current lock blocking the lock request otherwise. (In the latter case the return value is actually a quadruple, with the id of a process currently owning a conflicting lock in the last component.)

Note that the standard I/O operations use buffered I/O by default which might interfere with record locking. Therefore in applications requiring individual record locking you should work with the low-level operations (open, bwrite, etc.) instead.

The following select function waits for a set of files to change I/O status. Note that this operation is available on Windows as part of the socket interface, but it only applies to sockets there.

public extern select FILES;

The input is a tuple (IN, OUT, ERR, TIMEOUT) consisting of three lists of file descriptors and/or file objects to be watched, and an optional integer or floating point value specifying a timeout in seconds. The function returns as soon as either a member of IN or OUT becomes ready for performing an I/O operation (without blocking), or an error condition is signaled for a member of ERR. The returned value is a triple (IN, OUT, ERR) with all the members of the original lists which are now ready for I/O. If the timeout is exceeded before any of the files has become ready, a triple of three empty lists is returned. If no timeout is specified then the function may block indefinitely.

Examples

These examples are mostly UNIX-specific, thus Windows users might wish to skip ahead.

Fcntl

The following definitions show how the fcntl function can be used to change a file's "non-blocking" flag. This is useful, e.g., if we want to read from standard input or a pipe but do not want to be blocked until input becomes available. Instead, having set the non-blocking flag, input operations will fail immediately if there is no input to be read right now.

/* set and clear the O_NONBLOCK flag of a file */
set_nonblock FD:Int     = fcntl FD F_SETFL (FLAGS or O_NONBLOCK)
                            where FLAGS = fcntl FD F_GETFL ();
clr_nonblock FD:Int     = fcntl FD F_SETFL (FLAGS and not O_NONBLOCK)
                            where FLAGS = fcntl FD F_GETFL ();

And here is how we can perform advisory locking on an entire file.

/* place an advisory read or write lock on an entire file (fail if error) */

rdlock FD:Int           = () where () = fcntl FD F_SETLK (F_RDLCK, 0, 0);
wrlock FD:Int           = () where () = fcntl FD F_SETLK (F_WRLCK, 0, 0);

/* remove the lock from the file */

unlock FD:Int           = () where () = fcntl FD F_SETLK (F_UNLCK, 0, 0);

/* predicates to check whether a read or write lock could be placed */

rdlockp FD:Int          = (LOCK!0 = F_UNLCK)
                            where LOCK:Tuple =
                              fcntl FD F_GETLK (F_RDLCK, 0, 0);
wrlockp FD:Int          = (LOCK!0 = F_UNLCK)
                            where LOCK:Tuple =
                              fcntl FD F_GETLK (F_WRLCK, 0, 0);

Note that to apply these functions to standard file objects you can use the fileno function (see Extended File Functions) as follows:

==> rdlock (fileno F)

Select

The select function accepts both files and file descriptors as input. Here is a way to test whether input is currently available from a file/descriptor:

avail F                 = not null (select ([F],[],[],0)!0);

This is useful, in particular, if the file is actually a pipe. For instance:

==> def F = popen "sleep 5; echo done" "r"

==> avail F   // no input to be read yet, wait ...
false

==> avail F   // ... input available now
true

==> fget F
"done\n"

However, most of the time select is used for multiplexing I/O operations. For instance, the following loop processes input from a set of files, one line at a time:

loop FILES              = loop (proc FILES F)
                            where ([F|_],_,_) = select (FILES,[],[])
                            if not null FILES;
                        = () otherwise;

proc FILES F            = // done with this file, get rid of it
                          filter (neq F) FILES
                            if feof F;
                        = // process a line
                          writes (fgets F) || FILES;

Anonymous Pipes

The pipe and dup2 operations provide a quick way to reassign input and output of a child process and connect it to corresponding file objects in the parent. For instance, here's how we can implement a popen2 function which works like the built-in popen routine, but allows to redirect both the input and output side of a child process:

import system;

/* Create two unnamed pipes, one for the parent to read and the child to
   write, the other one for the child to read and the parent to write. */

popen2 CMD      = spawn2 CMD (P_IN, P_OUT) (C_IN, C_OUT)
                    where (P_IN, C_OUT) = pipe, (C_IN, P_OUT) = pipe;

/* Fork the child and redirect its standard input and output streams to the
   child's ends of the pipe. This is accomplished with dup2 SRC DEST which
   closes the file descriptor DEST and then makes DEST a copy of SRC. In the
   parent we use fdopen to open two new file objects for the parent's ends of
   the pipes. */

spawn2 CMD (P_IN, P_OUT) (C_IN, C_OUT)
                = close P_IN || close P_OUT ||
                  dup2 C_IN (fileno INPUT) || dup2 C_OUT (fileno OUTPUT) ||
                  exec "/bin/sh" ["/bin/sh", "-c", CMD] 
                    if fork = 0;
                = close C_IN || close C_OUT ||
                  (fdopen P_IN "r", fdopen P_OUT "w")
                    otherwise;

The popen2 function employs fork and exec to spawn a child process which executes the given command using the shell, after redirecting the child's input and output to two pipes. In the parent process, the popen2 function returns a pair of files opened on the other ends of the child's descriptors.

The following piece of Q code shows how to apply the popen2 function defined above in order to pipe a string list into the UNIX sort program and construct the sorted list from the output:

mysort STRL     = do (fprintf OUT "%s\n") STRL || fclose OUT || digest IN
                    where (IN, OUT) = popen2 "sort";

digest IN       = [] if feof IN;
                = [freads IN|digest IN] otherwise;

Example:

==> mysort ["five","strings","to","be","sorted"]
["be","five","sorted","strings","to"]

==> wait // get exit code of child process (sort program)
(15804,0)

[ < ]

[ > ]

[ << ]

[ Up ]

[ >> ]

[Top]

[Contents]

[Index]

[ ? ]

12.9 Terminal Operations

The following (UNIX-specific) operations from the system module provide an interface to the POSIX termios interface. Terminal attributes are stored in a "termios" structure, represented as a 7-tuple (IFLAG, OFLAG, CFLAG, LFLAG, ISPEED, OSPEED, CC). The control character set CC is represented as a list of character numbers, indexed by the symbolic constants VEOF etc. See termios(3) for further details.

public extern tcgetattr FD;             // get terminal attributes
public extern tcsetattr FD WHEN ATTR;   // set terminal attributes

public extern tcsendbreak FD DURATION;  // send break
public extern tcdrain FD;               // wait until all output finished
public extern tcflush FD QUEUE;         // flush input or output queue
public extern tcflow FD ACTION;         // control input/output flow

public extern tcgetpgrp FD;             // get terminal process group
public extern tcsetpgrp FD PGID;        // set terminal process group

/* Access components of the termios structure. */

public c_iflag ATTR, c_oflag ATTR, c_cflag ATTR, c_lflag ATTR, 
  c_ispeed ATTR, c_ospeed ATTR, c_cc ATTR;

Example

This example shows how to use the termios functions to read a password from the terminal without echoing. This is an almost literal translation of the C program described in Richard Stevens: Advanced Programming in the UNIX Environment, Addison-Wesley, 1993, cf. p. 350. The main difference is that we merely ignore the SIGINT and SIGTSTP signals instead of blocking them (the latter is not supported by Q's trap builtin).

import system;

getpass PROMPT  = fwritec F "\n" || unprep F SAVE || PW
                    where F:File = fopen ctermid "r+", SAVE = prep F,
                      PW = fwrites F PROMPT || fflush F || freads F;

/* prep F: ignore SIGINT and SIGTSTP and prepare the terminal */

prep F          = tcsetattr (fileno F) TCSAFLUSH NATTR || (ATTR,TRAPS)
                    where TRAPS = map (trap SIG_IGN) [SIGINT,SIGTSTP],
                      ATTR = tcgetattr (fileno F),
                      (IF,OF,CF,LF,IS,OS,CC) = ATTR,
                      LF = LF and not (ECHO or ECHOE or ECHOK or ECHONL),
                      NATTR = (IF,OF,CF,LF,IS,OS,CC);

/* unprep F SAVE: revert to previous settings */

unprep F (ATTR,TRAPS)
                = tcsetattr (fileno F) TCSAFLUSH ATTR ||
                  zipwith trap TRAPS [SIGINT,SIGTSTP];

[ < ]

[ > ]

[ << ]

[ Up ]

[ >> ]

[Top]

[Contents]

[Index]

[ ? ]

12.10 Readline Interface

The following functions from system.q provide a basic interface to the GNU readline library. While readline doesn't really belong to the POSIX interface, it is rather useful and the interpreter uses it anyway, so it makes sense to also provide it as a part of the system interface.

public extern readline PROMPT, rl_line_buffer;
public extern add_history LINE, stifle_history MAX;
public extern read_history FNAME, write_history FNAME;

These functions work like their C counterparts. The readline function prompts with the given string and reads an input line, providing editing and history facilities. Basic bash-like filename completion is also provided. The rl_line_buffer function returns the text of the current input line, which is useful, e.g., if a custom completion function (see below) needs to inspect surrounding context.

The add_history routine adds a line to the history. The maximum size of the history can be set with stifle_history, which also returns the size which was set previously. A negative MAX value makes the size of the history unbounded (which is also the default). Last but not least, the history can be read from and written to a file with the read_history and write_history functions.

For instance, the following commands read an input string, add it to the history, and finally save the history to a file:

==> import system

==> readline "input> "
input> foo bar
"foo bar"

==> add_history _
()

==> write_history "myhistory"
()

As another example, here is the definition of a little convenience function which reads a line using readline and enters it into the history if it is nonempty:

get_line PROMPT = if not null LINE then add_history LINE || LINE
                    where LINE:String = readline PROMPT;

Readline's standard filename completion facility can be augmented with a custom completion function. This is achieved by simply setting the following global variable to the desired function:

public var RL_COMPLETION_FUNCTION;

The completion function is invoked with two arguments, the string to be completed and the position index of that string in the input line (rl_line_buffer, see above), and is expected to return a string list with the possible completions. For instance, here is a simple example of a function which checks a list of "command" words for possible completions:

def RL_COMPLETION_FUNCTION = complete;
complete S _    = filter (is_prefix S) ["bar","foo","gnats","gnu"];
is_prefix X Y   = (X=sub Y 0 (#X-1));

The position index argument is useful if the completion depends on the position inside the input line. For instance, the following function only attempts completion at the beginning of the input line:

complete S 0    = filter (is_prefix S) ["bar","foo","gnats","gnu"];

Readline's default behaviour is to try a custom completion first, if it is available, and to fall back to the standard filename completion otherwise. The latter can be suppressed by ending the list of completions with a () entry:

complete S 0    = filter (is_prefix S) ["bar","foo","gnats","gnu"] ++ [()];
complete _ _    = [()] otherwise;

The text units for which readline attempts completion are the "words" of the input line. There is a `RL_WORD_BREAK_CHARS' variable which allows you to change readline's idea of what a word is, by setting the variable to a string containing the word delimiter characters. The default definition of this variable is as follows:

public var RL_WORD_BREAK_CHARS = " \t\n\"\\'`@$><=;|&{(";

Redefining RL_WORD_BREAK_CHARS affects both readline's cursor movement commands and the behaviour of the completion routine. For instance, to turn the comma into a word delimiter, you can change RL_WORD_BREAK_CHARS as follows:

def RL_WORD_BREAK_CHARS = RL_WORD_BREAK_CHARS++",";

[ < ]

[ > ]

[ << ]

[ Up ]

[ >> ]

[Top]

[Contents]

[Index]

[ ? ]

12.11 System Information

The functions described here are all in the system module. Error codes from various system operations can be retrieved with the following functions:

public extern errno, seterrno N;        // get/set last error code
public extern perror S, strerror N;     // print error message

The perror function is commonly used to report error conditions in system operations on the standard error file. For instance:

==> fopen "/etc/passw" "r"
fopen "/etc/passw" "r"

==> perror "fopen"
fopen: No such file or directory
()

If more elaborate formatting is required then you can use strerror on the errno value to obtain the error message as a string:

==> fprintf ERROR "fopen returned message `%s'\n" (strerror errno)
fopen returned message `No such file or directory'
()

Note that errno is only set when an error occurs in a system call. You can use seterrno to reset the errno value before a system operation to check whether there actually was an error while executing the system call:

==> seterrno 0
()

==> fopen "/etc/passwd" "r"
<<File>>

==> perror "fopen"
fopen: Success
()

The remaining operations are used to obtain various information about the system and its information databases. For instance, the uname operation returns a 5-tuple containing information identifying the operating system (this operation is generally only available on UNIX-like systems):

public extern uname;

/* Access components of uname result. */

public un_sysname UNAME, un_nodename UNAME, un_release UNAME,
  un_version UNAME, un_machine UNAME;

The hostname of the system can be retrieved with the gethostname function.

public extern gethostname;

The password and group database can be accessed with the following functions. Password entries are encoded as 7-tuples (NAME, PASSWD, UID, GID, GECOS, DIR, SHELL), group entries as 4-tuples (NAME, PASSWD, GID, MEMBERS). This information is only available on UNIX-like systems.

public extern getpwuid UID, getpwnam NAME;      // look up a password entry
public extern getpwent;                         // list of all pw entries
public extern getgrgid GID, getgrnam NAME;      // look up group entry
public extern getgrent;                         // list of all group entries

/* Access components of password and group structures. */

public pw_name PW, pw_passwd PW, pw_uid PW, pw_gid PW, pw_gecos PW,
  pw_dir PW, pw_shell PW;

public gr_name GR, gr_passwd GR, gr_gid GR, gr_members GR;

Moreover, the crypt function can be used to perform UNIX password encryption (see crypt(3) for details).

public extern crypt KEY SALT; // (U)

The following functions can be used to query host information as well as the network protocols and services available on your system. This information is closely related to the socket interface described in Sockets. For a closer description of these operations we refer the reader to the corresponding manual pages. Note that the gethostent, getprotoent and getservent operations are not available on Windows.

The host database: Host entries are of the form (NAME, ALIASES, ADDR_TYPE, ADDR_LIST), where NAME denotes the official hostname, ALIASES its alternative names, ADDR_TYPE the address family and ADDR_LIST the list of addresses.

public extern gethostbyname HOST, gethostbyaddr ADDR;
public extern gethostent; // (U)

public h_name HENT, h_aliases HENT, h_addr_type HENT, h_addr_list HENT;

Note that both hostnames and IP addresses are specified as strings. Hostnames are symbolic names such as "localhost", and can also have a domain name specified, as in "www.gnu.org". IPv4 addresses use the well-known "numbers-and-dots" notation, like the loopback address "127.0.0.1". IPv6 addresses are usually written as eight 16-bit hexadecimal numbers that are separated by colons; two colons are used to abbreviate strings of consecutive zeros. For example, the IPv6 loopback address "0:0:0:0:0:0:0:1" can be abbreviated as "::1".

The protocol database: Protocol entries are of the form (NAME, ALIASES, PROTO) denoting official name, aliases and number of the protocol.

public extern getprotobyname NAME;
public extern getprotobynumber PROTO;
public extern getprotoent; // (U)

public p_name PENT, p_aliases PENT, p_proto PENT;

The service database: Service entries are of the form (NAME, ALIASES, PORT, PROTO) denoting official name, aliases, port number and protocol number of the port. The NAME argument of getservbyname can also be a pair (NAME, PROTO) to restrict the search to services for the given protocol (given by its name). Likewise, the PORT argument can also be given as (PORT, PROTO).

public extern getservbyname NAME;
public extern getservbyport PORT;
public extern getservent; // (U)

public s_name PENT, s_aliases PENT, s_port PENT, s_proto PENT;

For instance, here is some information retrieved from a typical Linux system:

==> import system

==> uname
("Linux","obelix","2.4.19-4GB","#2 Tue Mar 4 16:03:51 CET 2003","i686")

==> gethostname
"obelix"

==> gethostbyname gethostname
("obelix.local",["obelix"],2,["127.0.0.2"])

==> gethostbyaddr "::1"
("localhost",["ipv6-localhost","ipv6-loopback"],10,["::1"])

==> getprotobyname "tcp"
("tcp",["TCP"],6)

==> getservbyname "ftp"
("ftp",[],5376,"tcp")

==> getpwuid getuid
("ag","x",500,100,"Albert Gräf","/home/ag","/bin/bash")

==> getgrgid getgid
("users","x",100,[])

[ < ]

[ > ]

[ << ]

[ Up ]

[ >> ]

[Top]

[Contents]

[Index]

[ ? ]

12.12 Sockets

The following functions from the system module are available on systems providing a BSD-compatible socket layer, which, besides BSD, includes BEOS, Linux, OSX, Windows and most recent System V flavours. Sockets provide bidirectional communication channels on the local machine as well as across the network. Sockets are represented by file descriptors which can be written to and read from with the send and recv functions. On most systems, socket descriptors are just ordinary file descriptors which can also be used with low-level I/O functions and fdopen as usual. However, on some systems (in particular, BEOS and Windows), socket descriptors are "special" and all socket I/O must be performed with the special socket operations.

At creation time, a socket is described by the following attributes (see socket(2) for more details):

Address family (sometimes also referred to as namespace, protocol family or domain). The adress families supported by this implementation are AF_LOCAL (a.k.a. AF_UNIX a.k.a. AF_FILE), AF_INET and AF_INET6. Not all protocol families may be available on all systems (e.g., the Windows socket library only supports AF_INET).
Communication style (also called the socket type). One distinguishes between "connection-based" sockets which are used to establish client/server connections, and "connectionless" sockets which allow data to be received from and sent to arbitrary addresses. The socket types supported by this implementation are SOCK_STREAM, SOCK_DGRAM, SOCK_SEQPACKET, SOCK_RAW and SOCK_RDM. Among these, SOCK_STREAM, SOCK_SEQPACKET and SOCK_RDM are connection-based. Please note that not all socket types are supported for all protocol families, and some socket types may be entirely missing on non-UNIX systems.
Protocol: For each address a.k.a. protocol family and socket type there may be a number of different protocols available. For the AF_LOCAL namespace, which refers to the local filesystem, the protocol is always 0, the default protocol. The available protocols for the internet namespaces can be retrieved from the protocol database (see System Information).

Before another process can connect to a socket it must also be bound to an address. The address format depends on the address family of the socket. For the local namespace, the address is simply a filename on the local filesystem. For the IPv4 namespace, it is a pair (HOST, PORT) where HOST denotes the host name or IP address, specified as a string, and PORT is a port number. For the IPv6 namespace, it is a quadruple (HOST, PORT, FLOWINFO, SCOPEID). Host names and known port numbers can be retrieved from the host and service databases (see System Information).

The following operations are provided to create a socket, or a pair of connected sockets, for the given address family, socket type and protocol. They return the file descriptor of the socket (or a pair of file descriptors).

public extern socket FAMILY TYPE PROTO;
public extern socketpair FAMILY TYPE PROTO; // (U)

The shutdown function terminates data transmission on a socket. You can stop reading, writing or both, depending on whether HOW is SHUT_RD, SHUT_WR or SHUT_RDWR. Note that this operation does not close the socket's file descriptor; for this purpose closesocket is used (see below).

public extern shutdown SOCKET HOW;

The closesocket function closes a socket. On most systems this is just identical to close (see Low-Level I/O), but, as already noted, on some systems socket descriptors are special and you must use this function instead.

public extern closesocket SOCKET;

The bind function binds a socket to an address. This is also done automatically when the socket is first used. However, if the socket has to be found by another process you'll have to explicitly specify an address for it. The bind function does just that.

public extern bind SOCKET ADDR;

The following operations are used to start listening for and accept connection requests on a socket. These operations are used on the server side of a connection-based socket. The argument N of listen denotes the maximum number of pending connection requests for the server. After the call to listen, the server can accept connections from a client with the accept function, which returns a pair (SOCKET, ADDR), where SOCKET is a new socket connected to the client, and ADDR is the client's address.

public extern listen SOCKET N;
public extern accept SOCKET;

The connect function is used to initiate a connection on a socket. This function can be used on both connection-based and connectionless sockets. In the former case, connect can only be invoked once. In the latter case, it can be invoked multiple times, and sets the remote socket for subsequent send and receive operations.

public extern connect SOCKET ADDR;

The following routines retrieve information about a socket. The local address of a socket and the address of the remote socket it is connected to can be retrieved with getsockname and getpeername. Socket options, specified using a protocol level LEVEL and an option index OPT, can be queried and changed with getsockopt and setsockopt. The option values are encoded as byte strings (cf. Byte Strings). For a description of the available options see getsockopt(2).

public extern getsockname SOCKET;
public extern getpeername SOCKET;
public extern getsockopt SOCKET LEVEL OPT;
public extern setsockopt SOCKET LEVEL OPT VAL;

Finally, the following specialized I/O functions are used to transmit data over a socket. All data is encoded as byte strings. The receive operations return the received data (which may be shorter than the requested size, if not enough data was currently available), the send operations the number of bytes actually written. For recvfrom/sendto the data is encoded as a pair (ADDR, DATA) which includes the source/destination address; these operations are typically used for connectionless sockets. The FLAGS argument is used to specify special transmission options (see the MSG_* constants at the beginning of clib.q).

public extern recv SOCKET FLAGS SIZE, send SOCKET FLAGS DATA;
public extern recvfrom SOCKET FLAGS SIZE, sendto SOCKET FLAGS DATA;

Example

The following script demonstrates how we can implement a connectionless server in the IPv4 namespace which repeatedly accepts a request from a client and sends back an answer. In this example the requests are strings denoting Q expressions; the server evaluates each expression and sends back the result as a string. The client reads input from the user, transmits it to the server and prints the received answer. Note that the transmitted strings are represented as byte strings, as required by the recvfrom and sendto operations. The bytestr and bstr functions are used to convert between character and byte strings, see Byte Strings.

import system;

def BUFSZ = 500000; // buffer size

/* the server: receive messages, evaluate them as Q expressions, and
   send back the results */

def SERVER = ("localhost",5001); // the server address

server          = server_loop FD
                    where FD:Int = socket AF_INET SOCK_DGRAM 0,
                      () = bind FD SERVER;
                = perror "server" otherwise;

server_loop FD  = sendto FD 0 (ADDR,eval MSG) || server_loop FD
                    where (ADDR,MSG) = recvfrom FD 0 BUFSZ;
                = server_loop FD otherwise;

/* evaluate an expression encoded as a byte string, catch syntax
   errors and exceptions, convert result back to a byte string */

eval MSG        = catch exception (bytestr (str VAL))
                    where 'VAL = valq (bstr MSG);
                = bytestr ">>> SYNTAX ERROR" otherwise;
exception _     = bytestr ">>> ABORTED";

/* the client: read input from user, send it to the server, print
   returned result */

def CLIENT = ("localhost",5002); // the client address

client          = client_loop FD
                    where FD:Int = socket AF_INET SOCK_DGRAM 0,
                      () = bind FD CLIENT;
                = perror "client" otherwise;

client_loop FD  = sendto FD 0 (SERVER,bytestr MSG) ||
                  printf "%s\n" (bstr (recv FD 0 BUFSZ)) || client_loop FD
                    if not null MSG
                    where MSG:String = writes "\nclient> " || flush || reads;
                = () otherwise;

For instance, we can invoke the server in a secondary thread and then execute the client as follows:

==> def S = thread server

==> client

client> prd [1..50]
30414093201713378043612608166064768844377641568960512000000000000

client> 1+)
>>> SYNTAX ERROR

client> quit
>>> ABORTED

client>

[ < ]

[ > ]

[ << ]

[ Up ]

[ >> ]

[Top]

[Contents]

[Index]

[ ? ]

12.13 POSIX Threads

On systems where the POSIX threads library or some compatible replacement is available (this includes Windows and most modern UNIXes), clib provides functions for handling multiple threads of control. Threads, a.k.a. "light-weighted processes", allow you to realize "multithreaded scripts" consisting of different tasks which together perform some computation in a distributed manner. All tasks are executed concurrently. Thus you can, e.g., perform some lengthy calculation in a background task while you go on evaluating other expressions in the interpreter's command loop. You can also have tasks communicate via mutexes, conditions and semaphores.

The operations described in this section (which are all contained in clib and thus included in the prelude) are in close correspondence with POSIX 1003.1b. However, some operations are named differently, and semaphores provide the extra functionality of sending data from one thread to another. Mutexes are also supported, mostly for the purpose of handling critical sections involving operations with side-effects (I/O etc.). Mutexes are not required to make conditions work since these have their own internal mutex handling. For more information on POSIX threads, please refer to the corresponding section in the UNIX manual.

Please note that these functions will only work as advertised if the interpreter has been built with POSIX thread support. Moreover, in the current implementation the interpreter effectively serializes multithreaded scripts on the reduction level and thus user-level threads cannot really take advantage of multi-processor machines.

[ < ]

[ > ]

[ << ]

[ Up ]

[ >> ]

[Top]

[Contents]

[Index]

[ ? ]

12.13.1 Thread Creation and Management

Clib threads are represented using handles (objects of type Thread). Note that a thread is canceled automatically as soon as its handle is garbage collected, thus you should keep the handle around as long as the thread is needed. For convenience, thread handles are numbered arbitrarily, starting at 0 which denotes the main thread, and are ordered by the thread numbers. This is handy, e.g., if you want to use thread handles as indices in a dictionary.

public extern type Thread;              // thread handle type

public isthread THREAD;                 // check for thread objects

public extern thread_no THREAD;         // thread number
public extern this_thread;              // handle of the current thread

The basic thread operations are listed below. The thread function starts evaluating its special argument in a new thread, and returns its handle. You can wait for a thread to terminate and obtain the evaluated result with the result function. (If there is no result, because the thread has been canceled, or was aborted with halt, quit or a runtime error, result fails.) Note that halt or quit in a thread which is not the main thread only terminates the current thread; however, the exit function, cf. Process Control, always exits from the interpreter. You can also terminate the current thread immediately and return a given value as its result with the return function; in the main thread, this function is equivalent to halt and the return value is ignored. Moreover, all threads except the main thread can also be canceled from any other thread using the cancel function. Finally, the yield function allows the interpreter to switch threads at any given point (normally the interpreter will only switch contexts in certain builtins and when a new rule is activated).

public extern special thread X;         // start new thread
public extern return X;                 // terminate thread with result X
public extern cancel THREAD;            // cancel THREAD
public extern result THREAD;            // wait for THREAD, return result
public extern yield;                    // allow context switch

Clib threads always use deferred cancellation, hence thread cancellation requests are usually not honored immediately, but are deferred until the thread reaches a cancellation point where it is safe to do so. Cancellation points occur at certain C library calls listed in the POSIX threads documentation, when a new equation is activated in the Q interpreter, and when yield is called.

You can also check whether a thread is still active or has been canceled. If neither condition holds, then the thread has already been terminated and you can obtain its result with the result operation.

public extern active THREAD;            // check if THREAD is active
public extern canceled THREAD;          // check if THREAD was canceled

[ < ]

[ > ]

[ << ]

[ Up ]

[ >> ]

[Top]

[Contents]

[Index]

[ ? ]

12.13.2 Realtime Scheduling

Threads are always created with the default "non-realtime" scheduling policy. On some systems it is also possible to increase a thread's priority and have it scheduled in "realtime". This is useful for tasks with strict timing and responsiveness requirements, such as a function performing recording or playback in a multimedia application.

The scheduling policy and the priority of a thread can be changed with the following function which takes two extra arguments, the policy POL (0 = default, 1 = realtime round-robin, 2 = realtime fifo scheduling) and the priority PRIO (where 0 denotes the default priority). Note that on most systems realtime scheduling will only be granted to privileged processes.

public extern setsched THREAD POL PRIO; // set scheduling parameters

The current scheduling policy and priority of a thread can be retrieved with the getsched function:

public extern getsched THREAD;          // get scheduling parameters

The actual range of priority values is system-specific. On a typical UNIX system a non-realtime thread can only have priority 0, while realtime threads always have positive priorities. Higher priority threads always take precedence over lower priority ones. All else being equal, threads using round-robin scheduling are each given their "fair" timeslice, while fifo threads are handled on a "first come first served" basis. The latter can only be interrupted by higher priority processes (and signals) and thus should be used with utmost care. In any case you must make sure that a high-priority thread does not run unsuspended for extended periods of time, otherwise it might lock up your system. Thus a realtime thread should typically spend most of its time "in limbo" where it waits for input to arrive or a condition to be signaled.

Assuming that the priority value ranges are contiguous and contain either 0 or 1, you can determine the ranges for your system with the following little script:

test POL PRIO   = setsched this_thread 0 0 || true
                    where () = setsched this_thread POL PRIO;
                = false otherwise;

priotest POL    = (hd L,last L) if not null L
                    where L = reverse (while (test POL) pred 0) ++
                      while (test POL) succ 1;

Here is a sample result obtained on Linux:

==> map priotest [0,1,2]
[(0,0),(1,99),(1,99)]

Realtime Scheduling under Windows

Under Windows, things are a bit different, since Windows does not really have POSIX-compatible thread scheduling. Instead, processes have "priority classes" while threads have "priorities" which are interpreted in the context of the priority class of the process they belong to. Therefore under Windows the POL argument of the setsched operation is actually interpreted as a priority class for the entire process while the PRIO argument specifies the priority of the individual thread. Clib keeps track of all setsched calls and always lets the process have the highest priority class specified for a thread which is still active.

To these ends, the policy values 0, 1 and 2 are mapped to the priority classes "normal", "high" and "realtime". (The latter should be used sparingly, if at all, in a Q script, because it makes Windows very unresponsive.) Moreover, the setsched function also accepts a policy value of -1 to denote "idle time" processes. (You'll rarely have any use for this, unless you want to write a screensaver in Q.) For each policy, the possible priority values have a range from -3 to 3, where -3 denotes idle time threads, 0 is the normal priority, and 3 is the highest priority to be used for "time-critical" threads. The remaining priority levels provide some additional amount of control over which thread gets the bone first.

While some "special effects" can be achieved with Windows' more exotic priority values, for typical usage a policy of 0 with zero priority should be employed for ordinary threads, a policy of 1 with some (small) positive priority value for a thread with moderate realtime requirements, and a policy of 2 if time is very critical. If you follow these conventions then your script will be able to run under Windows and most UNIX systems unchanged.

[ < ]

[ > ]

[ << ]

[ Up ]

[ >> ]

[Top]

[Contents]

[Index]

[ ? ]

12.13.3 Mutexes

Clib mutexes come in three flavours: fast (no error checking), error checking (fail if the current thread already holds a lock on the mutex) and recursive (the same thread may lock the mutex multiple times, and the same number of unlock operations is required to unlock the mutex again). The supported operations are lock (wait for the mutex to be unlocked, then lock it and return ()), unlock (unlock the mutex, return ()) and try (lock the mutex if it is available, fail otherwise). As of Q 7.11, on systems which support this functionality, the try operation can also be used with an argument of the form (MUTEX, TIME) to denote an absolute timeout value. This works analogously to the await operation on conditions, see Conditions, for details.

CAVEAT: These operations are dangerous, as you can easily create deadlocks which might lock up the interpreter as well. Note that a deadlock will also prevent the involved threads from being canceled since, in accordance with the POSIX standard, the wait for a mutex lock is not a cancellation point. Thus these operations should be used with care.

Since Q has no mutable variables and the interpreter's builtins are all thread-safe, mutexes are actually used much less in Q than in other, procedural languages. They are most useful for protecting critical sections in which a sequence of operations with side-effects (such as I/O) is to be carried out in an atomic fashion. This can be done by locking a mutex at the beginning and unlocking it at the end of the sequence. Such usage is safe, provided that no operation in the sequence may block for an extended or even indefinite period of time.

public extern type Mutex;               // mutex type

public ismutex MUTEX;                   // check for mutex objects

public extern mutex;                    // standard (fast) mutex object
public extern errorchecking_mutex;      // error checking mutex object
public extern recursive_mutex;          // recursive mutex object

public extern lock MUTEX;               // lock MUTEX
public extern unlock MUTEX;             // unlock MUTEX
public extern try MUTEX;                // try MUTEX, (MUTEX,TIME) for timeout

[ < ]

[ > ]

[ << ]

[ Up ]

[ >> ]

[Top]

[Contents]

[Index]

[ ? ]

12.13.4 Conditions

Clib conditions support the following operations: signal (wake up one thread waiting for the condition), broadcast (wake up all threads waiting for the condition) and await (suspend the current thread until the condition is sent). Each of these operations returns (). The await operation can also be invoked with a (COND, TIME) tuple to denote a timed wait. If the wait times out or is interrupted by a signal then the operation fails. Note that in accordance with the POSIX standard the TIME value denotes an absolute time (integer or floating point value in seconds since the "epoch", see also the description of the built-in time function in Miscellaneous Functions).

public extern type Condition;           // condition type

public iscondition COND;                // check for condition objects

public extern condition;                // new condition object

public extern signal COND;              // signal COND
public extern broadcast COND;           // broadcast COND
public extern await COND;               // wait for COND, or (COND,TIME)

[ < ]

[ > ]

[ << ]

[ Up ]

[ >> ]

[Top]

[Contents]

[Index]

[ ? ]

12.13.5 Semaphores

Clib semaphores are in fact semaphore queues which can be used as a a communication channel to pass values between different threads using a FIFO discipline. In extension to the POSIX 1003.1b semaphore operations, clib also provides support for bounded semaphores which are created with a positive limit on the number of values the semaphore queue may hold at any time. When a new value is posted to a bounded semaphore, the current thread is suspended until the queue has room to receive a new value, if necessary.

The supported operations are post (enqueue a value), get (dequeue a value and return it, as soon as one is available), try (non-blocking version of get which fails if no value is currently available), get_size or # which both return the current queue size, and get_bound which returns the queue size limit of a bounded semaphore (or zero if the semaphore is unbounded). As of Q 7.11, on systems which support this functionality, the try operation can also be used with an argument of the form (SEM, TIME) to denote an absolute timeout value. This works analogously to the await operation on conditions, see Conditions, for details.

Note that even with unbounded semaphores the maximum semaphore size is actually limited by the operating system, see your local POSIX threads documentation for details. If this implementation-specific limit is exceeded, the attempt to post a new value raises a syserr 9 exception.

public extern type Semaphore;           // semaphore type

public issemaphore SEM;                 // check for semaphore objects

public extern semaphore;                // semaphore object
public extern bounded_semaphore MAX;    // bounded semaphore object

public extern post SEM X;               // enqueue a value
public extern get SEM;                  // dequeue a value
public extern try SEM;                  // try SEM, (SEM,TIME) for timeout

public extern get_size SEM;             // get the current queue size
public extern get_bound SEM;            // get the max queue size (0 if none)

[ < ]

[ > ]

[ << ]

[ Up ]

[ >> ]

[Top]

[Contents]

[Index]

[ ? ]

12.13.6 Threads and Signals

Some remarks about the interaction between clib threads and the interpreter's signal handling are in order. It is a well-known fact that threads and signals generally do not mix very well. Therefore, in the current implementation, all signal handling is actually performed in the interpreter's main thread. This is where you should set up your signal handlers using the trap and catch builtins, as explained in Exception Handling. If necessary, the main thread of your script can inform other threads about received signals using the thread synchronization functions described above.

However, the interpreter still allows you to set up traps in secondary threads and keeps track of the configured traps separately for each thread. The traps in secondary threads will not have any effect until such a thread forks (see Process Control). In this case the forking thread will be the one and only thread in the child process and becomes the main thread. At this point, it takes on the signal handling, and any traps which have been set up before become active. (Note that in order to make this work reliably, you should configure the traps before you call fork and protect the call to fork with an enclosing catch. Otherwise a signal might arrive in the time between the fork and the next function call, causing the new process to exit before it had a chance of setting up its own signal handling.)

[ < ]

[ > ]

[ << ]

[ Up ]

[ >> ]

[Top]

[Contents]

[Index]

[ ? ]

12.13.7 Thread Examples

As already mentioned, threads allow you to perform a lengthy calculation as a background task. You can start such a task simply as follows:

==> def TASK = thread (sum [1..1000000])

==> // do some other work ...

==> result TASK // get the result
500000500000

==> stats all
thread #0: 0 secs, 1 reduction, 0 cells
thread #1: 4.3 secs, 2000004 reductions, 2000007 cells

As indicated, if the stats command is invoked with the parameter all then it also shows the statistics of background threads once they are finished. To release all resources associated with a thread (and also remove it from the stats all list) you must undefine all variables referencing the thread handle:

==> undef TASK

Conditions are useful when a thread has to wait for some condition before proceeding with a computation. E.g.:

==> def COND = condition, TASK = thread (await COND || writes "Got it!\n")

==> signal COND
()
Got it!

Semaphores are used when expression values have to be passed from one thread to another. For instance, let us rewrite the backtracking algorithm for the N queens problem (see Exception Handling) such that it runs as a background task which returns results via a semaphore instead of printing them directly on the terminal:

def RES = semaphore; // semaphore used to transmit results

queens N        = thread (search N 1 1 [] || post RES ());

search N I J P  = post RES P if I>N;
                = search N (I+1) 1 (P++[(I,J)]) || fail if safe (I,J) P;
                = search N I (J+1) P if J<N;
                = () otherwise;

Note that we added a post RES () to signal that all solutions have been found. The safe function is as in Exception Handling. In order to limit the size of the RES semaphore queue, we could also use a bounded semaphore instead, e.g.:

def RES = bounded_semaphore 10;

We can use a second thread to process the semaphore queue and print solutions as they become available:

print           = thread (loop print1 RES);

print1 ()       = return (); // no more results
print1 P        = write P || writes "\n" otherwise;

/* iterate a function over a semaphore */

loop F SEM      = F (get SEM) || loop F SEM;

To use this program, simply start the two background tasks as follows:

==> def QUEENS = queens 8, PRINT = print

Both threads will begin to execute immediately, and you will see the results scroll by. Once the threads are finished, you can check the resources used by each thread with the stats all command:

==> stats all
thread #0: 0 secs, 2 reductions, 2 cells
thread #1: 1.71 secs, 700451 reductions, 943 cells
thread #2: 0 secs, 461 reductions, 3 cells

Also note that here we used again a def statement to assign the thread handles to corresponding variables. It is important that you keep these variables as long as you want the threads to survive. The interpreter automatically cancels a thread as soon as the corresponding thread handle is garbage collected. Hence you can stop the threads at any time by simply undefining the variables using, e.g., the clear command.

[ < ]

[ > ]

[ << ]

[ Up ]

[ >> ]

[Top]

[Contents]

[Index]

[ ? ]

12.14 Expression References

These are in clib and thus included in the prelude. Expression references work like pointers to expression values. The reference is initialized with the ref function, giving the initial value as an argument. The referenced value can then be changed with the put function and retrieved with get. Also, as of Q 7.8 an "assignment" operator `:=' is provided as syntactic sugar for put.

public extern type Ref;                 // reference type

public isref REF;                       // check for reference objects

public extern ref X;                    // initialize a reference object

public extern put REF X;                // store a new value
public extern get REF;                  // retrieve the current value

public (:=) X Y @ (=);                  // assignment operator

NOTE: References can point to arbitrary values, in particular they can also point to other references. However, cyclic chains of references should be avoided since in the current implementation the interpreter cannot garbage-collect such structures.

References can be used to implement mutable data structures. This is particularly useful if a function has to maintain some internal state to perform its calculations. For instance, here is a function which memoizes past function calls in a hash table so that each requested value only has to be computed once:

def HASH = ref emptyhdict;

public hashed F X;
hashed F X      = get HASH!(F,X) if member (get HASH) (F,X);
                = put HASH (update (get HASH) (F,X) Y) || Y
                    where Y = F X otherwise;

Using the hashed function, a hashed version of the recursive Fibonacci function can be implemented as follows. Since no value of the function is computed more than once, the definition works in linear time (up to logarithmic factors for the table updates and lookups). In fact, even an expression like map fib [0..N] will only require a linear number of table manipulations and other operations.

fib             = hashed hfib;

hfib 0          = 0;
hfib 1          = 1;
hfib N          = fib (N-1)+fib (N-2) if N>1;

References also provide a general means for communicating values in a multithreaded script (see POSIX Threads). In this case they are typically used together with a condition which is signaled when the value changes:

def VAL_CHANGED = condition, VAL = ref ();
change X        = put VAL X || broadcast VAL_CHANGED;

A loop like the following might then be executed by any number of other threads which have to keep track of the value in question, and perform some appropriate action when the value changes:

watch           = await VAL_CHANGED || action (get VAL) || watch;

Sentinels

As of Q 7.1, there is a a second, lazy kind of expression references called sentinels. Sentinels defer the evaluation of the referenced expression until the sentinel is garbage-collected. They are created with the sentinel function which takes the expression to be evaluated later as a special argument. There are no other access operations.

public extern type Sentinel;            // sentinel type

public issentinel S;                    // check for sentinel objects

public extern special sentinel X;       // create a sentinel object

Sentinels are a means to trigger automatic cleanup of an ordinary Q data structure in the same fashion as the cleanup performed by some built-in and external data types. Example:

==> def X = sentinel (puts "cleaning up...\n")

==> undef X
cleaning up...

Mutable Sequences

An important application of references are mutable sequences which can be implemented as tuples, lists or streams of references. As of version 7.8, Q provides an auxiliary module reftypes.q which implements some useful convenience functions for working with these kinds of objects. Please note that this module is still experimental, and has to be imported explicitly right now as it is not yet included in the prelude.

The following operations are provided to construct a reference tuple, list or stream from a tuple, list or stream of its values, respectively.

public reftuple Xs, reflist Xs, refstream Xs;

The following functions work like the standard library functions mktuple, mklist and mkstream, but construct a sequence of references all initialized to the given value instead.

public mkreftuple X N, mkreflist X N, mkrefstream X;

All of these are also provided as "2D" variations to deal with matrix-like sequences:

public reftuple2 Xs, reflist2 Xs, refstream2 Xs;
public mkreftuple2 X NM, mkreflist2 X NM, mkrefstream2 X;

Note that mkreftuple2 and mkreflist2 take a pair of dimensions (N,M) as the second parameter to denote the desired number of rows N and columns M.

The module also provides the following pseudo type-checking functions. Please note that, for efficiency, these only look at the first component to see whether it is a reference (or reference sequence).

public isreftuple Xs, isreflist Xs, isrefstream Xs;
public isreftuple2 Xs, isreflist2 Xs, isrefstream2 Xs;

Moreover, the module overloads the get and put functions so that they work with an entire sequence at once. You can apply get to a tuple, list or stream of references to obtain the tuple, list or stream of its values, respectively, and you can invoke put Xs Ys to set the references in Xs to the corresponding values in the sequence Ys. This also works for nested sequences such as a list of lists of references.

A dereferencing index operator Xs!!I is provided as an abbreviation for get (Xs!I):

public (!!) Xs I @ (!);

The fill function fills references with a given value, and putmap implements a destructive variation of the standard library map function:

public fill Y X, putmap F X;

These both work with either individual references or nested sequences of references. Note that putmap F X applied to a reference X updates X with F (get X), whereas fill just fills references with a constant value. (Thus fill X does exactly the same as putmap (cst X), but is implemented directly for efficiency.) CAVEAT: Please note that fill takes the fill value as the first parameter. This allows for easier currying. For instance, you can quickly define yourself a function to zero out a sequence as follows:

zero = fill 0;

The following examples show most of the operations in action, taking a reference list as an example.

==> import reftypes             // needs to be imported explicitly

==> def Xs = reflist [1..10]    // create a new reference list

==> get (Xs!5)                  // get a member
6

==> Xs!!5                       // syntactic sugar for the above
6

==> put (Xs!5) 77               // change a member
()

==> Xs!5 := 77                  // syntactic sugar for the above
()

==> Xs!!5
77

==> get Xs                      // get all members
[1,2,3,4,5,77,7,8,9,10]

==> putmap (*2) Xs              // destructive map
()

==> get Xs
[2,4,6,8,10,154,14,16,18,20]

==> fill 0 Xs                   // fill with given value
()

==> get Xs
[0,0,0,0,0,0,0,0,0,0]

==> Xs := [1..10]               // update all values at once
()

==> get Xs
[1,2,3,4,5,6,7,8,9,10]

Tuples and streams work in an analogous fashion. Note that reference streams are always fully memoized so that the reference values are properly remembered when the stream is traversed for inspection and updates. Infinite streams work, too, since only the finite part of the stream which is accessed in the program ever gets evaluated. You must take care, however, that you don't apply fill, put or putmap (which are strict operations) to the entire stream, but only to a finite subsection of the stream. For instance:

==> def Xs = refstream {1..}

==> Xs!!5
6

==> Xs!5 := 77
()

==> Xs!!5
77

==> def Ys = take 10 Xs // apply putmap only to this finite subsequence

==> strict $ get Ys
{1,2,3,4,5,77,7,8,9,10}

==> putmap (*2) Ys
()

==> strict $ get Ys
{2,4,6,8,10,154,14,16,18,20}

==> Xs!!5
154

Note that at this point only the first 10 members of the stream have actually been evaluated, the rest of the stream is still "thunked":

==> Xs
{<<Ref>>,<<Ref>>,<<Ref>>,<<Ref>>,<<Ref>>,<<Ref>>,<<Ref>>,<<Ref>>,<<Ref>>,
<<Ref>>,ref 11|lazy (map ref (iterate (+1) ((+1) 11)))}

The reftypes.q module also contains some magic to make the overloaded get, put, fill and putmap functions work with any "indexed" reference container type providing a vals, list or list2 function which gives access to the container's reference components as a list. Examples of suitable container types are the Array, Dict and HDict types discussed in Standard Types. For instance (taking the Array type, which provides a kind of vector/matrix data structure with logarithmic lookup and modification times, as an example):

==> import array reftypes

==> def A = array $ reflist [1..10]

==> get A
[1,2,3,4,5,6,7,8,9,10]

==> A!5 := 99; get A
()
[1,2,3,4,5,99,7,8,9,10]

==> putmap (*2) A; get A
()
[2,4,6,8,10,198,14,16,18,20]

==> fill 0 A; get A
()
[0,0,0,0,0,0,0,0,0,0]

==> A := [0.1,0.2..1.0]; get A
()
[0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1.0]

2D containers a.k.a. matrices can be used in the same fashion:

==> def A = array2 $ reflist2 [1,2,3; 4,5,6]

==> get A
[[1,2,3],[4,5,6]]

==> A!(1,1) := 55
()

==> A!!(1,1)
55

==> get A
[[1,2,3],[4,55,6]]

==> fill 0 A; get A
()
[[0,0,0],[0,0,0]]

==> A := [0.1,0.2,0.3; 0.4,0.5,0.6]; get A
()
[[0.1,0.2,0.3],[0.4,0.5,0.6]]

==> putmap (*10) A; get A
()
[[1.0,2.0,3.0],[4.0,5.0,6.0]]

==> A!1 := [3,2,1]; get A
()
[[1.0,2.0,3.0],[3,2,1]]

[ < ]

[ > ]

[ << ]

[ Up ]

[ >> ]

[Top]

[Contents]

[Index]

[ ? ]

12.15 Time Functions

The functions described in this section (which need to be imported from the system module) are used to return information about the active timezone and the current time, and convert time values to various formats. Also available are functions to measure cpu time and high-resolution timers. These functions are in close correspondence with the date and time functions of the C library.

The calendar date and time functions use two different representations for time values:

Simple time (denoted T in the sequel) is represented as the number of seconds elapsed since the "epoch", 00:00:00 on January 1, 1970, UTC.
Broken-down time (denoted TM) is encoded as a 9-tuple (YEAR, MONTH, DAY, HOUR, MIN, SEC, WDAY, YDAY, ISDST). See the description of the tm struct in the ctime(3) manual page for more information.

Three functions (tzname, timezone, daylight) are provided to return information about the current timezone and daylight savings settings. The gmtime and localtime functions convert simple time to broken-down time using UTC or the local timezone, respectively. The functions mktime and asctime are used to convert a broken-down time value into a simple time value or the standardized string representation used by the date(1) program, respectively. The ctime function combines localtime and asctime. For more flexible formatting of time values, the strftime function can be used, which takes as its first argument a format string as described on the strftime(3) manual page.

public extern tzname, timezone, daylight;
public extern ctime T, gmtime T, localtime T;
public extern mktime TM, asctime TM, strftime FORMAT TM;

The components of broken-down time. can be accessed with the following functions:

public tm_year TM, tm_month TM, tm_day TM, tm_hour TM, tm_min TM,
  tm_sec TM, tm_wday TM, tm_yday TM, tm_isdst TM;

On the current 32 bit systems time values must be representable as 4 byte integer values, which means that the available range usually goes from "Fri Dec 13 20:45:52 1901" to "Tue Jan 19 03:14:07 2038". (This will probably be fixed before "time ends" on current UNIX systems in January 2038, though.)

Some examples:

==> tzname; timezone; daylight
("CET","CEST")
-3600
1

==> ctime time
"Mon Mar 17 04:08:29 2003\n"

==> localtime time
(103,2,17,4,8,41,1,75,0)

==> strftime "Hey it's %c" _
"Hey it's Mon Mar 17 04:08:41 2003"

==> ctime (st_mtime (stat "README-Clib"))
"Wed May  1 16:59:00 2002\n"

Measuring CPU Time

Two additional functions are provided for measuring cpu time. Note that clock and times measure cpu time in different units, given by the constants CLOCKS_PER_SEC and CLK_TICK, respectively. The times function returns a 5-tuple (TOTAL, UTIME, STIME, CHILD_UTIME, CHILD_STIME). See times(2) for details.

public extern clock, times;

For instance, you can calculate the cpu time in seconds it takes to evaluate an expression by computing the difference between the clock time at the end and at the beginning, divided by CLOCKS_PER_SEC:

==> -(clock - (prd [1..10000] || clock)) / CLOCKS_PER_SEC
0.14

Note that the initial value and the resolution of the clock timer is system-dependent. Usually the timer is initialized at process creation time, but that is not guaranteed. Moreover, the value of clock typically is a machine integer which wraps around after a finite time interval; on 32 bit systems with CLOCKS_PER_SEC = 1000000, as recommended by the POSIX standard, this happens approximately every 72 minutes.

High-Resolution Timers

On systems where the POSIX 1003.1-2001 timer extension is available (at the time of this writing, this comprises most recent Linux and Unix systems, but not Windows), the system module provides a number of operations to deal with high-resolution timers. These will be useful for realtime applications and other programs with strict timing requirements.

public extern nanotime ID, nanores ID;
public extern nanosleep ID T, nanosleep_until ID T;

Note that these operations differ from the builtin time and sleep functions (see Miscellaneous Functions) in that they use integer (unsigned 64 bit) time values specified in nanoseconds. The nanotime function returns the current time in nanosecs, the nanores function the resolution of the given timer (i.e., the smallest measurable amount of time in nanosecs). The nanosleep function sleeps for the specified amount of nanoseconds, while the nanosleep_until function sleeps until the given (absolute) time arrives. In case of any internal error condition, these functions will set errno accordingly.

Just like the sleep function, the nanosleep and nanosleep_until functions may wake up early if a signal is delivered to the process. In this case errno will be set to EINTR, and nanosleep will return the remaining time until the planned wakeup, while nanosleep_until will simply fail. If the sleep terminates normally, nanosleep will return zero and nanosleep_until (), respectively.

For each of the functions you have to specify an integer-valued timer id. In the current implementation, the following timers are supported:

CLOCK_REALTIME: The system clock, measured in nanosecs since the epoch. This will be the same time as returned by the builtin time function (up to rounding issues with the latter).
CLOCK_MONOTONIC: Monotonic (non-decreasing) time since some unspecified starting point. This clock cannot be reset by the user and is thus to be preferred in realtime applications.
CLOCK_PROCESS_CPUTIME, CLOCK_THREAD_CPUTIME: Highres per-process and per-thread CPU timers. (Normally these are for timing purposes only and might be unreliable on SMP systems, see clock_gettime(3) for details.)

Only CLOCK_REALTIME is guaranteed to exist (on systems which implement the highres timers at all, that is). If a given clock is unavailable on the host system, the corresponding id will be (). (If CLOCK_REALTIME is () then the highres timers are not supported at all.)

Moreover, if CPU timers are available, then the following process_cpu_clockid and thread_cpu_clockid functions can be used to return the clock id for the given process id (0 denotes the current process) and the given thread, respectively:

public extern process_cpu_clockid PID, thread_cpu_clockid THREAD;

CAVEAT: While the highres timers nominally support nanoseconds resolution, the actual resolutions depend on your system setup and will typically be much coarser (for the system and monotonic clocks, they are usually in the milliseconds range on current PCs). Also, be aware that there are system latencies which might cause calls to nanosleep and nanosleep_until to wake up late. You can use realtime scheduling priorities to mitigate these effects to some extent, see Realtime Scheduling, but some sources of latency will still remain.

[ < ]

[ > ]

[ << ]

[ Up ]

[ >> ]

[Top]

[Contents]

[Index]

[ ? ]

12.16 Internationalization

As of Q 7.0, the following functions give access to the locale and internationalization functions provided by the C library on most modern operating systems, which let you deal with different character sets and language-specific conventions. If you have iconv and gettext on your system, interfaces to these functions will be provided as well. As of Q 7.8, all of these need to be imported from the system module.

public extern setlocale CATEGORY LOCALE;
public extern localeconv;
public extern nl_langinfo ITEM;

The setlocale function sets the current locale of the program. This function is usually invoked near the beginning of an application to initialize locale-dependent processing. (The Q interpreter also initializes the default locale at startup time, so you only have to invoke this function yourself if you want to set a locale which is different from the user's default locale, or if you have to change the locale during execution of the program.) The CATEGORY parameter determines which part of the locale is to be modified, and can be any of the LC_XXX constants defined in the manifest constants section at the beginning of this module. The LOCALE string must be a valid locale value for the given category, or "" to specify that the locale is to be set from the corresponding user defaults. LOCALE can also be () if you only want to query the current locale. In any case the current locale value for the given category is then returned as a string. For a description of the different locale categories please refer to the setlocale(3) manpage.

The localeconv function returns various numeric formatting information about the current locale. The result is returned as a tuple. Please refer to clib.q and the localeconv(3) manpage for details about this operation. You'll rarely have to use the information provided by this function directly, as the strfmon function (see below) allows you to format numbers and monetary values in a locale-dependent way.

On SUSv2-compatible Unix systems, additional information about the current locale is available using the nl_langinfo function. The ITEM parameter is one of the symbolic constants defined in the manifest constants section at the beginning of clib.q (see nl_langinfo(3) for a description of these). The result of nl_langinfo is always a string.

The following functions help with the formatting of various locale-dependent items:

public extern strfmon FORMAT ARGS;
public extern strcoll S1 S2, strxfrm S;
public extern wcswidth S;
public wcwidth C;

The strfmon function provides locale-dependent formatting of numbers and monetary values. See strfmon(3) for a description of the syntax of the FORMAT string. ARGS is either a singleton value or a tuple of floating point values, as required for the given format string. Integers in the ARGS parameter are automatically converted to floating point values.

The strcoll function compares two strings in a locale-dependent way; it returns <0, 0, >0 iff S1 is, respectively, less than, equal to, or greater than S2. The strxfrm function converts the given string into a form which can be used for locale-based comparisons; by comparing the transformed strings you always get the same results as by checking the result of strcoll.

The wcswidth function returns the actual number of columns required to print a string, which may be different from the length of the string if the string contains non-ASCII Unicode characters. If any of the characters in the given string is non-printable, -1 is returned. The wcwidth function does the same for single characters.

Iconv Interface

public extern type IConv;
public extern iconv_open TO FROM, iconv_close IC, iconv IC B;

The IConv type represents conversion state objects for the iconv function. Objects of this type are created with iconv_open which takes two strings describing the desired target and source encodings as arguments, and destroyed with the iconv_close function (this is also done automatically when an IConv object is garbage-collected).

The iconv function does the actual conversion. It takes as arguments the current conversion state IC (an IConv object) and the byte string B to be converted, and returns the converted byte string. The second argument can also be () in which case iconv resets the conversion to the initial state and returns the "shift" sequence necessary to return to the initial state, if any (this will always be the empty byte string in the case of a stateless encoding).

If anything goes wrong during the conversion, instead of a single byte string iconv returns a pair of byte strings (C,D) where C is the successfully converted initial part of the sequence and D is the remaining data, starting from the position in the source buffer B where the error occurred. It is then the responsibility of the application to check the errno value in order to determine what exactly went wrong, see iconv(3) for details.

Please note that the supported combinations of source and target encodings depend on the system and particular iconv implementation. Using GNU iconv, you can obtain the list of all supported encodings with the `iconv --list' command.

Gettext Interface

public extern textdomain DOMAIN, bindtextdomain DOMAIN DIR;
public extern gettext MSGID, dgettext DOMAIN MSGID,
  dcgettext DOMAIN MSGID CATEGORY;
public extern ngettext MSGID1 MSGID2 N, dngettext DOMAIN MSGID1 MSGID2 N,
  dcngettext DOMAIN MSGID1 MSGID2 N CATEGORY;

These are simple wrappers for the corresponding C functions from the GNU gettext library.

The textdomain function selects the default domain to be used with the gettext/ngettext functions. The bindtextdomain function determines the base directory where the message files for the given domain are to be found. Both functions return the current setting, i.e., the current domain for textdomain, and the current directory for the given domain for bindtextdomain. To only query the current setting without changing it, use () as the DOMAIN argument to textdomain and the DIR argument to bindtextdomain.

The gettext function retrieves a message from a message catalog (determined with the textdomain and bindtextdomain functions) and returns the translated string. The message to be translated is specified as a string MSGID. If no translation can be found then gettext returns MSGID unchanged.

The other variations of gettext allow you to explicitly specify a domain (usually the application name, which defaults to the value set with the textdomain function) and a locale category (LC_MESSAGES by default). The ngettext functions are similar, but provide a way to translate plural forms. See the gettext(3) and ngettext(3) manual pages for details.

Note that the GNU gettext family of functions uses message catalogs in a binary format, which can be constructed from a readable description using the msgfmt(1) utility. Currently no special support is provided for the automatic extraction of messages from Q scripts using tools like xgettext(1). This will likely be fixed in future releases, but for the time being you will have to prepare the msgfmt input files manually, or use a custom script for that purpose.

[ < ]

[ > ]

[ << ]

[ Up ]

[ >> ]

[Top]

[Contents]

[Index]

[ ? ]

12.17 Filename Globbing

Matching filenames against patterns with shell wildcards (*, ? etc.), also known as filename "globbing", is implemented by the following functions from clib.q:

public extern fnmatch PATTERN S; // check whether S matches PATTERN
public extern glob PATTERN; // list of all filenames matching PATTERN

Examples

==> map (fnmatch "*.q") ["clib.q","clib.c"]
[true,false]

==> map (fnmatch "*.[cq]") ["clib.q","clib.c"]
[true,true]

==> glob "*.[qc]"
["clib.c","clib.q","factor.q","globexamp.q","regexamp.q"]

==> glob "/h*"
["/home"]

Only a single pattern is accepted by both fnmatch and glob. However, it is a simple matter to provide your own rules which extend the clib operations to lists of patterns:

fnmatch [] S:String      = false;
fnmatch [P:String|Ps] S:String
                         = fnmatch P S or else fnmatch Ps S;

glob Ps:List             = rmdups (sort (<) (cat (map glob Ps)));

rmdups []                = [];
rmdups [X,X|Xs]          = rmdups [X|Xs];
rmdups [X|Xs]            = [X|rmdups Xs] otherwise;

Note that in our extension of the definition of glob we employ the sort function (see C Replacements for Common Standard Library Functions) to sort the resulting list, and then remove adjacent duplicates (which could occur because a file may match more than one pattern).

Now let's see these definitions in action:

==> map (fnmatch ["*.q","*.c"]) ["clib.q","clib.c"]
[true,true]

==> glob ["*.q","*.c","clib.*"]
["clib.c","clib.q","clib.so","factor.q","globexamp.q","regexamp.q"]

[ < ]

[ > ]

[ << ]

[ Up ]

[ >> ]

[Top]

[Contents]

[Index]

[ ? ]

12.18 Regular Expression Matching

The clib module implements regular expression matching using the POSIX regcomp and regexec functions. Regular expressions are generally specified using the extended (egrep-like) syntax. Clib divides the matching functions into two interfaces, the high-level interface (regex function) and the low-level interface (regmatch and friends). Both interfaces are used with a third group of match state functions (reg and friends) which provide access to information about the current match.

For most purposes, the high-level regex function should be all that is needed. However, the low-level functions are provided so that you can create your own specialized regular expression engines, should you ever need to do so. In this case you might wish to take a look at the definition of the regex function in clib.q and modify it to suit your needs.

Note that in accordance with POSIX matching semantics, all matching functions search for an occurrence of the pattern in the target string, rather than simply checking whether the whole string matches the given pattern. If the latter functionality is required, you can use ^ and $ to tie the match to the beginning and the end of the string.

[ < ]

[ > ]

[ << ]

[ Up ]

[ >> ]

[Top]

[Contents]

[Index]

[ ? ]

12.18.1 High-Level Interface

The following function is implemented as a special form, with the EXPR argument being passed unevaluated.

public special regex ~OPTS ~REGEX ~S EXPR;

The regex function evaluates, for each match of the given regular expression (also called the pattern) in the given string, the special EXPR argument, and returns the collection of all results as a list, in the order in which the matches were found. If no match is found, the result list will be empty. The EXPR argument typically uses the match state functions (see Match State Information) to retrieve the current match information. You can also invoke the regdone function (see Low-Level Interface) to escape from the search at any time.

Matching generally proceeds from left to right, and if several different substrings match at a given position, the longest match is preferred. The matching process is controlled by means of the OPTS string argument, which may contain zero or more of the following option characters (the corresponding flags of the POSIX regcomp/regexec functions are given in parentheses, where applicable):

g: Match globally, i.e., find all non-overlapping occurrences. Otherwise only the first match is reported (if any).
G: Like g, but report overlapping matches as well.
i: Do case-insensitive matches (REG_ICASE).
n: Do multi-line matches. This makes ^ and $ match line beginnings and ends (besides beginning and end of the string), and makes . and lists [...] not match the newline character, unless it is explicitly included in a list (REG_NEWLINE).
^: Do not match ^ at the beginning of the string (REG_NOTBOL).
$: Do not match $ at the end of the string (REG_NOTEOL).

Abnormal error conditions such as bad regular expression syntax are handled by returning an expression of the form regerr MSG where MSG is a string describing the error. You may give an appropriate definition of regerr, or check for literal regerr MSG values, to handle such error conditions in any way you like.

The regex function is clib's primary interface to perform regular expression matching, and should cover most usual applications. You can find a number of examples showing how to use this function below (see Basic Examples). The regex function itself is implemented in Q using the low-level functions discussed in the following section. You might wish to take a look at these if you need to do something special which cannot be done with the regex function (or cannot be done in an efficient manner).

[ < ]

[ > ]

[ << ]

[ Up ]

[ >> ]

[Top]

[Contents]

[Index]

[ ? ]

12.18.2 Low-Level Interface

The following functions are implemented in C using the POSIX regex functions.

public extern regmatch OPTS REGEX S, regnext, regdone;

The regmatch function searches for the first (i.e., leftmost) occurrence of the regular expression pattern in the given string, and the parameterless regnext function searches for the next occurrence. Both functions normally return a truth value indicating success or failure (but see the remarks concerning error handling below). In case of success, you can retrieve the current match using the match state functions (see Match State Information).

IMPORTANT: These functions have side-effects; they change the internal match state which is accessed using the match state functions (see Match State Information). This hidden state is provided for your convenience, to spare you the trouble of having to pass explicit parameters between the matching operations and the match state functions. All match state information is maintained on an internal stack, so that you can start a new search with the regmatch function while another one is still in progress. (This also implies that you can recursively invoke regex inside the EXPR argument of a regex call.)

The OPTS argument has the same meaning as for the regex function, and error handling also works in the same fashion. That is, in case of an abnormal error condition the regmatch/regnext functions return a regerr MSG expression instead of a truth value. See High-Level Interface.

Note that the regnext function will always return false, unless one of the g/G (global search) options is specified in the preceding regmatch call.

To terminate a (global) search which is still in progress (i.e., has not failed yet), you can call the regdone function, which always returns (). Subsequent invocations of the match state functions will then behave just like after a failing regmatch/regnext call. (After a failed match, regdone is not needed; it will be invoked automatically.)

Some care is needed to ensure proper operation of the low-level routines with multiple nested searches. As a general rule, each search successfully started with regmatch must be terminated either with a failing regnext call, or by an invocation of the regdone function. In both cases, after termination of the nested search, a subsequent regnext will continue where regmatch/regnext left off when the nested search was started.

It is important to note that these rules also apply if the search is non-global. The main rationale behind this behaviour is that, even if one is only interested in the first match, one must still be able to obtain information about any trailing unmatched text using the regstart/regskip functions (see Match State Information), and this information is only available after a match has failed (or was aborted using regdone). Thus the interface functions should always behave consistently, no matter whether the g/G option was specified or not. (In fact, the only difference g/G makes is that, if it is omitted, then regnext will pretend that no next match is available even if there is one.)

If this sounds confusing to you, just stick with the high-level regex function which takes care of all those messy details. A nested matching example using regex is discussed in Nested Searches.

[ < ]

[ > ]

[ << ]

[ Up ]

[ >> ]

[Top]

[Contents]

[Index]

[ ? ]

12.18.3 Match State Information

These functions are typically invoked after regmatch, regnext, or in the EXPR argument of regex, to report information about the current match.

public extern regstart, regskip, reg N, regpos N, regend N, regs;

The reg function returns the text of a match, regpos its beginning in the target string (first character position), and regend its end (first position behind the match). These functions take an integer argument denoting the group (a.k.a. parenthesized subexpression) of the regular expression being matched. An argument of 0 always refers to the whole match, N > 0 to the text matched by the Nth group (counting opening parentheses from the left). The match reported for a given group will always extend to the right as far as possible ("longest match"), subject to the constraint that enclosing groups will also prefer the longest match, and they always take priority. Thus reg 0 will extend to the right as far as possible, reg 1 will be the longest possible match for group 1 inside this match, etc.

If a group belongs to a repetitive pattern (patterns involving the *, + operators, etc.), then it may be matched more than once. In this case, the text last matched by the group is reported by the reg/regpos/regend functions.

If the group belongs to an optional part of the pattern (*, ?, different alternatives with |), then the group might not be matched at all. For such unmatched groups, regpos and regend both return -1 and reg returns the empty string. These values are also returned for all groups (including 0) after a match has failed or has been aborted using the regdone function.

The regs function returns a list with the indices of all matched groups (except 0).

The parts of the target string which were not matched can also be accessed. This is done by means of the regstart and regskip functions, which return, respectively, the position where the last search for a match started, and the text from this position up to the following match (or to the end of the string if the match failed). These values can also be accessed after a failing or aborted match (e.g., after regex has finished), which is useful to determine a trailing unmatched portion in the input string.

[ < ]

[ > ]

[ << ]

[ Up ]

[ >> ]

[Top]

[Contents]

[Index]

[ ? ]

12.18.4 Basic Examples

Please refer to your UNIX manual for a description of the regular expression syntax. (To get a start, take a look at the egrep(1) manual page; most of what it has to say about (extended) regular expressions should be applicable here too. There are also some good books on the subject.)

So let's consider some elementary examples first. Here's how we can find all occurrences of "identifiers" (defined here as sequences of letters and digits, starting with a letter) in a string:

==> regex "g" "[A-Za-z][A-Za-z0-9]*" "1var foo 99 BAR $%&" (reg 0)
["var","foo","BAR"]

The first argument to regex is always the option string; the "g" in our example indicates a global search, i.e., a search for all (non-overlapping) matches of the given pattern. The second argument is the regular expression pattern to be matched, and the third argument is the string to be searched for matches against the pattern. The fourth argument is the expression to be evaluated for each match; in our case, reg 0 is used to simply retrieve the match. Actually, we could provide any appropriate expression here, e.g., we might use sprintf to add some fancy formatting:

==> regex "g" "[A-Za-z][A-Za-z0-9]*" "1var foo 99 BAR $%&" \
(sprintf "MATCH: %s" (reg 0))
["MATCH: var","MATCH: foo","MATCH: BAR"]

What happens if the pattern contains an error, such as an unmatched bracket? Let's see:

==> regex "g" "[A-Za-z][A-Za-z0-9*" "1var foo 99 BAR $%&" (reg 0)
regerr "Unmatched [ or [^"

Reasonably enough, regex returns an error message.

The regex function always enumerates matches from left to right, and only reports at most one match for each position in the input string. If more than one substring matches at a given position, the longest match is preferred. Thus in the above example, e.g., foo is matched, and not the individual characters f and o which by themselves also match the given pattern.

Note that with the g option only non-overlapping matches are reported; if we want all occurrences (even overlapping ones) then we specify the G option instead:

==> regex "G" "[A-Za-z][A-Za-z0-9]*" "1var foo 99 BAR $%&" (reg 0)
["var","ar","r","foo","oo","o","BAR","AR","R"]

And if we are only interested in the first match, we simply omit the g/G option character:

==> regex "" "[A-Za-z][A-Za-z0-9]*" "1var foo 99 BAR $%&" (reg 0)
["var"]

So far, so good. After these basics, let's reconsider our original search expression:

==> regex "g" "[A-Za-z][A-Za-z0-9]*" "1var foo 99 BAR $%&" (reg 0)
["var","foo","BAR"]

We can simplify the pattern somewhat if we use the i option to specify a case-independent search:

==> regex "gi" "[a-z][a-z0-9]*" "1var foo 99 BAR $%&" (reg 0)
["var","foo","BAR"]

Another way to write this pattern uses the predefined character classes [:alpha:] and [:alnum:]; these constructs have the advantage that they are portable, i.e., they work in different character sets and locales.

==> regex "g" "[[:alpha:]][[:alnum:]]*" "1var foo 99 BAR $%&" \
(reg 0)
["var","foo","BAR"]

So now we know how we can find identifiers in a string. But what if we have to check whether a given string is an identifier? For this purpose, we must match the string as a whole against the pattern. We can do this by tying our pattern to the beginning and end of the string using the ^ and $ "anchors". (The g option is not required here, since we are only looking for a single match.)

==> regex "" "^[[:alpha:]][[:alnum:]]*$" "foo" (reg 0)
["foo"]

==> regex "" "^[[:alpha:]][[:alnum:]]*$" "1var" (reg 0)
[]

If we only want a truth value, we can simply test the size of the result list; in this case, any expression argument will do:

==> 1=#regex "" "^[[:alpha:]][[:alnum:]]*$" "foo" ()
true

==> 1=#regex "" "^[[:alpha:]][[:alnum:]]*$" "1var" ()
false

[ < ]

[ > ]

[ << ]

[ Up ]

[ >> ]

[Top]

[Contents]

[Index]

[ ? ]

12.18.5 Empty and Overlapping Matches

Empty matches are permitted as well, subject to the constraint that at most one match is reported for each position (which also prevents looping). And of course an empty match will only be reported if nothing else matches. For instance:

==> regex "g" "" "foo" (regpos 0)
[0,1,2,3]

==> regex "g" "o*" "foo" (regpos 0)
[0,1,3]

(The usefulness of such constructs might be questionable, but see Performing Replacements, for a reasonable example.)

As already mentioned, only non-overlapping matches are reported with the g option. However, occasionally it might be useful to also determine overlapping matches, and for this purpose the G option is provided. For instance, suppose that we want to produce a table showing which letters follow another given letter in a text. The first step is to produce a list of all pairs of adjacent letters, and since we really need all pairs here, we employ the G option:

==> regex "G" "[[:alpha:]]{2}" "silly" (reg 0!0,reg 0!1)
[("s","i"),("i","l"),("l","l"),("l","y")]

Now it is an easy matter to collect the desired information using, e.g., a dictionary and a little bit of "list voodoo:"

==> var add = \D (X,Y).update D X (insert (D!X) Y)

==> def D = foldl add (mkdict emptyset (map fst _)) _

==> zip (keys D) (map list (vals D))
[("i",["l"]),("l",["l","y"]),("s",["i"])]

[ < ]

[ > ]

[ << ]

[ Up ]

[ >> ]

[Top]

[Contents]

[Index]

[ ? ]

12.18.6 Splitting

In previous examples we saw how we can tokenize a string by matching its constituents (see Basic Examples). Sometimes one also has to go the other way round, i.e., split a string into arbitrary tokens separated by "non-tokens". E.g., we might want to split a string into tokens delimited with whitespace, which can be described by an expression like "[ \t\n]+", or, in a locale-independent fashion, using "[[:space:]]+". To do this with the regex function, we must find all text that is between matches, and any trailing text after the last matched delimiter. The regskip function lets us do this as follows:

==> regex "g" "[[:space:]]+" "The little\t brown\n fox." regskip \
++ [regskip]
["The","little","brown","fox."]

Here the regskip expression argument to regex is used to access the intervening tokens, and the final ++ [regskip] appends the last token.

We can use this method to define our own little regular expression tokenizing function as follows:

regsplit OPTS REGEX S = regex OPTS REGEX S regskip ++ [regskip];

With this definition, the above example works as follows:

==> regsplit "g" "[[:space:]]+" "The little\t brown\n fox."
["The","little","brown","fox."]

We could also omit the g option, in which case only the first delimiter match is used as a splitting point:

==> regsplit "" "[[:space:]]+" "The little\t brown\n fox."
["The","little\t brown\n fox."]

Splitting with an empty or optional pattern also works as expected:

==> regsplit "g" "" "some text"
["","s","o","m","e"," ","t","e","x","t",""]

==> regsplit "g" " ?" "some text"
["","s","o","m","e","","t","e","x","t",""]

[ < ]

[ > ]

[ << ]

[ Up ]

[ >> ]

[Top]

[Contents]

[Index]

[ ? ]

12.18.7 Performing Replacements

A similar idea is also used to replace matches with new text. You can define a substitution function, which replaces each match with the result of an expression, as follows:

special regsub ~OPTS ~REGEX ~S EXPR;
regsub OPTS REGEX S EXPR = strcat (regex OPTS REGEX S (regskip++EXPR))
                           ++ regskip;

For instance:

==> regsub "g" "[[:alpha:]][[:alnum:]]*" "1var foo 99 BAR $%&" \
(sprintf "-*-%s-*-" (reg 0))
"1-*-var-*- -*-foo-*- 99 -*-BAR-*- $%&"

As usual, if we only want to replace the first match, we can omit the g option:

==> regsub "" "[[:alpha:]][[:alnum:]]*" "1var foo 99 BAR $%&" \
(sprintf "-*-%s-*-" (reg 0))
"1-*-var-*- foo 99 BAR $%&"

And here's an example showing empty and optional patterns at work:

==> regsub "g" "" "some text" " "
" s o m e   t e x t "

==> regsub "g" " ?" "some text" ":"
":s:o:m:e::t:e:x:t:"

[ < ]

[ > ]

[ << ]

[ Up ]

[ >> ]

[Top]

[Contents]

[Index]

[ ? ]

12.18.8 Submatches

It is also possible to access the parts of matches, called "groups", which are defined by parenthesized subexpressions of the regular expression. Groups are numbered starting at 1, counting opening parentheses from the left. The text matched by a group can be accessed using reg N where N is the group number.

For instance, suppose we want to parse environment lines, such as those returned by the shell's set command. We assume that each line with a definition looks like VARIABLE=VALUE, which can be described by the pattern "^([^=]+)=(.*)$", in which group 1 denotes the variable and group 2 the value. Thus we can obtain the parts of a definition using a regex search like the following:

==> regex "" "^([^=]+)=(.*)$" "VARIABLE=VALUE" (reg 1,reg 2)
[("VARIABLE","VALUE")]

We can turn this into an env function which returns the current environment as a list of (variable,value) pairs as follows. Here, we use the fget function to read in the whole environment as a single string from a pipe opened on the set command. The environment string is then parsed using the n option of regex, which makes the ^ and $ anchors match the beginning and end of each line, respectively.

env         = regex "gn" "^([^=]+)=(.*)$" envget (reg 1,reg 2);

envget      = fget (popen "set" "r");

Given these definitions, we can now get hold of the whole environment and, e.g., turn it into a dictionary which can be accessed in a convenient fashion using the functions from the dict.q standard library module:

==> var envdict = dict env

==> envdict!"SHELL"
"/bin/bash"

When a pattern consists of multiple alternatives, such as "(foo)|(bar)", then some of the groups might participate in a match, while others don't. You can check whether a group matched by testing the corresponding regpos value. If the value is nonnegative, then the group matched; otherwise (the value is -1) it didn't. You can also use the regs function to determine the list of all groups that matched. For instance:

==> regex "" "(foo)|(bar)" "foo" (regpos 1,regpos 2)
[(0,-1)]

==> regex "" "(foo)|(bar)" "bar" regs
[[2]]

So here's a simple way to implement a lexical analyzer which distinguishes between different kinds of tokens. For example, let us tokenize identifiers and integers, skip whitespace, and return all remaining characters as literals.

def SYN = "([[:alpha:]][[:alnum:]]*)|(-?[[:digit:]]+)|([^[:space:]])";
def TOK = (none,ident,num,id);

toks S  = regex "g" SYN S ((TOK!(hd regs)) (reg (hd regs)));

Note that the literal characters are mapped to the standard library id function, the identity operation. Hence these characters are represented by themselves, whereas other tokens are denoted by appropriate constructor terms, ident S and num S in our example, which contain the actual text of the token as arguments. Let's see this in action:

==> toks "foo -99; bar 99;"
[ident "foo",num "-99",";",ident "bar",num "99",";"]

The token sequence is now ready to be processed by a higher-level routine such as a parser, but this is another story which will be told another time …

Yet another way to employ submatches are the back references. You can specify that a group is to be repeated by using the notation \N, where N is a single digit between 1 and 9, in the regular expression argument. For instance, regex "g" "([ab]+)\\1" will match all substrings which have two consecutive identical sequences of a's and b's.

Back references add considerably to the matching complexity (some matches may take exponential time) and hence should be avoided. In fact, it can be shown that regular expression matching with back references is NP-complete.

[ < ]

[ > ]

[ << ]

[ Up ]

[ >> ]

[Top]

[Contents]

[Index]

[ ? ]

12.18.9 Nested Searches

Regular expression searches can also be nested, by invoking the regex function recursively in the expression argument of another regex call. This is useful, e.g., if regex is used to tokenize a string, and we want to further analyze the individual tokens. We could do this by mapping a regex call to the finished token list, but it may be more efficient to perform the second search right away on each individual token. Here's an abstract example, which finds all substrings delimited by c's, and counts the number of consecutive a's and b's in them:

==> regex "g" "[^c]+" "abbacbaaca" \
(regex "g" "a+|b+" (reg 0) (reg 0!0,#reg 0))
[[("a",1),("b",2),("a",1)],[("b",1),("a",2)],[("a",1)]]

There's one caveat with nested searches: The current regex implementation has the somewhat annoying but hardly avoidable limitation that it is not possible to "return" to the match state of an enclosing regex call after a recursive call to regex. Therefore all match state information must be retrieved before the recursive call, and must then either be processed right away or saved in local variables for later use.

[ < ]

[ > ]

[ << ]

[ Up ]

[ >> ]

[Top]

[Contents]

[Index]

[ ? ]

12.19 Additional Integer Functions

The Q interpreter implements basic integer arithmetic using the GMP (GNU multiprecision) library. With the clib module you also get some of the more advanced GMP routines, namely integer powers and roots, and various number-theoretic functions. (These are all included in the prelude.)

[ < ]

[ > ]

[ << ]

[ Up ]

[ >> ]

[Top]

[Contents]

[Index]

[ ? ]

12.19.1 Powers and Roots

public extern pow M N, root M N, intsqrt M;

These functions compute (exact) powers and (integer parts of) roots of integers. The results are always integers; roots are truncated to the largest integer below the exact root value. The pow function computes the Nth power of M (N >= 0), root the integer part of the Nth root of M (M >= 0, N > 0), and intsqrt the integer part of the square root of M >= 0.

public extern powmod K M N, invmod K M;

The powmod function returns the (smallest nonnegative) Nth power of M modulo K, where K must be nonzero. This function is much more efficient than pow M N mod K if N is large. The invmod function computes the (smallest positive) multiplicative inverse of M modulo K (if it exists, i.e. if M is nonzero and relatively prime to K). In other words, invmod K M returns a number N in the range 1..K-1 such that M*N = N*M = 1 modulo K.

[ < ]

[ > ]

[ << ]

[ Up ]

[ >> ]

[Top]

[Contents]

[Index]

[ ? ]

12.19.2 Prime Test

public extern isprime N;

The isprime function implements the Miller-Rabin algorithm for probabilistic prime testing. If isprime N (where N > 1 is an integer) returns true or false, then N is surely prime or composite, respectively. If the function fails, then the number is composite with a probability of only (1/4)^ISPRIME_REP, where the ISPRIME_REP variable denotes the number of repetitions of the probabilistic prime test.

To avoid the overhead of checking ISPRIME_REP each time the isprime function is invoked, isprime determines the value of the variable once, at the time of its first invocation. Thus, if you set this value, make sure to set it before you call isprime for the first time. Reasonable ISPRIME_REP values vary from 5 to 10; if the variable is not set when isprime is first called, a default value of 5 is used, which means that the probability of isprime failing on a composite will be 1/1024.

When using this function, your script should provide an appropriate default definition which handles the case that the isprime operation of this module fails. There are basically three different ways to cope with a failing isprime call. First, you can make your program abort, e.g., using an "error rule" like the following:

isprime N:Int = error "Prime test failed!";

Given that the test will probably not fail very often, this might be a viable option - you can always run your script again and hope that it finishes successfully, since the probabilistic test will choose different random parameters each time it is invoked.

Second, you can pretend that the prime test succeeded by providing the definition:

isprime N:Int = true;

With this definition there will be a small probability that you mistake a composite for a prime, but for some algorithms this might be tolerable.

Third, you can play safe by providing your own primality test which catches those few cases when the probabilistic test fails. For instance, employing the usual trial division method:

isprime N:Int = true if N = 2;
              = false if N and 1 = 0; // even number
              = try_div 3 (intsqrt N) N otherwise;

try_div L M N = true if L > M;
              = false if N mod L = 0;
              = try_div (L+2) M N otherwise;

Of course such an exhaustive search will take a long time for big numbers, but this might be acceptable if the default rule is not invoked very often.

[ < ]

[ > ]

[ << ]

[ Up ]

[ >> ]

[Top]

[Contents]

[Index]

[ ? ]

12.19.3 Other Number-Theoretic Functions

public extern gcd M N, lcm M N;

These are the usual greatest-common-divisor and least-common-multiple functions. Both gcd (which is only defined if at least one of the arguments is nonzero) and lcm (which returns zero if either argument is zero) always return a nonnegative integer (note that older GMP versions returned the sign of the product of N and M in the lcm value, but this is no longer the case).

public extern remove_factor M N;

The remove_factor function is used to help in factoring. It returns a pair (K,R) consisting of the multiplicity K of N in M (i.e., the maximum L such that pow N L divides M) and the remainder R which has all N factors removed, i.e., R = M div pow N K. This function requires that M is nonzero and N positive.

public extern jacobi M N;

This function computes the Jacobi symbol M over N, which can be used to find quadratic residues of an odd prime. M can be an arbitrary integer, but N must be positive; the value of the function is always 1, 0 or -1, where the value 0 indicates that gcd M N > 1. In the case of an odd prime N, we have that jacobi M N = 1 iff M is a nonzero quadratic residue modulo N.

[ < ]

[ > ]

[ << ]

[ Up ]

[ >> ]

[Top]

[Contents]

[Index]

[ ? ]

12.19.4 Examples

Determine all invertible numbers modulo 9 and their corresponding inverses:

==> filter (\(X,Y).isnum Y) (zip [1..8] (map (invmod 9) [1..8]))
[(1,1),(2,5),(4,7),(5,2),(7,4),(8,8)]

Check the result:

==> map (\(X,Y).X*Y mod 9) _
[1,1,1,1,1,1]

Find all (nonzero) quadratic residues modulo 7:

==> filter (\X.jacobi X 7 = 1) [1..6]
[1,2,4]

Decompose 858330 into prime factors (brute force):

==> def N = 858330, P = filter isprime [2..N]

==> filter (\(X,Y).Y>0) (zip P (map (fst.remove_factor N) P))
[(2,1),(3,3),(5,1),(11,1),(17,2)]

Well, that's probably not the right way to do it. So here's a reasonable factorization algorithm using remove_factor. This method is a lot faster - but still not fast enough for code breaking.

factor 0          = [];
factor N:Int      = factor (-N) if N<0;
                  = [(2,K)|factor_from 3 M]
                        where (K,M) = remove_factor N 2
                        if N and 1 = 0; // even number
                  = factor_from 3 N
                        otherwise;

factor_from P N   = [] if P > N;
                  = [(P,K)|factor_from (P+2) M]
                        where (K,M) = remove_factor N P
                        if N mod P = 0; // P divides N (must be prime)
                  = factor_from (P+2) N
                        otherwise; // not a factor, try the next one

For instance, try the following:

==> factor 807699854836875
[(3,1),(5,4),(7,2),(11,5),(13,2),(17,1),(19,1)]

Check the result:

==> prd (map (\(X,Y).pow X Y) _)
807699854836875

[ < ]

[ > ]

[ << ]

[ Up ]

[ >> ]

[Top]

[Contents]

[Index]

[ ? ]

12.20 C Replacements for Common Standard Library Functions

Last but not least, the clib module also provides "built-in" replacements for various standard library functions:

public extern stdlib::append Xs Y, stdlib::cat Xs, stdlib::mklist X N,
  stdlib::nums N M, stdlib::numsby K N M, stdlib::reverse Xs,
  tuple::tuplecat Xs;

public extern string::chars S, string::join DELIM Xs,
  string::split DELIM Xs, string::strcat Xs;

These functions are much more efficient, both in running time and memory requirements, than the "standard" definitions in stdlib.q and string.q, and improve the performance of many common list and string processing tasks. In some cases, the provided operations are even several orders of magnitude faster. In particular, the standard definitions of the strcat and tuplecat operations are very slow in comparison, because they are implemented using repeated concatenation and hence take quadratic running time. In contrast, all functions provided here are guaranteed to run in linear time.

A faster sorting routine is provided as well:

public extern sort P Xs;

This function works like the msort and qsort operations provided by the sort.q module, but is implemented using the quicksort routine from the C library. Note that, in difference to msort/qsort, the sort function does not perform stable sorting, so you still have to use msort or qsort if this feature is required.

[ << ]

[ >> ]

[Top]

[Contents]

[Index]

[ ? ]

This document was generated by Albert Gräf on February, 23 2008 using texi2html 1.76.