
Q Yacc + Lex
= ==== = ===

This package contains versions of Yacc and Lex written in the Q programming
language. Q Yacc and Lex try to be as compatible with their C counterparts as
reasonable, but generate parsers and lexical analyzers which can be executed
with the Q interpreter.

INSTALLATION
------------

To use these programs, you need to have the Q interpreter installed, available
from http://q-lang.sf.net. (If you get a binary release of Q 7.4 or later, Q
Yacc and Lex should already be included.)

To compile from source, first run `make' to compile the yacclib and lexlib C
modules and regenerate the qyacc.q and qlex.q programs. (Q Yacc and Lex are
implemented using themselves, so this bootstrapping step is required to build
the actual Q scripts and support libraries. Only an initial qyacc.q script is
required for this process, which is included in the source tarball.)

If all is well, you can run `sudo make install' to install the programs and
the accompanying support files under /usr. The installation prefix should be
the same as that of the Q interpreter; if necessary, you can change the prefix
by modifying the `prefix' variable in the Makefile, or by running `sudo make
install prefix=...' to set the proper prefix on the fly.

You can also do a quick bootstrap check with `make check'; this will recompile
qyacc.y and compare the resulting output file with the original qyacc.q; the
two files should not differ. It is possible to do this without installing the
programs.

USAGE
-----

Below is some information which should help you to get started with Q Yacc and
Lex. More elaborate documentation still needs to be written, but if you know C
Yacc and Lex then you should be able to find your way. The obligatory desktop
calculator example and a few other examples, including a fairly complete Q
parser, are available in the source tarball for your perusal.

Q YACC
- ----

The qyacc program is invoked as follows:

qyacc [ -dhvV ] [ -o output-file ] input-file

The -h (or --help) option prints the version number and a short help
message. The -V (--version) option prints only the version number. The -o
(--output-file) option can be used to explicitly set the name of the generated
parser; the default is the name of the input file with new extension `.q'. The
-v (--verbose) option causes a description of the generated parser to be
written to a `.output' file, and -d (--defines) causes the symbolic token
constants to be written to a separate `_defs.q' file. The `.output' file is
often useful to debug a parser and also gives a detailed report on parsing
conflicts which are caused by ambiguous (non-LALR) syntactic constructs.

The `.output' and `_defs.q' files are always named after the output file.
Thus, e.g., if the input file is named myparser.y, then by default the parser
itself, the parser description and the token definitions will be written to
myparser.q, myparser.output and myparser_defs.q, respectively.

The general format of Q Yacc input (`.y') files is the same as for C Yacc
grammars:

definitions
%%
rules
%%
auxiliary code

DEFINITIONS AND AUXILIARY CODE

Besides the usual kinds of Yacc symbol declarations (%token, %start, etc.),
both the definitions and the rules section may contain arbitrary Q code
enclosed in %{ ... %} which is copied verbatim to the beginning of the output
file. These code sections typically contain imports and other Q declarations
which are needed for the actions in the rules section.

The auxiliary code section, if present, also contains arbitrary Q code which
is simply tacked on to the end of the output file. Usually this section
contains the definitions of semantic routines as well as the `yylex' and
`yyerror' functions (see below).

RULES AND ACTIONS

Grammar rules follow the usual BNF-like format, i.e.:

name : symbol ...
     | symbol ... { action }
       ...
     ;

Like C Yacc, Q Yacc supports the use of the special `error' token to implement
error recovery rules. Literals in rules may be written as Unicode strings
enclosed in either single or double quotes; special characters in literals can
be specified using the same escape sequences as in Q strings.

Actions in rules take the form { X }, where X is an arbitrary Q expression
which may also contain "semantic values" of the form $$ and $I, where I is an
integer. These refer to the values associated with grammar symbols; $$ refers
to the value of the left-hand side symbol of the rule, while $I refers to the
value of the Ith symbol or action on the right-hand side if I>=1, or the value
of a symbol in an enclosing rule if I<=0. In difference to C Yacc, the
semantic values are always untyped and may hold a value of any Q type; () is
used as the default value for grammar symbols which do not have an associated
value. (As a diagnostic aid, the parser will warn you at runtime if an action
tries to access a non-existing or undefined semantic value, since this usually
hints at a bug in the action.)

Actions of the form `$$ = ...' or `$I = ...' are treated specially. They are
interpreted as assignments for the semantic values. Assignments and other
actions can also be executed in a sequence `X1 || ... || Xn' or in the
branches of a conditional action `if X then Y' or `if X then Y else Z'. Note
that assignments are only recognized if they occur at the toplevel of these
constructs, but not in nested subexpressions (where `=' denotes Q's usual
equality operator).

Q INTERFACE

As with C Yacc, the generated parser is implemented by a parameterless
function `yyparse' which returns 0 for a successful parse and 1 in case of an
(unrecoverable) syntax error. The core of the parser routine is now
implemented in C, in order to achieve better speed for practical applications;
the supporting C routines are implemented by the accompanying yacclib.q module
which is always imported in the generated Q scripts.

The programmer must supply a `yylex' function which performs lexical analysis
and returns the analyzed tokens to the parser; this function can either be
programmed by hand or be prepared with Q Lex (see the following section). The
`yylex' function is usually included in the auxiliary code section so that it
can access the symbolic token constants defined at the beginning of the output
file. `yylex' is a parameterless function which is assumed to return a new
token at each invokation. The result of `yylex' may either be an integer,
denoting one of the symbolic token constants, or a string which denotes a
character literal in the grammar, like, e.g. ';'. In difference to C Yacc, Q
Yacc allows arbitrary Unicode strings to be returned in this fashion. Thus,
e.g., if ':=' is used as a character literal in the grammar, then the `yylex'
function may return this token using either the corresponding token number or
the literal string ":=" (the latter is slightly less efficient, however, since
an extra table lookup is required). If a semantic value is associated with a
token, `yylex' should return both the token and its value as a pair of the
form (TOKEN,VALUE).

If the `yylex' function is provided in a separate module (e.g., a module
prepared with Q Lex), that module can be imported using a Q `import'
declaration in a code section, e.g., `%{ import mylexer; %}', near the
beginning of the grammar. In this case you have to write the token number
definitions to a separate `_defs.q' file using Q Yacc's -d option, and import
that module in the lexer module, so that the lexer can access the token
numbers.

It is also possible to directly include the code of `yylex' from a separate
file. As Q does not have a textual include directive, Q Yacc provides a
special `%yylex' directive for that purpose:

%yylex "mylexer.q"

Only one such directive may be present, which must be placed into the
definitions section. The directive immediately generates a `public' forward
declaration for the yylex function and later causes the specified file to be
included at the beginning of the auxiliary code section.

Another interface function used by the parser is `yyerror' which prints error
messages. A suitable default version of `yyerror', which just prints the given
message string on standard error, is provided in the yacclib.q module. You can
simply override the default definition of `yyerror' by providing your own,
e.g., in the auxiliary code section of your grammar. (If you need to provide
different versions of `yyerror' in different parsers to be used in the same
program, you can do that, too, if you both declare and define a new `yyerror'
function locally in each parser source. The declaration should then go into
the definitions section of the grammar so that it overrides the `yyerror'
declaration imported from yacclib.q.)

A number of other operations for use in actions are provided by the standard
parser skeleton in yyparse.in. In particular, yyACCEPT, yyABORT and yyERROR
initiate the parser's accept, abort and error actions, just like C Yacc's
YYACCEPT, YYABORT and YYERROR macros. The current lookahead token and its
semantic value can be accessed with the yychar and yylval functions, and
cleared with the yyclearin function. The yyerrok function is used during error
recovery to tell the parser that it has successfully recovered from a syntax
error. You can also use `yysetdebug true' to have the parser generate
debugging output. Please see yyparse.in for a closer description of the
supported operations.

Q LEX
- ---

The qlex program is invoked as follows:

qlex [ -hvV ] [ -o output-file ] input-file

The options work like those of the qyacc program. As with Q Yacc, the
`.output' file generated with -v is always named after the output file. Thus,
e.g., if the input file is named mylexer.l, then by default the lexical
analyzer and its description will be written to mylexer.q and mylexer.output,
respectively.

The general format of Q Lex input (`.l') files is the same as for C Lex
grammars:

definitions
%%
rules
%%
auxiliary code

DEFINITIONS AND AUXILIARY CODE

The definitions section may contain the usual regular and start state
definitions (%s) in the same format as accepted by C Lex; POSIX-style
exclusive start conditions (%x) are also supported. Extra code (indented lines
or code enclosed in %{ ... %}) in the definitions section or at the beginning
of the rules section is copied verbatim to the beginning of the output file.

The auxiliary code section, if present, also contains arbitrary Q code which
is simply tacked on to the end of the output file.

RULES AND ACTIONS

The rules section consists of a table of regular expressions and corresponding
actions, in the same format as accepted by C Lex. All the usual basic elements
of regular expressions (individual characters, character classes, double-quoted
strings) and operators for specifying optional elements (`?'), repetitions
(`*', `+', `{N}', `{N,M}', `{N,}') and alternatives (`|') are supported, as
well as trailing context (`/') and start conditions (`^', `<x,y,...>'); the
Flex notation `<*>' to indicate a rule which is valid in all start states, and
the name `INITIAL' denoting the predefined default start state are also
recognized. The `{NAME}' construct is expanded using the regular definitions
in the first section of the grammar; like C Lex, Q Lex performs simple textual
substitution for this construct, and the definitions must not be recursive.

Actions must be legal Q expressions, separated from the regular expression
using an arbitrary amount of whitespace (blanks or tabs). They may continue
across line ends if the continuation lines are indented with at least one
space or tab character. Instead of an action, you can also use the `|'
character to indicate that the action for this rule is to be the same as for
the next one. An action may also be empty (or a Q comment) if no action is to
be performed when the corresponding input is matched.

For instance, here are four simple rules to match identifiers and numbers and
ignore everything else:

[A-Za-z_][A-Za-z_0-9]+	return yytext
[0-9]+			return (val yytext)
.			|
\n			/* ignore */

As indicated, actions may return values to the caller by means of the `return'
function. A few other functions are defined in the scanner skeleton yylex.in
for use in actions, these are discussed in some detail below.

Q INTERFACE

Just like parsers generated with Q Yacc, lexical analyzers created with Q Lex
use C support routines to achieve a reasonable processing speed. These
routines are provided by the lexlib.q module which is always imported in the
generated output code. (The routines provided by lexlib.q are for internal use
in the scanner skeleton yylex.in only; normally you shouldn't invoke these
functions directly from application code.)

The lexical analyzer is implemented by a parameterless function `yylex' which
keeps scanning its input until a token is recognized, in which case it
executes the corresponding action. As usual, the analyzer prefers the longest
match and, if two rules match the same text, prefers the first of these. The
matched text is available inside actions by means of the `yytext' function,
and `yyleng' holds the length of the text, i.e., #yytext. In difference to C
Lex, these values cannot be modified directly.

An action may process the recognized text in some way, possibly producing
output, and it may also invoke the `return' function which returns an
arbitrary value to the caller. The functions `yymore', `yyless', `reject' and
`echo' are also available, and work in the same fashion as the corresponding C
Lex routines `yymore', `yyless', `REJECT' and `ECHO'.

If an input character is not matched by any rule then `yylex' simply copies it
to its output unchanged. When the analyzer reaches end of file on input, it
normally returns 0 the caller. The application may provide an alternate
definition for the function `yywrap' to modify this behaviour. In particular,
a custom `yywrap' can arrange for more input and then return false to indicate
that lexical analysis is to be resumed.

Character I/O is done through two file objects `yyin' and `yyout' (initially
assigned to standard input and output), and the functions `yygets', `yygetc',
`yyputs' and `yyputc' are provided to access the analyzer's input and
output. (Because of the buffering performed internally by the analyzer, the
application shouldn't normally read from `yyin' directly, but always use
`yygets' or `yygetc' to read strings and single characters from the input.)

The primary interface functions for actually doing I/O are named `yyinput' and
`yyoutput' which work like `gets' and `puts' but read and write directly from
`yyin' and to `yyout', respectively. The application can provide its own
definitions for `yyinput' and `yyoutput' to override the default behaviour,
e.g., if input is to be read from, or output is to be written to, strings
instead of files.

While the lexical analyzers generated by Q Lex are not reentrant per se, it is
possible to have `yylex' switch between different files and buffers by means
of the `yyactbuf', `yynewbuf' and `yysetbuf' functions. These routines work
more or less like the corresponding facilities of Flex scanners. The
`yyactbuf' function returns a `YYBuffer' object which encapsulates the state
of the buffers and files currently used by the `yylex' routine. The `yynewbuf'
function takes an input and an output "file" as arguments and returns a new
`YYBuffer' object which represents the initial state of a lexical analyzer
operating on the given files. (The arguments of `yynewbuf' can actually be any
values if you have replaced `yyinput' and `yyoutput' with your own
versions. The default I/O interface expects file values, however, and will
silently consider the files as nonexistent if other values are provided.) 
Finally, the `yysetbuf' function switches to the `YYBuffer' state given as its
argument. Using these facilities, it is an easy matter to have the analyzer
switch between different inputs, e.g., in order to process include files. (The
desktop calculator example included in the distribution shows how to do this.)

BUGS AND COMPATIBILITY NOTES
---- --- ------------- -----

This is BETA software, so expect some bugs. The LALR algorithm in the Q Yacc
program is currently rather slow for large grammars; this will hopefully be
fixed in a future version. However, the cores of the yyparse and yylex
routines are now implemented in C, so that the generated parsers and scanners
should be fast enough for practical applications. (They are still quite a bit
slower than pure C parsers and scanners, though, because of the extra overhead
for calling back into Q for the actions.)

The accepted input languages and behaviour of the generated parsers and
lexical analyzers should generally be compatible with C Yacc/Lex, except for
the differences listed below. (Q Lex is even bug-compatible with C Lex/Flex in
that patterns with "dangerous" trailing context like ab*/b are matched
incorrectly; e.g., if the input is "abbc" then "abb" will be matched, not
"ab".)

Incompatibilities and major differences to C Yacc/Lex:

- Q Yacc's semantic values ($$ and $I) are always dynamically typed. C Yacc
features which deal with value types (%union, %type, etc.) are thus not
supported by Q Yacc.

- C Lex character tables (%T) are not supported by Q Lex; neither are the
directives for table sizes (%p, %n, etc.). The latter are not needed anyway
because there are no internal tables with fixed sizes.

- Q Yacc and Lex fully support Unicode. Escapes for special characters in
string and character literals follow the Q conventions, rather than those
of C.

- Actions and auxiliary code is written in Q instead of C. Thanks to Q's
module system, multiple parsers and lexical analyzers can coexist in the same
program. Support routines to be provided by the programmer, such as yyerror
and yywrap, can be defined locally in the corresponding parser or scanner
module.

- The programming interface, while trying to mimic that of C Yacc and Lex,
works a bit differently. In particular, macros and variables in the output
code of C Yacc and Lex are implemented as Q functions and in some cases their
names had to be mangled to make them valid Q function identifiers. Scanners
generated by Q Lex employ a buffered character I/O system, more like the one
used by Flex, but the I/O routines are named differently.

- The parse tables constructed by Q Yacc are slightly larger than those of the
original C Yacc. This is because a reduce action will be used as the default
action of a state only if it is the only action in the state, in order to fix
a bug in the original Yacc implementation which would start error recovery too
late with certain kinds of error productions; see Schreiner/Friedman,
"Introduction to Compiler Construction with Unix", 1985, for details.


Enjoy! :)
Albert Graef
August 2006
