[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

3. Lexical Matters

The vocabulary of the Q programming language consists of notations for identifiers, operators, integers, floating point numbers, character strings, comments, a few reserved words which may not be used as identifiers, and some special punctuation symbols which are used as delimiters.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

3.1 Whitespace and Comments

Whitespace (blanks, tabs, newlines, form feeds) serves as a delimiter between adjacent symbols, but is otherwise ignored. Comments are treated like whitespace:

 
/* This is a comment ... */

Note that these comments cannot be nested. C++-style line-oriented comments are also supported:

 
// C++-style comment ...

Furthermore, lines beginning with the #! symbol denote a special type of comment which may be processed by the operating system's command shell and the Q programming tools. On UNIX systems, this (odd) feature allows you to execute Q scripts directly from the shell (by specifying the q program as a command language processor) and to include compiler and interpreter command line options in a script (see Running Scripts from the Shell).


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

3.2 Identifiers and Reserved Words

Identifiers are denoted by the usual sequences of letters (including `_') and digits, beginning with a letter. Upper- and lowercase is distinct. In the Q language, identifiers are used to denote type, function and variable symbols. Type identifiers may start with either an uppercase or lowercase letter (the convention, however, is to use an initial uppercase letter). The case of the first letter matters, however, for function and variable symbols. As in Prolog, a capitalized identifier (such as X, Xmax and XMAX) indicates a variable symbol, whereas identifiers starting with a lowercase letter denote function symbols (unless they are declared as "free" variables, see below). Unlike in Prolog, the underscore `_' counts as a lowercase letter, hence _MAX is a function symbol, not a variable. (The idea behind this is that it allows you to get a function symbol which appears to start with an uppercase letter by stropping it with an initial `_'.) However, as an exception to the general rule, the identifier `_' does denote a variable symbol, the so-called anonymous variable. The same rules also apply to symbols created interactively in the interpreter.

Variables actually come in two flavours: bound and free variables, i.e., variables which also occur on the left-hand side of an equation, and variables which only occur on the right-hand side and/or in the condition part of an equation. Identifiers may also be declared as free variables; see Declarations. In this case, they may also start with a lowercase letter.

Type, function and free variable identifiers may also be qualified with a module identifier prefix (cf. Scripts and Modules), to specifically denote a symbol of the given module. Such a qualified identifier takes the form module-id::identifier; no whitespace or comments are allowed between the module name, `::' symbol and the function or variable identifier.

Formally, the syntax of identifiers is described by the following grammatical rules:

 
identifier              : unqualified-identifier
                        | qualified-identifier
qualified-identifier    : module-identifier '::'
                          unqualified-identifier
unqualified-identifier  : variable-identifier
                        | function-identifier
                        | type-identifier
module-identifier       : letter {letter|digitsym}
type-identifier         : letter {letter|digitsym}
variable-identifier     : uppercase-letter {letter|digitsym}
                        | '_'
function-identifier     : lowercase-letter {letter|digitsym}
letter                  : uppercase-letter|lowercase-letter
uppercase-letter        : 'A'|...|'Z'|uppercase unicode letter
lowercase-letter        : 'a'|...|'z'|'_'|lowercase unicode letter
digitsym                : '0'|...|'9'|unicode digit

(Please refer to Q Language Grammar, for a description of the BNF grammar notation used throughout this document.)

The following symbols are reserved words of the Q language and may not be used as identifiers:

 
as        const     def       else      extern    from      if
import    include   otherwise private   public    special   then
type      undef     var       virtual   where

In addition, the following identifiers are predeclared as operator symbols (see Operator Symbols) and cannot be used as normal identifiers either:

 
and       div       mod       not       or

[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

3.3 Operator Symbols

Operator symbols may either take the form of function identifiers or they may be sequences of punctuation characters (excluding `_' which serves as a letter in the Q language). Like function and free variable symbols, they may be qualified with a module identifier prefix:

 
op                      : unary-op|binary-op
unary-op                : opsym
binary-op               : opsym|'and' 'then'|'or' 'else'
opsym                   : unqualified-opsym
                        | qualified-opsym
qualified-opsym         : module-identifier '::'
                          unqualified-opsym
unqualified-opsym       : function-identifier
                        | punctsym {punctsym}
punctsym                : unicode punctuation character

In either case, operator symbols must be declared explicitly before they can be used, cf. Declarations. The declaration determines the precedence and "fixity" of the operator (i.e., whether it acts as a unary prefix or a binary infix operator; see User-Defined Operators, for more information on this).

As already mentioned, the following identifiers are predefined as built-in operators:

 
and       div       mod       not       or

As indicated by the syntax rules, the logical connectives `and' and `or' can also be combined with the keywords `then' and `else' to form the "short-circuit" connectives `and then' and `or else'.

The punctuation symbols predefined as built-in operators are the following:

 
` ' ~ & . || < > = <= >= <> == ++ + - * / ^ ! # $

Most of these can actually be redeclared by the programmer for his own purposes. However, there are a few symbols, called "soft delimiters", which play a special role in the syntax of the Q language (some of these are also used as operator symbols):

 
~ . .. : | = == - \ @

The soft delimiters may occur inside user-defined operators, but if they are used as separate lexemes then they are treated like reserved words and thus they may not be declared as operator symbols (except for the purpose of making aliases, cf. Declarations). The same applies to the reserved keywords (see Identifiers and Reserved Words).

Moreover, the following special symbols serve as "hard delimiters" in the Q language which always separate lexemes and thus may not occur inside operator symbols at all:

 
" , ; :: ( ) [ ] { }

The same applies to whitespace and other non-printable characters, as well as the comment delimiters (`//' and `/*' as well as initial `#!' on a script line).

Symbols consisting of punctuation are generally parsed using the "longest possible lexeme" a.k.a. "maximal munch" rule. Here, the "longest possible lexeme" refers to the longest prefix of the input such that the sequence of punctuation characters either forms a valid (i.e., declared) operator symbol, or one of the reserved and special delimiter symbols listed above. Thus, e.g., `..#' will usually be parsed as `.. #', i.e., a reserved `..' symbol followed by a `#' operator. This holds unless the entire sequence `..#' has already been declared as an operator in the scope where it is used.

The only exception to the "maximal munch" rule occurs inside declarations where a new operator symbol is being introduced. In this case the symbol extends up to the next hard delimiter symbol (usually `)' or `;') or whitespace/non-printable character in the input. For instance, the following Q code snippet declares a new binary operator symbol `+~%' (please refer to Declarations, for an explanation of the declaration syntax):

 
public (+~%) X Y;

Another special case arises with conglomerates of three or more consecutive `:' symbols. In this case the initial `::' is always treated as the designation of a qualified (operator) symbol. (In the unlikely case that you ever need to specify a "guarded variable" of the form X : ::Type with a qualified type in the built-in namespace, the space between the `:' token and the following `::' qualification designator is mandatory.)


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

3.4 Numbers

Signed decimal numeric constants are denoted by sequences of decimal digits (only the standard digits 0..9 in the ASCII character set may be used here) and may contain a decimal point and/or a scaling factor. Integers can also be denoted in octal or hexadecimal, using the same syntax as in C:

 
number                  : ['-'] unsigned-number
unsigned-number         : '0' octdigitseq
                        | '0x' hexdigitseq
                        | '0X' hexdigitseq
                        | digitseq ['.' [digitseq]] [scalefact]
                        | [digitseq] '.' digitseq [scalefact]
digitseq                : digit {digit}
octdigitseq             : octdigit {octdigit}
hexdigitseq             : hexdigit {hexdigit}
scalefact               : 'E' ['-'] digitseq
                        | 'e' ['-'] digitseq
digit                   : '0'|...|'9'
octdigit                : '0'|...|'7'
hexdigit                : '0'|...|'9'|'a'|...|'f'|'A'|...|'F'

Simple digit sequences without decimal point and scaling factor are treated as integers; if the sequence starts with `0' or `0x'/`0X' then it denotes an integer in octal or hexadecimal base, respectively. Other numbers denote (decimal) floating point values. If a decimal point is present, it must be preceded or followed by at least one digit. Both the scaling factor and the number itself may be prefixed with a minus sign. (Syntactically, the minus sign in front of a number is interpreted as unary minus, cf. Expressions. However, if unary minus occurs in front of a number, it is interpreted as a part of the number and is taken to denote a negative value. See the remarks concerning unary minus in Expressions.) Some examples:

 
0  -187326  0.0  -.05  3.1415e3 -1E-10 0177 0xaf -0XFFFF

[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

3.5 Strings

String constants are written as character sequences enclosed in double quotes:

 
string                  : '"' {char} '"'
char                    : any character but newline and "

To include newlines, double quotes and other (non-printable) characters in a string, the following escape sequences may be used:

 
\n                      newline
\r                      carriage return
\t                      tab
\b                      backspace
\f                      form feed
\"                      double quote
\\                      backslash

Furthermore, a character may also be denoted in the form \N, where N is the character number in decimal, hexadecimal or octal (using the same syntax as for unsigned integer values). Note that the character number may consist of an arbitrary number of digits; the resulting value will be taken modulo the size of the local character set, which is 256 in the case of ASCII-only systems and 0x110000 if the interpreter has been built with Unicode support (see Unicode Support). Optionally, the character number may also be enclosed in parentheses; this makes it possible to specify a string in which a character escape is immediately followed by another character which happens to be a valid digit, as in "\(123)4".

As of version 7.11 and later, the interpreter also supports symbolic character escapes of the form `\&name;', where name is any of the XML single character entity names specified in the "XML Entity definitions for Characters", see http://www.w3.org/TR/xml-entity-names/. Note that, at the time of this writing, this is still a W3C working draft, so the supported entity names may be subject to change until the final specification comes out; the currently supported entities are described in the draft from 14 December 2007, see http://www.w3.org/TR/2007/WD-xml-entity-names-20071214/. Also note that multi-character entities are not supported in this implementation.

A string may also be continued across lines by putting the \ character immediately before the end of the line, which causes the following newline character to be ignored.

As of Q 7.0, it is a syntax error if a character escape is not recognized as either a numeric escape or one of the special escapes listed above.

Some examples:

 
""                      empty string
"A"                     single character string
"\27"                   ASCII escape character (ASCII 27)
"\033"                  same in octal
"\0x1b"                 same in hexadecimal
"\(0x1b)c"              ASCII escape followed by a literal `c' character
"Gr\&auml;f"            German umlaut, using XML entity escape
"a string"              multiple character string
"a \"quoted\" string"   include double quotes
"a line\n"              include newline
"a very \               continue across line end
long line\n"

[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

3.6 Unicode Support

Since version 7.0 the Q interpreter has full Unicode support. This means that identifiers may actually contain arbitrary alphanumeric letter and digit symbols from the entire Unicode character set. (All non-uppercase alphabetical letters are considered as lowercase letters which may begin a function identifier.) Likewise, operator symbols may consist of arbitrary punctuation characters in the Unicode character set. (We refrain from giving any concrete examples here, to keep this document 7-bit clean.)

Moreover, all string values are internally represented using the UTF-8 encoding, which allows you to represent all characters in the Unicode character set, while retaining backward compatibility with 7 bit ASCII. String literals in the source script will be translated from the system encoding to UTF-8 automatically and transparently, and the \N notation may specify any valid Unicode point. For this to work, the interpreter must have been built with Unicode support enabled (which is the default).


[ << ] [ >> ]           [Top] [Contents] [Index] [ ? ]

This document was generated by Albert Gräf on February, 23 2008 using texi2html 1.76.