7.1 Lexical conventions
Blanks
The following characters are considered as blanks: space,horizontal tabulation, carriage return, line feed and form feed. Blanks areignored, but they separate adjacent identifiers, literals andkeywords that would otherwise be confused as one single identifier,literal or keyword.
Comments
Comments are introduced by the two characters (, with nointervening blanks, and terminated by the characters ), withno intervening blanks. Comments are treated as blank characters.Comments do not occur inside string or character literals. Nestedcomments are handled correctly.
Identifiers
|
Identifiers are sequences of letters, digits, _ (the underscorecharacter), and ' (the single quote), starting with aletter or an underscore.Letters contain at least the 52 lowercase and uppercaseletters from the ASCII set. The current implementationalso recognizes as letters some characters from the ISO8859-1 set (characters 192–214 and 216–222 as uppercase letters;characters 223–246 and 248–255 as lowercase letters). Thisfeature is deprecated and should be avoided for future compatibility.
All characters in an identifier aremeaningful. The current implementation accepts identifiers up to16000000 characters in length.
In many places, OCaml makes a distinction between capitalizedidentifiers and identifiers that begin with a lowercase letter. Theunderscore character is considered a lowercase letter for thispurpose.
Integer literals
|
An integer literal is a sequence of one or more digits, optionallypreceded by a minus sign. By default, integer literals are in decimal(radix 10). The following prefixes select a different radix:
Prefix | Radix |
0x, 0X | hexadecimal (radix 16) |
0o, 0O | octal (radix 8) |
0b, 0B | binary (radix 2) |
(The initial 0 is the digit zero; the O for octal is the letter O.)An integer literal can be followed by one of the letters l, L or nto indicate that this integer has type int32, int64 or nativeintrespectively, instead of the default type int for integer literals.The interpretation of integer literals that fall outside the range ofrepresentable integer values is undefined.
For convenience and readability, underscore characters (_) are accepted(and ignored) within integer literals.
Floating-point literals
|
Floating-point decimal literals consist in an integer part, afractional part andan exponent part. The integer part is a sequence of one or moredigits, optionally preceded by a minus sign. The fractional part is adecimal point followed by zero, one or more digits.The exponent part is the character e or E followed by anoptional + or - sign, followed by one or more digits. It isinterpreted as a power of 10.The fractional part or the exponent part can be omitted but not both, toavoid ambiguity with integer literals.The interpretation of floating-point literals that fall outside therange of representable floating-point values is undefined.
Floating-point hexadecimal literals are denoted with the 0x or 0Xprefix. The syntax is similar to that of floating-point decimalliterals, with the following differences.The integer part and the fractional part use hexadecimaldigits. The exponent part starts with the character p or P.It is written in decimal and interpreted as a power of 2.
For convenience and readability, underscore characters (_) are accepted(and ignored) within floating-point literals.
Character literals
|
Character literals are delimited by ' (single quote) characters.The two single quotes enclose either one character different from' and \, or one of the escape sequences below:
Sequence | Character denoted |
\ | backslash () |
\" | double quote (") |
\' | single quote (') |
\n | linefeed (LF) |
\r | carriage return (CR) |
\t | horizontal tabulation (TAB) |
\b | backspace (BS) |
\space | space (SPC) |
\ddd | the character with ASCII code ddd in decimal |
\xhh | the character with ASCII code hh in hexadecimal |
\oooo | the character with ASCII code ooo in octal |
String literals
|
String literals are delimited by " (double quote) characters.The two double quotes enclose a sequence of either charactersdifferent from " and \, or escape sequences from thetable given above for character literals, or a Unicode characterescape sequence.
A Unicode character escape sequence is substituted by the UTF-8encoding of the specified Unicode scalar value. The Unicode scalarvalue, an integer in the ranges 0x0000…0xD7FF or 0xE000…0x10FFFF,is defined using 1 to 6 hexadecimal digits; leading zeros are allowed.
To allow splitting long string literals across lines, the sequence\newline spaces-or-tabs (a backslash at the end of a linefollowed by any number of spaces and horizontal tabulations at thebeginning of the next line) is ignored inside string literals.
Quoted string literals provide an alternative lexical syntax forstring literals. They are useful to represent strings of arbitrary contentwithout escaping. Quoted strings are delimited by a matching pairof {quoted-string-id| and |quoted-string-id} withthe same quoted-string-id on both sides. Quoted strings do not interpretany character in a special way but requires that thesequence |quoted-string-id} does not occur in the string itself.The identifier quoted-string-id is a (possibly empty) sequence oflowercase letters and underscores that can be freely chosen to avoidsuch issue (e.g. {|hello|}, {ext|hello {|world|}|ext}, …).
The current implementation places practically no restrictions on thelength of string literals.
Naming labels
To avoid ambiguities, naming labels in expressions cannot just be definedsyntactically as the sequence of the three tokens ~, ident and:, and have to be defined at the lexical level.
|
Naming labels come in two flavours: label for normal arguments andoptlabel for optional ones. They are simply distinguished by theirfirst character, either ~ or ?.
Despite label and optlabel being lexical entities in expressions,their expansions ~label-name: and ?label-name: will beused in grammars, for the sake of readability. Note also that insidetype expressions, this expansion can be taken literally, _i.e._there are really 3 tokens, with optional blanks between them.
Prefix and infix symbols
|
See also the following language extensions:extension operators andextended indexing operators.
Sequences of “operator characters”, such as <=> or !!,are read as a single token from the infix-symbol or prefix-symbolclass. These symbols are parsed as prefix and infix operators insideexpressions, but otherwise behave like normal identifiers.
Keywords
The identifiers below are reserved as keywords, and cannot be employedotherwise:
- and as assert asr begin class
- constraint do done downto else end
- exception external false for fun function
- functor if in include inherit initializer
- land lazy let lor lsl lsr
- lxor match method mod module mutable
- new nonrec object of open or
- private rec sig struct then to
- true try type val virtual when
- while with
The following character sequences are also keywords:
- != # & && ' ( ) * + , -
- -. -> . .. .~ : :: := :> ; ;;
- < <- = > >] >} ? [ [< [> [|
- ] _ ` { {< | |] || } ~
Note that the following identifiers are keywords of the Camlp4extensions and should be avoided for compatibility reasons.
- parser value $ $$ $: <: << >> ??
Ambiguities
Lexical ambiguities are resolved according to the “longest match”rule: when a character sequence can be decomposed into two tokens inseveral different ways, the decomposition retained is the one with thelongest first token.
Line number directives
|
Preprocessors that generate OCaml source code can insert line numberdirectives in their output so that error messages produced by thecompiler contain line numbers and file names referring to the sourcefile before preprocessing, instead of after preprocessing.A line number directive is composed of a # (sharp sign), followed bya positive integer (the source line number), optionally followed by acharacter string (the source file name).Line number directives are treated as blanks during lexicalanalysis.