re2go(1) re2go(1)
NAME
re2go - generate fast lexical analyzers for Go
SYNOPSIS
re2go [ OPTIONS ] [ WARNINGS ] INPUT
Input can be either a file or - for stdin.
INTRODUCTION
re2go works as a preprocessor. It reads the input file (which is
usually a program in Go, but can be anything) and looks for blocks of
code enclosed in special-form start/end markers. The text outside of
these blocks is copied verbatim into the output file. The contents of
the blocks are processed by re2go. It translates them to code in Go and
outputs the generated code in place of the block.
Here is an example of a small program that checks if a given string
contains a decimal number:
//go:generate re2go $INPUT -o $OUTPUT -i --api simple
package main
func lex(yyinput string) {
var yycursor int
/*!re2c
re2c:YYCTYPE = byte;
re2c:yyfill:enable = 0;
[1-9][0-9]* { return }
* { panic("error!") }
*/
}
func main() {
lex("1234\x00")
}
In the output re2go replaced the block in the middle with the generated
code:
// Code generated by re2go, DO NOT EDIT.
//go:generate re2go $INPUT -o $OUTPUT -i --api simple
package main
func lex(yyinput string) {
var yycursor int
{
var yych byte
yych = yyinput[yycursor]
switch (yych) {
case '1','2','3','4','5','6','7','8','9':
goto yy2
default:
goto yy1
}
yy1:
yycursor += 1
{ panic("error!") }
yy2:
yycursor += 1
yych = yyinput[yycursor]
switch (yych) {
case '0','1','2','3','4','5','6','7','8','9':
goto yy2
default:
goto yy3
}
yy3:
{ return }
}
}
func main() {
lex("1234\x00")
}
BASICS
A re2go program consists of a sequence of blocks intermixed with code
in the target language. A block may contain definitions,
configurations, rules, actions and directives in any order:
name = regular-expression ;
A definition binds name to regular-expression. Names may contain
alphanumeric characters and underscore. The regular expressions
section gives an overview of re2go syntax for regular
expressions. Once defined, the name can be used in other regular
expressions and in rules. Recursion in named definitions is not
allowed, and each name should be defined before it is used. A
block inherits named definitions from the global scope.
Redefining a name that exists in the current scope is an error.
configuration = value ;
A configuration allows one to change re2go behavior and
customize the generated code. For a full list of configurations
supported by re2go see the configurations section. Depending on
a particular configuration, the value can be a keyword, a
nonnegative integer number or a one-line string which should be
enclosed in double or single quotes unless it consists of
alphanumeric characters. A block inherits configurations from
the global scope and may redefine them or add new ones.
Configurations defined inside of a block affect the whole block,
even if they appear at the end of it.
regular-expression code
A rule binds regular-expression to its semantic action (a block
of code in curly braces, or a block of code that starts with :=
and ends on a newline followed by any non-whitespace character).
If the regular-expression matches, the associated code is
executed. If multiple rules match, the longest match takes
precedence. If multiple rules match the same string, the
earliest one takes precedence. There are two special rules: the
default rule * and the end of input rule $. Default rule should
always be defined, it has the lowest priority regardless of its
place in the block, and it matches any code unit (not
necessarily a valid character, see the encoding support
section). The end of input rule should be defined if the
corresponding method for handling the end of input is used.
With start conditions rules have more complex syntax.
!action code
An action binds a user-defined block of code to a particular
place in the generated finite state machine (in the same way as
semantic actions bind code to the final states). See the actions
section for a full list of predefined actions.
!directive ;
A directive is one of the special predefined statements. Each
directive has a unique purpose. See the directives section for
details.
Blocks
Block start and end markers are either /*!re2c and */, or %{ and %}
(both styles are supported). Starting from version 2.2 blocks may have
optional names that allow them to be referenced in other blocks. There
are different kinds of blocks:
/*!re2c[:<name>] ... */ or %{[:<name>] ... %}
A global block contains definitions, configurations, rules and
directives. re2go compiles regular expressions associated with
each rule into a deterministic finite automaton, encodes it in
the form of conditional jumps in the target language and
replaces the block with the generated code. Names and
configurations defined in a global block are added to the global
scope and become visible to subsequent blocks. At the start of
the program the global scope is initialized with command-line
options.
/*!local:re2c[:<name>] ... */ or %{local[:<name>] ... %}
A local block is like a global block, but the names and
configurations in it have local scope (they do not affect other
blocks).
/*!rules:re2c[:<name>] ... */ or %{rules[:<name>] ... %}
A rules block is like a local block, but it does not generate
any code by itself, nor does it add any definitions to the
global scope -- it is meant to be reused in other blocks. This
is a way of sharing code (more details in the reusable blocks
section). Prior to re2go version 2.2 rules blocks required -r
--reusable option.
/*!use:re2c[:<name>] ... */ or %{use[:<name>] ... %}
A use block that references a previously defined rules block. If
the name is specified, re2go looks for a rules blocks with this
name. Otherwise the most recent rules block is used (either a
named or an unnamed one). A use block can add definitions,
configurations and rules of its own, which are added to those of
the referenced rules block. Prior to re2go version 2.2 use
blocks required -r --reusable option.
/*!max:re2c[:<name1>[:<name2>...]] ... */ or
%{max[:<name1>[:<name2>...]] ... %}
A block that generates YYMAXFILL definition. An optional list of
block names specifies which blocks should be included when
computing YYMAXFILL value (if the list is empty, all blocks are
included). By default the generated code is a macro-definition
for C (#define YYMAXFILL <n>), or a global variable for Go (var
YYMAXFILL int = <n>). It can be customized with an optional
configuration format that specifies a template string where
@@{max} (or @@ for short) is replaced with the numeric value of
YYMAXFILL.
/*!maxnmatch:re2c[:<name1>[:<name2>...]] ... */ or
%{maxnmatch[:<name1>[:<name2>...]] ... %}
A block that generates YYMAXNMATCH definition (it requires -P
--posix-captures option). An optional list of block names
specifies which blocks should be included when computing
YYMAXNMATCH value (if the list is empty, all blocks are
included). By default the generated code is a macro-definition
for C (#define YYMAXNMATCH <n>), or a global variable for Go
(var YYMAXNMATCH int = <n>). It can be customized with an
optional configuration format that specifies a template string
where @@{max} (or @@ for short) is replaced with the numeric
value of YYMAXNMATCH.
/*!stags:re2c[:<name1>[:<name2>...]] ... */,
/*!mtags:re2c[:<name1>[:<name2>...]] ... */ or
%{stags[:<name1>[:<name2>...]] ... %}, %{mtags[:<name1>[:<name2>...]]
... %{ Blocks that specify a template piece of code that is expanded
for each s-tag/m-tag variable generated by re2go. An optional
list of block names specifies which blocks should be included
when computing the set of tag variables (if the list is empty,
all blocks are included). There are two optional
configurations: format and separator. Configuration format
specifies a template string where @@{tag} (or @@ for short) is
replaced with the name of each tag variable. Configuration
separator specifies a piece of code used to join the generated
format pieces for different tag variables.
/*!svars:re2c[:<name1>[:<name2>...]] ... */,
/*!mvars:re2c[:<name1>[:<name2>...]] ... */ or
%{svars[:<name1>[:<name2>...]] ... %}, %{mvars[:<name1>[:<name2>...]]
... %{ Blocks that specify a template piece of code that is expanded
for each s-tag/m-tag that is either explicitly mentioned by the
rules (with --tags option) or implicitly generated by re2go
(with --captvars or --posix-captvars options). An optional list
of block names specifies which blocks should be included when
computing the set of tags (if the list is empty, all blocks are
included). There are two optional configurations: format and
separator. Configuration format specifies a template string
where @@{tag} (or @@ for short) is replaced with the name of
each tag. Configuration separator specifies a piece of code
used to join the generated format pieces for different tags.
/*!getstate:re2c[:<name1>[:<name2>...]] ... */ or
%{getstate[:<name1>[:<name2>...]] ... %}
A block that generates conditional dispatch on the lexer state
(it requires --storable-state option). An optional list of block
names specifies which blocks should be included in the state
dispatch. The default transition goes to the start label of the
first block on the list. If the list is empty, all blocks are
included, and the default transition goes to the first block in
the file that has a start label. This block type is
incompatible with the --loop-switch option, as it requires
cross-block transitions that are unsupported without goto or
function calls.
/*!conditions:re2c[:<name1>[:<name2>...]] ... */, /*!types:re2c... */
or %{conditions[:<name1>[:<name2>...]] ... %}, %{types... %}
A block that generates condition enumeration (it requires
--conditions option). An optional list of block names specifies
which blocks should be included when computing the set of
conditions (if the list is empty, all blocks are included). By
default the generated code is an enumeration YYCONDTYPE. It can
be customized with optional configurations format and separator.
Configuration format specifies a template string where @@{cond}
(or @@ for short) is replaced with the name of each condition,
and @@{num} is replaced with a numeric index of that condition.
Configuration separator specifies a piece of code used to join
the generated format pieces for different conditions.
/*!include:re2c <file> */ or %{include <file> %}
This block allows one to include <file>, which must be a
double-quoted file path. The contents of the file are literally
substituted in place of the block, in the same way as #include
works in C/C++. This block can be used together with the
--depfile option to generate build system dependencies on the
included files.
/*!header:re2c:on*/ or %{header:on %}
This block marks the start of header file. Everything after it
and up to the following header:off block is processed by re2go
and written to the header file specified with -t --type-header
option.
/*!header:re2c:off*/ or %{header:off %}
This block marks the end of header file started with header:on*/
block.
/*!ignore:re2c ... */ or %{ignore ... %}
A block which contents are ignored and removed from the output
file.
Configurations
Here is a full list of configurations supported by re2go:
re2c:api, re2c:input
Same as the --api option.
re2c:api:sigil
Specify the marker ("sigil") that is used for argument
placeholders in the API primitives. The default is @@. A
placeholder starts with sigil followed by the argument name in
curly braces. For example, if sigil is set to $, then
placeholders will have the form ${name}. Single-argument APIs
may use shorthand notation without the name in braces. This
option can be overridden by options for individual API
primitives, e.g. re2c:YYFILL@len for YYFILL.
re2c:api:style
Specify API style. Possible values are functions (the default
for C) and free-form (the default for Go and Rust). In
functions style API primitives are generated with an argument
list in parentheses following the name of the primitive. The
arguments are provided only for autogenerated parameters (such
as the number of characters passed to YYFILL), but not for the
general lexer context, so the primitives behave more like macros
in C/C++ or closures in Go and Rust. In free-form style API
primitives do not have a fixed form: they should be defined as
strings containing free-form pieces of code with interpolated
variables of the form @@{var} or @@ (they correspond to
arguments in function-like style). This configuration may be
overridden for individual API primitives, see for example
re2c:YYFILL:naked configuration for YYFILL.
re2c:bit-vectors, re2c:flags:bit-vectors, re2c:flags:b
Same as the --bit-vectors option, but can be configured on
per-block basis.
re2c:captures, re2c:leftmost-captures
Same as the --leftmost-captures option, but can be configured on
per-block basis.
re2c:captvars, re2c:leftmost-captvars
Same as the --leftmost-captvars option, but can be configured on
per-block basis.
re2c:case-insensitive, re2c:flags:case-insensitive
Same as the --case-insensitive option, but can be configured on
per-block basis.
re2c:case-inverted, re2c:flags:case-inverted
Same as the --case-inverted option, but can be configured on
per-block basis.
re2c:case-ranges, re2c:flags:case-ranges
Same as the --case-ranges option, but can be configured on
per-block basis.
re2c:computed-gotos, re2c:flags:computed-gotos, re2c:flags:g
Same as the --computed-gotos option, but can be configured on
per-block basis.
re2c:computed-gotos:relative, re2c:cgoto:relative
Same as the --computed-gotos-relative option, but can be
configured on per-block basis.
re2c:computed-gotos:threshold, re2c:cgoto:threshold
If computed goto is used, this configuration specifies the
complexity threshold that triggers the generation of jump tables
instead of nested if statements and bitmaps. The default value
is 9.
re2c:cond:abort
If set to a positive integer value, the default case in the
generated condition dispatch aborts program execution.
re2c:cond:goto
Specifies a piece of code used for the autogenerated shortcut
rules :=> in conditions. The default is goto @@;. The @@
placeholder is substituted with condition name (see
configurations re2c:api:sigil and re2c:cond:goto@cond).
re2c:cond:goto@cond
Specifies the sigil used for argument substitution in
re2c:cond:goto definition. The default value is @@. Overrides
the more generic re2c:api:sigil configuration.
re2c:cond:divider
Defines the divider for condition blocks. The default value is
/* *********************************** */. Placeholders are
substituted with condition name (see re2c:api;sigil and
re2c:cond:divider@cond).
re2c:cond:divider@cond
Specifies the sigil used for argument substitution in
re2c:cond:divider definition. The default is @@. Overrides the
more generic re2c:api:sigil configuration.
re2c:cond:prefix, re2c:condprefix
Specifies the prefix used for condition labels. The default is
yyc_.
re2c:cond:enumprefix, re2c:condenumprefix
Specifies the prefix used for condition identifiers. The
default is yyc.
re2c:debug-output, re2c:flags:debug-output, re2c:flags:d
Same as the --debug-output option, but can be configured on
per-block basis.
re2c:empty-class, re2c:flags:empty-class
Same as the --empty-class option, but can be configured on
per-block basis.
re2c:encoding:ebcdic, re2c:flags:ecb, re2c:flags:e
Same as the --ebcdic option, but can be configured on per-block
basis.
re2c:encoding:ucs2, re2c:flags:wide-chars, re2c:flags:w
Same as the --ucs2 option, but can be configured on per-block
basis.
re2c:encoding:utf8, re2c:flags:utf-8, re2c:flags:8
Same as the --utf8 option, but can be configured on per-block
basis.
re2c:encoding:utf16, re2c:flags:utf-16, re2c:flags:x
Same as the --utf16 option, but can be configured on per-block
basis.
re2c:encoding:utf32, re2c:flags:unicode, re2c:flags:u
Same as the --utf32 option, but can be configured on per-block
basis.
re2c:encoding-policy, re2c:flags:encoding-policy
Same as the --encoding-policy option, but can be configured on
per-block basis.
re2c:eof
Specifies the sentinel symbol used with the end-of-input rule $.
The default value is -1 ($ rule is not used). Other possible
values include all valid code units. Only decimal numbers are
recognized.
re2c:header, re2c:flags:type-header, re2c:flags:t
Specifies the name of the generated header file relative to the
directory of the output file. Same as the --header option except
that the file path is relative.
re2c:indent:string
Specifies the string used for indentation. The default is a
single tab character "\t". Indent string should contain
whitespace characters only. To disable indentation entirely,
set this configuration to an empty string.
re2c:indent:top
Specifies the minimum amount of indentation to use. The default
value is zero. The value should be a non-negative integer
number.
re2c:invert-captures
Same as the --invert-captures option, but can be configured on
per-block basis.
re2c:label:prefix, re2c:labelprefix
Specifies the prefix used for DFA state labels. The default is
yy.
re2c:label:start, re2c:startlabel
Controls the generation of a block start label. The default
value is zero, which means that the start label is generated
only if it is used. An integer value greater than zero forces
the generation of start label even if it is unused by the lexer.
A string value also forces start label generation and sets the
label name to the specified string. This configuration applies
only to the current block (it is reset to default for the next
block).
re2c:label:yyFillLabel
Specifies the prefix of YYFILL labels used with re2c:eof and in
storable state mode.
re2c:label:yyloop
Specifies the name of the label marking the start of the lexer
loop with --loop-switch option. The default is yyloop.
re2c:label:yyNext
Specifies the name of the optional label that follows YYGETSTATE
switch in storable state mode (enabled with
re2c:state:nextlabel). The default is yyNext.
re2c:lookahead, re2c:flags:lookahead
Deprecated (see the deprecated --no-lookahead option).
re2c:monadic
If set to non-zero, the generated lexer will use monadic
notation (this configuration is specific to Haskell).
re2c:nested-ifs, re2c:flags:nested-ifs, re2c:flags:s
Same as the --nested-ifs option, but can be configured on
per-block basis.
re2c:posix-captures, re2c:flags:posix-captures, re2c:flags:P
Same as the --posix-captures option, but can be configured on
per-block basis.
re2c:posix-captvars
Same as the --posix-captvars option, but can be configured on
per-block basis.
re2c:tags, re2c:flags:tags, re2c:flags:T
Same as the --tags option, but can be configured on per-block
basis.
re2c:tags:expression
Specifies the expression used for tag variables. By default
re2go generates expressions of the form yyt<N>. This might be
inconvenient, for example if tag variables are defined as fields
in a struct. All occurrences of @@{tag} or @@ are replaced with
the actual tag name. For example, re2c:tags:expression = "s.@@";
results in expressions of the form s.yyt<N> in the generated
code. See also re2c:api:sigil configuration.
re2c:tags:negative
Specifies the constant expression that is used for negative tag
value (typically this would be -1 if tags are integer offsets in
the input string, or null pointer if they are pointers).
re2c:tags:prefix
Specifies the prefix for tag variable names. The default is yyt.
re2c:sentinel
Specifies the sentinel symbol used for the end-of-input checks
(when bounds checks are disabled with re2c:yyfill:enable = 0;
and re2c:eof is not set). This configuration does not affect
code generation: its purpose is to verify that the sentinel is
not allowed in the middle of a rule, and ensure that the lexer
won't read past the end of buffer. The default value is -1` (in
that case re2go assumes that the sentinel is zero, which is the
most common case). Only decimal numbers are recognized.
re2c:state:abort
If set to a positive integer value, the default case in the
generated state dispatch aborts program execution, and an
explicit -1 case contains transition to the start of the block.
re2c:state:nextlabel
Controls if the YYGETSTATE switch is followed by an yyNext label
(the default value is zero, which corresponds to no label).
Alternatively one can use re2c:label:start to generate a
specific start label, or an explicit getstate block to generate
the YYGETSTATE switch separately from the lexer block.
re2c:unsafe, re2c:flags:unsafe
Same as the --no-unsafe option, but can be configured on
per-block basis. If set to zero, it suppresses the generation
of unsafe wrappers around YYPEEK. The default is non-zero
(wrappers are generated). This configuration is specific to
Rust.
re2c:YYBACKUP, re2c:define:YYBACKUP
Defines generic API primitive YYBACKUP.
re2c:YYBACKUPCTX, re2c:define:YYBACKUPCTX
Defines generic API primitive YYBACKUPCTX.
re2c:YYCONDTYPE, re2c:define:YYCONDTYPE
Defines API primitive YYCONDTYPE.
re2c:YYCTYPE, re2c:define:YYCTYPE
Defines API primitive YYCTYPE.
re2c:YYCTXMARKER, re2c:define:YYCTXMARKER
Defines API primitive YYCTXMARKER.
re2c:YYCURSOR, re2c:define:YYCURSOR
Defines API primitive YYCURSOR.
re2c:YYDEBUG, re2c:define:YYDEBUG
Defines API primitive YYDEBUG.
re2c:YYFILL, re2c:define:YYFILL
Defines API primitive YYFILL.
re2c:YYFILL@len, re2c:define:YYFILL@len
Specifies the sigil used for argument substitution in YYFILL
definition. Defaults to @@. Overrides the more generic
re2c:api:sigil configuration.
re2c:YYFILL:naked, re2c:define:YYFILL:naked
Overrides the more generic re2c:api:style configuration for
YYFILL. Zero value corresponds to free-form API style.
re2c:YYFN
Defines API primitive YYFN.
re2c:YYINPUT
Defines API primitive YYINPUT.
re2c:YYGETCOND, re2c:define:YYGETCONDITION
Defines API primitive YYGETCOND.
re2c:YYGETCOND:naked, re2c:define:YYGETCONDITION:naked
Overrides the more generic re2c:api:style configuration for
YYGETCOND. Zero value corresponds to free-form API style.
re2c:YYGETSTATE, re2c:define:YYGETSTATE
Defines API primitive YYGETSTATE.
re2c:YYGETSTATE:naked, re2c:define:YYGETSTATE:naked
Overrides the more generic re2c:api:style configuration for
YYGETSTATE. Zero value corresponds to free-form API style.
re2c:YYGETACCEPT, re2c:define:YYGETACCEPT
Defines API primitive YYGETACCEPT.
re2c:YYLESSTHAN, re2c:define:YYLESSTHAN
Defines generic API primitive YYLESSTHAN.
re2c:YYLIMIT, re2c:define:YYLIMIT
Defines API primitive YYLIMIT.
re2c:YYMARKER, re2c:define:YYMARKER
Defines API primitive YYMARKER.
re2c:YYMTAGN, re2c:define:YYMTAGN
Defines generic API primitive YYMTAGN.
re2c:YYMTAGP, re2c:define:YYMTAGP
Defines generic API primitive YYMTAGP.
re2c:YYPEEK, re2c:define:YYPEEK
Defines generic API primitive YYPEEK.
re2c:YYRESTORE, re2c:define:YYRESTORE
Defines generic API primitive YYRESTORE.
re2c:YYRESTORECTX, re2c:define:YYRESTORECTX
Defines generic API primitive YYRESTORECTX.
re2c:YYRESTORETAG, re2c:define:YYRESTORETAG
Defines generic API primitive YYRESTORETAG.
re2c:YYSETCOND, re2c:define:YYSETCONDITION
Defines API primitive YYSETCOND.
re2c:YYSETCOND@cond, re2c:define:YYSETCONDITION@cond
Specifies the sigil used for argument substitution in YYSETCOND
definition. The default value is @@. Overrides the more generic
re2c:api:sigil configuration.
re2c:YYSETCOND:naked, re2c:define:YYSETCONDITION:naked
Overrides the more generic re2c:api:style configuration for
YYSETCOND. Zero value corresponds to free-form API style.
re2c:YYSETSTATE, re2c:define:YYSETSTATE
Defines API primitive YYSETSTATE.
re2c:YYSETSTATE@state, re2c:define:YYSETSTATE@state
Specifies the sigil used for argument substitution in YYSETSTATE
definition. The default value is @@. Overrides the more generic
re2c:api:sigil configuration.
re2c:YYSETSTATE:naked, re2c:define:YYSETSTATE:naked
Overrides the more generic re2c:api:style configuration for
YYSETSTATE. Zero value corresponds to free-form API style.
re2c:YYSETACCEPT, re2c:define:YYSETACCEPT
Defines API primitive YYSETACCEPT.
re2c:YYSKIP, re2c:define:YYSKIP
Defines generic API primitive YYSKIP.
re2c:YYSHIFT, re2c:define:YYSHIFT
Defines generic API primitive YYSHIFT.
re2c:YYCOPYMTAG, re2c:define:YYCOPYMTAG
Defines generic API primitive YYCOPYMTAG.
re2c:YYCOPYSTAG, re2c:define:YYCOPYSTAG
Defines generic API primitive YYCOPYSTAG.
re2c:YYSHIFTMTAG, re2c:define:YYSHIFTMTAG
Defines generic API primitive YYSHIFTMTAG.
re2c:YYSHIFTSTAG, re2c:define:YYSHIFTSTAG
Defines generic API primitive YYSHIFTSTAG.
re2c:YYSTAGN, re2c:define:YYSTAGN
Defines generic API primitive YYSTAGN.
re2c:YYSTAGP, re2c:define:YYSTAGP
Defines generic API primitive YYSTAGP.
re2c:yyaccept, re2c:variable:yyaccept
Defines API primitive yyaccept.
re2c:yybm, re2c:variable:yybm
Defines API primitive yybm.
re2c:yybm:hex, re2c:variable:yybm:hex
If set to nonzero, bitmaps for the --bit-vectors option are
generated in hexadecimal format. The default is zero (bitmaps
are in decimal format).
re2c:yych, re2c:variable:yych
Defines API primitive yych.
re2c:yych:emit, re2c:variable:yych:emit
If set to zero, yych definition is not generated. The default
is non-zero.
re2c:yych:conversion, re2c:variable:yych:conversion
If set to non-zero, re2go automatically generates a conversion
to YYCTYPE every time yych is read. The default is to zero (no
conversion).
re2c:yych:literals, re2c:variable:yych:literals
Specifies the form of literals that yych is matched against.
Possible values are: char (character literals in single quotes,
non-printable ones use escape sequences that start with
backslash), hex (hexadecimal integers) and char_or_hex (a
mixture of both, character literals for printable characters and
hexadecimal integers for others).
re2c:yyctable, re2c:variable:yyctable
Defines API primitive yyctable.
re2c:yynmatch, re2c:variable:yynmatch
Defines API primitive yynmatch.
re2c:yypmatch, re2c:variable:yypmatch
Defines API primitive yypmatch.
re2c:yytarget, re2c:variable:yytarget
Defines API primitive yytarget.
re2c:yystable, re2c:variable:yystable
Deprecated.
re2c:yystate, re2c:variable:yystate
Defines API primitive yystate.
re2c:yyfill, re2c:variable:yyfill
Defines API primitive yyfill.
re2c:yyfill:check
If set to zero, suppresses the generation of pre-YYFILL check
for the number of input characters (the YYLESSTHAN definition in
generic API and the YYLIMIT-based comparison in C pointer API).
The default is non-zero (generate the check).
re2c:yyfill:enable
If set to zero, suppresses the generation of YYFILL (together
with the check). This should be used when the whole input fits
into one piece of memory (there is no need for buffering) and
the end-of-input checks do not rely on the YYFILL checks (e.g.
if a sentinel character is used). Use warnings (-W option) and
re2c:sentinel configuration to verify that the generated lexer
cannot read past the end of input. The default is non-zero
(YYFILL is enabled).
re2c:yyfill:parameter
If set to zero, suppresses the generation of parameter passed to
YYFILL. The parameter is the minimum number of characters that
must be supplied. Defaults to non-zero (the parameter is
generated). This configuration can be overridden with
re2c:YYFILL:naked or re2c:api:style.
re2c:yyfn:sep
Specifies separator used in YYFN elements (defaults to
semicolon).
re2c:yyfn:throw
Specifies exceptions thrown by YYFN function (defaults to empty,
which means no exceptions).
Regular expressions
re2go uses the following syntax for regular expressions:
"foo" Case-sensitive string literal.
'foo' Case-insensitive string literal.
[a-xyz], [^a-xyz]
Character class (possibly negated).
. Any character except newline.
R \ S Difference of character classes R and S.
R* Zero or more occurrences of R.
R+ One or more occurrences of R.
R? Optional R.
R{n} Repetition of R exactly n times.
R{n,} Repetition of R at least n times.
R{n,m} Repetition of R from n to m times.
(R) Just R; parentheses are used to override precedence. If submatch
extraction is enabled, (R) is a capturing or a non-capturing
group depending on --invert-captures option.
(!R) If submatch extraction is enabled, (!R) is a non-capturing or a
capturing group depending on --invert-captures option.
R S Concatenation: R followed by S.
R | S Alternative: R or S.
R / S Lookahead: R followed by S, but S is not consumed.
name Regular expression defined as name (or literal string "name" in
Flex compatibility mode).
{name} Regular expression defined as name in Flex compatibility mode.
@stag An s-tag: saves the last input position at which @stag matches
in a variable named stag.
#mtag An m-tag: saves all input positions at which #mtag matches in a
variable named mtag.
Character classes and string literals may contain the following escape
sequences: \a, \b, \f, \n, \r, \t, \v, \\, octal escapes \ooo and
hexadecimal escapes \xhh, \uhhhh and \Uhhhhhhhh.
Actions
Here is a list of predefined actions supported by re2go:
!entry code
Entry action binds a user-defined block of code to the start
state of the current finite state machine. If start conditions
are used, the entry action can be set individually for each
condition. This action may be used to perform initialization,
e.g. to save start location of a lexeme.
!pre_rule code
Pre-rule action prepends a user-defined block of code to
semantic actions of all rules in the current block (or
condition, if start conditions are used). This action may be
used to factor out the common part of all semantic actions (e.g.
saving the end location of a lexeme).
!post_rule code
Post-rule action appends a user-defined block of code to
semantic actions of all rules in the current block (or
condition, if start conditions are used). This action may be
used to emit trap statements that guard against unintended
control flow.
Directives
Here is a full list of directives supported by re2go:
!use:name ;
An in-block use directive that merges a previously defined rules
block with the specified name into the current block. Named
definitions, configurations and rules of the referenced block
are added to the current ones. Conflicts between overlapping
rules and configurations are resolved in the usual way: the
first rule takes priority, and the latest configuration
overrides the preceding ones. One exception is the special rules
*, $ and <!> for which a block-local definition always takes
priority. A use directive can be placed anywhere inside of a
block, and multiple use directives are allowed.
!include file ;
This directive is the same as include block: it inserts file
contents verbatim in place of the directive.
Program interface
The generated code interfaces with the outer program with the help of
primitives, collectively referred to as the API. Which primitives
should be defined for a particular program depends on multiple factors,
including the complexity of regular expressions, input representation,
buffering and the use of various features. All the necessary primitives
should be defined by the user in the form of macros, functions,
variables or any other suitable form that makes the generated code
syntactically and semantically correct. re2go does not (and cannot)
check the definitions, so if anything is missing or defined
incorrectly, the generated program may have compile-time or run-time
errors. This manual provides examples of API definitions in the most
common cases.
re2go has three API flavors that define the core set of primitives used
by a program:
Simple API
(added in version 4.0) This is a basic API that can be enabled
with --api simple option or re2c:api = simple configuration. It
consists of the following primitives: YYINPUT (which should be
defined as a sequence of code units, e.g. a string) and
YYCURSOR, YYMARKER, YYCTXMARKER, YYLIMIT (which should be
defined as indices in YYINPUT).
Record API
(added in version 4.0) Record API is useful in cases when lexer
state must be stored in a struct. It is enabled with --api
record option or re2c:api = record configuration. This API
consists of a variable yyrecord (the name can be overridden with
re2c:yyrecord) that should be defined as a struct with fields
yyinput, yycursor, yymarker, yyctxmarker, yylimit (only the
fields used by the generated code need to be defined, and their
names can be configured).
Generic API
This is the most flexible API and the default API for the Go
backend. It contains primitives for generic operations: YYPEEK,
YYSKIP, YYBACKUP, YYBACKUPCTX, YYSTAGP, YYSTAGN, YYMTAGP,
YYMTAGN, YYRESTORE, YYRESTORECTX, YYRESTORETAG, YYSHIFT,
YYSHIFTSTAG, YYSHIFTMTAG, YYLESSTHAN. Generic API supports two
styles that determine the form in which the primitives should be
defined:
Free-form
Free-form style is the default one. It can also be
enabled with configuration re2c:api:style = free-form.
In this style interface primitives should be defined as
free-form pieces of code with interpolated variables of
the form @@{var} or optionally just @@ if there is a
single variable. The set of variables is specific to each
primitive. Here's how free-form generic API can be
defined in terms of integer variables cursor, limit,
marker, ctxmarker and a string (or a byte slice) data:
/*!re2c
re2c:YYPEEK = "data[cursor]";
re2c:YYSKIP = "cursor++";
re2c:YYBACKUP = "marker = cursor";
re2c:YYRESTORE = "cursor = marker";
re2c:YYBACKUPCTX = "ctxmarker = cursor";
re2c:YYRESTORECTX = "cursor = ctxmarker";
re2c:YYRESTORETAG = "cursor = ${tag}";
re2c:YYLESSTHAN = "limit - cursor < @@{len}";
re2c:YYSTAGP = "@@{tag} = cursor";
re2c:YYSTAGN = "@@{tag} = -1";
re2c:YYSHIFT = "cursor += @@{shift}";
re2c:YYSHIFTSTAG = "@@{tag} += @@{shift}";
*/
Function-like
Function-like style is enabled with configuration
re2c:api:style = functions. In this style primitives
should be defined as closures accepting the necessary
arguments. Here's how function-like generic API can be
defined in terms of integer variables cursor, limit,
marker, ctxmarker and a string (or a byte slice) data:
YYPEEK := func() byte { return data[cursor] }
YYSKIP := func() { cursor++ }
YYBACKUP := func() { marker = cursor }
YYRESTORE := func() { cursor = marker }
YYBACKUPCTX := func() { ctxmarker = cursor }
YYRESTORECTX := func() { cursor = ctxmarker }
YYLESSTHAN := func(n int) bool { return limit-cursor < n }
YYSHIFT := func(n int) { cursor += n }
Here is a full list of API primitives that may be used by the generated
code in order to interface with the outer program.
YYCTYPE
The type of the input characters (code units). For ASCII,
EBCDIC and UTF-8 encodings it should be 1-byte unsigned integer.
For UTF-16 or UCS-2 it should be 2-byte unsigned integer. For
UTF-32 it should be 4-byte unsigned integer.
YYCURSOR
An l-value that stores the current input position (a pointer or
an integer offset in YYINPUT). Initially YYCURSOR should point
to the first input character, and later it is advanced by the
generated code. When a rule matches, YYCURSOR position is the
one after the last matched character.
YYLIMIT
An r-value that stores the end of input position (a pointer or
an integer offset in YYINPUT). Initially YYLIMIT should point to
the position after the last available input character. It is not
changed by the generated code. The lexer compares YYCURSOR to
YYLIMIT in order to determine if there are enough input
characters left.
YYMARKER
An l-value that stores the position of the latest matched rule
(a pointer or an integer offset in YYINPUT). It is used to
restore the YYCURSOR position if the longer match fails and the
lexer needs to rollback. Initialization is not needed.
YYCTXMARKER
An l-value that stores the position of the trailing context (a
pointer or an integer offset in YYINPUT). No initialization is
needed. YYCTXMARKER is needed only if the lookahead operator /
is used.
YYFILL A generic API primitive with one variable len. YYFILL should
provide at least len more input characters or fail. If re2c:eof
is used, then len is always 1 and YYFILL should always return
to the calling function; zero return value indicates success.
If re2c:eof is not used, then YYFILL return value is ignored and
it should not return on failure. The maximum value of len is
YYMAXFILL.
YYFN A primitive that defines function prototype in
--recursive-functions code model. Its value should be an array
of one or more strings, where each string contains two or three
components separated by the string specified in re2c:fn:sep
configuration (typically a semicolon). The first array element
defines function name and return type (empty for a void
function). Subsequent elements define function arguments:
first, the expression for the argument used in function body
(usually just a name); second, argument type; third, an optional
formal parameter (it defaults to the first component - usually
both the argument and the parameter are the same identifier).
YYINPUT
An r-value that stores the current input character sequence
(string, buffer, etc.).
YYMAXFILL
An integral constant equal to the maximum value of the argument
to YYFILL. It can be generated with a max block.
YYLESSTHAN
A generic API primitive with one variable len. It should be
defined as an r-value of boolean type that equals true if and
only if there are less than len input characters left.
YYPEEK A generic API primitive with no variables. It should be defined
as an r-value of type YYCTYPE that is equal to the character at
the current input position.
YYSKIP A generic API primitive that should advance the current input
position by one code unit.
YYBACKUP
A generic API primitive that should save the current input
position (to be restored with YYRESTORE later).
YYRESTORE
A generic API primitive that should restore the current input
position to the value saved by YYBACKUP.
YYBACKUPCTX
A generic API primitive that should save the current input
position as the position of the trailing context (to be restored
with YYRESTORECTX later).
YYRESTORECTX
A generic API primitive that should restore the trailing context
position saved with YYBACKUPCTX.
YYRESTORETAG
A generic API primitive with one variable tag that should
restore the trailing context position to the value of tag.
YYSTAGP
A generic API primitive with one variable tag, where tag can be
a pointer or an offset in YYINPUT (see submatch extraction
section for details). YYSTAGP should set tag to the current
input position.
YYSTAGN
A generic API primitive with one variable tag, where tag can be
a pointer or an offset in YYINPUT (see submatch extraction
section for details). YYSTAGN should to set tag to a value that
represents non-existent input position.
YYMTAGP
A generic API primitive with one variable tag. YYMTAGP should
append the current position to the submatch history of tag (see
the submatch extraction section for details.)
YYMTAGN
A generic API primitive with one variable tag. YYMTAGN should
append a value that represents non-existent input position
position to the submatch history of tag (see the submatch
extraction section for details.)
YYSHIFT
A generic API primitive with one variable shift that should
shift the current input position by shift characters (the shift
value may be negative).
YYCOPYSTAG
A generic API primitive with two variables, lhs and rhs that
should copy right-hand-side s-tag variable rhs to the
left-hand-side s-tag variable lhs. For most languages this
primitive has a default definition that assigns lhs to rhs.
YYCOPYMTAG
A generic API primitive with two variables, lhs and rhs that
should copy right-hand-side m-tag variable rhs to the
left-hand-side m-tag variable lhs. For most languages this
primitive has a default definition that assigns lhs to rhs.
YYSHIFTSTAG
A generic API primitive with two variables, tag and shift that
should shift tag by shift code units (the shift value may be
negative).
YYSHIFTMTAG
A generic API primitive with two variables, tag and shift that
should shift the latest value in the history of tag by shift
code units (the shift value may be negative).
YYMAXNMATCH
An integral constant equal to the maximal number of POSIX
capturing groups in a rule. It is generated with a maxnmatch
block.
YYCONDTYPE
The type of the condition enum. It can be generated either with
conditions block or --header option.
YYGETACCEPT
A primitive with one variable var that stores numeric selector
of the accepted rule. For most languages this primitive has a
default definition that reads from var.
YYSETACCEPT
A primitive with two variables: var (an l-value that stores
numeric selector of the accepted rule), and val (the value of
selector). For most languages this primitive has a default
definition that assigns var to val.
YYGETCOND
An r-value of type YYCONDTYPE that is equal to the current
condition identifier.
YYSETCOND
A primitive with one variable cond that should set the current
condition identifier to cond.
YYGETSTATE
An r-value of integer type that is equal to the current lexer
state. It should be initialized to -1.
YYSETSTATE
A primitive with one variable state that should set the current
lexer state to state.
YYDEBUG
This primitive is generated only with -d, --debug-output option.
Its purpose is to add logging to the generated code (typical
YYDEBUG definition is a print statement). YYDEBUG statements are
generated in every state and have two variables: state (either a
DFA state index or -1) and symbol (the current input symbol).
yyaccept
An l-value of unsigned integral type that stores the number of
the latest matched rule. User definition is necessary only with
--storable-state option.
yybm A table containing compressed bitmaps for up to 8 transitions
(used with the --bitmaps option). The table contains 256
elements and is indexed by 1-byte code units. Each 8-bit element
combines boolean values for up to 8 transitions. k-Th bit of
n-th element is true iff n-th code unit is in the range of k-th
transition. The idea of this bitmap is to replace many if
branches or switch cases with one check of a single bit in the
table.
yych An l-value of type YYCTYPE that stores the current input
character. User definition is necessary only with -f
--storable-state option.
yyctable
Jump table generated for the initial condition dispatch (enabled
with the combination of --conditions and --computed-gotos
options).
yyfill An l-value that stores the result of YYFILL call (this may be
necessary for pure functional languages, where YYFILL is a
monadic function with complex return value).
yynmatch
An l-value of unsigned integral type that stores the number of
POSIX capturing groups in the matched rule. Used only with -P
--posix-captures option.
yypmatch
An array of l-values that are used to hold the tag values
corresponding to the capturing parentheses in the matching rule.
Array length must be at least yynmatch * 2 (usually YYMAXNMATCH
* 2 is a good choice). Used only with -P --posix-captures
option.
yystable
Deprecated.
yystate
An l-value used with the --loop-switch option to store the
current DFA state.
yytarget
Jump table that contains jump targets (label addresses) for all
transitions from a state. This table is local to each state.
Generation of yytarget tables is enabled with --computed-gotos
option.
Options
Some of the options have corresponding configurations, others are
global and cannot be changed after re2c starts reading the input file.
Debug options generally require building re2c in debug configuration.
Internal options are useful for experimenting with the algorithms used
in re2c.
-? --help -h
Show help message.
--api <simple | record | generic>
Specify the API used by the generated code to interface with
used-defined code. Option simple shold be used in simple cases
when there's no need for buffer refilling and storing lexer
state. Option record should be used when lexer state needs to be
stored in a record (struct, class, etc.). Option generic should
be used in complex cases when the other two APIs are not
flexible enough.
--bit-vectors -b
Optimize conditional jumps using bit masks. This option implies
--nested-ifs.
--captures, --leftmost-captures
Enable submatch extraction with leftmost greedy capturing
groups. The result is collected into an array yybmatch of
capacity 2 * YYMAXNMATCH, and yynmatch is set to the number of
groups for the matching rule.
--captvars, --leftmost-captvars
Enable submatch extraction with leftmost greedy capturing
groups. The result is collected into variables yytl<k>, yytr<k>
for k-th capturing group.
--case-insensitive
Treat single-quoted and double-quoted strings as
case-insensitive.
--case-inverted
Invert the meaning of single-quoted and double-quoted strings:
treat single-quoted strings as case-sensitive and double-quoted
strings as case-insensitive.
--case-ranges
Collapse consecutive cases in a switch statements into a range
of the form low ... high. This syntax is a C/C++ language
extension that is supported by compilers like GCC, Clang and
Tcc. The main advantage over using single cases is smaller
generated code and faster generation time, although for some
compilers like Tcc it also results in smaller binary size.
--computed-gotos -g
Optimize conditional jumps using non-standard "computed goto"
extension (which must be supported by the compiler). re2go
generates jump tables only in complex cases with a lot of
conditional branches. Complexity threshold can be configured
with cgoto:threshold configuration. Relative offsets can be
enabled with cgoto:relative configuration. This option implies
--bit-vectors.
--computed-gotos-relative
Similar to --computed-gotos but generate relative offsets for
jump tables instead (which must be supported by the compiler).
This option implies --computed-gotos.
--conditions --start-conditions -c
Enable support of Flex-like "conditions": multiple interrelated
lexers within one block. This is an alternative to manually
specifying different re2go blocks connected with goto or
function calls.
--depfile FILE
Write dependency information to FILE in the form of a Makefile
rule <output-file> : <input-file> [include-file ...]. This
allows one to track build dependencies in the presence of
include blocks/directives, so that updating include files
triggers regeneration of the output file. This option depends
on the --output option.
--ebcdic --ecb -e
Generate a lexer that reads input in EBCDIC encoding. re2go
assumes that the character range is 0 -- 0xFF and character size
is 1 byte.
--empty-class <match-empty | match-none | error>
Define the way re2go treats empty character classes. With
match-empty (the default) empty class matches empty input (which
is illogical, but backwards-compatible). With match-none empty
class always fails to match. With error empty class raises a
compilation error.
--encoding-policy <fail | substitute | ignore>
Define the way re2go treats Unicode surrogates. With fail re2go
aborts with an error when a surrogate is encountered. With
substitute re2go silently replaces surrogates with the error
code point 0xFFFD. With ignore (the default) re2go treats
surrogates as normal code points. The Unicode standard says that
standalone surrogates are invalid, but real-world libraries and
programs behave in different ways.
--flex-syntax -F
Partial support for Flex syntax: in this mode named definitions
don't need the equal sign and the terminating semicolon, and
when used they must be surrounded with curly braces. Names
without curly braces are treated as double-quoted strings.
--goto-label
Use "goto/label" code model: encode DFA in form of labeled code
blocks connected with goto transitions across blocks. This is
only supported for languages that have a goto statement.
--header --type-header -t HEADER
Generate a HEADER file. The contents of the file can be
specified using special blocks header:on and header:off. If
conditions are used, the generated header will have a condition
enum automatically appended to it (unless there is an explicit
conditions block).
-I PATH
Add PATH to the list of locations which are used when searching
for include files. This option is useful in combination with
include block or directive. re2go looks for FILE in the
directory of the parent file and in the include locations
specified with -I option.
--input <default | custom>
Deprecated alias for --api. Option default corresponds to simple
(it is indeed the default for most backends, but not for all).
Option custom corresponds to generic.
--input-encoding <ascii | utf8>
Specify the way re2go parses regular expressions. With ascii
(the default) re2go handles input as ASCII-encoded: any sequence
of code units is a sequence of standalone 1-byte characters.
With utf8 re2go handles input as UTF8-encoded and recognizes
multibyte characters.
--invert-captures
Invert the meaning of capturing and non-capturing groups. By
default (...) is capturing and (! ...) is non-capturing. With
this option (! ...) is capturing and (...) is non-capturing.
--lang <none | c | d | go | haskell | java | js | ocaml | python | rust
| swift | v | zig>
Specify the target language. Supported languages are C, D, Go,
Haskell, Java, JS, OCaml, Python, Rust, Swift, V, Zig (more
languages can be added via user-defined syntax files, see the
--syntax option). Option none disables default suntax configs,
so that the target language is undefined.
--location-format <gnu | msvc>
Specify location format in messages. With gnu locations are
printed as 'filename:line:column: ...'. With msvc locations are
printed as 'filename(line,column) ...'. The default is gnu.
--loop-switch
Use "loop/switch" code model: encode DFA in form of a loop over
a switch statement, where individual states are switch cases.
State is stored in a variable yystate. Transitions between
states update yystate to the case label of the destination state
and continue execution to the head of the loop.
--nested-ifs -s
Use nested if statements instead of switch statements in
conditional jumps. This usually results in more efficient code
with non-optimizing compilers.
--no-debug-info -i
Do not output line directives. This may be useful when the
generated code is stored in a version control system (to avoid
huge autogenerated diffs on small changes).
--no-generation-date
Suppress date output in the generated file.
--no-version
Suppress version output in the generated file.
--no-unsafe
Do not generate unsafe wrapper over YYPEEK (this option is
specific to Rust). For performance reasons YYPEEK should avoid
bounds-checking, as the lexer already performs end-of-input
checks in a more efficient way. The user may choose to provide
a safe YYPEEK definition, or a definition that is unsafe only in
release builds, in which case the --no-unsafe option helps to
avoid warnings about redundant unsafe blocks.
--output -o OUTPUT
Specify the OUTPUT file.
--posix-captures, -P
Enable submatch extraction with POSIX-style capturing groups.
The result is collected into an array yybmatch of capacity 2 *
YYMAXNMATCH, and yynmatch is set to the number of groups for the
matching rule.
--posix-captvars
Enable submatch extraction with POSIX-style capturing groups.
The result is collected into variables yytl<k>, yytr<k> for k-th
capturing group.
--recursive-functions
Use code model based on co-recursive functions, where each DFA
state is a separate function that may call other state-functions
or itself.
--reusable -r
Deprecated since version 2.2 (reusable blocks are allowed by
default now).
--skeleton -S
Ignore user-defined interface code and generate a self-contained
"skeleton" program. Additionally, generate input files with
strings derived from the regular grammar and compressed match
results that are used to verify "skeleton" behavior on all
inputs. This option is useful for finding bugs in optimizations
and code generation. This option is supported only for C.
--storable-state -f
Generate a lexer which can store its inner state. This is
useful in push-model lexers which are stopped by an outer
program when there is not enough input, and then resumed when
more input becomes available. In this mode users should
additionally define YYGETSTATE and YYSETSTATE primitives, and
variables yych, yyaccept and state should be part of the stored
lexer state.
--syntax FILE
Load configurations from the specified FILE and apply them on
top of the default syntax file. Note that FILE can define only a
few configurations (if it's used to amend the default syntax
file), or it can define a whole new language backend (in the
latter case it is recommended to use --lang none option).
--tags -T
Enable submatch extraction with tags.
--ucs2 --wide-chars -w
Generate a lexer that reads UCS2-encoded input. re2go assumes
that the character range is 0 -- 0xFFFF and character size is 2
bytes. This option implies --nested-ifs.
--utf8 --utf-8 -8
Generate a lexer that reads input in UTF-8 encoding. re2go
assumes that the character range is 0 -- 0x10FFFF and character
size is 1 byte.
--utf16 --utf-16 -x
Generate a lexer that reads UTF16-encoded input. re2go assumes
that the character range is 0 -- 0x10FFFF and character size is
2 bytes. This option implies --nested-ifs.
--utf32 --unicode -u
Generate a lexer that reads UTF32-encoded input. re2go assumes
that the character range is 0 -- 0x10FFFF and character size is
4 bytes. This option implies --nested-ifs.
--verbose
Output a short message in case of success.
--vernum -V
Show version information in MMmmpp format (major, minor, patch).
--version -v
Show version information.
--single-pass -1
Deprecated. Does nothing (single pass is the default now).
--debug-output -d
Emit YYDEBUG invocations in the generated code. This is useful
to trace lexer execution.
--dump-adfa
Debug option: output DFA after tunneling (in .dot format).
--dump-cfg
Debug option: output control flow graph of tag variables (in
.dot format).
--dump-closure-stats
Debug option: output statistics on the number of states in
closure.
--dump-dfa-det
Debug option: output DFA immediately after determinization (in
.dot format).
--dump-dfa-min
Debug option: output DFA after minimization (in .dot format).
--dump-dfa-tagopt
Debug option: output DFA after tag optimizations (in .dot
format).
--dump-dfa-tree
Debug option: output DFA under construction with states
represented as tag history trees (in .dot format).
--dump-dfa-raw
Debug option: output DFA under construction with expanded
state-sets (in .dot format).
--dump-interf
Debug option: output interference table produced by liveness
analysis of tag variables.
--dump-nfa
Debug option: output NFA (in .dot format).
--emit-dot -D
Instead of normal output generate lexer graph in .dot format.
The output can be converted to an image with the help of
Graphviz (e.g. something like dot -Tpng -odfa.png dfa.dot).
--dfa-minimization <moore | table>
Internal option: DFA minimization algorithm used by re2go. The
moore option is the Moore algorithm (it is the default). The
table option is the "table filling" algorithm. Both algorithms
should produce the same DFA up to states relabeling; table
filling is simpler and much slower and serves as a reference
implementation.
--eager-skip
Internal option: make the generated lexer advance the input
position eagerly -- immediately after reading the input symbol.
This changes the default behavior when the input position is
advanced lazily -- after transition to the next state.
--no-lookahead
Internal option, deprecated. It used to enable TDFA(0)
algorithm. Unlike TDFA(1), TDFA(0) algorithm does not use
one-symbol lookahead. It applies register operations to the
incoming transitions rather than the outgoing ones. Benchmarks
showed that TDFA(0) algorithm is less efficient than TDFA(1).
--no-optimize-tags
Internal option: suppress optimization of tag variables (useful
for debugging).
--posix-closure <gor1 | gtop>
Internal option: specify shortest-path algorithm used for the
construction of epsilon-closure with POSIX disambiguation
semantics: gor1 (the default) stands for Goldberg-Radzik
algorithm, and gtop stands for "global topological order"
algorithm.
--posix-prectable <complex | naive>
Internal option: specify the algorithm used to compute POSIX
precedence table. The complex algorithm computes precedence
table in one traversal of tag history tree and has quadratic
complexity in the number of TNFA states; it is the default. The
naive algorithm has worst-case cubic complexity in the number of
TNFA states, but it is much simpler than complex and may be
slightly faster in non-pathological cases.
--stadfa
Internal option, deprecated. It used to enable staDFA
algorithm, which differs from TDFA in that register operations
are placed in states rather than on transitions. Benchmarks
showed that staDFA algorithm is less efficient than TDFA.
--fixed-tags <none | toplevel | all>
Internal option: specify whether the fixed-tag optimization
should be applied to all tags (all), none of them (none), or
only those in toplevel concatenation (toplevel). The default is
all. "Fixed" tags are those that are located within a fixed
distance to some other tag (called "base"). In such cases only
the base tag needs to be tracked, and the value of the fixed tag
can be computed as the value of the base tag plus a static
offset. For tags that are under alternative or repetition it is
also necessary to check if the base tag has a no-match value (in
that case fixed tag should also be set to no-match, disregarding
the offset). For tags in top-level concatenation the check is
not needed, because they always match.
Warnings
Warnings can be invividually enabled, disabled and turned into an
error.
-W Turn on all warnings.
-Werror
Turn warnings into errors. Note that this option alone doesn't
turn on any warnings; it only affects those warnings that have
been turned on so far or will be turned on later.
-W<warning>
Turn on warning.
-Wno-<warning>
Turn off warning.
-Werror-<warning>
Turn on warning and treat it as an error (this implies
-W<warning>).
-Wno-error-<warning>
Don't treat this particular warning as an error. This doesn't
turn off the warning itself.
-Wcondition-order
Warn if the generated program makes implicit assumptions about
condition numbering. One should use either --header option or
conditions block to generate a mapping of condition names to
numbers and then use the autogenerated condition names.
-Wempty-character-class
Warn if a regular expression contains an empty character class.
Trying to match an empty character class makes no sense: it
should always fail. However, for backwards compatibility
reasons re2go permits empty character classes and treats them as
empty strings. Use the --empty-class option to change the
default behavior.
-Wmatch-empty-string
Warn if a rule is nullable (matches an empty string). If the
lexer runs in a loop and the empty match is unintentional, the
lexer may unexpectedly hang in an infinite loop.
-Wswapped-range
Warn if the lower bound of a range is greater than its upper
bound. The default behavior is to silently swap the range
bounds.
-Wundefined-control-flow
Warn if some input strings cause undefined control flow in the
lexer (the faulty patterns are reported). This is a dangerous
and common mistake. It can be easily fixed by adding the default
rule * which has the lowest priority, matches any code unit, and
always consumes a single code unit.
-Wunreachable-rules
Warn about rules that are shadowed by other rules and will never
match.
-Wdeprecated-eof_rule
Warn about standalone end of input rules $ that will be broken
by the future changes and require fixing. At the moment these
rules take precedence when conflicting with other rules, but
after the introduction of generalized end of input symbol $
precedence order will change and these rules will become
shadowed by other rules.
-Wuseless-escape
Warn if a symbol is escaped when it shouldn't be. By default,
re2go silently ignores such escapes, but this may as well
indicate a typo or an error in the escape sequence.
-Wnondeterministic-tags
Warn if a tag has n-th degree of nondeterminism, where n is
greater than 1.
-Wsentinel-in-midrule
Warn if the sentinel symbol occurs in the middle of a rule ---
this may cause reads past the end of buffer, crashes or memory
corruption in the generated lexer. This warning is only
applicable if the sentinel method of checking for the end of
input is used. It is set to an error if re2c:sentinel
configuration is used.
-Wundefined-syntax-config
Warn if the syntax file specified with --syntax option is
missing definitions of some configurations. This helps to
maintain user-defined syntax files: if a new release adds
configurations, old syntax file will raise a warning, and the
user will be notified. If some configurations are unused and do
not need a definition, they should be explicitly set to
<undefined>.
Syntax files
Support for different languages in re2c is based on the idea of syntax
files. A syntax file is a configuration file that defines syntax of
the target language -- not the whole language, but a small part of it
that is used by the generated code. Syntax files make re2c very
flexible, but they should not be used as a replacement for re2c:
configurations: their purpose is to define syntax of the target
language, not to customize one particular lexer. All supported
languages have default syntax files that are part of the distribution
(see include/syntax subdirectory); they are also embedded in the re2go
binary. Users may provide a custom syntax file that overrides a few
configurations for one of supported languages, or they may choose to
redefine all configurations (in that case --lang none option should be
used). Syntax files contain configurations of four different kinds:
feature lists, language configurations, inplace configurations and code
templates.
Feature lists
A few list configurations define various features supported by a
given backend, so that re2go may give a clear error if the user
tries to enable an unsupported feature:
supported_apis
A list of supported APIs with possible elements simple,
record, generic.
supported_api_styles
A list of supported API styles with possible elements
functions, free-form.
supported_code_models
A list of supported code models with possible elements
goto-label, loop-switch, recursive-functions.
supported_targets
A list of supported codegen targets with possible elements
code, dot, skeleton.
supported_features
A list of supported features with possible elements
nested-ifs, bitmaps, computed-gotos, case-ranges, monadic,
unsafe, tags, captures, captvars.
Language configurations
A few boolean configurations describe features of the target
language that affect re2go parser and code generator:
semicolons
Non-zero if the language uses semicolons after statements.
backtick_quoted_strings
Non-zero if the language has backtick-quoted strings.
single_quoted_strings
Non-zero if the language has single-quoted strings.
indentation_sensitive
Non-zero if the language is indentation sensitive.
wrap_blocks_in_braces
Non-zero if compound statements must be wrapped in curly
braces.
Inplace configurations
Syntax files define initial values of all re2c: configurations, as
they may differ for different languages. See configurations section
for a full list of all inplace configurations and their meaning.
Code templates
Code templates define syntax of the target language. They are
written in a simple domain-specific language with the following
formal grammar:
code-template ::
name '=' code-exprs ';'
| CODE_TEMPLATE ';'
| '<undefined>' ';'
code-exprs ::
<EMPTY>
| code-exprs code-expr
code-expr ::
STRING
| VARIABLE
| optional
| list
optional ::
'(' CONDITIONAL '?' code-exprs ')'
| '(' CONDITIONAL '?' code-exprs ':' code-exprs ')'
list ::
'[' VARIABLE ':' code-exprs ']'
| '[' VARIABLE '{' NUMBER '}' ':' code-exprs ']'
| '[' VARIABLE '{' NUMBER ',' NUMBER '}' ':' code-exprs ']'
A code template is a sequence of string literals, variables,
optional elements and lists, or a reference to another code
template, or a special value <undefined>. Variables are placeholders
that are substituted during code generation phase. List variables
are special: when expanding list templates, re2go repeats
expressions the right hand side of the column a few times, each time
replacing occurrences of the list variable with a value specific to
this repetition. Lists have optional bounds (negative values are
counted from the end, e.g. -1 means the last element). Conditional
names start with a dot. Both conditionals and variables may be
either local (specific to the given code template) or global
(allowed in all code templates). When re2go reads syntax file, it
checks that each code template uses only the variables and
conditionals that are allowed in it.
For example, the following code template defines if-then-else
construct for a C-like language:
code:if_then_else =
[branch{0}: topindent "if " cond " {" nl
indent [stmt: stmt] dedent]
[branch{1:-1}: topindent "} else" (.cond ? " if " cond) " {" nl
indent [stmt: stmt] dedent]
topindent "}" nl;
Here branch is a list variable: branch{0} expands to the first
branch (which is special, as there is no else part), branch{1:-1}
expands to all remaining branches (if any). stmt is also a list
variable: [stmt: stmt] is a nested list that expands to a list of
statements in the body of the current branch. topindent, indent,
dedent and nl are global variables, and .cond is a local conditional
(their meaning is described below). This code template could produce
the following code:
if x {
// do something
} else if y {
// do something else
} else {
// don't do anything
}
Here's a list of all code templates supported by re2go with their
local variables and conditionals. Note that a particular definition
may, but does not have to use local variables and conditionals. Any
unused code templates should be set to <undefined>.
code:var_local
Declaration or definition of a local variable. Supported
variables: type (the type of the variable), name (its name)
and init (initial value, if any). Conditionals: .init (true
if there is an initializer).
code:var_global
Same as code:var_local, except that it's used in top-level.
code:const_local
Definition of a local constant. Supported variables: type
(the type of the constant), name (its name) and init (initial
value).
code:const_global
Same as code:const_local, except that it's used in top-level.
code:array_local
Definition of a local array (table). Supported variables:
type (the type of array elements), name (array name), size
(its size), row (a list variable that does not itself produce
any code, but expands list expression as many times as there
are rows in the table) and elem (a list variable that expands
to all table elements in the current row -- it's meant to be
nested in the row list). Supported conditional: .const (true
if the array is immutable).
code:array_global
Same as code:array_local, except that it's used in top-level.
code:array_elem
Reference to an element of an array (table). Supported
variables: array (the name of the array) and index (index of
the element).
code:enum
Definition of an enumeration (it may be defined using a
special language construct for enumerations, or simply as a
few standalone constants). Supported variables are type
(user-defined enumeration type or type of the constants),
elem (list variable that expands to the name of each member)
and init (initializer for each member). Conditionals: .init
(true if there is an initializer).
code:enum_elem
Enumeration element (a member of a user-defined enumeration
type or a name of a constant, depending on how code:enum is
defined). Supported variables are name (the name of the
element) and type (its type).
code:assign
Assignment statement. Supported variables are lhs (left hand
side) and rhs (right hand side).
code:type_int
Signed integer type.
code:type_uint
Unsigned integer type.
code:type_yybm
Type of elements in the yybm table.
code:type_yytarget
Type of elements in the yytarget table.
code:type_yyctable
Type of elements in the yyctable table.
code:cmp_eq
Operator "equals".
code:cmp_ne
Operator "not equals".
code:cmp_lt
Operator "less than".
code:cmp_gt
Operator "greater than"
code:cmp_le
Operator "less or equal"
code:cmp_ge
Operator "greater or equal"
code:if_then_else
If-then-else statement with one or more branches. Supported
variables: branch (a list variable that does not itself
produce any code, but expands list expression as many times
as there are branches), cond (condition of the current
branch) and stmt (a list variable that expands to all
statements in the current branch). Conditionals: .cond (true
if the current branch has a condition), .many (true if
there's more than one branch).
code:if_then_else_oneline
A specialization of code:if_then_else for the case when all
branches have one-line statements. If this is <undefined>,
code:if_then_else is used instead.
code:switch
A switch statement with one or more cases. Supported
variables: expr (the switched-on expression) and case (a list
variable that expands to all cases-groups with their code
blocks).
code:switch_cases
A group of switch cases that maps to a single code block.
Supported variables are case (a list variable that expands to
all cases in this group) and stmt (a list variable that
expands to all statements in the code block.
code:switch_cases_oneline
A specialization of code:switch_cases for the case when the
code block consists of a single one-line statement. If this
is <undefined>, code:switch_cases is used instead.
code:switch_case_range
A single switch case that covers a range of values (possibly
consisting of a single value). Supported variable: val (a
list variable that expands to all values in the range).
Supported conditionals: .many (true if there's more than one
value in the range) and .char_literals (true if this is a
switch on character literals -- some languages provide
special syntax for this case).
code:switch_case_default
Default switch case.
code:loop
A loop that runs forever (unless interrupted from the loop
body). Supported variables: label (loop label), stmt (a list
variable that expands to all statements in the loop body).
code:continue
Continue statement. Supported variables: label (label from
which to continue execution).
code:goto
Goto statement. Supported variables: label (label of the jump
target).
code:cgoto
Computed goto statement. Supported variables: array (the
table containing computed goto information), index (index of
the element in the table) and base (base label, only used if
.cgoto.relative is true).
code:cgoto:data
Initializer expression for a single element in computed goto
table. Supported variables: label (the label that is used to
initialize the current element), type (underlying type of the
elements in the table) and base (base label - only used if
.cgoto.relative is true).
code:fndecl
Function declaration. Supported variables: name (function
name), type (return type), throw (exceptions thrown by this
function, maps to re2c:yyfn:throw configuration), arg (a list
variable that does not itself produce code, but expands list
expression as many times as there are function arguments),
argname (name of the current argument), argtype (type of the
current argument). Conditional: .type (true if this is a
non-void function).
code:fndef
Like code:fndecl, but used for function definitions, so it
has one additional list variable stmt that expands to all
statements in the function body.
code:fncall
Function call statement. Supported variables: name (function
name), retval (l-value where the return value is stored, if
any) and arg (a list variable that expands to all function
arguments). Conditionals: .args (true if the function has
arguments) and .retval (true if return value needs to be
saved).
code:tailcall
Tail call statement. Supported variables: name (function
name), and arg (a list variable that expands to all function
arguments). Conditionals: .args (true if the function has
arguments) and .retval (true if this is a non-void function).
code:recursive_functions
Program body with --recursive-functions code model. Supported
variables: fn (a list variable that does not itself produce
any code, but expands list expression as many times as there
are functions), fndecl (declaration of the current function)
and fndef (definition of the current function).
code:fingerprint
The fingerprint at the top of the generated output file.
Supported variables: ver (re2go version that was used to
generate this) and date (generation date).
code:line_info
The format of line directives (if this is set to <undefined>,
no directives are generated). Supported variables: line (line
number) and file (filename).
code:abort
A statement that aborts program execution.
code:yydebug
YYDEBUG statement, possibly specialized for different APIs.
Supported variables: YYDEBUG, yyrecord, yych (map to the
corresponding re2c: configurations), state (DFA state
number).
code:yypeek
YYPEEK statement, possibly specialized for different APIs.
Supported variables: YYPEEK, YYCTYPE, YYINPUT, YYCURSOR,
yyrecord, yych (map to the corresponding re2c:
configurations). Conditionals: .cast (true if
re2c:yych:conversion is set to non-zero).
code:yyskip
YYSKIP statement, possibly specialized for different APIs.
Supported variables: YYSKIP, YYCURSOR, yyrecord (map to the
corresponding re2c: configurations).
code:yybackup
YYBACKUP statement, possibly specialized for different APIs.
Supported variables: YYBACKUP, YYCURSOR, YYMARKER, yyrecord
(map to the corresponding re2c: configurations).
code:yybackupctx
YYBACKUPCTX statement, possibly specialized for different
APIs. Supported variables: YYBACKUPCTX, YYCURSOR,
YYCTXMARKER, yyrecord (map to the corresponding re2c:
configurations).
code:yyskip_yypeek
Combined code:yyskip and code:yypeek statement (defaults to
code:yyskip followed by code:yypeek).
code:yypeek_yyskip
Combined code:yypeek and code:yyskip statement (defaults to
code:yypeek followed by code:yyskip).
code:yyskip_yybackup
Combined code:yyskip and code:yybackup statement (defaults to
code:yyskip followed by code:yybackup).
code:yybackup_yyskip
Combined code:yybackup and code:yyskip statement (defaults to
code:yybackup followed by code:yyskip).
code:yybackup_yypeek
Combined code:yybackup and code:yypeek statement (defaults to
code:yybackup followed by code:yypeek).
code:yyskip_yybackup_yypeek
Combined code:yyskip, code:yybackup and code:yypeek statement
(defaults to``code:yyskip`` followed by code:yybackup
followed by code:yypeek).
code:yybackup_yypeek_yyskip
Combined code:yybackup, code:yypeek and code:yyskip statement
(defaults to``code:yybackup`` followed by code:yypeek
followed by code:yyskip).
code:yyrestore
YYRESTORE statement, possibly specialized for different APIs.
Supported variables: YYRESTORE, YYCURSOR, YYMARKER, yyrecord
(map to the corresponding re2c: configurations).
code:yyrestorectx
YYRESTORECTX statement, possibly specialized for different
APIs. Supported variables: YYRESTORECTX, YYCURSOR,
YYCTXMARKER, yyrecord (map to the corresponding re2c:
configurations).
code:yyrestoretag
YYRESTORETAG statement, possibly specialized for different
APIs. Supported variables: YYRESTORETAG, YYCURSOR, yyrecord
(map to the corresponding re2c: configurations), tag (the
name of tag variable used to restore position).
code:yyshift
YYSHIFT statement, possibly specialized for different APIs.
Supported variables: YYSHIFT, YYCURSOR, yyrecord (map to the
corresponding re2c: configurations), offset (the number of
code units to shift the current position).
code:yyshiftstag
YYSHIFTSTAG statement, possibly specialized for different
APIs. Supported variables: YYSHIFTSTAG, yyrecord, negative
(map to the corresponding re2c: configurations), tag (tag
variable which needs to be shifted), offset (the number of
code units to shift). Conditionals: .nested (true if this is
a nested tag -- in this case its value may equal to
re2c:tags:negative, which should not be shifted).
code:yyshiftmtag
YYSHIFTMTAG statement, possibly specialized for different
APIs. Supported variables: YYSHIFTMTAG (maps to the
corresponding re2c: configuration), tag (tag variable which
needs to be shifted), offset (the number of code units to
shift).
code:yystagp
YYSTAGP statement, possibly specialized for different APIs.
Supported variables: YYSTAGP, YYCURSOR, yyrecord (map to the
corresponding re2c: configurations), tag (tag variable that
should be updated).
code:yymtagp
YYMTAGP statement, possibly specialized for different APIs.
Supported variables: YYMTAGP (maps to the corresponding re2c:
configuration), tag (tag variable that should be updated).
code:yystagn
YYSTAGN statement, possibly specialized for different APIs.
Supported variables: YYSTAGN, negative, yyrecord (map to the
corresponding re2c: configurations), tag (tag variable that
should be updated).
code:yymtagn
YYMTAGN statement, possibly specialized for different APIs.
Supported variables: YYMTAGN (maps to the corresponding re2c:
configuration), tag (tag variable that should be updated).
code:yycopystag
YYCOPYSTAG statement, possibly specialized for different
APIs. Supported variables: YYCOPYSTAG, yyrecord (map to the
corresponding re2c: configurations), lhs, rhs (left and right
hand side tag variables of the copy operation).
code:yycopymtag
YYCOPYMTAG statement, possibly specialized for different
APIs. Supported variables: YYCOPYMTAG, yyrecord (map to the
corresponding re2c: configurations), lhs, rhs (left and right
hand side tag variables of the copy operation).
code:yygetaccept
YYGETACCEPT statement, possibly specialized for different
APIs. Supported variables: YYGETACCEPT, yyrecord (map to the
corresponding re2c: configurations), var (maps to
re2c:yyaccept configuration).
code:yysetaccept
YYSETACCEPT statement, possibly specialized for different
APIs. Supported variables: YYSETACCEPT, yyrecord (map to the
corresponding re2c: configurations), var (maps to
re2c:yyaccept configuration) and val (numeric value of the
accepted rule).
code:yygetcond
YYGETCOND statement, possibly specialized for different APIs.
Supported variables: YYGETCOND, yyrecord (map to the
corresponding re2c: configurations), var (maps to re2c:yycond
configuration).
code:yysetcond
YYSETCOND statement, possibly specialized for different APIs.
Supported variables: YYSETCOND, yyrecord (map to the
corresponding re2c: configurations), var (maps to re2c:yycond
configuration) and val (numeric condition identifier).
code:yygetstate
YYGETSTATE statement, possibly specialized for different
APIs. Supported variables: YYGETSTATE, yyrecord (map to the
corresponding re2c: configurations), var (maps to
re2c:yystate configuration).
code:yysetstate
YYSETSTATE statement, possibly specialized for different
APIs. Supported variables: YYSETSTATE, yyrecord (map to the
corresponding re2c: configurations), var (maps to
re2c:yystate configuration) and val (state number).
code:yylessthan
YYLESSTHAN statement, possibly specialized for different
APIs. Supported variables: YYLESSTHAN, YYCURSOR, YYLIMIT,
yyrecord (map to the corresponding re2c: configurations),
need (the number of code units to check against).
Conditional: .many (true if the need is more than one).
code:yybm_filter
Condition that is used to filter out yych values that are not
covered by the yybm table (used with --bitmaps option).
Supported variable: yych (maps to re2c:yych configuration).
code:yybm_match
The format of yybm table check (generated with --bitmaps
option). Supported variables: yybm, yych (map to the
corresponding re2c: configurations), offset (offset in the
yybm table that needs to be added to yych) and mask (bit mask
that should be applied to the table entry to retrieve the
boolean value that needs to be checked)
code:yytarget_filter
Condition that is used to filter out yych values that are not
covered by the yytarget table (used with --computed-gotos
option). Supported variable: yych (maps to re2c:yych
configuration).
Here's a list of all global variables that are allowed in syntax
files:
nl A newline.
indent A variable that does not produce any code, but has a
side-effect of increasing indentation level.
dedent A variable that does not produce any code, but has a
side-effect of decreasing indentation level.
topindent
Indentation string for the current statement. Indentation
level is tracked and automatically updated by the code
generator.
Here's a list of all global conditionals that are allowed in syntax
files:
.api.simple
True if simple API is used (--api simple or re2c:api =
simple).
.api.generic
True if generic API is used (--api generic or re2c:api =
generic).
.api.record
True if record API is used (--api record or re2c:api =
record).
.api_style.functions
True if function-like API style is used (re2c:api-style =
functions).
.api_style.freeform
True if free-form API style is used (re2c:api-style =
free-form).
.case_ranges
True if case ranges feature is enabled (--case-ranges or
re2c:case-ranges = 1).
.cgoto.relative
True if the relative form of computed goto is used
(--computed-gotos-relative or re2c:cgoto:relative = 1).
.code_model.goto_label
True if code model based on goto/label is used
(--goto-label).
.code_model.loop_switch
True if code model based on loop/switch is used
(--loop-switch).
.code_model.recursive_functions
True if code model based on recursive functions is used
(--recursive-function).
.date True if the generated fingerprint should contain generation
date.
.loop_label
True if re2go generated loops must have a label
(re2c:label:yyloop is set to a nonempty string).
.monadic
True if the generated code should be monadic (re2c:monadic =
1). This is only relevant for pure functional languages.
.start_conditions
True if start conditions are enabled (--start-conditions).
.storable_state
True if storable state is enabled (--storable-state).
.unsafe
True if re2go should use "unsafe" blocks in order to generate
faster code (--unsafe, re2c:unsafe = 1). This is only
relevant for languages that have "unsafe" feature.
.version
True if the generated fingerprint should contain re2go
version.
.yyfn.throw
True if re2c:yyfn:throw configuration is defined to a
nonempty string.
HANDLING THE END OF INPUT
One of the main problems for the lexer is to know when to stop. There
are a few terminating conditions:
o the lexer may match some rule (including default rule *) and come to
a final state
o the lexer may fail to match any rule and come to a default state
o the lexer may reach the end of input
The first two conditions terminate the lexer in a "natural" way: it
comes to a state with no outgoing transitions, and the matching
automatically stops. The third condition, end of input, is different:
it may happen in any state, and the lexer should be able to handle it.
Checking for the end of input interrupts the normal lexer workflow and
adds conditional branches to the generated program, therefore it is
necessary to minimize the number of such checks. re2go supports a few
different methods for handling the end of input. Which one to use
depends on the complexity of regular expressions, the need for
buffering, performance considerations and other factors. Here is a list
of methods:
o Sentinel. This method eliminates the need for the end of input checks
altogether. It is simple and efficient, but limited to the case when
there is a natural "sentinel" character that can never occur in valid
input. This character may still occur in invalid input, but it should
not be allowed by the regular expressions, except perhaps as the last
character of a rule. The sentinel is appended at the end of input and
serves as a stop signal: when the lexer reads this character, it is
either a syntax error or the end of input. In both cases the lexer
should stop. This method is used if YYFILL is disabled with
re2c:yyfill:enable = 0; and re2c:eof has the default value -1.
o Sentinel with bounds checks. This method is generic: it allows one to
handle any input without restrictions on the regular expressions. The
idea is to reduce the number of end of input checks by performing
them only on certain characters. Similar to the "sentinel" method,
one of the characters is chosen as a "sentinel" and appended at the
end of input. However, there is no restriction on where the sentinel
may occur (in fact, any character can be chosen for a sentinel).
When the lexer reads this character, it additionally performs a
bounds check. If the current position is within bounds, the lexer
resumes matching and handles the sentinel as a regular character.
Otherwise it invokes YYFILL (unless it is disabled). If more input is
supplied, the lexer will rematch the last character and continue as
if the sentinel wasn't there. Otherwise it must be the real end of
input, and the lexer stops. This method is used when re2c:eof has
non-negative value (it should be set to the numeric value of the
sentinel). YYFILL is optional.
o Bounds checks with padding. This method is generic, and it may be
faster than the "sentinel with bounds checks" method, but it is also
more complex. The idea is to partition DFA states into strongly
connected components (SCCs) and generate a single check per SCC for
enough characters to cover the longest non-looping path in this SCC.
This reduces the number of checks, but there is a problem with short
lexemes at the end of input, as the check requires enough characters
to cover the longest lexeme. This can be fixed by padding the input
with a few fake characters that do not form a valid lexeme suffix (so
that the lexer cannot match them). The length of padding should be
YYMAXFILL, generated with a max block. If there is not enough input,
the lexer invokes YYFILL which should supply at least the required
number of characters or not return. This method is used if YYFILL is
enabled and re2c:eof is -1 (this is the default configuration).
o Custom checks. Generic API allows one to override basic operations
like reading a character, which makes it possible to include the
end-of-input checks as part of them. This approach is error-prone
and should be used with caution. To use a custom method, enable
generic API with --api custom or re2c:api = custom; and disable
default bounds checks with re2c:yyfill:enable = 0; or
re2c:yyfill:check = 0;.
The following subsections contain an example of each method.
Sentinel
This example uses a sentinel character to handle the end of input. The
program counts space-separated words in a null-terminated string. The
sentinel is null: it is the last character of each input string, and it
is not allowed in the middle of a lexeme by any of the rules (in
particular, it is not included in character ranges where it is easy to
overlook). If a null occurs in the middle of a string, it is a syntax
error and the lexer will match default rule *, but it won't read past
the end of input or crash (use -Wsentinel-in-midrule warning and
re2c:sentinel configuration to verify this). Configuration
re2c:yyfill:enable = 0; suppresses the generation of bounds checks and
YYFILL invocations.
//go:generate re2go $INPUT -o $OUTPUT --api simple
package main
// Expect a null-terminated string.
func lex(yyinput string) int {
yycursor := 0
count := 0
for { /*!re2c
re2c:yyfill:enable = 0;
re2c:YYCTYPE = byte;
* { return -1 }
[\x00] { return count }
[a-z]+ { count += 1; continue }
[ ]+ { continue }
*/
}
}
func main() {
assert_eq := func(x, y int) { if x != y { panic("error") } }
assert_eq(lex("\000"), 0)
assert_eq(lex("one two three\000"), 3)
assert_eq(lex("f0ur\000"), -1)
}
Sentinel with bounds checks
This example uses sentinel with bounds checks to handle the end of
input (this method was added in version 1.2). The program counts
space-separated single-quoted strings. The sentinel character is null,
which is specified with re2c:eof = 0; configuration. As in the sentinel
method, null is the last character of each input string, but it is
allowed in the middle of a rule (for example, 'aaa\0aa'\0 is valid
input, but 'aaa\0 is a syntax error). Bounds checks are generated in
each state that matches an input character, but they are scoped to the
branch that handles null. Bounds checks are of the form YYLIMIT <=
YYCURSOR or YYLESSTHAN(1) with generic API. If the check condition is
true, lexer has reached the end of input and should stop (YYFILL is
disabled with re2c:yyfill:enable = 0; as the input fits into one
buffer, see the YYFILL with sentinel section for an example that uses
YYFILL). Reaching the end of input opens three possibilities: if the
lexer is in the initial state it will match the end-of-input rule $,
otherwise it may fallback to a previously matched rule (including
default rule *) or go to a default state, causing
-Wundefined-control-flow.
//go:generate re2go $INPUT -o $OUTPUT --api simple
package main
// Expects a null-terminated string.
func lex(yyinput string) int {
var yycursor, yymarker int
yylimit := len(yyinput) - 1 // lim points at the terminating null
count := 0
for { /*!re2c
re2c:YYCTYPE = byte;
re2c:yyfill:enable = 0;
re2c:eof = 0;
str = ['] ([^'\\] | [\\][^])* ['];
* { return -1 }
$ { return count }
str { count += 1; continue }
[ ]+ { continue }
*/
}
}
func main() {
assert_eq := func(x, y int) { if x != y { panic("error") } }
assert_eq(lex("\000"), 0)
assert_eq(lex("'qu\000tes' 'are' 'fine: \\'' \000"), 3)
assert_eq(lex("'unterminated\\'\000"), -1)
}
Bounds checks with padding
This example uses bounds checks with padding to handle the end of input
(this method is enabled by default). The program counts space-separated
single-quoted strings. There is a padding of YYMAXFILL null characters
appended at the end of input, where YYMAXFILL value is autogenerated
with a max block. It is not necessary to use null for padding --- any
characters can be used as long as they do not form a valid lexeme
suffix (in this example padding should not contain single quotes, as
they may be mistaken for a suffix of a single-quoted string). There is
a "stop" rule that matches the first padding character (null) and
terminates the lexer (note that it checks if null is at the beginning
of padding, otherwise it is a syntax error). Bounds checks are
generated only in some states that are determined by the strongly
connected components of the underlying automaton. Checks have the form
(YYLIMIT - YYCURSOR) < n or YYLESSTHAN(n) with generic API, where n is
the minimum number of characters that are needed for the lexer to
proceed (it also means that the next bounds check will occur in at most
n characters). If the check condition is true, the lexer has reached
the end of input and will invoke YYFILL(n) that should either supply at
least n input characters or not return. In this example YYFILL always
fails and terminates the lexer with an error (which is fine because the
input fits into one buffer). See the YYFILL with padding section for an
example that refills the input buffer with YYFILL.
//go:generate re2go $INPUT -o $OUTPUT --api simple
package main
import "strings"
/*!max:re2c*/
// Expects YYMAXFILL-padded string.
func lex(str string) int {
// Pad string with YYMAXFILL zeroes at the end.
yyinput := str + strings.Repeat("\000", int(YYMAXFILL))
yycursor := 0
yylimit := len(yyinput)
count := 0
for { /*!re2c
re2c:YYCTYPE = byte;
re2c:YYFILL = "return -1";
str = ['] ([^'\\] | [\\][^])* ['];
[\x00] {
// Check that it is the sentinel, not some unexpected null.
if yycursor - 1 == len(str) { return count } else { return -1 }
}
str { count += 1; continue }
[ ]+ { continue }
* { return -1 }
*/
}
}
func main() {
assert_eq := func(x, y int) { if x != y { panic("error") } }
assert_eq(lex(""), 0)
assert_eq(lex("'qu\000tes' 'are' 'fine: \\'' "), 3)
assert_eq(lex("'unterminated\\'"), -1)
assert_eq(lex("'unexpected \000 null\\'"), -1)
}
Custom checks
This example uses a custom end-of-input handling method based on
generic API. The program counts space-separated single-quoted strings.
It is the same as the sentinel example, except that the input is not
null-terminated. To cover up for the absence of a sentinel character at
the end of input, YYPEEK is redefined to perform a bounds check before
it reads the next input character. This is inefficient because checks
are done very often. If the check condition fails, YYPEEK returns the
real character, otherwise it returns a fake sentinel character.
//go:generate re2go $INPUT -o $OUTPUT
package main
// Returns "fake" terminating null if cursor has reached limit.
func peek(str string, cur int) byte {
if cur >= len(str) {
return 0 // fake null
} else {
return str[cur]
}
}
// Expects a string without terminating null.
func lex(str string) int {
var cur int
count := 0
for { /*!re2c
re2c:yyfill:enable = 0;
re2c:YYCTYPE = byte;
re2c:YYPEEK = "peek(str, cur)";
re2c:YYSKIP = "cur += 1";
* { return -1 }
[\x00] { return count }
[a-z]+ { count += 1; continue }
[ ]+ { continue }
*/
}
}
func main() {
assert_eq := func(x, y int) { if x != y { panic("error") } }
assert_eq(lex(""), 0)
assert_eq(lex("one two three"), 3)
assert_eq(lex("f0ur"), -1)
}
BUFFER REFILLING
The need for buffering arises when the input cannot be mapped in memory
all at once: either it is too large, or it comes in a streaming fashion
(like reading from a socket). The usual technique in such cases is to
allocate a fixed-sized memory buffer and process input in chunks that
fit into the buffer. When the current chunk is processed, it is moved
out and new data is moved in. In practice it is somewhat more complex,
because lexer state consists not of a single input position, but a set
of interrelated positions:
o cursor: the next input character to be read (YYCURSOR in C pointer
API or YYSKIP/YYPEEK in generic API)
o limit: the position after the last available input character (YYLIMIT
in C pointer API, implicitly handled by YYLESSTHAN in generic API)
o marker: the position of the most recent match, if any (YYMARKER in
default API or YYBACKUP/YYRESTORE in generic API)
o token: the start of the current lexeme (implicit in re2go API, as it
is not needed for the normal lexer operation and can be defined and
updated by the user)
o context marker: the position of the trailing context (YYCTXMARKER in
C pointer API or YYBACKUPCTX/YYRESTORECTX in generic API)
o tag variables: submatch positions (defined with stags and mtags
blocks and generic API primitives YYSTAGP/YYSTAGN/YYMTAGP/YYMTAGN)
Not all these are used in every case, but if used, they must be updated
by YYFILL. All active positions are contained in the segment between
token and cursor, therefore everything between buffer start and token
can be discarded, the segment from token and up to limit should be
moved to the beginning of buffer, and the free space at the end of
buffer should be filled with new data. In order to avoid frequent
YYFILL calls it is best to fill in as many input characters as possible
(even though fewer characters might suffice to resume the lexer). The
details of YYFILL implementation are slightly different depending on
which EOF handling method is used: the case of EOF rule is somewhat
simpler than the case of bounds-checking with padding. Also note that
if -f --storable-state option is used, YYFILL has slightly different
semantics (described in the section about storable state).
YYFILL with sentinel
If EOF rule is used, YYFILL is a function-like primitive that accepts
no arguments and returns a value which is checked against zero. YYFILL
invocation is triggered by condition YYLIMIT <= YYCURSOR in C pointer
API and YYLESSTHAN() in generic API. A non-zero return value means that
YYFILL has failed. A successful YYFILL call must supply at least one
character and adjust input positions accordingly. Limit must always be
set to one after the last input position in buffer, and the character
at the limit position must be the sentinel symbol specified by re2c:eof
configuration. The pictures below show the relative locations of input
positions in buffer before and after YYFILL call (sentinel symbol is
marked with #, and the second picture shows the case when there is not
enough input to fill the whole buffer).
<-- shift -->
>-A------------B---------C-------------D#-----------E->
buffer token marker limit,
cursor
>-A------------B---------C-------------D------------E#->
buffer, marker cursor limit
token
<-- shift -->
>-A------------B---------C-------------D#--E (EOF)
buffer token marker limit,
cursor
>-A------------B---------C-------------D---E#........
buffer, marker cursor limit
token
Here is an example of a program that reads input file input.txt in
chunks of 4096 bytes and uses EOF rule.
//go:generate re2go $INPUT -o $OUTPUT
package main
import (
"os"
"strings"
)
const BUFSIZE uint = 4096
type Input struct {
file *os.File
yyinput []byte
yycursor uint
yymarker uint
yylimit uint
token uint
eof bool
}
func fill(in *Input) int {
if in.eof { return -1 } // unexpected EOF
// Error: lexeme too long. In real life can reallocate a larger buffer.
if in.token < 1 { return -2 }
// Shift buffer contents (discard everything up to the current token).
copy(in.yyinput[0:], in.yyinput[in.token:in.yylimit])
in.yycursor -= in.token
in.yymarker -= in.token
in.yylimit -= in.token
in.token = 0
// Fill free space at the end of buffer with new data from file.
n, _ := in.file.Read(in.yyinput[in.yylimit:BUFSIZE])
in.yylimit += uint(n)
in.yyinput[in.yylimit] = 0
// If read less than expected, this is the end of input.
in.eof = in.yylimit < BUFSIZE
return 0
}
func lex(yyrecord *Input) int {
count := 0
for {
yyrecord.token = yyrecord.yycursor
/*!re2c
re2c:api = record;
re2c:eof = 0;
re2c:YYCTYPE = byte;
re2c:YYFILL = "fill(yyrecord) == 0";
str = ['] ([^'\\] | [\\][^])* ['];
* { return -1 }
$ { return count }
str { count += 1; continue }
[ ]+ { continue }
*/
}
}
func main() () {
fname := "input"
content := "'qu\000tes' 'are' 'fine: \\'' ";
// Prepare input file: a few times the size of the buffer, containing
// strings with zeroes and escaped quotes.
f, _ := os.Create(fname)
f.WriteString(strings.Repeat(content, int(BUFSIZE)))
f.Seek(0, 0)
count := 3 * int(BUFSIZE) // number of quoted strings written to file
// Prepare lexer state: all offsets are at the end of buffer.
in := &Input{
file: f,
// Sentinel at `yylimit` offset is set to zero, which triggers YYFILL.
yyinput: make([]byte, BUFSIZE+1),
yycursor: BUFSIZE,
yymarker: BUFSIZE,
yylimit: BUFSIZE,
token: BUFSIZE,
eof: false,
}
// Run the lexer.
if lex(in) != count { panic("error"); }
// Cleanup: remove input file.
f.Close();
os.Remove(fname);
}
YYFILL with padding
In the default case (when EOF rule is not used) YYFILL is a
function-like primitive that accepts a single argument and does not
return any value. YYFILL invocation is triggered by condition (YYLIMIT
- YYCURSOR) < n in C pointer API and YYLESSTHAN(n) in generic API. The
argument passed to YYFILL is the minimal number of characters that must
be supplied. If it fails to do so, YYFILL must not return to the lexer
(for that reason it is best implemented as a macro that returns from
the calling function on failure). In case of a successful YYFILL
invocation the limit position must be set either to one after the last
input position in buffer, or to the end of YYMAXFILL padding (in case
YYFILL has successfully read at least n characters, but not enough to
fill the entire buffer). The pictures below show the relative locations
of input positions in buffer before and after YYFILL invocation
(YYMAXFILL padding on the second picture is marked with # symbols).
<-- shift --> <-- need -->
>-A------------B---------C-----D-------E---F--------G->
buffer token marker cursor limit
>-A------------B---------C-----D-------E---F--------G->
buffer, marker cursor limit
token
<-- shift --> <-- need -->
>-A------------B---------C-----D-------E-F (EOF)
buffer token marker cursor limit
>-A------------B---------C-----D-------E-F###############
buffer, marker cursor limit
token <- YYMAXFILL ->
Here is an example of a program that reads input file input.txt in
chunks of 4096 bytes and uses bounds-checking with padding.
//go:generate re2go $INPUT -o $OUTPUT
package main
import (
"os"
"strings"
)
/*!max:re2c*/
const BUFSIZE uint = 4096
type Input struct {
file *os.File
yyinput []byte
yycursor uint
yylimit uint
token uint
eof bool
}
func fill(in *Input, need uint) int {
if in.eof { return -1 } // unexpected EOF
// Error: lexeme too long. In real life can reallocate a larger buffer.
if in.token < need { return -2 }
// Shift buffer contents (discard everything up to the current token).
copy(in.yyinput[0:], in.yyinput[in.token:in.yylimit])
in.yycursor -= in.token
in.yylimit -= in.token
in.token = 0
// Fill free space at the end of buffer with new data from file.
n, _ := in.file.Read(in.yyinput[in.yylimit:BUFSIZE])
in.yylimit += uint(n)
// If read less than expected, this is end of input => add zero padding
// so that the lexer can access characters at the end of buffer.
if in.yylimit < BUFSIZE {
in.eof = true
for i := uint(0); i < YYMAXFILL; i += 1 { in.yyinput[in.yylimit+i] = 0 }
in.yylimit += YYMAXFILL
}
return 0
}
func lex(yyrecord *Input) int {
count := 0
for {
yyrecord.token = yyrecord.yycursor
/*!re2c
re2c:api = record;
re2c:YYCTYPE = byte;
re2c:YYFILL = "if r := fill(yyrecord, @@); r != 0 { return r }";
str = ['] ([^'\\] | [\\][^])* ['];
[\x00] {
// Check that it is the sentinel, not some unexpected null.
if yyrecord.token == yyrecord.yylimit - YYMAXFILL { return count } else { return -1 }
}
str { count += 1; continue }
[ ]+ { continue }
* { return -1 }
*/
}
}
func main() () {
fname := "input"
content := "'qu\000tes' 'are' 'fine: \\'' ";
// Prepare input file: a few times the size of the buffer, containing
// strings with zeroes and escaped quotes.
f, _ := os.Create(fname)
f.WriteString(strings.Repeat(content, int(BUFSIZE)))
f.Seek(0, 0)
count := 3 * int(BUFSIZE) // number of quoted strings written to file
// Prepare lexer state: all offsets are at the end of buffer.
// This immediately triggers YYFILL, as the YYLESSTHAN condition is true.
in := &Input{
file: f,
yyinput: make([]byte, BUFSIZE+YYMAXFILL),
yycursor: BUFSIZE,
yylimit: BUFSIZE,
token: BUFSIZE,
eof: false,
}
// Run the lexer.
if lex(in) != count { panic("error"); }
// Cleanup: remove input file.
f.Close();
os.Remove(fname);
}
FEATURES
Multiple blocks
Sometimes it is necessary to have multiple interrelated lexers (for
example, if there is a high-level state machine that transitions
between lexer modes). This can be implemented using multiple connected
re2go blocks. Another option is to use start conditions.
The implementation of connections between blocks depends on the target
language. In languages that have goto statement (such as C/C++ and Go)
one can have all blocks in one function, each of them prefixed with a
label. Transition from one block to another is a simple goto. In
languages that do not have goto (such as Rust) it is necessary to use a
loop with a switch on a state variable, similar to the yystate
loop/switch generated by re2go, or else wrap each block in a function
and use function calls.
The example below uses multiple blocks to parse binary, octal, decimal
and hexadecimal numbers. Each base has its own block. The initial block
determines base and dispatches to other blocks. Common configurations
are defined in a separate block at the beginning of the program; they
are inherited by the other blocks.
//go:generate re2go $INPUT -o $OUTPUT -i --api simple
package main
import "errors"
const u32Limit uint64 = 1<<32
var (
eSyntax = errors.New("syntax error")
eOverflow = errors.New("overflow error")
)
func parse_u32(yyinput string) (uint32, error) {
var yycursor, yymarker int
result := uint64(0)
add := func(base uint64, offset byte) {
result = result * base + uint64(yyinput[yycursor-1] - offset)
if result >= u32Limit {
result = u32Limit
}
}
/*!re2c
re2c:yyfill:enable = 0;
re2c:YYCTYPE = byte;
end = "\x00";
'0b' / [01] { goto bin }
"0" { goto oct }
"" / [1-9] { goto dec }
'0x' / [0-9a-fA-F] { goto hex }
* { goto err }
*/
bin:
/*!re2c
end { goto end }
[01] { add(2, '0'); goto bin }
* { goto err }
*/
oct:
/*!re2c
end { goto end }
[0-7] { add(8, '0'); goto oct }
* { goto err }
*/
dec:
/*!re2c
end { goto end }
[0-9] { add(10, '0'); goto dec }
* { goto err }
*/
hex:
/*!re2c
end { goto end }
[0-9] { add(16, '0'); goto hex }
[a-f] { add(16, 'a'-10); goto hex }
[A-F] { add(16, 'A'-10); goto hex }
* { goto err }
*/
end:
if result < u32Limit {
return uint32(result), nil
} else {
return 0, eOverflow
}
err:
return 0, eSyntax
}
func main() {
test := func(num uint32, str string, err error) {
if n, e := parse_u32(str); !(n == num && e == err) {
panic("error")
}
}
test(1234567890, "1234567890\000", nil)
test(13, "0b1101\000", nil)
test(0x7fe, "0x007Fe\000", nil)
test(0644, "0644\000", nil)
test(0, "9999999999\000", eOverflow)
test(0, "123??\000", eSyntax)
}
Start conditions
Start conditions are enabled with --start-conditions option. They
provide a way to encode multiple interrelated automata within the same
re2go block.
Each condition corresponds to a single automaton and has a unique name
specified by the user and a unique internal number defined by re2go.
The numbers are used to switch between conditions: the generated code
uses YYGETCOND and YYSETCOND primitives to get the current condition or
set it to the given number. Use conditions block, --header option or
re2c:header configuration to generate numeric condition identifiers.
Configuration re2c:cond:enumprefix specifies the generated identifier
prefix.
In condition mode every rule must be prefixed with a list of
comma-separated condition names in angle brackets, or a wildcard <*> to
denote all conditions. The rule syntax is extended as follows:
< condition-list > regular-expression code
A rule that is merged to every condition on the
condition-list. It matches regular-expression and executes
the associated code.
< condition-list > regular-expression => condition code
A rule that is merged to every condition on the
condition-list. It matches regular-expression, sets the
current condition to condition and executes the associated
code.
< condition-list > regular-expression :=> condition
A rule that is merged to every condition on the
condition-list. It matches regular-expression and
immediately transitions to condition (there is no semantic
action).
< condition-list > !action code
A rule that binds code to the place defined by action in
every condition on the condition-list (see the actions
section for various types of actions).
<! condition-list > code
A rule that prepends code to semantic actions of all rules
for every condition on the condition-list. This syntax is
deprecated and the !pre_rule action should be used instead
(it does exactly the same).
< > code
A rule that creates a special entry condition with number
zero and name "0" that executes code before jumping to other
conditions. This syntax is deprecated, and the !entry action
should be used instead (it provides a more fine-grained
control, as the code can be specified on a per-condition
basis, and one can jump directly to condition start without
going through condition dispatch).
< > => condition code
Same as the previous rule, except that it sets the next
condition.
< > :=> condition
Same as the previous rule, except that it has no associated
code and immediately jumps to condition.
The code re2go generates for conditions depends on whether re2go uses
goto/label approach or loop/switch approach to encode the automata.
In languages that have goto statement (such as C/C++ and Go) conditions
are naturally implemented as blocks of code prefixed with labels of the
form yyc_<cond>, where cond is a condition name (label prefix can be
changed with re2c:cond:prefix). Transitions between conditions are
implemented using goto and condition labels. Before all conditions
re2go generates an initial switch on YYGETSTATE that jumps to the start
state of the current condition. The shortcut rules :=> bypass the
initial switch and jump directly to the specified condition
(re2c:cond:goto can be used to change the default behavior). The rules
with semantic actions do not automatically jump to the next condition;
this should be done by the user-defined action code.
In languages that do not have goto (such as Rust) re2go reuses the
yystate variable to store condition numbers. Each condition gets a
numeric identifier equal to the number of its start state, and a switch
between conditions is no different than a switch between DFA states of
a single condition. There is no need for a separate initial condition
switch. (Since the same approach is used to implement storable states,
YYGETCOND/YYSETCOND are redundant if both storable states and
conditions are used).
The program below uses start conditions to parse binary, octal, decimal
and hexadecimal numbers. There is a single block where each base has
its own condition, and the initial condition is connected to all of
them. User-defined variable cond stores the current condition number;
it is initialized to the number of the initial condition generated with
conditions block.
//go:generate re2go -c $INPUT -o $OUTPUT -i --api simple
package main
import "errors"
var (
eSyntax = errors.New("syntax error")
eOverflow = errors.New("overflow error")
)
/*!conditions:re2c*/
const u32Limit uint64 = 1<<32
func parse_u32(yyinput string) (uint32, error) {
var yycursor, yymarker int
result := uint64(0)
yycond := yycinit
add := func(base uint64, offset byte) {
result = result * base + uint64(yyinput[yycursor-1] - offset)
if result >= u32Limit {
result = u32Limit
}
}
/*!re2c
re2c:yyfill:enable = 0;
re2c:YYCTYPE = byte;
re2c:YYGETCOND = "cond";
re2c:YYSETCOND = "cond = @@";
<*> * { return 0, eSyntax }
<init> '0b' / [01] :=> bin
<init> "0" :=> oct
<init> "" / [1-9] :=> dec
<init> '0x' / [0-9a-fA-F] :=> hex
<bin, oct, dec, hex> "\x00" {
if result < u32Limit {
return uint32(result), nil
} else {
return 0, eOverflow
}
}
<bin> [01] { add(2, '0'); goto yyc_bin }
<oct> [0-7] { add(8, '0'); goto yyc_oct }
<dec> [0-9] { add(10, '0'); goto yyc_dec }
<hex> [0-9] { add(16, '0'); goto yyc_hex }
<hex> [a-f] { add(16, 'a'-10); goto yyc_hex }
<hex> [A-F] { add(16, 'A'-10); goto yyc_hex }
*/
}
func main() {
test := func(num uint32, str string, err error) {
if n, e := parse_u32(str); !(n == num && e == err) {
panic("error")
}
}
test(1234567890, "1234567890\000", nil)
test(13, "0b1101\000", nil)
test(0x7fe, "0x007Fe\000", nil)
test(0644, "0644\000", nil)
test(0, "9999999999\000", eOverflow)
test(0, "123??\000", eSyntax)
}
Storable state
With --storable-state option re2go generates a lexer that can store its
current state, return to the caller, and later resume operations
exactly where it left off. The default mode of operation in re2go is a
"pull" model, in which the lexer "pulls" more input whenever it needs
it. This may be unacceptable in cases when the input becomes available
piece by piece (for example, if the lexer is invoked by the parser, or
if the lexer program communicates via a socket protocol with some other
program that must wait for a reply from the lexer before it transmits
the next message). Storable state feature is intended exactly for such
cases: it allows one to generate lexers that work in a "push" model.
When the lexer needs more input, it stores its state and returns to the
caller. Later, when more input becomes available, the caller resumes
the lexer exactly where it stopped. There are a few changes necessary
compared to the "pull" model:
o Define YYSETSTATE() and YYGETSTATE(state) primitives.
o Define yych, yyaccept (if used) and state variables as a part of
persistent lexer state. The state variable should be initialized to
-1.
o YYFILL should return to the outer program instead of trying to supply
more input. Return code should indicate that lexer needs more input.
o The outer program should recognize situations when lexer needs more
input and respond appropriately.
o Optionally use getstate block to generate YYGETSTATE switch detached
from the main lexer. This only works for languages that have goto
(not in --loop-switch mode).
o Use re2c:eof and the sentinel with bounds checks method to handle the
end of input. Padding-based method may not work because it is unclear
when to append padding: the current end of input may not be the
ultimate end of input, and appending padding too early may cut off a
partially read greedy lexeme. Furthermore, due to high-level program
logic getting more input may depend on processing the lexeme at the
end of buffer (which already is blocked due to the end-of-input
condition).
Here is an example of a "push" model lexer that simulates reading
packets from a socket. The lexer loops until it encounters the end of
input and returns to the calling function. The calling function
provides more input by "sending" the next packet and resumes lexing.
This process stops when all the packets have been sent, or when there
is an error.
//go:generate re2go -f $INPUT -o $OUTPUT
package main
import (
"fmt"
"os"
)
// Use a small buffer to cover the case when a lexeme doesn't fit.
// In real world use a larger buffer.
const BUFSIZE int = 10
type State struct {
file *os.File
yyinput []byte
yycursor int
yymarker int
yylimit int
token int
yystate int
}
const (
lexEnd = iota
lexReady
lexWaitingForInput
lexPacketBroken
lexPacketTooBig
)
func fill(st *State) int {
shift := st.token
used := st.yylimit - st.token
free := BUFSIZE - used
// Error: no space. In real life can reallocate a larger buffer.
if free < 1 { return lexPacketTooBig }
// Shift buffer contents (discard already processed data).
copy(st.yyinput[0:], st.yyinput[shift:shift+used])
st.yycursor -= shift
st.yymarker -= shift
st.yylimit -= shift
st.token -= shift
// Fill free space at the end of buffer with new data.
n, _ := st.file.Read(st.yyinput[st.yylimit:BUFSIZE])
st.yylimit += n
st.yyinput[st.yylimit] = 0 // append sentinel symbol
return lexReady
}
func lex(yyrecord *State, recv *int) int {
var yych byte
/*!getstate:re2c*/
loop:
yyrecord.token = yyrecord.yycursor
/*!re2c
re2c:api = record;
re2c:eof = 0;
re2c:YYFILL = "return lexWaitingForInput";
packet = [a-z]+[;];
* { return lexPacketBroken }
$ { return lexEnd }
packet { *recv = *recv + 1; goto loop }
*/
}
func test(expect int, packets []string) {
// Create a pipe (open the same file for reading and writing).
fname := "pipe"
fw, _ := os.Create(fname)
fr, _ := os.Open(fname)
// Initialize lexer state: `state` value is -1, all offsets are at the end
// of buffer.
st := &State{
file: fr,
// Sentinel at `yylimit` offset is set to zero, which triggers YYFILL.
yyinput: make([]byte, BUFSIZE+1),
yycursor: BUFSIZE,
yymarker: BUFSIZE,
yylimit: BUFSIZE,
token: BUFSIZE,
yystate: -1,
}
// Main loop. The buffer contains incomplete data which appears packet by
// packet. When the lexer needs more input it saves its internal state and
// returns to the caller which should provide more input and resume lexing.
var status int
send := 0
recv := 0
for {
status = lex(st, &recv)
if status == lexEnd {
break
} else if status == lexWaitingForInput {
if send < len(packets) {
fw.WriteString(packets[send])
send += 1
}
status = fill(st)
if status != lexReady {
break
}
} else if status == lexPacketBroken {
break
}
}
// Check results.
if status != expect || (status == lexEnd && recv != send) {
panic(fmt.Sprintf("got %d, want %d", status, expect))
}
// Cleanup: remove input file.
fr.Close()
fw.Close()
os.Remove(fname)
}
func main() {
test(lexEnd, []string{})
test(lexEnd, []string{"zero;", "one;", "two;", "three;", "four;"})
test(lexPacketBroken, []string{"??;"})
test(lexPacketTooBig, []string{"looooooooooooong;"})
}
Reusable blocks
Reusable blocks of the form /*!rules:re2c[:<name>] ... */ or
%{rules[:<name>] ... %} can be reused any number of times and combined
with other re2go blocks. The <name> is optional. A rules block can be
used in a use block or directive. The code for a rules block is
generated at every point of use.
Use blocks are defined with /*!use:re2c[:<name>] ... */ or
%{use[:<name>] ... %}. The <name> is optional: if it's not specified,
the associated rules block is the most recent one (whether named or
unnamed). A use block can add named definitions, configurations and
rules of its own. An important use case for use blocks is a lexer that
supports multiple input encodings: the same rules block is reused
multiple times with encoding-specific configurations (see the example
below).
In-block use directive !use:<name>; can be used from inside of a re2go
block. It merges the referenced block <name> into the current one. If
some of the merged rules and configurations overlap with the previously
defined ones, conflicts are resolved in the usual way: the earliest
rule takes priority, and latest configuration overrides preceding ones.
One exception are the special rules *, $ and (in condition mode) <!>,
for which a block-local definition overrides any inherited ones. Use
directive allows one to combine different re2go blocks together in one
block (see the example below).
Named blocks and in-block use directive were added in re2go version
2.2. Since that version reusable blocks are allowed by default (no
special option is needed). Before version 2.2 reuse mode was enabled
with -r --reusable option. Before version 1.2 reusable blocks could not
be mixed with normal blocks.
Example of a !use directive
//go:generate re2go $INPUT -o $OUTPUT --api simple
package main
// This example shows how to combine reusable re2c blocks: two blocks
// ('colors' and 'fish') are merged into one. The 'salmon' rule occurs
// in both blocks; the 'fish' block takes priority because it is used
// earlier. Default rule * occurs in all three blocks; the local (not
// inherited) definition takes priority.
const (
Color = iota
Fish
Dunno
)
/*!rules:re2c:colors
* { panic("eh!") }
"red" | "salmon" | "magenta" { return Color }
*/
/*!rules:re2c:fish
* { panic("oh!") }
"haddock" | "salmon" | "eel" { return Fish }
*/
func lex(yyinput string) int {
var yycursor, yymarker int
/*!re2c
re2c:yyfill:enable = 0;
re2c:YYCTYPE = byte;
!use:fish;
!use:colors;
* { return Dunno } // overrides inherited '*' rules
*/
}
func main() {
assert_eq := func(x, y int) { if x != y { panic("error") } }
assert_eq(lex("salmon"), Fish);
assert_eq(lex("what?"), Dunno);
}
Example of a /*!use:re2c ... */ block
//go:generate re2go $INPUT -o $OUTPUT --input-encoding utf8 --api simple
package main
// This example supports multiple input encodings: UTF-8 and UTF-32.
// Both lexers are generated from the same rules block, and the use
// blocks add only encoding-specific configurations.
/*!rules:re2c
re2c:yyfill:enable = 0;
"<for all>x <there exists>y" { return 0; }
* { return 1; }
*/
func lexUTF8(yyinput []uint8) int {
var yycursor, yymarker int
/*!use:re2c
re2c:encoding:utf8 = 1;
re2c:YYCTYPE = uint8;
*/
}
func lexUTF32(yyinput []uint32) int {
var yycursor, yymarker int
/*!use:re2c
re2c:encoding:utf32 = 1;
re2c:YYCTYPE = uint32;
*/
}
func main() {
assert_eq := func(x, y int) { if x != y { panic("error") } }
assert_eq(lexUTF8([]uint8{0xe2, 0x88, 0x80, 0x78, 0x20, 0xe2, 0x88, 0x83, 0x79}), 0)
assert_eq(lexUTF32([]uint32{0x2200, 0x78, 0x20, 0x2203, 0x79}), 0)
}
Submatch extraction
re2go has two options for submatch extraction.
Tags The first option is to use standalone tags of the form @stag or
#mtag, where stag and mtag are arbitrary used-defined names.
Tags are enabled with -T --tags option or re2c:tags = 1
configuration. Semantically tags are position markers: they can
be inserted anywhere in a regular expression, and they bind to
the corresponding position (or multiple positions) in the input
string. S-tags bind to the last matching position, and m-tags
bind to a list of positions (they may be used in repetition
subexpressions, where a single position in a regular expression
corresponds to multiple positions in the input string). All tags
should be defined by the user, either manually or with the help
of svars and mvars blocks. If there is more than one way tags
can be matched against the input, ambiguity is resolved using
leftmost greedy disambiguation strategy.
Captures
The second option is to use capturing groups. They are enabled
with --captures option or re2c:captures = 1 configuration. There
are two flavours for different disambiguation policies,
--leftmost-captures (the default) is for leftmost greedy policy,
and, --posix-captures is for POSIX longest-match policy. In this
mode all parenthesized subexpressions are considered capturing
groups, and a bang can be used to mark non-capturing groups: (!
... ). With --invert-captures option or re2c:invert-captures = 1
configuration the meaning of bang is inverted. The number of
groups for the matching rule is stored in a variable yynmatch
(the whole regular expression is group number zero), and
submatch results are stored in yypmatch array. Both yynmatch and
yypmatch should be defined by the user, and yypmatch size must
be at least [yynmatch * 2]. Use maxnmatch block to define
YYMAXNMATCH, a constant that equals to the maximum value of
yynmatch among all rules.
Captvars
Another way to use capturing groups is the --captvars option or
re2c:captvars = 1 configuration. The only difference with
--captures is in the way the generated code stores submatch
results: instead of yynmatch and yypmatch re2go generates
variables yytl<k> and yytr<k> for k-th capturing group (the user
should declare these using an svars block). Captures with
variables support two disambiguation policies:
--leftmost-captvars or re2c:leftmost-captvars = 1 for leftmost
greedy policy (the default one) and --posix-captvars or
re2c:posix-captvars for POSIX longest-match policy.
Under the hood all these options translate into tags and Tagged
Deterministic Finite Automata with Lookahead. The core idea of TDFA is
to minimize the overhead on submatch extraction. In the extreme, if
there're no tags or captures in a regular expression, TDFA is just an
ordinary DFA. If the number of tags is moderate, the overhead is barely
noticeable. The generated TDFA uses a number of tag variables which do
not map directly to tags: a single variable may be used for different
tags, and a tag may require multiple variables to hold all its possible
values. Eventually ambiguity is resolved, and only one final variable
per tag survives. Tag variables should be defined using stags or mtags
blocks. If lexer state is stored, tag variables should be part of it.
They also need to be updated by YYFILL.
S-tags support the following operations:
o save input position to an s-tag: t = YYCURSOR with C pointer API or a
user-defined operation YYSTAGP(t) with generic API
o save default value to an s-tag: t = NULL with C pointer API or a
user-defined operation YYSTAGN(t) with generic API
o copy one s-tag to another: t1 = t2
M-tags support the following operations:
o append input position to an m-tag: a user-defined operation
YYMTAGP(t) with both default and generic API
o append default value to an m-tag: a user-defined operation YYMTAGN(t)
with both default and generic API
o copy one m-tag to another: t1 = t2
S-tags can be implemented as scalar values (pointers or offsets).
M-tags need a more complex representation, as they need to store a
sequence of tag values. The most naive and inefficient representation
of an m-tag is a list (array, vector) of tag values; a more efficient
representation is to store all m-tags in a prefix-tree represented as
array of nodes (v, p), where v is tag value and p is a pointer to
parent node.
Here is a simple example of using s-tags to parse semantic versions
consisting of three numeric components: major, minor, patch (the latter
is optional). See below for a more complex example that uses YYFILL.
//go:generate re2go $INPUT -o $OUTPUT --api simple
package main
import "reflect"
type SemVer struct { major, minor, patch int }
func s2n(s string) int { // convert pre-parsed string to a number
n := 0
for _, c := range s { n = n*10 + int(c-'0') }
return n
}
func parse(yyinput string) *SemVer {
var yycursor, yymarker int
// Final tag variables available in semantic action.
/*!svars:re2c format = 'var @@ int;'; */
// Intermediate tag variables used by the lexer (must be autogenerated).
/*!stags:re2c format = 'var @@ int;'; */
/*!re2c
re2c:yyfill:enable = 0;
re2c:YYCTYPE = byte;
re2c:tags = 1;
num = [0-9]+;
@t1 num @t2 "." @t3 num @t4 ("." @t5 num)? [\x00] {
major := s2n(yyinput[t1:t2])
minor := s2n(yyinput[t3:t4])
patch := 0
if t5 != -1 { patch = s2n(yyinput[t5:yycursor-1]) }
return &SemVer{major, minor, patch}
}
* { return nil }
*/
}
func main() {
assert_eq := func(x, y *SemVer) {
if !reflect.DeepEqual(x, y) { panic("error") }
}
assert_eq(parse("23.34\000"), &SemVer{23, 34, 0})
assert_eq(parse("1.2.9999\000"), &SemVer{1, 2, 9999})
assert_eq(parse("1.a\000"), nil)
}
Here is a more complex example of using s-tags with YYFILL to parse a
file with newline-separated semantic versions. Tag variables are part
of the lexer state, and they are adjusted in YYFILL like other input
positions. Note that it is necessary for s-tags because their values
are invalidated after shifting buffer contents. It may not be necessary
in a custom implementation where tag variables store offsets relative
to the start of the input string rather than the buffer, which may be
the case with m-tags.
//go:generate re2go $INPUT -o $OUTPUT --tags
package main
import (
"os"
"reflect"
"strings"
)
const BUFSIZE int = 4095
type Input struct {
file *os.File
yyinput []byte
yycursor int
yymarker int
yylimit int
token int
// Intermediate tag variables must be part of the lexer state passed to YYFILL.
// They don't correspond to tags and should be autogenerated by re2c.
/*!stags:re2c format = "\t@@ int\n"; */
eof bool
}
type SemVer struct { major, minor, patch int }
func s2n(s []byte) int { // convert pre-parsed string to a number
n := 0
for _, c := range s { n = n*10 + int(c-'0') }
return n
}
func fill(in *Input) int {
if in.eof { return -1 } // unexpected EOF
// Error: lexeme too long. In real life can reallocate a larger buffer.
if in.token < 1 { return -2 }
// Shift buffer contents (discard everything up to the current token).
copy(in.yyinput[0:], in.yyinput[in.token:in.yylimit])
in.yycursor -= in.token
in.yymarker -= in.token
in.yylimit -= in.token
// Tag variables need to be shifted like other input positions. The check
// for -1 is only needed if some tags are nested inside of alternative or
// repetition, so that they can have -1 value.
/*!stags:re2c format = "\tif in.@@ != -1 { in.@@ -= in.token }\n"; */
in.token = 0
// Fill free space at the end of buffer with new data from file.
n, _ := in.file.Read(in.yyinput[in.yylimit:BUFSIZE])
in.yylimit += n
in.yyinput[in.yylimit] = 0
// If read less than expected, this is the end of input.
in.eof = in.yylimit < BUFSIZE
return 0
}
func parse(in *Input) []SemVer {
// Final tag variables available in semantic action.
/*!svars:re2c format = "var @@ int;"; */
vers := make([]SemVer, 0)
for {
in.token = in.yycursor
/*!re2c
re2c:api = record;
re2c:eof = 0;
re2c:yyrecord = in;
re2c:YYCTYPE = byte;
re2c:YYFILL = "fill(in) == 0";
num = [0-9]+;
num @t1 "." @t2 num @t3 ("." @t4 num)? [\n] {
major := s2n(in.yyinput[in.token:t1])
minor := s2n(in.yyinput[t2:t3])
patch := 0
if t4 != -1 { patch = s2n(in.yyinput[t4:in.yycursor-1]) }
vers = append(vers, SemVer{major, minor, patch})
continue
}
$ { return vers }
* { return nil }
*/
}
}
func main() () {
fname := "input"
content := "1.22.333\n";
expect := make([]SemVer, 0, BUFSIZE)
for i := 0; i < BUFSIZE; i += 1 { expect = append(expect, SemVer{1, 22, 333}) }
// Prepare input file (make sure it exceeds buffer size).
f, _ := os.Create(fname)
f.WriteString(strings.Repeat(content, BUFSIZE))
f.Seek(0, 0)
// Initialize lexer state: all offsets are at the end of buffer.
in := &Input{
file: f,
// Sentinel at `yylimit` offset is set to zero, which triggers YYFILL.
yyinput: make([]byte, BUFSIZE+1),
yycursor: BUFSIZE,
yymarker: BUFSIZE,
yylimit: BUFSIZE,
token: BUFSIZE,
eof: false,
}
// Run the lexer and check results.
if !reflect.DeepEqual(parse(in), expect) { panic("error"); }
// Cleanup: remove input file.
f.Close();
os.Remove(fname);
}
Here is an example of using capturing groups to parse semantic
versions.
//go:generate re2go $INPUT -o $OUTPUT --api simple
package main
import "reflect"
type SemVer struct { major, minor, patch int }
func s2n(s string) int { // convert pre-parsed string to a number
n := 0
for _, c := range s { n = n*10 + int(c-'0') }
return n
}
func parse(yyinput string) *SemVer {
var yycursor, yymarker int
// Final tag variables used in semantic action.
/*!svars:re2c format = 'var @@ int;'; */
// Intermediate tag variables used by the lexer (must be autogenerated).
/*!stags:re2c format = 'var @@ int;'; */
/*!re2c
re2c:yyfill:enable = 0;
re2c:YYCTYPE = byte;
re2c:captvars = 1;
num = [0-9]+;
(num) "." (num) ("." num)? [\x00] {
_ = yytl0; _ = yytr0; // some variables are unused
major := s2n(yyinput[yytl1:yytr1])
minor := s2n(yyinput[yytl2:yytr2])
patch := 0
if yytl3 != -1 { patch = s2n(yyinput[yytl3+1:yytr3]) }
return &SemVer{major, minor, patch}
}
* { return nil }
*/
}
func main() {
assert_eq := func(x, y *SemVer) {
if !reflect.DeepEqual(x, y) { panic("error") }
}
assert_eq(parse("23.34\000"), &SemVer{23, 34, 0})
assert_eq(parse("1.2.9999\000"), &SemVer{1, 2, 9999})
assert_eq(parse("1.a\000"), nil)
}
Here is an example of using m-tags to parse a version with a variable
number of components. Tag variables are stored in a trie.
//go:generate re2go $INPUT -o $OUTPUT --api simple
package main
import "reflect"
const (
mtagRoot int = -1
tagNone int = -1
)
// An m-tag tree is a way to store histories with an O(1) copy operation.
// Histories naturally form a tree, as they have common start and fork at some
// point. The tree is stored as an array of pairs (tag value, link to parent).
// An m-tag is represented with a single link in the tree (array index).
type mtagElem struct {
elem int
pred int
}
type mtagTrie = []mtagElem
type Ver = []int // unbounded number of version components
func s2n(s string) int { // convert pre-parsed string to a number
n := 0
for _, c := range s { n = n*10 + int(c-'0') }
return n
}
// Append a single value to an m-tag history.
func add_mtag(trie *mtagTrie, mtag int, value int) int {
*trie = append(*trie, mtagElem{value, mtag})
return len(*trie) - 1
}
// Recursively unwind tag histories and collect version components.
func unwind(trie mtagTrie, x int, y int, str string) Ver {
// Reached the root of the m-tag tree, stop recursion.
if x == mtagRoot && y == mtagRoot {
return []int{}
}
// Unwind history further.
ver := unwind(trie, trie[x].pred, trie[y].pred, str)
// Get tag values. Tag histories must have equal length.
if x == mtagRoot || y == mtagRoot {
panic("tag histories have different length")
}
ex := trie[x].elem
ey := trie[y].elem
if ex != tagNone && ey != tagNone {
// Both tags are valid string indices, extract component.
ver = append(ver, s2n(str[ex:ey]))
} else if !(ex == tagNone && ey == tagNone) {
panic("both tags should be tagNone")
}
return ver
}
func parse(yyinput string) []int {
var yycursor, yymarker int
trie := make([]mtagElem, 0)
// Final tag variables available in semantic action.
/*!svars:re2c format = 'var @@ int;'; */
/*!mvars:re2c format = "var @@ int;"; */
// Intermediate tag variables used by the lexer (must be autogenerated).
/*!stags:re2c format = 'var @@ int'; separator = "\n\t"; */
/*!mtags:re2c format = "\t@@ := mtagRoot\n"; */
/*!re2c
re2c:tags = 1;
re2c:yyfill:enable = 0;
re2c:YYCTYPE = byte;
re2c:YYMTAGP = "@@ = add_mtag(&trie, @@, yycursor)";
re2c:YYMTAGN = "@@ = add_mtag(&trie, @@, tagNone)";
num = [0-9]+;
@t1 num @t2 ("." #t3 num #t4)* [\x00] {
ver := make([]int, 0)
ver = append(ver, s2n(yyinput[t1:t2]))
ver = append(ver, unwind(trie, t3, t4, yyinput)...)
return ver
}
* { return nil }
*/
}
func main() {
assert_eq := func(x, y []int) {
if !reflect.DeepEqual(x, y) { panic("error") }
}
assert_eq(parse("1\000"), []int{1})
assert_eq(parse("1.2.3.4.5.6.7\000"), []int{1, 2, 3, 4, 5, 6, 7})
assert_eq(parse("1.\000"), nil)
}
Encoding support
It is necessary to understand the difference between code points and
code units. A code point is a numeric identifier of a symbol. A code
unit is the smallest unit of storage in the encoded text. A single code
point may be represented with one or more code units. In a fixed-length
encoding all code points are represented with the same number of code
units. In a variable-length encoding code points may be represented
with a different number of code units. Note that the "any" rule [^]
matches any code point, but not necessarily any code unit (the only way
to match any code unit regardless of the encoding is the default rule
*). The generated lexer works with a stream of code units: yych stores
a code unit, and YYCTYPE is the code unit type. Regular expressions, on
the other hand, are specified in terms of code points. When re2go
compiles regular expressions to automata it translates code points to
code units. This is generally not a simple mapping: in variable-length
encodings a single code point range may get translated to a complex
code unit graph. The following encodings are supported:
o ASCII (enabled by default). It is a fixed-length encoding with code
space [0-255] and 1-byte code points and code units.
o EBCDIC (enabled with --ebcdic or re2c:encoding:ebcdic). It is a
fixed-length encoding with code space [0-255] and 1-byte code points
and code units.
o UCS2 (enabled with --ucs2 or re2c:encoding:ucs2). It is a
fixed-length encoding with code space [0-0xFFFF] and 2-byte code
points and code units.
o UTF8 (enabled with --utf8 or re2c:encoding:utf8). It is a
variable-length Unicode encoding. Code unit size is 1 byte. Code
points are represented with 1 -- 4 code units.
o UTF16 (enabled with --utf16 or re2c:encoding:utf16). It is a
variable-length Unicode encoding. Code unit size is 2 bytes. Code
points are represented with 1 -- 2 code units.
o UTF32 (enabled with --utf32 or re2c:encoding:utf32). It is a
fixed-length Unicode encoding with code space [0-0x10FFFF] and 4-byte
code points and code units.
Include file include/unicode_categories.re provides re2go definitions
for the standard Unicode categories.
Option --input-encoding specifies source file encoding, which can be
used to enable Unicode literals in regular expressions. For example
--input-encoding utf8 tells re2go that the source file is in UTF8 (it
differs from --utf8 which sets input text encoding). Option
--encoding-policy specifies the way re2go handles Unicode surrogates
(code points in range [0xD800-0xDFFF]).
Below is an example of a lexer for UTF8 encoded Unicode identifiers.
//go:generate re2go $INPUT -o $OUTPUT -8si --api simple
package main
/*!include:re2c "unicode_categories.re" */
func lex(yyinput string) int {
var yycursor, yymarker int
/*!re2c
re2c:yyfill:enable = 0;
re2c:YYCTYPE = byte;
// Simplified "Unicode Identifier and Pattern Syntax"
// (see https://unicode.org/reports/tr31)
id_start = L | Nl | [$_];
id_continue = id_start | Mn | Mc | Nd | Pc | [\u200D\u05F3];
identifier = id_start id_continue*;
identifier { return 0 }
* { return 1 }
*/
}
func main() {
if lex("_<?><?><?><?><?><?><?><?><?><?><?><?><?>\000") != 0 {
panic("error")
}
}
Include files
re2go allows one to include other files using a block of the form
/*!include:re2c FILE */ or %{include FILE %}, or an in-block directive
!include FILE ;, where FILE is a path to the file to be included.
re2go looks for include files in the directory of the including file
and in include locations, which can be specified with the -I option.
Include blocks/directives in re2go work in the same way as C/C++
#include: FILE contents are copy-pasted verbatim in place of the
block/directive. Include files may have further includes of their own.
Use --depfile option to track build dependencies of the output file on
include files. re2go provides some predefined include files that can
be found in the include/ subdirectory of the project. These files
contain definitions that may be useful to other projects (such as
Unicode categories) and form something like a standard library for
re2go. Below is an example of using include files.
Include file 1 (definitions.go)
const (
ResultOk = iota
ResultFail
)
/*!re2c
number = [1-9][0-9]*;
*/
Include file 2 (extra_rules.re.inc)
// floating-point numbers
frac = [0-9]* "." [0-9]+ | [0-9]+ ".";
exp = 'e' [+-]? [0-9]+;
float = frac exp? | [0-9]+ exp;
float { return ResultOk }
Input file
//go:generate re2go $INPUT -o $OUTPUT -i --api simple
package main
/*!include:re2c "definitions.go" */
func lex(yyinput string) int {
var yycursor, yymarker int
/*!re2c
re2c:YYCTYPE = byte;
re2c:yyfill:enable = 0;
* { return ResultFail }
number { return ResultOk }
!include "extra_rules.re.inc";
*/
}
func main() {
assert_eq := func(x, y int) { if x != y { panic("error") } }
assert_eq(lex("123\000"), ResultOk)
assert_eq(lex("123.4567\000"), ResultOk)
}
Header files
re2go allows one to generate header file from the input .re file using
--header option or re2c:header configuration and block pairs of the
form /*!header:re2c:on*/ and /*!header:re2c:off*/, or %{header:on%} and
%{header:off%}. The first block marks the beginning of header file, and
the second block marks the end of it. Everything between these blocks
is processed by re2go, and the generated code is written to the file
specified with --header option or re2c:header configuration (or stdout
if neither option nor configuration is used). Autogenerated header file
may be needed in cases when re2go is used to generate definitions that
must be visible from other translation units.
Here is an example of generating a header file that contains definition
of the lexer state with tag variables (the number variables depends on
the regular grammar and is unknown to the programmer).
Input file
//go:generate re2go $INPUT -o $OUTPUT -i --header lexer/state.go
package main
import "./lexer" // the package is generated by re2c
/*!header:re2c:on*/
package lexer
type State struct {
Data string
Cur /*!stags:re2c format=", @@"; */ int
}
/*!header:re2c:off*/
func lex(yyrecord *lexer.State) int {
var t int
/*!re2c
re2c:header = "lexer/state.go";
re2c:api = record;
re2c:YYCTYPE = byte;
re2c:YYINPUT = "yyrecord.Data";
re2c:YYCURSOR = "yyrecord.Cur";
re2c:yyfill:enable = 0;
re2c:tags = 1;
re2c:tags:prefix = "Tag";
[a]* @t [b]* { return t }
*/
}
func main() {
st := &lexer.State{Data:"ab\x00",}
if lex(st) != 1 {
panic("error")
}
}
Header file
// Code generated by re2c, DO NOT EDIT.
package lexer
type State struct {
Data string
Cur, Mar, Tag1 int
}
Skeleton programs
With the -S, --skeleton option, re2go ignores all non-re2go code and
generates a self-contained C program that can be further compiled and
executed. The program consists of lexer code and input data. For each
constructed DFA (block or condition) re2go generates a standalone lexer
and two files: an .input file with strings derived from the DFA and a
.keys file with expected match results. The program runs each lexer on
the corresponding .input file and compares results with the
expectations. Skeleton programs are very useful for a number of
reasons:
o They can check correctness of various re2go optimizations (the data
is generated early in the process, before any DFA transformations
have taken place).
o Generating a set of input data with good coverage may be useful for
both testing and benchmarking.
o Generating self-contained executable programs allows one to get
minimized test cases (the original code may be large or have a lot of
dependencies).
The difficulty with generating input data is that for all but the most
trivial cases the number of possible input strings is too large (even
if the string length is limited). re2go solves this difficulty by
generating sufficiently many strings to cover almost all DFA
transitions. It uses the following algorithm. First, it constructs a
skeleton of the DFA. For encodings with 1-byte code unit size (such as
ASCII, UTF-8 and EBCDIC) skeleton is just an exact copy of the original
DFA. For encodings with multibyte code units skeleton is a copy of DFA
with certain transitions omitted: namely, re2go takes at most 256 code
units for each disjoint continuous range that corresponds to a DFA
transition. The chosen values are evenly distributed and include range
bounds. Instead of trying to cover all possible paths in the skeleton
(which is infeasible) re2go generates sufficiently many paths to cover
all skeleton transitions, and thus trigger the corresponding
conditional jumps in the lexer. The algorithm implementation is
limited by ~1Gb of transitions and consumes constant amount of memory
(re2go writes data to file as soon as it is generated).
Visualization and debug
With the -D, --emit-dot option, re2go does not generate code. Instead,
it dumps the generated DFA in DOT format. One can convert this dump to
an image of the DFA using Graphviz or another library. Note that this
option shows the final DFA after it has gone through a number of
optimizations and transformations. Earlier stages can be dumped with
various debug options, such as --dump-nfa, --dump-dfa-raw etc. (see the
full list of options).
SEE ALSO
You can find more information about re2c at the official website:
http://re2c.org. Similar programs are flex(1), lex(1),
quex(http://quex.sourceforge.net).
AUTHORS
re2go was originally written by Peter Bumbulis (peter@csg.uwaterloo.ca)
in 1993. Marcus Boerger and Dan Nuffer spent several years to turn the
original idea into a production ready code generator. Since then it has
been maintained and developed by multiple volunteers, most notably,
Brian Young (bayoung@acm.org), Marcus Boerger, Dan Nuffer
(nuffer@users.sourceforge.net), Ulya Trofimovich (skvadrik@gmail.com),
Serghei Iakovlev, Sergei Trofimovich, Petr Skocik, ligfx raekye and
PolarGoose.
re2go(1)
re2c 4.3 - Generated Fri Aug 15 07:25:37 CDT 2025
