7 How the Input Is Matched

When the generated scanner is run, it analyzes its input looking for strings which match any of its patterns. If it finds more than one match, it takes the one matching the most text (for trailing context rules, this includes the length of the trailing part, even though it will then be returned to the input). If it finds two or more matches of the same length, the rule listed first in the flex input file is chosen.

Once the match is determined, the text corresponding to the match (called the token) is made available in the global character pointer yytext, and its length in the global integer yyleng. The action corresponding to the matched pattern is then executed (see section Actions), and then the remaining input is scanned for another match.

If no match is found, then the default rule is executed: the next character in the input is considered matched and copied to the standard output. Thus, the simplest valid flex input is:

%%

which generates a scanner that simply copies its input (one character at a time) to its output.

Note that yytext can be defined in two different ways: either as a character pointer or as a character array. You can control which definition flex uses by including one of the special directives %pointer or %array in the first (definitions) section of your flex input. The default is %pointer, unless you use the ‘-l’ lex compatibility option, in which case yytext will be an array. The advantage of using %pointer is substantially faster scanning and no buffer overflow when matching very large tokens (unless you run out of dynamic memory). The disadvantage is that you are restricted in how your actions can modify yytext (see section Actions), and calls to the unput() function destroys the present contents of yytext, which can be a considerable porting headache when moving between different lex versions.

The advantage of %array is that you can then modify yytext to your heart’s content, and calls to unput() do not destroy yytext (see section Actions). Furthermore, existing lex programs sometimes access yytext externally using declarations of the form:

    extern char yytext[];

This definition is erroneous when used with %pointer, but correct for %array.

The %array declaration defines yytext to be an array of YYLMAX characters, which defaults to a fairly large value. You can change the size by simply #define’ing YYLMAX to a different value in the first section of your flex input. As mentioned above, with %pointer yytext grows dynamically to accommodate large tokens. While this means your %pointer scanner can accommodate very large tokens (such as matching entire blocks of comments), bear in mind that each time the scanner must resize yytext it also must rescan the entire token from the beginning, so matching such tokens can prove slow. yytext presently does not dynamically grow if a call to unput() results in too much text being pushed back; instead, a run-time error results.

Also note that you cannot use %array with C++ scanner classes (see section Generating C++ Scanners).

This document was generated on November 4, 2011 using texi2html 5.0.