13 Posix Regular Expressions

This whole section has been written by Dorai Sitaram. It consists in the documentation of the pregexp package that may be found at http://www.ccs.neu.edu/~dorai/pregexp/pregexp.html.

The regexp notation supported is modeled on Perl’s, and includes such powerful directives as numeric and nongreedy quantifiers, capturing and non-capturing clustering, POSIX character classes, selective case- and space-insensitivity, backreferences, alternation, backtrack pruning, positive and negative lookahead and lookbehind, in addition to the more basic directives familiar to all regexp users. A regexp is a string that describes a pattern. A regexp matcher tries to match this pattern against (a portion of) another string, which we will call the text string. The text string is treated as raw text and not as a pattern.

Most of the characters in a regexp pattern are meant to match occurrences of themselves in the text string. Thus, the pattern "abc" matches a string that contains the characters a, b, c in succession.

In the regexp pattern, some characters act as metacharacters, and some character sequences act as metasequences. That is, they specify something other than their literal selves. For example, in the pattern "a.c", the characters a and c do stand for themselves but the metacharacter . can match any character (other than newline). Therefore, the pattern "a.c" matches an a, followed by any character, followed by a c.

If we needed to match the character . itself, we escape it, ie, precede it with a backslash (\). The character sequence \. is thus a metasequence, since it doesn’t match itself but rather just .. So, to match a followed by a literal . followed by c, we use the regexp pattern "a\\.c".(4) Another example of a metasequence is \t, which is a readable way to represent the tab character.

We will call the string representation of a regexp the U-regexp, where U can be taken to mean Unix-style or universal, because this notation for regexps is universally familiar. Our implementation uses an intermediate tree-like representation called the S-regexp, where S can stand for Scheme, symbolic, or s-expression. S-regexps are more verbose and less readable than U-regexps, but they are much easier for Scheme’s recursive procedures to navigate.

This document was generated on October 23, 2011 using texi2html 5.0.

13.1 Regular Expressions Procedures
13.2 Regular Expressions Pattern Language
13.3 An Extended Example