manpagez: man (manual) pages & more
man pcre(3)
Home | html | info | man  
pcre(3)                                                                pcre(3)




NAME

       PCRE - Perl-compatible regular expressions


INTRODUCTION


       The  PCRE  library is a set of functions that implement regular expres-
       sion pattern matching using the same syntax and semantics as Perl, with
       just  a few differences. Some features that appeared in Python and PCRE
       before they appeared in Perl are also available using the  Python  syn-
       tax,  there  is  some  support for one or two .NET and Oniguruma syntax
       items, and there is an option for requesting some  minor  changes  that
       give better JavaScript compatibility.

       The  current implementation of PCRE corresponds approximately with Perl
       5.10/5.11, including support for UTF-8 encoded strings and Unicode gen-
       eral  category properties. However, UTF-8 and Unicode support has to be
       explicitly enabled; it is not the default. The  Unicode  tables  corre-
       spond to Unicode release 5.2.0.

       In  addition to the Perl-compatible matching function, PCRE contains an
       alternative function that matches the same compiled patterns in a  dif-
       ferent way. In certain circumstances, the alternative function has some
       advantages.  For a discussion of the two matching algorithms,  see  the
       pcrematching page.

       PCRE  is  written  in C and released as a C library. A number of people
       have written wrappers and interfaces of various kinds.  In  particular,
       Google  Inc.   have  provided  a comprehensive C++ wrapper. This is now
       included as part of the PCRE distribution. The pcrecpp page has details
       of  this  interface.  Other  people's contributions can be found in the
       Contrib directory at the primary FTP site, which is:

       ftp://ftp.csx.cam.ac.uk/pub/software/programming/pcre

       Details of exactly which Perl regular expression features are  and  are
       not supported by PCRE are given in separate documents. See the pcrepat-
       tern and pcrecompat pages. There is a syntax summary in the  pcresyntax
       page.

       Some  features  of  PCRE can be included, excluded, or changed when the
       library is built. The pcre_config() function makes it  possible  for  a
       client  to  discover  which  features are available. The features them-
       selves are described in the pcrebuild page. Documentation about  build-
       ing  PCRE  for various operating systems can be found in the README and
       NON-UNIX-USE files in the source distribution.

       The library contains a number of undocumented  internal  functions  and
       data  tables  that  are  used by more than one of the exported external
       functions, but which are not intended  for  use  by  external  callers.
       Their  names  all begin with "_pcre_", which hopefully will not provoke
       any name clashes. In some environments, it is possible to control which
       external  symbols  are  exported when a shared library is built, and in
       these cases the undocumented symbols are not exported.


USER DOCUMENTATION


       The user documentation for PCRE comprises a number  of  different  sec-
       tions.  In the "man" format, each of these is a separate "man page". In
       the HTML format, each is a separate page, linked from the  index  page.
       In  the  plain  text format, all the sections, except the pcredemo sec-
       tion, are concatenated, for ease of searching. The sections are as fol-
       lows:

         pcre              this document
         pcre-config       show PCRE installation configuration information
         pcreapi           details of PCRE's native C API
         pcrebuild         options for building PCRE
         pcrecallout       details of the callout feature
         pcrecompat        discussion of Perl compatibility
         pcrecpp           details of the C++ wrapper
         pcredemo          a demonstration C program that uses PCRE
         pcregrep          description of the pcregrep command
         pcrematching      discussion of the two matching algorithms
         pcrepartial       details of the partial matching facility
         pcrepattern       syntax and semantics of supported
                             regular expressions
         pcreperform       discussion of performance issues
         pcreposix         the POSIX-compatible C API
         pcreprecompile    details of saving and re-using precompiled patterns
         pcresample        discussion of the pcredemo program
         pcrestack         discussion of stack usage
         pcresyntax        quick syntax reference
         pcretest          description of the pcretest testing command

       In addition, in the "man" and HTML formats, there is a short  page  for
       each C library function, listing its arguments and results.


LIMITATIONS


       There  are some size limitations in PCRE but it is hoped that they will
       never in practice be relevant.

       The maximum length of a compiled pattern is 65539 (sic) bytes  if  PCRE
       is compiled with the default internal linkage size of 2. If you want to
       process regular expressions that are truly enormous,  you  can  compile
       PCRE  with  an  internal linkage size of 3 or 4 (see the README file in
       the source distribution and the pcrebuild documentation  for  details).
       In  these  cases the limit is substantially larger.  However, the speed
       of execution is slower.

       All values in repeating quantifiers must be less than 65536.

       There is no limit to the number of parenthesized subpatterns, but there
       can be no more than 65535 capturing subpatterns.

       The maximum length of name for a named subpattern is 32 characters, and
       the maximum number of named subpatterns is 10000.

       The maximum length of a subject string is the largest  positive  number
       that  an integer variable can hold. However, when using the traditional
       matching function, PCRE uses recursion to handle subpatterns and indef-
       inite  repetition.  This means that the available stack space may limit
       the size of a subject string that can be processed by certain patterns.
       For a discussion of stack issues, see the pcrestack documentation.


UTF-8 AND UNICODE PROPERTY SUPPORT


       From  release  3.3,  PCRE  has  had  some support for character strings
       encoded in the UTF-8 format. For release 4.0 this was greatly  extended
       to  cover  most common requirements, and in release 5.0 additional sup-
       port for Unicode general category properties was added.

       In order process UTF-8 strings, you must build PCRE  to  include  UTF-8
       support  in  the  code,  and, in addition, you must call pcre_compile()
       with the PCRE_UTF8 option flag, or the  pattern  must  start  with  the
       sequence  (*UTF8).  When  either of these is the case, both the pattern
       and any subject strings that are matched  against  it  are  treated  as
       UTF-8 strings instead of strings of 1-byte characters.

       If  you compile PCRE with UTF-8 support, but do not use it at run time,
       the library will be a bit bigger, but the additional run time  overhead
       is limited to testing the PCRE_UTF8 flag occasionally, so should not be
       very big.

       If PCRE is built with Unicode character property support (which implies
       UTF-8  support),  the  escape sequences \p{..}, \P{..}, and \X are sup-
       ported.  The available properties that can be tested are limited to the
       general  category  properties such as Lu for an upper case letter or Nd
       for a decimal number, the Unicode script names such as Arabic  or  Han,
       and  the  derived  properties  Any  and L&. A full list is given in the
       pcrepattern documentation. Only the short names for properties are sup-
       ported.  For example, \p{L} matches a letter. Its Perl synonym, \p{Let-
       ter}, is not supported.  Furthermore,  in  Perl,  many  properties  may
       optionally  be  prefixed by "Is", for compatibility with Perl 5.6. PCRE
       does not support this.

   Validity of UTF-8 strings

       When you set the PCRE_UTF8 flag, the strings  passed  as  patterns  and
       subjects are (by default) checked for validity on entry to the relevant
       functions. From release 7.3 of PCRE, the check is according  the  rules
       of  RFC  3629, which are themselves derived from the Unicode specifica-
       tion. Earlier releases of PCRE followed the rules of  RFC  2279,  which
       allows  the  full range of 31-bit values (0 to 0x7FFFFFFF). The current
       check allows only values in the range U+0 to U+10FFFF, excluding U+D800
       to U+DFFF.

       The  excluded  code  points are the "Low Surrogate Area" of Unicode, of
       which the Unicode Standard says this: "The Low Surrogate Area does  not
       contain  any  character  assignments,  consequently  no  character code
       charts or namelists are provided for this area. Surrogates are reserved
       for  use  with  UTF-16 and then must be used in pairs." The code points
       that are encoded by UTF-16 pairs  are  available  as  independent  code
       points  in  the  UTF-8  encoding.  (In other words, the whole surrogate
       thing is a fudge for UTF-16 which unfortunately messes up UTF-8.)

       If an  invalid  UTF-8  string  is  passed  to  PCRE,  an  error  return
       (PCRE_ERROR_BADUTF8) is given. In some situations, you may already know
       that your strings are valid, and therefore want to skip these checks in
       order to improve performance. If you set the PCRE_NO_UTF8_CHECK flag at
       compile time or at run time, PCRE assumes that the pattern  or  subject
       it  is  given  (respectively)  contains only valid UTF-8 codes. In this
       case, it does not diagnose an invalid UTF-8 string.

       If you pass an invalid UTF-8 string  when  PCRE_NO_UTF8_CHECK  is  set,
       what  happens  depends on why the string is invalid. If the string con-
       forms to the "old" definition of UTF-8 (RFC 2279), it is processed as a
       string  of  characters  in  the  range 0 to 0x7FFFFFFF. In other words,
       apart from the initial validity test, PCRE (when in UTF-8 mode) handles
       strings  according  to  the more liberal rules of RFC 2279. However, if
       the string does not even conform to RFC 2279, the result is  undefined.
       Your program may crash.

       If  you  want  to  process  strings  of  values  in the full range 0 to
       0x7FFFFFFF, encoded in a UTF-8-like manner as per the old RFC, you  can
       set PCRE_NO_UTF8_CHECK to bypass the more restrictive test. However, in
       this situation, you will have to apply your own validity check.

   General comments about UTF-8 mode

       1. An unbraced hexadecimal escape sequence (such  as  \xb3)  matches  a
       two-byte UTF-8 character if the value is greater than 127.

       2.  Octal  numbers  up to \777 are recognized, and match two-byte UTF-8
       characters for values greater than \177.

       3. Repeat quantifiers apply to complete UTF-8 characters, not to  indi-
       vidual bytes, for example: \x{100}{3}.

       4.  The dot metacharacter matches one UTF-8 character instead of a sin-
       gle byte.

       5. The escape sequence \C can be used to match a single byte  in  UTF-8
       mode,  but  its  use can lead to some strange effects. This facility is
       not available in the alternative matching function, pcre_dfa_exec().

       6. The character escapes \b, \B, \d, \D, \s, \S, \w, and  \W  correctly
       test characters of any code value, but, by default, the characters that
       PCRE recognizes as digits, spaces, or word characters remain  the  same
       set  as  before,  all with values less than 256. This remains true even
       when PCRE is built to include Unicode property support, because  to  do
       otherwise  would  slow  down  PCRE in many common cases. Note that this
       also applies to \b, because it is defined in terms of \w and \W. If you
       really  want  to  test  for a wider sense of, say, "digit", you can use
       explicit Unicode property tests such as \p{Nd}.  Alternatively, if  you
       set  the  PCRE_UCP  option,  the way that the character escapes work is
       changed so that Unicode properties are used to determine which  charac-
       ters  match. There are more details in the section on generic character
       types in the pcrepattern documentation.

       7. Similarly, characters that match the POSIX named  character  classes
       are all low-valued characters, unless the PCRE_UCP option is set.

       8.  However,  the Perl 5.10 horizontal and vertical whitespace matching
       escapes (\h, \H, \v, and \V) do match all the appropriate Unicode char-
       acters, whether or not PCRE_UCP is set.

       9.  Case-insensitive  matching  applies only to characters whose values
       are less than 128, unless PCRE is built with Unicode property  support.
       Even  when  Unicode  property support is available, PCRE still uses its
       own character tables when checking the case of  low-valued  characters,
       so  as not to degrade performance.  The Unicode property information is
       used only for characters with higher values. Even when Unicode property
       support is available, PCRE supports case-insensitive matching only when
       there is a one-to-one mapping between a letter's  cases.  There  are  a
       small  number  of  many-to-one  mappings in Unicode; these are not sup-
       ported by PCRE.


AUTHOR


       Philip Hazel
       University Computing Service
       Cambridge CB2 3QH, England.

       Putting an actual email address here seems to have been a spam  magnet,
       so  I've  taken  it away. If you want to email me, use my two initials,
       followed by the two digits 10, at the domain cam.ac.uk.


REVISION


       Last updated: 12 May 2010
       Copyright (c) 1997-2010 University of Cambridge.



                                                                       pcre(3)

pcre 8.10 - Generated Fri Jun 25 19:20:59 CDT 2010