manpagez: man pages & more
man peg(n)
Home | html | info | man
grammar::peg(n)          Grammar operations and usage          grammar::peg(n)



______________________________________________________________________________


NAME

       grammar::peg - Create and manipulate parsing expression grammars


SYNOPSIS

       package require Tcl  8.4

       package require snit

       package require grammar::peg  ?0.1?

       ::grammar::peg pegName ?=|:=|<--|as|deserialize src?

       pegName destroy

       pegName clear

       pegName = srcPEG

       pegName --> dstPEG

       pegName serialize

       pegName deserialize serialization

       pegName is valid

       pegName start ?pe?

       pegName nonterminals

       pegName nonterminal add nt pe

       pegName nonterminal delete nt1 ?nt2 ...?

       pegName nonterminal exists nt

       pegName nonterminal rename nt ntnew

       pegName nonterminal mode nt ?mode?

       pegName nonterminal rule nt

       pegName unknown nonterminals

_________________________________________________________________


DESCRIPTION

       This package provides a container class for parsing expression grammars
       (Short: PEG).  It allows the incremental definition of the grammar, its
       manipulation  and querying of the definition.  The package neither pro-
       vides complex operations on the grammar, nor has it the ability to exe-
       cute  a  grammar  definition  for  a  stream  of symbols.  Two packages
       related to this one are grammar::mengine and grammar::peg::interpreter.
       The first of them defines a general virtual machine for the matching of
       a character stream, and the second implements an interpreter for  pars-
       ing expression grammars on top of that virtual machine.

   TERMS & CONCEPTS
       PEGs  are similar to context-free grammars, but not equivalent; in some
       cases PEGs are strictly more powerful than context-free grammars (there
       exist PEGs for some non-context-free languages).  The formal mathemati-
       cal definition of parsing expressions and parsing  expression  grammars
       can be found in section PARSING EXPRESSION GRAMMARS.

       In  short,  we have terminal symbols, which are the most basic building
       blocks for sentences, and nonterminal symbols with  associated  parsing
       expressions,  defining  the grammatical structure of the sentences. The
       two sets of symbols are distinctive, and do not overlap. When  speaking
       about  symbols  the  word  "symbol" is often left out. The union of the
       sets of terminal and nonterminal symbols is called the set of  symbols.

       Here the set of terminal symbols is not explicitly managed, but implic-
       itly defined as the set of all characters. Note that this means that we
       inherit from Tcl the ability to handle all of Unicode.

       A pair of nonterminal and parsing expression is also called a grammati-
       cal rule, or rule for short. In the context of a rule  the  nonterminal
       is  often  called  the left-hand-side (LHS), and the parsing expression
       the right-hand-side (RHS).

       The start expression of a grammar is a parsing  expression  from  which
       all  the  sentences  contained in the language specified by the grammar
       are derived.  To make the understanding of  this  term  easier  let  us
       assume  for  a  moment that the RHS of each rule, and the start expres-
       sion, is either a sequence of symbols, or a series of alternate parsing
       expressions.   In  the  latter  case  the  rule can be seen as a set of
       rules, each providing one alternative for the nonterminal.   A  parsing
       expression  A' is now a derivation of a parsing expression A if we pick
       one of the nonterminals N in the expression, and one of the alternative
       rules  R  for  N, and then replace the nonterminal in A with the RHS of
       the chosen rule. Here we can see why the terminal  symbols  are  called
       such.  They  cannot be expanded any further, thus terminate the process
       of deriving new expressions.  An example


           Rules
             (1)  A <- a B c
             (2a) B <- d B
             (2b) B <- e

           Some derivations, using starting expression A.

             A -/1/-> a B c -/2a/-> a d B c -/2b/-> a d e c


       A derived expression containing only terminal symbols  is  a  sentence.
       The set of all sentences which can be derived from the start expression
       is the language of the grammar.

       Some definitions for nonterminals and expressions:

       [1]    A nonterminal A is called reachable if it is possible to  derive
              a parsing expression from the start expression which contains A.

       [2]    A nonterminal A is called useful if it is possible to  derive  a
              sentence from it.

       [3]    A  nonterminal A is called recursive if it is possible to derive
              a parsing expression from it which contains A, again.

       [4]    The FIRST set of a nonterminal A contains all the symbols  which
              can  occur  of  as  the  leftmost symbol in a parsing expression
              derived from A. If the FIRST set contains  A  itself  then  that
              nonterminal is called left-recursive.

       [5]    The  LAST  set of a nonterminal A contains all the symbols which
              can occur of as the rightmost symbol  in  a  parsing  expression
              derived from A. If the LAST set contains A itself then that non-
              terminal is called right-recursive.

       [6]    The FOLLOW set of a nonterminal A contains all the symbols which
              can occur after A in a parsing expression derived from the start
              expression.

       [7]    A nonterminal (or parsing expression) is called nullable if  the
              empty sentence can be derived from it.


       And based on the above definitions for grammars:

       [1]    A  grammar G is recursive if and only if it contains a nontermi-
              nal A which is recursive. The terms left-  and  right-recursive,
              and useful are analogously defined.

       [2]    A  grammar  is  minimal if it contains only reachable and useful
              nonterminals.

       [3]    A grammar is wellformed if it is not left-recursive. Such  gram-
              mars  are also complete, which means that they always succeed or
              fail on all input sentences. For an incomplete  grammar  on  the
              other  hand  input sentences exist for which an attempt to match
              them against the grammar will not terminate.

       [4]    As we wish to allow ourselves to build a  grammar  incrementally
              in  a container object we will encounter stages where the RHS of
              one or more rules reference symbols which are not yet  known  to
              the  container.  Such  a grammar we call invalid.  We cannot use
              the term incomplete as this term is already taken, see the  last
              item.



   CONTAINER CLASS API
       The package exports the API described here.

       ::grammar::peg pegName ?=|:=|<--|as|deserialize src?
              The command creates a new container object for a parsing expres-
              sion grammar and returns the fully qualified name of the  object
              command as its result. The API the returned command is following
              is described in the section CONTAINER OBJECT API. It may be used
              to  invoke  various  operations on the container and the grammar
              within.

              The new container, i.e. grammar will be empty if no src is spec-
              ified. Otherwise it will contain a copy of the grammar contained
              in the src.  The src has to be a container object reference  for
              all  operators  except  deserialize.   The  deserialize operator
              requires src to be the serialization  of  a  parsing  expression
              grammar instead.

              An  empty  grammar  has  no  nonterminal  symbols, and the start
              expression is the empty expression, i.e. epsilon. It  is  valid,
              but not useful.


   CONTAINER OBJECT API
       All  grammar  container  objects  provide the following methods for the
       manipulation of their contents:

       pegName destroy
              Destroys the grammar, including its storage space and associated
              command.

       pegName clear
              Clears  out  the definition of the grammar contained in pegName,
              but does not destroy the object.

       pegName = srcPEG
              Assigns the contents of the grammar contained in srcPEG to  peg-
              Name,  overwriting any existing definition.  This is the assign-
              ment operator for grammars. It copies the grammar  contained  in
              the  grammar  object  srcPEG over the grammar definition in peg-
              Name. The old contents of pegName are deleted by this operation.

              This operation is in effect equivalent to


                  pegName deserialize [srcPEG serialize]


       pegName --> dstPEG
              This  is the reverse assignment operator for grammars. It copies
              the automation contained in the object pegName over the  grammar
              definition in the object dstPEG.  The old contents of dstPEG are
              deleted by this operation.

              This operation is in effect equivalent to


                  dstPEG deserialize [pegName serialize]


       pegName serialize
              This method serializes the grammar stored in pegName.  In  other
              words it returns a tcl value completely describing that grammar.
              This allows, for example, the transfer of  grammars  over  arbi-
              trary channels, persistence, etc.  This method is also the basis
              for both the copy constructor and the assignment operator.

              The result of this method has to be semantically identical  over
              all  implementations of the grammar::peg interface. This is what
              will enable us to copy grammars  between  different  implementa-
              tions of the same interface.

              The  result is a list of four elements with the following struc-
              ture:

              [1]    The constant string grammar::peg.

              [2]    A dictionary. Its keys are the names of all known nonter-
                     minal  symbols, and their associated values are the pars-
                     ing expressions describing their sentennial structure.

              [3]    A dictionary. Its keys are the names of all known nonter-
                     minal  symbols,  and  their  associated values hints to a
                     matcher regarding the semantic  values  produced  by  the
                     symbol.

              [4]    The  last item is a parsing expression, the start expres-
                     sion of the grammar.

       Assuming the following PEG for simple mathematical expressions


           Digit      <- '0'/'1'/'2'/'3'/'4'/'5'/'6'/'7'/'8'/'9'
           Sign       <- '+' / '-'
           Number     <- Sign? Digit+
           Expression <- '(' Expression ')' / (Factor (MulOp Factor)*)
           MulOp      <- '*' / '/'
           Factor     <- Term (AddOp Term)*
           AddOp      <- '+'/'-'
           Term       <- Number


       a possible serialization is


           grammar::peg \\
           {Expression {/ {x ( Expression )} {x Factor {* {x MulOp Factor}}}} \\
            Factor     {x Term {* {x AddOp Term}}} \\
            Term       Number \\
            MulOp      {/ * /} \\
            AddOp      {/ + -} \\
            Number     {x {? Sign} {+ Digit}} \\
            Sign       {/ + -} \\
            Digit      {/ 0 1 2 3 4 5 6 7 8 9} \\
           } \\
           {Expression value     Factor     value \\
            Term       value     MulOp      value \\
            AddOp      value     Number     value \\
            Sign       value     Digit      value \\
           }
           Expression


       A possible one, because the order of the nonterminals in the dictionary
       is not relevant.

       pegName deserialize serialization
              This  is  the  complement  to serialize. It replaces the grammar
              definition in pegName with the grammar described by the  serial-
              ization  value.  The old contents of pegName are deleted by this
              operation.

       pegName is valid
              A predicate. It tests whether the PEG in pegName is valid.   See
              section  TERMS  &  CONCEPTS  for  the definition of this grammar
              property.  The result is a boolean value. It will be set to true
              if the PEG has the tested property, and false otherwise.

       pegName start ?pe?
              This  method  defines  the  start  expression of the grammar. It
              replaces the previously defined start expression with the  pars-
              ing  expression  pe.  The method fails and throws an error if pe
              does not contain a valid parsing expression as specified in  the
              section  PARSING  EXPRESSIONS.  In  that case the existing start
              expression is not changed.  The method returns the empty  string
              as its result.

              If  the  method is called without an argument it will return the
              currently defined start expression.

       pegName nonterminals
              Returns the set of all nonterminal symbols known to the grammar.

       pegName nonterminal add nt pe
              This  method  adds the nonterminal nt and its associated parsing
              expression pe to the set of nonterminal symbols and rules of the
              PEG  contained  in  the  object  pegName.   The method fails and
              throws an error if either the string nt is already  known  as  a
              symbol of the grammar, or if pe does not contain a valid parsing
              expression as specified in the section PARSING  EXPRESSIONS.  In
              that  case  the  current set of nonterminal symbols and rules is
              not changed.  The method returns the empty string as its result.

       pegName nonterminal delete nt1 ?nt2 ...?
              This  method  removes the named symbols nt1, nt2 from the set of
              nonterminal symbols of the PEG contained in the object  pegName.
              The  method  fails  and throws an error if any of the strings is
              not known as a nonterminal symbol. In that case the current  set
              of  nonterminal  symbols is not changed.  The method returns the
              empty string as its result.

              The stored grammar becomes invalid if the  deleted  nonterminals
              are referenced by the RHS of still-known rules.

       pegName nonterminal exists nt
              A predicate. It tests whether the nonterminal symbol nt is known
              to the PEG in pegName.  The result is a boolean value.  It  will
              be set to true if the symbol nt is known, and false otherwise.

       pegName nonterminal rename nt ntnew
              This  method  renames  the  nonterminal symbol nt to ntnew.  The
              method fails and throws an error if either nt is not known as  a
              nonterminal,  or if ntnew is a known symbol.  The method returns
              the empty string as its result.

       pegName nonterminal mode nt ?mode?
              This mode returns or sets the semantic mode associated with  the
              nonterminal  symbol nt. If no mode is specified the current mode
              of the nonterminal is returned. Otherwise the  current  mode  is
              set  to mode.  The method fails and throws an error if nt is not
              known as a nonterminal.  The grammar interpreter implemented  by
              the  package  grammar::peg::interpreter recognizes the following
              modes:

              value  The semantic value of the  nonterminal  is  the  abstract
                     syntax  tree created from the AST's of the RHS and a node
                     for the nonterminal itself.

              match  The semantic value of the nonterminal is an the  abstract
                     syntax  tree  consisting  of single a node for the string
                     matched by the RHS. The ASTs generated  by  the  RHS  are
                     discarded.

              leaf   The  semantic value of the nonterminal is an the abstract
                     syntax tree consisting of single a node for the nontermi-
                     nal  itself. The ASTs generated by the RHS are discarded.

              discard
                     The nonterminal has no semantic value. The ASTs generated
                     by the RHS are discarded (as well).

       pegName nonterminal rule nt
              This  method  returns the parsing expression associated with the
              nonterminal nt.  The method fails and throws an error if  nt  is
              not known as a nonterminal.

       pegName unknown nonterminals
              This method returns a list containing the names of all nontermi-
              nal symbols which are referenced on the  RHS  of  a  grammatical
              rule,  but  have  no  rule  definining their structure. In other
              words, a list of the nonterminal symbols which make the  grammar
              invalid. The grammar is valid if this list is empty.



   PARSING EXPRESSIONS
       Various methods of PEG container objects expect a parsing expression as
       their argument, or will return such. This section specifies the  format
       such parsing expressions are in.


       [1]    The  string  epsilon is an atomic parsing expression. It matches
              the empty string.

       [2]    The string alnum is an atomic parsing expression. It matches any
              alphanumeric character.

       [3]    The string alpha is an atomic parsing expression. It matches any
              alphabetical character.

       [4]    The string dot is an atomic parsing expression. It  matches  any
              character.

       [5]    The  expression  [list  t x] is an atomic parsing expression. It
              matches the terminal string x.

       [6]    The expression [list n A] is an atomic  parsing  expression.  It
              matches the nonterminal A.

       [7]    For  parsing expressions e1, e2, ... the result of [list / e1 e2
              ... ] is a parsing expression as  well.   This  is  the  ordered
              choice, aka prioritized choice.

       [8]    For  parsing expressions e1, e2, ... the result of [list x e1 e2
              ... ] is a parsing expression as well.  This is the sequence.

       [9]    For a parsing expression e the result of [list * e] is a parsing
              expression as well.  This is the kleene closure, describing zero
              or more repetitions.

       [10]   For a parsing expression e the result of [list + e] is a parsing
              expression  as  well.   This  is  the  positive  kleene closure,
              describing one or more repetitions.

       [11]   For a parsing expression e the result of [list & e] is a parsing
              expression as well.  This is the and lookahead predicate.

       [12]   For a parsing expression e the result of [list ! e] is a parsing
              expression as well.  This is the not lookahead predicate.

       [13]   For a parsing expression e the result of [list ? e] is a parsing
              expression as well.  This is the optional input.


       Examples of parsing expressions where already shown, in the description
       of the method serialize.


PARSING EXPRESSION GRAMMARS

       For the mathematically inclined, a PEG is a 4-tuple (VN,VT,R,eS) where

       o      VN is a set of nonterminal symbols,

       o      VT is a set of terminal symbols,

       o      R is a finite set of rules, where each rule is a pair  (A,e),  A
              in VN, and e a parsing expression.

       o      eS is a parsing expression, the start expression.


       Further constraints are

       o      The intersection of VN and VT is empty.

       o      For  all  A  in  VT exists exactly one pair (A,e) in R. In other
              words, R is a  function  from  nonterminal  symbols  to  parsing
              expressions.


       Parsing expression are inductively defined via

       o      The empty string (epsilon) is a parsing expression.

       o      A terminal symbol a is a parsing expression.

       o      A nonterminal symbol A is a parsing expression.

       o      e1e2  is  a parsing expression for parsing expressions e1 and 2.
              This is called sequence.

       o      e1/e2 is a parsing expression for parsing expressions e1 and  2.
              This is called ordered choice.

       o      e*  is  a  parsing  expression for parsing expression e. This is
              called zero-or-more repetitions, also known as kleene closure.

       o      e+ is a parsing expression for parsing  expression  e.  This  is
              called  one-or-more  repetitions,  also known as positive kleene
              closure.

       o      !e is a parsing expression for parsing expression  e1.  This  is
              called a not lookahead predicate.

       o      &e  is  a  parsing expression for parsing expression e1. This is
              called an and lookahead predicate.



       PEGs are used to define a grammatical structure for streams of  symbols
       over  VT.  They  are  a modern phrasing of older formalisms invented by
       Alexander Birham. These formalisms  were  called  TS  (TMG  recognition
       scheme),  and  gTS  (generalized  TS).  Later they were renamed to TPDL
       (Top-Down Parsing Languages) and gTPDL (generalized TPDL).

       They can be easily implemented by recursive descent parsers with  back-
       tracking. This makes them relatives of LL(k) Context-Free Grammars.


REFERENCES

       [1]    The   Packrat  Parsing  and  Parsing  Expression  Grammars  Page
              [http://www.pdos.lcs.mit.edu/~baford/packrat/], by  Bryan  Ford,
              Massachusetts  Institute  of  Technology. This is the main entry
              page to PEGs, and their realization through Packrat Parsers.

       [2]    Parsing      Techniques      -      A      Practical       Guide
              [http://www.cs.vu.nl/~dick/PTAPG.html],  an online book offering
              a clear, accessible, and thorough discussion of  many  different
              parsing  techniques  with  their interrelations and applicabili-
              ties, including error recovery techniques.

       [3]    Compilers and Compiler  Generators  [http://scifac.ru.ac.za/com-
              pilers/], an online book using CoCo/R, a generator for recursive
              descent parsers.



BUGS, IDEAS, FEEDBACK

       This document, and the package it describes, will  undoubtedly  contain
       bugs  and  other  problems.   Please  report such in the category gram-
       mar_peg    of    the     Tcllib     SF     Trackers     [http://source-
       forge.net/tracker/?group_id=12883].   Please  also report any ideas for
       enhancements you may have for either package and/or documentation.


KEYWORDS

       LL(k), TDPL,  context-free  languages,  expression,  grammar,  parsing,
       parsing  expression,  parsing  expression grammar, push down automaton,
       recursive descent, state, top-down parsing languages, transducer


CATEGORY

       Grammars and finite automata


COPYRIGHT

       Copyright (c) 2005 Andreas Kupries <andreas_kupries@users.sourceforge.net>




grammar_peg                           0.1                      grammar::peg(n)

Mac OS X 10.8 - Generated Mon Sep 10 18:02:40 CDT 2012
© manpagez.com 2000-2025
Individual documents may contain additional copyright information.