manpagez: man pages & more
man htmlparse(n)
Home | html | info | man
htmlparse(n)                      HTML Parser                     htmlparse(n)



______________________________________________________________________________


NAME

       htmlparse - Procedures to parse HTML strings


SYNOPSIS

       package require Tcl  8.2

       package require struct::stack  1.3

       package require cmdline  1.1

       package require htmlparse  ?1.2?

       ::htmlparse::parse  ?-cmd  cmd?  ?-vroot  tag? ?-split n? ?-incvar var?
       ?-queue q? html

       ::htmlparse::debugCallback ?clientdata?  tag  slash  param  textBehind-
       TheTag

       ::htmlparse::mapEscapes html

       ::htmlparse::2tree html tree

       ::htmlparse::removeVisualFluff tree

       ::htmlparse::removeFormDefs tree

_________________________________________________________________


DESCRIPTION

       The htmlparse package provides commands that allow libraries and appli-
       cations to parse HTML in  a  string  into  a  representation  of  their
       choice.

       The following commands are available:

       ::htmlparse::parse  ?-cmd  cmd?  ?-vroot  tag? ?-split n? ?-incvar var?
       ?-queue q? html
              This  command  is  the  basic  parser for HTML. It takes an HTML
              string, parses it and invokes a command  prefix  for  every  tag
              encountered.  It  is  not necessary for the HTML to be valid for
              this parser to function. It is the responsibility of the command
              invoked  for  every tag to check this. Another responsibility of
              the invoked command is the handling of tag attributes and  char-
              acter entities (escaped characters). The parser provides the un-
              interpreted tag attributes to the invoked command to aid in  the
              former,  and  the  package  at  large provides a helper command,
              ::htmlparse::mapEscapes, to aid in the handling of  the  latter.
              The  parser  does  ignore  leading  DOCTYPE declarations and all
              valid HTML comments it encounters.

              All information beyond the HTML string itself is  specified  via
              options, these are explained below.

              To help understand the options, some more background information
              about the parser.

              It is capable of detecting incomplete tags in  the  HTML  string
              given  to  it.  Under  normal  circumstances this will cause the
              parser to throw an error, but if the option -incvar is  used  to
              specify  a global (or namespace) variable, the parser will store
              the incomplete part of the input  into  this  variable  instead.
              This  will aid greatly in the handling of incrementally arriving
              HTML, as the parser will handle whatever it can  and  defer  the
              handling of the incomplete part until more data has arrived.

              Another  feature  of  the  parser  are its two possible modes of
              operation. The normal mode is activated if the option -queue  is
              not  present  on  the command line invoking the parser. If it is
              present, the parser will go into the incremental mode instead.

              The main difference is that a parser in normal mode will immedi-
              ately  invoke  the command prefix for each tag it encounters. In
              incremental mode however the parser will generate  a  number  of
              scripts  which  invoke  the command prefix for groups of tags in
              the HTML string and then store these scripts  in  the  specified
              queue. It is then the responsibility of the caller of the parser
              to ensure the execution of the scripts in the queue.

              Note: The queue object given to the parser has  to  provide  the
              same  interface  as  the queue defined in tcllib -> struct. This
              means, for example, that all queues created via that tcllib mod-
              ule  can be immediately used here. Still, the queue doesn't have
              to come from tcllib -> struct as long as the same  interface  is
              provided.

              In  both  modes  the  parser  will return an empty string to the
              caller.

              The -split option may be given to a parser in  incremental  mode
              to  specify  the  size of the groups it creates. In other words,
              -split 5 means that each of the generated  scripts  will  invoke
              the  command prefix for 5 consecutive tags in the HTML string. A
              parser in normal mode will ignore this option and its value.

              The option -vroot specifies a virtual root tag. A parser in nor-
              mal  mode  will  invoke  the  command  prefix for it immediately
              before and after it processes the tags in the HTML,  thus  simu-
              lating  that  the  HTML string is enclosed in a <vroot> </vroot>
              combination. In incremental mode however the parser is unable to
              provide  the  closing  virtual  root  as it never knows when the
              input is complete. In this case the first  script  generated  by
              each  invocation of the parser will contain an invocation of the
              command prefix for the virtual root as its first  command.   The
              following options are available:

              -cmd cmd
                     The  command  prefix  to invoke for every tag in the HTML
                     string. Defaults to ::htmlparse::debugCallback.

              -vroot tag
                     The virtual root tag to add around  the  HTML  in  normal
                     mode.  In  incremental  mode  it is the first tag in each
                     chunk processed by the parser, but there will be no clos-
                     ing tags. Defaults to hmstart.

              -split n
                     The  size  of  the groups produced by an incremental mode
                     parser. Ignored when in normal mode. Defaults to 10. Val-
                     ues <= 0 are not allowed.

              -incvar var
                     The  name  of  the variable where to store any incomplete
                     HTML into. This makes  most  sense  for  the  incremental
                     mode.  The  parser  will throw an error if it sees incom-
                     plete HTML and has no place to store it  to.  This  makes
                     sense  for  the  normal  mode.  Only  incomplete tags are
                     detected, not missing tags.  Optional,  defaults  to  'no
                     variable'.


              Interface to the command prefix
                     In  normal mode the parser will invoke the command prefix
                     with four arguments appended. See ::htmlparse::debugCall-
                     back for a description.

                     In  incremental mode, however, the generated scripts will
                     invoke the command prefix with five  arguments  appended.
                     The  last four of these are the same which were mentioned
                     above. The first is a placeholder string  (@win@)  for  a
                     clientdata  value  to be supplied later during the actual
                     execution of the generated scripts. This could  be  a  tk
                     window  path,  for  example. This allows the user of this
                     package to preprocess  HTML  strings  without  committing
                     them  to a specific window, object, whatever during pars-
                     ing. This connection can be made later. This  also  means
                     that  it  is  possible  to  cache  preprocessed  HTML. Of
                     course, nothing prevents the  user  of  the  parser  from
                     replacing the placeholder with an empty string.

       ::htmlparse::debugCallback  ?clientdata?  tag  slash  param textBehind-
       TheTag
              This  command  is  the  standard  callback used by the parser in
              ::htmlparse::parse if none was specified by the user. It  simply
              dumps  its  arguments  to stdout.  This callback can be used for
              both normal and incremental mode of the calling parser. In other
              words,  it  accepts  four or five arguments. The last four argu-
              ments are described below. The optional fifth argument  contains
              the  clientdata  value  passed  to  the  callback by a parser in
              incremental mode. All callbacks have to follow the signature  of
              this  command  in the last four arguments, and callbacks used in
              incremental parsing have to follow this signature  in  the  last
              five arguments.

              The  first argument, clientdata, is optional and present only if
              this command is invoked by a parser in incremental mode. It con-
              tains whatever the user of this package wishes.

              The  second argument, tag, contains the name of the tag which is
              currently processed by the parser.

              The third argument, slash, is either empty or contains  a  slash
              character. It allows the callback to distinguish between opening
              (slash is empty) and closing tags (slash contains a slash  char-
              acter).

              The  fourth argument, param, contains the un-interpreted list of
              parameters to the tag.

              The fifth and last argument, textBehindTheTag, contains the text
              found by the parser behind the tag named in tag.

       ::htmlparse::mapEscapes html
              This  command  takes  a  HTML  string,  substitutes  all  escape
              sequences with their actual  characters  and  then  returns  the
              resulting  string.   HTML  strings  which  do not contain escape
              sequences are returned unchanged.

       ::htmlparse::2tree html tree
              This command is a wrapper around ::htmlparse::parse which  takes
              an  HTML string (in html) and converts it into a tree containing
              the logical structure of the parsed document. The  name  of  the
              tree  is given to the command as its second argument (tree). The
              command does not generate the tree by itself  but  expects  that
              the  caller provided it with an existing and empty tree. It also
              expects that the specified tree object follows the  same  inter-
              face  as the tree object in tcllib -> struct. It doesn't have to
              be from tcllib -> struct, but it must provide  the  same  inter-
              face.

              The  internal callback does some basic checking of HTML validity
              and tries to recover from the most  basic  errors.  The  command
              returns  the  contents  of its second argument. Side effects are
              the creation and manipulation of a tree object.

              Each node in the generated tree represent one tag in the  input.
              The name of the tag is stored in the attribute type of the node.
              Any html attributes coming with the tag are stored unmodified in
              the  attribute data of the tag. In other words, the command does
              not parse html attributes into their names and values.

              If a tag contains text its  node  will  have  children  of  type
              PCDATA  containing  this  text.  The  text will be stored in the
              attribute data of these children.

       ::htmlparse::removeVisualFluff tree
              This command walks a tree as generated by ::htmlparse::2tree and
              removes all the nodes which represent visual tags and not struc-
              tural ones. The purpose of the command is to make the tree  eas-
              ier  to  navigate without getting bogged down in visual informa-
              tion not relevant to the search. Its only argument is  the  name
              of the tree to cut down.

       ::htmlparse::removeFormDefs tree
              Like  ::htmlparse::removeVisualFluff this command is here to cut
              down on the size of the tree as generated by ::htmlparse::2tree.
              It  removes  all nodes representing forms and form elements. Its
              only argument is the name of the tree to cut down.



BUGS, IDEAS, FEEDBACK

       This document, and the package it describes, will  undoubtedly  contain
       bugs  and other problems.  Please report such in the category htmlparse
       of       the       Tcllib       SF       Trackers       [http://source-
       forge.net/tracker/?group_id=12883].   Please  also report any ideas for
       enhancements you may have for either package and/or documentation.


SEE ALSO

       struct::tree(n)


KEYWORDS

       html, parsing, queue, tree


CATEGORY

       Text processing



htmlparse                             1.2                         htmlparse(n)

Mac OS X 10.8 - Generated Fri Sep 7 16:12:40 CDT 2012
© manpagez.com 2000-2025
Individual documents may contain additional copyright information.