manpagez: man pages & more
man unicharambigs(5)
Home | html | info | man
unicharambigs(5)                                              unicharambigs(5)


NAME

       unicharambigs - Tesseract unicharset ambiguities


DESCRIPTION

       The unicharambigs file (a component of traineddata, see
       combine_tessdata(1) ) is used by Tesseract to represent possible
       ambiguities between characters, or groups of characters.

       The file contains a number of lines, laid out as follow:

           [num] <TAB> [char(s)] <TAB> [num] <TAB> [char(s)] <TAB> [num]


       Field one     the number of characters
                     contained in field two
       Field two     the character sequence to
                     be replaced
       Field three   the number of characters
                     contained in field four
       Field four    the character sequence
                     used to replace field two
       Field five    contains either 1 or 0. 1
                     denotes a mandatory
                     replacement, 0 denotes an
                     optional replacement.


       Characters appearing in fields two and four should appear in
       unicharset. The numbers in fields one and three refer to the number of
       unichars (not bytes).


EXAMPLE

           v1
           2       ' '     1       "     1
           1       m       2       r n   0
           3       i i i   1       m     0

       The first line is a version identifier. In this example, all instances
       of the 2 character sequence '' will always be replaced by the 1
       character sequence "; a 1 character sequence m may be replaced by the 2
       character sequence rn, and the 3 character sequence may be replaced by
       the 1 character sequence m.

       Version 3.03 and on supports a new, simpler format for the
       unicharambigs file:

           v2
           '' " 1
           m rn 0
           iii m 0

       In this format, the "error" and "correction" are simple UTF-8 strings
       separated by a space, and, after another space, the same type specifier
       as v1 (0 for optional and 1 for mandatory substitution). Note the
       downside of this simpler format is that Tesseract has to encode the
       UTF-8 strings into the components of the unicharset. In complex
       scripts, this encoding may be ambiguous. In this case, the encoding is
       chosen such as to use the least UTF-8 characters for each component, ie
       the shortest unicharset components will make up the encoding.


HISTORY

       The unicharambigs file first appeared in Tesseract 3.00; prior to that,
       a similar format, called DangAmbigs (dangerous ambiguities) was used:
       the format was almost identical, except only mandatory replacements
       could be specified, and field 5 was absent.


BUGS

       This is a documentation "bug": it's not currently clear what should be
       done in the case of ligatures (such as fi) which may also appear as
       regular letters in the unicharset.


SEE ALSO

       tesseract(1), unicharset(5)
       https://tesseract-ocr.github.io/tessdoc/Training-Tesseract-3.03%E2%80%933.05.html#the-unicharambigs-file


AUTHOR

       The Tesseract OCR engine was written by Ray Smith and his research
       groups at Hewlett Packard (1985-1995) and Google (2006-2018).

                                  08/31/2024                  unicharambigs(5)

tesseract 5.4.1 - Generated Thu Oct 3 16:47:56 CDT 2024
© manpagez.com 2000-2025
Individual documents may contain additional copyright information.