info aspell

[Index]

7.3.1 Syntax of the transformation array

In the translation array there are two strings on each line; the first one is the search string (or switch name) and the second one is the replacement string (or switch parameter). The line

version   version

is also required to appear somewhere in the translation array. The version string can be anything but it should be changed whenever a new version of the translation array is released. This is important because it will keep Aspell from using a compiled dictionary with the wrong set of rules. For example, if when coming up with suggestion for hallo, Aspell will use the new rules to come up with the soundslike say H*L*, but if ‘hello’ is stored in the dictionary using the old rules as HL instead of H*L* Aspell will never be able to come up with ‘hello’. So to solve this problem Aspell checks if the version strings match and aborts with an error if they don’t. Thus it is important to update it whenever a new version of the translation array is released. This is only a problem with the main word list as the personal word lists are now stored as simple word lists with a single header line (i.e. no soundslike data).

Each non switch line represents one replacement (transformation) rule. Words beginning with the same letter must be grouped together; the order inside this group does not depend on alphabetical issues but it gives priorities; the higher the rule the higher the priority. That’s why the first rule that matches is applied. In the following example:

GH   _
G    K

‘GH -> _’ has higher priority than ‘G -> K’

‘_’ represents the empty string “”. If ‘GH -> _’ came after ‘G -> K’, the second rule would never match because the algorithm would stop searching for more rules after the first match. The above rules transform any ‘GH’ to an empty string (delete them) and transforms any other ‘G’ to ‘K’.

At the end of the first string of a line (the search string) there may optionally stand a number of characters in brackets. One (only one!) of these characters must fit. It’s comparable with the ‘[ ]’ brackets in regular expressions. The rule ‘DG(EIY) -> J’ for example would match any ‘DGE’, ‘DGI’ and ‘DGY’ and replace them with ‘J’. This way you can reduce several rules to one.

Before the search string, one or more dashes ‘-’ may be placed. Those search strings will be matched totally but only the beginning of the string will be replaced. Furthermore, for these rules no follow-up rule will be searched (what this is will be explained later). The rule ‘TCH-- ’-> _ will match any word containing ‘TCH’ (like ‘match’) but will only replace the first character ‘T’ with an empty string. The number of dashes determines how many characters from the end will not be replaced. After the replacement, the search for transformation rules continues with the not replaced ‘CH’!

If a ‘<’ is appended to the search string, the search for replacement rules will continue with the replacement string and not with the next character of the word. The rule ‘PH< -> F’ for example would replace ‘PH’ with ‘F’ and then again start to search for a replacement rule for ‘F…’. If there would also be rules like ‘FO ’-> ‘O’ and ‘F -> _’ then words like ‘PHOXYZ’ would be transformed to ‘OXYZ’ and any occurrences of ‘PH’ that are not followed by an ‘O’ will be deleted like ‘PHIXYZ -> IXYZ’. The second replacement however is not applied if the priority of this rule is lower than the priority of the first rule.

Priorities are added to a rule by putting a number between 0 and 9 at the end of the search string, for example ‘ING6 -> N’. The higher the number the higher is the priority.

Priorities are especially important for the previously mentioned follow-up rules. Follow-up rules are searched beginning from the last string of the first search string. This is a bit complicated but I hope this example will make it clearer:

CHS      X
CH       G

HAU--1   H

SCH      SH

In this example ‘CHS’ in the word ‘FUCHS’ would be transformed to ‘X’. If we take the word ‘DURCHSCHNITT’ then things look a bit different. Here ‘CH’ belongs together and ‘SCH’ belongs together and both are spoken separately. The algorithm however first finds the string ‘CHS’ which may not be transformed like in the previous word ‘FUCHS’. At this point the algorithm can find a follow-up rule. It takes the last character of the first matching rule (‘CHS’) which is ‘S’ and looks for the next match, beginning from this character. What it finds is clear: It finds ‘SCH -> SH’, which has the same priority (no priority means standard priority, which is 5). If the priority is the same or higher the follow-up rule will be applied. Let’s take a look at the word ‘SCHAUKEL’. In this word ‘SCH’ belongs together and may not be taken apart. After the algorithm has found ‘SCH ’-> ‘SH’ it searches for a follow-up rule for ‘H+’‘AUKEL’. It finds ‘HAU--1 -> H’, but does not apply it because its priority is lower than the one of the first rule. You see that this is a very powerful feature but it also can easily lead to mistakes. If you really don’t need this feature you can turn it off by putting the line:

followup      0

at the beginning of the phonetic table file. As mentioned, for rules containing a ‘-’ no follow-up rules are searched but giving such rules a priority is not totally senseless because they can be follow-up rules and in that case the priority makes sense again. Follow-up rules of follow-up rules are not searched because this is in fact not needed very often.

The control character ‘^’ says that the search string only matches at the beginning of words so that the rule ‘RH -> R’ will only apply to words like ‘RHESUS’ but not ‘PERHAPS’. You can append another ‘^’ to the search string. In that case the algorithm treats the rest of the word totally separately from the first matched string at the beginning. This is useful for prefixes whose pronunciation does not depend on the rest of the word and vice versa like ‘OVER^^’ in English for example.

The same way as ‘^’ works does ‘$’ only apply to words that end with the search string. ‘GN$ -> N’ only matches on words like ‘SIGN’ but not ‘SIGNUM’. If you use ‘^’ and ‘$’ together, both of them must fit ‘ENOUGH^$ -> NF’ will only match the word ‘ENOUGH’ and nothing else.

Of course you can combine all of the mentioned control characters but they must occur in this order: ‘< - priority ^ $’. All characters must be written in CAPITAL letters.

If absolutely no rule can be found — might happen if you use strange characters for which you don’t have any replacement rule — the next character will simply be skipped and the search for replacement rules will continue with the rest of the word.

If you want double letters to be reduced to one you must set up a rule like ‘LL- -> L’. If double letters in the resulting phonetic word should be allowed, you must place the line:

collapse_result     0

at the beginning of your transformation table file; otherwise set the value to ‘1’. The English rules for example strip all vowels from words and so the word "GOGO" would be transformed to "K" and not to "KK" (as desired) if collapse_result is set to 1. That’s why the English rules have collapse_result set to 0.

By default, all accents are removed from a word before it is matched to the soundslike rules. If you do not want this then add the line

remove_accents      0

at the beginning of your file. The exact definition of an accent is language dependent and is controlled via the character set file. If you set remove_accents to ’0’ then you should also set "store-as" to "lower" in the language data file (not the phonetic transformation file) otherwise Aspell will have problems when both the accented and the de-accented version of a word appearing in the dictionary; it will consider one of them as incorrectly spelled.

[ < ]

[ > ]

[ << ]

[ Up ]

[ >> ]

[Top]

[Contents]

[Index]

[ ? ]