manpagez: man pages & more
info aspell
Home | html | info | man
[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

B.1.1 Notes on Latin Languages

Any word that can be written using one of the Latin ISO-8859 character sets (ISO-8859-1,2,3,4,9,10,13,14,15,16) can be written, in decomposed form, using the ASCII characters, the 23 additional letters:

 
U+00C6 LATIN CAPITAL LETTER AE
U+00D0 LATIN CAPITAL LETTER ETH
U+00D8 LATIN CAPITAL LETTER O WITH STROKE
U+00DE LATIN CAPITAL LETTER THORN
U+00DE LATIN SMALL LETTER THORN
U+00DF LATIN SMALL LETTER SHARP S
U+00E6 LATIN SMALL LETTER AE
U+00F0 LATIN SMALL LETTER ETH
U+00F8 LATIN SMALL LETTER O WITH STROKE
U+0110 LATIN CAPITAL LETTER D WITH STROKE
U+0111 LATIN SMALL LETTER D WITH STROKE
U+0126 LATIN CAPITAL LETTER H WITH STROKE
U+0127 LATIN SMALL LETTER H WITH STROKE
U+0131 LATIN SMALL LETTER DOTLESS I
U+0138 LATIN SMALL LETTER KRA
U+0141 LATIN CAPITAL LETTER L WITH STROKE
U+0142 LATIN SMALL LETTER L WITH STROKE
U+014A LATIN CAPITAL LETTER ENG
U+014B LATIN SMALL LETTER ENG
U+0152 LATIN CAPITAL LIGATURE OE
U+0153 LATIN SMALL LIGATURE OE
U+0166 LATIN CAPITAL LETTER T WITH STROKE
U+0167 LATIN SMALL LETTER T WITH STROKE

and the 14 modifiers:

 
U+0300 COMBINING GRAVE ACCENT
U+0301 COMBINING ACUTE ACCENT
U+0302 COMBINING CIRCUMFLEX ACCENT
U+0303 COMBINING TILDE
U+0304 COMBINING MACRON
U+0306 COMBINING BREVE
U+0307 COMBINING DOT ABOVE
U+0308 COMBINING DIAERESIS
U+030A COMBINING RING ABOVE
U+030B COMBINING DOUBLE ACUTE ACCENT
U+030C COMBINING CARON
U+0326 COMBINING COMMA BELOW
U+0327 COMBINING CEDILLA
U+0328 COMBINING OGONEK

Which is a total of 37 additional Unicode code points.

All ISO-8859 character leaves the characters 0x00 - 0x1F, and 0x80 - 0x9F unmapped as they are generally used as control characters. Of those, 0x01 - 0x0F, 0x11 - 0x1F and 0x80 - 0x9F may be mapped to anything in Aspell. This is a total of 62 characters which can be remapped in any ISO-8859 character set. Thus, by remapping 37 of the 62 characters to the previously specified Unicode code-points, any modified ISO-8859 character set can be used for any Latin languages covered by ISO-8859. Of course decomposing every single accented character wastes a lot of space, so only characters that cannot be represented in the precomposed form should be broken up. By using this trick it is possible to store foreign words in the correctly accented form in the dictionary even if the precomposed character is not in the current character set.

Any letter in the Unicode range U+0000 - U+0249, U+1E00 - U+1EFF (Basic Latin, Latin-1 Supplement, Latin Extended-A, Latin Extended-B, and Latin Extended Additional) can be represented using around 175 basic letters, and 25 modifiers which is less than 210 and can thus fit in an Aspell 8-bit character set. Since this Unicode range covers any possible Latin language this special character set can be used to represent any word written using the Latin script if so desired.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]
© manpagez.com 2000-2024
Individual documents may contain additional copyright information.