info libidn

10.6 Troubleshooting

Getting character data encoded right, and making sure Libidn use the same encoding, can be difficult. The reason for this is that most systems encode character data in more than one character encoding, i.e., using UTF-8 together with ISO-8859-1 or ISO-2022-JP. This problem is likely to continue to exist until only one character encoding come out as the evolutionary winner, or (more likely, at least to some extents) forever.

The first step to troubleshooting character encoding problems with Libidn is to use the ‘--debug’ parameter to find out which character set encoding ‘idn’ believe your locale uses.

jas@latte:~$ idn --debug --quiet ""
system locale uses charset `UTF-8'.

jas@latte:~$

If it prints ANSI_X3.4-1968 (i.e., US-ASCII), this indicate you have not configured your locale properly. To configure the locale, you can, for example, use ‘LANG=sv_SE.UTF-8; export LANG’ at a /bin/sh prompt, to set up your locale for a Swedish environment using UTF-8 as the encoding.

Sometimes ‘idn’ appear to be unable to translate from your system locale into UTF-8 (which is used internally), and you get an error like the following:

jas@latte:~$ idn --quiet foo
idn: could not convert from ISO-8859-1 to UTF-8.
jas@latte:~$

The simplest explanation is that you haven’t installed the ‘iconv’ conversion tools. You can find it as a standalone library in GNU Libiconv (http://www.gnu.org/software/libiconv/). On many GNU/Linux systems, this library is part of the system, but you may have to install additional packages (e.g., ‘glibc-locale’ for Debian) to be able to use it.

Another explanation is that the error is correct and you are feeding ‘idn’ invalid data. This can happen inadvertently if you are not careful with the character set encoding you use. For example, if your shell run in a ISO-8859-1 environment, and you invoke ‘idn’ with the ‘CHARSET’ environment variable as follows, you will feed it ISO-8859-1 characters but force it to believe they are UTF-8. Naturally this will lead to an error, unless the byte sequences happen to be valid UTF-8. Note that even if you don’t get an error, the output may be incorrect in this situation, because ISO-8859-1 and UTF-8 does not in general encode the same characters as the same byte sequences.

jas@latte:~$ idn --quiet --debug ""
system locale uses charset `ISO-8859-1'.
 
jas@latte:~$ CHARSET=UTF-8 idn --quiet --debug räksmörgås
system locale uses charset `UTF-8'.
input[0] = U+0072
input[1] = U+4af3
input[2] = U+006d
input[3] = U+1b29e5
input[4] = U+0073
output[0] = U+0078
output[1] = U+006e
output[2] = U+002d
output[3] = U+002d
output[4] = U+0072
output[5] = U+006d
output[6] = U+0073
output[7] = U+002d
output[8] = U+0068
output[9] = U+0069
output[10] = U+0036
output[11] = U+0064
output[12] = U+0035
output[13] = U+0039
output[14] = U+0037
output[15] = U+0035
output[16] = U+0035
output[17] = U+0032
output[18] = U+0061
xn--rms-hi6d597552a
jas@latte:~$

The sense moral here is to forget about ‘CHARSET’ (configure your locales properly instead) unless you know what you are doing, and if you want to use it, do it carefully, after verifying with ‘--debug’ that you get the desired results.

This document was generated on February 1, 2012 using texi2html 5.0.