info guile

6.17.8 Character Encoding of Source Files

Scheme source code files are usually encoded in ASCII, but, the built-in reader can interpret other character encodings. The procedure primitive-load, and by extension the functions that call it, such as load, first scan the top 500 characters of the file for a coding declaration.

A coding declaration has the form coding: XXXXXX, where XXXXXX is the name of a character encoding in which the source code file has been encoded. The coding declaration must appear in a scheme comment. It can either be a semicolon-initiated comment or a block #! comment.

The name of the character encoding in the coding declaration is typically lower case and containing only letters, numbers, and hyphens, as recognized by set-port-encoding! (see section set-port-encoding!). Common examples of character encoding names are utf-8 and iso-8859-1, as defined by IANA. Thus, the coding declaration is mostly compatible with Emacs.

However, there are some differences in encoding names recognized by Emacs and encoding names defined by IANA, the latter being essentially a subset of the former. For instance, latin-1 is a valid encoding name for Emacs, but it’s not according to the IANA standard, which Guile follows; instead, you should use iso-8859-1, which is both understood by Emacs and dubbed by IANA (IANA writes it uppercase but Emacs wants it lowercase and Guile is case insensitive.)

For source code, only a subset of all possible character encodings can be interpreted by the built-in source code reader. Only those character encodings in which ASCII text appears unmodified can be used. This includes UTF-8 and ISO-8859-1 through ISO-8859-15. The multi-byte character encodings UTF-16 and UTF-32 may not be used because they are not compatible with ASCII.

There might be a scenario in which one would want to read non-ASCII code from a port, such as with the function read, instead of with load. If the port’s character encoding is the same as the encoding of the code to be read by the port, not other special handling is necessary. The port will automatically do the character encoding conversion. The functions setlocale or by set-port-encoding! are used to set port encodings (see section Ports).

If a port is used to read code of unknown character encoding, it can accomplish this in three steps. First, the character encoding of the port should be set to ISO-8859-1 using set-port-encoding!. Then, the procedure file-encoding, described below, is used to scan for a coding declaration when reading from the port. As a side effect, it rewinds the port after its scan is complete. After that, the port’s character encoding should be set to the encoding returned by file-encoding, if any, again by using set-port-encoding!. Then the code can be read as normal.

Scheme Procedure: file-encoding port
C Function: scm_file_encoding port: Scan the port for an Emacs-like character coding declaration near the top of the contents of a port with random-accessible contents (see how Emacs recognizes file encoding in The GNU Emacs Reference Manual). The coding declaration is of the form coding: XXXXX and must appear in a Scheme comment. Return a string containing the character encoding of the file if a declaration was found, or #f otherwise. The port is rewound.

This document was generated on February 3, 2012 using texi2html 5.0.