manpagez: man pages & more
info gawk
Home | html | info | man

File: gawk.info,  Node: Translate Program,  Next: Labels Program,  Prev: Alarm Program,  Up: Miscellaneous Programs

11.3.3 Transliterating Characters
---------------------------------

The system 'tr' utility transliterates characters.  For example, it is
often used to map uppercase letters into lowercase for further
processing:

     GENERATE DATA | tr 'A-Z' 'a-z' | PROCESS DATA ...

   'tr' requires two lists of characters.(1)  When processing the input,
the first character in the first list is replaced with the first
character in the second list, the second character in the first list is
replaced with the second character in the second list, and so on.  If
there are more characters in the "from" list than in the "to" list, the
last character of the "to" list is used for the remaining characters in
the "from" list.

   Once upon a time, a user proposed adding a transliteration function
to 'gawk'.  The following program was written to prove that character
transliteration could be done with a user-level function.  This program
is not as complete as the system 'tr' utility, but it does most of the
job.

   The 'translate' program was written long before 'gawk' acquired the
ability to split each character in a string into separate array
elements.  Thus, it makes repeated use of the 'substr()', 'index()', and
'gsub()' built-in functions (*note String Functions::).  There are two
functions.  The first, 'stranslate()', takes three arguments:

'from'
     A list of characters from which to translate

'to'
     A list of characters to which to translate

'target'
     The string on which to do the translation

   Associative arrays make the translation part fairly easy.  't_ar'
holds the "to" characters, indexed by the "from" characters.  Then a
simple loop goes through 'from', one character at a time.  For each
character in 'from', if the character appears in 'target', it is
replaced with the corresponding 'to' character.

   The 'translate()' function calls 'stranslate()', using '$0' as the
target.  The main program sets two global variables, 'FROM' and 'TO',
from the command line, and then changes 'ARGV' so that 'awk' reads from
the standard input.

   Finally, the processing rule simply calls 'translate()' for each
record:

     # translate.awk --- do tr-like stuff
     # Bugs: does not handle things like tr A-Z a-z; it has
     # to be spelled out. However, if `to' is shorter than `from',
     # the last character in `to' is used for the rest of `from'.

     function stranslate(from, to, target,     lf, lt, ltarget, t_ar, i, c,
                                                                    result)
     {
         lf = length(from)
         lt = length(to)
         ltarget = length(target)
         for (i = 1; i <= lt; i++)
             t_ar[substr(from, i, 1)] = substr(to, i, 1)
         if (lt < lf)
             for (; i <= lf; i++)
                 t_ar[substr(from, i, 1)] = substr(to, lt, 1)
         for (i = 1; i <= ltarget; i++) {
             c = substr(target, i, 1)
             if (c in t_ar)
                 c = t_ar[c]
             result = result c
         }
         return result
     }

     function translate(from, to)
     {
         return $0 = stranslate(from, to, $0)
     }

     # main program
     BEGIN {
         if (ARGC < 3) {
             print "usage: translate from to" > "/dev/stderr"
             exit
         }
         FROM = ARGV[1]
         TO = ARGV[2]
         ARGC = 2
         ARGV[1] = "-"
     }

     {
         translate(FROM, TO)
         print
     }

   It is possible to do character transliteration in a user-level
function, but it is not necessarily efficient, and we (the 'gawk'
developers) started to consider adding a built-in function.  However,
shortly after writing this program, we learned that Brian Kernighan had
added the 'toupper()' and 'tolower()' functions to his 'awk' (*note
String Functions::).  These functions handle the vast majority of the
cases where character transliteration is necessary, and so we chose to
simply add those functions to 'gawk' as well and then leave well enough
alone.

   An obvious improvement to this program would be to set up the 't_ar'
array only once, in a 'BEGIN' rule.  However, this assumes that the
"from" and "to" lists will never change throughout the lifetime of the
program.

   Another obvious improvement is to enable the use of ranges, such as
'a-z', as allowed by the 'tr' utility.  Look at the code for 'cut.awk'
(*note Cut Program::) for inspiration.

   ---------- Footnotes ----------

   (1) On some older systems, including Solaris, the system version of
'tr' may require that the lists be written as range expressions enclosed
in square brackets ('[a-z]') and quoted, to prevent the shell from
attempting a file name expansion.  This is not a feature.

© manpagez.com 2000-2025
Individual documents may contain additional copyright information.