manpagez: man pages & more
info gawk
Home | html | info | man

gawk: Ordinal Functions

 
 10.2.5 Translating Between Characters and Numbers
 -------------------------------------------------
 
 One commercial implementation of 'awk' supplies a built-in function,
 'ord()', which takes a character and returns the numeric value for that
 character in the machine's character set.  If the string passed to
 'ord()' has more than one character, only the first one is used.
 
    The inverse of this function is 'chr()' (from the function of the
 same name in Pascal), which takes a number and returns the corresponding
 character.  Both functions are written very nicely in 'awk'; there is no
 real reason to build them into the 'awk' interpreter:
 
      # ord.awk --- do ord and chr
 
      # Global identifiers:
      #    _ord_:        numerical values indexed by characters
      #    _ord_init:    function to initialize _ord_
 
      BEGIN    { _ord_init() }
 
      function _ord_init(    low, high, i, t)
      {
          low = sprintf("%c", 7) # BEL is ascii 7
          if (low == "\a") {    # regular ascii
              low = 0
              high = 127
          } else if (sprintf("%c", 128 + 7) == "\a") {
              # ascii, mark parity
              low = 128
              high = 255
          } else {        # ebcdic(!)
              low = 0
              high = 255
          }
 
          for (i = low; i <= high; i++) {
              t = sprintf("%c", i)
              _ord_[t] = i
          }
      }
 
    Some explanation of the numbers used by '_ord_init()' is worthwhile.
 The most prominent character set in use today is ASCII.(1) Although an
 8-bit byte can hold 256 distinct values (from 0 to 255), ASCII only
 defines characters that use the values from 0 to 127.(2)  In the now
 distant past, at least one minicomputer manufacturer used ASCII, but
 with mark parity, meaning that the leftmost bit in the byte is always 1.
 This means that on those systems, characters have numeric values from
 128 to 255.  Finally, large mainframe systems use the EBCDIC character
 set, which uses all 256 values.  There are other character sets in use
 on some older systems, but they are not really worth worrying about:
 
      function ord(str,    c)
      {
          # only first character is of interest
          c = substr(str, 1, 1)
          return _ord_[c]
      }
 
      function chr(c)
      {
          # force c to be numeric by adding 0
          return sprintf("%c", c + 0)
      }
 
      #### test code ####
      # BEGIN {
      #    for (;;) {
      #        printf("enter a character: ")
      #        if (getline var <= 0)
      #            break
      #        printf("ord(%s) = %d\n", var, ord(var))
      #    }
      # }
 
    An obvious improvement to these functions is to move the code for the
 '_ord_init' function into the body of the 'BEGIN' rule.  It was written
 this way initially for ease of development.  There is a "test program"
 in a 'BEGIN' rule, to test the function.  It is commented out for
 production use.
 
    ---------- Footnotes ----------
 
    (1) This is changing; many systems use Unicode, a very large
 character set that includes ASCII as a subset.  On systems with full
 Unicode support, a character can occupy up to 32 bits, making simple
 tests such as used here prohibitively expensive.
 
    (2) ASCII has been extended in many countries to use the values from
 128 to 255 for country-specific characters.  If your system uses these
 extensions, you can simplify '_ord_init()' to loop from 0 to 255.
 
© manpagez.com 2000-2018
Individual documents may contain additional copyright information.