File: gawk.info, Node: Splitting By Content, Next: Testing field creation, Prev: Constant Size, Up: Reading Files 4.7 Defining Fields by Content ============================== * Menu: * More CSV:: More on CSV files. * FS versus FPAT:: A subtle difference. This minor node discusses an advanced feature of 'gawk'. If you are a novice 'awk' user, you might want to skip it on the first reading. Normally, when using 'FS', 'gawk' defines the fields as the parts of the record that occur in between each field separator. In other words, 'FS' defines what a field _is not_, instead of what a field _is_. However, there are times when you really want to define the fields by what they are, and not by what they are not. The most notorious such case is so-called "comma-separated values" (CSV) data. Many spreadsheet programs, for example, can export their data into text files, where each record is terminated with a newline, and fields are separated by commas. If commas only separated the data, there wouldn't be an issue. The problem comes when one of the fields contains an _embedded_ comma. In such cases, most programs embed the field in double quotes.(1) So, we might have data like this: Robbins,Arnold,"1234 A Pretty Street, NE",MyTown,MyState,12345-6789,USA The 'FPAT' variable offers a solution for cases like this. The value of 'FPAT' should be a string that provides a regular expression. This regular expression describes the contents of each field. In the case of CSV data as presented here, each field is either "anything that is not a comma," or "a double quote, anything that is not a double quote, and a closing double quote." (There are more complicated definitions of CSV data, treated shortly.) If written as a regular expression constant (*note Regexp::), we would have '/([^,]+)|("[^"]+")/'. Writing this as a string requires us to escape the double quotes, leading to: FPAT = "([^,]+)|(\"[^\"]+\")" Putting this to use, here is a simple program to parse the data: BEGIN { FPAT = "([^,]+)|(\"[^\"]+\")" } { print "NF = ", NF for (i = 1; i <= NF; i++) { printf("$%d = <%s>\n", i, $i) } } When run, we get the following: $ gawk -f simple-csv.awk addresses.csv NF = 7 $1 =$2 = $3 = <"1234 A Pretty Street, NE"> $4 = $5 = $6 = <12345-6789> $7 = Note the embedded comma in the value of '$3'. A straightforward improvement when processing CSV data of this sort would be to remove the quotes when they occur, with something like this: if (substr($i, 1, 1) == "\"") { len = length($i) $i = substr($i, 2, len - 2) # Get text within the two quotes } NOTE: Some programs export CSV data that contains embedded newlines between the double quotes. 'gawk' provides no way to deal with this. Even though a formal specification for CSV data exists, there isn't much more to be done; the 'FPAT' mechanism provides an elegant solution for the majority of cases, and the 'gawk' developers are satisfied with that. As written, the regexp used for 'FPAT' requires that each field contain at least one character. A straightforward modification (changing the first '+' to '*') allows fields to be empty: FPAT = "([^,]*)|(\"[^\"]+\")" As with 'FS', the 'IGNORECASE' variable (*note User-modified::) affects field splitting with 'FPAT'. Assigning a value to 'FPAT' overrides field splitting with 'FS' and with 'FIELDWIDTHS'. Finally, the 'patsplit()' function makes the same functionality available for splitting regular strings (*note String Functions::). ---------- Footnotes ---------- (1) The CSV format lacked a formal standard definition for many years. RFC 4180 (http://www.ietf.org/rfc/rfc4180.txt) standardizes the most common practices.