info gawk

13.3.10 Finding Anagrams From A Dictionary

An interesting programming challenge is to search for anagrams in a word list (such as ‘/usr/share/dict/words’ on many GNU/Linux systems). One word is an anagram of another if both words contain the same letters (for example, “babbling” and “blabbing”).

An elegant algorithm is presented in Column 2, Problem C of Jon Bentley’s Programming Pearls, second edition. The idea is to give words that are anagrams a common signature, sort all the words together by their signature, and then print them. Dr. Bentley observes that taking the letters in each word and sorting them produces that common signature.

The following program uses arrays of arrays to bring together words with the same signature and array sorting to print the words in sorted order.

# anagram.awk --- An implementation of the anagram finding algorithm
#                 from Jon Bentley's "Programming Pearls", 2nd edition.
#                 Addison Wesley, 2000, ISBN 0-201-65788-0.
#                 Column 2, Problem C, section 2.8, pp 18-20.

/'s$/   { next }        # Skip possessives

The program starts with a header, and then a rule to skip possessives in the dictionary file. The next rule builds up the data structure. The first dimension of the array is indexed by the signature; the second dimension is the word itself:

{
    key = word2key($1)  # Build signature
    data[key][$1] = $1  # Store word with signature
}

The word2key() function creates the signature. It splits the word apart into individual letters, sorts the letters, and then joins them back together:

# word2key --- split word apart into letters, sort, joining back together

function word2key(word,     a, i, n, result)
{
    n = split(word, a, "")
    asort(a)

    for (i = 1; i <= n; i++)
        result = result a[i]

    return result
}

Finally, the END rule traverses the array and prints out the anagram lists. It sends the output to the system sort command, since otherwise the anagrams would appear in arbitrary order:

END {
    sort = "sort"
    for (key in data) {
        # Sort words with same key
        nwords = asorti(data[key], words)
        if (nwords == 1)
            continue

        # And print. Minor glitch: trailing space at end of each line
        for (j = 1; j <= nwords; j++)
            printf("%s ", words[j]) | sort
        print "" | sort
    }
    close(sort)
}

Here is some partial output when the program is run:

$ gawk -f anagram.awk /usr/share/dict/words | grep '^b'
…
babbled blabbed 
babbler blabber brabble 
babblers blabbers brabbles 
babbling blabbing 
babbly blabby 
babel bable 
babels beslab 
babery yabber 
…

[ < ]

[ > ]

[ << ]

[ Up ]

[ >> ]

[Top]

[Contents]

[Index]

[ ? ]