Text Processing Pipelines
Written by Adrian J. Chung
Sure the command line is evil, but mastering it will unlock the powers
of a Unix box that remain unrealized under modern graphical user
interfaces. This article details the construction of text processing
pipelines, using ordinary GNU utilities, to accomplish a few fairly
challenging tasks.
Common word usage
Suppose that for whatever reason, one is interested in the word usage
of a piece of text, perhaps from an article such as this one,
downloaded from the Net. One might want to know what word is most
frequently used while ignoring all the non-words, like variable names
in source code or other random bits of junk. Perhaps a ranking of word
frequency is required. Should one resort to writing a special word
counting program in Perl? Here's how to do it using a few GNU text
utilities and the assistance of that great resource /usr/dict/words.
First we begin by breaking up the sentences in the text file so that
there is no more than one word per line. The "tr" tool is useful
here. This tool translates files one character at a time. For example
here is the essential USENET tool rot13 using "tr":
% tr a-zA-Z n-za-mN-ZA-M < rot13-encrypted.txt
The two arguments specify the character translation table to use:
abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ
nopqrstuvwxyzabcdefghijklmNOPQRSTUVWXYZABCDEFGHIJKLM
Characters in the first line are changed to the corresponding
character in the second line. Unmatched characters are
unaffected. "tr" can be made to reverse this behavior, changing only
the unmatched characters:
% tr -c a-zA-Z '\n' < article.txt
Anything that is not a letter is changed to a newline character. This
accomplishes the first step of isolating words, one per line.
Next we need to group the similar words together. Sorting the lines
of the file will suffice. The "sort" command can also operate in a
filter, and with a "-f" option the sort becomes case insensitive. Our
pipeline so far:
% tr -c a-zA-Z '\n' < article.txt | sort -f
You will notice that a body of text, such as the article you are
reading, will contain many non-words (e.g. "za","tr", "txt"). We can
get rid of these by referencing against a list of valid words,
/usr/dict/words for example. The "join" command implements what is
known as a natural join in database terminology. It reads two text
streams which must be pre-sorted by the key field which it uses to
match up lines from either stream. Since /usr/dict/words is already
sorted this is ideal. What is useful about natural joins is that rows
of data that have no matching counterpart in the other text stream are
not output. Any words not found in our dictionary are removed thus:
% tr -c a-zA-Z '\n' < article.txt | sort -f | join -i /usr/dict/words -
The "-" tells "join" to take the second text stream from standard
input (i.e. the output of the preceding stage of the pipeline). The
"-i" makes the join case insensitive.
The next step is to count the grouped words and we can use the "uniq"
utility to do the job. "uniq" removes repeated lines of text that are
consecutively located. With a "-c" option "uniq" will also output a
count of similar lines for each unique line found. Again, "-i" for
case insensitivity:
% tr -c a-zA-Z '\n' < article.txt | sort -f | join -i /usr/dict/words - |
uniq -i -c
Finally, this output needs to be sorted by frequency using "sort":
% tr -c a-zA-Z '\n' < article.txt | sort -f | join -i /usr/dict/words - |
uniq -i -c | sort -r
Normally when sorting by numerical value rather than ASCII string, the
"-n" option should be given. We can get away without it because "uniq"
right-justifies its numerical counts. The "-r" reverses the order of
the sort so that the most frequently used words appear first.
There are still a few shortcomings such as the handling of
contractions (e.g. "we'll", "it's", "can't") but one could expect
similar difficulties with a specially coded word counting program.
Unordered Natural Joins
On more than one occasion I have had the need to fuse two files together
so that the combined information contains the same info of the two files
separately. For example suppose we have the following two files:
alpha.txt
tetex-xdvi 1.0.6-11
ElectricFence 2.1-3
newt-devel 0.50.8-2
rgrep 0.98.7-5
dosfstools 2.2-4
bdflush 1.5-11
bin86 0.4-7
gnuplot 3.7.1-3
dialog 0.6-16
kernel-utils 2.2.14-5.0
termcap 10.2.7-9
beta.txt
ElectricFence 36903
bdflush 8861
bin86 74968
dialog 85955
dosfstools 106819
gnuplot 1345702
kernel-utils 292693
newt-devel 144815
rgrep 15202
termcap 625272
tetex-xdvi 1222425
A natural "join" suggests itself, however, suppose one needs to keep
the lines of text ordered as they are found in the alpha.txt
file. "join" will fail unless the key field is sorted. We need to add
an extra indexing field to the first file that will help to restore
the original order of the file. The "nl" command proves to be useful:
% nl alpha.txt
1 tetex-xdvi 1.0.6-11
2 ElectricFence 2.1-3
3 newt-devel 0.50.8-2
4 rgrep 0.98.7-5
5 dosfstools 2.2-4
6 bdflush 1.5-11
7 bin86 0.4-7
8 gnuplot 3.7.1-3
9 dialog 0.6-16
10 kernel-utils 2.2.14-5.0
11 termcap 10.2.7-9
Now we can sort by the key field to perform the join, then re-order by
our index field to restore the original order:
% nl alpha.txt |sort +1 |join -j2 2 beta.txt -
The "-j2 2" argument tells "join" to use the 2nd field as the key
field for the second input stream. (beta.txt is the first input
stream)T. The output is a bit messy but one observes that the index to
sort by is the 3rd field. "sort +2" will skip over the first two fields
when comparing rows:
% nl alpha.txt |sort +1 |join -j2 2 beta.txt - |sort +2 -n
Now get rid of the ordering field using "cut"
% nl alpha.txt |sort +1 |join -j2 2 beta.txt -| sort +2 -n | cut -f 1,2,4
-d " "
The "-f" argument gives a list of fields to include in the
output. "cut" normally uses [TAB] as the field separator but this is
changed using "-d". We still need to pretty up the formatting. The
"pr" tool finds a use here:
% nl alpha.txt |sort +1 |join -j2 2 beta.txt -| sort +2 -n | cut -f 1,2,4
-d " " |pr -e\ 16 -T
tetex-xdvi 1222425 1.0.6-11
ElectricFence 36903 2.1-3
newt-devel 144815 0.50.8-2
rgrep 15202 0.98.7-5
dosfstools 106819 2.2-4
bdflush 8861 1.5-11
bin86 74968 0.4-7
gnuplot 1345702 3.7.1-3
dialog 85955 0.6-16
kernel-utils 292693 2.2.14-5.0
termcap 625272 10.2.7-9
"pr" normally formats output for line printers. Few people use these
archaic pieces of hardware anymore but "pr" still has its uses. The
"-e" expands TAB characters, replacing them with spaces. Our slightly
embellished "-e" argument tells "pr" to use a single space as the TAB
character and to space the tab positions 16 character widths apart.
"-T" suppresses the headers, footers, and form feeds.
Conclusion
The GNU text processing utilities provide a rich set of functions that
can be combined, using the command line, in many different ways to
accomplish tasks that otherwise would require special programs to be
written. Investing a little time to learn to use the command line
interface can save one a great deal of trouble in the long run.
Would you like to have your article published online? Send them in to newfiles@linuxnewbie.org
|