Welcome to Linuxnewbie.org : Wanna learn linux? : NHF : Intel : Programming

Tuesday, 12-Dec-2000 10:38:29 EST

Text Processing Pipelines
Written by Adrian J. Chung

Sure the command line is evil, but mastering it will unlock the powers of a Unix box that remain unrealized under modern graphical user interfaces. This article details the construction of text processing pipelines, using ordinary GNU utilities, to accomplish a few fairly challenging tasks.

Common word usage

Suppose that for whatever reason, one is interested in the word usage of a piece of text, perhaps from an article such as this one, downloaded from the Net. One might want to know what word is most frequently used while ignoring all the non-words, like variable names in source code or other random bits of junk. Perhaps a ranking of word frequency is required. Should one resort to writing a special word counting program in Perl? Here's how to do it using a few GNU text utilities and the assistance of that great resource /usr/dict/words.

First we begin by breaking up the sentences in the text file so that there is no more than one word per line. The "tr" tool is useful here. This tool translates files one character at a time. For example here is the essential USENET tool rot13 using "tr":

% tr a-zA-Z n-za-mN-ZA-M < rot13-encrypted.txt

The two arguments specify the character translation table to use:

abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ
nopqrstuvwxyzabcdefghijklmNOPQRSTUVWXYZABCDEFGHIJKLM

Characters in the first line are changed to the corresponding character in the second line. Unmatched characters are unaffected. "tr" can be made to reverse this behavior, changing only the unmatched characters:

% tr -c a-zA-Z '\n' < article.txt

Anything that is not a letter is changed to a newline character. This accomplishes the first step of isolating words, one per line.

Next we need to group the similar words together. Sorting the lines of the file will suffice. The "sort" command can also operate in a filter, and with a "-f" option the sort becomes case insensitive. Our pipeline so far:

% tr -c a-zA-Z '\n' < article.txt | sort -f

You will notice that a body of text, such as the article you are reading, will contain many non-words (e.g. "za","tr", "txt"). We can get rid of these by referencing against a list of valid words, /usr/dict/words for example. The "join" command implements what is known as a natural join in database terminology. It reads two text streams which must be pre-sorted by the key field which it uses to match up lines from either stream. Since /usr/dict/words is already sorted this is ideal. What is useful about natural joins is that rows of data that have no matching counterpart in the other text stream are not output. Any words not found in our dictionary are removed thus:

% tr -c a-zA-Z '\n' < article.txt | sort -f | join -i /usr/dict/words -

The "-" tells "join" to take the second text stream from standard input (i.e. the output of the preceding stage of the pipeline). The "-i" makes the join case insensitive.

The next step is to count the grouped words and we can use the "uniq" utility to do the job. "uniq" removes repeated lines of text that are consecutively located. With a "-c" option "uniq" will also output a count of similar lines for each unique line found. Again, "-i" for case insensitivity:

% tr -c a-zA-Z '\n' < article.txt | sort -f | join -i /usr/dict/words - | uniq -i -c

Finally, this output needs to be sorted by frequency using "sort":

% tr -c a-zA-Z '\n' < article.txt | sort -f | join -i /usr/dict/words - | uniq -i -c | sort -r

Normally when sorting by numerical value rather than ASCII string, the "-n" option should be given. We can get away without it because "uniq" right-justifies its numerical counts. The "-r" reverses the order of the sort so that the most frequently used words appear first.

There are still a few shortcomings such as the handling of contractions (e.g. "we'll", "it's", "can't") but one could expect similar difficulties with a specially coded word counting program.

Unordered Natural Joins

On more than one occasion I have had the need to fuse two files together so that the combined information contains the same info of the two files separately. For example suppose we have the following two files:

alpha.txt

tetex-xdvi 1.0.6-11
ElectricFence 2.1-3
newt-devel 0.50.8-2
rgrep 0.98.7-5
dosfstools 2.2-4
bdflush 1.5-11
bin86 0.4-7
gnuplot 3.7.1-3
dialog 0.6-16
kernel-utils 2.2.14-5.0
termcap 10.2.7-9

beta.txt

ElectricFence 36903
bdflush 8861
bin86 74968
dialog 85955
dosfstools 106819
gnuplot 1345702
kernel-utils 292693
newt-devel 144815
rgrep 15202
termcap 625272
tetex-xdvi 1222425

A natural "join" suggests itself, however, suppose one needs to keep the lines of text ordered as they are found in the alpha.txt file. "join" will fail unless the key field is sorted. We need to add an extra indexing field to the first file that will help to restore the original order of the file. The "nl" command proves to be useful:

% nl alpha.txt

1 tetex-xdvi 1.0.6-11
2 ElectricFence 2.1-3
3 newt-devel 0.50.8-2
4 rgrep 0.98.7-5
5 dosfstools 2.2-4
6 bdflush 1.5-11
7 bin86 0.4-7
8 gnuplot 3.7.1-3
9 dialog 0.6-16
10 kernel-utils 2.2.14-5.0
11 termcap 10.2.7-9

Now we can sort by the key field to perform the join, then re-order by our index field to restore the original order:

% nl alpha.txt |sort +1 |join -j2 2 beta.txt -

The "-j2 2" argument tells "join" to use the 2nd field as the key field for the second input stream. (beta.txt is the first input
stream)T. The output is a bit messy but one observes that the index to sort by is the 3rd field. "sort +2" will skip over the first two fields when comparing rows:

% nl alpha.txt |sort +1 |join -j2 2 beta.txt - |sort +2 -n

Now get rid of the ordering field using "cut"

% nl alpha.txt |sort +1 |join -j2 2 beta.txt -| sort +2 -n | cut -f 1,2,4 -d " "

The "-f" argument gives a list of fields to include in the output. "cut" normally uses [TAB] as the field separator but this is changed using "-d". We still need to pretty up the formatting. The "pr" tool finds a use here:

tetex-xdvi 1222425 1.0.6-11
ElectricFence 36903 2.1-3
newt-devel 144815 0.50.8-2
rgrep 15202 0.98.7-5
dosfstools 106819 2.2-4
bdflush 8861 1.5-11
bin86 74968 0.4-7
gnuplot 1345702 3.7.1-3
dialog 85955 0.6-16
kernel-utils 292693 2.2.14-5.0
termcap 625272 10.2.7-9

"pr" normally formats output for line printers. Few people use these archaic pieces of hardware anymore but "pr" still has its uses. The "-e" expands TAB characters, replacing them with spaces. Our slightly embellished "-e" argument tells "pr" to use a single space as the TAB character and to space the tab positions 16 character widths apart. "-T" suppresses the headers, footers, and form feeds.

Conclusion

The GNU text processing utilities provide a rich set of functions that can be combined, using the command line, in many different ways to accomplish tasks that otherwise would require special programs to be written. Investing a little time to learn to use the command line interface can save one a great deal of trouble in the long run.

Would you like to have your article published online? Send them in to newfiles@linuxnewbie.org

[-NHF Control Panel-]
Back to NHF Menu
Installation	Security
Modems	Shell Config.
Tools	Basic Commands
Compiling	Distribution Specific
Fonts	Games
Sound/Audio	X-Windows
Network	OS Booting
Programming	Browsers
Misc.	Filesystem
Hardware	Software

Linux Planet
Linux Today
Linux Central
Linuxnewbie.org
PHPBuilder
Just Linux
Linux Programming
Linux Start
BSD Today
Apache Today
Enterprise Linux Today
BSD Central
All Linux Devices
SITE DESCRIPTIONS

[-What's New-]

Order a Linuxnewbie T-Shirt

Easy Webcam NHF

Directory Navigation NHF

Installing Snort 1.6.3 on SuSE 6.x-7.x

Customizing vim

The SysVinit NHF

Installing ALSA for the VT82C686 integrated sound

USB Creative Video Blaster II for Linux

Configuring the Intellimouse Explorer in XFree86 V4+

The beginnings of a distro NHF

Getting Past Carnivore?

Getting and Installing PGP

Getting your ATI Rage 128 Working

How to create a multiple partition system

Using Fdisk

Introduction to Programming in C/C++ with Vim

Adding a Hard drive in Linux -- In five steps

Installing ALSA for the Yamaha DS-XG Sound Card

Getting your Diamond Rio Mp3 Player to work with Linux

Bash Programming Cheat Sheet

Installing NVIDIA Drivers for Mandrake

Setting up Portsentry

Hard Drive Speed Tweak for Linux

Sensei's Log

Chat room

Join: Linuxnewbie.org SETI Black Belts!

Send in your news

Click the image to add Linuxnewbie.org to your MyNetscape Page

[-LNO Newsletter-]

[-Archive-]

The beginnings of a distro NHF

Connecting to the Internet using KPPP

Getting your SBLive to work

Unreal Tournament NHF

LWE Day 2 Pictures

LWE Day 1 Pictures

The LNO FAQ!

WoW (Words of Wisdom)

Other sites news

What is Linux?

What is Linux? part deux (ups & downs)

Search newsgroups

The List

ALS Report

Feedback Form

[-Quick Links-]

Linux-Mandrake
Linux.com
Linuxhelp.org
LWN.net

[-main-]	[-submit news-]	[-submit a NHF-]	[-about linuxnewbie.org-]
[-chat room-]	[-advertising-]	[-links-]	[-legalese-]
Copyright 2000 internet.com Corp. All Rights Reserved. Legal Notices Privacy Policy