![]()
So what is a filter? At the most basic level, a filter is a program that accepts input, transforms it, and outputs the transformed data. The idea of the filter is closely associated with several ideas that are ? the Unix operating system: standard input and output, input/output redirection, and pipes.
Standard input and output refer to default places from which a program will take input and to which it will write output respectively. The standard input (STDIN) for a program running interactively at the command line is the keyboard; the standard output (STDOUT), the terminal screen.
With input/output redirection, a program can take input or send output someplace other than standard input or output — to a file, for instance. Redirection of STDIN is accomplished using the < symbol, redirection of STDOUT by >. For example,
$ ls > list
redirects the output of the ls command, which would normally go the screen, into a file called list. Similarly,
$ cat < list
redirects the input for cat, which in the abscence of a file name would be expected from the keyboard, to come from the file list — so we print the contents of that file on-screen.
Pipes are a means of connecting programs together through I/O redirection. The symbol for pipe is |. For example,
$ ls | less
is a common way of comfortably viewing the output from a directory listing where there a more files than will fit in a screenful.
In what follows, I'll be looking at how simple programs provided as standard with your Linux system can be enhanced by being used as filters for other similar programs. I'll also show how simple programs of your own can by built to meet your own custom filtering needs.
One program I don't look at in this article is perl. This is for two reasons: firstly, perl is a programming language in its own right, and filters are, of course, language-independent; and secondly, I don't much like perl. Personal preference, of course, but then, who's writing this article anyway?!
Here's an example:
$ grep 'Linus Torvalds' *
This searches all the files in the current directory for the name of the great man himself.
Varieous command line switches can modify grep's behaviour. For example, if we aren't sure about case, we can write
$ grep -y 'linus torvalds'
The -y switch tells grep to match without considering case. If you use upper-case letters in the pattern, however, they will still match only upper-case. (This is broken in GNU grep, which simply ignores case when given the -y switch — that's what the -i switch is for).
Given even this much grep, it's easy to construct a practical application. Store name and address details in a file, and voila! a searcheable address book.
$ grep -y [search arg] /usr/lib/company/phone-book
Put that in a text file, make it executable with
$ chmod +x filename
and there you go.
. any one character * zero or more of the preceding character ^ beginning of line $ end of line [a-z] a set of characters [a-z] is the whole lower-case alphabet
$ egrep 'Linus Torvalds|L\. Torvalds|L\. T\.|Mr\. Torvalds'
will now find most ways of naming the inventor of Linux. Note the backslashes to escape the full stop — since that is a special character in regular expressions, when we want to use it as itself, as here, we must tell egrep not to interpret it as a magic character.
$ tr A-Z a-z
Changes uppercase letters to lower-case. A-Z is shorthand for all the letters from A to Z.
A more complicated filter applies the old cipher, rot13. Each letter of the alphabet is changed to the letter 13 characters ahead in the alphabetic sequence. Letters 14-26 are wrapped around.
tr '[a-m][n-z][A-M][N-Z]' '[n-z][a-m][N-Z][A-M]'
As is common in the Unix world, where tools are often duplicated in differnet ways, sed can do mos things that grep does. Here is a simple grep in sed:
sed -n '/Linus Torvalds/p'
All this does is read standard input, and print only those lines containing the string Linus Torvalds.
The default with sed, as with any filter[?], is to pass standard input to standard output, unchanged. To make it do anything useful, you give it instructions. In our first example, we searched for the string by enclosing it in "//", and told sed to print any line with that string in it with the 'p' command. THe -n switch to sed made sure that it did not print any lines which did not match the pattern. Rememeber, the default behaviour is to print everything.
If this was all sed could do, we would be better of to stick with grep. Sed's forte is as a "stream editor", used for changing text files according to rules you supply.
Let's take a simple example.
$ sed 's/Torvuls/Torvalds/g'
This uses the sed "substitute" - 's' - command, and applies it globally - 'g'. It looks for every occurence of "Torvuls", and hcanges it to "Torvalds". Withough the g command at then end, it would change only the fist occurence of "Torvuls" on each line.
sed '/^From /,/^$/d'
This searches the standard input for the word From at the beginning of a line, followed by a space, and deletes all the lines from the line containing that pattern up to and including the first blank line, which is represented by "^$" i.e. a beginning of line - "^" - followed immediately by an end of line - "$". In plain English, it strips out the header from a Usenet posting that you have saved in a file.
To double-space a text file takes just one comand:
#!/usr/bin/sed -f G
According to our manaul page, all that does is to append the contents of the hold space to the current text buffer. Eh? Well, in plain English, for each line, we output the contents of a buffer that sed uses to store text. Since we haven't put anything in there, it's empty. But, in sed, appending this buffer adds a newline, regardless of whether there is anything in the buffer. So, the effect is to add an extra newline to each line, thus double-spacing the output.
Sed is very handy for converting from one file format to another. The most command case where this is needed is porting text files to and from MS-DOS and Mac formats.
Now, something more complex. I publish my config. file for vi, .exrc, on the web. I want it to look nice in people's browsers, not to be just a big hunk of test. So, I run to through sed to turn it into a simple HTML document.
#!/bin/sed -f #filter-exrc: turn .exrc into html 1i\ <html>\ </head>\ <title>Paul Dunne's .exrc file<\/title>\ <\/head>\ <body>\ <pre>\ <code> $a\ <\/code>\ <\/pre>\ <\/body>\ <\/html>
This matches the first line — 1 —, and inserts — i — the given text up to a newline (this allows us to specify multiple lines of text to insert by escaping newlines in the text, as shown). Then, we wait until the end of the file — $ — before appending — a — some additional lines of HTML. Remember, meantime, although we gave no instructions save for first line and last line, sed has been sending all the lines of the .exrc file to standard output. So, our outcome is the original .exrc file bracketed with extra lines that make it a HTML document.
To start with, lets look again at yet another way to do a grep. Fast becoming traditional, this — perhaps we should call it YAG (Yet Another Grep)?
$ awk '/Linus Torvalds/'
Like grep and sed, awk can search for text patterns. As with sed, with each pattern can be associated an action If no action is supplied, as in the above example, then the default action is to print each line where the pattern is matched. Alterantivey, if no pattern is supplied, then the default action is to apply the action to every line.
Awk's strength is in its ability to treat data as tabular - that is, as arranging in rows and columns. With each input line, awk aotomatatically splits it into fields. The default field separator is "whitespace" i.e. blanks and tabs, but can be changed to any character you want. Now, many Unix utilities produce this sort of tabular output. In out next section, we'll see how this tabular format can be sent as input to awk using a shell construction which we haven't seen yet.
if we look again at the humble filter which started this article, wc, we see that its default output is in four columns. An alternative way of specifying the '-c' switch, to count only characters, would be
$ wc | awk ' { print $3 } '
Taking the whole output of wc e.g
258 1558 8921 lj.filters
and filtering it to get just what we require, in this case, the third column, the character count:
8921
If we want to print the whole input line, that's simply $0.
Another handy filtering pipe is this one. We know we can see all the hidden files (names beginning iwth .) using ls -a — but how do we see JUST the hidden files. A simple filtering of ls -a output makes it easy.
$ ls -a | grep ^[.].*
ls output often needs filtering.
See what programs I've been working on recently:
$ ls -tr ~/bin | tail -80 | 3
Where 3 is
#!/bin/sh pr -basename $0` -t -l1 $* _par A digression, this last shell script, since it is neither a filter nor a pipe. _par Earlier, we used sed to convert text file formats. Many other file format convertors can be produced in the same way. As an example, consider this primitive convertor for changing MS Word files into something readable: _par $ tr -dc 'A-Za-z0-9 .,:;!&"\t\r\n- | tr '\r' '\n'
Note that this is a very rough-cut filter, that simply strips away all the non-text from a Word file. It is sufficient for my needs; but could be vastly improved.
(Note that this doesn't work with bash, at least with version 1.14.7. You'll need a real /bin/sh — /bin/ash as supplied with Slackware will do the trick).
Of course, pipes greatly increase the power of programmable filters such as sed and awk. Here's a little script to calculate the last friday of any given month.
m4_example( #!/bin/sh /usr/bin/cal $1 $2 | awk '{ lasta = a; a = $6; if (a == "") a=lasta } END { print a }' )
It is often useful to store data in simple ASCII tables; and awk is a great tool for manipulating such data. As a simple example, consider this weights and measures convertor. We have a simple text file of conversions:
From To Rate
-— --- ----
kg lb 2.20
lb kg 0.4536
st lb 14
lb st 0.07
kg st 0.15
st kg 6.35
in cm 2.54
cm in 0.394
Our script reads a weight, the unit that weight is in, and the unit we wish to convert to, and gives us the result.
$ weightconv 100 kg lb 220 $
m4_example(,) #!/bin/sh #weightconv: weights & measures converter table=/usr/local/lib/weights_and_measures case $# in 0|1) echo "weightconv: usage weightconv amount from [to]" 1>&2; exit 1;; esac amount=$1 from=$2 to=$3 rate=`grep "^$from $to" $table|awk '{print $3}'` case $rate in "") echo "weightconv: no rate found for $from to $to" 1>&2; exit 2;; esac echo $amount $rate | awk '{print $1*$2}' ]])
cat $* |
tr -sc A-Za-z '\012' |
sort |
uniq -c |
sort -n |
tail
Firstly, we concatenate all the input together into one, using cat $*— since this command will of course be run from a shell script.
Next, we put each word on a seperate line using tr: the -s squeezes, the -c says take the complement of the pattern given i.e. anything that's not A-Za-z; so, together, they strip out all characters that don't make up words, and replace each run of same with a newline; this has the effect of putting each word on a sepearte line.
Then we feed the output of tr it into uniq, which strips out duplicates and, with the -c argument, prints a count of the number of times a duplicate word was found,
We then sort numberically (-n), which gives us a list of words ordered by frequency.
Lastly, we print only the last ten lines of the output. We now have a simple word frequency counter. For any text input, it will output a list of the ten most frequently-used words.
I hope this article has given you some idea of this power. Working with your Linux box should be both easier and more productive using filters and pipes. Happy Filtering!
Paul Dunne 1999
Copyright © 1995-2007
Paul Dunne,
Sponsored links (requires javascript):