Linux, Unix, /etc

Danger Will Robinson! You are now entering a condescending Unix user zone!
Sponsored links (requires javascript):

Filters

Introduction — What is a Filter?

This article is about filters, a very powerful facility available to every Linux user, but one that migrants from other operating systems may find new and unusual.

So what is a filter? At the most basic level, a filter is a program that accepts input, transforms it, and outputs the transformed data. The idea of the filter is closely associated with several ideas that are ? the Unix operating system: standard input and output, input/output redirection, and pipes.

Standard input and output refer to default places from which a program will take input and to which it will write output respectively. The standard input (STDIN) for a program running interactively at the command line is the keyboard; the standard output (STDOUT), the terminal screen.

With input/output redirection, a program can take input or send output someplace other than standard input or output — to a file, for instance. Redirection of STDIN is accomplished using the < symbol, redirection of STDOUT by >. For example,

$ ls > list

redirects the output of the ls command, which would normally go the screen, into a file called list. Similarly,

$ cat < list

redirects the input for cat, which in the abscence of a file name would be expected from the keyboard, to come from the file list — so we print the contents of that file on-screen.

Pipes are a means of connecting programs together through I/O redirection. The symbol for pipe is |. For example,

$ ls | less

is a common way of comfortably viewing the output from a directory listing where there a more files than will fit in a screenful.

In what follows, I'll be looking at how simple programs provided as standard with your Linux system can be enhanced by being used as filters for other similar programs. I'll also show how simple programs of your own can by built to meet your own custom filtering needs.

One program I don't look at in this article is perl. This is for two reasons: firstly, perl is a programming language in its own right, and filters are, of course, language-independent; and secondly, I don't much like perl. Personal preference, of course, but then, who's writing this article anyway?!

Some simple little filters

grep

The program grep, standing for Get Regular Expression and Print seems a good place to begin. The principle of grep is very simple: seach the input for a pattern, output the pattern.

Here's an example:

$ grep 'Linus Torvalds' *

This searches all the files in the current directory for the name of the great man himself.

Varieous command line switches can modify grep's behaviour. For example, if we aren't sure about case, we can write

$ grep -y 'linus torvalds'

The -y switch tells grep to match without considering case. If you use upper-case letters in the pattern, however, they will still match only upper-case. (This is broken in GNU grep, which simply ignores case when given the -y switch — that's what the -i switch is for).

Given even this much grep, it's easy to construct a practical application. Store name and address details in a file, and voila! a searcheable address book.

$ grep -y [search arg] /usr/lib/company/phone-book

Put that in a text file, make it executable with

$ chmod +x filename

and there you go.

A Note on Regular Expressions

A regular expression - "regexp" for short - is a formidable phrase denotiing something which is conceptually very simple. A regular expression is an expression that denotes a pattern, coded in a small language designed for just that. Thee basic special symbols are as follows:

. any one character * zero or more of the preceding character ^ beginning of line $ end of line [a-z] a set of characters [a-z] is the whole lower-case alphabet

Extended Grep

Sometimes, basic grep won't do. For instance, suppose we want to find all the occurrences of a text string which could possiblly be a reference to Linus. Clearly, searching for 'Linus Torvalds' is not enough — that won't find just Linus, or Torvalds on its own. We need some way of saying, "This OR this OR this". here is where egrep — "extended grep" — comes in. This handy program modifies standard grep to provide just such a syntax.

$ egrep 'Linus Torvalds|L\. Torvalds|L\. T\.|Mr\. Torvalds'

will now find most ways of naming the inventor of Linux. Note the backslashes to escape the full stop — since that is a special character in regular expressions, when we want to use it as itself, as here, we must tell egrep not to interpret it as a magic character.

tr

tr is perhaps the epitome of filters. Short for translate, tr basically does what its full name suggests: it changes a given character or set of characters to another character or set of characters. This is done by mapping input charcters to output characters. An example will make this clear:

$ tr A-Z a-z

Changes uppercase letters to lower-case. A-Z is shorthand for all the letters from A to Z.

A more complicated filter applies the old cipher, rot13. Each letter of the alphabet is changed to the letter 13 characters ahead in the alphabetic sequence. Letters 14-26 are wrapped around.

tr '[a-m][n-z][A-M][N-Z]' '[n-z][a-m][N-Z][A-M]'

Sorting with sort

sort is a very basic computer operation. It is very commonly used on text, to get lists in alphabetical order, or to sort a numbered list. Linux has a powerful filter for sorting, called logically enough, sort(1).

head & tail

Two very simple filters with a suprising variety of uses. As their names suggest, head(1) shows the head of a file, while tail(1) shows the end. By default, both show the first or last ten lines respectively; but tail in particular has a number of other useful options.

Programmable Filters

Sometimes, we need to do something a bit more complex than relatively simple commands lines of the time tr give us allow. For that, we need something I'll call a programmable filter, that is, a filter with a scripting language that allows us to specify complex operations.

sed

Sed, the stream editor, is a filter typically used to operate on lines of text as an alternative to using an interactive editor. There are times when firing up vi or whatever and making the change, whether manually or using vi/ex commands, is not appropriate. What, for instance, if you have the same changes to make to fifty files? What if you need to change a string, but are sure exactly what files it occurs in?

As is common in the Unix world, where tools are often duplicated in differnet ways, sed can do mos things that grep does. Here is a simple grep in sed:

sed -n '/Linus Torvalds/p'

All this does is read standard input, and print only those lines containing the string Linus Torvalds.

The default with sed, as with any filter[?], is to pass standard input to standard output, unchanged. To make it do anything useful, you give it instructions. In our first example, we searched for the string by enclosing it in "//", and told sed to print any line with that string in it with the 'p' command. THe -n switch to sed made sure that it did not print any lines which did not match the pattern. Rememeber, the default behaviour is to print everything.

If this was all sed could do, we would be better of to stick with grep. Sed's forte is as a "stream editor", used for changing text files according to rules you supply.

Let's take a simple example.

$ sed 's/Torvuls/Torvalds/g'

This uses the sed "substitute" - 's' - command, and applies it globally - 'g'. It looks for every occurence of "Torvuls", and hcanges it to "Torvalds". Withough the g command at then end, it would change only the fist occurence of "Torvuls" on each line.

sed '/^From /,/^$/d'

This searches the standard input for the word From at the beginning of a line, followed by a space, and deletes all the lines from the line containing that pattern up to and including the first blank line, which is represented by "^$" i.e. a beginning of line - "^" - followed immediately by an end of line - "$". In plain English, it strips out the header from a Usenet posting that you have saved in a file.

To double-space a text file takes just one comand:

#!/usr/bin/sed -f G

According to our manaul page, all that does is to append the contents of the hold space to the current text buffer. Eh? Well, in plain English, for each line, we output the contents of a buffer that sed uses to store text. Since we haven't put anything in there, it's empty. But, in sed, appending this buffer adds a newline, regardless of whether there is anything in the buffer. So, the effect is to add an extra newline to each line, thus double-spacing the output.

Sed is very handy for converting from one file format to another. The most command case where this is needed is porting text files to and from MS-DOS and Mac formats.

dostounix.sed
#!/usr/bin/sed -f s/ // /^$/d s///

mactounix.sed
#!/usr/bin/sed -f s/ /\ /g

unixtodos.sed
#!/usr/bin/sed -f s/$/ / $a\ 

unixtomac.sed
#!/usr/bin/sed -f s/\ / /

Now, something more complex. I publish my config. file for vi, .exrc, on the web. I want it to look nice in people's browsers, not to be just a big hunk of test. So, I run to through sed to turn it into a simple HTML document.

#!/bin/sed -f #filter-exrc: turn .exrc into html 1i\ <html>\ </head>\ <title>Paul Dunne's .exrc file<\/title>\ <\/head>\ <body>\ <pre>\ <code> $a\ <\/code>\ <\/pre>\ <\/body>\ <\/html>

This matches the first line — 1 —, and inserts — i — the given text up to a newline (this allows us to specify multiple lines of text to insert by escaping newlines in the text, as shown). Then, we wait until the end of the file — $ — before appending — a — some additional lines of HTML. Remember, meantime, although we gave no instructions save for first line and last line, sed has been sending all the lines of the .exrc file to standard output. So, our outcome is the original .exrc file bracketed with extra lines that make it a HTML document.

awk

Another very useful filter is the AWK programming language. The name AWK comes from the initials of Aho, Weinberger, and Kernighan — the three guys who wrote it. Despite this weird name, it is a everyday tool.

To start with, lets look again at yet another way to do a grep. Fast becoming traditional, this — perhaps we should call it YAG (Yet Another Grep)?

$ awk '/Linus Torvalds/'

Like grep and sed, awk can search for text patterns. As with sed, with each pattern can be associated an action If no action is supplied, as in the above example, then the default action is to print each line where the pattern is matched. Alterantivey, if no pattern is supplied, then the default action is to apply the action to every line.

centre lines
#!/usr/bin/awk -f #centre: centre lines in file(s) or stdin #usage: centre [filenames] BEGIN { linelength = 80 spaces = "" } { for (i = 1; i < (linelength - length($0)) / 2; i++) spaces = spaces " " print spaces $0 } Now, of course, this isn't the only filter for centering text. We could write it in sed thus: sed -n ' # remove leading and trailing blanks s/^[ ]*\(.*[^ ]\).*$/\1/ # append 80 spaces s/$/ / # chop character 80 onwards s/^\(.\{80\}\).*/\1/ # prefix string with half the trailing spaces s/^\(.*[^ ]\)\( *\)\(\2\)/\2\1/ p ' This raises the important and sometimes overlooked point, of choosing the right tool for the filtering job. Obviously, the awk filter above is rather easier to understand that the sed version.

Awk's strength is in its ability to treat data as tabular - that is, as arranging in rows and columns. With each input line, awk aotomatatically splits it into fields. The default field separator is "whitespace" i.e. blanks and tabs, but can be changed to any character you want. Now, many Unix utilities produce this sort of tabular output. In out next section, we'll see how this tabular format can be sent as input to awk using a shell construction which we haven't seen yet.

Pipes: when one filter isn't enough

The basic principle of the pipe (|) is that it is a pipe or junction that allows us to connect the standard output of one program with the standard input of another. A moment's thought should make the usefulness of this when combined with filters quite obvious. We can build quite complex programs, on the command line or in a shell script, simply by stringing filters together.

if we look again at the humble filter which started this article, wc, we see that its default output is in four columns. An alternative way of specifying the '-c' switch, to count only characters, would be

$ wc | awk ' { print $3 } '

Taking the whole output of wc e.g

258 1558 8921 lj.filters

and filtering it to get just what we require, in this case, the third column, the character count:

8921

If we want to print the whole input line, that's simply $0.

Another handy filtering pipe is this one. We know we can see all the hidden files (names beginning iwth .) using ls -a — but how do we see JUST the hidden files. A simple filtering of ls -a output makes it easy.

$ ls -a | grep ^[.].*

ls output often needs filtering.

See what programs I've been working on recently:

$ ls -tr ~/bin | tail -80 | 3

Where 3 is

#!/bin/sh pr -basename $0` -t -l1 $* _par A digression, this last shell script, since it is neither a filter nor a pipe. _par Earlier, we used sed to convert text file formats. Many other file format convertors can be produced in the same way. As an example, consider this primitive convertor for changing MS Word files into something readable: _par $ tr -dc 'A-Za-z0-9 .,:;!&"\t\r\n- | tr '\r' '\n'

Note that this is a very rough-cut filter, that simply strips away all the non-text from a Word file. It is sufficient for my needs; but could be vastly improved.

(Note that this doesn't work with bash, at least with version 1.14.7. You'll need a real /bin/sh — /bin/ash as supplied with Slackware will do the trick).

Of course, pipes greatly increase the power of programmable filters such as sed and awk. Here's a little script to calculate the last friday of any given month.

m4_example( #!/bin/sh /usr/bin/cal $1 $2 | awk '{ lasta = a; a = $6; if (a == "") a=lasta } END { print a }' )

It is often useful to store data in simple ASCII tables; and awk is a great tool for manipulating such data. As a simple example, consider this weights and measures convertor. We have a simple text file of conversions:

    From	To	Rate 
-— --- ----
kg lb 2.20
lb kg 0.4536
st lb 14
lb st 0.07
kg st 0.15
st kg 6.35
in cm 2.54
cm in 0.394

Our script reads a weight, the unit that weight is in, and the unit we wish to convert to, and gives us the result.

$ weightconv 100 kg lb 220 $

m4_example(,) #!/bin/sh #weightconv: weights & measures converter table=/usr/local/lib/weights_and_measures case $# in 0|1) echo "weightconv: usage weightconv amount from [to]" 1>&2; exit 1;; esac amount=$1 from=$2 to=$3 rate=`grep "^$from $to" $table|awk '{print $3}'` case $rate in "") echo "weightconv: no rate found for $from to $to" 1>&2; exit 2;; esac echo $amount $rate | awk '{print $1*$2}' ]])

Power Filters

The classic example of what one might call filtered pipelines is from the book The Unix Programming Environment, which is worth reproducing here (not least in an effort to encourage the reader to go out and get a copy of this most excellent work!):

    cat $* |
    tr -sc A-Za-z '\012' |
    sort |
    uniq -c |
    sort -n |
    tail

So what does that do? Well, let's take it line by line.

Firstly, we concatenate all the input together into one, using cat $*— since this command will of course be run from a shell script.

Next, we put each word on a seperate line using tr: the -s squeezes, the -c says take the complement of the pattern given i.e. anything that's not A-Za-z; so, together, they strip out all characters that don't make up words, and replace each run of same with a newline; this has the effect of putting each word on a sepearte line.

Then we feed the output of tr it into uniq, which strips out duplicates and, with the -c argument, prints a count of the number of times a duplicate word was found,

We then sort numberically (-n), which gives us a list of words ordered by frequency.

Lastly, we print only the last ten lines of the output. We now have a simple word frequency counter. For any text input, it will output a list of the ten most frequently-used words.

Conclusion

The combination of filters and pipes is very powerful, because it allows you a) to break down tasks and b) the pick the best tool for tackling each task. Many jobs that would have to be handled in a programming language (Perl, for example) in another computing environment, can be done under Linux by stringing together a few simple filters on the command line. Even when a programming language must be used for a particularly complicated filter, you are still saving a lot of development effort throught doing as much as possible using existing tools.

I hope this article has given you some idea of this power. Working with your Linux box should be both easier and more productive using filters and pipes. Happy Filtering!

Paul Dunne 1999


[back to Linux, Unix, /etc]



Copyright © 1995-2007 Paul Dunne,

Sponsored links (requires javascript):