Useful *nix tools: sed

UPDATE: I wrote a new article on sed on my new blog here.

If you work on *nix systems (UNIX, Linux, Mac OS X, BSD, etc.) and you want to extract information from text files such as log files of different kinds and/or lengthy output from other command line tools you’ll want to learn about sed. Sed stands for stream editor, it is an incredibly powerful non-interactive text editor/processor/filter that takes input via a *nix pipe and transforms it based on command line parameters and then outputs the result.

Learning how to use sed can save you the effort of having to write more complicated programs or scripts in whatever programming language you’re versed to extract information out of text files and output. While it may initially take you a little bit of time to get your head around it, you’ll most likely have a moment of enlightenment where you will wish you had known about sed before… I certainly had!

What makes sed so great then? It allows you to take text input, process it line by line and use the power of regular expressions to match text and transform it using the result of your regular expressions. In this article I’ll give you a simple example of how you can use sed, if you want to know more about sed after reading this article then there’s an excellent introduction and tutorial page here which I found invaluable. Note that I’m still quite new at using sed myself, so this example may not be the best example of how you should use sed… but it works.

Let’s say we’re dealing with an order entry program that processes files sent in from a remote system. Each of these files can contain multiple orders. For each file that the program processes it outputs a log file with the details of what it has done with the contents of the order file. The output would look something like this:

Start processing file ORDER-001.TXT at 2011-05-28T21:18:02
File has 5 order(s)
Created order 1
Created order 2
Created order 3
Created order 4
Created order 5
Finished processing at 2011-05-28T21:22:10

Now your customer comes along and asks you how long it takes you to process the order files. Using the command line with a combination of grep and sed you can get this information out of those log files and convert it into tabular format, allowing you easily import the data for the customer into a spreadsheet (which they often really like). Here’s how you could go about that:

cat order-001.log | grep -E '^(Start|File has|Finished).*$' | sed -e 'N' -e
'N' -e 's/Start.*file \([^\s]*\) at \([0-9]\{4\}-[0-9]\{2\}-[0-9]\{2\}\)T\([0-
9]\{1,2\}:[0-9]\{2\}:[0-9]\{2\}\).*File has \([0-9]*\).*\([0-9]\{4\}-[0-9]\{2\
}-[0-9]\{2\}\)T\([0-9]\{1,2\}:[0-9]\{2\}:[0-9]\{2\}\)/\1 \2 \3 \4 \5 \6/'

That may look a little frightening, but it really isn’t that complicated. Let me break it down for you:

The command starts with the cat command which simply streams the contents of the file to the stdout, this is fed to grep through a *nix pipe. The grep command takes a regular expression to filter out all of the lines that I’m not interested in, the only lines left will be those starting with “Start”, “File has” and “Finished” (depending on the version of grep installed on your system the syntax may be different from what I used here). The resulting output is then fed to sed which I’ll explain in more detail.

The -e parameter is used to tell sed to execute a command, in my example I am passing the following commands to sed:

  • N
  • N
  • s/Start.file ([^\s]) at ([0-9]{4}-[0-9]{2}-[0-9]{2})T([0-9]{1,2}:[0-9]{2}:[0-9]{2}).File has ([0-9]).*([0-9]{4}-[0-9]{2}-[0-9]{2})T([0-9]{1,2}:[0-9]{2}:[0-9]{2})/\1 \2 \3 \4 \5 \6/

The N command tells sed to keep the current line it just got and append the next line it gets to it (it’s important to know that the newline character is preserved here). By giving the N command twice I’m combining my three lines for processing. In my example I am thus depending on my output always having those 3 lines in the proper sequence, you can probably not depend on such a thing in the real world so you’d have to do smarter processing to reliably get the data you want 😉

The third command (s) is the substitute command, it receives it’s parameters separated by / characters (other characters can be used as well, see the earlier mentioned tutorial link for more details on that). The first argument is the regular expression to match, the second argument is what to output. The regular expression may look a bit complex with all those \ characters in there, but those are just there to escape certain characters so they’re used in the regex rather than being treated as literals. The regex simply catches the 6 pieces of data that we’re interested in groups, the output simply outputs each of these groups after each other separated by a space.

This then results in the following output

ORDER-001.TXT 2011-05-28 21:18:02 5 2011-05-28 21:22:10

Instead of just processing one file, you can obviously process multiple files with cat (by using cat *.log or something of the sort) generating a large report with ease! Sed really is an awesome tool 🙂

%d bloggers like this: