Efficient log processing

I’ve recently learned a couple of neat tricks to process large amounts of text files more efficiently from my new co-worker Nicolas Kruchten. Our use-case is efficiently going through tens of gigabytes of logs to extract specific lines and do some operation on them. Here are a couple of things we’ve done to speed things up.

Keep everything gziped

Often, the bottleneck will be IO. This is especially true on modern servers that have a lot of cores and ram. By keeping the files we want to process gziped, we can use zcat to directly read the compressed files and pipe the output to whichever script we need. This reduces the amount of data that needs to be read from the disk. For example:

zcat log.gz | grep pattern

If you’re piping into a Python script, you can easily loop over lines coming from the standard input by using the fileinput module, like this:

import fileinput
for line in fileinput.input():
    process(line)

**Use parallel to use all available cores**

GNU parallel is the coolest utility I’ve discovered recently. It allows you to execute a script that needs to act on a list of files in parallel. For example, suppose we have a list of 4 log files (eximreject_1.gz, exim_reject_2.gz, etc) and that we need to extract the lines that contain _gmail.com. We could run a grep on each of those files sequentially but if our machine has 4 cores, why not run all the greps at once? It can be done like this using parallel:

parallel -j4 "zcat {} | grep gmail.com" ::: exim_reject*.gz

Breaking down the previous command, we tell parallel to run, using 4 cores, the command zcat {} | grep gmail.com, where {} will be substituted with each of the files matching the selector exim_reject*.gz. Each resulting command from the substitutions of {} will be run in parallel.

What’s great about it is that you can also collect all the results from the parallel executions and pipe them into another command. We could for example decide to keep the resulting lines in a new file like this:

parallel -j4 "zcat {} | grep gmail.com" ::: exim_reject*.gz | gzip > results.txt.gz

Use a ramdisk

If you’ll be doing a lot of reading and writing to the disk on the same files and have lots of ram, you should consider using a ramdisk. Doing so will undoubtedly save you lots of IO time. On Linux, it is very easy to do. The following command would create an 8GB ramdisk:

sudo mount -t tmpfs -o size=8G,nr_inodes=1k,mode=777 tmpfs /media/ramdisk

In the end…

By using all the tricks above, we were able to considerably improve the overall runtime or our scripts. Well worth the time it took to refactor our initial naive pipeline.

Published Feb 9, 2011

I am a computer scientist specializing in building machine learning powered products. I’m currently a machine learning developer at Local Logic.François Maillet on Twitter

3 thoughts on "Efficient log processing"

Made of String » Quicker ways of processing log files

Commented 2012-01-29 16:29:13

[...] Maillet has some ideas on efficient log processing, and suggests gzipping your log files to avoid having to wait for reading from disk. (See [...]

Marc

Commented 2012-02-23 10:02:16

There are also other "z" utilities similar to the zcat command you mentioned. Those commands make it very easy to work with compressed log files: zcat, zless, zmore, zgrep, zegrep, zdiff, zcmp.

Source: http://www.thegeekstuff.com/2009/05/zcat-zless-zgrep-zdiff-zcmp-zmore-gzip-file-operations-on-the-compressed-files/

Cédrik

Commented 2012-05-31 06:42:51

When you have huge files to compress, I also recommend to replace gzip with <a href="http://zlib.net/pigz/" rel="nofollow">pigz</a>, which will use all available cores to speed up the compression.