I’ve recently learned a couple of neat tricks to process large amounts of text files more efficiently from my new co-worker @nicolaskruchten. Our use-case is efficiently going through tens of gigabytes of logs to extract specific lines and do some operation on them. Here are a couple of things we’ve done to speed things up.
Keep everything gziped
Often, the bottleneck will be IO. This is especially true on modern servers that have a lot of cores and ram. By keeping the files we want to process gziped, we can use zcat to directly read the compressed files and pipe the output to whichever script we need. This reduces the amount of data that needs to be read from the disk. For example:
zcat log.gz | grep pattern
If you’re piping into a Python script, you can easily loop over lines coming from the standard input by using the fileinput module, like this:
import fileinput for line in fileinput.input(): process(line) |
Use parallel to use all available cores
GNU parallel is the coolest utility I’ve discovered recently. It allows you to execute a script that needs to act on a list of files in parallel. For example, suppose we have a list of 4 log files (exim_reject_1.gz, exim_reject_2.gz, etc) and that we need to extract the lines that contain gmail.com. We could run a grep on each of those files sequentially but if our machine has 4 cores, why not run all the greps at once? It can be done like this using parallel:
parallel -j4 "zcat {} | grep gmail.com" ::: exim_reject*.gz
Breaking down the previous command, we tell parallel to run, using 4 cores, the command zcat {} | grep gmail.com, where {} will be substituted with each of the files matching the selector exim_reject*.gz. Each resulting command from the substitutions of {} will be run in parallel.
What’s great about it is that you can also collect all the results from the parallel executions and pipe them into another command. We could for example decide to keep the resulting lines in a new file like this:
parallel -j4 "zcat {} | grep gmail.com" ::: exim_reject*.gz | gzip > results.txt.gz
Use a ramdisk
If you’ll be doing a lot of reading and writing to the disk on the same files and have lots of ram, you should consider using a ramdisk. Doing so will undoubtedly save you lots of IO time. On Linux, it is very easy to do. The following command would create an 8GB ramdisk:
sudo mount -t tmpfs -o size=8G,nr_inodes=1k,mode=777 tmpfs /media/ramdisk
In the end…
By using all the tricks above, we were able to considerably improve the overall runtime or our scripts. Well worth the time it took to refactor our initial naive pipeline.