Category Archives: Linux

Efficient log processing

I’ve recently learned a couple of neat tricks to process large amounts of text files more efficiently from my new co-worker @nicolaskruchten. Our use-case is efficiently going through tens of gigabytes of logs to extract specific lines and do some operation on them. Here are a couple of things we’ve done to speed things up.

Keep everything gziped

Often, the bottleneck will be IO. This is especially true on modern servers that have a lot of cores and ram. By keeping the files we want to process gziped, we can use zcat to directly read the compressed files and pipe the output to whichever script we need. This reduces the amount of data that needs to be read from the disk. For example:

zcat log.gz | grep pattern

If you’re piping into a Python script, you can easily loop over lines coming from the standard input by using the fileinput module, like this:

import fileinput
for line in fileinput.input():
    process(line)

Use parallel to use all available cores

GNU parallel is the coolest utility I’ve discovered recently. It allows you to execute a script that needs to act on a list of files in parallel. For example, suppose we have a list of 4 log files (exim_reject_1.gz, exim_reject_2.gz, etc) and that we need to extract the lines that contain gmail.com. We could run a grep on each of those files sequentially but if our machine has 4 cores, why not run all the greps at once? It can be done like this using parallel:

parallel -j4 "zcat {} | grep gmail.com" ::: exim_reject*.gz

Breaking down the previous command, we tell parallel to run, using 4 cores, the command zcat {} | grep gmail.com, where {} will be substituted with each of the files matching the selector exim_reject*.gz. Each resulting command from the substitutions of {} will be run in parallel.

What’s great about it is that you can also collect all the results from the parallel executions and pipe them into another command. We could for example decide to keep the resulting lines in a new file like this:

parallel -j4 "zcat {} | grep gmail.com" ::: exim_reject*.gz | gzip > results.txt.gz

Use a ramdisk

If you’ll be doing a lot of reading and writing to the disk on the same files and have lots of ram, you should consider using a ramdisk. Doing so will undoubtedly save you lots of IO time. On Linux, it is very easy to do. The following command would create an 8GB ramdisk:

sudo mount -t tmpfs -o size=8G,nr_inodes=1k,mode=777 tmpfs /media/ramdisk

In the end…

By using all the tricks above, we were able to considerably improve the overall runtime or our scripts. Well worth the time it took to refactor our initial naive pipeline.

pfSense : a software alternative to your old router/firewall

My old D-Link router, like pretty much every other router I’ve ever owned, wasn’t very reliable in some way and so I was looking for open-source alternative firmwares like Tomato to flash it with. With the clear lack of effort put into the official firmwares, I thought it couldn’t hurt to try. Unfortunately, my router wasn’t supported by any third party firmware.

During my search, I however stumbled upon pfSense, a Free-BSD based router/firewall distro. It’s small (<100mb), runs on a 100MHz PC and includes all the features you would get on a very expensive commercial router (Firewall, NAT, VPN server, usage graphs, dynamic DNS support, per-ip bandwidth usage, QoS, etc).

Throughput on WAN interface

I already had a dedicated fileserver so I installed pfSense as a VM on it using VMWare (I could also have done it with VirtualBox, a free alternative to VMWare). All you need are two NICs. I now only use my old router as a wireless access point because pfSense naturally has a DHCP server. I could even completely let go of my D-Link router if I added a wireless NIC in my server.

If you have an old PC lying around or one that could be a host to a pfSense VM, all you might need is an extra NIC to get an enterprise-grade router that will cooperate a lot more than any cheap 50$ D-Link/Linksys/Netgear/etc router.

Disabling recursive queries on DNS servers

By default, cpanel doesn’t disable recursive queries on your DNS server. This can, I believe, open the door to possible attacks.

To be on the safe side, just edit the /etc/named.conf file and add the following lines, where ip1, ip2, etc, are replaced with the actual IPs of your server :

// added : http://forums.cpanel.net/showpost.php?p=217540&amp;postcount=27
acl "trusted" {
	ip1;
	ip2;
	127.0.0.1;
};
 
options {
	// following from http://forums.cpanel.net/showpost.php?p=217540&amp;postcount=27
	version "not currently available";
	allow-recursion { trusted; };
	allow-notify { trusted; };
	allow-transfer { trusted; };
};