My 9-5 gig revolves primarily around Tibco LogLogic (I’ll write it as Log Logic going forward, as I write in English, not C++), which is a centralized logging product. The appliances collect logs from a variety of dissimilar systems and present you with a unified, web-based interface to search them. When something goes wrong, having all of the logs in one place is invaluable for figuring it out.
That value comes at a price. I don’t know exactly what these appliances cost, but generally speaking, $100,000 is a good starting point for an estimate. So what if I told you that you could store 45% more data on these expensive appliances, and increase their performance very modestly (2-5 percent) in the process? Read on.
Loglogic compresses incoming data with garden-variety GNU Gzip, then encrypts it. I haven’t investigated which encryption algorithm it uses, as I haven’t had any need. Logs compress extremely well, and most large companies produce a lot of them, so compression is necessary to hold any meaningful quantity of them.
What many people don’t know is that Gzip is tunable. You can trade better compression ratio for faster compression, or vice-versa. And what I found when I extracted an archive with Tibco’s llunzip tool and then re-compressed it with all of the available options is that Log Logic opts for all the speed it can get, while settling for the worst compression ratio.
Presumably this is a holdover from the days when a 2 GHz Pentium 4 was as good as you could get. But with today’s CPUs, you can use maximum compression and the task uses less than 10% of the available CPU power. Disk I/O is a much bigger bottleneck now. And, in fact, when you recompress the data, you actually gain a bit of performance on the back end precisely because the system can pull smaller files off the disks faster.
I asked Tibco if this is tunable, but they said it is not.
And recompressing is worth doing—when I ran llzip on the raw data using -9, file sizes dropped 45-50 percent. That’s a big difference. And consider these appliances have about 5-6 TB worth of usable space on them to store data. Being able to store 45% more data on 5-6 TB is significant.
The performance gain is very modest—along the lines of 2-5 percent. What’s more important is that there’s no penalty for gaining that capacity.
I won’t give away all of my scripts, since I developed them on company time, but I’ll get you started. Log Logic stores its logs in a directory structure with the format /loglogic/data/vol1/yyyy/mm/dd/hhhh. Inside each of these directories is a large number (it can be hundreds) of .txt.gz files. Use the following command to extract an hour’s worth:
Recompressing them isn’t just a matter of adding -9, however. You also need to pass another parameter to keep the format the way the web frontend expects it.
llzip -9 --suffix=.gz /loglogic/data/vol1/2014/01/01/0000/rawdata*.txt
Recompressing a large amount of existing data is best done with a script, or series of scripts. An ideal setup would recompress the day’s logs after regular business hours, and compress existing data during off hours. None of this is especially difficult for a good Unix administrator to build.