What program to use to compress data?

My 1TB hard-drive is slowly filling up. Last week I did a BLAST run on all the microbial genomes in NCBI’s genome database. The output is around 500GB. As I have to keep this data stored for some time, I have to use some compression program. There are basically 3 choices: gzip,bzip2, and 7z (LZMA). As it turns out there is no reason to use gzip anymore. It is basically a very lousy compression software. The only thing going for gzip is the speed. It is fast. The second in speed is bzip2; then comes 7z.

One very important consideration for me is a way to directly read the compressed file using on-the-fly decompression, particularly using Perl modules like IO::Uncompress. It’s very important for me to read the file line-by-line: I need a file-handle. The particular set of modules have great support for gzip and bzip2 but no support for 7z.

As I decided to drop gzip from the equation, I gave bzip2 a try. The naive bzip2 is really too slow for my purpose. Then I came across pbzip2. It is a parallel version of bzip2’s block compression algorithm. The speed increase is almost linear with the increase in the number of processor. And the best thing is that it is a drop-in replacement of bzip2. I was pretty happy with it, till the time I tried to read it using IO::Uncompress::Bzip2. The module simple can’t read the files created using pbzip2. The only way to read such a file is to open a raw pipe handle like this:

open F, "bunzip2 -c $filename |"

As I can’t use these perl modules on the compressed files and have to use a pipe handle, I think I’ll give 7z a try. I used it for generating bzip2 files like this:

7z a -tbzip2 -mx=9 archive.gz filename

This runs slower than single-threaded bzip2. WTF! It is supposed to run multithreaded, and therefore, should run faster than naive bzip2. Then I ran

7z a -tbzip2 archive.gz filename

This is around 2 times slower the pbzip2.

I am not sure which program to use. I think I will give LZMA a try. Will let you know.


