HDF5 for Python
Copyright (c) 2012 Andrew Collette

LZF Compression Filter for HDF5

The LZF filter is a stand-alone compression filter for HDF5, which can be used in place of the built-in DEFLATE (or SZIP) compressors to provide faster compression. The target performance point for LZF is very high-speed compression with an "acceptable" compression ratio.

In benchmark trials with floating-point data (below), a filter pipeline with LZF typically provides 3x-5x faster compression than DEFLATE, 2x faster decompression, and retains 50%-90% of the DEFLATE compression ratio.

Unlike SZIP, this filter works for all datatypes on which DEFLATE works, including compound, opaque, array and user-defined types. There are also no settings to adjust.

The LZF filter is written in C and may be included in C++ applications. No external libraries are required. HDF5 versions 1.6 and 1.8 are both supported. The license is 3-clause BSD.

This filter is maintained as part of the HDF5 for Python (h5py) project. The goal of h5py is to provide access to the majority of the HDF5 C API and feature set from Python. A stand-alone version of the LZF filter is packaged inside the UNIX tarball for h5py, available here.

Based on LibLZF by Marc Lehmann.

Performance

Compression performance depends on many factors, including the storage datatype and the range of values used. LZF can be used on arbitrary HDF5 types, including strings, compound and arrays in addition to scalars. However, performance for multi-byte floating point and integer data sets is of particular importance, as they are so commonly used.

Therefore, this simple benchmark compares the performance of several HDF5 compression techniques on single-precision floating-point data sets of various complexity. For LZF, DEFLATE and the PyTables LZO filter, the HDF5 SHUFFLE filter is also applied. The measured quantity for all filters is the performance of the entire pipeline.

The following compression techniques were tested: The benchmark program (and a Python program to generate the test files) may be downloaded from the h5py SVN server from "/svn/bench".

Compression ratio is measured as the percent reduction in file size; 0.0% is uncompressed while 100% would be perfect compression.

Also keep in mind that even with a 200-round ensemble, these times are not precise to more than a few milliseconds. Additionally, only one platform (32-bit Intel Linux) was tested.

Trivial data data[i] = i

Compression Type Compress time (ms) Decompress time (ms) Compressed by
NULL 10.76.50.00%
LZF 18.617.896.66%
LZO 20.217.998.55%
GZIP 58.140.598.53%
SZIP 63.161.372.68%

Sine wave data[i] = sin(i/32)

Compression type Compress time (ms) Decompress time (ms) Compressed by
NULL 10.16.50.00%
LZF 54.522.238.42%
LZO 86.922.944.24%
GZIP 215.158.645.54%
SZIP 101.894.527.05%

Noisy sine wave data[i] = sin(i/32) + random(-0.25 to 0.25)

Compression type Compress time (ms) Decompress time (ms) Compressed by
NULL 10.86.50.00%
LZF 65.524.415.54%
LZO 125.426.717.25%
GZIP 298.664.820.05%
SZIP 115.2102.516.29%

Random float data data[i] = random(0 to 1.0)

(Note this is NOT the same thing as random bits.)

Compression type Compress time (ms) Decompress time (ms) Compressed by
NULL 9.07.80.00%
LZF 67.824.98.95%
LZO 124.030.612.78%
GZIP 305.467.217.05%
SZIP 120.6107.715.56%

Different chunk sizes

Filter performance can depend on the size of the chunk used. Here LZF, LZO and GZIP are compared for a variety of different chunk sizes as applied to the "sine wave" dataset above.

Compression ratio

Chunk size LZF LZO GZIP
32k35.74%36.57%38.25%
96k37.93%41.98%44.18%
192k38.42%44.24%45.54%
384k38.61%45.38%46.35%

Compress/decompress time (msec)

Chunk size LZF LZO GZIP
32k 63.820.1 96.718.4 172.043.0
96k 57.020.4 88.417.4 202.250.6
192k 55.722.6 90.221.6 214.158.6
384k 57.527.2 93.827.2 221.565.3