LZF Compression Filter for HDF5

The LZF filter is a stand-alone compression filter for HDF5, which can be used in place of the built-in DEFLATE (or SZIP) compressors to provide faster compression. The target performance point for LZF is very high-speed compression with an "acceptable" compression ratio.

In benchmark trials with floating-point data (below), a filter pipeline with LZF typically provides 3x-5x faster compression than DEFLATE, 2x faster decompression, and retains 50%-90% of the DEFLATE compression ratio.

Unlike SZIP, this filter works for all datatypes on which DEFLATE works, including compound, opaque, array and user-defined types. There are also no settings to adjust.

The LZF filter is written in C and may be included in C++ applications. No external libraries are required. HDF5 versions 1.6 and 1.8 are both supported. The license is 3-clause BSD.

This filter is maintained as part of the HDF5 for Python (h5py) project. The goal of h5py is to provide access to the majority of the HDF5 C API and feature set from Python. A stand-alone version of the LZF filter is packaged inside the UNIX tarball for h5py, available here.

Main web site and documentation: http://www.h5py.org

Based on LibLZF by Marc Lehmann.

Performance

Compression performance depends on many factors, including the storage datatype and the range of values used. LZF can be used on arbitrary HDF5 types, including strings, compound and arrays in addition to scalars. However, performance for multi-byte floating point and integer data sets is of particular importance, as they are so commonly used.

Therefore, this simple benchmark compares the performance of several HDF5 compression techniques on single-precision floating-point data sets of various complexity. For LZF, DEFLATE and the PyTables LZO filter, the HDF5 SHUFFLE filter is also applied. The measured quantity for all filters is the performance of the entire pipeline.

HDF5 1.8.2
4-byte floating-point data
4MB (1,024,000 element) dataset, 190kB chunk size
Times are averaged over 200 rounds of compression
Used H5FD_CORE driver; HDF5 file exists only in memory to avoid confusion with disk access times.
Compiled on 32-bit Linux, with gcc -O3

The following compression techniques were tested:

No compression
SHUFFLE + LZF
SHUFFLE + PyTables LZO
SHUFFLE + GZIP (level 1)
SZIP (NN, 16)

The benchmark program (and a Python program to generate the test files) may be downloaded from the h5py SVN server from "/svn/bench".

Compression ratio is measured as the percent reduction in file size; 0.0% is uncompressed while 100% would be perfect compression.

Also keep in mind that even with a 200-round ensemble, these times are not precise to more than a few milliseconds. Additionally, only one platform (32-bit Intel Linux) was tested.

Trivial data `data[i] = i`

Compression Type	Compress time (ms)	Decompress time (ms)	Compressed by
NULL	10.7	6.5	0.00%
LZF	18.6	17.8	96.66%
LZO	20.2	17.9	98.55%
GZIP	58.1	40.5	98.53%
SZIP	63.1	61.3	72.68%

Sine wave `data[i] = sin(i/32)`

Compression type	Compress time (ms)	Decompress time (ms)	Compressed by
NULL	10.1	6.5	0.00%
LZF	54.5	22.2	38.42%
LZO	86.9	22.9	44.24%
GZIP	215.1	58.6	45.54%
SZIP	101.8	94.5	27.05%

Noisy sine wave `data[i] = sin(i/32) + random(-0.25 to 0.25)`

Compression type	Compress time (ms)	Decompress time (ms)	Compressed by
NULL	10.8	6.5	0.00%
LZF	65.5	24.4	15.54%
LZO	125.4	26.7	17.25%
GZIP	298.6	64.8	20.05%
SZIP	115.2	102.5	16.29%

Random float data `data[i] = random(0 to 1.0)`

(Note this is NOT the same thing as random bits.)

Compression type	Compress time (ms)	Decompress time (ms)	Compressed by
NULL	9.0	7.8	0.00%
LZF	67.8	24.9	8.95%
LZO	124.0	30.6	12.78%
GZIP	305.4	67.2	17.05%
SZIP	120.6	107.7	15.56%

Different chunk sizes

Filter performance can depend on the size of the chunk used. Here LZF, LZO and GZIP are compared for a variety of different chunk sizes as applied to the "sine wave" dataset above.

Compression ratio

Chunk size	LZF	LZO	GZIP
32k	35.74%	36.57%	38.25%
96k	37.93%	41.98%	44.18%
192k	38.42%	44.24%	45.54%
384k	38.61%	45.38%	46.35%

Compress/decompress time (msec)

Chunk size	LZF		LZO		GZIP
32k	63.8	20.1	96.7	18.4	172.0	43.0
96k	57.0	20.4	88.4	17.4	202.2	50.6
192k	55.7	22.6	90.2	21.6	214.1	58.6
384k	57.5	27.2	93.8	27.2	221.5	65.3