info lzip

File: lzip.info, Node: Quality assurance, Next: Algorithm, Prev: Stream format, Up: Top

7 Design, development, and testing of lzip
******************************************

If only we could learn the right lessons from the successes of the past, we
would not need to learn from our failures.
-- C.A.R. Hoare

Lzip has been designed, written, and tested with great care to replace gzip
and bzip2 as general-purpose compressed format for Unix-like systems. This
chapter describes the lessons learned from these previous formats, and their
application to the design of lzip. The lzip format specification has been
reviewed carefully and is believed to be free from design errors.

7.1 Format design
=================

When gzip was designed in 1992, computers and operating systems were less
capable than they are today. The designers of gzip tried to work around some
of those limitations, like 8.3 file names, with additional fields in the
file format.

Today those limitations have mostly disappeared, and the format of gzip has
proved to be unnecessarily complicated. It includes fields that were never
used, others that have lost their usefulness, and finally others that have
become too limited.

Bzip2 was designed 5 years later, and its format is simpler than the one of
gzip. Both gzip and bzip2 lack the fields required to implement a reliable
and efficient '--list' operation.

Probably the worst defect of the gzip format from the point of view of data
safety is the variable size of its header. If the byte at offset 3 (flags)
of a gzip member gets corrupted, it may become difficult to recover the
data, even if the compressed blocks are intact, because it can't be known
with certainty where the compressed blocks begin.

By contrast, the header of a lzip member has a fixed length of 6. The LZMA
stream in a lzip member always starts at offset 6, making it trivial to
recover the data even if the whole header becomes corrupt.

Bzip2 also provides a header of fixed length and marks the begin and end of
each compressed block with six magic bytes, making it possible to find the
compressed blocks even in case of file damage. But bzip2 does not store the
size of each compressed block, as lzip does.

Lziprecover is able to provide unique data recovery capabilities because the
lzip format is extraordinarily safe. The simple and safe design of the file
format complements the embedded error detection provided by the LZMA data
stream. Any distance out of bounds acts as a forbidden symbol, allowing the
decoder to detect the approximate position of errors, and leaving little
work for the check sequence (CRC and data sizes) in the detection of errors.
Lzip is usually able to detect all possible bit flips in the compressed data
without resorting to the check sequence. It would be difficult to write an
automatic recovery tool like lziprecover for the gzip format. And, as far as
I know, it has never been written.

Lzip, like gzip and bzip2, uses a CRC32 to check the integrity of the
decompressed data because it provides optimal accuracy in the detection of
errors up to a compressed size of about 16 GiB, a size larger than that of
most files. In the case of lzip, the additional detection capability of the
decoder reduces the probability of undetected errors about 50 million times
more, resulting in a combined integrity checking optimally accurate for any
member size produced by lzip. Moreover, a CRC is better than a hash of the
same size for detection of errors in lzip files because the decoder catches
almost all the large errors, while the CRC guarantees the detection of the
small errors (which the hash does not).

The lzip format is designed for long-term archiving. Therefore it excludes
any unneeded features that may interfere with the future extraction of the
decompressed data.

For example, by not protecting the dictionary size field (DS) beyond
checking that it contains a valid value, lzip minimizes the number of false
positives caused by a corrupt DS. DS is the dictionary size used by the
compressor, but it is possible to decompress a file using a DS different
from the one used by the compressor as long as it is large enough to contain
the largest distance encoded. Protecting DS with a CRC would have caused
data loss in most cases where the data can be decompressed successfully.

7.1.1 Gzip format (mis)features not present in lzip
---------------------------------------------------

'Multiple algorithms'
Gzip provides a CM (Compression Method) field that has never been used
because it is too limiting. New compression methods may require
additional fields, making it impossible to implement new methods and,
at the same time, keep the same format.

'Optional fields in header'
Unless special precautions are taken, optional fields are generally a
bad idea because they produce a header of variable size. The gzip
header has 2 fields that, in addition to being optional, are
zero-terminated. This means that if any byte inside the field gets
zeroed, or if the terminating zero gets altered, gzip won't be able to
find neither the header CRC nor the compressed blocks.

'Optional CRC for the header'
Using an optional CRC for the header circumvents the Hamming distance
(HD) of the CRC and may prevent the extraction of good data. For
example, if the CRC is used and the bit enabling it is reset by a bit
flip, then the header seems to be intact (in spite of being corrupt)
while the compressed blocks seem to be unrecoverable (in spite of
being intact).

'Metadata'
The gzip format stores some metadata, like the modification time of the
original file or the operating system on which compression took place.
This complicates reproducible compression (obtaining identical
compressed output from identical input).

7.1.2 Lzip format improvements over gzip and bzip2
--------------------------------------------------

'64-bit size field'
Probably the most frequently reported shortcoming of the gzip format
is that it only stores the least significant 32 bits of the
uncompressed size. The size of any file larger or equal than 4 GiB
gets truncated.

Bzip2 does not store the uncompressed size of the file.

The lzip format provides a 64-bit field for the uncompressed size.
Additionally, lzip produces multimember output automatically when the
size is too large for a single member, allowing for an unlimited
uncompressed size.

'Distributed index'
The lzip format provides a distributed index that, among other things,
allows a reliable and efficient '--list' operation, helps plzip to
decompress faster than pigz, and helps lziprecover do its job. Neither
the gzip format nor the bzip2 format do provide an index.

A distributed index is safer and more scalable than a monolithic
index. The monolithic index introduces a single point of failure in
the compressed file and may limit the number of members or the total
uncompressed size.

7.2 Quality of implementation
=============================

Our civilization depends critically on software; it had better be quality
software.
-- Bjarne Stroustrup

'Accurate and robust error detection'
The lzip format provides 3-factor integrity checking, and the
decompressors report mismatches in each factor separately. This method
detects most false positives for corruption. If just one byte in one
factor fails but the other two factors match the data, it probably
means that the data are intact and the corruption just affects the
mismatching factor (CRC, data size, or member size) in the member
trailer.

'Multiple implementations'
Just like the lzip format provides 3-factor protection against
undetected data corruption, the development methodology of the lzip
family of compressors provides 3-factor protection against undetected
programming errors.

Three related but independent compressor implementations, lzip, clzip,
and minilzip/lzlib, are developed concurrently. Every stable release
of any of them is tested to check that it produces identical output to
the other two. This corroborates that all three implement the same
algorithm, and makes it unlikely that any of them may contain serious
undiscovered errors. In fact, no errors have been discovered in lzip
since 2009.

Additionally, the three implementations have been extensively tested
with lziprecover, valgrind, and 'american fuzzy lop' without finding a
single vulnerability or false negative. *Note Unzcrash:
(lziprecover)Unzcrash.

'Dictionary size'
Lzip automatically adapts the dictionary size to the size of each file.
In addition to reducing the amount of memory required for
decompression, this feature also minimizes the probability of being
affected by RAM errors during compression.

'Exit status'
Returning a warning status of 2 is a design flaw of compress that
leaked into the design of gzip. Both bzip2 and lzip are free from this
flaw.