manpagez: man pages & more
info lzip
Home | html | info | man

File: lzip.info,  Node: Quality assurance,  Next: Algorithm,  Prev: Stream format,  Up: Top

7 Design, development, and testing of lzip
******************************************

If only we could learn the right lessons from the successes of the past, we
would not need to learn from our failures.
-- C.A.R. Hoare

Lzip has been designed, written, and tested with great care to replace gzip
and bzip2 as general-purpose compressed format for Unix-like systems. This
chapter describes the lessons learned from these previous formats, and their
application to the design of lzip. The lzip format specification has been
reviewed carefully and is believed to be free from design errors.

7.1 Format design
=================

When gzip was designed in 1992, computers and operating systems were less
capable than they are today. The designers of gzip tried to work around some
of those limitations, like 8.3 file names, with additional fields in the
file format.

Today those limitations have mostly disappeared, and the format of gzip has
proved to be unnecessarily complicated. It includes fields that were never
used, others that have lost their usefulness, and finally others that have
become too limited.

Bzip2 was designed 5 years later, and its format is simpler than the one of
gzip. Both gzip and bzip2 lack the fields required to implement a reliable
and efficient '--list' operation.

Probably the worst defect of the gzip format from the point of view of data
safety is the variable size of its header. If the byte at offset 3 (flags)
of a gzip member gets corrupted, it may become difficult to recover the
data, even if the compressed blocks are intact, because it can't be known
with certainty where the compressed blocks begin.

By contrast, the header of a lzip member has a fixed length of 6. The LZMA
stream in a lzip member always starts at offset 6, making it trivial to
recover the data even if the whole header becomes corrupt.

Bzip2 also provides a header of fixed length and marks the begin and end of
each compressed block with six magic bytes, making it possible to find the
compressed blocks even in case of file damage. But bzip2 does not store the
size of each compressed block, as lzip does.

Lziprecover is able to provide unique data recovery capabilities because the
lzip format is extraordinarily safe. The simple and safe design of the file
format complements the embedded error detection provided by the LZMA data
stream. Any distance out of bounds acts as a forbidden symbol, allowing the
decoder to detect the approximate position of errors, and leaving little
work for the check sequence (CRC and data sizes) in the detection of errors.
Lzip is usually able to detect all possible bit flips in the compressed data
without resorting to the check sequence. It would be difficult to write an
automatic recovery tool like lziprecover for the gzip format. And, as far as
I know, it has never been written.

Lzip, like gzip and bzip2, uses a CRC32 to check the integrity of the
decompressed data because it provides optimal accuracy in the detection of
errors up to a compressed size of about 16 GiB, a size larger than that of
most files. In the case of lzip, the additional detection capability of the
decoder reduces the probability of undetected errors about 50 million times
more, resulting in a combined integrity checking optimally accurate for any
member size produced by lzip. Moreover, a CRC is better than a hash of the
same size for detection of errors in lzip files because the decoder catches
almost all the large errors, while the CRC guarantees the detection of the
small errors (which the hash does not).

The lzip format is designed for long-term archiving. Therefore it excludes
any unneeded features that may interfere with the future extraction of the
decompressed data.

For example, by not protecting the dictionary size field (DS) beyond
checking that it contains a valid value, lzip minimizes the number of false
positives caused by a corrupt DS. DS is the dictionary size used by the
compressor, but it is possible to decompress a file using a DS different
from the one used by the compressor as long as it is large enough to contain
the largest distance encoded. Protecting DS with a CRC would have caused
data loss in most cases where the data can be decompressed successfully.

7.1.1 Gzip format (mis)features not present in lzip
---------------------------------------------------

'Multiple algorithms'
     Gzip provides a CM (Compression Method) field that has never been used
     because it is too limiting. New compression methods may require
     additional fields, making it impossible to implement new methods and,
     at the same time, keep the same format.

'Optional fields in header'
     Unless special precautions are taken, optional fields are generally a
     bad idea because they produce a header of variable size. The gzip
     header has 2 fields that, in addition to being optional, are
     zero-terminated. This means that if any byte inside the field gets
     zeroed, or if the terminating zero gets altered, gzip won't be able to
     find neither the header CRC nor the compressed blocks.

'Optional CRC for the header'
     Using an optional CRC for the header circumvents the Hamming distance
     (HD) of the CRC and may prevent the extraction of good data. For
     example, if the CRC is used and the bit enabling it is reset by a bit
     flip, then the header seems to be intact (in spite of being corrupt)
     while the compressed blocks seem to be unrecoverable (in spite of
     being intact).

'Metadata'
     The gzip format stores some metadata, like the modification time of the
     original file or the operating system on which compression took place.
     This complicates reproducible compression (obtaining identical
     compressed output from identical input).


7.1.2 Lzip format improvements over gzip and bzip2
--------------------------------------------------

'64-bit size field'
     Probably the most frequently reported shortcoming of the gzip format
     is that it only stores the least significant 32 bits of the
     uncompressed size. The size of any file larger or equal than 4 GiB
     gets truncated.

     Bzip2 does not store the uncompressed size of the file.

     The lzip format provides a 64-bit field for the uncompressed size.
     Additionally, lzip produces multimember output automatically when the
     size is too large for a single member, allowing for an unlimited
     uncompressed size.

'Distributed index'
     The lzip format provides a distributed index that, among other things,
     allows a reliable and efficient '--list' operation, helps plzip to
     decompress faster than pigz, and helps lziprecover do its job. Neither
     the gzip format nor the bzip2 format do provide an index.

     A distributed index is safer and more scalable than a monolithic
     index. The monolithic index introduces a single point of failure in
     the compressed file and may limit the number of members or the total
     uncompressed size.


7.2 Quality of implementation
=============================

Our civilization depends critically on software; it had better be quality
software.
-- Bjarne Stroustrup

'Accurate and robust error detection'
     The lzip format provides 3-factor integrity checking, and the
     decompressors report mismatches in each factor separately. This method
     detects most false positives for corruption. If just one byte in one
     factor fails but the other two factors match the data, it probably
     means that the data are intact and the corruption just affects the
     mismatching factor (CRC, data size, or member size) in the member
     trailer.

'Multiple implementations'
     Just like the lzip format provides 3-factor protection against
     undetected data corruption, the development methodology of the lzip
     family of compressors provides 3-factor protection against undetected
     programming errors.

     Three related but independent compressor implementations, lzip, clzip,
     and minilzip/lzlib, are developed concurrently. Every stable release
     of any of them is tested to check that it produces identical output to
     the other two. This corroborates that all three implement the same
     algorithm, and makes it unlikely that any of them may contain serious
     undiscovered errors. In fact, no errors have been discovered in lzip
     since 2009.

     Additionally, the three implementations have been extensively tested
     with lziprecover, valgrind, and 'american fuzzy lop' without finding a
     single vulnerability or false negative. *Note Unzcrash:
     (lziprecover)Unzcrash.

'Dictionary size'
     Lzip automatically adapts the dictionary size to the size of each file.
     In addition to reducing the amount of memory required for
     decompression, this feature also minimizes the probability of being
     affected by RAM errors during compression.

'Exit status'
     Returning a warning status of 2 is a design flaw of compress that
     leaked into the design of gzip. Both bzip2 and lzip are free from this
     flaw.


© manpagez.com 2000-2026
Individual documents may contain additional copyright information.