info lzip

File: lzip.info, Node: Algorithm, Next: Trailing data, Prev: Quality assurance, Up: Top

8 Algorithm
***********

In spite of its name (Lempel-Ziv-Markov chain-Algorithm), LZMA is not a
concrete algorithm; it is more like "any algorithm using the LZMA coding
scheme". LZMA compression consists in describing the uncompressed data as a
succession of coding sequences from the set shown in Section 'What is
coded' (*note what is coded::), and then encoding them using a range
encoder. For example, the option '-0' of lzip uses the scheme in almost the
simplest way possible; issuing the longest match it can find, or a literal
byte if it can't find a match. Inversely, a more elaborate way of finding
coding sequences of minimum size than the one currently used by lzip could
be developed, and the resulting sequence could also be coded using the LZMA
coding scheme.

Lzip currently implements two variants of the LZMA algorithm: fast (used by
option '-0') and normal (used by all other compression levels).

The high compression of LZMA comes from combining two basic, well-proven
compression ideas: sliding dictionaries (LZ77) and Markov models (the thing
used by every compression algorithm that uses a range encoder or similar
order-0 entropy coder as its last stage) with segregation of contexts
according to what the bits are used for.

Lzip is a two stage compressor. The first stage is a Lempel-Ziv coder,
which reduces redundancy by translating repeated chunks of data to their
corresponding distance-length pairs. The second stage is a range encoder
that uses different sets of probability models for each type of symbol:
distances, lengths, literal bytes, and bits.

Here is how it works, step by step:

1) The member header is written to the output stream.

2) The first byte is coded literally, because there are no previous bytes
to which the match finder can refer to.

3) The main encoder advances to the next byte in the input data and calls
the match finder.

4) The match finder fills an array with the minimum distances before the
current byte where a match of a given length can be found.

5) Go back to step 3 until a sequence (formed of matches, repeated matches,
and literal bytes) of minimum price has been formed. Where the price
represents the number of output bits produced.

6) The range encoder encodes the sequence produced by the main encoder and
sends the bytes produced to the output stream.

7) Go back to step 3 until the input data are finished or until the member
or volume size limits are reached.

8) The range encoder is flushed.

9) The member trailer is written to the output stream.

10) If there are more data to compress, go back to step 1.

During compression, lzip reads data in large blocks (one dictionary size at
a time). Therefore it may block for up to tens of seconds any process
feeding data to it through a pipe. This is normal. The blocking intervals
get longer with higher compression levels because dictionary size increases
(and compression speed decreases) with compression level.

The ideas embodied in lzip are due to (at least) the following people:
Abraham Lempel and Jacob Ziv (for the LZ algorithm), Andrei Markov (for the
definition of Markov chains), G.N.N. Martin (for the definition of range
encoding), Igor Pavlov (for putting all the above together in LZMA), and
Julian Seward (for bzip2's CLI).