info lzip


File: lzip.info,  Node: Stream format,  Next: Quality assurance,  Prev: File format,  Up: Top

6 Format of the LZMA stream in lzip files
*****************************************

The LZMA algorithm has three parameters, called 'special LZMA properties',
to adjust it for some kinds of binary data. These parameters are:
'literal_context_bits' (with a default value of 3),
'literal_pos_state_bits' (with a default value of 0), and 'pos_state_bits'
(with a default value of 2). As a general purpose compressed format, lzip
only uses the default values for these parameters. In particular
'literal_pos_state_bits' has been optimized away and does not even appear
in the code.

The first byte of the LZMA stream is set to zero to help tools like grep
recognize lzip files as binary files.

The LZMA stream is terminated by an 'End Of Stream' (EOS) marker (a match
with distance = 2^32 - 1 and length = 2), which in conjunction with the
'member size' field in the member trailer allows the checking of stream
integrity. The LZMA stream in lzip files always has these two features
(default parameter values 3, 0, 2, and EOS marker) and is referred to in
this document as LZMA-302eos. This simplified and marker-terminated form of
the LZMA stream format has been chosen to achieve complete interoperability
and robust safety.

The second stage of LZMA is a range encoder that uses different sets of
probability models for each type of symbol: distances, lengths, literal
bytes, and bits. Range encoding conceptually encodes all the symbols of the
message into one number. Unlike Huffman coding, which assigns to each symbol
a bit-pattern and concatenates all the bit-patterns together, range encoding
can compress one symbol to less than one bit. Therefore the compressed data
produced by a range encoder can't be split in pieces that could be described
individually.

It seems that the only way of describing the LZMA-302eos stream is to
describe the algorithm that decodes it. And given the many details about
the range decoder that need to be described accurately, the source code of
a real decompressor seems the only appropriate reference to use.

What follows is a description of the decoding algorithm for LZMA-302eos
streams using as reference the source code of lzd, an educational
decompressor for lzip files, included in appendix A. *Note Reference source
code::. Lzd is written in C++11 and can be downloaded from the lzip download
directory.

6.1 The range decoder
=====================

A LZMA stream looks like a large pseudo-random binary number in big-endian
order. The range decoder reads this number one byte at a time starting from
the most significant byte (see 'normalize' in the source). As it reads the
number, the range decoder decodes one by one all the bits forming the LZMA
coding sequences. The compression ratio (number of bits decoded per input
byte) depends on how well these bits agree with their respective contexts.
The function 'decode_bit' decodes each bit using the probability associated
to one of the contexts or subcontexts described below, while the function
'decode' decodes the bits with a fixed 0.5 probability.

The range decoder state consists of two unsigned 32-bit variables: 'range'
(representing the most significant part of the size of the input range not
yet decoded) and 'code' (representing the current point within 'range').
'range' is initialized to 2^32 - 1 and 'code' is initialized to 0,
representing the whole range of the first 4 bytes of the input binary
number.

6.2 What is coded
=================

The LZMA stream includes literals, matches, and repeated matches (matches
reusing a recently used distance). A recently used distance is either 0 or a
copy of a previous value of 'dis'. There are 7 different coding sequences:

Bit sequence                Name        Description
-----------------------------------------------------------------------------
0 + byte                    literal     literal byte
1 + 0 + len + dis           match       LZ distance-length pair
1 + 1 + 0 + 0               shortrep    1 byte match at latest used distance
1 + 1 + 0 + 1 + len         rep0        len bytes match at latest used distance
1 + 1 + 1 + 0 + len         rep1        len bytes match at second latest used
                                        distance
1 + 1 + 1 + 1 + 0 + len     rep2        len bytes match at third latest used
                                        distance
1 + 1 + 1 + 1 + 1 + len     rep3        len bytes match at fourth latest used
                                        distance

'repN' is any one of 'rep0', 'rep1', 'rep2', or 'rep3'. 'rep' is any one of
'repN' or 'shortrep'. The output buffer is the dictionary.

A match (or rep) is defined to have the same effect as copying the bytes one
by one. Therefore a match can have a length larger than its distance,
allowing it to copy a succession of repeated bytes beyond the end of the
match. Before start decoding the LZMA stream, the four latest used distances
dis0 to dis3 are initialized to 0, and the state of the LZMA decoder is set
as if the previous byte to the first output byte were a 0 decoded as a
literal byte. The first coding sequence in a LZMA stream can be anything
except a match. If the first coding sequence is a literal byte, it is
decoded with a 'literal_state' of 0. If the first coding sequence is a
shortrep, it puts one byte with value 0 in the output buffer. If the first
coding sequence is a repN, it puts 'length' bytes with value 0 in the
output buffer.

As a literal is always valid, and the behavior of a first rep is defined, a
'dis' pointing to a byte position before the start of the output buffer is
the only possible decoding error caused by malformed LZMA input. Therefore,
a LZMA stream that does not contain any match with a distance out of bounds
is a valid LZMA stream.

In the following tables, multibit sequences are coded in normal order, from
most significant bit (MSB) to least significant bit (LSB), except where
noted otherwise.

Lengths range from 2 to 273, but the value coded (the 'len' in the table
above) is length - 2. Lengths are coded in 3 ranges as follows:

Bit sequence                           Description
----------------------------------------------------------------------------
0 + 3 bits                             lengths from 2 to 9
1 + 0 + 3 bits                         lengths from 10 to 17
1 + 1 + 8 bits                         lengths from 18 to 273

For example, a length of 8 is coded as the bit sequence '0110', a length of
16 is coded as the bit sequence '10_110', and a length of 32 is coded as
the bit sequence '11_0000_1110'.


A distance (the 'dis' in the table of coding sequences) is a negative
offset from the position of the latest byte decoded in the output buffer.
Distances range from 0 to the size of the dictionary - 1. The coding of
distances is more complicated than the coding of lengths, so I'll begin by
explaining a simpler version of the encoding.

Imagine you need to encode a number from 0 to 2^32 - 1, and you want to do
it in a way that produces shorter codes for the smaller numbers. You may
first encode the position of the most significant bit that is set to 1,
which you may find by making a bit scan from the left (from the MSB). A
position of 0 means that the number is 0 (no bit is set), 1 means the LSB is
the first bit set (the number is 1), and 32 means the MSB is set (i.e., the
number is >= 2^31). Then, if the position is >= 2, you encode the remaining
position - 1 bits. Let's call these bits "direct bits" because they are
coded directly by value instead of indirectly by position.

The inconvenient of this simple method is that it needs 6 bits to encode the
position, but it just uses 33 of the 64 possible values, wasting almost half
of the codes.

The intelligent trick of LZMA is that it encodes in what it calls a "slot"
the position of the most significant bit set (msb_pos), along with the value
of the next bit, using the same 6 bits that would take to encode the
position alone. This seems to need 66 slots (twice the number of positions),
but for positions 0 and 1 there is no next bit, so the number of slots
needed is 64 (0 to 63). For distances 0 and 1, the slot is equal to the
distance. For distances >= 2, the slot is computed as
(msb_pos - 1) * 2 + next_bit.

The 6 bits representing this "slot number" are then context-coded. If the
distance is >= 4, the remaining bits are encoded as follows. 'direct_bits'
is the amount of remaining bits (from 1 to 30) needed to form a complete
distance, and is calculated as (slot >> 1) - 1. If a distance needs 6 or
more direct_bits, the last 4 bits are encoded separately. The last piece
(all the direct_bits for distances 4 to 127 (slots 4 to 13), or the last 4
bits for distances >= 128 (slot >= 14)) is context-coded in reverse order
(from LSB to MSB) because between distances the LSB tends to correlate
better than more significant bits. For distances >= 128, the
'direct_bits - 4' part is encoded with fixed 0.5 probability.

Distances are coded in 3 ranges as follows:

Bit sequence                           Description
----------------------------------------------------------------------------
slot                                   distances from 0 to 3
slot + direct_bits                     distances from 4 to 127
slot + (direct_bits - 4) + 4 bits      distances from 128 to 2^32 - 1

For example, a distance of 3 is coded as the bit sequence '000011', a
distance of 127 is coded as '001101_11111', a distance of 128 is coded as
'001110_000000', a distance of 2^31 is coded as
'111110_000000_000000_000000_000000_00000', and a distance of 2^32 - 1 is
coded as '111111_111111_111111_111111_111111_111111'.


6.3 The coding contexts
=======================

Range coding encodes or decodes each bit asymmetrically depending on the
probability of the previous bits in the same context being 0. The
probability associated with each context is stored as an integer (see
'Bit_model' in the source). In order to make the prediction more accurate,
and increase the compression ratio, one of an array of contexts (chosen in
function of past data) may be used when coding a given bit. I.e., some
contexts have subcontexts. The indices used to choose one subcontext in
such an array are:

'state'
     A state machine ('State' in the source) with 12 states (0 to 11) coding
     the latest 2 to 4 types of sequences processed. The initial state is 0.

'pos_state'
     Value of the 2 least significant bits of the current position in the
     decoded data.

'literal_state'
     Value of the 3 most significant bits of the latest output byte decoded.

'len_state'
     Coded value of the current match length (length - 2), with a maximum
     of 3. The resulting value is in the range 0 to 3.


The types of previous sequences corresponding to each state are shown in the
following table. '!literal' is any sequence except a literal byte. The last
type in each line is the most recent. States 2 and 5 can be reached in two
ways each.

State   Types of previous sequences
---------------------------------------------
0       literal, literal, literal
1       match, literal, literal
2       repN, literal, literal
2       !literal, shortrep, literal, literal
3       literal, shortrep, literal, literal
4       match, literal
5       repN, literal
5       !literal, shortrep, literal
6       literal, shortrep, literal
7       literal, match
8       literal, repN
9       literal, shortrep
10      !literal, match
11      !literal, rep


The contexts for decoding the type of coding sequence are:

Name            Indices                     Used when
----------------------------------------------------------------------------
bm_match        state, pos_state            sequence start
bm_rep          state                       after sequence 1
bm_rep0         state                       after sequence 11
bm_rep1         state                       after sequence 111
bm_rep2         state                       after sequence 1111
bm_len          state, pos_state            after sequence 110


The contexts for decoding distances are:

Name            Indices                 Used when
----------------------------------------------------------------------------
bm_dis_slot     len_state, bit tree     distance start
bm_dis          reverse bit tree        after slots 4 to 13
bm_align        reverse bit tree        for distances >= 128, after fixed
                                        probability bits


There are two kinds of coding sequences that contain a length: matches
(match) and long repeated matches (repN). Therefore there are two separate
sets of contexts for lengths ('Len_model' in the source). Matches use
'match_len_model' to decode lengths, while long repeated matches use
'rep_len_model'. The contexts in each Len_model are (see 'decode_len' in
the source):

Name            Indices                        Used when
---------------------------------------------------------------------------
choice1         none                           length start
choice2         none                           after sequence 1
bm_low          pos_state, bit tree            after sequence 0
bm_mid          pos_state, bit tree            after sequence 10
bm_high         bit tree                       after sequence 11


The context array 'bm_literal' is special. If the previous output byte was
decoded as a literal, 'bm_literal' acts as a normal bit tree context; the
one selected by 'literal_state'. Else two other bit tree contexts are used
depending on the value of each bit in 'match_byte' (the byte at the latest
used distance), until a bit is decoded that is different from its
corresponding bit in 'match_byte'. After the first difference is found, the
rest of the byte is decoded using the normal bit tree context. (See
'decode_matched' in the source).

6.4 Decoding and checking the LZMA stream
=========================================

After decoding the member header and obtaining the dictionary size, the
range decoder is initialized and then the LZMA decoder enters a loop (see
'decode_member' in the source) where it invokes the range decoder with the
appropriate contexts to decode, bit by bit, the different coding sequences
(matches, repeated matches, and literal bytes), until the 'End Of Stream'
marker is decoded.

Once the 'End Of Stream' marker has been decoded, the decompressor reads and
decodes the member trailer, and checks that the three integrity factors
stored there (CRC, data size, and member size) match those computed from the
data.