Some data compression systems use a value that represents the number of uncompressed bytes to follow, but this is really analogous to a prefix bit(s), because 1-byte uncompressed data is very common.
For now, let's assume that 3/4 of the symbols generated are uncompressed bytes. In this case it seems viable to allocate the shorter codes for uncompressed bytes. We should use 2 bits to determine whether the data is compressed or uncompressed. Three of the combinations indicate an uncompressed byte, and one indicates compressed data. But how to do that, because we already weren't happy with adding bits to the uncompressed bytes ?
Postulating an even distribution on the uncompressed bytes we have another option. We use two bits (let them be the two most-significant bits) from the uncompressed bytes themselves to indicate compressed data. We have an escape code, which is compared to the bits, and if it matches, compressed data follows. If the bits do not match, the rest of the uncompressed byte follows. The escape code is not static. Whenever those two bits in an uncompressed byte would match the escape code, an escape sequence is generated. This escape sequence has the offending data and a new escape code. The length of this escape sequence may be 2 (escape bits used) + 3 (selects escape mode) + 2 (new escape bits) + 6 (rest of the byte), i.e. expands the uncompressed byte by 5 bits.
Thus, uncompressed bytes expand in average 25% of the time by 5 bits. The average length of uncompressed bytes is 25% * 13 + 75% * 8 = 9.25. Interestingly, this is longer than using one bit to tag uncompressed bytes. However, there is one thing we haven't considered: the escape sequence has the possibility to change the escape code. Using this feature to its optimum (escape optimization), the average 25% 'hit rate' becomes the maximum 'hit rate'. Also, because the distribution of the (values of the) uncompressed bytes is seldom flat (and there is locality, from which we can benefit), the actual hit rate is always much smaller than that. Empirical studies on the test files (see the compression tests) show that for 2-bit escape codes the actual realized hit rate is only 1.8-6.4%, while the theoretical maximum is the mentioned 25%.
Previously we assumed the distribution of 75% of uncompressed bytes and 25% of compressed data (note that this simply refers to the number of times these classes are generated, not to the number of bytes processed). This provided the reason to select 2 escape bits. For other distributions (differently compressible files, not necessarily better or worse) some other number of escape bits may be more suitable. The following table summarizes the 'hit rates' on the test files for different number of escape bits.
1-bit | 2-bit | 3-bit | 4-bit | File |
---|---|---|---|---|
50.0% | 25.0% | 12.5% | 6.25% | Maximum |
25.3% | 2.5% | 0.3002% | 0.0903% | ivanova.bin |
26.5% | 2.4% | 0.7800% | 0.0631% | sheridan.bin |
20.7% | 1.8% | 0.1954% | 0.0409% | delenn.bin |
26.5% | 6.4% | 2.4909% | 0.7118% | bs.bin |
9.06 | 8.32 | 8.15 | 8.050 | bits/byte for bs.bin |
As can be seen from the table, the realized hit rates are dramatically smaller than the theoretical maximum values. A thought might occur that we should always select 4-bit (or longer) escapes, because it presents the minimum overhead to uncompressed bytes. Unfortunately this isn't the case, because it also increases the code length of the compressed data.
Also, the case with 1-bit escape code validates the original suggestion: use a prefix bit. A simple prefix bit would produce better results (although only slightly, 9 bits vs. 9.06 bits, making 106 bytes for bs.bin). On the other hand, 1-bit escape code is not selected for bs.bin, because 2-bit escape gives better overall results.
Currently the number of escape bits used is selected automatically, but can also be overriden by the user with the -e option.
Note: for 7-bit ASCII text files, where the top bit is always 0, the hit rate is 0% for even 1-bit escapes. Thus, uncompressed bytes do not expand at all.
If the top bits of an uncompressed byte do not match the escape code, the byte is output as-is. If the bits match, the escape sequence is output, with the new escape code.
The most frequent compressed data is LZ77. The length of the match is output in Elias Gamma code, 0 meaning the length of 2, 100 length of 3, 101 length of 4 etc. If the length is 2, the next bit defines whether this is LZ77 or RLE/Escape. If the bit is 0, an 8-bit value follows, which gives the LZ77 offset value (the length is 2). If the bit is 1, the next bit decides between escape (0) and RLE (1).
So, the code for an escaped byte is e..e010n..ne....e, where E is the byte, and N is the new escape value.
The end of file condition is encoded to the LZ77 position and the long RLE is encoded into the short RLE length. Read further, and you get a better idea about why this kind of encoding is selected.
When I studied the distribution of the length values (LZ77 and short RLE lengths), I noticed that the smaller the value, the more occurrances. The following table shows the length values used for bs.bin.
LZLEN S-RLE 2 1975 477 3-4 1480 330 5-8 492 166 9-16 125 57 17-32 31 33 33-64 8 15
It almost immediately occurred to me that the optimal way to encode the length values (decremented by one) is:
0000000 not possible 0000001 0 1 -6 bits 000001x 10 x 2-3 -4 bits 00001xx 110 xx 4-7 -2 bits 0001xxx 1110 xxx 8-15 +0 bits 001xxxx 11110 xxxx 16-31 +2 bits 01xxxxx 111110 xxxxx 32-63 +4 bits 1xxxxxx 111111 xxxxxx 64-127 +5 bitsThe last column shows the difference to a 7-bit linear (non-)encoding. Using this encoding for the length distribution presented reduces considerably the bits used compared to a 'linear' representation. Later I found out that this encoding in fact is Elias Gamma Code, only the assignment of 0- and 1-bits in the prefix is reversed, and the length is limited. Currently the maximum value is selectable between 64 and 256.
However, the distribution of the LZ77 offset (LZPOS) values is not at all similar. Admittedly, the distribution isn't exactly flat, but it also isn't as radical as the length value distribution either. I decided to encode the lower 8 bits (automatically selected or user-selectable between 8 and 12 bits in the current version) of the value as-is (i.e. linear) and the upper bits with my version of the Elias Gamma Code. Because the upper bits need the value 0, and the code can't represent it, the upper part of the LZ77 offset is incremented by one before encoding (unlike the length values that are decremented by one, because 2 is the smallest value to be encoded). Also, one of the most significant part codes is reserved for an end of file (EOF) symbol. This restricts the offset value somewhat, but the loss in compression is negligible.
It is also the case, that 2-byte LZ77 matches only gain 4 bits (2-bit escapes) for each offset from 1 to 256, and 2 bits for each offset from 257 to 768. In the first case 9 bits are used to represent the offset, in the latter case 11 bits are used. The first case (offset 1..256) is much more frequent than the second case, because it saves more bits, and also because the symbol source statistics usually guarantee 2-byte matches in recent history (much better chance than for 3-byte matches, for example).
So, if we restrict the offset for a 2-byte LZ77 match to 8 bits (1..256), we don't lose so much compression at all, but instead we can shorten the code by one bit. Or, as I have done, we can use the bit to select whether this code really is LZ77 or something else. Compared to the older coding the codes for Escape sequence, RLE and End of File are still the same length, but the code for LZ77 has been shortened by one bit. Because LZ77 is more frequently used than any of the other codes, this presents a saving that more than compensates the loss of 2-byte LZ77 matches with offsets 257..768.
The Long RLE selection is encoded into the Short RLE code. Short RLE only uses half of its coding space, i.e. if the maximum value for the gamma code is 127, short RLE uses only values 1..63. Larger values switches the decoder into Long RLE mode and more bits are read to complete the run length value. Additional compression for RLE is gained using a table (and Elias Gamma Code) for the 31 top-ranking RLE bytes. The RLE bytes are ranked by the number of times they are used, and top 31 are put into a table, which is then indexed by an Elias Gamma Code. RLE bytes that are not in the table use the gamma code values 31..63 and some extra bits to encode the rest of the byte.
Then why does this optimization give better compression results ? It works because you are not forced to take every compression opportunity (seems that the compression community calls this "lazy coding" or "non-greedy" selection), but select the best ones. You may want to emit a byte uncompressed even if there is a 2-byte LZ77 match, because in the next position in the file you may have a longer match. This is actually more complicated than that, but just take my word for it that there is a difference. :-)
Note that most LZ77 compression programs need at least a 3-byte match to break even, i.e. not expanding the data.
To be able to find the minimal path, the algorithm needs the length of the RLE (the number of the identical bytes following) and the maximum LZ77 length/offset (an identical string appearing earlier in the file) for each location in the file. This is the most time-consuming and memory-consuming part of the compression. I have used several methods to make the search faster.
unsigned char *a = indata + P, val = *a++; int top = inlen - P; int rlelen = 1; /* Loop for the whole RLE */ while(rlelen<top && *a++ == val) rlelen++; for(i=0;i<rlelen-1;i++) rle[P+i] = rlelen-i;
This is a very slow way to do the search though. Theoretically it could take somewhere about (n^3) compares to process a file of the length n (a mathematically inclined person would probably give a better estimate). However, using the RLE to our advantage permits us to rule out the worst-case projection, which happens when all bytes are the same value. We only search LZ77 matches if the current file position has shorter RLE sequence than the maximum LZ77 copy length.
The first thing I did to improve the speed is to remember the position where each byte has last been seen. A simple 256-entry table handles that. Using this table, the search can directly start from the first potential match, and we don't need to search for it byte-by-byte anymore.
That didn't give much of an improvement, but then I increased the table to 256*256 entries, making it possible to locate the latest occurrance of any byte pair instead. Because the shortest possible string that would offer any compression (for the used encoding of LZ77) is two bytes long, this byte-pair history is very suitable indeed. Also, the first (shortest possible, i.e. 2-byte) match is found directly from the byte-pair history. This gave a moderate 30% decrease in compression time for one of my test files (from 28 minutes to 17 minutes on a 25 MHz 68030).
The second idea was to quickly discard the strings that had no chance of being longer matches than the one already found. A one-byte hash value is calculated from each three-byte groups. The values are calculated once and put into a table, so we only need two memory fetches to know if two 3-byte groups are different (if the hash values are different, at least one of the bytes differ, but if the hash values are the same, we have to compare the original bytes). The hash values of the strategic positions of the strings to compare are then .. compared. This strategic position is the location two bytes earlier than the longest match so far. If the hash values differ, there is no chance that the match is longer than the current one. It may be not even be as long, because one of the two earlier bytes may be different. If the hash values are the same, the brute-force byte-by-byte compare has to be done. However, the hash value check already discards a huge number of candidates and more than generously pays back its own memory references. Using the hash values the compression time shortens by 50%, from 17 minutes to 8 minutes.
I also reworded some code to help my compiler generate better code, but -- as expected -- only got 20% shorter compression time, 6˝ minutes.
Okay, the byte-pair table tells us where the latest occurrance of any byte pair is located. Still, for the latest occurrance before that one we have to do a brute-force search. The next improvement was to use the byte-pair table to generate a linked list of the byte pairs with the same value. In fact, this linked list can be trivially represented as a table, using the same indexing as the file positions. I.e. to locate the previous occurrance of a 2-byte string starting at location P, look at backSkip[P].
/* Update the two-byte history & backSkip */ if(P+1<inlen) { int index = (indata[P]<<8) | indata[P+1]; backSkip[P] = lastPair[index]; lastPair[index] = P+1; }Actually the values in the table are one bigger than the real table indices. This is because the values are of type unsigned short, and I wanted zero to mean "not occurred".
This table makes the search of the next (previous) location to consider much faster. The compression time was reduced from 6 minutes to 1 minute 10 seconds. Quite an improvement from the original 28 minutes!
There is also another trick that takes advantage of the already determined RLE lengths. If the RLE lengths for the positions to compare don't match, we can directly skip to the previous potential match. Note that the RLE bytes (the data bytes) are the same, and need not be compared, because the first byte (two bytes) are always equal on both positions (our backSkip table guarantees that). The RLE length value can also be used to skip the start of the strings when comparing them.
Note that a compression method similar to RLE can be realized using just LZ77. You just emit the byte to copy, and output a LZ77 code with offset 1 and the original RLE length minus 1. You can thus consider RLE as a special case, which offers tighter encoding of the necessary information. Also, as LZ77 limits the 'copy size' to 64/128/256 bytes, a RLE version providing lengths upto 32 kilobytes is a big improvement, even if the code for it is somewhat longer.
Another improvement to the search code (3.6.1997) made it dramatically faster than before on highly redundant files (such as pic from the Calgary Corpus Suite, which was the Achilles' heel until then). Basically the new seach method just skips over the RLE part (if any) in the seach position when using the backSkip table and then checks if the located position has equal number (and value) of RLE bytes before it.
backSkip[] lastPair[] _____ ________ \/ \ ...AB.....A..ABCD rle[p] # of A's, B is something else ^ ^ ^ | | | i p p+rle[p]-1The algorithm searches for bytes that are in p + rle[p]-1, i.e. the last rle byte ('A') and the non-matching one ('B'). When it finds such location (simple lastPair[] or backSkip[] lookup), it checks if the rle in the compare position (i-(rle[p]-1)) is long enough (i.e. the same number of A's before the B in both places). If there are, the normal hash compare is performed and if it succeeds, the brute-force byte-compare is done.
There are dramatically less matches for "AB" than for "AA", so we get a huge speedup with this approach. We are still guaranteed to find the most recent longest match there is.
Also, escape optimization is much faster on large number of escape bits. This is also needed, because the number of escape bits can now vary from 0 (uncompressed bytes always escaped) to 8.
More test files are welcomed!
bs.bin | 41537 | ||
---|---|---|---|
Packer | Size | Left | Comment |
ByteBonker 1.5 | 27326 | 65.8% | Mode 4 |
Cruelcrunch 2.2 | 27136 | 65.3% | Mode 1 |
The AB Cruncher | 27020 | 65.1% | |
Pu-Crunch | 26438 | 63.7% | -m5 |
delenn.bin | 47105 | ||
The AB Cruncher | N/A | N/A | Crashes |
ByteBonker 1.5 | 21029 | 44.6% | Mode 3 |
Cruelcrunch 2.2 | 20672 | 43.9% | Mode 1 |
Pu-Crunch | 19791 | 42.0% | -p2 |
sheridan.bin | 47105 | ||
ByteBonker 1.5 | 13661 | 29.0% | Mode 3 |
Cruelcrunch 2.2 | 13595 | 28.9% | Mode H |
The AB Cruncher | 13534 | 28.7% | |
Pu-Crunch | 12554 | 26.7% | -p2 -m7 |
ivanova.bin | 47105 | ||
ByteBonker 1.5 | 11016 | 23.4% | Mode 1 |
Cruelcrunch 2.2 | 10883 | 23.1% | Mode H |
The AB Cruncher | 10743 | 22.8% | |
Pu-Crunch | 9868 | 21.0% | -p2 -m7 |
LhA | 9543 | 20.3% | Decompressor not included |
gzip -9 | 9474 | 20.1% | Decompressor not included |
The decompression code is included in the compressed files, although it is not valid for files over 63k (compressed or uncompressed size). About 34 bytes are decompression parameters, the rest (approx. 300 bytes) is 6510 machine language.
The results surprised me, because the compression algorithm IS developed for a very special case. It only has a fixed code for LZ77/RLE lengths, not even a static one (fixed != static != adaptive)! Also, it does not use arithmetic code to compress the bytes that are not part of LZ77 (and RLE) matches. Because most of the big files are ASCII text, this somewhat handicaps my compressor. Also, decompression is relatively fast, and uses no extra memory.
I'm getting relatively near LhA, and shorter than LhA for 7 files (300-byte decompressor included!), and relatively near or shorter than LhA in other cases if the decompressor is removed.
FreeBSD epsilon3.vlsi.fi PentiumProŽ 200MHz | LhA | Zip | GZip -9 | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Decompression on a 200MHz PentiumProŽ: 1.4 seconds | |||||||||||
Estimated decompression on a C64 (1MHz 6510) 6:47 | |||||||||||
File | Options | In | Out | b/B | Ratio | Gained | Time | Out | Out | Out | |
bib | -p4 -m7 | 111261 | 35719 | 2.57 | 32.1% | 67.9% | 0:05 | 40740 | 35041 | 34900 | |
book1 | -p4 -m7 | 768771 | 325015 | 3.38 | 42.3% | 57.7% | 0:54 | 339074 | 313352 | 312281 | |
book2 | -p4 -m7 | 610856 | 211937 | 2.78 | 34.7% | 65.3% | 0:30 | 228442 | 206663 | 206158 | |
geo | -p2 -m7 | 102400 | 72945 | 5.70 | 71.2% | 28.8% | 0:06 | 68574 | 68471 | 68414 | |
news | -p4 -m7 | 377109 | 145942 | 3.10 | 38.8% | 61.3% | 0:11 | 155084 | 144817 | 144400 | |
obj1 | 21504 | 10763 | 4.00 | 50.1% | 49.9% | 0:01 | 10310 | 10300 | 10320 | ||
obj2 | -p1 -m7 | 246814 | 83455 | 2.71 | 33.8% | 66.2% | 0:08 | 84981 | 81608 | 81087 | |
paper1 | -p3 -m7 | 53161 | 19721 | 2.97 | 37.1% | 62.9% | 0:01 | 19676 | 18552 | 18543 | |
paper2 | -p3 -m7 | 82199 | 31443 | 3.06 | 38.3% | 61.7% | 0:03 | 32096 | 29728 | 29667 | |
paper3 | -p3 -m7 | 46526 | 19424 | 3.34 | 41.7% | 58.3% | 0:01 | 18949 | 18072 | 18074 | |
paper4 | -p1 -m5 | 13286 | 6143 | 3.70 | 46.2% | 53.8% | 0:00 | 5558 | 5511 | 5534 | |
paper5 | -p1 | 11954 | 5527 | 3.70 | 46.2% | 53.8% | 0:00 | 4990 | 4970 | 4995 | |
paper6 | -p2 -m7 | 38105 | 14286 | 3.00 | 37.5% | 62.5% | 0:01 | 13814 | 13207 | 13213 | |
pic | -p1 -m7 | 513216 | 58028 | 0.90 | 11.3% | 88.7% | 0:17 | 52221 | 56420 | 52381 | |
progc | -p2 -m7 | 39611 | 14308 | 2.89 | 36.1% | 63.9% | 0:01 | 13941 | 13251 | 13261 | |
progl | -p1 -m7 | 71646 | 17159 | 1.92 | 23.9% | 76.1% | 0:02 | 16914 | 16249 | 16164 | |
progp | -p1 -m7 | 49379 | 11892 | 1.93 | 24.1% | 75.9% | 0:01 | 11507 | 11222 | 11186 | |
trans | -p2 -m7 | 93695 | 19607 | 1.67 | 20.9% | 79.1% | 0:03 | 22578 | 18961 | 18862 | |
total | 3251493 | 1103314 | 2.71 | 33.9% | 66.1% | 2:36 |
Also, now it's possible to decompress the files compressed with the program (must be the same version). (-u)
ivanova -5 bytes, sheridan -14, delenn -26, bs -29
When generating the output rescans the LZ77 matches. This is because the optimization can shorten the matches and a shorter match may be found much nearer than the original longer match. Because longer offsets usually use more bits than shorter ones, we get some bits off for each match of this kind. Actually, the rescan should be done in OptimizeLength() to get the most out of it, but it is too much work right now (and would make the optimize even slower).
* [Fenwick and Gutmann, 1994]. P.M. Fenwick and P.C. Gutmann, "Fast LZ77 String Matching", Dept of Computer Science, The University of Auckland, Tech Report 102, Sep 1994