diff options
Diffstat (limited to 'contrib/compressor/algorithm.doc')
-rw-r--r-- | contrib/compressor/algorithm.doc | 58 |
1 files changed, 0 insertions, 58 deletions
diff --git a/contrib/compressor/algorithm.doc b/contrib/compressor/algorithm.doc deleted file mode 100644 index 74a7646c..00000000 --- a/contrib/compressor/algorithm.doc +++ /dev/null @@ -1,58 +0,0 @@ -The compressor achieves an average compression rate of 60% of the -original size which is on par with "gzip". It seems that you cannot do -much better for compressing compiled binaries. This means that the -break even point for using compressed images is reached, once the -uncompressed size approaches 1.5kB. We can stuff more than 12kB into -an 8kB EPROM and more than 25kB into an 16kB EPROM. As there is only -32kB of RAM for both the uncompressed image and its BSS area, this -means that 32kB EPROMs will hardly ever be required. - -The compression algorithm uses a 4kB ring buffer for buffering the -uncompressed data. Before compression starts, the ring buffer is -filled with spaces (ASCII character 0x20). The algorithm tries to -find repeated input sequences of a maximum length of 60 bytes. All -256 different input bytes plus the 58 (60 minus a threshold of 2) -possible repeat lengths form a set of 314 symbols. These symbols are -adaptively Huffman encoded. The algorithm starts out with a Huffmann -tree that assigns equal code lengths to each of the 314 symbols -(slightly favoring the repeat symbols over symbols for regular input -characters), but it will be changed whenever the frequency of any of -the symbols changes. Frequency counts are kept in 16bit words until -the total number of compressed codes totals 2^15. Then, all frequency -counts will be halfed (rounding to the bigger number). For unrepeated -characters (symbols 0..255) the Huffman code is written to the output -stream. For repeated characters the Huffmann code, which denotes the -length of the repeated character sequence, is written out and then the -index in the ring buffer is computed. From this index, the algorithm -computes the offset relative to the current index into the ring -buffer. Thus, for typical input data, one would expect that short to -medium range offsets are more frequent than extremely short or medium -range to long range offsets. Thus the 12bit (for a 4kB buffer) offset -value is statically Huffman encoded using a precomputed Huffman tree -that favors those offset values that are deemed to be more -frequent. The Huffman encoded offset is written to the output data -stream, directly following the code that determines the length of -repeated characters. - -This algorithm, as implemented in the C example code, looks very good -and its operating parameters are already well optimized. This also -explains why it achieves compression ratios comparable with -"gzip". Depending on the input data, it sometimes excells considerably -beyond what "gzip -9" does, but this phenomenon does not appear to be -typical. There are some flaws with the algorithm, such as the limited -buffer sizes, the adaptive Huffman tree which takes very long to -change, if the input characters experience a sudden change in -distribution, and the static Huffman tree for encoding offsets into -the buffer. The slow changes of the adaptive Huffman tree are -partially counteracted by artifically keeping a 16bit precision for -the frequency counts, but this does not come into play until 32kB of -compressed data is output, so it does not have any impact on our use -for "etherboot", because the BOOT Prom does not support uncompressed -data of more then 32kB (c.f. doc/spec.doc). - -Nonetheless, these problems do not seem to affect compression of -compiled programs very much. Mixing object code with English text, -would not work too well though, and the algorithm should be reset in -between. Actually, we might gain a little improvement, if text and -data segments were compressed individually, but I have not -experimented with this option, yet. |