The latest 4.13.9 source release of the Linux kernel is 780MiB, but thanks to xz compression, the download is a much more managable 96 MiB (an 88% reduction)
Before xz took over as the default compression format on kernel.org in 2013, following the "latest" link would have gotten you a bzip2 compressed file. The tar.bz2 would have been 115 MiB (-85%), but there’s was no defending the extra 20 MiB after xz caught up in popularity. bzip2 is all but displaced today.
bzip2 became the default in 2003, though it had long been an option over the less efficient gzip. However, since every OS, browser, language core library, phone and IoT lightswitch has built-in support for gzip, a 148 MiB (-81%) tar.gz remains an option even today.
gzip itself started taking over in 1994, before kernel.org, and before the World Wide Web went mainstream. It must have been a particularly easy sell for the fledgeling Linux kernel: it was made, used and endorsed by the mighty GNU project, it was Free Software, free of patent restrictions, and it provided powerful .zip style DEFLATE compression in a Unix friendly package.
Another nice benefit was that gzip could decompress other contemporary formats, thereby replacing contested and proprietary software.
Among the tools it could replace was compress, the de-facto Unix standard at the time. Created based on LZW in 1985, it was hampered by the same patent woes that plagued GIF files. The then-ubiquitous .Z suffix graced the first public Linux releases, but is now recognized only by the most long-bearded enthusiasts. The current release would have been 302 MiB (-61%) with compress.
Another even more obscure tool it could replace was compress‘s own predecessor, pack. This rather loosely defined collection of only partially compatible formats is why compress had to use a capital Z in its extension. pack came first, and offered straight Huffman coding with a .z extension.
With pack, our Linux release would have been 548 MiB (-30%). Compared to xz‘s 96 MiB, it’s obvious why no one has used it for decades.
Well, guess what: gzip never ended its support! Quoth the man page,
gunzip can currently decompress files created by gzip, zip, compress, compress -H or pack.
While multiple implementations existed, these were common peculiarities:
- They could not be used in pipes.
- They could not represent empty files.
- They could not compress a file with only one byte value, e.g. "aaaaaa…"
- They could fail on "large" files. "can’t occur unless [file size] >= [16MB]", a comment said dismissively, from the time when a 10MB hard drive was a luxury few could afford.
These issues stemmed directly from the Huffman coding used. Huffman coding, developed in 1952, is basically an improvement on Morse code, where common characters like "e" get a short code like "011", while uncommon "z" gets a longer one like "111010".
- 
Since you have to count the characters to figure out which are common, you can not compress in a single pass in a pipe. Now that memory is cheap, you could mostly get around that by keeping the data in RAM. 
- 
Empty files and single-valued files hit an edge case: if you only have a single value, the shortest code for it is the empty string. Decompressors that didn’t account for it would get stuck reading 0 bits forever. You can get around it by adding unused dummy symbols to ensure a minimum bit length of 1. 
- 
A file over 16MB could cause a single character to be so rare that its bit code was 25+ bits long. A decompressor storing the bits to be decoded in a 32bit value (a trick even gzipuses) would be unable to append a new 8bit byte to the buffer without displacing part of the current bit code. You can get around that by using "package merge" length restricted prefix codes over naive Huffman codes.
I wrote a Haskell implementation with all these fixes in place: koalaman/pack is available on GitHub.
During development, I found that pack support in gzip had been buggy since 2012 (version 1.6), but no one had noticed in the five years since. I tracked down the problem and I’m happy to say that version 1.9 will again restore full pack support!
Anyways, what could possibly be the point of using pack today?
There is actually one modern use case: code golfing.
This post came about because I was trying to implement the shortest possible program that would output a piece of simple ASCII art. A common trick is variations of a self-extracting shell script:
sed 1d $0|gunzip;exit <compressed binary data here>
You can use any available compressor, including xz and bzip2, but these were meant for bigger files and have game ruining overheads. Here’s the result of compressing the ASCII art in question:
- raw: 269 bytes
- xz: 216 bytes
- bzip2: 183 bytes
- gzip: 163 bytes
- compress: 165 bytes
- and finally, pack: 148 bytes!
I was able to save 15 bytes by leveraging gzip‘s forgotten legacy support. This is huge in a sport where winning entries are bytes apart.
Let’s have a look at this simple file format. Here’s an example pack file header for the word "banana":
1f 1e -- Two byte magic header 00 00 00 06 -- Original compressed length (6 bytes)
Next comes the Huffman tree. Building it is simple to do by hand, but too much for this post. It just needs to be complete, left-aligned, with eof on the right at the deepest level. Here’s the optimal tree for this string:
        /\
       /  a
      /\
     /  \
    /\   n
   b  eof
We start by encoding its depth (3), and the number of leaves on each level. The last level is encoded minus 2, because the lowest level will have between 2 and 257 leaves, while a byte can only store 0-255.
03 -- depth 01 -- level 1 only contains 'a' 01 -- level 2 only contains 'n' 00 -- level 3 contains 'b' and 'eof', -2 as mentioned
Next we encode the ASCII values of the leaves in the order from top to bottom, left to right. We can leave off the EOF (which is why it needs to be in the lower right):
61 6e 62 -- "a", "n" ,"b"
This is enough for the decompressor to rebuild the tree. Now we go on to encode the actual data.
Starting from the root, the Huffman codes are determined by adding a 0 for ever left branch and 1 for every right branch you have to take to get to your value:
a -> right = 1 n -> left+right = 01 b -> left+left+left -> 000 eof -> left+left+right -> 001
banana<eof> would therefore be 000 1 01 1 01 1 001, or when grouped as bytes:
16 -- 0001 0110 C8 -- 1100 1 (000 as padding)
And that’s all we need:
$ printf '\x1f\x1e\x00\x00\x00\x06'\ '\x03\x01\x01\x00\x61\x6e\x62\x16\xc8' | gzip -d banana
Unfortunately, the mentioned gzip bug triggers due to failing to account for leading zeroes in bit code. eof and a have values 001 and 1, so an oversimplified equality check confuses one for the other, causing gzip to terminate early:
b gzip: stdin: invalid compressed data--length error
However, if you’re stuck with an affected version, there’s another party trick you can do: the Huffman tree has to be canonical, but it does not have to be optimal!
What would happen if we skipped the count and instead claimed that each ASCII character is equally likely? Why, we’d get a tree of depth 8 where all the leaf nodes are on the deepest level.
It then follows that each 8 bit character will be encoded as 8 bits in the output file, with the bit patterns we choose by ordering the leaves.
Let’s add a header with a dummy length to a file:
$ printf '\x1F\x1E____' > myfile.z
Now let’s append the afforementioned tree structure, 8 levels with all nodes in the last one:
$ printf '\x08\0\0\0\0\0\0\0\xFE' >> myfile.z
And let’s populate the leaf nodes with 255 bytes in an order of our choice:
$ printf "$(printf '\\%o' {0..254})" |
    tr 'A-Za-z' 'N-ZA-Mn-za-m' >> myfile.z
Now we can run the following command, enter some text, and hit Ctrl-D to "decompress" it:
$ cat myfile.z - | gzip -d 2> /dev/null Jr unir whfg pbaivaprq TMvc gb hafpenzoyr EBG13! <Ctrl+D> We have just convinced GZip to unscramble ROT13!
Can you think of any other fun ways to use or abuse gzip‘s legacy support? Post a comment.
There is small error in line: 61 6e 52 — “a”, “n” ,”b”
hex code for letter ‘b’ is 62 and not 52
The result you posted for xz is not entirely correct. By (ab)using various xz options you could get down to 146 bytes:
xz -F raw –lzma2=pb=0,lc=0 -k -C none -c
You can further reduce the payload by stripping the last two bytes, for example with dd:
dd if=payload.xz of=payload.optimized bs=1 count=144
This however presents a problem. By trying to naively decompress it with xz -dFraw you’d get more errors added to output:
xz: Using a preset in raw mode is discouraged.
xz: The exact options of the presets may vary between software versions.
xz: (stdin): Unexpected end of input
You can silence them by adding -qq to the decompressing commandline, but that still left you with a missing newline in the output. Instead you can do something a bit more evil, which produced leftover garbage in CWD, but didn’t matter for the contest in question – redirect errors to a file with payload as its name:
sed 1d $0|xz -dFraw 2>’
There’s a slight drawback to that, as you need to append ‘ to the end of the binary, but the resulting script is still only 169 bytes long and granted me the second place in said contest :)
Love it :3
I took part in the same (I think?) contest, got 170 bytes.
I tried using all these tricks except for –lzma= – I wrongly interpreted the manpage, thinking that decompressing -Fraw without explicitly specifying filters will only work if the data was compressed with default options.
Interestingly, the 2>’binary’ trick doesn’t always work – it didn’t work with my payload, because it contained a slash `/` character – it was interpreted as a directory, and the script failed with `no such file or directory`.
Instead, I did this:
printf ‘payload with NUL escaped as \0’|xz -dqqFraw
The NUL slash-escapes increased the file size a bit, but it turns out some NUL bytes in the payload didn’t actually matter, so I was able to change them to any other byte and avoid the slash.
Would you care to comment on this article on .xz : http://www.nongnu.org/lzip/xz_inadequate.html
“The latest 4.13.9 source release of the Linux kernel is 780MiB, but thanks to xz compression, the download is a much more manageable 96 MiB (an 88% reduction)”
Compressing it using 7zip (using ppmd compression), we can squeeze it a little further –
90.3Mb (94,733,398 bytes)
I know that ppmd is supported by 7z via the option -mm=PPMd. It doesn’t look like Gzip supports it though.
compressing it with mcm does even better: 63.8 MiB (66,898,708)
The latest 4.1.3 source release of Linux is really manageable and the download is easy. The peculiarities which are given in the post are well described and it will be helpful for the user os Linux and apple support is also working on that.
Thanks you for fixing that bug! I recently came across some old man pages (from 1993 or earlier) compressed with unpack, which wouldn’t decompress without that fix. :)
I’ve tested your haskell program, it works well!
And what about “compact“ (.C files) ?
https://www.unix.com/man-page/ultrix/1/compact/
Next challenge…
apparently there is compact (and uncompact) source code available: http://fileformats.archiveteam.org/wiki/Compact_(Unix)
it doesn’t compile out of the box on a modern system, but probably wouldn’t be too difficult to modernize.
“A file over 16MB could cause a single character to be so rare that its bit code was 25+ bits long.”
As a matter of fact, a carefully constructed file of less than 200KB could cause a single character to be rare enough to make pack to fail. If memory serves, as soon as I’d encountered pack way back on Xenix 286, I wrote a little program to verify my hunch, and I was right, pack failed miserably.
Hey! Thanks for sharing this amazing content with us your information is really very fabulous to read.