DjVu-Digital vs. "Super Hero" PDF A faceoff by James Rile (PlanetDjVu, 11.20.2002)
One year ago, Dov Isaacs of Adobe produced a "Super Hero" PDF, an example of the use of superior compression methods in PDF files. See the review at: http://www.planetpdf.com/mainpage.asp?webpageid=1753&nl
Here is an excerpt:
"Do you think "super hero" is too strong a word to describe this file? OK, then you create a PDF like this:
84 slides from PowerPoint
Color graphics on every page
30 font faces subset embedded
17 line art drawings
30 screen shots
28 bitmaps (in addition to the screen shots)
Four languages
Looks great on-screen
Prints like a champ
...and is 1.14 MB in size."
We at PlanetDjVu decided to challenge this "super hero" example file by converting it to DjVu-Digital format. Our first result was pretty good, but then Leon Bottou, the "original format author" of DjVu, joined in the challenge and provided the final DjVu file that is linked to below.
The resulting DjVu file wins in the size comparision - it is 25% smaller than the best that PDF can offer! Open both files using the links below and you will see that the DjVu version opens and displays more quickly than even this "Super Hero" PDF!
PDF Version - 1.14 MB |
DjVu Version - 0.88 MB |
Now who is the "Super Hero"? Why, DjVu is!
Here is what Leon had to say about this winning DjVu rendition of superhero.pdf:
"This file was compressed by first converting the pdf into ps with xpdf and then using the following command:
%djvudigital --cseparg=-p100 --words --threshold=99 \ --fg-image-colors=1024 --fg-colors=128 --psrotate=90 \ --bg-slices=72+11+10+6 superhero.psTo compress this file, I had to recognize that the content of the pdf file is designed to showcase some of the strengths of pdf. It liberally uses gradients and line-art features that are shared among pages.
To deal with such a file, I chose to move as many things as possible into the foreground (even images) and thus give them a chance to be shared between pages. This is the meaning of options --threshold and --fg-image-colors. To help this strategy, the -cseparg option attempts to maximize sharing between pages. The --fg-colors option reduces the number of distinct colors in the foreground in order to limit the size of the foreground data. The --bg-slices option reduces the quality of the background because, after all, the background is no longer rich in details.
Mr. Isaacs explains on page 14 that "A PDF file can never be better than the content from which it is created". All his presentation explains is that one should avoid intermediate steps that could hide the structure of the original content.
DjVu was designed to remove this constraint. We could print the superhero file, scan the pages and still produce a DjVu file with a decent size (not as good as this one, but decent).
In other words, the DjVu compressors are designed to recover the structure of the document from whatever data is available (pixels for djvudocument, postscript for djvudigital). The format itself only implements a simple document structure (foreground/background) but gives many opportunities to conceal potential structure discovery errors.
In short:
---------------------------------- Gold-in Garbage-in ---------------------------------- PDF Gold-out Garbage-out DjVu Gold-out Acceptable-out
-Leon"