Fixing Archive.org’s PDFs

Here’s the webpage for a very early edition of Huckleberry Finn. If you open the PDF using a modern PC or tablet, it will look fine though a little slow to load. If you open it on your Kindle, Nook Color, or some other older Ebook reader that displays PDFs, you’re in for a shock.

Each page in these PDFs are actually 3 images. When put together by a modern PDF reader, they make one nice scanned PDF page. If you’re not suing a modern reader, you see all 3 layers separately. This makes the book unreadable. Even if you are using a modern reader, these PDFs have a noticeable lag time compared to other documents because it is loading 3 images per page.

This guide which show you how to eliminate the first two images and reverse the third image to be white on black. Will this 100% fix the book?  No. However if you value text over presentation, it does make the book readable on any device including the good old E-ink Kindle.

Step 1. Install the applications (OpenSUSE)

sudo zypper in pdfmod imagemagick pandoc grename

Step 2. Convert the PDF to images. Create a directory for the files to go to first:

mkdir huck
pdfimages huckleberry.pdf huck/

Step 3. The files that are created are all -xxx.ppm and .pbm: Bash doesn’t like this. I use grename to rename every file so that they don’t begin with a hyphen

Step 4. cd to the directory and delete the extra image files:

cd huck
rm *.ppm

Step 5. Reverse the images of the .pbm files. This will create a new copy of the files with inverted colors.

for i in *; do convert -monochrome -colors 2 -depth 1 -negate $i in-$i; done

Step 6. Move the completed files to a next directory and delete the originals

mkdir finished
mv in* finished/
rm *.pbm

Step 7. cd to the finished directory and create a new pdf. This will take time and may freeze your computer. Be patient.

cd finished
convert `ls -v` huck_bw.pdf

Step 8. Shrink your newly created PDF because it is far too large right now.

gs -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 -dPDFSETTINGS=/screen \
-dNOPAUSE -dQUIET -dBATCH -sOutputFile=huck_bw_final.pdf huck_bw.pdf

Your new PDF is complete. It is not a pretty as the original but it is more handy.

I then use pdfmod to edit the metadata so the ebook is easier to work with in calibre.

I’m very interested if anyone has found a better way to do this with open source software that retains the color of the original but without the multiple layers.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.