Digitising books

20 Apr 2013

Having acquired a new document scanner, I chomped through most of the paper in my life, scanning receipts, letters from the bank and so on. It took a few hours.

To really put my scanner through its paces, I wanted to digitise a few books. Here are a few thoughts and pointers for future reference.

It’s particularly useful to digitise reference books that you might want to refer to. This is a matter of opinion, but I think they are better suited to the illuminated screen, random-access type of reading/research that things like iPads are so good at.
You need to get the pages out. I’ve read of all sorts of ways of doing this. Circular saws look good fun! If you don’t have one, and live near a Mailbox ETC, go in there and ask them to guillotine the spine off. Other stationery shops may do this - but in Cambridge the friendly staff there don’t even charge for this. I don’t know how long that’ll last…
Chop off as little as possible when removing the spine. It will help avoid cut off lines and make rebinding easier, if you choose to do that.
After you’ve got the pages out, flick through them to make sure they really are separated. Sometimes you will find a few pages still ‘glued’ together.
Scan one or two pages as a way to get the right settings for resolution/compression/scan workflow. Zoom in, check OCR will succeed, and perform back-of-envelope calculation to enable you to strike your preferred balance between image resolution, compression level and file size. 150dpi seems fine to me, but I choose large file size over compression artefacts.
Start with a small book - large ones aren’t necessarily harder, but if you make a mistake, you’ve wasted less time.
Does your book rely on double page spreads? If so, see if your software will join them up for you. I haven’t found this yet in Scansnap. I made the mistake of scanning a whole book and then having to use the supplied “Page merger” tool on tens of double page spreads. Tedious!
You can’t put every page into the scanner, so do ‘em in batches. Look for the “continue scanning” option if using Scansnap software. In the “grab”, I chose not to do OCR. It takes time, so save it until the end, in case you make a mistake on the way.
If your scanner automatically discards blank pages, consider disabling this feature (see note about page numbers below).
Stitching together PDFs digitally is possible but fiddly. Macs have a python script to help "/System/Library/Automator/Combine PDF Pages.action/Contents/Resources/join.py"
You’ll have to scan the front cover separately (they are usually too stiff to go through auto feeders). You can use Preview (on a Mac) to put it at the front of your document.
Remember to set metadata like title, author(s) in the PDF file.
Consider the page labels. Usually the visible page number will not match the electronic page number, so finding content based on page number by hitting ⌥⌘G won’t work. If you’ve scanned every page and included blanks, the relationship will be simple (n’ = n + 4 or similar). I had success using jpdftweak to achieve this.
Now perform OCR, check the results, and upload to Dropbox, iBooks etc.
Consider having the pages rebound. You’ve already pulled all the pages out and digitised it, so keeping the book is an optional bonus. Helpfully, the only cost-effective rebinding mechanism is likely to be one that leaves you with a “stays flat” book - ideal for referring to while you have both hands full (e.g. tying a knot or fiddling with a bike).

Books are still useful in full sunlight, rain, or in your garage while you’re spraying WD40 around!

Tags: books, scanning, digital, paper, data

< Previous post | Next post >

Digitising books

Favourite posts

Recent posts

Blog archives