Digitising books

20 Apr 2013

Having acquired a new document scanner, I chomped through most of the paper in my life, scanning receipts, letters from the bank and so on. It took a few hours.

To really put my scanner through its paces, I wanted to digitise a few books. Here are a few thoughts and pointers for future reference.

  • It’s particularly useful to digitise reference books that you might want to refer to. This is a matter of opinion, but I think they are better suited to the illuminated screen, random-access type of reading/research that things like iPads are so good at.
  • You need to get the pages out. I’ve read of all sorts of ways of doing this. Circular saws look good fun! If you don’t have one, and live near a Mailbox ETC, go in there and ask them to guillotine the spine off. Other stationery shops may do this - but in Cambridge the friendly staff there don’t even charge for this. I don’t know how long that’ll last…
  • Chop off as little as possible when removing the spine. It will help avoid cut off lines and make rebinding easier, if you choose to do that.
  • After you’ve got the pages out, flick through them to make sure they really are separated. Sometimes you will find a few pages still ‘glued’ together.
  • Scan one or two pages as a way to get the right settings for resolution/compression/scan workflow. Zoom in, check OCR will succeed, and perform back-of-envelope calculation to enable you to strike your preferred balance between image resolution, compression level and file size. 150dpi seems fine to me, but I choose large file size over compression artefacts.
  • Start with a small book - large ones aren’t necessarily harder, but if you make a mistake, you’ve wasted less time.
  • Does your book rely on double page spreads? If so, see if your software will join them up for you. I haven’t found this yet in Scansnap. I made the mistake of scanning a whole book and then having to use the supplied “Page merger” tool on tens of double page spreads. Tedious!
  • You can’t put every page into the scanner, so do ‘em in batches. Look for the “continue scanning” option if using Scansnap software. In the “grab”, I chose not to do OCR. It takes time, so save it until the end, in case you make a mistake on the way.
  • If your scanner automatically discards blank pages, consider disabling this feature (see note about page numbers below).
  • Stitching together PDFs digitally is possible but fiddly. Macs have a python script to help "/System/Library/Automator/Combine PDF Pages.action/Contents/Resources/join.py"
  • You’ll have to scan the front cover separately (they are usually too stiff to go through auto feeders). You can use Preview (on a Mac) to put it at the front of your document.
  • Remember to set metadata like title, author(s) in the PDF file.
  • Consider the page labels. Usually the visible page number will not match the electronic page number, so finding content based on page number by hitting ⌥⌘G won’t work. If you’ve scanned every page and included blanks, the relationship will be simple (n’ = n + 4 or similar). I had success using jpdftweak to achieve this.
  • Now perform OCR, check the results, and upload to Dropbox, iBooks etc.
  • Consider having the pages rebound. You’ve already pulled all the pages out and digitised it, so keeping the book is an optional bonus. Helpfully, the only cost-effective rebinding mechanism is likely to be one that leaves you with a “stays flat” book - ideal for referring to while you have both hands full (e.g. tying a knot or fiddling with a bike).

Books are still useful in full sunlight, rain, or in your garage while you’re spraying WD40 around!

Tags: books, scanning, digital, paper, data