Home

Diary Scanning Project

2023-05-05. Omar Mustardo.

Sometime last year I learned about the existence of 50+ years of diaries from my great grandmother. That seemed worth preserving, so I reached out to my mother’s first cousin who had them. The cousin and other family members reasonably had concerns about distribution of this potentially private information, so the plan is just to keep a digital copy and not put it on archive.org. I was hoping to use archive.org since it provides a great book viewer, and is also reasonably certain to exist for a long time. I eventually got two boxes with all 11 diaries in the mail, and the work began.

How to scan

There is a lot of information about book scanning, but there isn’t a standard way to do it.

I tried out the cardboard box (sliced at an angle, to make a cradle for the book). It worked surprisingly well and I liked that it was easy to modify with scissors and tape. Advice for others would be to use a box significantly larger than the book so that postprocessing can’t confuse the edge of the box with the edge of the book. I reached out to a mailing list at work, which resulted in a lot of advice, and eventually someone who lent me a Czur Aura scanner. Here are some of the other tips in case it’s useful for others:

This search also led me down a rabbit hole around the story of Google Books, which is pretty interesting: https://www.theatlantic.com/technology/archive/2017/04/the-tragedy-of-google-books/523320/ and the Google-designed linear book scanner, is pretty neat if you haven’t seen it: https://linearbookscanner.org/ https://github.com/google/linear-book-scanner

Scanning - Czur Aura

The Czur Aura is essentially a camera and lights on a stand, a flat black mat as a background, and a foot pedal to trigger it. The foot pedal was extremely useful and in retrospect I wouldn’t want to scan without one. It has its own software which was easy to use and had no issues.

I ended up purchasing a small piece of clear acrylic at Home Depot to hold the pages flat, which avoided the need to de-warp in postprocessing and was definitely worthwhile. The acrylic ended up getting scratched up by the end and probably degraded image quality a bit. A sheet of glass with softened edges would probably have been better. This flattening glass is called a “platen”, and is also the name of the glass in a standard flatbed scanner.

It took 90-120 minutes to scan each diary. The variance was largely based on how easy the pages were to flatten. Some had pages that tended to stick together.

Postprocessing - ScanTailor

A minor pre-work step was to reorder and rename everything. Since I scanned the left and right sides separately, so they need to be interleaved. There was also sometimes a bad scan which I deleted and re-did, I needed to compress the number range (e.g. 1,2,4 -> 1,2,3). A little python handled this (see the scripts/ folder).

At this point I had 11 directories, each containing around 370 jpg images.

ScanTailor seems to be the default tool for postprocessing scanned books. Unfortunately there are many branches of it, and it’s unclear which to use. I ended up with https://github.com/ScanTailor-Advanced/scantailor-advanced/releases/tag/v1.0.18 as it was most recently updated. The executable is stored in this archive just in case it’s useful.

I put the raw scans into ScanTailor, one journal at a time. I went through each step (Orientation, Split Pages, Deskew, Select Content, Margins, Output) with all default settings, except using a fill background and not doing black-and-white output. I did some manual fixing of content boxes in Select Content because I found that the content box detection made severe mistakes in a few cases, and minor errors in many. The final journal was particularly bad. It had dark lines which threw off the content box detection and almost every page required significant manual fixes.

Output of this is a bunch of ~10mb tif images. I used imagemagick to convert them to jpg.

Future work

Cardboard Box Czur Aura Scan Setup Diaries Arrived Scan Example Scan Postprocessed Example