[Home](/) # Diary Scanning Project 2023-05-05. Omar Mustardo. Sometime last year I learned about the existence of 50+ years of diaries from my great grandmother. That seemed worth preserving, so I reached out to my mother's first cousin who had them. The cousin and other family members reasonably had concerns about distribution of this potentially private information, so the plan is just to keep a digital copy and not put it on archive.org. I was hoping to use archive.org since it provides a great book viewer, and is also reasonably certain to exist for a long time. I eventually got two boxes with all 11 diaries in the mail, and the work began. ## How to scan There is a lot of information about book scanning, but there isn't a standard way to do it. - [https://diybookscanner.org/forum/](https://diybookscanner.org/forum/) has loads of potential options. - [https://diybookscanner.org/archivist/](https://diybookscanner.org/archivist/) Fancy, well beyond what I'm looking for, but may have useful tidbits like choosing the right type of glass. - cardboard box with camera hole [https://hackaday.com/2011/07/18/diy-book-scanner-processes-600-pageshour/](https://hackaday.com/2011/07/18/diy-book-scanner-processes-600-pageshour/) [https://web.archive.org/web/20120618141301/http://www.314pies.com/projects/diy-book-scanner/item/18/](https://web.archive.org/web/20120618141301/http://www.314pies.com/projects/diy-book-scanner/item/18/) - laser cut wood (lots of pieces) + misc metal bits [https://arstechnica.com/gadgets/2013/02/diy-book-scanning-is-easier-than-you-think/](https://arstechnica.com/gadgets/2013/02/diy-book-scanning-is-easier-than-you-think/) - plastic pipe, MDF, and plexiglass. Very nice result [https://www.youtube.com/channel/UCwMsuo69GH2OkeFk8Y9jHmQ](https://www.youtube.com/channel/UCwMsuo69GH2OkeFk8Y9jHmQ) - General advice and links: [https://www.reddit.com/r/DataHoarder/comments/o1v4qw/book_scanning/](https://www.reddit.com/r/DataHoarder/comments/o1v4qw/book_scanning/) - [https://www.instructables.com/Bargain-Price-Book-Scanner-From-A-Cardboard-Box/](https://www.instructables.com/Bargain-Price-Book-Scanner-From-A-Cardboard-Box/) Looks good. I'll need to get a tripod and sheet of glass, but all of the rest is very doable. I tried out the cardboard box (sliced at an angle, to make a cradle for the book). It worked surprisingly well and I liked that it was easy to modify with scissors and tape. Advice for others would be to use a box significantly larger than the book so that postprocessing can't confuse the edge of the box with the edge of the book. I reached out to a mailing list at work, which resulted in a lot of advice, and eventually someone who lent me a Czur Aura scanner. Here are some of the other tips in case it's useful for others: - Search for "flatbed edge scanner" - Atom tabletop scanner using "Phase One" cameras - Noisebridge (makerspace in SF) had a scanner available, but it was disassembled for storage back in 2017. - An app, combined with a simple jig to hold the phone and some extra lights would probably be sufficient. - Jig to hold a phone, along with Adobe Scan or Google Drive Scan This search also led me down a rabbit hole around the story of Google Books, which is pretty interesting: [https://www.theatlantic.com/technology/archive/2017/04/the-tragedy-of-google-books/523320/](https://www.theatlantic.com/technology/archive/2017/04/the-tragedy-of-google-books/523320/) and the Google-designed linear book scanner, is pretty neat if you haven't seen it: [https://linearbookscanner.org/](https://linearbookscanner.org/) [https://github.com/google/linear-book-scanner](https://github.com/google/linear-book-scanner) ## Scanning - Czur Aura The Czur Aura is essentially a camera and lights on a stand, a flat black mat as a background, and a foot pedal to trigger it. The foot pedal was extremely useful and in retrospect I wouldn't want to scan without one. It has its own software which was easy to use and had no issues. I ended up purchasing a small piece of clear acrylic at Home Depot to hold the pages flat, which avoided the need to de-warp in postprocessing and was definitely worthwhile. The acrylic ended up getting scratched up by the end and probably degraded image quality a bit. A sheet of glass with softened edges would probably have been better. This flattening glass is called a "platen", and is also the name of the glass in a standard flatbed scanner. - I used a manual capture area since I can't lay the book flat (in which case I could have it auto-capture the whole thing, and auto-decide where the spine is. - I didn't use any image editing or cropping - Output is a bunch of ~500kb jpg images. It took 90-120 minutes to scan each diary. The variance was largely based on how easy the pages were to flatten. Some had pages that tended to stick together. ## Postprocessing - ScanTailor A minor pre-work step was to reorder and rename everything. Since I scanned the left and right sides separately, so they need to be interleaved. There was also sometimes a bad scan which I deleted and re-did, I needed to compress the number range (e.g. 1,2,4 -> 1,2,3). A little python handled this (see the [scripts/](scripts/) folder). At this point I had 11 directories, each containing around 370 jpg images. ScanTailor seems to be the default tool for postprocessing scanned books. Unfortunately there are many branches of it, and it's unclear which to use. I ended up with [https://github.com/ScanTailor-Advanced/scantailor-advanced/releases/tag/v1.0.18](https://github.com/ScanTailor-Advanced/scantailor-advanced/releases/tag/v1.0.18) as it was most recently updated. The executable is stored in this archive just in case it's useful. I put the raw scans into ScanTailor, one journal at a time. I went through each step (Orientation, Split Pages, Deskew, Select Content, Margins, Output) with all default settings, except using a fill background and not doing black-and-white output. I did some manual fixing of content boxes in Select Content because I found that the content box detection made severe mistakes in a few cases, and minor errors in many. The final journal was particularly bad. It had dark lines which threw off the content box detection and almost every page required significant manual fixes. Output of this is a bunch of ~10mb tif images. I used imagemagick to convert them to jpg. ## Future work - OCR. It would be great to have searchable text files. I looked into options for automatic transcription but ended up not finding anything that worked for handwritten text. This may just be an unsolved problem that requires waiting for tools to get better. - Fix color balance. Some of the scans ended up with a greenish tint. I suspect it's from light reflecting off of other things in my room, especially since it seemed to happen to only one side of the diaries. This should be possible to clean up in postprocessing, but I don't know of a good way to do it. ![Cardboard Box](cardboard_box.jpg) ![Czur Aura Scan Setup](czur_aura_scan_setup.jpg) ![Diaries Arrived](diaries_arrived.jpg) ![Scan Example](scan_example.jpg) ![Scan Postprocessed Example](scan_postprocessed_example.jpg)