Table Of Contents
One of the most useful digital humanities skills is not glamorous at all: knowing how to turn a folder of high quality page images into a searchable PDF. If you are scanning old documents, archival packets, newsletters, church records, syllabi, meeting minutes, or family papers, the fastest workflow is often to separate scanning from digitizing.
I do not want to sit at a scanner and make decisions about every final PDF while the machine is running. I want the scanner to do what scanners do best: capture clean images at 300 DPI or higher. Then I can group, rename, combine, and OCR the files later from the command line.
That small shift matters. Scanning becomes a capture process. Digitizing becomes an organization and text-recognition process. The result is less fiddling at the scanner, better preservation images, and a repeatable workflow for making documents searchable.
The Basic Workflow #
On a Mac, use homebroew to install two command line tools:
brew install img2pdf ocrmypdfPut the PNG files for one document into a folder. Name them in page order:
001.png
002.png
003.png
004.pngThe leading zeroes are important because the shell will pass the files to the command in sorted order. If you use names like 1.png, 2.png, and 10.png, your pages may not land where you expect.
Then combine the PNG files into a PDF:
img2pdf *.png -o combine.pdfThat gives you a PDF made from the page images. Next, run OCRmyPDF to add searchable text:
ocrmypdf combine.pdf combine-ocr.pdf The output, combine-ocr.pdf, is still visually based on the original scans, but now the PDF has a text layer. You can search it, select text, index it, quote from it, or add it to a research library.
Why Scan Images First? #
Many scanner apps want you to decide everything during the scan: file name, PDF settings, OCR settings, page grouping, destination folder, and sometimes compression. That is fine for a one-off document. It is not ideal when you are working through a box of old papers.
For digital humanities work, speed and repeatability matter. I would rather scan at 300 DPI or higher, save each page as a PNG, and keep moving. Later, when I am away from the scanner, I can sort the pages into folders, rename them, combine them into PDFs, and OCR them in batches.
This also preserves better intermediate files. A folder of high-resolution page images is flexible. You can make a PDF from it. You can rerun OCR with different settings. You can crop, deskew, split, reorder, or replace pages. You can keep the images as preservation files and treat the OCR PDF as a derivative.
That distinction is useful: the scan is the evidence; the OCR PDF is the access copy.
A Small Script #
I keep a script called png_to_pdf_to_ocr.sh for this workflow. From inside a folder of PNGs, it combines the images and then runs OCR:
./png_to_pdf_to_ocr.sh combined-ocr.pdfThe logic is simple:
img2pdf *.png -o combine.pdf
ocrmypdf combine.pdf combine-ocr.pdf --skip-textThe script just makes the process harder to mistype. It also gives me one command I can reuse when processing multiple documents.
Batch Thinking #
The real payoff comes when you start thinking in folders:
archive-box-01/
newsletter-1978-03/
001.png
002.png
003.png
minutes-1981-11-14/
001.png
002.png
003.png
004.pngEach folder becomes one document. Each folder can become one searchable PDF. That structure is readable by humans, easy to back up, and simple to revisit later.
This is the kind of ordinary technical practice that supports better humanities work. Digitization is not only about expensive platforms or specialized repositories. It is also about small, durable habits: good scans, clear filenames, reusable commands, and files that remain understandable after the software changes.
When I am working with old documents, I want the workflow to stay boring. Scan clean images. Group pages. Combine with img2pdf. Run ocrmypdf. Keep the source images. Save the searchable PDF. Repeat.
Tags : digital-humanities technology notes
Webmentions
No webmentions yet.