Adam DJ Brett

Home / Blog / PNG to PDF to OCR: A Small Command Line Skill for Digital Humanities Work

Table Of Contents

One of the most useful digital humanities skills is not glamorous at all: knowing how to turn a folder of high quality page images into a searchable PDF. If you are scanning old documents, archival packets, newsletters, church records, syllabi, meeting minutes, or family papers, the fastest workflow is often to separate scanning from digitizing.

I do not want to sit at a scanner and make decisions about every final PDF while the machine is running. I want the scanner to do what scanners do best: capture clean images at 300 DPI or higher. Then I can group, rename, combine, and OCR the files later from the command line.

That small shift matters. Scanning becomes a capture process. Digitizing becomes an organization and text-recognition process. The result is less fiddling at the scanner, better preservation images, and a repeatable workflow for making documents searchable.

The Basic Workflow #

On a Mac, use homebroew to install two command line tools:

brew install img2pdf ocrmypdf

Put the PNG files for one document into a folder. Name them in page order:

001.png
002.png
003.png
004.png

The leading zeroes are important because the shell will pass the files to the command in sorted order. If you use names like 1.png, 2.png, and 10.png, your pages may not land where you expect.

Then combine the PNG files into a PDF:

img2pdf *.png -o combine.pdf

That gives you a PDF made from the page images. Next, run OCRmyPDF to add searchable text:

ocrmypdf combine.pdf combine-ocr.pdf

The output, combine-ocr.pdf, is still visually based on the original scans, but now the PDF has a text layer. You can search it, select text, index it, quote from it, or add it to a research library.

Why Scan Images First? #

Many scanner apps want you to decide everything during the scan: file name, PDF settings, OCR settings, page grouping, destination folder, and sometimes compression. That is fine for a one-off document. It is not ideal when you are working through a box of old papers.

For digital humanities work, speed and repeatability matter. I would rather scan at 300 DPI or higher, save each page as a PNG, and keep moving. Later, when I am away from the scanner, I can sort the pages into folders, rename them, combine them into PDFs, and OCR them in batches.

This also preserves better intermediate files. A folder of high-resolution page images is flexible. You can make a PDF from it. You can rerun OCR with different settings. You can crop, deskew, split, reorder, or replace pages. You can keep the images as preservation files and treat the OCR PDF as a derivative.

That distinction is useful: the scan is the evidence; the OCR PDF is the access copy.

A Small Script #

I keep a script called png_to_pdf_to_ocr.sh for this workflow. From inside a folder of PNGs, it combines the images and then runs OCR:

./png_to_pdf_to_ocr.sh combined-ocr.pdf

The logic is simple:

img2pdf *.png -o combine.pdf
ocrmypdf combine.pdf combine-ocr.pdf --skip-text

The script just makes the process harder to mistype. It also gives me one command I can reuse when processing multiple documents.

Batch Thinking #

The real payoff comes when you start thinking in folders:

archive-box-01/
  newsletter-1978-03/
    001.png
    002.png
    003.png
  minutes-1981-11-14/
    001.png
    002.png
    003.png
    004.png

Each folder becomes one document. Each folder can become one searchable PDF. That structure is readable by humans, easy to back up, and simple to revisit later.

This is the kind of ordinary technical practice that supports better humanities work. Digitization is not only about expensive platforms or specialized repositories. It is also about small, durable habits: good scans, clear filenames, reusable commands, and files that remain understandable after the software changes.

When I am working with old documents, I want the workflow to stay boring. Scan clean images. Group pages. Combine with img2pdf. Run ocrmypdf. Keep the source images. Save the searchable PDF. Repeat.

Tags : digital-humanities technology notes

Webmentions

No webmentions yet.

ADAM DJ BRETT

PNG to PDF to OCR: A Small Command Line Skill for Digital Humanities Work

The Basic Workflow #

Why Scan Images First? #

A Small Script #

Batch Thinking #

Webmentions

Enabling Webmentions in Eleventy (11ty)