OCR: How to Scan Paper Documents Into Editable Text

Turn a document photo into copyable text in seconds. A practical guide for getting accurate results.

Taking a Good Photo for OCR

The quality of the OCR output is almost entirely determined by the quality of the input image. A good photo of a clear document will produce near-perfect recognition; a blurry, shadowed, or skewed photo will produce garbled output regardless of how sophisticated the OCR engine is.

Lighting is the most critical variable. Direct flash creates bright hotspots and sharp shadows that the OCR engine reads as marks on the paper. Indirect natural light — near a window but not in direct sunlight — produces even, shadow-free illumination that makes the text maximally legible. If you're using artificial light, a diffused source (like a ceiling lamp, not a desk spotlight) works best.

Hold the camera perpendicular to the document. Even a 15-degree angle introduces enough perspective distortion to reduce recognition accuracy, particularly on characters with similar shapes (O vs Q, I vs l vs 1). Most modern smartphone cameras have a document scanning mode that automatically corrects perspective — use it if available.

Resolution matters but has a ceiling. A minimum effective width of 1,500 pixels is a good target; 2,000–3,000 pixels is ideal. Beyond 4,000 pixels, you're adding file size without improving OCR accuracy. A plain, high-contrast background prevents the OCR engine from "seeing" background texture as part of the document.

What OCR Handles Well vs Poorly

Modern OCR engines achieve greater than 99% character accuracy on clean printed text — commercially typeset documents, laser-printed reports, and digital documents printed from Word or PDF. At this level, a 500-word document will typically have zero or one error, easily caught on a quick proofread.

Typewritten text (from a mechanical typewriter) is also handled very well. The consistent font and letterform shapes are similar enough to digital fonts that recognition rates remain high. Block handwritten capitals — the kind used to fill in official forms — are acceptable, typically achieving 90–95% accuracy depending on letter formation and ink contrast.

Cursive handwriting is where OCR becomes unreliable. The connected letterforms and highly variable shapes among writers mean current OCR systems produce high error rates. Some specialized handwriting recognition tools exist, but they are typically trained on specific script types and work poorly on general cursive.

Tables and mathematical formulas are structurally fragile. OCR can recognize the individual characters inside a table, but reassembling them into correct rows and columns in the output text is difficult. Math formulas with fractions, subscripts, and special symbols often lose their structure entirely. Post-processing manual correction is typically required.

Improving Accuracy on Difficult Documents

If your image produces poor results, several preprocessing steps can significantly improve accuracy. Straightening (deskewing) is the most impactful: even a two-degree rotation is enough to confuse character recognition. Most image editors include an automatic deskew function; alternatively, you can rotate the image manually until horizontal text appears level.

Increasing contrast helps with faded documents, photocopies, and printed text on colored paper. The goal is maximum difference between the text (dark) and the background (light). Converting to pure black-and-white (binarization) — rather than grayscale — is often the single most effective preprocessing step for difficult documents.

Cropping out distracting margins removes furniture, hands, and desk surfaces that the OCR engine might try to process as part of the document. For aged or yellowed paper, a brightness adjustment to lighten the background without washing out the text can significantly improve character contrast.

After OCR: Verify and Correct

No OCR output should be used without at least a quick proofread. Certain character confusions are systematic and predictable. The digit 0 is commonly confused with the letter O, especially in older OCR engines. The digit 1, lowercase L, and uppercase I look nearly identical in many fonts and are frequently swapped. The character combination "rn" is often recognized as "m" in smaller font sizes.

Numbers in tables and prices deserve special attention — a misread digit in a financial document or legal contract has real consequences. Proper nouns, technical terms, and domain-specific vocabulary are more error-prone than common dictionary words, because most OCR engines use language models that favor common words.

A practical routine: after OCR, run a spell-check on the output, then manually scan the first and last line of each paragraph (where recognition tends to be slightly less reliable than mid-paragraph text). For critical documents, a full proofread against the original image is worth the time.