What Is OCR and How Does It Work?

How a machine turns an image of text into editable text — and why image quality is the deciding factor in accuracy.

The simple definition

OCR stands for Optical Character Recognition. It's the technology that allows a computer to "read" an image containing text and extract editable, searchable text from it.

In practice: you have a scanned invoice, a photo of a document, or an image exported from a non-selectable PDF. OCR transforms those pixels into text you can copy, edit, index, or store in a database.

What happens inside an OCR engine

The complete process involves several distinct steps.

1. Image preprocessing

Before analyzing characters, the engine improves the source image: conversion to grayscale, contrast enhancement, deskewing (correcting tilt), noise reduction. The quality of this step largely determines final accuracy.

2. Segmentation

The engine identifies text zones in the image, then breaks those zones into lines, lines into words, and words into individual characters. This step is complex: poorly aligned documents, multiple columns, and tables pose real challenges to segmentation.

3. Glyph recognition

For each isolated character, the engine compares the glyph's shape against its database of character models. Modern approaches use convolutional neural networks trained on millions of text images.

4. Contextual correction

The raw recognition result often contains errors: an "l" recognized as "1", an "O" confused with "0". Built-in language models analyze recognized character sequences to correct linguistically improbable errors.

Tesseract: the reference open-source engine

Tesseract is the most widely used open-source OCR engine in the world. Originally developed by HP in the 1980s and maintained by Google since 2006, it now supports over 100 languages.

Zipero uses Tesseract.js, the WebAssembly port of Tesseract for the browser. This means recognition happens entirely on your device — no extracted text passes through our servers.

What determines accuracy

Resolution: 300 DPI is the recommended minimum for high accuracy. Below 150 DPI, characters are too blurry to recognize correctly. A hand-held photo of a document is often insufficient — a proper 300 DPI scan produces far better results.

Contrast: black text on a white background gives the best results. A yellowing document or colored text on a colored background significantly degrades accuracy.

Font: standard serif and sans-serif fonts are recognized at over 99% accuracy. Handwriting and decorative fonts remain challenging for all OCR engines.

Tilt: a document scanned at an angle can halve accuracy. Tesseract includes automatic deskewing for moderate angles.

Common use cases

Extracting text from a scanned invoice to import into accounting software
Making a non-selectable PDF searchable and copyable
Digitizing paper archives to index them in a database
Extracting structured data (tables, amounts, dates) from printed documents

Use OCR →