The simple definition
OCR stands for Optical Character Recognition. It's the technology that allows a computer to "read" an image containing text and extract editable, searchable text from it.
In practice: you have a scanned invoice, a photo of a document, or an image exported from a non-selectable PDF. OCR transforms those pixels into text you can copy, edit, index, or store in a database.
What happens inside an OCR engine
The complete process involves several distinct steps.
1. Image preprocessing
Before analyzing characters, the engine improves the source image: conversion to grayscale, contrast enhancement, deskewing (correcting tilt), noise reduction. The quality of this step largely determines final accuracy.
2. Segmentation
The engine identifies text zones in the image, then breaks those zones into lines, lines into words, and words into individual characters. This step is complex: poorly aligned documents, multiple columns, and tables pose real challenges to segmentation.
3. Glyph recognition
For each isolated character, the engine compares the glyph's shape against its database of character models. Modern approaches use convolutional neural networks trained on millions of text images.
4. Contextual correction
The raw recognition result often contains errors: an "l" recognized as "1", an "O" confused with "0". Built-in language models analyze recognized character sequences to correct linguistically improbable errors.
Tesseract: the reference open-source engine
Tesseract is the most widely used open-source OCR engine in the world. Originally developed by HP in the 1980s and maintained by Google since 2006, it now supports over 100 languages.
Zipero uses Tesseract.js, the WebAssembly port of Tesseract for the browser. This means recognition happens entirely on your device — no extracted text passes through our servers.
What determines accuracy
Resolution: 300 DPI is the recommended minimum for high accuracy. Below 150 DPI, characters are too blurry to recognize correctly. A hand-held photo of a document is often insufficient — a proper 300 DPI scan produces far better results.
Contrast: black text on a white background gives the best results. A yellowing document or colored text on a colored background significantly degrades accuracy.
Font: standard serif and sans-serif fonts are recognized at over 99% accuracy. Handwriting and decorative fonts remain challenging for all OCR engines.
Tilt: a document scanned at an angle can halve accuracy. Tesseract includes automatic deskewing for moderate angles.
Common use cases
- Extracting text from a scanned invoice to import into accounting software
- Making a non-selectable PDF searchable and copyable
- Digitizing paper archives to index them in a database
- Extracting structured data (tables, amounts, dates) from printed documents