Skip to content
jbreiden edited this page Feb 8, 2018 · 126 revisions

Please do not change the title of any wiki page without a permission from Tesseract developers.


Introduction

Tesseract is an open source text recognizer (OCR) Engine, available under the Apache 2.0 license. It can be used directly, or (for programmers) using an API to extract printed text from images. It supports a wide variety of languages.

Tesseract doesn't have a built-in GUI, but there are several available from the 3rdParty page.

Installation

There are two parts to install, the engine itself, and the training data for a language.

Linux

Tesseract is available directly from many Linux distributions. The package is generally called 'tesseract' or 'tesseract-ocr' - search your distribution's repositories to find it. Packages for specific languages, such as English, French, Chinese, etc. are also available directly from the Linux distribution.

If you are experimenting with OCR Engine modes, you will need to manually install language training data beyond what is available in your Linux distribution. This is for expert users only. Various types of training data can be found on github. Unpack and copy the .traineddata file into the 'tessdata' directory, probably /usr/share/tesseract-ocr/tessdata or /usr/share/tessdata or /usr/share/tesseract-ocr/4.00/tessdata. Training data for obsolete Tesseract versions =< 3.02 reside in another location.

If Tesseract is not available for your distribution, or you want to use a newer version than they offer, you can compile your own.

macOS

You can install Tesseract using either MacPorts or Homebrew.

MacPorts

To install Tesseract run this command:

sudo port install tesseract

To install any language data, run:

sudo port install tesseract-<langcode>

List of available langcodes can be found on MacPorts tesseract page.

Homebrew

To install Tesseract run this command:

brew install tesseract

Windows

An unofficial installer for windows for Tesseract 3.05-dev and Tesseract 4.00-dev is available from Tesseract at UB Mannheim. This includes the training tools.

An installer for the old version 3.02 is available for Windows from our download page. This includes the English training data. If you want to use another language, download the appropriate training data, unpack it using 7-zip, and copy the .traineddata file into the 'tessdata' directory, probably C:\Program Files\Tesseract-OCR\tessdata.

To access tesseract-OCR from any location you may have to add the directory where the tesseract-OCR binaries are located to the Path variables, probably C:\Program Files\Tesseract-OCR.

MSYS2

Install tesseract-OCR:

 pacman -S mingw-w64-{i686,x86_64}-tesseract-ocr

and the data files:

 pacman -S mingw-w64-tesseract-ocr-osd mingw-w64-{i686,x86_64}-tesseract-ocr-eng

Cygwin

Released version >= 3.02 of tesseract-ocr are part of 64bit Cygwin

Instruction for cygwin installation is here: https://cygwin.com/cygwin-ug-net/setup-net.html

Tesseract specific packages to be installed:

tesseract-ocr                           3.04.01-1
tesseract-ocr-eng                       3.04-1
tesseract-training-core                 3.04-1
tesseract-training-eng                  3.04-1
tesseract-training-util                 3.04.01-1

Other Platforms

Tesseract may work on more exotic platforms too. You can either try compiling it yourself, or take a look at the list of other projects using Tesseract.

Running Tesseract

Tesseract is a command-line program, so first open a terminal or command prompt. The command is used like this:

  tesseract imagename outputbase [-l lang] [-psm pagesegmode] [configfile...]

So basic usage to do OCR on an image called 'myscan.png' and save the result to 'out.txt' would be:

  tesseract myscan.png out

Or to do the same with German:

  tesseract myscan.png out -l deu

It can even be used with multiple languages traineddata at a time eg. English and German:

  tesseract myscan.png out -l eng+deu

Tesseract also includes a hOCR mode, which produces a special HTML file with the coordinates of each word. This can be used to create a searchable pdf, using a tool such as Hocr2PDF. To use it, use the 'hocr' config option, like this:

  tesseract myscan.png out hocr

You can also create a searchable pdf directly from tesseract ( versions >=3.03):

  tesseract myscan.png out pdf

More information about the various options is available in the Tesseract manpage.

Other Languages

Tesseract has been trained for many languages, check for your language in the Tessdata repository.

It can also be trained to support other languages and scripts; for more details see TrainingTesseract.

Development

Tesseract can also be used in your own project, under the terms of the Apache License 2.0. It has a fully featured API, and can be compiled for a variety of targets including Android and the iPhone. See the 3rdParty page for a sample of what has been done with it. Note that as yet there are very few 3rdParty Tesseract OCR projects being developed for Mac although there are several online OCR services that can be used on Mac that may use Tesseract as their OCR engine.

Also, it's free software, so if you want to pitch in and help, please do! If you find a bug and fix it yourself, the best thing to do is to attach the patch to your bug report in the Issues List

Support

First read the Wiki, particularly the FAQ to see if your problem is addressed there. If not, search the Tesseract user forum or the Tesseract developer forum, and if you still can't find what you need, please ask us there.

As of 02/02/2020


These wiki pages are no longer maintained.

All pages were moved to tesseract-ocr/tessdoc.

The latest documentation is available at https://tesseract-ocr.github.io/.


Clone this wiki locally