OCR with Python, OpenCV and PyTesseract

Optical Character Recognition (OCR) is the conversion of images of typed, handwritten or printed text into machine-encoded text, whether from a scanned document, a photo of a document, a photo from a scene (billboards in a landscape photo) or from a text superimposed on an image (subtitles on a television broadcast).

OCR consists generally of sub-processes to perform as accurately as possible.

  • Pre-processing
  • Text detection
  • Text recognition
  • Post-processing

The sub-processes can of course vary depending on the use-case but these are generaly the steps needed to perform optical character recognition.

Tesseract OCR :

Tesseract is an open source text recognition (OCR) Engine, available under the Apache 2.0 license. It can be used directly, or (for programmers) using an API to extract printed text from images. It supports a wide variety of languages. Tesseract doesn’t have a built-in GUI, but there are several available from the 3rdParty page. Tesseract is compatible with many programming languages and frameworks through wrappers that can be found here. It can be used with the existing layout analysis to recognize text within a large document, or it can be used in conjunction with an external text detector to recognize text from an image of a single text line.

OCR Process Flow from a blog post

Tesseract 4.00 includes a new neural network subsystem configured as a text line recognizer. It has its origins in OCRopus’ Python-based LSTM implementation but has been redesigned for Tesseract in C++. The neural network system in Tesseract pre-dates TensorFlow but is compatible with it, as there is a network description language called Variable Graph Specification Language (VGSL), that is also available for TensorFlow.

To recognize an image containing a single character, we typically use a Convolutional Neural Network (CNN). Text of arbitrary length is a sequence of characters, and such problems are solved using RNNs and LSTM is a popular form of RNN. Read this post to learn more about LSTM.

How it works

Tesseract developed from OCRopus model in Python which was a fork of a LSMT in C++, called CLSTM. CLSTM is an implementation of the LSTM recurrent neural network model in C++.

Tesseract 3 OCR process from paper

Tesseract was an effort on code cleaning and adding a new LSTM model. The input image is processed in boxes (rectangle) line by line feeding into the LSTM model and giving output. In the image below we can visualize how it works.

How Tesseract uses LSTM model presentation

Installing Tesseract

Installing tesseract on Windows is easy with the precompiled binaries found here. Do not forget to edit “path” environment variable and add tesseract path. For Linux or Mac installation it is installed with few commands.

By default, Tesseract expects a page of text when it segments an image. If you’re just seeking to OCR a small region, try a different segmentation mode, using the — psm argument. There are 14 modes available which can be found here. By default, Tesseract fully automates the page segmentation but does not perform orientation and script detection. To specify the parameter, type the following:

  0    Orientation and script detection (OSD) only.
1 Automatic page segmentation with OSD.
2 Automatic page segmentation, but no OSD, or OCR.
3 Fully automatic page segmentation, but no OSD. (Default)
4 Assume a single column of text of variable sizes.
5 Assume a single uniform block of vertically aligned text.
6 Assume a single uniform block of text.
7 Treat the image as a single text line.
8 Treat the image as a single word.
9 Treat the image as a single word in a circle.
10 Treat the image as a single character.
11 Sparse text. Find as much text as possible in no particular order.
12 Sparse text with OSD.
13 Raw line. Treat the image as a single text line,
bypassing hacks that are Tesseract-specific.

There is also one more important argument, OCR engine mode (oem). Tesseract 4 has two OCR engines — Legacy Tesseract engine and LSTM engine. There are four modes of operation chosen using the — oem option.

0. Legacy engine only.
1. Neural nets LSTM engine only.
2. Legacy + LSTM engines.
3. Default, based on what is available.

OCR with Pytesseract and OpenCV :

Pytesseract is a wrapper for Tesseract-OCR Engine. It is also useful as a stand-alone invocation script to tesseract, as it can read all image types supported by the Pillow and Leptonica imaging libraries, including jpeg, png, gif, bmp, tiff, and others. More info about Python approach read here.

Preprocessing for Tesseract :

We need to make sure the image is appropriately pre-processed. to ensure a certain level of accuracy.

This includes rescaling, binarization, noise removal, deskewing, etc.

To preprocess image for OCR, use any of the following python functions or follow the OpenCV documentation.

Our input is this image :

Here’s what we get :

Getting boxes around text :

We can determine the bounding box information with PyTesseradt using the following code.

The script below will give you bounding box information for each character detected by tesseract during OCR.

If you want boxes around words instead of characters, the function image_to_data will come in handy. You can use the image_to_data function with output type specified with pytesseract Output.

We will use the sample receipt image below as input to test out tesseract .

Here’s the code :

The output is a dictionary with the followings keys :

Using this dictionary, we can get each word detected, their bounding box information, the text in them and the confidence scores for each.

You can plot the boxes by using the code below :

The output:

As we can see Tesseract is not capable to detect all text boxes confidently, poor quality scans and small fonts may produce poor quality OCR text detection. Also no preprocessing have been done to improve the quality of the image.

Text template matching ( detect only digits ):

Take the example of trying to find where a only digits string is in an image. Here our template will be a regular expression pattern that we will match with our OCR results to find the appropriate bounding boxes. We will use the regex module and the image_to_data function for this.

Page segmentation modes :

There are several ways a page of text can be analysed. The tesseract api provides several page segmentation modes if you want to run OCR on only a small region or in different orientations, etc.

Here’s a list of the supported page segmentation modes by tesseract :

0. Orientation and script detection (OSD) only.
1. Automatic page segmentation with OSD.
2. Automatic page segmentation, but no OSD, or OCR.
3. Fully automatic page segmentation, but no OSD. (Default)
4. Assume a single column of text of variable sizes.
5. Assume a single uniform block of vertically aligned text.
6. Assume a single uniform block of text.
7. Treat the image as a single text line.
8. Treat the image as a single word.
9. Treat the image as a single word in a circle.
10. Treat the image as a single character.
11. Sparse text. Find as much text as possible in no particular order.
12. Sparse text with OSD.
13. Raw line. Treat the image as a single text line, bypassing hacks that are Tesseract-specific.

To change your page segmentation mode, change the --psm argument in your custom config string to any of the above mentioned mode codes.

Detect only digits using configuration :

You can recognise only digits by changing the config to the following :

which gives the following output.

As you can see the output is not the same using regex .

Whitelisting/Blacklisting characters :

Whitelisting letters :

Say you only want to detect certain characters from the given image and ignore the rest. You can specify your whitelist of characters (here, we have used all the lowercase characters from a to z only) by using the following config.

And it gives us this output :

Blacklisting letters :

If you are sure some characters or expressions definitely will not turn up in your text (the OCR will return wrong text in place of blacklisted characters otherwise), you can blacklist those characters by using the following config.

Output :

Multiple languages text :

To specify the language you need your OCR output in, use the -l LANG argument in the config where LANG is the 3 letter code for what language you want to use.

You can work with multiple languages by changing the LANG parameter as such :

NB : The language specified first to the -l parameter is the primary language.

And you will get the following output :

Unfortunately tesseract does not have a feature to detect language of the text in an image automatically. An alternative solution is provided by another python module called langdetect which can be installed via pip for more information check this link.

This module again, does not detect the language of text using an image but needs string input to detect the language from. The best way to do this is by first using tesseract to get OCR text in whatever languages you might feel are in there, using langdetect to find what languages are included in the OCR text and then run OCR again with the languages found.

Say we have a text we thought was in english and portugese.

NB: Tesseract performs badly when, in an image with multiple languages, the languages specified in the config are wrong or aren’t mentioned at all. This can mislead the langdetect module quite a bit as well.

Tesseract limitations :

Tesseract OCR is quite powerful but does have some limitations.

  • The OCR is not as accurate as some available commercial solutions .
  • Doesn’t do well with images affected by artifacts including partial occlusion, distorted perspective, and complex background.
  • It is not capable of recognizing handwriting.
  • It may find gibberish and report this as OCR output.
  • If a document contains languages outside of those given in the -l LANG arguments, results may be poor.
  • It is not always good at analyzing the natural reading order of documents. For example, it may fail to recognize that a document contains two columns, and may try to join text across columns.
  • Poor quality scans may produce poor quality OCR.
  • It does not expose information about what font family text belongs to.

Conclusion :

Tesseract is perfect for scanning clean documents and comes with pretty high accuracy and font variability since its training was comprehensive.

The latest release of Tesseract 4.0 supports deep learning based OCR that is significantly more accurate. The OCR engine itself is built on a Long Short-Term Memory (LSTM) network, which is a particular type of Recurrent Neural Network (RNN).

Further Reading :

Committed lifelong learner. I am passionate about machine learning, data engineering and currently working as a datascientist.

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store