Optical Character Recognition using Python and Tesseract

Jun Choi
3 min readFeb 19, 2021

--

Photo by Tanner Mardis on Unsplash

When I used to work as a freight forwarder, I had to deal with a lot of scanned documents. Because scanned documents were often in image format, I wasn’t able to just copy paste the contents for my tasks so my work involving scanned documents took way longer than documents that were in normal text format. If you are having a similar problem and don’t want to type out everything from the document, perhaps this post will help you.

Optical character recognition or optical character reader (OCR) is the electronic or mechanical conversion of images of typed, handwritten or printed text into machine-encoded text, whether from a scanned document, a photo of a document, a scene-photo (for example the text on signs and billboards in a landscape photo) or from subtitle text superimposed on an image (for example: from a television broadcast).

Many people think OCR problem is solved, but it is actually a very challenging problem because images containing the text could have a lot of distortions, noise, different languages and fonts.

One of the most popular OCR engines is Tesseract. Tesseract began as a Ph.D. research project in HP Labs, Bristol. It gained popularity and was developed by HP between 1984 and 1994. In 2005 HP released Tesseract as an open-source software. Since 2006 it is developed by Google.

In this demo, I will show you how to install Tesseract on Windows and perform simple OCR using pytesseract (Python wrapper for Tesseract) and Pillow (Open-source and free Python imaging library).

Note: I am assuming you are using Windows 10 and have Python 3 installed on your computer.

Download Tesseract

You can download Tesseract installer for Windows here.

After you download the installer, run the executable file. You can also download other language packs during the installation process.

Add Tesseract to Path Variable

  1. Search for “Environment Variable” on Windows search bar
  2. Click on “Environment Variables” button
  3. Create new path under user variables
  4. Enter the path where you installed Tesseract

Run the following code in command line to install pytesseract and pillow which are Python libraries needed for this demo.

pip install pytesseract
pip install pillow

Now you are ready to OCR!

Testing image 1:

Image source

# Importing libraries
from PIL import Image
import pytesseract

print(pytesseract.image_to_string(Image.open('tess_test.png'), lang='eng'))

Output:

A Python Approach to Character
Recognition

Testing image 2:

Image source

print(pytesseract.image_to_string(Image.open('tess_test2.jpg'), lang='kor'))

Output:

안녕하세요

Testing image 3:

Image source

print(pytesseract.image_to_string(Image.open('tess_test3.jpg'), lang='jpn'))

Output:

こんにちは

Testing image 4:

Image source

print(pytesseract.image_to_string(Image.open('tess_test4.jpg'), lang='eng'))

Output:

WHitg Um Alive Tl MAKE
TINY CHANGES To EARTH

Additional Resources

There is a more detailed guide on how to use pytesseract on this website. It also shows how to preprocess images with cv2 library to increase OCR performance.

References

1: https://en.wikipedia.org/wiki/Optical_character_recognition
2: https://nanonets.com/blog/ocr-with-tesseract/

Originally published at https://junschoi.github.io.

--

--