Skip to content
Search
Generic filters
Exact matches only

An Introduction to Optical Character Recognition for Beginners

Your first step towards reading text from unstructured data

Renu Khandelwal

In this article, you will learn

  • What is Optical Character Recognition(OCR)?
  • Usage of OCR
  • Simple code to read text from PDF files and images

You have scanned copies of several documents like certificates of courses candidates have taken. The course certificate could be a PDF or a JPEG or a PNG file. How can you extract vital information like the name of the candidate, name of the course completed, and the date when the course was taken?

Optical Character Recognition(OCR)

OCR is a technology to convert handwritten, typed, scanned text, or text inside images to machine-readable text.

You can use OCR on any image files containing text or a PDF document or any scanned document, printed document, or handwritten document that is legible to extract text.

Usage of OCR

Some of the common usages of OCR are

  • Create automated workflows by digitizing PDF documents across different business units
  • Eliminating manual data entry by digitizing printed documents like reading passports, invoices, bank statements, etc.
  • Create secure access to sensitive information by digitizing Id cards, credit cards, etc.
  • Digitizing printed books like the Gutenberg project

Reading a PDF file

Here you will read the contents of a PDF file. You need to install pypdf2 library which is built on python for handling different pdf functionalities like

  • Extracting document information like title, author, etc
  • Splitting documents page by page
  • Encrypting and decrypting PDF files
!pip install pypdf2

You can download a sample W4 form as a PDF

Importing the library

import PyPDF2

Extract the number of pages and PDF file information

Open the PDF file to be read in binary mode using mode as ‘rb’. Pass the pdfFileObj to the PdfFileReader() to read the file stream. numPages will get the total number of pages in the PDF file. Use getDocumentInfo() to extract the PDF file’s information like author, creator, producer, subject, title in a dictionary

filename=r'PDFfilesW4.pdf'
pdfFileObj = open(filename,'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
num_pages = pdfReader.numPages
info=pdfReader.getDocumentInfo()
print("No. of Pages: ", num_pages)
print("Titel: ", info.title)
print("Author: ",info.author)
print("Subject: ",info.subject)
print("Creator: ",info.creator)
print("Producer: ",info.producer)

Retrieve the text from all the pages in the PDF file

Iterate through all the pages in the PDF file and then use getPage(), which will retrieve a page by a number from the PDF file. You can now extract the text from PDF file using extractText(). In the end, close the file using close()

count = 0
text = “”
#The while loop will read each page.
while count < num_pages:
pageObj = pdfReader.getPage(count)
count +=1
text += pageObj.extractText()
print(“Page Number”,count)
print(“Content”,text)
pdfFileObj.close()

A word of caution: Text extracted using extractText() is not always in the right order, and the spacing also can be slightly different.

Reading a Text from an Image

You will use pytesseract, which a python wrapper for Google’s tesseract for optical character recognition (OCR), to read the text embedded in images.

You will need to understand some of the configuration options that can be applied using pytesseract

  • Page segmentation modes(psm)
  • OCR engine modes(oem)
  • Language(l)

Page Segmentation Method(psm)

psm defines how tesseract splits or segments image into lines of text or words

options for page segmentation modes(psm):

0: Orientation and script detection (OSD) only.
1: Automatic page segmentation with OSD.
2: Automatic page segmentation, but no OSD, or OCR.
3: Fully automatic page segmentation, but no OSD. (Default)
4: Assume a single column of text of variable sizes.
5: Assume a single uniform block of vertically aligned text.
6: Assume a single uniform block of text.
7: Treat the image as a single text line.
8: Treat the image as a single word.
9: Treat the image as a single word in a circle.
10: Treat the image as a single character.
11: Sparse text. Find as much text as possible in no particular order.
12: Sparse text with OSD.
13: Raw line. Treat the image as a single text line, bypassing hacks that are Tesseract-specific.

OCR Engine Mode(oem)

Tesseract has different engine modes for speed and performance

0: Legacy engine only.
1: Neural nets LSTM engine only.
2: Legacy + LSTM engines.
3: Default, based on what is available.

Language(l)

Pytessercat supports multiple languages, and you can specify the languages you intend to work with while installing pytesseract, and it will download the language package. By default, eng is the default language

Image used for reading text

Importing required libraries

import pytesseract
import cv2

Read the image file using openCV. Applying configuration option for pytesseract to read the text from images. You can try different options for psm and oem and checkout the difference sin output

image_Filename=r'Apparel_tag.jpg'# Read the file  using opencv and show the image
img=cv2.imread(image_Filename)
cv2.imshow("Apparel Tag", img)
cv2.waitKey(0)
#set the configuration for redaing text from image using pytesseract
custom_config = r'--oem 1 – psm 8 -l eng'
text=pytesseract.image_to_string(img, config=custom_config)
print(text)
extracted text from the image

Best Practices for OCR using pytesseract

  • Try a different combination of configurations for pytesseract to get the best results for your use case
  • The text should not be skewed, leave some white space around the text for better results and ensure better illumination of the image to remove dark borders
  • 300- 600 DPI at a minimum works great
  • The font size of 12 pt. or more gives better results
  • Applying different pre-processing techniques like binarizing, de-noising the image, rotating the image to deskew it, increase the sharpness of the image, etc.

Conclusion:

OCR results depend on the input data quality. A clean segmentation of the text and no noise in the background gives better results. In the real world, this is not always possible, so we need to apply multiple pre-processing techniques for OCR to give better results.

References:

https://pypi.org/project/PyPDF2/

https://stackoverflow.com/questions/9480013/image-processing-to-improve-tesseract-ocr-accuracy