Your first step towards reading text from unstructured data
In this article, you will learn
- What is Optical Character Recognition(OCR)?
- Usage of OCR
- Simple code to read text from PDF files and images
You have scanned copies of several documents like certificates of courses candidates have taken. The course certificate could be a PDF or a JPEG or a PNG file. How can you extract vital information like the name of the candidate, name of the course completed, and the date when the course was taken?
Optical Character Recognition(OCR)
OCR is a technology to convert handwritten, typed, scanned text, or text inside images to machine-readable text.
You can use OCR on any image files containing text or a PDF document or any scanned document, printed document, or handwritten document that is legible to extract text.
Usage of OCR
Some of the common usages of OCR are
- Create automated workflows by digitizing PDF documents across different business units
- Eliminating manual data entry by digitizing printed documents like reading passports, invoices, bank statements, etc.
- Create secure access to sensitive information by digitizing Id cards, credit cards, etc.
- Digitizing printed books like the Gutenberg project
Reading a PDF file
Here you will read the contents of a PDF file. You need to install pypdf2 library which is built on python for handling different pdf functionalities like
- Extracting document information like title, author, etc
- Splitting documents page by page
- Encrypting and decrypting PDF files
!pip install pypdf2
You can download a sample W4 form as a PDF
Importing the library
Extract the number of pages and PDF file information
Open the PDF file to be read in binary mode using mode as ‘rb’. Pass the pdfFileObj to the PdfFileReader() to read the file stream. numPages will get the total number of pages in the PDF file. Use getDocumentInfo() to extract the PDF file’s information like author, creator, producer, subject, title in a dictionary
pdfFileObj = open(filename,'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
num_pages = pdfReader.numPages
info=pdfReader.getDocumentInfo()print("No. of Pages: ", num_pages)
print("Titel: ", info.title)
Retrieve the text from all the pages in the PDF file
Iterate through all the pages in the PDF file and then use getPage(), which will retrieve a page by a number from the PDF file. You can now extract the text from PDF file using extractText(). In the end, close the file using close()
count = 0
text = “”
#The while loop will read each page.
while count < num_pages:
pageObj = pdfReader.getPage(count)
text += pageObj.extractText()
A word of caution: Text extracted using extractText() is not always in the right order, and the spacing also can be slightly different.
Reading a Text from an Image
You will use pytesseract, which a python wrapper for Google’s tesseract for optical character recognition (OCR), to read the text embedded in images.
You will need to understand some of the configuration options that can be applied using pytesseract
- Page segmentation modes(psm)
- OCR engine modes(oem)
Page Segmentation Method(psm)
psm defines how tesseract splits or segments image into lines of text or words
options for page segmentation modes(psm):
0: Orientation and script detection (OSD) only.
1: Automatic page segmentation with OSD.
2: Automatic page segmentation, but no OSD, or OCR.
3: Fully automatic page segmentation, but no OSD. (Default)
4: Assume a single column of text of variable sizes.
5: Assume a single uniform block of vertically aligned text.
6: Assume a single uniform block of text.
7: Treat the image as a single text line.
8: Treat the image as a single word.
9: Treat the image as a single word in a circle.
10: Treat the image as a single character.
11: Sparse text. Find as much text as possible in no particular order.
12: Sparse text with OSD.
13: Raw line. Treat the image as a single text line, bypassing hacks that are Tesseract-specific.
OCR Engine Mode(oem)
Tesseract has different engine modes for speed and performance
0: Legacy engine only.
1: Neural nets LSTM engine only.
2: Legacy + LSTM engines.
3: Default, based on what is available.
Pytessercat supports multiple languages, and you can specify the languages you intend to work with while installing pytesseract, and it will download the language package. By default, eng is the default language
Importing required libraries
Read the image file using openCV. Applying configuration option for pytesseract to read the text from images. You can try different options for psm and oem and checkout the difference sin output
image_Filename=r'Apparel_tag.jpg'# Read the file using opencv and show the image
cv2.imshow("Apparel Tag", img)
cv2.waitKey(0)#set the configuration for redaing text from image using pytesseract
custom_config = r'--oem 1 – psm 8 -l eng'
Best Practices for OCR using pytesseract
- Try a different combination of configurations for pytesseract to get the best results for your use case
- The text should not be skewed, leave some white space around the text for better results and ensure better illumination of the image to remove dark borders
- 300- 600 DPI at a minimum works great
- The font size of 12 pt. or more gives better results
- Applying different pre-processing techniques like binarizing, de-noising the image, rotating the image to deskew it, increase the sharpness of the image, etc.
OCR results depend on the input data quality. A clean segmentation of the text and no noise in the background gives better results. In the real world, this is not always possible, so we need to apply multiple pre-processing techniques for OCR to give better results.