A practical guide to import unstructured text/images data
Being a part of the data science/analytics team, you’ll probably encounter many file types to import and analyze in Python. In the ideal world, all our data reside in the cloud-based databases (e.g., SQL, NoSQL) that are easy to query and extract. In the real world, however, we rarely get neat tabular data. Also, if we need additional data (being structured or unstructured) to augment the analysis, we will inevitably be working with raw data files that come in different formats.
Recently, my team started a project that, as the first step, involves integrating raw data files in formats .csv, .xlsx, .pdf, .docx, and .doc. My first reaction: the mighty pandas! which certainly handles the .csv and .xlsx, but regarding the .pdf and .docx, we will have to explore possibilities beyond the pandas.
In this blog, I will be sharing my tips and tricks to help you easily import PDF and Word documents (into Python) in case it comes up in your own work, especially in your NLP Natural Language Processing projects. All sample data files are publicly accessible, and file copies along with the corresponding download links are available in my Github repo.
As one of the most commonly used documentation tools, the MS Word oftentimes is people’s top choice for writing and sharing text. For word documents with the .docx extension, Python module docx is a handy tool, and the following shows how to import .docx paragraphs with just 2 lines of code,
Now let’s print out the output information,
As we can see, the return is a list of strings/sentences, and thus we can leverage the string processing techniques and regular expression to get the text data ready for further analysis (e.g., NLP).
Despite the ease of use, the python-docx module cannot take in the aging .doc extension, and believe it or not, .doc file is still the go-to word processor for lots of stakeholders (despite the .docx being around for over a decade). If under this circumstance, converting file types is not an option, we can turn to the win32com.client package with a couple of tricks.
The basic technique is first to launch a Word application as an active document and then to read the content/paragraphs in Python. The function docReader( ) defined below showcases how (and the fully-baked code snippet is linked here),
After running this function, we should see the same output as in section 1. Two tips: (1) we set word.Visible = False to hide the physical file so that all the processing work is done in the background; (2) the argument doc_file_name requires a full file path, not just the file name. Otherwise, the function Documents.Open( ) wouldn’t recognize the file, even with setting the working directory to the current folder.
Now, let’s move onto the PDF files,
When it comes to processing PDF files in Python, the well-known module PyPDF2 will probably be the initial attempt of most analysts, including myself. Hence, I coded it up using PyPDF2 (full code available in my Github repo), which gave the text output, as shown below,
Hmmm, apparently this doesn’t seem right because all the white spaces are missing! Without the proper spaces, there’s no way for us to parse the strings correctly.
Indeed, this example reveals one caveat of the function extractText( ) in PyPDF2: it doesn’t perform well for PDFs containing complicated text or non-printable space characters. Therefore, let’s switch to the pdfminer and explore how to get this pdf text imported,
Now, the output looks way better, and can be easily cleaned up with text mining techniques,
To make things even complicated for data scientists (of course), PDFs can (and often) be created from scanned images in lieu of a text document; hence, they cannot be rendered as plain text by pdf readers, regardless of how neatly they are organized.
In this case, the best technique I found is first explicitly to extract out the images, and then to read and parse those images in Python. We will implement this idea with the modules pdf2image and pytesseract. If the latter sounds unfamiliar, the pytesseract is an OCR optical character recognition tool for Python, which can recognize and read the text embedded in images. Now, here’s the basic function,
In the output, you should see the text of the scanned image shown below:
PREFACE In 1939 the Yorkshire Parish Register Society, of which the Parish Register Section of the Yorkshire Archaeological Society is the successor (the publications having been issued in numerical sequence without any break) published as its Volume No. 108 the entries in the Register of Wensley Parish Church from 1538 to 1700 inclusive. These entries comprised the first 110 pages (and a few lines of p. 111) of the oldest register at Wensley.
Mission accomplished! As a bonus, you now know how to extract data from images as well, i.e., the image_to_string( ) in the pytesseract module!
Note: for the pytesseract module to run successfully, you may need to perform additional configuration steps, including installing the poppler and tesseract packages. Again, please feel free to grab a more robust implementation and a detailed list of configs in my Github here.