Parsing PDFs using Python

I’m part of a project that has a need to import tabular data into a structured database, from PDF files that are based on digital or analog inputs. [Digital input = PDF generated from computer applications; analog input = PDF generated from scanned paper documents.]

These are the preliminary research notes I made for myself a while ago that I am now publishing for reference by other project members. These are neither conclusive nor comprehensive, but they are directionally relevant.

I.E. The amount of work it takes code to parse structured data from analog input PDFs is a significant hurdle, not to be underestimated (this blog post was the single most awe-inspiring find I made). The strongest possible recommendation based on this research is GET AS MUCH OF THE DATA FROM DIGITAL SOURCES AS YOU CAN.

Packages/libraries/guidance

The basics: https://automatetheboringstuff.com/chapter13/ PyPDF2
A more involved tutorial examining many packages: https://www.binpress.com/tutorial/manipulating-pdfs-with-python/167: Pdfrw, slate, PDFQuery, PDFMiner, PyPDF2
Further searching: https://pypi.python.org/pypi?%3Aaction=search&term=pdf&submit=search: lots of packages
One StackOverflow comparison: http://stackoverflow.com/questions/6413441/python-pdf-library: PyPDF2, PDFMiner, ReportLab
One high-level package: https://github.com/pmaupin/pdfrw
Others suggested by Ed Borasky: ijmbarr/parsing-pdfs, reesepathak/pdf-mining
Others I found: poppler

Evaluation of Packages

Pdfrw: https://github.com/pmaupin/pdfrw
- Python 2 & 3 (3.3, 3.4)
- Last updated 2016-10
- Heavily oriented to a printing workflow: manipulating paging, sizing, embedded images
- Multiple references to reportlab (complementary functionality)
- “There are a lot of incorrectly formatted PDFs floating around; support for these is added in some cases. The decision is often based on what acroread and okular do with the PDFs; if they can display them properly, then eventually pdfrw should, too, if it is not too difficult or costly.”
- Great writeup of the trials of wrangling PDF document internal structures
Pdfminer http://www.unixuser.org/~euske/python/pdfminer/index.html
- Python 2 (pdfminer3k, pdfminer.six apparently support Python 3)
- Last updated 2016-09
- Can export PDF to other formats (e.g. HTML
Slate https://pypi.python.org/pypi/slate
- Supports Python 2 & 3
- Last updated 2015-11
- Wrapper around PDFMiner for ease of use
- Focused on extracting text from PDFs
ReportLab http://www.reportlab.com/opensource/
- Python 2.7 or 3.3+0
- Source currently tracked here: https://bitbucket.org/rptlab/reportlab/
- Commercially-backed open source
- Oriented primarily to creating PDFs
PdfQuery https://pypi.python.org/pypi/pdfquery
- Python 2 or 3
  Last updated 2016-03
- Also a wrapper around PDFMiner
- Also meant for ease of use
- Orients to JQuery or XPath syntax (I.e. requires no explicit knowledge of internal layout complexities)
XPDF http://www.foolabs.com/xpdf/about.html
- Couple of years old
- Includes PDFInfo which does a great job of exporting metadata
ijmbarr/parsing-pdfs https://github.com/ijmbarr/parsing-pdfs
- Specifically tackling tabular data
- FANTASTIC writeup of the low-level grind in extracting tabular data from PDFs that weren’t designed for ease of reuse: http://www.degeneratestate.org/posts/2016/Jun/15/extracting-tabular-data-from-pdfs/
Reesepathak/pdf-mining https://github.com/reesepathak/pdf-mining
- (did not examine)
Poppler https://poppler.freedesktop.org
- (did not examine)

Possible issues

Encryption of the file
Compression of the file
Vector images, charts, graphs, other image formats
Form XObjects
Text contained in figures
Does text always appear in the same place on the page, or different every page/document?

PDF examples I tried parsing, to evaluate the packages

IRS 1040A
2015-16-prelim-doc-web.pdf (Bellingham city budget)
- Tabular data begins on page 30 (labelled Page 28)
- PyPDF2 Parsing result: None of the tabular data is exported
- SCARY: some financial tables are split across two pages
2016-budget-highlights.pdf (Seattle city budget summary)
- Tabular data begins on page 15-16 (labelled 15-16)
- PyPDF2 Parsing result: this data parses out
FY2017 Proposed Budget-Lowell-MA (Lowell)
- Financial tabular data starts at page 95-104, then 129-130, 138-139
- More interesting are the small breakouts on subsequent pages e.g. 149, 151, 152, 162; 193, 195, 197
- PyPDF2 Parsing result: all data I sampled appears to parse out

Experiment ideas

Build an example PDF for myself with XLS tables, and then see what comes out when the contents are parsed using one of these libraries
Build a script that spits out useful metadata about the document: which app/library generated it (e.g. Producer, Creator), size, # of pages
Build another script to verify there’s a non-trivial amount of ASCII/Unicode text in the document (I.e. so we confirm it doesn’t have to be OCR’d)

Experiments tried

Create an Anaconda Notebook to write some PDF scripts
- Created “PDF Experiments” environment on Win10 Anaconda install
- Fired up Anaconda Prompt and ran “pip install pypdf2” (see http://stackoverflow.com/questions/18640305/how-do-i-keep-track-of-pip-installed-packages-in-an-anaconda-conda-environment#18640601)
- Created “PDF Experiments” notebook and ran script that included “import pypdf2” – successfully using the function
Extract the contents of Google Spreadsheets exported as PDF
- Result: no readable text exported using pypdf2 [5 spreadsheets attempted]

2 thoughts on “Parsing PDFs using Python”

Anne Laski says:

2018-11-27 at 9:57 pm

Just wondering if you could share some of the code you used for the Seattle City Budge Project. I am trying to do something similar, but I keep running into a problem with Indirect Objects. Thanks!

LikeLike

1. paranoidmike says:
  
  2018-11-27 at 9:59 pm
  
  Unfortunately we abandoned the work in favour of just hand-formatting table scrapes we performed from a PDF reader. Sorry.
  
  LikeLike

	Lewis on Update my Contacts with Python…
	paranoidmike on Parsing PDFs using Python
	Anne Laski on Parsing PDFs using Python
	paranoidmike on Hashicorp Vault + Ansible + CD…
	KrzWrd on Hashicorp Vault + Ansible + CD…