I’m part of a project that has a need to import tabular data into a structured database, from PDF files that are based on digital or analog inputs. [Digital input = PDF generated from computer applications; analog input = PDF generated from scanned paper documents.]
These are the preliminary research notes I made for myself a while ago that I am now publishing for reference by other project members. These are neither conclusive nor comprehensive, but they are directionally relevant.
I.E. The amount of work it takes code to parse structured data from analog input PDFs is a significant hurdle, not to be underestimated (this blog post was the single most awe-inspiring find I made). The strongest possible recommendation based on this research is GET AS MUCH OF THE DATA FROM DIGITAL SOURCES AS YOU CAN.
Packages/libraries/guidance
- The basics: https://automatetheboringstuff.com/chapter13/ PyPDF2
- A more involved tutorial examining many packages: https://www.binpress.com/tutorial/manipulating-pdfs-with-python/167: Pdfrw, slate, PDFQuery, PDFMiner, PyPDF2
- Further searching: https://pypi.python.org/pypi?%3Aaction=search&term=pdf&submit=search: lots of packages
- One StackOverflow comparison: http://stackoverflow.com/questions/6413441/python-pdf-library: PyPDF2, PDFMiner, ReportLab
- One high-level package: https://github.com/pmaupin/pdfrw
- Others suggested by Ed Borasky: ijmbarr/parsing-pdfs, reesepathak/pdf-mining
- Others I found: poppler
Evaluation of Packages
- Pdfrw: https://github.com/pmaupin/pdfrw
- Python 2 & 3 (3.3, 3.4)
- Last updated 2016-10
- Heavily oriented to a printing workflow: manipulating paging, sizing, embedded images
- Multiple references to reportlab (complementary functionality)
- “There are a lot of incorrectly formatted PDFs floating around; support for these is added in some cases. The decision is often based on what acroread and okular do with the PDFs; if they can display them properly, then eventually pdfrw should, too, if it is not too difficult or costly.”
- Great writeup of the trials of wrangling PDF document internal structures
- Pdfminer http://www.unixuser.org/~euske/python/pdfminer/index.html
- Python 2 (pdfminer3k, pdfminer.six apparently support Python 3)
- Last updated 2016-09
- Can export PDF to other formats (e.g. HTML
- Slate https://pypi.python.org/pypi/slate
- Supports Python 2 & 3
- Last updated 2015-11
- Wrapper around PDFMiner for ease of use
- Focused on extracting text from PDFs
- ReportLab http://www.reportlab.com/opensource/
- Python 2.7 or 3.3+0
- Source currently tracked here: https://bitbucket.org/rptlab/reportlab/
- Commercially-backed open source
- Oriented primarily to creating PDFs
- PdfQuery https://pypi.python.org/pypi/pdfquery
- Python 2 or 3
Last updated 2016-03 - Also a wrapper around PDFMiner
- Also meant for ease of use
- Orients to JQuery or XPath syntax (I.e. requires no explicit knowledge of internal layout complexities)
- Python 2 or 3
- XPDF http://www.foolabs.com/xpdf/about.html
- Couple of years old
- Includes PDFInfo which does a great job of exporting metadata
- ijmbarr/parsing-pdfs https://github.com/ijmbarr/parsing-pdfs
- Specifically tackling tabular data
- FANTASTIC writeup of the low-level grind in extracting tabular data from PDFs that weren’t designed for ease of reuse: http://www.degeneratestate.org/posts/2016/Jun/15/extracting-tabular-data-from-pdfs/
- Reesepathak/pdf-mining https://github.com/reesepathak/pdf-mining
- (did not examine)
- Poppler https://poppler.freedesktop.org
- (did not examine)
Possible issues
- Encryption of the file
- Compression of the file
- Vector images, charts, graphs, other image formats
- Form XObjects
- Text contained in figures
- Does text always appear in the same place on the page, or different every page/document?
PDF examples I tried parsing, to evaluate the packages
- IRS 1040A
- 2015-16-prelim-doc-web.pdf (Bellingham city budget)
- Tabular data begins on page 30 (labelled Page 28)
- PyPDF2 Parsing result: None of the tabular data is exported
- SCARY: some financial tables are split across two pages
- 2016-budget-highlights.pdf (Seattle city budget summary)
- Tabular data begins on page 15-16 (labelled 15-16)
- PyPDF2 Parsing result: this data parses out
- FY2017 Proposed Budget-Lowell-MA (Lowell)
- Financial tabular data starts at page 95-104, then 129-130, 138-139
- More interesting are the small breakouts on subsequent pages e.g. 149, 151, 152, 162; 193, 195, 197
- PyPDF2 Parsing result: all data I sampled appears to parse out
Experiment ideas
- Build an example PDF for myself with XLS tables, and then see what comes out when the contents are parsed using one of these libraries
- Build a script that spits out useful metadata about the document: which app/library generated it (e.g. Producer, Creator), size, # of pages
- Build another script to verify there’s a non-trivial amount of ASCII/Unicode text in the document (I.e. so we confirm it doesn’t have to be OCR’d)
Experiments tried
- Create an Anaconda Notebook to write some PDF scripts
- Created “PDF Experiments” environment on Win10 Anaconda install
- Fired up Anaconda Prompt and ran “pip install pypdf2” (see http://stackoverflow.com/questions/18640305/how-do-i-keep-track-of-pip-installed-packages-in-an-anaconda-conda-environment#18640601)
- Created “PDF Experiments” notebook and ran script that included “import pypdf2” – successfully using the function
- Extract the contents of Google Spreadsheets exported as PDF
- Result: no readable text exported using pypdf2 [5 spreadsheets attempted]
Just wondering if you could share some of the code you used for the Seattle City Budge Project. I am trying to do something similar, but I keep running into a problem with Indirect Objects. Thanks!
LikeLike
Unfortunately we abandoned the work in favour of just hand-formatting table scrapes we performed from a PDF reader. Sorry.
LikeLike