Now that we have a smaller file to experiment with, let's try some programmatic solutions to extract the text and see if we fare any better. pdfMiner is a Python package with two embedded tools to operate on PDF files. We are particularly interested in experimenting with one of these tools, a command-line program called pdf2txt
that is designed to extract text from within a PDF document. Maybe this will be able to help us get those tables of numbers out of the file correctly.
Launch the Canopy Python environment. From the Canopy Terminal Window, run the following command:
pip install pdfminer
This will install the entire pdfMiner package and all its associated command-line tools.
Tip
The documentation for pdfMiner and the two tools that come with it, pdf2txt
and dumpPDF
, is located at http://www.unixuser.org/~euske/python/pdfminer/.