We saw how we can tease meaningful information out of a PDF document. We assembled a core set of tools to extract outlines from documents, summarize the pages of a document, and pull the text from each page. We also discussed how we can analyze a table or other complex layout to reassemble meaningful information from that complex layout.
We used a very clever Python design pattern called wrap-sort-unwrap to decorate text blocks with coordinate information, and then sort it into the useful top-to-bottom and left-to-right positions. Once we had the text properly organized, we could unwrap the meaningful data and produce useful output.
We also discussed two other important Python design patterns: the context manager and the filter. We used object-oriented design techniques to create a hierarchy of context managers that simplify our scripts to extract data from files. The filter concept has three separate implementations: as a generator expression, as a generator function, and using the...