Book Image

Python for Secret Agents - Volume II - Second Edition

By : Steven F. Lott, Steven F. Lott
Book Image

Python for Secret Agents - Volume II - Second Edition

By: Steven F. Lott, Steven F. Lott

Overview of this book

Python is easy to learn and extensible programming language that allows any manner of secret agent to work with a variety of data. Agents from beginners to seasoned veterans will benefit from Python's simplicity and sophistication. The standard library provides numerous packages that move beyond simple beginner missions. The Python ecosystem of related packages and libraries supports deep information processing. This book will guide you through the process of upgrading your Python-based toolset for intelligence gathering, analysis, and communication. You'll explore the ways Python is used to analyze web logs to discover the trails of activities that can be found in web and database servers. We'll also look at how we can use Python to discover details of the social network by looking at the data available from social networking websites. Finally, you'll see how to extract history from PDF files, which opens up new sources of data, and you’ll learn about the ways you can gather data using an Arduino-based sensor device.
Table of Contents (12 chapters)
Python for Secret Agents Volume II
Credits
About the Author
About the Reviewer
www.PacktPub.com
Preface
Index

Understanding tables and complex layouts


In order to work successfully with PDF documents, we need to process some parts of the page geometry. For some kinds of running text, we don't need to worry about where the text appears on the page. But for tabular layouts, we're forced to understand the gridded nature of the display. We're also forced to grapple with the amazing subtlety of how the human eye can take a jumble of letters on a page and resolves them into meaningful rows and columns.

It doesn't matter now, but as we move forward it will become necessary to understand two pieces of PDF trivia. First, coordinates are in points, which are about 1/72 of an inch. Second, the origin, (0,0), is the lower-left corner of the page. As we read down the page, the y coordinate decreases toward zero.

A PDF page will be a sequence of various types of layout objects. We're only interested in the various subclasses of LTText.

The first thing we'll need is a kind of filter that will step through an iterable...