Book Image

Python for Secret Agents - Volume II - Second Edition

By : Steven F. Lott, Steven F. Lott
Book Image

Python for Secret Agents - Volume II - Second Edition

By: Steven F. Lott, Steven F. Lott

Overview of this book

Python is easy to learn and extensible programming language that allows any manner of secret agent to work with a variety of data. Agents from beginners to seasoned veterans will benefit from Python's simplicity and sophistication. The standard library provides numerous packages that move beyond simple beginner missions. The Python ecosystem of related packages and libraries supports deep information processing. This book will guide you through the process of upgrading your Python-based toolset for intelligence gathering, analysis, and communication. You'll explore the ways Python is used to analyze web logs to discover the trails of activities that can be found in web and database servers. We'll also look at how we can use Python to discover details of the social network by looking at the data available from social networking websites. Finally, you'll see how to extract history from PDF files, which opens up new sources of data, and you’ll learn about the ways you can gather data using an Arduino-based sensor device.
Table of Contents (12 chapters)
Python for Secret Agents Volume II
Credits
About the Author
About the Reviewer
www.PacktPub.com
Preface
Index

Background briefing: review of the Python language


Before moving on to our first mission, we'll review some essentials of the Python language, and the ways in which we'll use it to gather and disseminate data. We'll start by reviewing the interactive use of Python to do some data manipulation. Then we'll look at statements and script files.

When we start Python from the Terminal tool or the command line, we'll see an interaction that starts as shown in the following:

MacBookPro-SLott:Code slott$ python3.4
Python 3.4.3 (v3.4.3:9b73f1c3e601, Feb 23 2015, 02:52:03) 
[GCC 4.2.1 (Apple Inc. build 5666) (dot 3)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> 

The >>> prompt is Python's read-eval-print loop (REPL) that is waiting for us to enter a statement. If we use Python's development environment, IDLE, we'll also see this >>> prompt.

One of the simplest kinds of statements is a single expression. We can, for example, enter an arithmetic expression. The Read Eval Print Loop (REPL) will print the result automatically. Here's an example of simple math:

>>> 355/113
3.1415929203539825

We entered an expression statement and Python printed the resulting object. This gives us a way to explore the language. We can enter things and see the results, allowing us to experiment with new concepts.

Python offers us a number of different types of objects to work with. The first example showed integer objects, 355 and 113, as well as a floating-point result object, 3.1415929203539825.

In addition to integers and floats, we also have exact complex numbers. With the standard library, we can introduce decimal and fraction values using the decimal or fractions modules. Python can coerce values between the various types. If we have mixed values on either side of an operator, one of the values will be pushed up the numeric tower so that both operands have the same type. This means that integers can be promoted up to float and float can be promoted up to complex if necessary.

Python gives us a variety of operators. The common arithmetic operators are +, -, *, /, //, %, and **. These implement addition, subtraction, multiplication, true division, floor division, modulus, and raising to a power. The true division, /, will coerce integers to floating-point so that the answer is exact. The floor division, //, provides rounded-down answers, even with floating-point operands.

We also have some bit-fiddling operators: ~, &, |, ^, <<, and >>. These implement unary bitwise inversion, and, or, exclusive or, shift left, and shift right. These work with individual bits in a number. They're not logical operators at all.

What about more advanced math? We'll need to import libraries if we need more sophisticated features. For example, if we need to compute a square root, we'll need to import the math module, as follows:

>>> import math
>>> p= math.sqrt(7+math.sqrt(6+math.sqrt(5)))

Importing the math module creates a new object, math. This object is a kind of namespace that contains useful functions and constants. We'll use this import technique frequently to add features that we need to create useful software.

Using variables to save results

We can put a label on an object using the assignment statement. We often describe this as assigning an object to a variable; however, it's more like assigning a symbolic label to an object. The variable name (or label) must follow a specific set of syntax rules. It has to begin with a letter and can include any combination of letters, digits, and _ characters. We'll often use simple words such as x, n, samples, and data. We can use longer_names where this adds clarity.

Using variables allows us to build up results in steps by assigning names to intermediate results. Here's an example:

>>> n = 355
>>> d = 113
>>> r = n/d
>>> result = "ratio: {0:.6f}".format(r)
>>> result
'ratio: 3.141593'

We assigned the n name to the 355 integer; then we assigned the d name to the 113 integer. Then we assigned the ratio to another variable, r.

We used the format() method for strings to create a new string that we assigned to the variable named result. The format() method starts with a format specification and replace {} with formatted versions of the argument values. In the {}'s object, we requested item 0 from the collection of arguments. Since Python's indexes always start from zero, this will be the first argument value. We used a format specification of .6f to show a floating-point value (f) with six digits to the right of the decimal point (.6). This formatted number was interpolated into the overall string and the resulting string was given the name result.

The last expression in the sequence of statements, result, is very simple. The result of this trivial expression is the value of the variable. It's a string that the REPL prints for us. We can use a similar technique to print the values of intermediate results such as the r variable. We'll often make heavy use of intermediate variables in order to expose the details of a calculation.

Using the sequence collections: strings

Python strings are a sequence of Unicode characters. We have a variety of quoting rules for strings. Here are two examples:

>>> 'String with " inside'
'String with " inside'
>>> "String's methods"
"String's methods"

We can either use quotes or apostrophes to delimit a string. In the likely event that a string contains both quotes and apostrophes, we can use a \' or \" to embed some punctuation; this is called an escape sequence. The initial \ escapes from the normal meaning of the next character. The following is an example showing the complicated quotes and escapes:

>>> "I said, \"Don't touch.\""
'I said, "Don\'t touch."'

We used one set of quotes to enter the string. We used the escaped quotes in the string. Python responded with its preferred syntax; the canonical form for a string will generally use apostrophes to delimit the string overall.

Another kind of string that we'll encounter frequently is a byte string. Unlike a normal string that uses all the available Unicode characters, a byte string is limited to single-byte values. These can be shown using hexadecimal numeric codes, or for 96 of the available bytes values – an ASCII character instead of a numeric value.

Here are two examples of byte strings:

>>> b'\x08\x09\x0a\x0c\x0d\x0e\x0f'
b'\x08\t\n\x0c\r\x0e\x0f'
>>> b'\x41\x53\x43\x49\x49'
b'ASCII'

In the first example, we provided hexadecimal values using the \xnn syntax for each byte. The prefix of \x means that the following values will be in base 16. We write base 16 values using the digits 0-9 along with the letters a-f. We provide seven values for \x08 to \x0f. Python replies using a canonical notation; our input follows more relaxed rules than those of Python's output. The canonical syntax is different for three important byte values: the tab character, \x08 can also be entered as \t. The newline character is most commonly entered as \n rather than \x0a. Finally, the carriage return character, \r, is shorter than \x0d.

In the second example, we also provided some hexadecimal values that overlap with some of the ASCII characters. Python's canonical form shows the ASCII characters instead of the hexadecimal values. This demonstrates that, for some byte values, ASCII characters are a handy shorthand.

In some applications, we'll have trouble telling a Unicode string, 'hello', from a byte string, b'hello'. We can add a u'hello' prefix in order to clearly state that this is a string of Unicode characters and not a string of bytes.

As a string is a collection of individual Unicode characters, we can extract the characters from a string using the character's index positions. Here's a number of examples:

>>> word = 'retackling'
>>> word[0]
'r'
>>> word[-1]
'g'
>>> word[2:6]
'tack'
>>> word[-3:]
'ing'

We've created a string, which is a sequence object. Sequence objects have items that can be addressed by their position or index. In position 0, we see the first item in the sequence, the 'r' character.

Sequences can also be indexed from the right to left using negative numbers. Position -1 is the last (rightmost) item in a sequence. Index position -2 is next-to-rightmost.

We can also extract a slice from a sequence. This is a new sequence that is copied from the original sequence. When we take items in positions 2 to 6, we get four characters with index values 2, 3, 4, and 5. Note that a slice includes the first position and never includes the last specified position, it's an upto but not including rule. Mathematicians call it a half-open interval and write it as [2, 6) or sometimes [2, 6[. We can use the following set comprehension rule to understand how the interval works:

All of the sequence collections allow us to count occurrences of an item and location the index of an item. The following are some examples that show the method syntax and the two universal methods that apply to sequences:

>>> word.count('a')
1
>>> word.index('t')
2
>>> word.index('z') 
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: substring not found

We've counted the number of items that match a particular value. We've also asked for the position of a given letter. This returns a numeric value for the index of the item equal to 't'.

String sequences have dozens of other methods to create new strings in various ways. We can do a large number of sophisticated manipulations.

Note that a string is an immutable object. We can't replace a character in a string. We can only build new strings from the old strings.

Using other common sequences: tuples and lists

We can create two other common kinds of sequences: the list and the tuple. A tuple is a fixed-length sequence of items. We often use tuples for simple structures such as pairs (latitude, longitude) or triples (r, g, b). We write a literal tuple by enclosing the items in ()s. It looks as shown in the following:

>>> ultramarine_blue = (63, 38, 191)

We've create a three-tuple or triple with some RGB values that comprise a color.

Python's assignment statement can tease a tuple into its individual items. Here's an example:

>>> red, green, blue = ultramarine_blue
>>> red
63
>>> blue
191

This multiple-variable assignment works well with tuples as a tuple has a fixed size. We can also address individual items of a tuple with expressions such as ultramarine_blue[0]. Slicing a tuple is perfectly legal; however, semantically a little murky. Why is ultramarine_blue[:2] used to create a pair from the red and green channel?

A list is a variable-length sequence of items. This is a mutable object and we can insert, append, remove, and replace items in the list. This is one of the profound differences between the tuple and list sequences. A tuple is immutable; once we've built it, we can't change it. A list is mutable.

The following is an example of a list that we can tweak in order to correct the errors in the data:

>>> samples = [8.04, 6.95, 0, 8.81, 8.33, 9.96, 7.24, 4.26, 10.84, 4.82]
>>> samples[2]= 7.58
>>> samples.append(5.68)
>>> samples
[8.04, 6.95, 7.58, 8.81, 8.33, 9.96, 7.24, 4.26, 10.84, 4.82, 5.68] 
>>> sum(samples)
82.51000000000002
>>> round(sum(samples)/len(samples),2)
7.5

We've created a list object, samples, and initialized it with 10 values. We've set the value with an index of two; replacing a the zero item with 7.58. We've appended an item at the end of the list.

We've also shown two handy functions that apply to all sequences. However, they're particularly useful for lists. The sum() function adds up the values, reducing the list to a single value. The len() function counts the items, also reducing the list to a single value.

Note the awkward value shown for the sum; this is an important feature of floating-point numbers. In order to be really fast, they're finite. As they have a limited number of bits, they're only an approximation. Therefore, sometimes, we'll see some consequences of working with approximations.

Tip

Floating-point numbers aren't mathematical abstractions.

They're finite approximations. Sometimes, you'll see tiny error values.

One other interesting operator for sequences is the in comparison:

>>> 7.24 in samples
True

This checks whether a given item is found somewhere in the sequence. If we want the index of a given item, we can use the index method:

samples.index(7.24)

Using the dictionary mapping

The general idea of mapping is the association between keys and values. We might have a key of 'ultramarine blue' associated with a value of the tuple, (63, 38, 191). We might have a key of 'sunset orange' associated with a tuple of (254, 76, 64). We can represent this mapping of string-to-tuple with a Python dictionary object, as follows:

>>> colors = {'ultramarine blue': (63, 38, 191), 'sunset orange': (254, 76, 64) }

We've replaced the words associated with : and wrapped the whole in {}s in order to create a proper dictionary. This is a mapping from color strings to RGB tuples.

A dictionary is mutable; we can add new key-value pairs and remove key-value mappings from it. Of course, we can interrogate a dictionary to see what keys are present and what value is associated with a key.

>>> colors['olive green'] = (181, 179, 92)
>>> colors.pop('sunset orange')
(254, 76, 64)
>>> colors['ultramarine blue']
(63, 38, 191)
>>> 'asparagus' in colors
False

The same syntax will replace an existing key in a dictionary with a new value. We can pop a key from the dictionary; this will both update the dictionary to remove the key value pair and return the value associated with the key. When we use syntax such as colors['ultramarine blue'], we'll retrieve the value associated with a given key.

The in operator checks to see whether the given item is one of the keys of the mapping. In our example, we didn't provide a mapping for the name 'asparagus'.

We can retrieve the keys, the values, and the key value pairs from a mapping with methods of the class:

>>> sorted(colors.items())
[('olive green', (181, 179, 92)), ('ultramarine blue', (63, 38, 191))]

The keys() method returns the keys in the mapping. The values() method returns a list of only the values. The items() method returns a list of two-tuples. Each tuple is a key, value pair. We've applied the sorted() function in this example, as a dictionary doesn't guarantee any particular order for the keys. In many cases, we don't particularly care about the order. In the cases where we need to enforce the order, this is a common technique.

Comparing data and using the logic operators

Python implements a number of comparisons. We have the usual ==, !=, <=, >=, <, and > operators. These provide the essential comparison capabilities. The result of a comparison is a boolean object, either True or False.

The boolean objects have their own special logic operators: and, or, and not. These operators can short-circuit the expression evaluation. In the case of and, if the left-hand side expression is False, the final result must be False; therefore, the right-hand side expression is not evaluated. In the case of or, the rules are reversed. If the left-hand side expression is True, the final result is already known to be True, so the right-hand side expression is skipped.

For example, take two variables, sum and count,as follows:

>>> sum
82.51
>>> count
11
>>> mean = count != 0 and sum/count

Let's look closely at the final expression. The left-hand side expression of the and operator is count != 0, which is True. Therefore, the right-hand side expression must be evaluated. Interestingly, the right-hand side object is the final result. A numeric value of 7.5 is the value of the mean variable.

The following is another example to show how the and operator behaves:

>>> sum
0.0
>>> count
0
>>> mean = count != 0 and sum/count

What happens here? The left-hand side expression of the and operator is count != 0, which is False. The right-hand side is not evaluated. There's no division by zero error exception raised by this. The final result is False.

Using some simple statements

All of the preceding examples focused on one-line expression statements. We entered an expression in REPL, Python evaluated the expression, and REPL helpfully printed the resulting value. While the expression statement is handy for experiments at the REPL prompt, there's one expression statement that agents use a lot, as shown in the following:

>>> print("Hello \N{EARTH}")
Hello ♁

The print() function prints the results on the console. We provided a string with a Unicode character that's not directly available on most keyboards, this is the EARTH character, , U+2641, which looks different in different fonts.

We'll need the print() function as soon as we stop using interactive Python. Our scripts won't show any results unless we print them.

The other side of printing is the input() function. This will present a prompt and then read a string of input that is typed by a user at the console. We'll leave it to the interested agent to explore the details of how this works.

We'll need more kinds of imperative statements to get any real work done. We've shown two forms of the assignment statement; both will put a label on an object. The following are two examples to put label on an object:

>>> n, d = 355, 113
>>> pi = n/d

The first assignment statement evaluated the 355, 115 expression and created a tuple object from two integer objects. In some contexts, the surrounding ()s for a tuple are optional; this is one of those contexts. Then, we used multiple assignments to decompose the tuple to its two items and put labels on each object.

The second assignment statement follows the same pattern. The n/d expression is evaluated. It uses true division to create a floating-point result from integer operands. The resulting object has the name pi applied to it by the assignment statement.

Using compound statements for conditions: if

For conditional processing, we use the if statement. Python allows an unlimited number of else-if (elif) clauses, allowing us to build rather complex logic very easily.

For example, here's a statement that determines whether a value, n, is divisible by three, or five, or both:

>>> if n % 3 == 0 and n % 5 == 0:
...     print("fizz-buzz")
... elif n % 3 == 0:
...     print("fizz")
... elif n % 5 == 0:
...     print("buzz")
... else:
...     print(n)

We've written three Boolean expressions. The if statement will evaluate these in top-to-bottom order. If the value of the n variable is divisible by both, three and five, the first condition is True and the indented suite of statements is executed. In this example, the indented suite of statements is a single expression statement that uses the print() function.

If the first expression is False, then the elif clauses are examined in order. If none of the elif clauses are true, the indented suite of statements in the else clause is executed.

Remember that the and operator has a short-circuit capability. The first expression may involve as little as evaluating n % 3 == 0. If this subexpression is False, the entire and expression must be False; this means that the entire if clause is not executed. Otherwise, the entire expression must be evaluated.

Notice that Python changes the prompt from >>> at the start of a compound statement to to show that more of the statement can be entered. This is a helpful hint. We indent each suite of statements in a clause. We enter a blank line in order to show we're at the very end of the compound statement.

Tip

This longer statement shows us an important syntax rule:

Compound statements rely on indentation. Indent consistently. Use four spaces.

The individual if and elif clauses are separated based on their indentation level. The keywords such as if, elif, and else are not indented. The suite of statements in each clause is indented consistently.

Using compound statements for repetition: for and while

When we want to process all the items in a list or the lines in a file, we're going to use the for statement. The for statement allows us to specify a target variable, a source collection of values, and a suite of statements. The idea is that each item from the source collection is assigned to the target value and the suite of statements is executed.

The following is a complete example that computes the variance of some measurements:

>>> samples = [8.04, 6.95, 7.58, 8.81, 8.33, 9.96, 7.24, 4.26, 10.84, 4.82, 5.68]
>>> sum, sum2 = 0, 0
>>> for x in samples:
...     sum += x
...     sum2 += x**2
>>> n = len(samples)
>>> var = (sum2-(sum**2/n))/(n-1)

We've started with a list of values, assigned to the samples variable, plus two other variables, sum and sum2, to which we've assigned initial values of 0.

The for statement will iterate through the item in the samples list. An item will be assigned to the target variable, x, and then the indented body of the for statement is executed. We've written two assignment statements that will compute the new values for sum and sum2. These use the augmented assignment statement; using += saves us from writing sum = sum + x.

After the for statement, we are assured that the body has been executed for all values in the source object, samples. We can save the count of the samples in a handy local variable, n. This makes the calculation of the variance slightly more clear. In this example, the variance is about 4.13.

The result is a number that shows how spread out the raw data is. The square root of the variance is the standard deviation. We expect two-third of our data points to lie in one standard deviation of the average. We often use variance when comparing two data sets. When we get additional data, perhaps from a different agent, we can compare the averages and variances to see whether the data is similar. If the variances aren't the same, this may reflect that there are different sources and possibly indicate that we shouldn't trust either of the agents that are supplying us this raw data. If the variances are identical, we have another question whether we being fed false information?

The most common use of the for statement is to visit each item in a collection. A slightly less common use is to iterate a finite number of times. We use a range() object to emit a simple sequence of integer values, as follows:

>>> list(range(5))
[0, 1, 2, 3, 4]

This means that we can use a statement such as for i in range(n): in order to iterate n times.

Defining functions

It's often important to decompose large, complex data acquisition and analysis problems into smaller, more solvable problems. Python gives us a variety of ways to organize our software. We have a tall hierarchy that includes packages, modules, classes, and functions. We'll start with function definitions as a way to decompose and reuse functionality. The later missions will require class definitions.

A function—mathematically—is a mapping from objects in a domain to objects in a range. Many mathematical examples map numbers to different numbers. For example, the arctangent function, available as math.atan(), maps a tangent value to the angle that has this tangent value. In many cases, we'll need to use math.atan2(), as our tangent value is a ratio of the lengths of two sides of a triangle; this function maps a pair of values to a single result.

In Python terms, a function has a name and a collection of parameters and it may return a distinct value. If we don't explicitly return a resulting value, a function maps its values to a special None object.

Here's a handy function to average the values in a sequence:

>>> def mean(data):
...     if len(data) == 0:
...         return None
...     return sum(data)/len(data)

This function expects a single parameter, a sequence of values to average. When we evaluate the function, the argument value will be assigned to the data parameter. If the sequence is empty, we'll return the special None object in order to indicate that there's no average when there's no data.

If the sequence isn't empty, we'll divide the sum by the count to compute the average. Since we're using exact division, this will return a floating-point value even if the sequence is all integers.

The following is how it looks when we use our newly minted function combined with built-in functions:

>>> samples = [8.04, 6.95, 7.58, 8.81, 8.33, 9.96, 7.24, 4.26, 10.84, 4.82, 5.68]
>>> round(mean(samples), 2)
7.5

We've computed the mean of the values in the samples variable using our mean() function. We've applied the round() function to the resulting value to show that the mean is rounded to two decimal places.

Creating script files

We shouldn't try to do all the our data gathering and analysis by entering the Python code interactively at the >>> prompt. It's possible to work this way; however, the copy and paste is tedious and error-prone. It's much better to create a Python script that will gather, analyze, and display useful intelligence assets that we've gathered (or purchased).

A Python script is a file of Python statements. While it's not required, it's helpful to be sure that the file's name is a valid Python symbol that is created with letters, numbers, and _'s. It's also helpful if the file's name ends with .py.

Here's a simple script file that shows some of the features that we've been looking at:

import random, math
samples = int(input("How many samples: "))
inside = 0
for i in range(samples):
    if math.hypot(random.random(), random.random()) <= 1.0:
        inside += 1
print(inside, samples, inside/samples, math.pi/4)

This script file can be given a name such as example1.py. The script will use the input() function to prompt the user for a number of random samples. Since the result of input() is a string, we'll need to convert the string to an integer in order to be able to use it. We've initialized a variable, inside, to zero.

The for statement will execute the indented body for the number of times that are given by the value of samples. The range() object will generate samples distinct integer values. In the for statement, we've used an if statement to filter some randomly generated values. The values we're examining are the result of math.hypot(random.random(), random.random()). What is this value? It's the hypotenuse of a right angled triangle with sides that are selected randomly. We'll leave it to each field agent to rewrite this script in order to assign and print some intermediate variables to show precisely how this calculation works.

We're looking at a triangle with one vertex at (0,0) and another at (x,y). The third vertex could either be at (0,y) or (x,0), the results don't depend on how we visualize the triangle. Since the triangle sides are selected randomly, the end point of the hypotenuse can be any value from (0,0) to (1,1); the length of this varies between 0 and .

Statistically, we expect that most of the points should lie in a circle with a radius of one. How many should lie in this quarter circle? Interestingly, the random distribution will have of the samples in the circle; will be outside the circle.

When working in counterintelligence, the data that we're providing needs to be plausible. If we're going to mislead, our fake data needs to fit the basic statistical rules. A careful study of history will show how Operation Mincemeat was used to deceive Axis powers during World War II. What's central to this story is the plausibility of every nuance of the data that is supplied.