Book Image

Pandas Cookbook

By : Theodore Petrou
Book Image

Pandas Cookbook

By: Theodore Petrou

Overview of this book

This book will provide you with unique, idiomatic, and fun recipes for both fundamental and advanced data manipulation tasks with pandas 0.20. Some recipes focus on achieving a deeper understanding of basic principles, or comparing and contrasting two similar operations. Other recipes will dive deep into a particular dataset, uncovering new and unexpected insights along the way. The pandas library is massive, and it's common for frequent users to be unaware of many of its more impressive features. The official pandas documentation, while thorough, does not contain many useful examples of how to piece together multiple commands like one would do during an actual analysis. This book guides you, as if you were looking over the shoulder of an expert, through practical situations that you are highly likely to encounter. Many advanced recipes combine several different features across the pandas 0.20 library to generate results.
Table of Contents (12 chapters)

Working with operators on a Series

There exist a vast number of operators in Python for manipulating objects. Operators are not objects themselves, but rather syntactical structures and keywords that force an operation to occur on an object. For instance, when the plus operator is placed between two integers, Python will add them together. See more examples of operators in the following code:

>>> 5 + 9   # plus operator example adds 5 and 9
14

>>> 4 ** 2 # exponentiation operator raises 4 to the second power
16

>>> a = 10 # assignment operator assigns 10 to a

>>> 5 <= 9 # less than or equal to operator returns a boolean
True

Operators can work for any type of object, not just numerical data. These examples show different objects being operated on:

>>> 'abcde' + 'fg' 
'abcdefg'

>>> not (5 <= 9)
False

>>> 7 in [1, 2, 6]
False

>>> set([1,2,3]) & set([2,3,4])
set([2,3])

Visit tutorials point (http://bit.ly/2u5g5Io) to see a table of all the basic Python operators. Not all operators are implemented for every object. These examples all produce errors when using an operator:

>>> [1, 2, 3] - 3
TypeError: unsupported operand type(s) for -: 'list' and 'int'

>>> a = set([1,2,3])
>>> a[0]
TypeError: 'set' object does not support indexing

Series and DataFrame objects work with most of the Python operators.

Getting ready

In this recipe, a variety of operators will be applied to different Series objects to produce a new Series with completely different values.

How to do it...

  1. Select the imdb_score column as a Series:
>>> movie = pd.read_csv('data/movie.csv')
>>> imdb_score = movie['imdb_score']
>>> imdb_score
0 7.9 1 7.1 2 6.8 ... 4913 6.3 4914 6.3 4915 6.6 Name: imdb_score, Length: 4916, dtype: float64
  1. Use the plus operator to add one to each Series element:
>>> imdb_score + 1
0 8.9 1 8.1 2 7.8 ... 4913 7.3 4914 7.3 4915 7.6 Name: imdb_score, Length: 4916, dtype: float64
  1. The other basic arithmetic operators minus (-), multiplication (*), division (/), and exponentiation (**) work similarly with scalar values. In this step, we will multiply the series by 2.5:
>>> imdb_score * 2.5
0 19.75 1 17.75 2 17.00 ... 4913 15.75 4914 15.75 4915 16.50 Name: imdb_score, Length: 4916, dtype: float64
  1. Python uses two consecutive division operators (//) for floor division and the percent sign (%) for the modulus operator, which returns the remainder after a division. Series use these the same way:
>>> imdb_score // 7
0 1.0 1 1.0 2 0.0 ... 4913 0.0 4914 0.0 4915 0.0 Name: imdb_score, Length: 4916, dtype: float64
  1. There exist six comparison operators, greater than (>), less than (<), greater than or equal to (>=), less than or equal to (<=), equal to (==), and not equal to (!=). Each comparison operator turns each value in the Series to True or False based on the outcome of the condition:
>>> imdb_score > 7
0 True 1 True 2 False ... 4913 False 4914 False 4915 False Name: imdb_score, Length: 4916, dtype: bool

>>> director = movie['director_name']
>>> director == 'James Cameron'
0 True 1 False 2 False ... 4913 False 4914 False 4915 False Name: director_name, Length: 4916, dtype: bool

How it works...

All the operators used in this recipe apply the same operation to each element in the Series. In native Python, this would require a for-loop to iterate through each of the items in the sequence before applying the operation. Pandas relies heavily on the NumPy library, which allows for vectorized computations, or the ability to operate on entire sequences of data without the explicit writing of for loops. Each operation returns a Series with the same index, but with values that have been modified by the operator.

There's more...

All of the operators used in this recipe have method equivalents that produce the exact same result. For instance, in step 1, imdb_score + 1 may be reproduced with the add method. Check the following code to see the method version of each step in the recipe:

>>> imdb_score.add(1)              # imdb_score + 1
>>> imdb_score.mul(2.5) # imdb_score * 2.5
>>> imdb_score.floordiv(7) # imdb_score // 7
>>> imdb_score.gt(7) # imdb_score > 7
>>> director.eq('James Cameron') # director == 'James Cameron'

Why does pandas offer a method equivalent to these operators? By its nature, an operator only operates in exactly one manner. Methods, on the other hand, can have parameters that allow you to alter their default functionality:

Operator Group Operator Series method name
Arithmetic +, -, *, /, //, %, ** add, sub, mul, div, floordiv, mod, pow
Comparison <, >, <=, >=, ==, !=

lt, gt, le, ge, eq, ne

You may be curious as to how a Python Series object, or any object for that matter, knows what to do when it encounters an operator. For example, how does the expression imdb_score * 2.5 know to multiply each element in the Series by 2.5? Python has a built-in, standardized way for objects to communicate with operators using special methods.

Special methods are what objects call internally whenever they encounter an operator. Special methods are defined in the Python data model, a very important part of the official documentation, and are the same for every object throughout the language. Special methods always begin and end with two underscores. For instance, the special method __mul__ is called whenever the multiplication operator is used. Python interprets the imdb_score * 2.5 expression as imdb_score.__mul__(2.5).

There is no difference between using the special method and using an operator as they are doing the exact same thing. The operator is just syntactic sugar for the special method.

See also