Book Image

Pandas 1.x Cookbook - Second Edition

By : Matt Harrison, Theodore Petrou
Book Image

Pandas 1.x Cookbook - Second Edition

By: Matt Harrison, Theodore Petrou

Overview of this book

The pandas library is massive, and it's common for frequent users to be unaware of many of its more impressive features. The official pandas documentation, while thorough, does not contain many useful examples of how to piece together multiple commands as one would do during an actual analysis. This book guides you, as if you were looking over the shoulder of an expert, through situations that you are highly likely to encounter. This new updated and revised edition provides you with unique, idiomatic, and fun recipes for both fundamental and advanced data manipulation tasks with pandas. Some recipes focus on achieving a deeper understanding of basic principles, or comparing and contrasting two similar operations. Other recipes will dive deep into a particular dataset, uncovering new and unexpected insights along the way. Many advanced recipes combine several different features across the pandas library to generate results.
Table of Contents (17 chapters)
15
Other Books You May Enjoy
16
Index

Series operations

There exist a vast number of operators in Python for manipulating objects. For instance, when the plus operator is placed between two integers, Python will add them together:

>>> 5 + 9  # plus operator example. Adds 5 and 9
14

Series and DataFrames support many of the Python operators. Typically, a new Series or DataFrame is returned when using an operator.

In this recipe, a variety of operators will be applied to different Series objects to produce a new Series with completely different values.

How to do it…

  1. Select the imdb_score column as a Series:
    >>> movies = pd.read_csv("data/movie.csv")
    >>> imdb_score = movies["imdb_score"]
    >>> imdb_score
    0       7.9
    1       7.1
    2       6.8
    3       8.5
    4       7.1
           ... 
    4911    7.7
    4912    7.5
    4913    6.3
    4914    6.3
    4915    6.6
    Name: imdb_score, Length: 4916, dtype: float64
    
  2. Use the plus operator to add one to each Series element:
    >>> imdb_score + 1
    0       8.9
    1       8.1
    2       7.8
    3       9.5
    4       8.1
           ... 
    4911    8.7
    4912    8.5
    4913    7.3
    4914    7.3
    4915    7.6
    Name: imdb_score, Length: 4916, dtype: float64
    
  3. The other basic arithmetic operators, minus (-), multiplication (*), division (/), and exponentiation (**) work similarly with scalar values. In this step, we will multiply the series by 2.5:
    >>> imdb_score * 2.5
    0       19.75
    1       17.75
    2       17.00
    3       21.25
    4       17.75
            ...  
    4911    19.25
    4912    18.75
    4913    15.75
    4914    15.75
    4915    16.50
    Name: imdb_score, Length: 4916, dtype: float64
    
  4. Python uses a double slash (//) for floor division. The floor division operator truncates the result of the division. The percent sign (%) is the modulus operator, which returns the remainder after a division. The Series instances also support these operations:
    >>> imdb_score // 7
    0       1.0
    1       1.0
    2       0.0
    3       1.0
    4       1.0
           ... 
    4911    1.0
    4912    1.0
    4913    0.0
    4914    0.0
    4915    0.0
    Name: imdb_score, Length: 4916, dtype: float64
    
  5. There exist six comparison operators, greater than (>), less than (<), greater than or equal to (>=), less than or equal to (<=), equal to (==), and not equal to (!=). Each comparison operator turns each value in the Series to True or False based on the outcome of the condition. The result is a Boolean array, which we will see is very useful for filtering in later recipes:
    >>> imdb_score > 7
    0        True
    1        True
    2       False
    3        True
    4        True
            ...  
    4911     True
    4912     True
    4913    False
    4914    False
    4915    False
    Name: imdb_score, Length: 4916, dtype: bool
    >>> director = movies["director_name"]
    >>> director == "James Cameron"
    0        True
    1       False
    2       False
    3       False
    4       False
            ...  
    4911    False
    4912    False
    4913    False
    4914    False
    4915    False
    Name: director_name, Length: 4916, dtype: bool
    

How it works…

All the operators used in this recipe apply the same operation to each element in the Series. In native Python, this would require a for loop to iterate through each of the items in the sequence before applying the operation. pandas relies heavily on the NumPy library, which allows for vectorized computations, or the ability to operate on entire sequences of data without the explicit writing of for loops. Each operation returns a new Series with the same index, but with the new values.

There's more…

All of the operators used in this recipe have method equivalents that produce the exact same result. For instance, in step 1, imdb_score + 1 can be reproduced with the .add method.

Using the method rather than the operator can be useful when we chain methods together.

Here are a few examples:

>>> imdb_score.add(1)  # imdb_score + 1
0       8.9
1       8.1
2       7.8
3       9.5
4       8.1
       ... 
4911    8.7
4912    8.5
4913    7.3
4914    7.3
4915    7.6
Name: imdb_score, Length: 4916, dtype: float64
>>> imdb_score.gt(7)  # imdb_score > 7
0        True
1        True
2       False
3        True
4        True
        ...  
4911     True
4912     True
4913    False
4914    False
4915    False
Name: imdb_score, Length: 4916, dtype: bool

Why does pandas offer a method equivalent to these operators? By its nature, an operator only operates in exactly one manner. Methods, on the other hand, can have parameters that allow you to alter their default functionality.

Other recipes will dive into this further, but here is a small example. The .sub method performs subtraction on a Series. When you do subtraction with the - operator, missing values are ignored. However, the .sub method allows you to specify a fill_value parameter to use in place of missing values:

>>> money = pd.Series([100, 20, None])
>>> money – 15
0    85.0
1     5.0
2     NaN
dtype: float64
>>> money.sub(15, fill_value=0)
0    85.0
1     5.0
2   -15.0
dtype: float64

Following is a table of operators and the corresponding methods:

Operator group Operator Series method name

Arithmetic

+,-,*,/,//,%,**

.add, .sub, .mul, .div, .floordiv, .mod, .pow

Comparison

<,>,<=,>=,==,!=

.lt, .gt, .le, .ge, .eq, .ne

You may be curious as to how a Python Series object, or any object for that matter, knows what to do when it encounters an operator. For example, how does the expression imdb_score * 2.5 know to multiply each element in the Series by 2.5? Python has a built-in, standardized way for objects to communicate with operators using special methods.

Special methods are what objects call internally whenever they encounter an operator. Special methods always begin and end with two underscores. Because of this, they are also called dunder methods as the method that implements the operator is surrounded by double underscores (dunder being a lazy geeky programmer way of saying "double underscores"). For instance, the special method .__mul__ is called whenever the multiplication operator is used. Python interprets the imdb_score * 2.5 expression as imdb_score.__mul__(2.5).

There is no difference between using the special method and using an operator as they are doing the exact same thing. The operator is just syntactic sugar for the special method. However, calling the .mul method is different than calling the .__mul__ method.