Series operations
There exist a vast number of operators in Python for manipulating objects. For instance, when the plus operator is placed between two integers, Python will add them together:
>>> 5 + 9 # plus operator example. Adds 5 and 9
14
Series and DataFrames support many of the Python operators. Typically, a new Series or DataFrame is returned when using an operator.
In this recipe, a variety of operators will be applied to different Series objects to produce a new Series with completely different values.
How to do it…
- Select the
imdb_score
column as a Series:>>> movies = pd.read_csv("data/movie.csv") >>> imdb_score = movies["imdb_score"] >>> imdb_score 0 7.9 1 7.1 2 6.8 3 8.5 4 7.1 ... 4911 7.7 4912 7.5 4913 6.3 4914 6.3 4915 6.6 Name: imdb_score, Length: 4916, dtype: float64
- Use the plus operator to add one to each Series element:
>>> imdb_score + 1 0 8.9 1 8.1 2 7.8 3 9.5 4 8.1 ... 4911 8.7 4912 8.5 4913 7.3 4914 7.3 4915 7.6 Name: imdb_score, Length: 4916, dtype: float64
- The other basic arithmetic operators, minus (
-
), multiplication (*
), division (/
), and exponentiation (**
) work similarly with scalar values. In this step, we will multiply the series by 2.5:>>> imdb_score * 2.5 0 19.75 1 17.75 2 17.00 3 21.25 4 17.75 ... 4911 19.25 4912 18.75 4913 15.75 4914 15.75 4915 16.50 Name: imdb_score, Length: 4916, dtype: float64
- Python uses a double slash (
//
) for floor division. The floor division operator truncates the result of the division. The percent sign (%
) is the modulus operator, which returns the remainder after a division. The Series instances also support these operations:>>> imdb_score // 7 0 1.0 1 1.0 2 0.0 3 1.0 4 1.0 ... 4911 1.0 4912 1.0 4913 0.0 4914 0.0 4915 0.0 Name: imdb_score, Length: 4916, dtype: float64
- There exist six comparison operators, greater than (
>
), less than (<
), greater than or equal to (>=
), less than or equal to (<=
), equal to (==
), and not equal to (!=
). Each comparison operator turns each value in the Series to True or False based on the outcome of the condition. The result is a Boolean array, which we will see is very useful for filtering in later recipes:>>> imdb_score > 7 0 True 1 True 2 False 3 True 4 True ... 4911 True 4912 True 4913 False 4914 False 4915 False Name: imdb_score, Length: 4916, dtype: bool >>> director = movies["director_name"] >>> director == "James Cameron" 0 True 1 False 2 False 3 False 4 False ... 4911 False 4912 False 4913 False 4914 False 4915 False Name: director_name, Length: 4916, dtype: bool
How it works…
All the operators used in this recipe apply the same operation to each element in the Series. In native Python, this would require a for loop to iterate through each of the items in the sequence before applying the operation. pandas relies heavily on the NumPy library, which allows for vectorized computations, or the ability to operate on entire sequences of data without the explicit writing of for loops. Each operation returns a new Series with the same index, but with the new values.
There's more…
All of the operators used in this recipe have method equivalents that produce the exact same result. For instance, in step 1, imdb_score + 1
can be reproduced with the .add
method.
Using the method rather than the operator can be useful when we chain methods together.
Here are a few examples:
>>> imdb_score.add(1) # imdb_score + 1
0 8.9
1 8.1
2 7.8
3 9.5
4 8.1
...
4911 8.7
4912 8.5
4913 7.3
4914 7.3
4915 7.6
Name: imdb_score, Length: 4916, dtype: float64
>>> imdb_score.gt(7) # imdb_score > 7
0 True
1 True
2 False
3 True
4 True
...
4911 True
4912 True
4913 False
4914 False
4915 False
Name: imdb_score, Length: 4916, dtype: bool
Why does pandas offer a method equivalent to these operators? By its nature, an operator only operates in exactly one manner. Methods, on the other hand, can have parameters that allow you to alter their default functionality.
Other recipes will dive into this further, but here is a small example. The .sub
method performs subtraction on a Series. When you do subtraction with the -
operator, missing values are ignored. However, the .sub
method allows you to specify a fill_value
parameter to use in place of missing values:
>>> money = pd.Series([100, 20, None])
>>> money – 15
0 85.0
1 5.0
2 NaN
dtype: float64
>>> money.sub(15, fill_value=0)
0 85.0
1 5.0
2 -15.0
dtype: float64
Following is a table of operators and the corresponding methods:
Operator group | Operator | Series method name |
Arithmetic |
|
|
Comparison |
|
|
You may be curious as to how a Python Series object, or any object for that matter, knows what to do when it encounters an operator. For example, how does the expression imdb_score * 2.5
know to multiply each element in the Series by 2.5? Python has a built-in, standardized way for objects to communicate with operators using special methods.
Special methods are what objects call internally whenever they encounter an operator. Special methods always begin and end with two underscores. Because of this, they are also called dunder methods as the method that implements the operator is surrounded by double underscores (dunder being a lazy geeky programmer way of saying "double underscores"). For instance, the special method .__mul__
is called whenever the multiplication operator is used. Python interprets the imdb_score * 2.5
expression as imdb_score.__mul__(2.5)
.
There is no difference between using the special method and using an operator as they are doing the exact same thing. The operator is just syntactic sugar for the special method. However, calling the .mul
method is different than calling the .__mul__
method.