Book Image

Pandas 1.x Cookbook - Second Edition

By : Matt Harrison, Theodore Petrou
Book Image

Pandas 1.x Cookbook - Second Edition

By: Matt Harrison, Theodore Petrou

Overview of this book

The pandas library is massive, and it's common for frequent users to be unaware of many of its more impressive features. The official pandas documentation, while thorough, does not contain many useful examples of how to piece together multiple commands as one would do during an actual analysis. This book guides you, as if you were looking over the shoulder of an expert, through situations that you are highly likely to encounter. This new updated and revised edition provides you with unique, idiomatic, and fun recipes for both fundamental and advanced data manipulation tasks with pandas. Some recipes focus on achieving a deeper understanding of basic principles, or comparing and contrasting two similar operations. Other recipes will dive deep into a particular dataset, uncovering new and unexpected insights along the way. Many advanced recipes combine several different features across the pandas library to generate results.
Table of Contents (17 chapters)
15
Other Books You May Enjoy
16
Index

Selecting multiple DataFrame columns

We can select a single column by passing the column name to the index operator of a DataFrame. This was covered in the Selecting a column recipe in Chapter 1, Pandas Foundations. It is often necessary to focus on a subset of the current working dataset, which is accomplished by selecting multiple columns.

In this recipe, all the actor and director columns will be selected from the movie dataset.

How to do it...

  1. Read in the movie dataset, and pass in a list of the desired columns to the indexing operator:
    >>> import pandas as pd
    >>> import numpy as np
    >>> movies = pd.read_csv("data/movie.csv")
    >>> movie_actor_director = movies[
    ...     [
    ...         "actor_1_name",
    ...         "actor_2_name",
    ...         "actor_3_name",
    ...         "director_name",
    ...     ]
    ... ]
    >>> movie_actor_director.head()
      actor_1_name actor_2_name actor_3_name director_name
    0  CCH Pounder  Joel Dav...    Wes Studi  James Ca...
    1  Johnny Depp  Orlando ...  Jack Dav...  Gore Ver...
    2  Christop...  Rory Kin...  Stephani...   Sam Mendes
    3    Tom Hardy  Christia...  Joseph G...  Christop...
    4  Doug Walker   Rob Walker          NaN  Doug Walker
    
  2. There are instances when one column of a DataFrame needs to be selected. Using the index operation can return either a Series or a DataFrame. If we pass in a list with a single item, we will get back a DataFrame. If we pass in just a string with the column name, we will get a Series back:
    >>> type(movies[["director_name"]])
    <class 'pandas.core.frame.DataFrame'>
    >>> type(movies["director_name"])
    <class 'pandas.core.series.Series'>
    
  3. We can also use .loc to pull out a column by name. Because this index operation requires that we pass in a row selector first, we will use a colon (:) to indicate a slice that selects all of the rows. This can also return either a DataFrame or a Series:
    >>> type(movies.loc[:, ["director_name"]])
    <class 'pandas.core.frame.DataFrame'>
    >>> type(movies.loc[:, "director_name"])
    <class 'pandas.core.series.Series'>
    

How it works...

The DataFrame index operator is very flexible and capable of accepting a number of different objects. If a string is passed, it will return a single-dimensional Series. If a list is passed to the indexing operator, it returns a DataFrame of all the columns in the list in the specified order.

Step 2 shows how to select a single column as a DataFrame and as a Series. Usually, a single column is selected with a string, resulting in a Series. When a DataFrame is desired, put the column name in a single-element list.

Step 3 shows how to use the loc attribute to pull out a Series or a DataFrame.

There's more...

Passing a long list inside the indexing operator might cause readability issues. To help with this, you may save all your column names to a list variable first. The following code achieves the same result as step 1:

>>> cols = [
...     "actor_1_name",
...     "actor_2_name",
...     "actor_3_name",
...     "director_name",
... ]
>>> movie_actor_director = movies[cols]

One of the most common exceptions raised when working with pandas is KeyError. This error is mainly due to mistyping of a column or index name. This same error is raised whenever a multiple column selection is attempted without the use of a list:

>>> movies[
...     "actor_1_name",
...     "actor_2_name",
...     "actor_3_name",
...     "director_name",
... ]
Traceback (most recent call last):
  ...
KeyError: ('actor_1_name', 'actor_2_name', 'actor_3_name', 'director_name')