Book Image

Pandas 1.x Cookbook - Second Edition

By : Matt Harrison, Theodore Petrou
Book Image

Pandas 1.x Cookbook - Second Edition

By: Matt Harrison, Theodore Petrou

Overview of this book

The pandas library is massive, and it's common for frequent users to be unaware of many of its more impressive features. The official pandas documentation, while thorough, does not contain many useful examples of how to piece together multiple commands as one would do during an actual analysis. This book guides you, as if you were looking over the shoulder of an expert, through situations that you are highly likely to encounter. This new updated and revised edition provides you with unique, idiomatic, and fun recipes for both fundamental and advanced data manipulation tasks with pandas. Some recipes focus on achieving a deeper understanding of basic principles, or comparing and contrasting two similar operations. Other recipes will dive deep into a particular dataset, uncovering new and unexpected insights along the way. Many advanced recipes combine several different features across the pandas library to generate results.
Table of Contents (17 chapters)
15
Other Books You May Enjoy
16
Index

Selecting a column

Selected a single column from a DataFrame returns a Series (that has the same index as the DataFrame). It is a single dimension of data, composed of just an index and the data. You can also create a Series by itself without a DataFrame, but it is more common to pull them off of a DataFrame.

This recipe examines two different syntaxes to select a single column of data, a Series. One syntax uses the index operator and the other uses attribute access (or dot notation).

How to do it…

  1. Pass a column name as a string to the indexing operator to select a Series of data:
    >>> movies = pd.read_csv("data/movie.csv")
    >>> movies["director_name"]
    0           James Cameron
    1          Gore Verbinski
    2              Sam Mendes
    3       Christopher Nolan
    4             Doug Walker
                  ...        
    4911          Scott Smith
    4912                  NaN
    4913     Benjamin Roberds
    4914          Daniel Hsia
    4915             Jon Gunn
    Name: director_name, Length: 4916, dtype: object
    
  2. Alternatively, you may use attribute access to accomplish the same task:
    >>> movies.director_name
    0           James Cameron
    1          Gore Verbinski
    2              Sam Mendes
    3       Christopher Nolan
    4             Doug Walker
                  ...        
    4911          Scott Smith
    4912                  NaN
    4913     Benjamin Roberds
    4914          Daniel Hsia
    4915             Jon Gunn
    Name: director_name, Length: 4916, dtype: object
    
  3. We can also index off of the .loc and .iloc attributes to pull out a Series. The former allows us to pull out by column name, while the latter by position. These are referred to as label-based and positional-based in the pandas documentation.

    The usage of .loc specifies a selector for both rows and columns separated by a comma. The row selector is a slice with no start or end name (:) which means select all of the rows. The column selector will just pull out the column named director_name.

    The .iloc index operation also specifies both row and column selectors. The row selector is the slice with no start or end index (:) that selects all of the rows. The column selector, 1, pulls off the second column (remember that Python is zero-based):

    >>> movies.loc[:, "director_name"]
    0           James Cameron
    1          Gore Verbinski
    2              Sam Mendes
    3       Christopher Nolan
    4             Doug Walker
                  ...        
    4911          Scott Smith
    4912                  NaN
    4913     Benjamin Roberds
    4914          Daniel Hsia
    4915             Jon Gunn
    Name: director_name, Length: 4916, dtype: object
    >>> movies.iloc[:, 1]
    0           James Cameron
    1          Gore Verbinski
    2              Sam Mendes
    3       Christopher Nolan
    4             Doug Walker
                  ...        
    4911          Scott Smith
    4912                  NaN
    4913     Benjamin Roberds
    4914          Daniel Hsia
    4915             Jon Gunn
    Name: director_name, Length: 4916, dtype: object
    
  4. Jupyter shows the series in a monospace font, and shows the index, type, length, and name of the series. It will also truncate data according to the pandas configuration settings. See the image for a description of these.
    series anatomy

    Series anatomy

    You can also view the index, type, length, and name of the series with the appropriate attributes:

    >>> movies["director_name"].index
    RangeIndex(start=0, stop=4916, step=1)
    >>> movies["director_name"].dtype
    dtype('O')
    >>> movies["director_name"].size
    4196
    >>> movies["director_name"].name
    'director_name'
    
  5. Verify that the output is a Series:
    >>> type(movies["director_name"])
    <class 'pandas.core.series.Series'>
    
  6. Note that even though the type is reported as object, because there are missing values, the Series has both floats and strings in it. We can use the .apply method with the type function to get back a Series that has the type of every member. Rather than looking at the whole Series result, we will chain the .unique method onto the result, to look at just the unique types that are found in the director_name column:
    >>> movies["director_name"].apply(type).unique()
    array([<class 'str'>, <class 'float'>], dtype=object)
    

How it works…

A pandas DataFrame typically has multiple columns (though it may also have only one column). Each of these columns can be pulled out and treated as a Series.

There are many mechanisms to pull out a column from a DataFrame. Typically the easiest is to try and access it as an attribute. Attribute access is done with the dot operator (.notation). There are good things about this:

  • Least amount of typing
  • Jupyter will provide completion on the name
  • Jupyter will provide completion on the Series attributes

There are some downsides as well:

  • Only works with columns that have names that are valid Python attributes and do not conflict with existing DataFrame attributes
  • Cannot create a new column, can only update existing ones

What is a valid Python attribute? A sequence of alphanumerics that starts with a character and includes underscores. Typically these are in lowercase to follow standard Python naming conventions. This means that column names with spaces or special characters will not work with an attribute.

Selecting column names using the index operator ([) will work with any column name. You can also create and update columns with this operator. Jupyter will provide completion on the column name when you use the index operator, but sadly, will not complete on subsequent Series attributes.

I often find myself using attribute access because getting completion on the Series attribute is very handy. But, I also make sure that the column names are valid Python attribute names that don't conflict with existing DataFrame attributes. I also try not to update using either attribute or index assignment, but rather using the .assign method. You will see many examples of using .assign in this book.

There's more…

To get completion in Jupyter an press the Tab key following a dot, or after starting a string in an index access. Jupyter will pop up a list of completions, and you can use the up and down arrow keys to highlight one, and hit Enter to complete it.