Book Image

Pandas 1.x Cookbook - Second Edition

By : Matt Harrison, Theodore Petrou
Book Image

Pandas 1.x Cookbook - Second Edition

By: Matt Harrison, Theodore Petrou

Overview of this book

The pandas library is massive, and it's common for frequent users to be unaware of many of its more impressive features. The official pandas documentation, while thorough, does not contain many useful examples of how to piece together multiple commands as one would do during an actual analysis. This book guides you, as if you were looking over the shoulder of an expert, through situations that you are highly likely to encounter. This new updated and revised edition provides you with unique, idiomatic, and fun recipes for both fundamental and advanced data manipulation tasks with pandas. Some recipes focus on achieving a deeper understanding of basic principles, or comparing and contrasting two similar operations. Other recipes will dive deep into a particular dataset, uncovering new and unexpected insights along the way. Many advanced recipes combine several different features across the pandas library to generate results.
Table of Contents (17 chapters)
15
Other Books You May Enjoy
16
Index

DataFrame attributes

Each of the three DataFrame components–the index, columns, and data–may be accessed from a DataFrame. You might want to perform operations on the individual components and not on the DataFrame as a whole. In general, though we can pull out the data into a NumPy array, unless all the columns are numeric, we usually leave it in a DataFrame. DataFrames are ideal for managing heterogenous columns of data, NumPy arrays not so much.

This recipe pulls out the index, columns, and the data of the DataFrame into their own variables, and then shows how the columns and index are inherited from the same object.

How to do it…

  1. Use the DataFrame attributes index, columns, and values to assign the index, columns, and data to their own variables:
    >>> movies = pd.read_csv("data/movie.csv")
    >>> columns = movies.columns
    >>> index = movies.index
    >>> data = movies.to_numpy()
    
  2. Display each component's values:
    >>> columns
    Index(['color', 'director_name', 'num_critic_for_reviews', 'duration',
           'director_facebook_likes', 'actor_3_facebook_likes', 'actor_2_name',
           'actor_1_facebook_likes', 'gross', 'genres', 'actor_1_name',
           'movie_title', 'num_voted_users', 'cast_total_facebook_likes',
           'actor_3_name', 'facenumber_in_poster', 'plot_keywords',
           'movie_imdb_link', 'num_user_for_reviews', 'language', 'country',
           'content_rating', 'budget', 'title_year', 'actor_2_facebook_likes',
           'imdb_score', 'aspect_ratio', 'movie_facebook_likes'],      dtype='object')
    >>> index RangeIndex(start=0, stop=4916, step=1)
    >>> data
    array([['Color', 'James Cameron', 723.0, ..., 7.9, 1.78, 33000],
           ['Color', 'Gore Verbinski', 302.0, ..., 7.1, 2.35, 0],
           ['Color', 'Sam Mendes', 602.0, ..., 6.8, 2.35, 85000],
           ...,
           ['Color', 'Benjamin Roberds', 13.0, ..., 6.3, nan, 16],
           ['Color', 'Daniel Hsia', 14.0, ..., 6.3, 2.35, 660],
           ['Color', 'Jon Gunn', 43.0, ..., 6.6, 1.85, 456]], dtype=object)
    
  3. Output the Python type of each DataFrame component (the word following the last dot of the output):
    >>> type(index)
    <class 'pandas.core.indexes.range.RangeIndex'>
    >>> type(columns)
    <class 'pandas.core.indexes.base.Index'>
    >>> type(data)
    <class 'numpy.ndarray'>
    
  4. The index and the columns are closely related. Both of them are subclasses of Index. This allows you to perform similar operations on both the index and the columns:
    >>> issubclass(pd.RangeIndex, pd.Index)
    True
    >>> issubclass(columns.__class__, pd.Index)
    True
    

How it works…

The index and the columns represent the same thing but along different axes. They are occasionally referred to as the row index and column index.

There are many types of index objects in pandas. If you do not specify the index, pandas will use a RangeIndex. A RangeIndex is a subclass of an Index that is analogous to Python's range object. Its entire sequence of values is not loaded into memory until it is necessary to do so, thereby saving memory. It is completely defined by its start, stop, and step values.

There's more...

When possible, Index objects are implemented using hash tables that allow for very fast selection and data alignment. They are similar to Python sets in that they support operations such as intersection and union, but are dissimilar because they are ordered and can have duplicate entries.

Notice how the .values DataFrame attribute returned a NumPy n-dimensional array, or ndarray. Most of pandas relies heavily on the ndarray. Beneath the index, columns, and data are NumPy ndarrays. They could be considered the base object for pandas that many other objects are built upon. To see this, we can look at the values of the index and columns:

>>> index.to_numpy()
array([   0,    1,    2, ..., 4913, 4914, 4915], dtype=int64))
>>> columns.to_numpy()
array(['color', 'director_name', 'num_critic_for_reviews', 'duration',
       'director_facebook_likes', 'actor_3_facebook_likes',
       'actor_2_name', 'actor_1_facebook_likes', 'gross', 'genres',
       'actor_1_name', 'movie_title', 'num_voted_users',
       'cast_total_facebook_likes', 'actor_3_name',
       'facenumber_in_poster', 'plot_keywords', 'movie_imdb_link',
       'num_user_for_reviews', 'language', 'country', 'content_rating',
       'budget', 'title_year', 'actor_2_facebook_likes', 'imdb_score',
       'aspect_ratio', 'movie_facebook_likes'], dtype=object)
       

Having said all of that, we usually do not access the underlying NumPy objects. We tend to leave the objects as pandas objects and use pandas operations. However, we regularly apply NumPy functions to pandas objects.