Book Image

Pandas Cookbook

By : Theodore Petrou
Book Image

Pandas Cookbook

By: Theodore Petrou

Overview of this book

This book will provide you with unique, idiomatic, and fun recipes for both fundamental and advanced data manipulation tasks with pandas 0.20. Some recipes focus on achieving a deeper understanding of basic principles, or comparing and contrasting two similar operations. Other recipes will dive deep into a particular dataset, uncovering new and unexpected insights along the way. The pandas library is massive, and it's common for frequent users to be unaware of many of its more impressive features. The official pandas documentation, while thorough, does not contain many useful examples of how to piece together multiple commands like one would do during an actual analysis. This book guides you, as if you were looking over the shoulder of an expert, through practical situations that you are highly likely to encounter. Many advanced recipes combine several different features across the pandas 0.20 library to generate results.
Table of Contents (12 chapters)

Understanding data types

In very broad terms, data may be classified as either continuous or categorical. Continuous data is always numeric and represents some kind of measurement, such as height, wage, or salary. Continuous data can take on an infinite number of possibilities. Categorical data, on the other hand, represents discrete, finite amounts of values such as car color, type of poker hand, or brand of cereal.

Pandas does not broadly classify data as either continuous or categorical. Instead, it has precise technical definitions for many distinct data types. The following table contains all pandas data types, with their string equivalents, and some notes on each type:

Common data type name NumPy/pandas object Pandas string name Notes

Boolean

np.bool

bool

Stored as a single byte.

Integer

np.int

int

Defaulted to 64 bits. Unsigned ints are also available - np.uint.

Float

np.float

float

Defaulted to 64 bits.

Complex

np.complex

complex

Rarely seen in data analysis.

Object

np.object

O, object

Typically strings but is a catch-all for columns with multiple different types or other Python objects (tuples, lists, dicts, and so on).

Datetime

np.datetime64, pd.Timestamp

datetime64

Specific moment in time with nanosecond precision.

Timedelta

np.timedelta64, pd.Timedelta

timedelta64

An amount of time, from days to nanoseconds.

Categorical

pd.Categorical

category

Specific only to pandas. Useful for object columns with relatively few unique values.

Getting ready

In this recipe, we display the data type of each column in a DataFrame. It is crucial to know the type of data held in each column as it fundamentally changes the kind of operations that are possible with it.

How to do it...

  1. Use the dtypes attribute to display each column along with its data type:
>>> movie = pd.read_csv('data/movie.csv')
>>> movie.dtypes
color object director_name object num_critic_for_reviews float64 duration float64 director_facebook_likes float64 ... title_year float64 actor_2_facebook_likes float64 imdb_score float64 aspect_ratio float64 movie_facebook_likes int64 Length: 28, dtype: object
  1. Use the get_dtype_counts method to return the counts of each data type:
>>> movie.get_dtype_counts()
float64 13 int64 3 object 12

How it works...

Each DataFrame column must be exactly one type. For instance, every value in the column aspect_ratio is a 64-bit float, and every value in movie_facebook_likes is a 64-bit integer. Pandas defaults its core numeric types, integers, and floats to 64 bits regardless of the size necessary for all data to fit in memory. Even if a column consists entirely of the integer value 0, the data type will still be int64. get_dtype_counts is a convenience method for directly returning the count of all the data types in the DataFrame.

Homogeneous data is another term for referring to columns that all have the same type. DataFrames as a whole may contain heterogeneous data of different data types for different columns.

The object data type is the one data type that is unlike the others. A column that is of object data type may contain values that are of any valid Python object. Typically, when a column is of the object data type, it signals that the entire column is strings. This isn't necessarily the case as it is possible for these columns to contain a mixture of integers, booleans, strings, or other, even more complex Python objects such as lists or dictionaries. The object data type is a catch-all for columns that pandas doesn’t recognize as any other specific type.

There's more...

Almost all of pandas data types are built directly from NumPy. This tight integration makes it easier for users to integrate pandas and NumPy operations. As pandas grew larger and more popular, the object data type proved to be too generic for all columns with string values. Pandas created its own categorical data type to handle columns of strings (or numbers) with a fixed number of possible values.

See also