-
Book Overview & Buying
-
Table Of Contents
-
Feedback & Rating
Pandas 1.x Cookbook - Second Edition
By :
In the Calling Series methods recipe in Chapter 1, Pandas Foundations, a variety of methods operated on a single column or Series of data. Many of these were aggregation or reducing methods that returned a single scalar value. When these same methods are called from a DataFrame, they perform that operation for each column at once and reduce the results for each column in the DataFrame. They return a Series with the column names in the index and the summary for each column as the value.
In this recipe, we explore a variety of the most common DataFrame attributes and methods with the movie dataset.
.shape, .size, and .ndim, along with running the len function:
>>> movies = pd.read_csv("data/movie.csv")
>>> movies.shape
(4916, 28)
>>> movies.size
137648
>>> movies.ndim
2
>>> len(movies)
4916
.count method shows the number of non-missing values for each column. It is an aggregation method as it summarizes every column in a single value. The output is a Series that has the original column names as its index:
>>> movies.count()
color 4897
director_name 4814
num_critic_for_reviews 4867
duration 4901
director_facebook_likes 4814
...
title_year 4810
actor_2_facebook_likes 4903
imdb_score 4916
aspect_ratio 4590
movie_facebook_likes 4916
Length: 28, dtype: int64
.min, .max, .mean, .median, and .std, return Series that have the column names of the numeric columns in the index and their aggregations as the values:
>>> movies.min()
num_critic_for_reviews 1.00
duration 7.00
director_facebook_likes 0.00
actor_3_facebook_likes 0.00
actor_1_facebook_likes 0.00
...
title_year 1916.00
actor_2_facebook_likes 0.00
imdb_score 1.60
aspect_ratio 1.18
movie_facebook_likes 0.00
Length: 16, dtype: float64
.describe method is very powerful and calculates all the descriptive statistics and quartiles at once. The end result is a DataFrame with the descriptive statistics names as its index. I like to transpose the results using .T as I can usually fit more information on the screen that way:
>>> movies.describe().T
count mean ... 75% max
num_criti... 4867.0 137.988905 ... 191.00 813.0
duration 4901.0 107.090798 ... 118.00 511.0
director_... 4814.0 691.014541 ... 189.75 23000.0
actor_3_f... 4893.0 631.276313 ... 633.00 23000.0
actor_1_f... 4909.0 6494.488491 ... 11000.00 640000.0
... ... ... ... ... ...
title_year 4810.0 2002.447609 ... 2011.00 2016.0
actor_2_f... 4903.0 1621.923516 ... 912.00 137000.0
imdb_score 4916.0 6.437429 ... 7.20 9.5
aspect_ratio 4590.0 2.222349 ... 2.35 16.0
movie_fac... 4916.0 7348.294142 ... 2000.00 349000.0
.describe method using the percentiles parameter:
>>> movies.describe(percentiles=[0.01, 0.3, 0.99]).T
count mean ... 99% max
num_criti... 4867.0 137.988905 ... 546.68 813.0
duration 4901.0 107.090798 ... 189.00 511.0
director_... 4814.0 691.014541 ... 16000.00 23000.0
actor_3_f... 4893.0 631.276313 ... 11000.00 23000.0
actor_1_f... 4909.0 6494.488491 ... 44920.00 640000.0
... ... ... ... ... ...
title_year 4810.0 2002.447609 ... 2016.00 2016.0
actor_2_f... 4903.0 1621.923516 ... 17000.00 137000.0
imdb_score 4916.0 6.437429 ... 8.50 9.5
aspect_ratio 4590.0 2.222349 ... 4.00 16.0
movie_fac... 4916.0 7348.294142 ... 93850.00 349000.0
Step 1 gives basic information on the size of the dataset. The .shape attribute returns a tuple with the number of rows and columns. The .size attribute returns the total number of elements in the DataFrame, which is just the product of the number of rows and columns. The .ndim attribute returns the number of dimensions, which is two for all DataFrames. When a DataFrame is passed to the built-in len function, it returns the number of rows.
The methods in step 2 and step 3 aggregate each column down to a single number. Each column name is now the index label in a Series with its aggregated result as the corresponding value.
If you look closely, you will notice that the output from step 3 is missing all the object columns from step 2. This method ignores string columns by default.
Note that numeric columns have missing values but have a result returned by .describe. By default, pandas handles missing values in numeric columns by skipping them. It is possible to change this behavior by setting the skipna parameter to False. This will cause pandas to return NaN for all these aggregation methods if there exists at least a single missing value.
The .describe method displays the summary statistics of the numeric columns. You can expand its summary to include more quantiles by passing a list of numbers between 0 and 1 to the percentiles parameter. See the Developing a data analysis routine recipe for more on the .describe method.
To see how the .skipna parameter affects the outcome, we can set its value to False and rerun step 3 from the preceding recipe. Only numeric columns without missing values will calculate a result:
>>> movies.min(skipna=False)
num_critic_for_reviews NaN
duration NaN
director_facebook_likes NaN
actor_3_facebook_likes NaN
actor_1_facebook_likes NaN
...
title_year NaN
actor_2_facebook_likes NaN
imdb_score 1.6
aspect_ratio NaN
movie_facebook_likes 0.0
Length: 16, dtype: float64