Book Image

Learning Pandas

By : Michael Heydt
Book Image

Learning Pandas

By: Michael Heydt

Overview of this book

Table of Contents (19 chapters)
Learning pandas
Credits
About the Author
About the Reviewers
www.PacktPub.com
Preface
Index

Primary pandas objects


A programmer of pandas will spend most of their time using two primary objects provided by the pandas framework: Series and DataFrame. The DataFrame objects will be the overall workhorse of pandas and the most frequently used as they provide the means to manipulate tabular and heterogeneous data.

The pandas Series object

The base data structure of pandas is the Series object, which is designed to operate similar to a NumPy array but also adds index capabilities. A simple way to create a Series object is by initializing a Series object with a Python array or Python list.

In [2]:
   # create a four item DataFrame
   s = Series([1, 2, 3, 4])
   s

Out [2]:
   0    1
   1    2
   2    3
   3    4
   dtype: int64

This has created a pandas Series from the list. Notice that printing the series resulted in what appears to be two columns of data. The first column in the output is not a column of the Series object, but the index labels. The second column is the values of the Series object. Each row represents the index label and the value for that label. This Series was created without specifying an index, so pandas automatically creates indexes starting at zero and increasing by one.

Elements of a Series object can be accessed through the index using []. This informs the Series which value to return given one or more index values (referred to in pandas as labels). The following code retrieves the items in the series with labels 1 and 3.

In [3]:
   # return a Series with the rows with labels 1 and 3
   s[[1, 3]]

Out [3]:
   1    2
   3    4
   dtype: int64

Note

It is important to note that the lookup here is not by zero-based positions 1 and 3 like an array, but by the values in the index.

A Series object can be created with a user-defined index by specifying the labels for the index using the index parameter.

In [4]:
   # create a series using an explicit index
   s = Series([1, 2, 3, 4], 
              index = ['a', 'b', 'c', 'd'])
   s

Out [4]:
   a    1
   b    2
   c    3
   d    4
   dtype: int64

Note

Notice that the index labels in the output now have the index values that were specified in the Series constructor.

Data in the Series object can now be accessed by alphanumeric index labels by passing a list of the desired labels, as the following demonstrates:

In [5]:
   # look up items the series having index 'a' and 'd'
   s[['a', 'd']]

Out [5]:
   a    1
   d    4
   dtype: int64

Note

This demonstrates the previous point that the lookup is by label value and not by zero-based position.

It is still possible to refer to the elements of the Series object by their numerical position.

In [6]:
   # passing a list of integers to a Series that has
   # non-integer index labels will look up based upon
   # 0-based index like an array
   s[[1, 2]]

Out [6]:
   b    2
   c    3
   dtype: int64

Note

A Series is still smart enough to determine that you passed a list of integers and, therefore, that you want to do value lookup by zero-based position.

The s.index property allows direct access to the index of the Series object.

In [7]:
   # get only the index of the Series
   s.index

Out [7]:
   Index([u'a', u'b', u'c', u'd'], dtype='object')

The index is itself actually a pandas object. This shows us the values of the index and that the data type of each label in the index is object.

A common usage of a Series in pandas is to represent a time series that associates date/time index labels with a value. A date range can be created using the pandas method pd.date_range().

In [8]:
   # create a Series who's index is a series of dates
   # between the two specified dates (inclusive)
   dates = pd.date_range('2014-07-01', '2014-07-06')
   dates

Out [8]:
   <class 'pandas.tseries.index.DatetimeIndex'>
   [2014-07-01, ..., 2014-07-06]
   Length: 6, Freq: D, Timezone: None

Note

This has created a special index in pandas referred to as a DatetimeIndex, which is a pandas index that is optimized to index data with dates and times.

At this point, the index is not particularly useful without having values for each index. We can use this index to create a new Series object with values for each of the dates.

In [9]:
   # create a Series with values (representing temperatures)
   # for each date in the index
   temps1 = Series([80, 82, 85, 90, 83, 87], 
                   index = dates)
   temps1

Out [9]:
   2014-07-01    80
   2014-07-02    82
   2014-07-03    85
   2014-07-04    90
   2014-07-05    83
   2014-07-06    87
   Freq: D, dtype: int64

Statistical methods provided by NumPy can be applied to a pandas Series. The following returns the mean of the values in the Series.

In [10]:
   # calculate the mean of the values in the Series
   temps1.mean()

Out [10]:
   84.5

Two Series objects can be applied to each other with an arithmetic operation. The following code calculates the difference in temperature between two Series.

In [11]:
   # create a second series of values using the same index
   temps2 = Series([70, 75, 69, 83, 79, 77], 
                   index = dates)
   # the following aligns the two by their index values
   # and calculates the difference at those matching labels
   temp_diffs = temps1 - temps2
   temp_diffs

Out [11]:
   2014-07-01    10
   2014-07-02     7
   2014-07-03    16
   2014-07-04     7
   2014-07-05     4
   2014-07-06    10
   Freq: D, dtype: int64

Note

The result of an arithmetic operation (+, -, /, *, …) on two Series objects that are non-scalar values returns another Series object.

Time series data such as that shown here can also be accessed via the index or by an offset into the Series object.

In [12]:
   # lookup a value by date using the index
   temp_diffs['2014-07-03']

Out [12]:
   16

In [13]:
   # and also possible by integer position as if the 
   # series was an array
   temp_diffs[2]

Out [13]:
   16

The pandas DataFrame object

A pandas Series represents a single array of values, with an index label for each value. If you want to have more than one Series of data that is aligned by a common index, then a pandas DataFrame is used.

Note

In a way a DataFrame is analogous to a database table in that it contains one or more columns of data of heterogeneous type (but a single type for all items in each respective column).

The following code creates a DataFrame object with two columns representing the temperatures from the Series objects used earlier.

In [14]:
   # create a DataFrame from the two series objects temp1 and temp2
   # and give them column names
   temps_df = DataFrame(
               {'Missoula': temps1, 
                'Philadelphia': temps2})
   temps_df

Out [14]:
               Missoula  Philadelphia
   2014-07-01        80            70
   2014-07-02        82            75
   2014-07-03        85            69
   2014-07-04        90            83
   2014-07-05        83            79
   2014-07-06        87            77

Note

This has created a DataFrame object with two columns, named Missoula and Philadelphia, and using the values from the respective Series objects for each. These are new Series objects contained within the DataFrame, with the values copied from the original Series objects.

Columns in a DataFrame object can be accessed using an array indexer [] with the name of the column or a list of column names. The following code retrieves the Missoula column of the DataFrame object:

In [15]
   # get the column with the name Missoula
   temps_df['Missoula']

Out [15]:
   2014-07-01    80
   2014-07-02    82
   2014-07-03    85
   2014-07-04    90
   2014-07-05    83
   2014-07-06    87
   Freq: D, Name: Missoula, dtype: int64

The following code retrieves the Philadelphia column:

In [16]:
   # likewise we can get just the Philadelphia column
   temps_df['Philadelphia']

Out [16]:
   2014-07-01    70
   2014-07-02    75
   2014-07-03    69
   2014-07-04    83
   2014-07-05    79
   2014-07-06    77
   Freq: D, Name: Philadelphia, dtype: int64

The following code returns both the columns, but reversed.

In [17]:
   # return both columns in a different order
   temps_df[['Philadelphia', 'Missoula']]

Out [17]:
               Philadelphia  Missoula
   2014-07-01            70        80
   2014-07-02            75        82
   2014-07-03            69        85
   2014-07-04            83        90
   2014-07-05            79        83
   2014-07-06            77        87

Note

Notice that there is a subtle difference in a DataFrame object as compared to a Series object. Passing a list to the [] operator of DataFrame retrieves the specified columns, whereas Series uses it as index labels to retrieve rows.

Very conveniently, if the name of a column does not have spaces, you can use property-style names to access the columns in a DataFrame.

In [18]:
   # retrieve the Missoula column through property syntax
   temps_df.Missoula

Out [18]:
   2014-07-01    80
   2014-07-02    82
   2014-07-03    85
   2014-07-04    90
   2014-07-05    83
   2014-07-06    87
   Freq: D, Name: Missoula, dtype: int64

Arithmetic operations between columns within a DataFrame are identical in operation to those on multiple Series as each column in a DataFrame is a Series. To demonstrate, the following code calculates the difference between temperatures using property notation.

In [19]:
   # calculate the temperature difference between the two cities
   temps_df.Missoula - temps_df.Philadelphia

Out [19]:
   2014-07-01    10
   2014-07-02     7
   2014-07-03    16
   2014-07-04     7
   2014-07-05     4
   2014-07-06    10
   Freq: D, dtype: int64

A new column can be added to DataFrame simply by assigning another Series to a column using the array indexer [] notation. The following code adds a new column in the DataFrame, which contains the difference in temperature on the respective dates.

In [20]:
   # add a column to temp_df that contains the difference in temps
   temps_df['Difference'] = temp_diffs
   temps_df

Out [20]:
               Missoula  Philadelphia  Difference
   2014-07-01        80            70          10
   2014-07-02        82            75           7
   2014-07-03        85            69          16
   2014-07-04        90            83           7
   2014-07-05        83            79           4
   2014-07-06        87            77          10

The names of the columns in a DataFrame are object accessible via the DataFrame object's .columns property, which itself is a pandas Index object.

In [21]:
   # get the columns, which is also an Index object
   temps_df.columns

Out [21]:
   Index([u'Missoula', u'Philadelphia', u'Difference'], dtype='object')

The DataFrame (and Series) objects can be sliced to retrieve specific rows. A simple example here shows how to select the second through fourth rows of temperature difference values.

In [22]:
   # slice the temp differences column for the rows at 
   # location 1 through 4 (as though it is an array)
   temps_df.Difference[1:4]

Out [22]:
   2014-07-02     7
   2014-07-03    16
   2014-07-04     7
   Freq: D, Name: Difference, dtype: int64

Entire rows from a DataFrame can be retrieved using its .loc and .iloc properties. The following code returns a Series object representing the second row of temps_df of the DataFrame object by zero-based position of the row using the .iloc property:

In [23]:
   # get the row at array position 1
   temps_df.iloc[1]

Out [23]:
   Missoula        82
   Philadelphia    75
   Difference       7
   Name: 2014-07-02 00:00:00, dtype: int64

This has converted the row into a Series, with the column names of the DataFrame pivoted into the index labels of the resulting Series.

In [24]:
   # the names of the columns have become the index
   # they have been 'pivoted'
   temps_df.ix[1].index

Out [24]:
   Index([u'Missoula', u'Philadelphia', u'Difference'], dtype='object')

Rows can be explicitly accessed via index label using the .loc property. The following code retrieves a row by the index label:

In [25]:
   # retrieve row by index label using .loc
   temps_df.loc['2014-07-03']

Out [25]:
   Missoula        85
   Philadelphia    69
   Difference      16
   Name: 2014-07-03 00:00:00, dtype: int64

Specific rows in a DataFrame object can be selected using a list of integer positions. The following code selects the values from the Difference column in rows at locations 1, 3, and 5.

In [26]:
   # get the values in the Differences column in rows 1, 3, and 5
   # using 0-based location
   temps_df.iloc[[1, 3, 5]].Difference

Out [26]:
   2014-07-02     7
   2014-07-04     7
   2014-07-06    10
   Name: Difference, dtype: int64

Rows of a DataFrame can be selected based upon a logical expression applied to the data in each row. The following code returns the evaluation of the value in the Missoula temperature column being greater than 82 degrees:

In [27]:
   # which values in the Missoula column are > 82?
   temps_df.Missoula > 82

Out [27]:
   2014-07-01    False
   2014-07-02    False
   2014-07-03     True
   2014-07-04     True
   2014-07-05     True
   2014-07-06     True
   Freq: D, Name: Missoula, dtype: bool

When using the result of an expression as the parameter to the [] operator of a DataFrame, the rows where the expression evaluated to True will be returned.

In [28]:
   # return the rows where the temps for Missoula > 82
   temps_df[temps_df.Missoula > 82]

Out [28]:
               Missoula  Philadelphia  Difference
   2014-07-03        85            69          16
   2014-07-04        90            83           7
   2014-07-05        83            79           4
   2014-07-06        87            77          10

This technique of selection in pandas terminology is referred to as a Boolean selection, and will form the basis of selecting data based upon its values.