Mastering Python for Data Science

The pandas library was developed by Wes McKinny when he was working at AQR Capital Management. He wanted a tool that was flexible enough to perform quantitative analysis on financial data. Later, Chang She joined him and helped develop the package further.

The pandas library is an open source Python library, specially designed for data analysis. It has been built on NumPy and makes it easy to handle data. NumPy is a fairly low-level tool that handles matrices really well.

The pandas library brings the richness of R in the world of Python to handle data. It's has efficient data structures to process data, perform fast joins, and read data from various sources, to name a few.

The data structure of pandas

The pandas library essentially has three data structures:

Series
DataFrame
Panel

Series

Series is a one-dimensional array, which can hold any type of data, such as integers, floats, strings, and Python objects too. A series can be created by calling the following:

>>> import pandas as pd
>>> pd.Series(np.random.randn(5))

0    0.733810
1   -1.274658
2   -1.602298
3    0.460944
4   -0.632756
dtype: float64

The random.randn parameter is part of the NumPy package and it generates random numbers. The series function creates a pandas series that consists of an index, which is the first column, and the second column consists of random values. At the bottom of the output is the datatype of the series.

The index of the series can be customized by calling the following:

>>> pd.Series(np.random.randn(5), index=['a', 'b', 'c', 'd', 'e'])

a   -0.929494
b   -0.571423
c   -1.197866
d    0.081107
e   -0.035091
dtype: float64

A series can be derived from a Python dict too:

>>> d = {'A': 10, 'B': 20, 'C': 30}
>>> pd.Series(d)

A    10
B    20
C    30
dtype: int64

DataFrame

DataFrame is a 2D data structure with columns that can be of different datatypes. It can be seen as a table. A DataFrame can be formed from the following data structures:

A NumPy array
Lists
Dicts
Series
A 2D NumPy array

A DataFrame can be created from a dict of series by calling the following commands:

>>> d = {'c1': pd.Series(['A', 'B', 'C']),
        'c2': pd.Series([1, 2., 3., 4.])}
>>> df = pd.DataFrame(d)
>>> df

   c1  c2
0    A   1
1    B   2
2    C   3
3  NaN   4

The DataFrame can be created using a dict of lists too:

>>> d = {'c1': ['A', 'B', 'C', 'D'],
    'c2': [1, 2.0, 3.0, 4.0]}
>>> df = pd.DataFrame(d)
>>> print df
 c1  c2
0  A   1
1  B   2
2  C   3
3  D   4

Panel

A Panel is a data structure that handles 3D data. The following command is an example of panel data:

>>> d = {'Item1': pd.DataFrame(np.random.randn(4, 3)),
    'Item2': pd.DataFrame(np.random.randn(4, 2))}
>>> pd.Panel(d)

<class 'pandas.core.panel.Panel'>
Dimensions: 2 (items) x 4 (major_axis) x 3 (minor_axis)
Items axis: Item1 to Item2
Major_axis axis: 0 to 3
Minor_axis axis: 0 to 2

The preceding command shows that there are 2 DataFrames represented by two items. There are four rows represented by four major axes and three columns represented by three minor axes.

Inserting and exporting data

The data is stored in various forms, such as CSV, TSV, databases, and so on. The pandas library makes it convenient to read data from these formats or to export to these formats. We'll use a dataset that contains the weight statistics of the school students from the U.S..

We'll be using a file with the following structure:

Column	Description
`LOCATION CODE`	Unique location code
`COUNTY`	The county the school belongs to
`AREA NAME`	The district the school belongs to
`REGION`	The region the school belongs to
`SCHOOL YEARS`	The school year the data is addressing
`NO. OVERWEIGHT`	The number of overweight students
`PCT OVERWEIGHT`	The percentage of overweight students
`NO. OBESE`	The number of obese students
`PCT OBESE`	The percentage of obese students
`NO. OVERWEIGHT OR OBESE`	The number of students who are overweight or obese
`PCT OVERWEIGHT OR OBESE`	The percentage of students who are overweight or obese
`GRADE LEVEL`	Whether they belong to elementary or high school
`AREA TYPE`	The type of area
`STREET ADDRESS`	The address of the school
`CITY`	The city the school belongs to
`STATE`	The state the school belongs to
`ZIP CODE`	The zip code of the school
`Location 1`	The address with longitude and latitude

CSV

To read data from a .csv file, the following read_csv function can be used:

>>> d = pd.read_csv('Data/Student_Weight_Status_Category_Reporting_Results__Beginning_2010.csv')
>>> d[0:5]['AREA NAME']

0    RAVENA COEYMANS SELKIRK CENTRAL SCHOOL DISTRICT
1    RAVENA COEYMANS SELKIRK CENTRAL SCHOOL DISTRICT
2    RAVENA COEYMANS SELKIRK CENTRAL SCHOOL DISTRICT
3                        COHOES CITY SCHOOL DISTRICT
4                        COHOES CITY SCHOOL DISTRICT

The read_csv function takes the path of the .csv file to input the data. The command after this prints the first five rows of the Location column in the data.

To write a data to the .csv file, the following to_csv function can be used:

>>> d = {'c1': pd.Series(['A', 'B', 'C']),
    'c2': pd.Series([1, 2., 3., 4.])}
>>> df = pd.DataFrame(d)
>>> df.to_csv('sample_data.csv')

The DataFrame is written to a .csv file by using the to_csv method. The path and the filename where the file needs to be created should be mentioned.

XLS

In addition to the pandas package, the xlrd package needs to be installed for pandas to read the data from an Excel file:

>>> d=pd.read_excel('Data/Student_Weight_Status_Category_Reporting_Results__Beginning_2010.xls')

The preceding function is similar to the CSV reading command. To write to an Excel file, the xlwt package needs to be installed:

>>> df.to_excel('sample_data.xls')

JSON

To read the data from a JSON file, Python's standard json package can be used. The following commands help in reading the file:

>>> import json
>>> json_data = open('Data/Student_Weight_Status_Category_Reporting_Results__Beginning_2010.json')
>>> data = json.load(json_data)
>>> json_data.close()

In the preceding command, the open() function opens a connection to the file. The json.load() function loads the data into Python. The json_data.close() function closes the connection to the file.

The pandas library also provides a function to read the JSON file, which can be accessed using pd.read_json().

Database

To read data from a database, the following function can be used:

>>> pd.read_sql_table(table_name, con)

The preceding command generates a DataFrame. If a table name and an SQLAlchemy engine are given, they return a DataFrame. This function does not support the DBAPI connection. The following are the description of the parameters used:

table_name: This refers to the name of the SQL table in a database
con: This refers to the SQLAlchemy engine

The following command reads SQL query into a DataFrame:

>>> pd.read_sql_query(sql, con)

The following are the description of the parameters used:

sql: This refers to the SQL query that is to be executed
con: This refers to the SQLAlchemy engine

Mastering Python for Data Science

By : Samir Madhavan

Mastering Python for Data Science

By: Samir Madhavan

Overview of this book

Related Content you might be interested in

Current Title:

Mastering Python for Data Science

Empowering data analysis with pandas

The data structure of pandas

Series

DataFrame

Panel

Inserting and exporting data

CSV

XLS

JSON

Database