Book Image

Mastering Python for Data Science

By : Samir Madhavan
Book Image

Mastering Python for Data Science

By: Samir Madhavan

Overview of this book

Table of Contents (19 chapters)
Mastering Python for Data Science
Credits
About the Author
About the Reviewers
www.PacktPub.com
Preface
7
Estimating the Likelihood of Events
Index

Empowering data analysis with pandas


The pandas library was developed by Wes McKinny when he was working at AQR Capital Management. He wanted a tool that was flexible enough to perform quantitative analysis on financial data. Later, Chang She joined him and helped develop the package further.

The pandas library is an open source Python library, specially designed for data analysis. It has been built on NumPy and makes it easy to handle data. NumPy is a fairly low-level tool that handles matrices really well.

The pandas library brings the richness of R in the world of Python to handle data. It's has efficient data structures to process data, perform fast joins, and read data from various sources, to name a few.

The data structure of pandas

The pandas library essentially has three data structures:

  1. Series

  2. DataFrame

  3. Panel

Series

Series is a one-dimensional array, which can hold any type of data, such as integers, floats, strings, and Python objects too. A series can be created by calling the following:

>>> import pandas as pd
>>> pd.Series(np.random.randn(5))

0    0.733810
1   -1.274658
2   -1.602298
3    0.460944
4   -0.632756
dtype: float64

The random.randn parameter is part of the NumPy package and it generates random numbers. The series function creates a pandas series that consists of an index, which is the first column, and the second column consists of random values. At the bottom of the output is the datatype of the series.

The index of the series can be customized by calling the following:

>>> pd.Series(np.random.randn(5), index=['a', 'b', 'c', 'd', 'e'])

a   -0.929494
b   -0.571423
c   -1.197866
d    0.081107
e   -0.035091
dtype: float64

A series can be derived from a Python dict too:

>>> d = {'A': 10, 'B': 20, 'C': 30}
>>> pd.Series(d)

A    10
B    20
C    30
dtype: int64

DataFrame

DataFrame is a 2D data structure with columns that can be of different datatypes. It can be seen as a table. A DataFrame can be formed from the following data structures:

  • A NumPy array

  • Lists

  • Dicts

  • Series

  • A 2D NumPy array

A DataFrame can be created from a dict of series by calling the following commands:

>>> d = {'c1': pd.Series(['A', 'B', 'C']),
        'c2': pd.Series([1, 2., 3., 4.])}
>>> df = pd.DataFrame(d)
>>> df

   c1  c2
0    A   1
1    B   2
2    C   3
3  NaN   4

The DataFrame can be created using a dict of lists too:

>>> d = {'c1': ['A', 'B', 'C', 'D'],
    'c2': [1, 2.0, 3.0, 4.0]}
>>> df = pd.DataFrame(d)
>>> print df
 c1  c2
0  A   1
1  B   2
2  C   3
3  D   4

Panel

A Panel is a data structure that handles 3D data. The following command is an example of panel data:

>>> d = {'Item1': pd.DataFrame(np.random.randn(4, 3)),
    'Item2': pd.DataFrame(np.random.randn(4, 2))}
>>> pd.Panel(d)

<class 'pandas.core.panel.Panel'>
Dimensions: 2 (items) x 4 (major_axis) x 3 (minor_axis)
Items axis: Item1 to Item2
Major_axis axis: 0 to 3
Minor_axis axis: 0 to 2

The preceding command shows that there are 2 DataFrames represented by two items. There are four rows represented by four major axes and three columns represented by three minor axes.

Inserting and exporting data

The data is stored in various forms, such as CSV, TSV, databases, and so on. The pandas library makes it convenient to read data from these formats or to export to these formats. We'll use a dataset that contains the weight statistics of the school students from the U.S..

We'll be using a file with the following structure:

Column

Description

LOCATION CODE

Unique location code

COUNTY

The county the school belongs to

AREA NAME

The district the school belongs to

REGION

The region the school belongs to

SCHOOL YEARS

The school year the data is addressing

NO. OVERWEIGHT

The number of overweight students

PCT OVERWEIGHT

The percentage of overweight students

NO. OBESE

The number of obese students

PCT OBESE

The percentage of obese students

NO. OVERWEIGHT OR OBESE

The number of students who are overweight or obese

PCT OVERWEIGHT OR OBESE

The percentage of students who are overweight or obese

GRADE LEVEL

Whether they belong to elementary or high school

AREA TYPE

The type of area

STREET ADDRESS

The address of the school

CITY

The city the school belongs to

STATE

The state the school belongs to

ZIP CODE

The zip code of the school

Location 1

The address with longitude and latitude

CSV

To read data from a .csv file, the following read_csv function can be used:

>>> d = pd.read_csv('Data/Student_Weight_Status_Category_Reporting_Results__Beginning_2010.csv')
>>> d[0:5]['AREA NAME']

0    RAVENA COEYMANS SELKIRK CENTRAL SCHOOL DISTRICT
1    RAVENA COEYMANS SELKIRK CENTRAL SCHOOL DISTRICT
2    RAVENA COEYMANS SELKIRK CENTRAL SCHOOL DISTRICT
3                        COHOES CITY SCHOOL DISTRICT
4                        COHOES CITY SCHOOL DISTRICT

The read_csv function takes the path of the .csv file to input the data. The command after this prints the first five rows of the Location column in the data.

To write a data to the .csv file, the following to_csv function can be used:

>>> d = {'c1': pd.Series(['A', 'B', 'C']),
    'c2': pd.Series([1, 2., 3., 4.])}
>>> df = pd.DataFrame(d)
>>> df.to_csv('sample_data.csv')

The DataFrame is written to a .csv file by using the to_csv method. The path and the filename where the file needs to be created should be mentioned.

XLS

In addition to the pandas package, the xlrd package needs to be installed for pandas to read the data from an Excel file:

>>> d=pd.read_excel('Data/Student_Weight_Status_Category_Reporting_Results__Beginning_2010.xls')

The preceding function is similar to the CSV reading command. To write to an Excel file, the xlwt package needs to be installed:

>>> df.to_excel('sample_data.xls')

JSON

To read the data from a JSON file, Python's standard json package can be used. The following commands help in reading the file:

>>> import json
>>> json_data = open('Data/Student_Weight_Status_Category_Reporting_Results__Beginning_2010.json')
>>> data = json.load(json_data)
>>> json_data.close()

In the preceding command, the open() function opens a connection to the file. The json.load() function loads the data into Python. The json_data.close() function closes the connection to the file.

The pandas library also provides a function to read the JSON file, which can be accessed using pd.read_json().

Database

To read data from a database, the following function can be used:

>>> pd.read_sql_table(table_name, con)

The preceding command generates a DataFrame. If a table name and an SQLAlchemy engine are given, they return a DataFrame. This function does not support the DBAPI connection. The following are the description of the parameters used:

  • table_name: This refers to the name of the SQL table in a database

  • con: This refers to the SQLAlchemy engine

The following command reads SQL query into a DataFrame:

>>> pd.read_sql_query(sql, con)

The following are the description of the parameters used:

  • sql: This refers to the SQL query that is to be executed

  • con: This refers to the SQLAlchemy engine