Book Image

Learning Pandas

By : Michael Heydt
Book Image

Learning Pandas

By: Michael Heydt

Overview of this book

Table of Contents (19 chapters)
Learning pandas
Credits
About the Author
About the Reviewers
www.PacktPub.com
Preface
Index

Loading data from files and the Web


The data used in analyses is typically provided from other systems via files that are created and updated at various intervals, dynamically via access over the Web, or from various types of databases. The pandas library provides powerful facilities for easy retrieval of data from a variety of data sources and converting it into pandas objects. Here, we will briefly demonstrate this ease of use by loading data from files and from financial web services.

Loading CSV data from files

The pandas library provides built-in support for loading data in .csv format, a common means of storing structured data in text files. Provided with the code from this book is a file data/test1.csv in the CSV format, which represents some time series information. The specific content isn't important right now, as we just want to demonstrate the ease of loading data into a DataFrame.

The following statement in IPython uses the operating system to display the content of this file (the command to use is different based upon your operating system).

In [29]:
   # display the contents of test1.csv
   # which command to use depends on your OS
   !cat data/test1.csv # on non-windows systems
   #!type data\test1.csv # on windows systems

   date,0,1,2
   2000-01-01 00:00:00,1.10376250134,-1.90997889703,-0.808955536115
   2000-01-02 00:00:00,1.18891664768,0.581119740849,0.86159734949
   2000-01-03 00:00:00,-0.964200042412,0.779764393246,1.82906224532
   2000-01-04 00:00:00,0.782130444001,-1.72066965573,-1.10824167327
   2000-01-05 00:00:00,-1.86701699823,-0.528368292754,-2.48830894087
   2000-01-06 00:00:00,2.56928022646,-0.471901478927,-0.835033249865
   2000-01-07 00:00:00,-0.39932258251,-0.676426550985,-0.0112559158931
   2000-01-08 00:00:00,1.64299299394,1.01341997845,1.43566709724
   2000-01-09 00:00:00,1.14730764657,2.13799951538,0.554171306191
   2000-01-10 00:00:00,0.933765825769,1.38715526486,-0.560142729978

This information can be easily imported into DataFrame using the pd.read_csv() function.

In [30]:
   # read the contents of the file into a DataFrame
   df = pd.read_csv('data/test1.csv')
   df

Out [30]:
                     date         0         1         2
   0  2000-01-01 00:00:00  1.103763 -1.909979 -0.808956
   1  2000-01-02 00:00:00  1.188917  0.581120  0.861597
   2  2000-01-03 00:00:00 -0.964200  0.779764  1.829062
   3  2000-01-04 00:00:00  0.782130 -1.720670 -1.108242
   4  2000-01-05 00:00:00 -1.867017 -0.528368 -2.488309
   5  2000-01-06 00:00:00  2.569280 -0.471901 -0.835033
   6  2000-01-07 00:00:00 -0.399323 -0.676427 -0.011256
   7  2000-01-08 00:00:00  1.642993  1.013420  1.435667
   8  2000-01-09 00:00:00  1.147308  2.138000  0.554171
   9  2000-01-10 00:00:00  0.933766  1.387155 -0.560143

pandas has no idea that the first column is a date and has treated the contents of the date field as a string. This can be verified using the following Python statements:

In [31]:
   # the contents of the date column
   df.date

Out [31]:
   0    2000-01-01 00:00:00
   1    2000-01-02 00:00:00
   2    2000-01-03 00:00:00
   3    2000-01-04 00:00:00
   4    2000-01-05 00:00:00
   5    2000-01-06 00:00:00
   6    2000-01-07 00:00:00
   7    2000-01-08 00:00:00
   8    2000-01-09 00:00:00
   9    2000-01-10 00:00:00
   Name: date, dtype: object

In [32]:
   # we can get the first value in the date column
   df.date[0]

Out [32]:
   '2000-01-01 00:00:00'

In [33]:
   # it is a string
   type(df.date[0])

Out [33]:
   str

To guide pandas on how to convert data directly into a Python/pandas date object, we can use the parse_dates parameter of the pd.read_csv() function. The following code informs pandas to convert the content of the 'date' column into actual TimeStamp objects.

In [34]:
   # read the data and tell pandas the date column should be 
   # a date in the resulting DataFrame
   df = pd.read_csv('data/test1.csv', parse_dates=['date'])
   df

Out [34]:
           date         0         1         2
   0 2000-01-01  1.103763 -1.909979 -0.808956
   1 2000-01-02  1.188917  0.581120  0.861597
   2 2000-01-03 -0.964200  0.779764  1.829062
   3 2000-01-04  0.782130 -1.720670 -1.108242
   4 2000-01-05 -1.867017 -0.528368 -2.488309
   5 2000-01-06  2.569280 -0.471901 -0.835033
   6 2000-01-07 -0.399323 -0.676427 -0.011256
   7 2000-01-08  1.642993  1.013420  1.435667
   8 2000-01-09  1.147308  2.138000  0.554171
   9 2000-01-10  0.933766  1.387155 -0.560143

On checking whether it worked, we see it is indeed a Timestamp object now.

In [35]:
   # verify the type now is date
   # in pandas, this is actually a Timestamp
   type(df.date[0])

Out [35]:
   pandas.tslib.Timestamp

Unfortunately, this has not used the date field as the index for the DataFrame, instead it uses the default zero-based integer index labels.

In [36]:
   # unfortunately the index is numeric, which makes
   # accessing data by date more complicated
   df.index

Out [36]:
   Int64Index([0, 1, 2, 3, 4, 5, 6, 7, 8, 9], dtype='int64')

This can be rectified using the index_col parameter of the pd.read_csv() method to specify which column in the file should be used as the index.

In [37]:
   # read in again, now specify the data column as being the 
   # index of the resulting DataFrame
   df = pd.read_csv('data/test1.csv', 
                    parse_dates=['date'], 
                    index_col='date')
   df

Out [37]:
                      0         1         2
   date                                    
   2000-01-01  1.103763 -1.909979 -0.808956
   2000-01-02  1.188917  0.581120  0.861597
   2000-01-03 -0.964200  0.779764  1.829062
   2000-01-04  0.782130 -1.720670 -1.108242
   2000-01-05 -1.867017 -0.528368 -2.488309
   2000-01-06  2.569280 -0.471901 -0.835033
   2000-01-07 -0.399323 -0.676427 -0.011256
   2000-01-08  1.642993  1.013420  1.435667
   2000-01-09  1.147308  2.138000  0.554171
   2000-01-10  0.933766  1.387155 -0.560143

In [38]:
   df.index

Out [38]:
   <class 'pandas.tseries.index.DatetimeIndex'>
   [2000-01-01, ..., 2000-01-10]
   Length: 10, Freq: None, Timezone: None

Loading data from the Web

Data from the Web can also be easily read via pandas. To demonstrate this, we will perform a simple load of actual stock data. The example here uses the pandas.io.data.DataReader class, which is able to read data from various web sources, one of which is stock data from Yahoo! Finance.

The following reads the data of the previous three months for GOOG (based on the current date), and prints the five most recent days of stock data:

In [39]:
   # imports for reading data from Yahoo!
   from pandas.io.data import DataReader
   from datetime import date
   from dateutil.relativedelta import relativedelta

   # read the last three months of data for GOOG
   goog = DataReader("GOOG",  "yahoo", 
                     date.today() + 
                     relativedelta(months=-3))

   # the result is a DataFrame
   #and this gives us the 5 most recent prices
   goog.tail()

Out [39]:
                 Open    High     Low   Close   Volume  Adj Close
   Date                                                          
   2015-02-02  531.73  533.00  518.55  528.48  2826300     528.48
   2015-02-03  528.00  533.40  523.26  529.24  2029200     529.24
   2015-02-04  529.24  532.67  521.27  522.76  1656800     522.76
   2015-02-05  523.79  528.50  522.09  527.58  1840300     527.58
   2015-02-06  527.64  537.20  526.41  531.00  1744600     531.00

Tip

Downloading the example code

You can download the example code files for all Packt books you have purchased from your account at http://www.packtpub.com. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you. The code examples in the book are also publicly available on Wakari.io at https://wakari.io/sharing/bundle/LearningPandas/LearningPandas_Index.

This is actually performs quite a bit of work on your behalf. It makes the web requests retrieving the CSV data and converting it into a DataFrame with the proper conversion of types for the various series of data.