Book Image

Practical Data Analysis Cookbook

By : Tomasz Drabas
Book Image

Practical Data Analysis Cookbook

By: Tomasz Drabas

Overview of this book

Data analysis is the process of systematically applying statistical and logical techniques to describe and illustrate, condense and recap, and evaluate data. Its importance has been most visible in the sector of information and communication technologies. It is an employee asset in almost all economy sectors. This book provides a rich set of independent recipes that dive into the world of data analytics and modeling using a variety of approaches, tools, and algorithms. You will learn the basics of data handling and modeling, and will build your skills gradually toward more advanced topics such as simulations, raw text processing, social interactions analysis, and more. First, you will learn some easy-to-follow practical techniques on how to read, write, clean, reformat, explore, and understand your data—arguably the most time-consuming (and the most important) tasks for any data scientist. In the second section, different independent recipes delve into intermediate topics such as classification, clustering, predicting, and more. With the help of these easy-to-follow recipes, you will also learn techniques that can easily be expanded to solve other real-life problems such as building recommendation engines or predictive models. In the third section, you will explore more advanced topics: from the field of graph theory through natural language processing, discrete choice modeling to simulations. You will also get to expand your knowledge on identifying fraud origin with the help of a graph, scrape Internet websites, and classify movies based on their reviews. By the end of this book, you will be able to efficiently use the vast array of tools that the Python environment has to offer.
Table of Contents (19 chapters)
Practical Data Analysis Cookbook
Credits
About the Author
Acknowledgments
About the Reviewers
www.PacktPub.com
Preface
Index

Storing and retrieving from a relational database


The relational database model was invented in 1970 at IBM. Since then, it reigned the field of data analytics and storage for decades. The model is still widely used but has been losing the field to more and more popular solutions such as Hadoop. Nevertheless, the demise of relational databases is nowhere near as it is still a tool of choice for many applications.

Getting ready

In order to execute the following recipe, you need pandas and SQLAlchemy modules installed. The SQLAlchemy is a module that abstracts the database interactions between Python scripts and a number of relational databases; effectively, you do not have to remember the specifics of each database's syntax as SQLAlchemy will handle that for you.

If you are using the Anaconda distribution, issue the following command:

conda install sqlalchemy

Refer to your distribution of Python to check how to install new modules. Alternatively, check the previous recipe for instructions on how to install modules manually.

In addition, you might need to install the psycopg2 module. This can be accomplished with the following command:

conda install psycopg2

If, however, you are not using Anaconda, the psycopg2 can be found at http://initd.org/psycopg/. Follow the installation instructions found on the website.

To execute the script in this recipe, you will also need a PostgreSQL database running locally. Go to http://www.postgresql.org/download/ and follow the installation instructions for your operating system contained therein. We assume that you have your PostgreSQL database installed up and running before you proceed. We also assume that your database can be accessed at localhost:5432. On a Unix-like system, you can check the port used by the PostgreSQL database by issuing the following command in the terminal:

cat /etc/services | grep postgre

No other prerequisites are required.

How to do it…

pandas works hand in hand with SQLAlchemy to make accessing and storing/retrieving data in/from many databases very easy. Reading the data and storing it in the database can be achieved with the following script (the store_postgresql.py file):

import pandas as pd
import sqlalchemy as sa

# name of the CSV file to read from
r_filenameCSV = '../../Data/Chapter01/realEstate_trans.csv'

# database credentials
usr  = 'drabast'
pswd = 'pAck7!B0ok'

# create the connection to the database
engine = sa.create_engine(
    'postgresql://{0}:{1}@localhost:5432/{0}' \
    .format(usr, pswd)
)

# read the data
csv_read = pd.read_csv(r_filenameCSV)

# transform sale_date to a datetime object
csv_read['sale_date'] = pd.to_datetime(csv_read['sale_date'])

# store the data in the database
csv_read.to_sql('real_estate', engine, if_exists='replace')

How it works…

First, we import all the necessary modules: pandas and SQLAlchemy. We also store the name of the CSV file we will be reading the data from in a variable, and we specify the credentials to be used to connect to our PostgreSQL database. Then, using SQLAchemy's create_engine(...) method, we create an object that allows us to access the PostgreSQL database. The connection string specific syntax can be broken down as follows:

sa.create_engine('postgresql://<user_name>:<password>@<server>:<port>/<database>')

Here, <user_name> is the username allowed to log in to <database> using <password>. The user needs (at a minimum) CREATE, DROP, INSERT, and SELECT privileges for the specified database. The <server> tag can be expressed as an IP address of the server running the PostgreSQL database or (as in our case) the name of the server (localhost). The <port> specifies the server port the database listens on.

Next, we read in the data from a CSV file and convert the sale_date column to a datetime object so that we can store the data in a date format in the database. The read_csv(...) method normally tries to infer the proper format for the data read from a file but it can get really tricky with dates. Here, we explicitly specify the date format for the sale_date column.

The last line of the script stores the information in the database. The to_sql(...) method specifies the name of the table ('real_estate') and connector (engine) to be used. The last parameter passed instructs the method to replace the table if it already exists in the database.

You can check whether the data has loaded properly using the following command:

query = 'SELECT * FROM real_estate LIMIT 10'
top10 = pd.read_sql_query(query, engine)
print(top10)

We first specify a valid SQL query and then use the read_sql_query(...) method to execute it.

There's more…

The most popular database in the world is SQLite. SQLite databases can be found in phones, TV sets, cars, among others; it makes SQLite the most widespread database. SQLite is very lightweight and requires no server to run. It can either store the data on a disk or use the memory of your computer to temporarily keep the data. The latter can be used when speed is required but the data disappears as soon as your script finishes.

With SQLAlchemy, it is also extremely easy to talk to the SQLite database. The only change required in the preceding example is how we construct the engine (the store_SQLite.py file):

# name of the SQLite database
rw_filenameSQLite = '../../Data/Chapter01/realEstate_trans.db'

# create the connection to the database
engine = sa.create_engine(
    'sqlite:///{0}'.format(rw_filenameSQLite)
)

As you can see, as the SQLite databases are serverless, the only required parameter is where to store the database file itself.

Tip

If, instead of storing the database in the file, you would like to keep your data in the computer's memory, use sqlite:// as the connection string.

Note

Note the three slashes in the path; this is to help the innards of the create_engine(...) method. At the most general level, the connection string follows the following pattern:

<database_type>://<server_information>/<database>

As SQLite databases do not require any server, <server_information> is empty and, hence, three slashes.

See also

I highly recommend checking out the documentation for SQLAlchemy as it is a very powerful middleman between your code and various databases; the documentation can be found at http://docs.sqlalchemy.org/en/rel_1_0/index.html.