Book Image

Python: End-to-end Data Analysis

By : Ivan Idris, Luiz Felipe Martins, Martin Czygan, Phuong Vo.T.H, Magnus Vilhelm Persson
Book Image

Python: End-to-end Data Analysis

By: Ivan Idris, Luiz Felipe Martins, Martin Czygan, Phuong Vo.T.H, Magnus Vilhelm Persson

Overview of this book

Data analysis is the process of applying logical and analytical reasoning to study each component of data present in the system. Python is a multi-domain, high-level, programming language that offers a range of tools and libraries suitable for all purposes, it has slowly evolved as one of the primary languages for data science. Have you ever imagined becoming an expert at effectively approaching data analysis problems, solving them, and extracting all of the available information from your data? If yes, look no further, this is the course you need! In this course, we will get you started with Python data analysis by introducing the basics of data analysis and supported Python libraries such as matplotlib, NumPy, and pandas. Create visualizations by choosing color maps, different shapes, sizes, and palettes then delve into statistical data analysis using distribution algorithms and correlations. You’ll then find your way around different data and numerical problems, get to grips with Spark and HDFS, and set up migration scripts for web mining. You’ll be able to quickly and accurately perform hands-on sorting, reduction, and subsequent analysis, and fully appreciate how data analysis methods can support business decision-making. Finally, you will delve into advanced techniques such as performing regression, quantifying cause and effect using Bayesian methods, and discovering how to use Python’s tools for supervised machine learning. The course provides you with highly practical content explaining data analysis with Python, from the following Packt books: 1. Getting Started with Python Data Analysis. 2. Python Data Analysis Cookbook. 3. Mastering Python Data Analysis. By the end of this course, you will have all the knowledge you need to analyze your data with varying complexity levels, and turn it into actionable insights.
Table of Contents (6 chapters)

Data analysis starts with data. It is therefore beneficial to work with data storage systems that are simple to set up, operate and where the data access does not become a problem in itself. In short, we would like to have database systems that are easy to embed into our data analysis processes and workflows. In this book, we focus mostly on the Python side of the database interaction, and we will learn how to get data into and out of Pandas data structures.

There are numerous ways to store data. In this chapter, we are going to learn to interact with three main categories: text formats, binary formats and databases. We will focus on two storage solutions, MongoDB and Redis. MongoDB is a document-oriented database, which is easy to start with, since we can store JSON documents and do not need to define a schema upfront. Redis is a popular in-memory data structure store on top of which many applications can be built. It is possible to use Redis as a fast key-value store, but Redis supports lists, sets, hashes, bit arrays and even advanced data structures such as HyperLogLog out of the box as well.

Text is a great medium and it's a simple way to exchange information. The following statement is taken from a quote attributed to Doug McIlroy: Write programs to handle text streams, because that is the universal interface.

In this section we will start reading and writing data from and to text files.

Normally, the raw data logs of a system are stored in multiple text files, which can accumulate a large amount of information over time. Thankfully, it is simple to interact with these kinds of files in Python.

Pandas supports a number of functions for reading data from a text file into a DataFrame object. The most simple one is the read_csv() function. Let's start with a small example file:

In the above example file, each column is separated by comma and the first row is a header row, containing column names. To read the data file into the DataFrame object, we type the following command:

We see that the read_csv function uses a comma as the default delimiter between columns in the text file and the first row is automatically used as a header for the columns. If we want to change this setting, we can use the sep parameter to change the separated symbol and set header=None in case the example file does not have a caption row.

See the below example:

We can also set a specific row as the caption row by using the header that's equal to the index of the selected row. Similarly, when we want to use any column in the data file as the column index of DataFrame, we set index_col to the name or index of the column. We again use the second data file example_data/ex_06-02.txt to illustrate this:

Apart from those parameters, we still have a lot of useful ones that can help us load data files into Pandas objects more effectively. The following table shows some common parameters:

Parameter

Value

Description

dtype

Type name or dictionary of type of columns

Sets the data type for data or columns. By default it will try to infer the most appropriate data type.

skiprows

List-like or integer

The number of lines to skip (starting from 0).

na_values

List-like or dict, default None

Values to recognize as NA/NaN. If a dict is passed, this can be set on a per-column basis.

true_values

List

A list of values to be converted to Boolean True as well.

false_values

List

A list of values to be converted to Boolean False as well.

keep_default_na

Bool, default True

If the na_values parameter is present and keep_default_na is False, the default NaN values are ignored, otherwise they are appended to

thousands

Str, default None

The thousands separator

nrows

Int, default None

Limits the number of rows to read from the file.

error_bad_lines

Boolean, default True

If set to True, a DataFrame is returned, even if an error occurred during parsing.

Besides the read_csv() function, we also have some other parsing functions in Pandas:

Function

Description

read_table

Read the general delimited file into DataFrame

read_fwf

Read a table of fixed-width formatted lines into DataFrame

read_clipboard

Read text from the clipboard and pass to read_table. It is useful for converting tables from web pages

In some situations, we cannot automatically parse data files from the disk using these functions. In that case, we can also open files and iterate through the reader, supported by the CSV module in the standard library:

We can read and write binary serialization of Python objects with the pickle module, which can be found in the standard library. Object serialization can be useful, if you work with objects that take a long time to create, like some machine learning models. By pickling such objects, subsequent access to this model can be made faster. It also allows you to distribute Python objects in a standardized way.

Pandas includes support for pickling out of the box. The relevant methods are the read_pickle() and to_pickle() functions to read and write data from and to files easily. Those methods will write data to disk in the pickle format, which is a convenient short-term storage format:

HDF5 is not a database, but a data model and file format. It is suited for write-one, read-many datasets. An HDF5 file includes two kinds of objects: data sets, which are array-like collections of data, and groups, which are folder-like containers what hold data sets and other groups. There are some interfaces for interacting with HDF5 format in Python, such as h5py which uses familiar NumPy and Python constructs, such as dictionaries and NumPy array syntax. With h5py, we have high-level interface to the HDF5 API which helps us to get started. However, in this book, we will introduce another library for this kind of format called PyTables, which works well with Pandas objects:

We created an empty HDF5 file, named hdf5_store.h5. Now, we can write data to the file just like adding key-value pairs to a dict:

Objects stored in the HDF5 file can be retrieved by specifying the object keys:

Once we have finished interacting with the HDF5 file, we close it to release the file handle:

There are other supported functions that are useful for working with the HDF5 format. You should explore ,in more detail, two libraries – pytables and h5py – if you need to work with huge quantities of data.

HDF5

HDF5 is

not a database, but a data model and file format. It is suited for write-one, read-many datasets. An HDF5 file includes two kinds of objects: data sets, which are array-like collections of data, and groups, which are folder-like containers what hold data sets and other groups. There are some interfaces for interacting with HDF5 format in Python, such as h5py which uses familiar NumPy and Python constructs, such as dictionaries and NumPy array syntax. With h5py, we have high-level interface to the HDF5 API which helps us to get started. However, in this book, we will introduce another library for this kind of format called PyTables, which works well with Pandas objects:

We created an empty HDF5 file, named hdf5_store.h5. Now, we can write data to the file just like adding key-value pairs to a dict:

Objects stored in the HDF5 file can be retrieved by specifying the object keys:

Once we have finished interacting with the HDF5 file, we close it to release the file handle:

There are other supported functions that are useful for working with the HDF5 format. You should explore ,in more detail, two libraries – pytables and h5py – if you need to work with huge quantities of data.

Many applications require more robust storage systems then text files, which is why many applications use databases to store data. There are many kinds of databases, but there are two broad categories: relational databases, which support a standard declarative language called SQL, and so called NoSQL databases, which are often able to work without a predefined schema and where a data instance is more properly described as a document, rather as a row.

MongoDB is a kind of NoSQL database that stores data as documents, which are grouped together in collections. Documents are expressed as JSON objects. It is fast and scalable in storing, and also flexible in querying, data. To use MongoDB in Python, we need to import the pymongo package and open a connection to the database by passing a hostname and port. We suppose that we have a MongoDB instance, running on the default host (localhost) and port (27017):

If we do not put any parameters into the pymongo.MongoClient() function, it will automatically use the default host and port.

In the next step, we will interact with databases inside the MongoDB instance. We can list all databases that are available in the instance:

The above snippet says that our MongoDB instance only has one database, named 'local'. If the databases and collections we point to do not exist, MongoDB will create them as necessary:

Each database contains groups of documents, called collections. We can understand them as tables in a relational database. To list all existing collections in a database, we use collection_names() function:

Our db database does not have any collections yet. Let's create a collection, named person, and insert data from a DataFrame object to it:

The df_ex2 is transposed and converted to a JSON string before loading into a dictionary. The insert() function receives our created dictionary from df_ex2 and saves it to the collection.

If we want to list all data inside the collection, we can execute the following commands:

If we want to query data from the created collection with some conditions, we can use the find() function and pass in a dictionary describing the documents we want to retrieve. The returned result is a cursor type, which supports the iterator protocol:

Sometimes, we want to delete data in MongdoDB. All we need to do is to pass a query to the remove() method on the collection:

We learned step by step how to insert, query and delete data in a collection. Now, we will show how to update existing data in a collection in MongoDB:

The following table shows methods that provide shortcuts to manipulate documents in MongoDB:

Update Method

Description

inc()

Increment a numeric field

set()

Set certain fields to new values

unset()

Remove a field from the document

push()

Append a value onto an array in the document

pushAll()

Append several values onto an array in the document

addToSet()

Add a value to an array, only if it does not exist

pop()

Remove the last value of an array

pull()

Remove all occurrences of a value from an array

pullAll()

Remove all occurrences of any set of values from an array

rename()

Rename a field

bit()

Update a value by bitwise operation

Redis is an advanced kind of key-value store where the values can be of different types: string, list, set, sorted set or hash. Redis stores data in memory like memcached but it can be persisted on disk, unlike memcached, which has no such option. Redis supports fast reads and writes, in the order of 100,000 set or get operations per second.

To interact with Redis, we need to install the Redis-py module to Python, which is available on pypi and can be installed with pip:

Now, we can connect to Redis via the host and port of the DB server. We assume that we have already installed a Redis server, which is running with the default host (localhost) and port (6379) parameters:

As a first step to storing data in Redis, we need to define which kind of data structure is suitable for our requirements. In this section, we will introduce four commonly used data structures in Redis: simple value, list, set and ordered set. Though data is stored into Redis in many different data structures, each value must be associated with a key.

The ordered set data structure takes an extra attribute when we add data to a set called score. An ordered set will use the score to determine the order of the elements in the set:

By using the zrange(name, start, end) function, we can get a range of values from the sorted set between the start and end score sorted in ascending order by default. If we want to change the way method of sorting, we can set the desc parameter to True. The withscore parameter is used in case we want to get the scores along with the return values. The return type is a list of (value, score) pairs as you can see in the above example.

See the below table for more functions available on ordered sets:

Function

Description

zcard(name)

Return the number of elements in the sorted set with key name

zincrby(name, value, amount=1)

Increment the score of value in the sorted set with key name by amount

zrangebyscore(name, min, max, withscores=False, start=None, num=None)

Return a range of values from the sorted set with key name with a score between min and max.

If withscores is true, return the scores along with the values.

If start and num are given, return a slice of the range

zrank(name, value)

Return a 0-based value indicating the rank of value in the sorted set with key name

zrem(name, values)

Remove member value(s) from the sorted set with key name

The simple value

This is the

The ordered set data structure takes an extra attribute when we add data to a set called score. An ordered set will use the score to determine the order of the elements in the set:

By using the zrange(name, start, end) function, we can get a range of values from the sorted set between the start and end score sorted in ascending order by default. If we want to change the way method of sorting, we can set the desc parameter to True. The withscore parameter is used in case we want to get the scores along with the return values. The return type is a list of (value, score) pairs as you can see in the above example.

See the below table for more functions available on ordered sets:

Function

Description

zcard(name)

Return the number of elements in the sorted set with key name

zincrby(name, value, amount=1)

Increment the score of value in the sorted set with key name by amount

zrangebyscore(name, min, max, withscores=False, start=None, num=None)

Return a range of values from the sorted set with key name with a score between min and max.

If withscores is true, return the scores along with the values.

If start and num are given, return a slice of the range

zrank(name, value)

Return a 0-based value indicating the rank of value in the sorted set with key name

zrem(name, values)

Remove member value(s) from the sorted set with key name

List

We have a few

The ordered set data structure takes an extra attribute when we add data to a set called score. An ordered set will use the score to determine the order of the elements in the set:

By using the zrange(name, start, end) function, we can get a range of values from the sorted set between the start and end score sorted in ascending order by default. If we want to change the way method of sorting, we can set the desc parameter to True. The withscore parameter is used in case we want to get the scores along with the return values. The return type is a list of (value, score) pairs as you can see in the above example.

See the below table for more functions available on ordered sets:

Function

Description

zcard(name)

Return the number of elements in the sorted set with key name

zincrby(name, value, amount=1)

Increment the score of value in the sorted set with key name by amount

zrangebyscore(name, min, max, withscores=False, start=None, num=None)

Return a range of values from the sorted set with key name with a score between min and max.

If withscores is true, return the scores along with the values.

If start and num are given, return a slice of the range

zrank(name, value)

Return a 0-based value indicating the rank of value in the sorted set with key name

zrem(name, values)

Remove member value(s) from the sorted set with key name

Set

This

The ordered set data structure takes an extra attribute when we add data to a set called score. An ordered set will use the score to determine the order of the elements in the set:

By using the zrange(name, start, end) function, we can get a range of values from the sorted set between the start and end score sorted in ascending order by default. If we want to change the way method of sorting, we can set the desc parameter to True. The withscore parameter is used in case we want to get the scores along with the return values. The return type is a list of (value, score) pairs as you can see in the above example.

See the below table for more functions available on ordered sets:

Function

Description

zcard(name)

Return the number of elements in the sorted set with key name

zincrby(name, value, amount=1)

Increment the score of value in the sorted set with key name by amount

zrangebyscore(name, min, max, withscores=False, start=None, num=None)

Return a range of values from the sorted set with key name with a score between min and max.

If withscores is true, return the scores along with the values.

If start and num are given, return a slice of the range

zrank(name, value)

Return a 0-based value indicating the rank of value in the sorted set with key name

zrem(name, values)

Remove member value(s) from the sorted set with key name

Ordered set

The ordered set

data structure takes an extra attribute when we add data to a set called score. An ordered set will use the score to determine the order of the elements in the set:

By using the zrange(name, start, end) function, we can get a range of values from the sorted set between the start and end score sorted in ascending order by default. If we want to change the way method of sorting, we can set the desc parameter to True. The withscore parameter is used in case we want to get the scores along with the return values. The return type is a list of (value, score) pairs as you can see in the above example.

See the below table for more functions available on ordered sets:

Function

Description

zcard(name)

Return the number of elements in the sorted set with key name

zincrby(name, value, amount=1)

Increment the score of value in the sorted set with key name by amount

zrangebyscore(name, min, max, withscores=False, start=None, num=None)

Return a range of values from the sorted set with key name with a score between min and max.

If withscores is true, return the scores along with the values.

If start and num are given, return a slice of the range

zrank(name, value)

Return a 0-based value indicating the rank of value in the sorted set with key name

zrem(name, values)

Remove member value(s) from the sorted set with key name

We finished covering the basics of interacting with data in different commonly used storage mechanisms from the simple ones, such as text files, over more structured ones, such as HDF5, to more sophisticated data storage systems, such as MongoDB and Redis. The most suitable type of storage will depend on your use case. The choice of the data storage layer technology plays an important role in the overall design of data processing systems. Sometimes, we need to combine various database systems to store our data, such as complexity of the data, performance of the system or computation requirements.

Practice exercises