Book Image

Python: Data Analytics and Visualization

By : Martin Czygan, Phuong Vo.T.H, Ashish Kumar, Kirthi Raman
Book Image

Python: Data Analytics and Visualization

By: Martin Czygan, Phuong Vo.T.H, Ashish Kumar, Kirthi Raman

Overview of this book

You will start the course with an introduction to the principles of data analysis and supported libraries, along with NumPy basics for statistics and data processing. Next, you will overview the Pandas package and use its powerful features to solve data-processing problems. Moving on, you will get a brief overview of the Matplotlib API .Next, you will learn to manipulate time and data structures, and load and store data in a file or database using Python packages. You will learn how to apply powerful packages in Python to process raw data into pure and helpful data using examples. You will also get a brief overview of machine learning algorithms, that is, applying data analysis results to make decisions or building helpful products such as recommendations and predictions using Scikit-learn. After this, you will move on to a data analytics specialization—predictive analytics. Social media and IOT have resulted in an avalanche of data. You will get started with predictive analytics using Python. You will see how to create predictive models from data. You will get balanced information on statistical and mathematical concepts, and implement them in Python using libraries such as Pandas, scikit-learn, and NumPy. You’ll learn more about the best predictive modeling algorithms such as Linear Regression, Decision Tree, and Logistic Regression. Finally, you will master best practices in predictive modeling. After this, you will get all the practical guidance you need to help you on the journey to effective data visualization. Starting with a chapter on data frameworks, which explains the transformation of data into information and eventually knowledge, this path subsequently cover the complete visualization process using the most popular Python libraries with working examples This Learning Path combines some of the best that Packt has to offer in one complete, curated package. It includes content from the following Packt products: ? Getting Started with Python Data Analysis, Phuong Vo.T.H &Martin Czygan •Learning Predictive Analytics with Python, Ashish Kumar •Mastering Python Data Visualization, Kirthi Raman
Table of Contents (6 chapters)

Chapter 6. Interacting with Databases

Data analysis starts with data. It is therefore beneficial to work with data storage systems that are simple to set up, operate and where the data access does not become a problem in itself. In short, we would like to have database systems that are easy to embed into our data analysis processes and workflows. In this book, we focus mostly on the Python side of the database interaction, and we will learn how to get data into and out of Pandas data structures.

There are numerous ways to store data. In this chapter, we are going to learn to interact with three main categories: text formats, binary formats and databases. We will focus on two storage solutions, MongoDB and Redis. MongoDB is a document-oriented database, which is easy to start with, since we can store JSON documents and do not need to define a schema upfront. Redis is a popular in-memory data structure store on top of which many applications can be built. It is possible to use Redis as a fast key-value store, but Redis supports lists, sets, hashes, bit arrays and even advanced data structures such as HyperLogLog out of the box as well.

Interacting with data in text format

Text is a great medium and it's a simple way to exchange information. The following statement is taken from a quote attributed to Doug McIlroy: Write programs to handle text streams, because that is the universal interface.

In this section we will start reading and writing data from and to text files.

Reading data from text format

Normally, the raw data logs of a system are stored in multiple text files, which can accumulate a large amount of information over time. Thankfully, it is simple to interact with these kinds of files in Python.

Pandas supports a number of functions for reading data from a text file into a DataFrame object. The most simple one is the read_csv() function. Let's start with a small example file:

$ cat example_data/ex_06-01.txt
Name,age,major_id,sex,hometown
Nam,7,1,male,hcm
Mai,11,1,female,hcm
Lan,25,3,female,hn
Hung,42,3,male,tn
Nghia,26,3,male,dn
Vinh,39,3,male,vl
Hong,28,4,female,dn

Tip

The cat is the Unix shell command that can be used to print the content of a file to the screen.

In the above example file, each column is separated by comma and the first row is a header row, containing column names. To read the data file into the DataFrame object, we type the following command:

>>> df_ex1 = pd.read_csv('example_data/ex_06-01.txt')
>>> df_ex1
    Name  age  major_id     sex hometown
0    Nam    7         1    male      hcm
1    Mai   11         1  female      hcm
2    Lan   25         3  female       hn
3   Hung   42         3    male       tn
4  Nghia   26         3    male       dn
5   Vinh   39         3    male       vl
6   Hong   28         4  female       dn

We see that the read_csv function uses a comma as the default delimiter between columns in the text file and the first row is automatically used as a header for the columns. If we want to change this setting, we can use the sep parameter to change the separated symbol and set header=None in case the example file does not have a caption row.

See the below example:

$ cat example_data/ex_06-02.txt
Nam     7       1       male    hcm
Mai     11      1       female  hcm
Lan     25      3       female  hn
Hung    42      3       male    tn
Nghia   26      3       male    dn
Vinh    39      3       male    vl
Hong    28      4       female  dn

>>> df_ex2 = pd.read_csv('example_data/ex_06-02.txt',
                         sep = '\t', header=None)
>>> df_ex2
       0   1  2       3    4
0    Nam   7  1    male  hcm
1    Mai  11  1  female  hcm
2    Lan  25  3  female   hn
3   Hung  42  3    male   tn
4  Nghia  26  3    male   dn
5   Vinh  39  3    male   vl
6   Hong  28  4  female   dn

We can also set a specific row as the caption row by using the header that's equal to the index of the selected row. Similarly, when we want to use any column in the data file as the column index of DataFrame, we set index_col to the name or index of the column. We again use the second data file example_data/ex_06-02.txt to illustrate this:

>>> df_ex3 = pd.read_csv('example_data/ex_06-02.txt',
                         sep = '\t', header=None,
                         index_col=0)
>>> df_ex3
        1  2       3    4
0
Nam     7  1    male  hcm
Mai    11  1  female  hcm
Lan    25  3  female   hn
Hung   42  3    male   tn
Nghia  26  3    male   dn
Vinh   39  3    male   vl
Hong   28  4  female   dn

Apart from those parameters, we still have a lot of useful ones that can help us load data files into Pandas objects more effectively. The following table shows some common parameters:

Parameter

Value

Description

dtype

Type name or dictionary of type of columns

Sets the data type for data or columns. By default it will try to infer the most appropriate data type.

skiprows

List-like or integer

The number of lines to skip (starting from 0).

na_values

List-like or dict, default None

Values to recognize as NA/NaN. If a dict is passed, this can be set on a per-column basis.

true_values

List

A list of values to be converted to Boolean True as well.

false_values

List

A list of values to be converted to Boolean False as well.

keep_default_na

Bool, default True

If the na_values parameter is present and keep_default_na is False, the default NaN values are ignored, otherwise they are appended to

thousands

Str, default None

The thousands separator

nrows

Int, default None

Limits the number of rows to read from the file.

error_bad_lines

Boolean, default True

If set to True, a DataFrame is returned, even if an error occurred during parsing.

Besides the read_csv() function, we also have some other parsing functions in Pandas:

Function

Description

read_table

Read the general delimited file into DataFrame

read_fwf

Read a table of fixed-width formatted lines into DataFrame

read_clipboard

Read text from the clipboard and pass to read_table. It is useful for converting tables from web pages

In some situations, we cannot automatically parse data files from the disk using these functions. In that case, we can also open files and iterate through the reader, supported by the CSV module in the standard library:

$ cat example_data/ex_06-03.txt
Nam     7       1       male    hcm
Mai     11      1       female  hcm
Lan     25      3       female  hn
Hung    42      3       male    tn      single
Nghia   26      3       male    dn      single
Vinh    39      3       male    vl
Hong    28      4       female  dn

>>> import csv
>>> f = open('data/ex_06-03.txt')
>>> r = csv.reader(f, delimiter='\t')
>>> for line in r:
>>>    print(line)
['Nam', '7', '1', 'male', 'hcm']
['Mai', '11', '1', 'female', 'hcm']
['Lan', '25', '3', 'female', 'hn']
['Hung', '42', '3', 'male', 'tn', 'single']
['Nghia', '26', '3', 'male', 'dn', 'single']
['Vinh', '39', '3', 'male', 'vl']
['Hong', '28', '4', 'female', 'dn']

Writing data to text format

We saw how to load data from a text file into a Pandas data structure. Now, we will learn how to export data from the data object of a program to a text file. Corresponding to the read_csv() function, we also have the to_csv() function, supported by Pandas. Let's see an example below:

>>> df_ex3.to_csv('example_data/ex_06-02.out', sep = ';')
 

The result will look like this:

$ cat example_data/ex_06-02.out
0;1;2;3;4
Nam;7;1;male;hcm
Mai;11;1;female;hcm
Lan;25;3;female;hn
Hung;42;3;male;tn
Nghia;26;3;male;dn
Vinh;39;3;male;vl
Hong;28;4;female;dn
 

If we want to skip the header line or index column when writing out data into a disk file, we can set a False value to the header and index parameters:

>>> import sys
>>> df_ex3.to_csv(sys.stdout, sep='\t',
                  header=False, index=False)
7       1       male    hcm
11      1       female  hcm
25      3       female  hn
42      3       male    tn
26      3       male    dn
39      3       male    vl
28      4       female  dn

We can also write a subset of the columns of the DataFrame to the file by specifying them in the columns parameter:

>>> df_ex3.to_csv(sys.stdout, columns=[3,1,4],
                  header=False, sep='\t')
Nam     male    7       hcm
Mai     female  11      hcm
Lan     female  25      hn
Hung    male    42      tn
Nghia   male    26      dn
Vinh    male    39      vl
Hong    female  28      dn

With series objects, we can use the same function to write data into text files, with mostly the same parameters as above.

Interacting with data in binary format

We can read and write binary serialization of Python objects with the pickle module, which can be found in the standard library. Object serialization can be useful, if you work with objects that take a long time to create, like some machine learning models. By pickling such objects, subsequent access to this model can be made faster. It also allows you to distribute Python objects in a standardized way.

Pandas includes support for pickling out of the box. The relevant methods are the read_pickle() and to_pickle() functions to read and write data from and to files easily. Those methods will write data to disk in the pickle format, which is a convenient short-term storage format:

>>> df_ex3.to_pickle('example_data/ex_06-03.out')
>>> pd.read_pickle('example_data/ex_06-03.out')
        1  2       3    4
0
Nam     7  1    male  hcm
Mai    11  1  female  hcm
Lan    25  3  female   hn
Hung   42  3    male   tn
Nghia  26  3    male   dn
Vinh   39  3    male   vl
Hong   28  4  female   dn

HDF5

HDF5 is not a database, but a data model and file format. It is suited for write-one, read-many datasets. An HDF5 file includes two kinds of objects: data sets, which are array-like collections of data, and groups, which are folder-like containers what hold data sets and other groups. There are some interfaces for interacting with HDF5 format in Python, such as h5py which uses familiar NumPy and Python constructs, such as dictionaries and NumPy array syntax. With h5py, we have high-level interface to the HDF5 API which helps us to get started. However, in this book, we will introduce another library for this kind of format called PyTables, which works well with Pandas objects:

>>> store = pd.HDFStore('hdf5_store.h5')
>>> store
<class 'pandas.io.pytables.HDFStore'>
File path: hdf5_store.h5
Empty

We created an empty HDF5 file, named hdf5_store.h5. Now, we can write data to the file just like adding key-value pairs to a dict:

>>> store['ex3'] = df_ex3
>>> store['name'] = df_ex2[0]
>>> store['hometown'] = df_ex3[4]
>>> store
<class 'pandas.io.pytables.HDFStore'>
File path: hdf5_store.h5
/ex3                  frame        (shape->[7,4])
/hometown             series       (shape->[1])
/name                 series       (shape->[1])

Objects stored in the HDF5 file can be retrieved by specifying the object keys:

>>> store['name']
0      Nam
1      Mai
2      Lan
3     Hung
4    Nghia
5     Vinh
6     Hong
Name: 0, dtype: object

Once we have finished interacting with the HDF5 file, we close it to release the file handle:

>>> store.close()
>>> store
<class 'pandas.io.pytables.HDFStore'>
File path: hdf5_store.h5
File is CLOSED

There are other supported functions that are useful for working with the HDF5 format. You should explore ,in more detail, two libraries – pytables and h5py – if you need to work with huge quantities of data.

HDF5

HDF5 is not a database, but a data model and file format. It is suited for write-one, read-many datasets. An HDF5 file includes two kinds of objects: data sets, which are array-like collections of data, and groups, which are folder-like containers what hold data sets and other groups. There are some interfaces for interacting with HDF5 format in Python, such as h5py which uses familiar NumPy and Python constructs, such as dictionaries and NumPy array syntax. With h5py, we have high-level interface to the HDF5 API which helps us to get started. However, in this book, we will introduce another library for this kind of format called PyTables, which works well with Pandas objects:

>>> store = pd.HDFStore('hdf5_store.h5')
>>> store
<class 'pandas.io.pytables.HDFStore'>
File path: hdf5_store.h5
Empty

We created an empty HDF5 file, named hdf5_store.h5. Now, we can write data to the file just like adding key-value pairs to a dict:

>>> store['ex3'] = df_ex3
>>> store['name'] = df_ex2[0]
>>> store['hometown'] = df_ex3[4]
>>> store
<class 'pandas.io.pytables.HDFStore'>
File path: hdf5_store.h5
/ex3                  frame        (shape->[7,4])
/hometown             series       (shape->[1])
/name                 series       (shape->[1])

Objects stored in the HDF5 file can be retrieved by specifying the object keys:

>>> store['name']
0      Nam
1      Mai
2      Lan
3     Hung
4    Nghia
5     Vinh
6     Hong
Name: 0, dtype: object

Once we have finished interacting with the HDF5 file, we close it to release the file handle:

>>> store.close()
>>> store
<class 'pandas.io.pytables.HDFStore'>
File path: hdf5_store.h5
File is CLOSED

There are other supported functions that are useful for working with the HDF5 format. You should explore ,in more detail, two libraries – pytables and h5py – if you need to work with huge quantities of data.

Interacting with data in MongoDB

Many applications require more robust storage systems then text files, which is why many applications use databases to store data. There are many kinds of databases, but there are two broad categories: relational databases, which support a standard declarative language called SQL, and so called NoSQL databases, which are often able to work without a predefined schema and where a data instance is more properly described as a document, rather as a row.

MongoDB is a kind of NoSQL database that stores data as documents, which are grouped together in collections. Documents are expressed as JSON objects. It is fast and scalable in storing, and also flexible in querying, data. To use MongoDB in Python, we need to import the pymongo package and open a connection to the database by passing a hostname and port. We suppose that we have a MongoDB instance, running on the default host (localhost) and port (27017):

>>> import pymongo
>>> conn = pymongo.MongoClient(host='localhost', port=27017)

If we do not put any parameters into the pymongo.MongoClient() function, it will automatically use the default host and port.

In the next step, we will interact with databases inside the MongoDB instance. We can list all databases that are available in the instance:

>>> conn.database_names()
['local']
>>> lc = conn.local
>>> lc
Database(MongoClient('localhost', 27017), 'local')

The above snippet says that our MongoDB instance only has one database, named 'local'. If the databases and collections we point to do not exist, MongoDB will create them as necessary:

>>> db = conn.db
>>> db
Database(MongoClient('localhost', 27017), 'db')

Each database contains groups of documents, called collections. We can understand them as tables in a relational database. To list all existing collections in a database, we use collection_names() function:

>>> lc.collection_names()
['startup_log', 'system.indexes']
>>> db.collection_names()
[]

Our db database does not have any collections yet. Let's create a collection, named person, and insert data from a DataFrame object to it:

>>> collection = db.person
>>> collection
Collection(Database(MongoClient('localhost', 27017), 'db'), 'person')
>>> # insert df_ex2 DataFrame into created collection
>>> import json
>>> records = json.load(df_ex2.T.to_json()).values()
>>> records
dict_values([{'2': 3, '3': 'male', '1': 39, '4': 'vl', '0': 'Vinh'}, {'2': 3, '3': 'male', '1': 26, '4': 'dn', '0': 'Nghia'}, {'2': 4, '3': 'female', '1': 28, '4': 'dn', '0': 'Hong'}, {'2': 3, '3': 'female', '1': 25, '4': 'hn', '0': 'Lan'}, {'2': 3, '3': 'male', '1': 42, '4': 'tn', '0': 'Hung'}, {'2': 1, '3':'male', '1': 7, '4': 'hcm', '0': 'Nam'}, {'2': 1, '3': 'female', '1': 11, '4': 'hcm', '0': 'Mai'}])
>>> collection.insert(records)
[ObjectId('557da218f21c761d7c176a40'),
 ObjectId('557da218f21c761d7c176a41'),
 ObjectId('557da218f21c761d7c176a42'),
 ObjectId('557da218f21c761d7c176a43'),
 ObjectId('557da218f21c761d7c176a44'),
 ObjectId('557da218f21c761d7c176a45'),
 ObjectId('557da218f21c761d7c176a46')]

The df_ex2 is transposed and converted to a JSON string before loading into a dictionary. The insert() function receives our created dictionary from df_ex2 and saves it to the collection.

If we want to list all data inside the collection, we can execute the following commands:

>>> for cur in collection.find():
>>>     print(cur)
{'4': 'vl', '2': 3, '3': 'male', '1': 39, '_id': ObjectId('557da218f21c761d7c176
a40'), '0': 'Vinh'}
{'4': 'dn', '2': 3, '3': 'male', '1': 26, '_id': ObjectId('557da218f21c761d7c176
a41'), '0': 'Nghia'}
{'4': 'dn', '2': 4, '3': 'female', '1': 28, '_id': ObjectId('557da218f21c761d7c1
76a42'), '0': 'Hong'}
{'4': 'hn', '2': 3, '3': 'female', '1': 25, '_id': ObjectId('557da218f21c761d7c1
76a43'), '0': 'Lan'}
{'4': 'tn', '2': 3, '3': 'male', '1': 42, '_id': ObjectId('557da218f21c761d7c176
a44'), '0': 'Hung'}
{'4': 'hcm', '2': 1, '3': 'male', '1': 7, '_id': ObjectId('557da218f21c761d7c176
a45'), '0': 'Nam'}
{'4': 'hcm', '2': 1, '3': 'female', '1': 11, '_id': ObjectId('557da218f21c761d7c
176a46'), '0': 'Mai'}

If we want to query data from the created collection with some conditions, we can use the find() function and pass in a dictionary describing the documents we want to retrieve. The returned result is a cursor type, which supports the iterator protocol:

>>> cur = collection.find({'3' : 'male'})
>>> type(cur)
pymongo.cursor.Cursor
>>> result = pd.DataFrame(list(cur))
>>> result
       0   1  2     3    4                       _id
0   Vinh  39  3  male   vl  557da218f21c761d7c176a40
1  Nghia  26  3  male   dn  557da218f21c761d7c176a41
2   Hung  42  3  male   tn  557da218f21c761d7c176a44
3    Nam   7  1  male  hcm  557da218f21c761d7c176a45

Sometimes, we want to delete data in MongdoDB. All we need to do is to pass a query to the remove() method on the collection:

>>> # before removing data
>>> pd.DataFrame(list(collection.find()))
       0   1  2       3    4                       _id
0   Vinh  39  3    male   vl  557da218f21c761d7c176a40
1  Nghia  26  3    male   dn  557da218f21c761d7c176a41
2   Hong  28  4  female   dn  557da218f21c761d7c176a42
3    Lan  25  3  female   hn  557da218f21c761d7c176a43
4   Hung  42  3    male   tn  557da218f21c761d7c176a44
5    Nam   7  1    male  hcm  557da218f21c761d7c176a45
6    Mai  11  1  female  hcm  557da218f21c761d7c176a46

>>> # after removing records which have '2' column as 1 and '3' column as 'male'
>>> collection.remove({'2': 1, '3': 'male'})
{'n': 1, 'ok': 1}
>>> cur_all = collection.find();
>>> pd.DataFrame(list(cur_all))
       0   1  2       3    4                       _id
0   Vinh  39  3    male   vl  557da218f21c761d7c176a40
1  Nghia  26  3    male   dn  557da218f21c761d7c176a41
2   Hong  28  4  female   dn  557da218f21c761d7c176a42
3    Lan  25  3  female   hn  557da218f21c761d7c176a43
4   Hung  42  3    male   tn  557da218f21c761d7c176a44
5    Mai  11  1  female  hcm  557da218f21c761d7c176a46

We learned step by step how to insert, query and delete data in a collection. Now, we will show how to update existing data in a collection in MongoDB:

>>> doc = collection.find_one({'1' : 42})
>>> doc['4'] = 'hcm'
>>> collection.save(doc)
ObjectId('557da218f21c761d7c176a44')
>>> pd.DataFrame(list(collection.find()))
       0   1  2       3    4                       _id
0   Vinh  39  3    male   vl  557da218f21c761d7c176a40
1  Nghia  26  3    male   dn  557da218f21c761d7c176a41
2   Hong  28  4  female   dn  557da218f21c761d7c176a42
3    Lan  25  3  female   hn  557da218f21c761d7c176a43
4   Hung  42  3    male  hcm  557da218f21c761d7c176a44
5    Mai  11  1  female  hcm  557da218f21c761d7c176a46

The following table shows methods that provide shortcuts to manipulate documents in MongoDB:

Update Method

Description

inc()

Increment a numeric field

set()

Set certain fields to new values

unset()

Remove a field from the document

push()

Append a value onto an array in the document

pushAll()

Append several values onto an array in the document

addToSet()

Add a value to an array, only if it does not exist

pop()

Remove the last value of an array

pull()

Remove all occurrences of a value from an array

pullAll()

Remove all occurrences of any set of values from an array

rename()

Rename a field

bit()

Update a value by bitwise operation

Interacting with data in Redis

Redis is an advanced kind of key-value store where the values can be of different types: string, list, set, sorted set or hash. Redis stores data in memory like memcached but it can be persisted on disk, unlike memcached, which has no such option. Redis supports fast reads and writes, in the order of 100,000 set or get operations per second.

To interact with Redis, we need to install the Redis-py module to Python, which is available on pypi and can be installed with pip:

$ pip install redis

Now, we can connect to Redis via the host and port of the DB server. We assume that we have already installed a Redis server, which is running with the default host (localhost) and port (6379) parameters:

>>> import redis
>>> r = redis.StrictRedis(host='127.0.0.1', port=6379)
>>> r
StrictRedis<ConnectionPool<Connection<host=localhost,port=6379,db=0>>>

As a first step to storing data in Redis, we need to define which kind of data structure is suitable for our requirements. In this section, we will introduce four commonly used data structures in Redis: simple value, list, set and ordered set. Though data is stored into Redis in many different data structures, each value must be associated with a key.

The simple value

This is the most basic kind of value in Redis. For every key in Redis, we also have a value that can have a data type, such as string, integer or double. Let's start with an example for setting and getting data to and from Redis:

>>> r.set('gender:An', 'male')
True
>>> r.get('gender:An')
b'male'

In this example we want to store the gender info of a person, named An into Redis. Our key is gender:An and our value is male. Both of them are a type of string.

The set() function receives two parameters: the key and the value. The first parameter is the key and the second parameter is value. If we want to update the value of this key, we just call the function again and change the value of the second parameter. Redis automatically updates it.

The get() function will retrieve the value of our key, which is passed as the parameter. In this case, we want to get gender information of the key gender:An.

In the second example, we show you another kind of value type, an integer:

>>> r.set('visited_time:An', 12)
True
>>> r.get('visited_time:An')
b'12'
>>> r.incr('visited_time:An', 1)
13
>>> r.get('visited_time:An')
b'13'

We saw a new function, incr(), which used to increment the value of key by a given amount. If our key does not exist, RedisDB will create the key with the given increment as the value.

List

We have a few methods for interacting with list values in Redis. The following example uses rpush() and lrange() functions to put and get list data to and from the DB:

>>> r.rpush('name_list', 'Tom')
1L
>>> r.rpush('name_list', 'John')
2L
>>> r.rpush('name_list', 'Mary')
3L
>>> r.rpush('name_list', 'Jan')
4L
>>> r.lrange('name_list', 0, -1)
[b'Tom', b'John', b'Mary', b'Jan']
>>> r.llen('name_list')
4
>>> r.lindex('name_list', 1)
b'John'

Besides the rpush() and lrange() functions we used in the example, we also want to introduce two others functions. First, the llen() function is used to get the length of our list in the Redis for a given key. The lindex() function is another way to retrieve an item of the list. We need to pass two parameters into the function: a key and an index of item in the list. The following table lists some other powerful functions in processing list data structure with Redis:

Function

Description

rpushx(name, value)

Push value onto the tail of the list name if name exists

rpop(name)

Remove and return the last item of the list name

lset(name, index, value)

Set item at the index position of the list name to input value

lpushx(name,value)

Push value on the head of the list name if name exists

lpop(name)

Remove and return the first item of the list name

Set

This data structure is also similar to the list type. However, in contrast to a list, we cannot store duplicate values in our set:

>>> r.sadd('country', 'USA')
1
>>> r.sadd('country', 'Italy')
1
>>> r.sadd('country', 'Singapore')
1
>>> r.sadd('country', 'Singapore')
0
>>> r.smembers('country')
{b'Italy', b'Singapore', b'USA'}
>>> r.srem('country', 'Singapore')
1
>>> r.smembers('country')
{b'Italy', b'USA'}

Corresponding to the list data structure, we also have a number of functions to get, set, update or delete items in the set. They are listed in the supported functions for set data structure, in the following table:

Function

Description

sadd(name, values)

Add value(s) to the set with key name

scard(name)

Return the number of element in the set with key name

smembers(name)

Return all members of the set with key name

srem(name, values)

Remove value(s) from the set with key name

Ordered set

The ordered set data structure takes an extra attribute when we add data to a set called score. An ordered set will use the score to determine the order of the elements in the set:

>>> r.zadd('person:A', 10, 'sub:Math')
1
>>> r.zadd('person:A', 7, 'sub:Bio')
1
>>> r.zadd('person:A', 8, 'sub:Chem')
1
>>> r.zrange('person:A', 0, -1)
[b'sub:Bio', b'sub:Chem', b'sub:Math']
>>> r.zrange('person:A', 0, -1, withscores=True)
[(b'sub:Bio', 7.0), (b'sub:Chem', 8.0), (b'sub:Math', 10.0)]

By using the zrange(name, start, end) function, we can get a range of values from the sorted set between the start and end score sorted in ascending order by default. If we want to change the way method of sorting, we can set the desc parameter to True. The withscore parameter is used in case we want to get the scores along with the return values. The return type is a list of (value, score) pairs as you can see in the above example.

See the below table for more functions available on ordered sets:

Function

Description

zcard(name)

Return the number of elements in the sorted set with key name

zincrby(name, value, amount=1)

Increment the score of value in the sorted set with key name by amount

zrangebyscore(name, min, max, withscores=False, start=None, num=None)

Return a range of values from the sorted set with key name with a score between min and max.

If withscores is true, return the scores along with the values.

If start and num are given, return a slice of the range

zrank(name, value)

Return a 0-based value indicating the rank of value in the sorted set with key name

zrem(name, values)

Remove member value(s) from the sorted set with key name

The simple value

This is the most basic kind of value in Redis. For every key in Redis, we also have a value that can have a data type, such as string, integer or double. Let's start with an example for setting and getting data to and from Redis:

>>> r.set('gender:An', 'male')
True
>>> r.get('gender:An')
b'male'

In this example we want to store the gender info of a person, named An into Redis. Our key is gender:An and our value is male. Both of them are a type of string.

The set() function receives two parameters: the key and the value. The first parameter is the key and the second parameter is value. If we want to update the value of this key, we just call the function again and change the value of the second parameter. Redis automatically updates it.

The get() function will retrieve the value of our key, which is passed as the parameter. In this case, we want to get gender information of the key gender:An.

In the second example, we show you another kind of value type, an integer:

>>> r.set('visited_time:An', 12)
True
>>> r.get('visited_time:An')
b'12'
>>> r.incr('visited_time:An', 1)
13
>>> r.get('visited_time:An')
b'13'

We saw a new function, incr(), which used to increment the value of key by a given amount. If our key does not exist, RedisDB will create the key with the given increment as the value.

List

We have a few methods for interacting with list values in Redis. The following example uses rpush() and lrange() functions to put and get list data to and from the DB:

>>> r.rpush('name_list', 'Tom')
1L
>>> r.rpush('name_list', 'John')
2L
>>> r.rpush('name_list', 'Mary')
3L
>>> r.rpush('name_list', 'Jan')
4L
>>> r.lrange('name_list', 0, -1)
[b'Tom', b'John', b'Mary', b'Jan']
>>> r.llen('name_list')
4
>>> r.lindex('name_list', 1)
b'John'

Besides the rpush() and lrange() functions we used in the example, we also want to introduce two others functions. First, the llen() function is used to get the length of our list in the Redis for a given key. The lindex() function is another way to retrieve an item of the list. We need to pass two parameters into the function: a key and an index of item in the list. The following table lists some other powerful functions in processing list data structure with Redis:

Function

Description

rpushx(name, value)

Push value onto the tail of the list name if name exists

rpop(name)

Remove and return the last item of the list name

lset(name, index, value)

Set item at the index position of the list name to input value

lpushx(name,value)

Push value on the head of the list name if name exists

lpop(name)

Remove and return the first item of the list name

Set

This data structure is also similar to the list type. However, in contrast to a list, we cannot store duplicate values in our set:

>>> r.sadd('country', 'USA')
1
>>> r.sadd('country', 'Italy')
1
>>> r.sadd('country', 'Singapore')
1
>>> r.sadd('country', 'Singapore')
0
>>> r.smembers('country')
{b'Italy', b'Singapore', b'USA'}
>>> r.srem('country', 'Singapore')
1
>>> r.smembers('country')
{b'Italy', b'USA'}

Corresponding to the list data structure, we also have a number of functions to get, set, update or delete items in the set. They are listed in the supported functions for set data structure, in the following table:

Function

Description

sadd(name, values)

Add value(s) to the set with key name

scard(name)

Return the number of element in the set with key name

smembers(name)

Return all members of the set with key name

srem(name, values)

Remove value(s) from the set with key name

Ordered set

The ordered set data structure takes an extra attribute when we add data to a set called score. An ordered set will use the score to determine the order of the elements in the set:

>>> r.zadd('person:A', 10, 'sub:Math')
1
>>> r.zadd('person:A', 7, 'sub:Bio')
1
>>> r.zadd('person:A', 8, 'sub:Chem')
1
>>> r.zrange('person:A', 0, -1)
[b'sub:Bio', b'sub:Chem', b'sub:Math']
>>> r.zrange('person:A', 0, -1, withscores=True)
[(b'sub:Bio', 7.0), (b'sub:Chem', 8.0), (b'sub:Math', 10.0)]

By using the zrange(name, start, end) function, we can get a range of values from the sorted set between the start and end score sorted in ascending order by default. If we want to change the way method of sorting, we can set the desc parameter to True. The withscore parameter is used in case we want to get the scores along with the return values. The return type is a list of (value, score) pairs as you can see in the above example.

See the below table for more functions available on ordered sets:

Function

Description

zcard(name)

Return the number of elements in the sorted set with key name

zincrby(name, value, amount=1)

Increment the score of value in the sorted set with key name by amount

zrangebyscore(name, min, max, withscores=False, start=None, num=None)

Return a range of values from the sorted set with key name with a score between min and max.

If withscores is true, return the scores along with the values.

If start and num are given, return a slice of the range

zrank(name, value)

Return a 0-based value indicating the rank of value in the sorted set with key name

zrem(name, values)

Remove member value(s) from the sorted set with key name

List

We have a few methods for interacting with list values in Redis. The following example uses rpush() and lrange() functions to put and get list data to and from the DB:

>>> r.rpush('name_list', 'Tom')
1L
>>> r.rpush('name_list', 'John')
2L
>>> r.rpush('name_list', 'Mary')
3L
>>> r.rpush('name_list', 'Jan')
4L
>>> r.lrange('name_list', 0, -1)
[b'Tom', b'John', b'Mary', b'Jan']
>>> r.llen('name_list')
4
>>> r.lindex('name_list', 1)
b'John'

Besides the rpush() and lrange() functions we used in the example, we also want to introduce two others functions. First, the llen() function is used to get the length of our list in the Redis for a given key. The lindex() function is another way to retrieve an item of the list. We need to pass two parameters into the function: a key and an index of item in the list. The following table lists some other powerful functions in processing list data structure with Redis:

Function

Description

rpushx(name, value)

Push value onto the tail of the list name if name exists

rpop(name)

Remove and return the last item of the list name

lset(name, index, value)

Set item at the index position of the list name to input value

lpushx(name,value)

Push value on the head of the list name if name exists

lpop(name)

Remove and return the first item of the list name

Set

This data structure is also similar to the list type. However, in contrast to a list, we cannot store duplicate values in our set:

>>> r.sadd('country', 'USA')
1
>>> r.sadd('country', 'Italy')
1
>>> r.sadd('country', 'Singapore')
1
>>> r.sadd('country', 'Singapore')
0
>>> r.smembers('country')
{b'Italy', b'Singapore', b'USA'}
>>> r.srem('country', 'Singapore')
1
>>> r.smembers('country')
{b'Italy', b'USA'}

Corresponding to the list data structure, we also have a number of functions to get, set, update or delete items in the set. They are listed in the supported functions for set data structure, in the following table:

Function

Description

sadd(name, values)

Add value(s) to the set with key name

scard(name)

Return the number of element in the set with key name

smembers(name)

Return all members of the set with key name

srem(name, values)

Remove value(s) from the set with key name

Ordered set

The ordered set data structure takes an extra attribute when we add data to a set called score. An ordered set will use the score to determine the order of the elements in the set:

>>> r.zadd('person:A', 10, 'sub:Math')
1
>>> r.zadd('person:A', 7, 'sub:Bio')
1
>>> r.zadd('person:A', 8, 'sub:Chem')
1
>>> r.zrange('person:A', 0, -1)
[b'sub:Bio', b'sub:Chem', b'sub:Math']
>>> r.zrange('person:A', 0, -1, withscores=True)
[(b'sub:Bio', 7.0), (b'sub:Chem', 8.0), (b'sub:Math', 10.0)]

By using the zrange(name, start, end) function, we can get a range of values from the sorted set between the start and end score sorted in ascending order by default. If we want to change the way method of sorting, we can set the desc parameter to True. The withscore parameter is used in case we want to get the scores along with the return values. The return type is a list of (value, score) pairs as you can see in the above example.

See the below table for more functions available on ordered sets:

Function

Description

zcard(name)

Return the number of elements in the sorted set with key name

zincrby(name, value, amount=1)

Increment the score of value in the sorted set with key name by amount

zrangebyscore(name, min, max, withscores=False, start=None, num=None)

Return a range of values from the sorted set with key name with a score between min and max.

If withscores is true, return the scores along with the values.

If start and num are given, return a slice of the range

zrank(name, value)

Return a 0-based value indicating the rank of value in the sorted set with key name

zrem(name, values)

Remove member value(s) from the sorted set with key name

Set

This data structure is also similar to the list type. However, in contrast to a list, we cannot store duplicate values in our set:

>>> r.sadd('country', 'USA')
1
>>> r.sadd('country', 'Italy')
1
>>> r.sadd('country', 'Singapore')
1
>>> r.sadd('country', 'Singapore')
0
>>> r.smembers('country')
{b'Italy', b'Singapore', b'USA'}
>>> r.srem('country', 'Singapore')
1
>>> r.smembers('country')
{b'Italy', b'USA'}

Corresponding to the list data structure, we also have a number of functions to get, set, update or delete items in the set. They are listed in the supported functions for set data structure, in the following table:

Function

Description

sadd(name, values)

Add value(s) to the set with key name

scard(name)

Return the number of element in the set with key name

smembers(name)

Return all members of the set with key name

srem(name, values)

Remove value(s) from the set with key name

Ordered set

The ordered set data structure takes an extra attribute when we add data to a set called score. An ordered set will use the score to determine the order of the elements in the set:

>>> r.zadd('person:A', 10, 'sub:Math')
1
>>> r.zadd('person:A', 7, 'sub:Bio')
1
>>> r.zadd('person:A', 8, 'sub:Chem')
1
>>> r.zrange('person:A', 0, -1)
[b'sub:Bio', b'sub:Chem', b'sub:Math']
>>> r.zrange('person:A', 0, -1, withscores=True)
[(b'sub:Bio', 7.0), (b'sub:Chem', 8.0), (b'sub:Math', 10.0)]

By using the zrange(name, start, end) function, we can get a range of values from the sorted set between the start and end score sorted in ascending order by default. If we want to change the way method of sorting, we can set the desc parameter to True. The withscore parameter is used in case we want to get the scores along with the return values. The return type is a list of (value, score) pairs as you can see in the above example.

See the below table for more functions available on ordered sets:

Function

Description

zcard(name)

Return the number of elements in the sorted set with key name

zincrby(name, value, amount=1)

Increment the score of value in the sorted set with key name by amount

zrangebyscore(name, min, max, withscores=False, start=None, num=None)

Return a range of values from the sorted set with key name with a score between min and max.

If withscores is true, return the scores along with the values.

If start and num are given, return a slice of the range

zrank(name, value)

Return a 0-based value indicating the rank of value in the sorted set with key name

zrem(name, values)

Remove member value(s) from the sorted set with key name

Ordered set

The ordered set data structure takes an extra attribute when we add data to a set called score. An ordered set will use the score to determine the order of the elements in the set:

>>> r.zadd('person:A', 10, 'sub:Math')
1
>>> r.zadd('person:A', 7, 'sub:Bio')
1
>>> r.zadd('person:A', 8, 'sub:Chem')
1
>>> r.zrange('person:A', 0, -1)
[b'sub:Bio', b'sub:Chem', b'sub:Math']
>>> r.zrange('person:A', 0, -1, withscores=True)
[(b'sub:Bio', 7.0), (b'sub:Chem', 8.0), (b'sub:Math', 10.0)]

By using the zrange(name, start, end) function, we can get a range of values from the sorted set between the start and end score sorted in ascending order by default. If we want to change the way method of sorting, we can set the desc parameter to True. The withscore parameter is used in case we want to get the scores along with the return values. The return type is a list of (value, score) pairs as you can see in the above example.

See the below table for more functions available on ordered sets:

Function

Description

zcard(name)

Return the number of elements in the sorted set with key name

zincrby(name, value, amount=1)

Increment the score of value in the sorted set with key name by amount

zrangebyscore(name, min, max, withscores=False, start=None, num=None)

Return a range of values from the sorted set with key name with a score between min and max.

If withscores is true, return the scores along with the values.

If start and num are given, return a slice of the range

zrank(name, value)

Return a 0-based value indicating the rank of value in the sorted set with key name

zrem(name, values)

Remove member value(s) from the sorted set with key name

Summary

We finished covering the basics of interacting with data in different commonly used storage mechanisms from the simple ones, such as text files, over more structured ones, such as HDF5, to more sophisticated data storage systems, such as MongoDB and Redis. The most suitable type of storage will depend on your use case. The choice of the data storage layer technology plays an important role in the overall design of data processing systems. Sometimes, we need to combine various database systems to store our data, such as complexity of the data, performance of the system or computation requirements.

Practice exercises

  • Take a data set of your choice and design storage options for it. Consider text files, HDF5, a document database, and a data structure store as possible persistent options. Also evaluate how difficult (by some metric, for examples, how many lines of code) it would be to update or delete a specific item. Which storage type is the easiest to set up? Which storage type supports the most flexible queries?
  • In Chapter 3, Data Analysis with Pandas we saw that it is possible to create hierarchical indices with Pandas. As an example, assume that you have data on each city with more than 1 million inhabitants and that we have a two level index, so we can address individual cities, but also whole countries. How would you represent this hierarchical relationship with the various storage options presented in this chapter: text files, HDF5, MongoDB, and Redis? What do you believe would be most convenient to work with in the long run?