Text is a great medium and it's a simple way to exchange information. The following statement is taken from a quote attributed to Doug McIlroy: Write programs to handle text streams, because that is the universal interface.
In this section we will start reading and writing data from and to text files.
Normally, the raw data logs of a system are stored in multiple text files, which can accumulate a large amount of information over time. Thankfully, it is simple to interact with these kinds of files in Python.
Pandas supports a number of functions for reading data from a text file into a DataFrame object. The most simple one is the read_csv()
function. Let's start with a small example file:
We can also set a specific row as the caption row by using the header
that's equal to the index of the selected row. Similarly, when we want to use any column in the data file as the column index of DataFrame, we set index_col
to the name or index of the column. We again use the second data file example_data/ex_06-02.txt
to illustrate this:
Apart from those parameters, we still have a lot of useful ones that can help us load data files into Pandas objects more effectively. The following table shows some common parameters:
Parameter |
Value |
Description |
---|---|---|
|
Type name or dictionary of type of columns |
Sets the data type for data or columns. By default it will try to infer the most appropriate data type. |
|
List-like or integer |
The number of lines to skip (starting from 0). |
|
List-like or dict, default None |
Values to recognize as |
|
List |
A list of values to be converted to Boolean True as well. |
|
List |
A list of values to be converted to Boolean False as well. |
|
|
If the |
|
|
The thousands separator |
|
|
Limits the number of rows to read from the file. |
|
|
If set to True, a DataFrame is returned, even if an error occurred during parsing. |
Besides the read_csv()
function, we also have some other parsing functions in Pandas:
Function |
Description |
---|---|
|
Read the general delimited file into DataFrame |
|
Read a table of fixed-width formatted lines into DataFrame |
|
Read text from the clipboard and pass to |
In some situations, we cannot automatically parse data files from the disk using these functions. In that case, we can also open files and iterate through the reader, supported by the CSV module in the standard library:
We saw how to load data from a text file into a Pandas data structure. Now, we will learn how to export data from the data object of a program to a text file. Corresponding to the read_csv()
function, we also have the to_csv()
function, supported by Pandas. Let's see an example below:
The result will look like this:
If we want to skip the header line or index column when writing out data into a disk file, we can set a False
value to the header and index parameters:
We can read and write binary serialization of Python objects with the pickle module, which can be found in the standard library. Object serialization can be useful, if you work with objects that take a long time to create, like some machine learning models. By pickling such objects, subsequent access to this model can be made faster. It also allows you to distribute Python objects in a standardized way.
Pandas includes support for pickling out of the box. The relevant methods are the read_pickle()
and to_pickle()
functions to read and write data from and to files easily. Those methods will write data to disk in the pickle format, which is a convenient short-term storage format:
HDF5 is not a database, but a data model and file format. It is suited for write-one, read-many datasets. An HDF5 file includes two kinds of objects: data sets, which are array-like collections of data, and groups, which are folder-like containers what hold data sets and other groups. There are some interfaces for interacting with HDF5 format in Python, such as h5py
which uses familiar NumPy and Python constructs, such as dictionaries and NumPy array syntax. With h5py
, we have high-level interface to the HDF5 API which helps us to get started. However, in this book, we will introduce another library for this kind of format called PyTables, which works well with Pandas objects:
Objects stored in the HDF5 file can be retrieved by specifying the object keys:
Once we have finished interacting with the HDF5 file, we close it to release the file handle:
Many applications require more robust storage systems then text files, which is why many applications use databases to store data. There are many kinds of databases, but there are two broad categories: relational databases, which support a standard declarative language called SQL, and so called NoSQL databases, which are often able to work without a predefined schema and where a data instance is more properly described as a document, rather as a row.
The above snippet says that our MongoDB instance only has one database, named 'local'. If the databases and collections we point to do not exist, MongoDB will create them as necessary:
The df_ex2
is transposed and converted to a JSON string before loading into a dictionary. The insert()
function receives our created dictionary from df_ex2
and saves it to the collection.
If we want to list all data inside the collection, we can execute the following commands:
Sometimes, we want to delete data in MongdoDB. All we need to do is to pass a query to the remove()
method on the collection:
The following table shows methods that provide shortcuts to manipulate documents in MongoDB:
Update Method |
Description |
---|---|
|
Increment a numeric field |
|
Set certain fields to new values |
|
Remove a field from the document |
|
Append a value onto an array in the document |
|
Append several values onto an array in the document |
|
Add a value to an array, only if it does not exist |
|
Remove the last value of an array |
|
Remove all occurrences of a value from an array |
|
Remove all occurrences of any set of values from an array |
|
Rename a field |
|
Update a value by bitwise operation |
Redis is an advanced kind of key-value store where the values can be of different types: string, list, set, sorted set or hash. Redis stores data in memory like memcached but it can be persisted on disk, unlike memcached, which has no such option. Redis supports fast reads and writes, in the order of 100,000 set or get operations per second.
As a first step to storing data in Redis, we need to define which kind of data structure is suitable for our requirements. In this section, we will introduce four commonly used data structures in Redis: simple value, list, set and ordered set. Though data is stored into Redis in many different data structures, each value must be associated with a key.
This is the most basic kind of value in Redis. For every key in Redis, we also have a value that can have a data type, such as string, integer or double. Let's start with an example for setting and getting data to and from Redis:
In the second example, we show you another kind of value type, an integer:
We have a few methods for interacting with list values in Redis. The following example uses rpush()
and lrange()
functions to put and get list data to and from the DB:
Function |
Description |
---|---|
|
Push value onto the tail of the list name if name exists |
|
Remove and return the last item of the list name |
|
Set item at the index position of the list name to input value |
|
Push value on the head of the list name if name exists |
|
Remove and return the first item of the list name |
This data structure is also similar to the list type. However, in contrast to a list, we cannot store duplicate values in our set:
Corresponding to the list data structure, we also have a number of functions to get, set, update or delete items in the set. They are listed in the supported functions for set data structure, in the following table:
Function |
Description |
---|---|
|
Add value(s) to the set with key name |
|
Return the number of element in the set with key name |
|
Return all members of the set with key name |
|
Remove value(s) from the set with key name |
The ordered set data structure takes an extra attribute when we add data to a set called score. An ordered set will use the score to determine the order of the elements in the set:
By using the zrange(name, start, end)
function, we can get a range of values from the sorted set between the start and end score sorted in ascending order by default. If we want to change the way
method of sorting, we can set the desc
parameter to True
. The withscore
parameter is used in case we want to get the scores along with the return values. The return type is a list of (value, score) pairs as you can see in the above example.
See the below table for more functions available on ordered sets:
Function |
Description |
---|---|
|
Return the number of elements in the sorted set with key name |
|
Increment the score of value in the sorted set with key name by amount |
|
Return a range of values from the sorted set with key name with a score between min and max. If If start and |
|
Return a 0-based value indicating the rank of value in the sorted set with key name |
|
Remove member value(s) from the sorted set with key name |
We have a few methods for interacting with list values in Redis. The following example uses rpush()
and lrange()
functions to put and get list data to and from the DB:
Function |
Description |
---|---|
|
Push value onto the tail of the list name if name exists |
|
Remove and return the last item of the list name |
|
Set item at the index position of the list name to input value |
|
Push value on the head of the list name if name exists |
|
Remove and return the first item of the list name |
This data structure is also similar to the list type. However, in contrast to a list, we cannot store duplicate values in our set:
Corresponding to the list data structure, we also have a number of functions to get, set, update or delete items in the set. They are listed in the supported functions for set data structure, in the following table:
Function |
Description |
---|---|
|
Add value(s) to the set with key name |
|
Return the number of element in the set with key name |
|
Return all members of the set with key name |
|
Remove value(s) from the set with key name |
The ordered set data structure takes an extra attribute when we add data to a set called score. An ordered set will use the score to determine the order of the elements in the set:
By using the zrange(name, start, end)
function, we can get a range of values from the sorted set between the start and end score sorted in ascending order by default. If we want to change the way
method of sorting, we can set the desc
parameter to True
. The withscore
parameter is used in case we want to get the scores along with the return values. The return type is a list of (value, score) pairs as you can see in the above example.
See the below table for more functions available on ordered sets:
Function |
Description |
---|---|
|
Return the number of elements in the sorted set with key name |
|
Increment the score of value in the sorted set with key name by amount |
|
Return a range of values from the sorted set with key name with a score between min and max. If If start and |
|
Return a 0-based value indicating the rank of value in the sorted set with key name |
|
Remove member value(s) from the sorted set with key name |
Function |
Description |
---|---|
|
Push value onto the tail of the list name if name exists |
|
Remove and return the last item of the list name |
|
Set item at the index position of the list name to input value |
|
Push value on the head of the list name if name exists |
|
Remove and return the first item of the list name |
This data structure is also similar to the list type. However, in contrast to a list, we cannot store duplicate values in our set:
Corresponding to the list data structure, we also have a number of functions to get, set, update or delete items in the set. They are listed in the supported functions for set data structure, in the following table:
Function |
Description |
---|---|
|
Add value(s) to the set with key name |
|
Return the number of element in the set with key name |
|
Return all members of the set with key name |
|
Remove value(s) from the set with key name |
The ordered set data structure takes an extra attribute when we add data to a set called score. An ordered set will use the score to determine the order of the elements in the set:
By using the zrange(name, start, end)
function, we can get a range of values from the sorted set between the start and end score sorted in ascending order by default. If we want to change the way
method of sorting, we can set the desc
parameter to True
. The withscore
parameter is used in case we want to get the scores along with the return values. The return type is a list of (value, score) pairs as you can see in the above example.
See the below table for more functions available on ordered sets:
Function |
Description |
---|---|
|
Return the number of elements in the sorted set with key name |
|
Increment the score of value in the sorted set with key name by amount |
|
Return a range of values from the sorted set with key name with a score between min and max. If If start and |
|
Return a 0-based value indicating the rank of value in the sorted set with key name |
|
Remove member value(s) from the sorted set with key name |
Corresponding to the list data structure, we also have a number of functions to get, set, update or delete items in the set. They are listed in the supported functions for set data structure, in the following table:
Function |
Description |
---|---|
|
Add value(s) to the set with key name |
|
Return the number of element in the set with key name |
|
Return all members of the set with key name |
|
Remove value(s) from the set with key name |
The ordered set data structure takes an extra attribute when we add data to a set called score. An ordered set will use the score to determine the order of the elements in the set:
By using the zrange(name, start, end)
function, we can get a range of values from the sorted set between the start and end score sorted in ascending order by default. If we want to change the way
method of sorting, we can set the desc
parameter to True
. The withscore
parameter is used in case we want to get the scores along with the return values. The return type is a list of (value, score) pairs as you can see in the above example.
See the below table for more functions available on ordered sets:
Function |
Description |
---|---|
|
Return the number of elements in the sorted set with key name |
|
Increment the score of value in the sorted set with key name by amount |
|
Return a range of values from the sorted set with key name with a score between min and max. If If start and |
|
Return a 0-based value indicating the rank of value in the sorted set with key name |
|
Remove member value(s) from the sorted set with key name |
By using the zrange(name, start, end)
function, we can get a range of values from the sorted set between the start and end score sorted in ascending order by default. If we want to change the way
method of sorting, we can set the desc
parameter to True
. The withscore
parameter is used in case we want to get the scores along with the return values. The return type is a list of (value, score) pairs as you can see in the above example.
See the below table for more functions available on ordered sets:
Function |
Description |
---|---|
|
Return the number of elements in the sorted set with key name |
|
Increment the score of value in the sorted set with key name by amount |
|
Return a range of values from the sorted set with key name with a score between min and max. If If start and |
|
Return a 0-based value indicating the rank of value in the sorted set with key name |
|
Remove member value(s) from the sorted set with key name |
- Take a data set of your choice and design storage options for it. Consider text files, HDF5, a document database, and a data structure store as possible persistent options. Also evaluate how difficult (by some metric, for examples, how many lines of code) it would be to update or delete a specific item. Which storage type is the easiest to set up? Which storage type supports the most flexible queries?
- In Chapter 3, Data Analysis with Pandas we saw that it is possible to create hierarchical indices with Pandas. As an example, assume that you have data on each city with more than 1 million inhabitants and that we have a two level index, so we can address individual cities, but also whole countries. How would you represent this hierarchical relationship with the various storage options presented in this chapter: text files, HDF5, MongoDB, and Redis? What do you believe would be most convenient to work with in the long run?