Book Image

Bioinformatics with Python Cookbook - Third Edition

By : Tiago Antao
Book Image

Bioinformatics with Python Cookbook - Third Edition

By: Tiago Antao

Overview of this book

Bioinformatics is an active research field that uses a range of simple-to-advanced computations to extract valuable information from biological data, and this book will show you how to manage these tasks using Python. This updated third edition of the Bioinformatics with Python Cookbook begins with a quick overview of the various tools and libraries in the Python ecosystem that will help you convert, analyze, and visualize biological datasets. Next, you'll cover key techniques for next-generation sequencing, single-cell analysis, genomics, metagenomics, population genetics, phylogenetics, and proteomics with the help of real-world examples. You'll learn how to work with important pipeline systems, such as Galaxy servers and Snakemake, and understand the various modules in Python for functional and asynchronous programming. This book will also help you explore topics such as SNP discovery using statistical approaches under high-performance computing frameworks, including Dask and Spark. In addition to this, you’ll explore the application of machine learning algorithms in bioinformatics. By the end of this bioinformatics Python book, you'll be equipped with the knowledge you need to implement the latest programming techniques and frameworks, empowering you to deal with bioinformatics data on every scale.
Table of Contents (15 chapters)

Accelerating pandas processing with 
Apache Arrow

When dealing with large amounts of data, such as in whole genome sequencing, pandas is both slow and memory-consuming. Apache Arrow provides faster and more memory-efficient implementations of several pandas operations and can interoperate with it.

Apache Arrow is a project co-founded by Wes McKinney, the founder of pandas, and it has several objectives, including working with tabular data in a language-agnostic way, which allows for language interoperability while providing a memory- and computation-efficient implementation. Here, we will only be concerned with the second part: getting more efficiency for large-data processing. We will do this in an integrated way with pandas.

Here, we will once again use VAERS data and show how Apache Arrow can be used to accelerate pandas data loading and reduce memory consumption.

Getting ready

Again, we will be using data from the first recipe. Be sure you download and prepare it, as explained in the Getting ready section of the Using pandas to process vaccine-adverse events recipe. The code is available in Chapter02/Arrow.py.

How to do it...

Follow these steps:

  1. Let’s start by loading the data using both pandas and Arrow:
    import gzip
    import pandas as pd
    from pyarrow import csv
    import pyarrow.compute as pc 
    vdata_pd = pd.read_csv("2021VAERSDATA.csv.gz", encoding="iso-8859-1")
    columns = list(vdata_pd.columns)
    vdata_pd.info(memory_usage="deep") 
    vdata_arrow = csv.read_csv("2021VAERSDATA.csv.gz")
    tot_bytes = sum([
        vdata_arrow[name].nbytes
        for name in vdata_arrow.column_names])
    print(f"Total {tot_bytes // (1024 ** 2)} MB")

pandas requires 1.3 GB, whereas Arrow requires 614 MB: less than half the memory. For large files like this, this may mean the difference between being able to process data in memory or needing to find another solution, such as Dask. While some functions in Arrow have similar names to pandas (for example, read_csv), that is not the most common occurrence. For example, note the way we compute the total size of the DataFrame: by getting the size of each column and performing a sum, which is a different approach from pandas.

  1. Let’s do a side-by-side comparison of the inferred types:
    for name in vdata_arrow.column_names:
        arr_bytes = vdata_arrow[name].nbytes
        arr_type = vdata_arrow[name].type
        pd_bytes = vdata_pd[name].memory_usage(index=False, deep=True)
        pd_type = vdata_pd[name].dtype
        print(
            name,
            arr_type, arr_bytes // (1024 ** 2),
            pd_type, pd_bytes // (1024 ** 2),)

Here is an abridged version of the output:

VAERS_ID int64 4 int64 4
RECVDATE string 8 object 41
STATE string 3 object 34
CAGE_YR int64 5 float64 4
SEX string 3 object 36
RPT_DATE string 2 object 20
DIED string 2 object 20
L_THREAT string 2 object 20
ER_VISIT string 2 object 19
HOSPITAL string 2 object 20
HOSPDAYS int64 5 float64 4

As you can see, Arrow is generally more specific with type inference and is one of the main reasons why memory usage is substantially lower.

  1. Now, let’s do a time performance comparison:
    %timeit pd.read_csv("2021VAERSDATA.csv.gz", encoding="iso-8859-1")
    %timeit csv.read_csv("2021VAERSDATA.csv.gz")

On my computer, the results are as follows:

7.36 s ± 201 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
2.28 s ± 70.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Arrow’s implementation is three times faster. The results on your computer will vary as this is dependent on the hardware.

  1. Let’s repeat the memory occupation comparison while not loading the SYMPTOM_TEXT column. This is a fairer comparison as most numerical datasets do not tend to have a very large text column:
    vdata_pd = pd.read_csv("2021VAERSDATA.csv.gz", encoding="iso-8859-1", usecols=lambda x: x != "SYMPTOM_TEXT")
    vdata_pd.info(memory_usage="deep")
    columns.remove("SYMPTOM_TEXT")
    vdata_arrow = csv.read_csv(
        "2021VAERSDATA.csv.gz",
         convert_options=csv.ConvertOptions(include_columns=columns))
    vdata_arrow.nbytes

pandas requires 847 MB, whereas Arrow requires 205 MB: four times less.

  1. Our objective is to use Arrow to load data into pandas. For that, we need to convert the data structure:
    vdata = vdata_arrow.to_pandas()
    vdata.info(memory_usage="deep")

There are two very important points to be made here: the pandas representation created by Arrow uses only 1 GB, whereas the pandas representation, from its native read_csv, is 1.3 GB. This means that even if you use pandas to process data, Arrow can create a more compact representation to start with.

The preceding code has one problem regarding memory consumption: when the converter is running, it will require memory to hold both the pandas and the Arrow representations, hence defeating the purpose of using less memory. Arrow can self-destruct its representation while creating the pandas version, hence resolving the problem. The line for this is vdata = vdata_arrow.to_pandas(self_destruct=True).

There’s more...

If you have a very large DataFrame that cannot be processed by pandas, even after it’s been loaded by Arrow, then maybe Arrow can do all the processing as it has a computing engine as well. That being said, Arrow’s engine is, at the time of writing, substantially less complete in terms of functionality than pandas. Remember that Arrow has many other features, such as language interoperability, but we will not be making use of those in this book.