Bioinformatics with Python Cookbook - Third Edition

By : Tiago Antao

Bioinformatics with Python Cookbook - Third Edition

By: Tiago Antao

Overview of this book

Bioinformatics is an active research field that uses a range of simple-to-advanced computations to extract valuable information from biological data, and this book will show you how to manage these tasks using Python. This updated third edition of the Bioinformatics with Python Cookbook begins with a quick overview of the various tools and libraries in the Python ecosystem that will help you convert, analyze, and visualize biological datasets. Next, you'll cover key techniques for next-generation sequencing, single-cell analysis, genomics, metagenomics, population genetics, phylogenetics, and proteomics with the help of real-world examples. You'll learn how to work with important pipeline systems, such as Galaxy servers and Snakemake, and understand the various modules in Python for functional and asynchronous programming. This book will also help you explore topics such as SNP discovery using statistical approaches under high-performance computing frameworks, including Dask and Spark. In addition to this, you’ll explore the application of machine learning algorithms in bioinformatics. By the end of this bioinformatics Python book, you'll be equipped with the knowledge you need to implement the latest programming techniques and frameworks, empowering you to deal with bioinformatics data on every scale.

Preface

Who this book is for

What this book covers

To get the most out of this book

Download the color images

Reviews

Chapter 1: Python and the Surrounding Software Ecology

Installing the required basic software with Anaconda

Installing the required software with Docker

Interfacing with R via rpy2

Performing R magic with Jupyter

Free Chapter

Chapter 2: Getting to Know NumPy, pandas, Arrow, and Matplotlib

Using pandas to process vaccine-adverse events

Dealing with the pitfalls of joining pandas DataFrames

Reducing the memory usage of pandas DataFrames

Accelerating pandas processing with  Apache Arrow

Understanding NumPy as the engine behind Python data science and bioinformatics

Introducing Matplotlib for chart generation

Chapter 3: Next-Generation Sequencing

Accessing GenBank and moving around NCBI databases

Performing basic sequence analysis

Working with modern sequence formats

Working with alignment data

Extracting data from VCF files

Studying genome accessibility and filtering SNP data

Processing NGS data with HTSeq

Chapter 4: Advanced NGS Data Processing

Preparing a dataset for analysis

Using Mendelian error information for quality control

Exploring the data with standard statistics

Finding genomic features from sequencing annotations

Doing metagenomics with QIIME 2 Python API

Chapter 5: Working with Genomes

Technical requirements

Working with high-quality reference genomes

Dealing with low-quality genome references

Traversing genome annotations

Extracting genes from a reference using annotations

Finding orthologues with the Ensembl REST API

Retrieving gene ontology information from Ensembl

Chapter 6: Population Genetics

Managing datasets with PLINK

Using sgkit for population genetics analysis with xarray

Exploring a dataset with sgkit

Analyzing population structure

Performing a PCA

Investigating population structure with admixture

Chapter 7: Phylogenetics

Preparing a dataset for phylogenetic analysis

Aligning genetic and genomic data

Comparing sequences

Reconstructing phylogenetic trees

Playing recursively with trees

Visualizing phylogenetic data

Chapter 8: Using the Protein Data Bank

Finding a protein in multiple databases

Introducing Bio.PDB

Extracting more information from a PDB file

Computing molecular distances on a PDB file

Performing geometric operations

Animating with PyMOL

Parsing mmCIF files using Biopython

Chapter 9: Bioinformatics Pipelines

Introducing Galaxy servers

Accessing Galaxy using the API

Deploying a variant analysis pipeline with Snakemake

Deploying a variant analysis pipeline with Nextflow

Chapter 10: Machine Learning for Bioinformatics

Introducing scikit-learn with a PCA example

Using clustering over PCA to classify samples

Exploring breast cancer traits using Decision Trees

Predicting breast cancer outcomes using Random Forests

Chapter 11: Parallel Processing with Dask and Zarr

Reading genomics data with Zarr

Parallel processing of data using Python multiprocessing

Using Dask to process genomic data based on NumPy arrays

Scheduling tasks with dask.distributed

Chapter 12: Functional Programming for Bioinformatics

Understanding pure functions

Understanding immutability

Avoiding mutability as a robust development pattern

Using lazy programming for pipelining

The limits of recursion with Python

A showcase of Python’s functools module

Index

Why subscribe?

Other Books You May Enjoy

Packt is searching for authors like you

Share Your Thoughts

Customer Reviews

5 star

4 star

3 star

2 star

1 star

Accelerating pandas processing with  Apache Arrow

When dealing with large amounts of data, such as in whole genome sequencing, pandas is both slow and memory-consuming. Apache Arrow provides faster and more memory-efficient implementations of several pandas operations and can interoperate with it.

Apache Arrow is a project co-founded by Wes McKinney, the founder of pandas, and it has several objectives, including working with tabular data in a language-agnostic way, which allows for language interoperability while providing a memory- and computation-efficient implementation. Here, we will only be concerned with the second part: getting more efficiency for large-data processing. We will do this in an integrated way with pandas.

Here, we will once again use VAERS data and show how Apache Arrow can be used to accelerate pandas data loading and reduce memory consumption.

Getting ready

Again, we will be using data from the first recipe. Be sure you download and prepare it, as explained in the Getting ready section of the Using pandas to process vaccine-adverse events recipe. The code is available in Chapter02/Arrow.py.

How to do it...

Follow these steps:

Let’s start by loading the data using both pandas and Arrow:

import gzip
import pandas as pd
from pyarrow import csv
import pyarrow.compute as pc 
vdata_pd = pd.read_csv("2021VAERSDATA.csv.gz", encoding="iso-8859-1")
columns = list(vdata_pd.columns)
vdata_pd.info(memory_usage="deep") 
vdata_arrow = csv.read_csv("2021VAERSDATA.csv.gz")
tot_bytes = sum([
    vdata_arrow[name].nbytes
    for name in vdata_arrow.column_names])
print(f"Total {tot_bytes // (1024 ** 2)} MB")

pandas requires 1.3 GB, whereas Arrow requires 614 MB: less than half the memory. For large files like this, this may mean the difference between being able to process data in memory or needing to find another solution, such as Dask. While some functions in Arrow have similar names to pandas (for example, read_csv), that is not the most common occurrence. For example, note the way we compute the total size of the DataFrame: by getting the size of each column and performing a sum, which is a different approach from pandas.

Let’s do a side-by-side comparison of the inferred types:

for name in vdata_arrow.column_names:
    arr_bytes = vdata_arrow[name].nbytes
    arr_type = vdata_arrow[name].type
    pd_bytes = vdata_pd[name].memory_usage(index=False, deep=True)
    pd_type = vdata_pd[name].dtype
    print(
        name,
        arr_type, arr_bytes // (1024 ** 2),
        pd_type, pd_bytes // (1024 ** 2),)

Here is an abridged version of the output:

VAERS_ID int64 4 int64 4
RECVDATE string 8 object 41
STATE string 3 object 34
CAGE_YR int64 5 float64 4
SEX string 3 object 36
RPT_DATE string 2 object 20
DIED string 2 object 20
L_THREAT string 2 object 20
ER_VISIT string 2 object 19
HOSPITAL string 2 object 20
HOSPDAYS int64 5 float64 4

As you can see, Arrow is generally more specific with type inference and is one of the main reasons why memory usage is substantially lower.

Now, let’s do a time performance comparison:

%timeit pd.read_csv("2021VAERSDATA.csv.gz", encoding="iso-8859-1")
%timeit csv.read_csv("2021VAERSDATA.csv.gz")

On my computer, the results are as follows:

7.36 s ± 201 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
2.28 s ± 70.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Arrow’s implementation is three times faster. The results on your computer will vary as this is dependent on the hardware.

Let’s repeat the memory occupation comparison while not loading the SYMPTOM_TEXT column. This is a fairer comparison as most numerical datasets do not tend to have a very large text column:

vdata_pd = pd.read_csv("2021VAERSDATA.csv.gz", encoding="iso-8859-1", usecols=lambda x: x != "SYMPTOM_TEXT")
vdata_pd.info(memory_usage="deep")
columns.remove("SYMPTOM_TEXT")
vdata_arrow = csv.read_csv(
    "2021VAERSDATA.csv.gz",
     convert_options=csv.ConvertOptions(include_columns=columns))
vdata_arrow.nbytes

pandas requires 847 MB, whereas Arrow requires 205 MB: four times less.

Our objective is to use Arrow to load data into pandas. For that, we need to convert the data structure:
```
vdata = vdata_arrow.to_pandas()
vdata.info(memory_usage="deep")
```

There are two very important points to be made here: the pandas representation created by Arrow uses only 1 GB, whereas the pandas representation, from its native read_csv, is 1.3 GB. This means that even if you use pandas to process data, Arrow can create a more compact representation to start with.

The preceding code has one problem regarding memory consumption: when the converter is running, it will require memory to hold both the pandas and the Arrow representations, hence defeating the purpose of using less memory. Arrow can self-destruct its representation while creating the pandas version, hence resolving the problem. The line for this is vdata = vdata_arrow.to_pandas(self_destruct=True).

There’s more...

If you have a very large DataFrame that cannot be processed by pandas, even after it’s been loaded by Arrow, then maybe Arrow can do all the processing as it has a computing engine as well. That being said, Arrow’s engine is, at the time of writing, substantially less complete in terms of functionality than pandas. Remember that Arrow has many other features, such as language interoperability, but we will not be making use of those in this book.

Bioinformatics with Python Cookbook - Third Edition

By : Tiago Antao

Bioinformatics with Python Cookbook - Third Edition

By: Tiago Antao

Overview of this book

Related Content you might be interested in

Current Title:

Bioinformatics with Python Cookbook - Third Edition

R Bioinformatics Cookbook

R Bioinformatics Cookbook

Deep Learning for Genomics

Accelerating pandas processing with Apache Arrow

Getting ready

How to do it...

There’s more...

Accelerating pandas processing with  Apache Arrow