Book Image

Bioinformatics with Python Cookbook - Third Edition

By : Tiago Antao

Book Image

Bioinformatics with Python Cookbook - Third Edition

By: Tiago Antao

Overview of this book

Bioinformatics is an active research field that uses a range of simple-to-advanced computations to extract valuable information from biological data, and this book will show you how to manage these tasks using Python. This updated third edition of the Bioinformatics with Python Cookbook begins with a quick overview of the various tools and libraries in the Python ecosystem that will help you convert, analyze, and visualize biological datasets. Next, you'll cover key techniques for next-generation sequencing, single-cell analysis, genomics, metagenomics, population genetics, phylogenetics, and proteomics with the help of real-world examples. You'll learn how to work with important pipeline systems, such as Galaxy servers and Snakemake, and understand the various modules in Python for functional and asynchronous programming. This book will also help you explore topics such as SNP discovery using statistical approaches under high-performance computing frameworks, including Dask and Spark. In addition to this, you’ll explore the application of machine learning algorithms in bioinformatics. By the end of this bioinformatics Python book, you'll be equipped with the knowledge you need to implement the latest programming techniques and frameworks, empowering you to deal with bioinformatics data on every scale.

Preface

Who this book is for

What this book covers

To get the most out of this book

Download the color images

Conventions used

Share Your Thoughts

Chapter 1: Python and the Surrounding Software Ecology

Chapter 1: Python and the Surrounding Software Ecology

Installing the required basic software with Anaconda

Installing the required software with Docker

Interfacing with R via rpy2

Performing R magic with Jupyter

Free Chapter

Chapter 2: Getting to Know NumPy, pandas, Arrow, and Matplotlib

Chapter 2: Getting to Know NumPy, pandas, Arrow, and Matplotlib

Using pandas to process vaccine-adverse events

Dealing with the pitfalls of joining pandas DataFrames

Reducing the memory usage of pandas DataFrames

Accelerating pandas processing with  Apache Arrow

Understanding NumPy as the engine behind Python data science and bioinformatics

Introducing Matplotlib for chart generation

Chapter 3: Next-Generation Sequencing

Chapter 3: Next-Generation Sequencing

Accessing GenBank and moving around NCBI databases

Performing basic sequence analysis

Working with modern sequence formats

Working with alignment data

Extracting data from VCF files

Studying genome accessibility and filtering SNP data

Processing NGS data with HTSeq

Chapter 4: Advanced NGS Data Processing

Chapter 4: Advanced NGS Data Processing

Preparing a dataset for analysis

Using Mendelian error information for quality control

Exploring the data with standard statistics

Finding genomic features from sequencing annotations

Doing metagenomics with QIIME 2 Python API

Chapter 5: Working with Genomes

Chapter 5: Working with Genomes

Technical requirements

Working with high-quality reference genomes

Dealing with low-quality genome references

Traversing genome annotations

Extracting genes from a reference using annotations

Finding orthologues with the Ensembl REST API

Retrieving gene ontology information from Ensembl

Chapter 6: Population Genetics

Chapter 6: Population Genetics

Managing datasets with PLINK

Using sgkit for population genetics analysis with xarray

Exploring a dataset with sgkit

Analyzing population structure

Performing a PCA

Investigating population structure with admixture

Chapter 7: Phylogenetics

Chapter 7: Phylogenetics

Preparing a dataset for phylogenetic analysis

Aligning genetic and genomic data

Comparing sequences

Reconstructing phylogenetic trees

Playing recursively with trees

Visualizing phylogenetic data

Chapter 8: Using the Protein Data Bank

Chapter 8: Using the Protein Data Bank

Finding a protein in multiple databases

Introducing Bio.PDB

Extracting more information from a PDB file

Computing molecular distances on a PDB file

Performing geometric operations

Animating with PyMOL

Parsing mmCIF files using Biopython

Chapter 9: Bioinformatics Pipelines

Chapter 9: Bioinformatics Pipelines

Introducing Galaxy servers

Accessing Galaxy using the API

Deploying a variant analysis pipeline with Snakemake

Deploying a variant analysis pipeline with Nextflow

Chapter 10: Machine Learning for Bioinformatics

Chapter 10: Machine Learning for Bioinformatics

Introducing scikit-learn with a PCA example

Using clustering over PCA to classify samples

Exploring breast cancer traits using Decision Trees

Predicting breast cancer outcomes using Random Forests

Chapter 11: Parallel Processing with Dask and Zarr

Chapter 11: Parallel Processing with Dask and Zarr

Reading genomics data with Zarr

Parallel processing of data using Python multiprocessing

Using Dask to process genomic data based on NumPy arrays

Scheduling tasks with dask.distributed

Chapter 12: Functional Programming for Bioinformatics

Chapter 12: Functional Programming for Bioinformatics

Understanding pure functions

Understanding immutability

Avoiding mutability as a robust development pattern

Using lazy programming for pipelining

The limits of recursion with Python

A showcase of Python’s functools module

Index

Other Books You May Enjoy

Other Books You May Enjoy

Packt is searching for authors like you

Share Your Thoughts

Customer Reviews

5 star

0

4 star

0

3 star

0

2 star

0

1 star

0

Performing geometric operations

We will now perform computations with geometry information, including computing the center of the mass of chains and whole models.

Getting ready

You can find this content in the Chapter08/Mass.py Notebook file.

How to do it...

Let’s take a look at the following steps:

First, let’s retrieve the data:

from Bio import PDB
repository = PDB.PDBList()
parser = PDB.PDBParser()
repository.retrieve_pdb_file('1TUP', pdir='.', file_format='pdb')
p53_1tup = parser.get_structure('P 53', 'pdb1tup.ent')

Then, let’s recall the type of residues that we have with the following code:

my_residues = set()
for residue in p53_1tup.get_residues():
    my_residues.add(residue.id[0])
print(my_residues)

So, we have H_ ZN (zinc) and W (water), which are HETATM types; the vast majority are standard PDB atoms.

Let’s compute the masses for all chains...