Bioinformatics with Python Cookbook

Bioinformatics with Python Cookbook - Second Edition

By : Tiago Antao

Buy this Book

Bioinformatics with Python Cookbook - Second Edition

By: Tiago Antao

Buy this Book

Overview of this book

Bioinformatics is an active research field that uses a range of simple-to-advanced computations to extract valuable information from biological data. This book covers next-generation sequencing, genomics, metagenomics, population genetics, phylogenetics, and proteomics. You'll learn modern programming techniques to analyze large amounts of biological data. With the help of real-world examples, you'll convert, analyze, and visualize datasets using various Python tools and libraries. This book will help you get a better understanding of working with a Galaxy server, which is the most widely used bioinformatics web-based pipeline system. This updated edition also includes advanced next-generation sequencing filtering techniques. You'll also explore topics such as SNP discovery using statistical approaches under high-performance computing frameworks such as Dask and Spark. By the end of this book, you'll be able to use and implement modern programming techniques and frameworks to deal with the ever-increasing deluge of bioinformatics data.

Title Page

About Packt

Contributors

Preface

Free Chapter

Python and the Surrounding Software Ecology

Introduction

Installing the required software with Anaconda

Installing the required software with Docker

Interfacing with R via rpy2

Performing R magic with Jupyter Notebook

Next-Generation Sequencing

Introduction

Accessing GenBank and moving around NCBI databases

Performing basic sequence analysis

Working with modern sequence formats

Working with alignment data

Analyzing data in VCF

Studying genome accessibility and filtering SNP data

Processing NGS data with HTSeq

Working with Genomes

Introduction

Working with high-quality reference genomes

Dealing with low-quality genome references

Traversing genome annotations

Extracting genes from a reference using annotations

Finding orthologues with the Ensembl REST API

Retrieving gene ontology information from Ensembl

Population Genetics

Introduction

Managing datasets with PLINK

Introducing the Genepop format

Exploring a dataset with Bio.PopGen

Computing F-statistics

Performing Principal Components Analysis

Investigating population structure with admixture

Population Genetics Simulation

Introduction

Introducing forward-time simulations

Simulating selection

Simulating population structure using island and stepping-stone models

Modeling complex demographic scenarios

Phylogenetics

Introduction

Preparing a dataset for phylogenetic analysis

Aligning genetic and genomic data

Comparing sequences

Reconstructing phylogenetic trees

Playing recursively with trees

Visualizing phylogenetic data

Using the Protein Data Bank

Introduction

Finding a protein in multiple databases

Introducing Bio.PDB

Extracting more information from a PDB file

Computing molecular distances on a PDB file

Performing geometric operations

Animating with PyMOL

Parsing mmCIF files using Biopython

Bioinformatics Pipelines

Introduction

Introducing Galaxy servers

Accessing Galaxy using the API

Developing a Galaxy tool

Using generic pipelines with bioinformatics data

Deploying a variant analysis pipeline with Airflow

Python for Big Genomics Datasets

Introduction

Using high-performance data formats – HDF5

Doing parallel computing with Dask

Using high-performance data formats – Parquet

Computing sequencing statistics using Spark

Optimizing code with Cython and Numba

Installing the required software with Anaconda

Before we get started, we need to install some prerequisite software. The following sections will take you through the software and the steps needed to install them. An alternative way to start is to use the Docker recipe, after which everything will be taken care for you via a Docker container.

If you are already using a different Python version, you are strongly encouraged to consider Anaconda, as it has become the de facto standard for data science. Also, it is the distribution that will allow you to install software from Bioconda (https://bioconda.github.io/).

Getting ready

Python can be run on top of different environments. For instance, you can use Python inside the Java Virtual Machine (JVM) (via Jython) or with .NET (with IronPython). However, here, we are concerned not only with Python, but also with the complete software ecology around it; therefore, we will use the standard (CPython) implementation, since the JVM and .NET versions exist mostly to interact with the native libraries of these platforms. A potentially viable alternative would be to use the PyPy implementation of Python (not to be confused with Python Package Index (PyPI).

Save for noted exceptions, we will be using Python 3 only. If you were starting with Python and bioinformatics, any operating system will work, but here, we are mostly concerned with intermediate to advanced usage. So, while you can probably use Windows and macOS, most heavy-duty analysis will be done on Linux (probably on a Linux cluster). Next-generation sequencing (NGS) data analysis and complex machine learning is mostly performed on Linux clusters.

If you are on Windows, you should consider upgrading to Linux for your bioinformatics work because most modern bioinformatics software will not run on Windows. macOS will be fine for almost all analyses, unless you plan to use a computer cluster, which will probably be Linux-based.

If you are on Windows or macOS and do not have easy access to Linux, don't worry. Modern virtualization software (such as VirtualBox and Docker) will come to your rescue, which will allow you to install a virtual Linux on your operating system. If you are working with Windows and decide that you want to go native and not use Anaconda, be careful with your choice of libraries; you are probably safer if you install the 32-bit version for everything (including Python itself).

Note

If you are on Windows, many tools will be unavailable to you.

Note

Bioinformatics and data science are moving at breakneck speed; this is not just hype, it's a reality. When installing software libraries, choosing a version might be tricky. Depending on the code that you have, it might not work with some old versions, or maybe not even work with a newer version. Hopefully, any code that you use will indicate the correct dependencies—though this is not guaranteed.

The software developed for this book is available at https://github.com/PacktPublishing/Bioinformatics-with-Python-Cookbook-Second-Edition. To access it, you will need to install Git. Alternatively, you can download the ZIP file that GitHub makes available (indeed, getting used to Git may be a good idea because lots of scientific computing software is being developed with it).

Before you install the Python stack properly, you will need to install all the external non-Python software that you will be interoperating with. The list will vary from chapter to chapter, and all chapter-specific packages will be explained in their respective chapters. Some less common Python libraries may also be referred to in their specific chapters. Fortunately, since the first edition of this book, most bioinformatics software can be easily installed with conda using the Bioconda project.

If you are not interested in a specific chapter, you can skip the related packages and libraries. Of course, you will probably have many other bioinformatics applications around—such as Burrows-Wheeler Aligner (bwa) or Genome Analysis Toolkit (GATK) for NGS—but we will not discuss these because we do not interact with them directly (although we might interact with their outputs).

You will need to install some development compilers and libraries, all of which are free. On Ubuntu, consider installing the build-essential package (apt-get it), and on macOS, consider Xcode (https://developer.apple.com/xcode/).

In the following table, you will find a list of the most important Python software:

Name	Application	URL	Purpose
Project Jupyter	All chapters	https://jupyter.org/	Interactive computing
pandas	All chapters	https://pandas.pydata.org/	Data processing
NumPy	All chapters	http://www.numpy.org/	Array/matrix processing
SciPy	All chapters	https://www.scipy.org/	Scientific computing
Biopython	All chapters	https://biopython.org/	Bioinformatics library
PyVCF	NGS	https://pyvcf.readthedocs.io	VCF processing
Pysam	NGS	https://github.com/pysam-developers/pysam	SAM/BAM processing
HTSeq	NGS/Genomes	https://htseq.readthedocs.io	NGS processing
simuPOP	Population genetics	http://simupop.sourceforge.net/	Population genetics simulation
DendroPY	Phylogenetics	https://dendropy.org/	Phylogenetics
scikit-learn	Machine learning/population genetics	http://scikit-learn.org	Machine learning library
PyMol	Proteomics	https://pymol.org	Molecular visualization
rpy2	Introduction	https://rpy2.readthedocs.io	R interface
seaborn	All chapters	http://seaborn.pydata.org/	Statistical chart library
Cython	Big data	http://cython.org/	High performance
Numba	Big data	https://numba.pydata.org/	High performance
Dask	Big data	http://dask.pydata.org	Parallel processing

We have taken a somewhat conservative approach in most of the recipes with regard to the processing of tabled data. While we use pandas every now and then, most of the time, we use standard Python. As time advances and pandas becomes more pervasive, it will probably make sense to just process all tabular data with it (if it fits in-memory).

How to do it...

Take a look at the following steps to get started:

Start by downloading the Anaconda distribution from https://www.anaconda.com/download. Choose Python version 3. In any case, this is not fundamental, because Anaconda will let you use Python 2 if you need it. You can accept all the installation defaults, but you may want to make sure that the conda binaries are in your path (do not forget to open a new window so that the path is updated). If you have another Python distribution, be careful with your PYTHONPATH and existing Python libraries. It's probably better to unset your PYTHONPATH. As much as possible, uninstall all other Python versions and installed Python libraries.
Let's go ahead with the libraries. We will now create a newcondaenvironment calledbioinformaticswithbiopython=1.70, as shown in the following command:

conda create -n bioinformatics biopython biopython=1.70

Let's activate the environment, as follows:

source activate bioinformatics

Let's add the bioconda and conda-forge channel to our source list:

condaconfig--addchannelsbioconda
condaconfig--addchannelsconda-forge

Also, install the core packages:

conda install scipy matplotlib jupyter-notebook pip pandas cython numba scikit-learn seaborn pysam pyvcf simuPOP dendropy rpy2

Some of them will probably be installed with the core distribution anyway.

We can even install R from conda:

conda install r-essentials r-gridextra

r-essentials installs a lot of R packages, including ggplot2, which we will use later. We also install r-gridextra, since we will be using it in the Notebook.

There's more...

Compared to the first edition of this book, this recipe is now highly simplified. There are two main reasons for this: the bioconda package, and the fact that we only need to support Anaconda as it has become a standard. If you feel strongly against using Anaconda, you will be able to install many of the Python libraries via pip. You will probably need quite a few compilers and build tools—not only C compilers, but also C++ and Fortran.

Bioinformatics with Python Cookbook - Second Edition

By : Tiago Antao

Bioinformatics with Python Cookbook - Second Edition

By: Tiago Antao

Overview of this book

Related Content you might be interested in

Current Title:

Bioinformatics with Python Cookbook - Second Edition

Installing the required software with Anaconda

Getting ready

Note

Note

How to do it...

There's more...