Bioinformatics with Python Cookbook

Bioinformatics with Python Cookbook - Second Edition

By : Tiago Antao

Buy this Book

Bioinformatics with Python Cookbook - Second Edition

By: Tiago Antao

Buy this Book

Overview of this book

Bioinformatics is an active research field that uses a range of simple-to-advanced computations to extract valuable information from biological data. This book covers next-generation sequencing, genomics, metagenomics, population genetics, phylogenetics, and proteomics. You'll learn modern programming techniques to analyze large amounts of biological data. With the help of real-world examples, you'll convert, analyze, and visualize datasets using various Python tools and libraries. This book will help you get a better understanding of working with a Galaxy server, which is the most widely used bioinformatics web-based pipeline system. This updated edition also includes advanced next-generation sequencing filtering techniques. You'll also explore topics such as SNP discovery using statistical approaches under high-performance computing frameworks such as Dask and Spark. By the end of this book, you'll be able to use and implement modern programming techniques and frameworks to deal with the ever-increasing deluge of bioinformatics data.

Title Page

About Packt

Contributors

Preface

Free Chapter

Python and the Surrounding Software Ecology

Introduction

Installing the required software with Anaconda

Installing the required software with Docker

Interfacing with R via rpy2

Performing R magic with Jupyter Notebook

Next-Generation Sequencing

Introduction

Accessing GenBank and moving around NCBI databases

Performing basic sequence analysis

Working with modern sequence formats

Working with alignment data

Analyzing data in VCF

Studying genome accessibility and filtering SNP data

Processing NGS data with HTSeq

Working with Genomes

Introduction

Working with high-quality reference genomes

Dealing with low-quality genome references

Traversing genome annotations

Extracting genes from a reference using annotations

Finding orthologues with the Ensembl REST API

Retrieving gene ontology information from Ensembl

Population Genetics

Introduction

Managing datasets with PLINK

Introducing the Genepop format

Exploring a dataset with Bio.PopGen

Computing F-statistics

Performing Principal Components Analysis

Investigating population structure with admixture

Population Genetics Simulation

Introduction

Introducing forward-time simulations

Simulating selection

Simulating population structure using island and stepping-stone models

Modeling complex demographic scenarios

Phylogenetics

Introduction

Preparing a dataset for phylogenetic analysis

Aligning genetic and genomic data

Comparing sequences

Reconstructing phylogenetic trees

Playing recursively with trees

Visualizing phylogenetic data

Using the Protein Data Bank

Introduction

Finding a protein in multiple databases

Introducing Bio.PDB

Extracting more information from a PDB file

Computing molecular distances on a PDB file

Performing geometric operations

Animating with PyMOL

Parsing mmCIF files using Biopython

Bioinformatics Pipelines

Introduction

Introducing Galaxy servers

Accessing Galaxy using the API

Developing a Galaxy tool

Using generic pipelines with bioinformatics data

Deploying a variant analysis pipeline with Airflow

Python for Big Genomics Datasets

Introduction

Using high-performance data formats – HDF5

Doing parallel computing with Dask

Using high-performance data formats – Parquet

Computing sequencing statistics using Spark

Optimizing code with Cython and Numba

Preface

Whether you are reading this book as a computational biologist or a Python programmer, you will probably relate to the phrase "explosive growth, exciting times." The recent growth in the use of Python is strongly connected with its status as big data's main programming language. The deluge of data in biology, mostly from genomics and proteomics, makes bioinformatics one of the forefront applications of data science. There is a massive need for bioinformaticians to analyze all this data; of course, one of the main tools is Python. We will not only talk about the programming language but also the whole community and software ecology behind it.

When you choose Python to analyze your data, you expect to get an extensive set of libraries, ranging from statistical analysis to plotting, parallel programming, machine learning, and bioinformatics. However, you actually get even more than this; the community has a tradition of providing good documentation, reliable libraries, and frameworks. It is also friendly and supportive of all its participants.

In this book, we will present practical solutions to modern bioinformatics problems using Python. Our approach will be hands-on; we will address important topics, such as next-generation sequencing, genomics, population genetics, phylogenetics, and proteomics.

At this stage, you probably know the language reasonably well and are aware of the basic analysis methods in your field of research. You will dive directly into relevant complex computational biology problems and learn how to tackle them with Python. This is not your first Python book or your first biology lesson; this is where you will find reliable and pragmatic solutions to realistic and complex problems.

The first edition of this book took several high-risk decisions a few years ago, considering Docker, Jupyter Notebook, and even Python 3 were not obvious choices. These choices worked perfectly well. The second edition once again uses these technologies, which are now standard in the field. Probably due to bioinformatics being a more mature field, there are no high-risk options now. There is new content on pipelines, parallel processing systems, and file formats, but none of them are unsafe bets.

Who this book is for

This book is for data scientists, bioinformatics analysts, researchers, and Python developers who want to address intermediate-to-advanced biological and bioinformatics problems using a recipe-based approach. Working knowledge of Python programming language is expected.

What this book covers

Chapter 1, Python and the Surrounding Software Ecology, tells you how to set up a modern bioinformatics environment with Python. This chapter discusses how to deploy software using Docker, interface with R, and interact with the IPython Notebook.

Chapter 2, Next-Generation Sequencing, provides concrete solutions to deal with next-generation sequencing data. This chapter teaches you how to deal with large FASTQ, BAM, and VCF files. It also discusses data filtering.

Chapter 3, Working with Genomes, not only deals with high-quality references—such as the human genome—but also discusses how to analyze other low-quality references typical in nonmodel species. It introduces GFF processing, teaches you to analyze genomic feature information, and discusses how to use gene ontologies.

Chapter 4, Population Genetics, describes how to perform population genetics analysis of empirical datasets. For example, on Python, we could perform Principal Components Analysis, computer F_ST, or structure/admixture plots.

Chapter 5, Population Genetics Simulation, covers simuPOP, an extremely powerful Python-based forward-time population genetics simulator. This chapter shows you how to simulate different selection and demographic regimes. It also briefly discusses coalescent simulation.

Chapter 6, Phylogenetics, uses complete sequences of recently sequenced Ebola viruses to perform real phylogenetic analysis, which includes tree reconstruction and sequence comparisons. This chapter discusses recursive algorithms to process tree-like structures.

Chapter 7, Using the Protein Data Bank, focuses on processing PDB files, for example, performing the geometric analysis of proteins. This chapter takes a look at protein visualization.

Chapter 8, Bioinformatics Pipelines, introduces two types of pipelines. The first type of pipeline is Python-based Galaxy, a widely used system with a web-interface targeting mostly non-programming users, although bioinformaticians might still have to interact with it programmatically. The second type is Airflow, a type of pipeline that targets programmers.

Chapter 9, Python for Big Genomics Datasets, discusses high-performance programming techniques necessary to handle big datasets. It briefly discusses parallel processing with Dask and Spark. Code optimization frameworks (such as Numba or Cython) are introduced. Finally, efficient file formats such as HDF5 or Parquet are presented.

Chapter 10, Other Topics in Bioinformatics, talks about how to analyze data made available by the Global Biodiversity Information Facility (GBIF) and how to use Cytoscape, a powerful platform to visualize complex networks. This chapter also looks at how to work with geo-referenced data and map-based services.

Chapter 11, Advanced NGS Processing, covers advanced programming techniques to filter NGS data. These include the use of Mendelian datasets that are then analyzed by standard statistics and machine learning techniques.

To get the most out of this book

Modern bioinformatics analysis is normally performed on a Linux server. Most of our recipes will also work on macOS. It will also work on Windows in theory, but this is not recommended. If you do not have a Linux server, you can use a free virtual machine emulator, such as VirtualBox, to run it on a Windows/macOS computer. An alternative that we explore in the book is to use Docker as a container, which can be used on Windows and macOS.

As modern bioinformatics is a big data discipline, you will need a reasonable amount of memory; at least 8 GB on a native Linux machine, probably 16 GB on a macOS/Windows system, but more would be better. A broadband internet connection will also be necessary to download the real and hands-on datasets used in the book.

Python is a requirement. With few exceptions, the code will need Python 3. Many free Python libraries will also be required and these will be presented in the book. Biopython, NumPy, SciPy, and Matplotlib are used in almost all chapters. Although Jupyter Notebook is not strictly required, it's highly encouraged. Different chapters will also require various bioinformatics tools. All the tools used in the book are freely available and thorough instructions are provided in the relevant chapters of this book.

Download the example code files

You can download the example code files for this book from your account at www.packt.com. If you purchased this book elsewhere, you can visit www.packt.com/support and register to have the files emailed directly to you.

You can download the code files by following these steps:

Log in or register at www.packt.com.
Select the SUPPORT tab.
Click on Code Downloads & Errata.
Enter the name of the book in the Search box and follow the onscreen instructions.

Once the file is downloaded, please make sure that you unzip or extract the folder using the latest version of:

WinRAR/7-Zip for Windows
Zipeg/iZip/UnRarX for Mac
7-Zip/PeaZip for Linux

The code bundle for the book is also hosted on GitHub at https://github.com/PacktPublishing/Bioinformatics-with-Python-Cookbook-Second-Edition. In case there's an update to the code, it will be updated on the existing GitHub repository.

We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!

Download the color images

We also provide a PDF file that has color images of the screenshots/diagrams used in this book. You can download it here: https://www.packtpub.com/sites/default/files/downloads/9781789344691_ColorImages.pdf .

Conventions used

There are a number of text conventions used throughout this book.

CodeInText: Indicates code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles. Here is an example: "We will read the data from our file using R's read.delim function."

A block of code is set as follows:

import os
from IPython.display import Image
import rpy2.robjects as robjects
import pandas as pd
from rpy2.robjects import pandas2ri

Any command-line input or output is written as follows:

conda install r-essentials r-gridextra

Bold: Indicates a new term, an important word, or words that you see onscreen. For example, words in menus or dialog boxes appear in the text like this. Here is an example: "On the top menu choose User, inside choose Preferences."

Note

Warnings or important notes appear like this.

Note

Tips and tricks appear like this.

Sections

In this book, you will find several headings that appear frequently (Getting ready, How to do it..., How it works..., There's more..., and See also).

To give clear instructions on how to complete a recipe, use these sections as follows:

Getting ready

This section tells you what to expect in the recipe and describes how to set up any software or any preliminary settings required for the recipe.

How to do it…

This section contains the steps required to follow the recipe.

How it works…

This section usually consists of a detailed explanation of what happened in the previous section.

There's more…

This section consists of additional information about the recipe in order to make you more knowledgeable about the recipe.

Get in touch

Feedback from our readers is always welcome.

General feedback: If you have questions about any aspect of this book, mention the book title in the subject of your message and email us at [email protected].

Errata: Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you have found a mistake in this book, we would be grateful if you would report this to us. Please visit www.packt.com/submit-errata, selecting your book, clicking on the Errata Submission Form link, and entering the details.

Piracy: If you come across any illegal copies of our works in any form on the Internet, we would be grateful if you would provide us with the location address or website name. Please contact us at [email protected] with a link to the material.

If you are interested in becoming an author: If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, please visit authors.packtpub.com.

Reviews

Please leave a review. Once you have read and used this book, why not leave a review on the site that you purchased it from? Potential readers can then see and use your unbiased opinion to make purchase decisions, we at Packt can understand what you think about our products, and our authors can see your feedback on their book. Thank you!

For more information about Packt, please visit packt.com.https://www.packtpub.com/

Bioinformatics with Python Cookbook - Second Edition

By : Tiago Antao

Bioinformatics with Python Cookbook - Second Edition

By: Tiago Antao

Overview of this book

Related Content you might be interested in

Current Title:

Bioinformatics with Python Cookbook - Second Edition

Preface

Who this book is for

What this book covers

To get the most out of this book

Download the example code files

Download the color images

Conventions used

Note

Note

Sections

Getting ready

How to do it…

How it works…

There's more…

See also

Get in touch

Reviews