Book Image

Bioinformatics with Python Cookbook - Second Edition

By : Tiago Antao
Book Image

Bioinformatics with Python Cookbook - Second Edition

By: Tiago Antao

Overview of this book

Bioinformatics is an active research field that uses a range of simple-to-advanced computations to extract valuable information from biological data. This book covers next-generation sequencing, genomics, metagenomics, population genetics, phylogenetics, and proteomics. You'll learn modern programming techniques to analyze large amounts of biological data. With the help of real-world examples, you'll convert, analyze, and visualize datasets using various Python tools and libraries. This book will help you get a better understanding of working with a Galaxy server, which is the most widely used bioinformatics web-based pipeline system. This updated edition also includes advanced next-generation sequencing filtering techniques. You'll also explore topics such as SNP discovery using statistical approaches under high-performance computing frameworks such as Dask and Spark. By the end of this book, you'll be able to use and implement modern programming techniques and frameworks to deal with the ever-increasing deluge of bioinformatics data.
Table of Contents (16 chapters)
Title Page
About Packt
Contributors
Preface
Index

Exploring the data with standard statistics


Now that we have a compass from the decision tree, let's explore the data in order to get more insights that might help us to better filter the data. You can find this content in Chapter11/Exploration.ipynb.

How to do it…

  1. We start, as usual, with the necessary imports:
import gzip
import pickle
import random

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from pandas.plotting import scatter_matrix

%matplotlib inline

 

 

  1. Then we load the data. We will use pandas to navigate it:
fit = np.load(gzip.open('balanced_fit.npy.gz', 'rb'))
ordered_features = np.load(open('ordered_features', 'rb'))
num_features = len(ordered_features)
fit_df = pd.DataFrame(fit, columns=ordered_features + ['pos', 'error'])
num_samples = 80
del fit
  1. Let's ask pandas to show an histogram of all annotations:
fig,ax = plt.subplots(figsize=(16,9))
fit_df.hist(column=ordered_features, ax=ax)

The following histogram is generated:

Histogram of all annotations for a dataset...