Book Image

Bioinformatics with Python Cookbook - Third Edition

By : Tiago Antao
Book Image

Bioinformatics with Python Cookbook - Third Edition

By: Tiago Antao

Overview of this book

Bioinformatics is an active research field that uses a range of simple-to-advanced computations to extract valuable information from biological data, and this book will show you how to manage these tasks using Python. This updated third edition of the Bioinformatics with Python Cookbook begins with a quick overview of the various tools and libraries in the Python ecosystem that will help you convert, analyze, and visualize biological datasets. Next, you'll cover key techniques for next-generation sequencing, single-cell analysis, genomics, metagenomics, population genetics, phylogenetics, and proteomics with the help of real-world examples. You'll learn how to work with important pipeline systems, such as Galaxy servers and Snakemake, and understand the various modules in Python for functional and asynchronous programming. This book will also help you explore topics such as SNP discovery using statistical approaches under high-performance computing frameworks, including Dask and Spark. In addition to this, you’ll explore the application of machine learning algorithms in bioinformatics. By the end of this bioinformatics Python book, you'll be equipped with the knowledge you need to implement the latest programming techniques and frameworks, empowering you to deal with bioinformatics data on every scale.
Table of Contents (15 chapters)

Understanding NumPy as the engine behind Python data science and bioinformatics

Most of your analysis will make use of NumPy, even if you don’t use it explicitly. NumPy is an array manipulation library that is behind libraries such as pandas, Matplotlib, Biopython, and scikit-learn, among many others. While much of your bioinformatics work may not require explicit direct use of NumPy, you should be aware of its existence as it underpins almost everything you do, even if only indirectly via the other libraries.

In this recipe, we will use VAERS data to demonstrate how NumPy is behind many of the core libraries that we use. This is a very light introduction to the library so that you are aware that it exists and that it is behind almost everything. Our example will extract the number of cases from the five US states with more adverse effects, splitting them into age bins: 0 to 19 years, 20 to 39, up to 100 to 119.

Getting ready

Once again, we will be using the data from the first recipe, so make sure it’s available. The code for it can be found in Chapter02/NumPy.py.

How to do it…

Follow these steps:

  1. Let’s start by loading the data with pandas and reducing the data so that it’s related to the top five US states only:
    import numpy as np
    import pandas as pd
    import matplotlib.pyplot as plt
    vdata = pd.read_csv(
        "2021VAERSDATA.csv.gz", encoding="iso-8859-1")
    vdata["STATE"] = vdata["STATE"].str.upper()
    top_states = pd.DataFrame({
        "size": vdata.groupby("STATE").size().sort_values(ascending=False).head(5)}).reset_index()
    top_states["rank"] = top_states.index
    top_states = top_states.set_index("STATE")
    top_vdata = vdata[vdata["STATE"].isin(top_states.index)]
    top_vdata["state_code"] = top_vdata["STATE"].apply(
        lambda state: top_states["rank"].at[state]
    ).astype(np.uint8)
    top_vdata = top_vdata[top_vdata["AGE_YRS"].notna()]
    top_vdata.loc[:,"AGE_YRS"] = top_vdata["AGE_YRS"].astype(int)
    top_states

The top states are as follows. This rank will be used later to construct a NumPy matrix:

Figure 2.2 – US states with largest numbers of adverse effects

Figure 2.2 – US states with largest numbers of adverse effects

  1. Now, let’s extract the two NumPy arrays that contain age and state data:
    age_state = top_vdata[["state_code", "AGE_YRS"]]
    age_state["state_code"]
    state_code_arr = age_state["state_code"].values
    type(state_code_arr), state_code_arr.shape, state_code_arr.dtype
    age_arr = age_state["AGE_YRS"].values
    type(age_arr), age_arr.shape, age_arr.dtype

Note that the data that underlies pandas is NumPy data (the values call for both Series returns NumPy types). Also, you may recall that pandas has properties such as .shape or .dtype: these were inspired by NumPy and behave the same.

  1. Now, let’s create a NumPy matrix from scratch (a 2D array), where each row is a state and each column represents an age group:
    age_state_mat = np.zeros((5,6), dtype=np.uint64)
    for row in age_state.itertuples():
        age_state_mat[row.state_code, row.AGE_YRS//20] += 1
    age_state_mat

The array has five rows – one for each state – and six columns – one for each age group. All the cells in the array must have the same type.

We initialize the array with zeros. There are many ways to initialize arrays, but if you have a very large array, initializing it may take a lot of time. Sometimes, depending on your task, it might be OK that the array is empty at the beginning (meaning it was initialized with random trash). In that case, using np.empty will be much faster. We use pandas iteration here: this is not the best way to do things from a pandas perspective, but we want to make the NumPy part very explicit.

  1. We can extract a single row – in our case, the data for a state – very easily. The same applies to a column. Let’s take California data and then the 0-19 age group:
    cal = age_state_mat[0,:]
    kids = age_state_mat[:,0]

Note the syntax to extract a row or a column. It should be familiar to you, given that pandas copied the syntax from NumPy and we encountered it in previous recipes.

  1. Now, let’s compute a new matrix where we have the fraction of cases per age group:
    def compute_frac(arr_1d):
        return arr_1d / arr_1d.sum()
    frac_age_stat_mat = np.apply_along_axis(compute_frac, 1, age_state_mat)

The last line applies the compute_frac function to all rows. compute_frac takes a single row and returns a new row where all the elements are divided by the total sum.

  1. Now, let’s create a new matrix that acts as a percentage instead of a fraction – simply because it reads better:
    perc_age_stat_mat = frac_age_stat_mat * 100
    perc_age_stat_mat = perc_age_stat_mat.astype(np.uint8)
    perc_age_stat_mat

The first line simply multiplies all the elements of the 2D array by 100. Matplotlib is smart enough to traverse different array structures. That line will work if it’s presented with an array with any dimensions and would do exactly what is expected.

Here is the result:

Figure 2.3 – A matrix representing the distribution of vaccine-adverse effects 
in the five US states with the most cases

Figure 2.3 – A matrix representing the distribution of vaccine-adverse effects in the five US states with the most cases

  1. Finally, let’s create a graphical representation of the matrix using Matplotlib:
    fig = plt.figure()
    ax = fig.add_subplot()
    ax.matshow(perc_age_stat_mat, cmap=plt.get_cmap("Greys"))
    ax.set_yticks(range(5))
    ax.set_yticklabels(top_states.index)
    ax.set_xticks(range(6))
    ax.set_xticklabels(["0-19", "20-39", "40-59", "60-79", "80-99", "100-119"])
    fig.savefig("matrix.png")

Do not dwell too much on the Matplotlib code – we are going to discuss it in the next recipe. The fundamental point here is that you can pass NumPy data structures to Matplotlib. Matplotlib, like pandas, is based on NumPy.

See also

The following is some extra information that may be useful:

  • NumPy has many more features than the ones we’ve discussed here. There are plenty of books and tutorials on them. The official documentation is a good place to start: https://numpy.org/doc/stable/.
  • There are many important issues to discover with NumPy, but probably one of the most important is broadcasting: NumPy’s ability to take arrays of different structures and get the operations right. For details, go to https://numpy.org/doc/stable/user/theory.broadcasting.html.