Book Image

Bioinformatics with Python Cookbook

By : Tiago R Antao, Tiago Antao
Book Image

Bioinformatics with Python Cookbook

By: Tiago R Antao, Tiago Antao

Overview of this book

Table of Contents (16 chapters)
Bioinformatics with Python Cookbook
Credits
About the Author
About the Reviewers
www.PacktPub.com
Preface
Index

Installing the required software with Anaconda


Before we get started, we need to install some prerequisite software. The following sections will take you through the software and the steps needed to install them. An alternative way to start is to use the Docker recipe, after which everything will be taken care for you via a Docker container.

If you are already using a different Python version, you are encouraged to continue using your preferred version, although you will have to adapt the following instructions to suit your environment.

Getting ready

Python can be run on top of different environments. For instance, you can use Python inside the JVM (via Jython) or with .NET (with IronPython). However, here, we are concerned not only with Python, but also with the complete software ecology around it; therefore, we will use the standard (CPython) implementation as that the JVM and .NET versions exist mostly to interact with the native libraries of these platforms. A potentially viable alternative will be to use the PyPy implementation of Python (not to be confused with PyPi: the Python Package index).

An important decision is whether to choose the Python 2 or 3. Here, we will support both versions whenever possible, but there are a few issues that you should be aware of. The first issue is if you work with Phylogenetics, you will probably have to go with Python 2 because most existing Python libraries do not support version 3. Secondly, in the short term, Python 2, is generally better supported, but (save for the aforementioned Phylogenetics topic) Python 3 is well covered for computational biology. Finally, if you believe that you are in this for the long run, Python 3 is the place to be. Whatever is your choice, here, we will support both options unless clearly stated otherwise. If you go for Python 2, use 2.7 (or newer if it has been released). With Python 3, use at least 3.4.

If you were starting with Python and bioinformatics, any operating system will work, but here, we are mostly concerned with the intermediate to advanced usage. So, while you can probably use Windows and Mac OS X, most heavy-duty analysis will be done on Linux (probably on a Linux cluster). Next-generation sequencing data analysis and complex machine learning are mostly performed on Linux clusters.

If you are on Windows, you should consider upgrading to Linux for your bioinformatics work because many modern bioinformatics software will not run on Windows. Mac OS X will be fine for almost all analyses, unless you plan to use a computer cluster, which will probably be Linux-based.

If you are on Windows or Mac OS X and do not have easy access to Linux, do not worry. Modern virtualization software (such as VirtualBox and Docker) will come to your rescue, which will allow you to install a virtual Linux on your operating system. If you are working with Windows and decide that you want to go native and not use Anaconda, be careful with your choice of libraries; you are probably safer if you install the 32-bit version for everything (including Python itself).

Remember, if you are on Windows, many tools will be unavailable to you.

Tip

Bioinformatics and data science are moving at breakneck speed; this is not just hype, it's a reality. If you install the default packages of your software framework, be sure not to install old versions. For example, if you are a Debian/Ubuntu Linux user, it's possible that the default matplotlib package of your distribution is too old. In this case, it's advised to either use a recent conda or pip package instead.

The software developed for this book is available at https://github.com/tiagoantao/bioinf-python. To access it, you will need to install Git. Alternatively, you can download the ZIP file that GitHub makes available (however, getting used to Git may be a good idea because lots of scientific computing software are being developed with it).

Before you install the Python stack properly, you will need to install all the external non-Python software that you will be interoperating with. The list will vary from chapter to chapter and all chapter-specific packages will be explained in their respective chapters. Some less common Python libraries may also be referred to in their specific chapters.

If you are not interested on a specific chapter (that is perfectly fine), you can skip the related packages and libraries.

Of course, you will probably have many other bioinformatics applications around—such as bwa or GATK for next-generation sequencing, but we will not discuss these because we do not interact with them directly (although we might interact with their outputs).

You will need to install some development compilers and libraries (all free). On Ubuntu, consider installing the build-essential (apt-get it) package, and on Mac, consider Xcode (https://developer.apple.com/xcode/).

In the following table, you will find the list of the most important Python software. We strongly recommend the installation of the IPython Notebook (now known as Project Jupyter). While not strictly mandatory, it's becoming a fundamental cornerstone for scientific computing with Python:

Name

Usage

URL

Purpose

IPython

General

http://ipython.org/

General

NumPy

General

http://www.numpy.org/

Numerical Python

SciPy

General

http://scipy.org/

Scientific computing

matplotlib

General

http://matplotlib.org/

Visualization

Biopython

General

http://biopython.org/wiki/Main_Page

Bioinformatics

PyVCF

NGS

http://pyvcf.readthedocs.org/en/latest/

VCF processing

PySAM

NGS

http://pysam.readthedocs.org/en/latest/

SAM/BAM processing

simuPOP

Population Genetics

http://simupop.sourceforge.net/

Genetics Simulation

DendroPY

Phylogenetics

http://pythonhosted.org/DendroPy/

Phylogenetics

scikit-learn

General

http://scikit-learn.org/stable/

Machine learning

PyMOL

Proteomics

http://pymol.org/

Molecular visualization

rpy2

R integration

http://rpy.sourceforge.net/

R interface

pygraphviz

General

http://pygraphviz.github.io/

Graph library

Reportlab

General

http://reportlab.com/

Visualization

seaborn

General

http://web.stanford.edu/~mwaskom/software/seaborn/

Visualization/Stats

Cython

Big Data

http://cython.org/

High performance

Numba

Big Data

http://numba.pydata.org/

High performance

Note that the list of available software for Python in general and bioinformatics in particular is constantly increasing. For example, we recommend you to keep an eye on projects such as Blaze (data analysis) or Bokeh (visualization).

How to do it…

Here are the steps to perform the installation:

  1. Start by downloading the Anaconda distribution from http://continuum.io/downloads. You can either choose the Python Version 2 or 3. At this stage, this is not fundamental because Anaconda will let you use the alternative version if you need it. You can accept all the installation defaults, but you may want to make sure that conda binaries are in your PATH (do not forget to open a new window so that the PATH is updated).

    • If you have another Python distribution, but still decide to try Anaconda, be careful with your PYTHONPATH and existing Python libraries. It's probably better to unset your PYTHONPATH. As much as possible, uninstall all other Python versions and installed Python libraries.

  2. Let's go ahead with libraries. We will now create a new conda environment called bioinformatics with Biopython 1.65, as shown in the following command:

    conda create -n bioinformatics biopython biopython=1.65 python=2.7
    
    • If you want Python 3 (remember the reduced phylogenetics functionality, but more future proof), run the following command:

      conda create -n bioinformatics biopython=1.65 python=3.4
      
  3. Let's activate the environment, as follows:

    source activate bioinformatics
    
  4. Also, install the core packages, as follows:

    conda install scipy matplotlib ipython-notebook binstar pip
    conda install pandas cython numba scikit-learn seaborn
    
  5. We still need pygraphivz, which is not available on conda. Therefore, we need to use pip:

    pip install pygraphviz
    
  6. Now, install the Python bioinformatics packages, apart from Biopython (you only need to install those that you plan to use):

    • This is available on conda:

      conda install -c  https://conda.binstar.org/bcbio  pysam
      conda install -c https://conda.binstar.org/simupop simuPOP
      
    • This is available via pypi:

      pip install pyvcf
      pip install dendropy
      
  7. If you need to interoperate with R, of course, you will need to install it; either download it from the R website at http://www.r-project.org/ or use the R provided by your operating system distribution.

    • On a recent Debian/Ubuntu Linux distribution, you can just run the following command as root:

      apt-get r-bioc-biobase r-cran-ggplot2
      
    • This will install Bioconductor: the main R suite for bioinformatics and ggplot2—a popular plotting library in R. Of course, this will indirectly take care of installing R.

  8. Alternatively, If you are not on Debian/Ubuntu Linux, do not have root, or prefer to install in your home directory, after downloading and installing R manually, run the following command in R:

    source("http://bioconductor.org/biocLite.R")
    biocLite()
    
    • This will install Bioconductor (for detailed instructions, refer to http://www.bioconductor.org/install/). To install ggplot2, just run the following command in R:

      install.packages("ggplot2")
      install.packages("gridExtra")
      
  9. Finally, you will need to install rpy2, the R-to-Python bridge. Back at the command line, under the conda bioinformatics environment, run the following command:

    pip install rpy2
    

There's more…

There is no requirement to use Anaconda; you can easily install all this software on another Python distribution. Make sure that you have pip installed and install all conda packages with it, instead. You may need to install more compilers (for example, Fortran) and libraries because installation via pip will rely on compilation more than conda. However, as you also need pip for some packages under conda, you will need some compilers and C development libraries with conda, anyway. If you are on Python 3, you will probably have to perform pip3 and run Python as python3 (as python/pip will call Python 2 by default on most systems).

In order to isolate your environment, you may want to consider using virtualenv (http://docs.python-guide.org/en/latest/dev/virtualenvs/). This allows you to create a bioninformatics environment similar to the one on conda.

See also

  • The Anaconda (http://docs.continuum.io/anaconda/) Python distribution is commonly used, especially because of its intelligent package manager: conda. Although conda was developed by the Python community, it's actually language agnostic.

  • The software installation and package maintenance was never Python's strongest point (hence, the popularity of conda to address this issue). If you want to know the currently recommended installation policies for the standard Python distribution (and avoid old and deprecated alternatives), refer to https://packaging.python.org/.

  • You have probably heard of the IPython Notebook; if not, visit their page at http://ipython.org/notebook.html.