Before we get started, we need to install some prerequisite software. The following sections will take you through the software and the steps needed to install them. An alternative way to start is to use the Docker recipe, after which everything will be taken care for you via a Docker container.
If you are already using a different Python version, you are strongly encouraged to consider Anaconda, as it has become the de facto standard for data science. Also, it is the distribution that will allow you to install software from Bioconda (https://bioconda.github.io/).
Python can be run on top of different environments. For instance, you can use Python inside the Java Virtual Machine (JVM) (via Jython) or with .NET (with IronPython). However, here, we are concerned not only with Python, but also with the complete software ecology around it; therefore, we will use the standard (CPython) implementation, since the JVM and .NET versions exist mostly to interact with the native libraries of these platforms. A potentially viable alternative would be to use the PyPy implementation of Python (not to be confused with Python Package Index (PyPI).
Save for noted exceptions, we will be using Python 3 only. If you were starting with Python and bioinformatics, any operating system will work, but here, we are mostly concerned with intermediate to advanced usage. So, while you can probably use Windows and macOS, most heavy-duty analysis will be done on Linux (probably on a Linux cluster). Next-generation sequencing (NGS) data analysis and complex machine learning is mostly performed on Linux clusters.
If you are on Windows, you should consider upgrading to Linux for your bioinformatics work because most modern bioinformatics software will not run on Windows. macOS will be fine for almost all analyses, unless you plan to use a computer cluster, which will probably be Linux-based.
If you are on Windows or macOS and do not have easy access to Linux, don't worry. Modern virtualization software (such as VirtualBox and Docker) will come to your rescue, which will allow you to install a virtual Linux on your operating system. If you are working with Windows and decide that you want to go native and not use Anaconda, be careful with your choice of libraries; you are probably safer if you install the 32-bit version for everything (including Python itself).
Note
Bioinformatics and data science are moving at breakneck speed; this is not just hype, it's a reality. When installing software libraries, choosing a version might be tricky. Depending on the code that you have, it might not work with some old versions, or maybe not even work with a newer version. Hopefully, any code that you use will indicate the correct dependencies—though this is not guaranteed.
The software developed for this book is available at https://github.com/PacktPublishing/Bioinformatics-with-Python-Cookbook-Second-Edition. To access it, you will need to install Git. Alternatively, you can download the ZIP file that GitHub makes available (indeed, getting used to Git may be a good idea because lots of scientific computing software is being developed with it).
Before you install the Python stack properly, you will need to install all the external non-Python software that you will be interoperating with. The list will vary from chapter to chapter, and all chapter-specific packages will be explained in their respective chapters. Some less common Python libraries may also be referred to in their specific chapters. Fortunately, since the first edition of this book, most bioinformatics software can be easily installed with conda
using the Bioconda project.
If you are not interested in a specific chapter, you can skip the related packages and libraries. Of course, you will probably have many other bioinformatics applications around—such as Burrows-Wheeler Aligner (bwa) or Genome Analysis Toolkit (GATK) for NGS—but we will not discuss these because we do not interact with them directly (although we might interact with their outputs).
You will need to install some development compilers and libraries, all of which are free. On Ubuntu, consider installing the build-essential package (apt-get it
), and on macOS, consider Xcode (https://developer.apple.com/xcode/).
In the following table, you will find a list of the most important Python software:
Name | Application | URL | Purpose |
Project Jupyter | All chapters | Interactive computing | |
pandas | All chapters | Data processing | |
NumPy | All chapters | Array/matrix processing | |
SciPy | All chapters | Scientific computing | |
Biopython | All chapters | Bioinformatics library | |
PyVCF | NGS | VCF processing | |
Pysam | NGS | SAM/BAM processing | |
HTSeq | NGS/Genomes | NGS processing | |
simuPOP | Population genetics | Population genetics simulation | |
DendroPY | Phylogenetics | Phylogenetics | |
scikit-learn | Machine learning/population genetics | Machine learning library | |
PyMol | Proteomics | Molecular visualization | |
rpy2 | Introduction | R interface | |
seaborn | All chapters | Statistical chart library | |
Cython | Big data | High performance | |
Numba | Big data | High performance | |
Dask | Big data | Parallel processing |
We have taken a somewhat conservative approach in most of the recipes with regard to the processing of tabled data. While we use pandas
every now and then, most of the time, we use standard Python. As time advances and pandas
becomes more pervasive, it will probably make sense to just process all tabular data with it (if it fits in-memory).
Take a look at the following steps to get started:
- Start by downloading the Anaconda distribution from https://www.anaconda.com/download. Choose Python version 3. In any case, this is not fundamental, because Anaconda will let you use Python 2 if you need it. You can accept all the installation defaults, but you may want to make sure that the
conda
binaries are in your path (do not forget to open a new window so that the path is updated). If you have another Python distribution, be careful with yourPYTHONPATH
and existing Python libraries. It's probably better to unset yourPYTHONPATH
. As much as possible, uninstall all other Python versions and installed Python libraries. - Let's go ahead with the libraries. We will now create a new
conda
environment calledbioinformatics
withbiopython=1.70
, as shown in the following command:
conda create -n bioinformatics biopython biopython=1.70
- Let's activate the environment, as follows:
source activate bioinformatics
- Let's add the
bioconda
andconda-forge
channel to our source list:
condaconfig--addchannelsbioconda condaconfig--addchannelsconda-forge
Also, install the core packages:
conda install scipy matplotlib jupyter-notebook pip pandas cython numba scikit-learn seaborn pysam pyvcf simuPOP dendropy rpy2
Some of them will probably be installed with the core distribution anyway.
- We can even install R from
conda
:
conda install r-essentials r-gridextra
r-essentials
installs a lot of R packages, including ggplot2, which we will use later. We also install r-gridextra
, since we will be using it in the Notebook.
Compared to the first edition of this book, this recipe is now highly simplified. There are two main reasons for this: the bioconda
package, and the fact that we only need to support Anaconda as it has become a standard. If you feel strongly against using Anaconda, you will be able to install many of the Python libraries via pip
. You will probably need quite a few compilers and build tools—not only C compilers, but also C++ and Fortran.