Interfacing with R via rpy2
If there is some functionality that you need and cannot find it in a Python library, your first port of call is to check whether it's implemented in R. For statistical methods, R is still the most complete framework; moreover, some bioinformatics functionalities are also only available in R, most probably offered as a package belonging to the Bioconductor project.
The rpy2 provides provides a declarative interface from Python to R. As you will see, you will be able to write very elegant Python code to perform the interfacing process.
In order to show the interface (and try out one of the most common R data structures, the data frame, and one of the most popular R libraries: ggplot2), we will download its metadata from the Human 1000 genomes project (http://www.1000genomes.org/). As this is not a book on R, we do want to provide any interesting and functional examples.
Getting ready
You will need to get the metadata file from the 1000 genomes sequence index. Please check https://github.com/tiagoantao/bioinf-python/blob/master/notebooks/Datasets.ipynb and download the sequence.index
file. If you are using notebooks, open the 00_Intro/Interfacing_R notebook.ipynb
and just execute the wget
command on top.
This file has information about all FASTQ files in the project (we will use data from the Human 1000 genomes project in the chapters to come). This includes the FASTQ file, the sample ID, and the population of origin and important statistical information per lane, such as the number of reads and number of DNA bases read.
How to do it…
Take a look at the following steps:
We start by importing rpy2 and reading the file, using the
read_delim
R function:import rpy2.robjects as robjects read_delim = robjects.r('read.delim') seq_data = read_delim('sequence.index', header=True, stringsAsFactors=False) #In R: # seq.data <- read.delim('sequence.index', header=TRUE, # stringsAsFactors=FALSE)
The first thing that we do after importing is accessing the
read.delim
R function that allows you to read files.Note that the R language specification allows you to put dots in the names of objects. Therefore, we have to convert a function name to
read_delim
.
Then, we call the function proper; note the following highly declarative features. First, most atomic objects—such as strings—can be passed without conversion. Second, argument names are converted seamlessly (barring the dot issue). Finally, objects are available in the Python namespace (but objects are actually not available in the R namespace; more about this later). For reference, I have included the corresponding R code. I hope it's clear that it's an easy conversion.
The
seq_data
object is a data frame. If you know basic R or the Python pandas library, you are probably aware of this type of data structure; if not, then this is essentially a table: a sequence of rows where each column has the same type. Let's perform a basic inspection of this data frame as follows:print('This dataframe has %d columns and %d rows' % (seq_data.ncol, seq_data.nrow)) print(seq_data.colnames) #In R: # print(colnames(seq.data)) # print(nrow(seq.data)) # print(ncol(seq.data))
Again, note the code similarity. You can even mix styles using the following code:
my_cols = robjects.r.ncol(seq_data) print(my_cols)
You can call R functions directly; in this case, we will call
ncol
if they do not have dots in their name; however, be careful. This will display an output, not26
(the number of columns), but[26]
which is a vector composed of the element26
. This is because by default, most operations in R return vectors. If you want the number of columns, you have to performmy_cols[0]
. Also, talking about pitfalls, note that R array indexing starts with1
, whereas Python starts with0
.
Now, we need to perform some data cleanup. For example, some columns should be interpreted as numbers, but they are read as strings:
as_integer = robjects.r('as.integer') match = robjects.r.match my_col = match('BASE_COUNT', seq_data.colnames)[0] print(seq_data[my_col - 1][:3]) seq_data[my_col - 1] = as_integer(seq_data[my_col - 1]) print(seq_data[my_col - 1][:3])
The match function is somewhat similar to the index method in Python lists. As expected, it returns a vector so that we can extract the
0
element. It's also1-indexed
, so we subtract one when working on Python. Theas_integer
function will convert a column to integers. The first print will show strings (values surrounded by"
), whereas the second print will show numbers.
We will need to massage this table a bit more; details can be found on a notebook, but here we will finalize with getting the data frame to R (remember that while it's an R object, it's actually visible on the Python namespace only):
robjects.r.assign('seq.data', seq_data)
This will create a variable in the R namespace called
seq.data
with the content of the data frame from the Python namespace. Note that after this operation, both objects will be independent (if you change one, it will not be reflected on the other).
Note
While you can perform plotting on Python, R has default built-in plotting functionalities (which we will ignore here). It also has a library called ggplot2 that implements the Grammar of Graphics (a declarative language to specify statistical charts).
We will finalize our R integration example with a plot using ggplot2. This is particularly interesting, not only because you may encounter R code using ggplot2, but also because the drawing paradigm behind the Grammar of Graphics is really revolutionary and may be an alternative that you may want to consider instead of the more standard plotting libraries, such as matplotlib ggplot2 is so pervasive that rpy2 provides a Python interface to it:
import rpy2.robjects.lib.ggplot2 as ggplot2
With regards to our concrete example based on the Human 1000 genomes project, we will first plot a histogram with the distribution of center names, where all sequencing lanes were generated. The first thing that we need to do is to output the chart to a PNG file. We call the R
png()
function as follows:robjects.r.png('out.png')
We will now use ggplot to create a chart, as shown in the following command:
from rpy2.robjects.functions import SignatureTranslatedFunction ggplot2.theme = SignatureTranslatedFunction(ggplot2.theme, init_prm_translate={'axis_text_x': 'axis.text.x'}) bar = ggplot2.ggplot(seq_data) + ggplot2.geom_bar() + ggplot2.aes_string(x='CENTER_NAME') + ggplot2.theme(axis_text_x=ggplot2.element_text(angle=90, hjust=1)) bar.plot() dev_off = robjects.r('dev.off') dev_off()
The second line is a bit uninteresting, but is an important boilerplate code. One of the R functions that we will call has a parameter with a dot in its name. As Python function calls cannot have this, we map the
axis.text.x
R parameter name to theaxis_text_x
Python name in the function theme. We monkey patch it (that is, we replaceggplot2.theme
with a patched version of itself).
We then draw the chart itself. Note the declarative nature of ggplot2 as we add features to the chart. First, we specify the
seq_data data frame
, then we will use a histogram bar plot calledgeom_bar
, followed by annotating theX
variable (CENTER_NAME
).Finally, we rotate the text of the x axis by changing the theme.
We finalize by closing the R printing device. If you are in an IPython console, you will want to visualize the PNG image as follows:
from IPython.display import Image Image(filename='out.png')
This chart produced is as follows:
As a final example, we will now do a scatter plot of read and base counts for all the sequenced lanes for Yoruban (YRI) and Utah residents with ancestry from Northern and Western Europe (CEU) of the Human 1000 genomes project (the summary of the data of this project, which we will use thoroughly, can be seen in the Working with modern sequence formats recipe in Chapter 2, Next-generation Sequencing). We are also interested in the difference among the different types of sequencing (exome, high, and low coverage). We first generate a data frame with just YRI and CEU lanes and limit the maximum base and read counts:
robjects.r('yri_ceu <- seq.data[seq.data$POPULATION %in% c("YRI", "CEU") & seq.data$BASE_COUNT < 2E9 & seq.data$READ_COUNT < 3E7, ]') robjects.r('yri_ceu$POPULATION <- as.factor(yri_ceu$POPULATION)') robjects.r('yri_ceu$ANALYSIS_GROUP <- as.factor(yri_ceu$ANALYSIS_GROUP)')
The last two lines convert the
POPULATION
andANALYSIS_GROUPS
to factors, a concept similar to categorical data.
We are now ready to plot:
yri_ceu = robjects.r('yri_ceu') scatter = ggplot2.ggplot(yri_ceu) + ggplot2.geom_point() + \ ggplot2.aes_string(x='BASE_COUNT', y='READ_COUNT', shape='factor(POPULATION)', col='factor(ANALYSIS_GROUP)') scatter.plot()
Hopefully, this example (refer to the following screenshot) makes the power of the Grammar of Graphics approach clear. We will start by declaring the data frame and the type of chart in use (the scatter plot implemented by
geom_point
). Note how easy it is to express that the shape of each point depends on thePOPULATION
variable and the color on theANALYSIS_GROUP
:
Finally, when you think about Python and R, you probably think about pandas: the R-inspired Python library designed with data analysis and modeling in mind. One of the fundamental data structures in pandas is (surprise) the data frame. It's quite easy to convert backward and forward between R and pandas, as follows:
import pandas.rpy.common as pd_common pd_yri_ceu = pd_common.load_data('yri_ceu') del pd_yri_ceu['PAIRED_FASTQ'] no_paired = pd_common.convert_to_r_dataframe(pd_yri_ceu) robjects.r.assign('no.paired', no_paired) robjects.r("print(colnames(no.paired))")
We start by importing the necessary conversion module. We then convert the R data frame (note that we are converting the
yri_ceu
in the R namespace, not the one on the Python namespace). We delete the column that indicates the name of the paired FASTQ file on the pandas data frame and copy it back to the R namespace. If you print the column names of the new R data frame, you will see thatPAIRED_FASTQ
is missing.As this book enters production, the
pandas.rpy
module is being deprecated (although it's still available).
In the interests of maintaining the momentum of the book, we will not delve into pandas programming (there are plenty of books on this), but I recommend that you take a look at it, not only in the context of interfacing with R, but also as a very good library for data management of complex datasets.
There's more…
It's worth repeating that the advancements on the Python software ecology are occurring at a breakneck pace. This means that if a certain functionality is not available today, it might be released sometime in the near future. So, if you are developing a new project, be sure to check for the very latest developments on the Python front before using a functionality from an R package.
There are plenty of R packages for bioinformatics in the Bioconductor project (http://www.bioconductor.org/). This should probably be your first port of call in the R world for bioinformatics functionalities. However, note that there are many R bioinformatics packages that are not on Bioconductor, so be sure to search the wider R packages on CRAN (refer to the Comprehensive R Archive Network at http://cran.r-project.org/).
There are plenty of plotting libraries for Python. matplotlib is the most common library, but you also have a plethora of other choices. In the context of R, it's worth noting that there is a ggplot2-like implementation for Python based on the Grammar of Graphics description language for charts and this is called—surprise-surprise—ggplot! (http://ggplot.yhathq.com/).
See also
There are plenty of tutorials and books on R; check the R web page (http://www.r-project.org/) for documentation.
For Bioconductor, check the documentation at http://manuals.bioinformatics.ucr.edu/home/R_BioCondManual.
If you work with NGS, you might also want to check High Throughput Sequence Analysis with Bioconductor at http://manuals.bioinformatics.ucr.edu/home/ht-seq.
The rpy library documentation is your Python gateway to R at http://rpy.sourceforge.net/.
The Grammar of Graphics is described in a book aptly named The Grammar of Graphics, Leland Wilkinson, Springer.
In terms of data structures, similar functionality to R can be found on the pandas library. You can find some tutorials at http://pandas.pydata.org/pandas-docs/dev/tutorials.html. The book, Python for Data Analysis, Wes McKinney, O'Reilly Media, is also an alternative to consider.