Book Image

R Bioinformatics Cookbook

By : Dan MacLean
Book Image

R Bioinformatics Cookbook

By: Dan MacLean

Overview of this book

Handling biological data effectively requires an in-depth knowledge of machine learning techniques and computational skills, along with an understanding of how to use tools such as edgeR and DESeq. With the R Bioinformatics Cookbook, you’ll explore all this and more, tackling common and not-so-common challenges in the bioinformatics domain using real-world examples. This book will use a recipe-based approach to show you how to perform practical research and analysis in computational biology with R. You will learn how to effectively analyze your data with the latest tools in Bioconductor, ggplot, and tidyverse. The book will guide you through the essential tools in Bioconductor to help you understand and carry out protocols in RNAseq, phylogenetics, genomics, and sequence analysis. As you progress, you will get up to speed with how machine learning techniques can be used in the bioinformatics domain. You will gradually develop key computational skills such as creating reusable workflows in R Markdown and packages for code reuse. By the end of this book, you’ll have gained a solid understanding of the most important and widely used techniques in bioinformatic analysis and the tools you need to work with real biological data.
Table of Contents (13 chapters)

Differential peak analysis

When you've discovered unannotated transcripts you may want to see whether they are differentially expressed between experiments. We've already looked at how we might do that with edgeR and DESeq, but one problem is going from an object such as a RangedSummarizedExperiment, comprised of the data and a GRanges object that describes the peak regions, to the internal DESeq object. In this recipe, we'll look at how we can summarise the data in those objects and get them into the correct format.

Getting ready

For this recipe, you'll need the RangedSummarizedExperiment version of the Arabidopsis thaliana RNAseq in datasets/ch1/arabidopsis_rse.RDS in this book's repository. We'll use the DESeq and SummarizedExperiment Bioconductor packages we used earlier too.

How to do it...

  1. Load data and set up a function that creates region tags:
library(SummarizedExperiment) 
arab_rse <- readRDS(file.path(getwd(), "datasets", "ch1", "arabidopsis_rse.RDS") )

make_tag <- function(grange_obj){
paste0(
grange_obj@seqnames,
":",
grange_obj@ranges@start,
"-",
(grange_obj@ranges@start + grange_obj@ranges@width)
)
}
  1. Extract data and annotate rows:
counts <- assay(arab_rse)

if ( ! is.null(names(rowRanges(arab_rse))) ){
  rownames(counts) <- names(rowRanges(arab_rse))
} else {
  rownames(counts) <- make_tag(rowRanges(arab_rse))
}

How it works...

Step 1 starts by loading in our pre-prepared RangedSummarized experiment; note that the names slot of the GRanges object in there is not populated. We next create a custom function, make_tag(), which works by pasting together seqnames, starts and the computed end (start + width) from a passed GRanges object. Note the @ sign syntax: this is used because GRange is an S4 object and the slots are accessed with @ rather than the more familiar $.

In step 2, the code pulls out the actual data from RangedSummarizedExperiment using the assay() function. The matrix returned has no row names, which is unuseful, so we use the if clause to check the names slot—we use that as row names if it's available; if it, isn't we make a row name tag using the position information in the GRanges object in the make_tag() function we have created. This will give the following outputa count matrix that has the location tag as the row name that can be used in DESeq and edgeR as described in Recipes 1 and 2 in this chapter:

head(counts)
## mock1 mock2 mock3 hrcc1 hrcc2 hrcc3 ## Chr1:3631-5900 35 77 40 46 64 60 ## Chr1:5928-8738 43 45 32 43 39 49 ## Chr1:11649-13715 16 24 26 27 35 20 ## Chr1:23146-31228 72 43 64 66 25 90 ## Chr1:31170-33154 49 78 90 67 45 60 ## Chr1:33379-37872 0 15 2 0 21 8