R Bioinformatics Cookbook - Second Edition

By : Dan MacLean

R Bioinformatics Cookbook - Second Edition

By: Dan MacLean

Overview of this book

The updated second edition of R Bioinformatics Cookbook takes a recipe-based approach to show you how to conduct practical research and analysis in computational biology with R. You’ll learn how to create a useful and modular R working environment, along with loading, cleaning, and analyzing data using the most up-to-date Bioconductor, ggplot2, and tidyverse tools. This book will walk you through the Bioconductor tools necessary for you to understand and carry out protocols in RNA-seq and ChIP-seq, phylogenetics, genomics, gene search, gene annotation, statistical analysis, and sequence analysis. As you advance, you'll find out how to use Quarto to create data-rich reports, presentations, and websites, as well as get a clear understanding of how machine learning techniques can be applied in the bioinformatics domain. The concluding chapters will help you develop proficiency in key skills, such as gene annotation analysis and functional programming in purrr and base R. Finally, you'll discover how to use the latest AI tools, including ChatGPT, to generate, edit, and understand R code and draft workflows for complex analyses. By the end of this book, you'll have gained a solid understanding of the skills and techniques needed to become a bioinformatics specialist and efficiently work with large and complex bioinformatics datasets.

Preface

Who this book is for

What this book covers

To get the most out of this book

Download the example code files

Conventions used

Get in touch

Share Your Thoughts

Download a free PDF copy of this book

Chapter 1: Setting Up Your R Bioinformatics Working Environment

Technical requirements

Setting up an R project in a directory

Using the here package to simplify working with paths

Using the devtools package to work with the latest non-CRAN packages

Setting up your machine for the compilation of source packages

Using the renv package to create a project-specific set of packages

Installing and managing different versions of Bioconductor packages in environments

Using bioconda to install external tools

Free Chapter

Chapter 2: Loading, Tidying, and Cleaning Data in the tidyverse

Technical requirements

Loading data from files with readr

Tidying a wide format table into a tidy table with tidyr

Tidying a long format table into a tidy table with tidyr

Combining tables using join functions

Reformatting and extracting existing data into new columns using stringr

Computing new data columns from existing ones and applying arbitrary functions using mutate()

Using dplyr to summarize data in large tables

Using datapasta to create R objects from cut-and-paste data

Chapter 3: ggplot2 and Extensions for Publication Quality Plots

Technical requirements

Combining many plot types in ggplot2

Comparing changes in distributions with ggridges

Customizing plots with ggeasy

Highlighting selected values in busy plots with gghighlight

Plotting variability and confidence intervals better with ggdist

Making interactive plots with plotly

Clarifying label placement with ggrepel

Zooming and making callouts from selected plot sections with facetzoom

Chapter 4: Using Quarto to Make Data-Rich Reports, Presentations, and Websites

Technical requirements

Using Markdown and Quarto for literate computation

Creating different document formats from the same source

Creating data-rich presentations from code

Creating websites from collections of Quarto documents

Adding interactivity with Shiny

Chapter 5: Easily Performing Statistical Tests Using Linear Models

Technical requirements

Modeling data with a linear model

Using a linear model to compare the mean of two groups

Using a linear model and ANOVA to compare multiple groups in a single variable

Using linear models and ANOVA to compare multiple groups in multiple variables

Testing and accounting for interactions between variables in linear models

Doing tests for differences in data in two categorical variables

Making predictions using linear models

Chapter 6: Performing Quantitative RNA-seq

Technical requirements

Estimating differential expression with edgeR

Estimating differential expression with DESeq2

Estimating differential expression with Kallisto and Sleuth

Using Sleuth to analyze time course experiments

Analyzing splice variants with SGSeq

Performing power analysis with powsimR

Finding unannotated transcribed regions

Finding regions showing high expression ab initio using bumphunter

Differential peak analysis

Estimating batch effects with SVA

Finding allele-specific expression with AllelicImbalance

Presenting RNA-Seq data using ComplexHeatmap

Chapter 7: Finding Genetic Variants with HTS Data

Technical requirements

Finding SNPs and INDELs from sequence data using VariantTools

Getting ready

Predicting open reading frames in long reference sequences

Plotting features on genetic maps with karyoploteR

Selecting and classifying variants with VariantAnnotation

Extracting information in genomic regions of interest

Finding phenotype and genotype associations with GWAS

Estimating the copy number at a locus of interest

Chapter 8: Searching Gene and Protein Sequences for Domains and Motifs

Technical requirements

Finding DNA motifs with universalmotif

Finding protein domains with PFAM and bio3d

Finding InterPro domains

Finding transmembrane domains with tmhmm and pureseqTM

Creating figures of protein domains using drawProteins

Performing multiple alignments of proteins or genes

Aligning genomic length sequences with DECIPHER

Novel feature detection in proteins

3D structure protein alignment in bio3d

Chapter 9: Phylogenetic Analysis and Visualization

Technical requirements

Reading and writing varied tree formats with ape and treeio

Visualizing trees of many genes quickly with ggtree

Quantifying and estimating the differences between trees with treespace

Extracting and working with subtrees using ape

Creating dot plots for alignment visualizations

Reconstructing trees from alignments using phangorn

Finding orthologue candidates using reciprocal BLASTs

Chapter 10: Analyzing Gene Annotations

Technical requirements

Retrieving gene and genome annotations from BioMart

Getting Gene Ontology information for functional analysis from appropriate databases

Using AnnoDB packages for genome annotation

Using ClusterProfiler for determining GO enrichment in clusters

Finding GO enrichment in an Ontology Conditional way with topGO

Finding enriched KEGG pathways

Retrieving and working with SNPs

Chapter 11: Machine Learning with mlr3

Technical requirements

Defining a task and learner to implement k-nearest neighbors (k-NNs) in mlr3

Testing the fit of the model using cross-validation

Using logistic regression to classify the relative likelihood of two outcomes

Classifying using random forest and interpreting it with iml

Dimension reduction with PCA in mlr3 pipelines

Creating a tSNE and UMAP embedding

Clustering with k-means and hierarchical clustering

Chapter 12: Functional Programming with purrr and base R

Technical requirements

Making base R objects “tidy”

Using nested dataframes for functional programming

Using the apply family of functions

Using the map family of functions in purrr

Working with lists in purrr

Chapter 13: Turbo-Charging Development in R with ChatGPT

Technical requirements

Interpreting complicated code with ChatGPT assistance

Debugging and improving code with ChatGPT

Generating code with ChatGPT

Getting ready

Writing documentation for R functions with ChatGPT

Writing unit tests for R functions with ChatGPT

Finding R packages to build a workflow with ChatGPT

Index

Why subscribe?

Other Books You May Enjoy

Packt is searching for authors like you

Share Your Thoughts

Download a free PDF copy of this book

Customer Reviews

5 star

4 star

3 star

2 star

1 star

Combining tables using join functions

Joining rectangular tables in data science is a powerful way to combine data from multiple sources, allowing for more complex and detailed analysis. The process of joining tables involves matching rows from one table with corresponding rows in another table, based on shared columns or keys. The ability to join tables allows data scientists to gather information from different sources and can also be used to clean and prepare data for analysis by eliminating duplicates or filling in missing values. Note that although the joining process is powerful and useful, it isn’t magic and is actually a common source of errors. The user must take care that the operation was successful in the way that they intended and that combining data doesn’t create unexpected combinations, especially empty cells and repeated rows.

The dplyr package provides functions for manipulating and cleaning data, including a function called join() that can be used to join tables based on one or more common columns. The join() function supports several types of joins, including inner, left, right, and full outer joins. In this recipe, we’ll look at how each of these joins works.

Getting ready

We’ll need the dplyr package and the rbioinfcookbook package, which will give us a short gene expression dataset of just 10 Magnaporthe oryzae genes, and related annotation data of approximately 60,000 rows for the entire genome.

How to do it…

The process will begin with loading a data frame from the data package. The mo_gene_exp, mo_go_acc, and mo_go_evidence objects are all available as data objects when you load the rbioinfcookbook library, so we don’t have to try to load them from the file. You will have seen this behavior in numerous R tutorials before. For our work, this mimics the situation where you will already have gone through the process of loading in the data from a file on disk or received a data frame from an upstream function.

The following will help us to join tables together:

Load the data and add terms to genes:

library(rbioinfcookbook)library(dplyr)x <- left_join(mo_gene_exp, mo_terms, by = c('gene_id' = 'Gene stable ID'))

Add accession numbers:

y <- right_join(mo_go_acc, x, by = c( 'Gene stable ID' = 'gene_id' ) )

Add evidence code:

z <- inner_join(y, mo_go_evidence, by = c('GO term accession' = 'GO term evidence code'))

Compare the direction of joins:

a <- right_join(x, mo_go_acc, by = c( 'gene_id' = 'Gene stable ID') )

Stack two data frames:

mol_func <- filter(mo_go_evidence, `GO domain` == 'molecular_function')cell_comp <- filter(mo_go_evidence, `GO domain` == 'cellular_component')bind_rows(mol_func, cell_comp)

Put two data frames side by side:

small_mol_func <- head(mol_func, 15)small_cell_comp <- head(cell_comp, 15)bind_cols(small_mol_func, small_cell_comp)

And with that, we have joined data frames into one in most ways possible.

How it works…

The code joins different data frames in various ways. The mo_gene_exp, mo_terms, mo_go_acc, and mo_go_evidence objects are data frames, and they are loaded using the rbioinfcookbook library. Then, the first operation is to add terms to genes using the left_join() function. The left_join() function joins the mo_gene_exp and mo_terms data frames on the gene_id column of the mo_gene_exp data frame and the Gene stable ID column of the mo_terms data frame. Note the increase in rows as well as columns because of the multiple matching rows.

By step 2, we’re adding accession numbers using the right_join() function to join the mo_go_acc data frame and the result of the first join (x) on the Gene stable ID column of the mo_go_acc data frame and the gene_id column of the x data frame. Ordering the data frames this way minimizes the number of rows; see step 5 for how the converse goes. Note that the right_join() function returns the full set of rows from the right data frame.

Step 3’s inner_join() function demonstrates that only the rows shared are returned. The remaining steps create subsets of the mo_go_evidence data frame based on the component to highlight how bind_rows() does a name-unaware stacking and bind_cols() does a blind left-right paste/concatenation of data frames. These last two functions are quick and easy but do not do anything clever, so be sure that the data can be properly joined this way.

R Bioinformatics Cookbook - Second Edition

By : Dan MacLean

R Bioinformatics Cookbook - Second Edition

By: Dan MacLean

Overview of this book

Related Content you might be interested in

Current Title:

R Bioinformatics Cookbook - Second Edition

Bioinformatics with Python Cookbook

Deep Learning for Genomics

R Programming Fundamentals

Combining tables using join functions

Getting ready

How to do it…

How it works…