Book Image

R Bioinformatics Cookbook - Second Edition

By : Dan MacLean
Book Image

R Bioinformatics Cookbook - Second Edition

By: Dan MacLean

Overview of this book

The updated second edition of R Bioinformatics Cookbook takes a recipe-based approach to show you how to conduct practical research and analysis in computational biology with R. You’ll learn how to create a useful and modular R working environment, along with loading, cleaning, and analyzing data using the most up-to-date Bioconductor, ggplot2, and tidyverse tools. This book will walk you through the Bioconductor tools necessary for you to understand and carry out protocols in RNA-seq and ChIP-seq, phylogenetics, genomics, gene search, gene annotation, statistical analysis, and sequence analysis. As you advance, you'll find out how to use Quarto to create data-rich reports, presentations, and websites, as well as get a clear understanding of how machine learning techniques can be applied in the bioinformatics domain. The concluding chapters will help you develop proficiency in key skills, such as gene annotation analysis and functional programming in purrr and base R. Finally, you'll discover how to use the latest AI tools, including ChatGPT, to generate, edit, and understand R code and draft workflows for complex analyses. By the end of this book, you'll have gained a solid understanding of the skills and techniques needed to become a bioinformatics specialist and efficiently work with large and complex bioinformatics datasets.
Table of Contents (16 chapters)

Combining tables using join functions

Joining rectangular tables in data science is a powerful way to combine data from multiple sources, allowing for more complex and detailed analysis. The process of joining tables involves matching rows from one table with corresponding rows in another table, based on shared columns or keys. The ability to join tables allows data scientists to gather information from different sources and can also be used to clean and prepare data for analysis by eliminating duplicates or filling in missing values. Note that although the joining process is powerful and useful, it isn’t magic and is actually a common source of errors. The user must take care that the operation was successful in the way that they intended and that combining data doesn’t create unexpected combinations, especially empty cells and repeated rows.

The dplyr package provides functions for manipulating and cleaning data, including a function called join() that can be used to join tables based on one or more common columns. The join() function supports several types of joins, including inner, left, right, and full outer joins. In this recipe, we’ll look at how each of these joins works.

Getting ready

We’ll need the dplyr package and the rbioinfcookbook package, which will give us a short gene expression dataset of just 10 Magnaporthe oryzae genes, and related annotation data of approximately 60,000 rows for the entire genome.

How to do it…

The process will begin with loading a data frame from the data package. The mo_gene_exp, mo_go_acc, and mo_go_evidence objects are all available as data objects when you load the rbioinfcookbook library, so we don’t have to try to load them from the file. You will have seen this behavior in numerous R tutorials before. For our work, this mimics the situation where you will already have gone through the process of loading in the data from a file on disk or received a data frame from an upstream function.

The following will help us to join tables together:

  1. Load the data and add terms to genes:
    library(rbioinfcookbook)library(dplyr)x <- left_join(mo_gene_exp, mo_terms, by = c('gene_id' = 'Gene stable ID'))
  2. Add accession numbers:
    y <- right_join(mo_go_acc, x, by = c( 'Gene stable ID' = 'gene_id' ) )
  3. Add evidence code:
    z <- inner_join(y, mo_go_evidence, by = c('GO term accession' = 'GO term evidence code'))
  4. Compare the direction of joins:
    a <- right_join(x, mo_go_acc, by = c( 'gene_id' = 'Gene stable ID') )
  5. Stack two data frames:
    mol_func <- filter(mo_go_evidence, `GO domain` == 'molecular_function')cell_comp <- filter(mo_go_evidence, `GO domain` == 'cellular_component')bind_rows(mol_func, cell_comp)
  6. Put two data frames side by side:
    small_mol_func <- head(mol_func, 15)small_cell_comp <- head(cell_comp, 15)bind_cols(small_mol_func, small_cell_comp)

And with that, we have joined data frames into one in most ways possible.

How it works…

The code joins different data frames in various ways. The mo_gene_exp, mo_terms, mo_go_acc, and mo_go_evidence objects are data frames, and they are loaded using the rbioinfcookbook library. Then, the first operation is to add terms to genes using the left_join() function. The left_join() function joins the mo_gene_exp and mo_terms data frames on the gene_id column of the mo_gene_exp data frame and the Gene stable ID column of the mo_terms data frame. Note the increase in rows as well as columns because of the multiple matching rows.

By step 2, we’re adding accession numbers using the right_join() function to join the mo_go_acc data frame and the result of the first join (x) on the Gene stable ID column of the mo_go_acc data frame and the gene_id column of the x data frame. Ordering the data frames this way minimizes the number of rows; see step 5 for how the converse goes. Note that the right_join() function returns the full set of rows from the right data frame.

Step 3’s inner_join() function demonstrates that only the rows shared are returned. The remaining steps create subsets of the mo_go_evidence data frame based on the component to highlight how bind_rows() does a name-unaware stacking and bind_cols() does a blind left-right paste/concatenation of data frames. These last two functions are quick and easy but do not do anything clever, so be sure that the data can be properly joined this way.