Book Image

R Bioinformatics Cookbook - Second Edition

By : Dan MacLean
Book Image

R Bioinformatics Cookbook - Second Edition

By: Dan MacLean

Overview of this book

The updated second edition of R Bioinformatics Cookbook takes a recipe-based approach to show you how to conduct practical research and analysis in computational biology with R. You’ll learn how to create a useful and modular R working environment, along with loading, cleaning, and analyzing data using the most up-to-date Bioconductor, ggplot2, and tidyverse tools. This book will walk you through the Bioconductor tools necessary for you to understand and carry out protocols in RNA-seq and ChIP-seq, phylogenetics, genomics, gene search, gene annotation, statistical analysis, and sequence analysis. As you advance, you'll find out how to use Quarto to create data-rich reports, presentations, and websites, as well as get a clear understanding of how machine learning techniques can be applied in the bioinformatics domain. The concluding chapters will help you develop proficiency in key skills, such as gene annotation analysis and functional programming in purrr and base R. Finally, you'll discover how to use the latest AI tools, including ChatGPT, to generate, edit, and understand R code and draft workflows for complex analyses. By the end of this book, you'll have gained a solid understanding of the skills and techniques needed to become a bioinformatics specialist and efficiently work with large and complex bioinformatics datasets.
Table of Contents (16 chapters)

Loading, Tidying, and Cleaning Data in the tidyverse

Cleaning data is a crucial step in the data science process. It involves identifying and correcting errors, inconsistencies, and missing values in the data, as well as formatting and structuring the data in a way that makes it easy to work with. This allows the data to be used effectively for analysis, modeling, and visualization. The R tidyverse is a collection of packages designed for data science and includes tools for data manipulation, visualization, and modeling. The dplyr and tidyr packages are two of the most widely used packages within the tidyverse for data cleaning. dplyr provides a set of functions for efficiently manipulating large datasets, such as filtering, grouping, and summarizing data. tidyr is specifically designed for tidying (or restructuring) data, making it easier to work with. It provides functions for reshaping data, such as gathering and spreading columns, and allows for the creation of a consistent structure in the data. This makes it easier to perform data analysis and visualization. Together, these packages provide powerful tools for cleaning and manipulating data in R, making it a popular choice among data scientists. In this chapter, we will look at tools and techniques for preparing data in the tidyverse set of packages. You will learn how to deal with different formats and quickly interconvert them, merge different datasets, and summarize them. You will also learn how to bring data from outside sources not in handy files into your work.

In this chapter, we will cover the following recipes:

  • Loading data from files with readr
  • Tidying a wide format table into a tidy table with tidyr
  • Tidying a long format table into a tidy table with tidyr
  • Combining tables using join functions
  • Reformatting and extracting existing data into new columns using stringr
  • Computing new data columns from existing ones and applying arbitrary functions using mutate()
  • Using dplyr to summarize data in large tables
  • Using datapasta to create R objects from cut-and-paste data