Book Image

R Bioinformatics Cookbook - Second Edition

By : Dan MacLean
Book Image

R Bioinformatics Cookbook - Second Edition

By: Dan MacLean

Overview of this book

The updated second edition of R Bioinformatics Cookbook takes a recipe-based approach to show you how to conduct practical research and analysis in computational biology with R. You’ll learn how to create a useful and modular R working environment, along with loading, cleaning, and analyzing data using the most up-to-date Bioconductor, ggplot2, and tidyverse tools. This book will walk you through the Bioconductor tools necessary for you to understand and carry out protocols in RNA-seq and ChIP-seq, phylogenetics, genomics, gene search, gene annotation, statistical analysis, and sequence analysis. As you advance, you'll find out how to use Quarto to create data-rich reports, presentations, and websites, as well as get a clear understanding of how machine learning techniques can be applied in the bioinformatics domain. The concluding chapters will help you develop proficiency in key skills, such as gene annotation analysis and functional programming in purrr and base R. Finally, you'll discover how to use the latest AI tools, including ChatGPT, to generate, edit, and understand R code and draft workflows for complex analyses. By the end of this book, you'll have gained a solid understanding of the skills and techniques needed to become a bioinformatics specialist and efficiently work with large and complex bioinformatics datasets.
Table of Contents (16 chapters)

Tidying a wide format table into a tidy table with tidyr

The tidyr package in R is a package that provides tools for tidying and reshaping data. It is designed to make it easy to work with data in a consistent and structured format, which is known as a tidy format. Tidy data is a standard way of organizing data that makes it easy to perform data analysis and visualization.

The main principles of tidy data are as follows:

  • Each variable forms a column
  • Each observation forms a row
  • Each type of observational unit forms a table

Data in a tidy format is easier to work with because the structure of the data is consistent, facilitating operations such as filtering, grouping, and reshaping the data. Tidy data is also more compatible with various data visualization and analysis tools, such as ggplot2, dplyr, and other tidyverse packages.

Our aim in this recipe will be to take a wide format data frame where a lot of information is hiding in column names and squeeze and reformat them into a data column of their own and rationalize them in the process.

Getting ready

We’ll need the rbioinfcookbook and tidyr packages. We’ll use the finished output from recipe 1, which is saved in the package.

How to do it…

We have to use just one function, but the options are many.

Specify the transformation to the table:

library(rbioinfcookbook)library(dplyr)
library(tidyr)
long_df <- census_df |> 
  rename("0_to_4" = "under_4", "90_to_120" = "over_90") |> 
  pivot_longer(
    cols = contains("_to_"),
    names_to = c("age_from", "age_to"),
    names_pattern = "(.*)_to_(.*)",
    names_transform = list("age_from" = as.integer,
                         age.to = as.integer),
    values_to = "count"
  )

And that’s it. This short recipe is very dense, though.

How it works…

The tidyr package has functions that work by allowing the user to specify a particular transformation that will be applied to the data frame to generate a new one. In this single step, we specify a table row-count increasing operation that will find all the columns that contain age information. Next, we split the title of that column into data for two new columns—one for the lower boundary of the age category and one for the upper boundary of the age category. Then, we change the type of those new columns to integer and, lastly, put the actual counts in a new column.

The first function in this pipeline is code from dplyr, which helps us rename column headings. Our age data column names are largely consistent, except for the lower bound and the upper one, so we rename those columns to match the pattern of the others, simplifying the transform specification.

The pivot_longer() function specifies the transform in the arguments, with the cols argument we choose to operate on any columns containing the text to. The names_pattern argument takes a regular expression (regex) that captures the bits of text before and after the to string in the column names and uses them as values for the columns defined in the names_to argument. The actual counts from the cell are put into a new column called counts. The transformation is then applied in one step and reduces the data frames column count to eight, increasing the row count to 6,935, and in the process making the data tidy and easier to use in downstream packages.

See also

The recipe uses a regex to describe a pattern in text. If you haven’t seen these before and need a primer, try typing ?"regular expression" to view the R help on the topic.