Book Image

R Bioinformatics Cookbook - Second Edition

By : Dan MacLean
Book Image

R Bioinformatics Cookbook - Second Edition

By: Dan MacLean

Overview of this book

The updated second edition of R Bioinformatics Cookbook takes a recipe-based approach to show you how to conduct practical research and analysis in computational biology with R. You’ll learn how to create a useful and modular R working environment, along with loading, cleaning, and analyzing data using the most up-to-date Bioconductor, ggplot2, and tidyverse tools. This book will walk you through the Bioconductor tools necessary for you to understand and carry out protocols in RNA-seq and ChIP-seq, phylogenetics, genomics, gene search, gene annotation, statistical analysis, and sequence analysis. As you advance, you'll find out how to use Quarto to create data-rich reports, presentations, and websites, as well as get a clear understanding of how machine learning techniques can be applied in the bioinformatics domain. The concluding chapters will help you develop proficiency in key skills, such as gene annotation analysis and functional programming in purrr and base R. Finally, you'll discover how to use the latest AI tools, including ChatGPT, to generate, edit, and understand R code and draft workflows for complex analyses. By the end of this book, you'll have gained a solid understanding of the skills and techniques needed to become a bioinformatics specialist and efficiently work with large and complex bioinformatics datasets.
Table of Contents (16 chapters)

Classifying using random forest and interpreting it with iml

Random forest is a versatile ML algorithm that can be used for both regression and classification tasks. It is an ensemble learning method that combines multiple decision trees to make predictions. Decision trees split the data based on the values of features to create subsets with similar target variable values. Random forest combines multiple decision trees to create a more robust and accurate model. The algorithm randomly selects a subset of the training data (bootstrapping) and a subset of features at each tree’s node to create a diverse set of decision trees. The random subsets of the training data are used to train individual decision trees in the forest. The bootstrapping technique allows each tree to see a slightly different variation of the data, reducing the risk of overfitting.

Random forest assesses feature (variable) importance by evaluating how much each feature contributes to reducing error in the...