Book Image

Hands-On Data Science with R

By : Vitor Bianchi Lanzetta, Doug Ortiz, Nataraj Dasgupta, Ricardo Anjoleto Farias
Book Image

Hands-On Data Science with R

By: Vitor Bianchi Lanzetta, Doug Ortiz, Nataraj Dasgupta, Ricardo Anjoleto Farias

Overview of this book

R is the most widely used programming language, and when used in association with data science, this powerful combination will solve the complexities involved with unstructured datasets in the real world. This book covers the entire data science ecosystem for aspiring data scientists, right from zero to a level where you are confident enough to get hands-on with real-world data science problems. The book starts with an introduction to data science and introduces readers to popular R libraries for executing data science routine tasks. This book covers all the important processes in data science such as data gathering, cleaning data, and then uncovering patterns from it. You will explore algorithms such as machine learning algorithms, predictive analytical models, and finally deep learning algorithms. You will learn to run the most powerful visualization packages available in R so as to ensure that you can easily derive insights from your data. Towards the end, you will also learn how to integrate R with Spark and Hadoop and perform large-scale data analytics without much complexity.
Table of Contents (16 chapters)

What this book covers

Chapter 1, Getting Started with Data Science and R, provides an introduction to the field of data science, its applicability in different industry domains, an overview of the machine learning process, and how to install R Studio in order to get started in R development. It also introduces the reader to programming in R, starting off at an intermediate level to facilitate an analysis of the HDI, published by the UN development program. The HDI signifies the level of economic development, including general public health, education, and various other societal factors, of a state.

Chapter 2, Descriptive and Inferential Statistics, introduces fundamental statistical analysis using R, including techniques to perform random sampling, hypothesis testing, and non-parametric tests. This chapter contains extensive examples of commands in R for performing common analysis, such as t-tests and z-tests, and includes utilization of some well-known statistical packages, such as HMISC in R.

Chapter 3, Data Wrangling with R, provides an introduction to packages available in R to slice and manipulate data. Packages that are available as part of the tidyverse set of packages, such as dplyr, and, more generally, the apply family of functions in R, have been introduced. The chapter is example-heavy, in that several examples have been provided to guide the reader on how to apply the functions in the respective packages

Chapter 4, KDD, Data Mining, and Text Mining, includes extensive discussions on the art of extracting information from unstructured data sources, such as websites and Twitter. KDD is a popular term in the data science community and this chapter does full justice to the topic by providing step-by-step examples so as to provide a holistic overview of the subject matter. Sections on web scraping, data transformation, and data visualization have been included. Examples on how to leverage packages such as rvest and httr in order to perform such operations are also discussed at length.

Chapter 5, Data Analysis with R, covers a general introduction to data types and data categories in R as they apply to machine learning, manipulating strings and dates, and charting with R. This chapter is essentially a consolidation of topics that are found elsewhere in the book, but in a more concise format. This chapter can hence be used as a standalone section of the book that does not depend on any other chapter and can be used to gain familiarity with the topics discussed.

Chapter 6, Machine Learning with R, provides a detailed overview of using R for predictive analytics, more generally known as machine learning. It starts out with linear regression, and gradually progresses to more in-depth topics in ML such as decision trees, random forest, and SVMs. Extensively worked-out, hands-on examples, along with visualizations, complement the theoretical discussions in this chapter. The chapter concludes with a discussion on neural networks, one of the most popular fields today in machine learning.

Chapter 7, Forecasting and ML App with R, includes an advanced R Shiny application, full with custom CSS style sheets, Google fonts, modified data table formats, and such like, for forecasting the revenue and sales of pharmaceutical medications in the UK using the NHS dataset. Such datasets are also known as real-world datasets in the sense that they contain actual data pertaining to physicians' prescribing activities. The application is fully reactive; that is, changing the controls on the frontend will immediately run the respective forecasting algorithm and update forecast tables. We have also used an algorithm known as Markov Chain Monte Carlo, which is a machine learning-based forecasting model provided as part of the Facebook package, Prophet.

Chapter 8, Neural Networks and Deep Learning, initiates a comprehensive discussion, along with hands-on examples, of using R for machine learning using two of the most popular algorithms—neural networks, and its more advanced variation, deep learning. Indeed, some of the most successful machine learning projects in the world today, such as self-driving cars and automated assistants such as Siri, are powered by deep learning. This chapter gives readers a unique and robust opportunity to delve into these areas and learn how they, too, can apply some of the same algorithms driving sensational successes in the field of machine learning today.

Chapter 9, Markovian in R, applies to more advanced users who are interested in learning more about Markov processes that involve finding latent (or hidden) data from information in datasets. This is essentially a part of a field known as Bayesian analysis, which allows machine learning practitioners to model states that are not directly visible. Markov models are used in fields such as natural language processing, and object recognition.

Chapter 10, Visualizing Data, provides a comprehensive introduction to various plotting libraries in R. In particular, libraries such as ggplot2, rCharts, and mapping libraries have been discussed at length. R is well known for its presentation-grade libraries that are capable of creating stunning, professional-grade visualizations. The chapter walks the reader through many of the plotting libraries that have made R a mainstay of the data visualization field.

Chapter 11, Going to Production with R, provides an introduction to the Shiny R package, a tool for the development of interactive applications. This chapter delves into how it works, how reactivity works, the basics of its template, how to build a basic application, and how to build one using a real dataset. If you want a package to present your data to people who are unfamiliar with the R language, maybe you should start by learning the Shiny App.

Chapter 12, Large Scale Data Analytics with Hadoop, covers Apache Spark, an engine for large-scale data processing, similar but not identical to Apache Hadoop. Since its focus is on processing, you can use it entirely from your RStudio console. This chapter teaches how to install and take your first steps on it with sparklyr, an R package that provides a backend to the dplyr package. In this way, you can use the dplyr functions to manipulate your big dataset into the Spark cluster.

Chapter 13, R on Cloud, takes an in-depth look at using AzureML on the Microsoft Azure (cloud) platform. Cloud computing has allowed companies across the world to transition from a traditional data center-oriented architecture to a cloud-based decentralized environment. Unsurprisingly, machine learning has become a major part of the success of the cloud due to the ease of deploying multi-node clusters for large-scale machine learning. AzureML is an easy-to-use web-based platform from Microsoft that allows even new data scientists to get a jump start on machine learning via a GUI-based interface.

Appendix A, The Road Ahead, introduces the reader to various resources on the web, such as blogs and forums to utilize and learn more about the field of R. The world of R is rapidly evolving, and in this chapter, we share some insights on the specific resources that will help seasoned data scientists stay abreast of all the developments in R today.