Book Image

R for Data Science

By : Dan Toomey
Book Image

R for Data Science

By: Dan Toomey

Overview of this book

Table of Contents (19 chapters)


R is a software package that provides a language and an environment for data manipulation and statistics calculation. The resulting statistics can be displayed graphically as well.

R has the following features:

  • A lean syntax to perform operations on your data

  • A set of tools to load and store data in a variety of formats, both local and over the Internet

  • Consistent syntax for operating on datasets in memory

  • A built-in and an open source collection of tools for data analysis

  • Methods to generate on-the-fly graphics and store graphical representations to disk

What this book covers

Chapter 1, Data Mining Patterns, covers data mining in R. In this instance, we will look for patterns in a dataset. This chapter will explore examples of using cluster analysis using several tools. It also covers anomaly detection, and the use of association rules.

Chapter 2, Data Mining Sequences, explores methods in R that allow you to discover sequences in your data. There are several R packages available that help you to determine sequences and portray them graphically for further analysis.

Chapter 3, Text Mining, describes several methods of mining text in R. We will look at tools that allow you to manipulate and analyze the text or words in a source. We will also look into XML processing capabilities.

Chapter 4, Data Analysis – Regression Analysis, explores different ways of using regression analysis on your data. This chapter has methods to run simple and multivariate regression, along with subsequent displays.

Chapter 5, Data Analysis – Correlation, explores several correlation packages. The chapter analyzes data using basic correlation and covariance as well as Pearson, polychor, tetrachoric, heterogeneous, and partial correlation.

Chapter 6, Data Analysis – Clustering, explores a variety of references for cluster analysis. The chapter covers k-means, PAM, and a number of other clustering techniques. All of these techniques are available to an R programmer.

Chapter 7, Data Visualization – R Graphics, discusses a variety of methods of visualizing your data. We will look at the gamut of data from typical class displays to interaction with third-party tools and the use of geographic maps.

Chapter 8, Data Visualization – Plotting, discusses different methods of plotting your data in R. The chapter has examples of simple plots with standardized displays as well as customized displays that can be applied to plotting data.

Chapter 9, Data Visualization – 3D, acts as a guide to creating 3D displays of your data directly from R. We will also look at using 3D displays for larger datasets.

Chapter 10, Machine Learning in Action, discusses how to use R for machine learning. The chapter covers separating datasets into training and test data, developing a model from your training data, and testing your model against test data.

Chapter 11, Predicting Events with Machine Learning, uses time series datasets. The chapter covers converting your data into an R time series and then separating out the seasonal, trend, and irregular components. The goal is to model or predict future events.

Chapter 12, Supervised and Unsupervised Learning, explains the use of supervised and unsupervised learning to build your model. It covers several methods in supervised and unsupervised learning.

What you need for this book

For this book, you need R installed on your machine (or the machine you will be running scripts against). R is available for a number of platforms. This book is not constrained to particular versions of R at this time.

You need an interactive tool to develop R programs in order to use this book to its potential. The predominant tool is R Studio, a fully interactive, self-contained program available on several platforms, which allows you to enter R scripts, display data, and display graphical results. There is always the R command-line tool available with all installations of R.

Who this book is for

This book is written for data analysts who have a firm grip over advanced data analysis techniques. Some basic knowledge of the R language and some data science topics is also required. This book assumes that you have access to an R environment and are comfortable with the statistics involved.


In this book, you will find a number of text styles that distinguish between different kinds of information. Here are some examples of these styles and an explanation of their meaning.

Code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles are shown as follows: "We can include other contexts through the use of the kmeans directive."

A block of code is set as follows:

iter.max = 10,
nstart = 1,
algorithm = c("Hartigan-Wong",

Any command-line input or output is written as follows:

seqdist(seqdata, method, refseq=NULL, norm=FALSE,
  indel=1, sm=NA, with.missing=FALSE, full.matrix=TRUE)

New terms and important words are shown in bold. Words that you see on the screen, for example, in menus or dialog boxes, appear in the text like this: "You can see the key concepts: inflation, economic, conditions, employment, and the FOMC."


Warnings or important notes appear in a box like this.


Tips and tricks appear like this.

Reader feedback

Feedback from our readers is always welcome. Let us know what you think about this book—what you liked or disliked. Reader feedback is important for us as it helps us develop titles that you will really get the most out of.

To send us general feedback, simply e-mail , and mention the book's title in the subject of your message.

If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, see our author guide at

Customer support

Now that you are the proud owner of a Packt book, we have a number of things to help you to get the most from your purchase.

Downloading the example code

You can download the example code files from your account at for all the Packt Publishing books you have purchased. If you purchased this book elsewhere, you can visit and register to have the files e-mailed directly to you.

Downloading the color images of this book

We also provide you with a PDF file that has color images of the screenshots/diagrams used in this book. The color images will help you better understand the changes in the output. You can download this file from:


Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you find a mistake in one of our books—maybe a mistake in the text or the code—we would be grateful if you could report this to us. By doing so, you can save other readers from frustration and help us improve subsequent versions of this book. If you find any errata, please report them by visiting, selecting your book, clicking on the Errata Submission Form link, and entering the details of your errata. Once your errata are verified, your submission will be accepted and the errata will be uploaded to our website or added to any list of existing errata under the Errata section of that title.

To view the previously submitted errata, go to and enter the name of the book in the search field. The required information will appear under the Errata section.


Piracy of copyrighted material on the Internet is an ongoing problem across all media. At Packt, we take the protection of our copyright and licenses very seriously. If you come across any illegal copies of our works in any form on the Internet, please provide us with the location address or website name immediately so that we can pursue a remedy.

Please contact us at with a link to the suspected pirated material.

We appreciate your help in protecting our authors and our ability to bring you valuable content.


If you have a problem with any aspect of this book, you can contact us at , and we will do our best to address the problem.