Book Image

Mastering Data analysis with R

By : Gergely Daróczi
Book Image

Mastering Data analysis with R

By: Gergely Daróczi

Overview of this book

Table of Contents (19 chapters)
Mastering Data Analysis with R
Credits
www.PacktPub.com
Preface

Preface

R has become the lingua franca of statistical analysis, and it's already actively and heavily used in many industries besides the academic sector, where it originated more than 20 years ago. Nowadays, more and more businesses are adopting R in production, and it has become one of the most commonly used tools by data analysts and scientists, providing easy access to thousands of user-contributed packages.

Mastering Data Analysis with R will help you get familiar with this open source ecosystem and some statistical background as well, although with a minor focus on mathematical questions. We will primarily focus on how to get things done practically with R.

As data scientists spend most of their time fetching, cleaning, and restructuring data, most of the first hands-on examples given here concentrate on loading data from files, databases, and online sources. Then, the book changes its focus to restructuring and cleansing data—still not performing actual data analysis yet. The later chapters describe special data types, and then classical statistical models are also covered, with some machine learning algorithms.

What this book covers

Chapter 1, Hello, Data!, starts with the first very important task in every data-related task: loading data from text files and databases. This chapter covers some problems of loading larger amounts of data into R using improved CSV parsers, pre-filtering data, and comparing support for various database backends.

Chapter 2, Getting Data from the Web, extends your knowledge on importing data with packages designed to communicate with Web services and APIs, shows how to scrape and extract data from home pages, and gives a general overview of dealing with XML and JSON data formats.

Chapter 3, Filtering and Summarizing Data, continues with the basics of data processing by introducing multiple methods and ways of filtering and aggregating data, with a performance and syntax comparison of the deservedly popular data.table and dplyr packages.

Chapter 4, Restructuring Data, covers more complex data transformations, such as applying functions on subsets of a dataset, merging data, and transforming to and from long and wide table formats, to perfectly fit your source data with your desired data workflow.

Chapter 5, Building Models (authored by Renata Nemeth and Gergely Toth), is the first chapter that deals with real statistical models, and it introduces the concepts of regression and models in general. This short chapter explains how to test the assumptions of a model and interpret the results via building a linear multivariate regression model on a real-life dataset.

Chapter 6, Beyond the Linear Trend Line (authored by Renata Nemeth and Gergely Toth), builds on the previous chapter, but covers the problems of non-linear associations of predictor variables and provides further examples on generalized linear models, such as logistic and Poisson regression.

Chapter 7, Unstructured Data, introduces new data types. These might not include any information in a structured way. Here, you learn how to use statistical methods to process such unstructured data through some hands-on examples on text mining algorithms, and visualize the results.

Chapter 8, Polishing Data, covers another common issue with raw data sources. Most of the time, data scientists handle dirty-data problems, such as trying to cleanse data from errors, outliers, and other anomalies. On the other hand, it's also very important to impute or minimize the effects of missing values.

Chapter 9, From Big to Smaller Data, assumes that your data is already loaded, clean, and transformed into the right format. Now you can start analyzing the usually high number of variables, to which end we cover some statistical methods on dimension reduction and other data transformations on continuous variables, such as principal component analysis, factor analysis, and multidimensional scaling.

Chapter 10, Classification and Clustering, discusses several ways of grouping observations in a sample using supervised and unsupervised statistical and machine learning methods, such as hierarchical and k-means clustering, latent class models, discriminant analysis, logistic regression and the k-nearest neighbors algorithm, and classification and regression trees.

Chapter 11, A Social Network Analysis of the R Ecosystem, concentrates on a special data structure and introduces the basic concept and visualization techniques of network analysis, with a special focus on the igraph package.

Chapter 12, Analyzing a Time Series, shows you how to handle time-date objects and analyze related values by smoothing, seasonal decomposition, and ARIMA, including some forecasting and outlier detection as well.

Chapter 13, Data around Us, covers another important dimension of data, with a primary focus on visualizing spatial data with thematic, interactive, contour, and Voronoi maps.

Chapter 14, Analyzing the R Community, provides a more complete case study that combines many different methods from the previous chapters to highlight what you have learned in this book and what kind of questions and problems you might face in future projects.

Appendix, References, gives references to the used R packages and some further suggested readings for each aforementioned chapter.

What you need for this book

All the code examples provided in this book should be run in the R console, which needs to be installed on your computer. You can download the software for free and find the installation instructions for all major operating systems at http://r-project.org.

Although we will not cover advanced topics, such as how to use R in Integrated Development Environments (IDE), there are awesome plugins and extensions for Emacs, Eclipse, vi, and Notepad++, besides other editors. Also, we highly recommend that you try RStudio, which is a free and open source IDE dedicated to R, at https://www.rstudio.com/products/RStudio.

Besides a working R installation, we will also use some user-contributed R packages. These can easily be installed from the Comprehensive R Archive Network (CRAN) in most cases. The sources of the required packages and the versions used to produce the output in this book are listed in Appendix, References.

To install a package from CRAN, you will need an Internet connection. To download the binary files or sources, use the install.packages command in the R console, like this:

> install.packages('pander')

Some packages mentioned in this book are not (yet) available on CRAN, but may be installed from Bitbucket or GitHub. These packages can be installed via the install_bitbucket and the install_github functions from the devtools package. Windows users should first install rtools from https://cran.r-project.org/bin/windows/Rtools.

After installation, the package should be loaded to the current R session before you can start using it. All the required packages are listed in the appendix, but the code examples also include the related R command for each package at the first occurrence in each chapter:

> library(pander)

We highly recommend downloading the code example files of this book (refer to the Downloading the example code section) so that you can easily copy and paste the commands in the R console without the R prompt shown in the printed version of the examples and output in the book.

If you have no experience with R, you should start with some free introductory articles and manuals from the R home page, and a short list of suggested materials is also available in the appendix of this book.

Who this book is for

If you are a data scientist or an R developer who wants to explore and optimize their use of R's advanced features and tools, then this is the book for you. Basic knowledge of R is required, along with an understanding of database logic. If you are a data scientist, engineer, or analyst who wants to explore and optimize your use of R's advanced features, this is the book for you. Although a basic knowledge of R is required, the book can get you up and running quickly by providing references to introductory materials.

Conventions

You will find a number of styles of text that distinguish between different kinds of information. Here are some examples of these styles, and an explanation of their meaning.

Function names, arguments, variables and other code reference in text are shown as follows: "The header argument of the read.big.matrix function defaults to FALSE."

Any command-line input or output that is shown in the R console is written as follows:

> set.seed(42)
> data.frame(
+   A = runif(2),
+   B = sample(letters, 2))
          A B
1 0.9148060 h
2 0.9370754 u

The > character represents the prompt, which means that the R console is waiting for commands to be evaluated. Multiline expressions start with the same symbol on the first line, but all other lines have a + sign at the beginning to show that the last R expression is not complete yet (for example, a closing parenthesis or a quote is missing). The output is returned without any extra leading character, with the same monospaced font style.

New terms and important words are shown in bold.

Note

Warnings or important notes appear in a box like this.

Tip

Tips and tricks appear like this.

Reader feedback

Feedback from our readers is always welcome. Let us know what you think about this book—what you liked or disliked. Reader feedback is important for us as it helps us develop titles that you will really get the most out of.

To send us general feedback, simply e-mail , and mention the book's title in the subject of your message.

If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, see our author guide at www.packtpub.com/authors.

Customer support

Now that you are the proud owner of a Packt book, we have a number of things to help you to get the most from your purchase.

Downloading the example code

You can download the example code files from your account at http://www.packtpub.com for all the Packt Publishing books you have purchased. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you.

Downloading the color images of this book

We also provide you with a PDF file that has color images of the screenshots/diagrams used in this book. The color images will help you better understand the changes in the output. You can download this file from http://www.packtpub.com/sites/default/files/downloads/1234OT_ColorImages.pdf.

Errata

Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you find a mistake in one of our books—maybe a mistake in the text or the code—we would be grateful if you could report this to us. By doing so, you can save other readers from frustration and help us improve subsequent versions of this book. If you find any errata, please report them by visiting http://www.packtpub.com/submit-errata, selecting your book, clicking on the Errata Submission Form link, and entering the details of your errata. Once your errata are verified, your submission will be accepted and the errata will be uploaded to our website or added to any list of existing errata under the Errata section of that title.

To view the previously submitted errata, go to https://www.packtpub.com/books/content/support and enter the name of the book in the search field. The required information will appear under the Errata section.

Piracy

Piracy of copyrighted material on the Internet is an ongoing problem across all media. At Packt, we take the protection of our copyright and licenses very seriously. If you come across any illegal copies of our works in any form on the Internet, please provide us with the location address or website name immediately so that we can pursue a remedy.

Please contact us at with a link to the suspected pirated material.

We appreciate your help in protecting our authors and our ability to bring you valuable content.

Questions

If you have a problem with any aspect of this book, you can contact us at , and we will do our best to address the problem.