Mastering Data analysis with R

Book Image

Mastering Data analysis with R

By : Gergely Daróczi

Book Image

Mastering Data analysis with R

By: Gergely Daróczi

Overview of this book

Mastering Data Analysis with R

Mastering Data Analysis with R

Credits

www.PacktPub.com

www.PacktPub.com

Preface

Free Chapter

Hello, Data!

Loading text files of a reasonable size

Benchmarking text file parsers

Loading a subset of text files

Loading data from databases

Importing data from other statistical systems

Loading Excel spreadsheets

Getting Data from the Web

Getting Data from the Web

Loading datasets from the Internet

Other popular online data formats

Reading data from HTML tables

Scraping data from other online sources

R packages to interact with data source APIs

Filtering and Summarizing Data

Filtering and Summarizing Data

Drop needless data

Running benchmarks

Summary functions

Restructuring Data

Restructuring Data

Transposing matrices

Filtering data by string matching

Rearranging data

dplyr versus data.table

Computing new variables

Merging datasets

Reshaping data in a flexible way

The evolution of the reshape packages

Building Models (authored by Renata Nemeth and Gergely Toth)

Building Models (authored by Renata Nemeth and Gergely Toth)

The motivation behind multivariate models

Linear regression with continuous predictors

Model assumptions

How well does the line fit in the data?

Discrete predictors

Beyond the Linear Trend Line (authored by Renata Nemeth and Gergely Toth)

Beyond the Linear Trend Line (authored by Renata Nemeth and Gergely Toth)

The modeling workflow

Logistic regression

Models for count data

Unstructured Data

Unstructured Data

Importing the corpus

Cleaning the corpus

Visualizing the most frequent words in the corpus

Further cleanup

Analyzing the associations among terms

Some other metrics

The segmentation of documents

Polishing Data

The types and origins of missing data

Identifying missing data

By-passing missing values

Getting rid of missing data

Filtering missing data before or during the actual analysis

Data imputation

Extreme values and outliers

Using robust methods

From Big to Small Data

From Big to Small Data

Principal Component Analysis

Factor analysis

Principal Component Analysis versus Factor Analysis

Multidimensional Scaling

Classification and Clustering

Classification and Clustering

Cluster analysis

Latent class models

Discriminant analysis

Logistic regression

Machine learning algorithms

Social Network Analysis of the R Ecosystem

Social Network Analysis of the R Ecosystem

Loading network data

Centrality measures of networks

Visualizing network data

Further network analysis resources

Analyzing Time-series

Analyzing Time-series

Creating time-series objects

Visualizing time-series

Seasonal decomposition

Holt-Winters filtering

Autoregressive Integrated Moving Average models

Outlier detection

More complex time-series objects

Advanced time-series analysis

Data Around Us

Visualizing point data in space

Finding polygon overlays of point data

Plotting thematic maps

Rendering polygons around points

Interactive maps

Alternative map designs

Spatial statistics

Analyzing the R Community

Analyzing the R Community

R Foundation members

R package maintainers

The R-help mailing list

Analyzing overlaps between our lists of R users

The number of R users in social media

R-related posts in social media

References

General good readings on R

Chapter 1 – Hello, Data!

Chapter 2 – Getting Data from the Web

Chapter 3 – Filtering and Summarizing Data

Chapter 4 – Restructuring Data

Chapter 5 – Building Models (authored by Renata Nemeth and Gergely Toth)

Chapter 6 – Beyond the Linear Trend Line (authored by Renata Nemeth and Gergely Toth)

Chapter 7 – Unstructured Data

Chapter 8 – Polishing Data

Chapter 9 – From Big to Smaller Data

Chapter 10 – Classification and Clustering

Chapter 11 – Social Network Analysis of the R Ecosystem

Chapter 12 – Analyzing Time-series

Chapter 13 – Data Around Us

Chapter 14 – Analysing the R Community

Customer Reviews

5 star

0

4 star

0

3 star

0

2 star

0

1 star

0

Data imputation

And sometimes omitting missing values is not reasonable or possible at all, for example due to the low number of observations or if it seems that missing data is not random. Data imputation is a real alternative in such situations, and this method can replace NA with some real values based on various algorithms, such as filling empty cells with:

A known scalar
The previous value appearing in the column (hot-deck)
A random element from the same column
The most frequent value in the column
Different values from the same column with given probability
Predicted values based on regression or machine learning models

The hot-deck method is often used while joining multiple datasets together. In such a situation, the roll argument of data.table can be very useful and efficient, otherwise be sure to check out the hotdeck function in the VIM package, which offers some really useful ways of visualizing missing data. But when dealing with an already given column of a dataset, we have some other...