Mastering Data analysis with R

Book Image

Mastering Data analysis with R

By : Gergely Daróczi

Book Image

Mastering Data analysis with R

By: Gergely Daróczi

Overview of this book

Mastering Data Analysis with R

Mastering Data Analysis with R

Credits

www.PacktPub.com

www.PacktPub.com

Preface

Free Chapter

Hello, Data!

Loading text files of a reasonable size

Benchmarking text file parsers

Loading a subset of text files

Loading data from databases

Importing data from other statistical systems

Loading Excel spreadsheets

Getting Data from the Web

Getting Data from the Web

Loading datasets from the Internet

Other popular online data formats

Reading data from HTML tables

Scraping data from other online sources

R packages to interact with data source APIs

Filtering and Summarizing Data

Filtering and Summarizing Data

Drop needless data

Running benchmarks

Summary functions

Restructuring Data

Restructuring Data

Transposing matrices

Filtering data by string matching

Rearranging data

dplyr versus data.table

Computing new variables

Merging datasets

Reshaping data in a flexible way

The evolution of the reshape packages

Building Models (authored by Renata Nemeth and Gergely Toth)

Building Models (authored by Renata Nemeth and Gergely Toth)

The motivation behind multivariate models

Linear regression with continuous predictors

Model assumptions

How well does the line fit in the data?

Discrete predictors

Beyond the Linear Trend Line (authored by Renata Nemeth and Gergely Toth)

Beyond the Linear Trend Line (authored by Renata Nemeth and Gergely Toth)

The modeling workflow

Logistic regression

Models for count data

Unstructured Data

Unstructured Data

Importing the corpus

Cleaning the corpus

Visualizing the most frequent words in the corpus

Further cleanup

Analyzing the associations among terms

Some other metrics

The segmentation of documents

Polishing Data

The types and origins of missing data

Identifying missing data

By-passing missing values

Getting rid of missing data

Filtering missing data before or during the actual analysis

Data imputation

Extreme values and outliers

Using robust methods

From Big to Small Data

From Big to Small Data

Principal Component Analysis

Factor analysis

Principal Component Analysis versus Factor Analysis

Multidimensional Scaling

Classification and Clustering

Classification and Clustering

Cluster analysis

Latent class models

Discriminant analysis

Logistic regression

Machine learning algorithms

Social Network Analysis of the R Ecosystem

Social Network Analysis of the R Ecosystem

Loading network data

Centrality measures of networks

Visualizing network data

Further network analysis resources

Analyzing Time-series

Analyzing Time-series

Creating time-series objects

Visualizing time-series

Seasonal decomposition

Holt-Winters filtering

Autoregressive Integrated Moving Average models

Outlier detection

More complex time-series objects

Advanced time-series analysis

Data Around Us

Visualizing point data in space

Finding polygon overlays of point data

Plotting thematic maps

Rendering polygons around points

Interactive maps

Alternative map designs

Spatial statistics

Analyzing the R Community

Analyzing the R Community

R Foundation members

R package maintainers

The R-help mailing list

Analyzing overlaps between our lists of R users

The number of R users in social media

R-related posts in social media

References

General good readings on R

Chapter 1 – Hello, Data!

Chapter 2 – Getting Data from the Web

Chapter 3 – Filtering and Summarizing Data

Chapter 4 – Restructuring Data

Chapter 5 – Building Models (authored by Renata Nemeth and Gergely Toth)

Chapter 6 – Beyond the Linear Trend Line (authored by Renata Nemeth and Gergely Toth)

Chapter 7 – Unstructured Data

Chapter 8 – Polishing Data

Chapter 9 – From Big to Smaller Data

Chapter 10 – Classification and Clustering

Chapter 11 – Social Network Analysis of the R Ecosystem

Chapter 12 – Analyzing Time-series

Chapter 13 – Data Around Us

Chapter 14 – Analysing the R Community

Customer Reviews

5 star

0

4 star

0

3 star

0

2 star

0

1 star

0

Chapter 8. Polishing Data

When working with data, you will usually find that it may not always be perfect or clean in the means of missing values, outliers and similar anomalies. Handling and cleaning imperfect or so-called dirty data is part of every data scientist's daily life, and even more, it can take up to 80 percent of the time we actually deal with the data!

Dataset errors are often due to the inadequate data acquisition methods, but instead of repeating and tweaking the data collection process, it is usually better (in the means of saving money, time and other resources) or unavoidable to polish the data by a few simple functions and algorithms. In this chapter, we will cover:

Different use cases of the na.rm argument of various functions
The na.action and related functions to get rid of missing data
Several packages that offer a user-friendly way of data imputation
The outliers package with several statistical tests for extreme values
How to implement Lund's outlier test on our own as...