Book Image

Data Wrangling with R

By : Gustavo R Santos

Book Image

Data Wrangling with R

By: Gustavo R Santos

Overview of this book

In this information era, where large volumes of data are being generated every day, companies want to get a better grip on it to perform more efficiently than before. This is where skillful data analysts and data scientists come into play, wrangling and exploring data to generate valuable business insights. In order to do that, you’ll need plenty of tools that enable you to extract the most useful knowledge from data. Data Wrangling with R will help you to gain a deep understanding of ways to wrangle and prepare datasets for exploration, analysis, and modeling. This data book enables you to get your data ready for more optimized analyses, develop your first data model, and perform effective data visualization. The book begins by teaching you how to load and explore datasets. Then, you’ll get to grips with the modern concepts and tools of data wrangling. As data wrangling and visualization are intrinsically connected, you’ll go over best practices to plot data and extract insights from it. The chapters are designed in a way to help you learn all about modeling, as you will go through the construction of a data science project from end to end, and become familiar with the built-in RStudio, including an application built with Shiny dashboards. By the end of this book, you’ll have learned how to create your first data model and build an application with Shiny in R.

Preface

Who this book is for

What this book covers

To get the most out of this book

Download the example code files

Conventions used

Share Your Thoughts

Download a free PDF copy of this book

Part 1: Load and Explore Data

Part 1: Load and Explore Data

Free Chapter

Chapter 1: Fundamentals of Data Wrangling

Chapter 1: Fundamentals of Data Wrangling

What is data wrangling?

Why data wrangling?

The key steps of data wrangling

Further reading

Chapter 2: Loading and Exploring Datasets

Chapter 2: Loading and Exploring Datasets

Technical requirements

How to load files to RStudio

Tibbles versus Data Frames

A workflow for data exploration

Basic Web Scraping

Further reading

Chapter 3: Basic Data Visualization

Chapter 3: Basic Data Visualization

Technical requirements

Data visualization

Creating single-variable plots

Creating two-variable plots

Working with multiple variables

Further reading

Part 2: Data Wrangling

Part 2: Data Wrangling

Chapter 4: Working with Strings

Chapter 4: Working with Strings

Introduction to stringr

Working with regular expressions

Creating frequency data summaries in R

Further reading

Chapter 5: Working with Numbers

Chapter 5: Working with Numbers

Technical requirements

Numbers in vectors, matrices, and data frames

Math operations with variables

Descriptive statistics

Further reading

Chapter 6: Working with Date and Time Objects

Chapter 6: Working with Date and Time Objects

Technical requirements

Introduction to date and time

Date and time with lubridate

Date and time using regular expressions (regexps)

Further reading

Chapter 7: Transformations with Base R

Chapter 7: Transformations with Base R

Technical requirements

Slicing and filtering

Grouping and summarizing

Replacing and filling

Creating new variables

Using data.table

Further reading

Chapter 8: Transformations with Tidyverse Libraries

Chapter 8: Transformations with Tidyverse Libraries

Technical requirements

What is tidy data

Slicing and filtering

Grouping and summarizing data

Replacing and filling data

Creating new variables

Joining datasets

Reshaping a table

Do more with tidyverse

Further reading

Chapter 9: Exploratory Data Analysis

Chapter 9: Exploratory Data Analysis

Technical requirements

Loading the dataset to RStudio

Understanding the data

Treating missing data

Exploring and visualizing the data

Analysis report

Further reading

Part 3: Data Visualization

Part 3: Data Visualization

Chapter 10: Introduction to ggplot2

Chapter 10: Introduction to ggplot2

Technical requirements

The grammar of graphics

The basic syntax of ggplot2

Further reading

Chapter 11: Enhanced Visualizations with ggplot2

Chapter 11: Enhanced Visualizations with ggplot2

Technical requirements

Time series plots

Adding interactivity to graphics

Further reading

Chapter 12: Other Data Visualization Options

Chapter 12: Other Data Visualization Options

Technical requirements

Plotting graphics in Microsoft Power BI using R

Preparing data for plotting

Creating word clouds in RStudio

Further reading

Part 4: Modeling

Part 4: Modeling

Chapter 13: Building a Model with R

Chapter 13: Building a Model with R

Technical requirements

Machine learning concepts

Understanding the project

Preparing data for modeling in R

Exploring the data with a few visualizations

Selecting the best variables

Further reading

Chapter 14: Build an Application with Shiny in R

Chapter 14: Build an Application with Shiny in R

Technical requirements

Learning the basics of Shiny

Creating an application

Deploying the application on the web

Further reading

Conclusion

Other Books You May Enjoy

Other Books You May Enjoy

Packt is searching for authors like you

Share Your Thoughts

Download a free PDF copy of this book

Customer Reviews

5 star

0

4 star

0

3 star

0

2 star

0

1 star

0

The key steps of data wrangling

There are some basic steps to help data scientists and analysts to work through the data-wrangling part of the process. Naturally, once you first see a dataset, it is important to understand it, then organize, clean, enrich, and validate it before using it as input for a model.

Figure 1.4 – Steps of data wrangling

Figure 1.4 – Steps of data wrangling

Understand: The first step to take once we get our hands on new data is to understand it. Take some time to read the data dictionary, which is a document with the descriptions of the variables, if available, or talk to the owner(s) of the data to really understand what each data point represents and how they do or do not connect to your main purpose and to the business questions you are trying to answer. This will make the following steps clearer.
Format: Step two is to format or organize the data. Raw data may come unstructured or unformatted in a way that is not usable. Therefore, it is important to be familiar with the tidy format. Tidy data is a concept developed by Hadley Wickham in 2014 in a paper with the same name – Tidy data (Tidy data. The Journal of Statistical Software, vol. 59, 2014) – where he presents a standard method to organize and structure datasets, making the cleaning and exploration steps easier. Another benefit is facilitating the transference of the dataset between different tools that use the same format. Currently, the tidy data concept is widely accepted, so that helps you to focus on the analysis instead of munging the dataset every time you need to move it down the pipeline.

Tidy data standardizes the way the structure of the data is linked to the semantics, in other words, how the layout is linked with the meaning of the values. More specifically, structure means the rows and columns that can be labeled. Most of the time, the columns are labeled, but the rows are not. On the other hand, every value is related to a variable and an observation. This is the data semantics. On a tidy dataset, the variable will be a column that holds all the values for an attribute, and each row associated with one observation. Take the dataset extract from Figure 1.5 as an example. With regard to the horsepower column, we would see values such as 110, 110, 93, and 110 for four different cars. Looking at the observations level, each row is one observation, having one value for each attribute or variable, so a car could be associated with HP=110, 6 cylinders, 21 miles per gallon, and so on.

Figure 1.5 – Tidy data. Each row is one observation; each column is a variable

Figure 1.5 – Tidy data. Each row is one observation; each column is a variable

According to Wickham (https://tinyurl.com/2dh75y56), here are the three rules of tidy data:

Every column is a variable
Every row is an observation
Every cell is a single value

Clean: This step is relevant to determine the overall quality of the data. There are many forms of data cleaning, such as splitting, parsing variables, handling missing values, dealing with outliers, and removing erroneous entries.
Enrich: As you work through the data-wrangling steps and become more familiar with the data, questions will arise and, sometimes, more data will be needed. That can be solved by either joining another dataset to the original one to bring new variables or creating new ones using those you have.
Validate: To validate is to make sure that the cleaning, formatting, and transformations are all in place and the data is ready for modeling or other analysis.
Analysis/Model: Once everything is complete, your dataset is now ready for use in the next phases of the project, such as the creation of a dashboard or modeling.

As with every process, we must follow steps to reach the best performance and be able to standardize our efforts and allow them to be reproduced and scaled if needed. Next, we will look at three frameworks for Data Science projects that help to make a process easy to follow and reproduce.

Frameworks in Data Science

Data Science is no different from other sciences, and it also follows some common steps. Ergo, frameworks can be designed to guide people through the process, as well as to help implement a standardized process in a company.

It is important that a Data Scientist has a holistic understanding of the flow of the data from the moment of the acquisition until the end point since the resultant business knowledge is what will support decisions.

In this section, we will take a closer look at three known frameworks that can be used for Data Science projects: KDD, SEMMA, or CRISP-DM. Let’s get to know more about them.

KDD

KDD stands for Knowledge Discovery in Databases. It is a framework to extract knowledge from data in the context of large databases.

Figure 1.6 – KDD process

Figure 1.6 – KDD process

The process is iterative and follows these steps:

Data: Acquiring the data from a database
Selection: Creating a representative target set that is a subset of the data with selected variables or samples of interest
Preprocessing: Data cleaning and preprocessing to remove outliers and handle missing and noisy data
Transformation: Transforming and using dimensionality reduction to format the data
Data Mining: Using algorithms to analyze and search for patterns of interest (for example, classification and clustering)
Interpretation/Evaluation: Interpreting and evaluating the mined patterns

After the evaluation, if the results are not satisfactory, the process can be repeated with enhancements such as more data, a different subset, or a tweaked algorithm.

SEMMA

SEMMA stands for Sample, Explore, Modify, Model, and Assess. These are the steps of the process.

Figure 1.7 – SEMMA process

Figure 1.7 – SEMMA process

SEMMA is a cyclic process that flows more naturally with Data Science. It does not contain stages like KDD. The steps are as follows:

Sample: Based on statistics, it requires a sample large enough to be representative but small enough to be quick to work with
Explore: During this step, the goal is to understand the data and generate visualizations and descriptive statistics, looking for patterns and anomalies
Modify: Here is where data wrangling plays a more intensive role, where the transformations occur to make the data ready for modeling
Model: This step is where algorithms are used to generate estimates, predictions, or insights from the data
Assess: Evaluate the results

CRISP-DM

The acronym for this framework means Cross-Industry Standard Process for Data Mining. It provides the data scientist with the typical phases of the project and also an overview of the data mining life cycle.

Figure 1.8 – CRISP-DM life cycle

Figure 1.8 – CRISP-DM life cycle

The CRISP-DM life cycle has six phases, with the arrows indicating the dependencies between each one of them, but the key point here is that there is not a strict order to follow. The project can move back and forth during the process, making it a flexible framework. Let’s go through the steps:

Business understanding: Like the other two frameworks presented, it all starts with understanding the problem, the business. Understanding the business rules and specificities is often even more important than getting to the solution fast. That is because a solution may not be ideal for that kind of business. The business rules must always drive the solution.
Data understanding: This involves collecting and exploring the data. Make sure the data collected is representative of the whole and get familiar with it to be able to find errors, faulty data, and missing values and to assess quality. All these tasks are part of data understanding.
Data preparation: Once you are familiar with the data collected, it is time to wrangle it and prepare it for modeling.
Modeling: This involves applying Data Science algorithms or performing the desired analysis on the processed data.
Evaluation: This step is used to assess whether the solution is aligned with the business requirement and whether it is performing well.
Deployment: In this step, the model reaches its purpose (for example, an application that predicts a group or a value, a dashboard, and so on).

These three frameworks have a lot in common if you look closer. They start with understanding the data, go over data wrangling with cleaning and transforming, then move on to the modeling phase, and end with the evaluation of the model, usually working with iterations to assess flaws and improve the results.