Book Image

Data Wrangling with R

By : Gustavo R Santos

Book Image

Data Wrangling with R

By: Gustavo R Santos

Overview of this book

In this information era, where large volumes of data are being generated every day, companies want to get a better grip on it to perform more efficiently than before. This is where skillful data analysts and data scientists come into play, wrangling and exploring data to generate valuable business insights. In order to do that, you’ll need plenty of tools that enable you to extract the most useful knowledge from data. Data Wrangling with R will help you to gain a deep understanding of ways to wrangle and prepare datasets for exploration, analysis, and modeling. This data book enables you to get your data ready for more optimized analyses, develop your first data model, and perform effective data visualization. The book begins by teaching you how to load and explore datasets. Then, you’ll get to grips with the modern concepts and tools of data wrangling. As data wrangling and visualization are intrinsically connected, you’ll go over best practices to plot data and extract insights from it. The chapters are designed in a way to help you learn all about modeling, as you will go through the construction of a data science project from end to end, and become familiar with the built-in RStudio, including an application built with Shiny dashboards. By the end of this book, you’ll have learned how to create your first data model and build an application with Shiny in R.

Preface

Who this book is for

What this book covers

To get the most out of this book

Download the example code files

Conventions used

Share Your Thoughts

Download a free PDF copy of this book

Part 1: Load and Explore Data

Part 1: Load and Explore Data

Free Chapter

Chapter 1: Fundamentals of Data Wrangling

Chapter 1: Fundamentals of Data Wrangling

What is data wrangling?

Why data wrangling?

The key steps of data wrangling

Further reading

Chapter 2: Loading and Exploring Datasets

Chapter 2: Loading and Exploring Datasets

Technical requirements

How to load files to RStudio

Tibbles versus Data Frames

A workflow for data exploration

Basic Web Scraping

Further reading

Chapter 3: Basic Data Visualization

Chapter 3: Basic Data Visualization

Technical requirements

Data visualization

Creating single-variable plots

Creating two-variable plots

Working with multiple variables

Further reading

Part 2: Data Wrangling

Part 2: Data Wrangling

Chapter 4: Working with Strings

Chapter 4: Working with Strings

Introduction to stringr

Working with regular expressions

Creating frequency data summaries in R

Further reading

Chapter 5: Working with Numbers

Chapter 5: Working with Numbers

Technical requirements

Numbers in vectors, matrices, and data frames

Math operations with variables

Descriptive statistics

Further reading

Chapter 6: Working with Date and Time Objects

Chapter 6: Working with Date and Time Objects

Technical requirements

Introduction to date and time

Date and time with lubridate

Date and time using regular expressions (regexps)

Further reading

Chapter 7: Transformations with Base R

Chapter 7: Transformations with Base R

Technical requirements

Slicing and filtering

Grouping and summarizing

Replacing and filling

Creating new variables

Using data.table

Further reading

Chapter 8: Transformations with Tidyverse Libraries

Chapter 8: Transformations with Tidyverse Libraries

Technical requirements

What is tidy data

Slicing and filtering

Grouping and summarizing data

Replacing and filling data

Creating new variables

Joining datasets

Reshaping a table

Do more with tidyverse

Further reading

Chapter 9: Exploratory Data Analysis

Chapter 9: Exploratory Data Analysis

Technical requirements

Loading the dataset to RStudio

Understanding the data

Treating missing data

Exploring and visualizing the data

Analysis report

Further reading

Part 3: Data Visualization

Part 3: Data Visualization

Chapter 10: Introduction to ggplot2

Chapter 10: Introduction to ggplot2

Technical requirements

The grammar of graphics

The basic syntax of ggplot2

Further reading

Chapter 11: Enhanced Visualizations with ggplot2

Chapter 11: Enhanced Visualizations with ggplot2

Technical requirements

Time series plots

Adding interactivity to graphics

Further reading

Chapter 12: Other Data Visualization Options

Chapter 12: Other Data Visualization Options

Technical requirements

Plotting graphics in Microsoft Power BI using R

Preparing data for plotting

Creating word clouds in RStudio

Further reading

Part 4: Modeling

Part 4: Modeling

Chapter 13: Building a Model with R

Chapter 13: Building a Model with R

Technical requirements

Machine learning concepts

Understanding the project

Preparing data for modeling in R

Exploring the data with a few visualizations

Selecting the best variables

Further reading

Chapter 14: Build an Application with Shiny in R

Chapter 14: Build an Application with Shiny in R

Technical requirements

Learning the basics of Shiny

Creating an application

Deploying the application on the web

Further reading

Conclusion

Other Books You May Enjoy

Other Books You May Enjoy

Packt is searching for authors like you

Share Your Thoughts

Download a free PDF copy of this book

Customer Reviews

5 star

0

4 star

0

3 star

0

2 star

0

1 star

0

Why data wrangling?

Now you know what data wrangling means, and I am sure that you share the same view as me that this is a tremendously important subject – otherwise, I don’t think you would be reading this book.

In statistics and data science areas, there is this frequently repeated phrase: garbage in, garbage out. This popular saying represents the central idea of the importance of wrangling data because it teaches us that our analysis or even our model will only be as good as the data that we present to it. You could also use the weakest link in the chain analogy to describe that importance, meaning that if your data is weak, the rest of the analysis could be easily broken by questions and arguments.

Let me give you a naïve example, but one that is still very precise, to illustrate my point. If we receive a dataset like in Figure 1.2, everything looks right at first glance. There are city names and temperatures, and it is a common format used to present data. However, for data science, this data may not be ideal for use just yet.

Figure 1.2 – Temperatures for cities

Figure 1.2 – Temperatures for cities

Notice that all the columns are referring to the same variable, which is Temperature. We would have trouble plotting simple graphics in R with a dataset presented as in Figure 1.2, as well as using the dataset for modeling.

In this case, a simple transformation of the table from wide to long format would be enough to complete the data-wrangling task.

Figure 1.3 – Dataset ready for use

Figure 1.3 – Dataset ready for use

At first glance, Figure 1.2 might appear to be the better-looking option. And, in fact, it is for human eyes. The presentation of the dataset in Figure 1.2 makes it much easier for us to compare values and draw conclusions. However, we must not forget that we are dealing with computers, and machines don’t process data the same way humans do. To a computer, Figure 1.2 has seven variables: City, Jan, Feb, Mar, Apr, May, and Jun, while Figure 1.3 has only three: City, Month, and Temperature.

Now comes the fun part; let’s compare how a computer would receive both sets of data. A command to plot the temperature timeline by city for Figure 1.2 would be as follows: Computer, take a city and the temperatures during the months of Jan, Feb, Mar, Apr, May, and Jun in that city. Then consider each of the names of the months as a point on the x axis and the temperature associated as a point on the y axis. Plot a line for the temperature throughout the months for each of the cities.

Figure 1.3 is much clearer to the computer. It does not need to separate anything. The dataset is ready, so look how the command would be given: Computer, for each city, plot the month on the x axis and the temperature on the y axis.

Much simpler, agree? That is the importance of data wrangling for Data Science.

Benefits

Performing good data wrangling will improve the overall quality of the entire analysis process. Here are the benefits:

Structured data: Your data will be organized and easily understandable by other data scientists.
Faster results: If the data is already in a usable state, creating plots or using it as input to an algorithm will certainly be faster.
Better data flow: To be able to use the data for modeling or for a dashboard, it needs to be properly formatted and cleaned. Good data wrangling enables the data to follow to the next steps of the process, making data pipelines and automation possible.
Aggregation: As we saw in the example in the previous section, the data must be in a suitable format for the computer to understand. Having well-wrangled datasets will help you to be able to aggregate them quickly for insight extraction.
Data quality: Data wrangling is about transforming the data to the ready state. During this process, you will clean, aggregate, filter, and sort it accordingly, visualize the data, assess its quality, deal with outliers, and identify faulty or incomplete data.
Data enriching: During wrangling, you might be able to enrich the data by creating new variables out of the original ones or joining other datasets to make your data more complete.

Every project, being related with Data Science or not, can benefit from data wrangling. As we just listed, it brings many benefits to the analysis, impacting the quality of the deliverables in the end. But to get the best from it, there are steps to follow.