Book Image

Mastering Machine Learning with R - Third Edition

By : Cory Lesmeister
Book Image

Mastering Machine Learning with R - Third Edition

By: Cory Lesmeister

Overview of this book

Given the growing popularity of the R-zerocost statistical programming environment, there has never been a better time to start applying ML to your data. This book will teach you advanced techniques in ML ,using? the latest code in R 3.5. You will delve into various complex features of supervised learning, unsupervised learning, and reinforcement learning algorithms to design efficient and powerful ML models. This newly updated edition is packed with fresh examples covering a range of tasks from different domains. Mastering Machine Learning with R starts by showing you how to quickly manipulate data and prepare it for analysis. You will explore simple and complex models and understand how to compare them. You’ll also learn to use the latest library support, such as TensorFlow and Keras-R, for performing advanced computations. Additionally, you’ll explore complex topics, such as natural language processing (NLP), time series analysis, and clustering, which will further refine your skills in developing applications. Each chapter will help you implement advanced ML algorithms using real-world examples. You’ll even be introduced to reinforcement learning, along with its various use cases and models. In the concluding chapters, you’ll get a glimpse into how some of these blackbox models can be diagnosed and understood. By the end of this book, you’ll be equipped with the skills to deploy ML techniques in your own projects or at work.
Table of Contents (16 chapters)

Overview

If you haven't been exposed to large, messy datasets, then be patient, for it's only a matter of time. If you've encountered such data, has it been in a domain where you have little subject matter expertise? If not, then once again I proffer that it's only a matter of time. Some of the common problems that make up this term messy data include the following:

  • Missing or invalid values
  • Novel levels in a categorical feature that show up in algorithm production
  • High cardinality in categorical features such as zip codes
  • High dimensionality
  • Duplicate observations

So this begs the question what are we to do? Well, first we need to look at what are the critical tasks that need to be performed during this phase of the process. The following tasks serve as the foundation for building a learning algorithm. They're from the paper by SPSS, CRISP-DM 1.0, a step-by-step data-mining guide available at https://the-modeling-agency.com/crisp-dm.pdf:

  • Data understanding:
    1. Collect
    2. Describe
    3. Explore
    4. Verify
  • Data preparation:
    1. Select
    2. Clean
    3. Construct
    4. Integrate
    5. Format

Certainly this is an excellent enumeration of the process, but what do we really need to do? I propose that, in practical terms we can all relate to, the following must be done once the data is joined and loaded into your machine, cloud, or whatever you use:

  • Understand the data structure
  • Dedupe observations
  • Eliminate zero variance features and low variance features as desired
  • Handle missing values
  • Create dummy features (one-hot encoding)
  • Examine and deal with highly correlated features and those with perfect linear relationships
  • Scale as necessary
  • Create other features as desired

Many feel that this is a daunting task. I don't and, in fact, I quite enjoy it. If done correctly and with a judicious application of judgment, it should reduce the amount of time spent at this first stage of a project and facilitate training your learning algorithm. None of the previous steps are challenging, but it can take quite a bit of time to write the code to perform each task.

Well, that's the benefit of this chapter. The example to follow will walk you through the tasks and the R code that accomplishes it. The code is flexible enough that you should be able to apply it to your projects. Additionally, it will help you gain an understanding of the data at a point you can intelligently discuss it with Subject Matter Experts (SMEs) if, in fact, they're available.

In the practical exercise that follows, we'll work with a small dataset. However, it suffers from all of the problems described earlier. Don't let the small size fool you, as we'll take what we learn here and use it for the more massive datasets to come in subsequent chapters.

As background, the data we'll use I put together painstakingly by hand. It's the Order of Battle for the opposing armies at the Battle of Gettysburg, fought during the American Civil War, July 1st-3rd, 1863, and the casualties reported by the end of the day on July 3rd. I purposely chose this data because I'm reasonably sure you know very little about it. Don't worry, I'm the SME on the battle here and will walk you through it every step of the way. The one thing that we won't cover in this chapter is dealing with large volumes of textual features, which we'll discuss later in this book. Enough said already; let's get started!

The source used in the creation of the dataset is The Gettysburg Campaign in Numbers and Losses: Synopses, Orders of Battle, Strengths, Casualties, and Maps, June 9-July 14, 1863, by J. David Petruzzi and Steven A. Stanley.