Book Image

Machine Learning with R

By : Brett Lantz
Book Image

Machine Learning with R

By: Brett Lantz

Overview of this book

Machine learning, at its core, is concerned with transforming data into actionable knowledge. This fact makes machine learning well-suited to the present-day era of "big data" and "data science". Given the growing prominence of R—a cross-platform, zero-cost statistical programming environment—there has never been a better time to start applying machine learning. Whether you are new to data science or a veteran, machine learning with R offers a powerful set of methods for quickly and easily gaining insight from your data. "Machine Learning with R" is a practical tutorial that uses hands-on examples to step through real-world application of machine learning. Without shying away from the technical details, we will explore Machine Learning with R using clear and practical examples. Well-suited to machine learning beginners or those with experience. Explore R to find the answer to all of your questions. How can we use machine learning to transform data into action? Using practical examples, we will explore how to prepare data for analysis, choose a machine learning method, and measure the success of the process. We will learn how to apply machine learning methods to a variety of common tasks including classification, prediction, forecasting, market basket analysis, and clustering. By applying the most effective machine learning methods to real-world problems, you will gain hands-on experience that will transform the way you think about data. "Machine Learning with R" will provide you with the analytical tools you need to quickly gain insight from complex data.
Table of Contents (19 chapters)
Machine Learning with R
Credits
About the Author
About the Reviewers
www.PacktPub.com
Preface
9
Finding Groups of Data – Clustering with k-means
Index

Choosing a machine learning algorithm


The process of choosing a machine learning algorithm involves matching the characteristics of the data to be learned to the biases of the available approaches. Since the choice of a machine learning algorithm is largely dependent upon the type of data you are analyzing and the proposed task at hand, it is often helpful to be thinking about this process while you are gathering, exploring, and cleaning your data.

Tip

It may be tempting to learn a couple of machine learning techniques and apply them to everything, but resist this temptation. No machine learning approach is best for every circumstance. This fact is described by the No Free Lunch theorem, introduced by David Wolpert in 1996. For more information, visit: http://www.no-free-lunch.org.

Thinking about the input data

All machine learning algorithms require input training data. The exact format may differ, but in its most basic form, input data takes the form of examples and features.

An example is literally a single exemplary instance of the underlying concept to be learned; it is one set of data describing the atomic unit of interest for the analysis. If you were building a learning algorithm to identify spam e-mail, the examples would be data from many individual electronic messages. To detect cancerous tumors, the examples might comprise biopsies from a number of patients.

The phrase unit of observation is used to describe the units that the examples are measured in. Commonly, the unit of observation is in the form of transactions, persons, time points, geographic regions, or measurements. Other possibilities include combinations of these such as person years, which would denote cases where the same person is tracked over multiple time points.

A feature is a characteristic or attribute of an example, which might be useful for learning the desired concept. In the previous examples, attributes in the spam detection dataset might consist of the words used in the e-mail messages. For the cancer dataset, the attributes might be genomic data from the biopsied cells, or measured characteristics of the patient such as weight, height, or blood pressure.

The following spreadsheet shows a dataset in matrix format, which means that each example has the same number of features. In matrix data, each row in the spreadsheet is an example and each column is a feature. Here, the rows indicate examples of automobiles while the columns record various features of the cars such as the price, mileage, color, and transmission. Matrix format data is by far the most common form used in machine learning, though as you will see in later chapters, other forms are used occasionally in specialized cases.

Features come in various forms as well. If a feature represents a characteristic measured in numbers, it is unsurprisingly called numeric. Alternatively, if it measures an attribute that is represented by a set of categories, the feature is called categorical or nominal. A special case of categorical variables is called ordinal, which designates a nominal variable with categories falling in an ordered list. Some examples of ordinal variables include clothing sizes such as small, medium, and large, or a measurement of customer satisfaction on a scale from 1 to 5. It is important to consider what the features represent because the type and number of features in your dataset will assist with determining an appropriate machine learning algorithm for your task.

Thinking about types of machine learning algorithms

Machine learning algorithms can be divided into two main groups: supervised learners that are used to construct predictive models, and unsupervised learners that are used to build descriptive models. Which type you will need to use depends on the learning task you hope to accomplish.

A predictive model is used for tasks that involve, as the name implies, the prediction of one value using other values in the dataset. The learning algorithm attempts to discover and model the relationship among the target feature (the feature being predicted) and the other features. Despite the common use of the word "prediction" to imply forecasting predictive models need not necessarily foresee future events. For instance, a predictive model could be used to predict past events such as the date of a baby's conception using the mother's hormone levels; or, predictive models could be used in real time to control traffic lights during rush hours.

Because predictive models are given clear instruction on what they need to learn and how they are intended to learn it, the process of training a predictive model is known as supervised learning. The supervision does not refer to human involvement, but rather the fact that the target values provide a supervisory role, which indicates to the learner the task it needs to learn. Specifically, given a set of data, the learning algorithm attempts to optimize a function (the model) to find the combination of feature values that result in the target output.

The often used supervised machine learning task of predicting which category an example belongs to is known as classification. It is easy to think of potential uses for a classifier. For instance, you could predict whether:

  • A football team will win or lose

  • A person will live past the age of 100

  • An applicant will default on a loan

  • An earthquake will strike next year

The target feature to be predicted is a categorical feature known as the class and is divided into categories called levels. A class can have two or more levels, and the levels need not necessarily be ordinal. Because classification is so widely used in machine learning, there are many types of classification algorithms.

Supervised learners can also be used to predict numeric data such as income, laboratory values, test scores, or counts of items. To predict such numeric values, a common form of numeric prediction fits linear regression models to the input data. Although regression models are not the only type of numeric models, they are by far the most widely used. Regression methods are widely used for forecasting, as they quantify in exact terms the association between the inputs and the target, including both the magnitude and uncertainty of the relationship.

Tip

Since it is easy to convert numbers to categories (for example, ages 13 to 19 are teenagers) and categories to numbers (for example, assign 1 to all males, 0 to all females), the boundary between classification models and numeric prediction models is not necessarily firm.

A descriptive model is used for tasks that would benefit from the insight gained from summarizing data in new and interesting ways. As opposed to predictive models that predict a target of interest; in a descriptive model, no single feature is more important than any other. In fact, because there is no target to learn, the process of training a descriptive model is called unsupervised learning. Although it can be more difficult to think of applications for descriptive models—after all, what good is a learner that isn't learning anything in particular—they are used quite regularly for data mining.

For example, the descriptive modeling task called pattern discovery is used to identify frequent associations within data. Pattern discovery is often used for market basket analysis on transactional purchase data. Here, the goal is to identify items that are frequently purchased together, such that the learned information can be used to refine the marketing tactics. For instance, if a retailer learns that swimming trunks are commonly purchased at the same time as sunscreen, the retailer might reposition the items more closely in the store, or run a promotion to "up-sell" customers on associated items.

Tip

Originally used only in retail contexts, pattern discovery is now starting to be used in quite innovative ways. For instance, it can be used to detect patterns of fraudulent behavior, screen for genetic defects, or prevent criminal activity.

The descriptive modeling task of dividing a dataset into homogeneous groups is called clustering. This is sometimes used for segmentation analysis that identifies groups of individuals with similar purchasing, donating, or demographic information so that advertising campaigns can be tailored to particular audiences. Although the machine is capable of identifying the groups, human intervention is required to interpret them. For example, given five different clusters of shoppers at a grocery store, the marketing team will need to understand the differences among the groups in order to create a promotion that best suits each group. However, this is almost certainly easier than trying to create a unique appeal for each customer.

Matching your data to an appropriate algorithm

The following table lists the general types of machine learning algorithms covered in this book, each of which may be implemented in several ways. Although this covers only some of the entire set of all machine learning algorithms, learning these methods will provide a sufficient foundation for making sense of other methods as you encounter them.

Model

Task

Chapter

Supervised Learning Algorithms

Nearest Neighbor

Classification

Chapter 3

naive Bayes

Classification

Chapter 4

Decision Trees

Classification

Chapter 5

Classification Rule Learners

Classification

Chapter 5

Linear Regression

Numeric prediction

Chapter 6

Regression Trees

Numeric prediction

Chapter 6

Model Trees

Numeric prediction

Chapter 6

Neural Networks

Dual use

Chapter 7

Support Vector Machines

Dual use

Chapter 7

Unsupervised Learning Algorithms

Association Rules

Pattern detection

Chapter 8

k-means Clustering

Clustering

Chapter 9

To match a learning task to a machine learning approach, you will need to begin with one of the four types of tasks: classification, numeric prediction, pattern detection, or clustering. Certain tasks make the choice of algorithm simpler. For instance, if you are undertaking pattern detection, you will likely employ association rules. Similarly, a clustering problem will likely utilize the k-means algorithm while numeric prediction will utilize regression analysis or regression trees.

For classification, more thought is needed to match a learning problem to an appropriate classifier. In these cases, it is helpful to consider the various distinctions among the algorithms. For instance, within classification problems, decision trees result in models that are readily understood, while the models of neural networks are notoriously difficult to interpret. If you were designing a credit-scoring model, this could be an important distinction because law often requires that the applicant must be notified about the reasons he or she was rejected for the loan. Even if the neural network was better at predicting loan defaults if the predictions cannot be explained, then it is useless.

In each chapter, the key strengths and weaknesses of each approach will be listed. Although you will sometimes find that these characteristics exclude certain models from consideration in most cases, the choice of model is arbitrary. In this case, feel free to use whichever algorithm you are most comfortable with. Other times, when predictive accuracy is primary, you may need to test several and choose the one that fits best. In later chapters, we will even look at methods of combining models that utilize the best properties of each.