Book Image

R Machine Learning Projects

By : Dr. Sunil Kumar Chinnamgari
Book Image

R Machine Learning Projects

By: Dr. Sunil Kumar Chinnamgari

Overview of this book

R is one of the most popular languages when it comes to performing computational statistics (statistical computing) easily and exploring the mathematical side of machine learning. With this book, you will leverage the R ecosystem to build efficient machine learning applications that carry out intelligent tasks within your organization. This book will help you test your knowledge and skills, guiding you on how to build easily through to complex machine learning projects. You will first learn how to build powerful machine learning models with ensembles to predict employee attrition. Next, you’ll implement a joke recommendation engine and learn how to perform sentiment analysis on Amazon reviews. You’ll also explore different clustering techniques to segment customers using wholesale data. In addition to this, the book will get you acquainted with credit card fraud detection using autoencoders, and reinforcement learning to make predictions and win on a casino slot machine. By the end of the book, you will be equipped to confidently perform complex tasks to build research and commercial projects for automated operations.
Table of Contents (12 chapters)
The Road Ahead

Types of ML methods

Several types of tasks that aim at solving real-world problems can be achieved thanks to ML. An ML method generally means a group of specific types of algorithms that are suitable for solving a particular kind of problem and the method addresses any constraints that the problem brings along with it. For example, a constraint of a particular problem could be the availability of labeled data that can be provided as input to the learning algorithm.

Essentially, the popular ML methods are supervised learning, unsupervised learning, semi-supervised learning, reinforcement learning, and transfer learning. The rest of this section details each of these methods.

Supervised learning

A supervised learning algorithm is applied when one is very clear about the result that needs to be achieved from a problem, however one is unsure about the relationships between the data that affects the output. We would like the ML algorithm that we apply on the data to perceive these relationships between different data elements so as to achieve the desired output.

The concept can be better explained with an example—at a bank, prior to extending a loan, they would like to predict if a loan applicant would pay the loan back. In this case, the problem is very clear. If a loan is extended to a prospective customer X, there are two possibilities: that X would successfully repay the loan or X would not repay the loan. The bank would like to use ML to identify the category into which customer X falls; that is, a successful repayer of the loan or a repayment defaulter.

While the problem definition that is to be solved is clear, please note that the features of a customer that will contribute to successful loan repayment or non-repayment are not clear and this is something we would like the ML algorithm to learn by observing the patterns in the data.

The major challenge here is that we need to provide input data that represents both customers that repaid their loans successfully and also customers that failed to repay. The bank can simply look at the historical data to get the records of customers in both categories and then label each record as paid or unpaid categories as appropriate.

The records, thus labeled, now become input to a supervised learning algorithm so that it can learn the patterns of both categories of customers. The process of learning from the labeled data is called training and the output obtained (algorithm) from the learning process is called a model. Ideally, the bank would keep some part of the labeled data aside from training data so as to be able to test the model created, and this data is termed as test data. It should be no surprise that the labeled data that is used for training the model is called training data.

Once the model has been built, measurements are obtained by testing the model with test data to ensure the model yields a satisfactory level of performance, otherwise model-building iterations are carried out until the desired model performance is obtained. The model that achieved the desired performance on test data can be used by the bank to infer if any new loan applicant will be a future defaulter at all and, if so, make a better decision in terms of extending a loan to that applicant.

In a nutshell, supervised ML algorithms are employed when the objective is very clear and labeled data is available as input for the algorithm to learn the patterns from. The following diagram summarizes the supervised learning process:

Supervised learning can be further divided into two categories, namely classification and regression. The prediction of a bank loan defaulter explained in this section is an example of classification and it aims to predict a label of a nominal type such as yes or no. On the other hand, it is also possible to predict numeric values (continuous values) and this type of prediction is called regression. An example of regression is predicting the monthly rental of a home in a prime location of a city based on features such as the demand for houses in the area, the number of bedrooms, the dimensions of the house, and accessibility to public transportation.

Several supervised learning algorithms exist, and a few popularly known algorithms in this area include classification and regression trees (CART), logistic regression, linear regression, Naive Bayes, neural networks, k-nearest neighbors (KNN), and support vector machine (SVM).

Unsupervised learning

The availability of labeled data is not very common and manually labeling data is also not cheap. This is the situation where unsupervised learning comes into play.

For example, one small boutique firm wants to roll out a promotion to its customers, who are registered on their Facebook page. While the business objective is clear—that a promotion needs to be rolled out to customers—it is unclear as to which customer falls under which group. Unlike the supervised learning method where prior knowledge existed in terms of bad debtors and good debtors, in this case there are no such clues.

When the customer information is given as input to unsupervised learning algorithms, it tries to identify the patterns in the data and thereby groups the data of the customers with similar kinds of attributes.

Birds of the same feather flock together is the principle followed in customer grouping with unsupervised learning.

The reasoning behind the formation of these organic groups from the grouping exercise may not be very intuitive. It may take some research to identify the factors that contributed to the gathering of a set of customers in a group. Most of the time, this research is manual and the data points in each group need verifying. This research may form the basis to determine the groups to which the particular promotion at hand needs to be rolled out. This application of unsupervised learning is called clustering. The following diagram shows the application of unsupervised ML to cluster the data points:

There are a number of clustering algorithms. However, the most popular ones are namely, k-means clustering, k-modes clustering, hierarchical clustering, fuzzy clustering, and so on.

Other forms of unsupervised learning do exist. For example, in retail industry, an unsupervised learning method called association rule mining is applied on customer purchases to identify the goods that are purchased together. In this case, unlike supervised learning, there is no need for labels at all. The task involved only requires the ML algorithm to identify the latent associations between the products that are billed together by customers. Having the information from association rule mining helps retailers place the products that are bought together in proximity. The idea is that customers can be intuitively encouraged to buy the extra products.

A priori, equivalence class transformation (Eclat), and frequency pattern growth (FPG) are popular among the several algorithms that exist to perform association rule mining.

Yet another form of unsupervised learning is anomaly detection or outlier detection. The goal of the exercise is to identify data points that do not belong to the rest of the elements that are given as input to the unsupervised learning algorithm. Similar to association rule mining, due to the nature of the problem at hand, there is no requirement for labels to be made use of by the algorithm to achieve the goal.

Fraud detection is an important application of anomaly detection in the credit cards industry. Credit card transactions are monitored in real time and any spurious transaction patterns are flagged immediately to avoid losses to the credit card user as well as the credit card provider. The unusual pattern that is monitored for could be a huge transaction in a foreign currency rather than that of a normal currency in which the particular customer generally transacts. It could be transactions in physical stores located in two different continents on the same day. The general idea is to be able to flag up a pattern that is a deviation from the norm.

K-means clustering and one-class SVM are two well-known unsupervised ML algorithms that are used to observe abnormalities in the population.

Overall, it may be understood that unsupervised learning is unarguably a very important method, given that labeled data used for training is a scarce resource.

Semi-supervised learning

Semi-supervised learning is a hybrid of both supervised and unsupervised methods. ML requires large amounts of data for training. Most of the time, a directly proportional relationship is observed between the amount of data used for model training and the performance of the model.

In niche domains such as medical imagining, a large amount of image data (MRIs, x-rays, CT scans) is available. However, the time and availability of qualified radiologists to label these images is scarce. In this situation, we might end up getting only a few images labeled by radiologists.

Semi-supervised learning takes advantage of the few labeled images by building an initial model that is used to label the large amount of unlabeled data that exists in the domain. Once the large amount of labeled data is available, a supervised ML algorithm may be used to train and create a final model that is used for prediction tasks on the unseen data. The following diagram illustrates the steps involved in semi-supervised learning:

Speech analysis, protein synthesis, and web content classifications are certain areas where large amounts of unlabeled data and fewer amounts of labeled data are available. Semi-supervised learning is applied in these areas with successful results.

Generative adversarial networks (GANs), semi-supervised support vector machines (S3VMs), graph-based methods, and Markov chain methods are well-known methods among others in the semi-supervised ML area.

Reinforcement learning

Reinforcement learning (RL) is an ML method that is neither supervised learning nor unsupervised learning. In this method, a reward definition is provided as input to this kind of a learning algorithm at the start. As the algorithm is not provided with labeled data for training, this type of learning algorithm cannot be categorized as supervised learning. On the other hand, it is not categorized as unsupervised learning, as the algorithm is fed with information on reward definition that guides the algorithm through taking the steps to solve the problem at hand.

Reinforcement learning aims to improve the strategies used to solve any problem continuously by relying on the feedback received. The goal is to maximize the rewards while taking steps to solve the problem. The rewards obtained are computed by the algorithm itself going by the rewards and penalty definitions. The idea is to achieve optimal steps that maximize the rewards to solve the problem at hand.

The following diagram is an illustration depicting a robot automatically determining the ideal behavior through a reinforcement learning method within the specific context of fire:

A machine outplaying humans in an Atari video game is termed as one of the foremost success stories of reinforcement learning. To achieve this feat, a large number of example games played by humans are fed as input to the algorithm that learned the steps to take to maximize the reward. The reward in this case is the final score. The algorithm, post learning from the example inputs, just simulated the pattern at each step of the game that eventually maximized the score obtained.

Though it might appear that reinforcement learning can be applied to game scenarios only, there are numerous use cases for this method in industry as well. The following examples mentioned are three such use cases:

  • Dynamic pricing of goods and services based on spontaneous supply and demand targeted at achieving profit maximization is achieved through a variant of reinforcement learning called Q-learning.
  • Effective use of space in warehouses is a key challenge faced by inventory management professionals. Market demand fluctuations, the large availability of inventory stocks, and delays in refilling the inventory are the key constraints that affect space utilization. Reinforcement learning algorithms are used to optimize the time to procure inventory as well as to reduce the time to retrieve the goods from warehouses, thereby directly impacting the space management issue referred to as a problem in the inventory management area.
  • Prolonged treatments and differential drug administration is required in medical science to treat diseases such as cancer. The treatments are highly personalized, based on the characteristics of the patient. Treatment often involves variations of the treatment strategy at various stages. This kind of treatment plan is typically referred to as a dynamic treatment regime (DTR). Reinforcement learning helps with processing the clinical trials data to come up with the appropriate personalized DTR for the patient, based on the characteristics of the patient that are fed in as inputs to the reinforcement learning algorithm.

There are four very popular reinforcement learning algorithms, namely Q-learning, state-action-reward-state-action (SARSA), deep Q network (DQN), and deep deterministic policy gradient (DDPG).

Transfer learning

The reusability of code is one of the fundamental concepts of object-oriented programming (OOP) and it is pretty popular in the software-engineering world. Similarly, transfer learning involves reusing a model built to achieve a specific task to solve another related task.

It is understandable that to achieve better performance measurements, ML models need to be trained on large amounts of labeled data. The availability of fewer amounts of data means less training and the result is a model with suboptimal performance.

Transfer learning attempts to solve the problems arising from the availability of fewer amounts of data by reusing the knowledge obtained by a different related model. Having fewer data points available to train a model should not impede building a better model, which is the core concept behind transfer learning. The following diagram is an illustration showing the purpose of transfer learning in an image recognition task that classifies dog and cat images:

In this task, a neural network model is involved with detecting the edges, color blob detection, and so on in the first few layers. Only at the progressive layers (maybe in the last few layers) does the model attempt to identify the facial features of dogs or cats in order to classify them as one of the targets (a dog or a cat).

It may be observed that the tasks of identifying edges and color blobs are not specific to cats' and dogs' images. The knowledge to infer edges or color blobs may be generally inferred even if a model is trained on non-dog or non-cat images. Eventually, if this knowledge is clubbed with knowledge derived from inferring cat faces versus dog faces, even if they are small in number, we will have a better model than the suboptimal model obtained by training on fewer images.

In the case of a dogs-cats classifier, first, a model is trained on a large set of images that are not confined to cats' and dogs' images. The model is then taken and the last few layers are retrained on the dogs' and cats' faces. The model, thus obtained, is then tested and used post evidencing performance measurements that are satisfactory.

The concept of transfer learning is used not just for image-related tasks. Another example of it being used is in natural language processing (NLP) where it can perform sentiment analysis on text data.

Assume a company that launched a new product has a concept that never existed before (say, for now, a flying car). The task is to analyze the tweets related to the new product and identify each of them as being of positive, negative, or neural sentiment. It may be observed that prior, labeled tweets are unavailable in the flying car's domain. In such cases, we can take a model built based on the labeled data of generic product reviews for several products and domains. We can reuse the model by supplementing it with flying-car-domain-specific terminology to avail a new model. This new model will be finally used for testing and deploying to analyze sentiment on the tweets obtained about the newly launched flying cars.

It is possible to achieve transfer learning through the following two ways:

  • By reusing one's own model
  • By reusing a pretrained model

Pretrained models are models built by various organizations or individuals as part of their research work or as part of a competition. These models are generally very complex and are trained on large amounts of data. They are also optimized to perform their tasks with high precision. These models may take days or weeks to train on modern hardware. Organizations or individuals often release these models under permissive license for reuse. Such pretrained models can be downloaded and reused through the transfer-learning paradigm. This will effectively make use of the vast existing knowledge that the pretrained models possess, which would otherwise be hard to attain for an individual with limited hardware resources and amounts of data to train.

There are several pretrained models made available by various parties. The following described are some of the popular pretrained models:

  • Inception-V3 model: This model has been trained on ImageNet as part of a large visual recognition challenge. The competition required the participants to classify a given image into one of 1,000 classes. Some of the classes include the names of animals and object names.
  • MobileNet: This pretrained model has been built by Google and it is meant to perform object detection using the ImageNet database. The architecture is designed for mobiles.
  • VCG Face: This is a pretrained model built for face recognition.
  • VCG 16: This is a pretrained model trained on the MS COCO dataset. This one accomplishes image captioning; that is, given an input image, it generates a caption describing the image's contents.
  • Google's Word2Vec model and Stanford's GloVe model: These pretrained models take text as input and produce word vectors as output. Distributed word vectors offer one form of representing documents for NLP or ML applications.

Now that we have a basic understanding of various possible ML methods, in the next section, we focus on quickly reviewing the key terminology used in ML.