Book Image

R Machine Learning Projects

By : Dr. Sunil Kumar Chinnamgari
Book Image

R Machine Learning Projects

By: Dr. Sunil Kumar Chinnamgari

Overview of this book

R is one of the most popular languages when it comes to performing computational statistics (statistical computing) easily and exploring the mathematical side of machine learning. With this book, you will leverage the R ecosystem to build efficient machine learning applications that carry out intelligent tasks within your organization. This book will help you test your knowledge and skills, guiding you on how to build easily through to complex machine learning projects. You will first learn how to build powerful machine learning models with ensembles to predict employee attrition. Next, you’ll implement a joke recommendation engine and learn how to perform sentiment analysis on Amazon reviews. You’ll also explore different clustering techniques to segment customers using wholesale data. In addition to this, the book will get you acquainted with credit card fraud detection using autoencoders, and reinforcement learning to make predictions and win on a casino slot machine. By the end of the book, you will be equipped to confidently perform complex tasks to build research and commercial projects for automated operations.
Table of Contents (12 chapters)
The Road Ahead

What this book covers

Chapter 1, Exploring the Machine Learning Landscape, will briefly review the various ML concepts that a practitioner must know. In this chapter, we will cover topics such as supervised learning, reinforcement learning, unsupervised learning, and real-world ML uses cases.

Chapter 2, Predicting Employee Attrition Using Ensemble Models, covers the creation of powerful ML models through ensemble learning. The project covered in this chapter is from the human resources domain. Retention of talented employees is a key challenge faced by corporations. If we were able to predict the attrition of an employee well in advance, it is possible that the human resources or management team could do something to save the potential attrition from becoming real. It just so happens that it is possible to predict employee attrition through the application of ML. This chapter makes use of an IBM-curated public dataset that provides a pseudo employee attrition population and characteristics. We start the chapter with an introduction to the problem at hand and then attempt to explore the dataset with exploratory data analysis (EDA). The next step is the preprocessing phase, which includes the creation of new features using prior domain experience. Once the dataset is fully prepared, models will be created using multiple ensemble techniques, such as bagging, boosting , stacking, and randomization. Lastly, we will deploy the finally selected model for production. We will also learn about the concepts underlying the various ensemble techniques used to create the models.

Chapter 3, Implementing a Joke Recommendation Engine, introduces recommendation engines, which are designed to predict the ratings that a user would give to content such as movies and music. Based on what a user has previously liked or seen and using other profiling attributes, a recommendation engine suggests new content that the user might like. Such engines have gained a lot of significance in recent years. We explore the exciting area of recommendation systems by working on a joke recommendation engine project. In this chapter, we start by understanding the concepts and types of collaborative filtering algorithms. We will then build a recommendation engine to provide personalized joke recommendations using collaborative filtering approaches such as user-based collaborative filters and item-based collaborative filters. The dataset used for this project is a open dataset called the Jester jokes dataset. Apart from this, we will be exploring various libraries available in R that can be used to build recommendation systems, and we will be comparing the performances obtained from these approaches. Additionally, we leverage the market basket analysis technique, a pretty popular technique in the marketing domain, to discern relationships between various jokes.

Chapter 4, Sentiment Analysis of Amazon Reviews with NLP, covers sentiment analysis, which entails finding the sentiment of a sentence and labeling it as positive, negative, or neutral. This chapter introduces sentiment analysis and covers the various techniques that can be used to analyze text. We will understand text-mining concepts and the various ways that text is labeled based on the tone.

We will apply sentiment analysis to Amazon product review data. This dataset contains millions of Amazon customer reviews and star ratings. It is a classification task where we will be categorizing each review as positive, negative, or neutral depending on the tone. Apart from using various popular R text-mining libraries to preprocess the reviews to be classified, we will also be leveraging a wide range of text representations, such as bag of words, word2vec, fastText, and Glove. Each of the text representations is then used as input for ML algorithms to perform classification. In the course of implementing each of these techniques, we will also learn about the concepts behind these techniques and also explore other instances where we could successfully apply them.

Chapter 5, Customer Segmentation Using Wholesale Data, covers the segmentation, grouping, or clustering of customers, which can be achieved through unsupervised learning. We explore the various aspects of customer grouping in this chapter. Customer segmentation is an important tool used by product sellers to understand their customers and gather information. Customers can be segmented based on different criteria, such as age and spending patterns. In this chapter, we learn the various techniques of customer segmentation. For the project, we use a dataset containing wholesale transactions. This dataset is available in the UCI Machine Learning Repository. We will be applying advanced clustering techniques, such as k-means, DIANA, and AGNES. At times, we will not know the number of groups that exist in the dataset at hand. We will explore the ML techniques for dealing with such ambiguity and have ML find out the number of groups possible based on the underlying characteristics of the input data. Evaluating the output of the clustering algorithms is an area that is often challenging to practitioners. We also explore this area so as to have a well-rounded understanding of applying clustering algorithms to real-world problems.

Chapter 6, Image Recognition Using Deep Neural Networks, covers convolutional neural networks (CNNs), which are a type of deep neural network and are popular in computer vision applications. In this chapter, we learn about the fundamental concepts underlying CNNs. We explore why CNNs work so well with computer vision problems such as object detection. We discuss the aspects of transfer learning and how it works in tandem with CNNs to solve computer vision problems. As elsewhere in the book, we'll be going by the philosophy of learning by doing. We will learn about all of these concepts by applying a CNN in the building of a multi-class classification model on a popular open dataset called MNIST. The objective of the project is to classify given images of handwritten digits. The project explores the methodology for creating features from raw images. We will learn about the various preprocessing techniques that can be applied to the image data in order use the data with deep learning models.

Chapter 7, Credit Card Fraud Detection Using Autoencoders, covers autoencoders, which are yet another type of unsupervised deep learning network. We start the chapter by understanding autoencoders and how they are different from the other deep learning networks, such as recurrent neural networks (RNNs)and CNNs. We will learn about autoencoders by implementing a project that identifies credit card fraud. Credit card companies are constantly seeking ways to detect credit card fraud. Fraud detection is a key aspect for banks to protect their revenues. It can be achieved through the application of ML in the finance domain for the specific fraud detection problem. A fraud is usually an anomalous event that requires immediate action. In this chapter, we will use an autoencoder to detect fraud. Autoencoders are neural networks that contain a bottleneck layer whose dimensionality is smaller than the input data. In this chapter, we will become familiar with dimensionality reduction and how it can be used to identify credit card fraud detection. For the project, we will be using the H2O deep learning framework in tandem with R. As far as the dataset is concerned, we use an open dataset that contains credit card transactions of European card holders from September 2013. There are a total of 284,807 transactions, out of which 492 are fraudulent.

Chapter 8, Automatic Prose Generation with Recurrent Neural Networks, introduces some deep neural networks (DNNs) that have recently received a lot of attention. This is due to their success in obtaining great results in various areas of ML, from face recognition and object detection to music generation and neural art. This chapter introduces the concepts necessary for understanding deep learning. We discuss the nuts and bolts of neural networks, such as neurons, hidden layers, various activation functions, techniques for dealing with problems faced in neural networks, and using optimization algorithms to get weights in neural networks. We will also implement a neural network from scratch to demonstrate these concepts. The content of this chapter will help us get foundational knowledge on neural networks. Then, we will learn how to apply an RNN by doing a project. It has always been thought that creative tasks such as authoring stories, writing poems, and painting pictures can only be achieved by humans. This is no longer true, thanks to deep learning! Technology can now accomplish creative tasks. We will create an application based on long short-term memory (LSTM) network, a variant of RNNs that generates text automatically. To accomplish this task, we make use of the MXNet framework, which extends its support for the R language to perform deep learning. In the course of implementing this project, we will also learn more about the concepts surrounding RNNs and LSTMs.

Chapter 9, Winning the Casino Slot Machines with Reinforcement Learning, begins with an explanation of RL. We discuss the various concepts of RL, including strategies for solving what is called as the multi-arm bandit problem. We implement a project that uses UCB and Thompson sampling techniques in order to solve the multi-arm bandit problem.

Appendix, The Road Ahead, briefly discuss the advancements in the ML world and the need to stay on top of them.