Machine Learning in Java - Second Edition

By : AshishSingh Bhatia, Bostjan Kaluza

Machine Learning in Java - Second Edition

By: AshishSingh Bhatia, Bostjan Kaluza

Overview of this book

As the amount of data in the world continues to grow at an almost incomprehensible rate, being able to understand and process data is becoming a key differentiator for competitive organizations. Machine learning applications are everywhere, from self-driving cars, spam detection, document search, and trading strategies, to speech recognition. This makes machine learning well-suited to the present-day era of big data and Data Science. The main challenge is how to transform data into actionable knowledge. Machine Learning in Java will provide you with the techniques and tools you need. You will start by learning how to apply machine learning methods to a variety of common tasks including classification, prediction, forecasting, market basket analysis, and clustering. The code in this book works for JDK 8 and above, the code is tested on JDK 11. Moving on, you will discover how to detect anomalies and fraud, and ways to perform activity recognition, image recognition, and text analysis. By the end of the book, you will have explored related web resources and technologies that will help you take your learning to the next level. By applying the most effective machine learning methods to real-world problems, you will gain hands-on experience that will transform the way you think about data.

Preface

Who this book is for

What this book covers

To get the most out of this book

Get in touch

Free Chapter

Applied Machine Learning Quick Start

Machine learning and data science

Data and problem definition

Data collection

Data preprocessing

Unsupervised learning

Supervised learning

Generalization and evaluation

Summary

Java Libraries and Platforms for Machine Learning

The need for Java

Machine learning libraries

Building a machine learning application

Summary

Basic Algorithms - Classification, Regression, and Clustering

Summary

Customer Relationship Prediction with Ensembles

The customer relationship database

Basic Naive Bayes classifier baseline

Basic modeling

Advanced modeling with ensembles

Summary

Affinity Analysis

Market basket analysis

Association rule learning

The supermarket dataset

Discover patterns

Other applications in various areas

Summary

Recommendation Engines with Apache Mahout

Basic concepts

Getting Apache Mahout

Building a recommendation engine

Content-based filtering

Summary

Fraud and Anomaly Detection

Suspicious and anomalous behavior detection

Suspicious pattern detection

Anomalous pattern detection

Outlier detection using ELKI

Fraud detection in insurance claims

Anomaly detection in website traffic

Summary

Image Recognition with Deeplearning4j

Introducing image recognition

Image classification

Summary

Activity Recognition with Mobile Phone Sensors

Introducing activity recognition

Collecting data from a mobile phone

Building a classifier

Summary

Text Mining with Mallet - Topic Modeling and Spam Detection

Introducing text mining

Installing Mallet

Working with text data

Topic modeling for BBC News

Detecting email spam

Summary

What Is Next?

Machine learning in real life

Standards and markup languages

Machine learning in the cloud

Web resources and competitions

Summary

Other Books You May Enjoy

Leave a review - let other readers know what you think

Customer Reviews

5 star

4 star

3 star

2 star

1 star

Generalization and evaluation

Once the model is built, how do we know it will perform on new data? Is this model any good? To answer these questions, we'll first look into the model generalization, and then see how to get an estimate of the model performance on new data.

Underfitting and overfitting

Predictor training can lead to models that are too complex or too simple. The model with low complexity (the leftmost models in the following diagram) can be as simple as predicting the most frequent or mean class value, while the model with high complexity (the rightmost models) can represent the training instances. Modes that are too rigid, shown on the left-hand side, cannot capture complex patterns; while models that are too flexible, shown on the right-hand side, fit to the noise in the training data. The main challenge is to select the appropriate learning algorithm and its parameters, so that the learned model will perform well on the new data (for example, the middle column):

The following diagram shows how errors in the training set decreases with model complexity. Simple rigid models underfit the data and have large errors. As model complexity increases, it describes the underlying structure of the training data better and, consequentially, the error decreases. If the model is too complex, it overfits the training data and its prediction error increases again:

Depending on the task complexity and data availability, we want to tune our classifiers toward more or less complex structures. Most learning algorithms allow such tuning, as follows:

Regression: This is the order of the polynomial
Naive Bayes: This is the number of the attributes
Decision trees: This is the number of nodes in the tree—pruning confidence
K-nearest neighbors: This is the number of neighbors—distance-based neighbor weights
SVM: This is the kernel type; cost parameter
Neural network: This is the number of neurons and hidden layers

With tuning, we want to minimize the generalization error; that is, how well the classifier performs on future data. Unfortunately, we can never compute the true generalization error; however, we can estimate it. Nevertheless, if the model performs well on the training data but performance is much worse on the test data, then the model most likely overfits.

Train and test sets

To estimate the generalization error, we split our data into two parts: training data and testing data. A general rule of thumb is to split them by the training: testing ratio, that is, 70:30. We first train the predictor on the training data, then predict the values for the test data, and finally, compute the error, that is, the difference between the predicted and the true values. This gives us an estimate of the true generalization error.

The estimation is based on the two following assumptions: first, we assume that the test set is an unbiased sample from our dataset; and second, we assume that the actual new data will reassemble the distribution as our training and testing examples. The first assumption can be mitigated by cross-validation and stratification. Also, if it is scarce, one can't afford to leave out a considerable amount of data for a separate test set, as learning algorithms do not perform well if they don't receive enough data. In such cases, cross-validation is used instead.

Cross-validation

Cross-validation splits the dataset into k sets of approximately the same size—for example, in the following diagram, into five sets. First, we use sets 2 to 5 for learning and set 1 for training. We then repeat the procedure five times, leaving out one set at a time for testing, and average the error over the five repetitions:

This way, we use all of the data for learning and testing as well, while avoiding using the same data to train and test a model.

Leave-one-out validation

An extreme example of cross-validation is the leave-one-out validation. In this case, the number of folds is equal to the number of instances; we learn on all but one instance, and then test the model on the omitted instance. We repeat this for all instances, so that each instance is used exactly once for the validation. This approach is recommended when we have a limited set of learning examples, for example, less than 50.

Stratification

Stratification is a procedure to select a subset of instances in such a way that each fold roughly contains the same proportion of class values. When a class is continuous, the folds are selected so that the mean response value is approximately equal in all of the folds. Stratification can be applied along with cross-validation or separate training and test sets.

Machine Learning in Java - Second Edition

By : AshishSingh Bhatia, Bostjan Kaluza

Machine Learning in Java - Second Edition

By: AshishSingh Bhatia, Bostjan Kaluza

Overview of this book

Related Content you might be interested in

Current Title:

Machine Learning in Java - Second Edition

Java Data Science Cookbook

Java for Data Science

Mastering Java Machine Learning