Book Image

Machine Learning in Java - Second Edition

By : AshishSingh Bhatia, Bostjan Kaluza
Book Image

Machine Learning in Java - Second Edition

By: AshishSingh Bhatia, Bostjan Kaluza

Overview of this book

As the amount of data in the world continues to grow at an almost incomprehensible rate, being able to understand and process data is becoming a key differentiator for competitive organizations. Machine learning applications are everywhere, from self-driving cars, spam detection, document search, and trading strategies, to speech recognition. This makes machine learning well-suited to the present-day era of big data and Data Science. The main challenge is how to transform data into actionable knowledge. Machine Learning in Java will provide you with the techniques and tools you need. You will start by learning how to apply machine learning methods to a variety of common tasks including classification, prediction, forecasting, market basket analysis, and clustering. The code in this book works for JDK 8 and above, the code is tested on JDK 11. Moving on, you will discover how to detect anomalies and fraud, and ways to perform activity recognition, image recognition, and text analysis. By the end of the book, you will have explored related web resources and technologies that will help you take your learning to the next level. By applying the most effective machine learning methods to real-world problems, you will gain hands-on experience that will transform the way you think about data.
Table of Contents (13 chapters)

Supervised learning

Supervised learning is the key concept behind such amazing things as voice recognition, email spam filtering, and face recognition in photos, and detecting credit card frauds. More formally, given a set, D, of learning examples described with features, X, the goal of supervised learning is to find a function that predicts a target variable, Y. The function, f ,that describes the relation between features X and class Y is called a model:

The general structure of supervised learning algorithms is defined by the following decisions (Hand et al., 2001):

  1. Define the task
  2. Decide on the machine learning algorithm, which introduces specific inductive bias; that is, and a priori assumptions that it makes regarding the target concept
  3. Decide on the score or cost function, for instance, information gain, root mean square error, and so on
  4. Decide on the optimization/search method to optimize the score function
  5. Find a function that describes the relation between X and Y

Many decisions are already made for us by the type of the task and dataset that we have. In the following sections, we will take a closer look at the classification and regression methods and the corresponding score functions.

Classification

Classification can be applied when we deal with a discrete class, where the goal is to predict one of the mutually exclusive values in the target variable. An example would be credit scoring, where the final prediction is whether the person is credit liable or not. The most popular algorithms include decision trees, Naive Bayes classifiers, SVMs, neural networks, and ensemble methods.

Decision tree learning

Decision tree learning builds a classification tree, where each node corresponds to one of the attributes; edges correspond to a possible value (or intervals) of the attribute from which the node originates; and each leaf corresponds to a class label. A decision tree can be used to visually and explicitly represent the prediction model, which makes it a very transparent (white box) classifier. Notable algorithms are ID3 and C4.5, although many alternative implementations and improvements exist (for example, J48 in Weka).

Probabilistic classifiers

Given a set of attribute values, a probabilistic classifier is able to predict a distribution over a set of classes, rather than an exact class. This can be used as a degree of certainty; that is, how sure the classifier is about its prediction. The most basic classifier is Naive Bayes, which happens to be the optimal classifier if, and only if, the attributes are conditionally independent. Unfortunately, this is extremely rare in practice.

There is an enormous subfield denoted as probabilistic graphical models, comprising hundreds of algorithms for example, Bayesian networks, dynamic Bayesian networks, hidden Markov models, and conditional random fields that can handle not only specific relationships between attributes, but also temporal dependencies. Kiran R Karkera wrote an excellent introductory book on this topic, Building Probabilistic Graphical Models with Python, Packt Publishing (2014), while Koller and Friedman published a comprehensive theory bible, Probabilistic Graphical Models, MIT Press (2009).

Kernel methods

Any linear model can be turned into a non-linear model by applying the kernel trick to the model—replacing its features (predictors) by a kernel function. In other words, the kernel implicitly transforms our dataset into higher dimensions. The kernel trick leverages the fact that it is often easier to separate the instances in more dimensions. Algorithms capable of operating with kernels include the kernel perceptron, SVMs, Gaussian processes, PCA, canonical correlation analysis, ridge regression, spectral clustering, linear adaptive filters, and many others.

Artificial neural networks

Artificial neural networks are inspired by the structure of biological neural networks and are capable of machine learning, as well as pattern recognition. They are commonly used for both regression and classification problems, comprising a wide variety of algorithms and variations for all manner of problem types. Some popular classification methods are perceptron, restricted Boltzmann machine (RBM), and deep belief networks.

Ensemble learning

Ensemble methods compose of a set of diverse weaker models to obtain better predictive performance. The individual models are trained separately and their predictions are then combined in some way to make the overall prediction. Ensembles, hence, contain multiple ways of modeling the data, which hopefully leads to better results. This is a very powerful class of techniques, and as such, it is very popular. This class includes boosting, bagging, AdaBoost, and random forest. The main differences among them are the type of weak learners that are to be combined and the ways in which to combine them.

Evaluating classification

Is our classifier doing well? Is this better than the other one? In classification, we count how many times we classify something right and wrong. Suppose there are two possible classification labels of yes and no, then there are four possible outcomes, as shown in the following table:

Predicted as positive?
Yes No
Really positive? Yes TP-True Positive FN- False Negative
No FP- False Positive TN-True Negative

The four variables:

  • True positive (hit): This indicates a yes instance correctly predicted as yes
  • True negative (correct rejection): This indicates a no instance correctly predicted as no
  • False positive (false alarm): This indicates a no instance predicted as yes
  • False negative (miss): This indicates a yes instance predicted as no

The basic two performance measures of a classifier are, firstly, classification error:

And, secondly, classification accuracy is another performance measure, as shown here:

The main problem with these two measures is that they cannot handle unbalanced classes. Classifying whether a credit card transaction is an abuse or not is an example of a problem with unbalanced classes: there are 99.99% normal transactions and just a tiny percentage of abuses. The classifier that says that every transaction is a normal one is 99.99% accurate, but we are mainly interested in those few classifications that occur very rarely.

Precision and recall

The solution is to use measures that don't involve true negatives. Two such measures are as follows:

  • Precision: This is the proportion of positive examples correctly predicted as positive (TP) out of all examples predicted as positive (TP + FP):

  • Recall: This is the proportion of positives examples correctly predicted as positive (TP) out of all positive examples (TP + FN):

It is common to combine the two and report the F-measure, which considers both precision and recall to calculate the score as a weighted average, where the score reaches its best value at 1 and worst at 0, as follows:

Roc curves

Most classification algorithms return a classification confidence denoted as f(X), which is, in turn, used to calculate the prediction. Following the credit card abuse example, a rule might look similar to the following:

The threshold determines the error rate and the true positive rate. The outcomes of all the possible threshold values can be plotted as receiver operating characteristics (ROC) as shown in the following diagram:

A random predictor is plotted with a red dashed line and a perfect predictor is plotted with a green dashed line. To compare whether the A classifier is better than C, we compare the area under the curve.

Most of the toolboxes provide all of the previous measures out of the box.

Regression

Regression deals with a continuous target variable, unlike classification, which works with a discrete target variable. For example, in order to forecast the outside temperature of the following few days, we would use regression, while classification will be used to predict whether it will rain or not. Generally speaking, regression is a process that estimates the relationship among features, that is, how varying a feature changes the target variable.

Linear regression

The most basic regression model assumes linear dependency between features and target variable. The model is often fitted using least squares approach, that is, the best model minimizes the squares of the errors. In many cases, linear regression is not able to model complex relations; for example, the following diagram shows four different sets of points having the same linear regression line. The upper-left model captures the general trend and can be considered as a proper model, whereas the bottom-left model fits points much better (except for one outlier, which should be carefully checked), and the upper and lower-right side linear models completely miss the underlying structure of the data and cannot be considered proper models:

Logistic regression

Linear regression works when the dependent variable is continuous. If, however, the dependent variable is binary in nature, that is, 0 or 1, success or failure, yes or no, true or false, survived or died, and so on, then logistic regression is used instead. One such example is a clinical trial of drugs where the subject under study either responds to the drugs or does not respond. It is also used in fraud detection where the transaction is either a fraud or not fraud. Normally, a logistic function is used to measure the relationship between dependent and independent variables. It is seen as a Bernoulli distribution and, when plotted, looks similar to a curve in the shape of characters.

Evaluating regression

In regression, we predict numbers, Y, from input, X, and the predictions are usually wrong or not exact. The main question that we have to ask is: by how much? In other words, we want to measure the distance between the predicted and true values.

Mean squared error

Mean squared error (MSE) is an average of the squared difference between the predicted and true values, as follows:

The measure is very sensitive to the outliers, for example, 99 exact predictions and 1 prediction off by 10 is scored the same as all predictions wrong by 1. Moreover, the measure is sensitive to the mean. Therefore, a relative squared error that compares the MSE of our predictor to the MSE of the mean predictor (which always predicts the mean value) is often used instead.

Mean absolute error

Mean absolute error (MAS) is an average of the absolute difference between the predicted and the true values, as follows:

The MAS is less sensitive to the outliers, but it is also sensitive to the mean and scale.

Correlation coefficient

Correlation coefficient (CC) compares the average of prediction relative to the mean, multiplied by training values relative to the mean. If the number is negative, it means weak correlation; a positive number means strong correlation; and zero means no correlation. The correlation between true values X and predictions Y is defined as follows:

The CC measure is completely insensitive to the mean and scale and less sensitive to the outliers. It is able to capture the relative ordering, which makes it useful for ranking tasks, such as document relevance and gene expression.