Book Image

Machine Learning in Java - Second Edition

By : AshishSingh Bhatia, Bostjan Kaluza
Book Image

Machine Learning in Java - Second Edition

By: AshishSingh Bhatia, Bostjan Kaluza

Overview of this book

As the amount of data in the world continues to grow at an almost incomprehensible rate, being able to understand and process data is becoming a key differentiator for competitive organizations. Machine learning applications are everywhere, from self-driving cars, spam detection, document search, and trading strategies, to speech recognition. This makes machine learning well-suited to the present-day era of big data and Data Science. The main challenge is how to transform data into actionable knowledge. Machine Learning in Java will provide you with the techniques and tools you need. You will start by learning how to apply machine learning methods to a variety of common tasks including classification, prediction, forecasting, market basket analysis, and clustering. The code in this book works for JDK 8 and above, the code is tested on JDK 11. Moving on, you will discover how to detect anomalies and fraud, and ways to perform activity recognition, image recognition, and text analysis. By the end of the book, you will have explored related web resources and technologies that will help you take your learning to the next level. By applying the most effective machine learning methods to real-world problems, you will gain hands-on experience that will transform the way you think about data.
Table of Contents (13 chapters)

Machine learning and data science

Nowadays, everyone talks about machine learning and data science. So, what exactly is machine learning, anyway? How does it relate to data science? These two terms are commonly confused, as they often employ the same methods and overlap significantly. Therefore, let's first clarify what they are. Josh Wills tweeted this:

"A data scientist is a person who is better at statistics than any software engineer and better at software engineering than any statistician."
– Josh Wills

More specifically, data science encompasses the entire process of obtaining knowledge by integrating methods from statistics, computer science, and other fields to gain insight from data. In practice, data science encompasses an iterative process of data harvesting, cleaning, analysis, visualization, and deployment.

Machine learning, on the other hand, is mainly concerned with generic algorithms and techniques that are used in analysis and modelling phases of the data science process.

Solving problems with machine learning

Among the different machine learning approaches, there are three main ways of learning, as shown in the following list:

  • Supervised learning
  • Unsupervised learning
  • Reinforcement learning

Given a set of example inputs X, and their outcomes Y, supervised learning aims to learn a general mapping function f, which transforms inputs into outputs, as f: (X,Y).

An example of supervised learning is credit card fraud detection, where the learning algorithm is presented with credit card transactions (matrix X) marked as normal or suspicious (vector Y). The learning algorithm produces a decision model that marks unseen transactions as normal or suspicious (this is the f function).

In contrast, unsupervised learning algorithms do not assume given outcome labels, as they focus on learning the structure of the data, such as grouping similar inputs into clusters. Unsupervised learning can, therefore, discover hidden patterns in the data. An example of unsupervised learning is an item-based recommendation system, where the learning algorithm discovers similar items bought together; for example, people who bought book A also bought book B.

Reinforcement learning addresses the learning process from a completely different angle. It assumes that an agent, which can be a robot, bot, or computer program, interacts with a dynamic environment to achieve a specific goal. The environment is described with a set of states and the agent can take different actions to move from one state to another. Some states are marked as goal states, and if the agent achieves this state, it receives a large reward. In other states, the reward is smaller, non-existent, or even negative. The goal of reinforcement learning is to find an optimal policy or a mapping function that specifies the action to take in each of the states, without a teacher explicitly telling whether this leads to the goal state or not. An example of reinforcement learning would be a program for driving a vehicle, where the states correspond to the driving conditions, for example, current speed, road segment information, surrounding traffic, speed limits, and obstacles on the road; and the actions could be driving maneuvers, such as turn left or right, stop, accelerate, and continue. The learning algorithm produces a policy that specifies the action that is to be taken in specific configurations of driving conditions.

In this book, we will focus on supervised and unsupervised learning only, as they share many concepts. If reinforcement learning sparked your interest, a good book to start with is Reinforcement Learning: An Introduction by Richard S. Sutton and Andrew Barto, MIT Press (2018).

Applied machine learning workflow

This book's emphasis is on applied machine learning. We want to provide you with the practical skills needed to get learning algorithms to work in different settings. Instead of math and theory in machine learning, we will spend more time on the practical, hands-on skills (and dirty tricks) to get this stuff to work well on an application. We will focus on supervised and unsupervised machine learning and learn the essential steps in data science to build the applied machine learning workflow.

A typical workflow in applied machine learning applications consists of answering a series of questions that can be summarized in the following steps:

  1. Data and problem definition: The first step is to ask interesting questions, such as: What is the problem you are trying solve? Why is it important? Which format of result answers your question? Is this a simple yes/no answer? Do you need to pick one of the available questions?
  2. Data collection: Once you have a problem to tackle, you will need the data. Ask yourself what kind of data will help you answer the question. Can you get the data from the available sources? Will you have to combine multiple sources? Do you have to generate the data? Are there any sampling biases? How much data will be required?
  3. Data preprocessing: The first data preprocessing task is data cleaning. Some of the examples include filling missing values, smoothing noisy data, removing outliers, and resolving consistencies. This is usually followed by integration of multiple data sources and data transformation to a specific range (normalization), to value bins (discretized intervals), and to reduce the number of dimensions.
  4. Data analysis and modelling: Data analysis and modelling includes unsupervised and supervised machine learning, statistical inference, and prediction. A wide variety of machine learning algorithms are available, including k-nearest neighbors, Naive Bayes classifier, decision trees, Support Vector Machines (SVMs), logistic regression, k-means, and so on. The method to be deployed depends on the problem definition, as discussed in the first step, and the type of collected data. The final product of this step is a model inferred from the data.
  5. Evaluation: The last step is devoted to model assessment. The main issue that the models built with machine learning face is how well they model the underlying data; for example, if a model is too specific or it overfits to the data used for training, it is quite possible that it will not perform well on new data. The model can be too generic, meaning that it underfits the training data. For example, when asked how the weather is in California, it always answers sunny, which is indeed correct most of the time. However, such a model is not really useful for making valid predictions. The goal of this step is to correctly evaluate the model and make sure it will work on new data as well. Evaluation methods include separate test and train sets, cross-validation, and leave-one-out cross-validation.

We will take a closer look at each of the steps in the following sections. We will try to understand the type of questions we must answer during the applied machine learning workflow, and look at the accompanying concepts of data analysis and evaluation.