Book Image

Mastering Java for Data Science

By : Alexey Grigorev
Book Image

Mastering Java for Data Science

By: Alexey Grigorev

Overview of this book

Java is the most popular programming language, according to the TIOBE index, and it is a typical choice for running production systems in many companies, both in the startup world and among large enterprises. Not surprisingly, it is also a common choice for creating data science applications: it is fast and has a great set of data processing tools, both built-in and external. What is more, choosing Java for data science allows you to easily integrate solutions with existing software, and bring data science into production with less effort. This book will teach you how to create data science applications with Java. First, we will revise the most important things when starting a data science application, and then brush up the basics of Java and machine learning before diving into more advanced topics. We start by going over the existing libraries for data processing and libraries with machine learning algorithms. After that, we cover topics such as classification and regression, dimensionality reduction and clustering, information retrieval and natural language processing, and deep learning and big data. Finally, we finish the book by talking about the ways to deploy the model and evaluate it in production settings.
Table of Contents (17 chapters)
Title Page
Credits
About the Author
About the Reviewers
www.PacktPub.com
Customer Feedback
Preface

Data science process models


Applying data science is much more than just selecting a suitable machine learning algorithm and using it on the data. It is always good to keep in mind that machine learning is only a small part of the project; there are other parts such as understanding the problem, collecting the data, testing the solution and deploying to the production.

When working on any project, not just data science ones, it is beneficial to break it down into smaller manageable pieces and complete them one-by-one. For data science, there are best practices that describe how to do it the best way, and they are called process models. There are multiple models, including CRISP-DM and OSEMN.

In this chapter, CRISP-DM is explained as Obtain, Scrub, Explore, Model, and iNterpret (OSEMN), which is more suitable for data analysis tasks and addresses many important steps to a lesser extent.

CRISP-DM

Cross Industry Standard Process for Data Mining (CRISP-DM) is a process methodology for developing data mining applications. It was created before the term data science became popular, it's reliable and time-tested by several generations of analytics. These practices are still useful nowadays and describe the high-level steps of any analytical project quite well. 

Image source: https://en.wikipedia.org/wiki/File:CRISP-DM_Process_Diagram.png

The CRISP-DM methodology breaks down a project into the following steps:

  • Business understanding
  • Data understanding
  • Data preparation
  • Modeling
  • Evaluation
  • Deployment

The methodology itself defines much more than just these steps, but typically knowing what the steps are and what happens at each step is enough for a successful data science project. Let's look at each of these steps separately.

The first step is Business Understanding. This step aims at learning what kinds of problems the business has and what they want to achieve by solving these problems. To be successful, a data science application must be useful for the business. The result of this step is the formulation of a problem which we want to solve and what is the desired outcome of the project.

The second step is Data Understanding. In this step, we try to find out what data can be used to solve the problem. We also need to find out if we already have the data; if not, we need to think how we can we get it. Depending on what data we find (or do not find), we may want to alter the original goal.

When the data is collected, we need to explore it. The process of reviewing the data is often called Exploratory Data Analysis and it is an integral part of any data science project. it helps to understand the processes that created the data, and can already suggest approaches for tackling the problem. The result of this step is the knowledge about which data sources are needed to solve the problem. We will talk more about this step in Chapter 3, Exploratory Data Analysis.

The third step of CRISP-DM is Data Preparation. For a dataset to be useful, it needs to be cleaned and transformed to a tabular form. The tabular form means that each row corresponds to exactly one observation. If our data is not in this shape, it cannot be used by most of the machine learning algorithms. Thus, we need to prepare the data such that it eventually can be converted to a matrix form and fed to a model.

Also, there could be different datasets that contain the needed information, and they may  not be homogenous. What this means is that we need to convert these datasets to some common format, which can be read by the model.

This step also includes Feature Engineering--the process of creating features that are most informative for the problem and describe the data in the best way.

Many data scientists say that they spend most of their time on this step when building Data Science applications. We will talk about this step in Chapter 2, Data Processing Toolbox and throughout the book.

The fourth step is Modeling. In this step, the data is already in the right shape and we feed it to different Machine Learning algorithms. This step also includes parameter tuning, feature selection, and selecting the best model.

Evaluation of the quality of the models from the machine learning point of view happens during this step. The most important thing to check is the ability to generalize, and this is typically done via cross validation. In this step, we also may want to go back to the previous step and do extra cleaning and feature engineering. The outcome is a model that is potentially useful for solving the problem defined in Step 1.

The fifth step is Evaluation. It includes evaluating the model from the business perspective--not from the machine learning perspective. This means that we need to perform a critical review of the results so far and plan the next steps. Does the model achieve what we want? Additionally, some of the findings may lead to reconsidering the initial question. After this step, we can go to the deployment step or re-iterate the process.

The, final, sixth step is Model Deployment. During this step, the produced model is added to the production, so the result is the model integrated to the live system. We will cover this step in Chapter 10Deploying Data Science Models.

Often, evaluation is hard because it is not always possible to say whether the model achieves the desired result or not. In these cases, the evaluation and deployment steps can be combined into one, the model is deployed and applied only to a part of users, and then the data for evaluating the model is collected. We will also briefly cover the ways of doing them, such as A/B testing and multi-armed bandits, in the last chapter of the book.

A running example

There will be many practical use cases throughout the book, sometimes a couple in each chapter. But we will also have a running example, building a search engine. This problem is interesting for a number of reasons:

  • It is fun
  • Business in almost any domain can benefit from a search engine
  • Many businesses already have text data; often it is not used effectively, and its use can be improved
  • Processing text requires a lot of effort, and it is useful to learn to do this effectively

We will try to keep it simple, yet, with this example, we will touch on all the technical parts of the data science process throughout the book:

  • Data Understanding: Which data can be useful for the problem? How can we obtain this data?
  • Data Preparation: Once the data is obtained, how can we process it? If it is HTML, how do we extract text from it? How do we extract individual sentences and words from the text?
  • Modeling: Ranking documents by their relevance with respect to a query is a data science problem and we will discuss how it can be approached.
  • Evaluation: The search engine can be tested to see if it is useful for solving the business problem or not.
  • Deployment: Finally, the engine can be deployed as a REST service or integrated directly to the live system.

We will obtain and prepare the data in Chapter 2Data Processing Toolbox, understand the data in Chapter 3Exploratory Data Analysis, build simple models and evaluate them in Chapter 4, Supervised Machine Learning - Classification and Regression, look at how to process text in Chapter 6Working with Text - Natural Language Processing and Information Retrieval, see how to apply it to millions of webpages in Chapter 9Scaling Data Science, and, finally, learn how we can deploy it in Chapter 10Deploying Data Science Models.