Book Image

Mastering Java for Data Science

By : Alexey Grigorev
Book Image

Mastering Java for Data Science

By: Alexey Grigorev

Overview of this book

Java is the most popular programming language, according to the TIOBE index, and it is a typical choice for running production systems in many companies, both in the startup world and among large enterprises. Not surprisingly, it is also a common choice for creating data science applications: it is fast and has a great set of data processing tools, both built-in and external. What is more, choosing Java for data science allows you to easily integrate solutions with existing software, and bring data science into production with less effort. This book will teach you how to create data science applications with Java. First, we will revise the most important things when starting a data science application, and then brush up the basics of Java and machine learning before diving into more advanced topics. We start by going over the existing libraries for data processing and libraries with machine learning algorithms. After that, we cover topics such as classification and regression, dimensionality reduction and clustering, information retrieval and natural language processing, and deep learning and big data. Finally, we finish the book by talking about the ways to deploy the model and evaluate it in production settings.
Table of Contents (17 chapters)
Title Page
Credits
About the Author
About the Reviewers
www.PacktPub.com
Customer Feedback
Preface

Data science in Java


In this book, we will use Java for doing data science projects. Java might not seem a good choice for data science at first glance, unlike Python or R, it has fewer data science and machine learning libraries, it is more verbose and lacks interactivity. On the other hand, it has a lot of upsides as follows:

  • Java is a statically typed language, which makes it easier to maintain the code base and harder to make silly mistakes--the compiler can detect some of them.
  • The standard library for data processing is very rich, and there are even richer external libraries.
  • Java code is typically faster than the code in scripting languages that are usually used for data science (such as R or Python).
  • Maven, the de-facto standard for dependency management in the Java world, makes it very easy to add new libraries to the project and avoid version conflicts.
  • Most of big data frameworks for scalable data processing are written in either Java or JVM languages, such as Apache Hadoop, Apache Spark, or Apache Flink.
  • Very often production systems are written in Java and building models in other languages adds unnecessary levels of complexity. Creating the models in Java makes it easier to integrate them to the product.

Next, we will look at the data science libraries available in Java.

Data science libraries

While there are not as many data science libraries in Java compared to R, there are quite a few. Additionally, it is often possible to use machine learning and data mining libraries written in other JVM languages, such as Scala, Groovy, or Clojure. Because these languages share the runtime environment, it makes it very easy to import libraries written in Scala and use them directly in Java code.

We can divide the libraries into the following categories:

  • Data processing libraries
  • Math and stats libraries
  • Machine learning and data mining libraries
  • Text processing libraries

Now we will see each of them in detail. 

Data processing libraries

The standard Java library is very rich and offers a lot of tools for data processing, such as collections, I/O tools, data streams, and means of parallel task execution. 

There are very powerful extensions to the standard library such as:

We will cover both the standard API for data processing and its extensions in Chapter 2Data Processing Toolbox. In this book, we will use Maven for including external libraries such as Google Guava or Apache Commons IO. It is a dependency management tool and allows to specify the external dependencies with a few lines of XML code. For example, to add Google Guava, it is enough to declare the following dependency in pom.xml:

<dependency> 
 <groupId>com.google.guava</groupId> 
 <artifactId>guava</artifactId> 
 <version>19.0</version> 
</dependency>

When we do it, Maven will go to the Maven Central repository and download the dependency of the specified version. The best way to find the dependency snippets for pom.xml (such as the previous one) is to use the search at https://mvnrepository.com or your favorite search engine.

Java gives an easy way to access databases through Java Database Connectivity (JDBC)--a unified database access protocol. JDBC makes it possible to connect virtually any relational database that supports SQL, such as MySQL, MS SQL, Oracle, PostgreSQL, and many others. This allows moving the data manipulation from Java to the database side.

When it is not possible to use a database for handling tabular data, then we can use DataFrame libraries for doing it directly in Java. The DataFrame is a data structure that originally comes from R and it allows to easily manipulate textual data in the program, without resorting to external database.

For example, with DataFrames it is possible to filter rows based on some condition, apply the same operation to each element of a column, group by some condition or join with another DataFrame. Additionally, some data frame libraries make it easy to convert tabular data to a matrix form so that the data can be used by machine learning algorithms. 

There are a few data frame libraries available in Java. Some of them are as follows:

We will also cover databases and data frames in Chapter 2, Data Processing Toolbox and we will use DataFrames throughout the book. 

There are more complex data processing libraries such as Spring Batch (http://projects.spring.io/spring-batch/). They allow creating complex data pipelines (called ETLs from Extract-Transform-Load) and manage their execution.

Additionally, there are libraries for distributed data processing such as:

We will talk about distributed data processing in Chapter 9Scaling Data Science.

Math and stats libraries

The math support in the standard Java library is quite limited, and only includes methods such as log for computing the logarithm, exp for computing the exponent and other basic methods.

There are external libraries with richer support of mathematics. For example:

Also, many machine learning libraries come with some extra math functionality, often linear algebra, stats, and optimization.

Machine learning and data mining libraries

There are quite a few machine learning and data mining libraries available for Java and other JVM languages. Some of them are as follows:

  • Weka (http://www.cs.waikato.ac.nz/ml/weka/) is probably the most famous data mining library in Java, contains a lot of algorithms and has many extensions.
  • JavaML (http://java-ml.sourceforge.net/) is quite an old and reliable ML library, but unfortunately not updated anymore
  • Smile (http://haifengl.github.io/smile/) is a promising ML library that is under active development at the moment and a lot of new methods are being added there.
  • JSAT (https://github.com/EdwardRaff/JSAT) contains quite an impressive list of machine learning algorithms.
  • H2O (http://www.h2o.ai/) is a framework for distributed ML written in Java, but is available for multiple languages, including Scala, R, and Python.
  • Apache Mahout (http://mahout.apache.org/) is used for in-core (one machine) and distributed machine learning. The Mahout Samsara framework allows writing the code in a framework-independent way and then executes it on Spark, Flink, or H2O.

There are several libraries that specialize solely on neural networks:

We will cover some of these libraries throughout the book.

Text processing

It is possible to do simple text processing using only the standard Java library with classes such as StringTokenizer, the java.text package, or the regular expressions.

In addition to that, there is a big variety of text processing frameworks available for Java as follows:

Most NLP libraries have very similar functionality and coverage of algorithms, which is why selecting which one to use is usually a matter of habit or taste. They all typically have tokenization, parsing, part-of-speech tagging, named entity recognition, and other algorithms for text processing. Some of them (such as StanfordNLP) support multiple languages, and some support only English.

We will cover some of these libraries in Chapter 6Working with Text - Natural Language Processing and Information Retrival.