In this book, we will use Java for doing data science projects. Java might not seem a good choice for data science at first glance, unlike Python or R, it has fewer data science and machine learning libraries, it is more verbose and lacks interactivity. On the other hand, it has a lot of upsides as follows:
- Java is a statically typed language, which makes it easier to maintain the code base and harder to make silly mistakes--the compiler can detect some of them.
- The standard library for data processing is very rich, and there are even richer external libraries.
- Java code is typically faster than the code in scripting languages that are usually used for data science (such as R or Python).
- Maven, the de-facto standard for dependency management in the Java world, makes it very easy to add new libraries to the project and avoid version conflicts.
- Most of big data frameworks for scalable data processing are written in either Java or JVM languages, such as Apache Hadoop, Apache Spark, or Apache Flink.
- Very often production systems are written in Java and building models in other languages adds unnecessary levels of complexity. Creating the models in Java makes it easier to integrate them to the product.
Next, we will look at the data science libraries available in Java.
While there are not as many data science libraries in Java compared to R, there are quite a few. Additionally, it is often possible to use machine learning and data mining libraries written in other JVM languages, such as Scala, Groovy, or Clojure. Because these languages share the runtime environment, it makes it very easy to import libraries written in Scala and use them directly in Java code.
We can divide the libraries into the following categories:
- Data processing libraries
- Math and stats libraries
- Machine learning and data mining libraries
- Text processing libraries
Now we will see each of them in detail.
The standard Java library is very rich and offers a lot of tools for data processing, such as collections, I/O tools, data streams, and means of parallel task execution.
There are very powerful extensions to the standard library such as:
- Google Guava (https://github.com/google/guava) and Apache Common Collections (https://commons.apache.org/collections/) for richer collections
- Apache Commons IO (https://commons.apache.org/io/) for simplified I/O
- AOL Cyclops-React (https://github.com/aol/cyclops-react) for richer functional-way parallel streaming
We will cover both the standard API for data processing and its extensions in Chapter 2, Data Processing Toolbox. In this book, we will use Maven for including external libraries such as Google Guava or Apache Commons IO. It is a dependency management tool and allows to specify the external dependencies with a few lines of XML code. For example, to add Google Guava, it is enough to declare the following dependency in
<dependency> <groupId>com.google.guava</groupId> <artifactId>guava</artifactId> <version>19.0</version> </dependency>
When we do it, Maven will go to the Maven Central repository and download the dependency of the specified version. The best way to find the dependency snippets for
pom.xml (such as the previous one) is to use the search at https://mvnrepository.com or your favorite search engine.
Java gives an easy way to access databases through Java Database Connectivity (JDBC)--a unified database access protocol. JDBC makes it possible to connect virtually any relational database that supports SQL, such as MySQL, MS SQL, Oracle, PostgreSQL, and many others. This allows moving the data manipulation from Java to the database side.
When it is not possible to use a database for handling tabular data, then we can use DataFrame libraries for doing it directly in Java. The DataFrame is a data structure that originally comes from R and it allows to easily manipulate textual data in the program, without resorting to external database.
For example, with DataFrames it is possible to filter rows based on some condition, apply the same operation to each element of a column, group by some condition or join with another DataFrame. Additionally, some data frame libraries make it easy to convert tabular data to a matrix form so that the data can be used by machine learning algorithms.
There are a few data frame libraries available in Java. Some of them are as follows:
We will also cover databases and data frames in Chapter 2, Data Processing Toolbox and we will use DataFrames throughout the book.
There are more complex data processing libraries such as Spring Batch (http://projects.spring.io/spring-batch/). They allow creating complex data pipelines (called ETLs from Extract-Transform-Load) and manage their execution.
Additionally, there are libraries for distributed data processing such as:
We will talk about distributed data processing in Chapter 9, Scaling Data Science.
The math support in the standard Java library is quite limited, and only includes methods such as
log for computing the logarithm,
exp for computing the exponent and other basic methods.
There are external libraries with richer support of mathematics. For example:
- Apache Commons Math (http://commons.apache.org/math/) for statistics, optimization, and linear algebra
- Apache Mahout (http://mahout.apache.org/) for linear algebra, also includes a module for distributed linear algebra and machine learning
- JBlas (http://jblas.org/) optimized and very fast linear algebra package that uses the BLAS library
Also, many machine learning libraries come with some extra math functionality, often linear algebra, stats, and optimization.
There are quite a few machine learning and data mining libraries available for Java and other JVM languages. Some of them are as follows:
- Weka (http://www.cs.waikato.ac.nz/ml/weka/) is probably the most famous data mining library in Java, contains a lot of algorithms and has many extensions.
- JavaML (http://java-ml.sourceforge.net/) is quite an old and reliable ML library, but unfortunately not updated anymore
- Smile (http://haifengl.github.io/smile/) is a promising ML library that is under active development at the moment and a lot of new methods are being added there.
- JSAT (https://github.com/EdwardRaff/JSAT) contains quite an impressive list of machine learning algorithms.
- H2O (http://www.h2o.ai/) is a framework for distributed ML written in Java, but is available for multiple languages, including Scala, R, and Python.
- Apache Mahout (http://mahout.apache.org/) is used for in-core (one machine) and distributed machine learning. The Mahout Samsara framework allows writing the code in a framework-independent way and then executes it on Spark, Flink, or H2O.
There are several libraries that specialize solely on neural networks:
We will cover some of these libraries throughout the book.
It is possible to do simple text processing using only the standard Java library with classes such as
java.text package, or the regular expressions.
In addition to that, there is a big variety of text processing frameworks available for Java as follows:
- Apache Lucene (https://lucene.apache.org/) is a library that is used for information retrieval
- Stanford CoreNLP (http://stanfordnlp.github.io/CoreNLP/)
- Apache OpenNLP (https://opennlp.apache.org/)
- LingPipe (http://alias-i.com/lingpipe/)
- GATE (https://gate.ac.uk/)
- MALLET (http://mallet.cs.umass.edu/)
- Smile (http://haifengl.github.io/smile/) also has some algorithms for NLP
Most NLP libraries have very similar functionality and coverage of algorithms, which is why selecting which one to use is usually a matter of habit or taste. They all typically have tokenization, parsing, part-of-speech tagging, named entity recognition, and other algorithms for text processing. Some of them (such as StanfordNLP) support multiple languages, and some support only English.
We will cover some of these libraries in Chapter 6, Working with Text - Natural Language Processing and Information Retrival.