Data science has become a quite important tool for organizations nowadays: they have collected large amounts of data, and to be able to put it into good use, they need data science--the discipline about methods for extracting knowledge from data. Every day more and more companies realize that they can benefit from data science and utilize the data that they produce more effectively and more profitably.
It is especially true for IT companies, they already have the systems and the infrastructure for generating and processing the data. These systems are often written in Java--the language of choice for many large and small companies across the world. It is not a surprise, Java offers a very solid and mature ecosystem of libraries that are time proven and reliable, so many people trust Java and use it for creating their applications.
Thus, it is also a natural choice for many data processing applications. Since the existing systems are already in Java, it makes sense to use the same technology stack for data science, and integrate the machine learning model directly in the application's production code base.
This book will cover exactly that. We will first see how we can utilize Java’s toolbox for processing small and large datasets, then look into doing initial exploration data analysis. Next, we will review the Java libraries that implement common Machine Learning models for classification, regression, clustering, and dimensionality reduction problems. Then we will get into more advanced techniques and discuss Information Retrieval and Natural Language Processing, XGBoost, deep learning, and large scale tools for processing big datasets such as Apache Hadoop and Apache Spark. Finally, we will also have a look at how to evaluate and deploy the produced models such that the other services can use them.
We hope you will enjoy the book. Happy reading!
Chapter 1, Data Science Using Java, provides the overview of the existing tools available in Java as well and introduces the methodology for approaching Data Science projects, CRISP-DM. In this chapter, we also introduce our running example, building a search engine.
Chapter 2, Data Processing Toolbox, reviews the standard Java library: the Collection API for storing the data in memory, the IO API for reading and writing the data, and the Streaming API for a convenient way of organizing data processing pipelines. We will look at the extensions to the standard libraries such as Apache Commons Lang, Apache Commons IO, Google Guava, and AOL Cyclops React. Then, we will cover most common ways of storing the data--text and CSV files, HTML, JSON, and SQL Databases, and discuss how we can get the data from these data sources. We will finish this chapter by talking about the ways we can collect the data for the running example--the search engine, and how we prepare the data for that.
Chapter 3, Exploratory Data Analysis, performs the initial analysis of data with Java: we look at how to calculate common statistics such as the minimal and maximal values, the average value, and the standard deviation. We also talk a bit about interactive analysis and see what are the tools that allow us to visually inspect the data before building models. For the illustration in this chapter, we use the data we collect for the search engine.
Chapter 4, Supervised Learning - Classification and Regression, starts with Machine Learning, and then looks at the models for performing supervised learning in Java. Among others, we look at how to use the following libraries--Smile, JSAT, LIBSVM, LIBLINEAR, and Encog, and we see how we can use these libraries to solve the classification and regression problems. We use two examples here, first, we use the search engine data for predicting whether a URL will appear on the first page of results or not, which we use for illustrating the classification problem. Second, we predict how much time it takes to multiply two matrices on certain hardware given its characteristics, and we illustrate the regression problem with this example.
Chapter 5, Unsupervised Learning – Clustering and Dimensionality Reduction, explores the methods for Dimensionality Reduction available in Java, and we will learn how to apply PCA and Random Projection to reduce the dimensionality of this data. This is illustrated with the hardware performance dataset from the previous chapter. We also look at different ways to cluster data, including Agglomerative Clustering, K-Means, and DBSCAN, and we use the dataset with customer complaints as an example.
Chapter 6, Working with Text – Natural Language Processing and Information Retrieval, looks at how to use text in Data Science applications, and we learn how to extract more useful features for our search engine. We also look at Apache Lucene, a library for full-text indexing and searching, and Stanford CoreNLP, a library for performing Natural Language Processing. Next, we look at how we can represent words as vectors, and we learn how to build such embeddings from co-occurrence matrices and how to use existing ones like GloVe. We also look at how we can use machine learning for texts, and we illustrate it with a sentiment analysis problem where we apply LIBLINEAR to classify if a review is positive or negative.
Chapter 7, Extreme Gradient Boosting, covers how to use XGBoost in Java and tries to apply it to two problems we had previously, classifying whether the URL appears on the first page and predicting the time to multiply two matrices. Additionally, we look at how to solve the learning-to-rank problem with XGBoost and again use our search engine example as illustration.
Chapter 8, Deep Learning with DeepLearning4j, covers Deep Neural Networks and DeepLearning4j, a library for building and training these networks in Java. In particular, we talk about Convolutional Neural Nets and see how we can use them for image recognition--predicting whether it is a picture of a dog or a cat. Additionally, we discuss data augmentation--the way to generate more data, and also mention how we can speed up the training using GPUs. We finish the chapter by describing how to rent a GPU server on Amazon AWS.
Chapter 9, Scaling Data Science, talks about big data tools available in Java, Apache Hadoop, and Apache Spark. We illustrate it by looking at how we can process Common Crawl--the copy of the Internet, and calculate TF-IDF of each document there. Additionally, we look at the graph processing tools available in Apache Spark and build a recommendation system for scientists, we recommend a coauthor for the next possible paper.
Chapter 10, Deploying Data Science Models, looks at how we can expose the models to the rest of the world in such a way they are usable. Here we cover Spring Boot and talk how we can use the search engine model we developed to rank the articles from Common Crawl. We finish by discussing the ways to evaluate the performance of the models in the online settings and talk about A/B tests and Multi-Armed Bandits.
You need to have any latest system with at least 2GB RAM and a Windows 7 /Ubuntu 14.04/Mac OS X operating system. Further, you will need to have Java 1.8.0 or above and Maven 3.0.0 or above installed.
This book is intended for software engineers who are comfortable with developing Java applications and are familiar with the basic concepts of data science. Additionally, it will also be useful for data scientists who do not yet know Java, but want or need to learn it.
In this book, you will find a number of text styles that distinguish between different kinds of information. Here are some examples of these styles and an explanation of their meaning.
Code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles are shown as follows: "Here, we create
SummaryStatistics objects and add all body content lengths."
A block of code is set as follows:
SummaryStatistics statistics = new SummaryStatistics(); data.stream().mapToDouble(RankedPage::getBodyContentLength) .forEach(statistics::addValue); System.out.println(statistics.getSummary());
Any command-line input or output is written as follows:
mvn dependency:copy-dependencies -DoutputDirectory=lib mvn compile
New terms and important words are shown in bold. Words that you see on the screen, for example, in menus or dialog boxes, appear in the text like this: "If, instead, our model outputs some score such that the higher the values of the score the more likely the item is to be positive, then the binary classifier is called a ranking classifier."
Feedback from our readers is always welcome. Let us know what you think about this book-what you liked or disliked. Reader feedback is important for us as it helps us develop titles that you will really get the most out of.
To send us general feedback, simply e-mail
[email protected], and mention the book's title in the subject of your message.
If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, see our author guide at www.packtpub.com/authors.
Now that you are the proud owner of a Packt book, we have a number of things to help you to get the most from your purchase.
You can download the example code files for this book from your account at http://www.packtpub.com. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you.
You can download the code files by following these steps:
- Log in or register to our website using your e-mail address and password.
- Hover the mouse pointer on the
SUPPORTtab at the top.
- Click on
Code Downloads & Errata.
- Enter the name of the book in the
- Select the book for which you're looking to download the code files.
- Choose from the drop-down menu where you purchased this book from.
- Click on
Once the file is downloaded, please make sure that you unzip or extract the folder using the latest version of:
- WinRAR / 7-Zip for Windows
- Zipeg / iZip / UnRarX for Mac
- 7-Zip / PeaZip for Linux
The code bundle for the book is also hosted on GitHub at https://github.com/PacktPublishing/Mastering-Java-for-Data-Science. We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!
We also provide you with a PDF file that has color images of the screenshots/diagrams used in this book. The color images will help you better understand the changes in the output. You can download this file from https://www.packtpub.com/sites/default/files/downloads/MasteringJavaforDataScience_ColorImages.pdf.
Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you find a mistake in one of our books-maybe a mistake in the text or the code-we would be grateful if you could report this to us. By doing so, you can save other readers from frustration and help us improve subsequent versions of this book. If you find any errata, please report them by visiting http://www.packtpub.com/submit-errata, selecting your book, clicking on the
Errata Submission Form link, and entering the details of your errata. Once your errata are verified, your submission will be accepted and the errata will be uploaded to our website or added to any list of existing errata under the Errata section of that title.
To view the previously submitted errata, go to https://www.packtpub.com/books/content/support and enter the name of the book in the search field. The required information will appear under the
Piracy of copyrighted material on the Internet is an ongoing problem across all media. At Packt, we take the protection of our copyright and licenses very seriously. If you come across any illegal copies of our works in any form on the Internet, please provide us with the location address or website name immediately so that we can pursue a remedy.
Please contact us at
[email protected] with a link to the suspected pirated material.
We appreciate your help in protecting our authors and our ability to bring you valuable content.
If you have a problem with any aspect of this book, you can contact us at
[email protected], and we will do our best to address the problem.