Book Image

Big Data Analytics with Java

Book Image

Big Data Analytics with Java


Overview of this book

This book covers case studies such as sentiment analysis on a tweet dataset, recommendations on a movielens dataset, customer segmentation on an ecommerce dataset, and graph analysis on actual flights dataset. This book is an end-to-end guide to implement analytics on big data with Java. Java is the de facto language for major big data environments, including Hadoop. This book will teach you how to perform analytics on big data with production-friendly Java. This book basically divided into two sections. The first part is an introduction that will help the readers get acquainted with big data environments, whereas the second part will contain a hardcore discussion on all the concepts in analytics on big data. It will take you from data analysis and data visualization to the core concepts and advantages of machine learning, real-life usage of regression and classification using Naïve Bayes, a deep discussion on the concepts of clustering,and a review of simple neural networks on big data using deepLearning4j or plain Java Spark code. This book is a must-have book for Java developers who want to start learning big data analytics and want to use it in the real world.
Table of Contents (21 chapters)
Big Data Analytics with Java
About the Author
About the Reviewers
Customer Feedback
Free Chapter
Big Data Analytics with Java
Ensembling on Big Data
Real-Time Analytics on Big Data


In this chapter, we learnt about a very popular approach called ensembling in machine learning. We learnt how a group of decision trees can be parallelly built, trained, and run on a dataset in the case of random forests. Finally, their results can be combined by techniques like voting for classification to figure out the best voted classification or averaging the results in case of regression. We also learnt how a group of weak decision tree learners or models can be sequentially trained one after the other with every step boosting the results of the previous model in the workflow by minimizing an error function using techniques such as gradient descent. We also saw how powerful these approaches are and saw their advantages over other simple approaches. We also ran the two ensembling approaches on a real-world dataset provided by Lending Club and analyzed the accuracy of our results.

In the next chapter, we will cover the concept of clustering using the k-means algorithm. We will...