Book Image

Apache Spark 2.x for Java Developers

By : Sourav Gulati, Sumit Kumar
Book Image

Apache Spark 2.x for Java Developers

By: Sourav Gulati, Sumit Kumar

Overview of this book

Apache Spark is the buzzword in the big data industry right now, especially with the increasing need for real-time streaming and data processing. While Spark is built on Scala, the Spark Java API exposes all the Spark features available in the Scala version for Java developers. This book will show you how you can implement various functionalities of the Apache Spark framework in Java, without stepping out of your comfort zone. The book starts with an introduction to the Apache Spark 2.x ecosystem, followed by explaining how to install and configure Spark, and refreshes the Java concepts that will be useful to you when consuming Apache Spark's APIs. You will explore RDD and its associated common Action and Transformation Java APIs, set up a production-like clustered environment, and work with Spark SQL. Moving on, you will perform near-real-time processing with Spark streaming, Machine Learning analytics with Spark MLlib, and graph processing with GraphX, all using various Java packages. By the end of the book, you will have a solid foundation in implementing components in the Spark framework in Java to build fast, real-time applications.
Table of Contents (19 chapters)
Title Page
Credits
Foreword
About the Authors
About the Reviewer
www.PacktPub.com
Customer Feedback
Preface

Dimensions of big data


Big data can be best described by using its dimensions. Those dimensions are called the Vs of big data. To categorize a problem as a big data problem, it should lie in one or more of these dimensions.

The big data world started with three dimensions or 3Vs of big data, which are as follows:

  • Volume
  • Variety
  • Velocity

Let us now take a look at each one in detail:

  • Volume: The amount of data being generated in the world is increasing at an exponential rate. Let's take an example of social community websites such as Facebook or Twitter. They are dealing with billions of customers all around the world. So, to analyze the amount of data being generated, they need to find a solution out of the existing RDBMS world. Moreover, not only such big giants, but also other organizations, such as banks, telecom companies, and so on, are dealing with huge numbers of customers. Performing analytics on such a humongous amount of data is a big data problem. So, according to this dimension, if you are dealing with a high volume of data, which can't be handled by traditional database systems, then it's imperative to move to big data territory.
  • Variety: There was a time when only structured data was meant to be processed. But, to keep yourself ahead of your competitor, you need to analyze every sort of data which can increase value. For example, which products on a portal are more popular than others? So, you are analyzing user clicks. Now, data from these various sources that you need to use to keep yourself ahead can be structured or unstructured. It can be XML, JSON, CSV, or even plain text. So, now the data that you may need to deal with can be of different varieties. So, if you have such an issue, realize that this is a big data problem.
  • Velocity: Data is not only increasing in size but the rate at which it is arriving is also increasing rapidly. Take the example of Twitter: billions of users are tweeting at a time. Twitter has to handle such a high velocity of data in almost real time. Also, you can think of YouTube. A lot of videos are being uploaded or streamed from YouTube every minute. Even look at online portals of news channels; they are being updated every second or minute to cope up with incoming data of news from all over the world. So, this dimension of big data deals with a high velocity of data and helps to provide persistence or analyze the data in near real time so as to generate real value.

Then, with time, our 3D world changed to a 7D world, with the following newer dimensions:

  • Veracity: The truthfulness and completeness of the data are equally important. Take an example of a machine learning algorithm that involves automated decision making based on the data it analyzes. If the data is not accurate, this system can be disastrous. An example of such a system can be predictive analytics based on the online shopping data of end users. Using the analytics, you want to send offers to users. If the data that is fed to such a system is inaccurate or incomplete, analytics will not be meaningful or beneficial for the system. So, as per this dimension, before processing/analyzing, data should be validated. Processing high volume or high velocity data can only be meaningful if the data is accurate and complete, so before processing the data, it should be validated as well.
  • Variability: This dimension of big data mainly deals with natural language processing or sentiment analytics. In language, one word can have multiple usages based on the sentiments of the user. So, to find sentiments, you should be able to comprehend the exact meaning. Let's say your favorite football team is not playing well and you posted a sarcastic tweet saying "What a great performance today by our team!!" Now looking at this sentence, it seems you are loving the way your team is performing but in reality it is the opposite. So to analyze the sentiments, the system should be fed with lot of other information such as the statistics of the match, and so on. Another example, the sentence This is too good to be true is negative but it consists of all positive words. Semantic analytics or natural language processing can only be accurate if you can understand sentiments behind the data.
  • Value: There is lot of cost involved in performing big data analytics: the cost of getting the data, the cost for arranging hardware on which this data is saved and be analyzed, the cost of employees and time that goes into these analytics. All these costs are justified if the analytics provide value to the organization. Think of a healthcare company performing analytics on e-commerce data. They may be able to perform the analytics by getting data from the internet but it does not have value for them. Also, performing analytics on data which is not accurate or complete is not of any value. On the contrary, it can be harmful, as the analytics performed are misleading. So, value becomes an important dimension of big data because valuable analytics can be useful.
  • Visualization: Visualization is another important aspect of the analytics. No work can be useful until it is visualized in a proper manner. Let's say engineers of your company have performed real accurate analytics but the output of them are stored in some JSON files or even in databases. The business analyst of your company, not being hard core technical, is not able to understand the outcome of the analytics thoroughly as the outcome is not visualized in a proper manner. So the analytics, even though they are correct, cannot be of much value to your organization. On the other hand, if you have created proper graphs or charts or effective visualization on the outcome, it can be much easier to understand and can be really valuable. So, visualization is a really important aspect of big data analytics because things can only be highlighted if they are visible.