Traditionally computation has been processor driven. As the data grew, the industry was focused towards increasing processor speed and memory for getting better performances for computation. This gave birth to the distributed systems. In today's real world, different applications create hundreds and thousands of gigabytes of data every day. This data comes from disparate sources such as application software, sensors, social media, mobile devices, logs, and so on. Such huge data is difficult to operate upon using standard available software for data processing. This is mainly because the data size grows exponentially with time. Traditional distributed systems were not sufficient to manage the big data, and there was a need for modern systems that could handle heavy data load, with scalability and high availability. This is called Big Data.
Big data is usually associated with high volume and heavily growing data with unpredictable content. A video gaming industry needs to predict the performance of over 500 GB of data structure, and analyze over 4 TB of operational logs every day; many gaming companies use Big Data based technologies to do so. An IT advisory firm Gartner defines big data using 3Vs (high volume of data, high velocity of processing speed, and high variety of information). IBM added fourth V (high veracity) to its definition to make sure the data is accurate, and helps you make your business decisions.
While the potential benefits of big data are real and significant, there remain many challenges. So, organizations which deal with such high volumes of data face the following problems:
Data acquisition: There is lot of raw data that gets generated out of various data sources. The challenge is to filter and compress the data, and extract the information out of it once it is cleaned.
Information storage and organization: Once the information is captured out of raw data, the data model will be created and stored in a storage device. To store a huge dataset effectively, traditional relational system stops being effective at such a high scale. There has been a new breed of databases called NOSQL databases, which are mainly used to work with big data. NOSQL databases are non-relational databases.
Information search and analytics: Storing data is only a part of building a warehouse. Data is useful only when it is computed. Big data is often noisy, dynamic, and heterogeneous. This information is searched, mined, and analyzed for behavioral modeling.
Data security and privacy: While bringing in linked data from multiple sources, organizations need to worry about data security and privacy at the most.
Big data offers lot of technology challenges to the current technologies in use today. It requires large quantities of data processing within the finite timeframe, which brings in technologies such as massively parallel processing (MPP) technologies and distributed file systems.
Big data is catching more and more attention from various organizations. Many of them have already started exploring it. Recently Gartner (http://www.gartner.com/newsroom/id/2304615) published an executive program survey report, which reveals that Big Data and analytics are among the top 10 business priorities for CIOs. Similarly, analytics and BI stand as the top priority for CIO's technical priorities. We will try to understand Apache Hadoop in this chapter. We will cover the following:
Understanding Apache Hadoop and its ecosystem
Storing large data in HDFS
Creating MapReduce to analyze the Hadoop data
Installing and running Hadoop
Managing and viewing a Hadoop cluster
Administration tools