Book Image

Hadoop Essentials

By : Shiva Achari
Book Image

Hadoop Essentials

By: Shiva Achari

Overview of this book

This book jumps into the world of Hadoop and its tools, to help you learn how to use them effectively to optimize and improve the way you handle Big Data. Starting with the fundamentals Hadoop YARN, MapReduce, HDFS, and other vital elements in the Hadoop ecosystem, you will soon learn many exciting topics such as MapReduce patterns, data management, and real-time data analysis using Hadoop. You will also explore a number of the leading data processing tools including Hive and Pig, and learn how to use Sqoop and Flume, two of the most powerful technologies used for data ingestion. With further guidance on data streaming and real-time analytics with Storm and Spark, Hadoop Essentials is a reliable and relevant resource for anyone who understands the difficulties - and opportunities - presented by Big Data today. With this guide, you'll develop your confidence with Hadoop, and be able to use the knowledge and skills you learn to successfully harness its unparalleled capabilities.
Table of Contents (15 chapters)
Hadoop Essentials
About the Author
About the Reviewers
Pillars of Hadoop – HDFS, MapReduce, and YARN

Understanding big data

Actually, big data is a terminology which refers to challenges that we are facing due to exponential growth of data in terms of V problems. The challenges can be subdivided into the following phases:

  • Capture

  • Storage

  • Search

  • Sharing

  • Analytics

  • Visualization

Big data systems refer to technologies that can process and analyze data, which we discussed as volume, velocity, and variety data problems. The technologies that can solve big data problems should use the following architectural strategy:

  • Distributed computing system

  • Massively parallel processing (MPP)

  • NoSQL (Not only SQL)

  • Analytical database

The structure is as follows:

Big data systems use distributed computing and parallel processing to handle big data problems. Apart from distributed computing and MPP, there are other architectures that can solve big data problems that are toward database environment based system, which are NoSQL and Advanced SQL.


A NoSQL database is a widely adapted technology due to the schema less design, and its ability to scale up vertically and horizontally is fairly simple and in much less effort. SQL and RDBMS have ruled for more than three decades, and it performs well within the limits of the processing environment, and beyond that the RDBMS system performance degrades, cost increases, and manageability decreases, we can say that NoSQL provides an edge over RDBMS in these scenarios.


One important thing to mention is that NoSQLs do not support all ACID properties and are highly scalable, provide availability, and are also fault tolerant. NoSQL usually provides either consistency or availability (availability of nodes for processing), depending upon the architecture and design.

Types of NoSQL databases

As the NoSQL databases are nonrelational they have different sets of possible architecture and design. Broadly, there are four general types of NoSQL databases, based on how the data is stored:

  1. Key-value store: These databases are designed for storing data in a key-value store. The key can be custom, can be synthetic, or can be autogenerated, and the value can be complex objects such as XML, JSON, or BLOB. Key of data is indexed for faster access to the data and improving the retrieval of value. Some popular key-value type databases are DynamoDB, Azure Table Storage (ATS), Riak, and BerkeleyDB.

  2. Column store: These databases are designed for storing data as a group of column families. Read/write operation is done using columns, rather than rows. One of the advantages is the scope of compression, which can efficiently save space and avoid memory scan of the column. Due to the column design, not all files are required to be scanned, and each column file can be compressed, especially if a column has many nulls and repeating values. A column stores databases that are highly scalable and have very high performance architecture. Some popular column store type databases are HBase, BigTable, Cassandra, Vertica, and Hypertable.

  3. Document database: These databases are designed for storing, retrieving, and managing document-oriented information. A document database expands on the idea of key-value stores where values or documents are stored using some structure and are encoded in formats such as XML, YAML, or JSON, or in binary forms such as BSON, PDF, Microsoft Office documents (MS Word, Excel), and so on. The advantage in storing in an encoded format like XML or JSON is that we can search with the key within the document of a data, and it is quite useful in ad hoc querying and semi-structured data. Some popular document-type databases are MongoDB and CouchDB.

  4. Graph database: These databases are designed for data whose relations are well represented as trees or a graph, and has elements, usually with nodes and edges, which are interconnected. Relational databases are not so popular in performing graph-based queries as they require a lot of complex joins, and thus managing the interconnection becomes messy. Graph theoretic algorithms are useful for prediction, user tracking, clickstream analysis, calculating the shortest path, and so on, which will be processed by graph databases much more efficiently as the algorithms themselves are complex. Some popular graph-type databases are Neo4J and Polyglot.

Analytical database

An analytical database is a type of database built to store, manage, and consume big data. Analytical databases are vendor-managed DBMS, which are optimized for processing advanced analytics that involves highly complex queries on terabytes of data and complex statistical processing, data mining, and NLP (natural language processing). Examples of analytical databases are Vertica (acquired by HP), Aster Data (acquired by Teradata), Greenplum (acquired by EMC), and so on.