Book Image

Microsoft SQL Server 2012 with Hadoop

By : Debarchan Sarkar
Book Image

Microsoft SQL Server 2012 with Hadoop

By: Debarchan Sarkar

Overview of this book

With the explosion of data, the open source Apache Hadoop ecosystem is gaining traction, thanks to its huge ecosystem that has arisen around the core functionalities of its distributed file system (HDFS) and Map Reduce. As of today, being able to have SQL Server talking to Hadoop has become increasingly important because the two are indeed complementary. While petabytes of unstructured data can be stored in Hadoop taking hours to be queried, terabytes of structured data can be stored in SQL Server 2012 and queried in seconds. This leads to the need to transfer and integrate data between Hadoop and SQL Server. Microsoft SQL Server 2012 with Hadoop is aimed at SQL Server developers. It will quickly show you how to get Hadoop activated on SQL Server 2012 (it ships with this version). Once this is done, the book will focus on how to manage big data with Hadoop and use Hadoop Hive to query the data. It will also cover topics such as using in-memory functions by SQL Server and using tools for BI with big data. Microsoft SQL Server 2012 with Hadoop focuses on data integration techniques between relational (SQL Server 2012) and non-relational (Hadoop) worlds. It will walk you through different tools for the bi-directional movement of data with practical examples. You will learn to use open source connectors like SQOOP to import and export data between SQL Server 2012 and Hadoop, and to work with leading in-memory BI tools to create ETL solutions using the Hive ODBC driver for developing your data movement projects. Finally, this book will give you a glimpse of the present day self-service BI tools such as Excel and PowerView to consume Hadoop data and provide powerful insights on the data.
Table of Contents (12 chapters)

The Apache Hadoop framework


Hadoop is an open source software framework that supports data-intensive distributed applications available through the Apache Open Source community. It consists of a distributed file system HDFS, the Hadoop Distributed File System and an approach to distributed processing of analysis called MapReduce. It is written in Java and based on the Linux/Unix platform.

It's used (extensively now) in the processing of streams of data that go well beyond even the largest enterprise datasets in size. Whether it's sensor, clickstream, social media, location-based, or other data that is generated and collected in large gobs, Hadoop is often on the scene in the service of processing and analyzing it. The real magic of Hadoop is its ability to move the processing or computing logic to the data where it resides as opposed to traditional systems, which focus on a scaled-up single server, move the data to that central processing unit and process the data there. This model does not work on the volume, velocity, and variety of data that present day industry is looking to mine for business intelligence. Hence, Hadoop with its powerful fault tolerant and reliable file system and highly optimized distributed computing model, is one of the leaders in the Big Data world.

The core of Hadoop is its storage system and its distributed computing model:

HDFS

Hadoop Distributed File System is a program level abstraction on top of the host OS file system. It is responsible for storing data on the cluster. Data is split into blocks and distributed across multiple nodes in the cluster.

MapReduce

MapReduce is a programming model for processing large datasets using distributed computing on clusters of computers. MapReduce consists of two phases: dividing the data across a large number of separate processing units (called Map), and then combining the results produced by these individual processes into a unified result set (called Reduce). Between Map and Reduce, shuffle and sort occur. Hadoop cluster, once successfully configured on a system, has the following basic components:

NameNode

This is also called the Head Node/Master Node of the cluster. Primarily, it holds the metadata for HDFS during processing of data which is distributed across the nodes; it keeps track of each HDFS data block in the nodes.

Note

The NameNode is the single point of failure in a Hadoop cluster.

Secondary NameNode

This is an optional node that you can have in your cluster to back up the NameNode if it goes down. If a secondary NameNode is configured, it keeps a periodic snapshot of the NameNode configuration to serve as a backup when needed. However, there is no automated way for failing over to the secondary NameNode; if the primary NameNode goes down, a manual intervention is needed. This essentially means that there would be an obvious down time in your cluster in case the NameNode goes down.

DataNode

These are the systems across the cluster which store the actual HDFS data blocks. The data blocks are replicated on multiple nodes to provide fault tolerant and high availability solutions.

JobTracker

This is a service running on the NameNode, which manages MapReduce jobs and distributes individual tasks.

TaskTracker

This is a service running on the DataNodes, which instantiates and monitors individual Map and Reduce tasks that are submitted.

The following figure shows you the core components of the Apache Hadoop framework:

Additionally, there are a number of supporting projects for Hadoop, each having their unique purpose for example, to feed input data to Hadoop system, a data warehousing system for ad hoc queries on top of Hadoop, and many more. The following are a few worth mentioning:

Hive

Hive is a supporting project for the main Apache Hadoop project and is an abstraction on top of MapReduce, which allows users to query the data without developing MapReduce applications. It provides the user with a SQL-like query language called Hive Query Language (HQL) to fetch data from Hive store. This makes it easier for people with SQL skills to adapt to Hadoop environment quickly.

Pig

Pig is an alternative abstraction on MapReduce, which uses dataflow scripting language called PigLatin. This is favored by programmers who already have scripting skills. You can run PigLatin statements interactively in a command line Pig shell named Grunt. You can also combine a sequence of PigLatin statements in a script, which can then be executed as a unit. These PigLatin statements are used to generate MapReduce jobs by the Pig interpreter and are executed on the HDFS data.

Flume

Flume is another open source implementation on top of Hadoop, which provides a data-ingestion mechanism for data into HDFS as data is generated.

Sqoop

Sqoop provides a way to import and export data to and from relational database tables (for example, SQL Server) and HDFS.

Oozie

Oozie allows creation of workflow of MapReduce jobs. This is familiar with developers who have worked on Workflow and communication foundation based solutions.

HBase

HBase is Hadoop database, a NoSQL database. It is another abstraction on top of Hadoop, which provides a near real-time query mechanisms to HDFS data.

Mahout

Mahout is a machine-learning library that contains algorithms for clustering and classification. One major focus of machine-learning research is to automatically learn to recognize complex patterns and make intelligent decisions based on data.

The following figure gives you a 1000 feet view of the Apache Hadoop and the various supporting projects that form this amazing ecosystem:

We will be exploring some of these components in the subsequent chapters of this book, but for a complete reference, please visit the Apache website http://hadoop.apache.org/.

Setting up this ecosystem along with the required supporting projects could be really non-trivial. In fact the only drawback this implementation has, is the effort needed to set up and administer a Hadoop cluster. This is basically the reason that many vendors are coming up with their own distribution of Hadoop bundled and distributed as a data processing platform. Using these distributions, enterprises would be able to set up Hadoop clusters in minutes through simplified and user-friendly cluster deployment wizards and also use the various dashboards for monitoring and instrumentation purposes. Some of the present day distributions are CH4 from Cloudera, Hortonworks Data Platform, and Microsoft HDInsight, which are quickly gaining popularity. These distributions are outside the scope of this book and won't be covered; please visit the respective websites for detailed information about these distributions.