Book Image

Microsoft SQL Server 2012 with Hadoop

By : Debarchan Sarkar
Book Image

Microsoft SQL Server 2012 with Hadoop

By: Debarchan Sarkar

Overview of this book

With the explosion of data, the open source Apache Hadoop ecosystem is gaining traction, thanks to its huge ecosystem that has arisen around the core functionalities of its distributed file system (HDFS) and Map Reduce. As of today, being able to have SQL Server talking to Hadoop has become increasingly important because the two are indeed complementary. While petabytes of unstructured data can be stored in Hadoop taking hours to be queried, terabytes of structured data can be stored in SQL Server 2012 and queried in seconds. This leads to the need to transfer and integrate data between Hadoop and SQL Server. Microsoft SQL Server 2012 with Hadoop is aimed at SQL Server developers. It will quickly show you how to get Hadoop activated on SQL Server 2012 (it ships with this version). Once this is done, the book will focus on how to manage big data with Hadoop and use Hadoop Hive to query the data. It will also cover topics such as using in-memory functions by SQL Server and using tools for BI with big data. Microsoft SQL Server 2012 with Hadoop focuses on data integration techniques between relational (SQL Server 2012) and non-relational (Hadoop) worlds. It will walk you through different tools for the bi-directional movement of data with practical examples. You will learn to use open source connectors like SQOOP to import and export data between SQL Server 2012 and Hadoop, and to work with leading in-memory BI tools to create ETL solutions using the Hive ODBC driver for developing your data movement projects. Finally, this book will give you a glimpse of the present day self-service BI tools such as Excel and PowerView to consume Hadoop data and provide powerful insights on the data.
Table of Contents (12 chapters)

Big Data – what's the big deal?


There's a lot of talk about Big Data—estimates are that the total amount of digital information in the world is increasing ten times every five years, with 85 percent of this data coming from new data types for example, sensors, RFIDs, web logs, and so on. This presents a huge opportunity for businesses that tap into this new data to identify new opportunity and areas for innovation.

However, having a platform that supports the data trend is only a part of today's challenge; you need to also make it easier for people to access so that they can gain insight and make better decisions. If you think about the user experience, with everything we are able to do on the Web, our experiences through social media sites, how we're discovering, sharing, and collaborating in new ways, user expectations of their business, and productivity applications are changing as well.

One of the first questions we should set out to answer is a simple definitional one: how is Big Data different from traditional large data warehouses? International Data Corporation has the most broadly accepted theory of classifying Big Data as the three Vs:

  • Volume: Data volume is exploding. In the last few decades, computing and storage capacity have grown exponentially, driving down hardware and storage costs to near zero and making them a commodity. The current data processing needs are evolving and are demanding analysis of petabytes and zetabytes of data with industry standard hardware within minutes if not seconds.

  • Variety: The variety of data is increasing. It's all getting stored and nearly 85 percent of new data is unstructured data. The data can be in the form of tweets, JSONs with variable attributes and elements of which users may want to process selective ones.

  • Velocity: The velocity of data is speeding up the pace of business. Data capture has become nearly instantaneous, thanks to new customer interaction points and technologies. Real-time analytics is more important than ever. The ratio of data remittance rate continues to be way higher than the data consumption rate; coping with the speed of data continues to be a challenge. Think about a software that can let you message or type as fast as the speed of your thought.

Today, every organization finds it difficult to manage and track the right dataset within itself, the challenge is even greater when they need to look out for data which is external to the system. A typical analyst spends too much time searching for the right data from thousands of sources, which adversely impacts productivity. We will move from a world of search to one of discovery, where information is brought to the user based on who you are, and what you are working on. There has never been such an abundance of externally available and useful information as there is today. The challenge is how do you discover what is available and how do you connect to it?

To answer today's types of question, you need new ways to discover and explore data. By this we mean, data that may reside in a number of different domains such as:

  • Personal data: This is data created by me, or by my peers, but relevant for the task at hand.

  • Organizational data: This is data that is maintained and managed across the organization.

  • Community data: This is external data such as curated third party datasets that are shared into the public domain. Examples include Data.gov, Twitter, Facebook, and so on.

  • World data: This is all the other data that is available on the global stage, for example, data from sensors or logfiles, and for which technologies such as Hadoop for Big Data have emerged.

You could derive much deeper business insight and trends by combining the data you need across personal, corporate, community, and world data. You can connect and combine data from hundreds of trusted data providers—data includes demographic data, environment data, financial data, retail and sports data, social data such as twitter and facebook as well as data cleansing services. You can combine this data with your personal data through self-service tools, for example, PowerPivot, you can use reference data for cleansing your corporate data with SQL Server 2012, or you can use it in your custom applications.

Existing RDBMS solutions as SQL Server are good in managing challenging volumes of data, but it falls short when the data is unstructured or semi-structured with variable attributes such as the ones discussed previously. The current world seems almost obsessed with social media sentiments, tweets, devices, and so on; without the right tools, your company is adrift in a sea of data. You need the ability to unleash the wave of new value made possible by Big Data. It's all and every bit of data that you should be able to easily monitor and manage regardless of type or structure. That's why organizations are trending to build an end-to-end data platform for nearly all data and easy-to-use tools to analyze it. Regardless of data type, location (on-premises or in the cloud), or size, you have the power of familiar tools coupled with high-performance technologies to serve your business needs from data storage, processing, and all the way to visualization. The benefits of Big Data are not limited only to Business Intelligence (BI) experts or data scientists. Nearly everyone in your organization can analyze and make more informed decisions with the right tools.

In a traditional business environment, the data to power your reporting mechanism will usually come from tables in a database. However, it's increasingly necessary to supplement this with data obtained from outside your organization. This may be commercially available datasets, such as those available from Windows Data Market and elsewhere, or it may be data from less structured sources such as feeds, e-mails, logfiles, and more. You will, in most cases, need to cleanse, validate, and transform this data before loading it into an existing database. Extract, Transform, and Load (ETL) operations can use Big Data solutions to perform pattern matching, data categorization, deduplication, and summary operations on unstructured or semi-structured data to generate data in the familiar rows and columns format that can be imported into a database table. The following figure will give you a conceptual view of Big Data:

Big Data requires some level of machine learning or complex statistical processing to produce insights. If you have to use non-standard techniques to process and host it; it's probably Big Data.

The data store in a Big Data implementation is usually referred to as a NoSQL store, although this is not technically accurate because some implementations do support a SQL-like query language. NoSQL storage is typically much cheaper than relational storage, and usually supports a write-once capability that allows only for data to be appended. To update data in these stores you must drop and recreate the relevant file. This limitation maximizes performance; Big Data storage implementations are usually measured by throughput rather than capacity because this is usually the most significant factor for both storage and query efficiency. This approach also provides better performance and maintains the history of changes to the data.

Note

However, it is extremely important to note that, in addition to supporting all types of data, moving data to and from a non-relational store such as Hadoop and a relational data warehouse such as SQL Server is one of the key Big Data customer usage patterns. Throughout this book, we will explore how we can integrate Hadoop and SQL Server and derive powerful visualization on any data using the SQL Server BI suite.