Book Image

Hadoop Blueprints

By : Anurag Shrivastava, Tanmay Deshpande
Book Image

Hadoop Blueprints

By: Anurag Shrivastava, Tanmay Deshpande

Overview of this book

If you have a basic understanding of Hadoop and want to put your knowledge to use to build fantastic Big Data solutions for business, then this book is for you. Build six real-life, end-to-end solutions using the tools in the Hadoop ecosystem, and take your knowledge of Hadoop to the next level. Start off by understanding various business problems which can be solved using Hadoop. You will also get acquainted with the common architectural patterns which are used to build Hadoop-based solutions. Build a 360-degree view of the customer by working with different types of data, and build an efficient fraud detection system for a financial institution. You will also develop a system in Hadoop to improve the effectiveness of marketing campaigns. Build a churn detection system for a telecom company, develop an Internet of Things (IoT) system to monitor the environment in a factory, and build a data lake – all making use of the concepts and techniques mentioned in this book. The book covers other technologies and frameworks like Apache Spark, Hive, Sqoop, and more, and how they can be used in conjunction with Hadoop. You will be able to try out the solutions explained in the book and use the knowledge gained to extend them further in your own problem space.
Table of Contents (14 chapters)
Hadoop Blueprints
About the Authors
About the Reviewers

Big data use cases

In the previous sections of this chapter, we discussed the design and architecture of Hadoop. Hadoop and its powerful ecosystem of tools provides a strong data platform to build data-driven applications. In this section, you will get an overview of the use cases covered in this book. These use cases have been derived from real business problems in various industry sectors, but they have been simplified to fit a chapter in this book. You can read from Chapter 2, A 360-Degree View of the Customer, onwards in any order because each chapter is complete in itself.

Creating a 360 degree view of a customer

A 360 degree view of a customer combines information about a customer's attitude, behavior, preferences and static data such as date of birth, and presents it as a single integrated view. Call center agents and field sales agents use this information to better understand the customer's needs and to offer better services or sell the right product.

Large financial institutions operating in the retail market have millions of customers. These customers buy one or more financial products from such institutions.

Large enterprises use master data-management systems to store customer data. The customer data includes key details about the customers such as their name, address and data of birth. Customer service processes use this information to identify a customer during the processing of service requests. Marketing processes use this information to segment customers for direct mailings. The data stored in the MDM systems remain static for several months, if not for several years, because a change in customer addresses does not happen too often, and the other data, such as the name or date of birth, remains unchanged in the lifetime of the customer. The information stored in the MDM systems is very reliable because it undergoes multiple levels of checks before it is stored in the system. Many times, the information in the MDM system is directly taken from the customer data form filled in by the customers.

Despite the high quality of data in the MDM systems or other enterprise data stores, the data in these systems does not create a complete view of the customers. The views created from the static data lack information about what is happening currently that might define the nature of the product or service required by the customer. For example, if a customer is deeply unsatisfied with a product then trying to sell them another accessory will be counterproductive. A 360-degree view should attempt to capture the information about the products in the customer's possession, but also information about his recent experience with the product.

In Chapter 2, A 360-Degree View of the Customer, we will build a 360 degree view of a customer by combining information available in the enterprise data store and information available via social media. We will use Hadoop as the central data warehouse to create the 360-degree view.

Fraud detection systems for banks

According to a report published in Forbes in 2011 (Shaughnessy, 2011), merchants in the United States lose $190 billion per year owing to credit card fraud. Most of this credit card fraud originates from online channels. Banks lose $11 billion to fraud. With the proliferation of digital channels, online fraud is on the rise, and therefore timely fraud detection makes a strong business case for the banks. A solid fraud detection system helps banks in two ways:

  • By reducing financial loss, lowering the risk exposure, and reducing the capital tied to indemnify customers against the fraud

  • By strengthening the safe image of the bank and thereby growing its market share

Spending behavior of bank customers usually follows a pattern that repeats based upon events such as a credit of their salary into the bank account or the payment of utility bills. Any significant deviations from the spending pattern could point to a potential fraudulent activity. Such potential fraudulent activity should trigger an alert so that the bank can take timely action to limit or prevent the financial loss.

In Chapter 3, Building a Fraud Detection System, we will cover a transaction screening system. The goal of this system is to screen every transaction for a potential fraud and generate real-time fraud notifications, so that we can block the transaction and alert the prey. A big data based fraud detection system uses transaction data and enriches it with other static and location data to predict a possible fraud. In this use case, we will focus upon real-time fraud detection as opposed to detecting frauds in a historical data set.

A real-time fraud detection system is a very effective way to fight transaction fraud because it can prevent the movement of funds immediately when a fraud is detected. This prevention mechanism can be built into the transaction approval process. Batch-processing based fraud detection also has value because some types of frauds cannot be detected in real time owing to intensive computing power requirements. However, by the time a fraud is detected using the batch-processing mechanism the money might be irrecoverably lost, and the criminal might have fled, which is why a real-time fraud detection system is more useful.

Marketing campaign planning

You will be familiar with promotional folders that get delivered to your mailboxes by post or with newspapers and magazines. These promotional folders are sent as a part of a campaign run by the marketing departments of companies. A campaign is typically part of a project with a well-defined objective. Often these objectives are related to the successful sale of a product, or a customer visiting a store in response to a campaign. The rewards of the employees in the marketing department are linked to the success of the campaign. Promotional campaigns have a lot of waste associated with them because they target the wrong customers, who are unlikely to respond to a promotional folder, but they still get them because there is no way of knowing who the right and wrong customers are.


For example, if you send offers for meat products to a person who is a vegetarian then it is very unlikely that it will result in a sale of your product

As a result, the promotional folders are sent to everyone.

In Chapter 4, Marketing Campaign Planning, we will build a system to decide which customers are more likely to respond to a promotional folder by using an example of a fictitious company. We will build a predictive model from the historical campaign response data. We will use the predictive model to create a new target list of customers who are more likely to respond to our promotional campaign. The aim of this exercise is to increase the success of marketing campaigns. We will use a tool called BigML to build a predictive model and Hadoop to process the customer data.

Churn detection in telecom

Customer churn or customer attrition refers to the loss of clients to competitors. This problem is acute among technology service providers such as Internet service providers, mobile operators and vendors of software as a service. As a result of customer churn, the companies lose a source of revenue. In very competitive markets where vendors outdo each other by slashing the prices, the cost of acquiring new customers is much more than retaining customers. In these saturated markets, with little or no room for growth, customer retention is the only strategy to maintain market share and revenue. A customer churn detection system is a very compelling business case in the telecom sector.

In the telecom business, a customer might defect to another provider at the end of a contract period. If a telecom company knows in advance, which customer is likely to move to a new provider, then they can make a suitable offer to the customer that will increase the likelihood that the customer will stay with them after the end of the existing contract period.

To predict customer churn, we should examine what kind of signal we can derive from the data. Just by examining the static data about the customer, we will not be able to conclude much about an upcoming churn event. Therefore, a churn-detection system should look into customer behavior such as the calling patterns, social interactions and contacts with the call center. All this information, when analyzed properly, can be put to use for building a churn-detection model.

The churn-detection problem is also well suited for large-scale batch analytics, which Hadoop excels at. We can start shortlisting the customers who are likely to churn a few months before the contract end date. Once we have this list, we have a way that we can target the customers with both inbound and outbound marketing campaigns to increase the chances that customer will stay with the company after the end of the existing contract period.

In Chapter 5, Churn Detection, we will build a system to predict customer churn. We will use customer master data, and other master data, to build a customer-churn model. We will use this model to predict customers who are likely to churn. We will use batch processing to generate the list that will be used by inbound sales staff and outbound campaign managers to target customers with tailor made offers.

Analyzing sensor data

Nowadays sensors are everywhere. GPS sensors are fitted in taxis to track their movement and location. Smartphones carry GPS, temperature and speed sensors. Even large buildings and factory complexes have thousands of sensors that measure the lighting, temperature, and humidity. The sensor data is collected, processed and analyzed in three distinct steps using a big data system. The first step involves the detection of events that generate data from the sensor. The sensor transports this data using a wire or wireless protocol to a centralized data-storage system. In the second step, the sensor data is stored in a centralized data-storage system after data cleansing, if necessary. The third step involves analyzing and consuming this data by an application. A system capable of processing sensor data is also called an Internet of Things (IOT) system. However, sometimes we might need to analyze the sensor data in near real time. For example, the temperature or the humidity in a large factory complex, if not monitored or controlled in real time, might lead to perished goods or loss of human productivity. In such cases, we need a near real-time data analytics solution. Although the HDFS is suitable for batch analytics, other tools in the Hadoop ecosystem can support near real-time analytics very well.

Sensor data usually takes the form of time series, because data sent by a sensor is a measurement at a specific moment in time. This measurement might contain information about temperature, voltage, humidity or some other physical parameter of interest. In the use case covered in Chapter 6, Analyze Sensor Data Using Hadoop, we will use sensor data to build a batch and real-time data analytics system for a factory.

Building a data lake

The term data lake has gained popularity in recent years. The main promise of a data lake is to provide access to large volumes of data in a raw form for analytics of an entire enterprise and to introduce agility into the data-warehousing processes.

Data Lakes are challenging the traditional enterprise data-warehousing paradigm. Traditional data warehousing is based upon the Extract-Transform-Load (ETL) paradigm. The ETL based data-warehousing processes have a long cycle time because they require a well-defined data model where the data should be loaded. This process is called transformation because the data extracted from the operational systems is transformed for loading into the enterprise data warehouse. It's only when the data is loaded into the data warehouse that it becomes available for further analysis.

Hadoop supports the Extract-Load-Transform ELT paradigm. The data files in their raw format can be loaded into the HDFS. It does not require any kind of knowledge of the data model. Once the data has been loaded on the HDFS, then we can use a variety of tools for ad-hoc exploration and analytics. The transformation of data to facilitate structured exploration and analytics can continue, but users already get access to the raw data for exploration without waiting for a long time.

The data lake use case opens up new frontiers for businesses because data lakes give access to data to a large group of users. The lower cost of data storage in the HDFS makes this an attractive proposition because data with no immediate use does not have to be discarded to save the expensive storage space in the Enterprise Data Warehouse. The data lake stores all the data for an enterprise. It offers an opportunity to break the data silos in enterprises that made data analysis using cross-departmental data a very slow process owing to interdepartmental politics and boundaries created by different IT systems.

The data lake use case opens up a new set of questions about data governance and data security. Once all the enterprise data is stored in the data lake, fine-grained access to datasets becomes crucial to ensure that data does not get into the wrong hands. A system containing all the enterprise data becomes a valuable target for hackers too.

In Chapter 7, Build a Data Lake, we will build a basic data lake and see how we can keep it secure by using various tools available in the Hadoop ecosystem.