Learning Hunk

Learning Hunk

By : Dmitry Anoshin, Sergey Sheypak

Buy this Book

Learning Hunk

By: Dmitry Anoshin, Sergey Sheypak

Buy this Book

Overview of this book

Hunk is the big data analytics platform that lets you rapidly explore, analyse, and visualize data in Hadoop and NoSQL data stores. It provides a single, fluid user experience, designed to show you insights from your big data without the need for specialized skills, fixed schemas, or months of development. Hunk goes beyond typical data analysis methods and gives you the power to rapidly detect patterns and find anomalies across petabytes of raw data. This book focuses on exploring, analysing, and visualizing big data in Hadoop and NoSQL data stores with this powerful full-featured big data analytics platform. You will begin by learning the Hunk architecture and Hunk Virtual Index before moving on to how to easily analyze and visualize data using Splunk Search Language (SPL). Next you will meet Hunk Apps which can easy integrate with NoSQL data stores such as MongoDB or Sqqrl. You will also discover Hunk knowledge objects, build a semantic layer on top of Hadoop, and explore data using the friendly user-interface of Hunk Pivot. You will connect MongoDB and explore data in the data store. Finally, you will go through report acceleration techniques and analyze data in the AWS Cloud.

Learning Hunk

Credits

About the Authors

About the Reviewer

www.PacktPub.com

Preface

Free Chapter

Meet Hunk

Starting the VM and cluster in VirtualBox

Big data use case

Summary

Explore Hadoop Data with Hunk

Setting up Hunk

Exploring data

Controlling security with Hunk

Summary

Meeting Hunk Features

Knowledge objects

Introducing Pivot

Summary

Adding Speed to Reports

Big data performance issues

Hunk report acceleration

Hunk accelerations limits

Summary

Customizing Hunk

What we are going to do with the Splunk SDK

Dashboard customization using Splunk Web Framework

A description of time-series aggregated CDR data

Implementation

Custom map components

The final result

Summary

Discovering Hunk Integration Apps

What is Mongo?

Counting by shop in a single collection

Counting events in all collections

Summary

Exploring Data in the Cloud

An introduction to Amazon EMR and S3

Integrating Hunk with EMR and S3

Converting Hunk from an hourly rate to a license

Summary

Index

Customer Reviews

5 star

4 star

3 star

2 star

1 star

Hunk architecture

Let's explore how Hunk looks. From the end user perspective, Hunk looks like Splunk. You use same interface, you can write searches, visualize big data, and create reports, dashboards, and alerts. In other words, Hunk can do everything Splunk can do. In the following screenshot, you can see a schematic of Hunk's architecture:

Hunk has the same interface and command lines as well. The only change is that Splunk works with data stored in native indexes but the Hunk SPL acts with external data; that's why they call virtual indexes.

Connecting to Hadoop

Hunk is designed to connect to Hadoop via the Hadoop interface. The following screenshot demonstrates that Hunk can connect a Hadoop cluster via Hadoop client libraries and Java:

Moreover, Hunk can work with multiple Hadoop clusters:

In addition, you can use Splunk and Hunk together. You can connect Splunk Enterprise if you have Splunk and Hadoop in your environment. As a result, it is possible to correlate Hunk searches through Hadoop and Splunk Enterprise via the same search head.

Advance Hunk deployment

Sometimes, organizations have really big data. They have thousands of instances of Hadoop. It is a real challenge to get business insight from this extremely large data. However, Hunk can easily handle this titanic task. Of course this isn't as easy as it sounds, but it is possible because you can scale Hunk deployments. Let's look at the following example:

There are hundreds or thousands of users who put their business questions to big data. business users send their queries, they go through the Load Balancer (LB), LB sends them to Hunk, and Hunk makes distributive work to Hadoop.

Native versus virtual indexes

Before we start to compare native and virtual indexes, let's use our previous Splunk experience and see how SPL actually works.

For example, we have a query:

Index=main | stats count by status | rename count AS qty

As you may remember, every step in Splunk is divided by pipes. You can read expression from left to right and follow expression execution sequence.

Tip

Splunk development was motivated by Unix Shell pipes.

In our example, we:

Get all data from Index=main.
Count all rows for every status.
Rename the count field as qty.
Retrieve the final result.

Tip

It is interesting to know that SPL uses a MapReduce algorithm. In other words, it has a map phase when performing retrieve operations and reducing step, and when performing count operations.

The rule is that the first search command is always responsible for retrieving events.

Native indexes

Before Hunk was created, there were only the native indexes of Splunk Enterprise. The data was ingested by Splunk and access to it was via the Splunk interface.

A native index is basically a data store or collection of data. We can put web logs, syslogs, or other machine data in Splunk. We have access controls and the ability to give permissions to users to access data on specific indexes. In addition, Splunk gives us the opportunity to optimize popular and heavy searches. As a result, business users will get their dashboards very quickly.

Virtual index

Virtual indexes lack some features of native indexes. A virtual index is a data container with access controls. Hunk can only read data. Data gets into Hadoop somehow and Hunk can use this data as a container. The inventors of Hunk decided to not build indexes on top of Hadoop data and to not optimize Hunk to perform needle-in-the-haystack searches. However, if data layout is properly designed in Hadoop (for example, there is a hierarchical structure or data is organized based on the timestamp, year, month, or date), this can really improve search performance.

Let's compare both indexes in one table:

Native Indexes	Virtual Indexes
Serve as data containers	Serve as data containers
Access control	Access control
Reads/writes	Read only
Data retention policies	N/A
Optimized for keyword search	N/A
Optimized for time range search	Available via regex/pruning

Tip

You can learn more about virtual indexes on the Splunk website: http://docs.splunk.com/Documentation/Hunk/latest/Hunk/Virtualindexes.

External result provider

The core technology of Hunk is a virtual index and External Result Provider (ERP). We have already encountered virtual indexes. The term ERP is sometime known as resource provider.

The ERP is basically a helper process. It goes out and deals with the details of the external systems that are going to interact with Hadoop or another data store. In other words, it takes searches that users perform in Hunk and somehow translates or interprets them in mrjob. That's how it pushes computation.

There are a few other implementations of ERP that Splunk's partners developed in order to integrate Hunk with Mongo DB, Apache Accumulo, and Cassandra. There are just different implementations of the same interface that helps Hunk to interact with external systems and use any type of data via virtual indexes.

The following diagram demonstrates how ERP looks:

For each Hadoop cluster (or external system) the search process spawns an ERP process that is responsible for executing the (remote part of the) search on that system.

Tip

You can learn more about ERP on the Splunk web site: http://docs.splunk.com/Documentation/Hunk/latest/Hunk/Externalresultsproviders.

Computation models

Previously, we considered some challenges in big data analytics and found out powerful solutions via Hunk. Now we can go deeper in order to understand some of the core advantages of Hunk. Let's start with an easy question: how do we provide interactivity?

There are at least two computational models.

Data streaming

In this approach, data moves from HDFS to the search head. In other words, data is processed in a streaming fashion. As a result users can immediately start to work with data, slice and dice it, or visualize when the first bytes of data will start to appear. But there is a problem with this process. It is a huge volume of data to move and process.

There is one primary benefit that you will probably get; there is a very low response time. In addition, we get low throughput that is not very positive for us.

Data reporting

The second mode is moving computation to data. The way to do this is to create and start a MapReduce job to do the processing, monitor the MapReduce job, and, finally, collect the results. Then, merge the results and visualize the data. There is another problem here—late feedback, because the MapReduce job might take a long time. As a result, this approach has high latency and high throughput.

Mixed mode

Both modes have their pros and cons, but the most important are low latency, because it gives interactivity, and high throughput, because it gives us the opportunity to process larger datasets. These are all benefits and Hunk takes the best from both computational modes.

Let's visualize both modes in order to better understand how they work:

In addition, we consolidate all the modes in the following table, in order to make things clearer:

Streaming	Reporting	Mixed Mode
Pull data from HDFS to search head for processing	Push compute down to data and consume results	Start both streaming and reporting models. Show streaming results until reporting starts to complete
Low latency	High latency	Low latency
Low throughput	High throughput	High throughput

Hunk security

With version 6.1, Hunk became more secure. By default Hunk has superusers with full access. However, very often organizations want to apply a security model to their corporate data, in order to keep their data safe. Hunk can use pass-through authentication. It gives the opportunity to control how MapReduce jobs can submit users and what HDFS files they can access. In addition, it is possible to specify the queue MapReduce jobs should use.

Pass-through authentication gives us the capability to make the Hunk superuser a proxy for any number of configured Hunk users. As a result, Hunk users can act as Hadoop users to own the associated jobs, tasks, and files in Hadoop (and it can limit access to files in HDFS.). Let's look at the following diagram:

Let's explore some common use cases that can help us understand how it works.

One Hunk user to one Hadoop user

For example, say we want our Hunk user to act as a Hadoop user associated with a specific queue or data set. Then we just map the Hunk user to a specific user in Hadoop. For example, in Hunk the user name is Hemal, but in Hadoop it is HemalDesai and the queue is Books.

Many Hunk users to one Hadoop user

For example, say we have many Hunk users and want them to act as a Hadoop user. In Hunk, we can have several users such as Dmitry, Hemal, and Sergey, but in Hadoop they will all execute as an Executive user and will be assigned to the Books queue.

Hunk user(s) to the same Hadoop user with different queues

For example, say you have many Hunk users and the same Hadoop users; it is possible to assign them different queues.

Security will be discussed more closely in Chapter 3, Meet Hunk Features.

Learning Hunk

By : Dmitry Anoshin, Sergey Sheypak

Learning Hunk

By: Dmitry Anoshin, Sergey Sheypak

Overview of this book

Related Content you might be interested in

Current Title:

Learning Hunk

Hunk architecture

Connecting to Hadoop

Advance Hunk deployment

Native versus virtual indexes

Tip

Tip

Native indexes

Virtual index

Tip

External result provider

Tip

Computation models

Data streaming

Data reporting

Mixed mode

Hunk security

One Hunk user to one Hadoop user

Many Hunk users to one Hadoop user

Hunk user(s) to the same Hadoop user with different queues