Learning Hunk

Learning Hunk

By : Dmitry Anoshin, Sergey Sheypak

Buy this Book

Learning Hunk

By: Dmitry Anoshin, Sergey Sheypak

Buy this Book

Overview of this book

Hunk is the big data analytics platform that lets you rapidly explore, analyse, and visualize data in Hadoop and NoSQL data stores. It provides a single, fluid user experience, designed to show you insights from your big data without the need for specialized skills, fixed schemas, or months of development. Hunk goes beyond typical data analysis methods and gives you the power to rapidly detect patterns and find anomalies across petabytes of raw data. This book focuses on exploring, analysing, and visualizing big data in Hadoop and NoSQL data stores with this powerful full-featured big data analytics platform. You will begin by learning the Hunk architecture and Hunk Virtual Index before moving on to how to easily analyze and visualize data using Splunk Search Language (SPL). Next you will meet Hunk Apps which can easy integrate with NoSQL data stores such as MongoDB or Sqqrl. You will also discover Hunk knowledge objects, build a semantic layer on top of Hadoop, and explore data using the friendly user-interface of Hunk Pivot. You will connect MongoDB and explore data in the data store. Finally, you will go through report acceleration techniques and analyze data in the AWS Cloud.

Learning Hunk

Credits

About the Authors

About the Reviewer

www.PacktPub.com

Preface

Free Chapter

Meet Hunk

Starting the VM and cluster in VirtualBox

Big data use case

Summary

Explore Hadoop Data with Hunk

Setting up Hunk

Exploring data

Controlling security with Hunk

Summary

Meeting Hunk Features

Knowledge objects

Introducing Pivot

Summary

Adding Speed to Reports

Big data performance issues

Hunk report acceleration

Hunk accelerations limits

Summary

Customizing Hunk

What we are going to do with the Splunk SDK

Dashboard customization using Splunk Web Framework

A description of time-series aggregated CDR data

Implementation

Custom map components

The final result

Summary

Discovering Hunk Integration Apps

What is Mongo?

Counting by shop in a single collection

Counting events in all collections

Summary

Exploring Data in the Cloud

An introduction to Amazon EMR and S3

Integrating Hunk with EMR and S3

Converting Hunk from an hourly rate to a license

Summary

Index

Customer Reviews

5 star

4 star

3 star

2 star

1 star

Big data use case

We are going to use several data sources for our applications. The first will use RDBMS interaction since it's one of the most popular use cases. We will show you how to integrate classical data with a strict schema from the RDBMS small dictionary stored in HDFS. Data from RDBMS has big volume in real life (we just have a small subset of it now) and the dictionary keeps the dimensions for the RDBMS value. We will enrich the big data with the dimensions before displaying data on the map.

Importing data from RDBMS to Hadoop using Sqoop

There are many ways to import data into Hadoop; we could easily publish a book called "1,001 methods to put your data into Hadoop." We are not going to focus on these specialties and will use very simple cases. Why so simple? Because you will meet many problems in a production environment and we can't cover these in a book.

An example from real life: you will definitely need to import data from your existing DWH into Hadoop. And you will have to use Sqoop in conjunction with special Teradata/Oracle connectors to do it quickly without DWH performance penalties. You will spend some time tuning your DB storage schema and connection properties to achieve a reasonable result. That is why we decided to keep all this tricky stuff out of the book; our goal is to use Hunk on top of Hadoop.

Here is a short explanation of the import process. We've split the diagram into three parts:

MySQL, a database that stores data
Oozie, responsible for triggering the job import process
Hadoop, responsible for getting data from MySQL DB and storing it in an HDFS catalog

Initially, the data is stored in MySQL DB. We want to copy the data to HDFS for later processing using Hunk. We will work with the telco dataset in Chapter 6, Discovering Hunk Integration Apps, related to custom application development. We write an Oozie coordinator, started by the Oozie server each night. Oozie is a kind of Hadoop CRON and has some additional features that help us work with data. Oozie can do many useful things, but right now we are using its basic functionality: just running the workflow each day. The coordinator code is here: https://github.com/seregasheypak/learning-hunk/blob/master/import-milano-cdr/src/main/resources/oozie/import-milano-cdr-coord.xml.

Next is the workflow. The coordinator is responsible for scheduling the workflow. The workflow is responsible for doing business logic. The workflow code is here: https://github.com/seregasheypak/learning-hunk/blob/master/import-milano-cdr/src/main/resources/oozie/workflows/import-milano-cdr-wflow.xml.

The workflow has one Sqoop action.

Next is the Sqoop action. This action declares the way the job should read data from RDBMS and store it to HDFS.

The third section is the MapReduce job that reads data from RDBMS and writes it to HDFS. The Sqoop action internally runs a MapReduce job that is responsible for getting table rows out of MySQL. The whole process sounds pretty complex, but you don't have to worry. We've created the code to implement it.

Telecommunications – SMS, Call, and Internet dataset from dandelion.eu

We are going to use several open datasets from https://dandelion.eu. One weekly dataset was uploaded to MySQL and contains information about the telecommunication activity in the city of Milano. Later, you will use an Oozie coordinator with the Sqoop action to create a daily partitioned dataset.

The source dataset is: https://dandelion.eu/datagems/SpazioDati/telecom-sms-call-internet-mi/resource/ and the grid map is: https://dandelion.eu/datagems/SpazioDati/milano-grid/resource/.

Milano grid map

Milano is divided into equal squares. Each square has a unique ID and four longitude and latitude coordinates.

A mapping between the logical square mesh and spatial area would be helpful for us during geospatial analysis. We will demonstrate how Hunk can deal with geospatial visualizations out-of-the-box.

CDR aggregated data import process

We've prepared an Oozie coordinator to import data from MySQL to HDFS. Generally, it looks like a production-ready process. Real-life processes are organized in pretty much the same way. The following describes the idea behind the import process:

We have the potentially huge table 2013_12_milan_cdr with time-series data. We are not going to import the whole table in one go; we will partition data using a time field named time_interval.

The idea is to split data by equal time periods and import it to Hadoop. It's just a projection of the RDBMS partitioning/sharing techniques to Hadoop. You'll see seven folders named from /masterdata/stream/2013/12/01 to 07.

You can get the workflow code here: https://github.com/seregasheypak/learning-hunk/blob/master/import-milano-cdr/src/main/resources/oozie/workflows/import-milano-cdr-wflow.xml.

The general idea is to:

Run the workflow each day
Import the data for the whole day

We've applied a dummy MySQL date function; in real life, you would use the OraOop connector, Teradata connector, or some other tricks to play with the partition properties.

Periodical data import from MySQL using Sqoop and Oozie

To run the import process, you have to open the console application inside a VM window, go to the catalog with the coordinator configuration, and submit it to Oozie:

cd /home/devops/oozie-configs
sudo -u hdfs oozie job -oozie http://localhost/oozie -config import-milano-cdr-coord.properties -run

The console output will be:

job: 0000000-150302102722384-oozie-oozi-C

Where 0000000-15030102722384-oozie-oozi-C is the unique ID of the running coordinator. We can visit Hue and watch the progress of the import process. Visit this link: http://localhost:8888/oozie/list_oozie_coordinators/ and http://vm-cluster-node3.localdomain:8888/oozie/list_oozie_coordinators/:

Here is the running coordinator. It took two minutes to import one day. There are seven days in total. We used a powerful PC (32 GB memory and an 8-core AMD CPU) to accomplish this task.

Tip

Downloading the example code

You can download the example code from http://www.bigdatapath.com/wp-content/uploads/2015/05/learning-hunk-05-with-mongo.zip.

The following screenshot shows how the successful result should look:

We can see that the coordinator did produce seven actions for each day starting from December 1^st till December 7^th.

You can use the open console application to execute this command:

hadoop fs -du -h /masterdata/stream/milano_cdr/2013/12

The output should be:

74.1 M   222.4 M  /masterdata/stream/milano_cdr/2013/12/01
97.0 M   291.1 M  /masterdata/stream/milano_cdr/2013/12/02
100.4 M  301.3 M  /masterdata/stream/milano_cdr/2013/12/03
100.4 M  301.1 M  /masterdata/stream/milano_cdr/2013/12/04
100.6 M  301.8 M  /masterdata/stream/milano_cdr/2013/12/05
100.6 M  301.7 M  /masterdata/stream/milano_cdr/2013/12/06
89.2 M   267.6 M  /masterdata/stream/milano_cdr/2013/12/07

The target format is Avro with snappy compression. We'll see later how Hunk works with popular storage formats and compression codecs. Avro is a reasonable choice: it has wide support across Hadoop tools and has a schema.

It's possible to skip the import process; you can move data to the target destination using a command. Open the console application and execute:

sudo -u hdfs hadoop fs -mkdir -p /masterdata/stream/milano_cdr/2013
sudo -u hdfs hadoop fs -mv /backup/milano_cdr/2013/12/masterdata/stream/milano_cdr/2013/

Problems to solve

We have a week-length dataset with 10-minute time intervals for various subscriber activities. We are going to draw using dynamics in dimensions: part of the day, city area, and type of activity. We will build a subscriber dynamics map using imported data.

Learning Hunk

By : Dmitry Anoshin, Sergey Sheypak

Learning Hunk

By: Dmitry Anoshin, Sergey Sheypak

Overview of this book

Related Content you might be interested in

Current Title:

Learning Hunk

Big data use case

Importing data from RDBMS to Hadoop using Sqoop

Telecommunications – SMS, Call, and Internet dataset from dandelion.eu

Milano grid map

CDR aggregated data import process

Periodical data import from MySQL using Sqoop and Oozie

Tip

Problems to solve