Book Image

Learning Hunk

By : Dmitry Anoshin, Sergey Sheypak
Book Image

Learning Hunk

By: Dmitry Anoshin, Sergey Sheypak

Overview of this book

Hunk is the big data analytics platform that lets you rapidly explore, analyse, and visualize data in Hadoop and NoSQL data stores. It provides a single, fluid user experience, designed to show you insights from your big data without the need for specialized skills, fixed schemas, or months of development. Hunk goes beyond typical data analysis methods and gives you the power to rapidly detect patterns and find anomalies across petabytes of raw data. This book focuses on exploring, analysing, and visualizing big data in Hadoop and NoSQL data stores with this powerful full-featured big data analytics platform. You will begin by learning the Hunk architecture and Hunk Virtual Index before moving on to how to easily analyze and visualize data using Splunk Search Language (SPL). Next you will meet Hunk Apps which can easy integrate with NoSQL data stores such as MongoDB or Sqqrl. You will also discover Hunk knowledge objects, build a semantic layer on top of Hadoop, and explore data using the friendly user-interface of Hunk Pivot. You will connect MongoDB and explore data in the data store. Finally, you will go through report acceleration techniques and analyze data in the AWS Cloud.
Table of Contents (14 chapters)

Big data use case


We are going to use several data sources for our applications. The first will use RDBMS interaction since it's one of the most popular use cases. We will show you how to integrate classical data with a strict schema from the RDBMS small dictionary stored in HDFS. Data from RDBMS has big volume in real life (we just have a small subset of it now) and the dictionary keeps the dimensions for the RDBMS value. We will enrich the big data with the dimensions before displaying data on the map.

Importing data from RDBMS to Hadoop using Sqoop

There are many ways to import data into Hadoop; we could easily publish a book called "1,001 methods to put your data into Hadoop." We are not going to focus on these specialties and will use very simple cases. Why so simple? Because you will meet many problems in a production environment and we can't cover these in a book.

An example from real life: you will definitely need to import data from your existing DWH into Hadoop. And you will have to use Sqoop in conjunction with special Teradata/Oracle connectors to do it quickly without DWH performance penalties. You will spend some time tuning your DB storage schema and connection properties to achieve a reasonable result. That is why we decided to keep all this tricky stuff out of the book; our goal is to use Hunk on top of Hadoop.

Here is a short explanation of the import process. We've split the diagram into three parts:

  • MySQL, a database that stores data

  • Oozie, responsible for triggering the job import process

  • Hadoop, responsible for getting data from MySQL DB and storing it in an HDFS catalog

Initially, the data is stored in MySQL DB. We want to copy the data to HDFS for later processing using Hunk. We will work with the telco dataset in Chapter 6, Discovering Hunk Integration Apps, related to custom application development. We write an Oozie coordinator, started by the Oozie server each night. Oozie is a kind of Hadoop CRON and has some additional features that help us work with data. Oozie can do many useful things, but right now we are using its basic functionality: just running the workflow each day. The coordinator code is here: https://github.com/seregasheypak/learning-hunk/blob/master/import-milano-cdr/src/main/resources/oozie/import-milano-cdr-coord.xml.

Next is the workflow. The coordinator is responsible for scheduling the workflow. The workflow is responsible for doing business logic. The workflow code is here: https://github.com/seregasheypak/learning-hunk/blob/master/import-milano-cdr/src/main/resources/oozie/workflows/import-milano-cdr-wflow.xml.

The workflow has one Sqoop action.

Next is the Sqoop action. This action declares the way the job should read data from RDBMS and store it to HDFS.

The third section is the MapReduce job that reads data from RDBMS and writes it to HDFS. The Sqoop action internally runs a MapReduce job that is responsible for getting table rows out of MySQL. The whole process sounds pretty complex, but you don't have to worry. We've created the code to implement it.

Telecommunications – SMS, Call, and Internet dataset from dandelion.eu

We are going to use several open datasets from https://dandelion.eu. One weekly dataset was uploaded to MySQL and contains information about the telecommunication activity in the city of Milano. Later, you will use an Oozie coordinator with the Sqoop action to create a daily partitioned dataset.

The source dataset is: https://dandelion.eu/datagems/SpazioDati/telecom-sms-call-internet-mi/resource/ and the grid map is: https://dandelion.eu/datagems/SpazioDati/milano-grid/resource/.

Milano grid map

Milano is divided into equal squares. Each square has a unique ID and four longitude and latitude coordinates.

A mapping between the logical square mesh and spatial area would be helpful for us during geospatial analysis. We will demonstrate how Hunk can deal with geospatial visualizations out-of-the-box.

CDR aggregated data import process

We've prepared an Oozie coordinator to import data from MySQL to HDFS. Generally, it looks like a production-ready process. Real-life processes are organized in pretty much the same way. The following describes the idea behind the import process:

We have the potentially huge table 2013_12_milan_cdr with time-series data. We are not going to import the whole table in one go; we will partition data using a time field named time_interval.

The idea is to split data by equal time periods and import it to Hadoop. It's just a projection of the RDBMS partitioning/sharing techniques to Hadoop. You'll see seven folders named from /masterdata/stream/2013/12/01 to 07.

You can get the workflow code here: https://github.com/seregasheypak/learning-hunk/blob/master/import-milano-cdr/src/main/resources/oozie/workflows/import-milano-cdr-wflow.xml.

The general idea is to:

  • Run the workflow each day

  • Import the data for the whole day

We've applied a dummy MySQL date function; in real life, you would use the OraOop connector, Teradata connector, or some other tricks to play with the partition properties.

Periodical data import from MySQL using Sqoop and Oozie

To run the import process, you have to open the console application inside a VM window, go to the catalog with the coordinator configuration, and submit it to Oozie:

cd /home/devops/oozie-configs
sudo -u hdfs oozie job -oozie http://localhost/oozie -config import-milano-cdr-coord.properties -run

The console output will be:

job: 0000000-150302102722384-oozie-oozi-C

Where 0000000-15030102722384-oozie-oozi-C is the unique ID of the running coordinator. We can visit Hue and watch the progress of the import process. Visit this link: http://localhost:8888/oozie/list_oozie_coordinators/ and http://vm-cluster-node3.localdomain:8888/oozie/list_oozie_coordinators/:

Here is the running coordinator. It took two minutes to import one day. There are seven days in total. We used a powerful PC (32 GB memory and an 8-core AMD CPU) to accomplish this task.

Tip

Downloading the example code

You can download the example code from http://www.bigdatapath.com/wp-content/uploads/2015/05/learning-hunk-05-with-mongo.zip.

The following screenshot shows how the successful result should look:

We can see that the coordinator did produce seven actions for each day starting from December 1st till December 7th.

You can use the open console application to execute this command:

hadoop fs -du -h /masterdata/stream/milano_cdr/2013/12

The output should be:

74.1 M   222.4 M  /masterdata/stream/milano_cdr/2013/12/01
97.0 M   291.1 M  /masterdata/stream/milano_cdr/2013/12/02
100.4 M  301.3 M  /masterdata/stream/milano_cdr/2013/12/03
100.4 M  301.1 M  /masterdata/stream/milano_cdr/2013/12/04
100.6 M  301.8 M  /masterdata/stream/milano_cdr/2013/12/05
100.6 M  301.7 M  /masterdata/stream/milano_cdr/2013/12/06
89.2 M   267.6 M  /masterdata/stream/milano_cdr/2013/12/07

The target format is Avro with snappy compression. We'll see later how Hunk works with popular storage formats and compression codecs. Avro is a reasonable choice: it has wide support across Hadoop tools and has a schema.

It's possible to skip the import process; you can move data to the target destination using a command. Open the console application and execute:

sudo -u hdfs hadoop fs -mkdir -p /masterdata/stream/milano_cdr/2013
sudo -u hdfs hadoop fs -mv /backup/milano_cdr/2013/12/masterdata/stream/milano_cdr/2013/

Problems to solve

We have a week-length dataset with 10-minute time intervals for various subscriber activities. We are going to draw using dynamics in dimensions: part of the day, city area, and type of activity. We will build a subscriber dynamics map using imported data.