-
Book Overview & Buying
-
Table Of Contents
Real-Time Big Data Analytics
By :
The next step on journey to Big Data is to understand the levels and layers of abstraction, and the components around the same. The following figure depicts some common components of Big Data analytical stacks and their integration with each other. The caveat here is that, in most of the cases, HDFS/Hadoop forms the core of most of the Big-Data-centric applications, but that's not a generalized rule of thumb.
Talking about Big Data in a generic manner, its components are as follows:
It is used very widely because:
Now that we have skimmed through the Big Data technology stack and the components, the next step is to go through the generic architecture for analytical applications.
We will continue the discussion with reference to the following figure:
If you look at the diagram, there are four steps on the workflow of an analytical application, which in turn lead to the design and architecture of the same:
Now, let's dive deeper into each segment to understand how it works.
This is the first and most important step for any application. This is the step where the application architects and designers identify and decide upon the data sources that will be providing the input data to the application for analytics. The data could be from a client dataset, a third party, or some kind of static/dimensional data (such as geo coordinates, postal code, and so on).While designing the solution, the input data can be segmented into business-process-related data, business-solution-related data, or data for technical process building. Once the datasets are identified, let's move to the next step.
By now, we understand the business use case and the dataset(s) associated with it. The next steps are data ingestion and processing. Well, it's not that simple; we may want to make use of an ingestion process and, more often than not, architects end up creating an ETL (short for Extract Transform Load) pipeline. During the ETL step, the filtering is executed so that we only apply processing to meaningful and relevant data. This filtering step is very important. This is where we are attempting to reduce the volume so that we have to only analyze meaningful/valued data, and thus handle the velocity and veracity aspects. Once the data is filtered, the next step could be integration, where the filtered data from various sources reaches the landing data mart. The next step is transformation. This is where the data is converted to an entity-driven form, for instance, Hive table, JSON, POJO, and so on, and thus marking the completion of the ETL step. This makes the data ingested into the system available for actual processing.
Depending upon the use case and the duration for which a given dataset is to be analyzed, it's loaded into the analytical data mart. For instance, my landing data mart may have a year's worth of credit card transactions, but I just need one day's worth of data for analytics. Then, I would have a year's worth of data in the landing mart, but only one day's worth of data in the analytics mart. This segregation is extremely important because that helps in figuring out where I need real-time compute capability and which data my deep learning application operates upon.
Now, we will implement various aspects of the solution and integrate them with the appropriate data mart. We can have the following:
Once the data is analyzed, the next and most important step in the life cycle of any application is the presentation/visualization of the results. Depending upon the target audience of the end business user, the data visualization can be achieved using a custom-built UI presentation layer, business insight reports, dashboards, charts, graphs, and so on.
The requirement could vary from autorefreshing UI widgets to canned reports and ADO queries.
Change the font size
Change margin width
Change background colour