Book Image

Scaling Big Data with Hadoop and Solr, Second Edition

By : Hrishikesh Vijay Karambelkar
Book Image

Scaling Big Data with Hadoop and Solr, Second Edition

By: Hrishikesh Vijay Karambelkar

Overview of this book

Table of Contents (13 chapters)
Scaling Big Data with Hadoop and Solr Second Edition
Credits
About the Author
About the Reviewers
www.PacktPub.com
Preface
Index

Log management for banking


Today, many banks in the world are moving towards computerization and using automation in business processes to save costs and improve efficiency. This move requires a bank to build various applications that can support the complex banking use cases. These applications need to interact with each other over standardized communication protocols. A typical enterprise banking sector would consist of software for core banking applications, CMS, credit card management, B2B portals, treasury management, HRMS, ERP, CRM, business warehouses, accounting, BI tools, analytics, custom applications, and various other enterprise applications, all working together to ensure smooth business processes. Each of these applications work with sensitive data: hence, a good banking system landscape often provides high performance and high availability of scalable architecture, along with backup and recovery features, bringing in a completely diversified set of software together, into a secured environment.

Most banks today offer web-based interactions; they not only automate their own business processes, but also access various third-party software of other banks and vendors. A dedicated team of administrators are working 24/7 in order to monitor and handle issues/failures and escalations. A simple application that transfers money from your savings bank account to a loan account may touch upon at least twenty different applications. These systems generate terabytes of data everyday and include transactional data, change logs, and so on.

The problem

The problem arises when any business workflow/transaction fails. With such a complex system, it becomes a big task for system administrators/managers to:

  • Find out the issue or the application that has caused the failure

  • Try to understand the issue and find out the root cause

  • Correlate the issue with other applications

  • Keep monitoring the workflow

When multiple applications are involved, the log management across these applications becomes difficult. Some of the applications provide their own administration and monitoring capabilities. However, it make sense to have a consolidated place where everything can be seen at a glance/in one place.

How can it be tackled?

Log management is one of the standard problems where Big Data Search can effectively play a role. Apache Hadoop along with Apache Solr can provide a completely distributed environment to effectively manage the logs of multiple applications, and also provide searching capabilities along with it. Take a look at this representation of a sample log management application user interface:

This sample UI allows us to have a consolidated log management screen, which may also be transformed into a dashboard to show us the status and the log details. The following reasons explain why Apache Solr and Hadoop-based Big Data Search as the right solution for a given problem:

  • The number of logs generated by any banking application are huge in size and are continuous. Most of log-based systems use rotational log management, which cleans up old logs. Given that Apache Hadoop can work on commodity hardware, the overall storage cost for storing these logs becomes cheap, and they can remain in Hadoop storage for a longer time.

  • Although Apache Solr is capable of storing any type of schema, common fields, such as log descriptions, levels, and others can be consolidated easily.

  • Apache Solr is fast and its efficient searching capabilities can provide different interesting search features, such as highlighting the text or showing snippets of matched results. It also provides a faceted search to drill down and filter results, thereby providing a better browsing experience.

  • Apache Solr provides near real-time search capabilities to make the logs immediately searchable, so that administrators can see the latest alarming logs with high severity.

  • The cost of building Apache Hadoop with a Solr-based solution provides a low cost alternative infrastructure, which itself is required to have a high speed batch processing of data.

High-level design

The overall design, as shown in the following diagram, can have a schema that contains common attributes across all the log files, such as date and time of the log, severity, application name, user name, type of log, and so on. Other attributes can be added as dynamic text fields:

Since each system has a different log schema, these logs have to parsed periodically and then uploaded to a distributed search. The Log Upload Utility or an agent can be a custom script or it can also be based in Apache Kafka, Flume, or even RabbitMQ. Kafka is based on publish-subscribe messaging, and it provides high scalability; you can read more at http://blog.mmlac.com/log-transport-with-apache-kafka/ about how it can be used for log streaming. We need to write script/programs that will understand the log schema, and extract the field data from the logs. Log Upload Utility can feed the outcome to distributed search nodes, which are simply Solr instances running on a distributed system, such as Hadoop. To achieve near real-time search, the Solr configuration requires a change accordingly.

Indexing can be done either instantly, that is, right at the time of upload, or in a batch operation periodically. The second approach is more suitable if you have a consistent flow of log streams, and also if you have scheduled-based log uploading. Once the log is uploaded in a certain folder, for example /stage, a batched index operation using Hadoop's Map-Reduce can generate HDFS-based Solr indexes, based on the many alternatives that we saw in Chapter 4, Big Data Search Using Hadoop and Its Ecosystem, and Chapter 5, Scaling Search Performance. The generated index can be read using Solr through a Solr Hadoop connector, which does not use MapReduce capabilities while searching.

Apache Blur is another alternative to indexing and searching on Hadoop using Lucene or Solr. Commercial implementations, such as Hortonworks and LucidWorks provide a Solr-based integrated search on Hadoop (refer to http://hortonworks.com/hadoop-tutorial/searching-data-solr/).