Book Image

Data Analytics Using Splunk 9.x

By : Dr. Nadine Shillingford
5 (1)
Book Image

Data Analytics Using Splunk 9.x

5 (1)
By: Dr. Nadine Shillingford

Overview of this book

Splunk 9 improves on the existing Splunk tool to include important features such as federated search, observability, performance improvements, and dashboarding. This book helps you to make the best use of the impressive and new features to prepare a Splunk installation that can be employed in the data analysis process. Starting with an introduction to the different Splunk components, such as indexers, search heads, and forwarders, this Splunk book takes you through the step-by-step installation and configuration instructions for basic Splunk components using Amazon Web Services (AWS) instances. You’ll import the BOTS v1 dataset into a search head and begin exploring data using the Splunk Search Processing Language (SPL), covering various types of Splunk commands, lookups, and macros. After that, you’ll create tables, charts, and dashboards using Splunk’s new Dashboard Studio, and then advance to work with clustering, container management, data models, federated search, bucket merging, and more. By the end of the book, you’ll not only have learned everything about the latest features of Splunk 9 but also have a solid understanding of the performance tuning techniques in the latest version.
Table of Contents (18 chapters)
1
Part 1: Getting Started with Splunk
5
Part 2: Visualizing Data with Splunk
10
Part 3: Advanced Topics in Splunk

Splunking big data

Splunk is a big data tool. In this book, we will introduce the idea of using Splunk to solve problems that involve large amounts of data. When I worked on the IT security team, the problem was obvious – we needed to use security data to identify malicious activity. Defining the problem you are trying to solve will determine what kind of data you collect and how you analyze that data. Not every problem requires a big data solution. Sometimes, a traditional database solution might work just as well and with less cost. So, how do you know if you’re dealing with a big data problem? There are three V’s that help define big data:

  • High Volume: A big data problem usually involves large volumes of data. Most times, the amount of data is greater than what can fit into traditional database solutions.
  • High Velocity: Traditional database solutions are usually not able to handle the speed at which modern data enters a system. Imagine trying to store and manage data from user clicks on a website such as amazon.com in a traditional database. Databases are not designed to support that many operations.
  • High Variety: A problem requiring analysis of big data involves a variety of data sources of varying formats. An IT security SIEM may have data being logged from multiple data sources, including firewall devices, email traces, DNS logs, and access logs. Each of these logs has a different format and correlating all the logs requires a heavy-duty system.

Here are some cases that can be solved using big data:

  • A retail company wants to determine how product placement in stores affects sales. For example, research may show that placing packs of Cheetos near the Point Of Sale (POS) devices increases sales for customers with small children. The target assigns a guest ID number to every customer. They correlate this ID number with the customer’s credit card number and transactions.
  • A rental company wants to measure the times of year that are busiest to ensure that there is a sufficient inventory of vehicles at different locations. Even so, they may realize that a certain type of vehicle is more suitable for a particular area of town.
  • A public school district wants to explore data pulled from multiple district schools to determine the effect of remote classes on certain demographics.
  • An online shop wants to use customer traffic to determine the peak time for posting ads or giving discounts.
  • An IT security team may use datasets containing firewall logs, DNS logs, and user access to hunt down a malicious actor on the network.

Now, let’s look at how big data is generated.

How is big data generated?

Infographics published by FinancesOnline (https://financesonline.com) indicated that humans created, captured, copied, and consumed about 74 zettabytes of data in 2021. That number is estimated to grow to 149 zettabytes in 2024.

The volume of data seen in the last few years can be attributed to increases in three types of data:

  • Machine data: Data generated by machines such as operating systems and application logs
  • Social data: Data generated by social media systems
  • Transactional data: Data generated by e-commerce systems

We are surrounded by digital devices, and as the capacity and capabilities of these devices increase, the amount of data generated also increases. Modern devices such as phones, laptops, watches, smart speakers, cars, sensors, POS devices, and household appliances all generate large volumes of machine data in a wide variety of formats. Many times, this data stays untouched because the data owners do not have the ability, time, or money to analyze it.

The prevalence of smartphones is possibly another contributor to the exponential increase in data. IBM’s Simon Personal Communicator, the first mainstream mobile telephone introduced in 1992, had very limited capability. It cost a whopping $899 with a service contract. Out of the box, a user could use the Simon to make calls and send and receive emails, faxes, and pages. It also contained a notebook, address book, calendar, world clock, and scheduler features. IBM sold approximately 50,000 units (https://time.com/3137005/first-smartphone-ibm-simon/).

Figure 1.1 shows the first smartphone to have the functions of a phone and a Personal Digital Assistant (PDA):

Figure 1.1 – The IBM Simon Personal Communicator released in 1992

Figure 1.1 – The IBM Simon Personal Communicator released in 1992

The IBM Simon Personal Communicator is archaic compared to the average cellphone today. Apple sold 230 million iPhones in 2020 (https://www.businessofapps.com/data/apple-statistics/). iPhone users generate data when they browse the web, listen to music and podcasts, stream television and movies, conduct business transactions, and post to and browse social media feeds. This is in addition to the features that were found in the IBM Simon, such as sending and receiving emails. Each of these applications generates volumes of data. Just one application such as Facebook running on an iPhone involves a variety of data – posts, photos, videos, transactions from Facebook Marketplace, and so much more. Figure 1.2 shows data from OurWorldData.org (https://ourworldindata.org/internet) that illustrates the rapid increase in users of social media:

Figure 1.2 – Number of people using social media platforms, 2005 to 2019

Figure 1.2 – Number of people using social media platforms, 2005 to 2019

In the next section, we’ll explore how we can use Splunk to process all this data.

Understanding Splunk

Now that we understand what big data is, its applications, and how it is generated, let’s talk about Splunk Enterprise and how Splunk can be used to manage big data. For simplicity, we will refer to Splunk Enterprise as Splunk.

Splunk was founded in 2003 by Michael Baum, Rob Das, and Erik Swan. Splunk was designed to search, monitor, and analyze machine-generated data. Splunk can handle high volume, high variety data being generated at high velocity. This makes it a perfect tool for dealing with big data. Splunk works on various platforms, including Windows (32- and 64-bit), Linux (64-bit), and macOS. Splunk can be installed on physical devices, virtual machines such as VirtualBox and VMWare, and virtual cloud instances such as Amazon Web Services (AWS) and Microsoft Azure. Customers can also sign up for the Splunk Cloud Platform, which supplies the user with a Splunk deployment hosted virtually. Using AWS instances and Splunk Cloud frees the user from having to deploy and maintain physical servers. There is a free version 60-day trial of Splunk that allows the user to index 500 MB of data daily. Once the user has used the product for 60 days, they can use a perpetual free license or purchase a Splunk license. The 60-day version of Splunk is a great way to get your feet wet. Traditionally, the paid version of Splunk was billed at a volume rate – that is, the more data you index, the more you pay. However, new pricing models such as workload and ingest pricing have been introduced in recent years.

In addition to the core Splunk tool, there are various free and paid applications, such as Splunk Enterprise Security, Splunk Soar, and various observability solutions such as Splunk User Behavior Analytics (UBA) and Splunk Observability Cloud.

Splunk was designed to index a variety of data. This is accomplished via pre-defined configurations that allow Splunk to recognize the format of different data sources. In addition, splunkbase.com is a constantly growing repository of 1,000+ apps and Technical Add-Ons (TAs) developed by Splunk, Splunk partners, and the Splunk community. One of the most important features of these TAs includes configurations for automatically extracting fields from raw data. Unlike traditional databases, Splunk can index large volumes of data. A dedicated Splunk Enterprise indexer can index over 20 MB of data per second or 1.7 per day. The amount of data that Splunk is capable of indexing can be increased with additional indexers. There are many use cases for which Splunk is a great solution.

Table 1.1 highlights how Splunk improved processes at The University of Arizona, Honda, and Lenovo:

Use Case

Company

Details

Security

The University of Arizona

The University of Arizona used Splunk Remote Work Insights (RWI) to help with the challenges of remote learning during the pandemic (https://www.splunk.com/en_us/customers/success-stories/university-of-arizona.html)

IT Operations

Honda

Honda used predictive analytics to increase efficiency and solve problems before they became machine failures or interruptions in their production line (https://tinyurl.com/5n7f7naz)

DevOps

Lenovo

Lenovo reduced the amount of time spent in troubleshooting by 50% and maintained 100% uptime despite a 300% increase in web traffic (https://tinyurl.com/yactu398)

Table 1.1 – Examples of success stories from Splunk customers

We will look at some of the major components of Splunk in the next section.