Book Image

Splunk: Enterprise Operational Intelligence Delivered

By : Derek Mock, Betsy Page Sigman, Paul R. Johnson, Erickson Delgado, Josh Diakun, Ashish Kumar Tulsiram Yadav
Book Image

Splunk: Enterprise Operational Intelligence Delivered

By: Derek Mock, Betsy Page Sigman, Paul R. Johnson, Erickson Delgado, Josh Diakun, Ashish Kumar Tulsiram Yadav

Overview of this book

Splunk is an extremely powerful tool for searching, exploring, and visualizing data of all types. Splunk is becoming increasingly popular, as more and more businesses, both large and small, discover its ease and usefulness. Analysts, managers, students, and others can quickly learn how to use the data from their systems, networks, web traffic, and social media to make attractive and informative reports. This course will teach everything right from installing and configuring Splunk. The first module is for anyone who wants to manage data with Splunk. You’ll start with very basics of Splunk— installing Splunk— before then moving on to searching machine data with Splunk. You will gather data from different sources, isolate them by indexes, classify them into source types, and tag them with the essential fields. With more than 70 recipes on hand in the second module that demonstrate all of Splunk’s features, not only will you find quick solutions to common problems, but you’ll also learn a wide range of strategies and uncover new ideas that will make you rethink what operational intelligence means to you and your organization. Dive deep into Splunk to find the most efficient solution to your data problems in the third module. Create the robust Splunk solutions you need to make informed decisions in big data machine analytics. From visualizations to enterprise integration, this well-organized high level guide has everything you need for Splunk mastery. This learning path combines some of the best that Packt has to offer into one complete, curated package. It includes content from the following Packt products: • Splunk Essentials - Second Edition • Splunk Operational Intelligence Cookbook - Second Edition • Advanced Splunk
Table of Contents (6 chapters)

Chapter 5.  Data Optimization, Reports, Alerts, and Accelerating Searches

Finding the data that you need in Splunk is relatively easy, as you have seen in the previous chapters. Doing the same thing repeatedly, however, requires that you employ techniques that make data retrieval faster. In Chapter 2, Bringing in Data, you have been shown how to use data fields and to make field extractions. In Chapter 4, Data Models and Pivot, you learned how to create data models. You will continue that journey in this chapter by learning how to classify your data using event types, enrich your data using lookups and workflow actions, and normalize your data using tags.

Once you have all these essentials in place, you will be able to easily create reports, alerts, and dashboards. This is where Splunk really shines and your hard work so far will pay off.

In this chapter, we will cover a wide range of topics that showcase ways to manage, analyze, and get results from data. These topics will help you learn to work more efficiently with data and gather better insights from it:

  • Data classification with event types
  • Data normalization with tags
  • Data enrichment with lookups
  • Creating reports
  • Creating alerts
  • The Custom Cron schedule
  • Best practices in scheduling jobs
  • Optimizing searches

Data classification with event types

When you begin working with Splunk every day, you will quickly notice that many things are repeatable. In fact, while going through this book, you may have seen that search queries can easily get longer and more complex. One way to make things easier and shorten search queries is to create event types. Event types are not the same as events; an event is just a single instance of data. An event type is a grouping or classification of events that meet the same criteria.

If you took a break between chapters, you will probably want to open up Splunk again. Then you will execute a search command:

  1. Open up Splunk.
  2. Click on your Destinations app.
  3. Type in this query:
      SPL> index=main http_uri=/booking/confirmation http_status_code=200

This data will return successful booking confirmations. Now say you want to search for this the next day. Without any data classification, you'll have to type the same search string as previously. Instead of tedious repetition, you can simplify your work by saving this search now as an event type. Follow these steps now:

  1. In the Save As dropdown, select Event Type:
    Data classification with event types
  2. Label this new event type good_bookings.
  3. Select a color that is best suited for the type of event; in this case, we will select green.
  4. Select 5 as the priority. Priority here determines which style wins if there is more than one event type. One is the highest and 10 is the lowest.
  5. Use the following screenshot as a guide, then click on Save:
    Data classification with event types

Now let's create an event type for bad bookings:

  1. Change the search query from http_status_code=200 to http_status_code=500. The new query is as shown here:
          SPL> index=main http_uri=/booking/confirmation http_status_code=500 
    
    
  2. Save this as an event type. This time, name it bad_bookings and opt for the color to be red and leaving Priority at 5:
    Data classification with event types

We have created the two event types we needed. Now let's see them in action:

  1. Type the following query in the search input:
          SPL> eventtype=*bookings
    
  2. Notice that the search results have now been color-coded based on the event type that you created. You can also just search for either eventtype=good_bookings or eventtype=bad_bookings to narrow down your search results.
  3. Examine the following screenshot, which shows the results. The colors we have chosen make it easy to spot the types of booking. Imagine the time this saves a manager, who can instantly look for bad bookings. It's just one more way Splunk can make operations so much easier:
    Data classification with event types

Certain restrictions apply when creating event types. You cannot create an event type that consists of a piped command or subsearches. Only base commands can be saved as an event type.

Since the event type is now part of the search, you can then further manipulate data using piped commands, just like this:

SPL> eventtype=*bookings | stats count by eventtype

Create a few more event types now, using the following table as a guide:

Event type

Search command

Color

good_payment

index=main http_uri=/booking/payment http_status_code=200

green

bad_payment

index=main http_uri=/booking/payment http_status_code=500

red

destination_details

index=main http_uri=/destination/*/details

blue

bad_logins

index=main http_uri=/auth http_status_code=500

purple

Data normalization with tags

Tags in Splunk are useful for grouping events with related field values. Unlike event types, which are based on specified search commands, tags are created and mapped to specific fields. You can also have multiple tags assigned to the same field, and each tag can be assigned to that field for a specific reason.

The simplest use-case scenario when using tags is for classifying IP addresses. In our Eventgen logs, three IP addresses are automatically generated. We will create tags against these IP addresses that would allow us to classify them based on different conditions:

IP address

Tags

10.2.1.33

main, patched, east

10.2.1.34

main, patched, west

10.2.1.35

backup, east

In our server farm of three servers, we are going to group them by purpose, patch status, and geolocation. We will achieve this using tags, as shown in the following steps:

  1. Begin by using the following search command:
          SPL> index=main server_ip=10.2.1.33
    
  2. Expand the first event by clicking on the information field as seen in this screenshot:
    Data normalization with tags
  3. While expanded, look for the server_ip field. Click on the Actions dropdown and select Edit Tags:
    Data normalization with tags
  4. In the Create Tags window, fill in the Tag(s) text area using the following screenshot as a guide. For 10.2.1.33, you will use the following tags: main, patched, east.
  5. Click on Save when you're done:
    Data normalization with tags
  6. Do the same for the remaining two IP addresses and create tags based on the previous table.
  7. Now let us make use of this newly-normalized data. Run the search command:
          SPL> index=main tag=patched OR tag=east
    

This will give you all the events that come from the servers that are patched and hypothetically located in the east side of a building. You can then combine these with other search commands or an event type to narrow down the search results.

Consider a scenario where you need to find all booking payments with errors originating from the servers in the east side of a hypothetical building.

Without event types or tags, you would create a search command that looked something like this:

SPL> index=main server_ip=10.2.1.33 OR server_ip=10.2.1.35  
     AND (http_uri=/booking/payment http_status_code=500)

Compare that to this much more elegant and shorter search command, which you can try now:

SPL> index=main eventtype=bad_payment tag=east

Here's an additional exercise for you. Create tags for the following fields using this table as a guide and use them in a search query:

Fields

Tags

http_uri = /destination/LAX/details

major_destination

http_uri = /destination/NY/details

major_destination

http_uri = /destination/MIA/details

home

http_status_code = 301

redirect

http_status_code = 404

not_found

Now you can use these tags to search for bookings to major destinations, which have a status code of not_found. These tags can make your searches much easier and more useful. Here is an example of a search command that combines what you have learned in this chapter so far:

  1. Go ahead and run this now:
          SPL> eventtype=destination_details tag=major_destination
               tag=not_found
    
  2. Look through your results and see that you now have data from the destinations LAX, NY, and MIA.

Data enrichment with lookups

Occasionally you will come across pieces of data that you wish were rendered in a more readable manner. A common example is HTTP status codes. Computer engineers are often familiar with status codes as three-digit numbers. Business analysts, however, would not necessarily know the meaning of these codes. In Splunk, you solve this predicament by using lookup tables, which can pair numbers or acronyms with more understandable text classifiers.

A lookup table is a mapping of keys and values that Splunk can query so it can translate fields into more meaningful information at search time. This is best understood through an example. You can go through the following steps:

  1. From the Destinations app, click on Settings and then Lookups:
    Data enrichment with lookups
  2. In the Lookups page, click on the Add new option next to Lookup table files, as shown in the following screenshot:
    Data enrichment with lookups
  3. In the Add new page, make sure that the Destinations app is selected.
  4. Then, using the following screenshot as your guide, in Upload a lookup file, browse and choose the following: C:\splunk-essentials\labs\chapter05\http_status.csv.
  5. Finally, type in http_status.csv in the Destination filename field.
  6. Click on Save to complete:
    Data enrichment with lookups

The new lookup table file path will now appear in the main Lookup Table Files page. Change the permission so that all apps can use it and it will now appear as Global. The entries of the lookup table files should be similar to the following screenshot:

Data enrichment with lookups

Now that we have configured the lookup table file, it is time to define the lookup:

  1. In the Lookups page under Settings, click on the Add new option next to Lookup definitions:
    Data enrichment with lookups
  2. Once again, make sure that this is being saved in the context of the Destinations app.
  3. In the name field, type in http_status.
  4. Leave the Type as File-based. In the Lookup file dropdown, look for the http_status.csv file and select it.
  5. Leave the following checkboxes blank:
    Data enrichment with lookups
  6. Save the definition.
  7. The new lookup definition will now appear in the table. Change the permission sharing to Global as well.

Let us now try to make use of this new lookup table:

  1. In the Destinations app search bar, type in:
          SPL> eventtype=destination_details | top http_status_code
    
  2. The result will show the http_status_code column with the raw status codes. Now extend your search by using the lookup command. The following multi-line command might not work if you simply copied it. Retyping or re-tabbing is required for it to work:
          SPL> eventtype=destination_details  
                           | top http_status_code 
                           | rename http_status_code AS status 
                           | lookup http_status status OUTPUT 
                             status_description, status_type
    
  3. Look at the followi ng output. The steps you took give you a meaningful output showing the description and type of the status codes, all because of the lookup table we first set up:
    Data enrichment with lookups

This is good for a first step, but for it to be a practical tool, the lookup needs to happen automatically with all queries. To do this, take the following steps:

  1. Go back to Settings and then the Lookups page.
  2. Click on Add new to add a new Automatic Lookup:
    Data enrichment with lookups
  3. Complete the form with the following information. Click on Save when you're done. Go to Permissions and change the sharing permission to Global by clicking on All Apps:
    Data enrichment with lookups

Now let's see how these changes can help us out:

  1. Go back to the Destinations app search bar and type in the following query.
          SPL> eventtype=destination_details status_type=Redirection
    

    Tip

    Note that now you can filter your search using the lookup information without invoking the lookup command.

  2. Notice that the search output will match all events where http_status_code equals 301 or 302.

Creating reports

So far in this chapter, you have learned how to do three very important things: classify data using event types, normalize data using tags, and enrich data using lookup tables. All these, in addition to Chapter 4, Data Models and Pivot, constitute the essential foundation you need to use Splunk in an efficient manner. Now it is time to put them all to good use.

Splunk reports are reusable searches that can be shared to others or saved as a dashboard. Reports can also be scheduled periodically to perform an action, for example to be sent out as an e-mail. Reports can also be configured to display search results in a statistical table, as well as visualization charts. You can create a report through the search command line or through a Pivot. Here we will create a report using the search command line:

  1. In the Destinations app's search page, type in this command:
          SPL> eventtype=bad_logins | top client_ip
    

    The search is trying to find all client IP addresses that attempted to log in but got a 500 internal server error.

  2. To save this as a report for future, click on Save As | Report, then give it the title Bad Logins:
    Creating reports
  3. Next, click Save.
  4. Then click on View to go back to the search results.
  5. Notice that the report is now properly labeled with our title. You can see the report in the following screenshot:
    Creating reports
  6. If you expand the Edit dropdown, you now have additional options to consider while working on this report.

You can modify the permissions so others can use your report. You have done this step a couple of times earlier in the book. This process will be identical to editing permissions for other objects in Splunk.

You can create a schedule to run this report on a timely basis and perform an action on it. The typical action would either be sending the result as an e-mail or running a script. Unfortunately, you would need a mail server to send an e-mail, so you will not be able to do this from your Splunk workstation the way it is currently configured. However, we will show you how it is done:

  1. Click Edit | Edit Schedule.
  2. In the pop-up window, click on Schedule Report.
  3. Change the Schedule option to run Every Day. The time range applies to the search time scope. The default is to run the report against a 15-minute time range.

    Schedule windows are important for production environments. The schedule window you specify should be less than the time range. When there are multiple concurrent searches going on in the Splunk system, it will check whether you have a schedule window and will delay your report up to the defined time or until no other concurrent searches are running. This is one way of optimizing your Splunk system. If you need accurate results that are based on your time range, however, then do not use the schedule window option.

  4. Refer to the following screenshot, then click on Next when you're ready to move on:
    Creating reports
  5. In the next window, check the Send Email box to show advanced e-mail options. Once again, since your workstation does not have a mail server, the scheduled report will not work. But it is worth viewing what the advanced e-mail options look like:
    Creating reports
  6. Uncheck the Send Email option again and click on Save. The report will still run, but it will not perform any action. We can, however, embed the report into an external website and it will always show the results based on the scheduled run. We will reserve further discussion about this advanced option for Chapter 7, Splunk SDK for JavaScript and D3.js.

There is another option that you will commonly use for reports adding them to dashboards. You can do this with the Add to Dashboard button. We will use this option in Chapter 6, Panes of Glass.

Create a few more reports from SPL using the following guidelines. We will use some of these reports in future chapters so try your best to do all of them. You can always come back to this chapter if you need to:

Search

Schedule

Report name

Time range

Time window

eventtype="bad_payment" | top client_ip

Run every hour

Bad payments

Last 24 hrs

30 mins

eventtype=good_bookings | timechart span=1h count

Run every 24 hours

Bookings last 24 hrs

Last 24 hrs

15 mins

You also have the ability to create reports using Pivot:

  1. Click on Pivot.
  2. Create a Pivot table on the Destination Details child object with Last 24 hours as your Filters and Airport Code as your Split Rows.
  3. Refer to the following screenshot then save it as a report entitled Destinations by Airport Code. Schedule the report to run every hour, within a 24-hour time range, and with a 30-minute time window:
    Creating reports

Creating alerts

Alerts are crucial in IT operations. They provide real-time awareness of the state of the systems. Alerts also enable you to act fast when an issue has been detected prior to waiting for a user to report it. Sure enough, you can have a couple of data center operators monitor your dashboards, but nothing jolts their vigil more than an informative alert.

Now, alerts are only good if they are controlled and if they provide enough actionable information. Splunk allows you to do just that. In this section, we will walk you through how to create an actionable alert and how to throttle the alerting to avoid flooding your mailbox.

The exercises in this section will show you how to create an alert, but in order to generate the actual e-mail alert, you will need a mail server. This book will not cover mail servers but the process of creating the alert will be shown in full detail.

We want to know when there are instances of a failed booking scenario. This event type was constructed with the 500 HTTP status code. 5xx status codes are the most devastating errors in a web application so we want to be aware of them. We will now create an alert that will be triggered when a bad booking event is detected. Follow these steps:

  1. To create the alert, start by typing this:
          SPL> eventtype=bad_bookings
    
  2. Click on Save As | Alert. In the Save As Alert panel, fill up the form using the following screenshot as a guide:
    Creating alerts

Let us explain some of the different options in this selection:

  • Permissions: You should be fairly familiar with permissions by now. These apply to alerts as well.
  • Alert type: There are two ways to create an alert, just as there are two ways to run a search: scheduled (ad hoc) or in real time. Splunk has predefined schedules that you can easily use, namely:
    • Run every hour
    • Run every day
    • Run every week
    • Run every month
    • Although the schedules above are convenient, you will likely soon find yourself wanting more granularity for your searches. This is where the fifth option comes in: Run on Cron schedule. We will discuss this in detail later in the chapter.
  • Trigger Conditions: These are the conditions or rules that define when the alert will be generated. The predefined conditions that Splunk offers out-of-the-box are:
    • Number of Results: Most commonly used, this tells the alert to run whenever your search returns a certain number of events.
    • Number of Hosts: This is used when you need to know how many hosts are returning events based on your search.
    • Number of Sources: This is used when you need to know how many data sources are returning events based on your search.
    • Custom: This is used when you want to base your condition on the value of a particular field that is returned in your search result. We will discuss this in detail further into this chapter.
  • Trigger Actions: These are the actions that will be invoked when your trigger conditions are met. There are several possible default trigger actions currently included in Splunk Enterprise:
    • Add to Triggered Alerts: This will add an entry to theActivity | Triggered alerts page. This is what we will use in this book since it is the only readily available option.
    • Run a script: You can run a script (such as a Python script) located in the $SPLUNK_HOME/bin/scripts directory whenever this alert is generated. This is useful for self-repairing issues.
    • Send e-mail: Commonly used but requires a mail server to be configured.
    • Webhook: A recently introduced type of trigger that allows Splunk to make an HTTP POST to an external application (such as Twitter or Slack).

Click on Save to save your first alert. We will come back later to optimize it. Meanwhile, you should have now been sent to the alert summary page where you can continue to make changes. Note that since we selected the Add to Triggered Alerts action, you should now see the history of when this alert was triggered on your machine. Since the Eventgen data is randomized and we scheduled it to run every hour, you may have to wait until the next hour for results:

Creating alerts

Search and report acceleration

In Chapter 4, Data Models and Pivot, you learned how to accelerate a data model to speed up retrieval of data. The same principle applies to saved searches or reports:

  1. Click on the Reports link in the navigation menu of the Destinations app.
  2. Click on the Edit | Edit Acceleration option in the Bookings Last 24 Hrs report.
  3. Enable 1 Day acceleration as seen in the following screenshot:
    Search and report acceleration
  4. To check the progress of your report's acceleration, click on Settings | Report Acceleration Summaries:
    Search and report acceleration

Scheduling best practices

No matter how advanced and well-scaled your Splunk infrastructure is, if all scheduled searches and reports are running at the same time, the system will start experiencing issues. Typically you will receive a Splunk message saying that you have reached the limit of concurrent or historical searches. Suffice to say that there are only a certain number of searches that can be run on CPU core for each Splunk instance. The very first issue a beginner Splunk admin faces is how to limit the number of concurrent searches running at the same time. One way to fix this is to throw more servers into the Splunk cluster, but that is not the efficient way.

The trick to establishing a robust system is to properly stagger and budget scheduled searches and reports. This means ensuring that they are not running at the same time. There are two ways to achieve this:

  • Time windows: The first way to ensure that searches are not running concurrently is to always set a time window. You have done this in the exercises in this chapter. This is not ideal if you need to schedule runs so that the schedule of each run always happen at an exact time.
  • Custom Cron schedule: This is what most advanced users use to create their schedules. Cron is a system daemon, or a computer program that runs as a background process, derived from traditional UNIX systems; it is used to execute tasks at specified times.

Let us see an example of how to use a custom Cron schedule. Begin with this search query, which finds all errors in a payment:

  1. Type in the following:
          SPL> eventtype=bad_payment
    
  2. Save it as an alert by clicking on Save As | Alert.
  3. Name it Payment Errors.
  4. Change the permissions to Shared in App.
  5. In the Alert type, change the schedule to Run on Cron Schedule.
  6. In the Earliest field, enter -15m@m (last 15 minutes and snap at the beginning of the minute. This means a time range of 15 minutes and also ensures that it starts at the beginning of that minute).
  7. In the Latest field, type in now. In the Cron Expression field, type in */5 * * * *.
  8. Finally, change the Trigger Actions to Add to Triggered Alerts. Use the following screenshot as a guide:
    Scheduling best practices
  9. Click Save when done.

The Cron expression * * * * * corresponds to minute hour day month day-of-week.

Learning Cron expressions is easiest when you look at examples. The more examples, the simpler it is to understand this method of scheduling. Here are some typical examples:

Cron expression

Schedule

*/5 * * * *

Every 5 minutes

*/15 * * * *

Every 15 minutes

0 */6 * * *

Every 6 hours, on the hour

30 */2 * * *

Every 2 hours at the 30th minute (for instance, 3:30)

45 14 1,10 * *

Every 1st and 10th of the month, at 2:45 pm.

0 */1 * 1-5

Every hour, Monday to Friday

2,17,32,47 * * * *

Every 2nd minute, 17th minute, 32nd minute, and 47th minute of every hour.

Now that you know something about Cron expressions, you can fine-tune all your searches to run in precise and different schedules.

Summary indexing

In a matter of days, Splunk will accumulate data and start to move events into the cold bucket. If you recall, the cold bucket is where data is stored to disk. You will still be able to access this data but you are bound by the speed of the disk. Compound that with the millions of events that are typical with an enterprise Splunk implementation, and you can understand how your historical searches can slow down at an exponential rate.

There are two ways to circumvent this problem, one of which you have already performed: search acceleration and summary indexing.

With summary indexing, you run a scheduled search and output the results into an index called summary. The result will only show the computed statistics of the search. This results in a very small subset of data that will seemingly be faster to retrieve than going through the entirety of the events in the cold bucket.

Say, for example, you wish to keep track of all counts of an error in payment and you wish to keep the data in the summary index. Follow these steps:

  1. From your Destinations app, go to Settings | Searches, reports, and alerts.
  2. Click on the New button to create a new scheduled search.
  3. Use the following information as a guide:
    • Destinations app: Destinations
    • Search name: Summary of Payment Errors
    • Search: eventtype=bad_payment | stats count
    • Start time: -2m@m
    • Finish time: now
Summary indexing

Now perform the following steps:

  1. Click on Schedule this search.
  2. Change Schedule type to Cron.
  3. Set Cron schedule to */2 * * * *.
  4. Set Condition to always. This option, present in the Alert section denotes if the number of events is greater than 0.
  5. Set Expiration to Custom time of 1 hour.

Use the following screenshot as a guide:

Summary indexing

Now perform the following steps:

  1. Click on the Enable checkbox in the Summary indexing section
  2. Add a new field in the Add fields section, where values will be summaryCount equals to count

Use the following information as a guide:

Summary indexing
  1. Save when you are ready to continue.
  2. Now go back to the Destinations app's Search page. Type in the following search command and wait about 5-10 minutes:
      SPL> index=summary search_name="Summary of Payment Errors"

Notice that this stripped the original event of all other information except the count of events at the time that the scheduled search is running. We will use this information in later chapters to create optimized dashboards.

Summary

In this chapter, you have learned how to optimize data in three ways: classifying your data using event types, normalizing your data using tags, and enriching your data using lookup tables. You have also learned how to create advanced reports and alerts. You have accelerated your searches just like you did with data models. You have been introduced to the powerful Cron expression, which allows you to create granularity on your scheduled searches, and you have also been shown how to stagger your searches using time windows. Finally, you have created a summary index that allows you to search historical data faster. In the next chapter, Chapter 6, Panes of Glass, you will go on to learn more about how to do visualizations.