Book Image

Machine Learning for Cybersecurity Cookbook

By : Emmanuel Tsukerman
Book Image

Machine Learning for Cybersecurity Cookbook

By: Emmanuel Tsukerman

Overview of this book

Organizations today face a major threat in terms of cybersecurity, from malicious URLs to credential reuse, and having robust security systems can make all the difference. With this book, you'll learn how to use Python libraries such as TensorFlow and scikit-learn to implement the latest artificial intelligence (AI) techniques and handle challenges faced by cybersecurity researchers. You'll begin by exploring various machine learning (ML) techniques and tips for setting up a secure lab environment. Next, you'll implement key ML algorithms such as clustering, gradient boosting, random forest, and XGBoost. The book will guide you through constructing classifiers and features for malware, which you'll train and test on real samples. As you progress, you'll build self-learning, reliant systems to handle cybersecurity tasks such as identifying malicious URLs, spam email detection, intrusion detection, network protection, and tracking user and process behavior. Later, you'll apply generative adversarial networks (GANs) and autoencoders to advanced security tasks. Finally, you'll delve into secure and private AI to protect the privacy rights of consumers using your ML models. By the end of this book, you'll have the skills you need to tackle real-world problems faced in the cybersecurity domain using a recipe-based approach.
Table of Contents (11 chapters)

Analyzing time series using statsmodels

A time series is a series of values obtained at successive times. For example, the price of the stock market sampled every minute forms a time series. In cybersecurity, time series analysis can be very handy for predicting a cyberattack, such as an insider employee exfiltrating data, or a group of hackers colluding in preparation for their next hit.

Let's look at several techniques for making predictions using time series.

Getting ready

Preparation for this recipe consists of installing the matplotlib, statsmodels, and scipy packages in pip. The command for this is as follows:

pip install matplotlib statsmodels scipy

How to do it...

In the following steps, we demonstrate several methods for making predictions using time series data:

  1. Begin by generating a time series:
from random import random

time_series = [2 * x + random() for x in range(1, 100)]
  1. Plot your data:
%matplotlib inline
import matplotlib.pyplot as plt


The following screenshot shows the output:

  1. There is a large variety of techniques we can use to predict the consequent value of a time series:
    • Autoregression (AR):
from statsmodels.tsa.ar_model import AR

model = AR(time_series)
model_fit =
y = model_fit.predict(len(time_series), len(time_series))
    • Moving average (MA):
from statsmodels.tsa.arima_model import ARMA

model = ARMA(time_series, order=(0, 1))
model_fit =
y = model_fit.predict(len(time_series), len(time_series))
    • Simple exponential smoothing (SES):
from statsmodels.tsa.holtwinters import SimpleExpSmoothing

model = SimpleExpSmoothing(time_series)
model_fit =
y = model_fit.predict(len(time_series), len(time_series))

The resulting predictions are as follows:

How it works...

In the first step, we generate a simple toy time series. The series consists of values on a line sprinkled with some added noise. Next, we plot our time series in step 2. You can see that it is very close to a straight line and that a sensible prediction for the value of the time series at time  is . To create a forecast of the value of the time series, we consider three different schemes (step 3) for predicting the future values of the time series. In an autoregressive model, the basic idea is that the value of the time series at time t is a linear function of the values of the time series at the previous times. More precisely, there are some constants, , and a number, , such that:

As a hypothetical example, may be 3, meaning that the value of the time series can be easily computed from knowing its last 3 values.

In the moving-average model, the time series is modeled as fluctuating about a mean. More precisely, let be a sequence of i.i.d normal variables and let be a constant. Then, the time series is modeled by the following formula:

For that reason, it performs poorly in predicting the noisy linear time series we have generated.

Finally, in simple exponential smoothing, we propose a smoothing parameter, . Then, our model's estimate, , is computed from the following equations:

In other words, we keep track of an estimate, , and adjust it slightly using the current time series value, . How strongly the adjustment is made is regulated by the  parameter.