Book Image

Machine Learning for Cybersecurity Cookbook

By : Emmanuel Tsukerman
Book Image

Machine Learning for Cybersecurity Cookbook

By: Emmanuel Tsukerman

Overview of this book

Organizations today face a major threat in terms of cybersecurity, from malicious URLs to credential reuse, and having robust security systems can make all the difference. With this book, you'll learn how to use Python libraries such as TensorFlow and scikit-learn to implement the latest artificial intelligence (AI) techniques and handle challenges faced by cybersecurity researchers. You'll begin by exploring various machine learning (ML) techniques and tips for setting up a secure lab environment. Next, you'll implement key ML algorithms such as clustering, gradient boosting, random forest, and XGBoost. The book will guide you through constructing classifiers and features for malware, which you'll train and test on real samples. As you progress, you'll build self-learning, reliant systems to handle cybersecurity tasks such as identifying malicious URLs, spam email detection, intrusion detection, network protection, and tracking user and process behavior. Later, you'll apply generative adversarial networks (GANs) and autoencoders to advanced security tasks. Finally, you'll delve into secure and private AI to protect the privacy rights of consumers using your ML models. By the end of this book, you'll have the skills you need to tackle real-world problems faced in the cybersecurity domain using a recipe-based approach.
Table of Contents (11 chapters)

Performing clustering using scikit-learn

Clustering is a collection of unsupervised machine learning algorithms in which parts of the data are grouped based on similarity. For example, clusters might consist of data that is close together in n-dimensional Euclidean space. Clustering is useful in cybersecurity for distinguishing between normal and anomalous network activity, and for helping to classify malware into families.

Getting ready

Preparation for this recipe consists of installing the scikit-learn, pandas, and plotly packages in pip. The command for this is as follows:

pip install sklearn plotly pandas

In addition, a dataset named file_pe_header.csv is provided in the repository for this recipe.

How to do it...

In the following steps, we will see a demonstration of how scikit-learn's K-means clustering algorithm performs on a toy PE malware classification:

  1. Start by importing and plotting the dataset:
import pandas as pd
import plotly.express as px

df = pd.read_csv("file_pe_headers.csv", sep=",")
fig = px.scatter_3d(
df,
x="SuspiciousImportFunctions",
y="SectionsLength",
z="SuspiciousNameSection",
color="Malware",
)
fig.show()

The following screenshot shows the output:

  1. Extract the features and target labels:
y = df["Malware"]
X = df.drop(["Name", "Malware"], axis=1).to_numpy()
  1. Next, import scikit-learn's clustering module and fit a K-means model with two clusters to the data:
from sklearn.cluster import KMeans

estimator = KMeans(n_clusters=len(set(y)))
estimator.fit(X)
  1. Predict the cluster using our trained algorithm:
y_pred = estimator.predict(X)
df["pred"] = y_pred
df["pred"] = df["pred"].astype("category")
  1. To see how the algorithm did, plot the algorithm's clusters:
fig = px.scatter_3d(
df,
x="SuspiciousImportFunctions",
y="SectionsLength",
z="SuspiciousNameSection",
color="pred",
)
fig.show()

The following screenshot shows the output:

The results are not perfect, but we can see that the clustering algorithm captured much of the structure in the dataset.

How it works...

We start by importing our dataset of PE header information from a collection of samples (step 1). This dataset consists of two classes of PE files: malware and benign. We then use plotly to create a nice-looking interactive 3D graph (step 1). We proceed to prepare our dataset for machine learning. Specifically, in step 2, we set X as the features and y as the classes of the dataset. Based on the fact that there are two classes, we aim to cluster the data into two groups that will match the sample classification. We utilize the K-means algorithm (step 3), about which you can find more information at: https://en.wikipedia.org/wiki/K-means_clustering. With a thoroughly trained clustering algorithm, we are ready to predict on the testing set. We apply our clustering algorithm to predict to which cluster each of the samples should belong (step 4). Observing our results in step 5, we see that clustering has captured a lot of the underlying information, as it was able to fit the data well.