Machine Learning for Cybersecurity Cookbook

By : Emmanuel Tsukerman

Machine Learning for Cybersecurity Cookbook

By: Emmanuel Tsukerman

Overview of this book

Organizations today face a major threat in terms of cybersecurity, from malicious URLs to credential reuse, and having robust security systems can make all the difference. With this book, you'll learn how to use Python libraries such as TensorFlow and scikit-learn to implement the latest artificial intelligence (AI) techniques and handle challenges faced by cybersecurity researchers. You'll begin by exploring various machine learning (ML) techniques and tips for setting up a secure lab environment. Next, you'll implement key ML algorithms such as clustering, gradient boosting, random forest, and XGBoost. The book will guide you through constructing classifiers and features for malware, which you'll train and test on real samples. As you progress, you'll build self-learning, reliant systems to handle cybersecurity tasks such as identifying malicious URLs, spam email detection, intrusion detection, network protection, and tracking user and process behavior. Later, you'll apply generative adversarial networks (GANs) and autoencoders to advanced security tasks. Finally, you'll delve into secure and private AI to protect the privacy rights of consumers using your ML models. By the end of this book, you'll have the skills you need to tackle real-world problems faced in the cybersecurity domain using a recipe-based approach.

Preface

Who this book is for

What this book covers

To get the most out of this book

Sections

Get in touch

Free Chapter

Machine Learning for Cybersecurity

Technical requirements

Train-test-splitting your data

Standardizing your data

Summarizing large data using principal component analysis

Generating text using Markov chains

Performing clustering using scikit-learn

Training an XGBoost classifier

Analyzing time series using statsmodels

Anomaly detection with Isolation Forest

Natural language processing using a hashing vectorizer and tf-idf with scikit-learn

Hyperparameter tuning with scikit-optimize

Machine Learning-Based Malware Detection

Technical requirements

Malware static analysis

Malware dynamic analysis

Using machine learning to detect the file type

Measuring the similarity between two strings

Measuring the similarity between two files

Extracting N-grams

Selecting the best N-grams

Building a static malware detector

Tackling class imbalance

Handling type I and type II errors

Advanced Malware Detection

Technical requirements

Detecting obfuscated JavaScript

Featurizing PDF files

Extracting N-grams quickly using the hash-gram algorithm

Building a dynamic malware classifier

MalConv – end-to-end deep learning for malicious PE detection

Tackling packed malware

MalGAN – creating evasive malware

Tracking malware drift

Machine Learning for Social Engineering

Technical requirements

Twitter spear phishing bot

Voice impersonation

Speech recognition for OSINT

Facial recognition

Deepfake

Deepfake recognition

Lie detection using machine learning

Personality analysis

Social Mapper

Fake review generator

Fake news

Penetration Testing Using Machine Learning

Technical requirements

CAPTCHA breaker

Neural network-assisted fuzzing

DeepExploit

Web server vulnerability scanner using machine learning (GyoiThon)

Deanonymizing Tor using machine learning

IoT device type identification using machine learning

Keystroke dynamics

Malicious URL detector

Deep-pwning

Deep learning-based system for the automatic detection of software vulnerabilities

Automatic Intrusion Detection

Technical requirements

Spam filtering using machine learning

Phishing URL detection

Capturing network traffic

Network behavior anomaly detection

Botnet traffic detection

Insider threat detection

Detecting DDoS

Credit card fraud detection

Counterfeit bank note detection

Ad blocking using machine learning

Wireless indoor localization

Securing and Attacking Data with Machine Learning

Technical requirements

Assessing password security using ML

Deep learning for password cracking

Deep steganography

ML-based steganalysis

ML attacks on PUFs

Encryption using deep learning

HIPAA data breaches – data exploration and visualization

Secure and Private AI

Technical requirements

Federated learning

Encrypted computation

Private deep learning prediction

Testing the adversarial robustness of neural networks

Differential privacy using TensorFlow Privacy

Other Books You May Enjoy

Leave a review - let other readers know what you think

Appendix

Setting up a virtual lab environment

Using Python virtual environments

Customer Reviews

5 star

4 star

3 star

2 star

1 star

Train-test-splitting your data

In machine learning, our goal is to create a program that is able to perform tasks it has never been explicitly taught to perform. The way we do that is to use data we have collected to train or fit a mathematical or statistical model. The data used to fit the model is referred to as training data. The resulting trained model is then used to predict future, previously-unseen data. In this way, the program is able to manage new situations without human intervention.

One of the major challenges for a machine learning practitioner is the danger of overfitting – creating a model that performs well on the training data but is not able to generalize to new, previously-unseen data. In order to combat the problem of overfitting, machine learning practitioners set aside a portion of the data, called test data, and use it only to assess the performance of the trained model, as opposed to including it as part of the training dataset. This careful setting aside of testing sets is key to training classifiers in cybersecurity, where overfitting is an omnipresent danger. One small oversight, such as using only benign data from one locale, can lead to a poor classifier.

There are various other ways to validate model performance, such as cross-validation. For simplicity, we will focus mainly on train-test splitting.

Getting ready

Preparation for this recipe consists of installing the scikit-learn and pandas packages in pip. The command for this is as follows:

pip install sklearn pandas

In addition, we have included the north_korea_missile_test_database.csv dataset for use in this recipe.

How to do it...

The following steps demonstrate how to take a dataset, consisting of features X and labels y, and split these into a training and testing subset:

Start by importing the train_test_split module and the pandas library, and read your features into X and labels into y:

from sklearn.model_selection import train_test_split
import pandas as pd

df = pd.read_csv("north_korea_missile_test_database.csv")
y = df["Missile Name"]
X = df.drop("Missile Name", axis=1)

Next, randomly split the dataset and its labels into a training set consisting 80% of the size of the original dataset and a testing set 20% of the size:

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=31
)

We apply the train_test_split method once more, to obtain a validation set, X_val and y_val:

X_train, X_val, y_train, y_val = train_test_split(
    X_train, y_train, test_size=0.25, random_state=31
)

We end up with a training set that's 60% of the size of the original data, a validation set of 20%, and a testing set of 20%.

The following screenshot shows the output:

How it works...

We start by reading in our dataset, consisting of historical and continuing missile experiments in North Korea. We aim to predict the type of missile based on remaining features, such as facility and time of launch. This concludes step 1. In step 2, we apply scikit-learn's train_test_split method to subdivide X and y into a training set, X_train and y_train, and also a testing set, X_test and y_test. The test_size = 0.2 parameter means that the testing set consists of 20% of the original data, while the remainder is placed in the training set. The random_state parameter allows us to reproduce the same randomly generated split. Next, concerning step 3, it is important to note that, in applications, we often want to compare several different models. The danger of using the testing set to select the best model is that we may end up overfitting the testing set. This is similar to the statistical sin of data fishing. In order to combat this danger, we create an additional dataset, called the validation set. We train our models on the training set, use the validation set to compare them, and finally use the testing set to obtain an accurate indicator of the performance of the model we have chosen. So, in step 3, we choose our parameters so that, mathematically speaking, the end result consists of a training set of 60% of the original dataset, a validation set of 20%, and a testing set of 20%. Finally, we double-check our assumptions by employing the len function to compute the length of the arrays (step 4).

Machine Learning for Cybersecurity Cookbook

By : Emmanuel Tsukerman

Machine Learning for Cybersecurity Cookbook

By: Emmanuel Tsukerman

Overview of this book

Related Content you might be interested in

Current Title:

Machine Learning for Cybersecurity Cookbook

Mastering Machine Learning for Penetration Testing

Hands-On Machine Learning for Cybersecurity

10 Machine Learning Blueprints You Should Know for Cybersecurity