Machine Learning for Cybersecurity Cookbook

By : Emmanuel Tsukerman

Machine Learning for Cybersecurity Cookbook

By: Emmanuel Tsukerman

Overview of this book

Organizations today face a major threat in terms of cybersecurity, from malicious URLs to credential reuse, and having robust security systems can make all the difference. With this book, you'll learn how to use Python libraries such as TensorFlow and scikit-learn to implement the latest artificial intelligence (AI) techniques and handle challenges faced by cybersecurity researchers. You'll begin by exploring various machine learning (ML) techniques and tips for setting up a secure lab environment. Next, you'll implement key ML algorithms such as clustering, gradient boosting, random forest, and XGBoost. The book will guide you through constructing classifiers and features for malware, which you'll train and test on real samples. As you progress, you'll build self-learning, reliant systems to handle cybersecurity tasks such as identifying malicious URLs, spam email detection, intrusion detection, network protection, and tracking user and process behavior. Later, you'll apply generative adversarial networks (GANs) and autoencoders to advanced security tasks. Finally, you'll delve into secure and private AI to protect the privacy rights of consumers using your ML models. By the end of this book, you'll have the skills you need to tackle real-world problems faced in the cybersecurity domain using a recipe-based approach.

Preface

Who this book is for

What this book covers

To get the most out of this book

Sections

Get in touch

Free Chapter

Machine Learning for Cybersecurity

Technical requirements

Train-test-splitting your data

Standardizing your data

Summarizing large data using principal component analysis

Generating text using Markov chains

Performing clustering using scikit-learn

Training an XGBoost classifier

Analyzing time series using statsmodels

Anomaly detection with Isolation Forest

Natural language processing using a hashing vectorizer and tf-idf with scikit-learn

Hyperparameter tuning with scikit-optimize

Machine Learning-Based Malware Detection

Technical requirements

Malware static analysis

Malware dynamic analysis

Using machine learning to detect the file type

Measuring the similarity between two strings

Measuring the similarity between two files

Extracting N-grams

Selecting the best N-grams

Building a static malware detector

Tackling class imbalance

Handling type I and type II errors

Advanced Malware Detection

Technical requirements

Detecting obfuscated JavaScript

Featurizing PDF files

Extracting N-grams quickly using the hash-gram algorithm

Building a dynamic malware classifier

MalConv – end-to-end deep learning for malicious PE detection

Tackling packed malware

MalGAN – creating evasive malware

Tracking malware drift

Machine Learning for Social Engineering

Technical requirements

Twitter spear phishing bot

Voice impersonation

Speech recognition for OSINT

Facial recognition

Deepfake

Deepfake recognition

Lie detection using machine learning

Personality analysis

Social Mapper

Fake review generator

Fake news

Penetration Testing Using Machine Learning

Technical requirements

CAPTCHA breaker

Neural network-assisted fuzzing

DeepExploit

Web server vulnerability scanner using machine learning (GyoiThon)

Deanonymizing Tor using machine learning

IoT device type identification using machine learning

Keystroke dynamics

Malicious URL detector

Deep-pwning

Deep learning-based system for the automatic detection of software vulnerabilities

Automatic Intrusion Detection

Technical requirements

Spam filtering using machine learning

Phishing URL detection

Capturing network traffic

Network behavior anomaly detection

Botnet traffic detection

Insider threat detection

Detecting DDoS

Credit card fraud detection

Counterfeit bank note detection

Ad blocking using machine learning

Wireless indoor localization

Securing and Attacking Data with Machine Learning

Technical requirements

Assessing password security using ML

Deep learning for password cracking

Deep steganography

ML-based steganalysis

ML attacks on PUFs

Encryption using deep learning

HIPAA data breaches – data exploration and visualization

Secure and Private AI

Technical requirements

Federated learning

Encrypted computation

Private deep learning prediction

Testing the adversarial robustness of neural networks

Differential privacy using TensorFlow Privacy

Other Books You May Enjoy

Leave a review - let other readers know what you think

Appendix

Setting up a virtual lab environment

Using Python virtual environments

Customer Reviews

5 star

4 star

3 star

2 star

1 star

Generating text using Markov chains

Markov chains are simple stochastic models in which a system can exist in a number of states. To know the probability distribution of where the system will be next, it suffices to know where it currently is. This is in contrast with a system in which the probability distribution of the subsequent state may depend on the past history of the system. This simplifying assumption allows Markov chains to be easily applied in many domains, surprisingly fruitfully.

In this recipe, we will utilize Markov chains to generate fake reviews, which is useful for pen-testing a review system's spam detector. In a later recipe, you will upgrade the technology from Markov chains to RNNs.

Getting ready

Preparation for this recipe consists of installing the markovify and pandas packages in pip. The command for this is as follows:

pip install markovify pandas

In addition, the directory in the repository for this chapter includes a CSV dataset, airport_reviews.csv, which should be placed alongside the code for the chapter.

How to do it...

Let's see how to generate text using Markov chains by performing the following steps:

Start by importing the markovify library and a text file whose style we would like to imitate:

import markovify
import pandas as pd

df = pd.read_csv("airport_reviews.csv")

As an illustration, I have chosen a collection of airport reviews as my text:

"The airport is certainly tiny! ..."

Next, join the individual reviews into one large text string and build a Markov chain model using the airport review text:

from itertools import chain

N = 100
review_subset = df["content"][0:N]
text = "".join(chain.from_iterable(review_subset))
markov_chain_model = markovify.Text(text)

Behind the scenes, the library computes the transition word probabilities from the text.

Generate five sentences using the Markov chain model:

for i in range(5):
    print(markov_chain_model.make_sentence())

Since we are using airport reviews, we will have the following as the output after executing the previous code:

On the positive side it's a clean airport transfer from A to C gates and outgoing gates is truly enormous - but why when we arrived at about 7.30 am for our connecting flight to Venice on TAROM.
The only really bother: you may have to wait in a polite manner.
Why not have bus after a short wait to check-in there were a lots of shops and less seating.
Very inefficient and hostile airport. This is one of the time easy to access at low price from city center by train.
The distance between the incoming gates and ending with dirty and always blocked by never ending roadworks.

Surprisingly realistic! Although the reviews would have to be filtered down to the best ones.

Generate 3 sentences with a length of no more than 140 characters:

for i in range(3):
    print(markov_chain_model.make_short_sentence(140))

With our running example, we will see the following output:

However airport staff member told us that we were put on a connecting code share flight.
Confusing in the check-in agent was friendly.
I am definitely not keen on coming to the lack of staff . Lack of staff . Lack of staff at boarding pass at check-in.

How it works...

We begin the recipe by importing the Markovify library, a library for Markov chain computations, and reading in text, which will inform our Markov model (step 1). In step 2, we create a Markov chain model using the text. The following is a relevant snippet from the text object's initialization code:

class Text(object):

    reject_pat = re.compile(r"(^')|('$)|\s'|'\s|[\"(\(\)\[\])]")

    def __init__(self, input_text, state_size=2, chain=None, parsed_sentences=None, retain_original=True, well_formed=True, reject_reg=''):
        """
        input_text: A string.
        state_size: An integer, indicating the number of words in the model's state.
        chain: A trained markovify.Chain instance for this text, if pre-processed.
        parsed_sentences: A list of lists, where each outer list is a "run"
              of the process (e.g. a single sentence), and each inner list
              contains the steps (e.g. words) in the run. If you want to simulate
              an infinite process, you can come very close by passing just one, very
              long run.
        retain_original: Indicates whether to keep the original corpus.
        well_formed: Indicates whether sentences should be well-formed, preventing
              unmatched quotes, parenthesis by default, or a custom regular expression
              can be provided.
        reject_reg: If well_formed is True, this can be provided to override the
              standard rejection pattern.
        """

The most important parameter to understand is state_size = 2, which means that the Markov chains will be computing transitions between consecutive pairs of words. For more realistic sentences, this parameter can be increased, at the cost of making sentences appear less original. Next, we apply the Markov chains we have trained to generate a few example sentences (steps 3 and 4). We can see clearly that the Markov chains have captured the tone and style of the text. Finally, in step 5, we create a few tweets in the style of the airport reviews using our Markov chains.

Machine Learning for Cybersecurity Cookbook

By : Emmanuel Tsukerman

Machine Learning for Cybersecurity Cookbook

By: Emmanuel Tsukerman

Overview of this book

Related Content you might be interested in

Current Title:

Machine Learning for Cybersecurity Cookbook

Mastering Machine Learning for Penetration Testing

Hands-On Machine Learning for Cybersecurity

10 Machine Learning Blueprints You Should Know for Cybersecurity