Book Image

Machine Learning for Cybersecurity Cookbook

By : Emmanuel Tsukerman
Book Image

Machine Learning for Cybersecurity Cookbook

By: Emmanuel Tsukerman

Overview of this book

Organizations today face a major threat in terms of cybersecurity, from malicious URLs to credential reuse, and having robust security systems can make all the difference. With this book, you'll learn how to use Python libraries such as TensorFlow and scikit-learn to implement the latest artificial intelligence (AI) techniques and handle challenges faced by cybersecurity researchers. You'll begin by exploring various machine learning (ML) techniques and tips for setting up a secure lab environment. Next, you'll implement key ML algorithms such as clustering, gradient boosting, random forest, and XGBoost. The book will guide you through constructing classifiers and features for malware, which you'll train and test on real samples. As you progress, you'll build self-learning, reliant systems to handle cybersecurity tasks such as identifying malicious URLs, spam email detection, intrusion detection, network protection, and tracking user and process behavior. Later, you'll apply generative adversarial networks (GANs) and autoencoders to advanced security tasks. Finally, you'll delve into secure and private AI to protect the privacy rights of consumers using your ML models. By the end of this book, you'll have the skills you need to tackle real-world problems faced in the cybersecurity domain using a recipe-based approach.
Table of Contents (11 chapters)

Hyperparameter tuning with scikit-optimize

In machine learning, a hyperparameter is a parameter whose value is set before the training process begins. For example, the choice of learning rate of a gradient boosting model and the size of the hidden layer of a multilayer perceptron, are both examples of hyperparameters. By contrast, the values of other parameters are derived via training. Hyperparameter selection is important because it can have a huge effect on the model's performance.

The most basic approach to hyperparameter tuning is called a grid search. In this method, you specify a range of potential values for each hyperparameter, and then try them all out, until you find the best combination. This brute-force approach is comprehensive but computationally intensive. More sophisticated methods exist. In this recipe, you will learn how to use Bayesian optimization over hyperparameters using scikit-optimize. In contrast to a basic grid search, in Bayesian optimization, not all parameter values are tried out, but rather a fixed number of parameter settings is sampled from specified distributions. More details can be found at https://scikit-optimize.github.io/notebooks/bayesian-optimization.html.

Getting ready

The preparation for this recipe consists of installing a specific version of scikit-learn, installing xgboost, and installing scikit-optimize in pip. The command for this is as follows:

pip install scikit-learn==0.20.3 xgboost scikit-optimize pandas

How to do it...

In the following steps, you will load the standard wine dataset and use Bayesian optimization to tune the hyperparameters of an XGBoost model:

  1. Load the wine dataset from scikit-learn:
from sklearn import datasets

wine_dataset = datasets.load_wine()
X = wine_dataset.data
y = wine_dataset.target
  1. Import XGBoost and stratified K-fold:
import xgboost as xgb
from sklearn.model_selection import StratifiedKFold
  1.  Import BayesSearchCV from scikit-optimize and specify the number of parameter settings to test:
from skopt import BayesSearchCV

n_iterations = 50
  1. Specify your estimator. In this case, we select XGBoost and set it to be able to perform multi-class classification:
estimator = xgb.XGBClassifier(
n_jobs=-1,
objective="multi:softmax",
eval_metric="merror",
verbosity=0,
num_class=len(set(y)),
)

  1. Specify a parameter search space:
search_space = {
"learning_rate": (0.01, 1.0, "log-uniform"),
"min_child_weight": (0, 10),
"max_depth": (1, 50),
"max_delta_step": (0, 10),
"subsample": (0.01, 1.0, "uniform"),
"colsample_bytree": (0.01, 1.0, "log-uniform"),
"colsample_bylevel": (0.01, 1.0, "log-uniform"),
"reg_lambda": (1e-9, 1000, "log-uniform"),
"reg_alpha": (1e-9, 1.0, "log-uniform"),
"gamma": (1e-9, 0.5, "log-uniform"),
"min_child_weight": (0, 5),
"n_estimators": (5, 5000),
"scale_pos_weight": (1e-6, 500, "log-uniform"),
}
  1. Specify the type of cross-validation to perform:
cv = StratifiedKFold(n_splits=3, shuffle=True)
  1. Define BayesSearchCV using the settings you have defined:
bayes_cv_tuner = BayesSearchCV(
estimator=estimator,
search_spaces=search_space,
scoring="accuracy",
cv=cv,
n_jobs=-1,
n_iter=n_iterations,
verbose=0,
refit=True,
)

 

  1. Define a callback function to print out the progress of the parameter search:
import pandas as pd
import numpy as np

def print_status(optimal_result):
"""Shows the best parameters found and accuracy attained of the search so far."""
models_tested = pd.DataFrame(bayes_cv_tuner.cv_results_)
best_parameters_so_far = pd.Series(bayes_cv_tuner.best_params_)
print(
"Model #{}\nBest accuracy so far: {}\nBest parameters so far: {}\n".format(
len(models_tested),
np.round(bayes_cv_tuner.best_score_, 3),
bayes_cv_tuner.best_params_,
)
)

clf_type = bayes_cv_tuner.estimator.__class__.__name__
models_tested.to_csv(clf_type + "_cv_results_summary.csv")

 

  1. Perform the parameter search:
result = bayes_cv_tuner.fit(X, y, callback=print_status)

 

As you can see, the following shows the output:

Model #1
Best accuracy so far: 0.972
Best parameters so far: {'colsample_bylevel': 0.019767840658391753, 'colsample_bytree': 0.5812505808116454, 'gamma': 1.7784704701058755e-05, 'learning_rate': 0.9050859661329937, 'max_delta_step': 3, 'max_depth': 42, 'min_child_weight': 1, 'n_estimators': 2334, 'reg_alpha': 0.02886003776717955, 'reg_lambda': 0.0008507166793122457, 'scale_pos_weight': 4.801764874750116e-05, 'subsample': 0.7188797743009225}

Model #2
Best accuracy so far: 0.972
Best parameters so far: {'colsample_bylevel': 0.019767840658391753, 'colsample_bytree': 0.5812505808116454, 'gamma': 1.7784704701058755e-05, 'learning_rate': 0.9050859661329937, 'max_delta_step': 3, 'max_depth': 42, 'min_child_weight': 1, 'n_estimators': 2334, 'reg_alpha': 0.02886003776717955, 'reg_lambda': 0.0008507166793122457, 'scale_pos_weight': 4.801764874750116e-05, 'subsample': 0.7188797743009225}

<snip>

Model #50
Best accuracy so far: 0.989
Best parameters so far: {'colsample_bylevel': 0.013417868502558758, 'colsample_bytree': 0.463490250419848, 'gamma': 2.2823050161337873e-06, 'learning_rate': 0.34006478878384533, 'max_delta_step': 9, 'max_depth': 41, 'min_child_weight': 0, 'n_estimators': 1951, 'reg_alpha': 1.8321791726476395e-08, 'reg_lambda': 13.098734837402576, 'scale_pos_weight': 0.6188077759379964, 'subsample': 0.7970035272497132}

How it works...

In steps 1 and 2, we import a standard dataset, the wine dataset, as well as the libraries needed for classification. A more interesting step follows, in which we specify how long we would like the hyperparameter search to be, in terms of a number of combinations of parameters to try. The longer the search, the better the results, at the risk of overfitting and extending the computational time. In step 4, we select XGBoost as the model, and then specify the number of classes, the type of problem, and the evaluation metric. This part will depend on the type of problem. For instance, for a regression problem, we might set eval_metric = 'rmse' and drop num_class together.

Other models than XGBoost can be selected with the hyperparameter optimizer as well. In the next step, (step 5), we specify a probability distribution over each parameter that we will be exploring. This is one of the advantages of using BayesSearchCV over a simple grid search, as it allows you to explore the parameter space more intelligently. Next, we specify our cross-validation scheme (step 6). Since we are performing a classification problem, it makes sense to specify a stratified fold. However, for a regression problem, StratifiedKFold should be replaced with KFold.

Also note that a larger splitting number is preferred for the purpose of measuring results, though it will come at a computational price. In step 7, you can see additional settings that can be changed. In particular, n_jobs allows you to parallelize the task. The verbosity and the method used for scoring can be altered as well. To monitor the search process and the performance of our hyperparameter tuning, we define a callback function to print out the progress in step 8. The results of the grid search are also saved in a CSV file. Finally, we run the hyperparameter search (step 9). The output allows us to observe the parameters and the performance of each iteration of the hyperparameter search.

In this book, we will refrain from tuning the hyperparameters of classifiers. The reason is in part brevity, and in part because hyperparameter tuning here would be premature optimization, as there is no specified requirement or goal for the performance of the algorithm from the end user. Having seen how to perform it here, you can easily adapt this recipe to the application at hand.

Another prominent library for hyperparameter tuning to keep in mind is hyperopt.