## Evaluation of machine learning algorithms and feature engineering procedures

It is important to note that in literature, oftentimes there is a stark contrast between the terms *features* and *attributes*. The term **attribute** is generally given to columns in tabular data, while the term **feature** is generally given only to attributes that contribute to the success of machine learning algorithms. That is to say, some attributes can be unhelpful or even hurtful to our machine learning systems. For example, when predicting how long a used car will last before requiring servicing, the color of the car will probably not very indicative of this value.

In this book, we will generally refer to all columns as features until they are proven to be unhelpful or hurtful. When this happens, we will usually cast those attributes aside in the code. It is extremely important, then, to consider the basis for this decision. How does one evaluate a machine learning system and then use this evaluation to perform feature engineering?

### Example of feature engineering procedures – can anyone really predict the weather?

Consider a machine learning pipeline that was built to predict the weather. For the sake of simplicity in our introduction chapter, assume that our algorithm takes in atmospheric data directly from sensors and is set up to predict between one of two values, *sun* or *rain*. This pipeline is then, clearly, a classification pipeline that can only spit out one of two answers. We will run this algorithm at the beginning of every day. If the algorithm outputs *sun* and the day is mostly sunny, the algorithm was correct, likewise, if the algorithm predicts *rain* and the day is mostly rainy, the algorithm was correct. In any other instance, the algorithm would be considered incorrect. If we run the algorithm every day for a month, we would obtain nearly 30 values of the predicted weather and the actual, observed weather. We can calculate an accuracy of the algorithm. Perhaps the algorithm predicted correctly for 20 out of the 30 days, leading us to label the algorithm with a two out of three or about 67% accuracy. Using this standardized value or accuracy, we could tweak our algorithm and see if the accuracy goes up or down.

Of course, this is an oversimplification, but the idea is that for any machine learning pipeline, it is essentially useless if we cannot evaluate its performance using a set of standard metrics and therefore, feature engineering as applied to the bettering of machine learning, is impossible without said evaluation procedure. Throughout this book, we will revisit this idea of evaluation; however, let’s talk briefly about how, in general, we will approach this idea.

When presented with a topic in feature engineering, it will usually involve transforming our dataset (as per the definition of feature engineering). In order to definitely say whether or not a particular feature engineering procedure has helped our machine learning algorithm, we will follow the steps detailed in the following section.

### Steps to evaluate a feature engineering procedure

Here are the steps to evaluate a feature engineering procedure:

- Obtain a baseline performance of the machine learning model before applying any feature engineering procedures
- Apply feature engineering and combinations of feature engineering procedures
- For each application of feature engineering, obtain a performance measure and compare it to our baseline performance
- If the delta (change in) performance precedes a threshold (usually defined by the human), we deem that procedure helpful and apply it to our machine learning pipeline
- This change in performance will usually be measured as a percentage (if the baseline went from 40% accuracy to 76% accuracy, that is a 90% improvement)

In terms of performance, this idea varies between machine learning algorithms. Most good primers on machine learning will tell you that there are dozens of accepted metrics when practicing data science.

In our case, because the focus of this book is not necessarily on machine learning and rather on the understanding and transformation of features, we will use baseline machine learning algorithms and associated baseline metrics in order to evaluate the feature engineering procedures.

### Evaluating supervised learning algorithms

When performing predictive modeling, otherwise known as **supervised learning**, performance is directly tied to the model’s ability to exploit structure in the data and use that structure to make appropriate predictions. In general, we can further break down supervised learning into two more specific types, **classification** (predicting qualitative responses) and **regression** (predicting quantitative responses).

When we are evaluating classification problems, we will directly calculate the accuracy of a logistic regression model using a five-fold cross-validation:

# Example code for evaluating a classification problem from sklearn.linear_model import LogisticRegression from sklearn.model_selection import cross_val_score X = some_data_in_tabular_format y = response_variable lr = LinearRegression() scores = cross_val_score(lr, X, y, cv=5, scoring='accuracy') scores >> [.765, .67, .8, .62, .99]

Similarly, when evaluating a regression problem, we will use the **mean squared error** (**MSE**) of a linear regression using a five-fold cross-validation:

# Example code for evaluating a regression problem from sklearn.linear_model import LinearRegression from sklearn.model_selection import cross_val_score X = some_data_in_tabular_format y = response_variable lr = LinearRegression() scores = cross_val_score(lr, X, y, cv=5, scoring='mean_squared_error') scores >> [31.543, 29.5433, 32.543, 32.43, 27.5432]

We will use these two linear models instead of newer, more advanced models for their speed and their low variance. This way, we can be surer that any increase in performance is directly related to the feature engineering procedure and not to the model’s ability to pick up on obscure and hidden patterns.

### Evaluating unsupervised learning algorithms

This is a bit trickier. Because unsupervised learning is not concerned with predictions, we cannot directly evaluate performance based on how well the model can predict a value. That being said, if we are performing a cluster analysis, such as in the previous marketing segmentation example, then we will usually utilize the **silhouette coefficient** (a measure of separation and cohesion of clusters between -1 and 1) and some human-driven analysis to decide if a feature engineering procedure has improved model performance or if we are merely wasting our time.

Here is an example of using Python and scikit-learn to import and calculate the silhouette coefficient for some fake data:

attributes = tabular_data cluster_labels = outputted_labels_from_clustering from sklearn.metrics import silhouette_score silhouette_score(attributes, cluster_labels)

We will spend much more time on unsupervised learning later on in this book as it becomes more relevant. Most of our examples will revolve around predictive analytics/supervised learning.

### Note

It is important to remember that the reason we are standardizing algorithms and metrics is so that we may showcase the power of feature engineering and so that you may repeat our procedures with success. Practically, it is conceivable that you are optimizing for something other than accuracy (such as a true positive rate, for example) and wish to use decision trees instead of logistic regression. This is not only fine but encouraged. You should always remember though to follow the steps to evaluating a feature engineering procedure and compare baseline and post-engineering performance.

It is possible that you are not reading this book for the purposes of improving machine learning performance. Feature engineering is useful in other domains such as hypothesis testing and general statistics. In a few examples in this book, we will be taking a look at feature engineering and data transformations as applied to a statistical significance of various statistical tests. We will be exploring metrics such as R^{2 }and p-values in order to make judgements about how our procedures are helping.

In general, we will quantify the benefits of feature engineering in the context of three categories:

**Supervised learning**: Otherwise known as**predictive analytics**- Regression analysis—predicting a
*quantitative*variable:- Will utilize MSE as our primary metric of measurement

- Classification analysis—predicting a
*qualitative*variable- Will utilize accuracy as our primary metric of measurement

- Regression analysis—predicting a
**Unsupervised learning**: Clustering—the assigning of meta-attributes as denoted by the behavior of data:- Will utilize the silhouette coefficient as our primary metric of measurement

**Statistical testing**: Using correlation coefficients, t-tests, chi-squared tests, and others to evaluate and quantify the usefulness of our raw and transformed data

In the following few sections, we will look at what will be covered throughout this book.