Predicting classification
You learned that XGBoost may have an edge in regression, but what about classification? XGBoost has a classification model, but will it perform as accurately as well tested classification models such as logistic regression? Let's find out.
What is classification?
Unlike with regression, when predicting target columns with a limited number of outputs, a machine learning algorithm is categorized as a classification algorithm. The possible outputs may include the following:
Yes, No
Spam, Not Spam
0, 1
Red, Blue, Green, Yellow, Orange
Dataset 2 – The census
We will move a little more swiftly through the second dataset, the Census Income Data Set (https://archive.ics.uci.edu/ml/datasets/Census+Income), to predict personal income.
Data wrangling
Before implementing machine learning, the dataset must be preprocessed. When testing new algorithms, it's essential to have all numerical columns with no null values.
Data loading
Since this dataset is hosted directly on the UCI Machine Learning website, it can be downloaded directly from the internet using pd.read_csv
:
df_census = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data') df_census.head()
Here is the expected output:
The output reveals that the column headings represent the entries of the first row. When this happens, the data may be reloaded with the header=None
parameter:
df_census = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data', header=None) df_census.head()
Here is the expected output without the header:
As you can see, the column names are still missing. They are listed on the Census Income Data Set website (https://archive.ics.uci.edu/ml/datasets/Census+Income) under Attribute Information.
Column names may be changed as follows:
df_census.columns=['age', 'workclass', 'fnlwgt', 'education', 'education-num', 'marital-status', 'occupation', 'relationship', 'race', 'sex', 'capital-gain', 'capital-loss', 'hours-per-week', 'native-country', 'income'] df_census.head()
Here is the expected output with column names:
As you can see, the column names have been restored.
Null values
A great way to check null values is to look at the DataFrame .info()
method:
df_census.info()
The output is as follows:
<class 'pandas.core.frame.DataFrame'> RangeIndex: 32561 entries, 0 to 32560 Data columns (total 15 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 age 32561 non-null int64 1 workclass 32561 non-null object 2 fnlwgt 32561 non-null int64 3 education 32561 non-null object 4 education-num 32561 non-null int64 5 marital-status 32561 non-null object 6 occupation 32561 non-null object 7 relationship 32561 non-null object 8 race 32561 non-null object 9 sex 32561 non-null object 10 capital-gain 32561 non-null int64 11 capital-loss 32561 non-null int64 12 hours-per-week 32561 non-null int64 13 native-country 32561 non-null object 14 income 32561 non-null object dtypes: int64(6), object(9) memory usage: 3.7+ MB
Since all columns have the same number of non-null rows, we can infer that there are no null values.
Non-numerical columns
All columns of the dtype
object must be transformed into numerical columns. A pandas get_dummies
method takes the non-numerical unique values of every column and converts them into their own column, with 1
indicating presence and 0
indicating absence. For instance, if the column values of a DataFrame called "Book Types" were "hardback," "paperback," or "ebook," pd.get_dummies
would create three new columns called "hardback," "paperback," and "ebook" replacing the "Book Types" column.
Here is a "Book Types" DataFrame:
Here is the same DataFrame after pd.get_dummies
:
pd.get_dummies
will create many new columns, so it's worth checking to see whether any columns may be eliminated. A quick review of the df_census
data reveals an 'education'
column and an education_num
column. The education_num
column is a numerical conversion of 'education'
. Since the information is the same, the 'education'
column may be deleted:
df_census = df_census.drop(['education'], axis=1)
Now use pd.get_dummies
to transform the non-numerical columns into numerical columns:
df_census = pd.get_dummies(df_census) df_census.head()
Here is the expected output:
As you can see, new columns are created using a column_value
syntax referencing the original column. For example, native-country
is an original column, and Taiwan is one of many values. The new native-country_Taiwan
column has a value of 1
if the person is from Taiwan and 0
otherwise.
Tip
Using pd.get_dummies
may increase memory usage, as can be verified using the .info()
method on the DataFrame in question and checking the last line. Sparse matrices may be used to save memory where only values of 1
are stored and values of 0
are not stored. For more information on sparse matrices, see Chapter 10, XGBoost Model Deployment, or visit SciPy's official documentation at https://docs.scipy.org/doc/scipy/reference/.
Target and predictor columns
Since all columns are numerical with no null values, it's time to split the data into target and predictor columns.
The target column is whether or not someone makes 50K. After pd.get_dummies
, two columns, df_census['income_<=50K']
and df_census['income_>50K']
, are used to determine whether someone makes 50K. Since either column will work, we delete df_census['income_ <=50K']
:
df_census = df_census.drop('income_ <=50K', axis=1)
Now split the data into X
(predictor columns) and y
(target column). Note that -1
is used for indexing since the last column is the target column:
X = df_census.iloc[:,:-1]y = df_census.iloc[:,-1]
It's time to build machine learning classifiers!
Logistic regression
Logistic regression is the most fundamental classification algorithm. Mathematically, logistic regression works in a manner similar to linear regression. For each column, logistic regression finds an appropriate weight, or coefficient, that maximizes model accuracy. The primary difference is that instead of summing each term, as in linear regression, logistic regression uses the sigmoid function.
Here is the sigmoid function and the corresponding graph:
The sigmoid is commonly used for classification. All values greater than 0.5 are matched to 1, and all values less than 0.5 are matched to 0.
Implementing logistic regression with scikit-learn is nearly the same as implementing linear regression. The main differences are that the predictor column should fit into categories, and the error should be in terms of accuracy. As a bonus, the error is in terms of accuracy by default, so explicit scoring parameters are not required.
You may import logistic regression as follows:
from sklearn.linear_model import LogisticRegression
The cross-validation function
Let's use cross-validation on logistic regression to predict whether someone makes over 50K.
Instead of copying and pasting, let's build a cross-validation classification function that takes a machine learning algorithm as input and has the accuracy score as output using cross_val_score
:
def cross_val(classifier, num_splits=10): model = classifier scores = cross_val_score(model, X, y, cv=num_splits) print('Accuracy:', np.round(scores, 2)) print('Accuracy mean: %0.2f' % (scores.mean()))
Now call the function with logistic regression:
cross_val(LogisticRegression())
The output is as follows:
Accuracy: [0.8 0.8 0.79 0.8 0.79 0.81 0.79 0.79 0.8 0.8 ] Accuracy mean: 0.80
80% accuracy isn't bad out of the box.
Let's see whether XGBoost can do better.
Tip
Any time you find yourself copying and pasting code, look for a better way! One aim of computer science is to avoid repetition. Writing your own data analysis and machine learning functions will make your life easier and your work more efficient in the long run.
The XGBoost classifier
XGBoost has a regressor and a classifier. To use the classifier, import the following algorithm:
from xgboost import XGBClassifier
Now run the classifier in the cross_val
function with one important addition. Since there are 94 columns, and XGBoost is an ensemble method, meaning that it combines many models for each run, each of which includes 10 splits, we are going to limit n_estimators
, the number of models, to 5
. Normally, XGBoost is very fast. In fact, it has a reputation for being the fastest boosting ensemble method out there, a reputation that we will check in this book! For our initial purposes, however, 5
estimators, though not as robust as the default of 100
, is sufficient. Details on choosing n_estimators
will be a focal point of Chapter 4, From Gradient Boosting to XGBoost:
cross_val(XGBClassifier(n_estimators=5))
The output is as follows:
Accuracy: [0.85 0.86 0.87 0.85 0.86 0.86 0.86 0.87 0.86 0.86] Accuracy mean: 0.86
As you can see, XGBoost scores higher than logistic regression out of the box.