You learned that XGBoost may have an edge in regression, but what about classification? XGBoost has a classification model, but will it perform as accurately as well tested classification models such as logistic regression? Let's find out.
What is classification?
Unlike with regression, when predicting target columns with a limited number of outputs, a machine learning algorithm is categorized as a classification algorithm. The possible outputs may include the following:
Spam, Not Spam
Red, Blue, Green, Yellow, Orange
Dataset 2 – The census
We will move a little more swiftly through the second dataset, the Census Income Data Set (https://archive.ics.uci.edu/ml/datasets/Census+Income), to predict personal income.
df_census = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data') df_census.head()
Here is the expected output:
The output reveals that the column headings represent the entries of the first row. When this happens, the data may be reloaded with the
df_census = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data', header=None) df_census.head()
Here is the expected output without the header:
As you can see, the column names are still missing. They are listed on the Census Income Data Set website (https://archive.ics.uci.edu/ml/datasets/Census+Income) under Attribute Information.
df_census.columns=['age', 'workclass', 'fnlwgt', 'education', 'education-num', 'marital-status', 'occupation', 'relationship', 'race', 'sex', 'capital-gain', 'capital-loss', 'hours-per-week', 'native-country', 'income'] df_census.head()
Here is the expected output with column names:
As you can see, the column names have been restored.
The output is as follows:
<class 'pandas.core.frame.DataFrame'> RangeIndex: 32561 entries, 0 to 32560 Data columns (total 15 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 age 32561 non-null int64 1 workclass 32561 non-null object 2 fnlwgt 32561 non-null int64 3 education 32561 non-null object 4 education-num 32561 non-null int64 5 marital-status 32561 non-null object 6 occupation 32561 non-null object 7 relationship 32561 non-null object 8 race 32561 non-null object 9 sex 32561 non-null object 10 capital-gain 32561 non-null int64 11 capital-loss 32561 non-null int64 12 hours-per-week 32561 non-null int64 13 native-country 32561 non-null object 14 income 32561 non-null object dtypes: int64(6), object(9) memory usage: 3.7+ MB
Since all columns have the same number of non-null rows, we can infer that there are no null values.
All columns of the
dtype object must be transformed into numerical columns. A pandas
get_dummies method takes the non-numerical unique values of every column and converts them into their own column, with
1 indicating presence and
0 indicating absence. For instance, if the column values of a DataFrame called "Book Types" were "hardback," "paperback," or "ebook,"
pd.get_dummies would create three new columns called "hardback," "paperback," and "ebook" replacing the "Book Types" column.
Here is a "Book Types" DataFrame:
Here is the same DataFrame after
pd.get_dummies will create many new columns, so it's worth checking to see whether any columns may be eliminated. A quick review of the
df_census data reveals an
'education' column and an
education_num column. The
education_num column is a numerical conversion of
'education'. Since the information is the same, the
'education' column may be deleted:
df_census = df_census.drop(['education'], axis=1)
pd.get_dummies to transform the non-numerical columns into numerical columns:
df_census = pd.get_dummies(df_census) df_census.head()
As you can see, new columns are created using a
column_value syntax referencing the original column. For example,
native-country is an original column, and Taiwan is one of many values. The new
native-country_Taiwan column has a value of
1 if the person is from Taiwan and
pd.get_dummies may increase memory usage, as can be verified using the
.info() method on the DataFrame in question and checking the last line. Sparse matrices may be used to save memory where only values of
1 are stored and values of
0 are not stored. For more information on sparse matrices, see Chapter 10, XGBoost Model Deployment, or visit SciPy's official documentation at https://docs.scipy.org/doc/scipy/reference/.
Target and predictor columns
The target column is whether or not someone makes 50K. After
pd.get_dummies, two columns,
df_census['income_>50K'], are used to determine whether someone makes 50K. Since either column will work, we delete
df_census = df_census.drop('income_ <=50K', axis=1)
Now split the data into
X (predictor columns) and
y (target column). Note that
-1 is used for indexing since the last column is the target column:
X = df_census.iloc[:,:-1]y = df_census.iloc[:,-1]
It's time to build machine learning classifiers!
Logistic regression is the most fundamental classification algorithm. Mathematically, logistic regression works in a manner similar to linear regression. For each column, logistic regression finds an appropriate weight, or coefficient, that maximizes model accuracy. The primary difference is that instead of summing each term, as in linear regression, logistic regression uses the sigmoid function.
Here is the sigmoid function and the corresponding graph:
The sigmoid is commonly used for classification. All values greater than 0.5 are matched to 1, and all values less than 0.5 are matched to 0.
Implementing logistic regression with scikit-learn is nearly the same as implementing linear regression. The main differences are that the predictor column should fit into categories, and the error should be in terms of accuracy. As a bonus, the error is in terms of accuracy by default, so explicit scoring parameters are not required.
You may import logistic regression as follows:
from sklearn.linear_model import LogisticRegression
The cross-validation function
Instead of copying and pasting, let's build a cross-validation classification function that takes a machine learning algorithm as input and has the accuracy score as output using
def cross_val(classifier, num_splits=10): model = classifier scores = cross_val_score(model, X, y, cv=num_splits) print('Accuracy:', np.round(scores, 2)) print('Accuracy mean: %0.2f' % (scores.mean()))
Now call the function with logistic regression:
The output is as follows:
Accuracy: [0.8 0.8 0.79 0.8 0.79 0.81 0.79 0.79 0.8 0.8 ] Accuracy mean: 0.80
80% accuracy isn't bad out of the box.
Any time you find yourself copying and pasting code, look for a better way! One aim of computer science is to avoid repetition. Writing your own data analysis and machine learning functions will make your life easier and your work more efficient in the long run.
The XGBoost classifier
from xgboost import XGBClassifier
Now run the classifier in the
cross_val function with one important addition. Since there are 94 columns, and XGBoost is an ensemble method, meaning that it combines many models for each run, each of which includes 10 splits, we are going to limit
n_estimators, the number of models, to
5. Normally, XGBoost is very fast. In fact, it has a reputation for being the fastest boosting ensemble method out there, a reputation that we will check in this book! For our initial purposes, however,
5 estimators, though not as robust as the default of
100, is sufficient. Details on choosing
n_estimators will be a focal point of Chapter 4, From Gradient Boosting to XGBoost:
The output is as follows:
Accuracy: [0.85 0.86 0.87 0.85 0.86 0.86 0.86 0.87 0.86 0.86] Accuracy mean: 0.86