Using the preprocessed Census Income Dataset, separate the features from the target, creating the variables X and Y:
X = data.drop("target", axis=1) Y = data["target"]
As explained previously, there are several ways to achieve the separation of X and Y, and the main thing to consider is that X should contain the features for all instances, while Y should contain the class label of all instances.
Divide the dataset into training, validation, and testing sets, using a split ratio of 10%:
from sklearn.model_selection import train_test_split X_new, X_test, Y_new, Y_test = train_test_split(X, Y, test_size=0.1, random_state=101) X_train, X_dev, Y_train, Y_dev = train_test_split(X_new, Y_new, test_size=0.1111, random_state=101)
The shape of the sets created should be as follows:
X_train = (26048, 9) X_dev = (3256, 9) X_test = (3257, 9) Y_train = (26048, ) Y_dev = (3256, ) Y_test = (3257, 1)
From the neural_network module, import the Multilayer Perceptron Classifier class. Initialize it and train the model over the training data.
Leave the hyperparameters to their default values. Again, use a random_state equal to 101:
from sklearn.neural_network import MLPClassifier model = MLPClassifier(random_state=101) model = model.fit(X_train, Y_train)
Address any warning that may appear after training the model with the default values for the hyperparameters.
No warning was raised during the training process of the network, which means that the model was able to achieve convergence using the default values for the hyperparameters. Nevertheless, keep in mind that this does not mean that the best model was achieved, and changes in the hyperparameter values may result in better performance of the model.
Calculate the accuracy of the model for all three sets (training, validation, and testing):
from sklearn.metrics import accuracy_score X_sets = [X_train, X_dev, X_test] Y_sets = [Y_train, Y_dev, Y_test] accuracy = [] for i in range(0,len(X_sets)): pred = model.predict(X_sets[i]) score = accuracy_score(Y_sets[i], pred) accuracy.append(score)
The accuracy score for the three sets should be as follows:
Train sets = 0.8342 Dev sets = 0.8111 Test sets = 0.8252
Open the Jupyter Notebook that you used to train the models.
Compare the four models based on their accuracy score only.
By taking the accuracy scores of the models from the previous chapter, it is possible to perform a final comparison to choose the model that better solves the data problem. To do so, the following table displays the accuracy scores for all four models:
To identify the model with the best performance, begin by comparing the accuracy rates over the training sets. From this, it is possible to conclude that the decision tree model is a better fit to the data problem. Nonetheless, the performance over the validation and testing sets is lower than the one achieved using the Multilayer Perceptron, which is an indication of the presence of high variance in the decision tree model.
Hence, a good approach would be to address the high variance of the decision tree model by simplifying the model and adding a pruning argument, for instance (the pruning argument "trims" the leaves of the tree to simplify it and ignore some of the details of the tree in order to generalize the model to the data). Ideally, the model should be able to reach a similar level of accuracy for all three sets, which would make it the best model for the data problem.
However, if the model is not able to overcome this variance, and assuming that all the models have been fine-tuned to achieve the maximum performance possible, the Multilayer Perceptron should be the model that's selected, considering that it performs best over the testing sets. This is mainly because the performance of the model over the testing set is the one that defines its overall performance over unseen data, which means that the one with higher testing set performance will be more useful in the long term.