Before working on step 1, make sure that the data has been preprocessed, as follows:
import pandas as pd data = pd.read_csv("datasets/census_income_dataset.csv") data = data.drop(["fnlwgt","education","relationship","sex", "race"], axis=1)
After reading the dataset, the three variables considered irrelevant for the study are removed.
Next, the remaining qualitative variables are converted into their numerical form via the following code:
from sklearn.preprocessing import LabelEncoder enc = LabelEncoder() features_to_convert = ["workclass","marital-status","occupation","native-country","target"] for i in features_to_convert: data[i] = enc.fit_transform(data[i].astype('str'))
Once this is complete, you can begin with the steps of the activity:
Using the preprocessed Census Income Dataset, separate the features from the target by creating the variables X and Y:
X = data.drop("target", axis=1) Y = data["target"]
Note that there are several ways to achieve the separation of X and Y. Use the one that you feel most comfortable with. However, take into account that X should contain the features for all instances, while Y should contain the class label of all instances.
Divide the dataset into training, validation, and testing sets, using a split ratio of 10%:
from sklearn.model_selection import train_test_split X_new, X_test, Y_new, Y_test = train_test_split(X, Y, test_size=0.1, random_state=101) X_train, X_dev, Y_train, Y_dev = train_test_split(X_new, Y_new, test_size=0.12, random_state=101)
The final shape of all sets must match the values shown in the following code:
X_train = (26048, 9) Y_train = (26048, ) X_dev = (3256, 9) Y_dev = (3256, ) X_test = (3257, 9) Y_test = (3257, )
Import the Gaussian Naïve Bayes class, and then use the fit method to train the model over the training sets (X_train and Y_train):
from sklearn.naive_bayes import GaussianNB model_NB = GaussianNB() model_NB.fit(X_train,Y_train)
Finally, perform a prediction using the model that you trained previously for a new instance with the following values for each feature: 39, 6, 13, 4, 0, 2174, 0, 40, 38.
Using the following code, the prediction for the individual should be equal to zero, which means that the individual most likely has an income below or equal to 50K:
pred_1 = model_NB.predict([[39,6,13,4,0,4,1,2174,0,40,38]])
print(pred_1)
The shape of the previously created subsets must be as follows:
X_train = (26048, 11) Y_train = (26048, 1) X_dev = (3256, 11) Y_dev = (3256, 1) X_test = (3257, 11) Y_test = (3257, 1)
Using the preprocessed Census Income Dataset that was previously split into the different subsets, import the DecisionTreeClassifier class, and then use the fit method to train the model over the training sets (X_train and Y_train):
from sklearn.tree import DecisionTreeClassifier model_tree = DecisionTreeClassifier() model_tree.fit(X_train,Y_train)
Finally, perform a prediction using the model that you trained before for a new instance with the following values for each feature: 39, 6, 13, 4, 0, 2174, 0, 40, 38.
Using the following code, the prediction for the individual should be equal to zero, which means that the individual most likely has an income below or equal to 50K:
pred_2 = model_tree.predict([[39,6,13,4,0,4,1,2174,0,40,38]]) print(pred_2)
The shape of the previously created subsets must be as follows:
X_train = (26048, 11) Y_train = (26048, 1) X_dev = (3256, 11) Y_dev = (3256, 1) X_test = (3257, 11) Y_test = (3257, 1)
Using the preprocessed Census Income Dataset that was previously split into the different subsets, import the SVC class, and then use the fit method to train the model over the training sets (X_train and Y_train):
from sklearn.svm import SVC model_svm = SVC() model_svm.fit(X_train,Y_train)
Finally, perform a prediction using the model that you trained before for a new instance with the following values for each feature: 39, 6, 13, 4, 0, 2174, 0, 40, 38.
Using the following code, the prediction for the individual should be equal to zero, which means that the individual most likely has an income below or equal to 50K:
pred_3 = model_svm.predict([[39,6,13,4,0,4,1,2174,0,40,38]]) print(pred_3)