We will build a classifier to estimate the income bracket of a person based on 14 attributes. The possible output classes are higher than 50K or lower than or equal to 50K. There is a slight twist in this dataset in the sense that each datapoint is a mixture of numbers and strings. Numerical data is valuable, and we cannot use a label encoder in these situations. We need to design a system that can deal with numerical and non-numerical data at the same time. We will use the census income dataset available at https://archive.ics.uci.edu/ml/datasets/Census+Income.
We will use the
income.py
file already provided to you as a reference. We will use a Naive Bayes classifier to achieve this. Let's import a couple of packages:from sklearn import preprocessing from sklearn.naive_bayes import GaussianNB
input_file = 'path/to/adult.data.txt' # Reading the data X = [] y = [] count_lessthan50k = 0 count_morethan50k = 0 num_images_threshold...