We have learned how to create a decision tree but, at times, decision tree models don't hold up well when there are many variables and a large dataset. This is where ensemble models, such as random forest, come to rescue.
A random forest basically creates many decision trees on the dataset and then averages out the results. If you see a singing competition, such as American Idol, or a sporting competition, such as the Olympics, there are multiple judges. The reason for having multiple judges is to eliminate bias and give fair results, and this is what a random forest tries to achieve.
A decision tree can change drastically if the data changes slightly and it can easily overfit the data.
Let's try to create a random forest model and see how its precision/recall is compared to the decision tree that we just created:
>>> import sklearn.ensemble as sk >>> clf = sk.RandomForestClassifier(n_estimators=100) >>> clf = clf.fit(x_train, y_train.greater_than_50k...