It is always worth looking at the actual contributions of the individual features. For logistic regression, we can directly take the learned coefficients (clf.coef_
) to get an impression of the features' impact. The higher the coefficient of a feature, the more the feature plays a role in determining whether the post is good or not. Consequently, negative coefficients tell us that the higher values for the corresponding features indicate a stronger signal for the post to be classified as bad.
We see that LinkCount
, AvgWordLen
, NumAllCaps
, and NumExclams
have the biggest impact on the overall classification decision, while NumImages
(a feature that we sneaked in just for demonstration purposes a second ago) and AvgSentLen
play a rather minor role. While the feature importance overall makes sense intuitively, it is surprising that NumImages
is basically ignored. Normally, answers containing images are always rated high. In reality, however, answers very rarely have images...