Bayesian spam filtering
Suppose we have a filter that flags emails that it identifies as spam. Consider the events F = {e-mail flagged as spam} and T = {e-mail is spam}. If you have ever used a spam filter, you know that this is imperfect, so these sets do not coincide. Sometimes legitimate messages are caught by a spam filter and sometimes spam is undetected by the filter.
Suppose the developers of the spam filter did some extensive testing on a huge sample of emails and found several results:
- The probability that spam emails will be caught by the filter (true positives) is 0.95, or P(F|T) = 0.95.
- The probability that legitimate e-mails are not caught by the filter (true negatives) is 0.98, so P(Fc|Tc) = 0.98.
- The probability that an email from the selected sample is spam is 0.1, or P(T) = 0.1.
Suppose an email is caught by the filter—what is the probability that it is actually spam? In other words, what is P(T|F)? By Bayes' theorem, it would...