Book Image

Natural Language Processing with Java and LingPipe Cookbook

Book Image

Natural Language Processing with Java and LingPipe Cookbook

Overview of this book

Table of Contents (14 chapters)
Natural Language Processing with Java and LingPipe Cookbook
Credits
About the Authors
About the Reviewers
www.PacktPub.com
Preface
Index

Viewing error categories – false positives


We can achieve the best possible classifier performance by examining the errors and making changes to the system. There is a very bad habit among developers and machine-learning folks to not look at errors, particularly as systems mature. Just to be clear, at the end of a project, the developers responsible for tuning the classifier should be very familiar with the domain being classified, if not expert in it, because they have looked at so much data while tuning the system. If the developer cannot do a reasonable job of emulating the classifiers that you are tuning, then you are not looking at enough data.

This recipe performs the most basic form of looking at what the system got wrong in the form of false positives, which are examples from training data that the classifier assigned to a category, but the correct category was something else.

How to do it...

Perform the following steps in order to view error categories using false positives:

  1. This recipe extends the previous How to train and evaluate with cross validation recipe by accessing more of what the evaluation class provides. Get a command prompt and type:

    java -cp lingpipe-cookbook.1.0.jar:lib/opencsv-2.4.jar:lib/lingpipe-4.1.0.jar com.lingpipe.cookbook.chapter1.ReportFalsePositivesOverXValidation
    
  2. This will result in:

    Training data is: data/disney_e_n.csv
    reference\response
              \e,n,
             e 10,1,
             n 6,4,
    False Positives for e
    Malisímos los nuevos dibujitos de disney, nickelodeon, cartoon, etc, no me gustannn : n
    @meeelp mas que venha um filhinho mais fofo que o próprio pai, com covinha e amando a Disney kkkkkkkkkkkkkkkkk : n
    @HedyHAMIDI au quartier pas a Disney moi : n
    @greenath_ t'as de la chance d'aller a Disney putain j'y ai jamais été moi. : n
    Prefiro gastar uma baba de dinheiro pra ir pra cancun doq pra Disney por exemplo : n
    ES INSUPERABLE DISNEY !! QUIERO VOLVER:( : n
    False Positives for n
    request now "let's get tricky" by @bellathorne and @ROSHON on @radiodisney!!! just call 1-877-870-5678 or at http://t.co/cbne5yRKhQ!! <3 : e
    
  3. The output starts with a confusion matrix. Then, we will see the actual six instances of false positives for p from the lower left-hand side cell of the confusion matrix labeled with the category that the classifier guessed. Then, we will see false positives for n, which is a single example. The true category is appended with :, which is helpful for classifiers that have more than two categories.

How it works…

This recipe is based on the previous one, but it has its own source in com/lingpipe/cookbook/chapter1/ReportFalsePositivesOverXValidation.java. There are two differences. First, storeInputs is set to true for the evaluator:

boolean storeInputs = true;
BaseClassifierEvaluator<CharSequence> evaluator = new BaseClassifierEvaluator<CharSequence>(null, categories, storeInputs);

Second, a Util method is added to print false positives:

for (String category : categories) {
  Util.printFalsePositives(category, evaluator, corpus);
}

The preceding code works by identifying a category of focus—e or English tweets—and extracting all the false positives from the classifier evaluator. For this category, false positives are tweets that are non-English in truth, but the classifier thought they were English. The referenced Util method is as follows:

public static <E> void printFalsePositives(String category, BaseClassifierEvaluator<E> evaluator, Corpus<ObjectHandler<Classified<E>>> corpus) throws IOException {
  final Map<E,Classification> truthMap = new HashMap<E,Classification>();
  corpus.visitCorpus(new ObjectHandler<Classified<E>>() {
    @Override
    public void handle(Classified<E> data) {
      truthMap.put(data.getObject(),data.getClassification());
    }
  });

The preceding code takes the corpus that contains all the truth data and populates Map<E,Classification> to allow for lookup of the truth annotation, given the input. If the same input exists in two categories, then this method will not be robust but will record the last example seen:

List<Classified<E>> falsePositives = evaluator.falsePositives(category);
System.out.println("False Positives for " + category);
for (Classified<E> classified : falsePositives) {
  E data = classified.getObject();
  Classification truthClassification = truthMap.get(data);
  System.out.println(data + " : " + truthClassification.bestCategory());
  }
}

The code gets the false positives from the evaluator and then iterates over all them with a lookup into truthMap built in the preceding code and prints out the relevant information. There are also methods to get false negatives, true positives, and true negatives in evaluator.

The ability to identify mistakes is crucial to improving performance. The advice seems obvious, but it is very common for developers to not look at mistakes. They will look at system output and make a rough estimate of whether the system is good enough; this does not result in top-performing classifiers.

The next recipe works through more evaluation metrics and their definition.