Natural Language Processing with Java and LingPipe Cookbook

The earlier recipes have shown how to evaluate classifiers with truth data and how to train with truth data but how about doing both? This great idea is called cross validation, and it works as follows:

Split the data into n distinct sets or folds—the standard n is 10.
For i from 1 to n:
- Train on the n - 1 folds defined by the exclusion of fold i
- Evaluate on fold i
Report the evaluation results across all folds i.

This is how most machine-learning systems are tuned for performance. The work flow is as follows:

See what the cross validation performance is.
Look at the error as determined by an evaluation metric.
Look at the actual errors—yes, the data—for insights into how the system can be improved.
Make some changes
Evaluate it again.

Cross validation is an excellent way to compare different approaches to a problem, try different classifiers, motivate normalization approaches, explore feature enhancements, and so on. Generally, a system configuration that shows increased performance on cross validation will also show increased performance on new data. What cross validation does not do, particularly with active learning strategies discussed later, is reliably predict performance on new data. Always apply the classifier to new data before releasing production systems as a final sanity check. You have been warned.

Cross validation also imposes a negative bias compared to a classifier trained on all possible training data, because each fold is a slightly weaker classifier, in that it only has 90 percent of the data on 10 folds.

Rinse, lather, and repeat is the mantra of building state-of-the-art NLP systems.

Getting ready

Note how different this approach is from other classic computer-engineering approaches that focus on developing against a functional specification driven by unit tests. This process is more about refining and adjusting the code to work better as determined by the evaluation metrics.

How to do it...

To run the code, perform the following steps:

Get to a command prompt and type:

java -cp lingpipe-cookbook.1.0.jar:lib/opencsv-2.4.jar:lib/lingpipe-4.1.0.jar com.lingpipe.cookbook.chapter1.RunXValidate

The result will be:

Training data is: data/disney_e_n.csv
Training on fold 0
Testing on fold 0
Training on fold 1
Testing on fold 1
Training on fold 2
Testing on fold 2
Training on fold 3
Testing on fold 3
reference\response
    \e,n,
    e 10,1,
    n 6,4,

The preceding output will make more sense in the following section.

How it works…

This recipe introduces an XValidatingObjectCorpus object that manages cross validation. It is used heavily in training classifiers. Everything else should be familiar from the previous recipes. The main() method starts with:

String inputPath = args.length > 0 ? args[0] : "data/disney_e_n.csv";
System.out.println("Training data is: " + inputPath);
List<String[]> truthData = Util.readAnnotatedCsvRemoveHeader(new File(inputPath));

The preceding code gets us the data from the default or a user-entered file. The next two lines introduce XValidatingObjectCorpus—the star of this recipe:

int numFolds = 4;
XValidatingObjectCorpus<Classified<CharSequence>> corpus = Util.loadXValCorpus(truthData, numFolds);

The numFolds variable controls how the data that is just loaded will be partitioned—it will be in four partitions in this case. Now, we will look at the Util.loadXValCorpus(truthData, numfolds) subroutine:

public static XValidatingObjectCorpus<Classified<CharSequence>> loadXValCorpus(List<String[]> rows, int numFolds) throws IOException {
  XValidatingObjectCorpus<Classified<CharSequence>> corpus = new XValidatingObjectCorpus<Classified<CharSequence>>(numFolds);
  for (String[] row : rows) {
    Classification classification = new Classification(row[ANNOTATION_OFFSET]);
    Classified<CharSequence> classified = new Classified<CharSequence>(row[TEXT_OFFSET],classification);
    corpus.handle(classified);
  }
  return corpus;
}

XValidatingObjectCorpus<E> constructed will contain all the truth data in the form of Objects E. In this case, we are filling the corpus with the same object used to train and evaluate in the previous recipes in this chapter—Classified<CharSequence>. This will be handy, because we will be using the objects to both train and test our classifier. The numFolds parameter specifies how many partitions of the data to make. It can be changed later.

The following for loop should be familiar, in that, it should iterate over all the annotated data and creates the Classified<CharSequence> object before applying the corpus.handle() method, which adds it to the corpus. Finally, we will return the corpus. It is worth taking a look at the Javadoc for XValidatingObjectCorpus<E> if you have any questions.

Returning to the body of main(), we will permute the corpus to mix the data, get the categories, and set up BaseClassifierEvaluator<CharSequence> with a null value where we supplied a classifier in a previous recipe:

corpus.permuteCorpus(new Random(123413));
String[] categories = Util.getCategories(truthData);
boolean storeInputs = false;
BaseClassifierEvaluator<CharSequence> evaluator = new BaseClassifierEvaluator<CharSequence>(null, categories, storeInputs);

Now, we are ready to do the cross validation:

int maxCharNGram = 3;
for (int i = 0; i < numFolds; ++i) {
  corpus.setFold(i);
  DynamicLMClassifier<NGramBoundaryLM> classifier = DynamicLMClassifier.createNGramBoundary(categories, maxCharNGram);
  System.out.println("Training on fold " + i);
  corpus.visitTrain(classifier);
  evaluator.setClassifier(classifier);
  System.out.println("Testing on fold " + i);
  corpus.visitTest(evaluator);
}

On each iteration of the for loop, we will set which fold is being used, which, in turn, will select the training and testing partition. Then, we will construct DynamicLMClassifier and train it by supplying the classifier to corpus.visitTrain(classifier). Next, we will set the evaluator's classifier to the one we just trained. The evaluator is passed to the corpus.visitTest(evaluator) method where the contained classifier is applied to the test data that it was not trained on. With four folds, 25 percent of the data will be test data at any given iteration, and 75 percent of the data will be training data. Data will be in the test partition exactly once and three times in the training. The training and test partitions will never contain the same data unless there are duplicates in the data.

Once the loop has finished all iterations, we will print a confusion matrix discussed in the Evaluation of classifiers – the confusion matrix recipe:

System.out.println(
  Util.confusionMatrixToString(evaluator.confusionMatrix()));

There's more…

This recipe introduces quite a few moving parts, namely, cross validation and a corpus object that supports it. The ObjectHandler<E> interface is also used a lot; this can be confusing to developers not familiar with the pattern. It is used to train and test the classifier. It can also be used to print the contents of the corpus. Change the contents of the for loop to visitTrain with Util.corpusPrinter:

System.out.println("Training on fold " + i);
corpus.visitTrain(Util.corpusPrinter());
corpus.visitTrain(classifier);
evaluator.setClassifier(classifier);
System.out.println("Testing on fold " + i);
corpus.visitTest(Util.corpusPrinter());

Now, you will get an output that looks like:

Training on fold 0
Malis?mos los nuevos dibujitos de disney, nickelodeon, cartoon, etc, no me gustannn:n
@meeelp mas que venha um filhinho mais fofo que o pr?prio pai, com covinha e amando a Disney kkkkkkkkkkkkkkkkk:n
@HedyHAMIDI au quartier pas a Disney moi:n
I fully love the Disney Channel I do not care ?:e

The text is followed by : and the category. Printing the training/test folds is a good sanity check for whether the corpus is properly populated. It is also a nice glimpse into how the ObjectHandler<E> interface works—here, the source is from com/lingpipe/cookbook/Util.java:

public static ObjectHandler<Classified<CharSequence>> corpusPrinter () {
  return new ObjectHandler<Classified<CharSequence>>() {
    @Override
    public void handle(Classified<CharSequence> e) {
      System.out.println(e.toString());
    }
  };
}

There is not much to the returned class. There is a single handle()method that just prints the toString() method of Classified<CharSequence>. In the context of this recipe, the classifier instead invokes train() on the text and classification, and the evaluator takes the text, runs it past the classifier, and compares the result to the truth.

Another good experiment to run is to report performance on each fold instead of all folds. For small datasets, you will see very large variations in performance. Another worthwhile experiment is to permute the corpus 10 times and see the variations in performance that come from different partitioning of the data.

Another issue is how data is selected for evaluation. To text process applications, it is important to not leak information between test data and training data. Cross validation over 10 days of data will be much more realistic if each day is a fold rather than a 10-percent slice of all 10 days. The reason is that a day's data will likely be correlated, and this correlation will produce information about that day in training and testing, if days are allowed to be in both train and test. When evaluating the final performance, always select data from after the training data epoch if possible, to better emulate production environments where the future is not known.

Natural Language Processing with Java and LingPipe Cookbook

Natural Language Processing with Java and LingPipe Cookbook

Overview of this book

Related Content you might be interested in

Current Title:

Natural Language Processing with Java and LingPipe Cookbook

How to train and evaluate with cross validation

Getting ready

How to do it...

How it works…

There's more…