Natural Language Processing with Java and LingPipe Cookbook

Classifiers tend to be a lot more useful if they give more information about how confident they are of the classification—this is usually a score or a probability. We often threshold classifiers to help fit the performance requirements of an installation. For example, if it was vital that the classifier never makes a mistake, then we could require that the classification be very confident before committing to a decision.

LingPipe classifiers exist on a hierarchy based on the kinds of estimates they provide. The backbone is a series of interfaces—don't freak out; it is actually pretty simple. You don't need to understand it now, but we do need to write it down somewhere for future reference:

BaseClassifier<E>: This is just your basic classifier of objects of type E. It has a classify() method that returns a classification, which in turn has a bestCategory() method and a toString() method that is of some informative use.
RankedClassifier<E> extends BaseClassifier<E>: The classify() method returns RankedClassification, which extends Classification and adds methods for category(int rank) that says what the 1st to nth classifications are. There is also a size() method that indicates how many classifications there are.
ScoredClassifier<E> extends RankedClassifier<E>: The returned ScoredClassification adds a score(int rank) method.
ConditionalClassifier<E> extends RankedClassifier<E>: ConditionalClassification produced by this has the property that the sum of scores for all categories must sum to 1 as accessed via the conditionalProbability(int rank) method and conditionalProbability(String category). There's more; you can read the Javadoc for this. This classification will be the work horse of the book when things get fancy, and we want to know the confidence that the tweet is English versus the tweet is Japanese versus the tweet is Spanish. These estimates will have to sum to 1.
JointClassifier<E> extends ConditionalClassifier<E>: This provides JointClassification of the input and category in the space of all the possible inputs, and all such estimates sum to 1. This is a sparse space, so values are log based to avoid underflow errors. We don't see a lot of use of this estimate directly in production.

It is obvious that there has been a great deal of thought put into the classification stack presented. This is because huge numbers of industrial NLP problems are handled by a classification system in the end.

It turns out that our simplest classifier—in some arbitrary sense of simple—produces the richest estimates, which are joint classifications. Let's dive in.

Getting ready

In the previous recipe, we blithely deserialized to BaseClassifier<String> that hid all the details of what was going on. The reality is a bit more complex than suggested by the hazy abstract class. Note that the file on disk that was loaded is named 3LangId.LMClassifier. By convention, we name serialized models with the type of object it will deserialize to, which, in this case, is LMClassifier, and it extends BaseClassifier. The most specific typing for the classifier is:

LMClassifier<CompiledNGramBoundaryLM, MultivariateDistribution> classifier = (LMClassifier <CompiledNGramBoundaryLM, MultivariateDistribution>) AbstractExternalizable.readObject(new File(args[0]));

The cast to LMClassifier<CompiledNGramBoundaryLM, MultivariateDistribution> specifies the type of distribution to be MultivariateDistribution. The Javadoc for com.aliasi.stats.MultivariateDistribution is quite explicit and helpful in describing what this is.

Note

MultivariateDistribution implements a discrete distribution over a finite set of outcomes, numbered consecutively from zero.

The Javadoc goes into a lot of detail about MultivariateDistribution, but it basically means that we can have an n-way assignment of probabilities that sum to 1.

The next class in the cast is for CompiledNGramBoundaryLM, which is the "memory" of the LMClassifier. In fact, each language gets its own. This means that English will have a separate language model from Spanish and so on. There are eight different kinds of language models that could have been used as this part of the classifier—consult the Javadoc for the LanguageModel interface. Each language model (LM) has the following properties:

The LM will provide a probability that it generated the text provided. It is robust against data that it has not seen before, in the sense that it won't crash or give a zero probability. Arabic just comes across as a sequence of unknown characters for our example.
The sum of all the possible character sequence probabilities of any length is 1 for boundary LMs. Process LMs sum the probability to 1 over all sequences of the same length. Look at the Javadoc for how this bit of math is done.
Each language model has no knowledge of data outside of its category.
The classifier keeps track of the marginal probability of the category and factors this into the results for the category. Marginal probability is saying that we tend to see two-thirds English, one-sixth Spanish, and one-sixth Japanese in Disney tweets. This information is combined with the LM estimates.
The LM is a compiled version of LanguageModel.Dynamic that we will cover in the later recipes that discuss training.

LMClassifier that is constructed wraps these components into a classifier.

Luckily, the interface saves the day with a more aesthetic deserialization:

JointClassifier<String> classifier = (JointClassifier<String>) AbstractExternalizable.readObject(new File(classifierPath));

The interface hides the guts of the implementation nicely and this is what we are going with in the example program.

How to do it…

This recipe is the first time we start peeling away from what classifiers can do, but first, let's play with it a bit:

Get your magic shell genie to conjure a command prompt with a Java interpreter and type:

java -cp lingpipe-cookbook.1.0.jar:lib/lingpipe-4.1.0.jar: com.lingpipe.cookbook.chapter1.RunClassifierJoint

We will enter the same data as we did earlier:

Type a string to be classified. Empty string to quit.
The rain in Spain falls mainly on the plain.
Rank Categ Score   P(Category|Input) log2 P(Category,Input)
0=english -3.60092 0.9999999999         -165.64233893156052
1=spanish -4.50479 3.04549412621E-13    -207.2207276413206
2=japanese -14.369 7.6855682344E-150    -660.989401136873

As described, JointClassification carries through all the classification metrics in the hierarchy rooted at Classification. Each level of classification shown as follows adds to the classifiers preceding it:

Classification provides the first best category as the rank 0 category.
RankedClassification adds an ordering of all the possible categories with a lower rank corresponding to greater likelihood of the category. The rank column reflects this ordering.
ScoredClassification adds a numeric score to the ranked output. Note that scores might or might not compare well against other strings being classified depending on the type of classifier. This is the column labeled Score. To understand the basis of this score, consult the relevant Javadoc.
ConditionalClassification further refines the score by making it a category probability conditioned on the input. The probabilities of all categories will sum up to 1. This is the column labeled P(Category|Input), which is the traditional way to write probability of the category given the input.
JointClassification adds the log2 (log base 2) probability of the input and the category—this is the joint probability. The probabilities of all categories and inputs will sum up to 1, which is a very large space indeed with very low probabilities assigned to any pair of category and string. This is why log2 values are used to prevent numerical underflow. This is the column labeled log 2 P(Category, Input), which is translated as the log 2 probability of the category and input.

Look at the Javadoc for the com.aliasi.classify package for more information on the metrics and classifiers that implement them.

How it works…

The code is in src/com/lingpipe/cookbook/chapter1/RunClassifierJoint.java, and it deserializes to a JointClassifier<CharSequence>:

public static void main(String[] args) throws IOException, ClassNotFoundException {
  String classifierPath  = args.length > 0 ? args[0] : "models/3LangId.LMClassifier";
  @SuppressWarnings("unchecked")
    JointClassifier<CharSequence> classifier = (JointClassifier<CharSequence>) AbstractExternalizable.readObject(new File(classifierPath));
  Util.consoleInputPrintClassification(classifier);
}

It makes a call to Util.consoleInputPrintClassification(classifier), which minimally differs from Util.consoleInputBestCategory(classifier), in that it uses the toString() method of classification to print. The code is as follows:

public static void consoleInputPrintClassification(BaseClassifier<CharSequence> classifier) throws IOException {
  BufferedReader reader = new BufferedReader(new InputStreamReader(System.in));
  while (true) {
    System.out.println("\nType a string to be classified." + Empty string to quit.");
    String data = reader.readLine();
    if (data.equals("")) {
      return;
    }
    Classification classification = classifier.classify(data);
    System.out.println(classification);
  }
}

We got a richer output than we expected, because the type is Classification, but the toString() method will be applied to the runtime type JointClassification.

Natural Language Processing with Java and LingPipe Cookbook

Natural Language Processing with Java and LingPipe Cookbook

Overview of this book

Related Content you might be interested in

Current Title:

Natural Language Processing with Java and LingPipe Cookbook

Getting confidence estimates from a classifier

Getting ready

Note

How to do it…

How it works…

See also