Book Image

Natural Language Processing with Java and LingPipe Cookbook

Book Image

Natural Language Processing with Java and LingPipe Cookbook

Overview of this book

Table of Contents (14 chapters)
Natural Language Processing with Java and LingPipe Cookbook
Credits
About the Authors
About the Reviewers
www.PacktPub.com
Preface
Index

Deserializing and running a classifier


This recipe does two things: introduces a very simple and effective language ID classifier and demonstrates how to deserialize a LingPipe class. If you find yourself here from a later chapter, trying to understand deserialization, I encourage you to run the example program anyway. It will take 5 minutes, and you might learn something useful.

Our language ID classifier is based on character language models. Each language model gives you the probability of the text, given that it is generated in that language. The model that is most familiar with the text is the first best fit. This one has already been built, but later in the chapter, you will learn to make your own.

How to do it...

Perform the following steps to deserialize and run a classifier:

  1. Go to the cookbook directory for the book and run the command for OSX, Unix, and Linux:

    java -cp lingpipe-cookbook.1.0.jar:lib/lingpipe-4.1.0.jar com.lingpipe.cookbook.chapter1.RunClassifierFromDisk
    

    For Windows invocation (quote the classpath and use ; instead of :):

    java -cp "lingpipe-cookbook.1.0.jar;lib\lingpipe-4.1.0.jar" com.lingpipe.cookbook.chapter1.RunClassifierFromDisk
    

    We will use the Unix style command line in this book.

  2. The program reports the model being loaded and a default, and prompts for a sentence to classify:

    Loading: models/3LangId.LMClassifier
    Type a string to be classified. Empty string to quit.
    The rain in Spain falls mainly on the plain.
    english
    Type a string to be classified. Empty string to quit.
    la lluvia en España cae principalmente en el llano.
    spanish
    Type a string to be classified. Empty string to quit.
    スペインの雨は主に平野に落ちる。
    japanese
    
  3. The classifier is trained on English, Spanish, and Japanese. We have entered an example of each—to get some Japanese, go to http://ja.wikipedia.org/wiki/. These are the only languages it knows about, but it will guess on any text. So, let's try some Arabic:

    Type a string to be classified. Empty string to quit.
    المطر في اسبانيا يقع أساسا على سهل.
    japanese
    
  4. It thinks it is Japanese because this language has more characters than English or Spanish. This in turn leads that model to expect more unknown characters. All the Arabic characters are unknown.

  5. If you are working with a Windows terminal, you might encounter difficulty entering UTF-8 characters.

How it works...

The code in the jar is cookbook/src/com/lingpipe/cookbook/chapter1/ RunClassifierFromDisk.java. What is happening is that a pre-built model for language identification is deserialized and made available. It has been trained on English, Japanese, and Spanish. The training data came from Wikipedia pages for each language. You can see the data in data/3LangId.csv. The focus of this recipe is to show you how to deserialize the classifier and run it—training is handled in the Training your own language model classifier recipe in this chapter. The entire code for the RunClassifier FromDisk.java class starts with the package; then it imports the start of the RunClassifierFromDisk class and the start of main():

package com.lingpipe.cookbook.chapter1;
import java.io.File;
import java.io.IOException;

import com.aliasi.classify.BaseClassifier;
import com.aliasi.util.AbstractExternalizable;
import com.lingpipe.cookbook.Util;
public class RunClassifierFromDisk {
  public static void main(String[] args) throws
  IOException, ClassNotFoundException {

The preceding code is a very standard Java code, and we present it without explanation. Next is a feature in most recipes that supplies a default value for a file if the command line does not contain one. This allows you to use your own data if you have it, otherwise it will run from files in the distribution. In this case, a default classifier is supplied if there is no argument on the command line:

String classifierPath = args.length > 0 ? args[0] :  "models/3LangId.LMClassifier";
System.out.println("Loading: " + classifierPath);

Next, we will see how to deserialize a classifier or another LingPipe object from disk:

File serializedClassifier = new File(classifierPath);
@SuppressWarnings("unchecked")
BaseClassifier<String> classifier
  = (BaseClassifier<String>)
  AbstractExternalizable.readObject(serializedClassifier);

The preceding code snippet is the first LingPipe-specific code, where the classifier is built using the static AbstractExternalizable.readObject method.

This class is employed throughout LingPipe to carry out a compilation of classes for two reasons. First, it allows the compiled objects to have final variables set, which supports LingPipe's extensive use of immutables. Second, it avoids the messiness of exposing the I/O methods required for externalization and deserialization, most notably, the no-argument constructor. This class is used as the superclass of a private internal class that does the actual compilation. This private internal class implements the required no-arg constructor and stores the object required for readResolve().

Note

The reason we use Externalizable instead of Serializable is to avoid breaking backward compatibility when changing any method signatures or member variables. Externalizable extends Serializable and allows control of how the object is read or written. For more information on this, refer to the excellent chapter on serialization in Josh Bloch's book, Effective Java, 2nd Edition.

BaseClassifier<E> is the foundational classifier interface, with E being the type of object being classified in LingPipe. Look at the Javadoc to see the range of classifiers that implements the interface—there are 10 of them. Deserializing to BaseClassifier<E> hides a good bit of complexity, which we will explore later in the How to serialize a LingPipe object – classifier example recipe in this chapter.

The last line calls a utility method, which we will use frequently in this book:

Util.consoleInputBestCategory(classifier);

This method handles interactions with the command line. The code is in src/com/lingpipe/cookbook/Util.java:

public static void consoleInputBestCategory(
BaseClassifier<CharSequence> classifier) throws IOException {
  BufferedReader reader = new BufferedReader(new InputStreamReader(System.in));
  while (true) {
    System.out.println("\nType a string to be classified. " + " Empty string to quit.");
    String data = reader.readLine();
    if (data.equals("")) {
      return;
    }
    Classification classification = classifier.classify(data);
    System.out.println("Best Category: " + classification.bestCategory());
  }
}

Once the string is read in from the console, then classifier.classify(input) is called, which returns Classification. This, in turn, provides a String label that is printed out. That's it! You have run a classifier.