Book Image

Natural Language Processing with Java and LingPipe Cookbook

Book Image

Natural Language Processing with Java and LingPipe Cookbook

Overview of this book

Table of Contents (14 chapters)
Natural Language Processing with Java and LingPipe Cookbook
Credits
About the Authors
About the Reviewers
www.PacktPub.com
Preface
Index

How to serialize a LingPipe object – classifier example


In a deployment situation, trained classifiers, other Java objects with complex configuration, or training are best accessed by deserializing them from a disk. The first recipe did exactly this by reading in LMClassifier from the disk with AbstractExternalizable. This recipe shows how to get the language ID classifier written out to the disk for later use.

Serializing DynamicLMClassifier and reading it back in results in a different class, which is an instance of LMClassifier that performs the same as the one just trained except that it can no longer accept training instances because counts have been converted to log probabilities and the backoff smoothing arcs are stored in suffix trees. The resulting classifier is much faster.

In general, most of the LingPipe classifiers, language models, and hidden Marcov models (HMM) implement both the Serializable and Compilable interfaces.

Getting ready

We will work with the same data as we did in the Viewing error categories – false positives recipe.

How to do it...

Perform the following steps to serialize a LingPipe object:

  1. Go to the command prompt and convey:

    java -cp lingpipe-cookbook.1.0.jar:lib/opencsv-2.4.jar:lib/lingpipe-4.1.0.jar com.lingpipe.cookbook.chapter1.TrainAndWriteClassifierToDisk
    
  2. The program will respond with the default file values for input/output:

    Training on data/disney_e_n.csv
    Wrote model to models/my_disney_e_n.LMClassifier
    
  3. Test if the model works by invoking the Deserializing and running a classifier recipe while specifying the classifier file to be read in:

    java -cp lingpipe-cookbook.1.0.jar:lib/lingpipe-4.1.0.jar com.lingpipe.cookbook.chapter1.LoadClassifierRunOnCommandLine models/my_disney_e_n.LMClassifier
    
  4. The usual interaction follows:

    Type a string to be classified. Empty string to quit.
    The rain in Spain
    Best Category: e 
    

How it works…

The contents of main() from src/com/lingpipe/cookbook/chapter1/ TrainAndWriteClassifierToDisk.java start with the materials covered in the previous recipes of the chapter to read the .csv files, set up a classifier, and train it. Please refer back to it if any code is unclear.

The new bit for this recipe happens when we invoke the AbtractExternalizable.compileTo() method on DynamicLMClassifier, which compiles the model and writes it to a file. This method is used like the writeExternal method from Java's Externalizable interface:

AbstractExternalizable.compileTo(classifier,outFile);

This is all you need to know folks to write a classifier to a disk.

There's more…

There is an alternate way to serialize that is amenable to more variations of data sources for serializations that are not based on the File class. An alternate way to write a classifier is:

FileOutputStream fos = new FileOutputStream(outFile);
ObjectOutputStream oos = new ObjectOutputStream(fos);
classifier.compileTo(oos);
oos.close();
fos.close();

Additionally, DynamicLM can be compiled without involving the disk with a static AbstractExternalizable.compile() method. It will be used in the following fashion:

@SuppressWarnings("unchecked")
LMClassifier<LanguageModel, MultivariateDistribution> compiledLM = (LMClassifier<LanguageModel, MultivariateDistribution>) AbstractExternalizable.compile(classifier);

The compiled version is a lot faster but does not allow further training instances.