Book Image

Natural Language Processing with Java and LingPipe Cookbook

Book Image

Natural Language Processing with Java and LingPipe Cookbook

Overview of this book

Table of Contents (14 chapters)
Natural Language Processing with Java and LingPipe Cookbook
Credits
About the Authors
About the Reviewers
www.PacktPub.com
Preface
Index

Training your own language model classifier


The world of NLP really opens up when classifiers are customized. This recipe provides details on how to customize a classifier by collecting examples for the classifier to learn from—this is called training data. It is also called gold standard data, truth, or ground truth. We have some from the previous recipe that we will use.

Getting ready

We will create a customized language ID classifier for English and other languages. Creation of training data involves getting access to text data and then annotating it for the categories of the classifier—in this case, annotation is the language. Training data can come from a range of sources. Some possibilities include:

  • Gold standard data such as the one created in the preceding evaluation recipe.

  • Data that is somehow already annotated for the categories you care about. For example, Wikipedia has language-specific versions, which make easy pickings to train up a language ID classifier. This is how we created the 3LangId.LMClassifier model.

  • Be creative—where is the data that helps guide a classifier in the right direction?

Language ID doesn't require much data to work well, so 20 tweets per language will start to reliably distinguish strongly different languages. The amount of training data will be driven by evaluation—more data generally improves performance.

The example assumes that around 10 tweets of English and 10 non-English tweets have been annotated by people and put in data/disney_e_n.csv.

How to do it...

In order to train your own language model classifier, perform the following steps:

  1. Fire up a terminal and type the following:

    java -cp lingpipe-cookbook.1.0.jar:lib/opencsv-2.4.jar:lib/lingpipe-4.1.0.jar com.lingpipe.cookbook.chapter1.TrainAndRunLMClassifier
    
  2. Then, type some English in the command prompt, perhaps, a Kurt Vonnegut quotation, to see the resulting JointClassification. See the Getting confidence estimates from a classifier recipe for the explanation of the following output:

    Type a string to be classified. Empty string to quit.
    So it goes.
    Rank Categ Score  P(Category|Input)  log2 P(Category,Input)
    0=e -4.24592987919 0.9999933712053  -55.19708842949149
    1=n -5.56922173547 6.62884502334E-6 -72.39988256112824
    
  3. Type in some non-English, such as the Spanish title of Borge's The Garden of the Forking Paths:

    Type a string to be classified. Empty string to quit.
    El Jardín de senderos que se bifurcan 
    Rank Categ Score  P(Category|Input)  log2 P(Category,Input)
    0=n -5.6612148689 0.999989087229795 -226.44859475801326
    1=e -6.0733050528 1.091277041753E-5 -242.93220211249715
    

How it works...

The program is in src/com/lingpipe/cookbook/chapter1/TrainAndRunLMClassifier.java; the contents of the main() method start with:

String dataPath = args.length > 0 ? args[0] : "data/disney_e_n.csv";
List<String[]> annotatedData = Util.readAnnotatedCsvRemoveHeader(new File(dataPath));
String[] categories = Util.getCategories(annotatedData);

The preceding code gets the contents of the .csv file and then extracts the list of categories that were annotated; these categories will be all the non-empty strings in the annotation column.

The following DynamicLMClassifier is created using a static method that requires the array of categories and int, which is the order of the language models. With an order of 3, the language model will be trained on all 1 to 3 character sequences of the text training data. So "I luv Disney" will produce training instances of "I", "I ", "I l", " l", " lu", "u", "uv", "luv", and so on. The createNGramBoundary method appends a special token to the beginning and end of each text sequence; this token can help if the beginnings or ends are informative for classification. Most text data is sensitive to beginnings/ends, so we will choose this model:

int maxCharNGram = 3;
DynamicLMClassifier<NGramBoundaryLM> classifier = DynamicLMClassifier.createNGramBoundary(categories,maxCharNGram);

The following code iterates over the rows of training data and creates Classified<CharSequence> in the same way as shown in the Evaluation of classifiers – the confusion matrix recipe for evaluation. However, instead of passing the Classified object to an evaluation handler, it is used to train the classifier.

for (String[] row: annotatedData) {
  String truth = row[Util.ANNOTATION_OFFSET];
  String text = row[Util.TEXT_OFFSET];
  Classification classification 
    = new Classification(truth);
  Classified<CharSequence> classified = new Classified<CharSequence>(text,classification);
  classifier.handle(classified);
}

No further steps are necessary, and the classifier is ready for use by the console:

Util.consoleInputPrintClassification(classifier);

There's more...

Training and using the classifier can be interspersed for classifiers based on DynamicLM. This is generally not the case with other classifiers such as LogisticRegression, because they use all the data to compile a model that can carry out classifications.

There is another method for training the classifier that gives you more control over how the training goes. The following is the code snippet for this:

Classification classification = new Classification(truth);
Classified<CharSequence> classified = new Classified<CharSequence>(text,classification);
classifier.handle(classified);

Alternatively, we can have the same effect with:

int count = 1;
classifier.train(truth,text,count);

The train() method allows an extra degree of control for training, because it allows for the count to be explicitly set. As we explore LingPipe classifiers, we will often see an alternate way of training that allows for some additional control beyond what the handle() method provides.

Character-language model-based classifiers work very well for tasks where character sequences are distinctive. Language identification is an ideal candidate for this, but it can also be used for tasks such as sentiment, topic assignment, and question answering.

See also

The Javadoc for LingPipe's classifiers are quite extensive on the underlying math that drives the technology.