Natural Language Processing with Java and LingPipe Cookbook

Sentiment has become the classic business-oriented classification task—what executive can resist an ability to know on a constant basis what positive and negative things are being said about their business? Sentiment classifiers offer this capability by taking text data and classifying it into positive and negative categories. This recipe addresses the process of creating a simple sentiment classifier, but more generally, it addresses how to create classifiers for novel categories. It is also a 3-way classifier, unlike the 2-way classifiers we have been working with.

Our first sentiment system was built for BuzzMetrics in 2004 using language model classifiers. We tend to use logistic regression classifiers now, because they tend to perform better. Chapter 3, Advanced Classifiers, covers logistic regression classifiers.

How to do it…

The previous recipes focused on language ID—how do we shift the classifier over to the very different task of sentiment? This will be much simpler than one might think—all that needs to change is the training data, believe it or not. The steps are as follows:

Use the Twitter search recipe to download tweets about a topic that has positive/negative tweets about it. A search on disney is our example, but feel free to branch out. This recipe will work with the supplied CSV file, data/disneySentiment_annot.csv.
Load the created data/disneySentiment_annot.csv file into your spreadsheet of choice. There are already some annotations done.
As in the Evaluation of classifiers – the confusion matrix recipe, annotate the true class column for one of the three categories:
- The p annotation stands for positive. The example is "Oh well, I love Disney movies. #hateonit".
- The n annotation stands for negative. The example is "Disney really messed me up yo, this is not the way things are suppose to be".
- The o annotation stands for other. The example is "Update on Downtown Disney. http://t.co/SE39z73vnw.
- Leave blank tweets that are not in English, irrelevant, both positive and negative, or you are unsure about.
Keep annotating until the smallest category has at least 10 examples.
Save the annotations.

Run the previous recipe for cross validation, providing the annotated file's name:

java -cp lingpipe-cookbook.1.0.jar:lib/lingpipe-4.1.0.jar:lib/opencsv-2.4.jar com.lingpipe.cookbook.chapter1.RunXValidate data/disneyDedupedSentiment.csv

The system will then run a four-fold cross validation and print a confusion matrix. Look at the How to train and evaluate with cross validation recipe if you need further explanation:

Training on fold 0
Testing on fold 0
Training on fold 1
Testing on fold 1
Training on fold 2
Testing on fold 2
Training on fold 3
Testing on fold 3
reference\response
    \p,n,o,
    p 14,0,10,
    n 6,0,4,
    o 7,1,37,

That's it! Classifiers are entirely dependent on training data for what they classify. More sophisticated techniques will bring richer features into the mix than character ngrams, but ultimately, the labels imposed by training data are the knowledge being imparted to the classifier. Depending on your view, the underlying technology is magical or astoundingly simple minded.

How it works...

Most developers are surprised that the only difference between language ID and sentiment is the labeling applied to the data for training. The language model classifier is applying an individual language model for each category and also noting the marginal distribution of the categories in the estimates.

There's more…

Classifiers are pretty dumb but very useful if they are not expected to work outside their capabilities. Language ID works great as a classification problem because the observed events are tightly tied to the classification being done—the words and characters of a language. Sentiment is more difficult because the observed events, in this case, are exactly the same as the language ID and are less strongly associated with the end classification. For example, the phrase "I love" is a good predictor of the sentence being English but not as clear a predictor that the sentiment is positive, negative, or other. If the tweet is "I love Disney", then we have a positive statement. If the tweet is "I love Disney, not", then it is negative. Addressing the complexities of sentiment and other more complex phenomenon tends to be resolved in the following ways:

Create more training data. Even relatively dumb techniques such as language model classifiers can perform very well given enough data. Humanity is just not that creative in ways to gripe about, or praise, something. The Train a little, learn a little – active learning recipe of Chapter 3, Advanced Classifiers, presents a clever way to do this.
Use fancier classifiers that in turn use fancier features (observations) about the data to get the job done. Look at the logistic regression recipes for more information. For the negation case, a feature that looked for a negative phrase in the tweet might help. This could get arbitrarily sophisticated.

Note that a more appropriate way to take on the sentiment problem can be to create a binary classifier for positive and not positive and a binary classifier for negative and not negative. The classifiers will have separate training data and will allow for a tweet to be both positive and negative.

Common problems as a classification problem

Classifiers form the foundations of many industrial NLP problems. This recipe goes through the process of encoding some common problems into a classification-based solution. We will pull from real-world examples that we have built whenever possible. You can think of them as mini recipes.

Topic detection

Problem: Take footnotes from financial documents (10Qs and 10Ks) and determine whether an eXtensible Business Reporting Language (XBRL) category is applied like "forward looking financial statements". Turns out that foot notes are where all the action happens. For example, is the footnote referring to retired debt? Performance needed to be greater than 90 percent precision with acceptable recall.

Solution: This problem closely mirrors how we approached language ID and sentiment. The actual solution involves a sentence recognizer that detects the footnotes—see Chapter 5, Finding Spans in Text – Chunking—and then creates training data for each of the XBRL categories. We used the confusion matrix output to help refine the XBRL categories that the system was struggling to distinguish. Merging categories was a possibility, and we did merge them. This system is based on language model classifiers. If done now, we would use logistic regression.

Question answering

Problem: Identify FAQs in a large dataset of text-based customer support data and develop the answers and ability to automatically deliver answers with 90 percent precision.

Solution: Perform clustering analysis over logs to find FAQs—see Chapter 6, String Comparison and Clustering. This will result in a very large set of FAQs that are really Infrequently Asked Questions (IAQs); this means that the prevalence of an IAQ can be as low as 1/20000. Positive data is fairly easy to find for a classifier, but negative data is too expensive to create on any kind of balanced distribution—for every positive case, one will expect 19999 negative case. The solution is to assume that any random sample of a large size will contain very few positives and to just use this as negative data. A refinement is to run a trained classifier over the negatives to find high-scoring cases and annotate them to pull out the positives that might be found.

Degree of sentiment

Problem: Classify a sentiment on a scale of 1 to 10 based on the degree of negativeness to positiveness.

Solution: Even though our classifiers provide a score that can be mapped on a 1-to-10 scale, this is not what the background computation is doing. To correctly map to a degree scale, one will have to annotate the distinction in training data—this tweet is a 1, this tweet is a 3, and so on. We will then train a 10-way classifier, and the first best category should, in theory, be the degree. We write in theory because despite regular customer requests for this, we have never found a customer that was willing to support the required annotation.

Non-exclusive category classification

Problem: The desired classifications are not mutually exclusive. A tweet can say both positive and negative things, for example, "Loved Mickey, hated Pluto". Our classifiers assume that categories are mutually exclusive.

Solution: We regularly use multiple binary classifiers in place of one n-way or multinomial classifiers. The classifiers will be trained for positive/non-positive and negative/non-negative. A tweet can then be annotated n and p.

Person/company/location detection

Problem: Detect mentions of people in text data.

Solution: Believe it or not, this breaks down into a word classification problem. See Chapter 6, String Comparison and Clustering.

It is generally fruitful to look at any novel problem as a classification problem, even if classifiers don't get used as the underlying technology. It can help clarify what the underlying technology actually needs to do.

Natural Language Processing with Java and LingPipe Cookbook

Natural Language Processing with Java and LingPipe Cookbook

Overview of this book

Related Content you might be interested in

Current Title:

Natural Language Processing with Java and LingPipe Cookbook

How to classify sentiment – simple version

How to do it…

How it works...

There's more…

Common problems as a classification problem

Topic detection

Question answering

Degree of sentiment

Non-exclusive category classification

Person/company/location detection