-
Book Overview & Buying
-
Table Of Contents
Natural Language Processing with Java and LingPipe Cookbook
By :
Logistic regression allows for arbitrary features to be used. Features are any observations that can be made about data being classified. Some examples are as follows:
Words/tokens from the text.
We found that character ngrams work very well in lieu of words or stemmed words. For small data sets of less than 10,000 words of training, we will use 2-4 grams. Bigger training data can merit a longer gram, but we have never had good results above 8-gram characters.
Output from another component can be a feature, for example, a part-of-speech tagger.
Metadata known about the text, for example, the location of a tweet or time of the day it was created.
Recognition of dates and numbers abstracted from the actual value.
The source for this recipe is in src/com/lingpipe/cookbook/chapter3/ContainsNumberFeatureExtractor.java.
Feature extractors are straightforward to build. The following is a feature extractor that returns a CONTAINS_NUMBER feature with weight 1...
Change the font size
Change margin width
Change background colour