Natural Language Processing with Java and LingPipe Cookbook

Book Image

Natural Language Processing with Java and LingPipe Cookbook

Book Image

Natural Language Processing with Java and LingPipe Cookbook

Overview of this book

Natural Language Processing with Java and LingPipe Cookbook

Natural Language Processing with Java and LingPipe Cookbook

Credits

About the Authors

About the Authors

About the Reviewers

About the Reviewers

www.PacktPub.com

www.PacktPub.com

Preface

Free Chapter

Simple Classifiers

Simple Classifiers

Deserializing and running a classifier

Getting confidence estimates from a classifier

Getting data from the Twitter API

Applying a classifier to a .csv file

Evaluation of classifiers – the confusion matrix

Training your own language model classifier

How to train and evaluate with cross validation

Viewing error categories – false positives

Understanding precision and recall

How to serialize a LingPipe object – classifier example

Eliminate near duplicates with the Jaccard distance

How to classify sentiment – simple version

Finding and Working with Words

Finding and Working with Words

Introduction to tokenizer factories – finding words in a character stream

Combining tokenizers – lowercase tokenizer

Combining tokenizers – stop word tokenizers

Using Lucene/Solr tokenizers

Using Lucene/Solr tokenizers with LingPipe

Evaluating tokenizers with unit tests

Modifying tokenizer factories

Finding words for languages without white spaces

Advanced Classifiers

Advanced Classifiers

A simple classifier

Language model classifier with tokens

Feature extractors

Logistic regression

Multithreaded cross validation

Tuning parameters in logistic regression

Customizing feature extraction

Combining feature extractors

Classifier-building life cycle

Linguistic tuning

Thresholding classifiers

Train a little, learn a little – active learning

Tagging Words and Tokens

Tagging Words and Tokens

Interesting phrase detection

Foreground- or background-driven interesting phrase detection

Hidden Markov Models (HMM) – part-of-speech

N-best word tagging

Confidence-based tagging

Training word tagging

Word-tagging evaluation

Conditional random fields (CRF) for word/token tagging

Finding Spans in Text – Chunking

Finding Spans in Text – Chunking

Sentence detection

Evaluation of sentence detection

Tuning sentence detection

Marking embedded chunks in a string – sentence chunk example

Paragraph detection

Simple noun phrases and verb phrases

Regular expression-based chunking for NER

Dictionary-based chunking for NER

Translating between word tagging and chunks – BIO codec

Mixing the NER sources

CRFs for chunking

NER using CRFs with better features

String Comparison and Clustering

String Comparison and Clustering

Distance and proximity – simple edit distance

Weighted edit distance

The Jaccard distance

The Tf-Idf distance

Using edit distance and language models for spelling correction

The case restoring corrector

Automatic phrase completion

Single-link and complete-link clustering using edit distance

Latent Dirichlet allocation (LDA) for multitopic clustering

Finding Coreference Between Concepts/People

Finding Coreference Between Concepts/People

Named entity coreference with a document

Adding pronouns to coreference

Cross-document coreference

The John Smith problem

Index

Customer Reviews

5 star

0

4 star

0

3 star

0

2 star

0

1 star

0

Modifying CRFs

The power and appeal of CRFs comes from rich feature extraction—proceed with an evaluation harness that provides feedback on your explorations. This recipe will detail how to create more complex features.

How to do it...

We will not train and run a CRF; instead, we will print out the features. Substitute this feature extractor for the one in the previous recipe to see them at work. Perform the following steps:

Go to a command line and type:

java -cp lingpipe-cookbook.1.0.jar:lib/lingpipe-4.1.0.jar: com.lingpipe.cookbook.chapter4.ModifiedCrfFeatureExtractor

The feature extractor class outputs for each token in the training data the truth tagging that is being used to learn:
```
-------------------
Tagging:  John/PN
```
This reflects the training tagging for the token John as determined by src/com/lingpipe/cookbook/chapter4/TinyPosCorpus.java.
The node features follow the top-three POS tags from our Brown corpus HMM tagger and the TOK_John feature:
```
Node Feats:{nps=2.0251355582754984E-4,...
```