Python 3 Text Processing with NLTK 3 Cookbook

Book Image

Python 3 Text Processing with NLTK 3 Cookbook

By : Jacob Perkins

Book Image

Python 3 Text Processing with NLTK 3 Cookbook

By: Jacob Perkins

Overview of this book

Python 3 Text Processing with NLTK 3 Cookbook

Python 3 Text Processing with NLTK 3 Cookbook

Credits

About the Author

About the Author

About the Reviewers

About the Reviewers

www.PacktPub.com

www.PacktPub.com

Preface

Free Chapter

Tokenizing Text and WordNet Basics

Tokenizing Text and WordNet Basics

Tokenizing text into sentences

Tokenizing sentences into words

Tokenizing sentences using regular expressions

Training a sentence tokenizer

Filtering stopwords in a tokenized sentence

Looking up Synsets for a word in WordNet

Looking up lemmas and synonyms in WordNet

Calculating WordNet Synset similarity

Discovering word collocations

Replacing and Correcting Words

Replacing and Correcting Words

Lemmatizing words with WordNet

Replacing words matching regular expressions

Removing repeating characters

Spelling correction with Enchant

Replacing synonyms

Replacing negations with antonyms

Creating Custom Corpora

Creating Custom Corpora

Setting up a custom corpus

Creating a wordlist corpus

Creating a part-of-speech tagged word corpus

Creating a chunked phrase corpus

Creating a categorized text corpus

Creating a categorized chunk corpus reader

Lazy corpus loading

Creating a custom corpus view

Creating a MongoDB-backed corpus reader

Corpus editing with file locking

Part-of-speech Tagging

Part-of-speech Tagging

Default tagging

Training a unigram part-of-speech tagger

Combining taggers with backoff tagging

Training and combining ngram taggers

Creating a model of likely word tags

Tagging with regular expressions

Training a Brill tagger

Training the TnT tagger

Using WordNet for tagging

Tagging proper names

Classifier-based tagging

Training a tagger with NLTK-Trainer

Extracting Chunks

Extracting Chunks

Chunking and chinking with regular expressions

Merging and splitting chunks with regular expressions

Expanding and removing chunks with regular expressions

Partial parsing with regular expressions

Training a tagger-based chunker

Classification-based chunking

Extracting named entities

Extracting proper noun chunks

Extracting location chunks

Training a named entity chunker

Training a chunker with NLTK-Trainer

Transforming Chunks and Trees

Transforming Chunks and Trees

Filtering insignificant words from a sentence

Correcting verb forms

Swapping verb phrases

Swapping noun cardinals

Swapping infinitive phrases

Singularizing plural nouns

Chaining chunk transformations

Converting a chunk tree to text

Flattening a deep tree

Creating a shallow tree

Converting tree labels

Text Classification

Text Classification

Bag of words feature extraction

Training a Naive Bayes classifier

Training a decision tree classifier

Training a maximum entropy classifier

Training scikit-learn classifiers

Measuring precision and recall of a classifier

Calculating high information words

Combining classifiers with voting

Classifying with multiple binary classifiers

Training a classifier with NLTK-Trainer

Distributed Processing and Handling Large Datasets

Distributed Processing and Handling Large Datasets

Distributed tagging with execnet

Distributed chunking with execnet

Parallel list processing with execnet

Storing a frequency distribution in Redis

Storing a conditional frequency distribution in Redis

Storing an ordered dictionary in Redis

Distributed word scoring with Redis and execnet

Parsing Specific Data Types

Parsing Specific Data Types

Parsing dates and times with dateutil

Timezone lookup and conversion

Extracting URLs from HTML with lxml

Cleaning and stripping HTML

Converting HTML entities with BeautifulSoup

Detecting and converting character encodings

Penn Treebank Part-of-speech Tags

Penn Treebank Part-of-speech Tags

Index

Customer Reviews

5 star

0

4 star

0

3 star

0

2 star

0

1 star

0

Training a chunker with NLTK-Trainer

At the end of the previous chapter, Chapter 4, Part-of-speech Tagging, we introduced NLTK-Trainer and the train_tagger.py script. In this recipe, we will cover the script for training chunkers: train_chunker.py.

Note

You can find NLTK-Trainer at https://github.com/japerk/nltk-trainer and the online documentation at http://nltk-trainer.readthedocs.org/.

How to do it...

As with train_tagger.py, the only required argument to train_chunker.py is the name of a corpus. In this case, we need a corpus that provides a chunked_sents() method, such as treebank_chunk. Here's an example of running train_chunker.py on treebank_chunk:

$ python train_chunker.py treebank_chunk
loading treebank_chunk
4009 chunks, training on 4009
training ub TagChunker
evaluating TagChunker
ChunkParse score:
    IOB Accuracy:   97.0%
    Precision:      90.8%
    Recall:         93.9%
    F-Measure:      92.3%
dumping TagChunker to /Users/jacob/nltk_data/chunkers/treebank_chunk_ub.pickle

Just...