Sign In Start Free Trial

Book Overview & Buying
Table Of Contents

Python 3 Text Processing with NLTK 3 Cookbook - Second Edition

By : Jacob Perkins

3.8 (12)

Python 3 Text Processing with NLTK 3 Cookbook

3.8 (12)

By: Jacob Perkins

Overview of this book

This book is intended for Python programmers interested in learning how to do natural language processing. Maybe you’ve learned the limits of regular expressions the hard way, or you’ve realized that human language cannot be deterministically parsed like a computer language. Perhaps you have more text than you know what to do with, and need automated ways to analyze and structure that text. This Cookbook will show you how to train and use statistical language models to process text in ways that are practically impossible with standard programming tools. A basic knowledge of Python and the basic text processing concepts is expected. Some experience with regular expressions will also be helpful.

Preface

Preface

What this book covers

What you need for this book

Who this book is for

Conventions

Reader feedback

Customer support

Free Chapter

1. Tokenizing Text and WordNet Basics

1. Tokenizing Text and WordNet Basics

Introduction

Tokenizing text into sentences

Tokenizing sentences into words

Tokenizing sentences using regular expressions

Training a sentence tokenizer

Filtering stopwords in a tokenized sentence

Looking up Synsets for a word in WordNet

Looking up lemmas and synonyms in WordNet

Calculating WordNet Synset similarity

Discovering word collocations

2. Replacing and Correcting Words

2. Replacing and Correcting Words

Introduction

Stemming words

Lemmatizing words with WordNet

Replacing words matching regular expressions

Removing repeating characters

Spelling correction with Enchant

Replacing synonyms

Replacing negations with antonyms

3. Creating Custom Corpora

3. Creating Custom Corpora

Introduction

Setting up a custom corpus

Creating a wordlist corpus

Creating a part-of-speech tagged word corpus

Creating a chunked phrase corpus

Creating a categorized text corpus

Creating a categorized chunk corpus reader

Lazy corpus loading

Creating a custom corpus view

Creating a MongoDB-backed corpus reader

Corpus editing with file locking

4. Part-of-speech Tagging

4. Part-of-speech Tagging

Introduction

Default tagging

Training a unigram part-of-speech tagger

Combining taggers with backoff tagging

Training and combining ngram taggers

Creating a model of likely word tags

Tagging with regular expressions

Affix tagging

Training a Brill tagger

Training the TnT tagger

Using WordNet for tagging

Tagging proper names

Classifier-based tagging

Training a tagger with NLTK-Trainer

5. Extracting Chunks

5. Extracting Chunks

Introduction

Chunking and chinking with regular expressions

Merging and splitting chunks with regular expressions

Expanding and removing chunks with regular expressions

Partial parsing with regular expressions

Training a tagger-based chunker

Classification-based chunking

Extracting named entities

Extracting proper noun chunks

Extracting location chunks

Training a named entity chunker

Training a chunker with NLTK-Trainer

6. Transforming Chunks and Trees

6. Transforming Chunks and Trees

Introduction

Filtering insignificant words from a sentence

Correcting verb forms

Swapping verb phrases

Swapping noun cardinals

Swapping infinitive phrases

Singularizing plural nouns

Chaining chunk transformations

Converting a chunk tree to text

Flattening a deep tree

Creating a shallow tree

Converting tree labels

7. Text Classification

7. Text Classification

Introduction

Bag of words feature extraction

Training a Naive Bayes classifier

Training a decision tree classifier

Training a maximum entropy classifier

Training scikit-learn classifiers

Measuring precision and recall of a classifier

Calculating high information words

Combining classifiers with voting

Classifying with multiple binary classifiers

Training a classifier with NLTK-Trainer

8. Distributed Processing and Handling Large Datasets

8. Distributed Processing and Handling Large Datasets

Introduction

Distributed tagging with execnet

Distributed chunking with execnet

Parallel list processing with execnet

Storing a frequency distribution in Redis

Storing a conditional frequency distribution in Redis

Storing an ordered dictionary in Redis

Distributed word scoring with Redis and execnet

9. Parsing Specific Data Types

9. Parsing Specific Data Types

Introduction

Parsing dates and times with dateutil

Timezone lookup and conversion

Extracting URLs from HTML with lxml

Cleaning and stripping HTML

Converting HTML entities with BeautifulSoup

Detecting and converting character encodings

A. Penn Treebank Part-of-speech Tags

A. Penn Treebank Part-of-speech Tags

Index

Index

Appendix A. Penn Treebank Part-of-speech Tags

The following is a table of all the part-of-speech tags that occur in the treebank corpus distributed with NLTK. The tags and counts shown here were acquired using the following code:

>>> from nltk.probability import FreqDist
>>> from nltk.corpus import treebank
>>> fd = FreqDist()
>>> for word, tag in treebank.tagged_words():
...	   fd[tag] += 1
>>> fd.items()

The FreqDist fd contains all the counts shown here for every tag in the treebank corpus. You can inspect each tag count individually, by doing fd[tag], for example, fd['DT']. Punctuation tags are also shown, along with special tags such as -NONE-, which signifies that the part-of-speech tag is unknown. Descriptions of most of the tags can be found at the following link:

http://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html

Part-of-speech tag	Frequency of occurrence
`#`	`16`
`$`	`724`
`'&apos...`

CONTINUE READING

83

Tech Concepts

36

Programming languages

73

Tech Tools

Unlimited access to the largest independent learning library in tech of over 8,000 expert-authored tech books and videos.

Innovative learning tools, including AI book assistants, code context explainers, and text-to-speech.

50+ new titles added per month and exclusive early access to books as they are being written.

Python 3 Text Processing with NLTK 3 Cookbook

Search

Your notes and bookmarks