Book Image

Python 3 Text Processing with NLTK 3 Cookbook

By : Jacob Perkins
Book Image

Python 3 Text Processing with NLTK 3 Cookbook

By: Jacob Perkins

Overview of this book

Table of Contents (17 chapters)
Python 3 Text Processing with NLTK 3 Cookbook
Credits
About the Author
About the Reviewers
www.PacktPub.com
Preface
Penn Treebank Part-of-speech Tags
Index

Extracting named entities


Named entity recognition is a specific kind of chunk extraction that uses entity tags instead of, or in addition to, chunk tags. Common entity tags include PERSON, ORGANIZATION, and LOCATION. Part-of-speech tagged sentences are parsed into chunk trees as with normal chunking, but the labels of the trees can be entity tags instead of chunk phrase tags.

How to do it...

NLTK comes with a pre-trained named entity chunker. This chunker has been trained on data from the ACE program, National Institute of Standards and Technology (NIST) sponsored program for Automatic Content Extraction, which you can read more about at http://www.itl.nist.gov/iad/894.01/tests/ace/. Unfortunately, this data is not included in the NLTK corpora, but the trained chunker is. This chunker can be used through the ne_chunk() method in the nltk.chunk module. The ne_chunk() method will chunk a single sentence into a Tree. The following is an example using ne_chunk() on the first tagged sentence...