Book Image

Python 3 Text Processing with NLTK 3 Cookbook

By : Jacob Perkins
Book Image

Python 3 Text Processing with NLTK 3 Cookbook

By: Jacob Perkins

Overview of this book

Table of Contents (17 chapters)
Python 3 Text Processing with NLTK 3 Cookbook
Credits
About the Author
About the Reviewers
www.PacktPub.com
Preface
Penn Treebank Part-of-speech Tags
Index

Extracting location chunks


To identify LOCATION chunks, we can make a different kind of ChunkParserI subclass that uses the gazetteers corpus to identify location words. The gazetteers corpus is a WordListCorpusReader class that contains the following location words:

  • Country names

  • U.S. states and abbreviations

  • Major U.S. cities

  • Canadian provinces

  • Mexican states

How to do it...

The LocationChunker class, found in chunkers.py, iterates over a tagged sentence looking for words that are found in the gazetteers corpus. When it finds one or more location words, it creates a LOCATION chunk using IOB tags. The helper method iob_locations() is where the IOB LOCATION tags are produced, and the parse() method converts these IOB tags into a Tree:

from nltk.chunk import ChunkParserI
from nltk.chunk.util import conlltags2tree
from nltk.corpus import gazetteers

class LocationChunker(ChunkParserI):
  def __init__(self):
    self.locations = set(gazetteers.words())
    self.lookahead = 0

    for loc in self.locations...