Book Image

Python Text Processing with NLTK 2.0 Cookbook

By : Jacob Perkins
Book Image

Python Text Processing with NLTK 2.0 Cookbook

By: Jacob Perkins

Overview of this book

<p>Natural Language Processing is used everywhere – in search engines, spell checkers, mobile phones, computer games – even your washing machine. Python's Natural Language Toolkit (NLTK) suite of libraries has rapidly emerged as one of the most efficient tools for Natural Language Processing. You want to employ nothing less than the best techniques in Natural Language Processing – and this book is your answer.<br /><br /><em>Python Text Processing with NLTK 2.0 Cookbook</em> is your handy and illustrative guide, which will walk you through all the Natural Language Processing techniques in a step–by-step manner. It will demystify the advanced features of text analysis and text mining using the comprehensive NLTK suite.<br /><br />This book cuts short the preamble and you dive right into the science of text processing with a practical hands-on approach.<br /><br />Get started off with learning tokenization of text. Get an overview of WordNet and how to use it. Learn the basics as well as advanced features of Stemming and Lemmatization. Discover various ways to replace words with simpler and more common (read: more searched) variants. Create your own corpora and learn to create custom corpus readers for JSON files as well as for data stored in MongoDB. Use and manipulate POS taggers. Transform and normalize parsed chunks to produce a canonical form without changing their meaning. Dig into feature extraction and text classification. Learn how to easily handle huge amounts of data without any loss in efficiency or speed.<br /><br />This book will teach you all that and beyond, in a hands-on learn-by-doing manner. Make yourself an expert in using the NLTK for Natural Language Processing with this handy companion.</p>
Table of Contents (16 chapters)
Python Text Processing with NLTK 2.0 Cookbook
Credits
About the Author
About the Reviewers
Preface
Penn Treebank Part-of-Speech Tags
Index

Appendix A. Penn Treebank Part-of-Speech Tags

Following is a table of all the part-of-speech tags that occur in the treebank corpus distributed with NLTK. The tags and counts shown here were acquired using the following code:

>>> from nltk.probability import FreqDist
>>> from nltk.corpus import treebank
>>> fd = FreqDist()
>>> for word, tag in treebank.tagged_words():
...   fd.inc(tag)
>>> fd.items()

The FreqDist fd contains all the counts shown here for every tag in the treebank corpus. You can inspect each tag count individually by doing fd[tag], as in fd['DT']. Punctuation tags are also shown, along with special tags such as -NONE-, which signifies that the part-of-speech tag is unknown. Descriptions of most of the tags can be found at http://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html.

Part-of-speech tag

Frequency of occurrence

#

16

$

724

''

694

,

4,886

-LRB-

120

-NONE-

6,592

-RRB-

126

.

384

:

563

``

712

CC

2,265

CD

3,546

DT

8,165

EX

88

FW

4

IN

9,857

JJ

5,834

JJR

381

JJS

182

LS

13

MD

927

NN

13,166

NNP

9,410

NNPS

244

NNS

6,047

PDT

27

POS

824

PRP

1,716

PRP$

766

RB

2,822

RBR

136

RBS

35

RP

216

SYM

1

TO

2,179

UH

3

VB

2,554

VBD

3,043

VBG

1,460

VBN

2,134

VBP

1,321

VBZ

2,125

WDT

445

WP

241

WP$

14