Book Image

Python 3 Text Processing with NLTK 3 Cookbook

By : Jacob Perkins
Book Image

Python 3 Text Processing with NLTK 3 Cookbook

By: Jacob Perkins

Overview of this book

Table of Contents (17 chapters)
Python 3 Text Processing with NLTK 3 Cookbook
Credits
About the Author
About the Reviewers
www.PacktPub.com
Preface
Penn Treebank Part-of-speech Tags
Index

Setting up a custom corpus


A corpus is a collection of text documents, and corpora is the plural of corpus. This comes from the Latin word for body; in this case, a body of text. So a custom corpus is really just a bunch of text files in a directory, often alongside many other directories of text files.

Getting ready

You should already have the NLTK data package installed, following the instructions at http://www.nltk.org/data. We'll assume that the data is installed to C:\nltk_data on Windows, and /usr/share/nltk_data on Linux, Unix, and Mac OS X.

How to do it...

NLTK defines a list of data directories, or paths, in nltk.data.path. Our custom corpora must be within one of these paths so it can be found by NLTK. In order to avoid conflict with the official data package, we'll create a custom nltk_data directory in our home directory. The following is some Python code to create this directory and verify that it is in the list of known paths specified by nltk.data.path:

>>> import os,...