Book Image

Learning Data Mining with Python

Book Image

Learning Data Mining with Python

Overview of this book

Table of Contents (20 chapters)
Learning Data Mining with Python
Credits
About the Author
About the Reviewers
www.PacktPub.com
Preface
Index

Character n-grams


We saw how function words can be used as features to predict the author of a document. Another feature type is character n-grams. An n-gram is a sequence of n objects, where n is a value (for text, generally between 2 and 6). Word n-grams have been used in many studies, usually relating to the topic of the documents. However, character n-grams have proven to be of high quality for authorship attribution.

Character n-grams are found in text documents by representing the document as a sequence of characters. These n-grams are then extracted from this sequence and a model is trained. There are a number of different models for this, but a standard one is very similar to the bag-of-words model we have used earlier.

For each distinct n-gram in the training corpus, we create a feature for it. An example of an n-gram is <e t>, which is the letter e, a space, and then the letter t (the angle brackets are used to denote the start and end of the n-gram and aren't part of it)....