Book Image

Mastering Clojure Data Analysis

By : Eric Richard Rochester
Book Image

Mastering Clojure Data Analysis

By: Eric Richard Rochester

Overview of this book

Table of Contents (17 chapters)
Mastering Clojure Data Analysis
Credits
About the Author
About the Reviewers
www.PacktPub.com
Preface
Index

Preparing the data


For this experiment, I've randomly selected 500 hotel reviews and classified them manually. A better option might be to use Amazon's Mechanical Turk (https://www.mturk.com/mturk/) to get more reviews classified than any one person might be able to do easily. Really, a few hundred is about the minimum that we'd like to use as both the training and test sets need to come from this. I made sure that the sample contained an equal number of positive and negative reviews. (You can find the sample in the data directory of the code download.)

The data files are tab-separated values (TSV). After being manually classified, each line had four fields: the classification as a + or - sign, the date of the review, the title of the review, and the review itself. Some of the reviews are quite long.

After tagging the files, we'll take those files and create feature vectors from the vocabulary of the title and create a review for each one. For this chapter, we'll see what works best: unigrams...