We can also perform the analysis of performance at word level or lexical level.
Consider the following code in NLTK in which movie reviews have been taken and marked as either positive or negative. A feature extractor is constructed that checks whether a given word is present in a document or not:
>>> from nltk.corpus import movie_reviews >>> docs = [(list(movie_reviews.words(fileid)), category) ... for category in movie_reviews.categories() ... for fileid in movie_reviews.fileids(category)] >>> random.shuffle(docs) all_wrds = nltk.FreqDist(w.lower() for w in movie_reviews.words()) word_features = list(all_wrds)[:2000] def doc_features(doc): doc_words = set(doc) features = {} for word in word_features: features['contains({})'.format(word)] = (word in doc_words) return features >>> print(doc_features(movie_reviews.words('pos/cv957_8737.txt'))) {'contains(waste)': False...