Book Image

Learning Data Mining with Python

Book Image

Learning Data Mining with Python

Overview of this book

Table of Contents (20 chapters)
Learning Data Mining with Python
About the Author
About the Reviewers


We will now create a pipeline that takes a tweet and determines whether it is relevant or not, based only on the content of that tweet.

To perform the word extraction, we will be using the NLTK, a library that contains a large number of tools for performing analysis on natural language. We will use NLTK in future chapters as well.


To get NLTK on your computer, use pip to install the package: pip3 install nltk

If that doesn't work, see the NLTK installation instructions at

We are going to create a pipeline to extract the word features and classify the tweets using Naive Bayes. Our pipeline has the following steps:

  1. Transform the original text documents into a dictionary of counts using NLTK's word_tokenize function.

  2. Transform those dictionaries into a vector matrix using the DictVectorizer transformer in scikit-learn. This is necessary to enable the Naive Bayes classifier to read the feature values extracted in the first step.

  3. Train the Naive Bayes classifier...