In the previous chapters, we focused on the analysis of structured data, mostly in tabular format. Along with structured data, plaintext is another predominant form of data available today. Text analysis includes the analysis of word frequency distributions, pattern recognition, tagging, link and association analysis, sentiment analysis, and visualization. One of the main libraries used for text analysis in Python is the Natural Language Toolkit (NLTK) library. NLTK comes with a collection of sample texts called corpora. The scikit-learn library also contains tools for text analysis that we will cover briefly in this chapter. A small example of network analysis will also be covered. The following topics will be discussed in this chapter:
Installing NLTK
About NLTK
Filtering out stopwords, names, and numbers
The bag-of-words model
Analyzing word frequencies
Naive Bayes classification
Sentiment analysis
Creating word clouds
Social network analysis