Book Image

Python Natural Language Processing Cookbook

By : Zhenya Antić
Book Image

Python Natural Language Processing Cookbook

By: Zhenya Antić

Overview of this book

Python is the most widely used language for natural language processing (NLP) thanks to its extensive tools and libraries for analyzing text and extracting computer-usable data. This book will take you through a range of techniques for text processing, from basics such as parsing the parts of speech to complex topics such as topic modeling, text classification, and visualization. Starting with an overview of NLP, the book presents recipes for dividing text into sentences, stemming and lemmatization, removing stopwords, and parts of speech tagging to help you to prepare your data. You’ll then learn ways of extracting and representing grammatical information, such as dependency parsing and anaphora resolution, discover different ways of representing the semantics using bag-of-words, TF-IDF, word embeddings, and BERT, and develop skills for text classification using keywords, SVMs, LSTMs, and other techniques. As you advance, you’ll also see how to extract information from text, implement unsupervised and supervised techniques for topic modeling, and perform topic modeling of short texts, such as tweets. Additionally, the book shows you how to develop chatbots using NLTK and Rasa and visualize text data. By the end of this NLP book, you’ll have developed the skills to use a powerful set of tools for text processing.
Table of Contents (10 chapters)

Removing stopwords

When we work with words, especially if we are considering words' semantics, we sometimes need to exclude some very frequent words that do not bring any substantial meaning to a sentence, words such as but, can, we, and so on. This recipe shows how to do that.

Getting ready…

For this recipe, we will need a list of stopwords. We provide a list in the book's GitHub repository. You might find that for your project, you need to customize the list and add or remove words as necessary.

You can also use the stopwords list provided with the nltk package.

We will be using the Sherlock Holmes text referred to earlier. For this recipe, we will need just the beginning of the book, which can be found in the sherlock_holmes_1.txt file.

How to do it…

In the recipe, we will read in the text file, the file with stopwords, tokenize the text file, and remove the stopwords from the list:

  1. Import the csv and nltk modules:
    import csv
    import...