The previous chapter was all about you getting a head start on Python as well as NLTK. We learned about how we can start some meaningful EDA with any corpus of text. We did all the pre-processing part in a very crude and simple manner. In this chapter, will go over preprocessing steps like tokenization, stemming, lemmatization, and stop word removal in more detail. We will explore all the tools in NLTK for text wrangling. We will talk about all the pre-processing steps used in modern NLP applications, the different ways to achieve some of these tasks, as well as the general do's and don'ts. The idea is to give you enough information about these tools so that you can decide what kind of pre-processing tool you need for your application. By the end of this chapter, readers should know :
About all the data wrangling, and to perform it using NLTK
What is the importance of text cleansing and what are the common tasks that can be achieved using NLTK