We talked enough about NLTK as a library, and what are some of the most-used functions it gives us. Now, NLTK can solve many NLP problems from which many are highly parallelizable. This is the reason why we will try to use NLTK on Hadoop.
The best way of running NLTK on Hadoop is to get it installed on all the nodes of the cluster. This is probably not that difficult to achieve. There are ways in which you can do this, such as sending the resource files as a streaming argument. However, we will rather prefer the first option.
There are a variety of ways in which we can make NLTK run on Hadoop. Let's talk about one example of using NLTK by doing tokenization in parallel using a Hive UDF.
For this use case, we have to follow these steps:
We have chosen a small dataset where only two columns exist. We have to create the same schema in Hive:
ID
Content
UA0001
"I tried calling you. The service was not up to the mark"
UA0002
"Can you please update my phone no"
UA0003
"Really...