I start my discussion with the Gartner's new hype cycle and you can clearly see NLP on top of the cycle. Currently, NLP is one of the rarest skill sets that is required in the industry. After the advent of big data, the major challenge is that we need more people who are good with not just structured, but also with semi or unstructured data. We are generating petabytes of Weblogs, tweets, Facebook feeds, chats, e-mails, and reviews. Companies are collecting all these different kind of data for better customer targeting and meaningful insights. To process all these unstructured data source we need people who understand NLP.
We are in the age of information; we can't even imagine our life without Google. We use Siri for the most of basic stuff. We use spam filters for filtering spam emails. We need spell checker on our Word document. There are many examples of real world NLP applications around us.
Let me also give you some examples of the amazing NLP applications that you can use, but are not aware that they are built on NLP:
Spell correction (MS Word/ any other editor)
Search engines (Google, Bing, Yahoo, wolframalpha)
Speech engines (Siri, Google Voice)
Spam classifiers (All e-mail services)
News feeds (Google, Yahoo!, and so on)
Machine translation (Google Translate, and so on)
IBM Watson
Building these applications requires a very specific skill set with a great understanding of language and tools to process the language efficiently. So it's not just hype that makes NLP one of the most niche areas, but it's the kind of application that can be created using NLP that makes it one of the most unique skills to have.
To achieve some of the above applications and other basic NLP preprocessing, there are many open source tools available. Some of them are developed by organizations to build their own NLP applications, while some of them are open-sourced. Here is a small list of available NLP tools:
GATE
Mallet
Open NLP
UIMA
Stanford toolkit
Genism
Natural Language Tool Kit (NLTK)
Most of the tools are written in Java and have similar functionalities. Some of them are robust and have a different variety of NLP tools available. However, when it comes to the ease of use and explanation of the concepts, NLTK scores really high. NLTK is also a very good learning kit because the learning curve of Python (on which NLTK is written) is very fast. NLTK has incorporated most of the NLP tasks, it's very elegant and easy to work with. For all these reasons, NLTK has become one of the most popular libraries in the NLP community:
I am assuming all you guys know Python. If not, I urge you to learn Python. There are many basic tutorials on Python available online. There are lots of books also available that give you a quick overview of the language. We will also look into some of the features of Python, while going through the different topics. But for now, even if you only know the basics of Python, such as lists, strings, regular expressions, and basic I/O, you should be good to go.
I would recommend using Anaconda or Canopy Python distributions. The reason being that these distributions come with bundled libraries, such as scipy
, numpy
, scikit
, and so on, which are used for data analysis and other applications related to NLP and related fields. Even NLTK is part of this distribution.
Let's test everything.
Open the terminal on your respective operating systems. Then run:
$ python
This should open the Python interpreter:
Python 2.6.6 (r266:84292, Oct 15 2013, 07:32:41) [GCC 4.4.7 20120313 (Red Hat 4.4.7-4)] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>>
I hope you got a similar looking output here. There is a chance that you will have received a different looking output, but ideally you will get the latest version of Python (I recommend that to be 2.7), the compiler GCC, and the operating system details. I know the latest version of Python will be in 3.0+ range, but as with any other open source systems, we should tries to hold back to a more stable version as opposed to jumping on to the latest version. If you have moved to Python 3.0+, please have a look at the link below to gain an understanding about what new features have been added:
https://docs.python.org/3/whatsnew/3.4.html.
UNIX based systems will have Python as a default program. Windows users can set the path to get Python working. Let's check whether we have installed NLTK correctly:
>>>import nltk >>>print "Python and NLTK installed successfully" Python and NLTK installed successfully
Hey, we are good to go!