Book Image

NLTK Essentials

By : Nitin Hardeniya
Book Image

NLTK Essentials

By: Nitin Hardeniya

Overview of this book

<p>Natural Language Processing (NLP) is the field of artificial intelligence and computational linguistics that deals with the interactions between computers and human languages. With the instances of human-computer interaction increasing, it’s becoming imperative for computers to comprehend all major natural languages. Natural Language Toolkit (NLTK) is one such powerful and robust tool.</p> <p>You start with an introduction to get the gist of how to build systems around NLP. We then move on to explore data science-related tasks, following which you will learn how to create a customized tokenizer and parser from scratch. Throughout, we delve into the essential concepts of NLP while gaining practical insights into various open source tools and libraries available in Python for NLP. You will then learn how to analyze social media sites to discover trending topics and perform sentiment analysis. Finally, you will see tools which will help you deal with large scale text.</p> <p>By the end of this book, you will be confident about NLP and data science concepts and know how to apply them in your day-to-day work.</p>
Table of Contents (17 chapters)
NLTK Essentials
Credits
About the Author
About the Reviewers
www.PacktPub.com
Preface
Index

Let's start playing with Python!


We'll not be diving too deep into Python; however, we'll give you a quick tour of Python essentials. Still, I think for the benefit of the audience, we should have a quick five minute tour. We'll talk about the basics of data structures, some frequently used functions, and the general construct of Python in the next few sections.

Note

I highly recommend the two hour Google Python class. https://developers.google.com/edu/python should be good enough to start. Please go through the Python website https://www.python.org/ for more tutorials and other resources.

Lists

Lists are one of the most commonly used data structures in Python. They are pretty much comparable to arrays in other programming languages. Let's start with some of the most important functions that a Python list provide.

Try the following in the Python console:

>>> lst=[1,2,3,4]
>>> # mostly like arrays in typical languages
>>>print lst
[1, 2, 3, 4]

Python lists can be accessed using much more flexible indexing. Here are some examples:

>>>print 'First element' +lst[0]

You will get an error message like this:

TypeError: cannot concatenate 'str' and 'int' objects

The reason being that Python is an interpreted language, and checks for the type of the variables at the time it evaluates the expression. We need not initialize and declare the type of variable at the time of declaration. Our list has integer object and cannot be concatenated as a print function. It will only accept a string object. For this reason, we need to convert list elements to string. The process is also known as type casting.

>>>print 'First element :' +str(lst[0])
>>>print 'last element :' +str(lst[-1])
>>>print 'first three elements :' +str(lst[0:2])
>>>print 'last three elements :'+str(lst[-3:])
First element :1
last element :4
first three elements :[1, 2,3]
last three elements :[2, 3, 4]

Helping yourself

The best way to learn more about different data types and functions is to use help functions like help() and dir(lst).

The dir(python object) command is used to list all the given attributes of the given Python object. Like if you pass a list object, it will list all the cool things you can do with lists:

>>>dir(lst)
>>>' , '.join(dir(lst))
'__add__ , __class__ , __contains__ , __delattr__ , __delitem__ , __delslice__ , __doc__ , __eq__ , __format__ , __ge__ , __getattribute__ , __getitem__ , __getslice__ , __gt__ , __hash__ , __iadd__ , __imul__ , __init__ , __iter__ , __le__ , __len__ , __lt__ , __mul__ , __ne__ , __new__ , __reduce__ , __reduce_ex__ , __repr__ , __reversed__ , __rmul__ , __setattr__ , __setitem__ , __setslice__ , __sizeof__ , __str__ , __subclasshook__ , append , count , extend , index , insert , pop , remove , reverse , sort'

With the help(python object) command, we can get detailed documentation for the given Python object, and also give a few examples of how to use the Python object:

>>>help(lst.index)
Help on built-in function index:
index(...)
    L.index(value, [start, [stop]]) -> integer -- return first index of value.
This function raises a ValueError if the value is not present.

So help and dir can be used on any Python data type, and are a very nice way to learn about the function and other details of that object. It also provides you with some basic examples to work with, which I found useful in most cases.

Strings in Python are very similar to other languages, but the manipulation of strings is one of the main features of Python. It's immensely easy to work with strings in Python. Even something very simple, like splitting a string, takes effort in Java / C, while you will see how easy it is in Python.

Using the help function that we used previously, you can get help for any Python object and any function. Let's have some more examples with the other most commonly used data type strings:

  • Split: This is a method to split the string based on some delimiters. If no argument is provided it assumes whitespace as delimiter.

    >>> mystring="Monty Python !  And the holy Grail ! \n"
    >>> print mystring.split()
    ['Monty', 'Python', '!', 'and', 'the', 'holy', 'Grail', '!']
    
  • Strip: This is a method that can remove trailing whitespace, like '\n', '\n\r' from the string:

    >>> print mystring.strip()
    >>>Monty Python !  and the holy Grail !
    

    If you notice the '\n' character is stripped off. There are also methods like rstrip() and lstrip() to strip trailing whitespaces to the right and left of the string.

  • Upper/Lower: We can change the case of the string using these methods:

    >>> print (mystring.upper()
    >>>MONTY PYTHON !AND THE HOLY GRAIL !
    
  • Replace: This will help you substitute a substring from the string:

    >>> print mystring.replace('!','''''')
    >>> Monty Python   and the holy Grail
    

There are tons of string functions. I have just talked about some of the most frequently used.

Note

Please look the following link for more functions and examples:

https://docs.python.org/2/library/string.html.

Regular expressions

One other important skill for an NLP enthusiast is working with regular expression. Regular expression is effectively pattern matching on strings. We heavily use pattern extrication to get meaningful information from large amounts of messy text data. The following are all the regular expressions you need. I haven't used any regular expressions beyond these in my entire life:

  • (a period): This expression matches any single character except newline \n.

  • \w: This expression will match a character or a digit equivalent to [a-z A-Z 0-9]

  • \W (upper case W) matches any non-word character.

  • \s: This expression (lowercase s) matches a single whitespace character - space, newline, return, tab, form [\n\r\t\f].

  • \S: This expression matches any non-whitespace character.

  • \t: This expression performs a tab operation.

  • \n: This expression is used for a newline character.

  • \r: This expression is used for a return character.

  • \d: Decimal digit [0-9].

  • ^: This expression is used at the start of the string.

  • $: This expression is used at the end of the string.

  • \: This expression is used to nullify the specialness of the special character. For example, you want to match the $ symbol, then add \ in front of it.

Let's search for something in the running example, where mystring is the same string object, and we will try to look for some patterns in that. A substring search is one of the common use-cases of the re module. Let's implement this:

>>># We have to import re module to use regular expression
>>>import re
>>>if re.search('Python',mystring):
>>>    print "We found python "
>>>else:
>>>    print "NO "

Once this is executed, we get the message as follows:

We found python

We can do more pattern finding using regular expressions. One of the common functions that is used in finding all the patterns in a string is findall. It will look for the given patterns in the string, and will give you a list of all the matched objects:

>>>import re
>>>print re.findall('!',mystring)
['!', '!']

As we can see there were two instances of the "!" in the mystring and findall return both object as a list.

Dictionaries

The other most commonly used data structure is dictionaries, also known as associative arrays/memories in other programming languages. Dictionaries are data structures that are indexed by keys, which can be any immutable type; such as strings and numbers can always be keys.

Dictionaries are handy data structure that used widely across programming languages to implement many algorithms. Python dictionaries are one of the most elegant implementations of hash tables in any programming language. It's so easy to work around dictionary, and the great thing is that with few nuggets of code you can build a very complex data structure, while the same task can take so much time and coding effort in other languages. This gives the programmer more time to focus on algorithms rather than the data structure itself.

I am using one of the very common use cases of dictionaries to get the frequency distribution of words in a given text. With just few lines of the following code, you can get the frequency of words. Just try the same task in any other language and you will understand how amazing Python is:

>>># declare a dictionary
>>>word_freq={}
>>>for tok in string.split():
>>>    if tok in word_freq:
>>>        word_freq [tok]+=1
>>>    else:
>>>        word_freq [tok]=1
>>>print word_freq
{'!': 2, 'and': 1, 'holy': 1, 'Python': 1, 'Grail': 1, 'the': 1, 'Monty': 1}

Writing functions

As any other programming langauge Python also has its way of writing functions. Function in Python start with keyword def followed by the function name and parentheses (). Similar to any other programming language any arguments and the type of the argument should be placed within these parentheses. The actual code starts with (:) colon symbol. The initial lines of the code are typically doc string (comments), then we have code body and function ends with a return statement. For example in the given example the function wordfreq start with def keyword, there is no argument to this function and the function ends with a return statement.

>>>import sys
>>>def wordfreq (mystring):
>>>	'''
>>>	Function to generated the frequency distribution of the given text
>>>	'''
>>>	print mystring
>>>	word_freq={}

>>>	for tok in mystring.split():
>>>		if tok in word_freq:
>>>			word_freq [tok]+=1
>>>		else:
>>>			word_freq [tok]=1
>>>	print word_freq
>>>def main():
>>>	str="This is my fist python program"
>>>	wordfreq(str)
>>>if __name__ == '__main__':
>>>	main()

This was the same code that we wrote in the previous section the idea of writing in a form of function is to make the code re-usable and readable. The interpreter style of writing Python is also very common but for writing big programes it will be a good practice to use function/classes and one of the programming paradigm. We also wanted the user to write and run first Python program. You need to follow these steps to achive this.

  1. Open an empty python file mywordfreq.py in your prefered text editor.

  2. Write/Copy the code above in the code snippet to the file.

  3. Open the command prompt in your Operating system.

  4. Run following command prompt:

    $ python mywordfreq,py "This is my fist python program !!" 
    
  5. Output should be:

    {'This': 1, 'is': 1, 'python': 1, 'fist': 1, 'program': 1, 'my': 1}
    

Now you have a very basic understanding about some common data-structures that python provides. You can write a full Python program and able to run that. I think this is good enough I think with this much of an introduction to Python you can manage for the initial chapters.

Note

Please have a look at some Python tutorials on the following website to learn more commands on Python:

https://wiki.python.org/moin/BeginnersGuide