Book Image

Python Natural Language Processing Cookbook

By : Zhenya Antić
Book Image

Python Natural Language Processing Cookbook

By: Zhenya Antić

Overview of this book

Python is the most widely used language for natural language processing (NLP) thanks to its extensive tools and libraries for analyzing text and extracting computer-usable data. This book will take you through a range of techniques for text processing, from basics such as parsing the parts of speech to complex topics such as topic modeling, text classification, and visualization. Starting with an overview of NLP, the book presents recipes for dividing text into sentences, stemming and lemmatization, removing stopwords, and parts of speech tagging to help you to prepare your data. You’ll then learn ways of extracting and representing grammatical information, such as dependency parsing and anaphora resolution, discover different ways of representing the semantics using bag-of-words, TF-IDF, word embeddings, and BERT, and develop skills for text classification using keywords, SVMs, LSTMs, and other techniques. As you advance, you’ll also see how to extract information from text, implement unsupervised and supervised techniques for topic modeling, and perform topic modeling of short texts, such as tweets. Additionally, the book shows you how to develop chatbots using NLTK and Rasa and visualize text data. By the end of this NLP book, you’ll have developed the skills to use a powerful set of tools for text processing.
Table of Contents (10 chapters)

Extracting subjects and objects of the sentence

Sometimes, we might need to find the subject and direct objects of the sentence, and that can easily be accomplished with the spacy package.

Getting ready

We will be using the dependency tags from spacy to find subjects and objects.

How to do it…

We will use the subtree attribute of tokens to find the complete noun chunk that is the subject or direct object of the verb (see the Getting the dependency parse recipe for more information). Let's get started:

  1. Import spacy:
    import spacy
  2. Load the spacy engine:
    nlp = spacy.load('en_core_web_sm')
  3. We will get the list of sentences we will be processing:
    sentences=["The big black cat stared at the small dog.",
               "Jane watched her brother in the evenings."]
  4. We will use two functions to find the subject and the direct object of the sentence. These functions will loop through the tokens and return the subtree that contains the token with subj or dobj in the dependency tag, respectively. Here is the subject function:
    def get_subject_phrase(doc):
        for token in doc:
            if ("subj" in token.dep_):
                subtree = list(token.subtree)
                start = subtree[0].i
                end = subtree[-1].i + 1
                return doc[start:end]
  5. Here is the direct object function. If the sentence does not have a direct object, it will return None:
    def get_object_phrase(doc):
        for token in doc:
            if ("dobj" in token.dep_):
                subtree = list(token.subtree)
                start = subtree[0].i
                end = subtree[-1].i + 1
                return doc[start:end]
  6. We can now loop through the sentences and print out their subjects and objects:
    for sentence in sentences:
        doc = nlp(sentence)
        subject_phrase = get_subject_phrase(doc)
        object_phrase = get_object_phrase(doc)
        print(subject_phrase)
        print(object_phrase)

    The result will be as follows. Since the first sentence does not have a direct object, None is printed out:

    The big black cat
    None
    Jane
    her brother

How it works…

The code uses the spacy engine to parse the sentence. Then, the subject function loops through the tokens, and if the dependency tag contains subj, it returns that token's subtree, which is a Span object. There are different subject tags, including nsubj for regular subjects and nsubjpass for subjects of passive sentences, so we want to look for both.

The object function works exactly the same as the subject function, except it looks for the token that has dobj (direct object) in its dependency tag. Since not all sentences have direct objects, it returns None in those cases.

In step 1, we import spaCy, and in step 2, we load the spacy engine. In step 3, we initialize a list with the sentences we will be processing.

In step 4, we create the get_subject_phrase function, which gets the subject of the sentence. It looks for the token that has a dependency tag that contains subj and then returns the subtree that contains that token. There are several subject dependency tags, including nsubj and nsubjpass (for a subject of a passive sentence), so we look for the most general pattern.

In step 5, we create the get_object_phrase function, which gets the direct object of the sentence. It works similarly to the get_subject_phrase, but looks for the dobj dependency tag instead of a tag that contains "subj".

In step 6, we loop through the list of sentences we created in step 3, and use the preceding functions to find the subjects and direct objects in the sentences. For the sentence The big black cat stared at the small dog, the subject is the big black cat, and there is no direct object (the small dog is the object of the preposition at). For the sentence Jane watched her brother in the evenings, the subject is Jane and the direct object is her brother.

There's more…

We can look for other objects; for example, the dative objects of verbs such as give and objects of prepositional phrases. The functions will look very similar, with the main difference being the dependency tags; that is, dative for the dative object function and pobj for the prepositional object function. The prepositional object function will return a list since there can be more than one prepositional phrase in a sentence. Let's take a look:

  1. The dative object function checks the tokens for the dative tag. It returns None if there are no dative objects:
    def get_dative_phrase(doc):
        for token in doc:
            if ("dative" in token.dep_):
                subtree = list(token.subtree)
                start = subtree[0].i
                end = subtree[-1].i + 1
                return doc[start:end]
  2. Here is the prepositional object function. It returns a list of objects of prepositions, but will be empty if there are none:
    def get_prepositional_phrase_objs(doc):
        prep_spans = []
        for token in doc:
            if ("pobj" in token.dep_):
                subtree = list(token.subtree)
                start = subtree[0].i
                end = subtree[-1].i + 1
                prep_spans.append(doc[start:end])
        return prep_spans
  3. The prepositional phrase objects in the sentence Jane watched her brother in the evenings are as follows:
    [the evenings]
  4. And here is the dative object in the sentence Laura gave Sam a very interesting book:
    Sam

    It is left as an exercise for you to find the actual prepositional phrases with prepositions intact instead of just the noun phrases that are dependent on these prepositions.