Book Image

Python Natural Language Processing Cookbook

By : Zhenya Antić
Book Image

Python Natural Language Processing Cookbook

By: Zhenya Antić

Overview of this book

Python is the most widely used language for natural language processing (NLP) thanks to its extensive tools and libraries for analyzing text and extracting computer-usable data. This book will take you through a range of techniques for text processing, from basics such as parsing the parts of speech to complex topics such as topic modeling, text classification, and visualization. Starting with an overview of NLP, the book presents recipes for dividing text into sentences, stemming and lemmatization, removing stopwords, and parts of speech tagging to help you to prepare your data. You’ll then learn ways of extracting and representing grammatical information, such as dependency parsing and anaphora resolution, discover different ways of representing the semantics using bag-of-words, TF-IDF, word embeddings, and BERT, and develop skills for text classification using keywords, SVMs, LSTMs, and other techniques. As you advance, you’ll also see how to extract information from text, implement unsupervised and supervised techniques for topic modeling, and perform topic modeling of short texts, such as tweets. Additionally, the book shows you how to develop chatbots using NLTK and Rasa and visualize text data. By the end of this NLP book, you’ll have developed the skills to use a powerful set of tools for text processing.
Table of Contents (10 chapters)

Extracting entities and relations

It is possible to extract triplets of the subject entity-relation-object entity from documents, which are frequently used in knowledge graphs. These triplets can then be analyzed for further relations and inform other NLP tasks, such as searches.

Getting ready

For this recipe, we will need another Python package based on spacy, called textacy. The main advantage of this package is that it allows regular expression-like searching for tokens based on their part of speech tags. See the installation instructions in the Technical requirements section at the beginning of this chapter for more information.

How to do it…

We will find all verb phrases in the text, as well as all the noun phrases (see the previous section). Then, we will find the left noun phrase (subject) and the right noun phrase (object) that relate to a particular verb phrase. We will use two simple sentences, All living things are made of cells and Cells have organelles. Follow these steps:

  1. Import spaCy and textacy:
    import spacy
    import textacy
    from Chapter02.split_into_clauses import find_root_of_sentence
  2. Load the spacy engine:
    nlp = spacy.load('en_core_web_sm')
  3. We will get a list of sentences that we will be processing:
    sentences = ["All living things are made of cells.", 
                 "Cells have organelles."]
  4. In order to find verb phrases, we will need to compile regular expression-like patterns for the part of speech combinations of the words that make up the verb phrase. If we print out parts of speech of verb phrases of the two preceding sentences, are made of and have, we will see that the part of speech sequences are AUX, VERB, ADP, and AUX.
    verb_patterns = [[{"POS":"AUX"}, {"POS":"VERB"}, 
                      {"POS":"ADP"}], 
                     [{"POS":"AUX"}]]
  5. The contains_root function checks if a verb phrase contains the root of the sentence:
    def contains_root(verb_phrase, root):
        vp_start = verb_phrase.start
        vp_end = verb_phrase.end
        if (root.i >= vp_start and root.i <= vp_end):
            return True
        else:
            return False
  6. The get_verb_phrases function gets the verb phrases from a spaCy Doc object:
    def get_verb_phrases(doc):
        root = find_root_of_sentence(doc)
        verb_phrases = textacy.extract.matches(doc, 
                                               verb_patterns)
        new_vps = []
        for verb_phrase in verb_phrases:
            if (contains_root(verb_phrase, root)):
                new_vps.append(verb_phrase)
        return new_vps
  7. The longer_verb_phrase function finds the longest verb phrase:
    def longer_verb_phrase(verb_phrases):
        longest_length = 0
        longest_verb_phrase = None
        for verb_phrase in verb_phrases:
            if len(verb_phrase) > longest_length:
                longest_verb_phrase = verb_phrase
        return longest_verb_phrase
  8. The find_noun_phrase function will look for noun phrases either on the left- or right-hand side of the main verb phrase:
    def find_noun_phrase(verb_phrase, noun_phrases, side):
        for noun_phrase in noun_phrases:
            if (side == "left" and \
                noun_phrase.start < verb_phrase.start):
                return noun_phrase
            elif (side == "right" and \
                  noun_phrase.start > verb_phrase.start):
                return noun_phrase
  9. In this function, we will use the preceding functions to find triplets of subject-relation-object in the sentences:
    def find_triplet(sentence):
        doc = nlp(sentence)
        verb_phrases = get_verb_phrases(doc)
        noun_phrases = doc.noun_chunks
        verb_phrase = None
        if (len(verb_phrases) > 1):
            verb_phrase = \
            longer_verb_phrase(list(verb_phrases))
        else:
            verb_phrase = verb_phrases[0]
        left_noun_phrase = find_noun_phrase(verb_phrase, 
                                            noun_phrases, 
                                            "left")
        right_noun_phrase = find_noun_phrase(verb_phrase, 
                                             noun_phrases, 
                                             "right")
        return (left_noun_phrase, verb_phrase, 
                right_noun_phrase)
  10. We can now loop through our sentence list to find its relation triplets:
    for sentence in sentences:
        (left_np, vp, right_np) = find_triplet(sentence)
        print(left_np, "\t", vp, "\t", right_np)
  11. The result will be as follows:
    All living things        are made of     cells
    Cells    have    organelles

How it works…

The code finds triplets of subject-relation-object by looking for the root verb phrase and finding its surrounding nouns. The verb phrases are found using the textacy package, which provides a very useful tool for finding patterns of words of certain parts of speech. In effect, we can use it to write small grammars describing the necessary phrases.

Important note

The textacy package, while very useful, is not bug-free, so use it with caution.

Once the verb phrases have been found, we can prune through the sentence noun chunks to find those that are around the verb phrase containing the root.

A step-by-step explanation follows.

In step 1, we import the necessary packages and the find_root_of_sentence function from the previous recipe. In step 2, we initialize the spacy engine, and in step 3, we initialize a list with the sentences we will be using.

In step 4, we compile part of speech patterns that we will use for finding relations. For these two sentences, the patterns are AUX, VERB, ADP, and AUX.

In step 5, we create the contains_root function, which will make sure that a verb phrase contains the root of the sentence. It does that by checking the index of the root and making sure that it falls within the verb phrase span boundaries.

In step 6, we create the get_verb_phrases function, which extracts all the verb phrases from the Doc object that is passed in. It uses the part of speech patterns we created in step 4.

In step 7, we create the longer_verb_phrase function, which will find the longest verb phrase from a list. We do this because some verb phrases might be shorter than necessary. For example, in the sentence All living things are made of cells, both are and are made of will be found.

In step 8, we create the find_noun_phrase function, which finds noun phrases on either side of the verb. We specify the side as a parameter.

In step 9, we create the find_triplet function, which will find triplets of subject-relation-object in a sentence. In this function, first, we process the sentence with spaCy. Then, we use the functions defined in the previous steps to find the longest verb phrase and the nouns to the left- and right-hand sides of it.

In step 10, we apply the find_triplet function to the two sentences we defined at the beginning. The resulting triplets are correct.

In this recipe, we made a few assumptions that will not always be correct. The first assumption is that there will only be one main verb phrase. The second assumption is that there will be a noun chunk on either side of the verb phrase. Once we start working with sentences that are complex or compound, or contain relative clauses, these assumptions no longer hold. I leave it as an exercise for you to work with more complex cases.

There's more…

Once you've parsed out the entities and relations, you might want to input them into a knowledge graph for further use. There are a variety of tools you can use to work with knowledge graphs, such as neo4j.