Book Image

Python Natural Language Processing Cookbook

By : Zhenya Antić
Book Image

Python Natural Language Processing Cookbook

By: Zhenya Antić

Overview of this book

Python is the most widely used language for natural language processing (NLP) thanks to its extensive tools and libraries for analyzing text and extracting computer-usable data. This book will take you through a range of techniques for text processing, from basics such as parsing the parts of speech to complex topics such as topic modeling, text classification, and visualization. Starting with an overview of NLP, the book presents recipes for dividing text into sentences, stemming and lemmatization, removing stopwords, and parts of speech tagging to help you to prepare your data. You’ll then learn ways of extracting and representing grammatical information, such as dependency parsing and anaphora resolution, discover different ways of representing the semantics using bag-of-words, TF-IDF, word embeddings, and BERT, and develop skills for text classification using keywords, SVMs, LSTMs, and other techniques. As you advance, you’ll also see how to extract information from text, implement unsupervised and supervised techniques for topic modeling, and perform topic modeling of short texts, such as tweets. Additionally, the book shows you how to develop chatbots using NLTK and Rasa and visualize text data. By the end of this NLP book, you’ll have developed the skills to use a powerful set of tools for text processing.
Table of Contents (10 chapters)

Splitting sentences into clauses

When we work with text, we frequently deal with compound (sentences with two parts that are equally important) and complex sentences (sentences with one part depending on another). It is sometimes useful to split these composite sentences into its component clauses for easier processing down the line. This recipe uses the dependency parse from the previous recipe.

Getting ready

You will only need the spacy package in this recipe.

How to do it…

We will work with two sentences, He eats cheese, but he won't eat ice cream and If it rains later, we won't be able to go to the park. Other sentences may turn out to be more complicated to deal with, and I leave it as an exercise for you to split such sentences. Follow these steps:

  1. Import the spacy package:
    import spacy
  2. Load the spacy engine:
    nlp = spacy.load('en_core_web_sm')
  3. Set the sentence to He eats cheese, but he won't eat ice cream:
    sentence = "He eats cheese, but he won't eat ice cream."
  4. Process the sentence with the spacy engine:
    doc = nlp(sentence)
  5. It is instructive to look at the structure of the input sentence by printing out the part of speech, dependency tag, ancestors, and children of each token. This can be accomplished using the following code:
    for token in doc:
        ancestors = [t.text for t in token.ancestors]
        children = [t.text for t in token.children]
        print(token.text, "\t", token.i, "\t", 
              token.pos_, "\t", token.dep_, "\t", 
              ancestors, "\t", children)
  6. We will use the following function to find the root token of the sentence, which is usually the main verb. In instances where there is a dependent clause, it is the verb of the independent clause:
    def find_root_of_sentence(doc):
        root_token = None
        for token in doc:
            if (token.dep_ == "ROOT"):
                root_token = token
        return root_token
  7. We will now find the root token of the sentence:
    root_token = find_root_of_sentence(doc)
  8. We can now use the following function to find the other verbs in the sentence:
    def find_other_verbs(doc, root_token):
        other_verbs = []
        for token in doc:
            ancestors = list(token.ancestors)
            if (token.pos_ == "VERB" and len(ancestors) == 1\
                and ancestors[0] == root_token):
                other_verbs.append(token)
        return other_verbs
  9. Use the preceding function to find the remaining verbs in the sentence:
    other_verbs = find_other_verbs(doc, root_token)

    We will use the following function to find the token spans for each verb:

    def get_clause_token_span_for_verb(verb, doc, all_verbs):
        first_token_index = len(doc)
        last_token_index = 0
        this_verb_children = list(verb.children)
        for child in this_verb_children:
            if (child not in all_verbs):
                if (child.i < first_token_index):
                    first_token_index = child.i
                if (child.i > last_token_index):
                    last_token_index = child.i
        return(first_token_index, last_token_index)
  10. We will put together all the verbs in one array and process each using the preceding function. This will return a tuple of start and end indices for each verb's clause:
    token_spans = []   
    all_verbs = [root_token] + other_verbs
    for other_verb in all_verbs:
        (first_token_index, last_token_index) = \
         get_clause_token_span_for_verb(other_verb, 
                                        doc, all_verbs)
        token_spans.append((first_token_index, 
                            last_token_index))
  11. Using the start and end indices, we can now put together token spans for each clause. We sort the sentence_clauses list at the end so that the clauses are in the order they appear in the sentence:
    sentence_clauses = []
    for token_span in token_spans:
        start = token_span[0]
        end = token_span[1]
        if (start < end):
            clause = doc[start:end]
            sentence_clauses.append(clause)
    sentence_clauses = sorted(sentence_clauses, 
                              key=lambda tup: tup[0])
  12. Now, we can print the final result of the processing for our initial sentence; that is, He eats cheese, but he won't eat ice cream:
    clauses_text = [clause.text for clause in sentence_clauses]
    print(clauses_text)

    The result is as follows:

    ['He eats cheese,', 'he won't eat ice cream']

    Important note

    The code in this section will work for some cases, but not others; I encourage you to test it out on different cases and amend the code.

How it works…

The way the code works is based on the way complex and compound sentences are structured. Each clause contains a verb, and one of the verbs is the main verb of the sentence (root). The code looks for the root verb, always marked with the ROOT dependency tag in spaCy processing, and then looks for the other verbs in the sentence.

The code then uses the information about each verb's children to find the left and right boundaries of the clause. Using this information, the code then constructs the text of the clauses. A step-by-step explanation follows.

In step 1, we import the spaCy package and in step 2, we load the spacy engine. In step 3, we set the sentence variable and in step 4, we process it using the spacy engine. In step 5, we print out the dependency parse information. It will help us determine how to split the sentence into clauses.

In step 6, we define the find_root_of_sentence function, which returns the token that has a dependency tag of ROOT. In step 7, we find the root of the sentence we are using as an example.

In step 8, we define the find_other_verbs function, which will find other verbs in the sentence. In this function, we look for tokens that have the VERB part of speech tag and has the root token as its only ancestor. In step 9, we apply this function.

In step 10, we define the get_clause_token_span_for_verb function, which will find the beginning and ending index for the verb. The function goes through all the verb's children; the leftmost child's index is the beginning index, while the rightmost child's index is the ending index for this verb's clause.

In step 11, we use the preceding function to find the clause indices for each verb. The token_spans variable contains the list of tuples, where the first tuple element is the beginning clause index and the second tuple element is the ending clause index.

In step 12, we create token Span objects for each clause in the sentence using the list of beginning and ending index pairs we created in step 11. We get the Span object by slicing the Doc object and then appending the resulting Span objects to a list. As a final step, we sort the list to make sure that the clauses in the list are in the same order as in the sentence.

In step 13, we print the clauses in our sentence. You will notice that the word but is missing, since its parent is the root verb eats, although it appears in the other clause. The exercise of including but is left to you.