Book Image

The Handbook of NLP with Gensim

By : Chris Kuo
Book Image

The Handbook of NLP with Gensim

By: Chris Kuo

Overview of this book

Navigating the terrain of NLP research and applying it practically can be a formidable task made easy with The Handbook of NLP with Gensim. This book demystifies NLP and equips you with hands-on strategies spanning healthcare, e-commerce, finance, and more to enable you to leverage Gensim in real-world scenarios. You’ll begin by exploring motives and techniques for extracting text information like bag-of-words, TF-IDF, and word embeddings. This book will then guide you on topic modeling using methods such as Latent Semantic Analysis (LSA) for dimensionality reduction and discovering latent semantic relationships in text data, Latent Dirichlet Allocation (LDA) for probabilistic topic modeling, and Ensemble LDA to enhance topic modeling stability and accuracy. Next, you’ll learn text summarization techniques with Word2Vec and Doc2Vec to build the modeling pipeline and optimize models using hyperparameters. As you get acquainted with practical applications in various industries, this book will inspire you to design innovative projects. Alongside topic modeling, you’ll also explore named entity handling and NER tools, modeling procedures, and tools for effective topic modeling applications. By the end of this book, you’ll have mastered the techniques essential to create applications with Gensim and integrate NLP into your business processes.
Table of Contents (24 chapters)
1
Part 1: NLP Basics
5
Part 2: Latent Semantic Analysis/Latent Semantic Indexing
9
Part 3: Word2Vec and Doc2Vec
12
Part 4: Topic Modeling with Latent Dirichlet Allocation
18
Part 5: Comparison and Applications

NLU + NLG = NLP

NLP is an umbrella term that covers natural language understanding (NLU) and NLG. We’ll go through both in the next sections.

NLU

Many languages, such as English, German, and Chinese, have been developing for hundreds of years and continue to evolve. Humans can use languages artfully in various social contexts. Now, we are asking a computer to understand human language. What’s very rudimentary to us may not be so apparent to a computer. Linguists have contributed much to the development of computers’ understanding in terms of syntax, semantics, phonology, morphology, and pragmatics.

NLU focuses on understanding the meaning of human language. It extracts text or speech input and then analyzes the syntax, semantics, phonology, morphology, and pragmatics in the language. Let’s briefly go over each one:

  • Syntax: This is about the study of how words are arranged to form phrases and clauses, as well as the use of punctuation, order of words, and sentences.
  • Semantics: This is about the possible meanings of a sentence based on the interactions between words in the sentence. It is concerned with the interpretation of language, rather than its form or structure. For example, the word “table” as a noun can refer to “a piece of furniture having a smooth flat top that is usually supported by one or more vertical legs” or a data frame in a computer language.

Let’s elaborate more on semantics with two jokes. The first example is as follows:

Patient: “Doctor, doctor! I’ve broken my arm in three places!”

Doctor: “Well, stop going to those places, then.”

The patient uses the word places to mean the spots on the arm; the doctor uses places to mean physical locations.

The second example is as follows:

My coworker: “Do you ever think about working from home?”

Me: “I don’t even think about work at work!”

The first work in my reply means the tasks in my work. The second work in at work means at one’s place of employment.

NLU can understand the two meanings of a word in such jokes through a technique called word embedding. We will learn more about this in Chapter 2, Text Representation.

  • Phonology: This is about the study of the sound system of a language, including the sounds of speech (phonemes), how they are combined to form words (morphology), and how they are organized into larger units such as syllables and stress patterns. For example, the sounds represented by the letters “p” and “b” in English are distinct phonemes. A phoneme is the smallest unit of sound in a language that can change the meaning of a word. Consider the words “pat” and “bat.” The only difference between these two words is the initial sound, but their meanings are different.
  • Morphology: This is the study of the structure of words, including the way in which they are formed from smaller units of meaning called morphemes. It originally comes from “morph,” the shape or form, and “ology,” the study of something. Morphology is important because it helps us understand how words are formed and how they relate to each other. It also helps us understand how words change over time and how they are related to other words in a language. For example, the word “unkindness” consists of three separate morphemes: the prefix “un-,” the root “kind,” and the suffix “-ness.”
  • Pragmatics: This is the study of how language is used in a social context. Pragmatics is important because it helps us understand how language works in real-world situations, and how language can be used to convey meaning and achieve specific purposes. For example, if you offer to buy your friend a McDonald’s burger, a large fries, and a large drink, your friend may reply "no" because he is concerned about becoming fat. Your friend may simply mean the burger meal is high in calories, but the conversation can also imply he may be fat in a social context.

Now, let’s understand NLG.

NLG

While NLU is concerned with reading for a computer to comprehend, NLG is about writing for a computer to write. The term generation in NLG refers to an NLP model generating meaningful words or even articles. Today, when you compose an email or type a sentence in an app, it presents possible words to complete your sentence or performs automatic correction. These are applications of NLG. As a result, the term generative AI is coined for generative models including language, voice, image, and video generation. ChatGPT and GPT-4 from OpenAI are probably the most famous examples of generative AI. I will briefly introduce ChatGPT, GPT-4, and other open source products, such as gpt.h2o.ai, and present the HuggingFace.co open source community.

ChatGPT and GPT-4

You can enter a prompt for ChatGPT to generate a poem, a prompt, a story, and so on. With its use of the reinforcement learning from human feedback (RLHF) technique, ChatGPT is designed to respond to questions in a way that sounds natural and human-like, making it easier for people to communicate with the model. It is also able to remember the conversations prior to the current conversation. If you ask ChatGPT, “Should I wear a blue shirt or a white shirt for tomorrow’s outdoor company meeting when it is likely to be hot and sunny?”, it will formulate a response by inferring a sequence of words that are likely to come next. It answers, “When it is hot and sunny, you may want to wear a white shirt.” If you reply, “How about on a gloomy day?”, ChatGPT understands that you mean to ask, in continuation to your prior question, about what to wear for tomorrow’s outdoor company meeting. It does not take the second question as an independent question and answer it randomly. This ability to remember and contextualize inputs is what gives ChatGPT the ability to carry on some semblance of a human conversation rather than give naïve, one-off answers. Hence, having the memory for long context is the key for next-generation language models such as GPT-4.

Generative Pre-trained Transformer 4 (GPT-4) was released by OpenAI on March 14, 2023. It is a transformer-based model that can predict the next word or token. It can answer questions, summarize text, translate text to other languages, and generate code, blog posts, stories, conversations, and other content types. The ability to remember and contextualize inputs, known as the context window is key for a language model. The context window of GPT-4 has been increased to roughly 8,000 words. Given the English conversation speed is about 120 words per minute and 7,200 words per hour, a context window of 8,000 words is more than an hour’s conversation.

What is the token or transformer mentioned in the previous paragraph? A token is the NLP term referring to a word separated from a sentence, or tokenized. A transformer is the milestone language model invented in 2017 by Vaswani et al [2]. In Chapter 2, Text Representation, I will explain how tokenization works in text preprocessing. In Chapter 13, The Ensemble LDA for Model Stability, I will give high-level explanations of a transformer model and others including the BERTopic model that builds on it.

Gpt.h2o.ai

GPT-H2O.ai is an LLM by H2O.ai. It is aimed at enterprise-grade LLMs such as ChatGPT, but is open source. I let gpt.h2o.ai define itself by asking “What is gpt.h2o.ai?” It answered: “gpt.h2o.ai is a large language model that was trained on a combination of web crawl data and Reddit comments. The model can generate human-like text and has been used for a variety of tasks, including generating code, answering questions, and summarizing text.”

Let’s look at its storytelling capability. I entered: “Write a fable on a lion that rescued a boy when the boy was under the attack of aliens.” It returned the following interesting story:

“Once upon a time, there was a young boy named Jack who loved to explore the woods near his home. One day, while he was out on one of his adventures, he came across a strange sight. There, in the middle of the forest, was a spaceship. It was unlike anything Jack had ever seen before. As he approached the ship, he heard a strange noise coming from inside. Jack peered through the window and saw a group of aliens inside. They were holding a young boy captive and seemed to be doing something to him. Jack knew he had to act quickly. Just then, a lion appeared out of nowhere. The lion roared loudly, scaring the aliens away. The boy was freed and the spaceship took off into the sky. The boy thanked the lion for saving him and asked how he knew to come to his rescue. The lion explained that it had been watching over him for some time and knew that he was in danger. The boy and the lion became friends. Jack never forgot the day that he saw the lion save the boy from the aliens.”

HuggingFace.co

New research on LLMs has burgeoned over the past few years. New LLMs, datasets, and solutions become hard to keep up with and create a new knowledge barrier for anyone who wants to enter the world of generative AI. Hugging Face provides a platform called HuggingFace.co that allows everyone to access open source LLMs, academic papers, and various datasets that trained the LLMs on its platform. You can also share your LLMs and datasets.

You may ask why LLMs can answer questions, write book reports, draft notes, or summarize documents. An important factor is the data on which they were trained or with which they were fine-tuned. These large-scale datasets include “News/Wikipedia,” “Web crawling,” “Questions and Answers (Q&A),” “Books,” and “Reading comprehension.” In “Large Language Model Datasets” [3], I have given a more detailed review of the datasets that trained prominent LLMs such as GPT-2, GPT-3, GPT-4, and so on.

I trust you have gained a better understanding of NLU and NLG. Next, I will introduce the NLP techniques covered by Gensim.