Book Image

The Handbook of NLP with Gensim

By : Chris Kuo
Book Image

The Handbook of NLP with Gensim

By: Chris Kuo

Overview of this book

Navigating the terrain of NLP research and applying it practically can be a formidable task made easy with The Handbook of NLP with Gensim. This book demystifies NLP and equips you with hands-on strategies spanning healthcare, e-commerce, finance, and more to enable you to leverage Gensim in real-world scenarios. You’ll begin by exploring motives and techniques for extracting text information like bag-of-words, TF-IDF, and word embeddings. This book will then guide you on topic modeling using methods such as Latent Semantic Analysis (LSA) for dimensionality reduction and discovering latent semantic relationships in text data, Latent Dirichlet Allocation (LDA) for probabilistic topic modeling, and Ensemble LDA to enhance topic modeling stability and accuracy. Next, you’ll learn text summarization techniques with Word2Vec and Doc2Vec to build the modeling pipeline and optimize models using hyperparameters. As you get acquainted with practical applications in various industries, this book will inspire you to design innovative projects. Alongside topic modeling, you’ll also explore named entity handling and NER tools, modeling procedures, and tools for effective topic modeling applications. By the end of this book, you’ll have mastered the techniques essential to create applications with Gensim and integrate NLP into your business processes.
Table of Contents (24 chapters)
1
Part 1: NLP Basics
5
Part 2: Latent Semantic Analysis/Latent Semantic Indexing
9
Part 3: Word2Vec and Doc2Vec
12
Part 4: Topic Modeling with Latent Dirichlet Allocation
18
Part 5: Comparison and Applications

Introduction to NLP

“Why do we need NLP?” You may ask this question as you've witnessed the advancement of natural language processing (NLP) in recent years. Let’s see how NLP helped a well-established investment firm named "Harmony Investments." For decades, Harmony Investments had been renowned for its astute financial strategies and portfolio management, ranging from stocks and bonds to real estate and alternative investments. However, the sheer volume and variety of data sources, including news articles, earnings reports, social media posts, and financial statements, made it nearly impossible to manually analyze all the information. The firm's analysts were spending an excessive amount of time collecting and reviewing data. Recognizing the need for a more efficient and data-driven approach, the firm partnered with a leading AI solutions provider to implement NLP-driven solutions into their business operations. They used NLP algorithms to review news articles, press releases, and social media platforms in real time. This analysis enabled the firm to react swiftly. They used NLP tools that automatically summarized lengthy earning reports. This reduced the time the analysts spent on manual document review. They used NLP-powered sentiment analysis to gauge public sentiment surrounding specific stocks or market segments. Analysts had more time for strategic research and developing innovative investment strategies. As a result, Harmony Investments not only retained its reputation as a leading investment firm but also attracted new clients and expanded its portfolio.

Joe is a data scientist who is new to NLP. He and his data analyst colleague, Jacob, are interested in learning NLP techniques. They want to acquire the NLP techniques that can deliver the NLP benefits as discussed. They have certainly heard of ChatGPT and all the news about large language models (LLMs). They want to learn NLP systematically, from concepts to practice, and want to find a textbook that can bridge them to LLMs without diving into LLMs first. If you are like Joe or Jacob, then this book is for you.

A fundamental step in NLP for computers to understand texts is text representation, which convert a collection of text documents into numerical values. Each document is represented as a vector in a high-dimensional space, where each dimension corresponds to a unique word in the entire corpus. This helps computers understand what words mean and how they relate to each other in sentences. This book starts with bag-of-words (BoW), bag-of-N-grams, term frequency-inverse document frequency (TF-IDF). An advance to text representation is the word embedding techniques. Word embeddings are dense vector representations of words that capture semantic relationships between words based on their context in a large dataset. Word embeddings, like Word2Vec, create continuous vector representations where words with similar meanings have similar vector representations, and they capture semantic and syntactic relationships.

Topic modeling is a significant NLP subject. It classifies documents into topics for document retrieval, categorization, tagging, or annotation. This book gives more insight into the milestone topic modeling technique, Latent Dirichlet Allocation (LDA). In addition, another milestone topic modeling technique is BERTopic. Let me briefly describe the development history of Bidirectional Encoder Representations from Transformers (BERT). The seminal paper “Attention is all you need” by Vaswani et al. [2] enables many transformer-based word embeddings and LLMs. One of the word embeddings is BERT. Can we do topic modeling to classify documents based on BERT word embeddings? That’s the origin of BERTopic. I have included BERTopic in this book together with LDA so you get to see the differences. This will provide a bridge to the transformer-based NLP techniques.

This book is a practical handbook with code snippets. I will cover many techniques in the Gensim library. Gensim is an open source Python library for topic modeling, document clustering, and other unsupervised learning tasks on collections of textual documents. It provides a high-level interface for building and training a variety of models. Gensim stands for generate similar. It finds the similarities between documents to summarize texts or to classify documents into topics.

In this chapter, we will cover the following topics:

  • Introduction to natural language processing
  • NLU + NLG = NLP
  • Gensim and its NLP modeling techniques
  • Topic modeling with BERTopic
  • Common NLP Python modules included in this book

After completing this chapter, you will get to know the development history of NLP. You will be able to explain the key NLP techniques that Gensim covers. You will also understand other popular NLP Python libraries that are often used together.