Natural Language Processing and Computational Linguistics

Natural Language Processing and Computational Linguistics

By : Bhargav Srinivasa-Desikan

Buy this Book

Natural Language Processing and Computational Linguistics

By: Bhargav Srinivasa-Desikan

Buy this Book

Overview of this book

Modern text analysis is now very accessible using Python and open source tools, so discover how you can now perform modern text analysis in this era of textual data. This book shows you how to use natural language processing, and computational linguistics algorithms, to make inferences and gain insights about data you have. These algorithms are based on statistical machine learning and artificial intelligence techniques. The tools to work with these algorithms are available to you right now - with Python, and tools like Gensim and spaCy. You'll start by learning about data cleaning, and then how to perform computational linguistics from first concepts. You're then ready to explore the more sophisticated areas of statistical NLP and deep learning using Python, with realistic language and text samples. You'll learn to tag, parse, and model text using the best tools. You'll gain hands-on knowledge of the best frameworks to use, and you'll know when to choose a tool like Gensim for topic models, and when to work with Keras for deep learning. This book balances theory and practical hands-on examples, so you can learn about and conduct your own natural language processing projects and computational linguistics. You'll discover the rich ecosystem of Python tools you have available to conduct NLP - and enter the interesting world of modern text analysis.

Title Page

Packt Upsell

Contributors

Preface

Free Chapter

What is Text Analysis?

What is text analysis?

Where's the data at?

Garbage in, garbage out

Why should you do text analysis?

Summary

References

Python Tips for Text Analysis

Why Python?

Text manipulation in Python

Summary

References

spaCy's Language Models

spaCy

Installation

Tokenizing text

Summary

References

Gensim – Vectorizing Text and Transformations and n-grams

Introducing Gensim

Vectors and why we need them

Vector transformations in Gensim

n-grams and some more preprocessing

Summary

References

POS-Tagging and Its Applications

What is POS-tagging?

POS-tagging in Python

Training our own POS-taggers

POS-tagging code examples

Summary

References

NER-Tagging and Its Applications

What is NER-tagging?

NER-tagging in Python

Training our own NER-taggers

NER-tagging examples and visualization

Summary

References

Dependency Parsing

Dependency parsing

Dependency parsing in Python

Dependency parsing with spaCy

Training our dependency parsers

Summary

References

Topic Models

What are topic models?

Topic models in Gensim

Topic models in scikit-learn

Summary

References

Advanced Topic Modeling

Advanced training tips

Exploring documents

Topic coherence and evaluating topic models

Visualizing topic models

Summary

References

Clustering and Classifying Text

Clustering text

Starting clustering

K-means

Hierarchical clustering

Classifying text

Summary

References

Similarity Queries and Summarization

Summary

Word2Vec, Doc2Vec, and Gensim

Word2Vec

Doc2Vec

Other word embeddings

Summary

References

Deep Learning for Text

Deep learning

Deep learning for text (and more)

Generating text

Summary

References

Keras and spaCy for Deep Learning

Keras and spaCy

Classification with Keras

Classification with spaCy

Summary

References

Sentiment Analysis and ChatBots

Sentiment analysis

ChatBots

Summary

References

Other Books You May Enjoy

Leave a review - let other readers know what you think

Index

Customer Reviews

5 star

4 star

3 star

2 star

1 star

What is text analysis?

If there's one medium of media which we are exposed to every single day, it's text. Whether it's our morning paper or the messages we receive, it's likely you receive your information in the form of text.

Let's put things into a little more perspective – consider the amount of text data handled by companies such as Google (1+ trillion queries per year), Twitter (1.6 billion queries per day), and WhatsApp (30+ billion messages per day). That's an incredible resource, and the sheer ubiquitous nature of the text is enough reason for us to take it seriously. Textual data also has huge business value, and companies can use this data to help profile customers and understand customer trends. This can either be used to offer a more personalized experience for users or as information for targeted marketing. Facebook, for example, uses textual data heavily, and one of the algorithms we will learn later in this book was developed at Facebook's AI research team.

Fig 1.1 Rate of data growth from 2006 – 2018 with predicted rates of data in 2019 and 2020. Source: Patrick Cheeseman, https://www.eetimes.com/author.asp?section_id=36&doc_id=1330462

Text analysis can be understood as the technique of gleaning useful information from text. This can be done through various techniques, and we use Natural Language Processing (NLP), Computational Linguistics (CL), and numerical tools to get this information. These numerical tools are machine learning algorithms or information retrieval algorithms. We'll briefly, informally explain these terms as they will be coming up throughout the book.

Natural language processing (NLP) refers to the use of a computer to process natural language. For example, removing all occurrences of the word thereby from a body of text is one such example, albeit a basic example.

Computational linguistics (CL), as the name suggests, is the study of linguistics from a computational perspective. This means using computers and algorithms to perform linguistics tasks such as marking your text as a part of speech (such as noun or verb), instead of performing this task manually.

Machine Learning (ML) is the field of study where we use statistical algorithms to teach machines to perform a particular task. This learning occurs with data, and our task is often to predict a new value based on previously observed data.

Information Retrieval (IR) is the task of looking up or retrieving information based on a query by the user. The algorithms that aid in performing this task are called information retrieval algorithms, and we will be encountering them throughout the book.

Text analysis itself has been around for a long time – one of the first definitions of Business Intelligence (BI) itself, in an October 1958 IBM Journal article by H. P. Luhn, A Business Intelligence System [1], describes a system that will do the following:

"...utilize data-processing machines for auto-abstracting and auto-encoding of documents and for creating interest profiles for each of the 'action points' in an organization. Both incoming and internally generated documents are automatically abstracted, characterized by a word pattern, and sent automatically to appropriate action points."

It's interesting to see talk about documents, instead of numbers – to think that the first ideas of business intelligence were understanding text and documents is again a testament to text analysis throughout the ages. But even outside the realm of text analysis for business, using computers to better understand text and language has been around since the beginning of ideas of artificial intelligence. The 1999 review on text analysis by John Hutchins, Retrospect and prospect in computer-based translation [2], talks about efforts to do machine translation as early as the 1950s by the United States military, in order to translate Russian scientific journals into English.

Efforts to make an intelligent machine started with text as well – the ELIZA program developed in 1966 at MIT by Joseph Weizenbaum is one example. Even though the program had no real understanding of language, by basic pattern matching it could attempt to hold a conversation. These are just some of the earliest attempts to analyze text – computers (and human beings!) have come a long way since, and we now have incredible tools at our disposal.

Machine translation itself has come a long way, and we can now use our smartphones to effectively translate between languages, and with cutting-edge techniques such as Google's Neural Machine Translation, the gap between academia and industry is reducing – allowing us to actually experience the magic of natural language processing first hand.

Fig 1.2 An example of a Neural Translation model, working on French to English

Advances in this subject have helped advance the way we approach speech as well – closed captioning in videos, and personal assistants such as Apple's Siri or Amazon's Alexa are greatly benefited by superior text processing. Understanding structure in conversations and extracting information were key problems in early NLP, and the fruits of the research done are being very apparent in the 21st century.

Search engines such as Google or Bing! also stand on the shoulders of the research done in NLP and CL and affect our lives in an unprecedented way. Information retrieval (IR) builds on statistical approaches in text processing and allows us to classify, cluster, and retrieve documents. Methods such as topic modeling can help us identify key topics in large, unstructured bodies of text. Identifying these topics goes beyond searching for keywords, and we use statistical models to further understand the underlying nature of bodies of text. Without the power of computers, we could not perform this kind of large-scale statistical analysis on the text. We will be exploring topic modeling in detail later on in the book.

Fig 1.2 Techniques such as topic modeling use probabilistic modeling methods to identify key topics from the text. We will be studying this in detail later in the book

Going one step ahead of just being able to experience the wonders of modern computing on our mobile phones, recent developments in both Python and NLP means that we can now develop such systems on our own!

Not only has there been an evolution in the techniques used in NLP and text analysis, it has become very accessible to us – open source packages are becoming state-of-the-art, performing as well as commercial tools. An example of a commercial tool would be Microsoft's Text Analysis API (https://azure.microsoft.com/en-us/services/cognitive-services/text-analytics/).

MATLAB is another example of a popular commercial tool used for scientific computing. While historically such commercial tools performed better than free, open source software, an increase in people contributing to open source libraries, as well as funding from industry has helped the open source community immensely. Now, the tables appear to have turned and many software giants use open source packages for their internal systems – such as Google using TensorFlow and Apple using scikit-learn! Tensor flow and scikit-learn are two open source Python machine learning packages.

It can be argued that the sheer number of packages offered by the python ecosystem means it leads the pack when it comes to doing text analysis, and we will focus our efforts here. A very strong and active open source community adds to the appeal.

Throughout the course of the book, we will discuss modern natural language processing and computational linguistics techniques and the best open source tools available to us which we can use to apply these techniques.

Natural Language Processing and Computational Linguistics

By : Bhargav Srinivasa-Desikan

Natural Language Processing and Computational Linguistics

By: Bhargav Srinivasa-Desikan

Overview of this book

Related Content you might be interested in

Current Title:

Natural Language Processing and Computational Linguistics

Mastering spaCy

The Handbook of NLP with Gensim

Natural Language Processing with Python Quick Start Guide

What is text analysis?