Natural Language Processing and Computational Linguistics

Natural Language Processing and Computational Linguistics

By : Bhargav Srinivasa-Desikan

Buy this Book

Natural Language Processing and Computational Linguistics

By: Bhargav Srinivasa-Desikan

Buy this Book

Overview of this book

Modern text analysis is now very accessible using Python and open source tools, so discover how you can now perform modern text analysis in this era of textual data. This book shows you how to use natural language processing, and computational linguistics algorithms, to make inferences and gain insights about data you have. These algorithms are based on statistical machine learning and artificial intelligence techniques. The tools to work with these algorithms are available to you right now - with Python, and tools like Gensim and spaCy. You'll start by learning about data cleaning, and then how to perform computational linguistics from first concepts. You're then ready to explore the more sophisticated areas of statistical NLP and deep learning using Python, with realistic language and text samples. You'll learn to tag, parse, and model text using the best tools. You'll gain hands-on knowledge of the best frameworks to use, and you'll know when to choose a tool like Gensim for topic models, and when to work with Keras for deep learning. This book balances theory and practical hands-on examples, so you can learn about and conduct your own natural language processing projects and computational linguistics. You'll discover the rich ecosystem of Python tools you have available to conduct NLP - and enter the interesting world of modern text analysis.

Title Page

Packt Upsell

Contributors

Preface

Free Chapter

What is Text Analysis?

What is text analysis?

Where's the data at?

Garbage in, garbage out

Why should you do text analysis?

Summary

References

Python Tips for Text Analysis

Why Python?

Text manipulation in Python

Summary

References

spaCy's Language Models

spaCy

Installation

Tokenizing text

Summary

References

Gensim – Vectorizing Text and Transformations and n-grams

Introducing Gensim

Vectors and why we need them

Vector transformations in Gensim

n-grams and some more preprocessing

Summary

References

POS-Tagging and Its Applications

What is POS-tagging?

POS-tagging in Python

Training our own POS-taggers

POS-tagging code examples

Summary

References

NER-Tagging and Its Applications

What is NER-tagging?

NER-tagging in Python

Training our own NER-taggers

NER-tagging examples and visualization

Summary

References

Dependency Parsing

Dependency parsing

Dependency parsing in Python

Dependency parsing with spaCy

Training our dependency parsers

Summary

References

Topic Models

What are topic models?

Topic models in Gensim

Topic models in scikit-learn

Summary

References

Advanced Topic Modeling

Advanced training tips

Exploring documents

Topic coherence and evaluating topic models

Visualizing topic models

Summary

References

Clustering and Classifying Text

Clustering text

Starting clustering

K-means

Hierarchical clustering

Classifying text

Summary

References

Similarity Queries and Summarization

Summary

Word2Vec, Doc2Vec, and Gensim

Word2Vec

Doc2Vec

Other word embeddings

Summary

References

Deep Learning for Text

Deep learning

Deep learning for text (and more)

Generating text

Summary

References

Keras and spaCy for Deep Learning

Keras and spaCy

Classification with Keras

Classification with spaCy

Summary

References

Sentiment Analysis and ChatBots

Sentiment analysis

ChatBots

Summary

References

Other Books You May Enjoy

Leave a review - let other readers know what you think

Index

Customer Reviews

5 star

4 star

3 star

2 star

1 star

Garbage in, garbage out

Garbage in, garbage out (or GIGO) is an adage of computer science which is even more important when dealing with machine learning and possibly even more so when dealing with textual data. Garbage in, garbage out means that if we have poorly formatted data, it is likely we will have poor results.

Fig 1.5 XKCD hits the hammer on the nail once again (https://xkcd.com/1838/)

While more data usually leads to a better prediction, it isn't always the same case with text analysis, where more data can result in nonsense results or results which we don't always want. An intuitive example: the part of speech, articles, such as the words a, or the tend to appear a lot in text, but not adding any information to the text, and is usually limited to grammar or structure.

Words such as these which don't provide useful information are called stop words, and these words are often removed from the text before applying text analysis techniques on them. Similarly, sometimes we remove words with very high frequency in the body of text, and words which only appear once or twice – it is highly likely these words will not be useful to our analysis. That being said, this depends heavily on the kind of task being performed - if, for example, we would want to replicate human writing styles, stop words are important because humans many such words when writing. An example of how stop words can also include useful information is in this article, Pastiche detection based on stopword rankings. Exposing impersonators of a Romanian writer [20], is a study identified a certain author using frequency of stop words.

Let's consider another example where we might be dealing with useless data – if searching for influential words or topics in the text, would it make sense to have both the words reading and read in the results? Here, shortening the word reading to read would not lead to any loss of information. But on a similar note, it would make sense to have the words information and inform exist separately in the same body of text, because they could mean different things based on the context. We would then need techniques to shorten words appropriately. Lemmatizing and stemming are two methods we use to tackle this problem and remain two of the core concepts in natural language processing. We will be exploring these two techniques in more detail in Chapter 3, spaCy's Language models.

Even after basic text-processing, our data is still a collection of words. Since machines do not inherently understand the concepts tied to words, we can instead use numbers that represent individual words. The next important step in text analysis is converting words into numbers, whether it is bag-of-words (BOW), or term frequency-inverse document frequency (TF-IDF), which are different ways to count the number of words in each document or sentence. There are also more advanced techniques to represent words such as Word2Vec and GloVe.

We will go into these details and techniques in more detail in the chapter on pre-processing techniques – it is especially important to understand the motivation behind these techniques, and that a computer's output is only as good as the input you feed it.

Natural Language Processing and Computational Linguistics

By : Bhargav Srinivasa-Desikan

Natural Language Processing and Computational Linguistics

By: Bhargav Srinivasa-Desikan

Overview of this book

Related Content you might be interested in

Current Title:

Natural Language Processing and Computational Linguistics

Mastering spaCy

The Handbook of NLP with Gensim

Natural Language Processing with Python Quick Start Guide

Garbage in, garbage out