Book Image

Natural Language Processing with Java - Second Edition

By : Richard M. Reese
Book Image

Natural Language Processing with Java - Second Edition

By: Richard M. Reese

Overview of this book

Natural Language Processing (NLP) allows you to take any sentence and identify patterns, special names, company names, and more. The second edition of Natural Language Processing with Java teaches you how to perform language analysis with the help of Java libraries, while constantly gaining insights from the outcomes. You’ll start by understanding how NLP and its various concepts work. Having got to grips with the basics, you’ll explore important tools and libraries in Java for NLP, such as CoreNLP, OpenNLP, Neuroph, and Mallet. You’ll then start performing NLP on different inputs and tasks, such as tokenization, model training, parts-of-speech and parsing trees. You’ll learn about statistical machine translation, summarization, dialog systems, complex searches, supervised and unsupervised NLP, and more. By the end of this book, you’ll have learned more about NLP, neural networks, and various other trained models in Java for enhancing the performance of NLP applications.
Table of Contents (19 chapters)
Title Page
Dedication
Packt Upsell
Contributors
Preface
Index

Word embedding


Computers need to be taught to deal with the context. Say, for example, "I like eating apple." The computer need to understand that here, apple is a fruit and not a company. We want text where words have the same meaning to have the same representation, or at least a similar representation, so that machines can understand that the words have the same meaning. The main objective of word embedding is to capture as much context, hierarchical, and morphological information concerning the word as possible.

Word embedding can be categorized in two ways:

  • Frequency-based embedding
  • Prediction-based embedding

From the name, it is clear that frequency-based embedding uses a counting mechanism, whereas prediction-based embedding uses a probability mechanism.

Frequency-based embedding can be done in different ways, using a count vector, a TD-IDF vector, or a co-occurrence vector/matrix. A count vector tries to learn from all the documents. It will learn an item of vocabulary and count the number...