Natural Language Processing with Java

Natural Language Processing with Java - Second Edition

By : Richard M. Reese

Buy this Book

Natural Language Processing with Java - Second Edition

By: Richard M. Reese

Buy this Book

Overview of this book

Natural Language Processing (NLP) allows you to take any sentence and identify patterns, special names, company names, and more. The second edition of Natural Language Processing with Java teaches you how to perform language analysis with the help of Java libraries, while constantly gaining insights from the outcomes. You’ll start by understanding how NLP and its various concepts work. Having got to grips with the basics, you’ll explore important tools and libraries in Java for NLP, such as CoreNLP, OpenNLP, Neuroph, and Mallet. You’ll then start performing NLP on different inputs and tasks, such as tokenization, model training, parts-of-speech and parsing trees. You’ll learn about statistical machine translation, summarization, dialog systems, complex searches, supervised and unsupervised NLP, and more. By the end of this book, you’ll have learned more about NLP, neural networks, and various other trained models in Java for enhancing the performance of NLP applications.

Title Page

Dedication

Packt Upsell

Contributors

Preface

Free Chapter

Introduction to NLP

Deep learning for Java

Overview of text-processing tasks

Understanding NLP models

Preparing data

Summary

Finding Parts of Text

Understanding the parts of text

What is tokenization?

Simple Java tokenizers

NLP tokenizer APIs

Understanding normalization

Summary

Finding Sentences

The SBD process

What makes SBD difficult?

Understanding the SBD rules of LingPipe's HeuristicSentenceModel class

Simple Java SBDs

Using NLP APIs

Training a sentence-detector model

Summary

Finding People and Things

Why is NER difficult?

Techniques for name recognition

Using regular expressions for NER

Using NLP APIs

Building a new dataset with the NER annotation tool

Training a model

Summary

Detecting Part of Speech

The tagging process

Using the NLP APIs

Summary

Representing Text with Features

N-grams

Word embedding

GloVe

Word2vec

Dimensionality reduction

Principle component analysis

Distributed stochastic neighbor embedding

Summary

Information Retrieval

Boolean retrieval

Dictionaries and tolerant retrieval

Vector space model

Scoring and term weighting

Inverse document frequency

TF-IDF weighting

Evaluation of information retrieval systems

Summary

Classifying Texts and Documents

How classification is used

Understanding sentiment analysis

Text-classifying techniques

Using APIs to classify text

Summary

Topic Modeling

What is topic modeling?

The basics of LDA

Topic modeling with MALLET

Summary

Using Parsers to Extract Relationships

Relationship types

Understanding parse trees

Using extracted relationships

Extracting relationships

Using NLP APIs

Extracting relationships for a question-answer system

Summary

Combined Pipeline

Preparing data

Using boilerpipe to extract text from HTML

Using POI to extract text from Word documents

Using PDFBox to extract text from PDF documents

Using Apache Tika for content analysis and extraction

Pipelines

Using the Stanford pipeline

Using multiple cores with the Stanford pipeline

Creating a pipeline to search text

Summary

Creating a Chatbot

Chatbot architecture

Artificial Linguistic Internet Computer Entity

Summary

Other Books You May Enjoy

Leave a review - let other readers know what you think

Index

Customer Reviews

5 star

4 star

3 star

2 star

1 star

Scoring and term weighting

Term weighting deals with evaluating the importance of a term with respect to a document. A simple way is to think of this is that the term that appears more in the documents is an important term, apart from the stop words. A score from 0-1 can be assigned to each document. A score is a measurement that shows how well the term or query is matched in the document. A score of 0 means that the term does not exist in the document. As the frequency of the term increases in the document, the score moves from 0 toward 1. So, for a given term X, the scores for three documents, d1, d2, and d3 are 0.2, 0.3, and 0.5, respectively, which means that the match in d3 is more important than d2 and d1 is least important for the overall score. The same applies for the zones as well. How to assign such a score or weight to the term requires learning from some training set or continuously running and updating the score for terms.

The real-time query will be in the form of free text...

Natural Language Processing with Java - Second Edition

By : Richard M. Reese

Natural Language Processing with Java - Second Edition

By: Richard M. Reese

Overview of this book

Related Content you might be interested in

Current Title:

Natural Language Processing with Java - Second Edition

Natural Language Processing with Java Cookbook

Java for Data Science

Java Data Science Cookbook

Scoring and term weighting