Natural Language Processing with Java

Natural Language Processing with Java - Second Edition

By : Richard M. Reese

Buy this Book

Natural Language Processing with Java - Second Edition

By: Richard M. Reese

Buy this Book

Overview of this book

Natural Language Processing (NLP) allows you to take any sentence and identify patterns, special names, company names, and more. The second edition of Natural Language Processing with Java teaches you how to perform language analysis with the help of Java libraries, while constantly gaining insights from the outcomes. You’ll start by understanding how NLP and its various concepts work. Having got to grips with the basics, you’ll explore important tools and libraries in Java for NLP, such as CoreNLP, OpenNLP, Neuroph, and Mallet. You’ll then start performing NLP on different inputs and tasks, such as tokenization, model training, parts-of-speech and parsing trees. You’ll learn about statistical machine translation, summarization, dialog systems, complex searches, supervised and unsupervised NLP, and more. By the end of this book, you’ll have learned more about NLP, neural networks, and various other trained models in Java for enhancing the performance of NLP applications.

Title Page

Dedication

Packt Upsell

Contributors

Preface

Free Chapter

Introduction to NLP

Deep learning for Java

Overview of text-processing tasks

Understanding NLP models

Preparing data

Summary

Finding Parts of Text

Understanding the parts of text

What is tokenization?

Simple Java tokenizers

NLP tokenizer APIs

Understanding normalization

Summary

Finding Sentences

The SBD process

What makes SBD difficult?

Understanding the SBD rules of LingPipe's HeuristicSentenceModel class

Simple Java SBDs

Using NLP APIs

Training a sentence-detector model

Summary

Finding People and Things

Why is NER difficult?

Techniques for name recognition

Using regular expressions for NER

Using NLP APIs

Building a new dataset with the NER annotation tool

Training a model

Summary

Detecting Part of Speech

The tagging process

Using the NLP APIs

Summary

Representing Text with Features

N-grams

Word embedding

GloVe

Word2vec

Dimensionality reduction

Principle component analysis

Distributed stochastic neighbor embedding

Summary

Information Retrieval

Boolean retrieval

Dictionaries and tolerant retrieval

Vector space model

Scoring and term weighting

Inverse document frequency

TF-IDF weighting

Evaluation of information retrieval systems

Summary

Classifying Texts and Documents

How classification is used

Understanding sentiment analysis

Text-classifying techniques

Using APIs to classify text

Summary

Topic Modeling

What is topic modeling?

The basics of LDA

Topic modeling with MALLET

Summary

Using Parsers to Extract Relationships

Relationship types

Understanding parse trees

Using extracted relationships

Extracting relationships

Using NLP APIs

Extracting relationships for a question-answer system

Summary

Combined Pipeline

Preparing data

Using boilerpipe to extract text from HTML

Using POI to extract text from Word documents

Using PDFBox to extract text from PDF documents

Using Apache Tika for content analysis and extraction

Pipelines

Using the Stanford pipeline

Using multiple cores with the Stanford pipeline

Creating a pipeline to search text

Summary

Creating a Chatbot

Chatbot architecture

Artificial Linguistic Internet Computer Entity

Summary

Other Books You May Enjoy

Leave a review - let other readers know what you think

Index

Customer Reviews

5 star

4 star

3 star

2 star

1 star

Using Apache Tika for content analysis and extraction

Apache Tika is capable of detecting and extracting metadata and text from thousands of different type of files, such as .doc, .docx, .ppt, .pdf, .xls, and so on. It can be used for various file formats, which makes it useful for search engines, indexing, content analysis, translation, and so on. It can be downloaded from https://tika.apache.org/download.html. This section will explore how Tika can be used for text extraction for various formats. We will use Testdocument.docx and TestDocument.pdf only.

Using Tika is very straightforward, as shown in the following code:

File file = new File("TestDocument.pdf");            
Tika tika = new Tika();
String filetype = tika.detect(file);

System.out.println(filetype);
System.out.println(tika.parseToString(file));

Simply create an instance of Tika and use the detect and parseToString methods to get the following output:

application/pdf
Jump to navigation Jump to search  

Welcome to Wikipedia...

Natural Language Processing with Java - Second Edition

By : Richard M. Reese

Natural Language Processing with Java - Second Edition

By: Richard M. Reese

Overview of this book

Related Content you might be interested in

Current Title:

Natural Language Processing with Java - Second Edition

Natural Language Processing with Java Cookbook

Java for Data Science

Java Data Science Cookbook

Using Apache Tika for content analysis and extraction