Book Image

Python Natural Language Processing

Book Image

Python Natural Language Processing

Overview of this book

This book starts off by laying the foundation for Natural Language Processing and why Python is one of the best options to build an NLP-based expert system with advantages such as Community support, availability of frameworks and so on. Later it gives you a better understanding of available free forms of corpus and different types of dataset. After this, you will know how to choose a dataset for natural language processing applications and find the right NLP techniques to process sentences in datasets and understand their structure. You will also learn how to tokenize different parts of sentences and ways to analyze them. During the course of the book, you will explore the semantic as well as syntactic analysis of text. You will understand how to solve various ambiguities in processing human language and will come across various scenarios while performing text analysis. You will learn the very basics of getting the environment ready for natural language processing, move on to the initial setup, and then quickly understand sentences and language parts. You will learn the power of Machine Learning and Deep Learning to extract information from text data. By the end of the book, you will have a clear understanding of natural language processing and will have worked on multiple examples that implement NLP in the real world.

Preface

What this book covers

What you need for this book

Who this book is for

Reader feedback

Customer support

Free Chapter

Introduction

Understanding natural language processing

Understanding basic applications

Advantages of togetherness - NLP and Python

Environment setup for NLTK

Tips for readers

Practical Understanding of a Corpus and Dataset

Practical Understanding of a Corpus and Dataset

What is a corpus?

Why do we need a corpus?

Understanding corpus analysis

Understanding types of data attributes

Exploring different file formats for corpora

Resources for accessing free corpora

Preparing a dataset for NLP applications

Understanding the Structure of a Sentences

Understanding the Structure of a Sentences

Understanding components of NLP

Natural language understanding

Defining context-free grammar

Morphological analysis

Syntactic analysis

Semantic analysis

Handling ambiguity

Discourse integration

Pragmatic analysis

Preprocessing

Handling corpus-raw text

Handling corpus-raw sentences

Basic preprocessing

Practical and customized preprocessing

Feature Engineering and NLP Algorithms

Feature Engineering and NLP Algorithms

Understanding feature engineering

Basic feature of NLP

Basic statistical features for NLP

Advantages of features engineering

Challenges of features engineering

Advanced Feature Engineering and NLP Algorithms

Advanced Feature Engineering and NLP Algorithms

Recall word embedding

Understanding the basics of word2vec

Converting the word2vec model from black box to white box

Understanding the components of the word2vec model

Understanding the logic of the word2vec model

Understanding algorithmic techniques and the mathematics behind the word2vec model

Algorithms used by neural networks

Some of the facts related to word2vec

Applications of word2vec

Implementation of simple examples

Advantages of word2vec

Challenges of word2vec

How is word2vec used in real-life applications?

When should you use word2vec?

Developing something interesting

Extension of the word2vec concept

Importance of vectorization in deep learning

Rule-Based System for NLP

Rule-Based System for NLP

Understanding of the rule-based system

Purpose of having the rule-based system

Architecture of the RB system

Understanding the RB system development life cycle

Developing NLP applications using the RB system

Comparing the rule-based approach with other approaches

Advantages of the rule-based system

Disadvantages of the rule-based system

Challenges for the rule-based system

Understanding word-sense disambiguation basics

Discussing recent trends for the rule-based system

Machine Learning for NLP Problems

Machine Learning for NLP Problems

Understanding the basics of machine learning

Development steps for NLP applications

Understanding ML algorithms and other concepts

Hybrid approaches for NLP applications

Deep Learning for NLU and NLG Problems

Deep Learning for NLU and NLG Problems

An overview of artificial intelligence

Comparing NLU and NLG

A brief overview of deep learning

Basics of neural networks

Implementation of ANN

Deep learning and deep neural networks

Deep learning techniques and NLU

Deep learning techniques and NLG

Gradient descent-based optimization

Artificial intelligence versus human intelligence

Advanced Tools

Apache Hadoop as a storage framework

Apache Spark as a processing framework

Apache Flink as a real-time processing framework

Visualization libraries in Python

How to Improve Your NLP Skills

How to Improve Your NLP Skills

Beginning a new career journey with NLP

Choose your area

Agile way of working to achieve success

Useful blogs for NLP and data science

Grab public datasets

Mathematics needed for data science

Installation Guide

Installation Guide

Installing Python, pip, and NLTK

Installing the PyCharm IDE

Installing dependencies

Framework installation guides

Drop your queries

Customer Reviews

5 star

0

4 star

0

3 star

0

2 star

0

1 star

0

Exploring different file formats for corpora

Corpora can be in many different formats. In practice, we can use the following file formats. All these file formats are generally used to store features, which we will feed into our machine learning algorithms later. Practical stuff regarding dealing with the following file formats will be incorporated from Chapter 4, Preprocessing onward. Following are the aforementioned file formats:

.txt: This format is basically given to us as a raw dataset. The gutenberg corpus is one of the example corpora. Some of the real-life applications have parallel corpora. Suppose you want to make Grammarly a kind of grammar correction software, then you will need a parallel corpus.
.csv: This kind of file format is generally given to us if we are participating in some hackathons or on Kaggle. We use this file format to save our features, which we will...