Text extraction is the primary phase for any NLP tasks you want to undertake. If given a blog post, we want to extract the content of the blog and want to find the title of the post, author of the post, date when the post is published, text or content of the post, media-like images, videos in the post, and links to other posts, if any. Text extraction includes the following:
- Structuring so as to identify different fields, blocks of contents, and so on
- Determining the language of the document
- Finding the sentences, paragraphs, phrases, and quotes
- Breaking the text in tokens so as to process it further
- Normalization and tagging
- Lemmatization and stemming so as to reduce the variations and come close to root words
It also helps in topic modeling, which we have covered in Chapter 9, Topic Modeling. Here, we will quickly cover how text extraction can be performed for HTML, Word, and PDF documents. Although there are several APIs that support these tasks, we will use the following:
- Boilerpipe...