A brief introduction to NLP
Before diving straight into what Flair is capable of and how to leverage its features, we will be going through a brief introduction to NLP to provide some context for readers who are not familiar with all the NLP techniques and tasks solved by Flair. NLP is a branch of artificial intelligence, linguistics, and software engineering that helps machines understand human language. When we humans read a sentence, our brains immediately make sense of many seemingly trivial problems such as the following:
- Is the sentence written in a language I understand?
- How can the sentence be split into words?
- What is the relationship between the words?
- What are the meanings of the individual words?
- Is this a question or an answer?
- Which part-of-speech categories are the words assigned to?
- What is the abstract meaning of the sentence?
The human brain is excellent at solving these problems conjointly and often seamlessly, leaving us unaware that we made sense of all of these things simply by reading a sentence.
Even now, machines are still not as good as humans at solving all these problems at once. Therefore, to teach machines to understand human language, we have to split understanding of natural language into a set of smaller, machine-intelligible tasks that allow us to get answers to these questions one by one.
In this section, you will find a list of some important NLP tasks with emphasis on the tasks supported by Flair.
Tokenization
Tokenization is the process of breaking down a sentence or a document into meaningful units called tokens. A token can be a paragraph, a sentence, a collocation, or just a word.
For example, a word tokenizer would split the sentence Learning to use Flair into a list of tokens as ["Learning", "to", "use", "Flair"]
.
Tokenization has to adhere to language-specific rules and is rarely a trivial task to solve. For example, with unspaced languages where word boundaries aren't defined with spaces, it's very difficult to determine where one word ends and the next one starts. Well-defined token boundaries are a prerequisite for most NLP tasks that aim to process words, collocations, or sentences including the following tasks explained in this chapter.
Text vectorization
Text vectorization is a process of transforming words, sentences, or documents in their written form into a numerical representation understandable to machines.
One of the simplest forms of text vectorization is one-hot encoding. It maps words to binary vectors of length equal to the number of words in the dictionary. All elements of the vector are 0
apart from the element that represents the word, which is set to 1
– hence the name one-hot.
For example, take the following dictionary:
- Cat
- Dog
- Goat
The word cat would be the first word in our dictionary and its one-hot encoding would be [1, 0, 0]
. The word dog would be the second word in our dictionary and its one-hot encoding would be [0, 1, 0]
. And the word goat would be the third and last word in our dictionary and its one-hot encoding would be [0, 0, 1]
.
This approach, however, suffers from the problem of high dimensionality as the length of this vector grows linearly with the number of words in the dictionary. It also doesn't capture any semantic meaning of the word. To counter this problem, most modern state-of-the-art approaches use representations called word or document embeddings. Each embedding is usually a fixed-length vector consisting of real numbers. While the numbers will at first seem unintelligible to a human, in some cases, some vector dimensions may represent some abstract property of the word – for example, a dimension of a word-embedding vector could represent the general (positive or negative) sentiment of the word. Given two or more embeddings, we will be able to compute the similarity or distance between them using a distance measure called cosine similarity. With many modern NLP solutions, including Flair, embeddings are used as the underlying input representation for higher-level NLP tasks such as named entity recognition.
One of the main problems with early word embedding approaches was that words with multiple meanings (polysemic words) were limited to a single and constant embedding representation. One of the solutions to this problem in Flair is the use of contextual string embeddings where words are contextualized by their surrounding text, meaning that they will have a different representation given a different surrounding text.
Named entity recognition
Named entity recognition (NER) is an NLP task or technique that identifies named entities in a text and tags them with their corresponding categories. Named entity categories include, but aren't limited to, places, person names, brands, time expressions, and monetary values.
The following figure illustrates NER using colored backgrounds and tags associated with the words:
In the previous example, we can see that three entities were identified and tagged. The first and third tags are particularly interesting because they both represent the same word, Berkeley, yet the first one clearly refers to an organization whereas the second one refers to a geographic location. The human brain is excellent at distinguishing between different entity types based on context and is able to do so almost seamlessly, whereas machines have struggled with it for decades. Recent advancements in contextual string embeddings, an essential part of Flair, made a huge leap forward in solving that.
Word-sense disambiguation
Word-Sense Disambiguation (WSD) is an NLP technique concerned with identifying the intended sense of a given word with multiple meanings.
For example, take the given sentence:
George tried to return to Berlin to return his hat.
WSD would aim to identify the sense of the first use of the word return, referring to the act of giving something back, and the sense of the second return, referring to the act of going back to the same place.
Part-of-speech tagging
Part-of-Speech (POS) tagging is a technique closely related to both WSD and NER that aims to tag the words as corresponding to a particular part of speech such as nouns, verbs, adjectives adverbs, and so on.
Actual POS taggers provide a lot more information with the tags than simply associating the words with noun/verb/adjective categories. For example, the Penn Treebank Project corpus, one of the most widely used NER corpora, distinguishes between 36 different types of POS tags.
Chunking
Another NLP technique closely related to POS tagging is chunking. Unlike parts of speech (POS), where we identify individual POS, in chunking we identify complete short phrases such as noun phrases. In Figure 1.2, the phrase A lovely day can be considered a chunk as it is a noun phrase, and in its relationship to other words works the same way as a noun.
Stemming and lemmatization
Stemming and lemmatization are two closely related text normalization techniques used in NLP to reduce the words to their common base forms. For example, the word play is the base word of the words playing, played and plays.
The simpler of the two techniques, stemming, simply accomplishes this by cutting off the ends or beginnings of words. This simple solution often works, but is not foolproof. For example, the word ladies can never be transformed into the word lady by stemming only. We therefore need a technique that understands the POS category of a word and takes into account its context. This technique is called lemmatization. The process of lemmatization can be demonstrated using the following example.
Take the following sentence:
this meeting was exhausting
Lemmatization reduces the previous sentence to the following:
this meeting be exhaust
It reduces the word was
to be
and the word exhausting
to exhaust
. Also note that the word meeting
is used as a noun and it is therefore mapped to the same word meeting
, whereas if the word meeting
was used as a verb, it would be reduced to meet
.
A popular and easy-to-use library for performing lemmatization with Python is spaCy. Its models are trained on large corpora and are able to distinguish between different POS, yielding impressive results.
Text classification
Text classification is an NLP technique used to assign a text or a document to one or more classes or document types. Practical uses for text classification include spam filtering, language identification, sentiment analysis, and programming language identification from syntax.
Having covered the basic NLP concepts and terminology, we can now move on to understanding what Flair is and how it manages to solve NLP tasks with state-of-the-art results.