Book Image

Machine Learning for Emotion Analysis in Python

By : Allan Ramsay, Tariq Ahmad
5 (1)
Book Image

Machine Learning for Emotion Analysis in Python

5 (1)
By: Allan Ramsay, Tariq Ahmad

Overview of this book

Artificial intelligence and machine learning are the technologies of the future, and this is the perfect time to tap into their potential and add value to your business. Machine Learning for Emotion Analysis in Python helps you employ these cutting-edge technologies in your customer feedback system and in turn grow your business exponentially. With this book, you’ll take your foundational data science skills and grow them in the exciting realm of emotion analysis. By following a practical approach, you’ll turn customer feedback into meaningful insights assisting you in making smart and data-driven business decisions. The book will help you understand how to preprocess data, build a serviceable dataset, and ensure top-notch data quality. Once you’re set up for success, you’ll explore complex ML techniques, uncovering the concepts of deep neural networks, support vector machines, conditional probabilities, and more. Finally, you’ll acquire practical knowledge using in-depth use cases showing how the experimental results can be transformed into real-life examples and how emotion mining can help track short- and long-term changes in public opinion. By the end of this book, you’ll be well-equipped to use emotion mining and analysis to drive business decisions.
Table of Contents (18 chapters)
1
Part 1:Essentials
3
Part 2:Building and Using a Dataset
7
Part 3:Approaches
14
Part 4:Case Study

Introduction to NLP

Sentiment mining is about finding the sentiments that are expressed by natural language texts – often quite short texts such as tweets and online reviews, but also larger items such as newspaper articles. There are many other ways of getting computers to do useful things with natural language texts and spoken language: you can write programs that can have conversations (with people or with each other), you can write programs to extract facts and events from articles and stories, you can write programs to translate from one language to another, and so on. These applications all share some basic notions and techniques, but they each lay more emphasis on some topics and less on others. In Chapter 4, Preprocessing – Stemming, Tagging, and Parsing, we will look at the things that matter most for sentiment mining, but we will give a brief overview of the main principles of NLP here. As noted, not all of the stages outlined here are needed for every application, but it is nonetheless useful to have a picture of how everything fits together when considering specific subtasks later.

We will start with a couple of basic observations:

  • Natural language is linear. The fundamental form of language is speech, which is necessarily linear. You make one sound, and then you make another, and then you make another. There may be some variation in the way you make each sound – louder or softer, with a higher pitch or a lower one, quicker or slower – and this may be used to overlay extra information on the basic message, but fundamentally, spoken language is made up of a sequence of identifiable units, namely sounds; and since written language is just a way of representing spoken language, it too must be made up of a sequence of identifiable units.
  • Natural language is hierarchical. Smaller units are grouped into larger units, which are grouped into larger units, which are grouped into larger units, and so on. Consider the sentence smaller units are grouped into larger units. In the written form of English, for instance, the smallest units are characters; these are grouped into morphemes (meaning-bearing word-parts), as small er unit s are group ed into large er unit s, which are grouped into words (small-er unit-s are group-ed into large-er unit-s), which are grouped into base-level phrases ([small-er unit-s] [are group-ed] [into] [large-er unit-s]), which are grouped into higher-level phrases ([[small-er unit-s] [[are group-ed] [[into] [large-er unit-s]]]]]).

These two properties hold for all natural languages. All natural languages were spoken before they were written (some widely spoken languages have no universally accepted written form!), and hence are fundamentally linear. But they all express complex hierarchical relations, and hence to understand them, you have to be able to find the ways that smaller units are grouped into larger ones.

What the bottom-level units are like, and how they are grouped, differs from language to language. The sounds of a language are made by moving your articulators (tongue, teeth, lips, vocal cords, and various other things) around while trying to expel air from your lungs. The sound that you get by closing and then opening your lips with your vocal cords tensed (/b/, as in the English word bat) is different from the sound you get by doing the same things with your lips while your vocal cords are relaxed (/p/, as in pat). Different languages use different combinations – Arabic doesn’t use /p/ and English doesn’t use the sound you get by closing the exit from the chamber containing the vocal cords (a glottal stop): the combinations that are used in a particular language are called its phonemes. Speakers of a language that don’t use a particular combination find it hard to distinguish words that use it from ones that use a very similar combination, and very hard to produce that combination when they learn a language that does.

To make matters worse, the relationship between the bottom-level units in spoken language and written language can vary from language to language. The phonemes of a language can be represented in the written form of that language in a wide variety of ways. The written form may make use of graphemes, which are combinations of ways of making a shape out of strokes and marks (so, AAAAAA are all written by producing two near-vertical more-or-less-straight lines joined at the top with a cross-piece about half-way up), just as phonemes are combinations of ways of making a sound; a single phoneme may be represented by one grapheme (the short vowel /a/ from pat is represented in English by the character a) or by a combination of graphemes (the sound /sh/ from should is represented by the pair of graphemes s and h); a sound may have no representation in the written form (Arabic text omits short vowels and some other distinctions between phonemes); or there may simply be no connection between the written form and the way it is pronounced (written Chinese, Japanese kanji symbols). Given that we are going to be largely looking at text, we can at least partly ignore the wide variety of ways that written and spoken language are related, but we will still have to be aware that different languages combine the basic elements of the written forms in completely different ways to make up words.

The bottom-level units of a language, then, are either identifiable sounds or identifiable marks. These are combined into groups that carry meaning – morphemes. A morpheme can carry quite a lot of meaning; for example, cat (made out of the graphemes c, a, and t) denotes a small mammal with pointy ears and an inscrutable outlook on life, whereas s just says that you’ve got more than one item of the kind you are thinking about, so cats denotes a group of several small mammals with pointy ears and an opaque view of the world. Morphemes of the first kind are sometimes called lexemes, with a single lexeme combining with one or more other morphemes to express a concept (so, the French lexeme noir (black) might combine with e (feminine) and s (plural) to make noires – several black female things). Morphemes that add information to a lexeme, such as about how many things were involved or when an event happened, are called inflectional morphemes, whereas ones that radically change their meaning (for example an incomplete solution to a problem is not complete) are called derivational morphemes, since they derive a new concept from the original. Again, most languages make use of inflectional and derivational morphemes to enrich the basic set of lexemes, but exactly how this works varies from language to language. We will revisit this at some length in Chapter 5 , Sentiment Lexicons and Vector Space Models since finding the core lexemes can be significant when we are trying to assign emotions to texts.

A lexeme plus a suitable set of morphemes is often referred to as a word. Words are typically grouped into larger tree-like structures, with the way that they are grouped carrying a substantial part of the message conveyed by the text. In the sentence John believes that Mary expects Peter to marry Susan, for instance, Peter to marry Susan is a group that describes a particular kind of event, Mary expects [Peter to marry Susan] is a group that describes Mary’s attitude to this event, and John believes [that Mary expected [Peter to marry Susan]] is a group that describes John’s view of Mary’s expectation.

Yet again, different languages carry out this kind of grouping in different ways, and there are numerous ways of approaching the task of analyzing the grouping in particular cases. This is not the place for a review of all the grammatical theories that have ever been proposed to analyze the ways that words get grouped together or of all the algorithms that have ever been proposed for applying those theories to specific cases (parsers), but there are a few general observations that are worth making.

Phrase structure grammar versus dependency grammar

In some languages, groups are mainly formed by merging adjacent groups. The previous sentence, for instance, can be analyzed if we group it as follows:

In some languages groups are mainly formed by merging adjacent groups

In [some languages]np groups are mainly formed by merging [adjacent groups]np

[In [some languages]]pp groups are mainly formed by [merging [adjacent groups]]vp

[In [some languages]]pp groups are mainly formed [by [merging [adjacent groups]]]pp

[In [some languages]]pp groups are mainly [formed [by [merging [adjacent groups]]]]vp

[In [some languages]]pp groups are [mainly [formed [by [merging [adjacent groups]]]]]vp

[In [some languages]]pp groups [are [mainly [formed [by [merging [adjacent groups]]]]]]vp

[In [some languages]]pp [groups [are [mainly [formed [by [merging [adjacent groups]]]]]]]s

[[In [some languages]][groups [are [mainly [formed [by [merging [adjacent groups]]]]]]]]s

This tends to work well for languages where word order is largely fixed – no languages have completely fixed word order (for example, the preceding sentence could be rewritten as Groups are mainly formed by merging adjacent groups in some languages with very little change in meaning), but some languages allow more freedom than others. For languages such as English, analyzing the relationships between words in terms of adjacent phrases, such as using a phrase structure grammar, works quite well.

For languages where words and phrases are allowed to move around fairly freely, it can be more convenient to record pairwise relationships between words. The following tree describes the same sentence using a dependency grammar – that is, by assigning a parent word to every word (apart from the full stop, which we are taking to be the root of the tree):

Figure 1.3 – Analysis of “In some languages, groups are mainly formed by merging adjacent groups” using a rule-based dependency parser

Figure 1.3 – Analysis of “In some languages, groups are mainly formed by merging adjacent groups” using a rule-based dependency parser

There are many variations of phrase structure grammar and many variations of dependency grammar. Roughly speaking, dependency grammar provides an easier handle on languages where words can move around very freely, while phrase structure grammar makes it easier to deal with invisible items such as the subject of merging in the preceding example. The difference between the two is, in any case, less clear than it might seem from the preceding figure: a dependency tree can easily be transformed into a phrase structure tree by treating each subtree as a phrase, and a phrase structure tree can be transformed into a dependency tree if you can specify which item in a phrase is its head – for example, in the preceding phrase structure tree, the head of a group labeled as nn is its noun and the head of a group labeled as np is the head of nn.

Rule-based parsers versus data-driven parsers

As well as having a theory of how to describe the structure of a piece of text, you need a program that applies that theory to specific texts – a parser. There are two ways to approach the development of a parser:

  • Rule-based: You can try to devise a set of rules that describe the way that a particular language works (a grammar), and then implement a program that tries to apply these rules to the texts you want analyzed. Devising such rules is difficult and time-consuming, and programs that try to apply them tend to be slow and fail if the target text does not obey the rules.
  • Data-driven: You can somehow produce a set of analyses of a large number of texts (a treebank), and then implement a program that extracts patterns from these analyses. Producing a treebank is difficult and time-consuming – you need hundreds of thousands of examples, and the trees all have to be consistently annotated, which means that if this is to be done by people, then they have to be given consistent guidelines that cover every example they will see (which is, in effect, a grammar) (and if it is not done by people then you must already have an automated way of doing it, that is, a parser!).

Both approaches have advantages and disadvantages: when considering whether to use a dependency grammar or a phrase structure grammar and then when considering whether to follow a rule-based approach or a data-driven one, there are several criteria to be considered. Since no existing system optimizes all of these, you should think about which ones matter most for your application and then decide which way to go:

  • Speed: The first criterion to consider is the speed at which the parser runs. Some parsers can become very slow when faced with long sentences. The worst-case complexity of the standard chart-parsing algorithm for rule-based approaches is O(N3), where N is the length of the sentence, which means that for long sentences, the algorithm can take a very long time. Some other algorithms have much better complexity than this (the MALT (Nivre et al., 2006) and MST (McDonald et al., 2005) parsers, for instance, are linear in the length of the sentence), while others have much worse. If two parsers are equally good according to all the other criteria, then the faster one will be preferable, but there will be situations where one (or more) of the other criteria is more important.
  • Robustness: Some parsers, particularly rule-based ones, can fail to produce any analysis at all for some sentences. This will happen if the input is ungrammatical, but it will also happen if the rules are not a complete description of the language. A parser that fails to produce a perfectly grammatical input sentence is less useful than one that can analyze every grammatically correct sentence of the target language. It is less clear that parsers that will do something with every input sentence are necessarily more useful than ones that will reject some sentences as being ungrammatical. In some applications, detecting ungrammaticality is a crucial part of the task (for example, in language learning programs), but in any case, assigning an analysis to an ungrammatical sentence cannot be either right or wrong, and hence any program that makes use of such an analysis cannot be sure that it is doing the right thing.
  • Accuracy: A parser that assigns the right analysis to every input text will generally be more useful than one that does not. This does, of course, beg the question of how to decide what the right analysis is. For data-driven parsers, it is impossible to say what the right analysis of a sentence that does not appear in the treebank is. For rule-based parsers, any analysis that is returned will be right in the sense that it obeys the rules. So, if an analysis looks odd, you have to work out how the rules led to it and revise them accordingly.

There is a trade-off between accuracy and robustness. A parser that fails to return any analysis at all in complex cases will produce fewer wrong analyses than one that tries to find some way of interpreting every input text: the one that simply rejects some sentences will have lower recall but may have higher precision, and that can be a good thing. It may be better to have a system that says Sorry, I didn’t quite understand what you just said than one that goes ahead with whatever it is supposed to be doing based on an incorrect interpretation.

  • Sensitivity and consistency: Sometimes, sentences that look superficially similar have different underlying structures. Consider the following examples:
    1. a) I want to see the queen b) I went to see the queen

1(a) is the answer to What do you want? and 2(b) is the answer to Why did you go? If the structures that are assigned to these two sentences do not reflect the different roles for to see the queen, then it will be impossible to make this distinction:

Figure 1.4 – Trees for 1(a) and 1(b) from the Stanford dependency parser (Dozat et al., 2017)

Figure 1.4 – Trees for 1(a) and 1(b) from the Stanford dependency parser (Dozat et al., 2017)

  1. a) One of my best friends is watching old movies b) One of my favorite pastimes is watching old movies
Figure 1.5 – Trees for 2(a) and 2(b) from the Stanford dependency parser

Figure 1.5 – Trees for 2(a) and 2(b) from the Stanford dependency parser

The Stanford dependency parser (SDP) trees both say that the subject (One of my best friends, One of my favorite pastimes) is carrying out the action of watching old movies – it is sitting in its most comfortable armchair with the curtains drawn and the TV on. The first of these makes sense, but the second doesn’t: pastimes don’t watch old movies. What we need is an equational analysis that says that One of my favorite pastimes and watching old movies are the same thing, as in Figure 1.6:

Figure 1.6 – Equational analysis of “One of my favorite pastimes is watching old movies”

Figure 1.6 – Equational analysis of “One of my favorite pastimes is watching old movies”

Spotting that 2(b) requires an analysis like this, where my favorite pastime is the predication in an equational use of be rather than the agent of a watching-old-movies event, requires more detail about the words in question than is usually embodied in a treebank.

It can also happen that sentences that look superficially different have very similar underlying structures:

  1. a) Few great tenors are poor b) Most great tenors are rich

This time, the SDP assigns quite different structures to the two sentences:

Figure 1.7 – Trees for 3(a) and 3(b) from the SDP

Figure 1.7 – Trees for 3(a) and 3(b) from the SDP

The analysis of 3(a) assigns most as a modifier of great, whereas the analysis of 3(b) assigns few as a modifier of tenors. Most can indeed be used for modifying adjectives, as in He is the most annoying person I know, but in 3(a), it is acting as something more like a determiner, just as few is in 3(b).

  1. a) There are great tenors who are rich b) Are there great tenors who are rich?

It is clear that 4(a) and 4(b) should have almost identical analyses – 4(b) is just 4(a) turned into a question. Again, this can cause problems for treebank-based parsers:

Figure 1.8 – Trees for 4(a) and 4(b) from MALTParser

Figure 1.8 – Trees for 4(a) and 4(b) from MALTParser

The analysis in Figure 1.8 for 4(a) makes are the head of the tree, with there, great tenors who are rich, and as daughters, whereas 4(b) is given tenors as its head and are, there, great, who are rich, and ? as daughters. It would be difficult, given these analyses, to see that 4(a) is the answer to 4(b)!

Treebank-based parsers frequently fail to cope with issues of the kind raised by the examples given here. The problem is that the treebanks on which they are trained tend not to include detailed information about the words that appear in them – that went is an intransitive verb and want requires a sentential complement, that friends are human and can therefore watch old movies while pastimes are events, and can therefore be equated with the activity of watching something, or that most can be used in a wide variety of ways.

It is not possible to say that all treebank-based parsers suffer from these problems, but several very widely used ones (the SDP, the version of MALT distributed with the NLTK, the EasyCCG parser (Lewis & Steedman, 2014), spaCy (Kitaev & Klein, 2018)) do. Some of these issues are fairly widespread (the failure to distinguish 1(a) and 1(b)), and some arise because of specific properties of either the treebank or the parsing algorithm. Most of the pre-trained models for parsers such as MALT and SPACY are trained on the well-known Wall Street Journal corpus, and since this treebank does not distinguish between sentences such as 1(a) and 1(b), it is impossible for parsers trained on it to do so. All the parsers listed previously assign different structures to 3(a) and 3(b), which may be a characteristic of the treebank or it may be some property of the training algorithms. It is worth evaluating the output of any such parser to check that it does give distinct analyses for obvious cases such as 1(a) and 1(b) and does give parallel analyses for obvious cases such as 4(a) and 4(b).

So, when choosing a parser, you have to weigh up a range of factors. Do you care if it sometimes makes mistakes? Do you want it to assign different trees to texts whose underlying representations are different (this isn’t quite the same as accuracy because it could happen that what the parser produces isn’t wrong, it just doesn’t contain all the information you need, as in 1(a) and 1(b))? Do you want it to always produce a tree, even for texts that don’t conform to any of the rules of normal language (should it produce a parse for #anxious don’t know why ................. #worry 😶 slowly going #mad hahahahahahahahaha)? Does it matter if it takes 10 or 20 seconds to parse some sentences? Whatever you do, do not trust what anyone says about a parser: try it for yourself, on the data that you are intending to use it on, and check that its output matches your needs.

Semantics (the study of meaning)

As we’ve seen, finding words, assigning them to categories, and finding the relationships between them is quite hard work. There would be no point in doing this work unless you had some application in mind that could make use of it. The key here is that the choice of words and the relationships between them are what allow language to carry messages, to have meaning. That’s why language is important; because it carries messages. Almost all application programs that do anything with natural language are concerned with the message carried by the input text, so almost all such programs have to identify the words that are present and the way they are arranged.

The study of how language encodes messages is known as semantics. As just noted, the message is encoded by the words that are present (lexical semantics) and the way they are arranged (compositional semantics). They are both crucial: you can’t understand the difference between John loves Mary and John hates Mary if you don’t know what loves and hates mean, and you can’t understand the difference between John loves Mary and Mary loves John if you don’t know how being the subject or object of a verb encodes the relationship between the things denoted by John and Mary and the event denoted by loves.

The key test for a theory of semantics is the ability to carry out inference between sets of natural language texts. If you can’t do the inferences in 1–7 (where P1, …, Pn |- Q means that Q can be inferred from the premises P1, …, Pn), then you cannot be said to understand English:

  1. John hates Mary |- John dislikes Mary
  2. (a) John and Mary are divorced |- John and Mary are not married
  3. (b) John and Mary are divorced |- John and Mary used to be married
  4. I saw a man with a big nose |- I saw a man
  5. Every woman distrusts John, Mary is a woman |- Mary distrusts John
  6. I saw more than three pigeons |- I saw at least four birds
  7. I doubt that she saw anyone |- I do not believe she saw a fat man

These are very simple inferences. If someone said that the conclusions didn’t follow from the premises, you would have to say that they just don’t understand English properly. They involve a range of different kinds of knowledge – simple entailment relationships between words (hates entails dislikes (1)); more complex relationships between words (getting divorced means canceling an existing marriage (2), so if John and Mary are divorced, then they are not now married but at one time they were); the fact that a man with a big nose is something that is a man and has a big nose plus the fact that A and B entails A (3); an understanding of how quantifiers work ((4) and (5)); combinations of all of these (6) – but they are all inferences that anyone who understands English would agree with.

Some of this information can be fairly straightforwardly extracted from corpora. There is a great deal of work, for instance, on calculating the similarity between pairs of words, though extending that to cover entailments between words has proved more difficult. Some of it is much more difficult to find using data-driven methods – the relationships between more than and at least, for instance, cannot easily be found in corpora, and the complex concepts that lie behind the word divorce would also be difficult to extract unsupervised from a corpus.

Furthermore, some of it can be applied by using tree-matching algorithms of various kinds, from simple algorithms that just compute whether one tree is a subtree of another to more complex approaches that pay attention to polarity (that doubt flicks a switch that turns the direction of the matching algorithm round – I know she loves him |- I know she likes him, I doubt she likes him |- I doubt she loves him) and to the relationships between quantifiers (the |- some, more than N |- at least N-1) (Alabbas & Ramsay, 2013) (MacCartney & Manning, 2014). Some of it requires more complex strategies, in particular examples with multiple premises (4), but all but the very simplest (for example, just treating a sentence as a bag of words) require accurate, or at least consistent, trees.

Exactly how much of this machinery you need depends on your ultimate application. Fortunately for us, sentiment mining can be done reasonably effectively with fairly shallow approaches, but it should not be forgotten that there is a great deal more to understanding a text than simply knowing lexical relationships such as similarity or subsumption between words.

Before wrapping up this chapter, we will spend some time learning about machine learning, looking at various machine learning models, and then working our way through a sample project using Python.