Book Image

Mastering Social Media Mining with R

Book Image

Mastering Social Media Mining with R

Overview of this book

With an increase in the number of users on the web, the content generated has increased substantially, bringing in the need to gain insights into the untapped gold mine that is social media data. For computational statistics, R has an advantage over other languages in providing readily-available data extraction and transformation packages, making it easier to carry out your ETL tasks. Along with this, its data visualization packages help users get a better understanding of the underlying data distributions while its range of "standard" statistical packages simplify analysis of the data. This book will teach you how powerful business cases are solved by applying machine learning techniques on social media data. You will learn about important and recent developments in the field of social media, along with a few advanced topics such as Open Authorization (OAuth). Through practical examples, you will access data from R using APIs of various social media sites such as Twitter, Facebook, Instagram, GitHub, Foursquare, LinkedIn, Blogger, and other networks. We will provide you with detailed explanations on the implementation of various use cases using R programming. With this handy guide, you will be ready to embark on your journey as an independent social media analyst.
Table of Contents (13 chapters)
Mastering Social Media Mining with R
Credits
About the Authors
About the Reviewers
www.PacktPub.com
Preface
Index

Data modeling – the application of mining algorithms


Let's look at some of the standard mining algorithms.

Opinion mining (sentiment analysis)

In simple words, opinion mining or sentiment analysis is the method in which we try to assess the opinion/sentiment present in the given phrase. The phrase could be any sentence. Though our examples would be English, the sentiment analysis is not limited to any language. Also, the sentence could come from any source—it could be a 140-character tweet, Facebook post/chats, SMSs, and so on. Consider the following examples:

  • Visiting to the wonderful places in Europe. Feeling real happy—Positive.

  • I love little sunshine in winters, make me feel live—Positive.

  • I am stuck in a same place, feeling sad—Negative.

  • The cab driver was a nice person. Think many of them are actually good people—Positive.

Sentiment analysis can play a crucial role in understanding the costumer sentiment, which can actually affect the growth of any business. With social media platforms such as Twitter, the meaning of the saying words are mightier than swords, has reached a whole new level. In the next chapter, we'll see how the customer sentiments can affect the growth of business. Also, there is nothing like word of mouth marketing, and again social media platforms can help you provide more business via the words of real customers. This field has become so advanced that people have actually predicted the outcomes of major elections based on the sentiments of the voters. Similarly, stock market forecasts are now being generated based on the analysis of customer tweets.

Steps for sentiment analysis

A belief or an opinion or sentiment to a computer can be described as a quintuple; that is an object in a five dimensional space, where each axis represents the following:

  • Oj: This is the objective (that is, product). It is realized via named entity extraction.

  • fjk: This is a feature of Oj. It is assessed using information mining theory

  • SOijkl:This is the sentiment value of the opinion of the opinion holder hi on feature fjk of object oj at time tl

  • hi: This is the information miner

  • Ti: This is for data extraction

Perform the following steps to get the sentiment value SOijkl:

  1. Part-of-speech tagging (pos) means the term in the text (or the sentence) that are marked using a pos-tagger so that it allocates a label to each term, allowing the system to do something with it.

  2. We look at sentiment orientation (SO) of the patterns we mined. For example, we may have extracted Remarkable + Handset, which is, [JJ] + [NN] (or adjective trailed by noun). The opposite might be "Awful" for instance. In this phase, the system attempts to position the terms on an emotive scale.

  3. The average sentiment orientation of all the terms we gathered is computed. This allows the system to say something like:

    • "Usually individuals like the fresh Handset." They recommend it

    • "Usually individuals hate the fresh Handset." They don't recommend it

It's not easy to classify sentiments; nonetheless there are various classification algorithms, which have been employed to aid opinion mining. These algorithms vary from simple probabilistic classifiers such as Naïve Bayes (probability classifier that assumes all the features are independent and does not use any prior information) to the more advanced classifiers such as maximum entropy (which uses the prior information to a certain extent.

Many hyperspace classifiers such as Support Vector Machine (SVM) and Neural Networks (NN) have also been used to correctly classify the sentiments. Between SVM and NN, SVM, in general, works wonders due to the kernel trick.

There are other methods being explored as well. For example, Anomaly/spam detection or social spammer detection. Fake profiles created with a malicious intention are known as spam or anomalous profiles. The user who creates such profiles often pretend to be someone they are not and try to perform some inappropriate activity, which can eventually cause problems for the person they were imitating as well as to others. There has been an increase in the number of cases of online bullying, trolling, and so on, which are direct causes of social spamming. We'll show you the various classification algorithms to detect these fake profiles in Chapter 3, Find Friends on Facebook.

The algorithms we'll use to identify the spam and/or spammers based on a same example datasets, fall under the general class of algorithms known as supervised machine learning algorithms. The example dataset used in these algorithms is called training set. For notational consistency, let's say each ith record in the training set as a pair consists of an input vector represented by xi and output label represented by yi. The vector xi consists of a set of features representative of the ith sample point. The task of such an algorithm is to infer a function f (from a given possible set of functions F) which can map the xi's to the respective yi's, with high level of accuracy. This function f is sometimes also called a learned/trained model. The process of inferring f, using the training data is called learning. Once the model is trained, we use this learned model with the new records to identify new labels. The ability of such a model/algorithm to correctly identify the new example set (also called test set) labels that differ from the training set, is known as generalization.

There are many algorithms under the class of supervised machine learning algorithms such as the Naïve Bayes classifier, Decision tree classifier, and so on. One such algorithm is SVM. In a two-class (binary) classification problem, an SVM is the maximal margin hyperplane that separate the two classes with the largest possible margin. If there are more than two classes, then multiple SVMs are learned under one-versus-rest or one-versus-one methods; discussing these two methods is beyond the scope of the book.

The following figure illustrates a binary classification by SVM. The red and black dots are part of training data point xi's, representing the two types of the label yi. SVM comes with a neat transformation, which can transform the current feature space to a new feature space using various kernels. Discussing the details is beyond the scope of this book.

Community detection via clustering

In graph analogy, a community is a set of nodes between which the communications/interactions are rather more frequent than with those outside the set. From a marketing point of view, community detection become very crucial and has been proven to be very rewarding in terms of return-of-investments (ROIs). For example, travel enthusiasts can be identified on various social media websites based on their visited places, posts, comments, tweets, and so on. If such segmentation can be done, then selling them some product related to travel (such as a handheld compass, travel pillow, global alarm clock, binoculars, slim digital camera, noise-cancelling headphones, and so on) would stand a higher chance of purchase. Hence, with a focused marketing effort, the business can get more ROIs.

While spam detection is a supervised machine-learning task, community detection or clustering falls under the class of unsupervised learning algorithms. Social media offers two types of communities. Some are explicitly created groups with people of common location, hobbies, or occupation. There are several other people who might not be connected to such groups. Identification of these people is a clustering task. This is performed based on their interaction (for example, they mentioned a common thing in their comments/posts/tweets) as features sets (xi's) and without label information (as in the case of supervised machine learning algorithms). These features are passed to various unsupervised machine learning algorithms to find the commonalities and hence the communities. Many algorithms also provide the extent/degree/affinity score with which a particular person belongs to a specific community.

There are many algorithms and techniques proposed in academia that we'll discuss in detail in the following chapters. Basically, these methods are based on calculation of the influence on the link between various edges (people, locations, and other such entities). Similar people are likely to be linked, and edges between these links indicate that linked users will influence each other and become more similar, two users in the same group or community if they have higher similarity.