Book Image

Mastering Social Media Mining with R

Book Image

Mastering Social Media Mining with R

Overview of this book

With an increase in the number of users on the web, the content generated has increased substantially, bringing in the need to gain insights into the untapped gold mine that is social media data. For computational statistics, R has an advantage over other languages in providing readily-available data extraction and transformation packages, making it easier to carry out your ETL tasks. Along with this, its data visualization packages help users get a better understanding of the underlying data distributions while its range of "standard" statistical packages simplify analysis of the data. This book will teach you how powerful business cases are solved by applying machine learning techniques on social media data. You will learn about important and recent developments in the field of social media, along with a few advanced topics such as Open Authorization (OAuth). Through practical examples, you will access data from R using APIs of various social media sites such as Twitter, Facebook, Instagram, GitHub, Foursquare, LinkedIn, Blogger, and other networks. We will provide you with detailed explanations on the implementation of various use cases using R programming. With this handy guide, you will be ready to embark on your journey as an independent social media analyst.
Table of Contents (13 chapters)
Mastering Social Media Mining with R
Credits
About the Authors
About the Reviewers
www.PacktPub.com
Preface
Index

The generic process of social media mining


Any data mining activity follows some generic steps to gain some useful insights from the data. Since social media is the central theme of this book, let's discuss these steps by taking example data from Twitter:

  • Getting authentication from the social website

  • Data visualization

  • Cleaning and preprocessing

  • Data modeling using standard algorithms such as opinion mining, clustering, anomaly/spam detection, correlations and segmentations, recommendations

  • Result visualization

Getting authentication from the social website – OAuth 2.0

Most social media websites provide API access to their data. To do the mining, we (as a third-party) would need some mechanism to get access to users' data, available on these websites. But the problem is that a user will not share their credentials with anyone due to obvious security reasons. This is where OAuth comes in the picture. According to its home page (http://oauth.net/), OAuth can be defined as follows:

An open protocol to allow secure authorization in a simple and standard method from web, mobile and desktop applications.

To understand it better, let's take an example of Instagram where a user can allow a printing service access to his/her private photographs stored on Instagram's server, without sharing her credentials with the printing service. Instead, they authenticate directly with Instagram, which issues the printing service delegation-specific permissions. The user here is the primary owner of the resource and the printing service is the third-party client. Social media websites such as Instagram, Twitter, and Facebook allow various applications to access user data for various advertisements or recommendations. Almost all cab service applications access user location.

Here's a diagram illustrating the concept:

OAuth 2.0 provides various methods in which different levels of authorizations of the various resources can reliably be granted to the requesting client application. One of the most frequently used and most important use cases is the authorization of World Wide Web server data to another World Wide Web server/application.

The following image shows the authentication process:

Let's look at the various steps involved:

  1. The client accesses the web app with the button Login via Twitter (or Login via LinkedIn or Login via Facebook).

  2. This takes the client to an app, which will authenticate it. The client app then asks the user to allow it the access to his/her resources, that is, the profile data. The user needs to accept it to go the next step.

  3. The client is then redirected to a redirect link via the authenticating app, which the client app has provided to the authenticating app. Usually, the redirect link is delivered by registering the client app with the authenticating app. The user of the client app also registers the redirect link and at the same time authenticating app also gives the client app with client credentials.

  4. Using the redirect link, the client contacts the website in the client app. During this step, a connection between authenticating app and client app is made and the authentication code received in the redirect request parameters. So, an access token is returned by the authenticating app.

Depending on the network, the access provided by the access token can be constrained not only in terms of the information but also the life of the access token itself. As soon as the client app obtains an access token, this access token can be sent to the respective social media organizations, such as Facebook, LinkedIn, Twitter, and so on, to access resources in these servers that are related to the clients who gave permission via the tokens.

Differences between OAuth and OAuth 2.0

Here are some of the major differences:

  • More flows in OAuth 2.0 to permit improved support for non-browser based apps

  • OAuth 2.0 does not need the client app to have cryptography

  • OAuth 2.0 offers much less complicated signatures

  • OAuth 2.0 generates short-lived access tokens, hence it is more secure

  • OAuth 2.0 has a clearer segregation of roles concerning the server responsible for handling user authorization and the server handling OAuth requests

Data visualization R packages

A number of visualization R packages for text data are available as R package. These libraries, based on available data and objective, provide various options varying from simple clusters of words to the one inline with semantic analysis or topic modeling of the corpus. These libraries provide means to better understand text data. In this book, we'll use the following libraries:

The simple word cloud

One of the simplest and most frequently used visualization libraries is the simple word cloud. The basic intent to using word cloud is to visualize the weights of the words present. The "wordcloud" R library helps the user get an understanding of weights of a word/term with respect to the tf-idf matrix. The weights are proportional to the size and color of the word you see in the plot. Here's an example of one such simple word cloud based on the corpus created from tweets:

Sentiment analysis Wordcloud

There are R packages that can generate a word cloud similar to the preceding figure, along with the sentiments each word is representing. Such plots are one step ahead of the basic word cloud because they let the user get an understanding of what kind of sentiments are present and why the particular documents (collection of tweets) are of a particular nature (joy, sadness, disgust, love, and so on.). Timothy Jurka developed one such package, which we are going to use. The two main functions of this package are as follows:

  • Classify_emotion: As the name suggests, the procedure helps the user understand the type of sentiment that is present. This procedure also clusters the words present in the query based on the sentiment and level of emotions that particular word present. A voting-based classification is one the algorithms used in this particular procedure. The Naive Bayes algorithm is also used for more enhanced results. The training dataset used on the above algorithms is from Carlo Strapparava and Alessandro Valitutti. Here's a sample output:

  • Classify_polarity: This procedure indicates the overall polarity of the emotions (positive or negative). This is, in a way, an extension of the procedure. The training data used here comes from Janyce Wiebe's subjectivity lexicon.

The most commonly used visualization library for Facebook data is Gephi. The key difference between Facebook and Twitter is the richness of the profile of a user and the social connections one shares on Facebook. Gephi helps users visualize both of the distinctions in a very pleasant way. It enables a user to understand the impact one Facebook profile has, or could have, over the network. Gephi is highly customizable and user-friendly library. We'll discuss this in Chapter 3, Find Friends on Facebook. As a working example, here's the graph representation of a social network of two friends.

Many more R packages are available to visualize most social media data. For more information, refer to the following links: