Book Image

Learning Social Media Analytics with R

By : Dipanjan Sarkar, Karthik Ganapathy, Raghav Bali, Tushar Sharma
Book Image

Learning Social Media Analytics with R

By: Dipanjan Sarkar, Karthik Ganapathy, Raghav Bali, Tushar Sharma

Overview of this book

The Internet has truly become humongous, especially with the rise of various forms of social media in the last decade, which give users a platform to express themselves and also communicate and collaborate with each other. This book will help the reader to understand the current social media landscape and to learn how analytics can be leveraged to derive insights from it. This data can be analyzed to gain valuable insights into the behavior and engagement of users, organizations, businesses, and brands. It will help readers frame business problems and solve them using social data. The book will also cover several practical real-world use cases on social media using R and its advanced packages to utilize data science methodologies such as sentiment analysis, topic modeling, text summarization, recommendation systems, social network analysis, classification, and clustering. This will enable readers to learn different hands-on approaches to obtain data from diverse social media sources such as Twitter and Facebook. It will also show readers how to establish detailed workflows to process, visualize, and analyze data to transform social data into actionable insights.
Table of Contents (16 chapters)
Learning Social Media Analytics with R
About the Author
About the Reviewer
Customer Feedback

Social media analytics

We now have a detailed overview of social media, its significance, pitfalls, and various facets. We will now discuss social media analytics and the benefits it offers for data analysts, scientists and businesses in general looking to gather useful insights from social media. Social media analytics, also known as social media mining or social media intelligence, can be defined as the process of gathering data (usually unstructured) from social media platforms and analyzing the data using diverse analytical techniques to extract vital insights, which can be used to make data-driven business decisions. There are lots of opportunities and challenges involved in social media analytics, which we will be discussing in further detail in later sections. An important thing to remember is that the processes involved in social media analytics are usually domain-agnostic and you can apply them on data belonging to any organization or business in any domain.

The most important step in going forward with any social media analytics based workflow or process is to determine the business goals or objectives and the insights that we want to gather from our analyzes. These goals are usually in the form of key performance indicators (KPIs). For instance, the total number of followers, number of likes and shares can be KPIs to measure brand engagement with customers using social media. Sometimes data is not structured and the end objectives are not very concrete. Techniques like natural language processing and text analytics can be leveraged in such cases to extract insights from noisy unstructured text data like understanding the sentiment or mood of customers for a particular service or product and trying to understand the key trends and themes based on customer tweets or posts at any point in time.

A typical social media analytics workflow

We will be analyzing data from diverse social media applications and platforms throughout the course of this book. However, it is essential to have a good grasp of the essential concepts behind any typical analytics process or workflow. While we will be expanding more on data analytics and mining processes later, let us look at a typical social media analytics workflow in the following figure:

From the preceding diagram, we can broadly classify the main steps involved in the analytics workflow as follows:

  • Data access

  • Data processing and normalization

  • Data analysis

  • Insights

We will now briefly expand upon each of these four processes since we will be using them extensively in future chapters.

Data access

For access to social media data, you can usually do it using standard data retrieval methods in two ways.

The first technique is to use official APIs provided by the social media platform or organization itself.

The second technique is to use unofficial mechanisms, like web crawling and scraping. An important point to remember is that crawling and scraping social media websites and using that data for commercial purposes, like selling the data to other organizations, is usually against their terms of service. We will therefore not be using such methods in our book. Besides this, we will be following the necessary politeness policies while accessing social media data using their APIs, so that we do not overload them with too many requests. The data we'll obtain is the raw data which can be further processed and normalized as needed.

Data processing and normalization

The raw data obtained from data retrieval using social media APIs may not be structured and clean. In fact most of the data obtained from social media is noisy, unstructured and often contains unnecessary tokens such as Hyper Text Markup Language (HTML) tags and other metadata. Usually, data streams from social media APIs have JavaScript Object Notation (JSON) response objects, which consist of key value pairs just like the example shown in the following snippet:

"user": {
                    "profile_sidebar_fill_color": "DDFFCC",
                    "profile_sidebar_border_color": "BDDCAD",
                    "profile_background_tile": true,
                    "name": "J'onn J'onzz",
                    "profile_image_url": "",
                    "created_at": "Tue Apr 07 19:05:07 +0000 2009",
                    "location": "Ox City, UK",
                    "follow_request_sent": null,
                    "profile_link_color": "0084B4",
                    "is_translator": false,
                    "id_str": "2921138"
"followers_count": 2452,
"statuses_count": 7311,
"friends_count": 427

The preceding JSON object consists of a typical response from the Twitter API showing details of a user profile. Some APIs might return data in other formats, such as Extensible Markup Language (XML) or Comma Separated Values (CSV), and each format needs to be handled properly.

Often social media data contains unstructured textual data which needs additional text pre-processing and normalization before it can be fed into any standard data mining or machine learning algorithm. Text normalization is usually done using several techniques to clean and standardize the text. Some of them are:

  • Text tokenization

  • Removing special characters and symbols

  • Spelling corrections

  • Contraction expansions

  • Stemming

  • Lemmatization

More advanced processing can insert additional metadata to describe the text better, such as adding parts of speech (POS) tags, phrase tags, named entity tags, and so on.

Data analysis

This is the core of the whole workflow, where we apply various techniques to analyze the data: this could be the raw native data itself, or the processed and curated data. Usually the techniques used in analysis can be broadly classified into three areas:

  • Data mining or analytics

  • Machine learning

  • Natural language processing and text analytics

Data mining and machine learning have several overlapping concepts, including the fact that both use statistical techniques and try to find patterns from underlying data. Data mining is more about finding key patterns or insights from data; and machine learning is more about using mathematics, statistics, and even some of these data mining algorithms, to build models to predict or forecast outcomes. While both of these techniques need structured and numeric data to work with, more complex analyzes with unstructured textual data is usually handled in the separate realm of text analytics by leveraging natural language processing which enables us to use several tools, techniques and algorithms to analyze free-flowing unstructured text. We will be using techniques, from these three areas to analyze data from various social media platforms throughout this book. We will cover important concepts from data analytics and text analytics briefly towards the end of this chapter.


The end results from our workflow are the actual insights which act as facts or concrete data points to achieve the objective of the analysis. This can be anything from a business intelligence report to visualizations such as bar graphs, histograms, or even word or phrase clouds. Insights should be crisp, clear, and actionable so that it can be easy for businesses to take valuable decisions in time by leveraging them.


Based on the advantages of social media, we can derive plentiful opportunities which lie within the scope of social media analytics. You can save a lot of cost involved in targeted advertising and promotions by analyzing your social media traffic patterns. You can see how users engage with your brand or business using social media, for instance, when it is the perfect time to share something interesting, such as a new service, product, or even an interesting anecdote about your company. Based on traffic from different geographies, you can analyze and understand the preferences of users from different parts of the world. Users love it if you publish promotions in their local language, and businesses are already leveraging such capabilities from social media platforms such as Facebook to target users in specific countries based on localized content.

The social media analytics landscape is still young and emerging and has a lot of untapped potential.

Let us understand the potential of social media analytics better by taking a real-world example.

Consider you are running a profitable business with active engagement on various social media channels. How can you use the data generated from social media to know how you are doing and how your competitors are doing? Live data streams from Twitter could be continuously analyzed to get real-time mood, sentiment, emotion, and reactions of people to your products and services. You could even analyze the same for your rival competitors to see when they are launching their commodities and how users are reacting to them. With Facebook, you can do the same and even push localized promotions and advertisements to see if they help in generating better revenue. News portals would give you live feeds of trending news articles and insights into the current state of the economy and current events and help you decide if these are favorable times for a thriving business or should you be preparing for some hard times. Sentiment analysis, concept mining, topic models, clustering, and inference are just a few examples of using analytics on social media. The opportunities are huge—you just need to have a clear objective in mind so that you can use analytics effectively to solve that objective.


Before we delve into the challenges associated with social media analytics let us look at the following interesting facts:

  • There are over 300 million active Twitter users

  • Facebook has over 1.8 billion active users

  • Facebook generates 600-700+ terabytes of data daily (and it could be more now)

  • Twitter generates 8-10+ terabytes of data daily

  • Facebook generates over 4 to 5 million posts per minute

  • Instagram generates over 2 million likes per minute

These statistics give you a rough idea about the massive scale of data being generated and consumed in these social media platforms. This leads to some challenges:

  • Big data: Due to the massive amount of data produced by social media platforms, it is sometimes difficult to analyze the complete dataset using traditional analytical methods since the complete data would never fit in memory. Other approaches and tools, such as Hadoop and Spark, need to be leveraged.

  • Accessibility issues: Social media platforms generate a lot of data but getting access to them directly is not always easy. There are rate limits for their official APIs and it's rare to be able to access and store complete datasets. Besides this, each platform has its own terms and conditions, which should be adhered to when accessing their data.

  • Unstructured and noisy data: Most of the data from social media APIs are unstructured, noisy, and have a lot of junk in them. Dealing with data cleaning and processing becomes really cumbersome and often analysts and data scientists end up spending 70% of their time and effort in trying to clean and curate the data for analysis.

These are perhaps the most prevalent challenges when analyzing social media data, amongst many other challenges, that you might face in your social media analytics journey. Let's now get acquainted with the R programming language, which will be useful to us when we are performing our analyzes.