Book Image

Mastering Social Media Mining with Python

By : Marco Bonzanini
Book Image

Mastering Social Media Mining with Python

By: Marco Bonzanini

Overview of this book

Your social media is filled with a wealth of hidden data – unlock it with the power of Python. Transform your understanding of your clients and customers when you use Python to solve the problems of understanding consumer behavior and turning raw data into actionable customer insights. This book will help you acquire and analyze data from leading social media sites. It will show you how to employ scientific Python tools to mine popular social websites such as Facebook, Twitter, Quora, and more. Explore the Python libraries used for social media mining, and get the tips, tricks, and insider insight you need to make the most of them. Discover how to develop data mining tools that use a social media API, and how to create your own data analysis projects using Python for clear insight from your social data.
Table of Contents (15 chapters)
Mastering Social Media Mining with Python
Credits
About the Author
About the Reviewer
www.PacktPub.com
Preface

Social media - challenges and opportunities


In traditional media, users are typically just consumers. Information flows in one direction: from the publisher to the users. Social media breaks this model, allowing every user to be a consumer and publisher at the same time. Many academic publications have been written on this topic with the purpose of defining what the term social media really means (for example, Users of the world, unite! The challenges and opportunities of Social Media, Andreas M. Kaplan and Michael Haenlein, 2010). The aspects that are most commonly shared across different social media platforms are as follows:

  • Internet-based applications
  • User-generated content
  • Networking

Social media are Internet-based applications. It is clear that the advances in Internet and mobile technologies have promoted the expansion of social media. Through your mobile, you can, in fact, immediately connect to a social media platform, publish your content, or catch up with the latest news.

Social media platforms are driven by user-generated content. As opposed to the traditional media model, every user is a potential publisher. More importantly, any user can interact with every other user by sharing content, commenting, or expressing positive appraisal via the like button (sometimes referred to as upvote, or thumbs up).

Social media is about networking. As described, social media is about the users interacting with other users. Being connected is the central concept for most social media platform, and the content you consume via your news feed or timeline is driven by your connections.

With these main features being central across several platforms, social media is used for a variety of purposes:

  • Staying in touch with friends and family (for example, via Facebook)
  • Microblogging and catching up with the latest news (for example, via Twitter)
  • Staying in touch with your professional network (for example, via LinkedIn)
  • Sharing multimedia content (for example, via Instagram, YouTube, Vimeo, and Flickr)
  • Finding answers to your questions (for example, via Stack Overflow, Stack Exchange, and Quora)
  • Finding and organizing items of interest (for example, via Pinterest)

This book aims to answer one central question: how to extract useful knowledge from the data coming from the social media? Taking one step back, we need to define what is knowledge and what is useful.

Traditional definitions of knowledge come from information science. The concept of knowledge is usually pictured as part of a pyramid, sometimes referred to as knowledge hierarchy, which has data as its foundation, information as the middle layer, and knowledge at the top. This knowledge hierarchy is represented in the following diagram:

Figure 1.1: From raw data to semantic knowledge

Climbing the pyramid means refining knowledge from raw data. The journey from raw data to distilled knowledge goes through the integration of context and meaning. As we climb up the pyramid, the technology we build gains a deeper understanding of the original data, and more importantly, of the users who generate such data. In other words, it becomes more useful.

In this context, useful knowledge means actionable knowledge, that is, knowledge that enables a decision maker to implement a business strategy. As a reader of this book, you'll understand the key principles to extract value from social data. Understanding how users interact through social media platforms is one of the key aspects in this journey.

The following sections lay down some of the challenges and opportunities of mining data from social media platforms.

Opportunities

The key opportunity of developing data mining systems is to extract useful insights from data. The aim of the process is to answer interesting (and sometimes difficult) questions using data mining techniques to enrich our knowledge about a particular domain. For example, an online retail store can apply data mining to understand how their customers shop. Through this analysis, they are able to recommend products to their customers, depending on their shopping habits (for example, users who buy item A, also buy item B). This, in general, will lead to a better customer experience and satisfaction, which in return can produce better sales.

Many organizations in different domains can apply data mining techniques to improve their business. Some examples include the following:

  • Banking:
    • Identifying loyal customers to offer them exclusive promotions
    • Recognizing patterns of fraudulent transaction to reduce costs
  • Medicine:
    • Understanding patient behavior to forecast surgery visits
    • Supporting doctors in identifying successful treatments depending on the patient's history
  • Retail:
    • Understanding shopping patterns to improve customer experience
    • Improving the effectiveness of marketing campaigns with better targeting
    • Analyzing real-time traffic data to find the quickest route for food delivery

So how does it translate to the realm of social media? The core of the matter consists of how the users share their data through social media platforms. Organizations are not limited to analyze the data they directly collect anymore, and they have access to much more data.

The solution for this data collection happens through well-engineered language-agnostic APIs. A common practice among social media platforms is, in fact, to offer a Web API to developers who want to integrate their applications with particular social media functionalities.

Note

Application Programming Interface

An Application Programming Interface (API) is a set of procedure definitions and protocols that describe the behavior of a software component, such as a library or remote service, in terms of its allowed operations, inputs, and outputs. When using a third-party API, developers don't need to worry about the internals of the component, but only about how they can use it.

With the term Web API, we refer to a web service that exposes a number of URIs to the public, possibly behind an authentication layer, to access the data. A common architectural approach for designing this kind of APIs is called Representational State Transfer (REST). An API that implements the REST architecture is called RESTful API. We still prefer the generic term Web API, as many of the existing API do not strictly follow the REST principles. For the purpose of this book, a deep understanding of the REST architecture is not required.

Challenges

Some of the challenges of social media mining are inherited from the broader field of data mining.

When dealing with social data, we're often dealing with big data. To understand the meaning of big data and the challenges it entails, we can go back to the traditional definition (3D Data Management: Controlling Data Volume, Velocity and Variety, Doug Laney, 2001) that is also known as the three Vs of big data: volume, variety, and velocity. Over the years, this definition has also been expanded by adding more Vs, most notably value, as providing value to an organization is one the main purposes of exploiting big data. Regarding the original three Vs, volume means dealing with data that spans over more than one machine. This, of course, requires a different infrastructure from small data processing (for example, in-memory). Moreover, volume is also associated with velocity in the sense that data is growing so fast that the concept of big becomes a moving target. Finally, variety concerns how data is present in different formats and structures, often incompatible between them and with different semantics. Data from social media can check all the three Vs.

The rise of big data has pushed the development of new approaches to database technologies towards a family of systems called NoSQL. The term is an umbrella for multiple database paradigms that share the common trait of moving away from traditional relational data, promoting dynamic schema design. While this book is not about database technologies, from this field, we can still appreciate the need for dealing with a mixture of well-structured, unstructured, and semi-structured data. The phrase structured data refers to information that is well organized and typically presented in a tabular form. For this reason, the connection with relational databases is immediate. The following table shows an example of structured data that represents books sold by a bookshop:

Title

Genre

Price

1984

Political fiction

12

War and Peace

War novel

10

This kind of data is structured as each represented item has a precise organization, specifically, three attributes called title, genre, and price.

The opposite of structured data is unstructured data, which is information without a predefined data model, or simply not organized according to a predefined data model. Unstructured data is typically in the form of textual data, for example, e-mails, documents, social media posts, and so on. Techniques presented throughout this book can be used to extract patterns in unstructured data to provide some structure.

Between structured and unstructured data, we can find semi-structured data. In this case, the structure is either flexible or not fully predefined. It is sometimes also referred to as a self-describing structure. A typical example of data format that is semi-structured is JSON. As the name suggests, JSON borrows its notation from the programming language JavaScript. This data format has become extremely popular due to its wide use as a way to exchange data between client and server in a web application. The following snippet shows an example of the JSON representation that extends the previous book data:

[ 
  { 
    "title": "1984", 
    "price": 12, 
    "author": "George Orwell", 
    "genre": ["Political fiction", "Social science fiction"] 
  }, 
  { 
    "title": "War and Peace", 
    "price": 10, 
    "genre": ["Historical", Romance", "War novel"] 
  } 
] 

What we can observe from this example is that the first book has the author attribute, whereas, this attribute is not present in the second book. Moreover, the genre attribute is here presented as a list, with a variable number of values. Both these aspects are usually avoided in a well-structured (relational) data format, but are perfectly fine in JSON and more in general when dealing with semi-structured data.

The discussion on structured and unstructured data translates into handling different data formats and approaching data integrity in different ways. The phrase data integrity is used to capture the combination of challenges coming from the presence of dirty, inconsistent, or incomplete data.

The case of inconsistent and incomplete data is very common when analyzing user-generated content, and it calls for attention, especially with data from social media. It is very rare to observe users who share their data methodically, almost in a formal fashion. On the contrary, social media often consists of informal environments, with some contradictions. For example, if a user wants to complain about a product on the company's Facebook page, the user first needs to like the page itself, which is quite the opposite of being upset with a company due to the poor quality of their product. Understanding how users interact on social media platforms is crucial to design a good analysis.

Developing data mining applications also requires us to consider issues related to data access, particularly when company policies translate into the lack of data to analyze. In other words, data is not always openly available. The previous paragraph discussed how in social media mining, this is a little less of an issue compared to other corporate environments, as most social media platforms offer well-engineered language-agnostic APIs that allow us to access the data we need. The availability of such data is, of course, still dependent on how users share their data and how they grant us access. For example, Facebook users can decide the level of detail that can be shown in their public profile and the details that can be shown only to their friends. Profile information, such as birthday, current location, and work history (as well as many more), can all be individually flagged as private or public. Similarly, when we try to access such data through the Facebook API, the users who sign up to our application have the opportunity to grant us access only to a limited subset of the data we are asking for.

One last general challenge of data mining lies in understanding the data mining process itself and being able to explain it. In other words, coming up with the right question before we start analyzing the data is not always straightforward. More often than not, research and development (R&D) processes are driven by exploratory analysis, in the sense that in order to understand how to tackle the problem, we first need to start tampering with it. A related concept in statistics is described by the phrase correlation does not imply causation. Many statistical tests can be used to establish correlation between two variables, that is, two events occurring together, but this is not sufficient to establish a cause-effect relationship in either direction. Funny examples of bizarre correlations can be found all over the Web. A popular case was published in the New England Journal of Medicine, one of the most reputable medical journals, showing an interesting correlation between the amount of chocolate consumed per capita per country versus the number of Nobel prices awarded (Chocolate Consumption, Cognitive Function, and Nobel Laureates, Franz H. Messerli, 2012).

When performing an exploratory analysis, it is important to keep in mind that correlation (two events occurring together) is a bidirectional relationship, while causation (event A has caused event B) is a unidirectional one. Does chocolate make you smarter or do smart people like chocolate more than an average person? Do the two events occur together just by a random chance? Is there a third, yet unseen, variable that plays some role in the correlation? Simply observing a correlation is not sufficient to describe causality, but it is often an interesting starting point to ask important questions about the data we are observing.

The following section generalizes the way our application interacts with a social media API and performs the desired analysis.

Social media mining techniques

This section briefly discusses the overall process for building a social media mining application, before digging into the details in the next chapters.

The process can be summarized in the following steps:

  1. Authentication
  2. Data collection
  3. Data cleaning and pre-processing
  4. Modeling and analysis
  5. Result presentation

Figure 1.2 shows an overview of the process:

Figure 1.2: The overall process of social media mining

The authentication step is typically performed using the industry standard called Open Authorization (OAuth). The process is three legged, meaning that it involves three actors: a user, consumer (our application), and resource provider (the social media platform). The steps in the process are as follows:

  1. The user agrees with the consumer to grant access to the social media platform.
  2. As the user doesn't give their social media password directly to the consumer, the consumer has an initial exchange with the resource provider to generate a token and a secret. These are used to sign each request and prevent forgery.
  3. The user is then redirected with the token to the resource provider, which will ask to confirm authorizing the consumer to access the user's data.
  4. Depending on the nature of the social media platform, it will also ask to confirm whether the consumer can perform any action on the user's behalf, for example, post an update, share a link, and so on.
  5. The resource provider issues a valid token for the consumer.
  6. The token can then go back to the user confirming the access.

Figure 1.3 shows the OAuth process with references to each of the steps described earlier. The aspect to remember is that the exchange of credentials (username/password) only happens between the user and the resource provider through the steps 3 and 4. All other exchanges are driven by tokens:

Figure 1.3: The OAuth process

From the user's perspective, this apparently complex process happens when the user is visiting our web app and hits the Login with Facebook (or Twitter, Google+, and so on) button. Then the user has to confirm that they are granting privileges to our app, and everything for them happens behind the scenes.

From a developer's perspective, the nice part is that the Python ecosystem has already well-established libraries for most social media platforms, which come with an implementation of the authentication process. As a developer, once you have registered your application with the target service, the platform will provide the necessary authorization tokens for your app. Figure 1.4 shows a screenshot of a custom Twitter app called Intro to Text Mining. On the Keys and Access Tokens configuration page, the developer can find the API key and secret, as well as the access token and access token secret. We'll discuss the details of the authorization for each social media platform in the relevant chapters:

Figure 1.4: Configuration page for a Twitter app called Intro to Text Mining. The page contains all the authorization tokens for the developers to use in their app.

The data collection, cleaning, and pre-processing steps are also dependent on the social media platform we are dealing with. In particular, the data collection step is tied to the initial authorization as we can only download data that we have been granted access to. Cleaning and pre-processing, on the other hand, are functional to the type of data modeling and analysis that we decide to employ to produce insights on the data.

Back to Figure 1.2, the modeling and analysis is performed by the component labeled ANALYTICS ENGINE. Typical data processing tasks that we'll encounter throughout this book are text mining and graph mining.

Text mining (also referred to as text analytics) is the process of deriving structured information from unstructured textual data. Text mining is applicable to most social media platforms, as the users are allowed to publish content in the form of posts or comments.

Some examples of text mining applications include the following:

  • Document classification: This is the task of assigning a document to one or more categories
  • Document clustering: This is the task of grouping documents into subsets (called clusters) that are coherent and distinct from one another (for example, by topic or sub-topic)
  • Document summarization: This is the task of creating a shortened version of the document in order to reduce the information overload to the user, while still retaining the most important aspects described in the original source
  • Entity extraction: This is the task of locating and classifying entity references from a text into some desired categories such as persons, locations, or organizations
  • Sentiment analysis: This is the task of identifying and categorizing sentiments and opinions expressed in a text in order to understand the attitude towards a particular product, topic, service, and so on

Not all these applications are tailored for social media, but the growing amount of textual data available through these platforms makes social media a natural playground for text mining.

Graph mining is also focused on the structure of the data. Graphs are a simple-to-understand, yet powerful, data structure that is generic enough to be applied to many different data representations. In graphs, there are two main components to consider: nodes, which represent entities or objects, and edges, which represent relationships or connections between nodes. In the context of social media, the obvious use of a graph is to represent the social relationships of our users. More in general, in social sciences, the graph structure used to represent social relationship is also referred to as social network.

In terms of using such data structure within social media, we can naturally represent users as nodes, and their relationships (such as friends of or followers) as edges. In this way, information such as friends of friends who like Python becomes easily accessible just by traversing the graph (that is, walking from one node to the other by following the edges). Graph theory and graph mining offer more options to discover deeper insights that are not as clearly visible as the previous example.

After a high-level discussion on social media mining, the following section will introduce some of the useful Python tools that are commonly used in data mining projects.