Book Image

Python Social Media Analytics

By : Baihaqi Siregar, Siddhartha Chatterjee, Michal Krystyanczuk
Book Image

Python Social Media Analytics

By: Baihaqi Siregar, Siddhartha Chatterjee, Michal Krystyanczuk

Overview of this book

Social Media platforms such as Facebook, Twitter, Forums, Pinterest, and YouTube have become part of everyday life in a big way. However, these complex and noisy data streams pose a potent challenge to everyone when it comes to harnessing them properly and benefiting from them. This book will introduce you to the concept of social media analytics, and how you can leverage its capabilities to empower your business. Right from acquiring data from various social networking sources such as Twitter, Facebook, YouTube, Pinterest, and social forums, you will see how to clean data and make it ready for analytical operations using various Python APIs. This book explains how to structure the clean data obtained and store in MongoDB using PyMongo. You will also perform web scraping and visualize data using Scrappy and Beautifulsoup. Finally, you will be introduced to different techniques to perform analytics at scale for your social data on the cloud, using Python and Spark. By the end of this book, you will be able to utilize the power of Python to gain valuable insights from social media data and use them to enhance your business processes.
Table of Contents (17 chapters)
Title Page
Credits
About the Authors
Acknowledgments
About the Reviewer
www.PacktPub.com
Customer Feedback
Preface

Analyzing the data


In this section, we briefly introduce the main techniques which lie behind the social media analysis process and bring intelligence to the data. We also present how to deal with reasonably big amount of data using our development environment. However, it is worth noting that the problem of scaling and dealing with massive data will be analyzed in Chapter 9, Social Data Analytics at Scale - Spark and Amazon Web Services.

Brief introduction to machine learning

The recent growth in the volume of data created by mobile devices and social networks has dramatically impacted the need for high performance computation and new methods of analysis. Historically, large quantities of data (big data) were analyzed by statistical approaches which were based on sampling and inductive reasoning to derive knowledge from data. A more recent development of artificial intelligence, and more specifically, machine learning, enabled not only the ability to deal with large volume of data, but it brought a tremendous value to businesses and consumers by extracting valuable insights and hidden patterns.

Machine learning is not new. In 1959, Arthur Samuel defined machine learning as:

Field of study that gives computers ability to learn without being specifically programmed for it.

Within the field of data analytics, machine learning is a method used to devise complex models and algorithms that allow to This approach is similar to a person who increases his knowledge on a subject by reading more and more books on the subject. There are three main approaches in machine learning: supervised learning, unsupervised learning, and reinforcement learning.

Supervised learning assumes that we know what the outputs are of each data point. For example, we learn that a car that costs $80,000, which has an electric engine and acceleration of 0-100 km/h in 3 seconds, is called Tesla; another car, which costs $40,000, has a diesel engine, and acceleration of 0-100 km/h in 9.2 seconds, is called Toyota; and so on. Then, when we look for the name of a car which costs $35,000, has acceleration of 0-100 km/h in 9.8 seconds, and has a diesel engine, it is most probably Toyota and not Tesla.

Unsupervised learning is used when we do not know the outputs. In the case of cars, we only have technical specifications: acceleration, price, engine type. Then we cluster the data points into different groups (clusters) of similar cars. In our case, we will have the clusters with similar price and engine types. Then, we understand similarities and differences between the cars.

The third type of machine learning is reinforcement learning, which is used more in artificial intelligence applications. It consists of devising an algorithm that learns how to behave based on a system of rewards. This kind of learning is similar to the natural human learning process. It can be used in teaching an algorithm how to play chess. In the first step, we define the environment-the chess board and all possible moves. Then the algorithm starts by making random moves and earns positive or negative rewards. When a reward is positive, it means that the move was successful, and when it is negative, it means that it has to avoid such moves in the future. After thousands of games, it finishes by knowing all the best sequences of moves.

In real-life applications, many hybrid approaches are widely used, based on available data and the complexity of problems.

Techniques for social media analysis

Machine learning is a basic tool to add intelligence and extract valuable insights from social media data. There exist other widespread concepts that are used for social media analysis: Text Analytics, Natural Language Processing, and Graph Mining.

The first notion allows to retrieve non trivial information from textual data, such as brands or people names, relationships between words, extraction of phone numbers, URLs, hashtags, and so on. Natural Language Processing is more extensive and aims at finding the meaning of the text by analyzing text structure, semantics, and concepts among others.

Social networks can also be represented by graph structures. The last mining technique enables the structural analysis of such networks. These methods help in discovering relationships, paths, connections and clusters of people, brands, topics, and so on, in social networks.

The applications of all the techniques will be presented in following chapters.

Setting up data structure libraries

In our analysis, we will use some libraries that enable flexible data structures, such as pandas and sframe. The advantage of sframe over pandas is that it helps to deal with very big datasets which do not fit RAM memory. We will also use a pymongo library to pull collected data from MongoDB, as shown in the following code:

pip3 install pandas, sframe, pymongo 

All necessary machine learning libraries will be presented in corresponding chapters.