Mastering Social Media Mining with Python

Mastering Social Media Mining with Python

By : Marco Bonzanini

Buy this Book

Mastering Social Media Mining with Python

By: Marco Bonzanini

Buy this Book

Overview of this book

Your social media is filled with a wealth of hidden data – unlock it with the power of Python. Transform your understanding of your clients and customers when you use Python to solve the problems of understanding consumer behavior and turning raw data into actionable customer insights. This book will help you acquire and analyze data from leading social media sites. It will show you how to employ scientific Python tools to mine popular social websites such as Facebook, Twitter, Quora, and more. Explore the Python libraries used for social media mining, and get the tips, tricks, and insider insight you need to make the most of them. Discover how to develop data mining tools that use a social media API, and how to create your own data analysis projects using Python for clear insight from your social data.

Mastering Social Media Mining with Python

Credits

About the Author

About the Reviewer

www.PacktPub.com

Preface

Free Chapter

Social Media, Social Data, and Python

Getting started

Social media - challenges and opportunities

Python tools for data science

Processing data in Python

Building complex data pipelines

Summary

#MiningTwitter – Hashtags, Topics, and Time Series

Getting started

The Twitter API

Collecting data from Twitter

Analyzing tweets - entity analysis

Analyzing tweets - text analysis

Analyzing tweets - time series analysis

Summary

Users, Followers, and Communities on Twitter

Users, friends, and followers

Mining your followers

Mining the conversation

Plotting tweets on a map

Summary

Posts, Pages, and User Interactions on Facebook

The Facebook Graph API

Mining your posts

Mining Facebook Pages

Summary

Topic Analysis on Google+

Getting started with the Google+ API

Embedding the search results in a web GUI

Notes and activities from a Google+ page

Text analysis and TF-IDF on notes

Summary

Questions and Answers on Stack Exchange

Questions and answers

Getting started with the Stack Exchange API

Working with Stack Exchange data dumps

Text classification for question tags

Summary

Blogs, RSS, Wikipedia, and Natural Language Processing

Blogs and NLP

Getting data from blogs and websites

NLP Basics

Summary

Mining All the Data!

Many social APIs

Mining videos on YouTube

Mining open source software on GitHub

Mining local businesses on Yelp

Building a custom Python client

Summary

Linked Data and the Semantic Web

A Web of Data

Mining relations from DBpedia

Mining geo coordinates

Summary

Customer Reviews

5 star

4 star

3 star

2 star

1 star

Building complex data pipelines

As soon as the data tools that we're building grow into something a bigger than a simple script, it's useful to split data pre-processing tasks into small units, in order to map all the steps and dependencies of the data pipeline.

With the term data pipeline, we intend a sequence of data processing operations, which cleans, augments, and manipulates the original data, transforming it into something digestible by the analytics engine. Any non-trivial data analytics project will require a data pipeline that is composed of a number of steps.

In the prototyping phase, it is common to split these steps into different scripts, which are then run individually, for example:

$ python download_some_data.py
$ python clean_some_data.py
$ python augment_some_data.py

Each script in this example produces the output for the following script, so there are dependencies between the different steps. We can refactor data processing scripts into a large script that does everything, and then run it in one go:

$ python do_everything.py

The content of such script might look similar to the following code:

if __name__ == '__main__': 
  download_some_data() 
  clean_some_data() 
  augment_some_data()

Each of the preceding functions will contain the main logic of the initial individual scripts. The problem with this approach is that errors can occur in the data pipeline, so we should also include a lot of boilerplate code with try and except to have control over the exceptions that might occur. Moreover, parameterizing this kind of code might feel a little clumsy.

In general, when moving from prototyping to something more stable, it's worth thinking about the use of a data orchestrator, also called workflow manager. A good example of this kind of tool in Python is given by Luigi, an open source project introduced by Spotify. The advantages of using a data orchestrator such as Luigi include the following:

Task templates: Each data task is defined as a class with a few methods that define how the task runs, its dependencies, and its output

Dependency graph: Visual tools assist the data engineer to visualize and understand the dependencies between tasks Recovery from intermediate failure: If the data pipeline fails halfway through the tasks, it's possible to restart it from the last consistent state

Integration with command-line interface, as well as system job schedulers such as cron job
Customizable error reporting

We won't dig into all the features of Luigi, as a detailed discussion would go beyond the scope of this book, but the readers are encouraged to take a look at this tool and use it to produce a more elegant, reproducible, and easily maintainable and expandable data pipeline.

Mastering Social Media Mining with Python

By : Marco Bonzanini

Mastering Social Media Mining with Python

By: Marco Bonzanini

Overview of this book

Related Content you might be interested in

Current Title:

Mastering Social Media Mining with Python

Building complex data pipelines