Book Image

Mastering Social Media Mining with R

Book Image

Mastering Social Media Mining with R

Overview of this book

With an increase in the number of users on the web, the content generated has increased substantially, bringing in the need to gain insights into the untapped gold mine that is social media data. For computational statistics, R has an advantage over other languages in providing readily-available data extraction and transformation packages, making it easier to carry out your ETL tasks. Along with this, its data visualization packages help users get a better understanding of the underlying data distributions while its range of "standard" statistical packages simplify analysis of the data. This book will teach you how powerful business cases are solved by applying machine learning techniques on social media data. You will learn about important and recent developments in the field of social media, along with a few advanced topics such as Open Authorization (OAuth). Through practical examples, you will access data from R using APIs of various social media sites such as Twitter, Facebook, Instagram, GitHub, Foursquare, LinkedIn, Blogger, and other networks. We will provide you with detailed explanations on the implementation of various use cases using R programming. With this handy guide, you will be ready to embark on your journey as an independent social media analyst.
Table of Contents (13 chapters)
Mastering Social Media Mining with R
Credits
About the Authors
About the Reviewers
www.PacktPub.com
Preface
Index

Challenges for social media mining


Social media mining is currently in a stage of infancy, and its practitioners are learning and developing new approaches. Social media mining draws its roots from many fields, such as statistics, machine learning, information retrieval, pattern recognition, and bioinformatics. The parent fields themselves are not without their challenges. The sheer amount of data being generated daily is staggering, but current techniques allow for novel data mining solutions and scalable computational models with help from the fundamental concepts and theories and algorithms.

In social media theory, people are considered to be the basic building blocks of a world created on the grounds provided by the social media. The measurements of the interactions between these building blocks and other entities such as sites, networks, content, and so on leads to the discovery of human nature. The knowledge gained via these measurements constitutes the soul of the social worlds. Finding the insights from this data where social relationships play a critical role can be termed as the mining of social media data. This problem not only has to face the basic data mining challenges but also those that emerge because of the social-relationship aspect. We have listed down some of the important challenges here:

  • Big Data: Should we use the taste of a friend of a friend of the person of interest, who has studied at one particular college and whose hometown was one particular city to recommend something to the person of the interest? In some applications, this might be overkill and in others this information could lead to a very small but differentiating performance increase. The content that can be used in social media data can be very deep. However, this can lead to a problem called over fitting, which is well known in the domain of machine learning. Using multiple sources of data can also complicate the overall performance in a similar fashion.

  • Sufficiency: Should we restrict people to view only the person of interest's alma mater and his/her hometown to recommend something and not use the tastes of his/her friends? Common sense says this is not correct and we may be missing out on something. This is a problem commonly known as under fitting. This problem can also arise due to the fact that most social media networks restrict the amount of information that can be accessed in a certain time frame, so sometimes the data is not sufficient enough to generate patterns and/or generate recommendations.

  • Noise removal error: Preprocessing steps are more or less always required in any application of data mining. These steps not only make the actual application run faster on the cleaned data, but they also improve overall accuracy. Due to all the clutter, which is present in most social data, a large amount of noise is always expected but effectively removing the noise from the data we have is a very tricky business. You can always end up missing some information while trying to remove this noise. Noise by its definition is a subjective quantity and can always be confused; hence, this step can end up introducing more error in pattern recognition.

  • Evaluation dilemma: Because of the sheer size of social media data, it's not possible to obtain a properly annotated dataset to train a supervised machine-learning algorithm. Without the proper ground truth data, there is no way to judge the accuracy of any off-the-shell classification algorithms. Since there can't be any accuracy measures without the ground truth data, only a clustering (unsupervised machine learning) algorithm can be applied. But the problem is that such algorithms rely heavily on the domain expertise.