Book Image

RStudio for R Statistical Computing Cookbook

By : Andrea Cirillo
Book Image

RStudio for R Statistical Computing Cookbook

By: Andrea Cirillo

Overview of this book

The requirement of handling complex datasets, performing unprecedented statistical analysis, and providing real-time visualizations to businesses has concerned statisticians and analysts across the globe. RStudio is a useful and powerful tool for statistical analysis that harnesses the power of R for computational statistics, visualization, and data science, in an integrated development environment. This book is a collection of recipes that will help you learn and understand RStudio features so that you can effectively perform statistical analysis and reporting, code editing, and R development. The first few chapters will teach you how to set up your own data analysis project in RStudio, acquire data from different data sources, and manipulate and clean data for analysis and visualization purposes. You'll get hands-on with various data visualization methods using ggplot2, and you will create interactive and multidimensional visualizations with D3.js. Additional recipes will help you optimize your code; implement various statistical models to manage large datasets; perform text analysis and predictive analysis; and master time series analysis, machine learning, forecasting; and so on. In the final few chapters, you'll learn how to create reports from your analytical application with the full range of static and dynamic reporting tools that are available in RStudio so that you can effectively communicate results and even transform them into interactive web applications.
Table of Contents (15 chapters)
RStudio for R Statistical Computing Cookbook
Credits
About the Author
About the Reviewer
www.PacktPub.com
Preface
Index

Getting data from Twitter with the twitteR package


Twitter is an unbeatable source of data for nearly every kind of data-driven problem.

If my words are not enough to convince you, and I think they shouldn't be, you can always perform a quick search on Google, for instance, text analytics with Twitter, and read the over 30 million results to be sure.

This should not surprise you, given Google's huge and word-spreaded base of users together with the relative structure and richness of metadata of content on the platform, which makes this social network a place to go when talking about data analysis projects, especially those involving sentiment analysis and customer segmentations.

R comes with a really well-developed package named twitteR, developed by Jeff Gentry, which offers a function for nearly every functionality made available by Twitter through the API. The following recipe covers the typical use of the package: getting tweets related to a topic.

Getting ready

First of all, we have to install our great twitteR package by running the following code:

install.packages("twitteR")
library(twitter)

How to do it…

  1. As seen with the general procedure, in order to access the Twitter API, you will need to create a new application. This link (assuming you are already logged in to Twitter) will do the job: https://apps.twitter.com/app/new.

    Feel free to give whatever name, description, and website to your app that you want. The callback URL can be also left blank.

    After creating the app, you will have access to an API key and an API secret, namely Consumer Key and Consumer Secret, in the Keys and Access Tokens tab in your app settings.

    Below the section containing these tokens, you will find a section called Your Access Token. These tokens are required in order to let the app perform actions on your account's behalf. For instance, you may be willing to send direct messages to all new followers and could therefore write an app to do that automatically.

    Keep a note of these tokens as well, since you will need them to set up your connection within R.

  2. Then, we will get access to the API from R. In order to authenticate your app and use it to retrieve data from Twitter, you will just need to run a line of code, specifically, the setup_twitter_oauth() function, by passing the following arguments:

    • consumer_key

    • consumer_token

    • access_token

    • access_secret

      You can get these tokens from your app settings:

      setup_twitter_oauth(consumer_key    = "consumer_key", 
                         consumer_secret  = "consumer_secret", 
                         access_token     = "access_token",
                         access_secret    = "access_secret")
  3. Now, we will query Twitter and store the resulting data. We are finally ready for the core part: getting data from Twitter. Since we are looking for tweets pertaining to a specific topic, we are going to use the searchTwitter() function. This function allows you to specify a good number of parameters besides the search string. You can define the following:

    • n : This is the number of tweets to be downloaded.

    • lang: This is the language specified with the ISO 639-1 code. You can find a partial list of this code at https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes.

    • since – until: These are time parameters that define a range of time, where dates are expressed as YYYY-MM-DD, for instance, 2012-05-12.

    • locale: This specifies the geocode, expressed as latitude, longitude and radius, either in miles or kilometers, for example, 38.481157,-130.500342,1 mi.

    • sinceID – maxID: This is the account ID range.

    • resultType: This is used to filter results based on popularity. Possible values are 'mixed', 'recent', and 'popular'.

    • retryOnRateLimit: This is the number that defines how many times the query will be retried if the API rate limit is reached.

    Supposing that we are interested in tweets regarding data science with R; we run the following function:

    tweet_list <- searchTwitter('data science with R', n = 450)  

    Tip

    Performing a character-wise search with twitteR

    Searching Twitter for a specific sequence of characters is possible by submitting a query surrounded by double quotes, for instance, "data science with R". Consequently, if you are looking to retrieve tweets in R corresponding to a specific sequence of characters, you will have to submit and run a line of code similar to the following:

     tweet_list <- searchTwitter('data science with R', n = 450)

    tweet_list will be a list of the first 450 tweets resulting from the given query.

    Be aware that since n is the maximum number of tweets retrievable, you may retrieve a smaller number of tweets, if for the given query the number or result is smaller than n.

    Each element of the list will show the following attributes:

    • text

    • favorited

    • favoriteCount

    • replyToSN

    • created

    • truncated

    • replyToSID

    • id

    • replyToUID

    • statusSource

    • screenName

    • retweetCount

    • isRetweet

    • retweeted

    • longitude

    • latitude

      In order to let you work on this data more easily, a specific function is provided to transform this list in a more convenient data.frame, namely, the twiLstToDF() function.

      After this, we can run the following line of code:

      tweet_df   <-  twListToDF(tweet_list)

      This will result in a tweet_df object that has the following structure:

      > str(tweet_df)
      'data.frame':  20 obs. of  16 variables:
       $ text         : chr  "95% off  Applied Data Science with R - 
       $ favorited    : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
       $ favoriteCount: num  0 2 0 2 0 0 0 0 0 1 ...
       $ replyToSN    : logi  NA NA NA NA NA NA ...
       $ created      : POSIXct, format: "2015-10-16 09:03:32" "2015-10-15 17:40:33" "2015-10-15 11:33:37" "2015-10-15 05:17:59" ...
       $ truncated    : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
       $ replyToSID   : logi  NA NA NA NA NA NA ...
       $ id           : chr  "654945762384740352" "654713487097135104" "654621142179819520" "654526612688375808" ...
       $ replyToUID   : logi  NA NA NA NA NA NA ...
       $ statusSource : chr  "<a href=\"http://learnviral.com/\" rel=\"nofollow\">Learn Viral</a>" "<a href=\"https://about.twitter.com/products/tweetdeck\" rel=\"nofollow\">TweetDeck</a>" "<a href=\"http://not.yet/\" rel=\"nofollow\">final one kk</a>" "<a href=\"http://twitter.com\" rel=\"nofollow\">Twitter Web Client</a>" ...
       $ screenName   : chr  "Learn_Viral" "WinVectorLLC" "retweetjava" "verystrongjoe" ...
       $ retweetCount : num  0 0 1 1 0 0 0 2 2 2 ...
       $ isRetweet    : logi  FALSE FALSE TRUE FALSE FALSE FALSE ...
       $ retweeted    : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
       $ longitude    : logi  NA NA NA NA NA NA ...
       $ latitude     : logi  NA NA NA NA NA NA ...
      

    After sending you to the data visualization section for advanced techniques, we will now quickly visualize the retweet distribution of our tweets, leveraging the base R hist() function:

    hist(tweet_df$retweetCount)

    This code will result in a histogram that has the x axis as the number of retweets and the y axis as the frequency of those numbers:

There's more...

As stated in the official Twitter documentation, particularly at https://dev.twitter.com/rest/public/rate-limits, there is a limit to the number of tweets you can retrieve within a certain period of time, and this limit is set to 450 every 15 minutes.

However, what if you are engaged in a really sensible job and you want to base your work on a significant number of tweets? Should you set the n argument of searchTwitter() to 450 and wait for 15—everlasting—minutes? Not quite, the twitteR package provides a convenient way to overcome this limit through the register_db_backend(), register_sqlite_backend(), and register_mysql_bakend() functions. These functions allow you to create a connection with the named type of databases, passing the database name, path, username, and password as arguments, as you can see in the following example:

    register_mysql_backend("db_name", "host","user","password")

You can now leverage the search_twitter_and_store function, which stores the search results in the connected database. The main feature of this function is the retryOnRateLimit argument, which lets you specify the number of tries to be performed by the code once the API limit is reached. Setting this limit to a convenient level will likely let you pass the 15-minutes interval:

tweets_db = search_twitter_and_store("data science R", retryOnRateLimit = 20)

Retrieving stored data will now just require you to run the following code:

    from_db = load_tweets_db()