Book Image

R Data Analysis Projects

Book Image

R Data Analysis Projects

Overview of this book

R offers a large variety of packages and libraries for fast and accurate data analysis and visualization. As a result, it’s one of the most popularly used languages by data scientists and analysts, or anyone who wants to perform data analysis. This book will demonstrate how you can put to use your existing knowledge of data analysis in R to build highly efficient, end-to-end data analysis pipelines without any hassle. You’ll start by building a content-based recommendation system, followed by building a project on sentiment analysis with tweets. You’ll implement time-series modeling for anomaly detection, and understand cluster analysis of streaming data. You’ll work through projects on performing efficient market data research, building recommendation systems, and analyzing networks accurately, all provided with easy to follow codes. With the help of these real-world projects, you’ll get a better understanding of the challenges faced when building data analysis pipelines, and see how you can overcome them without compromising on the efficiency or accuracy of your systems. The book covers some popularly used R packages such as dplyr, ggplot2, RShiny, and others, and includes tips on using them effectively. By the end of this book, you’ll have a better understanding of data analysis with R, and be able to put your knowledge to practical use without any hassle.
Table of Contents (15 chapters)
Title Page
Credits
About the Author
About the Reviewer
www.PacktPub.com
Customer Feedback
Preface

Wrapping up


The final step in any data analysis project is documentation—either generating a report of the findings or documenting the scripts and data used. In our case, we are going to wrap up with a small application. We will use RShiny, an R web application framework. RShiny is a powerful framework for developing interactive web applications using R. We will leverage the code that we have written to generate a simple, yet powerful, user interface for our retail customers.

To keep things simple, we have a set of three screens. The first screen, as shown in the following screenshot, allows the user to vary support and confidence thresholds and view the rules generated. It also has additional interest measures, lift, conviction, and leverage. The user can sort the rule by any of these interest measures:

Another screen is a scatter plot representation of the rules:

Finally, a graph representation to view the product grouping for the easy selection of products for the cross-selling campaign is as follows:

The complete source code is available in <../App.R>:

########################################################################
 #
 # R Data Analysis Projects
 #
 # Chapter 1
 #
 # Building Recommender System
 # A step step approach to build Association Rule Mining
 #
 # Script:
 #
 # Rshiny app
 #
 # Gopi Subramanian
 #########################################################################
library(shiny)
 library(plotly)
 library(arules)
 library(igraph)
 library(arulesViz)
get.txn <- function(data.path, columns){
 # Get transaction object for a given data file
 #
 # Args:
 # data.path: data file name location
 # columns: transaction id and item id columns.
 #
 # Returns:
 # transaction object
 transactions.obj <- read.transactions(file = data.path, format = "single",
 sep = ",",
 cols = columns,
 rm.duplicates = FALSE,
 quote = "", skip = 0,
 encoding = "unknown")
 return(transactions.obj)
 }
get.rules <- function(support, confidence, transactions){
 # Get Apriori rules for given support and confidence values
 #
 # Args:
 # support: support parameter
 # confidence: confidence parameter
 #
 # Returns:
 # rules object
 parameters = list(
 support = support,
 confidence = confidence,
 minlen = 2, # Minimal number of items per item set
 maxlen = 10, # Maximal number of items per item set
 target = "rules"

 )

 rules <- apriori(transactions, parameter = parameters)
 return(rules)
 }
find.rules <- function(transactions, support, confidence, topN = 10){
 # Generate and prune the rules for given support confidence value
 #
 # Args:
 # transactions: Transaction object, list of transactions
 # support: Minimum support threshold
 # confidence: Minimum confidence threshold
 # Returns:
 # A data frame with the best set of rules and their support and confidence values


 # Get rules for given combination of support and confidence
 all.rules <- get.rules(support, confidence, transactions)

 rules.df <-data.frame(rules = labels(all.rules)
 , all.rules@quality)

 other.im <- interestMeasure(all.rules, transactions = transactions)

 rules.df <- cbind(rules.df, other.im[,c('conviction','leverage')])


 # Keep the best rule based on the interest measure
 best.rules.df <- head(rules.df[order(-rules.df$leverage),],topN)

 return(best.rules.df)
 }
plot.graph <- function(cross.sell.rules){
 # Plot the associated items as graph
 #
 # Args:
 # cross.sell.rules: Set of final rules recommended
 # Returns:
 # None
 edges <- unlist(lapply(cross.sell.rules['rules'], strsplit, split='=>'))
 g <- graph(edges = edges)
 return(g)

 }
columns <- c("order_id", "product_id") ## columns of interest in data file
 data.path = '../../data/data.csv' ## Path to data file
 transactions.obj <- get.txn(data.path, columns) ## create txn object
server <- function(input, output) {
cross.sell.rules <- reactive({
 support <- input$Support
 confidence <- input$Confidence
 cross.sell.rules <- find.rules( transactions.obj, support, confidence )
 cross.sell.rules$rules <- as.character(cross.sell.rules$rules)
 return(cross.sell.rules)

 })

 gen.rules <- reactive({
 support <- input$Support
 confidence <- input$Confidence
 gen.rules <- get.rules( support, confidence ,transactions.obj)
 return(gen.rules)

 })


 output$rulesTable <- DT::renderDataTable({
 cross.sell.rules()
 })

 output$graphPlot <- renderPlot({
 g <-plot.graph(cross.sell.rules())
 plot(g)
 })

 output$explorePlot <- renderPlot({
 plot(x = gen.rules(), method = NULL,
 measure = "support",
 shading = "lift", interactive = FALSE)
 })


 }
ui <- fluidPage(
 headerPanel(title = "X-Sell Recommendations"),
 sidebarLayout(
 sidebarPanel(
 sliderInput("Support", "Support threshold:", min = 0.01, max = 1.0, value = 0.01),
 sliderInput("Confidence", "Support threshold:", min = 0.05, max = 1.0, value = 0.05)

 ),
 mainPanel(
 tabsetPanel(
 id = 'xsell',
 tabPanel('Rules', DT::dataTableOutput('rulesTable')),
 tabPanel('Explore', plotOutput('explorePlot')),
 tabPanel('Item Groups', plotOutput('graphPlot'))
 )
 )
 )
 )
shinyApp(ui = ui, server = server)

We have described the get.txn, get.rules, and find.rules functions in the previous section. We will not go through them again here. The preceding code is a single page RShiny app code; both the server and the UI component reside in the same file.

The UI component is as follows:

ui <- fluidPage(
 headerPanel(title = "X-Sell Recommendations"),
 sidebarLayout(
 sidebarPanel(
 sliderInput("Support", "Support threshold:", min = 0.01, max = 1.0, value = 0.01),
 sliderInput("Confidence", "Support threshold:", min = 0.05, max = 1.0, value = 0.05)

 ),
 mainPanel(
 tabsetPanel(
 id = 'xsell',
 tabPanel('Rules', DT::dataTableOutput('rulesTable')),
 tabPanel('Explore', plotOutput('explorePlot')),
 tabPanel('Item Groups', plotOutput('graphPlot'))
 )
 )
 )
 )

We define the screen layout in this section. This section can also be kept in a separate file called UI.R. The page is defined by two sections, a panel in the left, defined by sidebarPanel, and a main section defined under mainPanel. Inside the side bar, we have defined two slider controls for the support and confidence thresholds respectively. The main panel contains a tab-separated window, defined by tabPanel.

The main panel has three tabs; each tab has a slot defined for the final set of rules, with their interest measures, a scatter plot for the rules, and finally the graph plot of the rules.

The server component is as follows:

server <- function(input, output) {
cross.sell.rules <- reactive({
 support <- input$Support
 confidence <- input$Confidence
 cross.sell.rules <- find.rules( transactions.obj, support, confidence )
 cross.sell.rules$rules <- as.character(cross.sell.rules$rules)
 return(cross.sell.rules)

 })

The cross.sell.rules data frame is defined as a reactive component. When the values of the support and confidence thresholds change in the UI, cross.sell.rules data frame will be recomputed. This frame will be served to the first page, where we have defined a slot for this table, called rulesTable:

gen.rules <- reactive({
 support <- input$Support
 confidence <- input$Confidence
 gen.rules <- get.rules( support, confidence ,transactions.obj)
 return(gen.rules)
 })

This reactive component retrieves the calculations and returns the rules object every time the support or/and confidence threshold is changed by the user in the UI:

output$rulesTable <- DT::renderDataTable({
 cross.sell.rules()
 })

The preceding code renders the data frame back to the UI:

output$graphPlot <- renderPlot({
 g <-plot.graph(cross.sell.rules())
 plot(g)
 })

 output$explorePlot <- renderPlot({
 plot(x = gen.rules(), method = NULL,
 measure = "support",
 shading = "lift", interactive = FALSE)
 })


 }

The preceding two pieces of code render the plot back to the UI.