Since we will want to have a basket of items to perform some association rules on, we will want to filter out the transactions that only have one item per invoice. That might be useful for a separate analysis of customers who only purchased one item, but it does not help with finding associations between multiple items, which is the goal of this exercise.
- Let's use
sqldf
to find all of the single item transactions, and then we will create a separate dataframe consisting of the number of items per customer invoice:
library(sqldf)
- First construct a query: How many distinct invoices were there? We see that there were 25900 separate invoices:
sqldf("select count(distinct InvoiceNo) from OnlineRetail") > Loading required package: tcltk > count(distinct InvoiceNo) > 1 25900
- How many invoices contain only single transactions? First, extract the single item invoices:
single...