Book Image

Introduction to R for Business Intelligence

By : Jay Gendron
Book Image

Introduction to R for Business Intelligence

By: Jay Gendron

Overview of this book

Explore the world of Business Intelligence through the eyes of an analyst working in a successful and growing company. Learn R through use cases supporting different functions within that company. This book provides data-driven and analytically focused approaches to help you answer questions in operations, marketing, and finance. In Part 1, you will learn about extracting data from different sources, cleaning that data, and exploring its structure. In Part 2, you will explore predictive models and cluster analysis for Business Intelligence and analyze financial times series. Finally, in Part 3, you will learn to communicate results with sharp visualizations and interactive, web-based dashboards. After completing the use cases, you will be able to work with business data in the R programming environment and realize how data science helps make informed decisions and develops business strategy. Along the way, you will find helpful tips about R and Business Intelligence.
Table of Contents (19 chapters)
Introduction to R for Business Intelligence
Credits
About the Author
Acknowledgement
About the Reviewers
www.PacktPub.com
Preface
References
R Packages Used in the Book
R Code for Supporting Market Segment Business Case Calculations

Extracting data from sources


Now we get to work. The aim of this book is to design, develop, and deliver a business intelligence product-a data product. In this section, you will explore two extraction methods to import data from different types of sources:

  • Importing CSV and other file formats

  • Importing data from relational databases

Depending on your background, you may be more or less familiar with both types of extraction methods. Here, you will learn or refresh your knowledge of both in order to have a more complete working understanding of ETL.

Importing CSV and other file formats

You can load the Bike Sharing data file into the R environment by using the read.csv() function. It is an easy and commonly used way to load CSV files:

bike <- read.csv("{filename}.csv") 

Calling this function will read the file as long as it is located in your R working directory. The working directory is the space R uses, much like a home directory. There are two commands that you can use to check and change your working directory:

  • getwd(): This will return the current working directory as a path in the R console

  • setwd(<path>): This is used in the console to change the working directory to <path> you pass in the function

If you try to read a file that is not in your working directory, you will see an error message. Some analysts manage data in a separate data directory, one level under their working directory. In this case, you can add a path in front of the filename. The following example shows how the data file is located one layer down in a data directory. Adding the ./data/ string to the front of the filename will allow R to access the data:

bike <- read.csv("./data/Ch1_bike_sharing_data.csv") 
str(bike) 

The str() function is not required to import the data, but it does provide a confirmation that the data was read into the R environment. It also provides you with a quick look at the structure of the dataset, its dimensions, and the names of the variables:

'data.frame':  17379 obs. of  12 variables:
 $ datetime  : Factor w/ 17379 levels "1/1/2011 0:00",..: 1 2 13  ...
 $ season    : int  1 1 1 1 1 1 1 1 1 1 ...
 $ holiday   : int  0 0 0 0 0 0 0 0 0 0 ...
 $ workingday: int  0 0 0 0 0 0 0 0 0 0 ...
 $ weather   : int  1 1 1 1 1 2 1 1 1 1 ...
 $ temp      : num  9.84 9.02 9.02 9.84 9.84 ...
 $ atemp     : num  14.4 13.6 13.6 14.4 14.4 ...
 $ humidity  : int  81 80 80 75 75 75 80 86 75 76 ...
 $ windspeed : num  0 0 0 0 0 ...
 $ casual    : int  3 8 5 3 0 0 2 1 1 8 ...
 $ registered: int  13 32 27 10 1 1 0 2 7 6 ...
 $ count     : int  16 40 32 13 1 1 2 3 8 14 ...

The read.csv() function uses default parameters that are consistent with CSV files, such as using a comma (,) as a separator. The R environment has a more flexible read option for instances when you may have tab-delimited data or TXT file. This option is the read.table() function:

bike <- read.table("./data/Ch1_bike_sharing_data.csv", sep = ",", 
                   header = TRUE)

This function performs identically to the read.csv() function. In fact, read.csv() is built on read.table() and uses sep = "," and header = TRUE as default parameters.

Tip

Downloading the example code

Detailed steps to download the code bundle are mentioned in the Preface of this book. Please have a look.

The code bundle for the book is also hosted on GitHub at https://github.com/PacktPublishing/Introduction-To-R-For-Business-Intelligence . We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/ . Check them out!

Importing data from relational databases

You can also use R to access the data stored in many relational databases through the Open Database Connectivity (ODBC) application programming interface. In R, you do this using the RODBC package. This package provides you with an advantage for legacy data. By establishing an R connection to relational databases at Bike Sharing, LLC, you can reuse existing SQL-type queries that may already exist in the BI group.

Imagine that the Bike Sharing dataset was not a CSV file, but it was instead stored in a relational database. You can access the data by loading the RODBC package and establishing a connection to the database using the following parameters: data source name, user ID, and password. In this notional example, the user Paul would like to access the data source ourDB using his password R4BI:

library(RODBC) 
connection <- odbcConnect(dsn = "ourDB", uid = "Paul", pwd = "R4BI") 

Note

As you follow along with this code, you will only be able to run the odbcConnect() function if you have configured a database source. Advanced users may try this using an Excel file. A tutorial for Windows-based environments is provided by Cyberfella LTD (2013) at http://www.cyberfella.co.uk/2012/06/11/windows7-odbc/ .

Having established a connection to your database, you can use R to write a SQL query, pass this query to the sqlQuery() function, and read the data into a bike data frame. You learned that there is a table in the database called marketing:

query <- "SELECT * FROM marketing" 
bike <- sqlQuery(connection, query) 
close(connection) 

The close() function closes the connection between R and the database. Doing this will free any memory consumed by the open connection. It is a good practice to follow.

Tip

R tip: One thing you can do when establishing connections is to add a log to record your ETL processes. In the event something goes wrong, the log will capture the issue and help you find the problem quickly. The log4r package is specifically written for this application. For more information, visit the repository at https://github.com/johnmyleswhite/log4r .

There are also non-relational databases in the work place. Non-relational databases store documents, key value data, and unstructured data. Unlike the relational database approach, they contain different data structures grouped into nodes, clusters, or other schemas. The R community has created various packages to connect with non-relational databases and distributive filing systems. You can connect to MongoDB using the rmongodb package and to Spark distributed computing using the SparkR package.

Tip

BI tip: By looking at the database approach used within an organization and conceptualizing the data, a business analyst can better understand business processes to improve the quality of analysis and meet organizational needs.