Before we begin describing the ETL process, consider its importance in business intelligence. CIO Magazine provides a popular and useful definition of BI (Mulcahy, 2007):
"Business intelligence, or BI, is an umbrella term that refers to a variety of software applications used to analyze an organization's raw data. BI as a discipline is made up of several related activities, including data mining, online analytical processing, querying and reporting."
Mulcahy captures the essence of this book, which presents solutions in R to walk you through the steps from data analytic techniques to communicating your results. The purpose of BI applications has changed over the last decade as big data challenges affect the business world in ways first experienced in the sciences decades ago.
You can find the term big data in many business settings. It appears in advertisements for boot camps, draws attendees to conferences, and perplexes business leaders. Arguably, the term is ill-defined. A 1998 presentation given by John Mashey, then the Chief Scientist of Silicon Graphics, is often cited as the document that introduced the term (Press, 2013). The impact of big data on business is undeniable, despite its elusive meaning. There is a general agreement on the following three characteristics of big data, called the 3Vs:
Volume: The size of datasets has grown from megabytes to petabytes
Velocity: The speed of data arrival has changed to near real time
Variety: The sources of data have grown from structured databases to unstructured ones, such as social media, websites, audio, and video
Together these three characteristics pose a growing challenge to the business community. Data is stored in facilities across a vast network of local servers or relational databases. Virtual software access it with cloud-based applications. BI applications have typically included static dashboards based on fixed measures using structured data. Big data changes the business by affording a competitive advantage to those who can extract value from the large and rapidly changing sources of diverse data.
Today, people ask business analysts, what is going to happen? To answer this type of question, a business needs tools and processes to tap into the growing stream of data. Often this data will not fit into the existing databases without transformation. The continual need to acquire data requires a structured ETL approach to wrangle the unstructured nature of modern data. As you read this chapter, think about how companies may benefit from using the techniques presented, even when they are less complex than big data.
Use case: Bike Sharing, LLC
You will begin your exploration of BI and analytics through the lens of a fictional business called Bike Sharing, LLC. The company operates and maintains a fleet of publically rental bikes in the Washington D.C. metropolitan area. Their customers are typically from the urban area, including people from business, government, and universities. Customers enjoy the convenience of finding bikes easily within a network of bike-sharing stations throughout the city. Renters may rent a bicycle at one location and leave it at another station.Bike Sharing, LLC started operations in 2011, and has enjoyed continued growth. They quickly established a BI group to keep track of the data collected about transactions, customers, and factors related to rentals, such as weather, holidays, and times of day. In 2014, they began to understand how they might use open source datasets to guide decisions regarding sales, operations, and advertising. In 2015, they expanded their BI talent pool with business analysts experienced with R and statistical methods that could use Bike Sharing data in new ways.
You joined Bike Sharing just a few months ago. You have a basic understanding of R from the many courses and tutorials that you used to expand your skills. You are working with a good group that has a diverse skillset, including programming, databases, and business knowledge. The first data you have been given is bike rental data covering the two-year period from Jan 1, 2011 to Dec 31, 2012 (Kaggle, 2014). You can download this same
Ch1_bike_sharing_data.csv file from the book's website at
Data sources often include a data dictionary to help new users understand the contents and coding of the data. Data Dictionary for Bike Sharing Data (Kaggle, 2014):
datetime: Hourly date + timestamp
holiday: Whether the day is considered a holiday
workingday: Whether the day is neither a weekend nor holiday
1: Clear, Few clouds, Partly cloudy, Partly cloudy
2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist
3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds
4: Heavy Rain + Ice Pellets + Thunderstorm + Mist, Snow + Fog
temp: Temperature in Celsius
atemp: Feels like temperature in Celsius
humidity: Relative humidity
windspeed: Wind speed
casual: Number of non-registered user rentals initiated
registered: Number of registered user rentals initiated
count: Number of total rentals
One of your goals is to strengthen your ETL skills. In this use case, you will learn common extraction, transformation, and loading skills to store a dataset in a file for analysis. Welcome to the Bike Sharing team.