R has built-in functionality for splitting up a data frame between training and testing sets, building a model based on the training set, predicting results using the model and the testing set, and then visualizing how well the model is working.
For this example, I am using airline arrival and departure times versus scheduled arrival and departure times from http://stat-computing.org/dataexpo/2009/the-data.html for 2008. The dataset is distributed as a .bz2
file that unpacks into a CSV file. I like this dataset, as the initial row count is over 7 million and it all works nicely in Jupyter.
We first read in the airplane data and display a summary. There are additional columns in the dataset that we are not using:
df <- read.csv("Documents/2008-airplane.csv")summary(df)...CRSElapsedTime AirTime ArrDelay DepDelay Min. :-141.0 Min. : 0 Min. :-519.00 Min. :-534.00 1st Qu.: 80.0 1st Qu.: 55 1st Qu.: -10.00...