To build a statistical model that can be trusted, we need to have confidence that it abstracts the phenomenon that we deal with accurately. To gain such trust, we need to test the model to see if it performs well. To assess the accuracy of our model, we cannot use the same dataset that we used for the training.
In this recipe, you will learn how to split your dataset into two subsets quickly: one that is used solely to train the model and the other one is used to test it.
To execute this recipe, you will need pandas
, SQLAlchemy
, and NumPy
. No other prerequisites are required.
We read our data from the PostgreSQL database and store it in the data DataFrame. Conventionally, we would set aside somewhere between 20-40% of our original dataset for testing purposes. In this example, we select 1/3 of the data (the data_split.py
file):
# specify what proportion of data to hold out for testing test_size = 0.33...