From the previous section, we've learned an interesting lesson: for big data, always use SGD-based learners because they are faster, and they do scale.
Now, in this section, let's consider this regression dataset:
Massive number of observations: 2M
Large number of features: 100
Noisy dataset
The X_train
matrix is composed of 200 million elements, and may not completely fit in memory (on a machine with 4 GB RAM); the testing set is composed of 10,000 observations.
Let's first create the datasets, and print the memory footprint of the biggest one:
In: # Let's generate a 1M dataset X_train, X_test, y_train, y_test = generate_dataset(2000000, 10000, 100, 10.0) print("Size of X_train is [GB]:", X_train.size * X_train[0,0].itemsize/1E9) Out: Size of X_train is [GB]: 1.6
The X_train
matrix is itself 1.6
GB of data; we can consider it as a starting point for big data. Let's now try to classify it using the best model we got from the previous section, SGDRegressor()
. To access...