-
Book Overview & Buying
-
Table Of Contents
Regression Analysis with Python
By :
From the previous section, we've learned an interesting lesson: for big data, always use SGD-based learners because they are faster, and they do scale.
Now, in this section, let's consider this regression dataset:
Massive number of observations: 2M
Large number of features: 100
Noisy dataset
The X_train matrix is composed of 200 million elements, and may not completely fit in memory (on a machine with 4 GB RAM); the testing set is composed of 10,000 observations.
Let's first create the datasets, and print the memory footprint of the biggest one:
In:
# Let's generate a 1M dataset
X_train, X_test, y_train, y_test = generate_dataset(2000000, 10000, 100, 10.0)
print("Size of X_train is [GB]:", X_train.size * X_train[0,0].itemsize/1E9)
Out:
Size of X_train is [GB]: 1.6The X_train matrix is itself 1.6 GB of data; we can consider it as a starting point for big data. Let's now try to classify it using the best model we got from the previous section, SGDRegressor(). To access...
Change the font size
Change margin width
Change background colour