Sometimes the dataset that we have is too big to be used to build a model. For practical reasons (so that the estimation of our models does not take forever), it is good to create a stratified sample from the full dataset.
In this recipe, we will read from our MongoDB database and use Python to create a sample.
To execute this recipe, you will need PyMongo
, pandas
, and NumPy
. No other prerequisites are required.
There are two approaches that one can take: either specify the fraction of the original dataset (say, 20%) or specify the number of records one would like to retrieve from the dataset. The following code shows you how to fetch a fraction of the dataset (the data_sampling.py
file):
strata_frac = 0.2 client = pymongo.MongoClient() db = client['packt'] real_estate = db['real_estate'] # retrieve the data sales = pd.DataFrame.from_dict( list( real_estate.find( {'beds': {'$in': [2,3,4]} }, {'_id': 0, 'zip': 1, 'city...