To explore large datasets, it is generally useful to work with a smaller sample of data first. For example, from a dataset consisting of 100 million records, we could take a sample of 1,000 records and start exploring some important properties of this data. Exploring the entire dataset would be ideal; however, the time required to do so would increase manifold.
For working with samples, it is important that sample selection is done carefully and biases are not introduced unnecessarily. Randomness plays a very important role in this.
Let's look at how we can make use of the Scala collection API to select sample data from a dataset:
- Create a list of 1000 numbers using Scala's Range...