When the data volume is extra large, we may need to find a subset of data to speed up data analysis. This is sampling, a technique used to identify and analyze a subset of data in order to discover patterns and trends in the whole dataset. In HQL, there are three ways of sampling data: random sampling, bucket table sampling, and block sampling.

# Sampling

# Random sampling

Random sampling uses the `rand()` function and `LIMIT` keyword to get the sampling of data, as shown in the following example. The `DISTRIBUTE` and `SORT` keywords are used here to make sure the data is also randomly distributed among mappers and reducers efficiently. The `ORDER BY rand()` statement can also achieve the same purpose, but the performance is not good:

...