In the scenario where we query on a unique values column of a dataset, partitioning is not a good fit. If we go with a partition on a column with high unique values like ID, it would create a large number of small datasets in HDFS and partition entries in the metastore, thus increasing the load on NameNode
and the metastore service.
To optimize queries on such a dataset, we group the data into a particular number of buckets and the data is divided into the maximum number of buckets.
Using the same sales
dataset, if we need to optimize queries on a column with high unique column values such as ID, we create buckets on that column as follows:
create table sales_buck (id int, fname string, state string, zip string, ip string, pid string) clustered by (id) into 50 buckets row format delimited fields terminated by '\t';
Here, we have defined 50 buckets for this table, which means that the complete dataset is divided and stored in 50 buckets based on the ID
...