Besides partition, bucket is another technique to cluster datasets into more manageable parts to optimize query performance. Different from partition, the bucket corresponds to segments of files in HDFS. For example, the employee_partitioned
table from the previous section uses the year and month as the top-level partition. If there is a further request to use the employee_id
as the third level of partition, it leads to many deep and small partitions and directories. For instance, we can bucket the employee_partitioned
table using employee_id
as the bucket column. The value of this column will be hashed by a user-defined number into buckets. The records with the same employee_id
will always be stored in the same bucket (segment of files). By using buckets, Hive can easily and efficiently do sampling (see Chapter 6, Data Aggregation and Sampling) and map side joins (see Chapter 4, Data Selection and Scope). An example to create a bucket table is as follows:
--Prepare another dataset...