In this recipe, we are going to learn how to write a map reduce program to partition data using a custom partitioner.
To perform this recipe, you should have a running Hadoop cluster running as well as an eclipse that's similar to an IDE.
During the shuffle and sort, if it's not specified, Hadoop by default uses a hash partitioner. We can also write our own custom partitioner with custom partitioning logic, such that we can partition the data into separate files.
Let's consider one example where we have user data with us along with the year of joining. Now, assume that we have to partition the users based on the year of joining that's specified in the record. The sample input data looks like this:
User_id|user_name|yoj 1|Tanmay|2010 2|Sneha|2015 3|Sakalya|2020 4|Manisha|2011 5|Avinash|2012 6|Vinit|2022
To get this data partitioned based on YOJ
, we will have to write a custom partitioner:
public class YearOfJoiningPartitioner...