Change Data Capture or CDC is one the most painful areas in Data Warehousing. CDC captures the changes that occur in a table. A change could be in the form of new records getting added, updated, or getting deleted. In this recipe, we are going to take a look at how to perform CDC in Hive.
To perform this recipe, you should have a running Hadoop cluster as well as the latest version of Hive installed on it. Here, I am using Hive 1.2.1.
First of all, we need a data sample. Consider a simple employee table that has columns, such as the employee ID, name, and salary. Let's say we import this table from a source table in week 1, and after a week, we want to know about the changes that have taken place in the same table. Let's say we have a table, employee1, which was imported in week 1, and we have another table, which was imported in week 2. Week 2 being the latest week, we want to know the changes that have taken place. Here...