Bad data lurks in all of the data that is ingested by Hadoop, but its impact magnifies with the phenomenal volume and variety that constitutes Big Data. Working with missing records, malformed values, and wrong file formats amplifies the amount of wasted time. What drives us to frustration is seeing the amount of data that we can't use even though we have it, data that we have at hand and then lost, and data that was not the same as it was yesterday. In a Big Data analytics project, it is common to be handed an extremely huge dataset without a lot of information as to where it came from, how it was collected, what the fields mean, and so on. In many cases, the data has gone through many hands and multiple transformations since it was gathered, and nobody really knows what it all means anymore.
Data profiling is a measure of how good the data is and the fitness to process it in the subsequent steps. It simply indicates what is wrong with the data. Data profiling...