The need for curating raw data
Data in the bronze layer is raw by nature in that it gets collected from several distinct and diverse data sources. Due to the diverse sources, it is natural for data to be delivered in unstandardized, invalid, inconsistent, non-uniform, duplicate, or insecure forms. In some other cases, raw data may have PII data in clear text, which should be properly masked before analytical consumption.
Important note
In big data, one of the hotly debated topics is veracity – that is, can the organization put trust in the data that is being collected? And if yes, then how much?
Let's try to understand some characteristics of unclean data so that we can properly justify the reasons for curating data.
Unstandardized data
These days, typically, data is collected using online transaction processing (OLTP) applications. The problem is that OLTP applications, such as web applications and mobile applications created in different countries, follow...