Data processing is one of the important capabilities in a Data Lake implementation. Our Data Lake is no exception and does participate in data processing, both in batch and speed layer. In this section we will cover some important topics that needs to be looked upon with respect to Data Lake dealing with data processing. With Hadoop 1.x, MapReduce was one of the main processing done in Hadoop. With Hadoop 2.x and with more data ingestion methodologies, more options in the real time/streaming area have also come in and these two aspects with some important considerations are detailed here.
Validating data before it gets into the persistence layer of Data Lake is a very important step. Validation in the context of Data Lake means two aspects as follows:
- Origin of data: Making sure right data from right source is ingested into the Data Lake. The source from where data originates should be known and also the data coming in also should...