Storage is one of the most critical parts of a Data Lake. Apache Hadoop (HDFS) is the core of data storage for our Data Lake. The following figure sums up this aspect quite clearly, showing batch and stream data storage components in our Data Lake, along with other technologies within Hadoop Ecosystem with regards to various aspects dealing with the storage:
Figure 02: Apache Hadoop (HDFS) as data storage
We will now understand some important concepts in data storage area. We will also concentrate explicitly on batch and speed data and how these gets stored in the Data Lake and also see some specific details in regards to these data types.
Even though the data in the storage need not follow a certain pattern, but for an organization while going with a Data Lake, it's good to have some clear directions and principles on how the data need to be put in the Data Lake. These are just some of our recommendations that could be considered or kept...