Collecting other Hadoop application data and non-Hadoop data
Not all relevant Hadoop data is always stored and accessed within Hive, HBase, or even HDFS. Hadoop clusters are typically part of a larger data analysis ecosystem. This means that data flows into and out of Hadoop from other systems. Inside Hadoop, and at the Hadoop data ingress and egress points, data transfers and transformations may occur. These changes to the data may be relevant, and as such, the investigator may need to collect data from these systems.
Many other Hadoop applications are available for data analysis and storage. The Apache Foundation currently lists many projects and incubator projects that are deployed in production environments. Applications such as Cassandra, Chukwa, and Spark may be found in the course of an investigation as well as new ones (for example, Drill and Tajo). When a new or uncommon application is identified, the investigator can apply the same collection process for each application, which...