Book Image

Big Data Forensics: Learning Hadoop Investigations

Book Image

Big Data Forensics: Learning Hadoop Investigations

Overview of this book

Table of Contents (15 chapters)
Big Data Forensics – Learning Hadoop Investigations
Credits
About the Author
About the Reviewers
www.PacktPub.com
Preface
Index

Chapter 4. Collecting Hadoop Distributed File System Data

The Hadoop Distributed File System (HDFS) is the primary source of evidence in a Hadoop forensic investigation. Whether Hadoop data is used in Hive, HBase, or a custom Java application, the data is stored in HDFS. This means the forensic evidence can be collected from HDFS. Investigators can take two collection approaches: collect HDFS data from the host operating system or directly from Hadoop.

The advantage of collecting from HDFS is investigators can collect much more data than they can from a data analysis layer or application layer. Some potentially relevant data can only be collected through HDFS. This includes metadata, configuration files, user files that were not imported into an application, custom scripts, and other information. In some forensic investigations, this otherwise ancillary data can be crucial for determining how the system operated and how the system was used.

Collecting evidence from HDFS can be more time- and...