Book Image

Big Data Forensics: Learning Hadoop Investigations

Book Image

Big Data Forensics: Learning Hadoop Investigations

Overview of this book

Table of Contents (15 chapters)
Big Data Forensics – Learning Hadoop Investigations
About the Author
About the Reviewers

An overview of computer forensics

Computer forensics is a field that involves the identification, collection, analysis, and presentation of digital evidence. The goals of a forensic investigation include:

  • Properly locating all relevant data

  • Collecting the data in a sound manner

  • Producing analysis that accurately describes the events

  • Clearly presenting the findings

Forensics is a technical field. As such, much of the process requires a deep technical understanding and the use of technical tools and techniques. Depending on the nature of an investigation, forensics may also involve legal considerations, such as spoliation and how to present evidence in court.


Unless otherwise stated, all references to forensics, investigations, and evidence in this book is in the context of Big Data forensics.

Computer forensics centers on evidence. Evidence is a proof of fact. Evidence may be presented in court to prove or disprove a claim or issue by logically establishing a fact. Many types of legal evidence exist, such as material objects, documents, and sworn testimony. Forensic evidence falls firmly in that legal set of categories and can be presented in court. In the broader sense, forensic evidence is the informational content of and about the data.

Forensic evidence comes in many forms, such as e-mails, databases, entire filesystems, and smartphone data. Evidence can be the information contained in the files, records, and other logical data containers. Evidence is not only the contents of the logical data containers, but also the associated metadata. Metadata is any information about the data that is stored by a filesystem, content management system, or other container. Metadata is useful for establishing information about the life of the data (for example, author and last modified date).

This metadata can be combined with the data to form a story about the who, what, why, when, where, and how of the data. Evidence can also take the form of deleted files, file fragments, and the contents of in-memory data.

For evidence to be court admissible or accepted by others, the data must be properly identified, collected, preserved, documented, handled, and analyzed. While the evidence itself is paramount, the process by which the data is identified, collected, and handled is also critical to demonstrate that the data was not altered in any way. The process should adhere to the best practices accepted by the court and backed by technical standards. The analysis and presentation must also adhere to best practices for both admissibility and audience comprehension. Finally, documentation of the entire process must be maintained and available for presentation to clearly demonstrate all the steps performed—from identification to collection to analysis.

The forensic process

The forensic process is an iterative process that involves four phases: identification, collection, analysis, and presentation. Each of the phases is performed sequentially. The forensic process can be iterative for the following reasons:

  • Additional data sources are required

  • Additional analyses need to be performed

  • Further documentation of the identification process is needed

  • Other situations, as required

The following figure shows the high-level forensic process discussed in this book:

Figure 1: The forensic process


This book follows the forensic process of Electronic Discovery Reference Model (EDRM), which is the industry standard and is a court-accepted best practice. The EDRM is developed and maintained by forensic and electronic discovery (e-discovery) professionals. For more information, visit EDRM's website at


The sets of forensic steps and goals should be attempted to be applied for every investigation. No two investigations are the same. As such, practical realities may dictate which steps are performed and which goals can be met.

The four steps in the forensic process and the goals for each are covered in the following sections:


Identifying and fully collecting the data of interest in the early stages of an investigation is critical to any successful project. If data is not properly identified and, subsequently, is not collected, an embarrassing and difficult process of corrective efforts will be required—at a minimum—not to mention wasted time. At worst, improperly identifying and collecting data will result in working with an incorrect or incomplete set of data. In the latter case, court sanctions, a lost investigation, and ruined reputations can be expected.

The high-level approach taken in this book starts with:

  • Examining the organization's system architecture

  • Determining the kinds of data in each system

  • Previewing the data

  • Assessing which systems are to be collected

In addition, the identification phase should also include a process to triage the data sources by priority, ensuring the data sources are not subsequently used and/or modified. This approach results in documentation to back up the claim that all potentially important sources of data were examined. It also provides assurance that no major systems were overlooked. The main considerations for each source are as follows:

  • Data quality

  • Data completeness

  • Supporting documentation

  • Validating the collected data

  • Previous systems where the data resided

  • How the data enters and leaves the system

  • The available formats for extraction

  • How well the data meets the data requirements

The following figure illustrates this high-level identification process:

Figure 2: Data identification process

The primary goals for the identification stage of an investigation are as follows:

  • Proper identification and documentation of potentially relevant sources of evidence

  • Complete documentation of identified sources of information

  • Timely assessment of potential sources of evidence from key stakeholders


The data collection phase involves the acquisition and preservation of evidence and validation information as well as properly documenting the process. For evidence to be court admissible and usable, it needs to be collected in a defensible manner that adheres to best practices. Collecting data alone, however, is not always sufficient in an investigation. The data should be accompanied by validation information (for example, log or query files) and documentation of the collection and preservation steps performed. Together, the collected data, validation information, and documentation allow for proper analysis that can be validated and defended.

The following figure highlights the collection phase process:

Figure 3: Data collection process

Data collection is a critical phase in a digital investigation. The data analysis phase can be rerun and corrected, if needed. However, improperly collecting data may result in serious issues later during analysis, if the error is detected at all. If the error goes undetected, the improper collection will result in poor data for the analysis. For example, if the collection was only a partial collection, the analysis results may understate the actual values. If the improper collection is detected during the analysis process, recollecting data may be impossible. This is the case when the data has been subsequently purged or is no longer available because the owner of the data will not permit access to the data again. In short, data collection is critical for later phases of the investigation, and there may not be opportunities to perform it again.

Data can be collected using several different methods. These methods are as follows:

  • Physical collection: A physical acquisition of every bit, which may be done across specific containers, volumes, or devices. The collection is an exact replica of every bit of data and metadata. Slack space and deleted files can be recovered using this method.

  • Logical collection: An acquisition of active data. The collection is a replica of the informational content and metadata, but is not a bit-by-bit collection.

  • Targeted collection: A collection of specific containers, volumes, or devices.

Each of the methods is covered in this book. Validation information serves as a means for proving what was collected, who performed the collection, and how all relevant data was captured. Validation is also crucial to the collection phase and later stages of an investigation. Collecting the relevant data is the primary goal of any investigation, but the validation information is critical for ensuring that the relevant data was collected properly and not modified later. Obviously, without the data, the entire process is moot.

A closely-related goal is to collect the validation information along with the data. The primary forms of validation information are MD5/SHA-1 hash values, system and process logs, and control totals. Both MD5 and SHA-1 are hash algorithms that generate a unique value based on the contents of the file that serves as a fingerprint and can be used to authenticate evidence. If a file is modified, the MD5 or SHA-1 of the modified file will not match the original. In fact, generating two different files with the same value is virtually impossible. For this reason, forensic investigators rely on MD5 or SHA-1 to prove that the evidence was successfully collected and that the data analyzed matches the original source data. Control totals are another form of validation information, which are values computed from a structured data source—such as the number of rows or sum value of a numeric field. All collected data should be validated in some manner during the collection phase before moving into the analysis.


Collect validation information simultaneously during or immediately after collecting evidence to ensure accurate and reliable validation.

The goals of the collection phase are as follows:

  • Forensically sound collection of relevant sources of evidence utilizing technical best practices and adhering to legal standards

  • Full, proper documentation of the collection process

  • Collection of verification information (for example, MD5 or control totals)

  • Validation of collected evidence

  • Maintenance of chain of custody


The analysis phase is the process by which collected and validated evidence is examined to gather and assemble the facts of an investigation. Many tools and techniques exist for converting the volumes of evidence into facts. In some investigations, the requirements clearly and directly point to the types of evidence and facts that are needed. These investigations may involve only a small amount of data or the issues are straightforward. For example, they only require a specific e-mail or only a small timeframe is in question. Other investigations, however, are large and complex. The requirements do not clearly identify a direct path of inquiry. The tools and techniques in the analysis phase are designed for both types of investigations and guide the inquiry.

The process for analyzing forensic evidence is dependent on the requirements of the investigation. Every case is different, so the analysis phase is both a science and an art. Most investigations are bounded by some known facts, such as a specific timeframe or the individuals involved. The analysis for such bounded investigations can begin by focusing on data from those time periods or involving those individuals. From there, the analysis can expand to include other evidence for corroboration or a new focus. Analysis can be an iterative process of investigating a subset of information. Analysis can also focus on one theory but then expand to either include new evidence or to form a new theory altogether. Regardless, the analysis should be completed within the practical confines of the investigation.

Two of the primary ways in which forensic analysis is judged are completeness and bias. Completeness, in forensics, is a relative term based on whether the relevant data has been reasonably considered and analyzed. Excluding relevant evidence or forms of analysis harms the credibility of the analysis. The key point is the reasonableness of including or excluding evidence and analysis. Bias is closely related to completeness. Bias is prejudice towards or against a particular thing. In the case of forensic analysis, bias is an inclination to favor a particular line of thinking without giving equal weight to other theories. Bias should be eliminated or minimized as much as possible when performing analysis to guarantee completeness and objective analysis. Both completeness and bias are covered in subsequent chapters.

Another key concept is data reduction. Forensic investigations can involve terabytes of data and millions of files and other data points. The practical realities of an investigation may not allow for a complete analysis of all data. Techniques exist for reducing the volume of data to a more manageable amount. This is performed using known facts and data interrelatedness to triage data by priority or eliminate data from the set of data to be analyzed.

Cross-validation is the use of multiple analyses or pieces of evidence to corroborate analysis. This is a key concept in forensics. While not always possible, cross-validation adds veracity to findings by further proving the likelihood that a finding is true. Cross-validation should be performed by independently testing two data sets or forms of analysis and confirming that the results are consistent.

The types of analysis performed depend on a number of factors. Forensic investigators have an arsenal of tools and techniques for analyzing evidence, and those tools and techniques are chosen based on the requirements of the investigation and the types of evidence. One example is timeline analysis, which is a technique used when chronology is important and chronological information exists and can be established. Timeline analysis is not important in all investigations, so it is not useful in every investigation.

In other cases, pattern analysis or anomaly detection may be required. While some investigations only require a single tool or technique, most investigations require a combination of tools and techniques. Later chapters include information about the various tools and techniques and how to select the proper ones. The following questions can help an investigator determine which tools and techniques to choose:

  • What are the requirements of the investigation?

  • What practical limitations exist?

  • What information is available?

  • What is already known about the evidence?

Documentation of findings and the analysis process must be carefully maintained throughout the process. Forensic evidence is complex. Analyzing forensic evidence can be even more complex. Without proper documentation, the findings are unclear and not defensible. An investigator can go down a path of analyzing data and related information—sometimes, linking hundreds of findings—and without documentation, detailing the full analysis is impossible. To avoid this, an investigator needs to carefully detail the evidence involved, the analysis performed, the analysis findings, and the interrelationships between multiple analyses.

The primary goals of the analysis phase are as follows:

  • Unbiased and objective analysis

  • Reduction of data complexity

  • Cross-validation of findings

  • Application of accepted standards


The final phase in the forensic process is the presentation of findings. The findings can be presented in a number of different ways, such as a written expert report, graphical presentations, or testimony. Regardless of the format, the key to a successful presentation is to clearly demonstrate the findings and the process by which the findings were derived. The process and findings should be presented in a way that the audience can easily understand. Not every piece of information about the process phases or findings needs to be presented. Instead, the focus should be on the critical findings at a level of detail that is sufficiently thorough. Documentation, such as chain of custody forms, may not need to be included but should still be available should the need arise.

The goals of the presentation phase are as follows:

  • Clear, compelling evidence

  • Analysis that separates the signal from the noise

  • Proper citation of source evidence

  • Availability of chain of custody and validation documentation

  • Post-investigation data management

Other investigation considerations

This book details the majority of the EDRM forensic process. However, investigators should be aware of several additional considerations not covered in detail in this book. Forensics is a large field with many technical, legal, and procedural considerations. Covering every topic would span multiple volumes. As such, this book does not attempt to cover all concepts. The following sections highlight several key concepts that a forensic investigator should consider—equipment, evidence management, investigator training, and the post-investigation process.


Forensic investigations require specialized equipment for the collection and processing of evidence. Source data can reside on a host of different types of systems and devices. An investigator may need to collect several different types of systems. These include cell phones, mainframe computers, laptops with various operating systems, and database servers. These devices have different hardware and software connectors, different means of accessing, different configurations, and so on. In addition, an investigator must be careful not to alter or destroy evidence in the collection process. A best practice is to employ write-blocker software or physical devices to ensure that evidence is preserved in its original state. In some instances, specialized forensic equipment should be used to perform the collections, such as forensic devices that connect to smartphones for acquisitions. Big Data investigations rarely involve this specialized equipment to collect the data, but encrypted drives and other forensic devices may be used. Forensic investigators should be knowledgeable about the required equipment and come prepared to collect data with a forensic kit that contains the required equipment.

Evidence management

The management of forensic evidence is also critical to maintaining proper control and security of the evidence. Forensic evidence, once collected, requires careful handling, storage, and documentation. A standard practice in forensics is to create and maintain chain of custody of all evidence. Chain of custody documentation is a chronological description that details the collection, handling, transfer, analysis, and destruction of evidence. The chain of custody is established when a forensic investigator first acquires the data. The documentation details the collection process and then serves as a log of all individuals who take possession of the evidence, when that person had possession of the evidence, and details about what was done to the evidence. Chain of custody documentation should always reflect the full history and current status of the evidence. Chain of custody is further discussed in later chapters.

Only authorized individuals should have access to the evidence. Evidence integrity is critical for establishing and maintaining the veracity of findings. Allowing unauthorized—or undocumented—access to evidence can cast doubt on whether the evidence was altered. Even if the MD5 hash values are later found to match, allowing unauthorized access to the evidence can be enough to call the investigative process into question.

Security is important for preventing unauthorized access to both original evidence and analysis. Physical and digital security both play important roles in the overall security of evidence. The security of evidence should cover the premises, the evidence locker, any device that can access the analysis server, and network connections. Forensic investigators should be concerned with two types of security: physical security and digital security.

  • Physical security is the collection of devices, structural design, processes, and other means for ensuring that unauthorized individuals cannot access, modify, destroy, or deny access to the data. Examples of physical security include locks, electronic fobs, and reinforced walls in the forensic lab.

  • Digital security is the set of measures to protect the evidence on devices and on a network. Evidence can contain malware that could infect the analysis machine. A networked forensic machine that collects evidence remotely can potentially be penetrated. Examples of digital security include antivirus software, firewalls, and ensuring that forensic analysis machines are not connected to a network.

Investigator training and certification

Forensic investigators are often required to take forensic training and maintain current certifications in order to conduct investigations and testify to the results. While this is not always required, investigators can further prove that he has proper technical expertise by way of such training and certification. Forensic investigators are forensic experts, so that expertise should be documented and provable should anyone question their credentials. This can be achieved in part by way of training and certification.

The post-investigation process

After an investigation concludes, the evidence and analysis findings need to be properly archived or destroyed. Criminal and civil investigations require that evidence be maintained for a mandated period of time. The investigator should be aware of such retention rules and ensure that evidence is properly and securely archived and maintained for that period of time. In addition, documentation and analysis should be retained as well to guarantee that the results of the investigation are not lost and to prevent issues arising from questions about the evidence (for example, chain of custody).