Big Data Forensics: Learning Hadoop Investigations

Big Data describes the tools and techniques used to manage and process data that traditional means cannot easily accomplish. Many factors have led to the need for Big Data solutions. These include the recent proliferation of data storage, faster and easier data transfer, increased awareness of the value of data, and social media. Big Data solutions were needed to address the rapid, complex, and voluminous data sets that have been created in the past decade. Big Data can be structured data (for example, databases), unstructured data (such as e-mails), or a combination of both.

The four Vs of Big Data

A widely-accepted set of characteristics of Big Data is the four Vs of data. In 2001, Doug Laney of META Group produced a report on the needs of the changing requirements for managing the forms of voluminous data. In this report, he defined the three Vs of data: volume, velocity, and variety. These factors address the following:

The large data sets
The increased speed at which the data arrives, requires storage, and should be analyzed
The multitude of forms the data, such as financial records, e-mails, and social media data

This definition has been expanded to include a fourth V for veracity—the trustworthiness of the data quality and the data's source.

Tip

One way to identify whether a data set is Big Data is to consider the four Vs.

Volume is the most obvious characteristic of Big Data. The amount of data produced has grown exponentially over the past three decades, and that growth has been fueled by better and faster communications networks and cheaper storage. In the early 1980s, a gigabyte of storage costs over $200,000. A gigabyte of storage today costs approximately $0.06. This massive drop in storage costs and the highly networked nature of devices provides a means to create and store massive volumes of data. The computing industry now talks about the realities of exabytes (approximately, one billion gigabytes) and zettabytes (approximately, one trillion gigabytes) of data—possibly even yottabytes (over a thousand trillion gigabytes). Data volumes have obviously grown, and Big Data solutions are designed to handle the voluminous data sets through distributed storage and computing to scale out to the growing data volumes. The distributed solutions provide a means for storing and analyzing massive data volumes that could not feasibly be stored or computer by a single device.

Velocity is another characteristic of Big Data. The value of the information contained in data has placed an increased emphasis on quickly extracting information from data. The speed at which social media data, financial transactions, and other forms of data are being created can outpace traditional analysis tools. Analyzing real-time social media data requires specialized tools and techniques for quickly retrieving, storing, transforming, and analyzing the information. Tools and techniques designed to manage high-speed data also fall into the category of Big Data solutions.

Variety is the third V of Big Data. A multitude of different forms of data are being produced. The new emphasis is on extracting information from a host of different data sources. This means that traditional analysis is not always sufficient. Video files and their metadata, social media posts, e-mails, financial records, and telephonic recordings may all contain valuable information, and the data need to be analyzed in conjunction with one another. These different forms of data are not easily analyzed using traditional means.

Traditional data analysis focuses on transactional data or so-called structured data for analysis in a relational or hierarchical database. Structured data has a fixed composition and adheres to rules about what types of values it can contain. Structured data are often thought of in terms of records or rows, each with a set of one or more columns or fields. The rows and columns are bound by defined properties, such as the data type and field width limitations. The most common forms of structured data are:

Database records
Comma-Separated Value (CSV) files
Spreadsheets

Traditional analysis is performed on structured data using databases, programs, or spreadsheets to load the data into a fixed format and run a set of commands or queries on the data. SQL has been the standard database language for data analysis over the past two decades—although many other languages and analysis packages exist.

Unstructured and semi-structured data do not have the same fixed data structure rules and do not lend themselves well to traditional analysis. Unstructured data is data that is stored in a format that is not expressly bound by the same data format and content rules as structured data. Several examples of unstructured data are:

E-mails
Video files
Presentation documents

Note

According to VMWare's 2013 Predictions for Big Data, over 80% of data produced will be unstructured, and the growth rate of unstructured data is 50-60% per year.

Semi-structured data is data that has rules for the data format and structure, but those rules are too loose for easy analysis using traditional means for analyzing structured data. XML is the most common form of semi-structured data. XML has a self-describing structure, but the structure of one XML file is not adhered to across all other XML files.

The variety of Big Data comes from the incorporation of a multitude of different types of data. Variety can mean incorporating structured, semi-structured, and unstructured data, but it can also mean simply incorporating various forms of structured data. Big Data solutions are designed to analyze whatever type of data is required. Regardless of the types of data are incorporated, the challenge for Big Data solutions is being able to collect, store, and analyze various forms of data in a single solution.

Veracity is the fourth V of Big Data. Veracity, in terms of data, indicates whether the informational content of data can be trusted. With so many new forms of data and the challenge of quickly analyzing a massive data set, how does one trust that the data is properly formatted, has correct and complete information, and is worth analyzing? Data quality is important for any analysis. If the data is lacking in some way, all the analyses will be lacking. Big Data solutions address this by devising techniques for quickly assessing the data quality and appropriately incorporating or excluding the data based on the data quality assessment results.

Big Data architecture and concepts

The architectures for Big Data solutions vary greatly, but several core concepts are shared by most solutions. Data is collected and ingested in Big Data solutions from a multitude of sources. Big Data solutions are designed to handle various types and formats of data, and the various types of data can be ingested and stored together. The data ingestion system brings the data in for transformation before the data is sent to the storage system. Distribution of storage is important for the storage of massive data sets. No single device can possibly store all the data or be expected to not experience failure as a device or on one of its disks. Similarly, computational distribution is critical for performing the analysis across large data sets with timeliness requirements. Typically, Big Data solutions enact a master/worker system—such as MapReduce—whereby one computational system acts as the master to distribute individual analyses for the worker computational systems to complete. The master coordinates and manages the computational tasks and ensures that the worker systems complete the tasks.

The following figure illustrates a high-level Big Data architecture:

Figure 4: Big Data overview

Big Data solutions utilize different types of databases to conduct the analysis. Because Big Data can include structured, semi-structured, and/or unstructured data, the solutions need to be capable of performing the analysis across various types of files. Big Data solutions can utilize both relational and nonrelational database systems. NoSQL (Not only SQL) databases are one of the primary types of nonrelational databases used in Big Data solutions. NoSQL databases use different data structures and query languages to store and retrieve information. Key-value, graph, and document structures are used by NoSQL. These types of structures can provide a better and faster method for retrieving information about unstructured, semi-structured, and structured data.

Two additional important and related concepts for many Big Data solutions are text analytics and machine learning. Text analytics is the analysis of unstructured sets of textual data. This area has grown in importance with the surge in social media content and e-mail. Customer sentiment analysis, predictive analysis on buyer behavior, security monitoring, and economic indicator analysis are performed on text data by running algorithms across their data. Text analytics is largely made possible by machine learning. Machine learning is the use of algorithms and tools to learn from data. Machine algorithms make decisions or predictions from data inputs without the need for explicit algorithm instructions.

Video files and other nontraditional analysis input files can be analyzed in a couple ways:

Using specialized data extraction tools during data ingestion
Using specialized techniques during analysis

In some cases, only the unstructured data's metadata is important. In others, content from the data needs to be captured. For example, feature extraction and object recognition information can be captured and stored for later analysis. The needs of the Big Data system owner dictate the types of information captured and which tools are used to ingest, transform, and analyze the information.

Big Data Forensics: Learning Hadoop Investigations

Big Data Forensics: Learning Hadoop Investigations

Overview of this book

Related Content you might be interested in

Current Title:

Big Data Forensics: Learning Hadoop Investigations

What is Big Data?

The four Vs of Big Data

Tip

Note

Big Data architecture and concepts