Book Image

Hands-On Big Data Modeling

By : James Lee, Tao Wei, Suresh Kumar Mukhiya
Book Image

Hands-On Big Data Modeling

By: James Lee, Tao Wei, Suresh Kumar Mukhiya

Overview of this book

Modeling and managing data is a central focus of all big data projects. In fact, a database is considered to be effective only if you have a logical and sophisticated data model. This book will help you develop practical skills in modeling your own big data projects and improve the performance of analytical queries for your specific business requirements. To start with, you’ll get a quick introduction to big data and understand the different data modeling and data management platforms for big data. Then you’ll work with structured and semi-structured data with the help of real-life examples. Once you’ve got to grips with the basics, you’ll use the SQL Developer Data Modeler to create your own data models containing different file types such as CSV, XML, and JSON. You’ll also learn to create graph data models and explore data modeling with streaming data using real-world datasets. By the end of this book, you’ll be able to design and develop efficient data models for varying data sizes easily and efficiently.
Table of Contents (17 chapters)

The concept of big data

Digital systems are progressively intertwined with real-world activities. As a consequence, multitudes of data are recorded and reported by information systems. During the last 50 years, the growth in information systems and their capabilities to capture, curate, store, share, transfer, analyze, and visualize data has increased exponentially. Besides these incredible technological advances, people and organizations depend more and more on computerized devices and information sources on the internet. The IDC Digital Universe Study in May 2010 illustrates the spectacular growth of data. This study estimated that the amount of digital information (on personal computers, digital cameras, servers, sensors) stored exceeds 1 zettabyte, and predicted that the digital universe would to grow to 35 zettabytes in 2010. The IDC study characterizes 35 zettabytes as a stack of DVDs reaching halfway to Mars. This is what we refer to as the data explosion.

Most of the data stored in the digital universe is very unstructured, and organizations are facing challenges to capture, curate, and analyze it. One of the most challenging tasks for today's organizations is to extract information and value from data stored in their information systems. This data, which is highly complex and too voluminous to be handled by a traditional DBMS, is called big data.

Big data is a term for a group of datasets so massive and sophisticated that it becomes troublesome to process using on-hand database-management tools or contemporary processing applications. Within the recent market, massive data trends to refer to the employment of user-behavior analytics, predictive analytics, or certain different advanced data-analysis methods that extract value from this new data echo system analytics.

Whether it's day-to-day data, business data, or basis data, if they represent a massive volume of data, either structured or unstructured, the data is relevant for the organization. However, it's not only the dimensions of the data that matters; it's how it's being used by the organization to extract the deeper insights that can drive them to better business and strategic decisions. This voluminous data can be used to determine a quality of research, enhance process flow in an organization, prevent a particular disease, link legal citations, or combat crimes. Big data is everywhere, and with the right tools it can be used to make the data more effective for business analytics.

Interesting insights regarding big data

Some interesting facts related to big data, and its management and analysis, are explained here, while some are presented in the Further reading section. The facts are taken from the source mentioned in the Further reading item.

  • Almost 91% of the world's marketing leaders consume customer data as big data to make business decisions.
  • Interestingly, 90% of the world's total data has been generated within the last two years.
  • 87% of people agree to record and distribute the right data. It is important to effectively measure Return of Investment (ROI) in their own company.
  • 86% of people are willing to pay more for a great customer experience with a brand.
  • 75% of companies claim they will expand investments in big data within the next year.
  • About 70% of big data is created by individuals—but enterprises are subjected to storing and controlling 80% of it.
  • 70% of businesses accept that their marketing efforts are under higher scrutiny.

Characteristics of big data

We explored the popularity of big data in the preceding section. But it is important to know what types of data can be categorized or labeled as big data. In this section, we are going to explore various features of big data. Most of the books available on the market would claim there are six different types, discussed as follows:

  • Volume: Big data implies massive amounts of data. The size of data gets a very relevant role in determining the value out of the data, and it is also a key factor that determines whether we can judge the chunk of data as big. Hence, volume justifies one of the important attributes of big data.
Every minute, 204,000,000 emails are sent, 200,000 photos are uploaded, and 1,800,000 likes are generated on Facebook; on YouTube, 1,300,000 videos are viewed and 72 hours of video are uploaded.

The idea behind such aggregation of massive volumes of data is to understand that businesses and organizations are collecting and leveraging giant volumes of data to reinforce their merchandise, whether it is safety, dependability, healthcare, or governance. In brief, the idea is to turn this abundant, voluminous data into some form of business advantage.

  • Velocity: It relates to the increasing speed at which big data is created, and the increasing speed at which data is stored and analyzed. Processing the data in real time to match its production rate as it gets generated is a remarkable goal of big data analytics. The term velocity generally applies to how fast the data is produced and processed to satisfy the demands; it discovers the real potential in the data. The flow of data is massive and continuous. Data can be stored and processed in different ways, including batch processing, near-time, real-time processing, and streaming:

    • Real-time processing refers to the ability to capture, store, and process the data in real time and trigger immediate action, potentially saving lives.
    • Batch processing refers to feeding a large amount of data into large machines and processing for days at a time. It is still very common today.
  • Variety: It refers to many sources and types of data, either structured, semi-structured, or unstructured. We will get to discuss more on these types of big data in Chapter 5, Structures of Data Models. When we think of data variety, we think of the additional complexity that results from more kinds of data that we need to store, process, and combine. Data is more heterogeneous these days, such as BLOB image data, enterprise data, network data, video data, text data, geographic maps, computer-generated or simulated data, and social media data. We can categorize the variety of data into several dimensions. Some of the dimensions are explained as follows:

    • Structural variety: This refers to the representation of the data; for example, a satellite image of wildfires from NASA is completely different from tweets sent out by people who are seeing the fire spread.
    • Media variety: Data gets delivered in various media, such as text, audio, or video. These are referred to as media variety.
    • Semantic variety: Semantic variety comes from different assumptions of conditions on the data. For example, we can measure its age using a qualitative approach (infant, juvenile, or adult) or a quantitative approach (numbers).
  • Veracity: It refers to the quality of the data, and is also designated as validity or volatility. Big data can be noisy and uncertain, full of biases and abnormalities, and it can be imprecise. The idea that data is of no value if it's not accurate—the results of the big data analysis are only as good as the data being analyzed—creates challenges in keeping track of data quality—what has been captured, where the data came from, and how it was analyzed prior to its use.

  • Valence: It refers to connectedness. The more connected data is, the higher its valences. A high valence dataset is denser. This makes many regular analytical critiques very inefficient.

  • Value: The term, in general, refers to the valuable insights gained from the ability to investigate and identify new patterns and trends from high-volume and cross-platform systems. The idea behind processing all this big data in the first place is to bring value to the query at hand. The final output of all the tasks is the value.

Here's a summed-up representation of the preceding content: