Book Image

Hands-On Big Data Modeling

By : James Lee, Tao Wei, Suresh Kumar Mukhiya
Book Image

Hands-On Big Data Modeling

By: James Lee, Tao Wei, Suresh Kumar Mukhiya

Overview of this book

Modeling and managing data is a central focus of all big data projects. In fact, a database is considered to be effective only if you have a logical and sophisticated data model. This book will help you develop practical skills in modeling your own big data projects and improve the performance of analytical queries for your specific business requirements. To start with, you’ll get a quick introduction to big data and understand the different data modeling and data management platforms for big data. Then you’ll work with structured and semi-structured data with the help of real-life examples. Once you’ve got to grips with the basics, you’ll use the SQL Developer Data Modeler to create your own data models containing different file types such as CSV, XML, and JSON. You’ll also learn to create graph data models and explore data modeling with streaming data using real-world datasets. By the end of this book, you’ll be able to design and develop efficient data models for varying data sizes easily and efficiently.
Table of Contents (17 chapters)

Sources and types of big data

We learned that big data is omnipresent and that it can be beneficial for enterprises in one or many ways. With the high prevalence of big data from existing hardware and software, enterprises are still struggling to process, store, analyze, and manage big data using traditional data-mining tools and techniques. In this section, we are going to explore the sources of these complex and dynamic data and how can we consume them.

We can separate the sources of the data into three major categories. The following diagram shows the three major sources of big data:

Let's look into the three major sources one by one:

  • Logs generated by a machine: A lot of the big data is generated from real-time sensors in industrial machinery or vehicles that create logs for tracking user behaviors, environmental sensors, or personal health-trackers and other sensor data. Most of this machine-created data can be grouped into the following subcategories:
    • Click-log stream data: This is the data that is captured every time a user clicks any link on a website. A detailed analysis of this data can reveal information related to customer behavior and deep interactions of the users with the current website, as well as customers' buying patterns.
    • Gaming events log data: A user performs a set of tasks when playing any online game. Each and every move the online user makes in a game can be stored. This data can be analyzed and the results can be helpful in knowing how end users are propeled through a gaming portfolio.
    • Sensors log data: Various types of sensors log data involve radio-frequency ID tags, smart meters, smartwatch sensor data, medical sensor devices such as heart-rate-monitoring sensors, and Global Positioning System (GPS) data. These types of sensors log data can be recorded and then used to analyze the actual status of the subject.
    • Weblog event data: There is extensive use of servers, cloud infrastructures, applications, networks, and so on. These applications operate and record all kinds of data about their events and operation. These data, when stored, can amount to massive volumes of data, and can be useful in understanding how to deal with service-level agreements or to predict security breaches.
    • Point-of-sale event-log data: Almost every product these days has a unique barcode. A cashier in a retail shop or department swipes the barcode of any product when selling, and all the data associated with the product is generated and can be captured. This data can be analyzed to understand the selling pattern of a retailer.
  • Person: People generate a lot of big data from social media, status updates, tweets, photos, and media uploads. Most of these logs are generated through interactions of a user with a network, such as the internet. This data reveal contains how a user communicates with the network. These interaction logs can reveal deep content-interaction models that can be useful in understanding user behavior. This analysis can be used to train a model to present personalized recommendations of web items, including next news to read, or, most likely, products to consider buying. A lot of similar researches are very hot in today's industry, including sentiment analysis and topic analysis. Most of this data is unstructured, as there is no proper format or well-defined structure available. Most of this data is either in a text format, a portable document format, a comma-separated value (CSV), or a JSON file.
  • Organization: We get a massive amount of data from an organization in terms of transaction information in databases and structured data open-stored in the data warehouse. This data is a highly structured form of data. Organizations store their data on some type of RDBMS, such as SQL, Oracle, and MS Access. This data resides in a fixed format inside the field or a table. This organization-generated data is consumed and processed in ICT technology to comprehend business intelligence and market analysis.

Challenges of big data

There are certain key aspects that make the big data very challenging. In this section, we'll discuss some of them:

  • Heterogeneity: There is a great deal of diversity in the information consumed by human beings, and they are indeed tolerated as well. In fact, the nuance and richness of natural language will provide valuable depth. However, machine-analysis algorithms expect consistent knowledge, and can't understand nuance. As a consequence, knowledge must be carefully structured as a first step to (or prior to) knowledge analysis. Computer systems work most efficiently if they can store multiple things that are all identical in size and structure. Economical representation, access, and the analysis of semi-structured knowledge require further work.
  • Personal privacy: There is a lot of personal information that is captured, stored, analyzed, and processed through internet service providers (ISPs), mobile networks, operators, supermarkets, local transportation, educational institutions, and medical and financial service organizations, including hospitals, banks, insurance companies, and credit card agencies. A great deal of information is being stored on social networks such as Facebook, YouTube, and Google. This illuminates that privacy is an issue whose importance, particularly to the customer, is growing as the value of big data becomes more apparent. This personal data is used by mining algorithms to personalize news content and to manage ads, and for other e-commerce advantages. This is clearly a violation of personal privacy.
  • Scale: As the name suggests, big data is massive. When there is an increase in size, there are underlying issues that accompany it in terms of storage, retrieval, processing, transformation, and analysis. As mentioned in the introduction, data volume is scaling much faster than computer resources and CPU speeds, which are static.
  • Timeliness: This is concerned with speed, as the larger the size of the data to be processed, the longer it will take to analyze it. There are many scenarios where in the results of the analysis are required in real-time or immediately. This creates an extra challenge when building a system that can process the big data in a timely manner.
  • Securing big data: Security is also a big concern for both enterprises and individuals. Big data stores can be engaging targets for hackers or complex persistent threats. Security is an essential attribute in the big data architecture that reveals ways to store and provide access to information securely.