Book Image

MySQL 8 for Big Data

By : Shabbir Challawala, Chintan Mehta, Kandarp Patel, Jaydip Lakhatariya
Book Image

MySQL 8 for Big Data

By: Shabbir Challawala, Chintan Mehta, Kandarp Patel, Jaydip Lakhatariya

Overview of this book

With organizations handling large amounts of data on a regular basis, MySQL has become a popular solution to handle this structured Big Data. In this book, you will see how DBAs can use MySQL 8 to handle billions of records, and load and retrieve data with performance comparable or superior to commercial DB solutions with higher costs. Many organizations today depend on MySQL for their websites and a Big Data solution for their data archiving, storage, and analysis needs. However, integrating them can be challenging. This book will show you how to implement a successful Big Data strategy with Apache Hadoop and MySQL 8. It will cover real-time use case scenario to explain integration and achieve Big Data solutions using technologies such as Apache Hadoop, Apache Sqoop, and MySQL Applier. Also, the book includes case studies on Apache Sqoop and real-time event processing. By the end of this book, you will know how to efficiently use MySQL 8 to manage data for your Big Data applications.
Table of Contents (17 chapters)
Title Page
Credits
About the Authors
About the Reviewers
www.PacktPub.com
Customer Feedback
Preface

The life cycle of Big Data


Many organizations are considering Big Data as not only just a buzzword, but a smart system to improve business and get relevant marked information and insights. Big Data is a term that refers to managing huge amounts of complex unprocessed data from diverse sources like databases, social media, images, sensor-driven equipment, log files, human sentiments, and so on. This data can be in a structured, semi-structured, or unstructured form. Thus, to process this data, Big Data tools are used to analyze, which is a difficult and time-intensive process using traditional processing procedures.

The life cycle of Big Data can be segmented into Volume, Variety, Velocity, and Veracity--commonly known as the FOUR V's OF BIG DATA. Let's look at them quickly and then move on to the four phases of the Big Data life cycle, that is, collecting data, storing data, analyzing data, and governing data.

The following illustrates a few real-world scenarios, which gives us a much better understanding of the four Vs defining Big Data:

Volume

Volume refers to the vast amount of data generated and stored every second. The size of data in enterprises is not in terabytes--it does an accrual of zettabytes or brontobytes. New Big Data tools are now generally using distributed systems that might be sometimes diversified across the world.

The amount of data generated across the globe by year 2008 is expected to be generated in just a minute by year 2020.

Variety

Variety refers to several types and natures of data such as click streams, text, sensors, images, voice, video, log files, social media conversations, and more. This helps people who scrutinize it to effectively use it for insights.

70% of the data in the world is unstructured such as text, images, voice, and so on. However, earlier structured data was popular for being analyzed, as it fits in files, databases, or such traditional data storing procedures.

Velocity

Velocity refers to the speed of the data generated, ingested, and processed to meet the demands and challenges that exist in the pathway towards evolution and expansion.

New age communication channels such as social media, emails, and mobiles have added velocity to the data in Big Data. To scrutinize around 1TB of trading event information every day for fraud detection is a time sensitive process, where sometimes every minute matters to prevent fraud. Just think of social media conversations going viral in a matter of seconds; analysis helps us get trends on such platforms.

Veracity

Veracity refers to the inconsistency of data that can be found; it can affect the way data is being managed and handled effectively. Managing such data and making it valuable is where Big Data can help.

Quality and accuracy has been a major challenge when we talk about Big Data, as that's what it's all about. The amount of Twitter feeds is an appropriate use case where hashtags, typos, informal text, and abbreviations abound; however, we daily come across scenarios where Big Data does its work in the backend and lets us work with this type of data.

Phases of the Big Data life cycle

The effective use of Big Data with exponential growth in data types and data volumes has the potential to transform economies useful business and marketing information and customer surplus. Big Data has become a key success mantra for current competitive markets for existing companies, and a game changer for new companies in the competition. This all can be proven true if VALUE FROM DATA is leveraged. Let's look at the following figure:

As this figure explains, the Big Data life cycle can be divided into four stages. Let's study them in detail.

Collect

This section is key in a Big Data life cycle; it defines which type of data is captured at the source. Some examples are gathering logs from the server, fetching user profiles, crawling reviews of organizations for sentiment analysis, and order information. Examples that we have mentioned might involve dealing with local language, text, unstructured data, and images, which will be taken care of as we move forward in the Big Data life cycle.

With an increased level of automating data collection streams, organizations that have been classically spending a lot of effort on gathering structured data to analyze and estimate key success data points for business are changing. Mature organizations now use data that was generally ignored because of either its size or format, which, in Big Data terminology, is often referred to as unstructured data. These organizations always try to use the maximum amount of information whether it is structured or unstructured, as for them, data is value.

You can use data to be transferred and consolidated into Big Data platform like HDFS (Hadoop Distributed File System). Once data is processed with the help of tools like Apache Spark, you can load it back to the MySQL database, which can help you populate relevant data to show which MySQL consists.

With the amount of data volume and velocity increasing, Oracle now has a NoSQL interface for the InnoDB storage engine and MySQL cluster. A MySQL cluster additionally bypasses the SQL layer entirely. Without SQL parsing and optimization, Key-value data can be directly inserted nine times faster into MySQL tables.

Store

In this section, we will discuss storing data that has been collected from various sources. Let's consider an example of crawling reviews of organizations for sentiment analysis, wherein each gathers data from different sites with each of them having data uniquely displayed.

Traditionally, data was processed using the ETL (Extract, Transform, and Load) procedure, which used to gather data from various sources, modify it according to the requirements, and upload it to the store for further processing or display. Tools that were every so often used for such scenarios were spreadsheets, relational databases, business intelligence tools, and so on, and sometimes manual effort was also a part of it.

The most common storage used in Big Data platform is HDFS. HDFS also provides HQL (Hive Query Language), which helps us do many analytical tasks that are traditionally done in business intelligence tools. A few other storage options that can be considered are Apache Spark, Redis, and MongoDB. Each storage option has their own way of working in the backend; however, most storage providers exposes SQL APIs which can be used to do further data analysis.

There might be a case where we need to gather real-time data and showcase in real time, which practically doesn't need the data to be stored for future purposes and can run real-time analytics to produce results based on the requests.

Analyze

In this section, we will discuss how these various data types are being analyzed with a common question starting with what if...? The way organizations have evolved with data also has impacted new metadata standards, organizing it for initial detection and reprocessing for structural approaches to be matured on the value of data being created.

Most mature organizations reliably provide accessibility, superiority, and value across business units with a constant automated process of structuring metadata and outcomes to be processed for analysis. A mature data-driven organization's analyzing engine generally works on multiple sources of data and data types, which also includes real-time data.

During the analysis phase, raw data is processed, for which MySQL has Map/Reduce jobs in Hadoop, to analyze and give the output. With MySQL data lying in HDFS, it can be accessed by the rest of the ecosystem of Big Data platform-related tools for further analysis.

Governance

Value for data cannot be expected for a business without an established governance policy in practice. In the absence of a mature data governance policy, businesses can experience misinterpreted information, which could ultimately cause unpredictable damages to the business. With the help of Big Data governance, an organization can achieve consistent, precise, and actionable awareness of data.

Data governance is all about managing data to meet compliance, privacy, regulatory, legal, and anything that is specifically obligatory as per business requirements. For data governance, continuous monitoring, studying, revising, and optimizing the quality of the process should also respect data security needs. So far, data governance has been taken with ease where Big Data is concerned; however, with data growing rapidly and being used in various places, this has drawn attention to data governance. It is gradually becoming a must-considerable factor for any Big Data project.

As we have now got a good understanding of the life cycle of Big Data, let's take a closer look at MySQL basics, benefits, and some of the excellent features introduced.