Book Image

Cloud Scale Analytics with Azure Data Services

By : Patrik Borosch
Book Image

Cloud Scale Analytics with Azure Data Services

By: Patrik Borosch

Overview of this book

Azure Data Lake, the modern data warehouse architecture, and related data services on Azure enable organizations to build their own customized analytical platform to fit any analytical requirements in terms of volume, speed, and quality. This book is your guide to learning all the features and capabilities of Azure data services for storing, processing, and analyzing data (structured, unstructured, and semi-structured) of any size. You will explore key techniques for ingesting and storing data and perform batch, streaming, and interactive analytics. The book also shows you how to overcome various challenges and complexities relating to productivity and scaling. Next, you will be able to develop and run massive data workloads to perform different actions. Using a cloud-based big data-modern data warehouse-analytics setup, you will also be able to build secure, scalable data estates for enterprises. Finally, you will not only learn how to develop a data warehouse but also understand how to create enterprise-grade security and auditing big data programs. By the end of this Azure book, you will have learned how to develop a powerful and efficient analytical platform to meet enterprise needs.
Table of Contents (20 chapters)
1
Section 1: Data Warehousing and Considerations Regarding Cloud Computing
4
Section 2: The Storage Layer
7
Section 3: Cloud-Scale Data Integration and Data Transformation
14
Section 4: Data Presentation, Dashboarding, and Distribution

Exploring the benefits of AI and ML

Companies start building AI projects and massively rely on those statistical functions and the math behind them to predict customer churn, recognize images, detect fraud, mine knowledge, and so much more. ML projects often begin as part of a big data or Data Warehouse project, but this can also be the other way around; that is, the start of an AI 
or machine learning project often leads to the development of an analytical system.

As my data scientist colleagues tell me, if you want to be able to really predict an event based on incoming data, a machine learning model needs to be trained on quite a large amount of data. The more data you can bring to train a model, the better and more accurate the model will be in the end.

A wonderful experiment to do with image recognition, for example, can be done at www.customvision.ai. You can start by examining one of the example projects there. I like the "Chocolate or Dalmatian" example.

This is a nice experiment that did not need too much of input to enable the image recognizer to distinguish between Stracciatella Chocolate ice cream and Dalmatian dogs. When you try to teach the system on different images and circumstances, you might find out that you need far more training images than six per group.

Understanding ML challenges

I have experimented with the service and uploaded images of people in an emergency versus images of people relaxing or doing Yoga or similar. I used around 50 – 60 images for each group and still didn't reach a really satisfying accuracy (74%).

With this experiment, I even created a model with a bias that I first didn't understand myself. There were too many "emergency" cases being interpreted incorrectly as "All good" cases. By discussing this with my data scientist colleagues and examining the training set, we found out why.

There were too many pictures in the "All good" training set that showed people on grass or in nature, with lots of green around them. This, in turn, led the system to interpret green as a signifier of "All good," no matter how big the emergency obviously was. Someone with their leg at a strange angle and a broken arm, in a meadow? The model would interpret it as "All good."

In this sandbox environment, I did no harm at all. But imagine a case where a system is used to help detect emergency situations, and hopefully kickstart an alarm those vital seconds earlier. This is only a very basic example of how the right tool in the wrong hands might cause a lot of damage.

There are so many different cases where machine learning can help increase the accuracy of processes, increase their speed, or save them money because of the right predictions – such as predicting when and why a machine will fail before it actually fails, helping to mine information from massive data, and more.

Sorting ML into the Modern Data Warehouse

How does this relate to the Modern Data Warehouse? As we stated previously, the Modern Data Warehouse does not only offer scalable, fast, and secure storage components. It also offers at least one (and, in the context of this book, six) compute component(s) that can interact with the storage services and can be used to create and run machine learning models at scale. The "run" can be implemented in batch mode, near-real time or even in real time, depending on the streaming engine used. That Modern Data Warehouse can then store the results of ML calculations into a suitable presentation layer to provide this data to the downstream consumers, who will process the data further, visualize it, draw their insights from it, and take action. The system can close the loop using the enterprise service bus and the integration services available to feed back insights as parameters for the surrounding systems.

Understanding responsible ML/AI

A responsible data scientist will need tools that support them conducting this work properly. The buzzword of the moment in that area is machine learning operations (MLOps). One of the most important steps of creating a responsible AI is having complete traceability/auditability of the source data, and the versions of the datasets used to train and retrain a certain model at a certain timestamp. With this information, the results of the model can also be audited and interpreted. This information is vital when it comes to audits and traceability regarding legal questions, for instance. The collaborative aspects of an MLOps-driven environment are another important factor.

Note

We can find a definition of Responsible AI, following the principles of Fairness, Reliability and Safety, Privacy and Security, Inclusiveness, and Transparency and Accountability, at https://www.microsoft.com/en-us/ai/responsible-ai.

An MLOps environment embedded in the Modern Data Warehouse will be another puzzle part of the bigger picture and helps integrate those principles into the analytical estate of any company. With the tight interconnectivity between data services, storage, compute components, streaming services, IoT technology, ML and AI, and visualization tools, the world of analytics today offers a wide range of possibilities at a far lower cost than ever before. The ease of use of these services and their corresponding productivity is constantly growing.