Book Image

Azure Data Scientist Associate Certification Guide

By : Andreas Botsikas, Michael Hlobil
Book Image

Azure Data Scientist Associate Certification Guide

By: Andreas Botsikas, Michael Hlobil

Overview of this book

The Azure Data Scientist Associate Certification Guide helps you acquire practical knowledge for machine learning experimentation on Azure. It covers everything you need to pass the DP-100 exam and become a certified Azure Data Scientist Associate. Starting with an introduction to data science, you'll learn the terminology that will be used throughout the book and then move on to the Azure Machine Learning (Azure ML) workspace. You'll discover the studio interface and manage various components, such as data stores and compute clusters. Next, the book focuses on no-code and low-code experimentation, and shows you how to use the Automated ML wizard to locate and deploy optimal models for your dataset. You'll also learn how to run end-to-end data science experiments using the designer provided in Azure ML Studio. You'll then explore the Azure ML Software Development Kit (SDK) for Python and advance to creating experiments and publishing models using code. The book also guides you in optimizing your model's hyperparameters using Hyperdrive before demonstrating how to use responsible AI tools to interpret and debug your models. Once you have a trained model, you'll learn to operationalize it for batch or real-time inferences and monitor it in production. By the end of this Azure certification study guide, you'll have gained the knowledge and the practical skills required to pass the DP-100 exam.
Table of Contents (17 chapters)
1
Section 1: Starting your cloud-based data science journey
6
Section 2: No code data science experimentation
9
Section 3: Advanced data science tooling and capabilities

Using Spark in data science

At the beginning of the twenty-first century, the big data problem became a reality. Data stored in data centers was growing in volumes and velocity. In 2021, we refer to datasets as big data when they reach at least a couple of terabytes in size, while it is not uncommon to see even petabytes of data in large organizations. These datasets increase at a rapid rate, which can be from a couple of gigabytes per day to even per minute, for example, when you are storing user interactions with a website in an online store to perform clickstream analysis.

In 2009, a research project started at the University of California, Berkeley, trying to provide the parallel computing tools needed to handle big data. In 2014, the first version of Apache Spark was released from this research project. Members from that research team founded the Databricks company, one of the most significant contributors to the open source Apache Spark project.

Apache Spark provides an easy-to-use scalable solution that allows people to perform parallel processing on top of data in a distributed manner. The main idea behind the Spark architecture is that a driver node is responsible for executing your code. Your code is split into smaller parallel actions that can be performed against smaller portions of data. These smaller jobs are scheduled to be executed by the worker nodes, as seen in Figure 1.6:

Figure 1.6 – Parallel processing of big data in a Spark cluster

Figure 1.6 – Parallel processing of big data in a Spark cluster

For example, suppose you wanted to calculate how many products your company sold during the last year. In that case, Spark could spin up 12 jobs that would produce the monthly aggregates, and then the results would be processed by another job that would sum up the totals for all months. If you were tempted to load the entire dataset into memory and perform those aggregates directly from there, let's examine how much memory you would need within that computer. Let's assume that the sales data for a single month is stored in a CSV file that is 1 GB. This file will require approximately 10 GB of memory to load. The compressed Parquet files will require even more memory. For example, a similar 1 GB parquet file may end up needing 40 GB of memory to load as a pandas.DataFrame object. As you can understand, loading all 12 files in memory simultaneously is an impossible task. You need to parallelize the processing, something Spark can do for you automatically.

Important note

The Parquet files are stored in a columnar format, which allows you to load partially any number of columns you need. In the 1 GB Parquet example, if you load only half the columns from the dataset, you will probably need only 20 GB of memory. This is one of the reasons why the Parquet format is widely used in analytical loads.

Spark is written in the Scala programming language. It offers APIs for Scala, Python, Java, R, and even C#. Still, the data science community is either working on Scala to achieve maximum computational performance and utilizing the Java library ecosystem or Python, which is widely adopted by the modern data science community. When you are writing Python code to utilize the Spark engine, you are using the PySpark tool to perform operations on top of Resilient Distributed Datasets (RDDs) or Spark.DataFrame objects introduced later in Spark Framework. To benefit from the distributed nature of Spark, you need to be handling big datasets. This means that Spark may be overkill if you deal with only hundreds of thousands of records or even a couple of millions of records.

Spark offers two machine learning libraries, the old MLlib and the new version called Spark ML. Spark ML uses the Spark.DataFrame structure, a distributed collection of data, and offers similar functionality to the DataFrame objects used in Python pandas or R. Moreover, the Koalas project provides an implementation that allows data scientists with existing knowledge of pandas.DataFrame manipulations to use their existing coding skills on top of Spark.

AzureML allows you to execute Spark jobs on top of PySpark, either using its native compute clusters or by attaching to Azure Databricks or Synapse Spark pools. Although you will not write any PySpark code in this book, in Chapter 12, Operationalizing Models with Code, you will learn how to achieve similar parallelization benefits without the need for Spark or a driver node.

No matter whether you are coding in regular Python, PySpark, R, or Scala, you are producing some code artifacts that are probably part of a larger system. In the next section, you will explore the DevOps mindset, which emphasizes the communication and collaboration of software engineers, data scientists, and system administrators to achieve faster release of valuable product features.