Book Image

Azure Data Scientist Associate Certification Guide

By : Andreas Botsikas, Michael Hlobil
Book Image

Azure Data Scientist Associate Certification Guide

By: Andreas Botsikas, Michael Hlobil

Overview of this book

The Azure Data Scientist Associate Certification Guide helps you acquire practical knowledge for machine learning experimentation on Azure. It covers everything you need to pass the DP-100 exam and become a certified Azure Data Scientist Associate. Starting with an introduction to data science, you'll learn the terminology that will be used throughout the book and then move on to the Azure Machine Learning (Azure ML) workspace. You'll discover the studio interface and manage various components, such as data stores and compute clusters. Next, the book focuses on no-code and low-code experimentation, and shows you how to use the Automated ML wizard to locate and deploy optimal models for your dataset. You'll also learn how to run end-to-end data science experiments using the designer provided in Azure ML Studio. You'll then explore the Azure ML Software Development Kit (SDK) for Python and advance to creating experiments and publishing models using code. The book also guides you in optimizing your model's hyperparameters using Hyperdrive before demonstrating how to use responsible AI tools to interpret and debug your models. Once you have a trained model, you'll learn to operationalize it for batch or real-time inferences and monitor it in production. By the end of this Azure certification study guide, you'll have gained the knowledge and the practical skills required to pass the DP-100 exam.
Table of Contents (17 chapters)
1
Section 1: Starting your cloud-based data science journey
6
Section 2: No code data science experimentation
9
Section 3: Advanced data science tooling and capabilities

Adopting the DevOps mindset

DevOps is a team mindset that tries to minimize the silos between developers and system operators to shorten the development life cycle of a product. Developers are constantly changing a product to introduce new features and modify existing behaviors. On the other side, system operators need to keep the production systems stable and up and running. In the past, these two groups of people were isolated, and developers were throwing the new piece of software over to the operations team who would try to deploy it in production. As you can imagine, things didn't work that well all the time, causing frictions between those two groups. When it comes to DevOps, one fundamental practice is that a team needs to be autonomous and should contain all required disciplines, both developers and operators.

When it comes to data science, some people refer to the practice as MLOps, but the fundamental ideas remain the same. A team should be self-sufficient, capable of developing all required components for the overall solution, from the data engineering parts that bring in data and the training of the models all the way to operationalizing the model in production. These teams usually work in an agile manner, which embraces an iterative approach, seeking constant improvement based on feedback, as seen in Figure 1.7:

Figure 1.7 – The feedback flow in an agile MLOps team

Figure 1.7 – The feedback flow in an agile MLOps team

The MLOps team operates on its backlog and performs the iterative steps you saw in the Working on a data science project section. Once the model is ready, the system administrators, who are part of the team, are aware of what needs to be done to take the model into production. The model is monitored closely, and if a defect or performance degradation is observed, a backlog item is created for the MLOps team to address in their next sprint.

In order to minimize the development and deployment life cycle of new features in production, automation needs to be embraced. The goal of a DevOps team is to minimize the number of human interventions in the deployment process and automate as many repeatable tasks as possible.

Figure 1.8 shows the most frequently used components while developing real-time models using the MLOps mindset:

Figure 1.8 – Components usually seen in MLOps-driven data science projects

Figure 1.8 – Components usually seen in MLOps-driven data science projects

Let's analyze those components:

  • ARM templates allow you to automate the deployment of Azure resources. This enables the team to spin up and down development, testing, or even production environments in no time. These artifacts are stored within Azure DevOps in a Git version-control repository. The deployment of multiple environments is automated using Azure DevOps pipelines. You are going to read about ARM templates in Chapter 2, Deploying Azure Machine Learning Workspace Resources.
  • Using Azure Data Factory, the data science team orchestrates the pulling and cleansing of the data from the source systems. The data is copied within a data lake, which is accessible from the AzureML workspace. Azure Data Factory uses ARM templates to define its orchestration pipelines, templates that are stored within the Git repository to track changes and be able to deploy in multiple environments.
  • Within the AzureML workspace, data scientists are working on their code. Initially, they start working on Jupyter notebooks. Notebooks are a great way to prototype some ideas, as you will see in Chapter 7, The AzureML Python SDK. As the project progresses, the scripts are exported from the notebooks and are organized into coding scripts. All those code artifacts are version-controlled into Git, using the terminal and commands such as the ones seen in Figure 1.9:
Figure 1.9 – Versioning a notebook and a script file using Git within AzureML

Figure 1.9 – Versioning a notebook and a script file using Git within AzureML

  • When a model is trained, if it is performing better than the model that is currently in production, it is registered within AzureML, and an event is emitted. This event is captured by the AzureML DevOps plugin, which triggers the automatic deployment of the model in the test environment. The model is tested within that environment, and if all tests pass and no errors have been logged in Application Insights, which is monitoring the deployment, the artifacts can be automatically deployed to the next environment, all the way to production.

The ability to ensure both code and model quality plays a crucial role in this automation process. In Python, you can use various tools, such as Flake8, Bandit, and Black, to ensure code quality, check for common security issues, and consistently format your code base. You can also use the pytest framework to write your functional testing, where you will be testing the model results against a golden dataset. With pytest, you can even perform integration testing to verify that the end-to-end system is working as expected.

Adopting DevOps is a never-ending journey. The team will become better every time you repeat the process. The trick is to build trust in the end-to-end development and deployment process so that everyone is confident to make changes and deploy them in production. When the process fails, understand why it failed and learn from your mistakes. Create the mechanisms that will prevent future failures and move on.