Book Image

The Definitive Guide to Google Vertex AI

By : Jasmeet Bhatia, Kartik Chaudhary
4 (1)
Book Image

The Definitive Guide to Google Vertex AI

4 (1)
By: Jasmeet Bhatia, Kartik Chaudhary

Overview of this book

While AI has become an integral part of every organization today, the development of large-scale ML solutions and management of complex ML workflows in production continue to pose challenges for many. Google’s unified data and AI platform, Vertex AI, directly addresses these challenges with its array of MLOPs tools designed for overall workflow management. This book is a comprehensive guide that lets you explore Google Vertex AI’s easy-to-advanced level features for end-to-end ML solution development. Throughout this book, you’ll discover how Vertex AI empowers you by providing essential tools for critical tasks, including data management, model building, large-scale experimentations, metadata logging, model deployments, and monitoring. You’ll learn how to harness the full potential of Vertex AI for developing and deploying no-code, low-code, or fully customized ML solutions. This book takes a hands-on approach to developing u deploying some real-world ML solutions on Google Cloud, leveraging key technologies such as Vision, NLP, generative AI, and recommendation systems. Additionally, this book covers pre-built and turnkey solution offerings as well as guidance on seamlessly integrating them into your ML workflows. By the end of this book, you’ll have the confidence to develop and deploy large-scale production-grade ML solutions using the MLOps tooling and best practices from Google.
Table of Contents (24 chapters)
1
Part 1:The Importance of MLOps in a Real-World ML Deployment
4
Part 2: Machine Learning Tools for Custom Models on Google Cloud
14
Part 3: Prebuilt/Turnkey ML Solutions Available in GCP
18
Part 4: Building Real-World ML Solutions with Google Cloud

Common challenges in developing real-world ML solutions

A real-world ML project is always filled with some unexpected challenges that we get to experience at different stages. The main reason for this is that the data present in the real world, and the ML algorithms, are not perfect. Though these challenges hamper the performance of the overall ML setup, they don’t prevent us from creating a valuable ML application. In a new ML project, it is difficult to know the challenges up front. They are often found during different stages of the project. Some of these challenges are not obvious and require skilled or experienced ML practitioners (or data scientists) to identify them and apply countermeasures to reduce their effect.

In this section, we will understand some of the common challenges encountered during the development of a typical ML solution. The following list shows some common challenges we will discuss in more detail:

  • Data collection and security
  • Non-representative training set
  • Poor quality of data
  • Underfitting of the training dataset
  • Overfitting of the training dataset
  • Infrastructure requirements

Now, let’s learn about each of these common challenges in detail.

Data collection and security

One of the most common challenges that organizations face is data availability. ML algorithms require a large amount of good-quality data in order to provide quality results. Thus, the availability of raw data is critical for a business if it wants to implement ML. Sometimes, even if the raw data is available, gathering data is not the only concern; we often need to transform or process the data in a way that our ML algorithm supports.

Data security is another important challenge that is very frequently faced by ML developers. When we get data from a company, it is essential to differentiate between sensitive and non-sensitive information to implement ML correctly and efficiently. The sensitive part of data needs to be stored in fully secured servers (storage systems) and should always be kept encrypted. Sensitive data should be avoided for security purposes, and only the less-sensitive data access should be given to trusted team members working on the project. If the data contains Personally Identifiable Information (PII), it can still be used by anonymizing it properly.

Non-representative training data

A good ML model is one that performs equally well on unseen data and training data. It is only possible when your training data is a good representative of most possible business scenarios. Sometimes, when the dataset is small, it may not be a true representative of the inherent distribution, and the resulting model may provide inaccurate predictions on unseen datasets despite having high-quality results on the training dataset. This kind of non-representative data is either the result of sampling bias or the unavailability of data. Thus, an ML model trained on such a non-representative dataset may have less value when it is deployed in production.

If it is impossible to get a true representative training dataset for a business problem, then it’s better to limit the scope of the problem to only the scenarios for which we have a sufficient amount of training samples. In this way, we will only get known scenarios in the unseen dataset, and the model should provide quality predictions. Sometimes, the data related to a business problem keeps changing with time, and it may not be possible to develop a single static model that works well; in such cases, continuous retraining of the model on the latest data becomes essential.

Poor quality of data

The performance of ML algorithms is very sensitive to the quality of training samples. A small number of outliers, missing data cases, or some abnormal scenarios can affect the quality of the model significantly. So, it is important to treat such scenarios carefully while analyzing the data before training any ML algorithm. There are multiple methods for identifying and treating outliers; the best method depends upon the nature of the problem and the data itself. Similarly, there are multiple ways of treating the missing values as well. For example, mean, median, mode, and so on are some frequently used methods to fill in missing data. If the training data size is sufficiently large, dropping a small number of rows with missing values is also a good option.

As discussed, the quality of the training dataset is important if we want our ML system to learn accurately and provide quality results on the unseen dataset. It means that the data pre-processing part of the ML life cycle should be taken very seriously.

Underfitting the training dataset

Underfitting an ML model means that the model is too simple to learn the inherent information or structure of the training dataset. It may occur when we try to fit a non-linear distribution using a linear ML algorithm such as linear regression. Underfitting may also occur when we utilize only a minimal set of features (that may not have much information about the target distribution) while training the model. This type of model can be too simple to learn the target distribution. An underfitted model learns too little from the training data and, thus, makes mistakes on unseen or test datasets.

There are multiple ways to tackle the problem of underfitting. Here is a list of some common methods:

  • Feature engineering – add more features that represent target distribution
  • Non-linear algorithms – switch to a non-linear algorithm if the target distribution is not linear
  • Removing noise from the data
  • Add more power to the model – increase trainable parameters, increase depth or number of trees in tree-based ensembles

Just like underfitting the model on training data, overfitting is also a big issue. Let’s deep dive into it.

Overfitting the training dataset

The overfitting problem is the opposite of the underfitting problem. Overfitting is the scenario when the ML model learns too much unnecessary information from the training data and fails to generalize on a test or unseen dataset. In this case, the model performs extremely well on the training dataset, but the metric value (such as accuracy) is very low on the test set. Overfitting usually occurs when we implement a very complex algorithm on simple datasets.

Some common methods to address the problem of overfitting are as follows:

  • Increase training data size – ML models often overfit on small datasets
  • Use simpler models – When problems are simple or linear in nature, choose simple ML algorithms
  • Regularization – There are multiple regularization methods that prevent complex models from overfitting on the training dataset
  • Reduce model complexity – Use a smaller number of trainable parameters, train for a smaller number of epochs, and reduce the depth of tree-based models

Overfitting and underfitting are common challenges and should be addressed carefully, as discussed earlier. Now, let’s discuss some infrastructure-related challenges.

Infrastructure requirements

ML is expensive. A typical ML project often involves crunching large datasets with millions or billions of samples. Slicing and dicing such datasets requires a lot of memory and high-end multi-core processors. Additionally, once the development of the project is complete, dedicated servers are required to deploy the models and match the scale of consumers. Thus, business organizations willing to practice ML need some dedicated infrastructure to implement and consume ML efficiently. This requirement increases further when working with large, deep learning models such as transformers, large language models (LLMs), and so on. Such models usually require a set of accelerators, graphical processing units (GPUs), or tensor processing units (TPUs) for training, finetuning, and deployment.

As we have discussed, infrastructure is critical for practicing ML. Companies that lack such infrastructure can consult with other firms or adopt cloud-based offerings to start developing ML-based applications.

Now that we understand the common challenges faced during the development of an ML project, we should be able to make more informed decisions about them. Next, let’s learn about some of the limitations of ML.