Pretrain Vision and Large Language Models in Python

By : Emily Webber

4.5 (2)

Buy this Book

Pretrain Vision and Large Language Models in Python

4.5 (2)

By: Emily Webber

Buy this Book

Overview of this book

Foundation models have forever changed machine learning. From BERT to ChatGPT, CLIP to Stable Diffusion, when billions of parameters are combined with large datasets and hundreds to thousands of GPUs, the result is nothing short of record-breaking. The recommendations, advice, and code samples in this book will help you pretrain and fine-tune your own foundation models from scratch on AWS and Amazon SageMaker, while applying them to hundreds of use cases across your organization. With advice from seasoned AWS and machine learning expert Emily Webber, this book helps you learn everything you need to go from project ideation to dataset preparation, training, evaluation, and deployment for large language, vision, and multimodal models. With step-by-step explanations of essential concepts and practical examples, you’ll go from mastering the concept of pretraining to preparing your dataset and model, configuring your environment, training, fine-tuning, evaluating, deploying, and optimizing your foundation models. You will learn how to apply the scaling laws to distributing your model and dataset over multiple GPUs, remove bias, achieve high throughput, and build deployment pipelines. By the end of this book, you’ll be well equipped to embark on your own project to pretrain and fine-tune the foundation models of the future.

Preface

Who is this book for?

What this book covers

To get the most out of this book

Download the example code files

Conventions used

Get in touch

Share Your Thoughts

Download a free PDF copy of this book

Part 1: Before Pretraining

Free Chapter

Chapter 1: An Introduction to Pretraining Foundation Models

The art of pretraining and fine-tuning

The Transformer model architecture and self-attention

State-of-the-art vision and language models

Encoders and decoders

Summary

References

Chapter 2: Dataset Preparation: Part One

Finding a dataset and use case for foundation modeling

Delta – how different is your dataset?

Bias detection and mitigation

Enhancing your dataset – multilingual, multimodal, and augmentations

Summary

References

Chapter 3: Model Preparation

Finding your best base model

Finding your pretraining loss function

Solving for your model size

Planning future experiments

Summary

References

Part 2: Configure Your Environment

Chapter 4: Containers and Accelerators on the Cloud

What are accelerators and why do they matter?

Getting ready to use your accelerators

Optimizing accelerator performance

Troubleshooting accelerator performance

Summary

References

Chapter 5: Distribution Fundamentals

Understanding key concepts – data and model parallelism

Combining model and data parallel

Distributed training on Amazon SageMaker

Advanced techniques to reduce GPU memory

Bringing it all home with examples from models today

Summary

References

Chapter 6: Dataset Preparation: Part Two, the Data Loader

Introducing the data loader in Python

Building and testing your own data loader – a case study from Stable Diffusion

Creating embeddings – tokenizers and other key steps for smart features

Optimizing your data pipeline on Amazon SageMaker

Transforming deep learning datasets at scale on AWS

Summary

References

Part 3: Train Your Model

Chapter 7: Finding the Right Hyperparameters

Hyperparameters – batch size, learning rate, and more

Tuning strategies

Hyperparameter tuning for foundation models

Scaling up as a function of world size with SageMaker

Summary

References

Chapter 8: Large-Scale Training on SageMaker

Optimizing your script for SageMaker training

Top usability features for SageMaker training

Summary

References

Chapter 9: Advanced Training Concepts

Evaluating and improving throughput

Using Flash Attention to speed up your training runs

Speeding up your jobs with compilation

Amazon SageMaker Training Compiler and Neo

Running compiled models on Amazon’s Trainium and Inferentia custom hardware

Solving for an optimal training time

Summary

References

Part 4: Evaluate Your Model

Chapter 10: Fine-Tuning and Evaluating

Fine-tuning for language, text, and everything in between

Evaluating foundation models

Reinforcement learning from human feedback

Summary

References

Chapter 11: Detecting, Mitigating, and Monitoring Bias

Detecting bias in ML models

Mitigating bias in vision and language models

Monitoring bias in ML models

Detecting, mitigating, and monitoring bias with SageMaker Clarify

Summary

References

Chapter 12: How to Deploy Your Model

What is model deployment?

What is the best way to host my model?

Why should I shrink my model, and how?

Hosting distributed models on SageMaker

Model servers and end-to-end hosting optimizations

Summary

References

Part 5: Deploy Your Model

Chapter 13: Prompt Engineering

Prompt engineering – the art of getting more with less

From few- to zero-shot learning

Text-to-image prompt engineering tips

Image-to-image prompt engineering tips

Prompting large language models

Advanced techniques – prefix and prompt tuning

Summary

References

Chapter 14: MLOps for Vision and Language

What is MLOps?

Continuous integration and continuous deployment

Model monitoring and human-in-the-loop

MLOps for foundation models

AWS offerings for MLOps

Summary

References

Chapter 15: Future Trends in Pretraining Foundation Models

Techniques for building applications for LLMs

Other generative modalities

AWS offerings in foundation models

The future of foundation models

The future of pretraining

Summary

References

Index

Why subscribe?

Other Books You May Enjoy

Packt is searching for authors like you

Share Your Thoughts

Download a free PDF copy of this book

Customer Reviews

4.5 (2)

5 star

50%

4 star

50%

3 star

2 star

1 star

State-of-the-art vision and language models

If you’re new to machine learning, then there is a key concept you will eventually want to learn how to master, that is, state of the art. As you are aware, there are many different types of machine learning tasks, such as object detection, semantic segmentation, pose detection, text classification, and question answering. For each of these, there are many different research datasets. Each of these datasets provides labels, frequently for train, test, and validation splits. The datasets tend to be hosted by academic institutions, and each of these is purpose-built to train machine learning models that solve each of these types of problems.

When releasing a new dataset, researchers will frequently also release a new model that has been trained on the train set, tuned on the validation set, and separately evaluated on the test set. Their evaluation score on a new test set establishes a new state of the art for this specific type of modeling problem. When publishing certain types of papers, researchers will frequently try to improve performance in this area – for example, by trying to increase accuracy by a few percentage points on a handful of datasets.

The reason state-of-the-art performance matters for you is that it is a strong indication of how well your model is likely to perform in the best possible scenario. It isn’t easy to replicate most research results, and frequently, labs will have developed special techniques to improve performance that may not be easily observed and replicated by others. This is especially true when datasets and code repositories aren’t shared publicly, as is the case with GPT-3. This is acutely true when training methods aren’t disclosed, as with GPT-4.

However, given sufficient resources, it is possible to achieve similar performance as reported in top papers. An excellent place to find state-of-the-art performance at any given point in time is an excellent website, Papers With Code, maintained by Meta and enhanced by the community. By using this free tool, you can easily find top papers, datasets, models, and GitHub sites with example code. Additionally, they have great historical views, so you can see how the top models in different datasets have evolved over time.

In later chapters on preparing datasets and picking models, we’ll go into more detail on how to find the right examples for you, including how to determine how similar to and different from your own goals they are. Later in the book, we’ll also help you determine the optimal models, and sizes for them. Right now, let’s look at some models that, as of this writing, are currently sitting at the top of their respective leaderboards.

Top vision models as of April 2023

First, let’s take a quick look at the models performing the best today within image tasks such as classification and generation.

Dataset	Best model	From Transformer	Performance
ImageNet	Basic-L (Lion fine-tuned)	Yes	91.10% top 1% accuracy
CIFAR-10	ViT-H/14 (1)	Yes	99.5% correct
COCO	InternImage-H (M3I Pre-training: https://paperswithcode.com/paper/internimage-exploring-large-scale-vision)	No	65.0 Box AP
STL-10	Diffusion ProjectedGAN	No	6.91 FID (generation)
ObjectNet	CoCa	Yes	82.7% top 1% accuracy
MNIST	Heterogeneous ensemble with simple CNN (1)	No	99.91% accuracy (0.09% error)

Table 1.1 – Top image results

At first glance, these numbers may seem intimidating. After all, many of them are near or close to 99% accurate! Isn’t that too high of a bar for beginning or intermediate machine learning practitioners?

Before we get too carried away with doubt and fear, it’s helpful to understand that most of these accuracy scores came at least five years after the research dataset was published. If we analyze the historical graphs available on Paper With Code, it’s easy to see that when the first researchers published their datasets, initial accuracy scores were closer to 60%. Then, it took many years of hard work, across diverse organizations and teams, to finally produce models capable of hitting the 90s. So, don’t lose heart! If you put in the time, you too can train a model that establishes a new state-of-the-art performance in a given area. This part is science, not magic.

You’ll notice that while some of these models do in fact adopt a Transformer-inspired backend, some do not. Upon closer inspection, you’ll also see that some of these models rely on the pretrain and fine-tune paradigm we’ll be learning about in this book, but not all of them. If you’re new to machine learning, then this discrepancy is something to start getting comfortable with! Robust and diverse scientific debate, perspectives, insights, and observations are critical aspects of maintaining healthy communities and increasing the quality of outcomes across the field as a whole. This means that you can, and should, expect some divergence in methods you come across, and that’s a good thing.

Now that you have a better understanding of top models in computer vision these days, let’s explore one of the earliest methods combining techniques from large language models with vision: contrastive pretraining and natural language supervision.

Contrastive pretraining and natural language supervision

What’s interesting about both modern and classic image datasets, from Fei-Fei Li’s 2006 ImageNet to the LAION-5B as used in 2022 Stable Diffusion, is that the labels themselves are composed of natural language. Said another way, because the scope of the images includes objects from the physical world, the labels necessarily are more nuanced than single digits. Broadly speaking, this type of problem framing is called natural language supervision.

Imagine having a large dataset of tens of millions of images, each provided with captions. Beyond simply naming the objects, a caption gives you more information about the content of the images. A caption can be anything from Stella sits on a yellow couch to Pepper, the Australian pup. In just a few words we immediately get more context than simply describing the objects. Now, imagine using a pretrained model, such as an encoder, to process the language into a dense vector representation. Then, combine this with another pretrained model, this time an image encoder, to process the image into another dense vector representation. Combine both of these in a learnable matrix, and you are on your way to contrastive pretraining! Also presented by Alex Radford and the team, just a few years after their work on GPT, this method gives us both a way to jointly learn about the relationship between both images and language and a model well suited to do so. The model is called Contrastive Language-Image Pretraining (CLIP).

CLIP certainly isn’t the only vision-language pretraining task that uses natural language supervision. One year earlier, in 2019, a research team from China proposed a Visual-Linguistic BERT model attempting a similar goal. Since then, the joint training of vision-and-language foundation models has become very popular, with Flamingo, Imagen, and Stable Diffusion all presenting interesting work.

Now that we’ve learned a little bit about joint vision-and-language contrastive pretraining, let’s explore today’s top models in language.

Top language models as of April 2023

Now, let’s evaluate some of today’s best-in-class models for a task extremely pertinent to foundation models, and thus this book: language modeling. This table shows a set of language model benchmark results across a variety of scenarios.

Dataset	Best model	From Transformer	Performance
WikiText-103	Hybrid H3 (2.7B params)	No	10.60 test perplexity
Penn Treebank (Word Level)	GPT-3 (Zero-Shot) (1)	Yes	20.5 test perplexity
LAMBADA	PaLM-540B (Few-Shot) (1)	Yes	89.7% accuracy
Penn Treebank (Character Level)	Mogrifer LSTM + dynamic eval (1)	No	1.083 bit per character
C4 (Colossal Clean Crawled Corpus)	Primer	No	12.35 perplexity

Table 1.2 – Top language modeling results

First, let’s try to answer a fundamental question. What is language modeling, and why does it matter? Language modeling as known today appears to have been formalized in two cornerstone papers: BERT (9) and GPT (10). The core concept that inspired both papers is deceptively simple: how do we better use unsupervised natural language?

As is no doubt unsurprising to you, the vast majority of natural language in our world has no direct digital label. Some natural language lends itself well to concrete labels, such as cases where objectivity is beyond doubt. This can include accuracy in answering questions, summarization, high-level sentiment analysis, document retrieval, and more.

But the process of finding these labels and producing the datasets necessary for them can be prohibitive, as it is entirely manual. At the same time, many unsupervised datasets get larger by the minute. Now that much of the global dialog is online, datasets rich in variety are easy to access. So, how can ML researchers position themselves to benefit from these large, unsupervised datasets?

This is exactly the problem that language modeling seeks to solve. Language modeling is a process to apply mathematical techniques on large corpora of unlabelled text, relying on a variety of pretraining objectives to enable the model to teach itself about the text. Also called self-supervision, the precise method of learning varies based on the model at hand. BERT applies a mask randomly throughout the dataset and learns to predict the word hidden by the mask, using an encoder. GPT uses a decoder to predict left-to-right, starting at the beginning of a sentence, for example, and learning how to predict the end of the sentence. Models in the T5 family use both encoders and decoders to learn text-to-text tasks, such as translation and search. As proposed in ELECTRA (11), another alternative is a token replacement objective, which opts to inject new tokens into the original text, rather than masking them.

Fundamentals – fine-tuning

Foundational language models are only useful in applications when paired with their peer method, fine-tuning. The intuition behind fine-tuning is very understandable; we want to take a foundational model pretrained elsewhere and apply a much smaller set of data to make it more focused and useful for our specific task. We can also call this domain adaptation – adapting a pretrained model to an entirely different domain that was not included in its pretraining task.

Fine-tuning tasks are everywhere! You can take a base language model, such as BERT, and fine-tune it for text classification. Or question answering. Or named entity recognition. Or you could take a different model, GPT-2 for example, and fine-tune it for summarization. Or you could take something like T5 and fine-tune it for translation. The basic idea is that you are leveraging the intelligence of the foundation model. You’re leveraging the compute, the dataset, the large neural network, and ultimately, the distribution method the researchers leveraged simply by inheriting their pretrained artifact. Then, you can optionally add extra layers to the network yourself, or more likely, use a software framework such as Hugging Face to simplify the process. Hugging Face has done an amazing job building an extremely popular open source framework with tens of thousands of pretrained models, and we’ll see in future chapters how to best utilize their examples to build our own models in both vision and language. There are many different types of fine-tuning, from parameter-efficient fine-tuning to instruction-fine-tuning, chain of thought, and even methods that don’t strictly update the core model parameters such as retrieval augmented generation. We’ll discuss these later in the book.

As we will discover in future chapters, foundational language and vision models are not without their negative aspects. For starters, their extremely large compute requirements place significant energy demands on service providers. Ensuring that energy is met through sustainable means and that the modeling process is as efficient as possible are top goals for the models of the future. These large compute requirements are also obviously quite expensive, posing inherent challenges for those without sufficient resources. I would argue, however, that the core techniques you’ll learn throughout this book are relevant across a wide spectrum of computational needs and resourcing. Once you’ve demonstrated success at a smaller scale of pretraining, it’s usually much easier to justify the additional ask.

Additionally, as we will see in future chapters, large models are infamous for their ability to inherit social biases present in their training data. From associating certain employment aspects with gender to classifying criminal likelihood based on race, researchers have identified hundreds (9) of ways bias can creep into NLP systems. As with all technology, designers and developers must be aware of these risks and take steps to mitigate them. In later chapters, I’ll identify a variety of steps you can take today to reduce these risks.

Next, let’s learn about a core technique used in defining appropriate experiments for language models: the scaling laws!

Language technique spotlight – causal modeling and the scaling laws

You’ve no doubt heard of the now-infamous model ChatGPT. For a few years, a San Francisco-based AI firm, OpenAI, developed research with a mission to improve humanity’s outcomes around artificial intelligence. Toward that end, they made bold leaps in scaling language models, deriving formulas as one might in physics to explain the performance of LLMs at scale. They originally positioned themselves as a non-profit, releasing their core insights and the code to reproduce them. Four years after its founding, however, they pivoted to cutting exclusive billion-dollar deals with Microsoft. Now, their 600-strong R&D teams focus on developing proprietary models and techniques, and many open source projects attempt to replicate and improve on their offerings. Despite this controversial pivot, the team at OpenAI gave the industry a few extremely useful insights. The first is GPT, and the second is the scaling laws.

As mentioned previously, GPT-based models use causal language modeling to learn how best to complete text. This means using a left-to-right completion learning criteria, which updates the model’s learnable parameters until the text is completely accurate. While the first GPT model of 2018 was itself useful, the real excitement came years later in two phases. First, Jared Kaplan lead a team at OpenAI to suggest a novel concept: using formulas inspired by his work in physics to estimate the impact the size of the model, dataset, and overall compute environment will have on the loss of the model. These Scaling Laws for Neural Language Models (9) suggested that the optimal model size for a given compute environment was massive.

The original GPT model of 2018 was only 117 million parameters, and its second version, aptly named GPT-2, increased the model size by up to 10x. This increase in parameter size more than doubled the overall accuracy of the model. Encouraged by these results, and fuelled by Kaplan’s theoretical and empirical findings, OpenAI boldly increased the model parameter size by another 10x, giving us GPT-3.

As the model increased in size, from 1.3 billion parameters to 13 billion, ultimately hitting 175 billion parameters, accuracy also took a huge leap! This result catalyzed the field of NLP, unleashing new use cases and a flurry of new work exploring and extending these impacts. Since then, new work has explored both larger (PaLM (9)) and smaller (Chinchilla (10)) models, with Chinchilla presenting an update to the scaling laws entirely. Yann LeCunn’s team at Meta has also presented smaller models that outperform the larger ones in specific areas, such as question-answering (Atlas (9)). Amazon has also presented two models that outperform GPT-3: the AlexaTM and MM-COT. Numerous teams have also undertaken efforts to produce open source versions of GPT-3, such as Hugging Face’s BLOOM, EleutherAI’S GPT-J, and Meta’s OPT.

The rest of this book is dedicated to discussing these models – where they come from, what they are good for, and especially how to train your own! While much excellent work has covered using these pretrained models in production through fine-tuning, such as Hugging Face’s own Natural Language Processing with Transformers (Tunstall et al., 2022), I continue to believe that pretraining your own foundation model is probably the most interesting computational intellectual exercise you can embark on today. I also believe it’s one of the most profitable. But more on that ahead!

Next, let’s learn about two key model components you’ll need to understand in detail: encoders and decoders.

Pretrain Vision and Large Language Models in Python

By : Emily Webber

Pretrain Vision and Large Language Models in Python

By: Emily Webber

Overview of this book

Related Content you might be interested in

Current Title:

Pretrain Vision and Large Language Models in Python

Accelerate Deep Learning Workloads with Amazon SageMaker

Applied Machine Learning and High-Performance Computing on AWS

Generative AI with LangChain

State-of-the-art vision and language models

Top vision models as of April 2023

Contrastive pretraining and natural language supervision

Top language models as of April 2023

Language technique spotlight – causal modeling and the scaling laws