-
Book Overview & Buying
-
Table Of Contents
Train Large Language Models Faster - Parallelism Deep Dive
By :
Train Large Language Models Faster - Parallelism Deep Dive
By:
Overview of this book
This course offers an in-depth exploration of parallelism in Large Language Model (LLM) training. Beginning with foundational IT concepts like cloud computing, GPUs, and network communication, the course introduces various parallelism techniques such as data parallelism, model parallelism, hybrid approaches, and pipeline parallelism, explaining their benefits and trade-offs. You’ll then apply these strategies in hands-on demos using real-world datasets like MNIST and WikiText.
As you progress, you’ll work on true parallelism with multiple GPUs through platforms like Runpod.io, and dive into essential topics such as fault tolerance, scalability, and checkpointing strategies. These lessons ensure your training systems are resilient and optimized for large-scale machine learning workflows. With insights into GPU architectures and advanced tools like DeepSpeed, you'll be equipped to handle the complexities of training massive models efficiently.
Whether you're an AI researcher or a data scientist, this course provides the knowledge and practical experience needed to accelerate LLM training and build scalable, efficient AI systems. Through a combination of theoretical lessons and hands-on applications, you’ll master parallelism techniques and become proficient in building and optimizing high-performance LLM training pipelines.
Table of Contents (16 chapters)
Introduction
Strategies for Parallelizing LLMS - Deep Dive
IT Fundamental Concepts
GPU Architecture for LLM Training Deep Dive
Deep and Machine Learning - Deep Dive
Large Language Models - Fundamentals of AI and LLMs
Parallel Computing Fundamentals & Parallelism in LLM Training
Types of Parallelism in LLM Training - Data, Model, and Hybrid Parallelism
Types of Parallelism - Pipeline and Tensor Parallelism
Tensor Parallelism - Deep Dive
HANDS-ON: Strategies for Parallelism - Data Parallelism Deep Dive
HANDS-ON: Data Parallelism w/ WikiText Dataset & DeepSpeed Mem. Optimization
Running TRUE Parallelism on Multiple GPU Systems - Runpod.io
Fault Tolerance and Scalability & Advanced Checkpointing Strategies - Deep Dive
Advanced Topics and Emerging Trends
Wrap up and Next Steps