Chapter 2: Data Cleaning for LLM Training

Book Overview & Buying
Table Of Contents

LLM Design Patterns

By : Ken Huang

3.5 (2)

Buy this Book

LLM Design Patterns

3.5 (2)

By: Ken Huang

Buy this Book

Overview of this book

This practical guide for AI professionals enables you to build on the power of design patterns to develop robust, scalable, and efficient large language models (LLMs). Written by a global AI expert and popular author driving standards and innovation in Generative AI, security, and strategy, this book covers the end-to-end lifecycle of LLM development and introduces reusable architectural and engineering solutions to common challenges in data handling, model training, evaluation, and deployment. You’ll learn to clean, augment, and annotate large-scale datasets, architect modular training pipelines, and optimize models using hyperparameter tuning, pruning, and quantization. The chapters help you explore regularization, checkpointing, fine-tuning, and advanced prompting methods, such as reason-and-act, as well as implement reflection, multi-step reasoning, and tool use for intelligent task completion. The book also highlights Retrieval-Augmented Generation (RAG), graph-based retrieval, interpretability, fairness, and RLHF, culminating in the creation of agentic LLM systems. By the end of this book, you’ll be equipped with the knowledge and tools to build next-generation LLMs that are adaptable, efficient, safe, and aligned with human values. *Email sign-up and proof of purchase required

Preface

Who this book is for

What this book covers

To get the most out of this book

Download the color images

Conventions used

Get in touch

Share Your Thoughts

Free Benefits with Your Book

Free Chapter

Part 1: Introduction and Data Preparation

Chapter 1: Introduction to LLM Design Patterns

Understanding LLMs

Understanding design patterns

Design patterns for LLM development

Summary

Chapter 2: Data Cleaning for LLM Training

Understanding the importance of clean data

Common data quality issues in language datasets

Text preprocessing techniques for LLMs

Handling multilingual and code-mixed data

Deduplication strategies for large text corpora

Automated data cleaning pipelines

Data validation and quality assurance

Summary

Subscribe for a free eBook

Chapter 3: Data Augmentation

Text data augmentation techniques

Leveraging existing LLMs for data generation

Multilingual data augmentation strategies

Semantic preservation in text augmentation

Balancing augmentation and data quality

Evaluating the impact of data augmentation

Summary

Chapter 4: Handling Large Datasets for LLM Training

Challenges of large datasets

Data sampling techniques

Distributed data processing

Data sharding and parallelization strategies

Efficient data storage formats

Streaming data processing for continuous LLM training

Memory-efficient data loading techniques

Summary

Subscribe for a free eBook

Chapter 5: Data Versioning

Understanding the need for data versioning

Data versioning strategies for large language datasets

Tools for data versioning

Integrating data versioning in training workflows

Version control for text corpora

Managing dataset variants and experiments

Best practices for data versioning

Summary

Chapter 6: Dataset Annotation and Labeling

The importance of quality annotations

Annotation strategies for different tasks

Tools and platforms for large-scale text annotation

Managing annotation quality

Crowdsourcing annotations – benefits and challenges

Semi-automated annotation techniques

Scaling annotation processes for massive language datasets

Annotation biases and mitigation strategies

Summary

Subscribe for a free eBook

Part 2: Training and Optimization of Large Language Models

Chapter 7: Training Pipeline

Components of a training pipeline

Data input and preprocessing

LLM architecture design considerations

Loss functions and optimization strategies

Logging

Pipeline modularity and reusability

Scaling your training pipeline for larger models

Summary

Chapter 8: Hyperparameter Tuning

Understanding hyperparameters

Manual versus automated tuning

Grid and random search

Bayesian optimization

Population-based methods

Multi-objective hyperparameter optimization

Hyperparameter tuning at scale – challenges and solutions

Summary

Subscribe for a free eBook

Chapter 9: Regularization

L2 regularization (Ridge regression)

Dropout

Layer-wise adaptive regularization

Gradient clipping and noise injection

Regularization in transfer learning and fine-tuning scenarios

Emerging regularization techniques

Summary

Chapter 10: Checkpointing and Recovery

Why is checkpointing important?

Checkpoint frequency and storage strategies

Efficient checkpoint formats

Recovering from failures

Checkpointing in distributed LLM training

Version control for LLM checkpoints

Automated checkpointing and recovery systems

Summary

Subscribe for a free eBook

Chapter 11: Fine-Tuning

Implementing transfer learning and fine-tuning

Strategies for freezing and unfreezing layers

Learning rate scheduling

Domain-specific fine-tuning techniques

Few-shot and zero-shot fine-tuning

Continual fine-tuning and catastrophic forgetting

Summary

Chapter 12: Model Pruning

Magnitude-based pruning

Structured versus unstructured pruning

Iterative pruning techniques

Pruning during training versus post-training pruning

Balancing pruning and model performance

Combining pruning with other compression techniques

Summary

Subscribe for a free eBook

Chapter 13: Quantization

Understanding the basics

Mixed-precision quantization

Hardware-specific considerations

Comparing quantization strategies

Combining quantization with other optimization techniques

Summary

Part 3: Evaluation and Interpretation of Large Language Models

Chapter 14: Evaluation Metrics

NLU benchmarks

Reasoning and problem-solving metrics

Coding and programming evaluation

Conversational ability assessment

Commonsense and general knowledge benchmarks

Other commonly used benchmarks

Developing custom metrics and benchmarks

Interpreting and comparing LLM evaluation results

Summary

Subscribe for a free eBook

Chapter 15: Cross-Validation

Pre-training and fine-tuning data splits

Few-shot and zero-shot evaluation strategies

Domain and task generalization

Continual learning evaluation

Cross-validation challenges and best practices

Summary

Chapter 16: Interpretability

Attention visualization techniques

Probing methods

Explaining LLM predictions with attribution methods

Interpretability in transformer-based LLMs

Mechanistic interpretability

Trade-offs between interpretability and performance

Summary

Subscribe for a free eBook

Chapter 17: Fairness and Bias Detection

Types of bias

Fairness metrics for LLM text generation and understanding

Detecting bias

Debiasing strategies

Fairness-aware training

Ethical considerations

Summary

Chapter 18: Adversarial Robustness

Types of textual adversarial attacks

Adversarial training techniques

Evaluating robustness

Trade-offs in the adversarial training of LLMs

Real-world implications

Summary

Subscribe for a free eBook

Chapter 19: Reinforcement Learning from Human Feedback

Components of RLHF systems

Scaling RLHF

Limitations of RLHF in language modeling

Applications of RLHF

Summary

Part 4: Advanced Prompt Engineering Techniques

Chapter 20: Chain-of-Thought Prompting

Designing effective CoT prompts

Using CoT prompting for problem solving

Combining CoT prompting with other techniques

Evaluating CoT prompting outputs

Limitations of CoT prompting

Future directions

Summary

Subscribe for a free eBook

Chapter 21: Tree-of-Thoughts Prompting

Designing ToT prompts

Search strategies

Pruning and evaluation

Applying ToT to solve a multi-step problem

Challenges in implementation

Future directions

Summary

Chapter 22: Reasoning and Acting

Implementing ReAct in LangChain

Building ReAct agents with LangChain’s Expression Language

Completing tasks and solving problems

Evaluating ReAct’s performance

Safety, control, and ethical considerations

Limitations and future directions

Summary

Subscribe for a free eBook

Chapter 23: Reasoning WithOut Observation

Implementing ReWOO with LangGraph

Advantages of ReWOO

Evaluating quality and ethical considerations

Future directions

Summary

Chapter 24: Reflection Techniques

Designing prompts for self-reflection

Implementing iterative refinement

Correcting errors

Evaluating the impact of reflection

Challenges in implementing effective reflection

Future directions

Summary

Subscribe for a free eBook

Chapter 25: Automatic Multi-Step Reasoning and Tool Use

Designing prompts for complex task decomposition

Integrating external tools

Implementing automatic tool selection and use

Complex problem solving

Evaluating multi-step reasoning and tool use

Challenges and future directions

Summary

Part 5: Retrieval and Knowledge Integration in Large Language Models

Chapter 26: Retrieval-Augmented Generation

Building a simple RAG system for LLMs

Embeddings and indexing for retrieval in LLM applications

Query formulation strategies in LLM-based RAG

Integrating retrieved information with LLM generation

Challenges and opportunities in RAG for LLMs

Summary

Subscribe for a free eBook

Chapter 27: Graph-Based RAG

Introduction to graph-based knowledge representation for LLMs

Designing graph RAG architectures for LLMs

Graph embedding techniques for LLM retrieval

Query expansion using graph structures in LLMs

Applications and use cases of graph RAG in LLMs

Challenges and future directions in graph-based RAG

Summary

Chapter 28: Advanced RAG

Multi-step and iterative retrieval techniques for LLMs

Adaptive retrieval based on context and task in LLMs

Meta-learning for improved retrieval in LLMs

Combining RAG with other LLM prompting techniques

Handling ambiguity and uncertainty in LLM-based RAG

Scaling RAG to very large knowledge bases

Future directions in RAG research for LLMs

Summary

Subscribe for a free eBook

Chapter 29: Evaluating RAG Systems

Challenges in evaluating RAG systems for LLMs

Metrics for assessing retrieval quality in LLM-based RAG

Considerations for retrieval metrics in RAG

Evaluating the relevance of retrieved information for LLMs

Measuring the impact of retrieval on LLM generation

End-to-end evaluation of RAG systems in LLMs

Human evaluation techniques for LLM-based RAG

Benchmarks and datasets for RAG evaluation

Summary

Chapter 30: Agentic Patterns

Introduction to agentic AI systems based on LLMs

Goal-setting and planning in LLM-based agents

Implementing memory and state management for LLM agents

Decision-making and action selection in LLM-based agents

Learning and adaptation in agentic LLM systems

Ethical considerations and safety in LLM-based agentic AI

Future prospects of agentic AI using LLMs

Summary

Future directions in LLM patterns and their development

Subscribe for a free eBook

Chapter 31: Unlock Your Exclusive Benefits

Unlock this Book’s Free Benefits in 3 Easy Steps

Index

Why subscribe?

Other Books You May Enjoy

Packt is searching for authors like you

Share Your Thoughts

LLM Design Patterns

By : Ken Huang

LLM Design Patterns

By: Ken Huang

Overview of this book

Data Cleaning for LLM Training

Confirmation

Buy this book with your credits?

Submit Your Feedback

Create a Free Account To Continue Reading

Sign in to activate your 7-day free access