-
Book Overview & Buying
-
Table Of Contents
LLM Design Patterns
By :
In this chapter, we’ll dive into the data cleaning pattern for LLM training.
Clean, high-quality data is the foundation of robust and reliable language models. We’ll explore common data quality issues, preprocessing techniques, and strategies for handling diverse data types. Figure 2.1 depicts a data cleaning pipeline specifically designed for processing raw text data before it’s used to train language models.
Figure 2.1 – Data cleaning pipeline
The process begins with an initial data quality check to assess the raw data’s suitability. Following this, text preprocessing and deduplication steps are applied to refine and streamline the dataset. If the data fails to meet the required standards at any point, it is rerouted through an automated cleaning pipeline for additional processing. Successful completion of this stage leads to data validation to ensure the dataset’s integrity...
Change the font size
Change margin width
Change background colour