-
Book Overview & Buying
-
Table Of Contents
LLM Design Patterns
By :
Quantization refers to reducing the precision of the weights and activations of a model, typically from 32-bit floating point (FP32) to lower precision formats such as 16-bit (FP16) or even 8-bit integers (INT8). The goal is to decrease memory usage, speed up computation, and make the model more deployable on hardware with limited computational capacity. While quantization can lead to performance degradation, carefully tuned quantization schemes usually result in only minor losses in accuracy, especially for LLMs with robust architectures.
There are two primary quantization methods: dynamic quantization and static quantization.
In the following example, we use torch.quantization.quantize_dynamic to dynamically...
Change the font size
Change margin width
Change background colour