Implementation for model size reduction
Even though the transformer-based models achieve state-of-the-art results in many aspects of NLP, they usually share the very same problem: they are big models and are not fast enough to be used. In business cases where it is necessary to embed them inside a mobile application or in a web interface, it seems to be impossible if you try to use the original models.
In order to improve the speed and size of these models, some techniques are proposed, which are listed here:
- Distillation (also known as knowledge distillation)
- Pruning
- Quantization
For each of these techniques, we provide a separate subsection to address the technical and theoretical insights.
Working with DistilBERT for knowledge distillation
The process of transferring knowledge from a bigger model to a smaller one is called knowledge distillation. In other words, there is a teacher model and a student model; the teacher is typically a bigger and stronger...