Technical considerations for NLP models in production
Deploying machine learning (especially NLP) models differs from deploying other software solutions in one key area – the resources needed to run the service. Hosting a generic service for a simple mobile or web app can, in theory, be done by any modern device such as a PC or a mobile phone. Only as you start to scale the service to cater to a larger audience is when you need to put extra thought, effort, and resources into making the service more scalable. When serving machine learning models, things often get complicated right from the start. We are dealing with huge models where a typical web server sometimes can't even load a model into memory. Each request uses up a significant amount of resources, yet we need to serve requests on demand, in real time, and to a large audience. But how do you do that?
When comparing this chapter's topic to what we covered in this book so far, you will notice that deploying...