Self-serving flair models
Regardless of whether you decide to self-serve or use a fully managed service, having some basic understanding of how NLP model deployment inference works in production is always useful.
Since deploying NLP models in production is more of a software engineering feat than an NLP challenge, general software engineering rules apply – the main one being, don't reinvent the wheel.
While implementing your own NLP model-serving frameworks is entirely possible, doing so will require a significant amount of time. Chances are that even if you put your best efforts into implementing a solution, you won't be able to build a product better than the open source solutions out there built by hundreds of different contributors.
There is a wide range of tools and packages out there for self-serving NLP models, but one specific package stands out due to its support for PyTorch packages (Flair is built on top of PyTorch) and its support for Hugging Face...