Elasticity in model inference
- If the number of concurrent inference inputs is higher, we use more GPUs for this model-serving job.
- If the number of concurrent inference inputs is lower, we shrink down the number of GPUs we use.
For example, right now we have received four concurrent model-serving queries, as shown in the following figure:
As shown in the preceding figure, if we have more queries, we can use more GPUs to do concurrent model serving in order to reduce the model-serving latency.