Summary
In this chapter, we reviewed the available hardware accelerators that are suitable for running DL inference programs. We also discussed how your models can be optimized for target hardware accelerators using the TensorRT compiler for NVIDIA GPU accelerators and Neuron SDK for AWS Inferentia accelerators. Then, we reviewed the SageMaker Neo service, which allows you to compile supported models for a wide range of hardware platforms with minimal development efforts and highlighted several limitations of this service. After reading this chapter, you should be able to make decisions about which hardware accelerators to use and how to optimize them based on your specific use case requirements around latency, throughput, and cost.
Once you have selected your hardware accelerator and model optimization strategy, you will need to decide which model server to use and how to further tune your inference workload at serving time. In the next chapter, we will discuss popular model server...