Selecting hardware accelerators in AWS Cloud
AWS Cloud and Amazon SageMaker provide a range of hardware accelerators that are suitable for inference workloads. Choosing a hardware platform often requires multiple experiments to be performed with various accelerators and serving parameters. Let’s look at some key selection criteria that can be useful during the evaluation process.
Latency-throughput trade-offs
Inference latency defines how quickly your model can return inference outcomes to the end user, and we want to minimize latency to improve user experience. Inference throughput defines how many inference requests can be processed simultaneously, and we want to maximize it to guarantee that as many inference requests as possible are served. In software engineering, it’s common to discuss latency-throughput trade-offs as it’s usually impractical to minimize latency and maximize throughput at the same time, so you need to find a balance between these characteristics...