Distributing training jobs
Distributed training lets you scale training jobs by running them on a cluster of CPU or GPU instances. These may train either on the full dataset or on a fraction of it, depending on the distribution policy that we configure.
FullyReplicated distributes the full dataset to each instance.
ShardedByS3Key distributes an equal number of input files to each instance, which is where splitting your dataset into many files comes in handy.
Distributing training for built-in algorithms
As built-in algorithms are implemented with Apache MXNet, training instances use its Key-Value Store to exchange results. It's set up automatically by SageMaker on one of the training instances. Curious minds can learn more at https://mxnet.apache.org/api/faq/distributed_training.