TensorFlow also supports distributed computing, allowing us to partition a graph and compute it on different processes. Distributed TensorFlow works like a client-server model, or to be more specific, a master-workers model. In TensorFlow, we first create a cluster of workers, with one being the master-worker. The master coordinates the distribution of tasks to different workers.
The first thing to do when you have to work with many machines (or processors) is to define their name and job type, that is, make a cluster of machines (or processors). Each machine in the cluster is assigned a unique address (for example, worker0.example.com:2222
), and they have a specific job, such as type: master
(parameter server), or worker. Later, the TensorFlow server assigns a specific task to each worker. To create a cluster, we first need to define cluster specification. This is a dictionary that maps worker processes and jobs. The following code creates a cluster...