In this section, we will explore the mechanisms through which computation in TensorFlow can be distributed. The first step in running distributed TensorFlow is to specify the architecture of the cluster using tf.train.ClusterSpec
:
import tensorflow as tf cluster = tf.train.ClusterSpec({"ps": ["localhost:2222"],\ "worker": ["localhost:2223",\ "localhost:2224"]})
Nodes are typically divided into two jobs: parameter servers (ps
), which host variables, and workers, which perform heavy computation. In the preceding code, we have one parameter server and two workers, as well as the IP address and port of each node.
Then we have to build a tf.train.Server
for each parameter server and worker, previously defined:
ps = tf.train.Server(cluster, job_name="ps", task_index=0) worker0 = tf.train.Server(cluster,\ job_name="worker", task_index=0) worker1 = tf.train.Server(cluster...