A computation graph is the basic unit of computation in TensorFlow. A computation graph consists of nodes and edges. Each node represents an instance of tf.Operation
, while each edge represents an instance of tf.Tensor
that gets transferred between the nodes.
A model in TensorFlow contains a computation graph. First, you must create the graph with the nodes representing variables, constants, placeholders, and operations, and then provide the graph to the TensorFlow execution engine. The TensorFlow execution engine finds the first set of nodes that it can execute. The execution of these nodes starts the execution of the nodes that follow the sequence of the computation graph.
Thus, TensorFlow-based programs are made up of performing two types of activities on computation graphs:
- Defining the computation graph
- Executing the computation graph
A TensorFlow program starts execution with a default graph. Unless another graph is explicitly specified, a new node gets implicitly added to the default graph. Explicit access to the default graph can be obtained using the following command:
graph = tf.get_default_graph()
For example, the following computation graph represents the addition of three inputs to produce the output, that is,
:
In TensorFlow, the add operation node in the preceding diagram would correspond to the code y = tf.add( x1 + x2 + x3 )
.
The variables, constants, and placeholders get added to the graph as and when they are created. After defining the computation graph, a session object is instantiated that executes the operation objects and evaluates the tensor objects.
Let's define and execute a computation graph to calculate
, just like we saw in the preceding example:
# Linear Model y = w * x + b # Define the model parameters w = tf.Variable([.3], tf.float32) b = tf.Variable([-.3], tf.float32) # Define model input and output x = tf.placeholder(tf.float32) y = w * x + b output = 0 with tf.Session() as tfs: # initialize and print the variable y tf.global_variables_initializer().run() output = tfs.run(y,{x:[1,2,3,4]}) print('output : ',output)
Creating and using a session in the with
block ensures that the session is automatically closed when the block is finished. Otherwise, the session has to be explicitly closed with the tfs.close()
command, where tfs
is the session name.
The nodes in a computation graph are executed in their order of dependency. If node x depends on node y, then x is executed before y when the execution of y is requested. A node is only executed if either the node itself or another node depending on it is invoked for execution. This execution philosophy is known as lazy loading. As the name implies, the node objects are not instantiated and initialized until they are actually required.
Often, it is necessary to control the order of the execution of the nodes in a computation graph. This can be done with the tf.Graph.control_dependencies()
function. For example, if the graph has the nodes l
,m
,n
, and o
, and we want to execute n
and o
before l
and m
, then we would use the following code:
with graph_variable.control_dependencies([n,o]): # other statements here
This makes sure that any node in the preceding with
block is executed after nodes n
and o
have been executed.
A graph can be partitioned into several parts, and each part can be placed and executed on different devices, such as a CPU or GPU. All of the devices that are available for graph execution can be listed with the following command:
from tensorflow.python.client import device_lib print(device_lib.list_local_devices())
The output is listed as follows (the output for your machine will be different because this will depend on the available compute devices in your specific system):
[name: "/device:CPU:0" device_type: "CPU" memory_limit: 268435456 locality { } incarnation: 12900903776306102093 , name: "/device:GPU:0" device_type: "GPU" memory_limit: 611319808 locality { bus_id: 1 } incarnation: 2202031001192109390 physical_device_desc: "device: 0, name: Quadro P5000, pci bus id: 0000:01:00.0, compute capability: 6.1" ]
The devices in TensorFlow are identified with the string /device:<device_type>:<device_idx>
. In the last output, CPU
and GPU
denote the device type, and 0
denotes the device index.
One thing to note about the last output is that it shows only one CPU, whereas our computer has 8 CPUs. The reason for this is that TensorFlow implicitly distributes the code across the CPU units and thus, by default, CPU:0
denotes all of the CPUs available to TensorFlow. When TensorFlow starts executing graphs, it runs the independent paths within each graph in a separate thread, with each thread running on a separate CPU. We can restrict the number of threads used for this purpose by changing the number of inter_op_parallelism_threads
. Similarly, if, within an independent path, an operation is capable of running on multiple threads, TensorFlow will launch that specific operation on multiple threads. The number of threads in this pool can be changed by setting the number of intra_op_parallelism_threads
.
To enable the logging of variable placement by defining a config object, set the log_device_placement
property to true
, and then pass this config
object to the session as follows:
tf.reset_default_graph() # Define model parameters w = tf.Variable([.3], tf.float32) b = tf.Variable([-.3], tf.float32) # Define model input and output x = tf.placeholder(tf.float32) y = w * x + b config = tf.ConfigProto() config.log_device_placement=True with tf.Session(config=config) as tfs: # initialize and print the variable y tfs.run(global_variables_initializer()) print('output',tfs.run(y,{x:[1,2,3,4]}))
The output from the console window of the Jupyter Notebook is listed as follows:
b: (VariableV2): /job:localhost/replica:0/task:0/device:GPU:0 b/read: (Identity): /job:localhost/replica:0/task:0/device:GPU:0 b/Assign: (Assign): /job:localhost/replica:0/task:0/device:GPU:0 w: (VariableV2): /job:localhost/replica:0/task:0/device:GPU:0 w/read: (Identity): /job:localhost/replica:0/task:0/device:GPU:0 mul: (Mul): /job:localhost/replica:0/task:0/device:GPU:0 add: (Add): /job:localhost/replica:0/task:0/device:GPU:0 w/Assign: (Assign): /job:localhost/replica:0/task:0/device:GPU:0 init: (NoOp): /job:localhost/replica:0/task:0/device:GPU:0 x: (Placeholder): /job:localhost/replica:0/task:0/device:GPU:0 b/initial_value: (Const): /job:localhost/replica:0/task:0/device:GPU:0 Const_1: (Const): /job:localhost/replica:0/task:0/device:GPU:0 w/initial_value: (Const): /job:localhost/replica:0/task:0/device:GPU:0 Const: (Const): /job:localhost/replica:0/task:0/device:GPU:0
Thus, by default, TensorFlow creates the variable and operations nodes on a device so that it can get the highest performance. These variables and operations can be placed on specific devices by using the tf.device()
function. Let's place the graph on the CPU:
tf.reset_default_graph() with tf.device('/device:CPU:0'): # Define model parameters w = tf.get_variable(name='w',initializer=[.3], dtype=tf.float32) b = tf.get_variable(name='b',initializer=[-.3], dtype=tf.float32) # Define model input and output x = tf.placeholder(name='x',dtype=tf.float32) y = w * x + b config = tf.ConfigProto() config.log_device_placement=True with tf.Session(config=config) as tfs: # initialize and print the variable y tfs.run(tf.global_variables_initializer()) print('output',tfs.run(y,{x:[1,2,3,4]}))
In the Jupyter console, we can see that the variables have been placed on the CPU and that execution also takes place on the CPU:
b: (VariableV2): /job:localhost/replica:0/task:0/device:CPU:0
b/read: (Identity): /job:localhost/replica:0/task:0/device:CPU:0
b/Assign: (Assign): /job:localhost/replica:0/task:0/device:CPU:0
w: (VariableV2): /job:localhost/replica:0/task:0/device:CPU:0
w/read: (Identity): /job:localhost/replica:0/task:0/device:CPU:0
mul: (Mul): /job:localhost/replica:0/task:0/device:CPU:0
add: (Add): /job:localhost/replica:0/task:0/device:CPU:0
w/Assign: (Assign): /job:localhost/replica:0/task:0/device:CPU:0
init: (NoOp): /job:localhost/replica:0/task:0/device:CPU:0
x: (Placeholder): /job:localhost/replica:0/task:0/device:CPU:0
b/initial_value: (Const): /job:localhost/replica:0/task:0/device:CPU:0
Const_1: (Const): /job:localhost/replica:0/task:0/device:CPU:0
w/initial_value: (Const): /job:localhost/replica:0/task:0/device:CPU:0
Const: (Const): /job:localhost/replica:0/task:0/device:CPU:0
TensorFlow follows the following rules for placing the variables on devices:
If the graph was previously run, then the node is left on the device where it was placed earlier Else If the tf.device() block is used, then the node is placed on the specified device Else If the GPU is present then the node is placed on the first available GPU Else If the GPU is not present then the node is placed on the CPU
The tf.device()
function can be provided with a function name in place of a device string. If a function name is provided, then the function has to return the device string. This way of providing a device string through a custom function allows complex algorithms to be used for placing the variables on different devices. For example, TensorFlow provides a round robin device setter function in tf.train.replica_device_setter()
.
If a TensorFlow operation is placed on the GPU, then the execution engine must have the GPU implementation of that operation, known as the kernel. If the kernel is not present, then the placement results in a runtime error. Also, if the requested GPU device does not exist, then a runtime error is raised. The best way to handle such errors is to allow the operation to be placed on the CPU if requesting the GPU device results in an error. This can be achieved by setting the following config
value:
config.allow_soft_placement = True
At the start of the TensorFlow session, by default, a session grabs all of the GPU memory, even if the operations and variables are placed only on one GPU in a multi-GPU system. If another session starts execution at the same time, it will receive an out-of-memory error. This can be solved in multiple ways:
- For multi-GPU systems, set the environment variable
CUDA_VISIBLE_DEVICES=<list of device idx>
:
os.environ['CUDA_VISIBLE_DEVICES']='0'
The code that's executed after this setting will be able to grab all of the memory of the visible GPU.
- For letting the session grab a part of the memory of the GPU, use the config option
per_process_gpu_memory_fraction
to allocate a percentage of the memory:
config.gpu_options.per_process_gpu_memory_fraction = 0.5
This will allocate 50% of the memory in all of the GPU devices.
- By combining both of the preceding strategies, you can make only a certain percentage, alongside just some of the GPU, visible to the process.
- Limit the TensorFlow process to grab only the minimum required memory at the start of the process. As the process executes further, set a config option to allow for the growth of this memory:
config.gpu_options.allow_growth = True
This option only allows for the allocated memory to grow, so the memory is never released back.
We can create our own graphs, which are separate from the default graph, and execute them in a session. However, creating and executing multiple graphs is not recommended, because of the following disadvantages:
- Creating and using multiple graphs in the same program would require multiple TensorFlow sessions, and each session would consume heavy resources
- Data cannot be directly passed in-between graphs
Hence, the recommended approach is to have multiple subgraphs in a single graph. In case we wish to use our own graph instead of the default graph, we can do so with the tf.graph()
command. In the following example, we create our own graph, g
, and execute it as the default graph:
g = tf.Graph() output = 0 # Assume Linear Model y = w * x + b with g.as_default(): # Define model parameters w = tf.Variable([.3], tf.float32) b = tf.Variable([-.3], tf.float32) # Define model input and output x = tf.placeholder(tf.float32) y = w * x + b with tf.Session(graph=g) as tfs: # initialize and print the variable y tf.global_variables_initializer().run() output = tfs.run(y,{x:[1,2,3,4]}) print('output : ',output)
Now, let's put this learning into practice and implement the classification of handwritten digital images with TensorFlow.