We used the raw TensorFlow API for all our implementations in this book for better transparency of the actual functionality of the models and for a better learning experience. However, TensorFlow has various libraries that hide all the fine-grained details of the implementations. This allows users to implement sequence-to-sequence models like the Neural Machine Translation (NMT) model we saw in Chapter 10, Sequence-to-Sequence Learning – Neural Machine Translation with fewer lines of code and without worrying about more specific technical details about how they work. Knowledge about these libraries is important as they provide a much cleaner way of using these models in production code or researching beyond the existing methods. Therefore, we will go through a quick introduction of how to use the TensorFlow seq2seq
library. This code is available as an exercise in the seq2seq_nmt.ipynb
file.
We will first define the encoder inputs, decoder inputs, and decoder output placeholders:
enc_train_inputs = [] dec_train_inputs, dec_train_labels = [],[] for ui in range(source_sequence_length): enc_train_inputs.append(tf.placeholder(tf.int32, shape=[batch_size],name='train_inputs_%d'%ui)) for ui in range(target_sequence_length): dec_train_inputs.append(tf.placeholder(tf.int32, shape=[batch_size],name='train_inputs_%d'%ui)) dec_train_labels.append(tf.placeholder(tf.int32, shape=[batch_size],name='train_outputs_%d'%ui))
Next, we will define the embedding lookup function for all the encoder and decoder inputs, to obtain the word embeddings:
encoder_emb_inp = [tf.nn.embedding_lookup(encoder_emb_layer, src) for src in enc_train_inputs] encoder_emb_inp = tf.stack(encoder_emb_inp) decoder_emb_inp = [tf.nn.embedding_lookup(decoder_emb_layer, src) for src in dec_train_inputs] decoder_emb_inp = tf.stack(decoder_emb_inp)
The encoder is made with an LSTM cell as its basic building block. Then, we will define dynamic_rnn
, which takes the defined LSTM cell as the input, and the state is initialized with zeros. Then, we will set the time_major
parameter to True
because our data has the time axis as the first axis (that is, axis 0). In other words, our data has the [sequence_length, batch_size, embeddings_size]
shape, where time-dependent sequence_length
is in the first axis. The benefit of dynamic_rnn
is its ability to handle dynamically sized inputs. You can use the optional sequence_length
argument to define the length of each sentence in the batch. For example, consider you have a batch of size [3,30]
with three sentences having lengths of [10, 20, 30] (note that we pad the short sentences up to 30 with a special token). Passing a tensor that has values [10, 20, 30] as sequence_length
will zero out LSTM outputs that are computed beyond the length of each sentence. For the cell state, it will not zero out, but take the last cell state computed within the length of the sentence and copy that value beyond the length of the sentence, until 30 is reached:
encoder_cell = tf.nn.rnn_cell.BasicLSTMCell(num_units) initial_state = encoder_cell.zero_state(batch_size, dtype=tf.float32) encoder_outputs, encoder_state = tf.nn.dynamic_rnn( encoder_cell, encoder_emb_inp, initial_state=initial_state, sequence_length=[source_sequence_length for _ in range(batch_size)], time_major=True, swap_memory=True)
The swap_memory
option allows TensorFlow to swap the tensors produced during the inference process between GPU and CPU, in case the model is too complex to fit entirely in the GPU.
The decoder is defined similar to the encoder, but has an extra layer called, projection_layer
, which represents the softmax output layer for sampling the predictions made by the decoder. We will also define a TrainingHelper
function that properly feeds the decoder inputs to the decoder. We also define two types of decoders in this example: a BasicDecoder
and BahdanauAttention
decoders. (The attention mechanism is discussed in Chapter 10, Sequence-to-Sequence Learning – Neural Machine Translation.) Many other decoders exist in the library, such as BeamSearchDecoder
and BahdanauMonotonicAttention
:
decoder_cell = tf.nn.rnn_cell.BasicLSTMCell(num_units) projection_layer = Dense(units=vocab_size, use_bias=True) helper = tf.contrib.seq2seq.TrainingHelper( decoder_emb_inp, [target_sequence_length for _ in range(batch_size)], time_major=True) if decoder_type == 'basic': decoder = tf.contrib.seq2seq.BasicDecoder( decoder_cell, helper, encoder_state, output_layer=projection_layer) elif decoder_type == 'attention': decoder = tf.contrib.seq2seq.BahdanauAttention( decoder_cell, helper, encoder_state, output_layer=projection_layer)
We will use dynamic decoding to get the outputs of the decoder:
outputs, _, _ = tf.contrib.seq2seq.dynamic_decode( decoder, output_time_major=True, swap_memory=True )
Next, we will define the logits, cross-entropy loss, and train prediction operations:
logits = outputs.rnn_output crossent = tf.nn.sparse_softmax_cross_entropy_with_logits( labels=dec_train_labels, logits=logits) loss = tf.reduce_mean(crossent) train_prediction = outputs.sample_id
Then, we will define two optimizers, where we use AdamOptimizer
for the first 10,000 steps and vanilla stochastic GradientDescentOptimizer
for the rest of the optimization process. This is because, using Adam optimizer for a long term gives rise to some unexpected behaviors. Therefore, we will use Adam to obtain a good initial position for the SGD optimizer and then use SGD from then on:
with tf.variable_scope('Adam'): optimizer = tf.train.AdamOptimizer(learning_rate) with tf.variable_scope('SGD'): sgd_optimizer = tf.train.GradientDescentOptimizer(learning_rate) gradients, v = zip(*optimizer.compute_gradients(loss)) gradients, _ = tf.clip_by_global_norm(gradients, 25.0) optimize = optimizer.apply_gradients(zip(gradients, v)) sgd_gradients, v = zip(*sgd_optimizer.compute_gradients(loss)) sgd_gradients, _ = tf.clip_by_global_norm(sgd_gradients, 25.0) sgd_optimize = optimizer.apply_gradients(zip(sgd_gradients, v))