Bigram Based LSTM with Regularization

In the previous LSTM tutorial, we used a single character at a time, now the bigram approach is to predict a character using two characters at a time. This tutorial will also introduce regularization technique known as Dropout in RNN.

Bigram Based LSTM with Regularization

From the TensorFlow prospect, the code remains almost the same, the slight differences are explained below.

For bigram, we will introduce embedding vector as we did in word2vec. The use of embedding is because the number of possible bigrams will be too large, and using them directly into the one hot encoding will lead to wasteful computations.

The following code is a modified version of the previous code. We have changed the dimension to the embedding size.
embedding_size = 128 # Dimension of the embedding vector.
num_nodes = 64
graph = tf.Graph()
with graph.as_default(): 
  # Parameters:
  vocabulary_embeddings = tf.Variable(
    tf.random_uniform([vocabulary_size * vocabulary_size, embedding_size], -1.0, 1.0))
  # Input gate: input, previous output, and bias.
  ix = tf.Variable(tf.truncated_normal([embedding_size, num_nodes], -0.1, 0.1))
  im = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
  ib = tf.Variable(tf.zeros([1, num_nodes]))
  # Forget gate: input, previous output, and bias.
  fx = tf.Variable(tf.truncated_normal([embedding_size, num_nodes], -0.1, 0.1))
  fm = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
  fb = tf.Variable(tf.zeros([1, num_nodes]))
  # Memory cell: input, state and bias.                             
  cx = tf.Variable(tf.truncated_normal([embedding_size, num_nodes], -0.1, 0.1))
  cm = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
  cb = tf.Variable(tf.zeros([1, num_nodes]))
  # Output gate: input, previous output, and bias.
  ox = tf.Variable(tf.truncated_normal([embedding_size, num_nodes], -0.1, 0.1))
  om = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
  ob = tf.Variable(tf.zeros([1, num_nodes]))

Since we are using two characters at a time, we need to modify the shifting by two steps.
# Input data.
  train_data = list()
  for _ in range(num_unrollings + 1):
    train_data.append(tf.placeholder(tf.float32, shape=[batch_size,vocabulary_size]))
  train_chars = train_data[:num_unrollings]
  train_inputs = zip(train_chars[:-1], train_chars[1:])
  train_labels = train_data[2:]  # labels are inputs shifted by twos time step.

Embedding needs a lookup, we will use the function on vocabulary_embeddings, bigram_index, and the latter is the index calculated using the argmax function. While for the dropout we will use it only in the input and output layers, the problem is greatly explained in this paper.
  for i in train_inputs:
    bigram_index = tf.argmax(i[0], dimension=1) + vocabulary_size * tf.argmax(i[1], dimension=1)
    i_embed = tf.nn.embedding_lookup(vocabulary_embeddings, bigram_index)
    drop_i = tf.nn.dropout(i_embed, 0.7)
    output, state = lstm_cell(drop_i, output, state)
    drop_o = tf.nn.dropout(output, 0.7)

A similar modification is done in the validation part, you will see in the full code, get from here.

The training part has few lines need to be replaced; since we have two characters at a time we need to feed two inputs, we will introduce a for loop and the following two statements;
sentence = characters(feed[0])[0] + characters(feed[1])[0]
      predictions = sample_prediction.eval({
                    sample_input[0]: b[0],
                    sample_input[1]: b[1]

The bigram base model increases the perplexity as compared to the last model in which we were doing character by character. And even after the dropout the performance remains the same.

Dropout does not always increase the performance of the model, we have already experienced in the Neural Network that dropout can also decrease the accuracy, this depends on the environment in which we are using. Dropout usually works well late in the game, currently 10,000 iterations are way too low for them.

I would recommend you to try out and share your results with the community, especially if dropout helps. 


  1. In embedding using vocabulary_embeddings function I have problem:
    Let num_unrollings = 10 but when we have time step, the resultant input gets nine batches therefore, while running the session, the size of feed_dect becomes (batch_size*10, embedding_size) and trained_prediction becomes (batch_size*9, embedding_size).
    Note: I am tried to solve an assignment-6 of Udacity deep learning course

  2. Hi Mohbat, Sorry for the delay in reply, because I didnt receive any email notification.
    I didnt get your problem, you can see the full code at the GitHub repo. Copy Paste you will identify yourselves the problem. Otherwise, send me the code :)