Do Robots Dream of Electric Pentameter? Generative Poetry with LSTM Networks

Can an artificial brain pen an artificial quatrain? Can neural networks master metonymy? Having a taste for both verse and machine learning, I set out to put together a fun little project to apply the things I’ve been learning about neural networks: a Shakespearean sonnet generator powered by a long short term memory (LSTM) neural network. For those who are not familiar with LSTM networks, they’re basically a particular type of neural network that is uniquely suited to working with time series data. Regular old neural networks operate on one batch at a time, with no memory of previous batches, and are therefore pretty bad at learning connections between past data and present data. LSTMs are instead designed to retain memory of the information they process for a long time, and to make intelligent decisions about what information to forget and what to keep. This makes them much better than standard neural networks at learning connections over time.

LSTM networks are a useful technique to have under your belt, as they’ve been used to achieve state-of-the-art results in many different fields. But don’t let the data-sci-speak scare you off: with modern machine learning libraries like Keras and TensorFlow, anyone with basic understanding of Python and linear algebra can get up and running with some basic neural networks in a day or two. If you’re not convinced, read on: hopefully this little toy text-generation project will show you that neural networks are for everybody, and you don’t need to be a comp sci Ph.D. to start finding useful applications for cutting-edge machine learning in your own data analysis, programming, or new media art.

Introductory matters: imports and corpus

First, let’s get some imports of the way:

import numpy as np
from keras.models import Sequential
from keras.layers import LSTM, Dense, Activation
from keras.callbacks import ModelCheckpoint

Alright, now let’s choose & load our text corpus. You’ll want a corpus of sizable length— a novel would be good. I chose Shakespeare’s sonnets, out of interest in seeing how a simple neural net would handle poetic text. If you don’t have a corpus in mind already, I suggest checking out Project Gutenberg. In addition to loading the text, we’ll need a way to transform its string data into a numerical form that the neural network can understand. We’ll start this process by assigning an integer encoding to the characters contained within the corpus. We’ll get rid of all duplicate characters by casting the corpus to a set, then give it a canonical ordering by casting it to a list and sorting it, then build the dictionary of integer encodings with enumerate and a comprehension. Given my highly-structured source material, I felt it would be far more interesting to leave all the punctuation and line breaks intact, to see whether the network can learn some of the high-level features of the sonnet form. We’ll also build a decoding dictionary to ease converting our networks output back into text.

It’s probably a good idea to print some output after completion of each step, so we can check whether our numbers look reasonable.

with open("sonnets.txt") as corpus_file:
    corpus = corpus_file.read()
print("Loaded a corpus of {0} characters.".format(len(corpus)))

# Get a unique identifier for each char in the corpus, then make some dicts to ease encoding and decoding
chars = sorted(list(set(corpus)))
num_chars = len(chars)
encoding = {c: i for i, c in enumerate(chars)}
decoding = {i: c for i, c in enumerate(chars)}
print("Our corpus contains {0} unique characters.".format(num_chars))

My result:

Loaded a corpus of 94654 characters.
Our corpus contains 64 unique characters.

Encoding & vectorizing our data

OK, now that we have dictionaries with which to encode our text, we’ll need to chop it up into examples our neural network can work with, and corresponding labels to test against. We’re going to build our network to accept a sequence of characters as input (let’s call it a ‘sentence’ for simplicity’s sake), and to predict the character that occurs after that sentence. We’ll need all our sentences to be of an equal length, and preferably long enough to give the neural network plenty of data to work with. I tried it with sentences 50 characters in length. Once we decide on our length, slices and a for-loop make it easy to chop up our corpus into the appropriate chunks & encode the characters into integers.

# it slices, it dices, it makes julienned datasets!
# chop up our data into X and y, slice into roughly (num_chars / skip) overlapping 'sentences'
# of length sentence_length, and encode the chars
sentence_length = 50
skip = 1
X_data = []
y_data = []
for i in range (0, len(corpus) - sentence_length, skip):
    sentence = corpus[i:i + sentence_length]
    next_char = corpus[i + sentence_length]
    X_data.append([encoding[char] for char in sentence])
    y_data.append(encoding[next_char])

num_sentences = len(X_data)
print("Sliced our corpus into {0} sentences of length {1}".format(num_sentences, sentence_length))

You should see something like this:

Sliced our corpus into 94604 sentences of length 50

Suddenly, it’s linear algebra time! We’ve encoded our corpus as lists of integers, but Keras speaks the language of matrices. We’ll also want to convert our integer identifiers to one-hot encoded vectors of dimension num_chars. This makes sense because the integers we are using are just arbitrary nominal identifiers, not ordinal quantities (for example, there’s no sense in which ‘N’ is greater than ‘,’, or in which ‘o’ is closer to ‘t’ than ‘7’). We could use fancy tools like sklearn’s OneHotEncoder or Keras’ np_utils.to_categorical, but for this easy case, I found it instructive to just use a nested for-loop. Think about it like this: our data structure contains n sentences, each containing m characters, each of which should become a t-dimensional one-hot encoded vector. That is, a num_sentences x sentence_length x num_chars array. Let’s initialize a zero array with those dimensions. Now, if we iterate over the sentences in our data structure, and within that loop, iterate over the characters in each sentence, then the index of our outer loop corresponds to our position on the 0-dimension, the index of the inner loop corresponds to our position on the 1-dimension, and the integer that encodes the character corresponds to our position on the 2-dimension. We assign that element to 1, and we have our one-hot encoding!

For our labels, we have one label for each sentence, each of which should become a t-dimensional one-hot encoded vector. This will be a num_sentences x num_chars array. So, we’ll initialize another array. Since we are already looping over each sentence in the outer loop of the previous step, we can just use that same loop. Its index will correspond with the 0-dimension, and the integer character encoding will correspond with the 1-dimension.

We should make sure to sanity check our matrix dimensions before going any further! It will save you a lot of unnecessary headaches.

# Vectorize our data and labels. We want everything in one-hot
# because smart data encoding cultivates phronesis and virtue.
print("Vectorizing X and y...")
X = np.zeros((num_sentences, sentence_length, num_chars), dtype=np.bool)
y = np.zeros((num_sentences, num_chars), dtype=np.bool)
for i, sentence in enumerate(X_data):
    for n, encoded_char in enumerate(sentence):
        X[i, n, encoded_char] = 1
    y[i, y_data[i]] = 1

# Double check our vectorized data before we sink hours into fitting a model
print("Sanity check y. Dimension: {0} # Sentences: {1} Characters in corpus: {2}".format(y.shape, num_sentences, len(chars)))
print("Sanity check X. Dimension: {0} Sentence length: {1}".format(X.shape, sentence_length))

Vectorizing X and y...
Sanity check y. Dimension: (94604, 64) # Sentences: 94604 Characters in corpus: 64
Sanity check X. Dimension: (94604, 50, 64) Sentence length: 50

Building our network

It’s finally time to build our neural network! You might expect this to be complex, and normally it would be. But with Keras, it’s remarkably simple. We can simply initialize a model, add layers to it until we are satisfied with its architecture, and then compile it. We will initialize a Sequential model, then add a decent-sized LSTM layer for our inputs (256 worked quite well for me). We’ll need to specify the dimensions of an input example. Since each sentence is an input, our dimensions will be sentence_length x num_chars. We will choose a Dense layer for our output. Each label is a single character, so our dimension here will just be num_chars. Because our network is basically solving a classification problem with num_chars classes, we’ll choose softmax activation for our output layer, and a cross-entropy loss function. Now we just need an optimizer! I chose to take the Adam optimizer for a spin.

# Define our model
print("Let's build a brain!")
model = Sequential()
model.add(LSTM(256, input_shape=(sentence_length, num_chars)))
model.add(Dense(num_chars))
model.add(Activation('softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam')

Saving our model

We’re almost ready to train our model! But first, we’ll need a way to save our model and results. To save our model’s architecture, we can use the handy to_yaml() method, like so:

# Dump our model architecture to a file so we can load it elsewhere
architecture = model.to_yaml()
with open('model.yaml', 'a') as model_file:
    model_file.write(architecture)

Saving our weights is pretty easy as well. We’ll do this with a ModelCheckpoint that will be passed into our model’s training function as a callback and executed after each epoch. We’ll set it to monitor our loss function, and save only the epochs that improve our loss.

# Set up checkpoints
file_path="weights-{epoch:02d}-{loss:.3f}.hdf5"
dump_weights = ModelCheckpoint(file_path, monitor="loss", verbose=1, save_best_only=True, mode="min")
callbacks = [dump_weights]

It’s Go Time!

Now we’ll train our model. We just need to specify the number of epochs and the batch size, and to remember to pass in our callbacks:

# Action time! [Insert guitar solo here]
model.fit(X, y, nb_epoch=30, batch_size=128, callbacks=callbacks)

If all went well, you should find yourself staring at something like this (and stare you indeed shall, as it will take a while to train):

  128/94604 [..............................] - ETA: 1778s - loss: 4.1734
  256/94604 [..............................] - ETA: 1265s - loss: 4.1566
  [...]
  94464/94604 [============================>.] - ETA: 1s - loss: 2.6441
  94592/94604 [============================>.] - ETA: 0s - loss: 2.6437
Epoch 00000: loss improved from inf to 2.64364, saving model to weights-00-2.644.hdf5

Be patient, especially if you’re not running these computations with CUDA. I don’t have an appropriate GPU, so it took my poor old ThinkPad 6-8 hours each time I wanted to train my model. So, make some tea and ponder the mysteries of the human experience, I guess. Once you’ve achieved a Zen-like tranquility and your model has finished training, I recommend saving the file with the best weights under an easily-accessible name, like weights.hdf5.

Generating text

OK, it’s time to generate text. Let’s make a new file for this, generate.py. We’ll do some imports, load our corpus, define some useful variables, and make our encoding dictionaries like before:

import numpy as np
from keras.models import model_from_yaml
from random import randint

with open("sonnets.txt") as corpus_file:
    corpus = corpus_file.read()
print("Loaded a corpus of {0} characters".format(len(corpus)))

# Get a unique identifier for each char in the corpus, then make some dicts to ease encoding and decoding
chars = sorted(list(set(corpus)))
encoding = {c: i for i, c in enumerate(chars)}
decoding = {i: c for i, c in enumerate(chars)}

# Some variables we'll need later
num_chars = len(chars)
sentence_length = 50
corpus_length = len(corpus)

…and we’ll load up our model and our best weights file.

with open("model.yaml") as model_file:
    architecture = model_file.read()

model = model_from_yaml(architecture)
model.load_weights("weights.hdf5")
model.compile(loss='categorical_crossentropy', optimizer='adam')

Our high-level strategy is this: we’ll feed a seed phrase into the model, predict the next character and store it, and make a new seed by appending the predicted character to the seed and chopping off the first character. We will then feed the new seed back into the model, and repeat this process until we’ve generated a decently long chunk of text. So let’s start with the seed. We’ll pick a random sentence from the corpus, and then one-hot encode it, just like before:

seed = randint(0, corpus_length - sentence_length)
seed_phrase = corpus[seed:seed + sentence_length]

X = np.zeros((1, sentence_length, num_chars), dtype=np.bool)
for i, character in enumerate(seed_phrase):
    X[0, i, encoding[character]] = 1

Now we just need to feed it into the model and iterate as described above. To get our predicted character, we’ll feed the seed into model.predict() and then take the argmax. We can then just pop that into the decoding dictionary to get our predicted character. Then, we simply construct an array with the appropriate number of dimensions, use it to one-hot encode a character, and append it to our seed, dropping the first character. We can do this with a slice and a an np.concatenate along the 1-dimension.

generated_text = ""
for i in range(500):
    prediction = np.argmax(model.predict(X, verbose=0))

    generated_text += decoding[prediction]

    activations = np.zeros((1, 1, num_chars), dtype=np.bool)
    activations[0, 0, prediction] = 1
    X = np.concatenate((X[:, 1:, :], activations), axis=1)

print(generated_text)

Here’s what I got during one run:

n the reason you,
And beauty's decays with thee, thou art fow,
The priefther that well knows that have a dection.

O! for my soul false play and heart to such,
And in his preature the time that fook to must know
How have thee to the thing they see should love constance,
As I am not thou gay, the confic'd new;
  That leaves all my beauty of thy self doth.

If those thou art to the ton of thee so great,
That thou are stal to such a beauty shade,
And it is the stars and pain as fair a fear
That hav

We see our model forming mostly-correct words! I was also surprised to see the degree to which my model preserved the form of the sonnet, including line breaks, capitalization, and punctuation at appropriate points, and even sometimes attempting to indent the text in the same way that the final couplet of each sonnet was typeset. It’s pretty neat that a simple model was able to learn those parameters with no explicit guidance. As for the syntactic & semantic coherence of the sonnet… ehh, not so much. This model has a very weak understanding of grammar, and effectively none at all of meaning. But, that’s totally expected: those are more complex problems, requiring more advanced models.

Where to go next?

If we wish to continue tinkering with this model to improve its results, we could play with its hyperparameters by doing things like adding a dropout layer, changing the size of our layers, tinkering with the activation, loss, and optimization functions, etc. I haven’t explored this avenue very much yet, mostly because of how long it takes me to run a each session of training, but I hope to explore it more in future. Alternatively, if we wanted a model more focused on syntactically-correct sentences or meaningful phrases, instead of doing character-level modeling, perhaps we could try doing some natural language processing with a library like spaCy and training a network on the results.

However, what I think is even cooler is that these LSTM methods are applicable to almost any kind of time-series data! So you’re not stuck with text processing— using the same techniques you just learned, you could work with audio files, economic data, weather, brain waves, with the motion of a physical system— if it’s time-series, an LSTM network can probably do something interesting with it. Your only limit is your own curiosity… and the quality of your GPU.

Personally, I chose to use Flask to build my network into a little web app, so that others might find amusement from it. The result: a sonnet-spewing interactive robo-Shakespeare. The code for this is up on my GitHub, as always.

Hopefully, this has been a helpful intro to what LSTM networks can do and how we can implement them in Keras. If your curiosity is piqued, I implore you to peruse the Keras documentation and try building some networks. With Keras, it’s endlessly rewarding, and much simpler than it sounds. Happy coding!