Recurrent neural networks can also be used as generative models.
This means that in addition to being used for predictive models (making predictions) they can learn the sequences of a problem and then generate entirely new plausible sequences for the problem domain.
Generative models like this are useful not only to study how well a model has learned a problem, but to learn more about the problem domain itself.
In this post you will discover how to create a generative model for text, character-by-character using LSTM recurrent neural networks in python with Keras.
After reading this post you will know:
Where to download a free corpus of text that you can use to train text generative models. How to frame the problem of text sequences to a recurrent neural network generative model. How to develop an LSTM to generate plausible text sequences for a given problem.Let’s get started.
Note: LSTM recurrent neural networks can be slow to train and it is highly recommend that you train them on GPU hardware. You can access GPU hardware in the cloud very cheaply using Amazon Web Services, see the tutorial here .

Text Generation With LSTM Recurrent Neural Networks in Python with Keras
Photo by Russ Sanderlin , some rights reserved.
Problem Description: Project GutenbergMany of the classical texts are no longer protectedunder copyright.
This means that you can download all of the text for these books for free and use them in experiments, like creating generative models.Perhaps the best place to get access to freebooks that are no longer protected by copyright is Project Gutenberg .
In this tutorial we are going to use afavorite book from childhood as the dataset: Alice’s Adventures in Wonderland by Lewis Carroll .
We are going to learn the dependencies between characters and the conditional probabilities of characters in sequences so that we can in turn generate wholly new and original sequences of characters.
This is a lot of fun and I recommend repeating these experiments with other books from Project Gutenberg, here is a list of the most popular books on the site .
These experiments are not limited to text, you can also experiment with other ASCII data, such as computer source code, marked up documents in LaTeX, HTML or Markdown and more.
You can download the complete text in ASCII format (Plain Text UTF-8) for this book for free and place it in your working directory with the filename wonderland.txt .
Now we need to prepare the dataset ready for modeling.
Project Gutenberg adds a standard header and footer to each book and this is not part of the original text. Open the file in a text editor and delete the header and footer.
The header is obvious and ends with the text:
*** START OF THIS PROJECT GUTENBERG EBOOK ALICE'S ADVENTURES IN WONDERLAND ***The footer is all of the text after the line of text that says:
THE ENDYou should be left with a text file that has about 3,330 lines of text.
Develop a Small LSTM Recurrent Neural NetworkIn this section we will develop a simple LSTM network to learn sequences of characters from Alice in Wonderland. In the next section we will use this model to generate new sequences of characters.
Let’s start off by importing the classes and functions we intend to use to train our model.
importnumpy fromkeras.modelsimportSequential fromkeras.layersimportDense fromkeras.layersimportDropout fromkeras.layersimportLSTM fromkeras.callbacksimportModelCheckpoint fromkeras.utilsimportnp_utilsNext we need to load the ASCII text for the book into memory and convert all of the characters to lowercase to reduce the vocabulary that the network must learn.
# load ascii text and covert to lowercase filename = "wonderland.txt" raw_text = open(filename).read() raw_text = raw_text.lower()Now that the book is loaded, we must prepare the data for modeling by the neural network. We cannot model the characters directly, instead we must convert the characters to integers.
We can do this easily by first creating a set of all of the distinct characters in the book, then creating a map of each character to a unique integer.
# create mapping of unique chars to integers, and a reverse mapping chars = sorted(list(set(raw_text))) char_to_int = dict((c, i) for i, c in enumerate(chars))For example, the list of unique sorted lowercase characters in the book is as follows:
['\n', '\r', ' ', '!', '"', "'", '(', ')', '*', ',', '-', '.', ':', ';', '?', '[', ']', '_', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', '\xbb', '\xbf', '\xef']You can see that there may be some characters that we could remove to further clean up the dataset that will reduce the vocabulary and may improve the modeling process.
Now that the book has been loaded and the mapping prepared, we can summarize the dataset.
n_chars = len(raw_text) n_vocab = len(chars) print "Total Characters: ", n_chars print "Total Vocab: ", n_vocabRunning the code to this point produces the following output.
Total Characters:147674 Total Vocab:47We can see that the book has just under 150,000 characters and that when converted to lowercase that there are only 47 distinct characters in the vocabulary for the network to learn. Much more than the 26 in the alphabet.
We now need to define the training data for the network. There is a lot of flexibility in how you choose to break up the text and expose it to the network during training.
In this tutorial we will split the book text up into subsequences with a fixed length of 100 characters, an arbitrary length. We could just as easily split the data up by sentences and pad the shorter sequences and truncate the longer ones.
Each training pattern of the network is comprised of 100 time steps of one character (X) followed by one character output (y). When creating these sequences, we slide this window along the whole book one character at a time, allowing each character a chance to be learned from the 100 characters that preceded it (except the first 100 characters of course).
For example, if the sequence length is5 (for simplicity) then the first two training patterns would be as follows:
CHAPT -> E HAPTE -> RAs we split up the book into these sequences, we convert the characters to integers using our lookup table we prepared earlier.
# prepare the dataset of input to output pairs encoded as integers seq_length = 100 dataX = [] dataY = [] for i in range(0, n_chars - seq_length, 1): seq_in = raw_text[i:i + seq_length] seq_out = raw_text[i + seq_length] dataX.append([char_to_int[char] for char in seq_in]) dataY.append(char_to_int[seq_out]) n_patterns = len(dataX) print "Total Patterns: ", n_patte