Text Generation With LSTM Recurrent Neural Networks in Python with Keras

Recurrent neural networks can also be used as generative models.

This means that in addition to being used for predictive models (making predictions) they can learn the sequences of a problem and then generate entirely new plausible sequences for the problem domain.

Generative models like this are useful not only to study how well a model has learned a problem, but to learn more about the problem domain itself.

In this post you will discover how to create a generative model for text, character-by-character using LSTM recurrent neural networks in python with Keras.

After reading this post you will know:

Where to download a free corpus of text that you can use to train text generative models. How to frame the problem of text sequences to a recurrent neural network generative model. How to develop an LSTM to generate plausible text sequences for a given problem.

Let’s get started.

Note: LSTM recurrent neural networks can be slow to train and it is highly recommend that you train them on GPU hardware. You can access GPU hardware in the cloud very cheaply using Amazon Web Services, see the tutorial here .

Text Generation With LSTM Recurrent Neural Networks in Python with Keras

Photo by Russ Sanderlin , some rights reserved.

Problem Description: Project Gutenberg

Many of the classical texts are no longer protectedunder copyright.

This means that you can download all of the text for these books for free and use them in experiments, like creating generative models.Perhaps the best place to get access to freebooks that are no longer protected by copyright is Project Gutenberg .

In this tutorial we are going to use afavorite book from childhood as the dataset: Alice’s Adventures in Wonderland by Lewis Carroll .

We are going to learn the dependencies between characters and the conditional probabilities of characters in sequences so that we can in turn generate wholly new and original sequences of characters.

This is a lot of fun and I recommend repeating these experiments with other books from Project Gutenberg, here is a list of the most popular books on the site .

These experiments are not limited to text, you can also experiment with other ASCII data, such as computer source code, marked up documents in LaTeX, HTML or Markdown and more.

You can download the complete text in ASCII format (Plain Text UTF-8) for this book for free and place it in your working directory with the filename wonderland.txt .

Now we need to prepare the dataset ready for modeling.

Project Gutenberg adds a standard header and footer to each book and this is not part of the original text. Open the file in a text editor and delete the header and footer.

The header is obvious and ends with the text:

*** START OF THIS PROJECT GUTENBERG EBOOK ALICE'S ADVENTURES IN WONDERLAND ***

The footer is all of the text after the line of text that says:

THE END

You should be left with a text file that has about 3,330 lines of text.

Develop a Small LSTM Recurrent Neural Network

In this section we will develop a simple LSTM network to learn sequences of characters from Alice in Wonderland. In the next section we will use this model to generate new sequences of characters.

Let’s start off by importing the classes and functions we intend to use to train our model.

importnumpy fromkeras.modelsimportSequential fromkeras.layersimportDense fromkeras.layersimportDropout fromkeras.layersimportLSTM fromkeras.callbacksimportModelCheckpoint fromkeras.utilsimportnp_utils

Next we need to load the ASCII text for the book into memory and convert all of the characters to lowercase to reduce the vocabulary that the network must learn.

# load ascii text and covert to lowercase filename = "wonderland.txt" raw_text = open(filename).read() raw_text = raw_text.lower()

Now that the book is loaded, we must prepare the data for modeling by the neural network. We cannot model the characters directly, instead we must convert the characters to integers.

We can do this easily by first creating a set of all of the distinct characters in the book, then creating a map of each character to a unique integer.

# create mapping of unique chars to integers, and a reverse mapping chars = sorted(list(set(raw_text))) char_to_int = dict((c, i) for i, c in enumerate(chars))

For example, the list of unique sorted lowercase characters in the book is as follows:

['\n', '\r', ' ', '!', '"', "'", '(', ')', '*', ',', '-', '.', ':', ';', '?', '[', ']', '_', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', '\xbb', '\xbf', '\xef']

You can see that there may be some characters that we could remove to further clean up the dataset that will reduce the vocabulary and may improve the modeling process.

Now that the book has been loaded and the mapping prepared, we can summarize the dataset.

n_chars = len(raw_text) n_vocab = len(chars) print "Total Characters: ", n_chars print "Total Vocab: ", n_vocab

Running the code to this point produces the following output.

Total Characters:147674 Total Vocab:47

We can see that the book has just under 150,000 characters and that when converted to lowercase that there are only 47 distinct characters in the vocabulary for the network to learn. Much more than the 26 in the alphabet.

We now need to define the training data for the network. There is a lot of flexibility in how you choose to break up the text and expose it to the network during training.

In this tutorial we will split the book text up into subsequences with a fixed length of 100 characters, an arbitrary length. We could just as easily split the data up by sentences and pad the shorter sequences and truncate the longer ones.

Each training pattern of the network is comprised of 100 time steps of one character (X) followed by one character output (y). When creating these sequences, we slide this window along the whole book one character at a time, allowing each character a chance to be learned from the 100 characters that preceded it (except the first 100 characters of course).

For example, if the sequence length is5 (for simplicity) then the first two training patterns would be as follows:

CHAPT -> E HAPTE -> R

As we split up the book into these sequences, we convert the characters to integers using our lookup table we prepared earlier.

# prepare the dataset of input to output pairs encoded as integers seq_length = 100 dataX = [] dataY = [] for i in range(0, n_chars - seq_length, 1): seq_in = raw_text[i:i + seq_length] seq_out = raw_text[i + seq_length] dataX.append([char_to_int[char] for char in seq_in]) dataY.append(char_to_int[seq_out]) n_patterns = len(dataX) print "Total Patterns: ", n_patte

Text Generation With LSTM Recurrent Neural Networks in Python with Keras

Trending Articles

SM3268AB 8CE三星量产无法格式化

[下载工具]Think4V utubedown(Youtube高清视频下载工具) v2.1.6 官方版2.1.3

出售: SINE Othello 電源線

博讯｜张磊帮助下，李源潮的儿子被耶鲁录取

FullEventLogView 1.73 免安裝中文版 - 事件檢視器取代工具

同門四角戀？李沛旭喇舌「小郭雪芙」曾智希，蔡淑臻拍完婚紗...怒毀婚

五代RAV4 降車身（機械車位因素）

[攻略] 《魔獸世界》6.2.2 白色魚人蛋再現！來去收編魚人寶寶特基！

jetBrains Product crack 2024 Java based

2013 KUGA 6G轉動方向盤會聽到摳摳摳的異音，有人知道原因嗎?

【豌豆字幕組】[藥屋少女的呢喃（藥師少女的獨語）/ Kusuriya no Hitorigoto][25][繁體][1080P][MP4]

好用的照片后期处理软件【DxO PhotoLab Elite 5.4.0.4765 (x64) 多语言便携版】..

出售: Thixar Silence Plus 啫喱板

df-dferh-01 中国区 Android 安装 Google Play Store 后报错的解决办法

三條崙討海人故事…重建烏倉寮憶43年前船難

致喬立建設道歉聲明

[一般] 神州全地圖掉寶資料

方易通7862 8/128G 無360 刷機

動感校園小記者・瑪利諾修院學校｜採訪王瑋駿陳晞文帶領試玩風帆

有藍電流行車紀錄器分享文嗎