rochester ins tute of technology computer science master ... · rochester ins tute of technology...

Rochester Ins꛶�tute of Technology Computer Science Master Projec t

Text generation with Language models

Author: Anil Kumar Behera

Co‐Advisors: Dr. Cecilia Ovesdo壺�er Alm

Dr. Christopher Homan Dr. Emily Prud'hommeaux

Dr. Raymond Ptucha

A Project Submitted in Fulfillment of the Requirements

for the Master Degree of Computer Science

in the Computer Science Department

B. Thomas Golisano College of Compu꛶�ng and Informa꛶�on Sciences

May 26, 2016 1

1

Acknowledgments

I want to wholeheartedly thank the entire faculty in the Computer Science Department for

the continuous support and assistance provided to me throughout the growth and develop-

ment of my career at RIT. I want to specifically thank the following faculty members who

made it possible for me to achieve the completion of my Master’s Degree in Computer

Science; Dr. Raymond Ptucha, Dr. Cecilia Ovesdotter Alm, Dr. Christopher Homan and

Dr. Emily Prud’hommeaux. I would also like to thank Mr. Mayuresh Oak and Mr. Titus

Thomas for their contributions. I cannot express in words the gratitude and appreciation I

have for giving me the opportunity to prove myself regardless of the trying circumstances

that make life.

2

Abstract

Text Generation with Language Models

Anil Kumar Behera

Co-advisors:

Dr. Raymond Ptucha

Dr. Cecilia Ovesdotter Alm

Dr. Emily Prud’hommeaux

Dr. Christopher Homan

Text generation is a task of generating text from a machine representation system. Text

generation has many useful applications like data anonymization, synthetic data generation

for data sparsity issue, and summarizations. We explore text generation on various the-

matic topics using statistical language model and deep learning technique (LSTM model).

We explored different LSTM architectures for text generation. We study differences in gen-

erated text using different language models. We used BLEU score and manual inspection

to evaluate language models.

3

Contents

Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

2 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.1 Micro-blogging snippets . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.2 Literary Text . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

3 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73.0.1 Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73.0.2 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . 10

4 Results and Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

4

List of Tables

2.1 Text statistics for the data set.TTR-TypeTokenRatio, ATL=Average tweetLength . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.2 Text statistics for the data set.TTR-TypeTokenRatio, ASL=Average Sen-tence Length . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

3.1 Hyper parameter used for training language models . . . . . . . . . . . . . 8

5

List of Figures

1.1 LSTM Memory cell . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

3.1 A UML diagram for Language model creation and Text generation . . . . . 83.2 Architecture of modified model . . . . . . . . . . . . . . . . . . . . . . . . 93.3 A modified LSTM cell . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

4.1 Comparison of Karpathy’s LSTM models and Modified Karpathy’s LSTMmodels on three categories of tweets points to negative results for these data 12

4.2 Comparison of Karpathy’s models and LM models on literary texts showthat LSTM does better with characters vs. words and vice-versa for LM. . . 13

1

Chapter 1

Introduction

Language generation from any machine represented model for any specific domain is

known as Natural language generation. Random text generation is a branch of language

generation where random text is generated using model trained on text-data specific do-

main. These text generations can have multiple uses such as solving data sparsity issue,

and data anonymization. One such issue is solved using random text generation in [2].

The author used random text generation for solving data sparsity issue. The author has

used Markov chain, Hidden Markov Model (HMM) and Latent Dirichlet allocation (LDA)

models to generate synthetic text for a given data set for sentimental analysis. The results

showed that the F-measure score on actual data and synthetic data differ by a very close

margin. This experiment showed that synthetic data generation can be useful in data spar-

sity scenarios. There are various ways we can generate random text. Some of the methods

are discussed in [2]. In this project, we will explore the text generation technique using

language models.

Language Model computes the probability of an unseen text based on the training data

set. It captures the characteristics of a language. Often language models are used as as-

sistance models such as in speech recognition to seek word sequences that are most likely

produced by the acoustic sequence. A potential text generation technique using language

model can be beneficiary as language model by nature can be scalable to multiple domains

and language. In this project we will explore text generation using language model on

different text types like micro-blogging snippets and literary text.

2

Statistical language models can be broadly divided into two types. One is n-gram based

language model and the other is continuous space language model. The n-gram based

language model is most widely used. OpenGRM [4] is a type of n-gram based language

model and language generation tool using finite-state transducer (FST).

A continuous space language model uses continuous word embedding technique to

create a model. Some examples of models using such technique are skip-gram model and

models using Neural Network. This technique is widely useful for solving curse of di-

mensionality problem in the data set. In recent years, Recurrent Neural Networks (RNN)

deep learning techniques model are becoming popular and taking over traditional language

models. One such implementation is presented in [6]. The authors presented a new char-

acter based recurrent neural network architecture. Like most deep learning techniques,

recurrent neural networks are powerful models, but come with heavy computational cost.

A recent advancement in Hessian-free optimization for faster calculation of gradient de-

scent provides huge computational benefits while training neural networks. This creates a

huge demand in the use of RNNs for innovative language model architectures. In [6], the

authors suggested Multiplicative RNN(MRNN) for character based language modelling.

To better represent conjugative verb stem, the authors recommend that the hidden nodes

should always depend upon the learned sequence of input characters. The authors showed

that MRNN outperforms standard RNN models by 7% on correct reordering. We will ex-

plore a deep learning architecture inspired from this model for our text generation purpose.

A typical RNN suffers from vanishing gradient problem over a long period of time. To

overcome such problems, Long Short Term Memory (LSTM) [1] became an alternative

to deep learning techniques. It is a subtype of RNN where standard node is replaced by

LSTM memory cells. A typical LSTM memory cell contains special gated units that learn

to open and close access to constant error flow. This helps LSTM to learn patterns over

long and short periods of time. This aids in overcoming the vanishing gradient problem

that appears in a typical RNN architecture. An LSTM memory cell can be summarised as

in Fig 1.1. An LSTM memory cell consists of an Input gate (It

) for determining the effects

3

Figure 1.1: LSTM Memory cell

of current inputs on the LSTM cell, Forget gate (ft

) for determining the effect of historical

data on current LSTM cell and Output gate (Ot

) for determining the output of the LSTM

cell. Also two modular nodes Input node and Memory cell are responsible for determin-

ing the importance of previous hidden node output and previous memory cell output. An

LSTM cell can be expressed using the equation 1.1

It

= �(Wi

[Ol�1t

, ht�1, Ct�1] + b

i

)

ft

= �(Wf

[Ol�1t

, ht�1, Ct�1] + b

f

)

eCt

= tanh(Wc

[Ol�1t

, ht�1, Ct�1] + b

c

)

Ct

= It

⇤ eCt

+ ft

⇤ Ct�1

4

Ol

t

= �(Wo

[Ol�1t

, ht�1, Ct�1] + b

o

)

ht

= Ot

⇤ tanh(Ct

)

(1.1)

Here, W represents the weight vector for each gate, Ol�1t

represents the output of the

previous layer (current input,Xt

for first layer) , ht�1 represents the previous hidden layer

output and Ct�1 represents the previous memory cell output.

In [5], the authors used LSTMs as language model on various data sets to achieve a state

of the art speech recognition system. It has improved 8% relative perplexity over standard

recurrent neural network.

In the first part of the paper, we will explore the text generation for micro-blogging

snippets (tweets) using LSTM language model (Karpathy et. al.,2015) and Modified LSTM

model. In the second part of the paper we will explore text generation for literary text data

using standard 4-gram based language model using OpenGRM tool (Roark et al., 2012)

and LSTM language model (Karpathy et. al.,2015).

5

Chapter 2

Data

In this project we are using two data sets to explore text generation.

• Micro-blogging snippets (Twitter)

• Literary text

2.1 Micro-blogging snippets

Twitter is one of the most prominent micro-blogging sites about people’s personal event

details. We collected twitter data set from a previous work [] . We applied customized

filters to fetch tweets related to life-changing events of birth, marriage and death etc. After

collection of the data, a set of 2000 hand-annotated tweets are selected for each category

and a randomly picked 2000 tweets are called as general category.

Then the data is preprocessed based on different language models. We replaced Twitter

user names, URLs, retweets and emoticons with keyword @USER, URL, RT and EMOT

respectively. For char-based language model, it is further preprocessed by replacing space

character with <space>and separated characters by space. Then 1800 tweets are selected

for the train data set and 200 tweets are selected for test data set. Each tweet on the data

set is separated by newline character. Some text statistics on this data set are presented in

table 2.1

6

Event Category¯

Sentences Tokens Types TTR Characters ATL(tokens) ATL (char)Birth 1800 30188 3764 0.12 163215 16.77 90.67Marriage 1800 22763 2827 0.12 128681 12.64 71.49Death 1800 23866 2518 0.10 117231 19.89 65.12General 1800 25945 5873 0.22 137360 14.41 76.31

Table 2.1: Text statistics for the data set.TTR-TypeTokenRatio, ATL=Average tweet Length

2.2 Literary Text

Literary text is widely used by everybody. It closely relates to human natural language.

For this purpose, we have used Literary text from two genres collected from open source

project Gutenberg. These texts are literary novels such as Emma by Jane Austen and The

return of Sherlock Holmes by Arthur Conan Doyle.

Each text category is processed and extracted by most common character in the novel by

applying Named Entity Recognition and frequency count. For this purpose, Stanford Core

NLP tool is used. Then 1900 sentences are extracted from each novel based on the char-

acter selected for each novel. In the text, Emma, the top 2 common characters are Emma

and Harriet . In the text, The Return of Sherlock Holmes, the top 2 common characters

are Holmes and Watson. We applied co-reference resolution on these characters to extract

sentences related to them. For this purpose, the sentence only related to co-reference noun

phrases are extracted. For char-based language model, it is further preprocessed by replac-

ing space character with <space>and separated characters by space. From each category

1800 random sentences are selected for training purposes and 100 sentences are selected

for testing purpose. Some text statistics on this data set are presented in table 2.2

Event Category Sentences Tokens Types TTR Characters ASL(tokens) ASL(char)Emma 1800 51782 4253 0.08 269816 28.77 149.89Holmes 1800 32029 3567 0.11 163062 17.79 90.59

Table 2.2: Text statistics for the data set.TTR-TypeTokenRatio, ASL=Average SentenceLength

7

Chapter 3

Implementation

The implementation models are broadly categorized into three types of models.

• Standard Language model (LM)

• Karpathy’s LSTM model (LSTM)

• Modified Karpathy’s model (Modified Karpathy’s model)

Each model is further categorized into character-based model and word-based models.

3.0.1 Models

Standard Language model (LM)

For this model we have used OpenGRM tool(Roark et al., 2012) for creating a language

model and generating text. We are mainly generating a 4-gram based Language model.

The model is further categorized into char-based model and word-based model. A work

flow of the model for creating language model and generating text is shown in the Figure

3.1 Based on the input type, a 4-gram based language model is created. Then a weighted

finite state transducer is created based on the probability distribution of n-grams. From the

weighted FST, text is generated.

8

Figure 3.1: A UML diagram for Language model creation and Text generation

Parameter Valuelmtype 0/1 (word/char)sequence length 50rnn size 128number of layers 2batch size 200max epochs 500learning rate 0.002dropout 0.5decay rate 0.95

Table 3.1: Hyper parameter used for training language models

Karpathy’s LSTM model (LSTM)

For this model, a deep learning based language model implemented in (Karpathy et. al.,2015)

is used. The model is a simple model as described in the introduction. For char-based

model,a sequence of characters is taken as input vector whereas in word-based model, a se-

quence of words is taken as input vector. The models are trained on similar hyper parameter

values. The hyper parameter values are described in the Table 3.1

9

Figure 3.2: Architecture of modified model Figure 3.3: A modified LSTM cell

Modified Karpathy’s LSTM model (Modified-LSTM)

In this model, we have modified the Karpathy’s model to support the new architecture.

The architecture inspired from (Sutskever et. al., 2011) is a char-based recurrent neural

network. The work flow diagram of the modified model and Modified LSTM cell is shown

in Figure 3.2 and 3.3

In this model, all the hidden layers are always dependent upon the learned sequence of

current input character/word. Based on this model, the mathematical model for a LSTM

cell is shown in equation 3.1

It

= �(Wi

[Ol�1t

, ht�1, Ct�1, Xt

] + bi

)

ft

= �(Wf

[Ol�1t

, ht�1, Ct�1, Xt

] + bf

)

eCt

= tanh(Wc

[Ol�1t

, ht�1, Ct�1, Xt

] + bc

)

Ct

= It

⇤ eCt

+ ft

⇤ Ct�1

10

Ol

t

= �(Wo

[Ol�1t

, ht�1, Ct�1, Xt

] + bo

)

ht

= Ot

⇤ tanh(Ct

)

(3.1)

In this model, Xt

input sequence is added to hidden layer. This helps every hidden layer

to learn the current input sequence. This model is only trained on character-level inputs.

3.0.2 Experimental Setup

The experimental setup is broadly categorized into two parts. The first experiment is

focused on the comparison performance of Karpathy’s model and Modified Karpathy’s

model. The second experimental setup is focused on exploring the text generation for liter-

ary text data set.

Experiment 1

This experiment is mainly focused on comparing the effectiveness of Karpathy’s model

and Modified Karpathy’s model. For this experiment, the micro-blogging snippet data

with event category - Birth, Marriage and Death - is used for comparison. A total of six

models are trained with three char-based model for each event category among both LSTM

architectures. After training from ten files, each model is generated where each file contains

around 200 tweets.

BLEU score without brevity penalty is used for capturing the effectiveness of the model.

BLEU score is calculated for each candidate tweet across all reference tweets. Then BLEU

score is averaged across all generated tweets by a model.

11

Experiment 3

This experiment is mainly focused on exploring text generation for literary text data. For

this experiment, data set collected from literary text novels Emma and The Return of Sher-

lock Holmes are used. For each data set 4 models are used as char-LM, word-LM, char-

LSTM, and word-LSTM. For this experiment, Karpathy’s LSTM model is used, as Mod-

ified Karpathy’s model showed negative results as shown in Results and Analysis section.

A held out set of 100 sentences from their respective data sets is used as reference test set.

Then ten sets of 100 sentences each are generated for each model. Then an averaged BLEU

score is calculated among all test sets.

12

Chapter 4

Results and Analysis

In a previous work at [3] showed that BLEU score measure is a good measure to evaluate

text generation models. All models are evaluated based on BLEU score. We measured the

BLEU score (Papineni et. al.,2002) for measuring effectiveness of the model. Figure [?]

shows the BLEU score results from experiment 1. From the output graph, we can clearly

Figure 4.1: Comparison of Karpathy’s LSTM models and Modified Karpathy’s LSTMmodels on three categories of tweets points to negative results for these data

see that Modified Karpathy’s model is always performing ten points lower than the original

model. After analysis of the model, it shows that the weight vector for Xt

at each layer

13

is never being trained properly. This is because the gradient descent for weight vector

belonging to Xt

at each layer is ignored as they are input vectors. This leads to huge loss

in the propagated error while performing back propagation.

In a second experiment, we tried to explore text generation for literary text using dif-

ferent language models. For each thematic type, four language models have been trained.

These language models are char-LM, word-LM, char-LSTM and word-LSTM. Figure [?]

shows the BLEU score results from experiment 2.

Figure 4.2: Comparison of Karpathy’s models and LM models on literary texts show thatLSTM does better with characters vs. words and vice-versa for LM.

From the output graph, we can clearly see that LSTM-char and LM-word perform very

similar with the highest BLEU score and LM-char performs comparatively less. Emma

is performing much lower than Holmes. This is probably due to the fact that Emma has

higher lexical complexity than Holmes as Emma has Type Token ratio of 0.08 where as

Holmes has 0.11, and Emma has higher average sentence length than Holmes data set.

From an initial analysis, the performance of micro-blogging data set is much better than

literary data set though they have very similar lexicographical diversity. The interesting

14

phenomenon to notice is as average sentence length (token) increases, the BLEU score

decreases. On further analysis, it shows that the diversity of words in micro-blogging

snippets are concentrated on very few words, whereas in literary text it is well distributed.

After calculating frequency count and removing all words with frequency less than 30

on birth data set and Emma data set, it showed that birth data set has only 102 words

remaining above frequency 30 whereas there are 213 unique words in Emma data set with

frequency above 30. This is almost double in diversity ratio. This initial analysis shows

that Emma has richer lexical diversity, which puts models difficult to predict. Also, Holmes

has 122 unique words having frequency greater than 30 and 102,85 for marriage and death

respectively. This shows a clear sign that effectiveness of the model depends upon the

average length of the sentence and lexical diversity in the language. Some of the example

texts generated from the various language models are:

• @USER congratulations on the birth of your daughter URL LSTM-char, birth

• we as hours of core sores and must overs for your.LSTM-char, Holmes

15

Chapter 5

Conclusions

From the analysis we can say that text generation using Language model can be a useful

solution for many applications. On a well constrained domain, text generation using lan-

guage models can be a feasible solution for solving applications like data sparsity or data

anonymity.

5.1 Future Work

Any deep learning method performs better with increasing training data size. A model

trained on large data sets might show us better performance. Also, training models on

various text type like medical data or journals might give us different perspective view

points.

16

Bibliography

[1] Sepp Hochreiter and Jurgen Schmidhuber. Long short-term memory. Neural compu-tation, 9(8):1735–1780, 1997.

[2] Umar Maqsud. Synthetic text generation for sentiment analysis. In 6TH WORKSHOPON COMPUTATIONAL APPROACHES TO SUBJECTIVITY, SENTIMENT AND SO-CIAL MEDIA ANALYSIS WASSA 2015, page 156, 2015.

[3] Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. @inproceedingspa-pineni2002bleu, title=Generating Clinically Relevant Texts: A Case Study on Life-changing Events, author=Behera, Anil, Oak, Mayuresh, Thomas Titus, Alm, Ceciliaovesdotter, Prud’hommeaux, Emily, Homan, Chris, Putcha, Ray, year=2016, organi-zation=CLASPhysc . In Proceedings of the 40th annual meeting on association forcomputational linguistics, pages 311–318. Association for Computational Linguistics,2002.

[4] Brian Roark, Richard Sproat, Cyril Allauzen, Michael Riley, Jeffrey Sorensen, andTerry Tai. The opengrm open-source finite-state grammar software libraries. In Pro-ceedings of the ACL 2012 System Demonstrations, pages 61–66. Association for Com-putational Linguistics, 2012.

[5] Martin Sundermeyer, Ralf Schluter, and Hermann Ney. Lstm neural networks for lan-guage modeling. In INTERSPEECH, pages 194–197, 2012.

[6] Ilya Sutskever, James Martens, and Geoffrey E Hinton. Generating text with recurrentneural networks. In Proceedings of the 28th International Conference on MachineLearning (ICML-11), pages 1017–1024, 2011.

rochester ins tute of technology computer science master ... · rochester ins tute of technology...

Documents