rochester ins tute of technology computer science master ... · rochester ins tute of technology...
TRANSCRIPT
Rochester Ins꛶�tute of Technology Computer Science Master Projec t
Text generation with Language models
Author: Anil Kumar Behera
Co‐Advisors: Dr. Cecilia Ovesdo壺�er Alm
Dr. Christopher Homan Dr. Emily Prud'hommeaux
Dr. Raymond Ptucha
A Project Submitted in Fulfillment of the Requirements
for the Master Degree of Computer Science
in the Computer Science Department
B. Thomas Golisano College of Compu꛶�ng and Informa꛶�on Sciences
May 26, 2016 1
1
Acknowledgments
I want to wholeheartedly thank the entire faculty in the Computer Science Department for
the continuous support and assistance provided to me throughout the growth and develop-
ment of my career at RIT. I want to specifically thank the following faculty members who
made it possible for me to achieve the completion of my Master’s Degree in Computer
Science; Dr. Raymond Ptucha, Dr. Cecilia Ovesdotter Alm, Dr. Christopher Homan and
Dr. Emily Prud’hommeaux. I would also like to thank Mr. Mayuresh Oak and Mr. Titus
Thomas for their contributions. I cannot express in words the gratitude and appreciation I
have for giving me the opportunity to prove myself regardless of the trying circumstances
that make life.
2
Abstract
Text Generation with Language Models
Anil Kumar Behera
Co-advisors:
Dr. Raymond Ptucha
Dr. Cecilia Ovesdotter Alm
Dr. Emily Prud’hommeaux
Dr. Christopher Homan
Text generation is a task of generating text from a machine representation system. Text
generation has many useful applications like data anonymization, synthetic data generation
for data sparsity issue, and summarizations. We explore text generation on various the-
matic topics using statistical language model and deep learning technique (LSTM model).
We explored different LSTM architectures for text generation. We study differences in gen-
erated text using different language models. We used BLEU score and manual inspection
to evaluate language models.
3
Contents
Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
2 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.1 Micro-blogging snippets . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.2 Literary Text . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
3 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73.0.1 Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73.0.2 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . 10
4 Results and Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
4
List of Tables
2.1 Text statistics for the data set.TTR-TypeTokenRatio, ATL=Average tweetLength . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2 Text statistics for the data set.TTR-TypeTokenRatio, ASL=Average Sen-tence Length . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
3.1 Hyper parameter used for training language models . . . . . . . . . . . . . 8
5
List of Figures
1.1 LSTM Memory cell . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
3.1 A UML diagram for Language model creation and Text generation . . . . . 83.2 Architecture of modified model . . . . . . . . . . . . . . . . . . . . . . . . 93.3 A modified LSTM cell . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
4.1 Comparison of Karpathy’s LSTM models and Modified Karpathy’s LSTMmodels on three categories of tweets points to negative results for these data 12
4.2 Comparison of Karpathy’s models and LM models on literary texts showthat LSTM does better with characters vs. words and vice-versa for LM. . . 13
1
Chapter 1
Introduction
Language generation from any machine represented model for any specific domain is
known as Natural language generation. Random text generation is a branch of language
generation where random text is generated using model trained on text-data specific do-
main. These text generations can have multiple uses such as solving data sparsity issue,
and data anonymization. One such issue is solved using random text generation in [2].
The author used random text generation for solving data sparsity issue. The author has
used Markov chain, Hidden Markov Model (HMM) and Latent Dirichlet allocation (LDA)
models to generate synthetic text for a given data set for sentimental analysis. The results
showed that the F-measure score on actual data and synthetic data differ by a very close
margin. This experiment showed that synthetic data generation can be useful in data spar-
sity scenarios. There are various ways we can generate random text. Some of the methods
are discussed in [2]. In this project, we will explore the text generation technique using
language models.
Language Model computes the probability of an unseen text based on the training data
set. It captures the characteristics of a language. Often language models are used as as-
sistance models such as in speech recognition to seek word sequences that are most likely
produced by the acoustic sequence. A potential text generation technique using language
model can be beneficiary as language model by nature can be scalable to multiple domains
and language. In this project we will explore text generation using language model on
different text types like micro-blogging snippets and literary text.
2
Statistical language models can be broadly divided into two types. One is n-gram based
language model and the other is continuous space language model. The n-gram based
language model is most widely used. OpenGRM [4] is a type of n-gram based language
model and language generation tool using finite-state transducer (FST).
A continuous space language model uses continuous word embedding technique to
create a model. Some examples of models using such technique are skip-gram model and
models using Neural Network. This technique is widely useful for solving curse of di-
mensionality problem in the data set. In recent years, Recurrent Neural Networks (RNN)
deep learning techniques model are becoming popular and taking over traditional language
models. One such implementation is presented in [6]. The authors presented a new char-
acter based recurrent neural network architecture. Like most deep learning techniques,
recurrent neural networks are powerful models, but come with heavy computational cost.
A recent advancement in Hessian-free optimization for faster calculation of gradient de-
scent provides huge computational benefits while training neural networks. This creates a
huge demand in the use of RNNs for innovative language model architectures. In [6], the
authors suggested Multiplicative RNN(MRNN) for character based language modelling.
To better represent conjugative verb stem, the authors recommend that the hidden nodes
should always depend upon the learned sequence of input characters. The authors showed
that MRNN outperforms standard RNN models by 7% on correct reordering. We will ex-
plore a deep learning architecture inspired from this model for our text generation purpose.
A typical RNN suffers from vanishing gradient problem over a long period of time. To
overcome such problems, Long Short Term Memory (LSTM) [1] became an alternative
to deep learning techniques. It is a subtype of RNN where standard node is replaced by
LSTM memory cells. A typical LSTM memory cell contains special gated units that learn
to open and close access to constant error flow. This helps LSTM to learn patterns over
long and short periods of time. This aids in overcoming the vanishing gradient problem
that appears in a typical RNN architecture. An LSTM memory cell can be summarised as
in Fig 1.1. An LSTM memory cell consists of an Input gate (It
) for determining the effects
3
Figure 1.1: LSTM Memory cell
of current inputs on the LSTM cell, Forget gate (ft
) for determining the effect of historical
data on current LSTM cell and Output gate (Ot
) for determining the output of the LSTM
cell. Also two modular nodes Input node and Memory cell are responsible for determin-
ing the importance of previous hidden node output and previous memory cell output. An
LSTM cell can be expressed using the equation 1.1
It
= �(Wi
[Ol�1t
, ht�1, Ct�1] + b
i
)
ft
= �(Wf
[Ol�1t
, ht�1, Ct�1] + b
f
)
eCt
= tanh(Wc
[Ol�1t
, ht�1, Ct�1] + b
c
)
Ct
= It
⇤ eCt
+ ft
⇤ Ct�1
4
Ol
t
= �(Wo
[Ol�1t
, ht�1, Ct�1] + b
o
)
ht
= Ot
⇤ tanh(Ct
)
(1.1)
Here, W represents the weight vector for each gate, Ol�1t
represents the output of the
previous layer (current input,Xt
for first layer) , ht�1 represents the previous hidden layer
output and Ct�1 represents the previous memory cell output.
In [5], the authors used LSTMs as language model on various data sets to achieve a state
of the art speech recognition system. It has improved 8% relative perplexity over standard
recurrent neural network.
In the first part of the paper, we will explore the text generation for micro-blogging
snippets (tweets) using LSTM language model (Karpathy et. al.,2015) and Modified LSTM
model. In the second part of the paper we will explore text generation for literary text data
using standard 4-gram based language model using OpenGRM tool (Roark et al., 2012)
and LSTM language model (Karpathy et. al.,2015).
5
Chapter 2
Data
In this project we are using two data sets to explore text generation.
• Micro-blogging snippets (Twitter)
• Literary text
2.1 Micro-blogging snippets
Twitter is one of the most prominent micro-blogging sites about people’s personal event
details. We collected twitter data set from a previous work [] . We applied customized
filters to fetch tweets related to life-changing events of birth, marriage and death etc. After
collection of the data, a set of 2000 hand-annotated tweets are selected for each category
and a randomly picked 2000 tweets are called as general category.
Then the data is preprocessed based on different language models. We replaced Twitter
user names, URLs, retweets and emoticons with keyword @USER, URL, RT and EMOT
respectively. For char-based language model, it is further preprocessed by replacing space
character with <space>and separated characters by space. Then 1800 tweets are selected
for the train data set and 200 tweets are selected for test data set. Each tweet on the data
set is separated by newline character. Some text statistics on this data set are presented in
table 2.1
6
Event Category¯
Sentences Tokens Types TTR Characters ATL(tokens) ATL (char)Birth 1800 30188 3764 0.12 163215 16.77 90.67Marriage 1800 22763 2827 0.12 128681 12.64 71.49Death 1800 23866 2518 0.10 117231 19.89 65.12General 1800 25945 5873 0.22 137360 14.41 76.31
Table 2.1: Text statistics for the data set.TTR-TypeTokenRatio, ATL=Average tweet Length
2.2 Literary Text
Literary text is widely used by everybody. It closely relates to human natural language.
For this purpose, we have used Literary text from two genres collected from open source
project Gutenberg. These texts are literary novels such as Emma by Jane Austen and The
return of Sherlock Holmes by Arthur Conan Doyle.
Each text category is processed and extracted by most common character in the novel by
applying Named Entity Recognition and frequency count. For this purpose, Stanford Core
NLP tool is used. Then 1900 sentences are extracted from each novel based on the char-
acter selected for each novel. In the text, Emma, the top 2 common characters are Emma
and Harriet . In the text, The Return of Sherlock Holmes, the top 2 common characters
are Holmes and Watson. We applied co-reference resolution on these characters to extract
sentences related to them. For this purpose, the sentence only related to co-reference noun
phrases are extracted. For char-based language model, it is further preprocessed by replac-
ing space character with <space>and separated characters by space. From each category
1800 random sentences are selected for training purposes and 100 sentences are selected
for testing purpose. Some text statistics on this data set are presented in table 2.2
Event Category Sentences Tokens Types TTR Characters ASL(tokens) ASL(char)Emma 1800 51782 4253 0.08 269816 28.77 149.89Holmes 1800 32029 3567 0.11 163062 17.79 90.59
Table 2.2: Text statistics for the data set.TTR-TypeTokenRatio, ASL=Average SentenceLength
7
Chapter 3
Implementation
The implementation models are broadly categorized into three types of models.
• Standard Language model (LM)
• Karpathy’s LSTM model (LSTM)
• Modified Karpathy’s model (Modified Karpathy’s model)
Each model is further categorized into character-based model and word-based models.
3.0.1 Models
Standard Language model (LM)
For this model we have used OpenGRM tool(Roark et al., 2012) for creating a language
model and generating text. We are mainly generating a 4-gram based Language model.
The model is further categorized into char-based model and word-based model. A work
flow of the model for creating language model and generating text is shown in the Figure
3.1 Based on the input type, a 4-gram based language model is created. Then a weighted
finite state transducer is created based on the probability distribution of n-grams. From the
weighted FST, text is generated.
8
Figure 3.1: A UML diagram for Language model creation and Text generation
Parameter Valuelmtype 0/1 (word/char)sequence length 50rnn size 128number of layers 2batch size 200max epochs 500learning rate 0.002dropout 0.5decay rate 0.95
Table 3.1: Hyper parameter used for training language models
Karpathy’s LSTM model (LSTM)
For this model, a deep learning based language model implemented in (Karpathy et. al.,2015)
is used. The model is a simple model as described in the introduction. For char-based
model,a sequence of characters is taken as input vector whereas in word-based model, a se-
quence of words is taken as input vector. The models are trained on similar hyper parameter
values. The hyper parameter values are described in the Table 3.1
9
Figure 3.2: Architecture of modified model Figure 3.3: A modified LSTM cell
Modified Karpathy’s LSTM model (Modified-LSTM)
In this model, we have modified the Karpathy’s model to support the new architecture.
The architecture inspired from (Sutskever et. al., 2011) is a char-based recurrent neural
network. The work flow diagram of the modified model and Modified LSTM cell is shown
in Figure 3.2 and 3.3
In this model, all the hidden layers are always dependent upon the learned sequence of
current input character/word. Based on this model, the mathematical model for a LSTM
cell is shown in equation 3.1
It
= �(Wi
[Ol�1t
, ht�1, Ct�1, Xt
] + bi
)
ft
= �(Wf
[Ol�1t
, ht�1, Ct�1, Xt
] + bf
)
eCt
= tanh(Wc
[Ol�1t
, ht�1, Ct�1, Xt
] + bc
)
Ct
= It
⇤ eCt
+ ft
⇤ Ct�1
10
Ol
t
= �(Wo
[Ol�1t
, ht�1, Ct�1, Xt
] + bo
)
ht
= Ot
⇤ tanh(Ct
)
(3.1)
In this model, Xt
input sequence is added to hidden layer. This helps every hidden layer
to learn the current input sequence. This model is only trained on character-level inputs.
3.0.2 Experimental Setup
The experimental setup is broadly categorized into two parts. The first experiment is
focused on the comparison performance of Karpathy’s model and Modified Karpathy’s
model. The second experimental setup is focused on exploring the text generation for liter-
ary text data set.
Experiment 1
This experiment is mainly focused on comparing the effectiveness of Karpathy’s model
and Modified Karpathy’s model. For this experiment, the micro-blogging snippet data
with event category - Birth, Marriage and Death - is used for comparison. A total of six
models are trained with three char-based model for each event category among both LSTM
architectures. After training from ten files, each model is generated where each file contains
around 200 tweets.
BLEU score without brevity penalty is used for capturing the effectiveness of the model.
BLEU score is calculated for each candidate tweet across all reference tweets. Then BLEU
score is averaged across all generated tweets by a model.
11
Experiment 3
This experiment is mainly focused on exploring text generation for literary text data. For
this experiment, data set collected from literary text novels Emma and The Return of Sher-
lock Holmes are used. For each data set 4 models are used as char-LM, word-LM, char-
LSTM, and word-LSTM. For this experiment, Karpathy’s LSTM model is used, as Mod-
ified Karpathy’s model showed negative results as shown in Results and Analysis section.
A held out set of 100 sentences from their respective data sets is used as reference test set.
Then ten sets of 100 sentences each are generated for each model. Then an averaged BLEU
score is calculated among all test sets.
12
Chapter 4
Results and Analysis
In a previous work at [3] showed that BLEU score measure is a good measure to evaluate
text generation models. All models are evaluated based on BLEU score. We measured the
BLEU score (Papineni et. al.,2002) for measuring effectiveness of the model. Figure [?]
shows the BLEU score results from experiment 1. From the output graph, we can clearly
Figure 4.1: Comparison of Karpathy’s LSTM models and Modified Karpathy’s LSTMmodels on three categories of tweets points to negative results for these data
see that Modified Karpathy’s model is always performing ten points lower than the original
model. After analysis of the model, it shows that the weight vector for Xt
at each layer
13
is never being trained properly. This is because the gradient descent for weight vector
belonging to Xt
at each layer is ignored as they are input vectors. This leads to huge loss
in the propagated error while performing back propagation.
In a second experiment, we tried to explore text generation for literary text using dif-
ferent language models. For each thematic type, four language models have been trained.
These language models are char-LM, word-LM, char-LSTM and word-LSTM. Figure [?]
shows the BLEU score results from experiment 2.
Figure 4.2: Comparison of Karpathy’s models and LM models on literary texts show thatLSTM does better with characters vs. words and vice-versa for LM.
From the output graph, we can clearly see that LSTM-char and LM-word perform very
similar with the highest BLEU score and LM-char performs comparatively less. Emma
is performing much lower than Holmes. This is probably due to the fact that Emma has
higher lexical complexity than Holmes as Emma has Type Token ratio of 0.08 where as
Holmes has 0.11, and Emma has higher average sentence length than Holmes data set.
From an initial analysis, the performance of micro-blogging data set is much better than
literary data set though they have very similar lexicographical diversity. The interesting
14
phenomenon to notice is as average sentence length (token) increases, the BLEU score
decreases. On further analysis, it shows that the diversity of words in micro-blogging
snippets are concentrated on very few words, whereas in literary text it is well distributed.
After calculating frequency count and removing all words with frequency less than 30
on birth data set and Emma data set, it showed that birth data set has only 102 words
remaining above frequency 30 whereas there are 213 unique words in Emma data set with
frequency above 30. This is almost double in diversity ratio. This initial analysis shows
that Emma has richer lexical diversity, which puts models difficult to predict. Also, Holmes
has 122 unique words having frequency greater than 30 and 102,85 for marriage and death
respectively. This shows a clear sign that effectiveness of the model depends upon the
average length of the sentence and lexical diversity in the language. Some of the example
texts generated from the various language models are:
• @USER congratulations on the birth of your daughter URL LSTM-char, birth
• we as hours of core sores and must overs for your.LSTM-char, Holmes
15
Chapter 5
Conclusions
From the analysis we can say that text generation using Language model can be a useful
solution for many applications. On a well constrained domain, text generation using lan-
guage models can be a feasible solution for solving applications like data sparsity or data
anonymity.
5.1 Future Work
Any deep learning method performs better with increasing training data size. A model
trained on large data sets might show us better performance. Also, training models on
various text type like medical data or journals might give us different perspective view
points.
16
Bibliography
[1] Sepp Hochreiter and Jurgen Schmidhuber. Long short-term memory. Neural compu-tation, 9(8):1735–1780, 1997.
[2] Umar Maqsud. Synthetic text generation for sentiment analysis. In 6TH WORKSHOPON COMPUTATIONAL APPROACHES TO SUBJECTIVITY, SENTIMENT AND SO-CIAL MEDIA ANALYSIS WASSA 2015, page 156, 2015.
[3] Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. @inproceedingspa-pineni2002bleu, title=Generating Clinically Relevant Texts: A Case Study on Life-changing Events, author=Behera, Anil, Oak, Mayuresh, Thomas Titus, Alm, Ceciliaovesdotter, Prud’hommeaux, Emily, Homan, Chris, Putcha, Ray, year=2016, organi-zation=CLASPhysc . In Proceedings of the 40th annual meeting on association forcomputational linguistics, pages 311–318. Association for Computational Linguistics,2002.
[4] Brian Roark, Richard Sproat, Cyril Allauzen, Michael Riley, Jeffrey Sorensen, andTerry Tai. The opengrm open-source finite-state grammar software libraries. In Pro-ceedings of the ACL 2012 System Demonstrations, pages 61–66. Association for Com-putational Linguistics, 2012.
[5] Martin Sundermeyer, Ralf Schluter, and Hermann Ney. Lstm neural networks for lan-guage modeling. In INTERSPEECH, pages 194–197, 2012.
[6] Ilya Sutskever, James Martens, and Geoffrey E Hinton. Generating text with recurrentneural networks. In Proceedings of the 28th International Conference on MachineLearning (ICML-11), pages 1017–1024, 2011.