folktale classification using learning to rank
TRANSCRIPT
Folktale Classification using Learning to Rank
Dong Nguyen, Dolf Trieschnigg, and Mariët Theune University of Twente
Folktales
• Fairy tales • Riddles • Legends • Urban legends • Jokes • Etc..
Folktale researchers
• Folktales are a resource to research – Variation in tales – Shifting moral values, beliefs, identities etc. – Intertextuality
Classification systems
• Folktale researchers have developed classification systems
• To compare and analyze stories • To organize stories
Classification systems
• Folktale researchers have developed classification systems
• To compare and analyze stories • To organize stories
Story types
The Vanishing Hitchhiker (BRUN 01000)!!
A ghostly or heavenly hitchhiker that vanishes !from a vehicle, sometimes after giving warning or prophecy.!
The Vanishing Hitchhiker (BRUN 01000)!!
A ghostly or heavenly hitchhiker that vanishes !from a vehicle, sometimes after giving warning or prophecy.!
A car driver picks up a hitchhiker. They talk about spiritual topics in life. !Suddenly the hitchhiker
vanishes. The police tell !him they have heard the !story earlier that day.!
The Vanishing Hitchhiker (BRUN 01000)!!
A ghostly or heavenly hitchhiker that vanishes !from a vehicle, sometimes after giving warning or prophecy.!
A guy bikes through the park !at night. He encounters a !girl covered in blood. !On their way to the !police, she suddenly !
disappears. She resembles a murdered girl…!
A car driver picks up a hitchhiker. They talk about spiritual topics in life. !Suddenly the hitchhiker
vanishes. The police tell !him they have heard the !story earlier that day.!
The Vanishing Hitchhiker (BRUN 01000)!!
A ghostly or heavenly hitchhiker that vanishes !from a vehicle, sometimes after giving warning or prophecy.!
A guy bikes through the park !at night. He encounters a !girl covered in blood. !On their way to the !police, she suddenly !
disappears. She resembles a murdered girl…!
A car driver picks up a hitchhiker and borrows her !his sweater. When he stops! by to pick up the sweater, !
he discovers she passed !away due to a car accident !
a while ago. He finds! his sweater on her grave.!
A car driver picks up a hitchhiker. They talk about spiritual topics in life. !Suddenly the hitchhiker
vanishes. The police tell !him they have heard the !story earlier that day.!
The Vanishing Hitchhiker (BRUN 01000)!!
A ghostly or heavenly hitchhiker that vanishes !from a vehicle, sometimes after giving warning or prophecy.!
A guy bikes through the park !at night. He encounters a !girl covered in blood. !On their way to the !police, she suddenly !
disappears. She resembles a murdered girl…!
A car driver picks up a hitchhiker and borrows her !his sweater. When he stops! by to pick up the sweater, !
he discovers she passed !away due to a car accident !
a while ago. He finds! his sweater on her grave.!
A car driver picks up a girl wearing a white dress. He
accidently spills red wine on her dress. The next day !he finds out she died a !
year ago. When the police open !her grave, they find the white! dress with the red wine spot.!
A car driver picks up a hitchhiker. They talk about spiritual topics in life. !Suddenly the hitchhiker
vanishes. The police tell !him they have heard the !story earlier that day.!
The Vanishing Hitchhiker (BRUN 01000)!!
A ghostly or heavenly hitchhiker that vanishes !from a vehicle, sometimes after giving warning or prophecy.!
A guy bikes through the park !at night. He encounters a !girl covered in blood. !On their way to the !police, she suddenly !
disappears. She resembles a murdered girl…!
A car driver picks up a hitchhiker and borrows her !his sweater. When he stops! by to pick up the sweater, !
he discovers she passed !away due to a car accident !
a while ago. He finds! his sweater on her grave.!
A car driver picks up a girl wearing a white dress. He
accidently spills red wine on her dress. The next day !he finds out she died a !
year ago. When the police open !her grave, they find the white! dress with the red wine spot.!
A car driver picks up a hitchhiker. They talk about spiritual topics in life. !Suddenly the hitchhiker
vanishes. The police tell !him they have heard the !story earlier that day.!
The Vanishing Hitchhiker (BRUN 01000)!!
A ghostly or heavenly hitchhiker that vanishes !from a vehicle, sometimes after giving warning or prophecy.!
A guy bikes through the park !at night. He encounters a !girl covered in blood. !On their way to the !police, she suddenly !
disappears. She resembles a murdered girl…!
A car driver picks up a hitchhiker and borrows her !his sweater. When he stops! by to pick up the sweater, !
he discovers she passed !away due to a car accident !
a while ago. He finds! his sweater on her grave.!
A car driver picks up a girl wearing a white dress. He
accidently spills red wine on her dress. The next day !he finds out she died a !
year ago. When the police open !her grave, they find the white! dress with the red wine spot.!
A car driver picks up a hitchhiker. They talk about spiritual topics in life. !Suddenly the hitchhiker
vanishes. The police tell !him they have heard the !story earlier that day.!
The Vanishing Hitchhiker (BRUN 01000)!!
A ghostly or heavenly hitchhiker that vanishes !from a vehicle, sometimes after giving warning or prophecy.!
A guy bikes through the park !at night. He encounters a !girl covered in blood. !On their way to the !police, she suddenly !
disappears. She resembles a murdered girl…!
A car driver picks up a hitchhiker and borrows her !his sweater. When he stops! by to pick up the sweater, !
he discovers she passed !away due to a car accident !
a while ago. He finds! his sweater on her grave.!
A car driver picks up a girl wearing a white dress. He
accidently spills red wine on her dress. The next day !he finds out she died a !
year ago. When the police open !her grave, they find the white! dress with the red wine spot.!
A car driver picks up a hitchhiker. They talk about spiritual topics in life. !Suddenly the hitchhiker
vanishes. The police tell !him they have heard the !story earlier that day.!
Story Type Indexes: ATU
Aarne-‐Thompson-‐Uther classifica4on system (ATU) Red Riding Hood (ATU 0333), The Race between Hare and Tortoise (ATU 0275A), etc.
Story Type Indexes: Brunvand
Urban legends The Microwaved Pet (BRUN 02000), The Kidney Heist (BRUN 06305), The Killer in the Backseat (BRUN 01305), The Vanishing Hitchhiker (BRUN 01000), etc.
Automatic Identification of Story Types
• Increasing digitalization
• Discover relationships between stories
Outline
• Problem descrip4on, corpora • Experimental setup – Baselines, Learning to Rank
• Results – Baselines, Feature analysis, Error analysis
• Discussion/Conclusion
Goal and Evaluation
Given an input story, return a ranking of story types
• Semi automatic setting Reciprocal Rank (MRR)
• Classifica4on – Simula4ng a classifica4on seRng. The highest ranked label is then taken as the predicted class.
Accuracy
1ranki
Folktale database
• Dutch Folktale Database (http://www.verhalenbank.nl) • Over 42.000 stories • In our experiments: only stories written in
standard Dutch
Dataset I
3 Story Type Indexes
The Dutch Folktale Database is a large collection of Dutch folktales containinga variety of subgenres, including fairy tales, urban legends and jokes. We onlyconsider stories that are written in standard Dutch (the collection also containsmany narratives in historical Dutch, Frisian and Dutch dialects). In this paperwe restrict our focus to the two type indexes mentioned in the introduction,the ATU index [25] and the Type-Index of Urban Legends [6]. We created twodatasets based on these type indexes. For each type index, we only keep thestory types that occur at least two times in our dataset. The frequencies of thestory types are plotted in Figure 1. Many story types only occur a couple oftimes in the database, whereas a few story types have many instances.
3.1 Aarne-Thompson-Uther (ATU)
Our first type-index is the Aarne-Thompson-Uther classification (ATU) [25].Examples of specific story types are Red Riding Hood (ATU 0333) and TheRace between Hare and Tortoise (ATU 0275A). The index contains story typeshierarchically organized into categories (e.g. Fairy Tales and Religious Tales).We discard stories belonging to the Anecdotes and Jokes category (types 1200-1999), since the story types in this category are very di↵erent in nature from therest of the stories2. The average number of words per story is 489 words.
3.2 Brunvand
Our second type index is proposed by Brunvand [6] and is a classification of urbanlegends. Examples of story types are The Microwaved Pet (BRUN 02000), TheKidney Heist (BRUN 06305) and The Vanishing Hitchhiker (BRUN 01000). Thestories have on average 158 words.
Number of stories
Number
ofstorytypes
0 20 40 60
020
4060
(a) ATU
Number of stories
Number
ofstorytypes
0 20 40 60
020
4060
(b) Brunvand
Fig. 1. Story type frequencies
2 As was suggested by a folktale researcher. Story types in the Anecdotes and Jokescategory are mostly based on thematic similarity, while others are based on plot.
Dataset II
Index Train Dev Test
Nr documents 400 75 25 50
Nr story types 98 59 24 43
Index Train Dev Test
Nr documents 687 175 50 75
Nr story types 125 92 40 50
ATU
Brunvand
Baselines: big doc
Vanishing Hitchiker
Microwaved Pet
Killer in the Backseat
Killer in the Backseat
Input document (query)
Ranking
Baselines: small doc
Vanishing Hitchiker
Microwaved Pet
Killer in the Backseat
Input document (query)
Ranking
Vanishing Hitchiker Microwaved Pet Killer in the
Backseat
When taking the top ranked label as the class, this is the same as a Nearest Neighbour classifier (k=1).
Results - Baseline
MRR Accuracy
Smalldoc 0.7779 0.72
Bigdoc 0.4423 0.36
MRR Accuracy
Smalldoc 0.6430 0.56
Bigdoc 0.6411 0.56
ATU
Brunvand
Learning to Rank
1. Retrieve an ini4al set of candidate stories (small document baseline).
2. Apply learning to rank to rerank the top 50 candidates.
3. Create a final ranked list of story types, by taking the corresponding labels of the ranked stories and removing duplicates.
Features I • Small Document Scores (IR)
– Indicates the score of the query on the candidate stories. – Fulltext (BM25 -‐ Full text), only nouns (BM25 -‐ nouns) and only verbs (BM25 – verbs)
• Big Document Scores (Bigdoc) – Similarity to all stories of the candidate's story type (bigdoc) Fulltext (Bigdoc -‐ BM25 -‐ Full text), only nouns (Bigdoc -‐ BM25 -‐ nouns) and only verbs (Bigdoc -‐ BM25 -‐ verbs).
• Lexical Similarity (LS) – Jaccard and TFIDF similarity, calculated on the following token types: unigram, bigrams, character ngrams (2-‐5), chunks, named en44es, 4me and loca4ons.
Features II
• Verb(Subject, Object) triplets
• Extracted based on dependencies obtained using Frog parser
Lives (princess, castle)!!
Disappear(driver,)
Features III
• Matches – Exact, Subject-Object, Verb-Object, Subject-Verb
• Abstraction – VerbNet
Example: ‘consider-29.9’ class in VerbNet: achten (esteem), bevinden (find), inzien (realise), menen (think/believe), veronderstellen (presume), kennen (know), wanen (falsely believe), denken (think)
Feature Analysis I
MRR Accuracy
Baseline (smalldoc)
0.7779 0.72
+ Bigdoc 0.8367 0.78
+ IR 0.8049 0.76
+ LS 0.7921 0.72
+ Triplets 0.8016 0.72
All 0.8569 0.82
MRR Accuracy
Baseline (smalldoc)
0.6430 0.56
+ Bigdoc 0.7933 0.72
+ IR 0.7247 0.61
+ LS 0.6810 0.60
+ Triplets 0.6600 0.59
All 0.8132 0.76
ATU Brunvand
Feature Analysis II
Feature Weight
Bigdoc: BM25 -‐ nouns 0.179
Bigdoc: BM25 -‐ full text 0.158
LS: unigrams -‐ TFIDF 0.109
Bigdoc: BM25 -‐ verbs 0.069
Triplets: SO match, Jaccard, no abstrac4on
0.063
ATU Brunvand
Feature Weight
Bigdoc: BM25 -‐ full text 0.209
Bigdoc: BM25 -‐ nouns 0.204
LS: unigrams -‐ TFIDF 0.065
IR: BM25 -‐ nouns 0.062
Bigdoc: BM25 -‐ verbs 0.051
Error analysis
• Matches based on style instead of actual plot. Dis4nguishing narrator, or very short stories.
• ATU: Also matched on content words that were not core to the plot. Happened in par4cular with long stories.
Discussion
• High MRR – Suitable for a semi-‐automa4c seRng, where annotators are presented with a ranked list
• What’s next? Other type indexes, dialects, historical variants.
Summary
• Classifica4on of story types. • Two story type indexes: ATU and Brunvand. • Nearest Neighbor using Learning to Rank approach.
• Combining a small document and big document model was very effec4ve.