ideas for 100k word data set for human and machine learning lori levin alon lavie jaime carbonell...
TRANSCRIPT
![Page 1: Ideas for 100K Word Data Set for Human and Machine Learning Lori Levin Alon Lavie Jaime Carbonell Language Technologies Institute Carnegie Mellon University](https://reader035.vdocuments.site/reader035/viewer/2022070401/56649f1b5503460f94c30bef/html5/thumbnails/1.jpg)
Ideas for100K Word Data Set for Human and Machine Learning
Lori LevinAlon LavieJaime CarbonellLanguage Technologies InstituteCarnegie Mellon University
![Page 2: Ideas for 100K Word Data Set for Human and Machine Learning Lori Levin Alon Lavie Jaime Carbonell Language Technologies Institute Carnegie Mellon University](https://reader035.vdocuments.site/reader035/viewer/2022070401/56649f1b5503460f94c30bef/html5/thumbnails/2.jpg)
The data set should support
Machine learningMachine learning from small data can work if
the data is structured. Analysis by humans
Humans can learn a lot from a small data set if the form-function mappings are clear.
![Page 3: Ideas for 100K Word Data Set for Human and Machine Learning Lori Levin Alon Lavie Jaime Carbonell Language Technologies Institute Carnegie Mellon University](https://reader035.vdocuments.site/reader035/viewer/2022070401/56649f1b5503460f94c30bef/html5/thumbnails/3.jpg)
Concrete Suggestions1. Hand align a portion of the corpus. 2. Include parse trees and feature structures for a
portion of the corpus.3. Include a representative sample of diversity of
phrase structures.4. Include a representative sample of diversity in
function/meaning.5. Include some simple, single sentences.6. Include some full texts.7. Look for well-known divergences. 8. Conduct an evaluation to be sure that the
corpus elicits what you want it to elicit.
![Page 4: Ideas for 100K Word Data Set for Human and Machine Learning Lori Levin Alon Lavie Jaime Carbonell Language Technologies Institute Carnegie Mellon University](https://reader035.vdocuments.site/reader035/viewer/2022070401/56649f1b5503460f94c30bef/html5/thumbnails/4.jpg)
Hand align a portion of the corpus
Automatic alignments algorithms can be bootstrapped from the hand alignments.
A lexicon can be created from the alignments.
Humans can study word usage.
![Page 5: Ideas for 100K Word Data Set for Human and Machine Learning Lori Levin Alon Lavie Jaime Carbonell Language Technologies Institute Carnegie Mellon University](https://reader035.vdocuments.site/reader035/viewer/2022070401/56649f1b5503460f94c30bef/html5/thumbnails/5.jpg)
Provide parse trees for a portion of the corpus
Parse trees plus alignments can be input to Avenue-style rule learning Automatic treebanking of the minor language
Humans can study the translation of specific structures.
There should be semantic and functional information in addition to structural information. See below.
![Page 6: Ideas for 100K Word Data Set for Human and Machine Learning Lori Levin Alon Lavie Jaime Carbonell Language Technologies Institute Carnegie Mellon University](https://reader035.vdocuments.site/reader035/viewer/2022070401/56649f1b5503460f94c30bef/html5/thumbnails/6.jpg)
Include a representative example of structural diversity Part of the corpus can be structured to
include simple, common sub-trees from the English Penn TreeBank.
Learn a collection of structural mappings that is compositionalA lot of mileage from small data
Preliminary work with Katharina ProbstRaw WSJ data requires editingNeed redundant examples of each structure
![Page 7: Ideas for 100K Word Data Set for Human and Machine Learning Lori Levin Alon Lavie Jaime Carbonell Language Technologies Institute Carnegie Mellon University](https://reader035.vdocuments.site/reader035/viewer/2022070401/56649f1b5503460f94c30bef/html5/thumbnails/7.jpg)
Include a representative example of function or meaning Finding out how English structures translate
into minor language structures is not enoughFor example, finding out how to translate
English auxiliary verbs is not useful because they have many functions: tense, aspect, epistemics, evidentials, etc.
Finding out how to express tense, aspect, epistemics, evidentials, etc. is useful.
![Page 8: Ideas for 100K Word Data Set for Human and Machine Learning Lori Levin Alon Lavie Jaime Carbonell Language Technologies Institute Carnegie Mellon University](https://reader035.vdocuments.site/reader035/viewer/2022070401/56649f1b5503460f94c30bef/html5/thumbnails/8.jpg)
Include some multi-sentence texts
In order to observeTemporal sequencing of eventsCausationRhetorical relations
Contrast, elaboration, etc.
Given and new informationCo-reference
![Page 9: Ideas for 100K Word Data Set for Human and Machine Learning Lori Levin Alon Lavie Jaime Carbonell Language Technologies Institute Carnegie Mellon University](https://reader035.vdocuments.site/reader035/viewer/2022070401/56649f1b5503460f94c30bef/html5/thumbnails/9.jpg)
Look for well-known divergences
E.g., run across the street vs cross the street running
But see below for our view of divergences.
![Page 10: Ideas for 100K Word Data Set for Human and Machine Learning Lori Levin Alon Lavie Jaime Carbonell Language Technologies Institute Carnegie Mellon University](https://reader035.vdocuments.site/reader035/viewer/2022070401/56649f1b5503460f94c30bef/html5/thumbnails/10.jpg)
Include some simple sentences
So that the form-function mapping is clear to a human without confounding factors
As a seed for machine learning
![Page 11: Ideas for 100K Word Data Set for Human and Machine Learning Lori Levin Alon Lavie Jaime Carbonell Language Technologies Institute Carnegie Mellon University](https://reader035.vdocuments.site/reader035/viewer/2022070401/56649f1b5503460f94c30bef/html5/thumbnails/11.jpg)
Evaluation
Test the corpus on a few languages that in order to be sure that the intended structures and functions are elicited. Need to watch out for idiosyncrasies, lexical
gaps, special constructions, etc. For example, if you want to elicit a noun
modified by a preposition, the person in the room will work better than a bottle of wine.
![Page 12: Ideas for 100K Word Data Set for Human and Machine Learning Lori Levin Alon Lavie Jaime Carbonell Language Technologies Institute Carnegie Mellon University](https://reader035.vdocuments.site/reader035/viewer/2022070401/56649f1b5503460f94c30bef/html5/thumbnails/12.jpg)
Hard problems
Body of common phenomena with a tail of phenomena that are individually rare, but collectively massive.
![Page 13: Ideas for 100K Word Data Set for Human and Machine Learning Lori Levin Alon Lavie Jaime Carbonell Language Technologies Institute Carnegie Mellon University](https://reader035.vdocuments.site/reader035/viewer/2022070401/56649f1b5503460f94c30bef/html5/thumbnails/13.jpg)
Extra slides
Our view of translation divergences Elaboration on the different roles of
structure and function
![Page 14: Ideas for 100K Word Data Set for Human and Machine Learning Lori Levin Alon Lavie Jaime Carbonell Language Technologies Institute Carnegie Mellon University](https://reader035.vdocuments.site/reader035/viewer/2022070401/56649f1b5503460f94c30bef/html5/thumbnails/14.jpg)
Our view of divergences which is divergent from some other views of divergences
Divergences arise when the same function is expressed by a different structure.
Many functions are expressed by specialized constructions that do not translate literally into other languages.
Divergences cannot be neatly grouped into a few classes.
Typological differences between languages are relevant: Embedding vs serialization Synthetic vs analytic causative constructions
![Page 15: Ideas for 100K Word Data Set for Human and Machine Learning Lori Levin Alon Lavie Jaime Carbonell Language Technologies Institute Carnegie Mellon University](https://reader035.vdocuments.site/reader035/viewer/2022070401/56649f1b5503460f94c30bef/html5/thumbnails/15.jpg)
Coverage: Structure and Function
Structural DiversityAppositives, adjuncts, embedded clauses,
coordinate structures, ellipsis, etc. Functional/Meaning Diversity
Temporal relations, rhetorical relations, modality, negation, tense, aspect, etc.
![Page 16: Ideas for 100K Word Data Set for Human and Machine Learning Lori Levin Alon Lavie Jaime Carbonell Language Technologies Institute Carnegie Mellon University](https://reader035.vdocuments.site/reader035/viewer/2022070401/56649f1b5503460f94c30bef/html5/thumbnails/16.jpg)
Structure and Function
The way you understand a text is by knowing which structure has which function.
The same function is expressed by different structures in different languages.
![Page 17: Ideas for 100K Word Data Set for Human and Machine Learning Lori Levin Alon Lavie Jaime Carbonell Language Technologies Institute Carnegie Mellon University](https://reader035.vdocuments.site/reader035/viewer/2022070401/56649f1b5503460f94c30bef/html5/thumbnails/17.jpg)
What a human needs to know(function) Who did what to who when? What happened before/after what? What caused what? Is it first hand knowledge, hearsay, or
inference? Is it certain, probable, or improbable?
Did it happen or not? What do these words mean?
![Page 18: Ideas for 100K Word Data Set for Human and Machine Learning Lori Levin Alon Lavie Jaime Carbonell Language Technologies Institute Carnegie Mellon University](https://reader035.vdocuments.site/reader035/viewer/2022070401/56649f1b5503460f94c30bef/html5/thumbnails/18.jpg)
How a human knows these things(structure/grammar)
Who did what to who when? Grammatical relations, coreference, time expressions, pronouns/pro-drop,
nominalizations, subordinate clauses, case marking, word order, agreement, tense, aspect
What happened before/after what? Time expressions, temporal connectives, tense and aspect morphemes
What caused what Markers of rhetorical relationsbetween sentences
Is it first hand knowledge, hearsay, or inference? Is it certain, probable, or improbable? Markers of modality and epistemics
Did it happen or not? Markers of negation and counterfactuals
What do these words mean? Vocabulary
Other Questions, existentials, possessives, coordinate structures
![Page 19: Ideas for 100K Word Data Set for Human and Machine Learning Lori Levin Alon Lavie Jaime Carbonell Language Technologies Institute Carnegie Mellon University](https://reader035.vdocuments.site/reader035/viewer/2022070401/56649f1b5503460f94c30bef/html5/thumbnails/19.jpg)
How to make sure the corpus captures what a human needs to know
Organize the corpus by function and then a human can observe the corresponding structure.
![Page 20: Ideas for 100K Word Data Set for Human and Machine Learning Lori Levin Alon Lavie Jaime Carbonell Language Technologies Institute Carnegie Mellon University](https://reader035.vdocuments.site/reader035/viewer/2022070401/56649f1b5503460f94c30bef/html5/thumbnails/20.jpg)
Coverage of data for human analysis: basics Closed Class and Special Constructions
Dates, names, numbers, prices, etc. Pronouns, prepositions, etc.
Encoding of grammatical relations and/or semantic roles. How do you know who did what to who? Word order, case marking, agreement
Encoding of old and new information Word order, special constructions (e.g., clefts), etc.
Questions Negation Modification Possession Coordination Indirect speech
![Page 21: Ideas for 100K Word Data Set for Human and Machine Learning Lori Levin Alon Lavie Jaime Carbonell Language Technologies Institute Carnegie Mellon University](https://reader035.vdocuments.site/reader035/viewer/2022070401/56649f1b5503460f94c30bef/html5/thumbnails/21.jpg)
Coverage of data for human analysis: multi-sentence and multi-clause
Rhetorical relationsCause, elaboration, contrast, etc.
Temporal relationsBefore, after, during, etc.
Same subject and obviation phenomena Subordination
As subject or objectAs complementAs adjunct
![Page 22: Ideas for 100K Word Data Set for Human and Machine Learning Lori Levin Alon Lavie Jaime Carbonell Language Technologies Institute Carnegie Mellon University](https://reader035.vdocuments.site/reader035/viewer/2022070401/56649f1b5503460f94c30bef/html5/thumbnails/22.jpg)
Other grammatically encoded meanings Modality and Epistemics
Certainty, source of information (first hand, second hand, inference), etc.
Conditionals Comparatives Existentials Tense and aspect Definiteness