annotation and feature engineeringjbg/teaching/csci_3022/11b.pdfannotation and feature engineering...
TRANSCRIPT
![Page 1: Annotation and Feature Engineeringjbg/teaching/CSCI_3022/11b.pdfAnnotation and Feature Engineering Introduction to Data Science Algorithms Jordan Boyd-Graber and Michael Paul HOUSES,](https://reader033.vdocuments.site/reader033/viewer/2022042808/5f835d94418ed251ad1ae12e/html5/thumbnails/1.jpg)
Annotation and FeatureEngineering
Introduction to Data Science AlgorithmsJordan Boyd-Graber and Michael PaulHOUSES, SPOILERS, AND TRIVIA
Introduction to Data Science Algorithms | Boyd-Graber and Paul Annotation and Feature Engineering | 1 of 13
![Page 2: Annotation and Feature Engineeringjbg/teaching/CSCI_3022/11b.pdfAnnotation and Feature Engineering Introduction to Data Science Algorithms Jordan Boyd-Graber and Michael Paul HOUSES,](https://reader033.vdocuments.site/reader033/viewer/2022042808/5f835d94418ed251ad1ae12e/html5/thumbnails/2.jpg)
Humans doing Incremental Classification
• Game called “quiz bowl”
• Two teams play each other
◦ Moderator reads a question◦ When a team knows the
answer, they signal (“buzz” in)◦ If right, they get points;
otherwise, rest of the questionis read to the other team
• Hundreds of teams in the USalone
Introduction to Data Science Algorithms | Boyd-Graber and Paul Annotation and Feature Engineering | 2 of 13
![Page 3: Annotation and Feature Engineeringjbg/teaching/CSCI_3022/11b.pdfAnnotation and Feature Engineering Introduction to Data Science Algorithms Jordan Boyd-Graber and Michael Paul HOUSES,](https://reader033.vdocuments.site/reader033/viewer/2022042808/5f835d94418ed251ad1ae12e/html5/thumbnails/3.jpg)
Humans doing Incremental Classification
• Game called “quiz bowl”
• Two teams play each other
◦ Moderator reads a question◦ When a team knows the
answer, they signal (“buzz” in)◦ If right, they get points;
otherwise, rest of the questionis read to the other team
• Hundreds of teams in the USalone
• Example . . .
Introduction to Data Science Algorithms | Boyd-Graber and Paul Annotation and Feature Engineering | 2 of 13
![Page 4: Annotation and Feature Engineeringjbg/teaching/CSCI_3022/11b.pdfAnnotation and Feature Engineering Introduction to Data Science Algorithms Jordan Boyd-Graber and Michael Paul HOUSES,](https://reader033.vdocuments.site/reader033/viewer/2022042808/5f835d94418ed251ad1ae12e/html5/thumbnails/4.jpg)
Sample Question 1
With Leo Szilard, he invented a doubly-eponymous
Introduction to Data Science Algorithms | Boyd-Graber and Paul Annotation and Feature Engineering | 3 of 13
![Page 5: Annotation and Feature Engineeringjbg/teaching/CSCI_3022/11b.pdfAnnotation and Feature Engineering Introduction to Data Science Algorithms Jordan Boyd-Graber and Michael Paul HOUSES,](https://reader033.vdocuments.site/reader033/viewer/2022042808/5f835d94418ed251ad1ae12e/html5/thumbnails/5.jpg)
Sample Question 1
With Leo Szilard, he invented a doubly-eponymous refrigerator with nomoving parts. He did not take interaction with neighbors into account whenformulating his theory of
Introduction to Data Science Algorithms | Boyd-Graber and Paul Annotation and Feature Engineering | 3 of 13
![Page 6: Annotation and Feature Engineeringjbg/teaching/CSCI_3022/11b.pdfAnnotation and Feature Engineering Introduction to Data Science Algorithms Jordan Boyd-Graber and Michael Paul HOUSES,](https://reader033.vdocuments.site/reader033/viewer/2022042808/5f835d94418ed251ad1ae12e/html5/thumbnails/6.jpg)
Sample Question 1
With Leo Szilard, he invented a doubly-eponymous refrigerator with nomoving parts. He did not take interaction with neighbors into account whenformulating his theory of heat capacity, so
Introduction to Data Science Algorithms | Boyd-Graber and Paul Annotation and Feature Engineering | 3 of 13
![Page 7: Annotation and Feature Engineeringjbg/teaching/CSCI_3022/11b.pdfAnnotation and Feature Engineering Introduction to Data Science Algorithms Jordan Boyd-Graber and Michael Paul HOUSES,](https://reader033.vdocuments.site/reader033/viewer/2022042808/5f835d94418ed251ad1ae12e/html5/thumbnails/7.jpg)
Sample Question 1
With Leo Szilard, he invented a doubly-eponymous refrigerator with nomoving parts. He did not take interaction with neighbors into account whenformulating his theory of heat capacity, so Debye adjusted the theory forlow temperatures. His summation convention automatically sums repeatedindices in tensor products. His name is attached to the A and B coefficients
Introduction to Data Science Algorithms | Boyd-Graber and Paul Annotation and Feature Engineering | 3 of 13
![Page 8: Annotation and Feature Engineeringjbg/teaching/CSCI_3022/11b.pdfAnnotation and Feature Engineering Introduction to Data Science Algorithms Jordan Boyd-Graber and Michael Paul HOUSES,](https://reader033.vdocuments.site/reader033/viewer/2022042808/5f835d94418ed251ad1ae12e/html5/thumbnails/8.jpg)
Sample Question 1
With Leo Szilard, he invented a doubly-eponymous refrigerator with nomoving parts. He did not take interaction with neighbors into account whenformulating his theory of heat capacity, so Debye adjusted the theory forlow temperatures. His summation convention automatically sums repeatedindices in tensor products. His name is attached to the A and B coefficientsfor spontaneous and stimulated emission, the subject of one of his multiplegroundbreaking 1905 papers. He further developed the model of statisticssent to him by
Introduction to Data Science Algorithms | Boyd-Graber and Paul Annotation and Feature Engineering | 3 of 13
![Page 9: Annotation and Feature Engineeringjbg/teaching/CSCI_3022/11b.pdfAnnotation and Feature Engineering Introduction to Data Science Algorithms Jordan Boyd-Graber and Michael Paul HOUSES,](https://reader033.vdocuments.site/reader033/viewer/2022042808/5f835d94418ed251ad1ae12e/html5/thumbnails/9.jpg)
Sample Question 1
With Leo Szilard, he invented a doubly-eponymous refrigerator with nomoving parts. He did not take interaction with neighbors into account whenformulating his theory of heat capacity, so Debye adjusted the theory forlow temperatures. His summation convention automatically sums repeatedindices in tensor products. His name is attached to the A and B coefficientsfor spontaneous and stimulated emission, the subject of one of his multiplegroundbreaking 1905 papers. He further developed the model of statisticssent to him by Bose to describe particles with integer spin. For 10 points,who is this German physicist best known for formulating the
Introduction to Data Science Algorithms | Boyd-Graber and Paul Annotation and Feature Engineering | 3 of 13
![Page 10: Annotation and Feature Engineeringjbg/teaching/CSCI_3022/11b.pdfAnnotation and Feature Engineering Introduction to Data Science Algorithms Jordan Boyd-Graber and Michael Paul HOUSES,](https://reader033.vdocuments.site/reader033/viewer/2022042808/5f835d94418ed251ad1ae12e/html5/thumbnails/10.jpg)
Sample Question 1
With Leo Szilard, he invented a doubly-eponymous refrigerator with nomoving parts. He did not take interaction with neighbors into account whenformulating his theory of heat capacity, so Debye adjusted the theory forlow temperatures. His summation convention automatically sums repeatedindices in tensor products. His name is attached to the A and B coefficientsfor spontaneous and stimulated emission, the subject of one of his multiplegroundbreaking 1905 papers. He further developed the model of statisticssent to him by Bose to describe particles with integer spin. For 10 points,who is this German physicist best known for formulating the special andgeneral theories of relativity?
Introduction to Data Science Algorithms | Boyd-Graber and Paul Annotation and Feature Engineering | 3 of 13
![Page 11: Annotation and Feature Engineeringjbg/teaching/CSCI_3022/11b.pdfAnnotation and Feature Engineering Introduction to Data Science Algorithms Jordan Boyd-Graber and Michael Paul HOUSES,](https://reader033.vdocuments.site/reader033/viewer/2022042808/5f835d94418ed251ad1ae12e/html5/thumbnails/11.jpg)
Sample Question 1
With Leo Szilard, he invented a doubly-eponymous refrigerator with nomoving parts. He did not take interaction with neighbors into account whenformulating his theory of heat capacity, so Debye adjusted the theory forlow temperatures. His summation convention automatically sums repeatedindices in tensor products. His name is attached to the A and B coefficientsfor spontaneous and stimulated emission, the subject of one of his multiplegroundbreaking 1905 papers. He further developed the model of statisticssent to him by Bose to describe particles with integer spin. For 10 points,who is this German physicist best known for formulating the special andgeneral theories of relativity?
Albert Einstein
Introduction to Data Science Algorithms | Boyd-Graber and Paul Annotation and Feature Engineering | 3 of 13
![Page 12: Annotation and Feature Engineeringjbg/teaching/CSCI_3022/11b.pdfAnnotation and Feature Engineering Introduction to Data Science Algorithms Jordan Boyd-Graber and Michael Paul HOUSES,](https://reader033.vdocuments.site/reader033/viewer/2022042808/5f835d94418ed251ad1ae12e/html5/thumbnails/12.jpg)
Humans doing Incremental Classification
• This is not Jeopardy
• There are buzzers, but players canonly buzz at the end of a question
• Doesn’t discriminate knowledge
• Quiz bowl questions are pyramidal
Introduction to Data Science Algorithms | Boyd-Graber and Paul Annotation and Feature Engineering | 4 of 13
![Page 13: Annotation and Feature Engineeringjbg/teaching/CSCI_3022/11b.pdfAnnotation and Feature Engineering Introduction to Data Science Algorithms Jordan Boyd-Graber and Michael Paul HOUSES,](https://reader033.vdocuments.site/reader033/viewer/2022042808/5f835d94418ed251ad1ae12e/html5/thumbnails/13.jpg)
Research Question: How do we know if a guess is correct?
• Turn (question, guess) into features
• Treat it as a binary classification problem
• What features help us do this well?
• Subject of HW3
Introduction to Data Science Algorithms | Boyd-Graber and Paul Annotation and Feature Engineering | 5 of 13
![Page 14: Annotation and Feature Engineeringjbg/teaching/CSCI_3022/11b.pdfAnnotation and Feature Engineering Introduction to Data Science Algorithms Jordan Boyd-Graber and Michael Paul HOUSES,](https://reader033.vdocuments.site/reader033/viewer/2022042808/5f835d94418ed251ad1ae12e/html5/thumbnails/14.jpg)
Research Question: How do we know if a guess is correct?
• Turn (question, guess) into features
• Treat it as a binary classification problem
• What features help us do this well?
• Subject of HW3
Introduction to Data Science Algorithms | Boyd-Graber and Paul Annotation and Feature Engineering | 5 of 13
![Page 15: Annotation and Feature Engineeringjbg/teaching/CSCI_3022/11b.pdfAnnotation and Feature Engineering Introduction to Data Science Algorithms Jordan Boyd-Graber and Michael Paul HOUSES,](https://reader033.vdocuments.site/reader033/viewer/2022042808/5f835d94418ed251ad1ae12e/html5/thumbnails/15.jpg)
Provided Dataset
• text: the clues revealed so far
• page: a guess at the answer
• answer: the actual answer (closest Wikipedia page)
• body_score: IR measure of how good a match the text is
Introduction to Data Science Algorithms | Boyd-Graber and Paul Annotation and Feature Engineering | 6 of 13
![Page 16: Annotation and Feature Engineeringjbg/teaching/CSCI_3022/11b.pdfAnnotation and Feature Engineering Introduction to Data Science Algorithms Jordan Boyd-Graber and Michael Paul HOUSES,](https://reader033.vdocuments.site/reader033/viewer/2022042808/5f835d94418ed251ad1ae12e/html5/thumbnails/16.jpg)
Baseline
• What if we always say that the answer is wrong?
• Performance: 0.54
• Every feature should do better than this (otherwise, it’s useless)
Introduction to Data Science Algorithms | Boyd-Graber and Paul Annotation and Feature Engineering | 7 of 13
![Page 17: Annotation and Feature Engineeringjbg/teaching/CSCI_3022/11b.pdfAnnotation and Feature Engineering Introduction to Data Science Algorithms Jordan Boyd-Graber and Michael Paul HOUSES,](https://reader033.vdocuments.site/reader033/viewer/2022042808/5f835d94418ed251ad1ae12e/html5/thumbnails/17.jpg)
Page Name
• The title of wikipedia pages often have disambiguation in parentheses
1 Paris (mythology)2 Paris (song)3 Paris (genus)4 Paris (band)
• Feature is 1 if the page has disambiguator in the text
◦ “This band performed . . . ”, Paris (band)→ True◦ “This band performed . . . ”, Paris (mythology)→ False
• Slight improvement: 0.58
Introduction to Data Science Algorithms | Boyd-Graber and Paul Annotation and Feature Engineering | 8 of 13
![Page 18: Annotation and Feature Engineeringjbg/teaching/CSCI_3022/11b.pdfAnnotation and Feature Engineering Introduction to Data Science Algorithms Jordan Boyd-Graber and Michael Paul HOUSES,](https://reader033.vdocuments.site/reader033/viewer/2022042808/5f835d94418ed251ad1ae12e/html5/thumbnails/18.jpg)
Page Name
• The title of wikipedia pages often have disambiguation in parentheses
1 Paris (mythology)2 Paris (song)3 Paris (genus)4 Paris (band)
• Feature is 1 if the page has disambiguator in the text
◦ “This band performed . . . ”, Paris (band)→ True◦ “This band performed . . . ”, Paris (mythology)→ False
• Slight improvement: 0.58
Introduction to Data Science Algorithms | Boyd-Graber and Paul Annotation and Feature Engineering | 8 of 13
![Page 19: Annotation and Feature Engineeringjbg/teaching/CSCI_3022/11b.pdfAnnotation and Feature Engineering Introduction to Data Science Algorithms Jordan Boyd-Graber and Michael Paul HOUSES,](https://reader033.vdocuments.site/reader033/viewer/2022042808/5f835d94418ed251ad1ae12e/html5/thumbnails/19.jpg)
Links
• The more morelinks a Wikipediapage has, the morepopular it is
• Popularity is oftena sign of a wronganswer
• By itself, doesn’tdo so well: 0.56
• But improves if wetake the log of thevalue: 0.61
0.0
0.1
0.2
0.3
0.4
−2 0 2 4log_links
dens
ity
corr
False
True
Introduction to Data Science Algorithms | Boyd-Graber and Paul Annotation and Feature Engineering | 9 of 13
![Page 20: Annotation and Feature Engineeringjbg/teaching/CSCI_3022/11b.pdfAnnotation and Feature Engineering Introduction to Data Science Algorithms Jordan Boyd-Graber and Michael Paul HOUSES,](https://reader033.vdocuments.site/reader033/viewer/2022042808/5f835d94418ed251ad1ae12e/html5/thumbnails/20.jpg)
Score
• We can see howsimilar the text of aWikipedia page is
• Higher, the better
• This feature alonegives accuracy of0.75
0.000
0.005
0.010
0.015
0.020
0 100 200 300body_score
dens
ity
corr
False
True
Introduction to Data Science Algorithms | Boyd-Graber and Paul Annotation and Feature Engineering | 10 of 13
![Page 21: Annotation and Feature Engineeringjbg/teaching/CSCI_3022/11b.pdfAnnotation and Feature Engineering Introduction to Data Science Algorithms Jordan Boyd-Graber and Michael Paul HOUSES,](https://reader033.vdocuments.site/reader033/viewer/2022042808/5f835d94418ed251ad1ae12e/html5/thumbnails/21.jpg)
Length
• The more text wesee, the moreconfident weshould be
• By itself, doesn’tdo so well: 0.56
• But whencombined with theIR score, doesgreat: 0.82 (bestso far)
0.0000
0.0005
0.0010
0.0015
0 200 400 600 800 1000 1200obs_len
dens
ity
corr
False
True
Introduction to Data Science Algorithms | Boyd-Graber and Paul Annotation and Feature Engineering | 11 of 13
![Page 22: Annotation and Feature Engineeringjbg/teaching/CSCI_3022/11b.pdfAnnotation and Feature Engineering Introduction to Data Science Algorithms Jordan Boyd-Graber and Michael Paul HOUSES,](https://reader033.vdocuments.site/reader033/viewer/2022042808/5f835d94418ed251ad1ae12e/html5/thumbnails/22.jpg)
Others . . .
• Tournament the question was used in
• The type of thing the answer is
• Try your own, be creative!
• Last year’s feature engineering assignment
Introduction to Data Science Algorithms | Boyd-Graber and Paul Annotation and Feature Engineering | 12 of 13
![Page 23: Annotation and Feature Engineeringjbg/teaching/CSCI_3022/11b.pdfAnnotation and Feature Engineering Introduction to Data Science Algorithms Jordan Boyd-Graber and Michael Paul HOUSES,](https://reader033.vdocuments.site/reader033/viewer/2022042808/5f835d94418ed251ad1ae12e/html5/thumbnails/23.jpg)
Introduction to Data Science Algorithms | Boyd-Graber and Paul Annotation and Feature Engineering | 13 of 13