ttic 31190: natural language...
TRANSCRIPT
TTIC31190:NaturalLanguageProcessing
KevinGimpelWinter2016
Lecture1:Introduction
1
2
Whatisnaturallanguageprocessing?
3
anexperimentalcomputerscienceresearchareathatincludesproblemsandsolutionspertainingto
theunderstandingofhumanlanguage
Whatisnaturallanguageprocessing?
4
TextClassification
5
TextClassification
• spam/notspam• prioritylevel• category(primary/social/promotions/updates)
6
SentimentAnalysis
7
MachineTranslation
8
MachineTranslation
NewPoll:WillyoubuyanAppleWatch?
9
QuestionAnswering
10
Summarization
11
Summarization
TheAppleWatchhasdrawbacks.Thereareothersmartwatches thatoffermorecapabilities.
12
DialogSystems
user:ScheduleameetingwithMattandDavidonThursday.computer:Thursdaywon’tworkforDavid.HowaboutFriday?user:I’dpreferMondaythen,butFridaywouldbeokifnecessary.
13
Part-of-SpeechTagging
determinerverb(past)prep.properproperposs.adj.nounSomequestionedifTimCook’sfirstproduct
modalverbdet.adjectivenounprep.properpunc.wouldbeabreakawayhitforApple.
determinerverb(past)prep.properproperposs.adj.noun
modalverbdet.adjectivenounprep.properpunc.
14
Part-of-SpeechTagging
determinerverb(past)prep.nounnounposs.adj.nounSomequestionedifTimCook’sfirstproduct
modalverbdet.adjectivenounprep.nounpunc.wouldbeabreakawayhitforApple.
determinerverb(past)prep.properproperposs.adj.noun
modalverbdet.adjectivenounprep.properpunc.
15
Part-of-SpeechTagging
determinerverb(past)prep.nounnounposs.adj.nounSomequestionedifTimCook’sfirstproduct
modalverbdet.adjectivenounprep.nounpunc.wouldbeabreakawayhitforApple.
SomequestionedifTimCook’sfirstproductwouldbeabreakawayhitforApple.
NamedEntityRecognition
PERSON ORGANIZATION
16
SyntacticParsing
figurecredit:Durrett &Klein(2014)17
Revenues of $14.5 billion were posted by Dell1. The company1 ...
en.wikipedia.org/wiki/Dell
en.wikipedia.org/wiki/Michael_Dell
Infobox type: company
Infobox type: person
ORGANIZATIONPERSON
Figure 1: Coreference can help resolve ambiguous casesof semantic types or entity links: propagating informationacross coreference arcs can inform us that, in this context,Dell is an organization and should therefore link to thearticle on Dell in Wikipedia.
shown that tighter integration of coreference andentity linking is promising (Hajishirzi et al., 2013;Zheng et al., 2013); we extend these approaches andmodel the entire process more holistically. Namedentity recognition is improved by simple coreference(Finkel et al., 2005; Ratinov and Roth, 2009) andknowledge from Wikipedia (Kazama and Torisawa,2007; Ratinov and Roth, 2009; Nothman et al.,2013; Sil and Yates, 2013). Joint models of corefer-ence and NER have been proposed in Haghighi andKlein (2010) and Durrett et al. (2013), but in neithercase was supervised data used for both tasks. Tech-nically, our model is most closely related to that ofSingh et al. (2013), who handle coreference, namedentity recognition, and relation extraction.2 Our sys-tem is novel in three ways: the choice of tasks tomodel jointly, the fact that we maintain uncertaintyabout all decisions throughout inference (rather thanusing a greedy approach), and the feature sets wedeploy for cross-task interactions.
In designing a joint model, we would like topreserve the modularity, efficiency, and structuralsimplicity of pipelined approaches. Our model’sfeature-based structure permits improvement of fea-tures specific to a particular task or to a pair of tasks.By pruning variable domains with a coarse modeland using approximate inference via belief propaga-tion, we maintain efficiency and our model is only afactor of two slower than the union of the individual
2Our model could potentially be extended to handle relationextraction or mention detection, which has also been addressedin past joint modeling efforts (Daume and Marcu, 2005; Li andJi, 2014), but that is outside the scope of the current work.
models. Finally, as a structured CRF, it is concep-tually no more complex than its component modelsand its behavior can be understood using the sameintuition.
We apply our model to two datasets, ACE 2005and OntoNotes, with different mention standardsand layers of annotation. In both settings, our jointmodel outperforms our independent baseline mod-els. On ACE, we achieve state-of-the-art entity link-ing results, matching the performance of the systemof Fahrni and Strube (2014). On OntoNotes, wematch the performance of the best published coref-erence system (Bjorkelund and Kuhn, 2014) andoutperform two strong NER systems (Ratinov andRoth, 2009; Passos et al., 2014).
2 Motivating Examples
We first present two examples to motivate our ap-proach. Figure 1 shows an example of a case wherecoreference is beneficial for named entity recogni-tion and entity linking. The company is clearlycoreferent to Dell by virtue of the lack of other possi-ble antecedents; this in turn indicates that Dell refersto the corporation rather than to Michael Dell. Thiseffect can be captured for entity linking by a fea-ture tying the lexical item company to the fact thatCOMPANY is in the Wikipedia infobox for Dell,3
thereby helping the linker make the correct decision.This would also be important for recovering the factthat the mention the company links to Dell; how-ever, in the version of the task we consider, a men-tion like the company actually links to the Wikipediaarticle for Company.4
Figure 2 shows a different example, one wherethe coreference is now ambiguous but entity linkingis transparent. In this case, an NER system basedon surface statistics alone would likely predict thatFreddie Mac is a PERSON. However, the Wikipediaarticle for Freddie Mac is unambiguous, which al-lows us to fix this error. The pronoun his can then becorrectly resolved.
These examples justify why these tasks should behandled jointly: there is no obvious pipeline orderfor a system designer who cares about the perfor-
3Monospaced fonts indicate titles of Wikipedia articles.4This decision was largely driven by a need to match the
ACE linking annotations provided by Bentivogli et al. (2010).
Coreference Resolution
EntityLinking
18
“Winograd Schema”Coreference Resolution
Themancouldn'tlifthissonbecausehewassoweak.
Themancouldn'tlifthissonbecausehewassoheavy.
19
“Winograd Schema”Coreference Resolution
Themancouldn'tlifthissonbecausehewassoweak.
Themancouldn'tlifthissonbecausehewassoheavy.
man
son
OncetherewasaboynamedFritzwholovedtodraw.Hedreweverything.Inthemorning,hedrewapictureofhiscerealwithmilk.Hispapasaid,“Don’tdrawyourcereal.Eatit!”Afterschool,Fritzdrewapictureofhisbicycle.Hisunclesaid,“Don'tdrawyourbicycle.Rideit!”…
WhatdidFritzdrawfirst?A)thetoothpasteB)hismamaC)cerealandmilkD)hisbicycle
20
ReadingComprehension
Conspicuousbytheirabsence…• speechrecognition(seeTTIC31110)• informationretrievalandwebsearch• knowledgerepresentation• recommendersystems
21
ComputationalLinguisticsvs.NaturalLanguageProcessing
• howdotheydiffer?
22
ComputationalBiologyvs.Bioinformatics
“Computationalbiology=thestudyofbiologyusingcomputationaltechniques.Thegoalistolearnnewbiology,knowledgeaboutlivingsystems.Itisaboutscience.
Bioinformatics=thecreationoftools(algorithms,databases)thatsolveproblems.Thegoalistobuildusefultoolsthatworkonbiologicaldata.Itisaboutengineering.”
--RussAltman
23
ComputationalLinguisticsvs.NaturalLanguageProcessing
• manypeoplethinkofthetwotermsassynonyms
• computationallinguisticsismoreinclusive;morelikelytoincludesociolinguistics,cognitivelinguistics,andcomputationalsocialscience
• NLPismorelikelytousemachinelearningandinvolveengineering/system-building
24
IsNLPScienceorEngineering?• goalofNLPistodeveloptechnology,whichtakestheformofengineering
• thoughwetrytosolvetoday’sproblems,weseekprinciplesthatwillbeusefulforthefuture
• ifscience,it’snotlinguisticsorcognitivescience;it’sthescienceofcomputationalprocessingoflanguage
• soIliketothinkthatwe’redoingthescienceofengineering
25
CourseOverview• Newcourse,firsttimebeingoffered
• Aimedatfirst-yearPhDstudents
• Instructorofficehours:Mondays3-4pm,TTIC531
• Teachingassistant:Lifu Tu,TTICPhDstudent
26
Prerequisites• Nocourseprerequisites,butIwillassume:– someprogrammingexperience(nospecificlanguagerequired)
– familiaritywithbasicsofprobability,calculus,andlinearalgebra
• Undergraduateswithrelevantbackgroundarewelcometotakethecourse.Pleasebringanenrollmentapprovalformtomeifyoucan’tenrollonline.
27
Grading• 3assignments(15%each)• midtermexam(15%)• courseproject(35%):– preliminaryreportandmeetingwithinstructor(10%)– classpresentation(5%)– finalreport(20%)
• classparticipation(5%)• nofinal
28
Assignments• Mixtureofformalexercises,implementation,experimentation,analysis
• “Chooseyourownadventure”componentbasedonyourinterests,e.g.:– exploratorydataanalysis– machinelearning– implementation/scalability– modelanderroranalysis– visualization
29
Project• Replicate[partof]apublishedNLPpaper,ordefineyourownproject.
• Theprojectmaybedoneindividuallyorinagroupoftwo.Eachgroupmemberwillreceivethesamegrade.
• Moredetailstocome.
30
CollaborationPolicy• Youarewelcometodiscussassignmentswithothersinthecourse,butsolutionsandcodemustbewrittenindividually
31
Textbooks
• Allareoptional• SpeechandLanguageProcessing,2nd Ed.
– somechaptersof3rd editionareonline
• TheAnalysisofData,Volume1:Probability– freelyavailableonline
• IntroductiontoInformationRetrieval– freelyavailableonline
32
Roadmap• classification• words• lexicalsemantics• languagemodeling• sequencelabeling• syntaxandsyntacticparsing• neuralnetworkmethodsinNLP• semanticcompositionality• semanticparsing• unsupervisedlearning• machinetranslationandotherapplications
33
WhyisNLPhard?• ambiguityandvariabilityoflinguisticexpression:– variability:manyformscanmeanthesamething– ambiguity:oneformcanmeanmanythings
• therearemanydifferentkindsofambiguity• eachNLPtaskhastoaddressadistinctsetofkinds
34
WordSenseAmbiguity• manywordshavemultiplemeanings
35
WordSenseAmbiguity
36
credit:A.Zwicky
WordSenseAmbiguity
37
credit:A.Zwicky
AttachmentAmbiguity
38
MeaningAmbiguity
39
• simplestuser-facingNLPapplication• email(spam,priority,categories):
• sentiment:
• topicclassification• others?
40
TextClassification
Whatisaclassifier?
41
Whatisaclassifier?• afunctionfrominputsx toclassificationlabelsy
42
Whatisaclassifier?• afunctionfrominputsx toclassificationlabelsy• onesimpletypeofclassifier:– foranyinputx,assignascoretoeachlabely,parameterizedbyvector :
43
Whatisaclassifier?• afunctionfrominputsx toclassificationlabelsy• onesimpletypeofclassifier:– foranyinputx,assignascoretoeachlabely,parameterizedbyvector :
– classifybychoosinghighest-scoringlabel:
44
CoursePhilosophy• Fromreadingpapers,onegetstheideathatmachinelearningconceptsaremonolithic,opaqueobjects– e.g.,naïveBayes,logisticregression,SVMs,CRFs,neuralnetworks,LSTMs,etc.
• Nothingisopaque• Everythingcanbedissected,whichrevealsconnections• Thenamesaboveareusefulshorthand,butnotusefulforgainingunderstanding
45
CoursePhilosophy• Wewilldrawfrommachinelearning,linguistics,and
algorithms,buttechnicalmaterialwillbe(mostly)self-contained;wewon’tusemanyblackboxes
• Wewillfocusondeclarative(ratherthanprocedural)specifications,becausetheyhighlightconnectionsanddifferences
46
Modeling,Inference,Learning
47
Modeling,Inference,Learning
• Modeling:Howdoweassignascoretoan(x,y)pairusingparameters?
modeling:definescorefunction
48
Modeling,Inference,Learning
• Inference:Howdoweefficientlysearchoverthespaceofalllabels?
inference:solve_ modeling:definescorefunction
49
Modeling,Inference,Learning
• Learning:Howdowechoose?
learning:choose_
modeling:definescorefunctioninference:solve_
50
Modeling,Inference,Learning
• Wewillusethissameparadigmthroughoutthecourse,evenwhentheoutputspacesizeisexponentialinthesizeoftheinputorisunbounded(e.g.,machinetranslation)
learning:choose_
modeling:definescorefunctioninference:solve_
51
Notation• We’lluseboldfaceforvectors:
• Individualentrieswillusesubscriptsandnoboldface,e.g.,forentryi:
52
Modeling:LinearModels• Scorefunctionislinearin:
• f:featurefunctionvector• :weightvector
53
Modeling:LinearModels• Scorefunctionislinearin:
• f:featurefunctionvector• :weightvector• Howdowedefinef?
54
DefiningFeatures• ThisisalargepartofNLP• Last20years:featureengineering• Last2years:representationlearning
• Inthiscourse,wewilldoboth• Learningrepresentationsdoesn’tmeanthatwedon’thavetolookatthedataortheoutput!
• There’sstillplentyofengineeringrequiredinrepresentationlearning
55
DefiningFeatures• ThisisalargepartofNLP• Last20years:featureengineering• Last2years:representationlearning
• Inthiscourse,we’lldoboth• Learningrepresentationsdoesn’tmeanthatwedon’thavetolookatthedataortheoutput!
• There’sstillplentyofengineeringrequiredinrepresentationlearning
56
FeatureEngineering• Oftendecriedas“costly,hand-crafted,expensive,domain-specific”,etc.
• Butinpractice,simplefeaturestypicallygivethebulkoftheperformance
• Let’sgetconcrete:howshouldwedefinefeaturesfortextclassification?
57
FeatureEngineeringforTextClassification
58
FeatureEngineeringforTextClassification
59
isnowavectorbecauseitisasequenceofwords
FeatureEngineeringforTextClassification
60
isnowavectorbecauseitisasequenceofwords
let’sconsidersentimentanalysis:
FeatureEngineeringforTextClassification
61
isnowavectorbecauseitisasequenceofwords
so,hereisoursentimentclassifierthatusesalinearmodel:
let’sconsidersentimentanalysis:
FeatureEngineeringforTextClassification
• Twofeatures:
where
• Whatshouldtheweightsbe?
62
FeatureEngineeringforTextClassification
• Twofeatures:
where
• Whatshouldtheweightsbe?
63
FeatureEngineeringforTextClassification
• Twofeatures:
• Let’ssayweset• Onsentencescontaining“great”intheStanfordSentimentTreebanktrainingdata,thiswouldgetusanaccuracyof69%
• But“great’’onlyappearsin83/6911examples
64
FeatureEngineeringforTextClassification
• Twofeatures:
• Let’ssayweset• Onsentencescontaining“great”intheStanfordSentimentTreebanktrainingdata,thiswouldgetusanaccuracyof69%
• But“great’’onlyappearsin83/6911examples
65
variability:manyotherwordscanindicatepositivesentiment
ambiguity:“great”canmeandifferentthingsindifferentcontexts
• Usually,greatindicatespositivesentiment:Themostwondrous lovestoryinyears,itisagreat film.Agreat companionpiecetootherNapoleon films.
• Sometimesnot.Why?Negation:It'snotagreatmonstermovie.Differentsense:There'sagreatdealofcornydialogueandpreposterousmoments.Multiplesentiments: Agreat ensemblecastcan'tliftthisheartfeltenterpriseoutofthefamiliar.
• Andtherearemanyotherwordsthatindicatepositivesentiment
66
• Usually,greatindicatespositivesentiment:Themostwondrous lovestoryinyears,itisagreat film.Agreat companionpiecetootherNapoleon films.
• Sometimesnot.Why?Negation:It'snotagreatmonstermovie.Differentsense:There'sagreatdealofcornydialogueandpreposterousmoments.Multiplesentiments: Agreat ensemblecastcan'tliftthisheartfeltenterpriseoutofthefamiliar.
67
FeatureEngineeringforTextClassification
• Whataboutafeaturelikethefollowing?
• Whatshoulditsweightbe?
68
FeatureEngineeringforTextClassification
• Whataboutafeaturelikethefollowing?
• Whatshoulditsweightbe?• Doesn’tmatter.• Why?
69
TextClassification
70
ourlinearsentimentclassifier:
Inference forTextClassification
71
inference:solve_
Inference forTextClassification
72
inference:solve_
• trivial(loopoverlabels)
TextClassification
73
LearningforTextClassification
74
learning:choose_
LearningforTextClassification
75
learning:choose_
• Therearemanywaystochoose
ExperimentalPractice• inthebeginning,wejusthaddata• firstinnovation:splitintotrainandtest– motivation:simulateconditionsofapplyingsysteminpractice
• but,there’saproblemwiththis…– weneedtoexploreandevaluatemethodologicalchoices
– aftermultipleevaluationsontest,itisnolongerasimulationofreal-worldconditions
76
ExperimentalPractice• inthebeginning,wejusthaddata• firstinnovation:splitintotrain andtest– motivation:simulateconditionsofapplyingsysteminpractice
• but,there’saproblemwiththis…– weneedtoexploreandevaluatemethodologicalchoices
– aftermultipleevaluationsontest,itisnolongerasimulationofreal-worldconditions
77
ExperimentalPractice• inthebeginning,wejusthaddata• firstinnovation:splitintotrain andtest– motivation:simulateconditionsofapplyingsysteminpractice
• but,there’saproblemwiththis…– weneedtoexploreandevaluatemethodologicalchoices
– aftermultipleevaluationsontest,itisnolongerasimulationofreal-worldconditions
78
ExperimentalPractice• inthebeginning,wejusthaddata• firstinnovation:splitintotrain andtest– motivation:simulateconditionsofapplyingsysteminpractice
• but,there’saproblemwiththis…– weneedtoexploreandevaluatemethodologicalchoices
– aftermultipleevaluationsontest,itisnolongerasimulationofreal-worldconditions
79
ExperimentalPractice• weneedtoexplore/evaluatemethodologicalchoices• whatshouldwedo?– someusecrossvalidationontrain,butthisisslowanddoesn’tquitesimulatereal-worldsettings(why?)
• secondinnovation:dividedataintotrain,test,andathirdsetcalleddevelopmentorvalidation– usedevelopment/validationtoevaluatechoices– then,whenreadytowritethepaper,evaluatethebestmodelontest
• arewedoneyet?no!there’sstillaproblem
80
ExperimentalPractice• weneedtoexplore/evaluatemethodologicalchoices• whatshouldwedo?– someusecrossvalidationontrain,butthisisslowanddoesn’tquitesimulatereal-worldsettings(why?)
• secondinnovation:dividedataintotrain,test,andathirdsetcalleddevelopment (dev)orvalidation(val)– usedev/val toevaluatechoices– then,whenreadytowritethepaper,evaluatethebestmodelontest
• arewedoneyet?no!there’sstillaproblem:– overfitting todev/val
81
ExperimentalPractice• weneedtoexplore/evaluatemethodologicalchoices• whatshouldwedo?– someusecrossvalidationontrain,butthisisslowanddoesn’tquitesimulatereal-worldsettings(why?)
• secondinnovation:dividedataintotrain,test,andathirdsetcalleddevelopment (dev)orvalidation(val)– usedev/val toevaluatechoices– then,whenreadytowritethepaper,evaluatethebestmodelontest
• arewedoneyet?no!there’sstillaproblem:– overfitting todev/val
82
ExperimentalPractice• weneedtoexplore/evaluatemethodologicalchoices• whatshouldwedo?– someusecrossvalidationontrain,butthisisslowanddoesn’tquitesimulatereal-worldsettings(why?)
• secondinnovation:dividedataintotrain,test,andathirdsetcalleddevelopment (dev)orvalidation(val)– usedev/val toevaluatechoices– then,whenreadytowritethepaper,evaluatethebestmodelontest
• arewedoneyet?no!there’sstillaproblem:– overfitting todev/val
83
ExperimentalPractice• bestpractice:splitdataintotrain,development(dev),developmenttest(devtest),andtest– trainmodelontrain,tunehyperparameter valuesondev,dopreliminarytestingondevtest,dofinaltestingontestasingletimewhenwritingthepaper
– Evenbettertohaveevenmoretestsets!test1,test2,etc.
• experimentalcredibilityisahugecomponentofdoingusefulresearch
• whenyoupublisharesult,ithadbetterbereplicablewithouttuninganythingontest
84
Don’tCheat!
85