relation extraction - github pagesaritter.github.io/courses/5525_slides/relation_extraction.pdf ·...
TRANSCRIPT
RelationExtraction
ManyslidesfromDanJurafsky
Extractingrelationsfromtext
• Companyreport: “InternationalBusinessMachinesCorporation(IBMorthecompany)wasincorporatedintheStateofNewYorkonJune16,1911,astheComputing-Tabulating-RecordingCo.(C-T-R)…”• ExtractedComplexRelation:
Company-FoundingCompany IBMLocation NewYorkDate June16,1911Original-Name Computing-Tabulating-RecordingCo.
• ButwewillfocusonthesimplertaskofextractingrelationtriplesFounding-year(IBM,1911)Founding-location(IBM,New York)
ExtractingRelationTriplesfromTextTheLelandStanfordJuniorUniversity,commonlyreferredtoasStanfordUniversityorStanford,isanAmericanprivateresearchuniversitylocatedinStanford,California …nearPaloAlto,California…LelandStanford…foundedtheuniversityin1891
Stanford EQLelandStanfordJuniorUniversityStanford LOC-INCaliforniaStanford IS-Aresearch universityStanford LOC-NEARPaloAltoStanford FOUNDED-IN1891StanfordFOUNDERLelandStanford
WhyRelationExtraction?
• Createnewstructuredknowledgebases,usefulforanyapp• Augmentcurrentknowledgebases• AddingwordstoWordNet thesaurus,factstoFreeBase orDBPedia
• Supportquestionanswering• Thegranddaughterofwhichactorstarredinthemovie“E.T.”?(acted-in ?x “E.T.”)(is-a ?y actor)(granddaughter-of ?x ?y)
• Butwhichrelationsshouldweextract?
4
AutomatedContentExtraction(ACE)
ARTIFACT
GENERALAFFILIATION
ORGAFFILIATION
PART-WHOLE
PERSON-SOCIAL PHYSICAL
Located
Near
Business
Family Lasting Personal
Citizen-Resident-Ethnicity-Religion
Org-Location-Origin
Founder
EmploymentMembership
OwnershipStudent-Alum
Investor
User-Owner-Inventor-Manufacturer
GeographicalSubsidiary
Sports-Affiliation
17relationsfrom2008“RelationExtractionTask”
AutomatedContentExtraction(ACE)
• Physical-LocatedPER-GPEHe was in Tennessee
• Part-Whole-SubsidiaryORG-ORGXYZ, the parent company of ABC
• Person-Social-FamilyPER-PERJohn’s wife Yoko
• Org-AFF-FounderPER-ORGSteve Jobs, co-founder of Apple…
•
6
UMLS:UnifiedMedicalLanguageSystem
• 134entitytypes,54relations
Injury disrupts PhysiologicalFunctionBodilyLocation location-of BiologicFunctionAnatomicalStructure part-of OrganismPharmacologicSubstancecauses PathologicalFunctionPharmacologicSubstancetreats PathologicFunction
ExtractingUMLSrelationsfromasentence
Doppler echocardiography can be used to diagnose left anterior descending artery stenosis in patients with type 2 diabetes
ê
Echocardiography,DopplerDIAGNOSES Acquiredstenosis
8
DatabasesofWikipedia Relations
9
RelationsextractedfromInfoboxStanfordstateCaliforniaStanfordmotto “DieLuft derFreiheit weht”…
WikipediaInfobox
RelationdatabasesthatdrawfromWikipedia
• ResourceDescriptionFramework(RDF)triplessubjectpredicate objectGolden Gate Park location San Franciscodbpedia:Golden_Gate_Park dbpedia-owl:location dbpedia:San_Francisco
• DBPedia:1billionRDFtriples,385fromEnglishWikipedia• FrequentFreebaserelations:
people/person/nationality,location/location/containspeople/person/profession,people/person/place-of-birthbiology/organism_higher_classification film/film/genre
10
Ontologicalrelations
• IS-A(hypernym):subsumption betweenclasses•Giraffe IS-Aruminant IS-A ungulate IS-Amammal IS-Avertebrate IS-Aanimal…
• Instance-of:relationbetween individual andclass•San Francisco instance-of city
ExamplesfromtheWordNet Thesaurus
Howtobuildrelationextractors
1. Hand-writtenpatterns2. Supervisedmachinelearning3. Semi-supervisedandunsupervised• Bootstrapping(usingseeds)• Distantsupervision• Unsupervisedlearningfromtheweb
RelationExtraction
Whatisrelationextraction?
RelationExtraction
Usingpatternstoextractrelations
RulesforextractingIS-Arelation
EarlyintuitionfromHearst(1992)• “Agarisasubstancepreparedfromamixtureofredalgae,suchasGelidium, forlaboratoryorindustrial use”
• WhatdoesGelidium mean?• Howdoyouknow?`
RulesforextractingIS-Arelation
EarlyintuitionfromHearst(1992)• “Agarisasubstancepreparedfromamixtureofredalgae,suchasGelidium,forlaboratoryorindustrial use”
• WhatdoesGelidium mean?• Howdoyouknow?`
Hearst’sPatternsforextractingIS-Arelations
(Hearst,1992):AutomaticAcquisitionofHyponyms
“Y such as X ((, X)* (, and|or) X)”“such Y as X”“X or other Y”“X and other Y”“Y including X”“Y, especially X”
Hearst’sPatternsforextractingIS-Arelations
Hearstpattern ExampleoccurrencesXandother Y ...temples,treasuries,andotherimportantcivicbuildings.
XorotherY Bruises,wounds,brokenbonesorotherinjuries...
YsuchasX Thebowlute,suchastheBambarandang...
Such YasX ...such authorsas Herrick,Goldsmith,andShakespeare.
YincludingX ...common-lawcountries,including CanadaandEngland...
Y,especiallyX Europeancountries,especially France,England,andSpain...
ExtractingRicherRelationsUsingRules
• Intuition: relationsoftenholdbetweenspecificentities• located-in(ORGANIZATION, LOCATION)• founded (PERSON,ORGANIZATION)• cures(DRUG,DISEASE)
• StartwithNamedEntitytagstohelpextractrelation!
NamedEntitiesaren’tquiteenough.Whichrelationsholdbetween2entities?
Drug Disease
Cure?Prevent?
Cause?
Whatrelationsholdbetween2entities?
PERSON ORGANIZATION
Founder?
Investor?
Member?
Employee?
President?
ExtractingRicherRelationsUsingRulesandNamedEntities
Whoholdswhatofficeinwhatorganization?PERSON, POSITION of ORG
• GeorgeMarshall,SecretaryofStateoftheUnitedStates
PERSON(named|appointed|chose|etc.) PERSON Prep?POSITION• TrumanappointedMarshallSecretaryofState
PERSON [be]?(named|appointed|etc.) Prep?ORG POSITION• GeorgeMarshallwasnamedUSSecretaryofState
Hand-builtpatternsforrelations• Plus:•Humanpatternstendtobehigh-precision• Canbetailoredtospecificdomains
•Minus•Humanpatternsareoftenlow-recall•Alotofworktothinkofallpossiblepatterns!•Don’twanttohavetodothisforeveryrelation!•We’dlikebetteraccuracy
RelationExtraction
Usingpatternstoextractrelations
RelationExtraction
Supervisedrelationextraction
Supervisedmachinelearningforrelations
• Chooseasetofrelationswe’dliketoextract• Chooseasetofrelevantnamedentities• Findandlabeldata• Choosearepresentativecorpus• Labelthenamedentitiesinthecorpus• Hand-labeltherelationsbetweentheseentities• Breakintotraining,development,andtest
• Trainaclassifieronthetrainingset
26
Howtodoclassificationinsupervisedrelationextraction
1. Findallpairsofnamedentities(usuallyinsamesentence)
2. Decideif2entitiesarerelated3. Ifyes,classifytherelation•Whytheextrastep?• Fasterclassificationtrainingbyeliminatingmostpairs• Canusedistinctfeature-setsappropriateforeachtask.
27
AutomatedContentExtraction(ACE)
ARTIFACT
GENERALAFFILIATION
ORGAFFILIATION
PART-WHOLE
PERSON-SOCIAL PHYSICAL
Located
Near
Business
Family Lasting Personal
Citizen-Resident-Ethnicity-Religion
Org-Location-Origin
Founder
EmploymentMembership
OwnershipStudent-Alum
Investor
User-Owner-Inventor-Manufacturer
GeographicalSubsidiary
Sports-Affiliation
17sub-relationsof6relationsfrom2008“RelationExtractionTask”
RelationExtraction
Classifytherelationbetweentwoentities inasentence
AmericanAirlines,aunitofAMR,immediatelymatchedthemove,spokesman
TimWagnersaid.
SUBSIDIARY
FAMILYEMPLOYMENT
NIL
FOUNDER
CITIZEN
INVENTOR…
WordFeaturesforRelationExtraction
• HeadwordsofM1andM2,andcombinationAirlinesWagnerAirlines-Wagner
• BagofwordsandbigramsinM1andM2{American,Airlines,Tim,Wagner,AmericanAirlines,TimWagner}
• WordsorbigramsinparticularpositionsleftandrightofM1/M2M2:-1spokesmanM2:+1said
• Bagofwordsorbigramsbetweenthetwoentities{a,AMR,of,immediately,matched,move,spokesman,the,unit}
AmericanAirlines,aunitofAMR,immediatelymatchedthemove,spokesman TimWagnersaidMention1 Mention2
NamedEntityTypeandMentionLevelFeaturesforRelationExtraction
• Named-entitytypes• M1:ORG• M2:PERSON
• Concatenationofthetwonamed-entitytypes• ORG-PERSON
• EntityLevelofM1andM2 (NAME,NOMINAL,PRONOUN)• M1:NAME [itor hewouldbePRONOUN]• M2:NAME [thecompanywouldbeNOMINAL]
AmericanAirlines,aunitofAMR,immediatelymatchedthemove,spokesman TimWagnersaidMention1 Mention2
ParseFeaturesforRelationExtraction
• BasesyntacticchunksequencefromonetotheotherNPNPPPVPNPNP
• ConstituentpaththroughthetreefromonetotheotherNPé NPé Sé Sê NP
• DependencypathAirlines<- matched->Wagner->said
AmericanAirlines,aunitofAMR,immediatelymatchedthemove,spokesman TimWagnersaidMention1 Mention2
Gazeteer andtriggerwordfeaturesforrelationextraction• Triggerlistforfamily:kinshipterms• parent,wife,husband,grandparent,etc.[fromWordNet]
• Gazeteer:• Listsofusefulgeoorgeopoliticalwords
• Countrynamelist• Othersub-entities
AmericanAirlines,aunitofAMR,immediatelymatchedthemove,spokesmanTimWagnersaid.
Classifiersforsupervisedmethods
• Nowyoucanuseanyclassifieryoulike•MaxEnt• NaïveBayes• SVM• ...
• Trainitonthetrainingset,tuneonthedev set,testonthetestset
EvaluationofSupervisedRelationExtraction
•ComputeP/R/F1 foreachrelation
36
P = # of correctly extracted relationsTotal # of extracted relations
R = # of correctly extracted relationsTotal # of gold relations
F1 =2PRP + R
Summary:SupervisedRelationExtraction
+ Cangethighaccuracieswithenoughhand-labeledtrainingdata,iftestsimilarenoughtotraining- Labelingalargetraining setisexpensive
- Supervisedmodelsarebrittle, don’tgeneralizewelltodifferentgenres
RelationExtraction
Supervisedrelationextraction
RelationExtraction
Semi-supervisedandunsupervisedrelationextraction
Seed-basedorbootstrappingapproachestorelationextraction
•Notraining set?Maybeyouhave:• Afewseedtuplesor• Afewhigh-precisionpatterns
•Canyouusethoseseedstodosomethinguseful?• Bootstrapping:usetheseedstodirectlylearntopopulatearelation
RelationBootstrapping(Hearst1992)
•GatherasetofseedpairsthathaverelationR• Iterate:1. Findsentenceswiththesepairs2. Lookatthecontextbetweenoraroundthepair
andgeneralizethecontexttocreatepatterns3. Usethepatterns forgrep formorepairs
Bootstrapping
• <MarkTwain,Elmira>Seedtuple• Grep (google)fortheenvironmentsoftheseedtuple“MarkTwainisburiedinElmira,NY.”
XisburiedinY“ThegraveofMarkTwainisinElmira”
ThegraveofXisinY“ElmiraisMarkTwain’sfinalrestingplace”
YisX’sfinalrestingplace.
• Usethosepatternstogrep fornewtuples• Iterate
Dipre:Extract<author,book>pairs
• Startwith5seeds:
• FindInstances:TheComedyofErrors,byWilliamShakespeare,wasTheComedyofErrors,byWilliamShakespeare,isTheComedyofErrors,oneofWilliamShakespeare'searliestattemptsTheComedyofErrors,oneofWilliamShakespeare'smost
• Extractpatterns(groupbymiddle,takelongestcommonprefix/suffix)?x , by ?y , ?x , one of ?y ‘s
• Nowiterate,findingnewseedsthatmatchthepattern
Brin,Sergei.1998.ExtractingPatternsandRelationsfromtheWorldWideWeb.Author BookIsaacAsimov TheRobots ofDawnDavidBrin Startide RisingJamesGleick Chaos:MakingaNewScienceCharlesDickens GreatExpectationsWilliamShakespeare TheComedyofErrors
Snowball
• Similariterativealgorithm
• Groupinstancesw/similarprefix,middle,suffix,extractpatterns• ButrequirethatXandYbenamedentities• Andcomputeaconfidenceforeachpattern
{’s, in, headquarters}
{in, based} ORGANIZATIONLOCATION
Organization LocationofHeadquartersMicrosoft RedmondExxon IrvingIBM Armonk
E.Agichtein andL.Gravano 2000.Snowball:ExtractingRelationsfromLargePlain-TextCollections.ICDL
ORGANIZATION LOCATION .69
.75
DistantSupervision
•Combinebootstrappingwithsupervised learning• Insteadof5seeds,• Usealargedatabasetogethuge#ofseedexamples
•Createlotsoffeaturesfromalltheseexamples•Combineinasupervised classifier
Snow,Jurafsky,Ng.2005.Learningsyntacticpatternsforautomatichypernymdiscovery.NIPS17Fei WuandDanielS.Weld.2007.AutonomouslySemantifyingWikipeida.CIKM2007Mintz,Bills,Snow,Jurafsky.2009.Distantsupervisionforrelationextractionwithoutlabeleddata.ACL09
Distantsupervisionparadigm
• Likesupervised classification:• Usesaclassifierwithlotsoffeatures• Supervisedbydetailedhand-createdknowledge• Doesn’trequireiterativelyexpandingpatterns
• Likeunsupervised classification:• Usesverylargeamountsofunlabeleddata• Notsensitivetogenreissuesintrainingcorpus
Distantlysupervisedlearningofrelationextractionpatterns
Foreachrelation
Foreachtupleinbigdatabase
Findsentencesinlargecorpuswithbothentities
Extractfrequentfeatures(parse,words,etc)
Trainsupervisedclassifierusingthousandsofpatterns
4
1
2
3
5
PERwasborninLOCPER,born(XXXX), LOCPER’sbirthplaceinLOC
<EdwinHubble,Marshfield><AlbertEinstein,Ulm>
Born-In
HubblewasborninMarshfieldEinstein,born(1879),UlmHubble’sbirthplaceinMarshfield
P(born-in | f1,f2,f3,…,f70000)
Unsupervisedrelationextraction
• OpenInformationExtraction:• extractrelationsfromthewebwithnotrainingdata,nolistofrelations
1. Useparseddatatotraina“trustworthytuple”classifier2. Single-passextractallrelationsbetweenNPs,keepiftrustworthy3. Assessorranksrelationsbasedontextredundancy
(FCI,specializesin,softwaredevelopment)(Tesla,invented,coiltransformer)
48
M.Banko,M.Cararella,S.Soderland,M.Broadhead, andO.Etzioni.2007.Openinformationextractionfromtheweb. IJCAI
EvaluationofSemi-supervisedandUnsupervisedRelationExtraction
• Sinceitextractstotallynewrelationsfromtheweb• Thereisnogoldsetofcorrectinstancesofrelations!• Can’tcomputeprecision(don’tknowwhichonesarecorrect)• Can’tcomputerecall(don’tknowwhichonesweremissed)
• Instead,wecanapproximateprecision(only)• Drawarandomsampleofrelationsfromoutput,checkprecisionmanually
• Canalsocomputeprecisionatdifferentlevelsofrecall.• Precisionfortop1000newrelations,top10,000newrelations,top100,000• Ineachcasetakingarandomsampleofthatset
• Butnowaytoevaluaterecall49
P̂ = # of correctly extracted relations in the sampleTotal # of extracted relations in the sample
RelationExtraction
Semi-supervisedandunsupervisedrelationextraction