langage naturel - lip6
TRANSCRIPT
![Page 1: Langage Naturel - LIP6](https://reader030.vdocuments.site/reader030/viewer/2022012015/615a64751242b35d300333b5/html5/thumbnails/1.jpg)
LangageNaturelJean-GabrielGanasciaLIP6–UniversityPierreetMarieCurie4,placeJussieu,75252Paris,[email protected]
Jean-GabrielGANASCIA
![Page 2: Langage Naturel - LIP6](https://reader030.vdocuments.site/reader030/viewer/2022012015/615a64751242b35d300333b5/html5/thumbnails/2.jpg)
Qu’estqu’unelangue?
• Languesetlangage• Langue
– Systèmedesignes– Sciencesdulangage:inventairedessignesetdeleurscombinaisons
– Oral/écrit
• «Langagenaturel»– Sensetcompréhension
Jean-GabrielGANASCIA CoursApprentissageSymboliqueetWebSémantique
![Page 3: Langage Naturel - LIP6](https://reader030.vdocuments.site/reader030/viewer/2022012015/615a64751242b35d300333b5/html5/thumbnails/3.jpg)
Mésopotamie:inventiondel’écriture
« Écrire à Sumer » Jean-Jacques Glassner
• L’écrit n’est pas une transcription de l’oral • L’écriture n’est pas uniquement «pictographique»
=> L’écriture se présente dès
l’origine comme un système de signes => L’écriture comme l’IA transcrit
les connaissances
Jean-GabrielGANASCIA CoursApprentissageSymboliqueetWebSémantique
![Page 4: Langage Naturel - LIP6](https://reader030.vdocuments.site/reader030/viewer/2022012015/615a64751242b35d300333b5/html5/thumbnails/4.jpg)
Science(s)dulangageCommentinventorierlessignesetdeleurscombinaisons
• Lexicologie• Grammaire/syntaxe• Linguistique(phrase)• Rhétorique(discours)• Phonétique• Questiondusens:sémantique?
Jean-GabrielGANASCIA CoursApprentissageSymboliqueetWebSémantique
![Page 5: Langage Naturel - LIP6](https://reader030.vdocuments.site/reader030/viewer/2022012015/615a64751242b35d300333b5/html5/thumbnails/5.jpg)
Linguistiquecognitive
• Langue
• Ordinateur
Génération
Analyse Compréhension
Jean-GabrielGANASCIA CoursApprentissageSymboliqueetWebSémantique
![Page 6: Langage Naturel - LIP6](https://reader030.vdocuments.site/reader030/viewer/2022012015/615a64751242b35d300333b5/html5/thumbnails/6.jpg)
Traitementautomatiquedeslangues–TALquelquesrepèreshistoriques
• 1950-60: traduction automatique � Georgetown University MT Experiment (1952-54): essai de traduction mot à mot avec une analyse syntaxique minimale.
• 1960-70: approches sans sémantique � modèle d’analyse de chaînes de caractères fondés sur la grammaire transformationelle de Chomsky (1965) � dialogue en langage naturel - ELIZA (Weizenbaum 1966): conversation simulée par l’emploi d’un dictionnaire de mots clés
• 1970-80: l’intelligence artificielle entre en scène � développement de systèmes de compréhension � réseau de transmission augmentée [Augmented Transition Networks] (Woods 1971) – un formalisme implémentable qui possède la puissance des grammaires transformationnelle
• � SHRDLU (Winograd 1971): dialogue en langage naturel avec un robot simulé opérant sur le “monde des blocs” Le système est capable d’agir et de planifier aussi bien que de répondre aux questions. � MARGIE (Schank 1973): compréhension du langage naturel en faisant des inférences basées sur l’utilisation de connaissances conceptuelles.
Jean-GabrielGANASCIA CoursApprentissageSymboliqueetWebSémantique
![Page 7: Langage Naturel - LIP6](https://reader030.vdocuments.site/reader030/viewer/2022012015/615a64751242b35d300333b5/html5/thumbnails/7.jpg)
Traitementautomatiquedeslangues–TALquelquesrepèreshistoriques(suite)
• 1980-90: formalismes grammaticaux � « Definite Clause Grammars » (DCGs) – analyse syntaxique fondée sur la
programmation logique (Pereira & Warren 1980). � grammaire d’unification et de contraintes
• 1990-2005: ingénierie linguistique intégrée. � méthodes statistiques - (1) modèle de performance (2) évaluation empirique systématique. � multi modalité - (1) intégration du langage et de la parole (2) projet langage/parole à grande échelle. � multilinguisme - (1) société de l’information multilingue (2) traduction automatique (3) internationalisation du logiciel (4) dimension européenne. � ressources linguistiques - (1) lexique, grammaire (2) corpus de textes de parole (3) établissement de standards de représentation
Jean-GabrielGANASCIA CoursApprentissageSymboliqueetWebSémantique
![Page 8: Langage Naturel - LIP6](https://reader030.vdocuments.site/reader030/viewer/2022012015/615a64751242b35d300333b5/html5/thumbnails/8.jpg)
Traitementautomatiquedeslangues–TALquelquesrepèreshistoriques(fin)
• 2005-Present: masses de données. – traduction statistique (Google translate) – exploitation de grandes masses de données textuelles (web)
et phonétiques – annotations collaboratives (Crowdsourcing)
Jean-GabrielGANASCIA CoursApprentissageSymboliqueetWebSémantique
![Page 9: Langage Naturel - LIP6](https://reader030.vdocuments.site/reader030/viewer/2022012015/615a64751242b35d300333b5/html5/thumbnails/9.jpg)
StructuresSyntaxiquesNoamChomsky
• Existencededifférentsniveauxdereprésentation:– Phonologique:phonème– Syntaxique:morphème– Sémantique:sémème– Pragmatique...
• Commentproduirelesphrasessyntaxiquementcorrectes?Grammaire.
Jean-GabrielGANASCIA CoursApprentissageSymboliqueetWebSémantique
![Page 10: Langage Naturel - LIP6](https://reader030.vdocuments.site/reader030/viewer/2022012015/615a64751242b35d300333b5/html5/thumbnails/10.jpg)
Processusquiengendrelesphrasessyntaxiquementcorrectes
• But:construireunegrammaireàétatsfinis(automatedereconnaissanceàétatsfinis)
• Modèle1:analyseenconstituants(I)Phrase→SN+SV(II)SN→Article+Nom(III)SV→Verbe+SN(IV)Article→The(V)Nom→man,ball,etc.(VI)Verbe→hit,took,...
Jean-GabrielGANASCIA CoursApprentissageSymboliqueetWebSémantique
![Page 11: Langage Naturel - LIP6](https://reader030.vdocuments.site/reader030/viewer/2022012015/615a64751242b35d300333b5/html5/thumbnails/11.jpg)
“Themanhittheball”
(I) Phrase (II) SN + SV (III) Art + Nom + SV (IV) Art + Nom + Verbe + SN (V) The + Nom + Verbe + SN (VI) The + man + Verbe + SN (VII) The + man + hit + SN (VIII) The + man + hit + Art + Nom (IX) The + man + hit + The + Nom (X) The + man + hit + the + ball
Phrase
SN SV
Art Nom Verbe SN
Art Nom The man hit
the ball
Jean-GabrielGANASCIA CoursApprentissageSymboliqueetWebSémantique
![Page 12: Langage Naturel - LIP6](https://reader030.vdocuments.site/reader030/viewer/2022012015/615a64751242b35d300333b5/html5/thumbnails/12.jpg)
Grammaire:modèlesyntagmatique
• Grammaire– EnsemblederèglesdedérivationF– EnsembledeformulesinitialesΣ(phrases,phrasesinterrogatives,phrasesdéclaratives...)
– Introductiondelastructuremorphologique:SNsing+Verbe→SNsing+hitsSN+Verbe+Passé→SN+walk+Passéwalk+Passé→walked
Jean-GabrielGANASCIA CoursApprentissageSymboliqueetWebSémantique
![Page 13: Langage Naturel - LIP6](https://reader030.vdocuments.site/reader030/viewer/2022012015/615a64751242b35d300333b5/html5/thumbnails/13.jpg)
•Lesgrammairesformelles:outildemodélisationdelasyntaxedeslangues
• modèle : représentation simplifiée et formalisée d'un objet ou d'un processus
• Noam Chomsky - formation initiale de mathématicien - entré en linguistique, dans la filiation de Harris et Bloomfield mais en rupture / objet : matériau attesté —> "compétence" du locuteur - 1957 : Syntactic Structures : un ouvrage de linguiste
- pour les linguistes chomskiens : modéliser la "compétence" du locuteur natif (en génération) - pour les "TAListes" : modéliser la syntaxe des langues, comme matériau attesté (en analyse)
- divergence entre les 2 courants vers 1971 (Théorie Standard Étendue)
- Jean-GabrielGANASCIA CoursApprentissageSymboliqueetWebSémantique
![Page 14: Langage Naturel - LIP6](https://reader030.vdocuments.site/reader030/viewer/2022012015/615a64751242b35d300333b5/html5/thumbnails/14.jpg)
•Compilation<—>TraductionAutomatique
traduction humaine : analyse-programmation
instructions en un langage de progr.
traduction automatique : compilation
humain instructions en une langue
processeur
instructions en langage machine
Traduction Automatique de langues
humain
textes en une langue source
textes en une langue cible
humain
-
codes analysés : différents
Jean-GabrielGANASCIA CoursApprentissageSymboliqueetWebSémantique
![Page 15: Langage Naturel - LIP6](https://reader030.vdocuments.site/reader030/viewer/2022012015/615a64751242b35d300333b5/html5/thumbnails/15.jpg)
DifférentssystèmesactuelsdeTraitementAutomatiquedesLangues
SyntacticParsing
POSTagging
SemanticRoleLabeling
NamedEntityRecognition
QuestionAnswering
• LesentréesdessystèmesdeTALdépendentdessortiesd’autressystèmes.
Jean-GabrielGANASCIA CoursApprentissageSymboliqueetWebSémantique
![Page 16: Langage Naturel - LIP6](https://reader030.vdocuments.site/reader030/viewer/2022012015/615a64751242b35d300333b5/html5/thumbnails/16.jpg)
DifférentssystèmesactuelsdeTraitementAutomatiquedesLangues
Analysesyntaxique
Etiquetagesyntaxique
Etiquetagederôles
sémantiquesReconnaissanced’entitésnommées
Questionréponse
• LesentréesdessystèmesdeTALdépendentdessortiesd’autressystèmes(enfrançais).
Jean-GabrielGANASCIA CoursApprentissageSymboliqueetWebSémantique
![Page 17: Langage Naturel - LIP6](https://reader030.vdocuments.site/reader030/viewer/2022012015/615a64751242b35d300333b5/html5/thumbnails/17.jpg)
POS–PartofSpeechTaggingEtiqueteursyntaxique
“The process of assigning a part-of-speech or other lexical class marker to each word in a corpus” (Jurafsky and Martin)
the girl kissed the boy on the cheek
WORDS TAGS
N V P DET
Jean-GabrielGANASCIA CoursApprentissageSymboliqueetWebSémantique
![Page 18: Langage Naturel - LIP6](https://reader030.vdocuments.site/reader030/viewer/2022012015/615a64751242b35d300333b5/html5/thumbnails/18.jpg)
AnExample
the girl kiss the boy on the cheek
LEMMA TAG
+DET +NOUN +VPAST +DET +NOUN +PREP +DET +NOUN
the girl kissed the boy on the cheek
WORD
Jean-GabrielGANASCIA CoursApprentissageSymboliqueetWebSémantique
![Page 19: Langage Naturel - LIP6](https://reader030.vdocuments.site/reader030/viewer/2022012015/615a64751242b35d300333b5/html5/thumbnails/19.jpg)
Techniques• Whypartofspeechtagging?• Wordclasses• Tagsetsandproblemdefinition• Automaticapproaches1:rule-basedtagging• Automaticapproaches2:stochastictagging• Automaticapproaches3:transformation-basedtagging
• Otherissues:taggingunknownwords,evaluation
Jean-GabrielGANASCIA CoursApprentissageSymboliqueetWebSémantique
![Page 20: Langage Naturel - LIP6](https://reader030.vdocuments.site/reader030/viewer/2022012015/615a64751242b35d300333b5/html5/thumbnails/20.jpg)
WordClasses
• Basicwordclasses:Noun,Verb,Adjective,Adverb,Preposition,…
• Openvs.Closedclasses– Open:
• Nouns,Verbs,Adjectives,Adverbs.• Why“open”?
– Closed:• determiners:a,an,the• pronouns:she,he,I• prepositions:on,under,over,near,by,…
Jean-GabrielGANASCIA CoursApprentissageSymboliqueetWebSémantique
![Page 21: Langage Naturel - LIP6](https://reader030.vdocuments.site/reader030/viewer/2022012015/615a64751242b35d300333b5/html5/thumbnails/21.jpg)
PrepositionsfromCELEX
Jean-GabrielGANASCIA CoursApprentissageSymboliqueetWebSémantique
![Page 22: Langage Naturel - LIP6](https://reader030.vdocuments.site/reader030/viewer/2022012015/615a64751242b35d300333b5/html5/thumbnails/22.jpg)
EnglishSingle-WordParticles
Jean-GabrielGANASCIA CoursApprentissageSymboliqueetWebSémantique
![Page 23: Langage Naturel - LIP6](https://reader030.vdocuments.site/reader030/viewer/2022012015/615a64751242b35d300333b5/html5/thumbnails/23.jpg)
PronounsinCELEX
Jean-GabrielGANASCIA CoursApprentissageSymboliqueetWebSémantique
![Page 24: Langage Naturel - LIP6](https://reader030.vdocuments.site/reader030/viewer/2022012015/615a64751242b35d300333b5/html5/thumbnails/24.jpg)
Conjunctions
Jean-GabrielGANASCIA CoursApprentissageSymboliqueetWebSémantique
![Page 25: Langage Naturel - LIP6](https://reader030.vdocuments.site/reader030/viewer/2022012015/615a64751242b35d300333b5/html5/thumbnails/25.jpg)
Auxiliaries
Jean-GabrielGANASCIA CoursApprentissageSymboliqueetWebSémantique
![Page 26: Langage Naturel - LIP6](https://reader030.vdocuments.site/reader030/viewer/2022012015/615a64751242b35d300333b5/html5/thumbnails/26.jpg)
Outline
• Whypartofspeechtagging?• Wordclasses• Tagsetsandproblemdefinition• Automaticapproaches1:rule-basedtagging• Automaticapproaches2:stochastictagging• Automaticapproaches3:transformation-basedtagging
• Otherissues:taggingunknownwords,evaluation
Jean-GabrielGANASCIA CoursApprentissageSymboliqueetWebSémantique
![Page 27: Langage Naturel - LIP6](https://reader030.vdocuments.site/reader030/viewer/2022012015/615a64751242b35d300333b5/html5/thumbnails/27.jpg)
WordClasses:TagSets
• Varyinnumberoftags:adozentoover200• Sizeoftagsetsdependsonlanguage,objectivesandpurpose– Sometaggingapproaches(e.g.,constraintgrammarbased)makefewerdistinctionse.g.,conflatingprepositions,conjunctions,particles
– Simplemorphology=moreambiguity=fewertags
Jean-GabrielGANASCIA CoursApprentissageSymboliqueetWebSémantique
![Page 28: Langage Naturel - LIP6](https://reader030.vdocuments.site/reader030/viewer/2022012015/615a64751242b35d300333b5/html5/thumbnails/28.jpg)
WordClasses:Tagsetexample
PRP PRP$
Jean-GabrielGANASCIA CoursApprentissageSymboliqueetWebSémantique
![Page 29: Langage Naturel - LIP6](https://reader030.vdocuments.site/reader030/viewer/2022012015/615a64751242b35d300333b5/html5/thumbnails/29.jpg)
TheProblem
• Wordsoftenhavemorethanonewordclass:this– Thisisaniceday=PRP– Thisdayisnice=DT– Youcangothisfar=RB
Jean-GabrielGANASCIA CoursApprentissageSymboliqueetWebSémantique
![Page 30: Langage Naturel - LIP6](https://reader030.vdocuments.site/reader030/viewer/2022012015/615a64751242b35d300333b5/html5/thumbnails/30.jpg)
WordClassAmbiguity(intheBrownCorpus)
• Unambiguous(1tag):35,340• Ambiguous(2-7tags):4,100
2 tags 3,760 3 tags 264 4 tags 61 5 tags 12 6 tags 2 7 tags 1
(Derose, 1988) Jean-GabrielGANASCIA CoursApprentissageSymboliqueetWebSémantique
![Page 31: Langage Naturel - LIP6](https://reader030.vdocuments.site/reader030/viewer/2022012015/615a64751242b35d300333b5/html5/thumbnails/31.jpg)
Rule-BasedTagging
• BasicIdea:– Assignallpossibletagstowords– Removetagsaccordingtosetofrulesoftype:ifword+1isanadj,adv,orquantifierandthefollowingisasentenceboundaryandword-1isnotaverblike“consider”theneliminatenon-advelseeliminateadv.
– Typicallymorethan1000hand-writtenrules,butmaybemachine-learned.
Jean-GabrielGANASCIA CoursApprentissageSymboliqueetWebSémantique
![Page 32: Langage Naturel - LIP6](https://reader030.vdocuments.site/reader030/viewer/2022012015/615a64751242b35d300333b5/html5/thumbnails/32.jpg)
StochasticTagging
• Basedonprobabilityofcertaintagoccurringgivenvariouspossibilities
• Requiresatrainingcorpus• Noprobabilitiesforwordsnotincorpus.• Trainingcorpusmaybedifferentfromtestcorpus.
Jean-GabrielGANASCIA CoursApprentissageSymboliqueetWebSémantique
![Page 33: Langage Naturel - LIP6](https://reader030.vdocuments.site/reader030/viewer/2022012015/615a64751242b35d300333b5/html5/thumbnails/33.jpg)
StochasticTagging(cont.)
• SimpleMethod:Choosemostfrequenttagintrainingtextforeachword!– Result:90%accuracy– Baseline– Otherswilldobetter– HMMisanexample
Jean-GabrielGANASCIA CoursApprentissageSymboliqueetWebSémantique
![Page 34: Langage Naturel - LIP6](https://reader030.vdocuments.site/reader030/viewer/2022012015/615a64751242b35d300333b5/html5/thumbnails/34.jpg)
Transformation-BasedTagging(BrillTagging)• CombinationofRule-basedandstochastictaggingmethodologies– Likerule-basedbecauserulesareusedtospecifytagsinacertainenvironment
– Likestochasticapproachbecausemachinelearningisused—withtaggedcorpusasinput
• Input:– taggedcorpus– dictionary(withmostfrequenttags)
• Usuallyconstructedfromthetaggedcorpus
Jean-GabrielGANASCIA CoursApprentissageSymboliqueetWebSémantique
![Page 35: Langage Naturel - LIP6](https://reader030.vdocuments.site/reader030/viewer/2022012015/615a64751242b35d300333b5/html5/thumbnails/35.jpg)
Transformation-BasedTagging(cont.)
• BasicIdea:– Setthemostprobabletagforeachwordasastartvalue– Changetagsaccordingtorulesoftype“ifword-1isadeterminerandwordisaverbthenchangethetagtonoun”inaspecificorder
• Trainingisdoneontaggedcorpus:– Writeasetofruletemplates– Amongthesetofrules,findonewithhighestscore– Continuefrom2untillowestscorethresholdispassed– Keeptheorderedsetofrules
• Rulesmakeerrorsthatarecorrectedbylaterrules
Jean-GabrielGANASCIA CoursApprentissageSymboliqueetWebSémantique
![Page 36: Langage Naturel - LIP6](https://reader030.vdocuments.site/reader030/viewer/2022012015/615a64751242b35d300333b5/html5/thumbnails/36.jpg)
TemplatesforTBL
Jean-GabrielGANASCIA CoursApprentissageSymboliqueetWebSémantique
![Page 37: Langage Naturel - LIP6](https://reader030.vdocuments.site/reader030/viewer/2022012015/615a64751242b35d300333b5/html5/thumbnails/37.jpg)
Evaluation
• Theresultiscomparedwithamanuallycoded“GoldStandard”– Typicallyaccuracyreaches96-97%– Thismaybecomparedwithresultforabaselinetagger(onethatusesnocontext).
• Important:100%isimpossibleevenforhumanannotators.• Factorsthataffectstheperformance
– Theamountoftrainingdataavailable– Thetagset– Thedifferencebetweentrainingcorpusandtestcorpus– Dictionary– Unknownwords
Jean-GabrielGANASCIA CoursApprentissageSymboliqueetWebSémantique
![Page 38: Langage Naturel - LIP6](https://reader030.vdocuments.site/reader030/viewer/2022012015/615a64751242b35d300333b5/html5/thumbnails/38.jpg)
Reconnaissanced’entitésnommées
Jean-GabrielGANASCIA CoursApprentissageSymboliqueetWebSémantique
Name/Location/OrganizationENAMEX(1992,1995)
(MUC-62001,2002)CityStateCountry
PoliticianEntertainer
Timex...Product...Drug/Chemical...(2003–2005)timedate
About 200 categories forming an ontology.
![Page 39: Langage Naturel - LIP6](https://reader030.vdocuments.site/reader030/viewer/2022012015/615a64751242b35d300333b5/html5/thumbnails/39.jpg)
NEDefinition
• NEinvolvesidentificationofpropernamesintexts,andclassificationintoasetofpredefinedcategoriesofinterest.
• Threeuniversallyacceptedcategories:person,locationandorganisation
• Othercommontasks:recognitionofdate/timeexpressions,measures(percent,money,weightetc),emailaddressesetc.
• Otherdomain-specificentities:namesofdrugs,medicalconditions,namesofships,bibliographicreferencesetc.
Jean-GabrielGANASCIA CoursApprentissageSymboliqueetWebSémantique
![Page 40: Langage Naturel - LIP6](https://reader030.vdocuments.site/reader030/viewer/2022012015/615a64751242b35d300333b5/html5/thumbnails/40.jpg)
ProblemsinNETaskDefinition
• Categorydefinitionsareintuitivelyquiteclear,buttherearemanygreyareas.Manyofthesegreyareaarecausedbymetonymy.
Personvs.Artefact:“Thehamsandwichwantshisbill.”vs“Bringmeahamsandwich.”
Organisationvs.Location:“EnglandwontheWorldCup”vs.“TheWorldCuptookplaceinEngland”.
Companyvs.Artefact:“sharesinMTV”vs.“watchingMTV”
Locationvs.Organisation:“shemethimatHeathrow”vs.“theHeathrowauthorities”
Jean-GabrielGANASCIA CoursApprentissageSymboliqueetWebSémantique
![Page 41: Langage Naturel - LIP6](https://reader030.vdocuments.site/reader030/viewer/2022012015/615a64751242b35d300333b5/html5/thumbnails/41.jpg)
BasicProblemsinNE
• VariationofNEs–e.g.JohnSmith,MrSmith,John.
• AmbiguityofNEtypes– JohnSmith(companyvs.person)– May(personvs.month)– Washington(personvs.location)– 1945(datevs.time)
• Ambiguitywithcommonwords,e.g.“may”Jean-GabrielGANASCIA CoursApprentissageSymboliqueetWebSémantique
![Page 42: Langage Naturel - LIP6](https://reader030.vdocuments.site/reader030/viewer/2022012015/615a64751242b35d300333b5/html5/thumbnails/42.jpg)
ShallowParsingApproach
• Internalevidence–namesoftenhaveinternalstructure.Thesecomponentscanbeeitherstoredorguessed.
location:
CapWord+{City,Forest,Center}
e.g.SherwoodForest
CapWord+{Street,Boulevard,Avenue,Crescent,Road}
e.g.PortobelloStreet
Jean-GabrielGANASCIA CoursApprentissageSymboliqueetWebSémantique
![Page 43: Langage Naturel - LIP6](https://reader030.vdocuments.site/reader030/viewer/2022012015/615a64751242b35d300333b5/html5/thumbnails/43.jpg)
ShallowParsingApproach
• Externalevidence-namesareoftenusedinverypredictivelocalcontexts
Location:
“tothe”COMPASS“of”CapWorde.g.tothesouthofLoitokitok
“basedin”CapWorde.g.basedinLoitokitok
CapWord“isa”(ADJ)?GeoWord
e.g.Loitokitokisafriendlycity
Jean-GabrielGANASCIA CoursApprentissageSymboliqueetWebSémantique
![Page 44: Langage Naturel - LIP6](https://reader030.vdocuments.site/reader030/viewer/2022012015/615a64751242b35d300333b5/html5/thumbnails/44.jpg)
DifficultiesinShallowParsingApproach
• Ambiguouslycapitalisedwords(firstwordinsentence)
[AllAmericanBank]vs.All[StatePolice]
• Semanticambiguity
“JohnF.Kennedy”=airport(location)
“PhilipMorris”=organisation
• Structuralambiguity
[CableandWireless]vs.[Microsoft]and[Dell]
[CenterforComputationalLinguistics]vs.messagefrom[CityHospital]for
[JohnSmith].
Jean-GabrielGANASCIA CoursApprentissageSymboliqueetWebSémantique
![Page 45: Langage Naturel - LIP6](https://reader030.vdocuments.site/reader030/viewer/2022012015/615a64751242b35d300333b5/html5/thumbnails/45.jpg)
Modèlesd’apprentisage
• Apprentissagesupervisé– apprentissagesurcorpusétiqueté
• Rec:76%onlocations,49%onorganizationsandand70-90%onperson('03)Precision76%andRecall48%onallEnamex(03)
• Apprentissagenonsupervisé– catégorisationsansétiquetageutilisantdesco-occurrences,Wordnetetautresbasesdeconnaissances,capitales,chiffres…
• Apprentissagesemi-supervisé– graineavecquelquesexemples–utilisationétiqueteur(POS),relationssyntaxiques,…
Jean-GabrielGANASCIA CoursApprentissageSymboliqueetWebSémantique
![Page 46: Langage Naturel - LIP6](https://reader030.vdocuments.site/reader030/viewer/2022012015/615a64751242b35d300333b5/html5/thumbnails/46.jpg)
Représentationpourl’apprentissage
• Word-level:capitalization,digitpattern,PoS,morphology,punctuation,function,character
• List-lookup:dictionary,typicalorganizationnames,onthelist(stemming,fuzzymatchingsuchaseditdistance,phoneticmatchingsuchasSoundex)
• Document/Corpus:multiple(co)-occurrences,capitalization,andlocationinthesentence.Aliasesandcontextualdisambiguation.Statisticsofmultiwordunit.
Jean-GabrielGANASCIA CoursApprentissageSymboliqueetWebSémantique
![Page 47: Langage Naturel - LIP6](https://reader030.vdocuments.site/reader030/viewer/2022012015/615a64751242b35d300333b5/html5/thumbnails/47.jpg)
Evaluation
MUCEvaluation– overlappingmatchesontwodimensions(tag-locationandtag-type)evaluatedusingthemicroaveragedF-score,i.e.theharmonicmeanofPrecisionandRecall
ExactMatchEvaluation– Muchmorerestrictedrequiringexactmatchesofentitiesandtypes
ACEEvaluation– Mostpowerfulbutverycomplex.Penalizesmis-matchedpatternsusingweightingtechniques.Harttocomparewithotherevaluationsunlessfixingtheweights
Jean-GabrielGANASCIA CoursApprentissageSymboliqueetWebSémantique
![Page 48: Langage Naturel - LIP6](https://reader030.vdocuments.site/reader030/viewer/2022012015/615a64751242b35d300333b5/html5/thumbnails/48.jpg)
Conferences-references
MUC(MessageUnderstandingConference),IREX,CONLL(ConferenceonNaturalLanguageLearning),ACE,BioNLP.
NadeauandSekine.2007Petasisetal.2001Poibeau.2003
Jean-GabrielGANASCIA CoursApprentissageSymboliqueetWebSémantique
![Page 49: Langage Naturel - LIP6](https://reader030.vdocuments.site/reader030/viewer/2022012015/615a64751242b35d300333b5/html5/thumbnails/49.jpg)
ApplicationdelaReconnaissanced’EntitésNommées• Contexte:projetEuropeanaNewspaper
• But:indexerlesjournaux• Reconnaissanceoptiquedecaractères
• Repérerindividus,lieux,institutions,…
• Difficulté:pasdecorpusétiqueté!
Jean-GabrielGANASCIA CoursApprentissageSymboliqueetWebSémantique
![Page 50: Langage Naturel - LIP6](https://reader030.vdocuments.site/reader030/viewer/2022012015/615a64751242b35d300333b5/html5/thumbnails/50.jpg)
MéthodespourlareconnaissanceUNERDUnsupervisedNERandDisambiguation
• Utilisationdedictionnaires• Désambigüisation
– catégoriedumot– catégoriedesvoisins(contexte)
Jean-GabrielGANASCIA CoursApprentissageSymboliqueetWebSémantique
![Page 51: Langage Naturel - LIP6](https://reader030.vdocuments.site/reader030/viewer/2022012015/615a64751242b35d300333b5/html5/thumbnails/51.jpg)
UNERD-Résultats
• MAD:donnéesannotéesmanuellement
• Comparaisondesdifférentsscénarios– S1–nonsupervisé(dict.)– S2–apprentissage– S3–apprentissage+dict.– S4–apprentissage+dict.+désambigüisation
– UNERD-dict.+désambigüis.
Jean-GabrielGANASCIA CoursApprentissageSymboliqueetWebSémantique
![Page 52: Langage Naturel - LIP6](https://reader030.vdocuments.site/reader030/viewer/2022012015/615a64751242b35d300333b5/html5/thumbnails/52.jpg)
UNERD
Jean-GabrielGANASCIA CoursApprentissageSymboliqueetWebSémantique
![Page 53: Langage Naturel - LIP6](https://reader030.vdocuments.site/reader030/viewer/2022012015/615a64751242b35d300333b5/html5/thumbnails/53.jpg)
NeverEndingLanguageLearningNELL–CMU
Jean-GabrielGANASCIA CoursApprentissageSymboliqueetWebSémantique
![Page 54: Langage Naturel - LIP6](https://reader030.vdocuments.site/reader030/viewer/2022012015/615a64751242b35d300333b5/html5/thumbnails/54.jpg)
Outline
• Web-scale information extraction: – discovering factual by automatically reading language on the
Web
• NELL: A Never-Ending Language Learner – Goals, current scope, and examples
• Key ideas: – Redundancy of information on the Web – Constraining the task by scaling up
• Related Works: – SOAR, Eurisko, ACT-R, ….
Jean-GabrielGANASCIA CoursApprentissageSymboliqueetWebSémantique
![Page 55: Langage Naturel - LIP6](https://reader030.vdocuments.site/reader030/viewer/2022012015/615a64751242b35d300333b5/html5/thumbnails/55.jpg)
Information Extraction
• Goal: – Extract facts about the world automatically by
reading text – IE systems are usually based on learning how to
recognize facts in text • .. and then (sometimes) aggregating the results • Latest-generation IE systems need not require large
amounts of training • … and IE does not necessarily require subtle analysis of any
particular piece of text
Jean-GabrielGANASCIA CoursApprentissageSymboliqueetWebSémantique
![Page 56: Langage Naturel - LIP6](https://reader030.vdocuments.site/reader030/viewer/2022012015/615a64751242b35d300333b5/html5/thumbnails/56.jpg)
NeverEndingLanguageLearningNELL–CMU
Jean-GabrielGANASCIA CoursApprentissageSymboliqueetWebSémantique
![Page 57: Langage Naturel - LIP6](https://reader030.vdocuments.site/reader030/viewer/2022012015/615a64751242b35d300333b5/html5/thumbnails/57.jpg)
CPLCoupledPatternLearner
• Extractiondecatégories• Extractiondemotifscomme:– «mayorofX»– «XplaysforY»
• Co-occurrencesstatistiquesentreentitésnominalesetmotifscontextuels
• Entrées2milliardsdephrases(extraction,segmentation,étiquetagesyntaxique)
Jean-GabrielGANASCIA CoursApprentissageSymboliqueetWebSémantique
![Page 58: Langage Naturel - LIP6](https://reader030.vdocuments.site/reader030/viewer/2022012015/615a64751242b35d300333b5/html5/thumbnails/58.jpg)
Motifs extrait appris: playsSport(arg1,arg2)
arg1_was_playing_arg2 arg2_megastar_arg1 arg2_icons_arg1 arg2_player_named_arg1 arg2_prodigy_arg1 arg1_is_the_tiger_woods_of_arg2 arg2_career_of_arg1 arg2_greats_as_arg1 arg1_plays_arg2 arg2_player_is_arg1 arg2_legends_arg1 arg1_announced_his_retirement_from_arg2 arg2_operations_chief_arg1 arg2_player_like_arg1 arg2_and_golfing_personalities_including_arg1 arg2_players_like_arg1 arg2_greats_like_arg1 arg2_players_are_steffi_graf_and_arg1 arg2_great_arg1 arg2_champ_arg1 arg2_greats_such_as_arg1 arg2_professionals_such_as_arg1 arg2_course_designed_by_arg1 arg2_hit_by_arg1 arg2_course_architects_including_arg1 arg2_greats_arg1 arg2_icon_arg1 arg2_stars_like_arg1 arg2_pros_like_arg1 arg1_retires_from_arg2 arg2_phenom_arg1 arg2_lesson_from_arg1 arg2_architects_robert_trent_jones_and_arg1 arg2_sensation_arg1 arg2_architects_like_arg1 arg2_pros_arg1 arg2_stars_venus_and_arg1 arg2_legends_arnold_palmer_and_arg1 arg2_hall_of_famer_arg1 arg2_racket_in_arg1 arg2_superstar_arg1 arg2_legend_arg1 arg2_legends_such_as_arg1 arg2_players_is_arg1 arg2_pro_arg1 arg2_player_was_arg1 arg2_god_arg1 arg2_idol_arg1 arg1_was_born_to_play_arg2 arg2_star_arg1 arg2_hero_arg1 arg2_course_architect_arg1 arg2_players_are_arg1 arg1_retired_from_professional_arg2 arg2_legends_as_arg1 arg2_autographed_by_arg1 arg2_related_quotations_spoken_by_arg1 arg2_courses_were_designed_by_arg1 arg2_player_since_arg1 arg2_match_between_arg1 arg2_course_was_designed_by_arg1 arg1_has_retired_from_arg2 arg2_player_arg1 arg1_can_hit_a_arg2 arg2_legends_including_arg1 arg2_player_than_arg1 arg2_legends_like_arg1 arg2_courses_designed_by_legends_arg1 arg2_player_of_all_time_is_arg1 arg2_fan_knows_arg1 arg1_learned_to_play_arg2 arg1_is_the_best_player_in_arg2 arg2_signed_by_arg1 arg2_champion_arg1
Jean-GabrielGANASCIA CoursApprentissageSymboliqueetWebSémantique
![Page 59: Langage Naturel - LIP6](https://reader030.vdocuments.site/reader030/viewer/2022012015/615a64751242b35d300333b5/html5/thumbnails/59.jpg)
Apprentissage semi-supervisé
Paris Pittsburgh
Seattle Cupertino
mayor of arg1 live in arg1
San Francisco Austin denial
arg1 is home of traits such as arg1
Sous contraint!!
anxiety selfishness
Berlin
Extract cities:
Jean-GabrielGANASCIA CoursApprentissageSymboliqueetWebSémantique
![Page 60: Langage Naturel - LIP6](https://reader030.vdocuments.site/reader030/viewer/2022012015/615a64751242b35d300333b5/html5/thumbnails/60.jpg)
NP1 NP2
Krzyzewski coaches the Blue Devils.
athlete team
coachesTeam(c,t)
person
coach
sport
playsForTeam(a,t)
NP
Krzyzewski coaches the Blue Devils.
coach(NP)
hard (underconstrained) semi-supervised learning
problem
much easier (more constrained) semi-supervised learning problem
teamPlaysSport(t,s)
playsSport(a,s)
Clef pour l’apprentissage semi-supervisé
Plus facile d’apprendre 100 tâches reliées qu’une tâche isolée
Jean-GabrielGANASCIA CoursApprentissageSymboliqueetWebSémantique
![Page 61: Langage Naturel - LIP6](https://reader030.vdocuments.site/reader030/viewer/2022012015/615a64751242b35d300333b5/html5/thumbnails/61.jpg)
CSEAL–coupledSEAL
• Extracteurquiquestionneinternetavecunensembledecroyances
• Partd’unensembled’instances–construitdesquestions
• Recherchetablesetlistesdenouvellesinstancesdesescroyances
• Utilisationderelationsd’exclusionpourtrouverdescontre-exemples
Jean-GabrielGANASCIA CoursApprentissageSymboliqueetWebSémantique
![Page 62: Langage Naturel - LIP6](https://reader030.vdocuments.site/reader030/viewer/2022012015/615a64751242b35d300333b5/html5/thumbnails/62.jpg)
Pagespourleconceptde«dictator»
Jean-GabrielGANASCIA CoursApprentissageSymboliqueetWebSémantique
![Page 63: Langage Naturel - LIP6](https://reader030.vdocuments.site/reader030/viewer/2022012015/615a64751242b35d300333b5/html5/thumbnails/63.jpg)
SEAL: Set Expander for Any Language
<li class="honda"><a href="http://www.curryauto.com/">
<li class="toyota"><a href="http://www.curryauto.com/">
<li class="nissan"><a href="http://www.curryauto.com/">
<li class="ford"><a href="http://www.curryauto.com/"> <li class="ford"><a href="http://www.curryauto.com/">
<li class="ford"><a href="http://www.curryauto.com/">
<li class="ford"><a href="http://www.curryauto.com/">
<li class="ford"><a href="http://www.curryauto.com/">
<li class="ford"><a href="http://www.curryauto.com/">
…
…
…
…
…
ford,toyota,nissan
honda
Seeds Extractions
*RichardC.WangandWilliamW.Cohen:Language-IndependentSetExpansionofNamedEntitiesusingtheWeb.InProceedingsofIEEEInternationalConferenceonDataMining(ICDM2007),Omaha,NE,USA.2007.
Uselists and tables as well as text
Single-pagePatterns
Jean-GabrielGANASCIA CoursApprentissageSymboliqueetWebSémantique
![Page 64: Langage Naturel - LIP6](https://reader030.vdocuments.site/reader030/viewer/2022012015/615a64751242b35d300333b5/html5/thumbnails/64.jpg)
Extrapolating user-provided seeds
• Set expansion (SEAL): – Given seeds (kdd, icml, icdm),
formulate query to search engine and collect semi-structured web pages
– Detect lists on these pages – Merge the results, ranking items “frequently” occurring on “good” lists highest
– Details: Wang & Cohen ICDM 2007, 2008; EMNLP 2008, 2009
Jean-GabrielGANASCIA CoursApprentissageSymboliqueetWebSémantique
![Page 65: Langage Naturel - LIP6](https://reader030.vdocuments.site/reader030/viewer/2022012015/615a64751242b35d300333b5/html5/thumbnails/65.jpg)
CMC-CoupledMorphologicalClass
• Classifiedesentitésnominalessurlescatégoriesdelabasedeconnaissances
• Utilisedescaractéristiquesmorphologiques(mots,capitales,préfixesousuffixes,position,…)
Jean-GabrielGANASCIA CoursApprentissageSymboliqueetWebSémantique
![Page 66: Langage Naturel - LIP6](https://reader030.vdocuments.site/reader030/viewer/2022012015/615a64751242b35d300333b5/html5/thumbnails/66.jpg)
RuleLearnerRL• Apprentissageenlogiquedupremierordre(FOIL)• ConstruitdesclausesdeHornprobabilistes• Utilisepourinférerdenouvellesrelationsàpartirderelationsexistantdéjàdanslabasedeconnaissances
Jean-GabrielGANASCIA CoursApprentissageSymboliqueetWebSémantique
![Page 67: Langage Naturel - LIP6](https://reader030.vdocuments.site/reader030/viewer/2022012015/615a64751242b35d300333b5/html5/thumbnails/67.jpg)
Never Ending Language Learning (NELL)
• NELL is a large-scale IE system – Simultaneously learning 500-600 concepts and
relations (person, celebrity, emotion, aquiredBy, locatedIn, capitalCityOf, ..)
– Uses 500M web page corpus + live queries – Running (almost) continuously for over a year – Currently has learned 3.2M low-confidence “beliefs”
and over 500K high-confidence beliefs • about 85% of high-confidence beliefs are correct
Jean-GabrielGANASCIA CoursApprentissageSymboliqueetWebSémantique
![Page 68: Langage Naturel - LIP6](https://reader030.vdocuments.site/reader030/viewer/2022012015/615a64751242b35d300333b5/html5/thumbnails/68.jpg)
Examples of what NELL knows
Jean-GabrielGANASCIA CoursApprentissageSymboliqueetWebSémantique
![Page 69: Langage Naturel - LIP6](https://reader030.vdocuments.site/reader030/viewer/2022012015/615a64751242b35d300333b5/html5/thumbnails/69.jpg)
Examples of what NELL knows
Jean-GabrielGANASCIA CoursApprentissageSymboliqueetWebSémantique
![Page 70: Langage Naturel - LIP6](https://reader030.vdocuments.site/reader030/viewer/2022012015/615a64751242b35d300333b5/html5/thumbnails/70.jpg)
Examples of what NELL knows
Jean-GabrielGANASCIA CoursApprentissageSymboliqueetWebSémantique
![Page 71: Langage Naturel - LIP6](https://reader030.vdocuments.site/reader030/viewer/2022012015/615a64751242b35d300333b5/html5/thumbnails/71.jpg)
Jean-GabrielGANASCIA CoursApprentissageSymboliqueetWebSémantique