convcomp2016: verso la “chat intelligente”: la ricerca in natural language processing e machine...
TRANSCRIPT
Outline
• Natural LanguageProcessing• TheroleofMachineLearning
• Scenario 1:questionanswering overstructured data• Theroleofknowledgemodeling
• Scenario 2:FAQretrieval• Theroleoftext-to-texttechnologies
• Towardtheintelligentchat• Learningcomplexdialogues
Pipeline of NLP tools
• HTML cleaner• Tokenizer and Sentence Splitter• POS tagging • Morphological analysis• Chunking and parsing• Named entities• Temporal expressions• Key concepts• Geo-codingOngoing• Sentiment analysis• NER for German
SemanticAnnotations
TextPro isfreelydistributedforresearchpurposes
http://textpro.fbk.eu/
A Pipeline of NLP taggers
CleanPro TokenPro SentencePro MorphoPro
EntityPro
LemmaPro
GeoCoder
SyntaxPro
TagPro
KX ChunkPro
TimePro
TextPro: - a cascade of “annotators”- Input: Pure text (UTF-8)- Output: tabular format (IOB annotation),XML
OutputofanNLPPipeline
token tokenid tokenstart tokenend pos lemma chunk entity timexUS 1 0 2 NP0 __NULL__ B-NP B-GPE OPresident 2 3 12 NP0 president I-NP O OBarack 3 13 19 NP0 __NULL__ I-NP B-PER OObama 4 20 25 NP0 __NULL__ I-NP I-PER Odelayed 5 26 33 VVD delay B-VP O Oa 6 34 35 AT0 a B-NP O Odecision 7 36 44 NN1 decision I-NP O Oon 8 45 47 PRP on B-PP O Opotential 9 48 57 NN1 potential B-NP O Omilitary 10 58 66 AJ0 military I-NP O Oaction 11 67 73 NN1 action I-PP O Oagainst 12 74 81 PRP against B-PP O O
Syria 13 82 87 NP0 syria B-NP B-LOC Oafter 14 88 93 CJS after B-PP O O
September 15 94 103 NP0 september B-NP O B-DATE
9 16 104 105 CRD __NULL__ I-NP O I-DATE
Providedby:TokenPro TagPro LemmaPro ChunkPro EntityPro TimePro
Graph-basedRepresentation
AbstractMeaningRepresentation (AMR) format
TheAMRabovecanbeexpressed variously inEnglish:
Theboywantsthegirltobelieve him.
Theboywantstobebelievedbythegirl.
Theboyhasadesire tobebelieved bythegirl.
Theboy’sdesire is forthegirl tobelieve him.
Theboyisdesirousofthegirlbelieving him.
United States presidential election of 2008, scheduled for Tuesday November 4, 2008, will be the 56th consecutive quadernnial United Statespresidential election and will select the President and the Vice President of the United States. The Republican Party has chosen JohnMcCain, thesenior United States Senator from Arizona as its nominee; the DemocraticParty has chosen Barak Obama, the junior United States Senator fromIllinois, as its nominee.
United_B-LOC States_I-LOC presidential_O election_O of_O 2008_O ,_O scheduled_O for_O Tuesday_O November_O 4_O ,_O2008_O ,_Owill_O be_Othe_O 56th_O consecutive_O quadernnial_O United_B-LOC States_I-LOC presidential_O election_O and_O will_O select_O the_O President_B-PERand_O the_O Vice_B-PER President_I-PER of_I-PER the_I-PER United_I-PER States_I-PER. The_O Republican_B-ORG Party_I-ORG has_O chosen_OJohn_B-PER McCain_I-PER ,_O the_O senior_O United_B-PER States_I-PER Senator_I-PER from_O Arizona_B-LOC as_O its_O
United,null,null,null,States,presidential,election,1,1,1,1,0,1,0,0,1,1,1,1,1,0,1,0,1,1,ed,ted,Uni,UnNamed EntityClassification
supervised TEST DATA
answer
Learning Algorithm Trained Machine
TRAIN DATA
MachineLearningDevelopmentCycle
• Typicaldevelopmentprocess:taskdefinition,dataset(training,test),featureextraction,evaluation
• Advantages:lowcost,goodperformance• Drawbacks:domainportability,poorcontrolonresults
ActiveLearning
• A techniqueforselectingtrainingexampleswithhighprobabilitytochangewrongclassifications
• Activelearningselectionvsrandomselection
• Involvetheuserinthelearningprocess
• Improvingperformancecorrectingthesystemerrors
• Significantlyreducesdevelopmenttime
hasInfrastructure
hasName
CinemahasName
hasEvent
Event
isinSite
PostalAddress
ContacthasPostalAddress
hasContact
Director
Astra
Destination
isInDestination
Movie
hasEventContent
hasDirectorhasName
194 mins
Titanic
Cameron
duration
Movie
hasEventContent
hasDirector hasName
164 mins
E.T.
Spielberg
duration
Director
hasName
• Task:givenaquestionfindapreciseanswercontainedindatabase/knowledgebase
• Needmodelingdomainknowledge:theontology
• E.g.Anontologyforculturalevents(movies,sportevent,etc.)
Scenario1:Question/AnsweringonStructuredData
WhichcinemaisshowingTitanicbyCameroninTrento?Scenario1:Question/AnsweringonStructuredData
10Madrid, June 1, 2010 - Bernardo Magnini
Titanic is showing today in Trento at Cinema Astra at 8 p.m. Price of the ticket is 10 Euros.
WhichcinemaisshowingTitanicbyCameroninTrento?
EAT CONSTRAINT CONTEXT
?CINEMA:XMovie-hasCinema(“Movie:Titanic” Cinema:X) Movie-director(“Titanic” “Cameron”) Cinema-loc(Cinema:x Trento)
Context-Time(Q, December 14th) Context-LOC(Q, Trento)
11
Scenario1:Question/AnsweringonStructuredData
12
WhichcinemaisshowingTitanicbyCameroninTrento?
EAT CONSTRAINT CONTEXT
?CINEMA:XMovie-hasCinema(“Titanic” Cinema:X) Movie-director(“Titanic” “Cameron”) Cinema-loc(Cinema:x Trento
Context-Time(Q, December 14th) Context-LOC(Q, Trento)
CORE JUSTIFICATION COMPLEMENTARY
CINEMA:Astra Movie-hasCinema(“Titanic”Cinema:Astra)Movie-director(“Titanic”“Cameron”)Cinema-loc(Cinema:Astra,Trento)
Movie-time(“Titanic”, 8 pm)Movie-price(“Titanic” 10 euros)
Scenario1:Question/AnsweringonStructuredData
13
WhichcinemaisshowingTitanicbyCameroninTrento?
EAT CONSTRAINT CONTEXT
?CINEMA:XMovie-hasCinema(“Titanic” Cinema:X) Movie-director(“Titanic” “Cameron”) Cinema-loc(Cinema:x Trento
Context-Time(Q, December 14th) Context-LOC(Q, Trento
Titanic is showing today in Trento at Cinema Astra at 8 p.m. Price of the ticket is 10 Euros.
CORE JUSTIFICATION COMPLEMENTARY
CINEMA:Astra Movie-hasCinema(“Titanic”Cinema:Astra)Movie-director(“Titanic”“Cameron”)Cinema-loc(Cinema:Astra,Trento)
Movie-time(“Titanic”, 8 pm)Movie-price(“Titanic” 10 euros)
Scenario1:Question/AnsweringonStructuredData
Scenario2:FAQRetrieval
• Task:retrievethemostsimilarFAQtotheuserquestion
• Noneedfordeeptextinterpretation
• Text-to-Textapproaches• Theroleoflearningtechnologies
TOPIC: Reasons for dissatisfaction in railway service
Int-448: Efficient service. Quick through security and check in. But leg room in standard class was quite poor.
Int-202: Everything ran smoothly and well. Only complaint is lack of leg room with seating with tables.
Int-275: Seating is very cramped – my journey has been very uncomfortable with the person next to me taking up most of the space we have.
ExtractingFragmentsfromInteractions
nothappywiththecatering coffee isawful
coffee ineconomyisawful
norefreshments
foodontrainistooexpensive
youchargetoomuchforsandwiches
foodqualityisdisappointing
badfoodinpremier
notenoughfoodselection provideveggiemeals
nothappywiththeservice
journeyistooslow
noclearinformation
nothappywiththestaff
staffisunfriendly novegetarianfood expandmealoptions
sandwichesareoverpriced
sandwichesaretooexpensive
disgustingcoffeeisserved
theyhavehorriblecoffee
foodisbad
Catering nonbuono
Caffe’pessimoC’era uncaffe’terribileHannoservito caffe’non
buono
Ilcaffe’inclasseeconomica e’cattivo
Nessunrinfresco
pasti inprimaclasse troppo cari
Paninitroppo costosiPaninicostosissimi
Sipaga troppo perIpanini
Cibo discarsa qualita’Laqualita’delcibo e’
bassa
Cibo scadente inprimaclasse
Scelta dicibon nonsufficienteEspandere il menu
niente cibovegetariani
Aggiungere scelta vegetariana
serviziononsoddisfaciente
viaggiotroppo lento Informazioni
nonchiare
Nonsoddisfatto del
personale
Personale pocoamichevole
Organising Customer Interactions
Text-to-Textapproaches
UIM
A-CAS
Distance-based (EDITS)
DistanceComponentEditDistance
ITALIANTokenization,Lemma,POS,dependencyparsing
GERMANToken, POS,Lemma,dependencyparsing
ENGLISHToken,Lemma,POS,dependencyparsing
WORDNETItalianGermanEnglish
Lexical componentEntailment rules
WIKIPEDIAItalianEnglish
Classification-based (TIE)
ScoringComponentBagofWordssimilarity
DISTRIBUTIONALSIMILARITYEnglishGermanItalian
Configurator
Transformation-based (BIUTEE +AdArte)
Alignment-based (P1EDA)
Algorithms
DERIVATIONALMORPHOLOGY
ItalianEnglishGerman
AlignmentComponent
PHRASETABLESItalianEnglishGerman
BulgarianToken,Lemma,POS,dependencyparsing
• Aplatformfortext-to-textinferences
• SemanticAlignment• Similarity• Entailment• Contradiction
• Based onmachinelearning
• FAQretrieval at Evalita2016:http://qa4faq.github.io
TowardtheIntelligentChat
• Fromtext-to-text (FAQretrieval)tosequence-to-sequencelearning
• Learningfromchatsusingneuralnetworks• Firstresults:veryrealisticdialogue“style”
• Canwelearndialogueschema?• Howtointegratespecificknowledge(e.g.inadatabase)
Link…
• Associazione Italiana diLinguisticaComputazionale:ai-lc.it
• CLIC-it,Terza Conferenza Italiana diLinguisticaComputazionale
• Evalita:Valutazione ditecnologie dellinguaggioscritto eparlato perl’italiano:evalita.it
• NLPaFBK:http://hlt.fbk.eu