© johan bos november 2005 pub quiz. © johan bos november 2005 question answering lecture 1 (last...

72
© Johan Bos November 2005 Pub Quiz

Post on 19-Dec-2015

212 views

Category:

Documents


0 download

TRANSCRIPT

© J

oh

an B

os

No

vem

ber

200

5

Pub Quiz

© J

oh

an B

os

No

vem

ber

200

5

Question Answering

• Lecture 1 (Last week):Introduction; History of QA; Architecture of a QA system; Evaluation.

• Lecture 2 (Today):Question Classification; NLP techniques for question analysis; Tokenisation; Lemmatisation; POS-tagging; Parsing; WordNet.

• Lecture 3 (Next lecture):Retrieving Answers; Document pre-processing; Named Entity Recognition; Anaphora Resolution; Matching; Reranking; Sanity checking.

© J

oh

an B

os

No

vem

ber

200

5

Architecture of a QA system

IRQuestion Analysis

query

Document Analysis

Answer Extraction

question

answer-type

question representation

documents/passages

passage representation

corpus

answers

© J

oh

an B

os

No

vem

ber

200

5

Architecture of a QA system

IRQuestion Analysis

query

Document Analysis

Answer Extraction

question

answer-type

question representation

documents/passages

passage representation

corpus

answers

© J

oh

an B

os

No

vem

ber

200

5

Syntactically Distinguishing Questions

• Wh-questions:– Where was Franz Kafka born?– How many countries are member of

OPEC?– Who is Thom Yorke?– Why did David Koresh ask the FBI for a

word processor?– How did Frank Zappa die?– Which boxer beat Muhammed Ali?

© J

oh

an B

os

No

vem

ber

200

5

Syntactically Distinguishing Questions

• Yes-no questions:– Does light have weight?– Scotland is part of England – true or false?

• Choice-questions:– Did Italy or Germany win the world cup in

1982?– Who is Harry Potter’s best friend – Ron,

Hermione or Sirius?

© J

oh

an B

os

No

vem

ber

200

5

Syntactically Distinguishing Questions

• Imperative:– Name four European countries that

produce wine.– Give the date of birth of Franz Kafka.

• Declarative:– I would like to know when Jim Morrison

was born.

© J

oh

an B

os

No

vem

ber

200

5

Semantically Distinguishing Questions

• Divide questions according to their expected answer type

• Simple Answer-Type Typology:

PERSONNUMERALDATEMEASURELOCATIONORGANISATIONENTITY

© J

oh

an B

os

No

vem

ber

200

5

Expected Answer Types

• DATE:– When was JFK killed?– In what year did Rome become the capital

of Italy?

© J

oh

an B

os

No

vem

ber

200

5

Expected Answer Types

• DATE:– When was JFK killed?– In what year did Rome become the capital

of Italy?

• PERSON:– Who won the Nobel prize for Peace?– Which rock singer wrote Lithium?

© J

oh

an B

os

No

vem

ber

200

5

Expected Answer Types

• DATE:– When was JFK killed?– In what year did Rome become the capital

of Italy?

• PERSON:– Who won the Nobel prize for Peace?– Which rock singer wrote Lithium?

• NUMERAL:– How many inhabitants does Rome have?– What’s the population of Scotland?

© J

oh

an B

os

No

vem

ber

200

5

Focus and Topic

• Information expressed in a question can be structured into two parts:– the focus: information that is asked for– the topic: information about focus

• Example:How many inhabitants does Rome have?

FOCUS TOPIC

© J

oh

an B

os

No

vem

ber

200

5

We need to know how to process natural language!

© J

oh

an B

os

No

vem

ber

200

5

Architecture of a QA system

IRQuestion Analysis

query

Document Analysis

Answer Extraction

question

answer-type

question representation

documents/passages

passage representation

corpus

answers

© J

oh

an B

os

No

vem

ber

200

5

Generating Query Terms

• Example 1:– Question: Who discovered prions?

– Text A: Dr. Stanley Prusiner received the Nobel prize for the discovery of prions.

– Text B: Prions are a kind of proteins that…

• Query terms?

© J

oh

an B

os

No

vem

ber

200

5

Generating Query Terms

• Example 2:– Question: When did Franz Kafka die?

– Text A: Kafka died in 1924.– Text B: Dr. Franz died in 1971.

• Query terms?

© J

oh

an B

os

No

vem

ber

200

5

Generating Query Terms

• Example 3:– Question: How did actor James Dean die?

– Text:

James Dean was killed in a car accident.

• Query terms?

© J

oh

an B

os

No

vem

ber

200

5

We need to know how to process natural language!

© J

oh

an B

os

No

vem

ber

200

5

Architecture of a QA system

IRQuestion Analysis

query

Document Analysis

Answer Extraction

question

answer-type

question representation

documents/passages

passage representation

corpus

answers

© J

oh

an B

os

No

vem

ber

200

5

Difference in structure

• Example:– Question: When did Franz Kafka die? – Text A:

The mother of Franz Kafka died in 1918.

© J

oh

an B

os

No

vem

ber

200

5

Difference in structure

• Example:– Question: When did Franz Kafka die? – Text A:

The mother of Franz Kafka died in 1918.– Text B:

Kafka died in 1924.

© J

oh

an B

os

No

vem

ber

200

5

Difference in structure

• Example:– Question: When did Franz Kafka die? – Text A:

The mother of Franz Kafka died in 1918.– Text B:

Kafka died in 1924.– Text C:

Both Kafka and Lenin died in 1924.

© J

oh

an B

os

No

vem

ber

200

5

Difference in structure

• Example:– Question: When did Franz Kafka die? – Text A:

The mother of Franz Kafka died in 1918.– Text B:

Kafka died in 1924.– Text C:

Both Kafka and Lenin died in 1924.– Text D:

Max Brod, a friend of Kafka, died in 1930.

© J

oh

an B

os

No

vem

ber

200

5

We need to know how to process natural language!

© J

oh

an B

os

No

vem

ber

200

5

• We need ways to automate the process of manipulating natural language– Punctuation– The way words are composed– The relationship between wordforms– The relationship between words– The structure of phrases

• This is where NLP (Natural Language Processing) comes in!

Natural Language is messy!

© J

oh

an B

os

No

vem

ber

200

5

NLP Techniques

• Tokenisation

• Lemmatisation

• Part of Speech Tagging

• Syntactic analysis (parsing)

• WordNet

© J

oh

an B

os

No

vem

ber

200

5

NLP Techniques

• Tokenisation

• Lemmatisation

• Part of Speech Tagging

• Syntactic analysis (parsing)

• WordNet

© J

oh

an B

os

No

vem

ber

200

5

Tokenisation

• Tokenisation is the task that splits words from punctuation – Semicolons, colons ; :– exclamation marks, question marks ! ?– commas and full stops . ,– quotes “ ‘ `

• Tokens are normally split by spaces

© J

oh

an B

os

No

vem

ber

200

5

Tokenisation: Example 1

• Input (9 tokens):

When was the Buckingham Palace built in London, England?

© J

oh

an B

os

No

vem

ber

200

5

Tokenisation: Example 1

• Input (9 tokens):

When was the Buckingham Palace built in London, England?

• Output (11 tokens):

When was the Buckingham Palace built in London , England ?

© J

oh

an B

os

No

vem

ber

200

5

Tokenisation: Example 2

• Input (7 tokens):

What year did "Snow White" come out?

© J

oh

an B

os

No

vem

ber

200

5

Tokenisation: Example 2

• Input (7 tokens):

What year did "Snow White" come out?

• Output (10 tokens):

What year did “ Snow White " come out ?

© J

oh

an B

os

No

vem

ber

200

5

Tokenisation: combined words

• Combined words are split– I’d I ’d– country’s country ’s– won’t will n’t– “don’t!” “ do n’t ! “

• Some Italian examples– gliel’ha detto gli lo ha detto– posso prenderlo posso prendere lo

© J

oh

an B

os

No

vem

ber

200

5

Difficulties with tokenisation

• Abbreviations, acronyms– When was the U.S. invasion of Haiti?

• In particular if the abbreviation or acronym is the last word of a sentence– Look at next word: if in uppercase, then

assume it is end of sentence– But think of cases such as Mr. Jones

© J

oh

an B

os

No

vem

ber

200

5

Why is tokenisation important?

• To look up a word in an electronic dictionary (such as WordNet)

• For all subsequent stages of processing– Lemmatisation– Parsing

© J

oh

an B

os

No

vem

ber

200

5

NLP Techniques

• Tokenisation

• Lemmatisation

• Part of Speech Tagging

• Syntactic analysis (parsing)

• WordNet

© J

oh

an B

os

No

vem

ber

200

5

Lemmatisation

• Lemmatising means– grouping morphological variants of words under a

single headword

• For example, you could take the words

am, was, are, is, were, and been together under the word be 

© J

oh

an B

os

No

vem

ber

200

5

Lemmatisation

• Lemmatising means– grouping morphological variants of words under a

single headword

• For example, you could take the words

am, was, are, is, were, and been together under the word be 

© J

oh

an B

os

No

vem

ber

200

5

Lemmatisation

• Using linguistic terminology, the variants taken together form the lemma of a lexeme

• Lexeme: a “lexical unit”, an abstraction over specific constructions

• Other examples:

dying, die, died, dies diecar, cars carman, men man

© J

oh

an B

os

No

vem

ber

200

5

NLP Techniques

• Tokenisation

• Lemmatisation

• Part of Speech Tagging

• Syntactic analysis (parsing)

• WordNet

© J

oh

an B

os

No

vem

ber

200

5

Traditional parts of speech

• Verb

• Noun

• Pronoun

• Adjective

• Adverb

• Preposition

• Conjunction

• Interjection

© J

oh

an B

os

No

vem

ber

200

5

Parts of speech in NLP

CLAWS1 (132 tags)

Examples:NN singular common noun (boy, pencil ... ) NN$ genitive singular common noun (boy's,

parliament's ... ) NNP singular common noun with word initial

capital (Austrian, American, Sioux, Eskimo ... )

NNP$ genitive singular common noun with word initial capital (Sioux', Eskimo's, Austrian's, American's, ...)

NNPS plural common noun with word initial capital (Americans, ... )

NNPS$ genitive plural common noun with word initial capital (Americans‘, …)

NNS plural common noun (pencils, skeletons, days, weeks ... )

NNS$ genitive plural common noun (boys', weeks' ... )

NNU abbreviated unit of measurement unmarked for number (in, cc, kg …)

Penn Treebank (45 tags)

Examples:JJ adjective (green, …)

JJR adjective, comparative (greener,…)

JJS adjective, superlative (greenest, …)

MD modal (could, will, …)

NN noun, singular or mass (table, …)

NNS noun plural (tables, …)

NNP proper noun, singular (John, …)

NNPS proper noun, plural (Vikings, …)

PDT predeterminer (both the boys)

POS possessive ending (friend's)

PRP personal pronoun (I, he, it, …)

PRP$ possessive pronoun (my, his, …)

RB adverb (however, usually, naturally, here, good, …)

RBR adverb, comparative (better, …)

© J

oh

an B

os

No

vem

ber

200

5

POS tagged example

Whatyear did “ Snow White " come out ?

© J

oh

an B

os

No

vem

ber

200

5

POS tagged example

What WP

year NN

did VBD

“ “ Snow NNP White NNP

" “come VB

out IN

? .

© J

oh

an B

os

No

vem

ber

200

5

Why is POS-tagging important?

• To disambiguate words

• For instance, to distinguish “book” used as a noun from “book” used as a verb– I like that book– Did you book a room?

• Prerequisite for further processing stages, such as parsing

© J

oh

an B

os

No

vem

ber

200

5

NLP Techniques

• Tokenisation

• Lemmatisation

• Part of Speech Tagging

• Syntactic analysis (parsing)

• WordNet

© J

oh

an B

os

No

vem

ber

200

5

What is Parsing

• Parsing is the process of assigning a syntactic structure to a sequence of words

• The syntactic structure is defined using a grammar

• A grammar contains of a set of symbols (terminal and non-terminal symbols) and production rules (grammar rules)

• The lexicon is built over the terminal symbols (i.e., the words)

© J

oh

an B

os

No

vem

ber

200

5

Syntactic Categories

• The non-terminal symbols correspond to syntactic categories– Det (determiner)– N (noun)– IV (intransitive verb)– TV (transitive verb)– PN (proper name)– Prep (preposition)– NP (noun phrase) the car – PP (prepositional phrase) at the table– VP (verb phrase) saw a car– S (sentence) Mia likes Vincent

© J

oh

an B

os

No

vem

ber

200

5

Example Grammar

Lexicon

Det: which, a, the,…

N: rock, singer, …

IV: die, walk, …

TV: kill, write,…

PN: John, Lithium, …

Prep: on, from, to, …

Grammar Rules

S NP VP

NP Det N

NP PN

N N N

N N PP

VP TV NP

VP IV

PP Prep NP

VP VP PP

© J

oh

an B

os

No

vem

ber

200

5

The Parser

• A parser automates the process of parsing

• The input of the parser is a string of words (possibly annotated with POS-tags)

• The output of a parser is a parse tree, connecting all the words

• The way a parse tree is constructed is also called a derivation

© J

oh

an B

os

No

vem

ber

200

5

Derivation Example

Which rock singer wrote Lithium

© J

oh

an B

os

No

vem

ber

200

5

Lexical stage

Det N N TV PN

Which rock singer wrote Lithium

© J

oh

an B

os

No

vem

ber

200

5

Use rule: NP Det N

NP

Det N N TV PN

Which rock singer wrote Lithium

© J

oh

an B

os

No

vem

ber

200

5

Use rule: NP PN

NP NP

Det N N TV PN

Which rock singer wrote Lithium

© J

oh

an B

os

No

vem

ber

200

5

Use rule: VP TV NP

VP

NP NP

Det N N TV PN

Which rock singer wrote Lithium

© J

oh

an B

os

No

vem

ber

200

5

Backtracking

VP

NP NP

Det N N TV PN

Which rock singer wrote Lithium

© J

oh

an B

os

No

vem

ber

200

5

Use rule: N N N

VP

N NP

Det N N TV PN

Which rock singer wrote Lithium

© J

oh

an B

os

No

vem

ber

200

5

Use rule: NP Det N

NP VP

N NP

Det N N TV PN

Which rock singer wrote Lithium

© J

oh

an B

os

No

vem

ber

200

5

Use rule S NP VP

S

NP VP

N NP

Det N N TV PN

Which rock singer wrote Lithium

© J

oh

an B

os

No

vem

ber

200

5

Syntactic “head”

S

NP VP

N NP

Det N N TV PN

Which rock singer wrote Lithium

© J

oh

an B

os

No

vem

ber

200

5

Parse Tree (another example)

S

NP

N

PP VP

NP VP PP

Det N Prep PN PN IV Prep NP

The mother of Franz Kafka died in 1918

© J

oh

an B

os

No

vem

ber

200

5

Syntactic head

S

NP

N

PP VP

NP VP PP

Det N Prep PN PN IV Prep NP

The mother of Franz Kafka died in 1918

© J

oh

an B

os

No

vem

ber

200

5

Using a parser

• Normally expects tokenised and POS-tagged input

• Example of wide-coverage parsers:– Charniak parser– Collins parser– RASP (Carroll & Briscoe)– CCG parser (Clark & Curran)

© J

oh

an B

os

No

vem

ber

200

5

NLP Techniques

• Tokenisation

• Lemmatisation

• Part of Speech Tagging

• Syntactic analysis (parsing)

• WordNet

© J

oh

an B

os

No

vem

ber

200

5

WordNet

• Electronic dictionary

• Not only words and definitions, but also relations between words

• Four parts of speech– Nouns– Verbs– Adjectives– Adverbs

© J

oh

an B

os

No

vem

ber

200

5

WordNet SynSets

• Words are organised in SynSets

• A SynSet is a group of words with the same meaning --- in other words, a set of synonyms

• Example:{ Rome, Roma, Eternal City, Italian Capital, capital of Italy}

© J

oh

an B

os

No

vem

ber

200

5

Senses

• A word can have several different meanings

• Example: plant– A building for industrial labour– A living organism lacking the power of

locomotion

• The different meanings of a word are called senses

• Therefore, one word can occur in more than one SynSet in WordNet

© J

oh

an B

os

No

vem

ber

200

5

SynSet Example

- {mug, mugful}= the quantity that can be held in a mug

- {chump, fool, gull, mark, patsy, fall guy, sucker, soft touch, chump, mug}= a person who is gullible and easy to take advantage of

- {countenance, physiognomy, phiz, visage, kisser, smiler, mug} = the human face

© J

oh

an B

os

No

vem

ber

200

5

Hypernyms

• Hyperonomy is a WordNet relation defined among two SynSets– If A is a hypernym of B, then A is more generic

then B

• The inverse of hyperonomy is hyponomy– If A is a hyponym of B, then A is more specific

then B

• Use these relations transitively• Examples:

– “cow” and “horse” are hyponyms of “animal”– “publication” is a hypernym of “book”

© J

oh

an B

os

No

vem

ber

200

5

Examples using WordNet

• Which rock singer …– singer is a hyponym of person, therefore

expected answer type is PERSON

• What is the population of …– population is a hyponym of number,

hence answer type NUMERAL

© J

oh

an B

os

No

vem

ber

200

5

How to use NLP tools?

• There is a large set of tools available on the web, most of it free for research

• Examples of integrated text processing environment– GATE (University of Sheffield)– TTT (University of Edinburgh)– LingPipe– For a general ovewrview of NLP tools, see

http://registry.dfki.de/

© J

oh

an B

os

No

vem

ber

200

5

Question Answering

• Lecture 1 (Last week):Introduction; History of QA; Architecture of a QA system; Evaluation.

• Lecture 2 (Today):Question Classification; NLP techniques for question analysis; Tokenisation; Lemmatisation; POS-tagging; Parsing; WordNet.

• Lecture 3 (Next lecture):Retrieving Answers; Document pre-processing; Named Entity Recognition; Anaphora Resolution; Matching; Reranking; Sanity checking.