1 improving search through corpus profiling bonnie webber school of informatics university of...

61
1 Improving Search through Corpus Profiling Bonnie Webber School of Informatics University of Edinburgh Scotland

Upload: arabella-roberts

Post on 29-Dec-2015

214 views

Category:

Documents


0 download

TRANSCRIPT

1

Improving Search throughCorpus Profiling

Bonnie Webber

School of Informatics

University of Edinburgh

Scotland

Original motivation

PhD research (Michael Kaisser) on using (lexical resources (FrameNet, PropBank, VerbNet) to improve performance in QA

Developed two methods [Kaisser & Webber, 2007]

Evaluation on Web and AQUAINT corpus produced significantly different results.

Other research where same methods on same input produce significantly different results on different corpora.

FrameNet Example of annotated FrameNet data:

(Screenshots from framenet.icsi.berkeley.edu)

Two QA methods

Method 1 Use resources to generate templates in which answer

might be found. Project templates onto quoted strings used directly as

search queries. Method 2

Use resources to generate dependency structures in which answer might occur.

Search on lexical co-occurance. Filter results by comparing structure of candidate

sentences with the structure of the annotated resource sentences.

Method 1

Example:

“Who purchased YouTube?”

Method 1

Extract simplified dependency structure from question using MiniPar:

head: purchase.v

head\subj: “Who”

head\obj: “YouTube”

Method 1

Get annotated sentences from FrameNet for purchase.v:

The company had purchased several PDMS terminals ...

FE:Buyer lexical unit FE: Goods

Method 1

The company had purchased several PDMS terminals ...

FE:Buyer lexical unit FE: Goods

Use MiniPar to associate annotated abstract frame structure with dependency structure:

Buyer[Subject, NP] VERB Goods[Object, NP]

Method 1

Buyer[Subject, NP] VERB Goods[Object, NP]

head=purchase.V, Subject=“Who”, Object=“YouTube”

Buyer[ANSWER] purchase.V Goods[“YouTube”]

Method 1

Generate potential answer templates:

ANSWER[NP] purchased YouTube ANSWER[NP] (has|have) purchased YouTube ANSWER[NP] had purchased YouTube YouTube (has|have) been purchased by

ANSWER[NP] ...

Method 1 Use patterns to generate quoted strings as

search queries:

"YouTube has been purchased by" Extract sentences from snippets. Parse sentences. If structures match, extract answer:

“YouTube has been purchased by Google for $1.65 billion.”

Method 1 (extended)

... the landowner sold the land to developers ...

FE:Seller LU FE:Goods FE:Buyer

The company had purchased several PDMS terminals ...

FE:Buyer LU FE:Goods

Create additional paraphrases using all verbs in original frame & verbs identified through inter-frame relations:

ANSWER[NP] bought YouTube YouTube was sold to ANSWER[NP]

Method 1 (Web-based evaluation) Accuracy results on 264 (of 500) TREC 2002

questions whose head verb is not "be":

FN base PropBank VerbNet combined

0.181 0.227 0.223 0.261

FN base all verbs in frame inter-frame relations

0.181 0.204 0.215

Method 1 (further extension)FN often gives ‘interesting’ examples rather than common

ones. So assume (as default) that verbs display common patterns:

Intransitive: [ARG0] VERB Transitive: [ARG0] VERB [ARG1] Ditransitive: [ARG0] VERB [ARG1] [ARG2]

And if one of these patterns is observed in Q that isn’t among those found in FN, just add it.

combined combined+

0.261 0.284

Method 1

Method 1 and its extensions all lead to clear improvements in QA over the web,

But they may be losing answers by finding only exact string matches.

“YouTube was recently purchased by Google for $1.65 billion.”

Method 2 addresses this.

Method 2Associates each annotated sentence in FN and PB with a set of

dependency paths from the head to each of the frame elements.

“The Soviet Union[ARG0] has purchased roughly eight million tons of grain[ARG1] this month[TMP]”.

head: “purchase”, path = /i ARG0: paths = {./s, ./subj,} ARG1: paths = {./obj} TMP: paths = {./mod}

Method 2 Question analysis: Same as Method 1. Search based on key words from question:

purchased YouTube (no quotes)

Sentences are extracted from the returned snippets, e.g.:

“Their aim is to compete with YouTube, which Google recently purchased for more than $1 billion.”

Dependency parse produced for each extract.

Method 2

Eight tests comparing dependency paths:

1a Do the candidate and example sentences share the same head verb?

1b Do the candidate and example sentences share the same path to the head?

2a In the candidate sentence, do we find one or more of the example’s paths to the answer role?

2b In the candidate sentence, do we find all of the example’s paths to the answer role?

Method 23a Can some of the paths for the other roles be

found in the candidate sentence?

3b Can all of the paths for the other roles be found in the candidate sentence?

4a Do the surface strings of the other roles partially match those of the question?

4b Do the surface strings of the other roles completely match those of the question?

Method 2

Each sentence that passes steps 1a and 2a is assigned a weight of 1. (Otherwise 0.)

For each of the remaining tests that succeeds, that weight is multiplied by 2.

Method 2Annotated frame sentence (from PropBank):

“The Soviet Union[ARG0] has purchased roughly eight million tons of grain[ARG1] this month[TMP]”.

Candidate sentence retrieved from the Web:“Their aim is to compete with YouTube, which Google recently purchased for more than $1 billion.”

N.B. Object rel clause - string match would fail.

Method 2Candidate sentence: head: “purchase, ”path = /i/pred/i/mod/pcomp-n/rel/i phrase: “Google”, paths = {./s, ./subj,} phrase: “which”, paths = {./obj} phrase: “YouTube”, paths = {\i\rel} phrase: “for more than $1 billion”, paths = {./mod}

PropBank example sentence: head: “purchase”, path = /i ARG0: “The Soviet Union”, paths = {./s, ./subj,} ARG1: “roughly eight million tons of grain ”, paths = {./obj} TMP: “this month”, paths = {./mod}

Method 2Candidate sentence: head: “purchase”, path = /i/pred/i/mod/pcomp-n/rel/i phrase: “Google”, paths = {./s, ./subj} phrase: “which”, paths = {./obj} phrase: “YouTube”, paths = {\i\rel} phrase: “for more than $1 billion”, paths = {./mod}

PropBank example sentence: head: “purchase”, path = /i ARG0: “The Soviet Union”, paths = {./s, ./subj} ARG1: “roughly eight million tons of grain”, paths = {./obj} TMP: “this month”, paths = {./mod}

Method 2Candidate sentence: head: “purchase”, path = /i/pred/i/mod/pcomp-n/rel/i phrase: “Google”, paths = {./s, ./subj} phrase: “which”, paths = {./obj} phrase: “YouTube”, paths = {\i\rel} phrase: “for more than $1 billion”, paths = {./mod}

1a OK

1b -

2a OK

2b OK

3a OK

3b OK

4a -

4b -

The results of the tests are:

This sentence returns the answer “Google”, to which a score of 8 is assigned.

Method 2Candidate sentence: head: “purchase”, path = /i/pred/i/mod/pcomp-n/rel/i phrase: “Google”, paths = {./s, ./subj} phrase: “which”, paths = {./obj} phrase: “YouTube”, paths = {../..} phrase: “for more than $1 billion”, paths = {./mod}

We get a (partially correct) role assignment: ARG0: “Google ”, paths = {./s, ./subj} ARG1: “which”, paths = {./obj} TMP: “for more than $1 billion”, paths = {./mod}

Method 2Evaluation results for method 2:

PropBank outperforms FrameNet because: More lexical entries in PropBank More example sentences per entry in PropBank FrameNet does not annotate peripheral adjuncts

FrameNet 0.030

PropBank 0.159

Method 1 – FrameNet 0.181

Method 1 – PropBank 0.227

Method 1 – VerbNet 0.223

Method 1 – all resources 0.261

Method 2 – PropBank 0.159

All methods – PropBank 0.306

All methods – all resources 0.367

Evaluation21% improvement on the 264 non-’be’ TREC 2002 questions, when used on the web.

ProblemSimilar levels of improvement were not found when applied directly to the AQUAINT corpus, using the exact same methods.

Method 1 - FrameNet 0.181 0.027

Method 2 - PropBank 0.159 0.023

Not an isolated caseAcross 9 different IR models, [Iwayama et al, 2003] found similar differences when posing the same queries to

a corpus of Japanese patent applications (full text)

a corpus of Japanese newspaper articles

tf .0227 .1054

idf .1577 .2443

log(tf) .1255 .2266

log(tf).idf .2132 .2853

BM25 .2503 .3346

But they don’t speculate on the reason for these results.

What makes for such differences?

In Kaisser’s case, the form in which information appears in the corpus may match neither the question nor any form derivable from it via FrameNet, PropBank or VerbNet.

What year was Alaska purchased? On March 30, 1867, U.S. Secretary of State

William H. Seward reached agreement with Russia to purchase the territory of Alaska for $7.2 million, a deal roundly ridiculed as Seward's Folly. (APW20000329.0213)

But by 1867, when Secretary of State William H. Seward negotiated the purchase of Alaska from the Russians, sweetheart deals like that weren't available anymore.' (NYT19980915.0275)

Hypothesis Profiling a corpus and adapting search to its

characteristics can improve performance in IR and QA.

Neither new nor surprising: “Genre, like a range of other non-topical features of documents, has been under-exploited in IR algorithms to date, despite the fact that we know that searchers rely heavily on such features when evaluating and selecting documents” [Freund et al, 2006].

Also cf. [Argamon et al, 1998; Finn & Kushmerik, 2006; Karlgren 2004; Kessler et al, 1997]

What basis for profiling?

Documents can be characterised in terms of genre register domain

These in turn implicate lexical choice syntactic choice choice of referring expression structural choices at the document level formatting choices

Definitions Genre, register, domain are not completely

independent concepts.

Definitions Genre, register, domain are not completely

independent concepts.

Genre: A distinctive type of communicative action, characterized by a socially recognized communicative purpose and common aspects of form [Orlikowski & Yates, 1994].

Definitions Genre, register, domain are not completely

independent concepts.

Genre: A distinctive type of communicative action, characterized by a socially recognized communicative purpose and common aspects of form [Orlikowski & Yates, 1994].

Register: Generalized stylistic choices due to situational features such as audience and discourse environment [Morato et a., 2003]

Definitions Genre: A distinctive type of communicative

action, characterized by a socially recognized communicative purpose and common aspects of form [Orlikowski & Yates, 1994].

Register: Generalized stylistic choices due to situational features such as audience and discourse environment [Morato et a., 2003]

Domain: The knowledge and assumptions held by members of a (professional) community.

Assumptions In IR, it seems worth characterizing documents

directly as to genre (and possibly register).

Doing so automatically requires characterising inter alia significant linguistic features.

For QA, further benefits will come from profiling the lexical, syntactic, referential, structural and formatting consequences of genre, register and domain, and exploiting these features directly.

Direct use of genre [Freund et al, 2006], [Yeung et al, 2007]

Analysed behavior of software engineering consultants looking for documents they need in order to provide technical services to customers using the company’s software product.

A range of genres identified through both user interviews and analysis of the websites and repositories they used

Direct use of genre Manuals Presentations Product documents Technotes, tips Tutorials and labs White papers Best practices Design patterns Discussions/forums ….

Direct use of genre Requires manually labelling each document with

its genre or recognizing its genre automatically. The latter requires characterising genres in

terms of automatically recognizable features.

Best practice: Description of a proven methodologyor technique for achieving a desired result, often basedon practical experience. Form: primarily text, many formats, variable length Style: imperatives, “best practice” Subject matter: new technologies, design, coding

Direct use of genre (X-Site) Prototype workplace search tool for software

engineers currently in use [Yeung et al, 2007]. Provides access to ~8GB of content crawled

from the Internet, intranet and Lotus Notes data. Exploits

Task profiles Task-genre associations: known +/_/-

relationships between task and genre pairs Automatic genre classifier

Using genre, register, domain in QA

Answers to Qs can be found anywhere, not just in documents on the specific topic.

Q: When Did the Titanic Sink?

Twelve months have passed since 193 people died aboard the Herald of Free Enterprise. But time has not eased the pain of Evelyn Pinnells, who lost two daughters when the ferry capsized off Belgium. They were among the victims when the Herald of Free Enterprise capsized off the Belgian port of Zeebrugge on March 6, 1987. It was the worst peacetime disaster involving a British ship since the Titanic sank in 1912.

Using genre, register, domain in QA

For this reason, IR for QA differs from general IR, using (instead) passage retrieval, quoted strings, etc.

For the same reason, one may not want to prefilter documents by genre, register or domain labels (as seems useful for IR).

Rather, it may be beneficial to exploit features of and patterns in the linguistic features that realize genre, register and domain.

What are those features?

Lexical features

Register strongly affects word choice: MedLinePlus: “runny nose” PubMed: “rhinitis”, “nasopharyngeal symptoms” Clinical notes: “greenish sputum” UMLS: Informal “greenish” doesn’t appear

[Bodenreider & Pakhomov, 2004] Domain also affects word choice:

“smoltification” occurs ~600 times in a corpus of 1000 papers on salmon, while none in AQUAINT [Gabbay & Sutcliffe, 2004].

Lexical features Register strongly effects type/token ratios:

Only 850 core words (+ inflections) in Basic English, so type/token ratio is very small.

Federalist papers: ~0.36

King James Version: And God said, Let the waters bring forthabundantly the moving creature that hath life, and fowl that may flyabove the earth in the open firmament of heaven.

Bible in Basic English: And God said, Let the waters be full ofliving things, and let birds be in flight over the earth under the archof heaven.

Lexical features IR4QA using either keywords or quoted strings

for passage retrieval could benefit from responding to both types of lexical divergence between question and corpus.

Syntactic Features: Voice Active

The Grid provides an ideal platform for new ontology tools and data bases, …

Users log-in using a password which is encrypted using a public key and private key mechanism.

Passive Ontologies are recognized as having a key role in

data integration on the computational Grid. We store ontology files in hierarchical collections,

based on user unique identifiers, ontology identifiers, and ontology version numbers.

Syntactic Features Passive voice is used significantly more often in

the physical sciences than in the social sciences [Bonzi, 1990].

Active (%) Passive (%)

Air pollution 65.5% 34.5%

Infectious disease 66.3% 33.7%

Ed administration 77.9% 22.1%

Social Psychology 76.6% 23.4%

Syntactic Features Passives also used significantly often in surgical

reports [Bross et al, 1972] and repair reports. For agentive verbs, missing agent is surgeon (or

surgical team) or repair person: “… the skin was prepared and draped ..

Incision was made .. Axillary fat was dissected and bleeding controlled …”

But not for non-agentive verbs.

Syntactic Features DURING NORMAL START CYCLE OF 1A GAS TURBINE,

APPROX 90 SEC AFTER CLUTCH ENGAGEMENT, LOW LUBE OIL AND FAIL TO ENGAGE ALARM WERE RECEIVED ON THE ACC. (ALL CONDITIONS WERE NORMAL INITIALLY). SAC WAS REMOVED AND METAL CHUNKS FOUND IN OIL PAN. LUBE OIL PUMP WAS REMOVED AND WAS FOUND TO BE SEIZED. DRIVEN GEAR WAS SHEARED ON PUMP SHAFT.

CasRep: Maintenance & repair report

Syntactic Features: Clause type Main clause Relative clause

In this case, the user (j.bard) has created a private version of the CARO ontology which is shared with stuart.aitken …

Participial clause Users log-in using a password which is encrypted

using a public key and private key mechanism.

Infinitive clause …

Syntactic Features

Relative clauses used significantly more often in the social than the physical sciences [Bonzi, 1990].

Participial clauses used significantly more often in the physical than the social sciences [Bonzi, 1990].

Rel Clauses (%) Participial Clauses (%)

Air pollution 20.5% 29.6%

Infectious disease 25.2% 28.1%

Ed administration 29.2% 18.0%

Social Psychology 33.0% 18.6%

Why might this matter? Not all the arguments to verbs in different types

of clauses are explicit. The same techniques cannot be used to recover

them. For relative clauses, syntax (attachment)

suffices. For participial (and main) clauses, more

general context must be assessed. The missing argument could be what answers

the question.

Structural features Document structure variation across genres is

greater than variation within genres. “inverted pyramid” structure of a news article IMRD structure of a scientific article step structure of instructions ingredients list + step structure of recipes SOAP structure of clinical records (and

systems structure within Objective section)

Why might this matter? Can suggest where to look for information.

In news articles, information that defines terms is more likely to be found near the beginning than the end [Joho and Sanderson 2000].

In scientific articles, position isn’t a good indicator for definitional material [Gabbay & Sutcliffe 2004].

Why might this matter? If information isn’t in its intended section, one

might conclude that it’s false, unnecessary, irrelevant, etc., depending on the function of the section. Chocolate chips absent from an ingredients

list in recipe. No mention of “irregular heart beat” in report

on CV system in Objective section of SOAP note.

Conclusion The effectiveness of a given technique for IR,

QA, IE can vary significantly, depending on the corpus it’s applied to.

For search among docs (IR), genre and register are clear factors in user relevance decisions.

For search within docs (QA, IE), they appear significant as well.

In particular, search that is sensitive to genre and register may yield better performance than search that isn’t.

References

O Bodenreider, S Pakhomov (2003). Exploring adjectival modification in biomedical discourse across two genres. ACL Workshop on Biomedical NLP, Sapporo, Japan.

S Argamon, M Koppel, G. Avneri (1998). Routing documents according to style. First International Workshop on Innovative Information Systems, Boston, MA.

L Freund, C Clarke, E Toms (2006).Towards genre classification for IR in the workplace. First Symposium on Information Interaction in Context (IIiX).

I Gabbay, R Sutcliffe (2004). A qualitative comparison of scientific and journalistic texts from the perspective of extracting definitions. ACL Workshop on QA in Restricted Domains. Sydney.

M Iwayama, A Fujii, N Kando, Y Marukawa (2003). An empirical study on retrieval models for different document genres: Patents and newspaper articles. SIGIR’03, 251-258.

H Joho, M Sanderson (2000). Retrieving descriptive phrases from large amounts of free text. Proc. 9th Intl Confenrence on Information and Knowledge Management (CIKM), pp 180-186.

References

J Karlgren (1999). Stylistic experiments in information retrieval. In T. Strzalkowski (Ed.), Natural language information retrieval. Dordrecht, The Netherlands: Kluwer.

M Kaisser, B Webber (2007). Question Answering based on semantic roles, ACL/EACL Workshop on Deep Linguistic Processing, Prague CZ.

Kessler, B., Nunberg, G., & Schutze, H. (1997). Automatic detection of text genre. Proc. 35th Annual Meeting, Association for Computational Linguistics (pp. 32-38).

P Yeung, L Freund, C Clarke (2007). X-Site: A Workplace Search Tool for Software Engineers. SIGIR’07 demo, Amsterdam.

61

Thank you!