patrick pantel - new york university

20
Of Search and Semantics Patrick Pantel NSF Symposium on Semantic Knowledge Discovery, Organization and Use November 15, 2008 - 2 -

Upload: others

Post on 26-May-2022

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Patrick Pantel - New York University

Of Search and Semantics Patrick Pantel

NSF Symposium on Semantic Knowledge Discovery, Organization and Use

November 15, 2008

- 2 -

Page 2: Patrick Pantel - New York University

VannaverVannaver Bush proposes to Bush proposes to build a body of knowledge build a body of knowledge for all mankind: for all mankind: MemexMemex

Bush: "technical difficulties Bush: "technical difficulties of all sorts have been of all sorts have been

ignored."ignored."

- 3 -

Memex, in the form of a desk, would instantly bring files and material on any subject to the operator’s fingertips.

Semantics captured by an Semantics captured by an associative trailassociative trail, , personal personal commentscomments and and side trailsside trails

Gerald SaltonGerald Salton, father of modern , father of modern search: introduces concepts such as search: introduces concepts such as

vector space modelvector space model, Inverse , Inverse Document FrequencyDocument Frequency (IDF), (IDF), Term Term q yq yFrequency Frequency (TF), and (TF), and relevancy relevancy

feedbackfeedback

- 4 -

1990: the first search 1990: the first search engine engine ArchieArchie, from McGill, , from McGill, matches keyword queries matches keyword queries

against a database of Web against a database of Web filenamesfilenames

Page 3: Patrick Pantel - New York University

Yahoo!Yahoo! is founded around a is founded around a taxonomical organization of taxonomical organization of

the Web… manuallythe Web… manually

- 5 -

Key innovatorKey innovator::Advanced search features + Advanced search features +

first SE to allow natural first SE to allow natural language querieslanguage queries

Key innovatorKey innovator::Semantic models for ranking Semantic models for ranking

paid search resultspaid search results

- 6 -

Semantics and Query LogsSemantics and Query Logs::Technologies developed/applied during the Technologies developed/applied during the DARPA Tipster program and TREC make it DARPA Tipster program and TREC make it

commercial: spelling correction, query commercial: spelling correction, query reformulations, also try, …reformulations, also try, …

Page 4: Patrick Pantel - New York University

Future of search lies in a Future of search lies in a deep deep understanding and understanding and matching of information matching of information gg

request request behind user queriesbehind user queries

Natural language questions Natural language questions answered by editorsanswered by editors

- 7 -

Semantic repositories Semantic repositories and userand user--annotated annotated content grow rapidlycontent grow rapidly

Semantic Semantic search engines search engines

- 8 -

search engines search engines emergeemerge

Page 5: Patrick Pantel - New York University

- 9 -

- 10 -

Page 6: Patrick Pantel - New York University

- 11 -

Search Assist Search Assist TechnologyTechnology

- 12 -

Page 7: Patrick Pantel - New York University

- 13 -

Smart SnippetsTapas Kanungo et al.

- 14 -

Page 8: Patrick Pantel - New York University

Aggregate star Aggregate star ratingrating

Example review as Example review as summarysummary

Current prices at Current prices at

- 15 -

Current prices at Current prices at merchant sitesmerchant sites

Images, maps,Images, maps,specs, …specs, …

O i iOpportunities:

• Marriage of information extraction, content analysis, and query intent modeling

• User experience design

• Key Technologies• IE: Entity detection and

salience; attribute detection

- 16 -

salience; attribute detection

• CA: Text classification, aboutness, information fusion

• QIM: Entity detection, intent/task understanding

Page 9: Patrick Pantel - New York University

Task ModelingAna-Maria Popescu

- 17 -

Task Modeling

• What is the user trying to do?– Find product reviews of the Nikon p

D300?– Buy a Nikon D300?– Find support for her camera?

• UED: enriching the search experience– How can we make use of this

knowledge?

K T h l i

- 18 -

• Key Technologies– Entity detection– Document classification– Intent modeling / detection

Page 10: Patrick Pantel - New York University

Is Is battery life battery life a synonym of a synonym of image qualityimage quality??

Research ProblemResearch Problem::Intent SynonymyIntent Synonymy

How can we How can we automatically discover automatically discover

Review Intent

- 19 -

automatically discover automatically discover intent synonyms?intent synonyms?

Transactional Intent

Intent Synonymy

Key Assumption: one intent per session A searcher’s intent remains the same within a search session

Intent Synonymsh d l l d f d b ll l d

Review Price Supportbest thin where can I get cheap common problemsrate black Friday sale fault codescompare kodak vs. canon christmas sales installing newsmall portable cheap easy use

Methodology very similar to constructing dictionary of distributionally similar words

Intent Discovery and Chaining via Clustering

- 20 -

small portable cheap easy usehigh zoom buy now pay later operating instructionsrate best dell discount won’t heatbattery life great deals best schematicsuser comments overstock.com keeps shutting

Page 11: Patrick Pantel - New York University

Road Ahead

Key Applications Enabling Technologies

Entity DetectionEntity Detection

AttributeAttribute MiningMining

Intent SynonymyIntent Synonymy

Distributional SimilarityDistributional Similarity

EntailmentEntailment

- 21 -

Anaphora ResolutionAnaphora Resolution

Content AnalysisContent Analysis

TextText ClassificationClassification

- 22 -

Page 12: Patrick Pantel - New York University

Dynamic Similarity ModelingEric Crestan and Vishnu Vyas

- 23 -

SeeLEx: Seed List Expansion

• SeeLEx, developed at Yahoo!, is a weakly supervised system for mining entity lists based on distributional similarity

Jaguar

Honda

Peugeot

- 24 -

Mercedes Ford

Page 13: Patrick Pantel - New York University

SeeLEx: Seed List Expansion

NissanOpel cheetahlion

Jaguar

Peugeot

HondaNissan

VolkswagenPorsche

MazdaFiat

Hyundai

LexusToyota

Clinton

- 25 -

Mercedes Ford

Renault

Hyundai

Saab

Suburu

puma

caracal

Carter Eisenhower

SeeLEx: Seed List Expansion

NissanOpel cheetahlion

Jaguar

Peugeot

HondaNissan

VolkswagenPorsche

MazdaFiat

Hyundai

LexusToyota

What is a caracal?endangered

carnivorefast

- 26 -

Mercedes Ford

Renault

Hyundai

Saab

Suburu

puma

fasthunted

…caracal

Page 14: Patrick Pantel - New York University

Questions Asked and Conclusions

What is the effect of corpus size on expansion accuracy? Significant boost in

performance

What is the effect of corpus quality on expansion accuracy?

Does seed quality matter?Significant boost in performance

Great variance in performance depending on set seed composition

- 27 -

How many seeds are on average optimal for expansion accuracy and are more seeds better than less? Somewhat surprising: 5-20

seeds in general is sufficient; more seeds gains little but don’t hurt

Archbishops of CanterburyAstronomersAustralian AirlinesAustralian A-League

Cognitive ScientistsComposersCountriesElectronic Companies

Male Tennis PlayersMaryland CountiesNew Zealand SongwritersNHL Hockey Teams

Gold Standard Sets

Football TeamsAustralian CitiesAustralian Prime MinistersBest Actress Academy Award WinnersBiology DisciplinesBottled Water BrandsBoxing Weight ClassesCalifornia CountiesCanadian Stadiums

ElementsEnglish CitiesEnglish PoetsEnglish Premier Football ClubsFirst LadiesFormula One DriversFrench ArtistsGreek GodsGreek Islands

North American Mountain RangesPresidents of ArgentinaRivers in EnglandRoman EmperorsRussian AuthorsSpanish ProvincesStarsSuperheroesTexas Counties

- 28 -

Canadian StadiumsCanadian UniversitiesCharitable FoundationsClassical PianistsCocktails

Greek IslandsIrish TheatresItalian RegionsJapanese Martial ArtsJapanese Prefectures

Texas CountiesU.S. Army GeneralsU.S. Federal DepartmentsU.S. Internet Companies

Average of 208 instancesMinimum of 11Maximum of 1116Total of 10,377 instances

50 sets extracted from Wikipedia (2007/12)

Page 15: Patrick Pantel - New York University

Corpora

Table: Corpora used to build our expansion models.

CORPORA UNIQUE

SEN TEN CES

(MILLIONS)

TOKEN S

(MILLIONS )

UN IQUE

WORDS (MILLION S)

k100  5,201  217,940  542 

k020†  1040  43,588  108 

- 29 -

k004†  208  8,717  22 

Wikipedia  30  721  34 †Estimate d from k100 st atistics.

The Effect of:Corpus Size and Corpus Quality

- 30 -

Page 16: Patrick Pantel - New York University

0.7

0.8

0.9

1System and Corpora Analysis

(Precision vs. Recall)

full.k100

f ll k020

Takeaway: Corpus Size Matters• k100 yields 13% higher R-precision

than 1/5th its size

0.3

0.4

0.5

0.6

Recall

full.k020

full.k004

full.wikipedia

Opportunity: Model a more naturalSeeLEx usage

• k100 yields 53% higher R-precision than 1/25th its size

Takeaway: Corpus Quality Matters• Wikipedia yields similar performance

to a web corpus 60 times its size

- 31 -

0

0.1

0.2

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Precision

SeeLEx usage• What is the performance of the system

on typical sets searched by editors?

0.7

0.8

0.9

1

k100: List EffectPrecision vs. Recall

Opportunity: Bucket sets to find predictable behaviorsO ld t d d t i ifi tl i th i

0.3

0.4

0.5

0.6

Recall

• Our gold standard sets vary significantly in their expansion performance

• Look into predictability of open vs. closed class sets, large vs. small class sets, and types of sets such as locations, people, …

- 32 -

0

0.1

0.2

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Precision

Page 17: Patrick Pantel - New York University

The Effect of:Seed Selection

- 33 -

0.7

0.8

0.9

1k100: Seed Selection Effect

Precision vs. Recall full.k100 s010

s010.t01 s010.t02

s010.t03 s010.t04

s010.t05 s010.t06

s010.t07 s010.t08Takeaway: Seed set composition greatly affects performance

0.3

0.4

0.5

0.6

Recall

s010.t09 s010.t10

s010.t11 s010.t12

s010.t13 s010.t14

s010.t15 s010.t16

s010.t17 s010.t18

s010.t19 s010.t20

s010.t21 s010.t22

affects performance• Best performing seed set had 42% higher

precision and 39% higher recall than the worst performing seed set

Opportunity: Reject seeds and/or ask for more• We are investigating ways to automatically detect

hi h d l t b tt th th i

- 34 -

0

0.1

0.2

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Precision

s010.t23 s010.t24

s010.t25 s010.t26

s010.t27 s010.t28

s010.t29 s010.t30

which seed elements are better than others in order to reduce the impact of seed selection effect

Page 18: Patrick Pantel - New York University

The Effect of:Seed Size

- 35 -

2

2.5

3

s

Rate of New Correct Expansionsvs. Seed Size

Takeaway: Small numbers of seeds are sufficient to fully saturate the distributional similarity model

1

1.5

2

Rate of N

ew Correct Expan

sion to fully saturate the distributional similarity model

• After 10-20 seeds, more seeds yield few newcorrect instances in the expansion

- 36 -

0

0.5

0 20 40 60 80 100 120 140 160 180 200

Seed Size

Page 19: Patrick Pantel - New York University

0.7

0.8

0.9

1

Seed Size vs. % of Errors

0.3

0.4

0.5

0.6

% o

f Err

ors

Takeaway: Although bad seeds may be less desirable in a seed set than others, adding them does not seem to degrade performance• Percentage of errors does not increase with

- 37 -

0

0.1

0.2

0 20 40 60 80 100 120 140 160 180 200

Seed Size

Percentage of errors does not increase with increased seed set size

Questions Asked and Conclusions

What is the effect of corpus size on expansion accuracy? Significant boost in

performance

What is the effect of corpus quality on expansion accuracy?

Does seed quality matter?Significant boost in performance

Great variance in performance depending on set seed composition

- 38 -

How many seeds are on average optimal for expansion accuracy and are more seeds better than less? Somewhat surprising: 5-20

seeds in general is sufficient; more seeds gains little but don’t hurt

Page 20: Patrick Pantel - New York University

- 39 -