why is computational linguistics not more used in search engine technology? john tait university of...

35
Why is Computational Linguistics Not More Used in Search Engine Technology? John Tait University of Sunderland, UK

Upload: ashlee-hill

Post on 16-Jan-2016

221 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Why is Computational Linguistics Not More Used in Search Engine Technology? John Tait University of Sunderland, UK

Why is Computational Linguistics Not More Used in Search Engine Technology?

John Tait

University of Sunderland, UK

Page 2: Why is Computational Linguistics Not More Used in Search Engine Technology? John Tait University of Sunderland, UK

Contents of Talk

• Introduction– Search Engines– Computational Linguistics

• Three Questions

• Research Agenda

• Conclusions

Page 3: Why is Computational Linguistics Not More Used in Search Engine Technology? John Tait University of Sunderland, UK

Introduction

Page 4: Why is Computational Linguistics Not More Used in Search Engine Technology? John Tait University of Sunderland, UK

Origins

• TREC has been running since 1992– Only two systems using CL techniques (Strzalkowski et

al in the 1990 and more recently Stokoe, Oakes and Tait) have ever shown an improvement on the standard search engine task

• Performance on most tasks is improved by using more information – Surely dictionaries, grammars, semantics should help ?

• ACL/COLING CLIIR Workshop in Sydney

Page 5: Why is Computational Linguistics Not More Used in Search Engine Technology? John Tait University of Sunderland, UK

Search Engine Process

Signature Data

Crawler IndexQuery Engine

Searcher

Web

Page 6: Why is Computational Linguistics Not More Used in Search Engine Technology? John Tait University of Sunderland, UK

Index

• Compressed and Abstracted form of data (web pages) allowing rapid access to some part of that data– Simple version

• Maps key words (signature) to URL’s (data)

– Real Systems • Compressed vector of weighted terms (signature) to

URL’s plus snippet generation support (+ ad’s etc.)

Page 7: Why is Computational Linguistics Not More Used in Search Engine Technology? John Tait University of Sunderland, UK

Crawler

• Continuously in the background– Moves over web pages accumulating

• Signature data – term data

– metadata

– links

• URLS, URI’s

• etc.

– Updates the index

Page 8: Why is Computational Linguistics Not More Used in Search Engine Technology? John Tait University of Sunderland, UK

Query Engine

• User types in a query– Usually short list of key words

• Organizes results– “Best” pages first– Summary to judge page relevance– Clickable links– Relevant ad’s– Etc.

Page 9: Why is Computational Linguistics Not More Used in Search Engine Technology? John Tait University of Sunderland, UK

IR

Its all about matching documents and queries

Page 10: Why is Computational Linguistics Not More Used in Search Engine Technology? John Tait University of Sunderland, UK

Myths about Statistical IR

I.e. Search Engines

Page 11: Why is Computational Linguistics Not More Used in Search Engine Technology? John Tait University of Sunderland, UK

Myths

1. It doesn’t work• Google ?

2. It’s a dead subject• 40% improvement since TREC began in 1991• Recent progress with e.g. Language Modeling and

Continuous Relevance Models

3. IR people don’t know/care about language• Karen Sparck Jones • much early work• Strzalkowski et al …

Page 12: Why is Computational Linguistics Not More Used in Search Engine Technology? John Tait University of Sunderland, UK

Computational Linguistics

What is it ?

Page 13: Why is Computational Linguistics Not More Used in Search Engine Technology? John Tait University of Sunderland, UK

Characteristics of CL

• Assumes Existence of – Dictionary– Grammar– Semantics

• Independent of task

• Dependent on word meanings

• Arrived at through composition of words in sentences

Page 14: Why is Computational Linguistics Not More Used in Search Engine Technology? John Tait University of Sunderland, UK

Statistical CL

• Often– Aims to make immediate progress with

practical tasks– Minimizes assumptions about language

• But still shares the common assumptions of CL

Page 15: Why is Computational Linguistics Not More Used in Search Engine Technology? John Tait University of Sunderland, UK

IR/Search Engines

Only care about the task

Page 16: Why is Computational Linguistics Not More Used in Search Engine Technology? John Tait University of Sunderland, UK

Characteristics of CL

• Assumes Existence of – Dictionary– Grammar– Semantics

• Independent of task

• Dependent on word meanings

• Arrived at through composition of words in sentences

Page 17: Why is Computational Linguistics Not More Used in Search Engine Technology? John Tait University of Sunderland, UK

Three Questions about

Why Search Engines don’t use Computational Linguistics

Page 18: Why is Computational Linguistics Not More Used in Search Engine Technology? John Tait University of Sunderland, UK

Disclaimer

• Question Answering– QA systems use CL– Do search engines use QA now ?

• askJeeves ??? – Will they in the future ?

• Will we ever get general/casual users to type long questions ?

• Have known for a long time long queries are good -rarely used

Page 19: Why is Computational Linguistics Not More Used in Search Engine Technology? John Tait University of Sunderland, UK

Three Questions

• Are Computational Linguistic Techniques too inaccurate to improve Search Engines?

• Is the Search Engine Task formed in some way which makes CL techniques ineffective?

• Does statistical information retrieval in fact capture the relevant properties of language but in a form which is inaccessible or hidden?

Page 20: Why is Computational Linguistics Not More Used in Search Engine Technology? John Tait University of Sunderland, UK

CL too inaccurate ?

• Long Version– Is the problem that computational linguistic

techniques are too unreliable or narrowly applicable, so improved performance on some documents or queries is masked by worse performance on others?

Page 21: Why is Computational Linguistics Not More Used in Search Engine Technology? John Tait University of Sunderland, UK

Example of Problem

• Query “wants” and unusual word sense– “main head design”

• Topic “yachting”: “head” “head of sail”

• Irrelevant retrieved document 1 had a signature generated off an inaccurate word sense (“body part”)– CL eliminates

• Irrelevant not retrieved document 2– Word Sense Disambiguation inaccurate

• Added to relevant set

Page 22: Why is Computational Linguistics Not More Used in Search Engine Technology? John Tait University of Sunderland, UK

CL too inaccurate ?

• But best systems do >97% on test data - like Penn Treebank ?– Overfitted on very sparse data?– Don’t do anything like as well on unseen data– Especially bad at unseen noun phrases - very

common search terms

Page 23: Why is Computational Linguistics Not More Used in Search Engine Technology? John Tait University of Sunderland, UK

What CL should do

• Stop working on pitifully small samples:– IR researchers consider 18Gigabytes too small

for real statistical significance

• Ensure you include overfitting protection in your methodology– Always test against genuinely unseen data

• Don’t simplify the data– But do use “hacks” to make it tractable

Page 24: Why is Computational Linguistics Not More Used in Search Engine Technology? John Tait University of Sunderland, UK

Search task not match CL

• Is the conventional information retrieval task formulated in a way which prevents or obstructs computational linguistics contributing?– Short queries

• Not sentences, running text

– Short Ranked Lists of highly relevant documents

– Predetermined document signatures

Page 25: Why is Computational Linguistics Not More Used in Search Engine Technology? John Tait University of Sunderland, UK

Search Task not match CL ?

• CL allows the extraction of structural signatures

1. Bracketing is combinatoric– Effect on index size

2. Most queries too short to get structure– Remember its matching queries and documents

(signatures)

3. Many queries too short to disambiguate– Really ??? Co-occurence

Page 26: Why is Computational Linguistics Not More Used in Search Engine Technology? John Tait University of Sunderland, UK

What CL should do

• Focus on Word Sense Disambiguation– Accept the dictionary is more important than

grammar– Accept proper names/named entities are at least

as important as common words

• Focus on chunking/triple/phrase extraction– Full parsing will only ever help as an

intermediate step

Page 27: Why is Computational Linguistics Not More Used in Search Engine Technology? John Tait University of Sunderland, UK

IR Captures Relevant Properties

• Long Version– Does statistical information retrieval in fact capture the

relevant properties of language but in a form which is inaccessible or hidden?

– Just like many machine learning techniques

Page 28: Why is Computational Linguistics Not More Used in Search Engine Technology? John Tait University of Sunderland, UK

IR captures relevant properties?

• Could be ?– Success of corpus linguistics– Success of data driven and Machine Learning

approaches• E.g. Statistical MT

• E.g. Textual Entailment

Page 29: Why is Computational Linguistics Not More Used in Search Engine Technology? John Tait University of Sunderland, UK

What CL should do

• Look at what and whether IR term weighting algorithms like BM25 are capturing about language as a legitimate research topic– Observation: BM25 looks very like some Machine

Learning generated formulae• Hardly surprising as BM25 derived by optimisation over a

very large corpus• Like Porter Stemmer before it

• Consider whether and to what extent division into dictionary, syntax, semantics is “real”

Page 30: Why is Computational Linguistics Not More Used in Search Engine Technology? John Tait University of Sunderland, UK

Some more questions

• Are assumptions made in computational linguistics about the nature of lexical semantics and the structural properties of well formed running text in some way ill founded, at least for the information retrieval task?

• Is there some specific property of language (for example semantic redundancy or one topic per document) which means that the relatively crude statistical techniques capture enough information to obtain the available improvements in performance?

Page 31: Why is Computational Linguistics Not More Used in Search Engine Technology? John Tait University of Sunderland, UK

Lessons

• CL has much to learn from IR– Having a task changes the game

• Allows the development of effective experimental methodology

• Effective solutions to task problems becomes the focus

– Which might in turn stimulate non-task based research

Page 32: Why is Computational Linguistics Not More Used in Search Engine Technology? John Tait University of Sunderland, UK

Lessons 2

• CL for IR– Needs to work on better document signatures

• Small, compressible, characteristics of documents– Word sense identifiers

– Triples

• Noun verb/prep Noun

• Chunks

– Accept probability

Page 33: Why is Computational Linguistics Not More Used in Search Engine Technology? John Tait University of Sunderland, UK

Lesson 3

• Show document structure is useful for determining relevance– Are sentences useful

• So can parse trees be useful– Human centred evaluation

– Paragraphs ??– Whole Documents ???

Page 34: Why is Computational Linguistics Not More Used in Search Engine Technology? John Tait University of Sunderland, UK

Conclusions

• IR can benefit from Computational Linguistics Techniques– But CL research needs to focus on the relevant

problems

• CL can benefit greatly from trying to get acceptance in IR– Focussed task

– Think of statistical MT

Page 35: Why is Computational Linguistics Not More Used in Search Engine Technology? John Tait University of Sunderland, UK

Job Ad

• Postdoc positions in multimedia retrieval available in Sunderland

• Search for Sunderland IR Group on the Web• See:

– http://my.sunderland.ac.uk/web/services/hr/recruitment/

– Search for VITALAS

• Email me:– [email protected]