research update – language modeling, n-grams and search engines david johnson utas – september...

Research update – Research update – Language modeling,Language modeling,n-grams and search n-grams and search

enginesengines

David JohnsonDavid JohnsonUTas – September 2005UTas – September 2005

Supervisors:

Dr Vishv Malhotra and Dr Peter Vamplew

2

OverviewOverview

►AISAT 2004AISAT 2004►SWIRL 2004SWIRL 2004►Language modelingLanguage modeling►A New Information Retrieval ToolA New Information Retrieval Tool►QuestionsQuestions

3

AISAT 2004AISAT 2004

►AISAT 2004 - The 2nd International AISAT 2004 - The 2nd International Conference on Artificial Intelligence in Conference on Artificial Intelligence in Science and TechnologyScience and Technology

►November 2004, UTAS, HobartNovember 2004, UTAS, Hobart►Hosted by The School of EngineeringHosted by The School of Engineering

4


► Johnson DG, Malhotra VM, Vamplew PW, Patro S, Johnson DG, Malhotra VM, Vamplew PW, Patro S, ““Refining Search Queries from Examples Refining Search Queries from Examples Using Boolean Expressions and Latent Using Boolean Expressions and Latent Semantic Analysis”Semantic Analysis”

► Extension of prior research (Vishv M. Malhotra, Extension of prior research (Vishv M. Malhotra, Sunanda Patro, David Johnson: Sunanda Patro, David Johnson: Synthesize Web Synthesize Web Queries: Search the Web by Examples. ICEIS Queries: Search the Web by Examples. ICEIS 2005.)2005.) – the contribution of this paper was – the contribution of this paper was investigating the application of LSA to improve the investigating the application of LSA to improve the refined queriesrefined queries

► This is only a brief overview – the full paper is This is only a brief overview – the full paper is available o from the UTAS Eprints Repositoryavailable o from the UTAS Eprints Repository

5


► Consider the problems facing a web Consider the problems facing a web searcher looking for information in an searcher looking for information in an unfamiliar domainunfamiliar domain Lack of knowledge of domain specific terms, Lack of knowledge of domain specific terms,

jargon and conceptsjargon and concepts Difficulty rating the importance of newly Difficulty rating the importance of newly

discovered terms in targeting relevant discovered terms in targeting relevant informationinformation

Lack of understanding about how extra terms Lack of understanding about how extra terms and changing the structure of the query will and changing the structure of the query will affect the search resultsaffect the search results

6


► This often leads to problems with the This often leads to problems with the resulting queryresulting query Poor recall – most of the relevant documents are Poor recall – most of the relevant documents are

not locatednot located Poor precision – many of the retrieved Poor precision – many of the retrieved

documents are not relevantdocuments are not relevant

► Frustration often resultsFrustration often results MSN-Harris Survey (August 2004) reports “MSN-Harris Survey (August 2004) reports “there there

is a significant minority -- 29 percent [of search is a significant minority -- 29 percent [of search engine users] -- who only sometimes or rarely engine users] -- who only sometimes or rarely find what they want” find what they want” ((http://www.microsoft.com/presspass/press/2004/aug04/08-02searchpollpr.mspxhttp://www.microsoft.com/presspass/press/2004/aug04/08-02searchpollpr.mspx))

7


► However – it is usually relatively easy for a searcher However – it is usually relatively easy for a searcher to classify some example documents (from an initial to classify some example documents (from an initial search) as relevant or notsearch) as relevant or not

► The text from these documents can then be The text from these documents can then be analyzed to build a better query – one that will analyzed to build a better query – one that will select more relevant and less irrelevant documentsselect more relevant and less irrelevant documents

8


Relevant

Documents

Relevant

Documents

Irrelevant

Documents

Irrelevant

Documents

Naïve

Query

Naïve

Query

Query Enhancement Algorithms

Query Enhancement Algorithms

Refined Query

Refined Query

Resubmit to Web Search Engine

9


► The new query is initially built in conjunctive normal form The new query is initially built in conjunctive normal form (CNF)(CNF) Original queryOriginal query (a OR b OR …) AND (a OR c OR …) AND … (a OR b OR …) AND (a OR c OR …) AND … Each maxterm is chosen to select all documents in set Each maxterm is chosen to select all documents in set

“relevant” and reject as many of the documents from “relevant” and reject as many of the documents from “irrelevant” as possible“irrelevant” as possible

The CNF expression (i.e. the conjunction of all maxterms) The CNF expression (i.e. the conjunction of all maxterms) is chosen to reject all documents from “irrelevant”is chosen to reject all documents from “irrelevant”

► In order to minimise the size of the CNF expression, terms In order to minimise the size of the CNF expression, terms with high selective potential must be usedwith high selective potential must be used

► In some cases further optimization was required – for In some cases further optimization was required – for instance (at the time) Google would only accept queries of instance (at the time) Google would only accept queries of up to 10 keywords in length up to 10 keywords in length

10


The potential of a candidate term, t, in a partially constructed maxterm is calculated as:

)1|)(|1|||(|

|)|||)(|(|)(

tt

tt

TIRTRTR

TIRTIRTRtPotential

TR = relevant documents not yet selected

TIR = irrelevant docs still selected by conjunction of prior maxterms

TRt = new relevant documents selected by term t

TIRt = irrelevant documents from TIR selected by term t

11


► The The Potential(t)Potential(t) function enabled the formulation of function enabled the formulation of effective queries, but they were often counterintuitive, effective queries, but they were often counterintuitive, e.g. searching for information related to the Sun (as a e.g. searching for information related to the Sun (as a star)star)

► The synthesised query was:The synthesised query was:

sun AND (solar AND (recent OR (field AND tour)) sun AND (solar AND (recent OR (field AND tour))

OR (lower AND home) OR (million AND core))OR (lower AND home) OR (million AND core))

► This was felt to be at least partially due to overtraining This was felt to be at least partially due to overtraining on the relatively small sets of example documents – but on the relatively small sets of example documents – but could we improve the generated queries?could we improve the generated queries?

12


► Latent Semantic Analysis (LSA) was investigated as Latent Semantic Analysis (LSA) was investigated as a technique with the potential to help in selecting a technique with the potential to help in selecting more meaningful termsmore meaningful terms

► It is “A theory and method for extracting and It is “A theory and method for extracting and representing the contextual-usage meaning of representing the contextual-usage meaning of words by statistical computations applied to a large words by statistical computations applied to a large corpus of text” (Landauer and Dumais, 1997)corpus of text” (Landauer and Dumais, 1997)

► It uses no humanly constructed dictionaries or It uses no humanly constructed dictionaries or knowledge bases, yet has been shown to overlap knowledge bases, yet has been shown to overlap with human scores on some language-based with human scores on some language-based judgment tasksjudgment tasks

13


► LSA can help overcome the problems of polysemy LSA can help overcome the problems of polysemy (same word different meaning) and synonymy (same word different meaning) and synonymy (different word same meaning) by associating (different word same meaning) by associating words with abstract concepts via its analysis of words with abstract concepts via its analysis of word usage patternsword usage patterns

► It can be very computationally intensive (requiring It can be very computationally intensive (requiring the singular value decomposition (SVD) of a large the singular value decomposition (SVD) of a large matrix), but for the small collections of relevant / matrix), but for the small collections of relevant / irrelevant documents we are analysing it can be irrelevant documents we are analysing it can be performed quickly (< 1 second CPU time)performed quickly (< 1 second CPU time)

14


► The term weighting was adjusted to account for the The term weighting was adjusted to account for the LSA weighting as follows: LSA weighting as follows:

Weight(t) = Potential(t)Weight(t) = Potential(t) * ((Normalised LSA weight * ((Normalised LSA weight in set “relevant”) – (Normalised LSA weight in set in set “relevant”) – (Normalised LSA weight in set “irrelevant”))“irrelevant”))

► During the experimentation that lead to the During the experimentation that lead to the development of this weighting scheme, we noted development of this weighting scheme, we noted that not allowing the that not allowing the Potential(t)Potential(t) values sufficient values sufficient weight resulted in very long queriesweight resulted in very long queries

Note: The Note: The Potential(t)Potential(t) value is recalculated after each term is value is recalculated after each term is selected, but the LSA values are based on the entire document selected, but the LSA values are based on the entire document setssets

15


►Results - Results - Example 1Example 1 Naïve query “elephants” – searching for Naïve query “elephants” – searching for

information suitable for a high school projectinformation suitable for a high school project 7 out of first 20 documents, and around 30% of 7 out of first 20 documents, and around 30% of

all initial documents from the naïve query were all initial documents from the naïve query were judged relevantjudged relevant

The initial number of relevant documents was The initial number of relevant documents was quite high, reflecting the reasonably broad quite high, reflecting the reasonably broad nature of the information neednature of the information need

16


►Results – Example 1Results – Example 1 Refined query – no LSARefined query – no LSA

elephant AND (forests OR lived OR calves OR poachers OR elephant AND (forests OR lived OR calves OR poachers OR maximus OR threaten OR electric OR tel OR kudu)maximus OR threaten OR electric OR tel OR kudu)

Refined query – with LSARefined query – with LSAelephant AND (climate OR habitats OR grasslands OR quantities elephant AND (climate OR habitats OR grasslands OR quantities OR africana OR threaten OR insects OR electric OR kudu)OR africana OR threaten OR insects OR electric OR kudu)

► LSA did help in the selection of “intuitive” words, LSA did help in the selection of “intuitive” words, although not as much as hoped – it avoided although not as much as hoped – it avoided “lived”“lived” and and “tel”“tel” (abbreviation for telephone no), although (abbreviation for telephone no), although it still selected it still selected “quantities”“quantities”

17


►Results - Example 2Results - Example 2 Naïve query “mushrooms” – searching for Naïve query “mushrooms” – searching for

information on growth, lifecycle and structure of information on growth, lifecycle and structure of mushroomsmushrooms

NOT interested in recipes, magic mushrooms, or NOT interested in recipes, magic mushrooms, or identifying wild edible mushroomsidentifying wild edible mushrooms

3 of first 20 documents and 13% of initial 3 of first 20 documents and 13% of initial documents relevantdocuments relevant

18


► Results - Example 2Results - Example 2 Refined query – no LSARefined query – no LSA

mushrooms AND (cylindrical OR ancestors OR hyphae OR cellulose OR mushrooms AND (cylindrical OR ancestors OR hyphae OR cellulose OR issued OR hydrogen OR developing OR putting)issued OR hydrogen OR developing OR putting)

Refined query – LSARefined query – LSAmushrooms AND (ascospores OR hyphae OR itis OR peroxide OR mushrooms AND (ascospores OR hyphae OR itis OR peroxide OR

discharge OR developing OR pulled OR jean)discharge OR developing OR pulled OR jean)

► In this example the LSA query has identified some In this example the LSA query has identified some additional technical terms, although still including additional technical terms, although still including the quite general term the quite general term “pulled”“pulled”. Note that the . Note that the selection of the term “jean” was due to “Jean” selection of the term “jean” was due to “Jean” being the first name of several authors mentioned being the first name of several authors mentioned in relevant documentsin relevant documents

ITIS = Integrated Taxonomic Information System ITIS = Integrated Taxonomic Information System

19

Refining Web Queries - Precision In 1st 20 Documents

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Pre

cis

ion

in f

irs

t 2

0 d

oc

um

en

ts

Elephant

Mushrooms

Elephant 0.35 0.95 0.95

Mushrooms 0.15 0.85 0.85

Original Query Refined Refined with LSA

20

Percentage of Relevant Documents (150 retrieved)

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Elephant

Mushrooms

Elephant 30% 76% 81%

Mushrooms 13% 57% 61%

Original Query Refined Refined with LSA

21


►ConclusionsConclusions The refined search queries show significant The refined search queries show significant

improvement over original naïve queries, without improvement over original naïve queries, without the need for detailed domain knowledgethe need for detailed domain knowledge

LSA did assist in the formulation of more LSA did assist in the formulation of more meaningful Boolean web queriesmeaningful Boolean web queries

LSA did not significantly improve retrieval LSA did not significantly improve retrieval performance compared to the original query performance compared to the original query enhancement algorithmenhancement algorithm

22


►Further Comments (Not in original Further Comments (Not in original presentation/paper):presentation/paper): A criticism of this approach is that the A criticism of this approach is that the

requirement for the user to review and classify requirement for the user to review and classify several documents requires quite a bit of effort – several documents requires quite a bit of effort – were we really helping the user or not?were we really helping the user or not?

While it is our assertion that the review / While it is our assertion that the review / classification process required is not very time classification process required is not very time consuming, we have tried to address this consuming, we have tried to address this criticism by taking a different approach in current criticism by taking a different approach in current workwork

23

SWIRL 2004SWIRL 2004

► SStrategic trategic WWorkshop on orkshop on IInformation nformation RRetrieval in etrieval in LLorne, December 2004orne, December 2004

► Organized by Justin Zobel from RMIT and Alistair Organized by Justin Zobel from RMIT and Alistair Moffat from the University of MelbourneMoffat from the University of Melbourne

► Funded by the Funded by the Australian Academy of Technological Australian Academy of Technological Sciences and Engineering under their “Frontiers of Sciences and Engineering under their “Frontiers of Science and Technology Missions and Workshops” Science and Technology Missions and Workshops” programprogram

► 17 International and 13 Australian researchers plus 17 International and 13 Australian researchers plus 8 research students8 research students

► Included researchers from industry and government Included researchers from industry and government as well as academia (Microsoft, NISTas well as academia (Microsoft, NIST**, CSIRO), CSIRO)

((** NIST = National Institute of Standards and Technology, a US Government research organization) NIST = National Institute of Standards and Technology, a US Government research organization)

24

SWIRL 2004SWIRL 2004

► It was a discussion-based residential workshopIt was a discussion-based residential workshop► ““The aim for the workshop was to try and define what The aim for the workshop was to try and define what

we know (and also don't know) about Information we know (and also don't know) about Information Retrieval, examining past work to identify Retrieval, examining past work to identify fundamental contributions, challenges, and turning fundamental contributions, challenges, and turning points; and then to examine possible future research points; and then to examine possible future research directions, including possible joint projects. That is, directions, including possible joint projects. That is, our goal was to pause for a few minutes, reflect on our goal was to pause for a few minutes, reflect on the lessons of past research, and reconsider what the lessons of past research, and reconsider what questions are important and which research might questions are important and which research might lead to genuine advances.”lead to genuine advances.”**

**From the SWIRL 2004 web site - From the SWIRL 2004 web site - http://www.cs.mu.oz.au/~alistair/swirl2004/http://www.cs.mu.oz.au/~alistair/swirl2004/

25

SWIRL 2004SWIRL 2004► ““Attendees included researchers responsible for many of Attendees included researchers responsible for many of

the key innovations in web searching and the the key innovations in web searching and the implementation of effective and efficient information implementation of effective and efficient information retrieval systems.”retrieval systems.”

John Tait (University of Sunderland)

Dave Harper (The Robert Gordon University, Scotland)

Bill Hersh, M.D.(Oregon Health & Science University)

Robert Dale (Macquarie University)Kal Järvelin (University of Tampere, Finland)

Andrew Turpin (University of Melbourne)Bruce Croft (University of Massachusetts, Amherst)

Ross Wilkinson (CSIRO)

Alistair Moffat (University of Melbourne)Justin Zobel (RMIT)

Jamie Callan (Carnegie Mellon University )David Harper (CSIRO)

26

SWIRL 2004SWIRL 2004►Lorne Beach during the site visit (Winter)Lorne Beach during the site visit (Winter)

27

SWIRL 2004SWIRL 2004►Lorne Beach when we arrived Lorne Beach when we arrived

(Summer)(Summer)

28

SWIRL 2004SWIRL 2004►Program - Day 1Program - Day 1► Travel from Melbourne to Lorne (group bus)Travel from Melbourne to Lorne (group bus)► Keynote PresentationKeynote Presentation

The IR Landscape – Bruce CroftThe IR Landscape – Bruce Croft► PresentationPresentation

Adventures in IR evaluation – Ellen VoorheesAdventures in IR evaluation – Ellen Voorhees► Group DiscussionGroup Discussion

IR – where are we, how did we get here, and IR – where are we, how did we get here, and where might we go? – Mark Sandersonwhere might we go? – Mark Sanderson

► Workshop DinnerWorkshop Dinner

29

SWIRL 2004SWIRL 2004►Program - Day 2Program - Day 2► Group DiscussionGroup Discussion

What motivates research – Justin Zobel / Alistair MoffatWhat motivates research – Justin Zobel / Alistair Moffat► Small Group DiscussionSmall Group Discussion

Challenges in information retrieval and language modeling Challenges in information retrieval and language modeling – David Harper, Phil Vines and David Johnson: Challenges – David Harper, Phil Vines and David Johnson: Challenges in Contextual Retrieval, Challenges in Metasearch, in Contextual Retrieval, Challenges in Metasearch, Challenges in Cross Language Information Retrieval (CLIR)Challenges in Cross Language Information Retrieval (CLIR)

► Group DiscussionGroup Discussion Important papers for new research students to be aware of Important papers for new research students to be aware of

– David Hawking– David Hawking► Small Group ProjectSmall Group Project

How to spend $1M per year for five years – Ross WilkinsonHow to spend $1M per year for five years – Ross Wilkinson

30

SWIRL 2004SWIRL 2004►Program - Day 3Program - Day 3► Presentations from Group ProjectsPresentations from Group Projects

Mark Sanderson, Ellen Voorhees, David Johnson, Mark Sanderson, Ellen Voorhees, David Johnson, et al – Experimental Educational Search Engine – et al – Experimental Educational Search Engine – Targeting information needs for upper primary to Targeting information needs for upper primary to grade 10grade 10

► Group DiscussionGroup Discussion Where to now with SIGIR? – Jamie CallanWhere to now with SIGIR? – Jamie Callan

► Group DiscussionGroup Discussion Writing the ideal SIGIR 2005 paper – Susan Writing the ideal SIGIR 2005 paper – Susan

DumaisDumais► Return to MelbourneReturn to Melbourne

Via Port Campbell National Park and Great Ocean Via Port Campbell National Park and Great Ocean RoadRoad

31

SWIRL 2004SWIRL 2004►SummarySummary

An excellent opportunity to meet and talk to An excellent opportunity to meet and talk to many respected IR researchersmany respected IR researchers

Provided guidance on the current “state of the Provided guidance on the current “state of the art” and ideas for the future direction of my art” and ideas for the future direction of my researchresearch

Provided many useful contactsProvided many useful contacts

32

Language Modeling - Language Modeling - IntroductionIntroduction

► The seminal paper for language modeling in The seminal paper for language modeling in modern IR: “modern IR: “A Language Modeling Approach to A Language Modeling Approach to Information Retrieval” Jay Ponte and Bruce Croft Information Retrieval” Jay Ponte and Bruce Croft 19981998

► Like LSA, it can deal with the problems of polysemy Like LSA, it can deal with the problems of polysemy (same word different meaning) and synonymy (same word different meaning) and synonymy (different word same meaning)(different word same meaning)

► Unlike LSA, it has a firm theoretical underpinningUnlike LSA, it has a firm theoretical underpinning► Less computationally demanding than LSA – Less computationally demanding than LSA –

particularly important for large document particularly important for large document collectionscollections

33


► Documents and queries are considered to be Documents and queries are considered to be generated stochastically using their underlying generated stochastically using their underlying language modellanguage model

► The document in a collection that is considered The document in a collection that is considered most relevant to a given query is the one with the most relevant to a given query is the one with the highest probability of generating the query from its highest probability of generating the query from its language modellanguage model

► The assumption of “word order independence” is The assumption of “word order independence” is usually made to make the model mathematically usually made to make the model mathematically tractable, although bi-grams, tri-grams, etc. can be tractable, although bi-grams, tri-grams, etc. can be accommodatedaccommodated

34


► A language model of a document is a function that A language model of a document is a function that for any given word calculates the probability of that for any given word calculates the probability of that word appearing in the documentword appearing in the document

► The probability of a phrase (or query) is calculated The probability of a phrase (or query) is calculated by multiplying together the individual word by multiplying together the individual word probabilities (applying the word order probabilities (applying the word order independence assumption)independence assumption)

► All we need is a method to estimate the language All we need is a method to estimate the language models of the documents!models of the documents!

35

Language Modeling – Model Language Modeling – Model EstimationEstimation

► Obviously the document itself is the primary data Obviously the document itself is the primary data source, butsource, but If a word, wIf a word, w11,doesn’t appear in a document, should we ,doesn’t appear in a document, should we

have P(whave P(w11|M|Mdd) = 0? (in other words meaning that it is ) = 0? (in other words meaning that it is impossible for language model Mimpossible for language model Mdd to generate word w to generate word w11))

If word wIf word w22 appears 5 times in a 1,000 word document, is appears 5 times in a 1,000 word document, is P(wP(w22|M|Mdd) = 0.005 really a reasonable estimate?) = 0.005 really a reasonable estimate?

► It is important to smooth the language model to It is important to smooth the language model to overcome problems caused by lack of training dataovercome problems caused by lack of training data

36


► There are a number of smoothing methods in use, the There are a number of smoothing methods in use, the basic premise being that some of the “probability basic premise being that some of the “probability mass” in the model should be taken from the observed mass” in the model should be taken from the observed data to be used as an estimate for unseen wordsdata to be used as an estimate for unseen words

► They take into account “Zipf’s law” nature of word They take into account “Zipf’s law” nature of word usage (a few words used many times, many words usage (a few words used many times, many words used infrequently) to improve estimates – for instance used infrequently) to improve estimates – for instance the Good-Turing algorithmthe Good-Turing algorithm

► Corpus or general English word usage data may also Corpus or general English word usage data may also be used to augment the document data, but care has be used to augment the document data, but care has to be taken. The Ponte-Croft paper addresses this to be taken. The Ponte-Croft paper addresses this problem by calculating a risk-adjustment factor, based problem by calculating a risk-adjustment factor, based on the difference between corpus and document word on the difference between corpus and document word usage usage

37


► Zipf’s lawZipf’s law

38

Language Modeling – ResultsLanguage Modeling – Results

► A basic implementation of the Ponte-Croft method A basic implementation of the Ponte-Croft method proved very useful in quickly locating relevant proved very useful in quickly locating relevant documents in small local document collectionsdocuments in small local document collections

► It is planned to implement an improved version as It is planned to implement an improved version as part of the information retrieval tool that is part of the information retrieval tool that is currently under development – discussed nextcurrently under development – discussed next

39

A New Information Retrieval ToolA New Information Retrieval Tool

► GoalsGoals Assist the user to get required information from an ad-hoc Assist the user to get required information from an ad-hoc

web search more quickly and effectivelyweb search more quickly and effectively Data driven to allow quick access to key parts of retrieved Data driven to allow quick access to key parts of retrieved

documentsdocuments In some cases provide answers to questions directly In some cases provide answers to questions directly

without the user needing to view documentswithout the user needing to view documents Assist in refining web queriesAssist in refining web queries Use language modeling techniques locally on retrieved Use language modeling techniques locally on retrieved

documents to satisfy more complex information needs – documents to satisfy more complex information needs – i.e. those not able to be expressed adequately in a web i.e. those not able to be expressed adequately in a web queryquery

40

A New Information Retrieval ToolA New Information Retrieval Tool► Information FlowInformation Flow

Web Query Search Engine

Links to Results WWW

Retrieved Document Text150 docs ~35 seconds200 docs ~50 seconds

Bi-gram List

User

From bi-gram List

•Question may be answered

•Jump directly to relevant parts of documents containing bi-grams

•Explore documents further using language modeling

•Formulate and run a refined web query including/rejecting selected bi-grams

41


► Why use bi-grams?Why use bi-grams? Express a concept much better than a single wordExpress a concept much better than a single word

Foods/cooking

Peanut butter

Peanut candy

Roasted peanut

Chocolate peanut

Peanut brittle

Peanut cookie

Peanut recipe

Peanut lover

Peanut soup

Peanut oil

Peanut

Agriculture

Peanut institute

Peanut commission

Peanut grower

Peanut farmer

Peanut producer

Peanut plant

Commercial/Brands

Peanut software

Peanut linux

Peanut van

Peanut inn

Peanut clothing

Peanut Apparel

Baby peanut

Peanut sandals

Peanut ties

Mr peanut

Medical

Peanut allergy

Other

Peanut gallery

42

A New Information Retrieval ToolA New Information Retrieval Tool► Why use bi-grams?Why use bi-grams?

Can be used directly as a web search term by using quotesCan be used directly as a web search term by using quotes Easily combined into well targeted searches using “OR” Easily combined into well targeted searches using “OR”

and “-” operatorsand “-” operators Reasonably easy to extract meaningful bi-grams from Reasonably easy to extract meaningful bi-grams from

document text using simple rulesdocument text using simple rules Tri-grams or higher n-grams tend to occur too infrequentlyTri-grams or higher n-grams tend to occur too infrequently Also, bi-grams seem to be somewhat neglected in current Also, bi-grams seem to be somewhat neglected in current

IR research – possibly because of the “sparse data” IR research – possibly because of the “sparse data” problem – a corpus with a vocabulary of 50,000 words has problem – a corpus with a vocabulary of 50,000 words has the potential for 2,500,000,000 bi-grams causing difficulty the potential for 2,500,000,000 bi-grams causing difficulty in language modeling and many other IR techniquesin language modeling and many other IR techniques

43

A New Information Retrieval ToolA New Information Retrieval Tool► Simple bi-gram extraction rulesSimple bi-gram extraction rules

Ignore a bi-gram if it contains a stop word (a word that doesn’t Ignore a bi-gram if it contains a stop word (a word that doesn’t convey much meaning, for instance - a, an, and, of, the, etc. – convey much meaning, for instance - a, an, and, of, the, etc. – without this step the most frequent bi-grams are usually “of the”, without this step the most frequent bi-grams are usually “of the”, “in the”, “to the” and so on)“in the”, “to the” and so on)

If bi-grams are found that are the same except for plurals (e.g. If bi-grams are found that are the same except for plurals (e.g. african elephant and african elephants) only present the most african elephant and african elephants) only present the most common form to the usercommon form to the user

Sort bi-grams by descending occurrence count within descending Sort bi-grams by descending occurrence count within descending document occurrence countdocument occurrence count

► Alternative methodAlternative method Part of speech filtering – almost all interesting bi-grams are of the Part of speech filtering – almost all interesting bi-grams are of the

form “Adjective Noun” or “Noun Noun”form “Adjective Noun” or “Noun Noun” It isn’t always possible to determine the part of speech exactly It isn’t always possible to determine the part of speech exactly

(for example unknown words, words with multiple possible parts (for example unknown words, words with multiple possible parts of speech), but we can certainly reject many bi-grams that could of speech), but we can certainly reject many bi-grams that could never be of the required formnever be of the required form

44


► Example of an automatically generated bi-gram listExample of an automatically generated bi-gram list 150 documents from simple Google search “Elephants”150 documents from simple Google search “Elephants”

1. african elephant

2. elephant man

3. loxodonta africana

4. elephant loxodonta

5. asian elephant

6. south africa

7. privacy policy

8. united states

9. years old

10. elephas maximus

11. 22 months

12. national park

13. endangered species

14. elephants live

15. forest elephants

16. small objects

17. elephant jokes

18. baby elephant

19. elephant elephas

20. 70 years

21. indian elephant

22. elephant society

23. large ears

24. wild elephants

25. land mammal

26. white elephant

27. forest elephant

28. baby elephants

29. elephants eat

30. ivory tusks

31. family elephantidae

32. species survival

33. 13 feet

34. largest living

35. largest land

36. elephant seal

37. natural history

38. young elephants

39. female elephants

40. elephant conservation

41. long time

42. elephant range

43. give birth

44. incisor teeth

45. 20 years

46. ivory trade

47. 60 years

48. blood vessels

45

A New Information Retrieval ToolA New Information Retrieval Tool► From browsing the bi-gram list the userFrom browsing the bi-gram list the user

Gets an idea of the topic areas in the retrieved dataGets an idea of the topic areas in the retrieved data Can jump directly to relevant parts of retrieved documentsCan jump directly to relevant parts of retrieved documents Can mark bi-grams to include/exclude from a new web searchCan mark bi-grams to include/exclude from a new web search Can mark portions of retrieved documents as Can mark portions of retrieved documents as

relevant/irrelevant to use in a more targeted local search using relevant/irrelevant to use in a more targeted local search using language modeling (this allows drilling down to discover language modeling (this allows drilling down to discover information on concepts that are difficult to express as a web information on concepts that are difficult to express as a web query)query)

► The user can also use natural language and/or keyword The user can also use natural language and/or keyword queries using language modeling to assist in examining queries using language modeling to assist in examining local data (i.e. the document text downloaded as part local data (i.e. the document text downloaded as part of the retrieval process)of the retrieval process)

46


► Example of a new web query formulated by Example of a new web query formulated by marking relevant/irrelevant bi-grams:marking relevant/irrelevant bi-grams:

elephant ("african elephant" OR "asian elephants" OR "loxodonta elephant ("african elephant" OR "asian elephants" OR "loxodonta africana") africana")

-"elephant man" -"elephant seal" -"elephant jokes" -"white elephant“-"elephant man" -"elephant seal" -"elephant jokes" -"white elephant“

► Using the criterion of our test information need (web pages Using the criterion of our test information need (web pages with suitable information for a high school project on the land with suitable information for a high school project on the land mammal “elephant”), 48 of the first 50 pages returned by the mammal “elephant”), 48 of the first 50 pages returned by the search were judged to be relevant.search were judged to be relevant.

► The web search also indicated there were about 228,000 The web search also indicated there were about 228,000 pages matching our query, so we were still getting a very pages matching our query, so we were still getting a very wide coveragewide coverage

47


► Are the generated bi-gram lists always Are the generated bi-gram lists always useful?useful? At least some of the information in the retrieved At least some of the information in the retrieved

documents must be relevant for interesting bi-grams to be documents must be relevant for interesting bi-grams to be generatedgenerated

On the other hand, if the bi-grams are all way off track, the On the other hand, if the bi-grams are all way off track, the user knows immediately to rethink the initial search, rather user knows immediately to rethink the initial search, rather than reviewing many irrelevant documentsthan reviewing many irrelevant documents

Initial testing with 27 different one word Google searches Initial testing with 27 different one word Google searches (~150 documents retrieved for each search) generated (~150 documents retrieved for each search) generated useful bi-gram lists in 21 casesuseful bi-gram lists in 21 cases

By using a two word search (e.g. “angle geometry” instead By using a two word search (e.g. “angle geometry” instead of “angle”, “cobra snake” instead of “cobra”), useful bi-of “angle”, “cobra snake” instead of “cobra”), useful bi-gram lists were obtained for five of the remaining six casesgram lists were obtained for five of the remaining six cases

51


► Future workFuture work Continue development and refinement of this information Continue development and refinement of this information

retrieval toolretrieval tool Identify examples of information needs that are difficult to Identify examples of information needs that are difficult to

satisfy by direct web search alonesatisfy by direct web search alone User survey – how does the tool perform in practice? How User survey – how does the tool perform in practice? How

could it be improved?could it be improved?

52

Questions/Comments?Questions/Comments?

research update – language modeling, n-grams and search engines david johnson utas – september...

Documents

new relevant documents

retrieved documents

example documents

new query

overview aisat

web search engine slide

relevant frustration

length slide