web search tutorial jan pedersen and knut magne risvik yahoo! inc. search and marketplace

56
Web Search Tutorial Jan Pedersen and Knut Magne Risvik Yahoo! Inc. Search and Marketplace

Post on 21-Dec-2015

215 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Web Search Tutorial Jan Pedersen and Knut Magne Risvik Yahoo! Inc. Search and Marketplace

Web Search Tutorial

Jan Pedersen and Knut Magne Risvik

Yahoo! Inc. Search and Marketplace

Page 2: Web Search Tutorial Jan Pedersen and Knut Magne Risvik Yahoo! Inc. Search and Marketplace

Agenda

• A Short History

• Internet Search Fundamentals– Web Pages– Indexing

• Ranking and Evaluation

• Third Generation Technologies

Page 3: Web Search Tutorial Jan Pedersen and Knut Magne Risvik Yahoo! Inc. Search and Marketplace

A Short History

Page 4: Web Search Tutorial Jan Pedersen and Knut Magne Risvik Yahoo! Inc. Search and Marketplace

Precursors

• Information Retrieval (IR) Systems– online catalogs, and News

• Limited scale, homogeneous text

– recall focus– empirical

• Driven by results on evaluation collections

– free text queries shown to win over Boolean

• Specialized Internet access– Gopher, Wais, Archie

• FTP archives and special databases• Never achieved critical mass

Page 5: Web Search Tutorial Jan Pedersen and Knut Magne Risvik Yahoo! Inc. Search and Marketplace

First Generation Systems

• 1993: Mosaic opens the WWW– 1993 Architext/Excite (Stanford/Kleiner Perkins)– 1994 Webcrawler (full text Indexing)– 1994 Yahoo! (human edited Directory)– 1994 Lycos (400K indexed pages)– 1994 Infoseek (subscription service)

• Power systems– 1994 AltaVista (Dec Labs, advanced query syntax,

large index)– 1996 Inktomi (massively distributed solution)

Page 6: Web Search Tutorial Jan Pedersen and Knut Magne Risvik Yahoo! Inc. Search and Marketplace

Second Generation Systems

• Relevance matters– 1998 Direct Hit (clickthrough based re-ranking)– 1998 Google (link authority based re-ranking)

• Size matters– 1999 FAST/AllTheWeb (scalable architecture)

• The user matters– 1996 Ask Jeeves (question answering)

• Money matters– 1997 Goto/Overture (pay-for-performance search)

Page 7: Web Search Tutorial Jan Pedersen and Knut Magne Risvik Yahoo! Inc. Search and Marketplace

Third Generation Systems

• Market consolidation– 2002 Yahoo! Purchases Inktomi– 2003 Overture purchases AV and FAST/AllTheWeb– 2003 MSN announces intention to build a Search Engine

• Search matures– $2B market projected to grow to $6B by 2005– required capital investment limits new players

• Gigablast?

– traffic focused in a few sites• Yahoo!, MSN, Google, AOL

– consumer use driven by Brand marketing

Page 8: Web Search Tutorial Jan Pedersen and Knut Magne Risvik Yahoo! Inc. Search and Marketplace

Web Search Fundamentals

Page 9: Web Search Tutorial Jan Pedersen and Knut Magne Risvik Yahoo! Inc. Search and Marketplace

Web Fundamentals

URL

User Browser Web Server

HTML Page

Page Rendering Page ServingHyper Links

HTTP Request

Page 10: Web Search Tutorial Jan Pedersen and Knut Magne Risvik Yahoo! Inc. Search and Marketplace

Definitions

• URL’s refer to WWW content– referential integrity is not guaranteed– roughly 10% of Url’s go 404 every month

• HTTP requests fetch content from a server– stateless protocol– cookies provide partial state

• Web servers generate HTML pages– can be static or dynamic (output of a program)– markup tags determine page rendering

• HTML pages contain hyperlinks– link consists of a url and anchor text

Page 11: Web Search Tutorial Jan Pedersen and Knut Magne Risvik Yahoo! Inc. Search and Marketplace

Url’s

• URL Definition– http://host:port/path;params?query#fragment

• fragment is not considered part of the URL• params are considered part of the path

– params are not frequently used

• Examples– http://www.cnn.com/– http://ad.doubleclick.net/jump;sz=120x60;ptile=6;ord=69810

62172– http://us.imdb.com/Title?0068646– http://www.sky.com/skynews/article/0,,30000-12261027,00.h

tml

Page 12: Web Search Tutorial Jan Pedersen and Knut Magne Risvik Yahoo! Inc. Search and Marketplace

Dynamic Url’s

• Urls with Dynamic Components– Path (including params) and host are not dynamic

• If you change the PATH and/or host you will get a 404 or similar error

– Query is dynamic• If you change the query part, you will get a valid page back• source of potentially infinite number of pages

• Examples– http://www.cnn.com/index.html?test

• Returns a valid 200 page, even if test is not a valid query term

– http://www.cnn.com/index.html;test• Returns a 404 error page

• Not all Url’s Follow this Convention:– http://www.internetnews.com/xSP/article.php/1378731

Page 13: Web Search Tutorial Jan Pedersen and Knut Magne Risvik Yahoo! Inc. Search and Marketplace

Dynamic Content

• Content Depends on External (to URL) Factors– Cookies – IP– Referrer– User-Agent

• Examples– http://my.yahoo.com/– http://forum.doom9.org/forumdisplay.php?s=af9ddb31710c7

b314b75262c1031d8af&forumid=65

• Dynamic Url’s and Dynamic Content are Orthogonal– static url’s can refer to dynamic content– dynamic url’s can refer to static content

Page 14: Web Search Tutorial Jan Pedersen and Knut Magne Risvik Yahoo! Inc. Search and Marketplace

HMTL Sample

<html> <head> <title>Andreas S. WEIGEND, PhD</title> </head>

<body>

<blockquote><font face="Verdana,Tahoma,Arial" size=2>

<h2><font size="4" face="Verdana, Arial, Helvetica, sans-serif">Andreas S. WEIGEND,

</font><font size="3" face="Verdana, Arial, Helvetica, sans-serif">Ph.D.</font><font face="Verdana, Arial, Helvetica, sans-serif"><br>

<font size="2">Chief Scientist, Amazon.com</font></font></h2> </font>

<blockquote>

<p><font face="Verdana, Arial, Helvetica, sans-serif" size="2"><i>&quot;Sophisticated

algorithms have always been a big part of creating the Amazon.com customer

experience.&quot; (Jeff Bezos, Founder and CEO of Amazon.com)</i></font></p></blockquote>

<p><font face="Verdana, Arial, Helvetica, sans-serif" size="2"> <a href="http://www.amazon.com">Amazon.com</a>

might be the world's largest laboratory to study human behavior and decision

making. It for sure is a place with very smart people, with a healthy attitude

towards data, measurement, and modeling. I am responsible for research in

machine learning and computational marketing. Applications range from real-time

predictions of customer intent and satisfaction, to personalization and long-term

optimization of pricing and promotions.<font size="-2"> [<a href="http://www.weigend.com/amazonjobs.html"

onclick="window.open(this.href);return false;">Job openings.</a>] </font>

I'm also the point person for academic relations.</font></p>

</blockquote>

<font face="Verdana,Tahoma,Arial" size=2>

<h3> <font face="Verdana, Arial, Helvetica, sans-serif"><i><font size="3"> Schedule Summer 2003</font></i></font></h3>

</font>

Page 15: Web Search Tutorial Jan Pedersen and Knut Magne Risvik Yahoo! Inc. Search and Marketplace

Rendered Page

Page 16: Web Search Tutorial Jan Pedersen and Knut Magne Risvik Yahoo! Inc. Search and Marketplace

WWW Size

• How pages are in the WWW?– Lawrence and Giles,

1999: 800M pages with most pages not indexed

– Dynamically generated pages imply effective size is infinite

• How many sites are registered?– Churn due to SPAM

Page 17: Web Search Tutorial Jan Pedersen and Knut Magne Risvik Yahoo! Inc. Search and Marketplace

Crawling

• Search Engine robot– visits every page that will be indexed– traversal behavior depends on crawl policy

• Index parameterized by size and freshness– freshness is time since last revisit if page has changed

• Batch vs Incremental– Batch crawl has several, distinct, batch processing stages

• discover, grab, index• AV discovery phase takes 10 days, grab another 10, etc.• sharp freshness curve

– Incremental crawl• crawler constantly operates, intermixing discovery with grab• mild drop-off in freshness

Page 18: Web Search Tutorial Jan Pedersen and Knut Magne Risvik Yahoo! Inc. Search and Marketplace

Typical Crawl/Build Architecture

Grab

URL DBSeed List

Discovery

InternetInternet

Pagefiles

Filtered Pagefiles IndexPagefiles

Anchor Text DB

Connectivity DB

Duplicates DB

Alias DB

Index Build

Crawl

Page 19: Web Search Tutorial Jan Pedersen and Knut Magne Risvik Yahoo! Inc. Search and Marketplace

Relative Size

From SearchEngineShowdown

Google claims 3B

Fast claims 2.5B

AV claims 1B

Page 20: Web Search Tutorial Jan Pedersen and Knut Magne Risvik Yahoo! Inc. Search and Marketplace

Freshness

From

Search Engine Showdown

Note hybrid indices; subindices

with differing update rates

Page 21: Web Search Tutorial Jan Pedersen and Knut Magne Risvik Yahoo! Inc. Search and Marketplace

Query Language

• Free text with implicit AND and implicit proximity– Syntax-free input

• Explicit Boolean – AND (+)– OR (|)– AND NOT (-)

• Explicit Phrasing (“”)

• Filters – domain: filetype:– host: title: – link: image:– url: anchor:

Page 22: Web Search Tutorial Jan Pedersen and Knut Magne Risvik Yahoo! Inc. Search and Marketplace

Query Serving Architecture

• Index divided into segments each served by a node

• Each row of nodes replicated for query load

• Query integrator distributes query and merges results

• Front end creates a HTML page with the query results

Load Balancer

FE1

QI1

Node1,1 Node1,2 Node1,3 Node1,N

Node2,1 Node2,2 Node2,3 Node2,N

Node4,1 Node4,2 Node4,3 Node4,N

Node3,1 Node3,2 Node3,3 Node3,N

QI2 QI8

FE2 FE8

“travel”

“travel”

“travel”

“travel”

“travel”

…………

Page 23: Web Search Tutorial Jan Pedersen and Knut Magne Risvik Yahoo! Inc. Search and Marketplace

Query Evaluation

• Index has two tables: – term to posting

– document ID to document data

• Postings record term occurrences– may include positions

• Ranking employs posting– to score documents

• Display employs document info– fetched for top scoring documents

Terms Posting Doc ID Doc Data

Query Evaluator

“travel”

ranking display

Page 24: Web Search Tutorial Jan Pedersen and Knut Magne Risvik Yahoo! Inc. Search and Marketplace

Scale

• Indices typically cover billions of pages– terrabytes of data

• Tens of millions of queries served every day– translates to hundreds of queries per second

• User require rapid response– query must be evaluated in under 300 msecs

• Data Centers typically employ thousands of machines– Individual component failures are common

Page 25: Web Search Tutorial Jan Pedersen and Knut Magne Risvik Yahoo! Inc. Search and Marketplace

Search Results Page

• Blended results– multiple sources

• Relevance ranked• Assisted search

– Spell correction

• Specialized indices– via Tabs

• Sponsored listing– monetization

• Localization– Country language

experience

Page 26: Web Search Tutorial Jan Pedersen and Knut Magne Risvik Yahoo! Inc. Search and Marketplace

Relevance Evaluation

Page 27: Web Search Tutorial Jan Pedersen and Knut Magne Risvik Yahoo! Inc. Search and Marketplace

Relevance is Everything

• The Search Paradigm: 2.4 words, a few clicks, and you’re done

– only possible if results are very relevant

• Relevance is ‘speed’– time from task initiation to resolution– important factors:

• Location of useful result• UI Clutter• latency

• Relevance is relative– context dependent

• e.g. ‘football’ in the UK vs the US

– task dependent• e.g. ‘mafia’ when shopping vs researching

Page 28: Web Search Tutorial Jan Pedersen and Knut Magne Risvik Yahoo! Inc. Search and Marketplace

Relevance is Hard to Measure

• Poorly defined, subjective notion– depends on task, user context, etc.

• Analysts have Focused on Easier-to-Measure Surrogates– index size, traffic, speed– anecdotal relevance tests

• e.g. Vanity queries

• Requires Survey Methodology– averaged over queries– averaged over users

Page 29: Web Search Tutorial Jan Pedersen and Knut Magne Risvik Yahoo! Inc. Search and Marketplace

Survey Methodologies

• Internal expert assessments– assessments typically not replicated– models absolute notion of relevance

• External consumer assessments– assessments heavily replicated– models statistical notion of relevance

• A/B surveys– compare whole result sets– visual relevance plays a large role

• Url surveys– judge relevance of particular url for query

Page 30: Web Search Tutorial Jan Pedersen and Knut Magne Risvik Yahoo! Inc. Search and Marketplace

A/B Test Design

• Strategy:– Compare two ranking algorithms by asking panelists to compare pairs

of search results

• Queries:– 1000 semi-random queries, filtered for family-friendly,

understandability• Users can select from a list of 20 queries

• URLS– Top 10 search results from 2 algorithms

• Voting:– 5 point scale, 7 replications– Each user rates 6 queries, one of which is a control query

• Control query has AV results on one side, random URLs on the other• Reject voters who take less than 10 seconds to vote

Page 31: Web Search Tutorial Jan Pedersen and Knut Magne Risvik Yahoo! Inc. Search and Marketplace

Query selection screen

Page 32: Web Search Tutorial Jan Pedersen and Knut Magne Risvik Yahoo! Inc. Search and Marketplace

Rating screen

Page 33: Web Search Tutorial Jan Pedersen and Knut Magne Risvik Yahoo! Inc. Search and Marketplace

A/B Test Scoring

• Test ran until we had 400 decisive votes– Margin of error = 5%

• Compute:– Majority Vote: count of queries where more than half

of the users said one engine was “somewhat better” or “much better”

– Total Vote: count of users that rated a result set “somewhat better” of “better” for each engine

• Compare percentages– test if one system ‘out votes’ the other– determine if the difference is statistically significant

Page 34: Web Search Tutorial Jan Pedersen and Knut Magne Risvik Yahoo! Inc. Search and Marketplace

Results

Queries with winner All accepted votes

Majority Unanimous “a little better” “much better”

AltaVista 37.6% 6.1% 24.4% 11.0%

SE1 37.3% 8.0% 22.2% 10.7%

Same 25.1% 2.6% 31.7%

Queries with winner All accepted votes

Majority Unanimous “a little better” “much better”

Good 98.1% 51.5% 24.4% 59.1%

Bad 0% 0% 4.7% 1.7%

Same 1.9% 0.6% 10.1%

• Control Votes (error bar = 1/sqrt(160) = 7.9%)

• Test One: AV vs SE1 (error bar = 1/sqrt(400) = 5%)

Page 35: Web Search Tutorial Jan Pedersen and Knut Magne Risvik Yahoo! Inc. Search and Marketplace

Results

Queries with winner All accepted votes

Majority Unanimous “a little better” “much better”

AltaVista 58.5% 13.4% 26.5% 16.3%

SE2 28.1% 4.6% 21.8% 8.9%

Same 13.4% 0.9% 26.4%

Queries with winner All accepted votes

Majority Unanimous “a little better” “much better”

SE1 35.4% 4.7% 28.2% 13.2%

SE2 40.6% 4.1% 29.0% 15.6%

Same 24.0% 1.9% 13.8%

• Test Three: SE1 Vs SE2

• Test Two: AV Vs SE2 (with UI issue)

Page 36: Web Search Tutorial Jan Pedersen and Knut Magne Risvik Yahoo! Inc. Search and Marketplace

Ranking

• Given 2.4 query terms, search 2B documents and return 10 highly relevant in 300 msecs– Problem queries:

• Travel (matches 32M documents)• John Ellis (which one)• Cobra (medical or animal)

• Query types– Navigational (known item retrieval)– Informational

• Ingredients– Keyword match (title, abstract, body)– Anchor Text (referring text)– Quality (link connectivity)– User Feedback (clickrate analysis)

Page 37: Web Search Tutorial Jan Pedersen and Knut Magne Risvik Yahoo! Inc. Search and Marketplace

The Components of Relevance

• First Generation:– Keyword matching

• Title and abstract worth more

• Second Generation:– Computed document authority

• Based on link analysis

– Anchor text matching• Webmaster voting

• Development Cycle:Tune Ranking

Evaluate Metrics

Page 38: Web Search Tutorial Jan Pedersen and Knut Magne Risvik Yahoo! Inc. Search and Marketplace

Connectivity

Page 39: Web Search Tutorial Jan Pedersen and Knut Magne Risvik Yahoo! Inc. Search and Marketplace

Connectivity Goals

• An indicator of authority– As measured by static links– Each link is a ‘vote’ in favor of a site– Webmasters are the voters

• Not all links are equal– Links from authoritative sites are worth more

• Introduces an interesting circularity– Votes from sites with many links are discounted

• Use your vote wisely– Discount navigational links

• Not all links are editorial– Account for link SPAM

Page 40: Web Search Tutorial Jan Pedersen and Knut Magne Risvik Yahoo! Inc. Search and Marketplace

Connectivity Network

A

B

• What is authority score for nodes A and B?

• Inlink computes:– A = 3– B = 2

• Page Rank Computes– A = .225– B = .295

Page 41: Web Search Tutorial Jan Pedersen and Knut Magne Risvik Yahoo! Inc. Search and Marketplace

Definitions

• Connectivity Graph– Nodes are pages (or hosts)– Directed edges are links– Graph edges can be represented as a transition matrix, A

• The ith row of A represents the links out from node i

• Authority score– Score associated with each node– Some function of inlinks to node and outlinks from node

• Simplest authority score is inlink count

Page 42: Web Search Tutorial Jan Pedersen and Knut Magne Risvik Yahoo! Inc. Search and Marketplace

• Contribution averaged over all outlinks

• Node score is the sum of contributions

• Fixed point equation

– If A is normalized• Each row sums to 1.0

Page Rank (Without Random Jump)

.1

.1 A (.25)

B (.3)

1/2

1/2

.1

ijj je

jrir

: )(

)()(

rAr

Page 43: Web Search Tutorial Jan Pedersen and Knut Magne Risvik Yahoo! Inc. Search and Marketplace

• A is a stochastic matrix– r(i) can be interpreted as a probability

• Suppose a surfer takes a outlink at random

• r(i) is the long run probability of landing at a particular node

– Solution to fixed point equation is the principal Eigen vector

• principal Eigen value is 1.0

• Solution can be found by iteration– If then

– Start with random initial value for r

– Iterate multiplication by A

• Contribution of smaller eigen values will drop out

– Final value is a good estimate of the fixed point solution

Page Rank Implications

rrA nrrA

Page 44: Web Search Tutorial Jan Pedersen and Knut Magne Risvik Yahoo! Inc. Search and Marketplace

• What’s the score for a node with no in-links?

• Revised equation

• Fixed point equation

• Probability interpretation– As before with chance of

jumping randomly

Page Rank (with random jump)

.1

.1 A (.225)

B (.293)

1/2

1/2

.1

ijj je

jr

Nir

: )(

)()1()(

Nji1

))1((

,

U

rAUr

= 0.1

Page 45: Web Search Tutorial Jan Pedersen and Knut Magne Risvik Yahoo! Inc. Search and Marketplace

Eigenrank

• Separates internal from external links– Internal transition matrix I– External transition matrix E

• Introduces a new parameter is the random jump probability is the probability of taking an internal link– (1 - - ) is the probability of taking an external link

Page 46: Web Search Tutorial Jan Pedersen and Knut Magne Risvik Yahoo! Inc. Search and Marketplace

• Revised equation

• Fixed point equation

• Probability interpretation chance of random jump chance of internal link– (1--) chance of external link

Eigenrank

.1

.1 A (.2)

B (.202)

1/2

1/2

.1

external:

internal: )(

)()1(

)(

)()(

ijjijj je

jr

je

jrNir

Nji1

))1((

,

U

rEIUr

= 0.1= 0.1

Page 47: Web Search Tutorial Jan Pedersen and Knut Magne Risvik Yahoo! Inc. Search and Marketplace

Computational Issues

• Nodes with no outlinks– Transition matrix with zero row

• Internal or external

– Leave out of computation(?)– Redistribute mass to random jump(?)

• Currently mass is redistributed– Complex formula that prefers external links

Page 48: Web Search Tutorial Jan Pedersen and Knut Magne Risvik Yahoo! Inc. Search and Marketplace

• Two scores– Authority score, a– Hub score, h

• Fixed Point equations– Authority

– Hub

– Principal Eigen vectors are solutions

Kleinberg

aAAhAa t

hAAaAh t t

Page 49: Web Search Tutorial Jan Pedersen and Knut Magne Risvik Yahoo! Inc. Search and Marketplace

SPAM

• Manipulation of content purely to influence ranking– Dictionary SPAM– Link sharing– Domain hi-jacking– Link farms

• Robotic use of search results– Meta-search engines– Search Engine optimizers– Fraud

Page 50: Web Search Tutorial Jan Pedersen and Knut Magne Risvik Yahoo! Inc. Search and Marketplace

Third Generation Technologies

Page 51: Web Search Tutorial Jan Pedersen and Knut Magne Risvik Yahoo! Inc. Search and Marketplace

Handling Ambiguity

Results for query: Cobra

Page 52: Web Search Tutorial Jan Pedersen and Knut Magne Risvik Yahoo! Inc. Search and Marketplace

Impression Tracking

Incoherent urls are those that receive high rank for a large

diversity of queries. Many incoherent urls indicate SPAM or a

bug (as in this case).

Page 53: Web Search Tutorial Jan Pedersen and Knut Magne Risvik Yahoo! Inc. Search and Marketplace

Clickrate Relevance Metric

Average highest rank clicked perceptibly increased with the release of

a new rank function.

Page 54: Web Search Tutorial Jan Pedersen and Knut Magne Risvik Yahoo! Inc. Search and Marketplace

User Interface

• Ranked result lists– Document summaries are critical

• Hit highlighting• Dynamic abstracts• url

– No recent innovation• Graphical presentations not well fit to the task

• Blending– Predefined segmentation

• e.g. Paid listing

– Intermixed with results from other sources• e.g. News

Page 55: Web Search Tutorial Jan Pedersen and Knut Magne Risvik Yahoo! Inc. Search and Marketplace

Future Trends

• Question Answering– WWW as language model

• Enables simple methods

• e.g. Dumais et al. (SIGIR 2002)

• New contexts– Ubiquitous Searching

• Toolbars, desktop, phone

– Implicit Searching• Computed links

• New Tasks– E.g. Local/ Country Search

Page 56: Web Search Tutorial Jan Pedersen and Knut Magne Risvik Yahoo! Inc. Search and Marketplace

Bibliography

• Modeling the Internet and the Web: Probabilistic Methods and Algorithmsby Pierre Baldi, Paolo Frasconi, and Padhraic SmythJohn Wiley & Sons; May 28, 2003

• Mining the Web: Analysis of Hypertext and Semi Structured Databy Soumen ChakrabartiMorgan Kaufmann; August 15, 2002

• The Anatomy of a Large-scale Hypertextual Web Search Engine by S. Brin and L. Page.7th International WWW Conference, Brisbane, Australia; April 1998.

• Websites:– http://www.searchenginewatch.com/– http://www.searchengineshowdown.com/

• Presentations– http://infonortics.com/searchengines/sh03/slides/evans.pdf