windows live image search hugh williams senior software design engineer windows live search...

32
Windows Live Image Search Hugh Williams Senior Software Design Engineer Windows Live Search Microsoft Corporation

Upload: janis-henderson

Post on 21-Jan-2016

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Windows Live Image Search Hugh Williams Senior Software Design Engineer Windows Live Search Microsoft Corporation Hugh Williams Senior Software Design

Windows Live Image SearchWindows Live Image Search

Hugh WilliamsSenior Software Design EngineerWindows Live SearchMicrosoft Corporation

Hugh WilliamsSenior Software Design EngineerWindows Live SearchMicrosoft Corporation

Page 2: Windows Live Image Search Hugh Williams Senior Software Design Engineer Windows Live Search Microsoft Corporation Hugh Williams Senior Software Design

OverviewOverview

Windows Live Image Search

Problem Definition and Background

User Interface

Architecture

Why is it a beta?

Questions?

Windows Live Image Search

Problem Definition and Background

User Interface

Architecture

Why is it a beta?

Questions?

Page 3: Windows Live Image Search Hugh Williams Senior Software Design Engineer Windows Live Search Microsoft Corporation Hugh Williams Senior Software Design

IntroductionIntroduction

Windows Live Image Search is new:Released in Beta form on March 8, 2006Architected, designed, and engineered in RedmondClose relative of MSN/Windows Live web searchMicrosoft’s Image search is available only at Windows Live

The MSN Image Search solution is provided by a third-party

Strong partnership between the Windows Live Search product team and:

Microsoft Research, Cambridge UKMicrosoft Research, Asia (Beijing, China)Microsoft Research, Redmond

Windows Live Image Search is new:Released in Beta form on March 8, 2006Architected, designed, and engineered in RedmondClose relative of MSN/Windows Live web searchMicrosoft’s Image search is available only at Windows Live

The MSN Image Search solution is provided by a third-party

Strong partnership between the Windows Live Search product team and:

Microsoft Research, Cambridge UKMicrosoft Research, Asia (Beijing, China)Microsoft Research, Redmond

Page 4: Windows Live Image Search Hugh Williams Senior Software Design Engineer Windows Live Search Microsoft Corporation Hugh Williams Senior Software Design

Problem DefinitionProblem Definition

Find thumbnail images using a text queryThere are no CBIR-based web-scale imagesearch enginesAll modern image search engines share fundamentals with AltaVista’s originalPhotoFinder (1998)

The thumbnail images represent web pages “containing” the original imageWe crawl web pages and images

More than a billion imagesPages and images regularly refreshedLarge numbers of images enter and leave the collection dailyMore later…

Find thumbnail images using a text queryThere are no CBIR-based web-scale imagesearch enginesAll modern image search engines share fundamentals with AltaVista’s originalPhotoFinder (1998)

The thumbnail images represent web pages “containing” the original imageWe crawl web pages and images

More than a billion imagesPages and images regularly refreshedLarge numbers of images enter and leave the collection dailyMore later…

Page 5: Windows Live Image Search Hugh Williams Senior Software Design Engineer Windows Live Search Microsoft Corporation Hugh Williams Senior Software Design

QueriesQueries

From an MSN Search sample drawnfrom a month:

Most frequent: 65,000+ occurrencesMedian: 2 occurrencesMost queries are 1 to 3 words in lengthMost popular queries: lindsay lohan, scarlett johansson, angelina jolie, sex, jessica simpson, kate beckinsale, paris hilton, britney spears, shakira, sexy, jessica alba,jennifer lopezRandom queries: bridge, rodolfo font, playboy, douwe egberts, jesus, tanning, beauty, oakenfold, priyankachopra, actors

Around 60 of the top 100 queries are adultor celebrityOther popular scenarios are places, animals,or objects

From an MSN Search sample drawnfrom a month:

Most frequent: 65,000+ occurrencesMedian: 2 occurrencesMost queries are 1 to 3 words in lengthMost popular queries: lindsay lohan, scarlett johansson, angelina jolie, sex, jessica simpson, kate beckinsale, paris hilton, britney spears, shakira, sexy, jessica alba,jennifer lopezRandom queries: bridge, rodolfo font, playboy, douwe egberts, jesus, tanning, beauty, oakenfold, priyankachopra, actors

Around 60 of the top 100 queries are adultor celebrityOther popular scenarios are places, animals,or objects

Page 6: Windows Live Image Search Hugh Williams Senior Software Design Engineer Windows Live Search Microsoft Corporation Hugh Williams Senior Software Design

More On Queries…More On Queries…

In the US, around 10% are spelling errorsLess in some languages, more in others

Word forms are extremely commonTom’s Diner, Toms Diner, Tom Diner

Lots of weirdness: Math.abs

3/4” Ply

103,5 versus 103.5

www cnn.com

Every conceivable spelling of “Britney”

Navigational queries

In the US, around 10% are spelling errorsLess in some languages, more in others

Word forms are extremely commonTom’s Diner, Toms Diner, Tom Diner

Lots of weirdness: Math.abs

3/4” Ply

103,5 versus 103.5

www cnn.com

Every conceivable spelling of “Britney”

Navigational queries

Page 7: Windows Live Image Search Hugh Williams Senior Software Design Engineer Windows Live Search Microsoft Corporation Hugh Williams Senior Software Design

Thumbnail ResultsThumbnail Results

Page 8: Windows Live Image Search Hugh Williams Senior Software Design Engineer Windows Live Search Microsoft Corporation Hugh Williams Senior Software Design

Thumbnail ClickthroughThumbnail Clickthrough

Page 9: Windows Live Image Search Hugh Williams Senior Software Design Engineer Windows Live Search Microsoft Corporation Hugh Williams Senior Software Design

How Users Click ThroughHow Users Click ThroughMSN Result Visits for Web and Image Search

0.00%

10.00%

20.00%

30.00%

40.00%

50.00%

60.00%

70.00%

80.00%

90.00%

100.00%

0 1000 2000 3000 4000 5000 6000 7000 8000

Answer rank

Cu

mu

lati

ve

pe

rce

nta

ge

of

se

ss

ion

s

Web search

Image search

Around 75% of Web search result page views are page one. For image search it is 43%, and the 75% threshold in image search is reached around page eight

Page 10: Windows Live Image Search Hugh Williams Senior Software Design Engineer Windows Live Search Microsoft Corporation Hugh Williams Senior Software Design

Searching And RankingSearching And Ranking

Our ranking process matches queriesto documents

So, what is a document?We refer to our documents as nodules

A nodule is created for each link between an HTML document and an image (where we haveretrieved both)

The alternative is a nodule per image, or a nodule per page

A nodule typically contains:

The thumbnail of the image

Text and headers from the HTML page

Image metadata

Our ranking process matches queriesto documents

So, what is a document?We refer to our documents as nodules

A nodule is created for each link between an HTML document and an image (where we haveretrieved both)

The alternative is a nodule per image, or a nodule per page

A nodule typically contains:

The thumbnail of the image

Text and headers from the HTML page

Image metadata

Page 11: Windows Live Image Search Hugh Williams Senior Software Design Engineer Windows Live Search Microsoft Corporation Hugh Williams Senior Software Design

Background: RankingBackground: Ranking

So, how do we rank?We rank using:

Static Rank: Query Independent valueImage and page properties, web link analysis, junk page probability, and so on

Dynamic Rank: Query Dependent valueTF-IDF, BM25, and so on

The overall rank is a combination of Static and Dynamic Rank

Broad answer: we compute the similarity between selected nodules and a query, and order the results by decreasing similarity

The selected nodules are those that contain all query terms (Boolean AND to find a filter set, then similarity-based ordering of the filter set)

So, how do we rank?We rank using:

Static Rank: Query Independent valueImage and page properties, web link analysis, junk page probability, and so on

Dynamic Rank: Query Dependent valueTF-IDF, BM25, and so on

The overall rank is a combination of Static and Dynamic Rank

Broad answer: we compute the similarity between selected nodules and a query, and order the results by decreasing similarity

The selected nodules are those that contain all query terms (Boolean AND to find a filter set, then similarity-based ordering of the filter set)

Page 12: Windows Live Image Search Hugh Williams Senior Software Design Engineer Windows Live Search Microsoft Corporation Hugh Williams Senior Software Design

Algorithmic SearchAlgorithmic Search

Traditional Information Retrieval focuseson Intelligence

Recall

Long queries

Well-formed documents

Small (low millions) index

Image search focuses onPrecision

Short queries

Poor documents

Billions of nodules in the index

Traditional Information Retrieval focuseson Intelligence

Recall

Long queries

Well-formed documents

Small (low millions) index

Image search focuses onPrecision

Short queries

Poor documents

Billions of nodules in the index

Page 13: Windows Live Image Search Hugh Williams Senior Software Design Engineer Windows Live Search Microsoft Corporation Hugh Williams Senior Software Design

Nodule TextNodule Text

Nodules represent the link between an HTML page and an image

Nodule text includes elements such as:The HTML page <title>

Text from the HTML pageText from near the image is a good start…

ALT or anchor text from the imageImages can be embedded in a page using the <img> tag or linked-to using the <a> tag

Nodules represent the link between an HTML page and an image

Nodule text includes elements such as:The HTML page <title>

Text from the HTML pageText from near the image is a good start…

ALT or anchor text from the imageImages can be embedded in a page using the <img> tag or linked-to using the <a> tag

Page 15: Windows Live Image Search Hugh Williams Senior Software Design Engineer Windows Live Search Microsoft Corporation Hugh Williams Senior Software Design

Image MetadataImage Metadata

Ranking uses text and image properties (the latter are exclusively for image search)

These include:AspectRatio (the ratio of the X dimension tothe Y dimension)

Pixels (the product of X and Y dimensions)

PhotoGraphic (whether an image is a photographor a graphic)

Ranking uses text and image properties (the latter are exclusively for image search)

These include:AspectRatio (the ratio of the X dimension tothe Y dimension)

Pixels (the product of X and Y dimensions)

PhotoGraphic (whether an image is a photographor a graphic)

Page 16: Windows Live Image Search Hugh Williams Senior Software Design Engineer Windows Live Search Microsoft Corporation Hugh Williams Senior Software Design

Aspect Ratio ExtremesAspect Ratio Extremes

Page 17: Windows Live Image Search Hugh Williams Senior Software Design Engineer Windows Live Search Microsoft Corporation Hugh Williams Senior Software Design

Throwing Out JunkThrowing Out Junk

The Web is full of balls, lines, and Amazon logos

Right now, we ignore very small images Some we don’t fetch (HTML width and height attributes help us), many we drop after fetching

Junk properties help us in ranking:We lower the rank of images with extremeaspect ratios

We lower the rank of images with few pixels

The Web is full of balls, lines, and Amazon logos

Right now, we ignore very small images Some we don’t fetch (HTML width and height attributes help us), many we drop after fetching

Junk properties help us in ranking:We lower the rank of images with extremeaspect ratios

We lower the rank of images with few pixels

Page 18: Windows Live Image Search Hugh Williams Senior Software Design Engineer Windows Live Search Microsoft Corporation Hugh Williams Senior Software Design

Duplicates And Near DuplicatesDuplicates And Near Duplicates

Duplication is problematic, particularly for logos, products, and posters

We compute a hash of all imagesAll except the highest-ranked exact duplicate is removed from the filter set at query time

We are working on techniques for removing near duplicates

Duplication is problematic, particularly for logos, products, and posters

We compute a hash of all imagesAll except the highest-ranked exact duplicate is removed from the filter set at query time

We are working on techniques for removing near duplicates

Page 19: Windows Live Image Search Hugh Williams Senior Software Design Engineer Windows Live Search Microsoft Corporation Hugh Williams Senior Software Design

User InterfaceUser Interface

The Windows Live image search user interface has five new features:

1. “Infinite scroll” or “smart scroll”

2. Thumbnail size slider

3. Film strip results view

4. Show full image

5. Metadata grow experience

The Windows Live image search user interface has five new features:

1. “Infinite scroll” or “smart scroll”

2. Thumbnail size slider

3. Film strip results view

4. Show full image

5. Metadata grow experience

Page 20: Windows Live Image Search Hugh Williams Senior Software Design Engineer Windows Live Search Microsoft Corporation Hugh Williams Senior Software Design

Windows Live Image SearchWindows Live Image Search

Page 21: Windows Live Image Search Hugh Williams Senior Software Design Engineer Windows Live Search Microsoft Corporation Hugh Williams Senior Software Design

Infinite Or Smart ScrollInfinite Or Smart Scroll

Results are presented in a single pageRemoves others’ paging model

Smooths the click curve

Improves browsability

Motivated by click dataAs discussed previously, only 43% of users stayon page one

Many sessions show very deep click behaviors

Same motivation for the thumbnail size slider

Results are presented in a single pageRemoves others’ paging model

Smooths the click curve

Improves browsability

Motivated by click dataAs discussed previously, only 43% of users stayon page one

Many sessions show very deep click behaviors

Same motivation for the thumbnail size slider

Page 22: Windows Live Image Search Hugh Williams Senior Software Design Engineer Windows Live Search Microsoft Corporation Hugh Williams Senior Software Design

Other Features…Other Features…

Motivated and reinforced by usability studiesFilm Strip Results View:

Improve results navigationRemove unnecessary click actionsMake it easy to find a page or image

Show full image feature:Helps locate original imageParticularly useful for <a> links

Metadata growMost users don’t use metadataReduce clutter, improve browse experience

Motivated and reinforced by usability studiesFilm Strip Results View:

Improve results navigationRemove unnecessary click actionsMake it easy to find a page or image

Show full image feature:Helps locate original imageParticularly useful for <a> links

Metadata growMost users don’t use metadataReduce clutter, improve browse experience

Page 23: Windows Live Image Search Hugh Williams Senior Software Design Engineer Windows Live Search Microsoft Corporation Hugh Williams Senior Software Design

Architecture And DesignArchitecture And Design

Crawl and index over a billion nodules every two weeks

Crawl 750 nodules per second

Answer queries in less than 250ms, with most answered in less than 50ms

Serve several million queries per day Peak load of 150+ queries per second

Serve 10,000+ thumbnails per secondat peak

Manage several petabytes of raw storage

Crawl and index over a billion nodules every two weeks

Crawl 750 nodules per second

Answer queries in less than 250ms, with most answered in less than 50ms

Serve several million queries per day Peak load of 150+ queries per second

Serve 10,000+ thumbnails per secondat peak

Manage several petabytes of raw storage

Page 24: Windows Live Image Search Hugh Williams Senior Software Design Engineer Windows Live Search Microsoft Corporation Hugh Williams Senior Software Design

Architecture: Serving QueriesArchitecture: Serving Queries

Front End Experience

(FEX)Federator

Image Search

Spelling Correction

`

CustomerQuery

Mid LevelAggregator

Mid LevelAggregator

Index Serving Node

Front End Experience

(FEX)Federator

Image Search

Spelling Correction

`

CustomerQuery

Mid LevelAggregator

Mid LevelAggregator

Index Serving Node

Page 25: Windows Live Image Search Hugh Williams Senior Software Design Engineer Windows Live Search Microsoft Corporation Hugh Williams Senior Software Design

Architecture: Index BuildingArchitecture: Index Building

Crawler

Index Serving Node

Static Ranker

Index Builder

`

WebServers

Page 26: Windows Live Image Search Hugh Williams Senior Software Design Engineer Windows Live Search Microsoft Corporation Hugh Williams Senior Software Design

Indexing: Selection And CrawlIndexing: Selection And Crawl

Only way into Search is via our CrawlerWe used to have “paid inclusion” but abandoned it

Google doesn’t have it, Yahoo! does

Crawl is partly prioritized by Static RankWe crawl the top few billion pages

Biggest issue with crawling: politeness

Only way into Search is via our CrawlerWe used to have “paid inclusion” but abandoned it

Google doesn’t have it, Yahoo! does

Crawl is partly prioritized by Static RankWe crawl the top few billion pages

Biggest issue with crawling: politeness

Page 27: Windows Live Image Search Hugh Williams Senior Software Design Engineer Windows Live Search Microsoft Corporation Hugh Williams Senior Software Design

Distributed Searching I: Single BoxDistributed Searching I: Single Box

Web ServerFrontends

Big Iron(DEC TurboLaser)

Web ServerFrontends

Big Iron(DEC TurboLaser)

Monolithic Model (AltaVista, WebCrawler) – the index goes on a single (big) box.

Advantages:Easy to scale query volume: just buy more web server frontends and Big Boxes

Full visibility on results while ranking

Disadvantages:Hard to scale index size --- limited by CPU and Memory

Reliability

Monolithic Model (AltaVista, WebCrawler) – the index goes on a single (big) box.

Advantages:Easy to scale query volume: just buy more web server frontends and Big Boxes

Full visibility on results while ranking

Disadvantages:Hard to scale index size --- limited by CPU and Memory

Reliability

Page 28: Windows Live Image Search Hugh Williams Senior Software Design Engineer Windows Live Search Microsoft Corporation Hugh Williams Senior Software Design

Distributed Searching II: Word-StripingDistributed Searching II: Word-Striping

Quick brown fox

Web ServerFrontends

Quick

fox

brownQuick brown fox

Web ServerFrontends

Quick

fox

brown

Stripe the index by term across index servers

Have a central box send the query terms to appropriate servers

Merge the results

Advantages:Only boxes that have answers get used per query

Have full visibility of results while ranking

Disadvantages:Some boxes are likely to be more loaded than others

It turns out this creates significant network traffic

Stripe the index by term across index servers

Have a central box send the query terms to appropriate servers

Merge the results

Advantages:Only boxes that have answers get used per query

Have full visibility of results while ranking

Disadvantages:Some boxes are likely to be more loaded than others

It turns out this creates significant network traffic

Page 29: Windows Live Image Search Hugh Williams Senior Software Design Engineer Windows Live Search Microsoft Corporation Hugh Williams Senior Software Design

Distributed Searching III: Document StripingDistributed Searching III: Document Striping

Quick brown fox

Web ServerFrontends

Quick brown fox

Quick brown fox

Quick brown foxQuick brown fox

Web ServerFrontends

Quick brown fox

Quick brown fox

Quick brown fox

Stripe documents randomly across boxes

Send query to all boxes

Merge the results from all boxes

Advantages:Scales with both index size and query traffic volume

Minimal network traffic, aggregation is easy

Disadvantage:No visibility on all results while ranking

Stripe documents randomly across boxes

Send query to all boxes

Merge the results from all boxes

Advantages:Scales with both index size and query traffic volume

Minimal network traffic, aggregation is easy

Disadvantage:No visibility on all results while ranking

Page 30: Windows Live Image Search Hugh Williams Senior Software Design Engineer Windows Live Search Microsoft Corporation Hugh Williams Senior Software Design

Why Is It A Beta?Why Is It A Beta?

We are working on multiple featuresContinuous improvement of rankingand relevance

Internationalization and accessibility

Scaling and reliability

Adult filtering

New, thought-leading featuresMany of these involve colleagues inMicrosoft Research

We are working on multiple featuresContinuous improvement of rankingand relevance

Internationalization and accessibility

Scaling and reliability

Adult filtering

New, thought-leading featuresMany of these involve colleagues inMicrosoft Research

Page 31: Windows Live Image Search Hugh Williams Senior Software Design Engineer Windows Live Search Microsoft Corporation Hugh Williams Senior Software Design

© 2006 Microsoft Corporation. All rights reserved.Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries.The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation.Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft,and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation.MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

Page 32: Windows Live Image Search Hugh Williams Senior Software Design Engineer Windows Live Search Microsoft Corporation Hugh Williams Senior Software Design