windows live image search hugh williams senior software design engineer windows live search...
TRANSCRIPT
Windows Live Image SearchWindows Live Image Search
Hugh WilliamsSenior Software Design EngineerWindows Live SearchMicrosoft Corporation
Hugh WilliamsSenior Software Design EngineerWindows Live SearchMicrosoft Corporation
OverviewOverview
Windows Live Image Search
Problem Definition and Background
User Interface
Architecture
Why is it a beta?
Questions?
Windows Live Image Search
Problem Definition and Background
User Interface
Architecture
Why is it a beta?
Questions?
IntroductionIntroduction
Windows Live Image Search is new:Released in Beta form on March 8, 2006Architected, designed, and engineered in RedmondClose relative of MSN/Windows Live web searchMicrosoft’s Image search is available only at Windows Live
The MSN Image Search solution is provided by a third-party
Strong partnership between the Windows Live Search product team and:
Microsoft Research, Cambridge UKMicrosoft Research, Asia (Beijing, China)Microsoft Research, Redmond
Windows Live Image Search is new:Released in Beta form on March 8, 2006Architected, designed, and engineered in RedmondClose relative of MSN/Windows Live web searchMicrosoft’s Image search is available only at Windows Live
The MSN Image Search solution is provided by a third-party
Strong partnership between the Windows Live Search product team and:
Microsoft Research, Cambridge UKMicrosoft Research, Asia (Beijing, China)Microsoft Research, Redmond
Problem DefinitionProblem Definition
Find thumbnail images using a text queryThere are no CBIR-based web-scale imagesearch enginesAll modern image search engines share fundamentals with AltaVista’s originalPhotoFinder (1998)
The thumbnail images represent web pages “containing” the original imageWe crawl web pages and images
More than a billion imagesPages and images regularly refreshedLarge numbers of images enter and leave the collection dailyMore later…
Find thumbnail images using a text queryThere are no CBIR-based web-scale imagesearch enginesAll modern image search engines share fundamentals with AltaVista’s originalPhotoFinder (1998)
The thumbnail images represent web pages “containing” the original imageWe crawl web pages and images
More than a billion imagesPages and images regularly refreshedLarge numbers of images enter and leave the collection dailyMore later…
QueriesQueries
From an MSN Search sample drawnfrom a month:
Most frequent: 65,000+ occurrencesMedian: 2 occurrencesMost queries are 1 to 3 words in lengthMost popular queries: lindsay lohan, scarlett johansson, angelina jolie, sex, jessica simpson, kate beckinsale, paris hilton, britney spears, shakira, sexy, jessica alba,jennifer lopezRandom queries: bridge, rodolfo font, playboy, douwe egberts, jesus, tanning, beauty, oakenfold, priyankachopra, actors
Around 60 of the top 100 queries are adultor celebrityOther popular scenarios are places, animals,or objects
From an MSN Search sample drawnfrom a month:
Most frequent: 65,000+ occurrencesMedian: 2 occurrencesMost queries are 1 to 3 words in lengthMost popular queries: lindsay lohan, scarlett johansson, angelina jolie, sex, jessica simpson, kate beckinsale, paris hilton, britney spears, shakira, sexy, jessica alba,jennifer lopezRandom queries: bridge, rodolfo font, playboy, douwe egberts, jesus, tanning, beauty, oakenfold, priyankachopra, actors
Around 60 of the top 100 queries are adultor celebrityOther popular scenarios are places, animals,or objects
More On Queries…More On Queries…
In the US, around 10% are spelling errorsLess in some languages, more in others
Word forms are extremely commonTom’s Diner, Toms Diner, Tom Diner
Lots of weirdness: Math.abs
3/4” Ply
103,5 versus 103.5
www cnn.com
Every conceivable spelling of “Britney”
Navigational queries
In the US, around 10% are spelling errorsLess in some languages, more in others
Word forms are extremely commonTom’s Diner, Toms Diner, Tom Diner
Lots of weirdness: Math.abs
3/4” Ply
103,5 versus 103.5
www cnn.com
Every conceivable spelling of “Britney”
Navigational queries
Thumbnail ResultsThumbnail Results
Thumbnail ClickthroughThumbnail Clickthrough
How Users Click ThroughHow Users Click ThroughMSN Result Visits for Web and Image Search
0.00%
10.00%
20.00%
30.00%
40.00%
50.00%
60.00%
70.00%
80.00%
90.00%
100.00%
0 1000 2000 3000 4000 5000 6000 7000 8000
Answer rank
Cu
mu
lati
ve
pe
rce
nta
ge
of
se
ss
ion
s
Web search
Image search
Around 75% of Web search result page views are page one. For image search it is 43%, and the 75% threshold in image search is reached around page eight
Searching And RankingSearching And Ranking
Our ranking process matches queriesto documents
So, what is a document?We refer to our documents as nodules
A nodule is created for each link between an HTML document and an image (where we haveretrieved both)
The alternative is a nodule per image, or a nodule per page
A nodule typically contains:
The thumbnail of the image
Text and headers from the HTML page
Image metadata
Our ranking process matches queriesto documents
So, what is a document?We refer to our documents as nodules
A nodule is created for each link between an HTML document and an image (where we haveretrieved both)
The alternative is a nodule per image, or a nodule per page
A nodule typically contains:
The thumbnail of the image
Text and headers from the HTML page
Image metadata
Background: RankingBackground: Ranking
So, how do we rank?We rank using:
Static Rank: Query Independent valueImage and page properties, web link analysis, junk page probability, and so on
Dynamic Rank: Query Dependent valueTF-IDF, BM25, and so on
The overall rank is a combination of Static and Dynamic Rank
Broad answer: we compute the similarity between selected nodules and a query, and order the results by decreasing similarity
The selected nodules are those that contain all query terms (Boolean AND to find a filter set, then similarity-based ordering of the filter set)
So, how do we rank?We rank using:
Static Rank: Query Independent valueImage and page properties, web link analysis, junk page probability, and so on
Dynamic Rank: Query Dependent valueTF-IDF, BM25, and so on
The overall rank is a combination of Static and Dynamic Rank
Broad answer: we compute the similarity between selected nodules and a query, and order the results by decreasing similarity
The selected nodules are those that contain all query terms (Boolean AND to find a filter set, then similarity-based ordering of the filter set)
Algorithmic SearchAlgorithmic Search
Traditional Information Retrieval focuseson Intelligence
Recall
Long queries
Well-formed documents
Small (low millions) index
Image search focuses onPrecision
Short queries
Poor documents
Billions of nodules in the index
Traditional Information Retrieval focuseson Intelligence
Recall
Long queries
Well-formed documents
Small (low millions) index
Image search focuses onPrecision
Short queries
Poor documents
Billions of nodules in the index
Nodule TextNodule Text
Nodules represent the link between an HTML page and an image
Nodule text includes elements such as:The HTML page <title>
Text from the HTML pageText from near the image is a good start…
ALT or anchor text from the imageImages can be embedded in a page using the <img> tag or linked-to using the <a> tag
Nodules represent the link between an HTML page and an image
Nodule text includes elements such as:The HTML page <title>
Text from the HTML pageText from near the image is a good start…
ALT or anchor text from the imageImages can be embedded in a page using the <img> tag or linked-to using the <a> tag
Table ParsingTable Parsing
Image MetadataImage Metadata
Ranking uses text and image properties (the latter are exclusively for image search)
These include:AspectRatio (the ratio of the X dimension tothe Y dimension)
Pixels (the product of X and Y dimensions)
PhotoGraphic (whether an image is a photographor a graphic)
…
Ranking uses text and image properties (the latter are exclusively for image search)
These include:AspectRatio (the ratio of the X dimension tothe Y dimension)
Pixels (the product of X and Y dimensions)
PhotoGraphic (whether an image is a photographor a graphic)
…
Aspect Ratio ExtremesAspect Ratio Extremes
Throwing Out JunkThrowing Out Junk
The Web is full of balls, lines, and Amazon logos
Right now, we ignore very small images Some we don’t fetch (HTML width and height attributes help us), many we drop after fetching
Junk properties help us in ranking:We lower the rank of images with extremeaspect ratios
We lower the rank of images with few pixels
The Web is full of balls, lines, and Amazon logos
Right now, we ignore very small images Some we don’t fetch (HTML width and height attributes help us), many we drop after fetching
Junk properties help us in ranking:We lower the rank of images with extremeaspect ratios
We lower the rank of images with few pixels
Duplicates And Near DuplicatesDuplicates And Near Duplicates
Duplication is problematic, particularly for logos, products, and posters
We compute a hash of all imagesAll except the highest-ranked exact duplicate is removed from the filter set at query time
We are working on techniques for removing near duplicates
Duplication is problematic, particularly for logos, products, and posters
We compute a hash of all imagesAll except the highest-ranked exact duplicate is removed from the filter set at query time
We are working on techniques for removing near duplicates
User InterfaceUser Interface
The Windows Live image search user interface has five new features:
1. “Infinite scroll” or “smart scroll”
2. Thumbnail size slider
3. Film strip results view
4. Show full image
5. Metadata grow experience
The Windows Live image search user interface has five new features:
1. “Infinite scroll” or “smart scroll”
2. Thumbnail size slider
3. Film strip results view
4. Show full image
5. Metadata grow experience
Windows Live Image SearchWindows Live Image Search
Infinite Or Smart ScrollInfinite Or Smart Scroll
Results are presented in a single pageRemoves others’ paging model
Smooths the click curve
Improves browsability
Motivated by click dataAs discussed previously, only 43% of users stayon page one
Many sessions show very deep click behaviors
Same motivation for the thumbnail size slider
Results are presented in a single pageRemoves others’ paging model
Smooths the click curve
Improves browsability
Motivated by click dataAs discussed previously, only 43% of users stayon page one
Many sessions show very deep click behaviors
Same motivation for the thumbnail size slider
Other Features…Other Features…
Motivated and reinforced by usability studiesFilm Strip Results View:
Improve results navigationRemove unnecessary click actionsMake it easy to find a page or image
Show full image feature:Helps locate original imageParticularly useful for <a> links
Metadata growMost users don’t use metadataReduce clutter, improve browse experience
Motivated and reinforced by usability studiesFilm Strip Results View:
Improve results navigationRemove unnecessary click actionsMake it easy to find a page or image
Show full image feature:Helps locate original imageParticularly useful for <a> links
Metadata growMost users don’t use metadataReduce clutter, improve browse experience
Architecture And DesignArchitecture And Design
Crawl and index over a billion nodules every two weeks
Crawl 750 nodules per second
Answer queries in less than 250ms, with most answered in less than 50ms
Serve several million queries per day Peak load of 150+ queries per second
Serve 10,000+ thumbnails per secondat peak
Manage several petabytes of raw storage
Crawl and index over a billion nodules every two weeks
Crawl 750 nodules per second
Answer queries in less than 250ms, with most answered in less than 50ms
Serve several million queries per day Peak load of 150+ queries per second
Serve 10,000+ thumbnails per secondat peak
Manage several petabytes of raw storage
Architecture: Serving QueriesArchitecture: Serving Queries
Front End Experience
(FEX)Federator
Image Search
Spelling Correction
`
CustomerQuery
Mid LevelAggregator
Mid LevelAggregator
Index Serving Node
Front End Experience
(FEX)Federator
Image Search
Spelling Correction
`
CustomerQuery
Mid LevelAggregator
Mid LevelAggregator
Index Serving Node
Architecture: Index BuildingArchitecture: Index Building
Crawler
Index Serving Node
Static Ranker
Index Builder
`
WebServers
Indexing: Selection And CrawlIndexing: Selection And Crawl
Only way into Search is via our CrawlerWe used to have “paid inclusion” but abandoned it
Google doesn’t have it, Yahoo! does
Crawl is partly prioritized by Static RankWe crawl the top few billion pages
Biggest issue with crawling: politeness
Only way into Search is via our CrawlerWe used to have “paid inclusion” but abandoned it
Google doesn’t have it, Yahoo! does
Crawl is partly prioritized by Static RankWe crawl the top few billion pages
Biggest issue with crawling: politeness
Distributed Searching I: Single BoxDistributed Searching I: Single Box
Web ServerFrontends
Big Iron(DEC TurboLaser)
Web ServerFrontends
Big Iron(DEC TurboLaser)
Monolithic Model (AltaVista, WebCrawler) – the index goes on a single (big) box.
Advantages:Easy to scale query volume: just buy more web server frontends and Big Boxes
Full visibility on results while ranking
Disadvantages:Hard to scale index size --- limited by CPU and Memory
Reliability
Monolithic Model (AltaVista, WebCrawler) – the index goes on a single (big) box.
Advantages:Easy to scale query volume: just buy more web server frontends and Big Boxes
Full visibility on results while ranking
Disadvantages:Hard to scale index size --- limited by CPU and Memory
Reliability
Distributed Searching II: Word-StripingDistributed Searching II: Word-Striping
Quick brown fox
Web ServerFrontends
Quick
fox
brownQuick brown fox
Web ServerFrontends
Quick
fox
brown
Stripe the index by term across index servers
Have a central box send the query terms to appropriate servers
Merge the results
Advantages:Only boxes that have answers get used per query
Have full visibility of results while ranking
Disadvantages:Some boxes are likely to be more loaded than others
It turns out this creates significant network traffic
Stripe the index by term across index servers
Have a central box send the query terms to appropriate servers
Merge the results
Advantages:Only boxes that have answers get used per query
Have full visibility of results while ranking
Disadvantages:Some boxes are likely to be more loaded than others
It turns out this creates significant network traffic
Distributed Searching III: Document StripingDistributed Searching III: Document Striping
Quick brown fox
Web ServerFrontends
Quick brown fox
Quick brown fox
Quick brown foxQuick brown fox
Web ServerFrontends
Quick brown fox
Quick brown fox
Quick brown fox
Stripe documents randomly across boxes
Send query to all boxes
Merge the results from all boxes
Advantages:Scales with both index size and query traffic volume
Minimal network traffic, aggregation is easy
Disadvantage:No visibility on all results while ranking
Stripe documents randomly across boxes
Send query to all boxes
Merge the results from all boxes
Advantages:Scales with both index size and query traffic volume
Minimal network traffic, aggregation is easy
Disadvantage:No visibility on all results while ranking
Why Is It A Beta?Why Is It A Beta?
We are working on multiple featuresContinuous improvement of rankingand relevance
Internationalization and accessibility
Scaling and reliability
Adult filtering
New, thought-leading featuresMany of these involve colleagues inMicrosoft Research
We are working on multiple featuresContinuous improvement of rankingand relevance
Internationalization and accessibility
Scaling and reliability
Adult filtering
New, thought-leading featuresMany of these involve colleagues inMicrosoft Research
© 2006 Microsoft Corporation. All rights reserved.Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries.The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation.Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft,and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation.MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.