computer science 1000 information searching ii permission to redistribute these slides is strictly...
Post on 17-Jan-2016
224 Views
Preview:
TRANSCRIPT
Computer Science 1000
Information Searching II
Permission to redistribute these slides is strictly prohibited without permission
Search Enginea collection of computer programs designed
to help us find information on the Web typically served through a websitedifferent search providers exist, but basic
functionality is consistent type keywords into a text boxpage returns links to other pages
Search Enginewhy is a search engine like an index?
recall that an index maps keywords to a location in some medium (like a page number in a book)
a search engine does a very similar thing takes keywords of interest from a user maps these keywords to relevant web pages
in fact, one of the key components of a search engine is its index
Search Enginewhat differentiates a search engine from
other indexes (like a book index)? the ability to quickly combine keywords in
searches e.g. search for information on ducks and foxes
result rankingpersonalizationamong others …
Search Engine – How it Worksdifferent search engines employ different
technologies the full details of commercial search
engines are typically not publichowever, some of the basics are consistent
crawling indexingquery processing
Crawling for a search engine to be able to link to a web page,
it must know about its existence search engines find pages by crawling the web
programs called crawlers or spiders e.g. Googlebot
a crawler visits web pages, in much the same way that you do
as each page is visited, information is remembered about the page (indexing)
Crawling – Todo List the todo list is a list of pages that
are visited by the crawler the crawling process starts with
an initial to-do list, populated with sites from previous crawls
however, the list is updated as the crawl takes place
hyperlinks on visited sites are added to the list
http://www.uleth.cahttp://www.tsn.cahttp://www.usask.ca...
Todo List
Crawling – Examplesuppose that this page was being
processed by a crawler
Kev's Page
Favorite Stuff:
• New York Islanders
• Saskatchewan Roughriders
• John Deere
as a consequence of this page being crawled, its links would be added to the todo list (if they aren't already there)
those pages would subsequently be checked by the crawler at some point
The "Invisible Web"not all information is crawled, which means
it are not visible to search enginessome pages are new, and haven't yet had a
chance to be crawledhowever, there are other reasons that certain
information does not get crawled
The "Invisible Web" 1) No hyperlinks to that page
recall that in order for a page to be crawled, it must be: on the todo list be linked to a page that appears on the todo list
without a hyperlink, that page will never be found
Page 1Page 2Page 3
Page 1
Page 4
Page 2
Page 3
Page 6
Page 4
Page 5
Page 6
Todo List Web pages
Page 5 will not be crawled, as it is not on the to-do list, and no other pages link to it.
The "Invisible Web" 2) The Page is synthetic
a synthetic page is created on demand, depending on user input
e.g. the results of a search on another search engine
My personal search for "New York Islanders" on Bing results in an on-demand page that is not stored. Hence, it will not be crawled.
The "Invisible Web" 3) The content is unreadable to the crawler
search engines are primarily text-based certain data, such as movie content, is not crawlable
http://support.google.com/webmasters/bin/answer.py?hl=en&answer=72746
The webpage containing the movie might be crawled, but not the movie itself.
The "Invisible Web" 4) The content is password-protected
if you require a password to access a page, then so does a search engine*
The "Invisible Web" 5) You ask the search engine to
ignore your site the presence of certain files stored
with your website will restrict your site from being crawled
e.g. The Robots Exclusion Protocol a file called robots.txt can be stored that
will request that your site (or just certain pages) are not indexed
unlike the previous four examples, this does not prevent search engines from crawling your site
they can choose to ignore robots.txt
http://www.robotstxt.org/
User-agent: Google Disallow:
User-agent: * Disallow: /
Example:
Indexing the primary role of the crawler is to build an index an index is a list of tokens
words phrases (not considered here)*
each token is associated with a list of URLs in other words, like a book index, but with page URLs instead of
page numbers other information might be stored with URLs (e.g. page location
of token) these indexes are saved by the search provider
search queries use information from the indexes (fast), rather than crawling the web for each query (slow)
*http://www.google.com/patents/US7536408
Index Lists – Example
* from text – Figure number might be different
Indexing – What Makes a Token? page text
a common approach search providers differ on which text is selected*
some may use all text others may only use certain text, such as:
titles and headings frequently occuring words words occuring early in a page
sometimes, stop words (a, an, the) are ignored
hyperlink text the term from a hyperlink on another page may be used to
describe the page that it links to
*http://computer.howstuffworks.com/internet/basics/search-engine1.htm
Query Processing the part of the search engine that we see the query processor:
reads words/phrases from the user interface returns pages that are relevant to that query
modern query processors: are extremely fast are very accurate allow a considerable variety in their capabilities
how does this all work?
Query Processing – How it works let's start simple: suppose we search for a
single word (e.g. cat) in a nutshell:
the search engine finds the list for the token 'cat' contains list of pages that contain 'cat' in the appropriate text
(e.g. title)
this list is ranked according to perceived relevance the ranked list is returned as an ordered set of
hyperlinks
Query Processing – How it worksStep 1: the search engine finds the list for
the token 'cat'
Query Processing – How it worksStep 2: this list is ranked according to
perceived relevance
www.cat.comen.wikipedia.org/wiki/Catwww.youtube.com/watch?v=J---aiyznGQ...
Query Processing – How it worksStep 3: the ranked list is returned as an
ordered set of hyperlinks
www.cat.comen.wikipedia.org/wiki/Catwww.youtube.com/watch?v=J---aiyznGQ...
Query Processingwhat about multi-word searching?
as mentioned, some search engines index phrases as well
however, what if a particular phrase is not indexed?
e.g. (text) red fish guppy
solution: intersecting queries the webpages that are common to all of the search words
are returned
Intersecting Queries example (text): suppose the query was “red fish guppy” further suppose that the indexes for each word were as
follows: result is the set of sites that contain all of the keywords in other words, the sites that are found on all three lists
red: en.wikipedia.org/wiki/rednewsroom.urc.eduwww.fullredguppy.comwww.red.comwww.sciencedaily.com
fish: en.wikipedia.org/wiki/fishnewsroom.urc.eduwww.fish.comwww.fullredguppy.comwww.sciencedaily.com
guppy: en.wikipedia.org/wiki/guppywww.ifga.orgwww.fullredguppy.comwww.sciencedaily.comwww.tropicalfish.com
red: en.wikipedia.org/wiki/rednewsroom.urc.eduwww.fullredguppy.comwww.red.comwww.sciencedaily.com
fish: en.wikipedia.org/wiki/fishnewsroom.urc.eduwww.fish.comwww.fullredguppy.comwww.sciencedaily.com
guppy: en.wikipedia.org/wiki/guppywww.ifga.orgwww.fullredguppy.comwww.sciencedaily.comwww.tropicalfish.com
Result:www.fullredguppy.comwww.sciencedaily.com
Intersecting Queries - Efficiency the size of index lists can be large
'cat' returns over 2.3 billion resultsmodern search engines are fasthence, clever algorithms must be developed
for optimizing queriesexample: intersecting queries
Intersecting Queries - Efficiency suppose you had two search terms
e.g. red and fish
the query processor has a list for tokens suppose each list contained 1 billion tokens let's consider a method for performing the
intersecting query that is, how do we find all pages that occur on both lists?
The Naive Approach for each entry in the 'red' list
search through the entire 'fish' list if we find the entry from the red list, then add
that to our result
red: www.sciencedaily.comen.wikipedia.org/wiki/rednewsroom.urc.eduwww.red.comwww.fullredguppy.com
fish: en.wikipedia.org/wiki/fishnewsroom.urc.eduwww.fish.comwww.fullredguppy.comwww.sciencedaily.com
result:
The Naive ApproachFirst search: www.sciencedaily.comdo we find it in second list?
yes – add it to result
red: www.sciencedaily.comen.wikipedia.org/wiki/rednewsroom.urc.eduwww.red.comwww.fullredguppy.com
fish: en.wikipedia.org/wiki/fishnewsroom.urc.eduwww.fish.comwww.fullredguppy.comwww.sciencedaily.com
result:www.sciencedaily.com
The Naive ApproachSecond search: en.wikipedia.org/wiki/reddo we find it in second list?
no
red: www.sciencedaily.comen.wikipedia.org/wiki/rednewsroom.urc.eduwww.red.comwww.fullredguppy.com
fish: en.wikipedia.org/wiki/fishnewsroom.urc.eduwww.fish.comwww.fullredguppy.comwww.sciencedaily.com
result:www.sciencedaily.com
The Naive ApproachThird search: newsroom.urc.edudo we find it in second list?
yes, add it to list
red: www.sciencedaily.comen.wikipedia.org/wiki/rednewsroom.urc.eduwww.red.comwww.fullredguppy.com
fish: en.wikipedia.org/wiki/fishnewsroom.urc.eduwww.fish.comwww.fullredguppy.comwww.sciencedaily.com
result:www.sciencedaily.comnewsroom.urc.edu
The Naive ApproachFourth search: www.red.comdo we find it in second list?
no
red: www.sciencedaily.comen.wikipedia.org/wiki/rednewsroom.urc.eduwww.red.comwww.fullredguppy.com
fish: en.wikipedia.org/wiki/fishnewsroom.urc.eduwww.fish.comwww.fullredguppy.comwww.sciencedaily.com
result:www.sciencedaily.comnewsroom.urc.ed
The Naive ApproachFifth search: www.fullredguppy.comdo we find it in second list?
yes – add it to list
red: www.sciencedaily.comen.wikipedia.org/wiki/rednewsroom.urc.eduwww.red.comwww.fullredguppy.com
fish: en.wikipedia.org/wiki/fishnewsroom.urc.eduwww.fish.comwww.fullredguppy.comwww.sciencedaily.com
result:www.sciencedaily.comnewsroom.urc.eduwww.fullredguppy.com
The Naive Approachproblems?
slow!! for each URL in left list, we potentially had to
compare it to every URL in right listunder our previous assumption (billion size lists),
we have to do 1 billion x 1 billion comparisonseven for a powerful computer, this would require
a considerable amount of time
Alphabetized Lists suppose that each list was maintained
alphabetically then we could employ the following approach
place a marker at start of each list if markers point to same URL:
add URL to result list move both markers down
otherwise, move the marker whose URL is lexicographically smaller
stop when at least one marker goes off the end of the list
The Sorted Approachplace markers at the start of each list
red: www.sciencedaily.comen.wikipedia.org/wiki/rednewsroom.urc.eduwww.red.comwww.fullredguppy.com
fish: en.wikipedia.org/wiki/fishnewsroom.urc.eduwww.fish.comwww.fullredguppy.comwww.sciencedaily.com
result:
red: en.wikipedia.org/wiki/rednewsroom.urc.eduwww.fullredguppy.comwww.red.comwww.sciencedaily.com
fish: en.wikipedia.org/wiki/fishnewsroom.urc.eduwww.fish.comwww.fullredguppy.comwww.sciencedaily.com
The Sorted Approachdo markers point to same URL?
nosince right marker's URL is less than left
marker's URL, move right marker down
red: www.sciencedaily.comen.wikipedia.org/wiki/rednewsroom.urc.eduwww.red.comwww.fullredguppy.com
fish: en.wikipedia.org/wiki/fishnewsroom.urc.eduwww.fish.comwww.fullredguppy.comwww.sciencedaily.com
result:
red: en.wikipedia.org/wiki/rednewsroom.urc.eduwww.fullredguppy.comwww.red.comwww.sciencedaily.com
fish: en.wikipedia.org/wiki/fishnewsroom.urc.eduwww.fish.comwww.fullredguppy.comwww.sciencedaily.com
The Sorted Approachdo markers point to same URL?
nosince left marker's URL is less than right
marker's URL, move left marker down
red: www.sciencedaily.comen.wikipedia.org/wiki/rednewsroom.urc.eduwww.red.comwww.fullredguppy.com
fish: en.wikipedia.org/wiki/fishnewsroom.urc.eduwww.fish.comwww.fullredguppy.comwww.sciencedaily.com
result:
red: en.wikipedia.org/wiki/rednewsroom.urc.eduwww.fullredguppy.comwww.red.comwww.sciencedaily.com
fish: en.wikipedia.org/wiki/fishnewsroom.urc.eduwww.fish.comwww.fullredguppy.comwww.sciencedaily.com
The Sorted Approachdo markers point to same URL?
yes add URL to result move both markers
red: www.sciencedaily.comen.wikipedia.org/wiki/rednewsroom.urc.eduwww.red.comwww.fullredguppy.com
fish: en.wikipedia.org/wiki/fishnewsroom.urc.eduwww.fish.comwww.fullredguppy.comwww.sciencedaily.com
result:newsroom.urc.edu
red: en.wikipedia.org/wiki/rednewsroom.urc.eduwww.fullredguppy.comwww.red.comwww.sciencedaily.com
fish: en.wikipedia.org/wiki/fishnewsroom.urc.eduwww.fish.comwww.fullredguppy.comwww.sciencedaily.com
The Sorted Approachdo markers point to same URL?
nosince right marker's URL is less than left
marker's URL, move right marker down
red: www.sciencedaily.comen.wikipedia.org/wiki/rednewsroom.urc.eduwww.red.comwww.fullredguppy.com
fish: en.wikipedia.org/wiki/fishnewsroom.urc.eduwww.fish.comwww.fullredguppy.comwww.sciencedaily.com
result:newsroom.urc.edu
red: en.wikipedia.org/wiki/rednewsroom.urc.eduwww.fullredguppy.comwww.red.comwww.sciencedaily.com
fish: en.wikipedia.org/wiki/fishnewsroom.urc.eduwww.fish.comwww.fullredguppy.comwww.sciencedaily.com
The Sorted Approachdo markers point to same URL?
yes add URL to result move both markers
red: www.sciencedaily.comen.wikipedia.org/wiki/rednewsroom.urc.eduwww.red.comwww.fullredguppy.com
fish: en.wikipedia.org/wiki/fishnewsroom.urc.eduwww.fish.comwww.fullredguppy.comwww.sciencedaily.com
result:newsroom.urc.eduwww.fullredguppy.com
red: en.wikipedia.org/wiki/rednewsroom.urc.eduwww.fullredguppy.comwww.red.comwww.sciencedaily.com
fish: en.wikipedia.org/wiki/fishnewsroom.urc.eduwww.fish.comwww.fullredguppy.comwww.sciencedaily.com
The Sorted Approachdo markers point to same URL?
nosince left marker's URL is less than right
marker's URL, move left marker down
red: www.sciencedaily.comen.wikipedia.org/wiki/rednewsroom.urc.eduwww.red.comwww.fullredguppy.com
fish: en.wikipedia.org/wiki/fishnewsroom.urc.eduwww.fish.comwww.fullredguppy.comwww.sciencedaily.com
result:newsroom.urc.eduwww.fullredguppy.com
red: en.wikipedia.org/wiki/rednewsroom.urc.eduwww.fullredguppy.comwww.red.comwww.sciencedaily.com
fish: en.wikipedia.org/wiki/fishnewsroom.urc.eduwww.fish.comwww.fullredguppy.comwww.sciencedaily.com
The Sorted Approachdo markers point to same URL?
yes add URL to result move both markers
red: www.sciencedaily.comen.wikipedia.org/wiki/rednewsroom.urc.eduwww.red.comwww.fullredguppy.com
fish: en.wikipedia.org/wiki/fishnewsroom.urc.eduwww.fish.comwww.fullredguppy.comwww.sciencedaily.com
result:newsroom.urc.eduwww.fullredguppy.comwww.sciencedaily.com
red: en.wikipedia.org/wiki/rednewsroom.urc.eduwww.fullredguppy.comwww.red.comwww.sciencedaily.com
fish: en.wikipedia.org/wiki/fishnewsroom.urc.eduwww.fish.comwww.fullredguppy.comwww.sciencedaily.com
The Sorted Approachat least one marker has completed its list,
so we can stopnotice that our result contains correct values
red: www.sciencedaily.comen.wikipedia.org/wiki/rednewsroom.urc.eduwww.red.comwww.fullredguppy.com
fish: en.wikipedia.org/wiki/fishnewsroom.urc.eduwww.fish.comwww.fullredguppy.comwww.sciencedaily.com
result:newsroom.urc.eduwww.fullredguppy.comwww.sciencedaily.com
red: en.wikipedia.org/wiki/rednewsroom.urc.eduwww.fullredguppy.comwww.red.comwww.sciencedaily.com
fish: en.wikipedia.org/wiki/fishnewsroom.urc.eduwww.fish.comwww.fullredguppy.comwww.sciencedaily.com
The Sorted Approachhow many comparisons are done?
note that every step involves moving at least one arrow
hence, the maximum number of steps is 2 billion this is considerably less than (1 billion) squared result: a massive speedup
The Sorted Approach – Notes remember: commercial search engines don't fully
publicize strategies hence, some search engines may use alternate
approaches for efficient intersections
the previous strategy applies to more than two lists simultaneously
hence, we can search for multiple tokens, rather than just two
Example (from text):
Ranking Results a typical search can produce
millions of results however, we often find what we
are looking for in the first few results
according to Optify, first returned result from Google gets clicked 36.4% of time
first page gets clicked through 90% of the time
how does this occur? via a page ranking system
http://searchenginewatch.com/article/2049695/Top-Google-Result-Gets-36.4-of-Clicks-Study
Ranking Resultssearch providers have different ways
of ranking the results of the searchGoogle: PageRank
proprietary (not all details available) some details are public (considered next) the higher the PageRank score, the closer to
the top of the search results a page will be
http://support.google.com/webmasters/bin/answer.py?hl=en&answer=70897
PageRanka scoring system links from other pages add to a page's
score
Page 1
Page 4 Page 5
Page 2
Page 5 Page 6
Page 3
Page 5 Page 6
Page 4
Page 5
Page 6
Web pages
the link from Page 1 adds to Page 4's score
the links from Pages 1,2,3 add to Page 5's score
the links from Page 2 and 3 add to Page 6's score
PageRank the score from each page is not weighted equally the higher a page's PageRank, the more important its
contribution is
Page 1
Page 3
Page 2
Page 4
Page 3
Page 4
Web pages suppose that Page 3
has one link (Page 1), and Page 4 has one link (Page 2)
since Page 2's rank is higher than Page 1's, then Page 4's rank will be higher than Page 3's
Hig
h R
an
k
Lo
w R
an
k
PageRank – Notes since a page is not necessarily aware of other
pages that point to it, its PageRank must be computed by the crawler
PageRank is only part of the ranking process that you see
Google uses over 200 factors to determine page relevancy
PageRank is one of those factors others include location, language, personalization, etc.
http://support.google.com/webmasters/bin/answer.py?hl=en&answer=70897
top related