anatomy of a large-scale hyperbuid textual web search engine

Upload: ganesh-yadav

Post on 03-Apr-2018

218 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/28/2019 Anatomy of a Large-Scale HyperBuid Textual Web Search Engine

    1/21

    Presented By:Ganesh C. Yadav.

    (Roll No : 09141)

  • 7/28/2019 Anatomy of a Large-Scale HyperBuid Textual Web Search Engine

    2/21

    Presentation Overview Problem Definition.

    Design Goals

    Google Search Engine Characteristics. Google Architecture

    Scalability

    Conclusions

  • 7/28/2019 Anatomy of a Large-Scale HyperBuid Textual Web Search Engine

    3/21

    Problem Web is vast and ever expanding. It is getting flooded with data.

    This data is heterogeneous and consists of all forms Text Images Ascii Java applets

    Lists maintained by Humans cannot keep track of this.

    Human attention is confined to 10-1000 documents

    Previous search methodologies relied on keyword matching producinginferior quality results.

  • 7/28/2019 Anatomy of a Large-Scale HyperBuid Textual Web Search Engine

    4/21

    Solution = Search Engine Search engines facilitate users to get the text or

    documents of their choice within a click of mouse.

    Some examples of Search engines:

    Google,Altavista,MetaCrawler,Kosmix.

    For comprehensive list of search engines do visit:

    http://en.wikipedia.org/wiki/List_of_search_engines

  • 7/28/2019 Anatomy of a Large-Scale HyperBuid Textual Web Search Engine

    5/21

    Specific Design Goals Deliver results that have very high precision even

    at the expense of recall

    Bring search engine technology into academicrealm in order to support novel research activities

    Make search engine technology transparent, i.e.

    advertising shouldnt bias results

    Make system user friendly .

  • 7/28/2019 Anatomy of a Large-Scale HyperBuid Textual Web Search Engine

    6/21

    Google Search Engine Features Uses link structure of web (PageRank) Uses text surrounding hyperlinks to improve accurate

    document retrieval

    Other features include: Takes into account word proximity in documents

    Uses font size, word position, etc. to weight word Storage of full raw html pages

  • 7/28/2019 Anatomy of a Large-Scale HyperBuid Textual Web Search Engine

    7/21

    PageRank For Layman Imagine a web surfer doing a simple random walk on the

    entire web for an infinite number of steps.

    Occasionally, the surfer will get bored and instead offollowing a link pointing outward from the current page

    will jump to another random page.

    At some point, the percentage of time spent at each page

    will converge to a fixed value. This value is known as the PageRank of the page.

  • 7/28/2019 Anatomy of a Large-Scale HyperBuid Textual Web Search Engine

    8/21

    PageRank For TechiesN(p): # outgoing links from page p

    B(p): set of pages that point to p

    d: tendency to get bored .R(p): PageRank of p

    R(p) = [(1-d)+d*R(q)/N(q)] .

  • 7/28/2019 Anatomy of a Large-Scale HyperBuid Textual Web Search Engine

    9/21

    Why do we need d? In the real world virtually all web graphs are not

    connected, i.e. they have dead-ends, islands, etc.

    If we dont have d we get ranks leaksfor graphs that are not connected, i.e. leads tonumerical instability.

  • 7/28/2019 Anatomy of a Large-Scale HyperBuid Textual Web Search Engine

    10/21

    Justifications for using PageRankAttempts to model user behavior

    Captures the notion that the more a page is pointed to

    by important pages, the more it is worth looking at Takes into account global structure of web

  • 7/28/2019 Anatomy of a Large-Scale HyperBuid Textual Web Search Engine

    11/21

    Google ArchitectureImplemented in C and C++ on Solaris and Linux

  • 7/28/2019 Anatomy of a Large-Scale HyperBuid Textual Web Search Engine

    12/21

    PreliminaryHitlist is defined as list of occurrences of a particular

    word in a particular document including additionalmeta info:

    - position of word in doc- font size

    - capitalization

    - descriptor type, e.g. title, anchor, etc.

  • 7/28/2019 Anatomy of a Large-Scale HyperBuid Textual Web Search Engine

    13/21

    Google Architecture (cont.)

    Keeps track of

    URLs that have andneed to be crawled

    Compresses and

    stores web pages

    Multiple crawlers run in parallel.Each crawler keeps its own DNS

    lookup cache and ~300 openconnections open at once.

    Uncompresses and parsesdocuments. Stores link

    information in anchors file.

    Stores each link andtext surrounding link.

    Converts relative URLsinto absolute URLs.

    Contains full html of every webpage. Each document is prefixed

    by docID, length, and URL.

  • 7/28/2019 Anatomy of a Large-Scale HyperBuid Textual Web Search Engine

    14/21

    Google Architecture (cont.)Maps absolute URLs into docIDs stored in

    Doc Index. Stores anchor text in barrels.Generates database of links (pairs of docIds).

    Parses & distributes hit lists

    into barrels.

    Creates inverted index wherebydocument list containing docIDand hitlists can be retrieved

    given wordID.

    In-memory hash table thatmaps words to wordIds.

    Contains pointer to doclist inbarrel which wordId falls into.

    Partially sorted forwardindexes sorted by docID. Each

    barrel stores hitlists for agiven range of wordIDs.

    DocID keyed index where each entry includes info such as pointer to doc inrepository, checksum, statistics, status, etc. Also contains URL info if doc has

    been crawled. If not just contains URL.

  • 7/28/2019 Anatomy of a Large-Scale HyperBuid Textual Web Search Engine

    15/21

    Google Architecture (cont.)

    List of wordIds

    produced by Sorter andlexicon created byIndexer used to createnew lexicon used by

    searcher. Lexicon stores~14 million words.

    New lexicon keyed bywordID, inverted docindex keyed by docID,and PageRanks used

    to answer queries

    2 kinds of barrels.Short barrell which

    contain hit list whichinclude title or anchorhits. Long barrell for

    all hit lists.

  • 7/28/2019 Anatomy of a Large-Scale HyperBuid Textual Web Search Engine

    16/21

    Google Query Evaluation1. Parse the query.2. Convert words into wordIDs.3. Seek to the start of the doclist in the short barrel for

    every word.

    4. Scan through the doclists until there is a document thatmatches all the search terms.

    5. Compute the rank of that document for the query.6. If we are in the short barrels and at the end of any

    doclist, seek to the start of the doclist in the full barrelfor every word and go to step 4.

    7. If we are not at the end of any doclist go to step 4.8. Sort the documents that have matched by rank and

    return the top k.

  • 7/28/2019 Anatomy of a Large-Scale HyperBuid Textual Web Search Engine

    17/21

    Single Word Query Ranking Hitlist is retrieved for single word

    Each hit can be one of several types: title, anchor, URL,large font, small font, etc.

    Each hit type is assigned its own weight Type-weights make up vector of weights

    # of hits of each type is counted to form count vector

    Dot product of two vectors is used to compute IR score

    IR score is combined with PageRank to compute final rank

  • 7/28/2019 Anatomy of a Large-Scale HyperBuid Textual Web Search Engine

    18/21

    Multi-word Query Ranking Similar to single-word ranking except now must

    analyze proximity

    Hits occurring closer together are weighted higher

    Each proximity relation is classified into 1 of 10values ranging from a phrase match to not evenclose

    Counts are computed for every type of hit andproximity

  • 7/28/2019 Anatomy of a Large-Scale HyperBuid Textual Web Search Engine

    19/21

    Scalability Cluster architecture combined with Moores Law make

    for high scalability. At time of writing:

    ~ 24 million documents indexed in one week

    ~518 million hyperlinks indexed

    Four crawlers collected 100 documents/sec

  • 7/28/2019 Anatomy of a Large-Scale HyperBuid Textual Web Search Engine

    20/21

    Summary of Key Optimization Techniques Each crawler maintains its own DNS lookup cache Use flex to generate lexical analyzer with own stack for parsing

    documents Parallelization of indexing phase

    In-memory lexicon Compression of repository Compact encoding of hitlists accounting for major space savings Indexer is optimized so it is just faster than the crawler so that

    crawling is the bottleneck Document index is updated in bulk Critical data structures placed on local disk Overall architecture designed avoid to disk seeks wherever possible

  • 7/28/2019 Anatomy of a Large-Scale HyperBuid Textual Web Search Engine

    21/21

    References: http://video.google.com/videoplay?docid=-

    1400721382961784115

    http://google.stanford.edu

    http://en.wikipedia.org/wiki/List_of_search_engines

    The Anatomy of a Large-Scale Hyper textual

    Web Search Engine Sergey Brin and Lawrence

    Page(pdf). The audio presentation of my lecture will be posted

    on my HomePage and will be given to Dr.Hankley.

    www.cis.ksu.edu/~vamsee

    http://video.google.com/videoplay?docid=-1400721382961784115http://video.google.com/videoplay?docid=-1400721382961784115http://google.stanford.edu/http://en.wikipedia.org/wiki/List_of_search_engineshttp://en.wikipedia.org/wiki/List_of_search_engineshttp://google.stanford.edu/http://video.google.com/videoplay?docid=-1400721382961784115http://video.google.com/videoplay?docid=-1400721382961784115http://video.google.com/videoplay?docid=-1400721382961784115