google paper

The Anatomy of a Large-Scale Hypertextual Web Search Engine

Presented By : Girish MalkarnenkarEmail: [email protected]

INF384H / CS395T Concepts of Information Retrieval and Web Search (Fall 2011) - (12th September 2011)

Lawrence Page & Sergey Brin

mailto:[email protected]

Motivation behind Google

• Rapid growth in Amount of

information

on the web

Number of new

inexperienced web users

Motivation behind Google

• Usage of human maintained indices like Yahoo! which were subjective, expensive to build & maintain, slow to improve and did not cover all topics.

• Automated search engines relying on simple keyword matching returned low quality results.

• Attempts by advertisers to mislead automated search engines

How bad were things in 1997?

• “Junk results” washed out any relevant search results.

• Only one of the top 4 commercial search engines at the time could find itself (in the top 10 results)!

• There was a desperate need for a search engine that could cope up with the ever-increasing information flow and still return relevant information.

Challenges in scaling with the web!

• In 1994, the 1st web search engine, the WWWW indexed around 105 pages.

• By November 1997, the top engines indexed 108 web documents!

• In 1994, the WWWW handled 1500 queries per day.

• By November 1997, Altavista handled around 20 million queries per day!

Challenges in scalability

• Fast crawling technology

• Storage Space

• Efficient indexing system

• Fast handling of queries

Google’s design goals

• Aiming for very high precision in results since most users look only at the first few 10s of results.

• Precision is important even at the expense of recall (i.e. the total number of relevant documents returned)

The irony of it all…

• In this paper, the authors had criticized the commercialization of academic search engine as it caused search engine technology to remain a black art.

• They had also stated their aims of making Google an open academic environment for researchers working on large scale web data.

• In the appendix, they had also blasted advertising funded search engines for being “inherently biased”

System features of Google

• PageRank• A Top 10 IEEE ICDM data mining algorithm

• Tries to incorporate ideas from

academic community (publishing and citations)

• Anchor Text• <a href=http://www.com> ANCHOR TEXT </a>

http://www.com/

PageRank!

It isn't the only factor that Google uses to rank pages, but it is an important one.

Why does PageRank use links?

• Links represent citations

• Quantity of links to a website makes the website more popular

• Quality of links to a website also helps in computing rank

• Link structure largely unused before Larry Page proposed it to thesis advisor

• Idea based on academic citation literature which counted citations or backlinks to a given page.

How does PageRank work?

Counts links from all pages but not equally

Normalizes by the number of links on a page.

Simplified PageRank algorithm

• Assume four web pages: A, B,C and D. Let each page would begin with an estimated PageRank of 0.25.

• L(A) is defined as the number of links going out of page A. The PageRank of a page A is given as follows:

A

B

C

D

A

B

C

D

PageRank algorithm including damping factor

Assume page A has pages B, C, D ..., which point to it. The parameter d is a damping factor which can be set between 0 and 1. Usually set d to 0.85. The PageRank of a page A is given as follows:

Intuitive Justification

• A "random surfer" who is given a web page at random and keeps clicking on links, never hitting "back“, but eventually gets bored and starts on another random page.

– The probability that the random surfer visits a page is its PageRank.

– The d damping factor is the probability at each page the "random surfer" will get bored and request another random page.

• A page can have a high PageRank

– If there are many pages that point to it

– Or if there are some pages that point to it, and have a high PageRank.

Anchor Text

• <A href="http://www.yahoo.com/">Yahoo!</A>

The text of a hyperlink (anchor text) is

associated with the page that the link is on,

and it is also associated with the page the link

points to.

Why?

anchors often provide more accurate descriptions of web pages than the pages themselves.

anchors may exist for documents which cannot be indexed by a text-based search engine, such as images, programs, and databases.

Other Features

• It has location information for all hits (uses proximity in search)

• Google keeps track of some visual presentation details such as font size of words.

• Words in a larger or bolder font are weighted higher than other words.

• Full raw HTML of pages is available in a repository

Google Architecture

Implemented in C and C++ on Solaris and Linux

Google Architecture

Keeps track of URLs that have and need

to be crawled

Compresses and stores web pages

Multiple crawlers run in parallel. Each crawler keeps its own DNS lookup cache and ~300 open

connections open at once.

Uncompresses and parses documents. Stores link

information in anchors file.

Stores each link and text surrounding link.

Converts relative URLs into absolute URLs.

Contains full html of every web page. Each document is prefixed

by docID, length, and URL.

Google Architecture

Maps absolute URLs into docIDs stored in Doc Index. Stores anchor text in “barrels”.

Generates database of links (pairs of docIds).

Parses & distributes hit lists into “barrels.”

Creates inverted index whereby document list

containing docID and hitlists can be retrieved given wordID.

In-memory hash table that maps words to wordIds.

Contains pointer to doclist in barrel which wordId falls into.

Partially sorted forward indexes sorted by docID. Each barrel stores hitlists for a given

range of wordIDs.

DocID keyed index where each entry includes info such as pointer to doc in repository, checksum, statistics, status, etc. Also contains URL info if doc

has been crawled. If not just contains URL.

Single Word Query Ranking

• Hitlist is retrieved for single word• Each hit can be one of several types: title, anchor,

URL, large font, small font, etc.• Each hit type is assigned its own weight• Type-weights make up vector of weights• Number of hits of each type is counted to form

count-weight vector• Dot product of type-weight and count-weight vectors

is used to compute IR score• IR score is combined with PageRank to compute final

rank

Multi-word Query Ranking

• Similar to single-word ranking except now must analyze proximity of words in a document

• Hits occurring closer together are weighted higher than those farther apart

• Each proximity relation is classified into 1 of 10 bins ranging from a “phrase match” to “not even close”

• Each type and proximity pair has a type-prox weight

• Counts converted into count-weights

• Take dot product of count-weights and type-proxweights to computer for IR score

The Past: Original Page # 1

When Larry Page and Sergey Brin begun work on their search engine, it wasn’t originally called Google. They called it Backrub (as a reference to the algorithm which used backlinks to rank pages), only changing it a year into development and yes, the hand in the logo was Larry Page’s, scanned.

The Past: Original Page # 2

The original Google webpage (in 1997)

The Present

The Future?

“The ultimate search engine would understand exactly what you mean and give back exactly what you want.”

- Larry Page

References…

• Brin, Page. The Anatomy of a Large-Scale Hypertextual Web Search Engine.

• www.cs.uvm.edu/~xwu/kdd

• http://www.ics.uci.edu/~scott/google.htm

Thank you!

google paper

Technology