data mining - 1dl460 · • pagerank (google)" – for discovering the most “important”...

4/26/13 1 Kjell Orsborn - UDBL - IT - UU!

DATA MINING - 1DL460

Spring 2013"

A second course in data mining��

http://www.it.uu.se/edu/course/homepage/infoutv2/vt13

Kjell Orsborn��Uppsala Database Laboratory��

Department of Information Technology, Uppsala University, ��Uppsala, Sweden


Introduction to Data Mining: Search Engines Technology and Algorithms for

Efficient Searching of Information on the Web

(slides + supplemental articles)"

Kjell Orsborn ��

Department of Information Technology Uppsala University, Uppsala, Sweden


Lecture’s Outline"

•  The Web and its Search Engines •  Heuristics-based Ranking •  PageRank (Google)

–  for discovering the most “important” web pages •  HITS: hubs and authorities (Clever project)

–  more detailed evaluation of pages’ importance


The Web in 2001: Some Facts "

•  More than 3 billion pages; several terabytes ��

•  Highly dynamic –  More than 1 million new pages every day! –  Over 600 GB of pages change per month –  Average page changes in a few weeks ��

•  Largest crawlers –  Refresh less than 18% in a few weeks –  Cover less than 50% ever (invisible Web)��

•  Average page has 7–10 links –  Links form content-based communities


Chaos on the Web"

•  Internet lacks organization and structure: –  pages written in any language, dialect or style; –  different cultures, interests and motivation; –  mixes truth, falsehood, wisdom, propaganda…

•  Challenge: –  Quickly extract from this digital morass, high-quality, relevant, up-to-date

pages in response to specific information needs –  No precise mathematical measure of “best” results


Search Products and Services "

•  Verity •  Fulcrum •  PLS •  Oracle text extender •  DB2 text extender •  Infoseek Intranet •  SMART (academic) •  Glimpse (academic) •  BING (Microsoft) •  Baidu (China) •  Yandex (Russia)

•  Inktomi (HotBot) •  Alta Vista •  Raging Search •  Google •  Dmoz.org •  Yahoo! •  Infoseek Internet •  Lycos •  Excite

* heuristics-based * humanly-selected


Web Search Queries"

•  Web search queries are short: –  ~2.4 words on average (Aug. 2000) –  Has increased, was 1.7 (~1997)��

•  User expectations: –  “The first item shown should be what I want to see!” –  This works if the user has the most popular or most common notion in mind;

not otherwise


Standard Web Search Engine Architecture"

crawl the web

create an inverted index

inverted index

Search engine servers

user!query

show results to user

DocIds

eliminate duplicates


Heuristics-based ranking"

•  Naive attempt used by many search engines ��

•  Heuristics employed: –  number of times a page contains the query term –  favor instances where the term appears early –  give weight to word appearing in a special place or form

•  e.g. in a title or in bold.

•  All heuristics fail miserably due to: –  spamming, or –  synonymy and polysemy of natural language words


Hyperlink Graph Analysis"

•  Hypermedia is a social network –  Telephoned, advised, co-authored, paid��

•  Social network theory (cf. Wasserman & Faust) –  Extensive research applying graph notions –  Centrality and prestige –  Co-citation (relevance judgment)��

•  Applications –  Web search: HITS, Google –  Classification and topic distillation


?

Hypertext models for classification"

•  c = class, t = text, N = neighbors ��

•  Text-only model: Pr[t | c]��

•  Using neighbors’ text��to judge my topic: Pr[t, t(N) | c]��

•  Better model: Pr[t, c(N) | c]��

•  Recursive relationships


Exploiting the Web’s Hyperlink Structure"

Underlying assumption: view each link as an implicit endorsement of the location it points to

•  Assumption: If the pages pointing to this page are good, then this is also a good

page. –  References: Kleinberg 98, Page et al. 98��

•  Draws upon earlier research in sociology and bibliometrics. –  Kleinberg’s model includes “authorities” (highly referenced pages) and “hubs” (pages

containing good reference lists). –  Google model is a version with no hubs, and is closely related to work on influence

weights by Pinski-Narin (1976).


Link analysis for ranking pages"

•  Why does this work?�� –  The official Ferrari site will be linked to by lots of other official (or high-

quality) sites ��

–  The best Ferrari fan-club sites probably also have many links pointing to it��

–  Less high-quality sites do not have as many high-quality sites linking to them


Page Rank"•  Intuition: Recursive Definition of “importance”.

•  A page is important if important pages link to it.

•  Method: create a stochastic matrix of the Web�� –  each page corresponds to a matrix’s row and column –  if page j has L successors, (out-links) then the ij-th entry is:��

1/L if page i is one of these L successors of page j 0 otherwise


Page Rank intuition"

•  Initially, each page has one unit of importance. •  At each round, each page shares whatever importance it has with its successors,

and receives new importance from its predecessors. •  Eventually, the importance reaches a limit, which happens to be its component of

the principal eigenvector of this matrix. •  Importance = probability that a random Web surfer, starting from a random Web

page, and following random links will be at the page in question after a long series of links.


Page Rank example: the web in 1689"

Equation: N M A

N′ M′ A′

= 1/2 0 1/2 0 0 1/2 1/2 1 0

N M A

= = =

1 1 1

Solution by relaxation (iterative solution):

1 1/2 3/2

5/4 3/4 1

9/8 1/2 11/8

5/4 11/16 17/16

...

...

...

6/5 3/5 6/5 Amazon

Microsoft

Netscape


Problems with real web graphs"

•  Dead ends: a page that has no successors has nowhere to send its importance –  Eventually, all importance will “leak out of” the Web

•  Spider traps: a group of one or more pages that have no links outside the group –  Eventually, these pages will accumulate all the importance of the Web.


The Bowtie structure of the Web"

5.1. PAGERANK 151

5.1.3 Structure of the Web

It would be nice if the Web were strongly connected like Fig. 5.1. However, itis not, in practice. An early study of the Web found it to have the structureshown in Fig. 5.2. There was a large strongly connected component (SCC), butthere were several other portions that were almost as large.

1. The in-component, consisting of pages that could reach the SCC by fol-lowing links, but were not reachable from the SCC.

2. The out-component, consisting of pages reachable from the SCC but un-able to reach the SCC.

3. Tendrils, which are of two types. Some tendrils consist of pages reachablefrom the in-component but not able to reach the in-component. The othertendrils can reach the out-component, but are not reachable from the outcomponent.

Component

Tubes

StronglyConnected

ComponentIn

ComponentOut

OutTendrils

InTendrils

DisconnectedComponents

Figure 5.2: The “bowtie” picture of the Web

In addition, there were small numbers of pages found either in

From Rajaraman and Ullman, ��Mining Massive Data Sets


Dead ends - rank leaks:Microsoft tries to duck monopoly charges..."

Equation:

= 1/2 0 1/2 0 0 1/2 1/2 0 0


N M A

= = =

1 1 1

1 1/2 1/2

3/4 1/4 1/2

5/8 1/4 1/2

1/2 3/16 5/16

...

...

...

0 0 0 Amazon

Microsoft

Netscape N M A

N′ M′ A′


Spider traps - rank sinks Microsoft considers itself the center of the universe..."

Equation:

= 1/2 0 1/2 0 1 1/2 1/2 0 0


N M A

= = =

1 1 1

1 3/2 1/2

3/4 7/4 1/2

5/8 2

3/8

1/2 35/16 5/16

...

...

...

0 3 0 Amazon

Microsoft

Netscape N M A

N′ M′ A′


N M A

= 0.8 1/2 0 1/2 0 1 1/2 1/2 0 0

+ 0.2 0.2 0.2

Google solution to dead ends and spider traps"

•  Instead of applying the matrix directly, “tax” each page with some fraction of its current importance, and distribute the taxed importance equally among all pages.

•  The solution to this equation is now: N = 7/11; M = 21/11; A = 5/11

N′ M′ A′


PageRank"

•  Where PR is the PageRank, p1, p2, ..., pN are the pages under consideration, M(pi) is the set of pages that link to pi, L(pj) is the number of outbound links on page pj, N is the total number of pages, and d is the damping factor.

•  In an iterative formulation we have at t= 0:��

•  At each time step, the PageRank can be computed as follows:


PageRank"•  Expressed as a matrix equation (using a modified adjacency matrix):��where , 1 is an N-length column vector of ones and the matrix M is given by:��Convergence is assumed when:


Google anti-spam devices"

•  Spamming: an attempt by many web sites to appear to be about a subject that will attract surfers, without truly being about the subject

•  Early search engines relied on the words on a page to tell what it is about. –  Led to “tricks” in which pages attracted attention by placing false words in the

background color on their page. •  Google trusts the words in anchor text

–  Relies on others telling the truth about your page, rather than relying on you. –  Makes it harder for a homepage to appear to be about something it is not.

•  The use of page rank to measure importance, rather than the more naive “number of links into a page”, also protects against spamming. E.g. Page Rank recognizes as unimportant 1000 pages that mutually link to one another.


Google search engine evolution, (ref: wired.com)"

•  Backrub [September 1997] –  This search engine, which had run on Stanford’s servers for almost two years, is renamed Google. Its breakthrough innovation: ranking searches

based on the number and quality of incoming links.

•  New algorithm [August 2001] –  The search algorithm is completely revamped to incorporate additional ranking criteria more easily.

•  Local connectivity analysis [February 2003] –  Google’s first patent is granted for this feature, which gives more weight to links from authoritative sites.

•  Fritz [Summer 2003] –  This initiative allows Google to update its index constantly, instead of in big batches.

•  Personalized results [June 2005] –  Users can choose to let Google mine their own search behavior to provide individualized results.

•  Bigdaddy [December 2005] –  Engine update allows for more-comprehensive Web crawling.

•  Universal search [May 2007] –  Building on Image Search, Google News, and Book Search, the new Universal Search allows users to get links to any medium on the same results

page.

•  Google Caffeine [August 2009] –  A new architecture designed to return results faster and to better deal with rapidly updated information from services including

Facebook and Twitter. With Caffeine, Google moved its back-end indexing system away from MapReduce and onto BigTable, the company's distributed database platform.

•  Panda [February 2011] –  Change to lower the rank of "low-quality sites" or "thin sites” and return higher-quality sites near the top of the search results, e.g. drop rankings for

sites containing large amounts of advertising

•  Penquin [April 2012] –  Change to decrease search engine rankings of websites that violate Google’s Webmaster Guidelines, such as keyword stuffing, cloaking,

participating in link schemes, deliberate creation of duplicate content, and others.


Google Facts (from end 2001)"

•  Indexes 3 billion web pages –  If printed, they would result in a stack of paper 200 km high –  If a person reads a page per minute (and does nothing else), (s)he would need 6000

years to read them all •  200 million search queries a day

–  Approx. 80 billion searches a year! •  Most searches take less than half second •  Support for 35 non-English languages •  Searchable index contains 3 trillion items

–  Updated every 28 days


Google architecture (approx.)"

2. The query travels to the doc servers, which actually retrieve the stored documents. Snippets are generated to describe each search result.

3. The search results are returned to the user typically in less than a second.

1. The web server sends the query to the index servers. The content inside the index servers is similar to the index in the back of a book - it tells which pages contain the words that match the query.


Google Advanced Search"


Hubs and Authorities"

•  HITS - hypertext-induced topic selection (IBM’s Clever project) •  Defined in a mutually recursive way:

–  a hub links to many (valuable) authorities; –  an authority is linked to by many (good) hubs.

•  Authorities turn out to be pages that offer the best information about a topic. •  Hubs are pages that do not provide any information, but specify a collection of

links on where to find the information.


Hubs and Authorities"

•  Use a matrix formulation similar to that of Page Rank, but without the stochastic restriction. –  Rows and columns correspond to web pages; –  A[i,j] = 1 if page i links to page j –  A[i,j] = 0 otherwise

•  The transpose of A looks like the matrix for Page Rank, but it has a 1 where the Page Rank matrix has a fraction

•  An iterative solution where the matrix A is repeatedly applied will result in a diverging solution vector.

•  However, by introducing scaling factors the computed values of “authority” and “hubbiness” for each page can be kept within finite bounds.


a = µ A h

h = λ A a

T

Computing Hubbiness and Authority of pages"

•  Let a and h be vectors –  The i-th component of a and h correspond to the degrees of authority and

hubbiness of the i-th page respectively. •  Let λ and µ be suitable scaling factors. •  Then:

–  the hubbiness of each page is the sum of authorities it links to, scaled by λ ��

–  the authority of each page is the sum of the hubbiness of all pages that link to it,

scaled by µ


a = λ µ A A a T

h = λ µ A A h T

Computing Hubbiness and Authority of pages"

•  By simple substitution, two equations that relate vectors a and h only to themselves:

•  Thus, we can compute a and h by iteratively solving the equations, giving us the principal eigenvectors of the matrices AAT and ATA, respectively.


Hubs and Authorities example: Netscape acknowledges Microsoft’s existence"

1 1 1 0 0 1 1 1 0

3 1 2 1 1 0 2 0 2

1 0 1 1 0 1 1 1 0

2 2 1 2 2 1 1 1 2

AT = A = Matrices:

AAT = ATA =

Amazon

Microsoft

Netscape


Hubs and Authorities example: Netscape acknowledges Microsoft’s existence"

Amazon

Microsoft

Iteratively solving the equations for ��a and h assuming λ = µ = 1:

= = =

1 1 1

6 2 4

28 8 20

132 36 96

h(N) h(M) h(A)

a(N) a(M) a(A)

...

...

...

= = =

1 1 1

5 5 4

24 24 18

114 114 84

1.3 a(A) 1.3 a(A)

a(A)

Netscape

...

...

...

1.000 a(A) 0,268 a(A) 0,735 a(A)


Authorities and Hubs pragmatics"

•  The system is jump-started by obtaining a set of root pages from a standard text index such as Google or Bing.

•  The iterative process settles very rapidly. –  A root set of 3,000 pages requires just 5 rounds of calculations! –  The results are independent of the initial estimates

•  Algorithm naturally separates web sites into clusters –  e.g. a search for “abortion” partitions the Web into a pro-life and a pro-choice

community


Moore’s Law"


Web Sizes over Time"

•  ~150M in 1998 •  ~5B in 2005

–  33x increase –  Moore would predict 25x

•  Monthly refresh in 1998 •  Daily refresh in 2005 •  What about 2010?

–  40B? •  Where is the content?

–  Public Web? –  Personal Web?


Philosophical Remarks"

•  The Web of today is dramatically different from what it was five years ago. •  Predicting the next five years seems futile.

–  Will even the basic act of indexing soon become infeasible? –  If so, will our notion of searching the Web undergo fundamental changes?

•  The Web’s relentless growth will continue to generate computational challenges for wading through the ever increasing volume of on-line information.


World Wide Web statistics"

•  According to a 2001 study, there were massively more than 550 billion documents on the Web, mostly in the invisible Web, or deep Web.!

•  A 2002 survey of 2,024 million Web pages determined that by far the most Web content was in English: 56.4%; next were pages in German (7.7%), French (5.6%), and Japanese (4.9%). !

•  A more recent study, which used Web searches in 75 different languages to sample the Web, determined that there were over 11.5 billion Web pages in the publicly indexable Web as of the end of January 2005.!

•  As of March 2009, the indexable web contains at least 25.21 billion pages.!•  On July 25, 2008, Google software engineers Jesse Alpert and Nissan Hajaj

announced that Google Search had discovered one trillion unique URLs. !•  As of May 2009, over 109.5 million websites operated. Of these 74% were

commercial or other sites operating in the .com generic top-level domain.!

data mining - 1dl460 · • pagerank (google)" – for discovering the most “important”...

Documents