1 graphs & more on web search 15-211 fundamental data structures and algorithms stefan niculescu...

59
1 Graphs & more on Web search 15-211 Fundamental Data Structures and Algorithms Stefan Niculescu & James Lyons March 21, 2002

Post on 19-Dec-2015

220 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: 1 Graphs & more on Web search 15-211 Fundamental Data Structures and Algorithms Stefan Niculescu & James Lyons March 21, 2002

1

Graphs & more on Web search

15-211 Fundamental Data Structures and Algorithms

Stefan Niculescu & James Lyons March 21, 2002

Page 2: 1 Graphs & more on Web search 15-211 Fundamental Data Structures and Algorithms Stefan Niculescu & James Lyons March 21, 2002

Announcements

Page 3: 1 Graphs & more on Web search 15-211 Fundamental Data Structures and Algorithms Stefan Niculescu & James Lyons March 21, 2002

3

Homework 5

Homework Assignment #5 will be out on Friday.

Must do some reading in order to complete it.

Must take a progress quiz.

Get started today and as usual, think b4 u hack!

Page 4: 1 Graphs & more on Web search 15-211 Fundamental Data Structures and Algorithms Stefan Niculescu & James Lyons March 21, 2002

4

Reading

About graphs:

Chapter 14

About Web search:http://www.cadenza.org/search_engine_terms/

srchad.htm

A HTML tutorial:http://www.cwru.edu/help/introHTML/toc.html

Page 5: 1 Graphs & more on Web search 15-211 Fundamental Data Structures and Algorithms Stefan Niculescu & James Lyons March 21, 2002

Introduction to Graphs

Page 6: 1 Graphs & more on Web search 15-211 Fundamental Data Structures and Algorithms Stefan Niculescu & James Lyons March 21, 2002

6

Graphs — an overview

Vertices (aka nodes)

Page 7: 1 Graphs & more on Web search 15-211 Fundamental Data Structures and Algorithms Stefan Niculescu & James Lyons March 21, 2002

7

Graphs — an overview

PIT

BOS

JFK

DTW

LAX

SFO

Vertices (aka nodes)

Edges

Page 8: 1 Graphs & more on Web search 15-211 Fundamental Data Structures and Algorithms Stefan Niculescu & James Lyons March 21, 2002

8

Undirected

Graphs — an overview

PIT

BOS

JFK

DTW

LAX

SFO

Vertices (aka nodes)

Edges

Page 9: 1 Graphs & more on Web search 15-211 Fundamental Data Structures and Algorithms Stefan Niculescu & James Lyons March 21, 2002

9

Undirected

Graphs — an overview

PIT

BOS

JFK

DTW

LAX

SFO

Vertices (aka nodes)

Edges

1987

2273

3442145

2462

618

211

318

190

Weights

Page 10: 1 Graphs & more on Web search 15-211 Fundamental Data Structures and Algorithms Stefan Niculescu & James Lyons March 21, 2002

10

Terminology

Graph G = (V,E)

Set V of vertices (nodes)

Set E of edges Elements of E are pairs (v,w) where v,w V.

An edge (v,v) is a self-loop. (Usually assume no self-loops.)

Weighted graph

Elements of E are (v,w,x) where x is a weight.

Page 11: 1 Graphs & more on Web search 15-211 Fundamental Data Structures and Algorithms Stefan Niculescu & James Lyons March 21, 2002

11

Terminology, cont’d

Directed graph (digraph)The edge pairs are ordered

Every edge has a specified direction

The Web is a directed graph

Undirected graphThe edge pairs are unordered

E is a symmetric relation

(v,w) E implies (w,v) E

In an undirected graph (v,w) and (w,v) are usually treated as though they are the same edge

Page 12: 1 Graphs & more on Web search 15-211 Fundamental Data Structures and Algorithms Stefan Niculescu & James Lyons March 21, 2002

12

Directed Graph (digraph)

PIT

BOS

JFK

DTW

LAX

SFO

Vertices (aka nodes)

Edges

Page 13: 1 Graphs & more on Web search 15-211 Fundamental Data Structures and Algorithms Stefan Niculescu & James Lyons March 21, 2002

13

Undirected Graph

PIT

BOS

JFK

DTW

LAX

SFO

Vertices (aka nodes)

Edges

Page 14: 1 Graphs & more on Web search 15-211 Fundamental Data Structures and Algorithms Stefan Niculescu & James Lyons March 21, 2002

14

Terminology, cont’d

v and w adjacent (neighbors) if (v,w)E or (w,v)E

d(v) (degree of v) = # neighbors of v (for undirected graphs)

d+(v) (out-degree of v)= # of edges (v,w)E

d-(v) (in-degree of v)= # of edges (w,v)E

Page 15: 1 Graphs & more on Web search 15-211 Fundamental Data Structures and Algorithms Stefan Niculescu & James Lyons March 21, 2002

15

Terminology, cont’d

Path

a list of nodes (v[1], v[2],...,v[n]) s.t. (v[i],v[i+1]) E for all 0 < i < n

The length of the above path is n-1

Cycle

a path that begins and ends with the same node

Cyclic graph – contains at least one cycle

Acyclic graph - no cycles

Page 16: 1 Graphs & more on Web search 15-211 Fundamental Data Structures and Algorithms Stefan Niculescu & James Lyons March 21, 2002

16

Elements of a Graph

PIT

BOS

JFK

DTW

LAX

SFO

Page 17: 1 Graphs & more on Web search 15-211 Fundamental Data Structures and Algorithms Stefan Niculescu & James Lyons March 21, 2002

17

Terminology, cont’d

Subgraph of a graph G

a subset of V with the corresponding edges from E.

Connected graph

a graph where for every pair of nodes there exists a sequence of edges starting at one node and ending at the other.

Connected component of a graph G a connected subgraph of G.

Page 18: 1 Graphs & more on Web search 15-211 Fundamental Data Structures and Algorithms Stefan Niculescu & James Lyons March 21, 2002

18

Elements of a Graph, cont’d

PIT

BOS

JFK

DTW

LAX

SFO

Page 19: 1 Graphs & more on Web search 15-211 Fundamental Data Structures and Algorithms Stefan Niculescu & James Lyons March 21, 2002

19

Terminology, cont’d

Unrooted (undirected) tree

a acyclic connected undirected graph

Theorem: in any unrooted tree T=(V,E), |V|=|E|+1.

Proof: by induction on |V| Base case |V|=1 (|E|=0)

Show there exists a node of degree one

Remove that node and apply induction hypothesis

Page 20: 1 Graphs & more on Web search 15-211 Fundamental Data Structures and Algorithms Stefan Niculescu & James Lyons March 21, 2002

20

Example of a unrooted tree

PIT

BOS

JFK

DTW

LAX

SFO

Page 21: 1 Graphs & more on Web search 15-211 Fundamental Data Structures and Algorithms Stefan Niculescu & James Lyons March 21, 2002

21

Quiz Break

Page 22: 1 Graphs & more on Web search 15-211 Fundamental Data Structures and Algorithms Stefan Niculescu & James Lyons March 21, 2002

22

So, is this a connected graph?

Cyclic or Acyclic?

Directed or Undirected?

6

4

5

3

1

2

Page 23: 1 Graphs & more on Web search 15-211 Fundamental Data Structures and Algorithms Stefan Niculescu & James Lyons March 21, 2002

23

Directed graph (unconnected)

Cyclic or Acyclic?

6

5

32

1

4

Page 24: 1 Graphs & more on Web search 15-211 Fundamental Data Structures and Algorithms Stefan Niculescu & James Lyons March 21, 2002

Representing Graphs

Page 25: 1 Graphs & more on Web search 15-211 Fundamental Data Structures and Algorithms Stefan Niculescu & James Lyons March 21, 2002

25

Representing graphs

Adjacency matrix 2-dimensional array

For each edge (u,v), set A[u][v] to true; otherwise false

1

4

76

3 5

2

1 2 3 4 5 6 7

1 x x

2 x x

3 x

4 x x x

5 x x

6

7

3 41

4 52

63

4

3 67

5

4 7

6

7

Adjacency lists For each vertex, keep a list of adjacent vertices

Page 26: 1 Graphs & more on Web search 15-211 Fundamental Data Structures and Algorithms Stefan Niculescu & James Lyons March 21, 2002

26

Choosing a representation

Size of V relative to size of E is a primary factor.

Dense: |E|/|V| is large

Sparse: |E|/|V| is small

Adjacency matrix is expensive in terms of space if the graph is sparse (O(|V|2 >

O(|E|+|V|)).

Adjacency list is expensive in terms of checking edges if the graph is dense.

Page 27: 1 Graphs & more on Web search 15-211 Fundamental Data Structures and Algorithms Stefan Niculescu & James Lyons March 21, 2002

27

Size of a Graph

How many undirected graphs for a set of n given vertices?

Answer:

2

n

22n

How many edges in a undirected graph with n vertices?

Minimum: 0

Maximum:

Page 28: 1 Graphs & more on Web search 15-211 Fundamental Data Structures and Algorithms Stefan Niculescu & James Lyons March 21, 2002

Graphs are Everywhere

Page 29: 1 Graphs & more on Web search 15-211 Fundamental Data Structures and Algorithms Stefan Niculescu & James Lyons March 21, 2002

29

Graphs as models

The Internet

Communication pathways

DNS hierarchy

The WWW

The physical world

Road topology and maps

Airline routes and fares

Electrical circuits

Job and manufacturing scheduling

Page 30: 1 Graphs & more on Web search 15-211 Fundamental Data Structures and Algorithms Stefan Niculescu & James Lyons March 21, 2002

30

Graphs as models

Physical objects are often modeled by meshes, which are a particular kind of graph structure.

By Jonathan Shewchuk

Page 31: 1 Graphs & more on Web search 15-211 Fundamental Data Structures and Algorithms Stefan Niculescu & James Lyons March 21, 2002

31

More graph models

See alsohttp://java.sun.com/applets/jdk/1.1/demo/WireFrame/index.htmlandhttp://www.mapquest.com

NASA CFD labs

By Paul Heckbert andDavid Garland

Page 32: 1 Graphs & more on Web search 15-211 Fundamental Data Structures and Algorithms Stefan Niculescu & James Lyons March 21, 2002

32

Structure of the Internet

Europe

Japan

Backbone 1

Backbone 2

Backbone 3

Backbone 4, 5, N

Australia

Regional A

Regional B

NAP

NAP

NAP

NAP

SOURCE: CISCO SYSTEMS

MAPS UUNET MAP

Page 33: 1 Graphs & more on Web search 15-211 Fundamental Data Structures and Algorithms Stefan Niculescu & James Lyons March 21, 2002

33

Relationship graphs

Graphs are also used to model relationships among entities.

Scheduling and resource constraints.

Inheritance hierarchies.15-113 15-151

15-211

15-212 15-213

15-312

15-411 15-412

15-251

15-45115-462

Page 34: 1 Graphs & more on Web search 15-211 Fundamental Data Structures and Algorithms Stefan Niculescu & James Lyons March 21, 2002

34

Where are we right now?

Page 35: 1 Graphs & more on Web search 15-211 Fundamental Data Structures and Algorithms Stefan Niculescu & James Lyons March 21, 2002

The Web Graph

Page 36: 1 Graphs & more on Web search 15-211 Fundamental Data Structures and Algorithms Stefan Niculescu & James Lyons March 21, 2002

36

Web Graph

Documents written in HTML

HTML (HyperText Markup Language)

TAGS:

<head>, <body>, <title>

<H1>, <p>

<a> (anchor, link)

Page 37: 1 Graphs & more on Web search 15-211 Fundamental Data Structures and Algorithms Stefan Niculescu & James Lyons March 21, 2002

37

A simple HTML example

<html>

<head>

<TITLE>A Simple HTML Example</TITLE>

</head>

<body>

<a href="http://www.cmu.edu">

Carnegie Mellon University </a>

</body>

</html>

Page 38: 1 Graphs & more on Web search 15-211 Fundamental Data Structures and Algorithms Stefan Niculescu & James Lyons March 21, 2002

38

Web Graph

A directed graph where :

V = (all web pages)

E = (all HTML-defined links from one

web page to another)

Page 39: 1 Graphs & more on Web search 15-211 Fundamental Data Structures and Algorithms Stefan Niculescu & James Lyons March 21, 2002

39

Web Graph

<a href …>

<a href …>

<a href …>

<a href …>

<a href …>

<a href …>

<a href …>

Web Pages are nodes (vertices)HTML references are links (edges)

Page 40: 1 Graphs & more on Web search 15-211 Fundamental Data Structures and Algorithms Stefan Niculescu & James Lyons March 21, 2002

40

Is the Web Graph connected?

Sparse, unconnected graph

AUTHORITIES

web pages containing a “reasonable” amount of relevant information about a specific topic

HUBS

web pages that point (link) to many pages containing relevant information about a given topic

Page 41: 1 Graphs & more on Web search 15-211 Fundamental Data Structures and Algorithms Stefan Niculescu & James Lyons March 21, 2002

41

Finding Hubs & Authorities

Nice iterative algorithm by Jon Kleinberg

www.cs.cornell.edu/home/kleinber/auth.ps

HUB: Avrim’s Machine Learning page

www.cs.cmu.edu/~avrim/ML

AUTHORITY: www.java.sun.com

Extra credit opportunity for homework 5

Page 42: 1 Graphs & more on Web search 15-211 Fundamental Data Structures and Algorithms Stefan Niculescu & James Lyons March 21, 2002

Graphs : ApplicationSearch Engines

Page 43: 1 Graphs & more on Web search 15-211 Fundamental Data Structures and Algorithms Stefan Niculescu & James Lyons March 21, 2002

43

Search Engines

Page 44: 1 Graphs & more on Web search 15-211 Fundamental Data Structures and Algorithms Stefan Niculescu & James Lyons March 21, 2002

44

What are they?

Tools for finding information on the Web

Problem: “hidden” databases, e.g. New York Times (ie, databases of keywords hosted by the web site itself. These cannot be accessed by Yahoo, Google etc.)

Search engine

A machine-constructed index (usually by keyword)

So many search engines, we need search engines to find them. Searchenginecollosus.com

Page 45: 1 Graphs & more on Web search 15-211 Fundamental Data Structures and Algorithms Stefan Niculescu & James Lyons March 21, 2002

45

Did you know?

Vivisimo was developed here at CMU

Developed by Prof. Raul Valdes-Perez

Developed in 2000

Page 46: 1 Graphs & more on Web search 15-211 Fundamental Data Structures and Algorithms Stefan Niculescu & James Lyons March 21, 2002

46

SE Architecture

Spider Crawls the web to find pages. Follows hyperlinks.

Never stops

Indexer Produces data structures for fast searching of all

words in the pages (ie, it updates the lexicon)

Retriever Query interface

Database lookup to find hits 1 billion documents

1 TB RAM, many terabytes of disk

Ranking

Page 47: 1 Graphs & more on Web search 15-211 Fundamental Data Structures and Algorithms Stefan Niculescu & James Lyons March 21, 2002

47

A look at

10,000 servers (WOW!)

Web site traffic grows over 20% per month

Spiders over 2 Billion URLs

Supports 28 language searches

Over 100 million searches per day

“Even CMU uses it!”

Page 48: 1 Graphs & more on Web search 15-211 Fundamental Data Structures and Algorithms Stefan Niculescu & James Lyons March 21, 2002

48

Google’s server farm

Page 49: 1 Graphs & more on Web search 15-211 Fundamental Data Structures and Algorithms Stefan Niculescu & James Lyons March 21, 2002

49

Web Crawlers

Start with an initial page P0. Find URLs on P0 and add them to a queue

When done with P0, pass it to an indexing program, get a page P1 from the queue and repeat

Can be specialized (e.g. only look for email addresses)

Issues Which page to look at next? (Special subjects, recency)

How deep within a site do you go (depth search)?

How frequently to visit pages?

Page 50: 1 Graphs & more on Web search 15-211 Fundamental Data Structures and Algorithms Stefan Niculescu & James Lyons March 21, 2002

50

So, why Spider the Web?

Refresh Collection by deleting dead links

OK if index is slightly smaller

Done every 1-2 weeks in best engines

Finding new sites

Respider the entire web

Done every 2-4 weeks in best engines

Page 51: 1 Graphs & more on Web search 15-211 Fundamental Data Structures and Algorithms Stefan Niculescu & James Lyons March 21, 2002

51

Cost of Spidering

Spider can (and does) run in parallel on

hundreds of severs

Very high network connectivity (e.g. T3 line)

Servers can migrate from spidering to query

processing depending on time-of-day load

Running a full web spider takes days even with

hundreds of dedicated servers

Page 52: 1 Graphs & more on Web search 15-211 Fundamental Data Structures and Algorithms Stefan Niculescu & James Lyons March 21, 2002

52

Indexing

Arrangement of data (data structure) to permit fast searching

Which list is easier to search?

sow fox pig eel yak hen ant cat dog hog ant cat dog eel fox hen hog pig sow yak

Sorting helps. Why?

Permits binary search. About log2n probes into list

log2(1 billion) ~ 30

Permits interpolation search. About log2(log2n) probes

log2 log2(1 billion) ~ 5

Page 53: 1 Graphs & more on Web search 15-211 Fundamental Data Structures and Algorithms Stefan Niculescu & James Lyons March 21, 2002

53

Inverted Files

A file is a list of words by position

- First entry is the word in position 1 (first word)

- Entry 4562 is the word in position 4562 (4562nd word)

- Last entry is the last word

An inverted file is a list of positions by word!

POS1

10

20

30

36

FILE

a (1, 4, 40)entry (11, 20, 31)file (2, 38)list (5, 41)position (9, 16, 26)positions (44)word (14, 19, 24, 29, 35, 45)words (7)4562 (21, 27)

INVERTED FILE

Page 54: 1 Graphs & more on Web search 15-211 Fundamental Data Structures and Algorithms Stefan Niculescu & James Lyons March 21, 2002

54

Inverted Files for Multiple Documents

107 4 322 354 381 405232 6 15 195 248 1897 1951 2192677 1 481713 3 42 312 802

WORD NDOCS PTR

jezebel 20

jezer 3

jezerit 1

jeziah 1

jeziel 1

jezliah 1

jezoar 1

jezrahliah 1

jezreel 39jezoar

34 6 1 118 2087 3922 3981 500244 3 215 2291 301056 4 5 22 134 992

DOCID OCCUR POS 1 POS 2 . . .

566 3 203 245 287

67 1 132. . .

“jezebel” occurs6 times in document 34,3 times in document 44,4 times in document 56 . . .

LEXICON

WORD INDEX

Page 55: 1 Graphs & more on Web search 15-211 Fundamental Data Structures and Algorithms Stefan Niculescu & James Lyons March 21, 2002

55

Ranking (Scoring) Hits

Hits must be presented in some order

What order? Relevance, recency, popularity, reliability, alphabetic?

Some ranking methods Presence of keywords in title of document

Closeness of keywords to start of document

Frequency of keyword in document

Link popularity (how many pages point to this one)

Page 56: 1 Graphs & more on Web search 15-211 Fundamental Data Structures and Algorithms Stefan Niculescu & James Lyons March 21, 2002

56

Spamdexing & Link Popularity

Spamdexing means influencing retrieval ranking by altering a web page. (Puts “spam” in the index)

www.linkpopularity.com

Link popularity is used for ranking

Many measures

Number of links in (In-links)

Weighted number of links in (by weight of referring page)

Page 57: 1 Graphs & more on Web search 15-211 Fundamental Data Structures and Algorithms Stefan Niculescu & James Lyons March 21, 2002

57

Search Engine Sizes

AV AltavistaEX ExciteFAST FASTGG GoogleINK InktomiNL Northern Light

SOURCE: SEARCHENGINEWATCH.COM

Page 58: 1 Graphs & more on Web search 15-211 Fundamental Data Structures and Algorithms Stefan Niculescu & James Lyons March 21, 2002

58

Historical Notes

WebCrawler: first documented spider

Lycos: first large-scale spider

Top-honors for most web pages spidered: First Lycos, then AltaVista, then Google...

Page 59: 1 Graphs & more on Web search 15-211 Fundamental Data Structures and Algorithms Stefan Niculescu & James Lyons March 21, 2002

59

Overview

Engines are a critical Web resource

Very sophisticated, high technology, but secret

Most spidering re-traverses stable web graph

They don’t spider the Web completely

Spamdexing is a problem

New paradigms needed as Web grows

What about images, music, video?

Google’s image search engine

Napster