history of search and web search engines - seminar on web search

2 December 2005

Seminar on Web Search History of Search and Web Search Engines

Prof. Beat Signer

Department of Computer Science

Vrije Universiteit Brussel

http://www.beatsigner.com

http://www.beatsigner.com/

Beat Signer - Department of Computer Science - [email protected]

2 September 5, 2011

Seminar Organisation

Prof. Beat Signer

WISE Lab, Vrije Universiteit Brussel

[email protected] cross-media information spaces

and architectures

interactive paper and augmented reality

multimodal and multi-touch interaction

Content of the Seminar history of search and web search engines

search engine optimisation (SEO) and search engine marketing (SEM)

current and future trends in web search


3 September 5, 2011

Early "Documents"


4 September 5, 2011

Papyrus

Greeks and Romans

stored information on

papyrus scrolls

Tags with a summary of

the content facilitated the

retrieval of information

Table of content was

introduced around 100 BC

Parchment (vellum) came

up as an alternative

bound in book form


5 September 5, 2011

Paper

Invented in China (105 AD)

Brought to Europe only in

the twelfth century

Took another 300 years

before paper became the

major writing material

How long will we still use

paper?

electronic paper vs.

augmented paper


6 September 5, 2011

Printing Press

Johann Gutenberg

invented the printing press

in 1450

Gutenberg Bible published

in 1455

Growing libraries and

need to search for

information


7 September 5, 2011

Reading Wheel (Bookwheel)

Described by Agostino

Ramelli in 1588

Keep several books open

to read from them at the

same time

comparable to modern

tabbed browsing

The reading wheel has

never really been built

Could be seen as a

predecessor of hypertext


8 September 5, 2011

Dewey Decimal Classification (DDC)

Library classification

system

developed by Melvil Dewey

in 1876

Hierarchical classification

10 main classes with

10 divisions each and

10 sections per division

total of 1000 sections

often separate fiction section

Documents can appear in

more than one class


9 September 5, 2011

Dewey Decimal Classification (DDC) ...

After the three numbers,

decimals can be used for

further subclassification

Different Alternatives

Library of Congress

classification

Universal Decimal

Classification (UDC)


10 September 5, 2011

Dewey Decimal Classification (DDC) ...

000-099 Computer Science, Information and General Works 000 Computer Science, Knowledge and Systems 000 Computer Science, Knowledge and General Works ... 005 Computer Programming, Programs and Data ... 009 [Unassigned] 010 Bibliographies ... 100-199 Philosophy and Psychology 200-299 Religion 300-399 Social Sciences 340 Law 341 International Law 400-499 Language 500-599 Science 600-699 Technology 700-799 Arts 800-899 Literature 900-999 History, Geography and Biography



"As We May Think" (1945)

... When data of any sort are placed in

storage, they are filed alphabetically

or numerically, and information is

found (when it is) by tracing it down

from subclass to subclass. It can be in

only one place, unless duplicates are

used; one has to have rules as to which

path will locate it, and the rules are

cumbersome. Having found one

item, moreover, one has to emerge from

the system and re-enter on a

new path. The human mind does not work

that way. It operates by association.

...

Vannevar Bush



"As We May Think" (1945) …

... It affords an immediate step,

however, to associative indexing, the

basic idea of which is a

provision whereby any item may be

caused at will to select immediately

and automatically another. This is the

essential feature of the memex. The

process of tying two items together is

the important thing. ...

Vannevar Bush, As We May Think,

Atlanic Monthly, July 1945

Vannevar Bush



"As We May Think" (1945) …

Bush's article 'As We My Think'

(1945) is often seen as

the “origin" of hypertext

Article introduces the Memex prototypical hypertext machine

store and access information

follow cross-references in the form of associative trails between pieces of information (microfilms)

trail blazers are those who find delight in the task of establishing useful trails

Memex



Memex Movie



Hypertext (1965)

Ted Nelson coined the term hypertext

Nelson started Project Xanadu in 1960 first hypertext project

nonsequential writing

referencing/embedding parts of a document in another document (transclusion) transpointing windows

bidirectional (bivisible) links

version and rights management

XanaduSpace 1.0 was released as part of Project

Xanadu in 2007

Ted Nelson



World Wide Web (WWW)

Networked hypertext system

(over ARPANET) to share in-

formation at CERN first draft in March 1989

The Information Mine, Information Mesh, …?

Components by end of 1990 HyperText Transfer Protocol (HTTP)

HyperText Markup Language (HTML)

HTTP server software

Web browser (WorldWideWeb)

First public "release" in August 1991

Tim Berners-Lee Robert Cailliau



Search Engine History

Early "search engines" include various systems

starting with Bush's Memex

Archie (1990) first Internet search engine

indexing of files on FTP servers

W3Catalog (September 1993) first "web search engine"

mirroring and integration of manually maintained catalogues

JumpStation (December 1993) first web search engine combining crawling, indexing and

searching



Search Engine History ...

In the following two years (1994/1995) many

new search engines appeared AltaVista, Infoseek, Excite, Inktomi, Yahoo!, ...

Two categories of early Web search solutions full text search

- based on an index that is automatically created by a web crawler in

combination with an indexer

- e.g. AltaVista or InfoSeek

manually maintained classification (hierarchy) of webpages

- significant human editing effort

- e.g. Yahoo



Information Retrieval

Precision and recall can be used to measure the

performance of different information retrieval algorithms

documents retrieved

documents retrieveddocumentsrelevant precision

documentsrelevant

documents retrieveddocumentsrelevant recall

D1 D2 D4

D6 D7 D10

D3 D5

D8 D9

D1 D3 D8

D9 D10

query

6.05

3precision

75.04

3recall



Information Retrieval ...

Often a combination of precision and recall, the so-called

F-score (harmonic mean) is used as a single measure

D1 D2 D4

D6 D7 D10

D3 D5

D8 D9

D1 D3

D8 D9 D10

query

57.0precision

1recall

recallprecision

recallprecision2scoreF

D1 D2 D4

D6 D7 D10

D3 D5

D8 D9

D1 D3 D8

D9 D10

query

6.0precision

75.0recall

67.0score-F

D5 D2

73.0score-F



Bank

Delhaize

Ghent

Metro

Shopping

Train

D1 D2 D3 D4 D5 D6

1

Boolean Model

Based on set theory and boolean logic

Exact matching of documents to a user query

Uses the boolean AND, OR and NOT operators

query: Shopping AND Ghent AND NOT Delhaize

computation: 101110 AND 100111 AND 000111 = 000110

result: document set {D4,D5}

1 0 0 1 1

1

1

0

1

1

1

0

0

1

0

0

1

1

1

0

0

1

0

1

1

0

1

0

1

0

0

1

0

0

0

... ... ... ... ... ... ...



Boolean Model ...

Advantages relatively easy to implement and scalable

fast query processing based on parallel scanning of indexes

Disadvantages does not pay attention to synonymy

does not pay attention to polysemy

no ranking of output

often the user has to learn a special syntax such as the use of double quotes to search for phrases

Variants of the boolean model form the basis for many

search engines



Vector Space Model

Algebraic model representing text documents and

queries as vectors based on the index terms one dimension for each term

Compute the similarity (angle) between the query vector

and the document vectors

Advantages simple model based on linear algebra

partial matching with relevance scoring for results

potenial query reevaluation based on user relevance feedback

Disadvantages computationally expensive (similarity measures for each query)

limited scalability



Web Search Engines

Most web search engines are based on traditional

information retrieval techniques but they have to be

adapted to deal with the characteristics of the the Web immense amount of web resources (>50 billion webpages)

hyperlinked resources

dynamic content with frequent updates

self-organised web resources

Evaluation of performance no standard collections

often based on user studies (satisfaction)

Of course not only the precision and recall but also the

query answer time is an important issue



What About Old Content?



The Internet Archive



Web Crawler

A web crawler or spider is used to create an

index of webpages to be used by a web search engine any web search is then based on this index

Web crawler has to deal with the following issues freshness

- the index should be updated regularly (based on webpage update frequency)

quality

- since not all webpages can be indexed, the crawler should give priority to

"high quality" pages

scalabilty

- it should be possible to increase the crawl rate by just adding additional

servers (modular architecture)

- e.g. the estimated number of Google servers in 2007 was 1'000'000 (including

not only the crawler but the entire Google platform)



Web Crawler ...

distribution

- the crawler should be able to run in a distributed manner (computer centers all

over the world)

robustness

- the Web contains a lot of pages with errors and a crawler has to deal with

these problems

- e.g. deal with a web server that creates an unlimited number of "virtual web

pages" (crawler trap)

efficiency

- resources (e.g. network bandwidth) should be used in a most efficient way

crawl rates

- the crawler should pay attention to existing web server policies

(e.g. revisit-after HTML meta tag or robots.txt file)

User-agent: * Disallow: /cgi-bin/ Disallow: /tmp/ robots.txt



Web Search Engine Architecture

WWW Crawler

URL Pool

Storage Manager

Page Repository

content already added?

Document Index

Special Indexes

Indexers URL Handler

URL Repository

filter

normalisation

and duplicate

elimination

Client

Query Handler

inverted index

Ranking



Pre-1998 Web Search

Find all documents for a given query term use information retrieval (IR) solutions

- boolean model

- vector space model

- ...

ranking based on "on-page factors" problem: poor quality of search results (order)

Larry Page and Sergey Brin proposed to compute the

absolute quality of a page called PageRank based on the number and quality of pages linking

to a page (votes)

query-independent



Origins of PageRank

Developed as part of an

academic project at Stanford

University research platform to aid under-

standing of large-scale web data and enable researchers to easily experiment with new search technologies

Larry Page and Sergey Brin worked on the project about a new kind of search engine (1995-1998) which finally led to a functional prototype called Google

Larry Page Sergey Brin



PageRank

A page Pi has a high PageRank Ri if there are many pages linking to it

or, if there are some pages with a high PageRank linking to it

Total score = IR score × PageRank

P1

R1

P2

R2

P3

R3

P4

R4

P5

R5

P6

R6

P7

R7

P8

R8



Basic PageRank Algorithm

where Bi is the set of pages

that link to page Pi

Lj is the number of outgoing links for page Pj

ij BP j

j

iL

PRPR

)()(

P1 P2

P3

P1

1

P2

1

P3

1

P1

1.5

P2

1.5

P3

0.75

P1

1.5

P2

1.5

P3

0.75



Matrix Representation

Let us define a hyperlink

matrix H

P1 P2

P3

otherwise0

if1 ijj

ij

BPLH

0210

001

1210

H iPRRand

HRR

R is an eigenvector of H

with eigenvalue 1



Matrix Representation ...

We can use the power method to find R

sparse matrix H with 40 billion columns and rows but only an average of 10 non-zero entries in each colum

tt HRR 1

0210

001

1210

HFor our example

this results in or 122R 2.04.04.0



Dangling Pages (Rank Sink)

Problem with pages that

have no outbound links (e.g. P2)

Stochastic adjustment if page Pj has no outgoing links then replace column j with 1/Lj

New stochastic matrix S always has a stationary vector R can also be interpreted as a markov chain

P1 P2

01

00H and 00R

210

210C

211

210CHSand

C

C



Strongly Connected Pages (Graph)

Add new transition proba-

bilities between all pages with probability d we follow

the hyperlink structure S

with probability 1-d we choose a random page

matrix G becomes irreducible

Google matrix G reflects

a random surfer no modelling of back button

P1 P2

P3 P4

P5

1SGn

dd1

1 GRR

1-d

1-d 1-d



Examples

1SGn

dd1

1

A1

0.26

A2

0.37

A3

0.37



Examples ...

A1

0.13

A2

0.185

A3

0.185

B1

0.13

B2

0.185

B3

0.185

5.0AP 5.0BP

1SGn

dd1

1



Examples

PageRank leakage

A1

0.10

A2

0.14

A3

0.14

B1

0.22

B2

0.20

B3

0.20

38.0AP 62.0BP

1SGn

dd1

1



Examples ...

A1

0.3

A2

0.23

A3

0.18

B1

0.10

B2

0.095

B3

0.095

71.0AP 29.0BP

1SGn

dd1

1



Examples

PageRank feedback

A1

0.35

A2

0.24

A3

0.18

B1

0.09

B2

0.07

B3

0.07

77.0AP 23.0BP

1SGn

dd1

1



Examples ...

A1

0.33

A2

0.17

A3

0.175

B1

0.08

B2

0.06

B3

0.06

80.0AP

20.0BPA4

0.125

1SGn

dd1

1



Implications for Website Development

First make sure that your page gets indexed on-page factors

Think about your site's internal link structure create many internal links for important pages

be "careful" about where to put outgoing links

Increase the number of pages

Ensure that webpages are addressed consistently http://www.vub.ac.be http://www.vub.ac.be/index.php

Make sure that you get incoming links from good

websites



Tools

Google toolbar shows logarithmic PageRank value (from 0 to 10)

information not frequently updated (google dance)

Google webmaster tools accepts a sitemap (XML document) with the structure of a website

variety of reports that help to improve the quality of a website

- meta description issues

- title tag issues

- non-indexable content issues

- number and URLs of indexed pages

- number and URLs of inbound/outbound links

- ...



Questions

Is PageRank fair?

What about Google's power and influence?

What about Web 2.0 or Web 3.0 and web search? "non-existent" webpages such as offered by Rich Internet

Applications (e.g. Ajax) may bring problems for traditional search engines (hidden web)

new forms of social search

- Wikia Search

- Delicious

- ...

social marketing



HITS Algorithm

Hypertext Induced Topic Search Jon Kleinberg

developed around the same time when Page and Brin invented PageRank

Uses the link structure like PageRank to

compute a popularity score

Differences from PageRank two popularity values for each page (hub and authority score)

note that the values are not query-independent

user gets a ranked hub and authority list

Jon Kleinberg



HITS Algorithm ...

Good authorities are linked by good hubs and good hubs

link to good authorities

Compute impact of authorities and hubs similar to

PageRank (but only on limited set of result pages!)

P1 P2

Authority Hub

initialise each page with an authority and hub score of 1 repeat { compute new authority scores compute new hub scores normalise authority and hub scores }



Meta Search Engines

Search tool that sends a query to multiple search

engines

Aggregates the individual results on a single result page

metacrawler is an example of a meta search engine that

uses different search engines (Google, Bing, Yahoo!, ...)



Search Engine Market Share



Conclusions

Web information retrieval techniques have to deal with

the specific characteristics of the Web

PageRank algorithm absolute quality of a page based on incoming links

based on random surfer model

computed as eigenvector of Google matrix G

PageRank is just one (important) factor

Implications for website development and SEO



References

Vannevar Bush, As We May Think, Atlanic Monthly,

July 1945 http://www.theatlantic.com/doc/194507/bush/

http://sloan.stanford.edu/MouseSite/Secondary.html

L. Page, S. Brin, R. Motwani and T. Winograd,

The PageRank Citation Ranking: Bringing Order

to the Web, January 1998

S. Brin and L. Page, The Anatomy of a Large-Scale

Hypertextual Web Search Engine, Computer Networks

and ISDN Systems, 30(1-7), April 1998



References …

Amy N. Langville and Carl D. Meyer, Google's

PageRank and Beyond – The Science of Search Engine

Rankings, Princeton University Press, July 2006

PageRank Calculator http://www.webworkshop.net/pagerank_calculator.php

Google Webmaster Tools http://www.google.com/webmasters/

2 December 2005

Next Lecture Search Engine Optimisation (SEO) and Search

Engine Marketing (SEM)

history of search and web search engines - seminar on web search

Education

computer science

beat signer department

systems000 computer

computer programming

beat signerwise lab

web searchhistory of

web search enginesprof

seminar history of search