digital libraries initiatives: what i learned (and didn't) in 10 years hector garcia-molina...

Post on 28-Dec-2015

218 Views

Category:

Documents

3 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Digital Libraries Initiatives: What I learned (and didn't) in 10 years

Hector Garcia-Molina

Stanford University

Work with: Sergey Brin, Junghoo Cho, Taher Haveliwala, Jun Hirai, Glen Jeh, Zoltan Gyongyi, Andy Kacsmar, Sep Kamvar,

Wang Lam, Mor Naaman, Larry Page, Andreas Paepcke, Sriram Raghavan, Gary Wesley, Rebecca Wesley, and others...

3

Outline

• DLI I & II Experience– (with special help from Andreas and Rebecca)

• Stanford Research

• “Controversial” Questions for the Future

4

Disclaimer

• Stanford Perspective

5

DLI Experience

• Lots of great research!

• Lots of great content!

• Main Event was....

6

Main DL Event

IEEE and ACM DL Conferences Merge into JCDL!!

8

9

The WWW Tsunami

• Before the Web:– Publishers, catalogs,...– Librarians: see the need for technology– CS Types: want to have social impact

10

The WWW Tsunami

• The Web Arrives:– few coherent collections– producers = consumers– everything free– heterogeneous– merge:

• shopping,

• entertainment

• library services ...

11

CS-Library “Tensions”

• Web generated a lot of excitement, but...

• “Friendly tensions” as everyone adjusted:– Techies take all the funding!– Librarians don’t get it!

12

Example: CS-TR Experience

• History

• Copyright issues

• Pubs servers everywhere

• Citeseer,...

• Organization vs chaos• Chaos wins! (this round)

DLI I & II

NCSTRL

13

Bright Future

• DLI I & II made important contributions (more later...)• Huge volume of information available• Direct communication between authors and librarians• Core library functions needed, more than ever:

– organization– curation– trusted information– ...

DLI I & II

NCSTRL

today

14

Stanford DLI Project

Stanford Theme - Phase I

• “GLUE” for accessing diverse libraries and services

InternetLibraries

PaymentInstitutions

SearchAgents

User Interfacesand Annotations

Commercial Information Brokers &

Providers

CopyrightServices

Query/DataConversionHTTP

Z39.50

Telnet

InfoBus Example

Folio Dialog DigiCash F.V.

FolioProxy

DialogProxy

DigiCashProxy

F.V.Proxy

DLite GlossQueryTrans

MetaData U-Pai

Con-tracts

Q: Find Ti distributed (W) systems

InfoBus Example

Folio Dialog DigiCash F.V.

FolioProxy

DialogProxy

DigiCashProxy

F.V.Proxy

DLite GlossQueryTrans

MetaData U-Pai

Con-tracts

Q: Find Ti distributed (W) systems

Suggested: Folio, Dialog

InfoBus Example

Folio Dialog DigiCash F.V.

FolioProxy

DialogProxy

DigiCashProxy

F.V.Proxy

DLite GlossQueryTrans

MetaData U-Pai

Con-tracts

Q: Find Ti distributed (W) systems

Q’: Find Ti distributed AND systems

Query Translation

InfoBus Example

Folio Dialog DigiCash F.V.

FolioProxy

DialogProxy

DigiCashProxy

F.V.Proxy

DLite GlossQueryTrans

MetaData U-Pai

Con-tracts

Q: Find Ti distributed (W) systems

Pay per View

InfoBus Details

LSP LSP LSP LSPZ-cl

Z-sr

DLITEclient

Z39.50client

SenseMakerclient

Z39.50Library

L1 Ln. . . S1 Sn. . .

Libraries Services

Payment, Translation,MetaData,… Services

•ILU Objects•Information Models•DLI Protocol

{

Querying Sources

• Differences: Language, Operators, Attributes,...

Q1: title contains large AND distributed (W) system

Q2: FIND heading large AND distributed NEAR system

Query Translation

TargetIR System

TargetIR System

TargetIR System

...Query

Translator

Post-Filter

Userquery

Final results

Target syntax, capabilities, schemas

Filter Queries

Stop Word Examples• User Query Q1:

– title contains gone AND with AND the AND wind

• Subsuming Query QS: (for Dialog)– title contains gone AND wind

• Filter Query QF:– title contains with AND the

post-filter

query trans

sourceQ1

QS

QF

ASA1

Stop Word Examples• User Query Q1:

– title contains gone (W) with (W) the (W) wind

• Subsuming Query QS: (for Dialog)– title contains gone (2W) wind

• Filter Query QF:– title contains gone (W) with (W) the (W) wind

post-filter

query trans

sourceQ1

QS

QF

ASA1

Translation Overhead: Stop Words

Size of user query with (W) operator

Size ofsubsumingquerywithoutstopwords

Text field on Dialog

Summary

• Option 1: Avoid Translation– Need: common language and operators– Need: common attributes

• Option 2: Translate– Need: source meta-data– Need: user involvement in translation

27

Stanford DLI II: Technical Barriers

Economic Weaknesses

Information Loss

Information Overload

Service Heterogeneity

Physical Barriers

RepositoryMulticastEngine

WWW

FeatureRepository

RetrievalIndexes

Webbase API

Web CrawlerWeb

CrawlerWeb CrawlerWeb Crawlers

Client Client Client Client

Client ClientWebBase Architecture

29

PowerBrowser - Start Screen

30

PowerBrowser - Hypertext View

31

Copy Detection

Copy Detection System

32

Replicated Collections on the Web

33

Archival Repository

server

stanfordTRs

server

illinoisTRs

stanfordarchival repository

illinoisarchival repository

34

Archival Repository Design

• If I have $100K/yr• Want 99.999% “reliability”

– how many copies

– how much preventive maintenance

– ???

Preventive Maintenance and Aging

0

10

20

30

40

50

60

70

80

1 3 5 10 Never

Start of Aging (years)

MT

TF

(y

ea

rs)

1

3

5

10

Never

Preventive Maintenance

Period (years)

35

Crawler Friendly Web Servers

• Year 2000 Paper:– Onn Brandman, Junghoo Cho, Narayanan Shivakumar

– Help crawlers identify pages of interest

webserver

crawler

pull

36

Crawler Friendly Web Servers

• Year 2000 Paper:– Onn Brandman, Junghoo Cho, Narayanan Shivakumar

– Help crawlers identify pages of interest

webserver

crawler

pull

dige

st

Other options:• Push• Filter service

37

Needless Requests

38

Improved Freshness

40

DLI Technology Transfer

• Research Product: Students

• Transfer Takes Time!

Economic Weaknesses

Information Loss

Information Overload

Service Heterogeneity

Physical Barriers

• Interoperability

• Value Filtering

• Mobile Access

• IP Infrastructure

• Archival Repository

Technologies forTechnologies forDigital LibrariesDigital Libraries

41

42

“Controversial” Questions

43

Is Metadata Dead?

document

metadata

44

Will the Semantic Web Make It?

• Will tags be generated?• By whom?• Agreement

web

? SearchEngine

semantic tags

45

Is Google the Future Digital Library?

46

Not Online, Not Worth Having?

• Bill Arms Quote

47

Are Publishers Still Needed?

48

Here Today, Gone Tomorrow?

• Will we find today’s materials in 50 years?

49

Will Lawyers Win?

50

Summary

• We learned a lot from DLI I & II

• Trained students who are changing the world

• Many challenges ahead...

51

Extra Slides

52

Outline

• dli– 94-98; 00-05

– lots of great research; wonderful sites (cervantes)

– the web; like doing research on tidal pools when tsunami hits

– before the web:• librarians: catalogs, publishers in control, research funding low

• com sci: chance to have impact; do good for society

– the web• blurred distinction between producers consumers

• no coherent collections (with curator who controlled, organized...)

• everything free (expectation that...)

• heterogeneity (beyond html...)

• merged shopping, work, library, entertainment... blurred distinctions...

– tensions cs-librarians• cs folks taking all the funding to work on technology

• librarians “don’t get it” times are changing

• CS-TR experience...copyrights, servers, search, etc...

53

Outline

• dli (continued)– Bright Future

• direct communication between librarians and authors (camera ready...)

• huge volume of information available

• core function of librarianship remains (organize, categorize,....)now more than ever: need to filter out junk, need to organize, synthesize....

• more on this future later on in talk...

54

Outline

• summary of stanford work

55

Outline

• dli

• summary of stanford work

• future issues– will semantic web ever make it?

– Is metadata really dead?

– Are publishers still needed?

– Is Google the digital library of the future?• google scans books

– Is paper relevant?• bill arms: “If it is not online, it is not work having”

• my students do not cite anything not online (Michigan story)

– Will we be able to find today’s digital materials in 50 years?

– How will DLs be funded? DL Research funded?

Research Areas

• Interface: our window to a digital library• Interoperation: accessing heterogeneous services• Discovery: finding desired resources• Translation: speaking the right language• Payment: multiple policies & currencies• Interpretation: understanding results• Creation: generating new information

Outline

• Overview of Digital Library Innitiative

• The Stanford Digital Library Project

– Overview

– The InfoBus– Internet Meta-Searching

• Discovery

• Querying

• Merging and ranking

– STARTS Protocol

Discovery: Exhaustive Searching

Source

Source

Queries

Answer

Answer

Discovery: Full Index

Source

Extractor

Source

Extractor

INDEXINDEX

Query

DocumentIdentifiers

Requests for Specific Documents

Full Text

Discovery: GLOSS

Source

Collector

Source

Collector

GLOSSGLOSS

Query

Hints

Query to source

Statistics

Example:

• query: find author Knuth and title computers

• statistics GLOSS keeps on databases:

DB #docs #docs with #docs with author Knuth title computers

db1 100 0 3 db2 200 10 200 db3 1000 100 100 db4 1000 1 1

Which database(s) should the user search?

• q = find author Knuth and title computers

DB #docs #docs with #docs with author Knuth title computers

db1 100 0 3 db2 200 10 200 db3 1000 100 100 db4 1000 1 1

Example (cont.):

• Use IND predictor (others available).

• Resulting rank: ESize(q, db2) = (10/200)*(200/200)*200 = 10 docs ESize(q, db3) = (100/1000)*(100/1000)*1000 = 10 docs

ESize(q, db4) = (1/1000)*(1/1000)*1000 = 0.001 docs ESize(q, db1) = (0/100)*(3/100)*100 = 0 docs

GLOSS Results

• Experimental Evaluation• GLOSS hints “very good” 85% to 90% of the time• GLOSS index is 2% of the size of full index

Summary

• GLOSS and other resource discovery tools work…• BUT require meta-data collection facilities.

SourceCollector

Queries

Translation Overhead: Stop Words

Size ofsubsumingquerywithoutstopwords

Size of user query with AND operator

Text field on Dialog

Translation Overhead: Stop Words

Size of user query with (W) operator

Size ofsubsumingquerywithoutstopwords

Text field on Dialog

Remaining lengthgreater than 1

Ranking & Interpreting Results

• How do we merge ranked results?– Example: Query: “distributed databases”– Source1: (d1, 0.7), (d2, 0.3)– Source2: (d3, 100), (d4, 82). (d5, 71)

Ranking & Interpreting Results• Need additional information from sources

– Example: Query: “distributed databases”– Source1: ( doc = d1,

rank = 0.7,frequency[“distributed”] = 100,frequency[“databases”] = 1000,totalDocuments = 5000 ),

( doc = d2,rank = 0.3,frequency[“distributed”] = 10,frequency[“databases”] = 300,totalDocuments = 5000 )

Target Ranking

• Compute target ranking:– Source1: (d1, T100), (d2, T50)– Source2: (d3, T150), (d4, T80), (d5, T25)

• Merge:– Combined: (d3, T150), (d1, T100), (d4, T80),

(d2, T50), (d5, T25)

Target Ranking

• Compute target ranking:– Source1: (d1, T100), (d2, T50)

– Source2: (d3, T150), (d4, T80), (d5, T25)

• Merge:– Combined: (d3, T150), (d1, T100), (d4, T80), (d2,

T50), (d5, T25)

• Question: Are we positive (d3, T150) is best?– Maybe (dx, 0.25) at Source1 (ranked below d2 there)

has target rank of (dx, T200)??

Summary

• Sources need to export auxiliary ranking information• We need some ``knowledge’’ of source ranking

function

STARTS

• Stanford Protocol for Internet Search and Retrieval

• Participants:– Fulcrum, Infoseek, Microsoft Network, Verity, WAIS– GILS, Harvest, Netscape, PLS, HP, others

• Goal: Simplify the Job of Meta-searchers.• Goal: Simplicity• Can be used by different transport protocols.• Visit:

– http://www-db.stanford.edu/~gravano/starts_home.html

STARTS Components

(1) Common scheme for collecting meta-data(2) Common query language(3) Common result ranking information

SourceQueries

Collector

Answers

(1)

(2)

(3)

Metadata Example (SOIF)

@SMetaAttributes{Version{10}: STARTS 1.0SourceID{8}: Source-1FieldsSupported{17}: [basic-1 author]ModifiersSupported{19}: {basic-1 phonetics}FieldModifierCombinations{39}: ([basic-1 author] {basic-1 phonetics})QueryPartsSupported{2}: RFScoreRange{7}: 0.0 1.0RankingAlgorithmID{6}: Acme-1...

Sample Query (SOIF)

@SQuery{Version{10}: STARTS 1.0FilterExpression{48}: ((author ``Ullman'') and (title stem ``databases''))RankingExpression{61}: list( (body-of-text ``distributed'') (body-of-text ``databases''))DropStopWords{1}: TDefaultAttributeSet{7}: basic-1DefaultLanguage{5}: en-USAnswerFields{12}: title authorMinDocumentScore{3}: 0.5MaxNumberDocuments{2}: 10}

Meta-Searching Conclusion

• Need extra information from sources:– Meta-data– Ranking information

• For querying multiple sources:– Need standard query language; or– Need query translation machinery

Meta-Searching Conclusion

• Other issues:– Payment– Preserving advertisements– Improved “value” filtering

The Stanford Digital Library Project

InternetLibraries

PaymentInstitutions

SearchAgents

User Interfacesand Annotations

Commercial Information Brokers &

Providers

CopyrightServices

Query/DataConversionHTTP

Z39.50

Telnet

79

Interoperability Challenges

• Growing number of players, formats, countries,...• Repositories Services• Dynamic artifacts

Digital Libraries

80

Standards

• Too Many– e.g., Z39-50, HTTP, SDLIP, CORBA, DASL, ...

• Narrow– e.g., XML not a silver bullet

• Nevertheless Important...translation

81

Query Translation Example

Q: Find Title contains(“cats” near “dogs”)

targetsystem

blah, blah,cats and dogs

blah, blah

doc 1:

blah, cats,blah,blah, blah,blah, dogs

doc 2:

Q’: Find Title contains(“cats”)AND contains(“dogs”)

translate filter

{doc1, doc2}

{doc1}

82

Another Query Translation Example

Q: Find [grade > 8] AND [name =“elton john”]

Q’: Find [score = A] AND [last-name = “john”] AND [first-name = “elton”]

targetsystem

translate• basic rules• translation algorithm• error estimation

83

Filtering Challenges

• Too much information

• Not controlled

84

Current Filtering

textualsimilarity

85

Page Rank Filtering

textualsimilarity

page rank(Google)

86

Initial Page Rank

4

1

87

Recursive Page Rank

4

1

6

1

2

2

1+2+1+2 = 6

88

Value Filtering

textualsimilarity

page rank

geography

context

opinions

access

89

Mobile Access Challenges

• Limited Screen Size

• Limited Bandwidth

• Disconnected Operation

• Limited Power

90

Power Browsing

Techniques• Show only text headers• Show URLs, anchors, titles• Order URLs by page rank• Summarize text• Summarize set of pages• Low-resolution pictures• Site search, word completion• ...

91

PowerBrowser - Text View

92

PowerBrowser - History

93

Economic Challenges

• Piracy

• Payment

• Heterogeneity

• Security/Privacy

94

Piracy on the Internet

95

Approaches

• Copy Prevention– isolation– cryptography– secure viewer

• Copy Detection– watermarking– content based

96

Copy Detection

• Content– text

– audio

– video

• Challenges– crawling the web, mailing lists,...

– large scale comparison

– false negatives, positives

– different formats, sampling rates, frame rates,...

– adversary tries to fool system

97

Example: Text Copy Detection

chunk signature

database(hash table)

get document

break intochunks

computesignature

store indatabase

get document

break intochunks

computesignature

probedatabase

abovethreshold?

98

Text Detection Issues

• What are chunks?• What is threshold?• How to foil adversary?• How to compare hypertext documents?

99

Information Preservation Challenges

• Preserving the Bits– Evolving hardware– Evolving software– Evolving organizations

• Preserving the Meaning

100

Archiving the Web

server

documents

web server

web pages

stanfordarchival repository web users

101

InfoMonitor History View

102

InfoMonitor Snapshot View

103

104

Archival Repository

• Object Identifier Signature

• No Deletions (never ever!)

handle

set set

new version?

top related