digital libraries initiatives: what i learned (and didn't) in 10 years hector garcia-molina...
Post on 28-Dec-2015
218 Views
Preview:
TRANSCRIPT
Digital Libraries Initiatives: What I learned (and didn't) in 10 years
Hector Garcia-Molina
Stanford University
Work with: Sergey Brin, Junghoo Cho, Taher Haveliwala, Jun Hirai, Glen Jeh, Zoltan Gyongyi, Andy Kacsmar, Sep Kamvar,
Wang Lam, Mor Naaman, Larry Page, Andreas Paepcke, Sriram Raghavan, Gary Wesley, Rebecca Wesley, and others...
3
Outline
• DLI I & II Experience– (with special help from Andreas and Rebecca)
• Stanford Research
• “Controversial” Questions for the Future
4
Disclaimer
• Stanford Perspective
5
DLI Experience
• Lots of great research!
• Lots of great content!
• Main Event was....
6
Main DL Event
IEEE and ACM DL Conferences Merge into JCDL!!
8
9
The WWW Tsunami
• Before the Web:– Publishers, catalogs,...– Librarians: see the need for technology– CS Types: want to have social impact
10
The WWW Tsunami
• The Web Arrives:– few coherent collections– producers = consumers– everything free– heterogeneous– merge:
• shopping,
• entertainment
• library services ...
11
CS-Library “Tensions”
• Web generated a lot of excitement, but...
• “Friendly tensions” as everyone adjusted:– Techies take all the funding!– Librarians don’t get it!
12
Example: CS-TR Experience
• History
• Copyright issues
• Pubs servers everywhere
• Citeseer,...
• Organization vs chaos• Chaos wins! (this round)
DLI I & II
NCSTRL
13
Bright Future
• DLI I & II made important contributions (more later...)• Huge volume of information available• Direct communication between authors and librarians• Core library functions needed, more than ever:
– organization– curation– trusted information– ...
DLI I & II
NCSTRL
today
14
Stanford DLI Project
Stanford Theme - Phase I
• “GLUE” for accessing diverse libraries and services
InternetLibraries
PaymentInstitutions
SearchAgents
User Interfacesand Annotations
Commercial Information Brokers &
Providers
CopyrightServices
Query/DataConversionHTTP
Z39.50
Telnet
InfoBus Example
Folio Dialog DigiCash F.V.
FolioProxy
DialogProxy
DigiCashProxy
F.V.Proxy
DLite GlossQueryTrans
MetaData U-Pai
Con-tracts
Q: Find Ti distributed (W) systems
InfoBus Example
Folio Dialog DigiCash F.V.
FolioProxy
DialogProxy
DigiCashProxy
F.V.Proxy
DLite GlossQueryTrans
MetaData U-Pai
Con-tracts
Q: Find Ti distributed (W) systems
Suggested: Folio, Dialog
InfoBus Example
Folio Dialog DigiCash F.V.
FolioProxy
DialogProxy
DigiCashProxy
F.V.Proxy
DLite GlossQueryTrans
MetaData U-Pai
Con-tracts
Q: Find Ti distributed (W) systems
Q’: Find Ti distributed AND systems
Query Translation
InfoBus Example
Folio Dialog DigiCash F.V.
FolioProxy
DialogProxy
DigiCashProxy
F.V.Proxy
DLite GlossQueryTrans
MetaData U-Pai
Con-tracts
Q: Find Ti distributed (W) systems
Pay per View
InfoBus Details
LSP LSP LSP LSPZ-cl
Z-sr
DLITEclient
Z39.50client
SenseMakerclient
Z39.50Library
L1 Ln. . . S1 Sn. . .
Libraries Services
Payment, Translation,MetaData,… Services
•ILU Objects•Information Models•DLI Protocol
{
Querying Sources
• Differences: Language, Operators, Attributes,...
Q1: title contains large AND distributed (W) system
Q2: FIND heading large AND distributed NEAR system
Query Translation
TargetIR System
TargetIR System
TargetIR System
...Query
Translator
Post-Filter
Userquery
Final results
Target syntax, capabilities, schemas
Filter Queries
Stop Word Examples• User Query Q1:
– title contains gone AND with AND the AND wind
• Subsuming Query QS: (for Dialog)– title contains gone AND wind
• Filter Query QF:– title contains with AND the
post-filter
query trans
sourceQ1
QS
QF
ASA1
Stop Word Examples• User Query Q1:
– title contains gone (W) with (W) the (W) wind
• Subsuming Query QS: (for Dialog)– title contains gone (2W) wind
• Filter Query QF:– title contains gone (W) with (W) the (W) wind
post-filter
query trans
sourceQ1
QS
QF
ASA1
Translation Overhead: Stop Words
Size of user query with (W) operator
Size ofsubsumingquerywithoutstopwords
Text field on Dialog
Summary
• Option 1: Avoid Translation– Need: common language and operators– Need: common attributes
• Option 2: Translate– Need: source meta-data– Need: user involvement in translation
27
Stanford DLI II: Technical Barriers
Economic Weaknesses
Information Loss
Information Overload
Service Heterogeneity
Physical Barriers
RepositoryMulticastEngine
WWW
FeatureRepository
RetrievalIndexes
Webbase API
Web CrawlerWeb
CrawlerWeb CrawlerWeb Crawlers
Client Client Client Client
Client ClientWebBase Architecture
29
PowerBrowser - Start Screen
30
PowerBrowser - Hypertext View
31
Copy Detection
Copy Detection System
32
Replicated Collections on the Web
33
Archival Repository
server
stanfordTRs
server
illinoisTRs
stanfordarchival repository
illinoisarchival repository
34
Archival Repository Design
• If I have $100K/yr• Want 99.999% “reliability”
– how many copies
– how much preventive maintenance
– ???
Preventive Maintenance and Aging
0
10
20
30
40
50
60
70
80
1 3 5 10 Never
Start of Aging (years)
MT
TF
(y
ea
rs)
1
3
5
10
Never
Preventive Maintenance
Period (years)
35
Crawler Friendly Web Servers
• Year 2000 Paper:– Onn Brandman, Junghoo Cho, Narayanan Shivakumar
– Help crawlers identify pages of interest
webserver
crawler
pull
36
Crawler Friendly Web Servers
• Year 2000 Paper:– Onn Brandman, Junghoo Cho, Narayanan Shivakumar
– Help crawlers identify pages of interest
webserver
crawler
pull
dige
st
Other options:• Push• Filter service
37
Needless Requests
38
Improved Freshness
40
DLI Technology Transfer
• Research Product: Students
• Transfer Takes Time!
Economic Weaknesses
Information Loss
Information Overload
Service Heterogeneity
Physical Barriers
• Interoperability
• Value Filtering
• Mobile Access
• IP Infrastructure
• Archival Repository
Technologies forTechnologies forDigital LibrariesDigital Libraries
41
42
“Controversial” Questions
43
Is Metadata Dead?
document
metadata
44
Will the Semantic Web Make It?
• Will tags be generated?• By whom?• Agreement
web
? SearchEngine
semantic tags
45
Is Google the Future Digital Library?
46
Not Online, Not Worth Having?
• Bill Arms Quote
47
Are Publishers Still Needed?
48
Here Today, Gone Tomorrow?
• Will we find today’s materials in 50 years?
49
Will Lawyers Win?
50
Summary
• We learned a lot from DLI I & II
• Trained students who are changing the world
• Many challenges ahead...
51
Extra Slides
52
Outline
• dli– 94-98; 00-05
– lots of great research; wonderful sites (cervantes)
– the web; like doing research on tidal pools when tsunami hits
– before the web:• librarians: catalogs, publishers in control, research funding low
• com sci: chance to have impact; do good for society
– the web• blurred distinction between producers consumers
• no coherent collections (with curator who controlled, organized...)
• everything free (expectation that...)
• heterogeneity (beyond html...)
• merged shopping, work, library, entertainment... blurred distinctions...
– tensions cs-librarians• cs folks taking all the funding to work on technology
• librarians “don’t get it” times are changing
• CS-TR experience...copyrights, servers, search, etc...
53
Outline
• dli (continued)– Bright Future
• direct communication between librarians and authors (camera ready...)
• huge volume of information available
• core function of librarianship remains (organize, categorize,....)now more than ever: need to filter out junk, need to organize, synthesize....
• more on this future later on in talk...
54
Outline
• summary of stanford work
55
Outline
• dli
• summary of stanford work
• future issues– will semantic web ever make it?
– Is metadata really dead?
– Are publishers still needed?
– Is Google the digital library of the future?• google scans books
– Is paper relevant?• bill arms: “If it is not online, it is not work having”
• my students do not cite anything not online (Michigan story)
– Will we be able to find today’s digital materials in 50 years?
– How will DLs be funded? DL Research funded?
Research Areas
• Interface: our window to a digital library• Interoperation: accessing heterogeneous services• Discovery: finding desired resources• Translation: speaking the right language• Payment: multiple policies & currencies• Interpretation: understanding results• Creation: generating new information
Outline
• Overview of Digital Library Innitiative
• The Stanford Digital Library Project
– Overview
– The InfoBus– Internet Meta-Searching
• Discovery
• Querying
• Merging and ranking
– STARTS Protocol
Discovery: Exhaustive Searching
Source
Source
Queries
Answer
Answer
Discovery: Full Index
Source
Extractor
Source
Extractor
INDEXINDEX
Query
DocumentIdentifiers
Requests for Specific Documents
Full Text
Discovery: GLOSS
Source
Collector
Source
Collector
GLOSSGLOSS
Query
Hints
Query to source
Statistics
Example:
• query: find author Knuth and title computers
• statistics GLOSS keeps on databases:
DB #docs #docs with #docs with author Knuth title computers
db1 100 0 3 db2 200 10 200 db3 1000 100 100 db4 1000 1 1
Which database(s) should the user search?
• q = find author Knuth and title computers
DB #docs #docs with #docs with author Knuth title computers
db1 100 0 3 db2 200 10 200 db3 1000 100 100 db4 1000 1 1
Example (cont.):
• Use IND predictor (others available).
• Resulting rank: ESize(q, db2) = (10/200)*(200/200)*200 = 10 docs ESize(q, db3) = (100/1000)*(100/1000)*1000 = 10 docs
ESize(q, db4) = (1/1000)*(1/1000)*1000 = 0.001 docs ESize(q, db1) = (0/100)*(3/100)*100 = 0 docs
GLOSS Results
• Experimental Evaluation• GLOSS hints “very good” 85% to 90% of the time• GLOSS index is 2% of the size of full index
Summary
• GLOSS and other resource discovery tools work…• BUT require meta-data collection facilities.
SourceCollector
Queries
Translation Overhead: Stop Words
Size ofsubsumingquerywithoutstopwords
Size of user query with AND operator
Text field on Dialog
Translation Overhead: Stop Words
Size of user query with (W) operator
Size ofsubsumingquerywithoutstopwords
Text field on Dialog
Remaining lengthgreater than 1
Ranking & Interpreting Results
• How do we merge ranked results?– Example: Query: “distributed databases”– Source1: (d1, 0.7), (d2, 0.3)– Source2: (d3, 100), (d4, 82). (d5, 71)
Ranking & Interpreting Results• Need additional information from sources
– Example: Query: “distributed databases”– Source1: ( doc = d1,
rank = 0.7,frequency[“distributed”] = 100,frequency[“databases”] = 1000,totalDocuments = 5000 ),
( doc = d2,rank = 0.3,frequency[“distributed”] = 10,frequency[“databases”] = 300,totalDocuments = 5000 )
Target Ranking
• Compute target ranking:– Source1: (d1, T100), (d2, T50)– Source2: (d3, T150), (d4, T80), (d5, T25)
• Merge:– Combined: (d3, T150), (d1, T100), (d4, T80),
(d2, T50), (d5, T25)
Target Ranking
• Compute target ranking:– Source1: (d1, T100), (d2, T50)
– Source2: (d3, T150), (d4, T80), (d5, T25)
• Merge:– Combined: (d3, T150), (d1, T100), (d4, T80), (d2,
T50), (d5, T25)
• Question: Are we positive (d3, T150) is best?– Maybe (dx, 0.25) at Source1 (ranked below d2 there)
has target rank of (dx, T200)??
Summary
• Sources need to export auxiliary ranking information• We need some ``knowledge’’ of source ranking
function
STARTS
• Stanford Protocol for Internet Search and Retrieval
• Participants:– Fulcrum, Infoseek, Microsoft Network, Verity, WAIS– GILS, Harvest, Netscape, PLS, HP, others
• Goal: Simplify the Job of Meta-searchers.• Goal: Simplicity• Can be used by different transport protocols.• Visit:
– http://www-db.stanford.edu/~gravano/starts_home.html
STARTS Components
(1) Common scheme for collecting meta-data(2) Common query language(3) Common result ranking information
SourceQueries
Collector
Answers
(1)
(2)
(3)
Metadata Example (SOIF)
@SMetaAttributes{Version{10}: STARTS 1.0SourceID{8}: Source-1FieldsSupported{17}: [basic-1 author]ModifiersSupported{19}: {basic-1 phonetics}FieldModifierCombinations{39}: ([basic-1 author] {basic-1 phonetics})QueryPartsSupported{2}: RFScoreRange{7}: 0.0 1.0RankingAlgorithmID{6}: Acme-1...
Sample Query (SOIF)
@SQuery{Version{10}: STARTS 1.0FilterExpression{48}: ((author ``Ullman'') and (title stem ``databases''))RankingExpression{61}: list( (body-of-text ``distributed'') (body-of-text ``databases''))DropStopWords{1}: TDefaultAttributeSet{7}: basic-1DefaultLanguage{5}: en-USAnswerFields{12}: title authorMinDocumentScore{3}: 0.5MaxNumberDocuments{2}: 10}
Meta-Searching Conclusion
• Need extra information from sources:– Meta-data– Ranking information
• For querying multiple sources:– Need standard query language; or– Need query translation machinery
Meta-Searching Conclusion
• Other issues:– Payment– Preserving advertisements– Improved “value” filtering
The Stanford Digital Library Project
InternetLibraries
PaymentInstitutions
SearchAgents
User Interfacesand Annotations
Commercial Information Brokers &
Providers
CopyrightServices
Query/DataConversionHTTP
Z39.50
Telnet
79
Interoperability Challenges
• Growing number of players, formats, countries,...• Repositories Services• Dynamic artifacts
Digital Libraries
80
Standards
• Too Many– e.g., Z39-50, HTTP, SDLIP, CORBA, DASL, ...
• Narrow– e.g., XML not a silver bullet
• Nevertheless Important...translation
81
Query Translation Example
Q: Find Title contains(“cats” near “dogs”)
targetsystem
blah, blah,cats and dogs
blah, blah
doc 1:
blah, cats,blah,blah, blah,blah, dogs
doc 2:
Q’: Find Title contains(“cats”)AND contains(“dogs”)
translate filter
{doc1, doc2}
{doc1}
82
Another Query Translation Example
Q: Find [grade > 8] AND [name =“elton john”]
Q’: Find [score = A] AND [last-name = “john”] AND [first-name = “elton”]
targetsystem
translate• basic rules• translation algorithm• error estimation
83
Filtering Challenges
• Too much information
• Not controlled
84
Current Filtering
textualsimilarity
85
Page Rank Filtering
textualsimilarity
page rank(Google)
86
Initial Page Rank
4
1
87
Recursive Page Rank
4
1
6
1
2
2
1+2+1+2 = 6
88
Value Filtering
textualsimilarity
page rank
geography
context
opinions
access
89
Mobile Access Challenges
• Limited Screen Size
• Limited Bandwidth
• Disconnected Operation
• Limited Power
90
Power Browsing
Techniques• Show only text headers• Show URLs, anchors, titles• Order URLs by page rank• Summarize text• Summarize set of pages• Low-resolution pictures• Site search, word completion• ...
91
PowerBrowser - Text View
92
PowerBrowser - History
93
Economic Challenges
• Piracy
• Payment
• Heterogeneity
• Security/Privacy
94
Piracy on the Internet
95
Approaches
• Copy Prevention– isolation– cryptography– secure viewer
• Copy Detection– watermarking– content based
96
Copy Detection
• Content– text
– audio
– video
• Challenges– crawling the web, mailing lists,...
– large scale comparison
– false negatives, positives
– different formats, sampling rates, frame rates,...
– adversary tries to fool system
97
Example: Text Copy Detection
chunk signature
database(hash table)
get document
break intochunks
computesignature
store indatabase
get document
break intochunks
computesignature
probedatabase
abovethreshold?
98
Text Detection Issues
• What are chunks?• What is threshold?• How to foil adversary?• How to compare hypertext documents?
99
Information Preservation Challenges
• Preserving the Bits– Evolving hardware– Evolving software– Evolving organizations
• Preserving the Meaning
100
Archiving the Web
server
documents
web server
web pages
stanfordarchival repository web users
101
InfoMonitor History View
102
InfoMonitor Snapshot View
103
104
Archival Repository
• Object Identifier Signature
• No Deletions (never ever!)
handle
set set
new version?
top related