size matters: quality vs. quantity traditionally, libraries spent a lot of effort on selection,...

28
Size matters: quality vs. quantity Traditionally, libraries spent a lot of effort on selection, choosing what to buy. They then spent a lot of effort organizing and indexing the material. The Internet is the other extreme: everything is available, nothing is organized. There are two fundamental changes: low-cost disks full-text indexing Selection is expensive, storage is cheap; organizing is expensive, searching is cheap.

Upload: merryl-newman

Post on 25-Dec-2015

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Size matters: quality vs. quantity Traditionally, libraries spent a lot of effort on selection, choosing what to buy. They then spent a lot of effort organizing

Size matters: quality vs. quantity

Traditionally, libraries spent a lot of effort on selection, choosing what to buy. They then spent a lot of effort organizing and indexing the material. The Internet is the other extreme: everything is available, nothing is organized.

There are two fundamental changes: low-cost disks full-text indexing

Selection is expensive, storage is cheap; organizing is expensive, searching is cheap.

Page 2: Size matters: quality vs. quantity Traditionally, libraries spent a lot of effort on selection, choosing what to buy. They then spent a lot of effort organizing

Bush’s memex

As visualized by Life Magazine in 1945.

Page 3: Size matters: quality vs. quantity Traditionally, libraries spent a lot of effort on selection, choosing what to buy. They then spent a lot of effort organizing

Size matters: the Internet Archive

The Internet Archive sweeps the Web roughly every two monthsand saves whatever pages it can find. It buys about 10 TB of diskeach month, and now has about 100 TB total.

There are two copies of the Archive (neither quite up to date): oneat the Library of Congress and one at the Biblioteca Alexandrina.

In addition to the general Web collection, the Archive has alsogathered "curated" collections where specialists chose web sites,e.g., the 2000 Election website for the Library of Congress.

Page 4: Size matters: quality vs. quantity Traditionally, libraries spent a lot of effort on selection, choosing what to buy. They then spent a lot of effort organizing

More than the Web: universalaccess to human knowledge

It is now credible to imagine that all of our creative activity is placed on line. For example, perhaps 100M books have beenpublished; digital versions of these would fit in 1 "petabyte" (thestep after the terabyte) and a petabyte of disk today is $1M.

The Internet Archive supports, for example: The Million Book Project (Profs. Raj Reddy & N. Balakrishnan)The Prelinger Archive and the Television Archive (moving images)The "etree.org" music files.Software collections, working with Macromedia.The Internet Bookmobile.

Page 5: Size matters: quality vs. quantity Traditionally, libraries spent a lot of effort on selection, choosing what to buy. They then spent a lot of effort organizing

From John McCallum

Page 6: Size matters: quality vs. quantity Traditionally, libraries spent a lot of effort on selection, choosing what to buy. They then spent a lot of effort organizing

What to keep: lessons from historyOnce upon a time libraries didn't give full respect to: Vernacular literature (before the Renaissance) Plays, instead of poetry Non-European languages Films and television scripts and recordings

Today the distinctions between libraries, archives and museums are eroding.Undergraduates are using primary materials online, which they would not have been able to use on paper; even in schools some of these are useful.

As time goes on it is cheaper to collect but more expensive to select;it is cheaper to search and more expensive to organize.

Page 7: Size matters: quality vs. quantity Traditionally, libraries spent a lot of effort on selection, choosing what to buy. They then spent a lot of effort organizing

Google vs. ACM DLQuery: neural nets

ACM: 554 hits Bounds for the computational power & learning complexity.. Neural networks & open texture Efficient simulation of finite automata.... Parallel construction of minimal perfect hashing ...

Google: 131,000 hits Lecture notes from Msc course on neural nets Neural networks at PNNL Old neural net FAQ FAQ for comp.ai.neural-nets

ACM dates 1991-1993, Google 1995-2001.

On balance Google pages better as an introduction; ACM hits too specialized (ACM DL does not have monographs).

Page 8: Size matters: quality vs. quantity Traditionally, libraries spent a lot of effort on selection, choosing what to buy. They then spent a lot of effort organizing

Google vs. ACM DLQuery: rsa cryptography

ACM: 12 hits Hardware speedups in long integer multiplication. Dynamically reconfigurable architecture for image proc. Representation of ASN.1 in APL nested structures Architectural tradeoff in implementing RSA procs.

Google: 117,000 hits RSA Laboratories cryptography FAQ RSA Labs algorithm simulation center (Javascript) RSA Cryptography Today FAQ RSA cryptography spec 2.0

Again, the ACM hits are very specialized; as an introduction the pages found by Google are better.

Page 9: Size matters: quality vs. quantity Traditionally, libraries spent a lot of effort on selection, choosing what to buy. They then spent a lot of effort organizing

Google vs. Art IndexQuery: paleography

Art Index: 72 hits Cuneiform: The Evolution of a Multimedia Cuneiform Database Une Priere de Vengane sur une Tablette de Plomb a Delos. More help from Syria: introducing Emar to biblical study The death of Niphururiya and its aftermath

Google: 21,100 hits Manuscripts, paleography, codicology, introductory bibliography Ductus: an online course in paleography BYZANTIUM: Byzantine Paleography Texts, manuscripts and paleography

The same general results, that the “selected” material is too specialized, is also true in art, although the advantage for Google was smaller.

Page 10: Size matters: quality vs. quantity Traditionally, libraries spent a lot of effort on selection, choosing what to buy. They then spent a lot of effort organizing

What about art history?

I tried four questions in computer science and four questions in art history, Google against the ACM digital library and the Art Index. In general: Google has more general resources Google sometimes gets distracted

It’s hard to find a query that the”official” sources do well and Google doesn’t do at all.

Page 11: Size matters: quality vs. quantity Traditionally, libraries spent a lot of effort on selection, choosing what to buy. They then spent a lot of effort organizing

Large image libraries

There are now some very large image collections: the National Museum of the American Indian has 800,000 ARTSTOR will have about 250,000Commercial sites (e.g. Corbis) have millions of images.

Computers are good at matching up images. They are not, today, good at image search: but with a large enough library, the problem will be recognition and not analysis.

Page 12: Size matters: quality vs. quantity Traditionally, libraries spent a lot of effort on selection, choosing what to buy. They then spent a lot of effort organizing

Image matching

(from Andrew Zisserman, Oxford)

Page 13: Size matters: quality vs. quantity Traditionally, libraries spent a lot of effort on selection, choosing what to buy. They then spent a lot of effort organizing

(Jitendra Malik and David Forsythe, Berkeley)

Page 14: Size matters: quality vs. quantity Traditionally, libraries spent a lot of effort on selection, choosing what to buy. They then spent a lot of effort organizing
Page 15: Size matters: quality vs. quantity Traditionally, libraries spent a lot of effort on selection, choosing what to buy. They then spent a lot of effort organizing

(from Peter Allen & collaborators at Columbia)

Beauvais Cathedral

Page 16: Size matters: quality vs. quantity Traditionally, libraries spent a lot of effort on selection, choosing what to buy. They then spent a lot of effort organizing

Tom Funkhouser, Princeton

Page 17: Size matters: quality vs. quantity Traditionally, libraries spent a lot of effort on selection, choosing what to buy. They then spent a lot of effort organizing

The Internet Bookmobile

Van, satellite modem, computers, printer, binding machine; can makea copy of an out of print book for $1, van + equipment costs $15,000.

Page 18: Size matters: quality vs. quantity Traditionally, libraries spent a lot of effort on selection, choosing what to buy. They then spent a lot of effort organizing

The Million Book ProjectCreated by Raj Reddy of Carnegie-Mellon University; also led byProf. N. Balakrishnan of the Indian Institutes of Sciences.

The US provides scanners, disks, and computers (about $4.5M iscommitted); India provides labor (1 -2 thousand staff-years).

About 100 Minolta look-down scanners enable non-destructiveblack&white scanning of books at about one book per hour. Withtwo shifts, for two years, this should scan 1 million books.

Scanning is 600 dpi, bitonal, with OCR and some image cleanup.

So far about 20,000 books have been scanned in India; this isabout 4 months of activity. Centers are running in Bangalore,Hyderabad, Pune, Chennai, Mumbai, Thirupati, and other places.

Page 19: Size matters: quality vs. quantity Traditionally, libraries spent a lot of effort on selection, choosing what to buy. They then spent a lot of effort organizing
Page 20: Size matters: quality vs. quantity Traditionally, libraries spent a lot of effort on selection, choosing what to buy. They then spent a lot of effort organizing

International Children's Digital Library

Curated collection of children's books; research on interfaces byBen Bederson and Allison Druin; see www.icdlbooks.org, butonly about 200 books so far.

Page 21: Size matters: quality vs. quantity Traditionally, libraries spent a lot of effort on selection, choosing what to buy. They then spent a lot of effort organizing

Television ArchiveSeptember 11 broadcasts from around the world; oneweek, news programs only.

Also the Prelinger Archive;about 1,000 films, typicallyindustrial or government.

Online availability causedan increase in commerciallicensing.

Page 22: Size matters: quality vs. quantity Traditionally, libraries spent a lot of effort on selection, choosing what to buy. They then spent a lot of effort organizing

Internet Archive issuesCopyright. The Archive is generally "opt-out"; is this OK? SomeUS rights holders using DMCA to lean on Google & the Archive.

Economics. The Archive does not charge and believes publicdomain material, in particular, should be free. Will this work inthe long run?

Technology. The more Web pages fill with Javascript and Flash,the harder it is to save them. The collection of Macromedia'sCD-ROMS is particularly vulnerable here.

Interfaces. The Archive, in general (ICDL is an exception) doescollections but does not do much research on how to use them.

Impact. How can we get the most from such resources?

Page 23: Size matters: quality vs. quantity Traditionally, libraries spent a lot of effort on selection, choosing what to buy. They then spent a lot of effort organizing

Data lookup, not experiment

In the future, many experiments won’t be necessary because the answers will already be online. Data acquisition is being automated and enormous quantities of information are online (petabytes).

Molecular biology is first, replacing wet chemistry with lookups in the protein and genome data banks (eg to determine the function of a gene or protein)

Astronomy is probably coming next

Many earth-observing fields getting ready

Page 24: Size matters: quality vs. quantity Traditionally, libraries spent a lot of effort on selection, choosing what to buy. They then spent a lot of effort organizing

National needs

Assisting the intelligence agencies

• photointerpretation

• individual identification

• database fusion

• large scale data mining

Page 25: Size matters: quality vs. quantity Traditionally, libraries spent a lot of effort on selection, choosing what to buy. They then spent a lot of effort organizing

Face spotting

Page 26: Size matters: quality vs. quantity Traditionally, libraries spent a lot of effort on selection, choosing what to buy. They then spent a lot of effort organizing

Virtual Cities

Above: modern Los Angeles; left, classical Rome. UCLA.

Page 27: Size matters: quality vs. quantity Traditionally, libraries spent a lot of effort on selection, choosing what to buy. They then spent a lot of effort organizing

Human motion analysis

Jezekiel Ben-Arie, U of Illinois Chicago

Page 28: Size matters: quality vs. quantity Traditionally, libraries spent a lot of effort on selection, choosing what to buy. They then spent a lot of effort organizing

Future Issues

Can we create dictionaries of interesting items?

Can we infer 3-D from 2-D, and build 3-D models?

Can we merge speech, text, and databases?

Can we summarize mixed-media material?

Can we deal with multiple languages?

Can we anticipate scientific and defense needs?

Can we model earth-observing needs?

Can we do this all in real-time?