million books to the web an example of indo-us collaboration lessons learnt & the road ahead...

100
Million Books to the Web An Example of Indo-US Collaboration Lessons Learnt & The Road Ahead Prof N. Balakrishnan Indo-US Workshop on Open Digital Libraries & Interoprability Washington, DC June 23, 2003 Supercomputer Education and Research Centre Indian Institute of Science Bangalore India School of Computer Science Carnegie Mellon University Pittsburgh USA

Upload: allyson-strickland

Post on 11-Jan-2016

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Million Books to the Web An Example of Indo-US Collaboration Lessons Learnt & The Road Ahead Prof N. Balakrishnan Indo-US Workshop on Open Digital Libraries

Million Books to the WebAn Example of Indo-US Collaboration

Lessons Learnt & The Road Ahead

Prof N. Balakrishnan

Indo-US Workshop on Open Digital Libraries & InteroprabilityWashington, DC

June 23, 2003

Supercomputer Education and Research Centre

Indian Institute of Science

Bangalore India

School of Computer Science

Carnegie Mellon University

Pittsburgh USA

Page 2: Million Books to the Web An Example of Indo-US Collaboration Lessons Learnt & The Road Ahead Prof N. Balakrishnan Indo-US Workshop on Open Digital Libraries

Lessons from the past

• fires of Alexandria – irrevocably severed our access to any of the works of the ancients.

• introduction of printing technology – several Indian and Chinese knowledge disseminated by word of

mouth and on palm leaves virtually disappear or inaccessible

• New cultural revolutions – edifices built by destroying the past irrevocably

– later revolutions seek solace in attempting to preserve what was destroyed

– we need to preserve our heritage independent of the political and social ups and downs

A single wanton act of destruction can destroy an entire line of heritage

Page 3: Million Books to the Web An Example of Indo-US Collaboration Lessons Learnt & The Road Ahead Prof N. Balakrishnan Indo-US Workshop on Open Digital Libraries

Lessons from Reality

In a thousand years:

only a few of the paper documents we have today will survive the ravages of deterioration, loss, and outright destruction. 

Existing archives of paper many other works still in existence today are rare

- only accessible to a small population of scholars and collectors at specific geographic locations 

Contrary to the popular beliefs, the libraries, museums, and publishers do not routinely maintain broadly comprehensive archives of the considered works of man

No one can afford to do this, unless the archive is digital

Page 4: Million Books to the Web An Example of Indo-US Collaboration Lessons Learnt & The Road Ahead Prof N. Balakrishnan Indo-US Workshop on Open Digital Libraries

The Approach• Technology Driven Vision• Decide on the stake holders

– Never make it exclusive

• Pilot Projects to perfect technology• Bring in advanced management

concepts – like People Maturity Models – Quality assurance– automate wherever possible

Continued…

Page 5: Million Books to the Web An Example of Indo-US Collaboration Lessons Learnt & The Road Ahead Prof N. Balakrishnan Indo-US Workshop on Open Digital Libraries

The Approach• Lessons from the past

– Too many Digital Library Projects – with half-life of less than 2 years from the date of

“Launch” or a long incubation time– Follow Nike – JUST DO IT

• Digital Library must have two ingredients– A knowledge Amplifier– Free-access, giving avenues for every one to make

economic benefit• still contribute to multiplication of knowledge by circulation

• In India, it should be a test bed for our Language Technology Research– a show case for our heritage

Page 6: Million Books to the Web An Example of Indo-US Collaboration Lessons Learnt & The Road Ahead Prof N. Balakrishnan Indo-US Workshop on Open Digital Libraries

Elements of Technology

• Microprocessors• Memory• Connectivity• Software

All these technologies are growing exponentially

Page 7: Million Books to the Web An Example of Indo-US Collaboration Lessons Learnt & The Road Ahead Prof N. Balakrishnan Indo-US Workshop on Open Digital Libraries

Communication Revolution

If you are amazed at the drop in cost of computing,wait till you see what is going to happen to bandwidth.

Network technology will increase 10-100 times fasterthan processor technology

-Andy Grove, Titan of Intel

Bandwidth will double every year

Network speeds become comparable to interconnect speeds

Page 8: Million Books to the Web An Example of Indo-US Collaboration Lessons Learnt & The Road Ahead Prof N. Balakrishnan Indo-US Workshop on Open Digital Libraries

Death of Time and Distance

Anytime, Anyplace and Anyone

Together, the technology of Computers and Communications Revolutions aim at

Page 9: Million Books to the Web An Example of Indo-US Collaboration Lessons Learnt & The Road Ahead Prof N. Balakrishnan Indo-US Workshop on Open Digital Libraries

The World of Computers & Communication

Small fish eat the Big Fish Microprocessors offer performances

comparable to supercomputers; Paradigm Shift from Dinosaurs to mammals- from performance to functionality

NETWORK is everywhere Web is a preferred medium of communication

for everyone - including the military & the terrorists

Companies that make more and more Software Free – capitalize more- Open archives

Page 10: Million Books to the Web An Example of Indo-US Collaboration Lessons Learnt & The Road Ahead Prof N. Balakrishnan Indo-US Workshop on Open Digital Libraries

Processor of Tomorrow

• Carbon Nano Tubes– 5 to 10 atoms wide – promise to replace silicon soon

• Flexible Transistors– made from plastic, oraganic

materials• Silicon will live for 15 years• Moore’s law will live longer• 1000 times growth in 10 years

The winner will be decided by:Material Convergence + Human Like interactions

Page 11: Million Books to the Web An Example of Indo-US Collaboration Lessons Learnt & The Road Ahead Prof N. Balakrishnan Indo-US Workshop on Open Digital Libraries

Processor of Tomorrow

• A billion Transistors at 10 to 20 GHz Clock rates by 2010

• 128 G Bytes of Main Memory• Terra byte of Disk Storage- may be

Holographic• Speech input/ output ASR• Multiligual• Terrabit connectivity at PC• The DL plans of today must be

sensitive to this

Page 12: Million Books to the Web An Example of Indo-US Collaboration Lessons Learnt & The Road Ahead Prof N. Balakrishnan Indo-US Workshop on Open Digital Libraries

The Road Ahead

ScientificCalculations

Data Analysis

Expert Systems

SuperHumans

Poor

Medium

Rich

Brilliant

KnowledgeContent

Emulating HumanPerformance:

See, Hear, Talk, and “Think”

Bill Joy’sNightmare

Evolution

Nan

osys

tem

s

Page 13: Million Books to the Web An Example of Indo-US Collaboration Lessons Learnt & The Road Ahead Prof N. Balakrishnan Indo-US Workshop on Open Digital Libraries

The future trends:• Browser will be the only medium of

communication.• It will be active- with voice and video,

language independent.• Mobility will be the key.• Small form factor devices such as Palms,

PDAs and Tablets would be the future.• We would soon see TVPCT at the cost of a

TV• We will witness major convergence between

ICT, Nano Technologies and Biological Sciences

Page 14: Million Books to the Web An Example of Indo-US Collaboration Lessons Learnt & The Road Ahead Prof N. Balakrishnan Indo-US Workshop on Open Digital Libraries

Electronic Resources and the Library of the Future

E-mags; E-books; E-music; E-Movies

Page 15: Million Books to the Web An Example of Indo-US Collaboration Lessons Learnt & The Road Ahead Prof N. Balakrishnan Indo-US Workshop on Open Digital Libraries

Dedicated E-book Readers

• Dedicated readers – about 20,000

• Palm devices – 6,000,000• PC’s – hundreds of

millions• “For people accustomed

to reading text on a computer for hours at a time, e-book screen clarity is a non-issue.”

• A low cost E-Book reader design on in India

Page 16: Million Books to the Web An Example of Indo-US Collaboration Lessons Learnt & The Road Ahead Prof N. Balakrishnan Indo-US Workshop on Open Digital Libraries

http://www.eink.com/technology/index.htm

• E Ink is made up of millions of microcapsules– each the diameter of a human hair

• Each microcapsule contains– positively charged white particles &– negatively charged black particles

• that float in a clear fluid

• A film of transistors supplies the voltage to the capsules

• A negative charge makes the white particles move to the top of the microcapsule– an opposite electric field pulls the black

particles to the bottom of the microcapsules, mimicking the effect of print.

• Electronic ink is a real power miser

Page 17: Million Books to the Web An Example of Indo-US Collaboration Lessons Learnt & The Road Ahead Prof N. Balakrishnan Indo-US Workshop on Open Digital Libraries

E-ink/e-paper (Lucent)The technology has been identified and

development is well under wayBy the year 2003, we envision electronic

books • that can display volumes of

information as easily as flipping a page,

• permanent newspapers that update themselves daily via wireless broadcast

• Just as today's books give people easy access to everyday information, tomorrow's books will provide the same easy access to the dynamic data of the information age

The world of publishing will never be the same

Page 18: Million Books to the Web An Example of Indo-US Collaboration Lessons Learnt & The Road Ahead Prof N. Balakrishnan Indo-US Workshop on Open Digital Libraries

Indian Institute of Science’s Simputer

• A hand held Linux Box at around US$ 200• Has the state of the art browser• Color screen• very good speech synthesizer

– In English and many Indian Languages

• A very powerful tool for access with wireless• Soon to be modified as an E-bookwww.simputer.orgwww.picopeta.comwww.ncoretech.com

Page 19: Million Books to the Web An Example of Indo-US Collaboration Lessons Learnt & The Road Ahead Prof N. Balakrishnan Indo-US Workshop on Open Digital Libraries

The Challenges in ComputingTomorrow’s computing

needs are not in mflops and Gflops

The computer to process Information, recognition and DM like a Human

Small inexpensive Robots, swarms will be a reality

Page 20: Million Books to the Web An Example of Indo-US Collaboration Lessons Learnt & The Road Ahead Prof N. Balakrishnan Indo-US Workshop on Open Digital Libraries

Ray Kurzweil:The Age of Spiritual Machines“A $1,000 PC (in 1999-dollars)…

– 2009 = trillion calculations/second

– 2019 = 20 million billion calculations/second (the human brain)

– 2029 = 2 * 1019 calculations/second (1,000 human brains)

Page 21: Million Books to the Web An Example of Indo-US Collaboration Lessons Learnt & The Road Ahead Prof N. Balakrishnan Indo-US Workshop on Open Digital Libraries
Page 22: Million Books to the Web An Example of Indo-US Collaboration Lessons Learnt & The Road Ahead Prof N. Balakrishnan Indo-US Workshop on Open Digital Libraries

Ray Kurzweil:The Age of Spiritual Machines

• 2009: “Computer displays have all the display qualities of paper- high resolution, high contrast, large viewing angle, and no flicker. Books, magazines, and newspapers are now routinely read on displays that are the size of small books.”

• 2009: “At least half of all (business) transactions are conducted online.”

Page 23: Million Books to the Web An Example of Indo-US Collaboration Lessons Learnt & The Road Ahead Prof N. Balakrishnan Indo-US Workshop on Open Digital Libraries

• 2009: “There is effective convergence of all media, which exist as digital objects (that is, files) distributed by the ever-present high-bandwidth, wireless information web. Users can instantly download books, magazines, newspapers, television, radio, movies, and other forms of software to their highly portable personal communication devices.”

Ray Kurzweil:The Age of Spiritual Machines

Page 24: Million Books to the Web An Example of Indo-US Collaboration Lessons Learnt & The Road Ahead Prof N. Balakrishnan Indo-US Workshop on Open Digital Libraries

2009• A $1,000 PC delivers Terahertz speeds• PCs with high resolution visual displays come in a

range of sizes– from those small enough to be embedded in clothing and jewelry – to the size of a thin book

• Cables are disappearing– Communication between components uses wireless technology, as

does access to the Web

• The majority of text is created using continuous speech recognition– Also ubiquitous are language user interfaces.

• Most routine business transactions (purchases, travel, etc.) take place between a human and a virtual personality– Often the virtual personality includes an animated visual presence

that looks like a human face

Page 25: Million Books to the Web An Example of Indo-US Collaboration Lessons Learnt & The Road Ahead Prof N. Balakrishnan Indo-US Workshop on Open Digital Libraries

• 2019: “Reading books, magazines, newspapers, and other Web documents; listening to music; watching three-dimensional moving images (for example, television, movies); engaging in three-dimensional visual phone calls; entering virtual environments (by yourself, or with others who may be geographically remote); and various combinations of these activities are all done through the ever-present communications Web and do not require any equipment, devices, or objects that are not worn or implanted.”

Ray Kurzweil:The Age of Spiritual Machines

Page 26: Million Books to the Web An Example of Indo-US Collaboration Lessons Learnt & The Road Ahead Prof N. Balakrishnan Indo-US Workshop on Open Digital Libraries

2029: “The ever learning Society”• Learning now constitutes the primary focus of

the human species. • Human learning is accomplished using virtual

teachers (and virtual libraries?). • Learning is enhanced by widely available neural

implants, which improve memory and perception but cannot yet download knowledge directly.

• Automated agents are learning, on their own without human assistance. Machines can now create significant new knowledge with little or no human intervention; unlike humans, machines easily share knowledge structures with one another.

Ray Kurzweil:The Age of Spiritual Machines

Page 27: Million Books to the Web An Example of Indo-US Collaboration Lessons Learnt & The Road Ahead Prof N. Balakrishnan Indo-US Workshop on Open Digital Libraries

And Then There Was Music

• RealJukeBox• Win Amp• MP3• Napster

Page 28: Million Books to the Web An Example of Indo-US Collaboration Lessons Learnt & The Road Ahead Prof N. Balakrishnan Indo-US Workshop on Open Digital Libraries

The Growth rates

• The processor performance doubles every 18 Months

• The Network bandwidth doubles every year

• The storage capacity doubles every nine months

• Soon you will have processor bottleneck • 1000 times growth in storage in 10 years

– I already have 250 GB on a single disk-

Page 29: Million Books to the Web An Example of Indo-US Collaboration Lessons Learnt & The Road Ahead Prof N. Balakrishnan Indo-US Workshop on Open Digital Libraries

Recognition verses Recall• Recognition is like seeing your

friend’s face in a sea of faces– even if he has changed since you last saw him– storage intensive and fast

• Recall is like figuring out how to repair your car’s carburetor using a manual and you have never done that before- applying knowledge to a new situation- processor intensive and less storage

• Brian works on recognition• Present day computers prefer recall –

remember the Y2K• Future computers would work like the

brain- recognition

Page 30: Million Books to the Web An Example of Indo-US Collaboration Lessons Learnt & The Road Ahead Prof N. Balakrishnan Indo-US Workshop on Open Digital Libraries

Recognition verses Recall- what it does to our DL

• We will move away from quantitative search (key word match) to “aboutness” and content based retrieval

• In Future the documents will be read more by computers than by humans – will it change the way we write ? Would we think in html or in xml ?

• From mere Text data to 3d Objects, voice and video

• Multiligual• Every conceivable form of knowledge

expression

Page 31: Million Books to the Web An Example of Indo-US Collaboration Lessons Learnt & The Road Ahead Prof N. Balakrishnan Indo-US Workshop on Open Digital Libraries

Technology Driven vision for The Digital Library

• We can store everything– all the knowledge of the human race– in all forms– that is the Universal Digital Library

• Cost of Selection is stationary but storage cost is plummeting

It is not about contents alone- It is about networking of people

Page 32: Million Books to the Web An Example of Indo-US Collaboration Lessons Learnt & The Road Ahead Prof N. Balakrishnan Indo-US Workshop on Open Digital Libraries

Education

Real-time Engineering Science Business

Universities CollegesSchools

3 Ls of Learning1. Face-to-Face Lectures2. Virtual Labs3. Universal Digital Library

Page 33: Million Books to the Web An Example of Indo-US Collaboration Lessons Learnt & The Road Ahead Prof N. Balakrishnan Indo-US Workshop on Open Digital Libraries

Universal Library Vision

All recorded information online• instantly available

– To Anyone– Anywhere in the world – In any language– searchable, browsable, navigable by

humans and machines

Page 34: Million Books to the Web An Example of Indo-US Collaboration Lessons Learnt & The Road Ahead Prof N. Balakrishnan Indo-US Workshop on Open Digital Libraries

Digital Library Contents

• Books• Periodicals (journals, newspapers)• Art, photographs• Databases, software• Movies, video• Music, opera, danceSuppose all of this were on the Web

Page 35: Million Books to the Web An Example of Indo-US Collaboration Lessons Learnt & The Road Ahead Prof N. Balakrishnan Indo-US Workshop on Open Digital Libraries

Digital Library of the future

•Digital library•Digital museum•Digital tour guide•Research assistant• Knowledge amplifier

Page 36: Million Books to the Web An Example of Indo-US Collaboration Lessons Learnt & The Road Ahead Prof N. Balakrishnan Indo-US Workshop on Open Digital Libraries

Can we store all the human knowledge in a Digital formThere are about 100 Million books written by the

human raceMultiply by 10 for all other form of knowledge1 book = 500 pp. = 1 MB uncompressed

– 109 books = 1015 bytes = 1 petabyte

140 million computers on the Internet– At 20 GB free space each >2.8 Zetabytes

now

1 GB of disk costs ~$1– 1 petabyte < $1 million– Our Peta Byte server Initiative– Storage is not the limitation but creation

and coordination are– Avoiding Duplication and connectivity are

Page 37: Million Books to the Web An Example of Indo-US Collaboration Lessons Learnt & The Road Ahead Prof N. Balakrishnan Indo-US Workshop on Open Digital Libraries

Universal Digital Library

• More than 120 million PCs on the net• Each having atleast 20 GB of free

space• Peer to peer Communication• Can we store all the Human

Knowledge in the computers

This is todayThe time consuming process is taking the printed books to the web- The technology

is not an impediment

Page 38: Million Books to the Web An Example of Indo-US Collaboration Lessons Learnt & The Road Ahead Prof N. Balakrishnan Indo-US Workshop on Open Digital Libraries

Technology Driven Vision for the Universal Digital Library• A vision to store everything that the

human race ever produced• A mission to digitize 1 Million Books

and make them freely available

Page 39: Million Books to the Web An Example of Indo-US Collaboration Lessons Learnt & The Road Ahead Prof N. Balakrishnan Indo-US Workshop on Open Digital Libraries

The Strategy for Scanning of books• A planetary Scanner like the Minolta PS 7000• Takes about two hours to scan a 500 page

book, crop, OCR and convert it to TIFF, HTML and XML files

• About 10, 000 pages to the web in a day• Storage per book is around ~ 60MB• 100 Tera byte is not an issue• Our Partner Internet Archives has 370 TB

adding 30 TB a day• Distributed data bases

Page 40: Million Books to the Web An Example of Indo-US Collaboration Lessons Learnt & The Road Ahead Prof N. Balakrishnan Indo-US Workshop on Open Digital Libraries

Identification ofBooks

Pre-Scanning process

Process InvolvedProcess Involved

ConversionProcess

Scanning Process

Image Processing

Process

Page 41: Million Books to the Web An Example of Indo-US Collaboration Lessons Learnt & The Road Ahead Prof N. Balakrishnan Indo-US Workshop on Open Digital Libraries

Scanning

•2 pages at a time •Stored in tif format•2 pages at a time

•Stored in tif format

Page 42: Million Books to the Web An Example of Indo-US Collaboration Lessons Learnt & The Road Ahead Prof N. Balakrishnan Indo-US Workshop on Open Digital Libraries

Post scanning operations

• Skew Correction• Document Registration• Dot Shading and Speck Removal• Image centering• Image Cropping• Smoothing and Completion

Page 43: Million Books to the Web An Example of Indo-US Collaboration Lessons Learnt & The Road Ahead Prof N. Balakrishnan Indo-US Workshop on Open Digital Libraries

Image comparison

Original Image

Page 44: Million Books to the Web An Example of Indo-US Collaboration Lessons Learnt & The Road Ahead Prof N. Balakrishnan Indo-US Workshop on Open Digital Libraries

Processed ImageSW 1

Page 45: Million Books to the Web An Example of Indo-US Collaboration Lessons Learnt & The Road Ahead Prof N. Balakrishnan Indo-US Workshop on Open Digital Libraries

OCR CONVERSION

Page 46: Million Books to the Web An Example of Indo-US Collaboration Lessons Learnt & The Road Ahead Prof N. Balakrishnan Indo-US Workshop on Open Digital Libraries

Performance evaluation for various fonts in Kannada language OCR

Series1: Average performance efficiency before using the cropping software.

Series2: Average performance efficiency after using cropping software.

Page 47: Million Books to the Web An Example of Indo-US Collaboration Lessons Learnt & The Road Ahead Prof N. Balakrishnan Indo-US Workshop on Open Digital Libraries

The Digitized book

Page 48: Million Books to the Web An Example of Indo-US Collaboration Lessons Learnt & The Road Ahead Prof N. Balakrishnan Indo-US Workshop on Open Digital Libraries

• Average book size ~ 500 Pages• Size of Page as Image ~ 50-150

KB • Size of Page as text file

(rtf /htm) ~ 8 – 15 KB• Average size of Digitized book ~

60MB

Page 49: Million Books to the Web An Example of Indo-US Collaboration Lessons Learnt & The Road Ahead Prof N. Balakrishnan Indo-US Workshop on Open Digital Libraries

Brightness – Dark(1 in scale) and contrast – 9(in scale)

Original image

Cropped image

Page 50: Million Books to the Web An Example of Indo-US Collaboration Lessons Learnt & The Road Ahead Prof N. Balakrishnan Indo-US Workshop on Open Digital Libraries

Million Books to the web- Stake holders as Partners

• Academia- CS, IS and users• Researchers and Language

Technologists• Cultural and Religious

Organizations• Public Libraries• Government Agencies• None too exclusive

Page 51: Million Books to the Web An Example of Indo-US Collaboration Lessons Learnt & The Road Ahead Prof N. Balakrishnan Indo-US Workshop on Open Digital Libraries

Background and Status

• Collaborative Project between India and US• Lead roles by CMU and IISc• Initiated by CMU sending scanners free of cost to

India. NSF supported• Initiated by the Office of the Principal Scientific

Advisor to GOI by a Seed funding to IISc• Fuelled by MCIT’s whole hearted support• More than 16 centres in academic, religious and

government institutions spread across the country• 69 scanners in place• China, Egypt (Alexandria Library), Srilanka,

Australia joining in• There is light on the other side of the tunnel

Page 52: Million Books to the Web An Example of Indo-US Collaboration Lessons Learnt & The Road Ahead Prof N. Balakrishnan Indo-US Workshop on Open Digital Libraries

Hubs of DL Activities in India

Anna University, Chennai, Tamil NaduArulmigu Kalasligam College of Engineering, Srivilliputur, Madurai, Tamil

NaduGoa University, GoaIndian Institute of Information Technology, Allahabad, Uttar PradeshInternational Institute of Information Technology, Hyderabad, Andhra

PradeshCity and State Central Library, Andhra PradeshShanmugha Art, Science, Technology & Research Academy, Thanjavore,

Tamil NaduSringeri Mutt, Sringeri, KarnatakaTirumala Tirupathi Devasthanams, Tirupathi, Anadhra PradeshMahastrastra Industrial Development Corporation, MaharastraUniversirty of Pune, PuneKanchi University, Kanchi, Tamil NaduIndian Institute of AstroPhysics, Karnataka

Page 53: Million Books to the Web An Example of Indo-US Collaboration Lessons Learnt & The Road Ahead Prof N. Balakrishnan Indo-US Workshop on Open Digital Libraries

Scanner Operation at Hubs

2 1 2 1 1 1

10

53 4

2 13

5

40

05

1015202530354045

Page 54: Million Books to the Web An Example of Indo-US Collaboration Lessons Learnt & The Road Ahead Prof N. Balakrishnan Indo-US Workshop on Open Digital Libraries

Progress of Various Centre in Scanning

1704

10311097

2000

504 465 273 158

6276

3042

0500

100015002000250030003500400045005000

IISc

AK

CE

SA

ST

RA

TT

D

MID

C

PU

NE

AU

Kanchi

CC

L

SC

L

Centre

No.

of

Boo

ks

Page 55: Million Books to the Web An Example of Indo-US Collaboration Lessons Learnt & The Road Ahead Prof N. Balakrishnan Indo-US Workshop on Open Digital Libraries

8377

08

1589

33 4514

52

5000

00

1341

00

9733

4

1525

02

3939

5

1319

001

1080

759

0

200000

400000

600000

800000

1000000

1200000

1400000

Centre

No.

of

Pag

es

Number of Pages Scanned

Page 56: Million Books to the Web An Example of Indo-US Collaboration Lessons Learnt & The Road Ahead Prof N. Balakrishnan Indo-US Workshop on Open Digital Libraries

Category of Books

2962

5596

836

430176 168

384

0

500

1000

1500

2000

2500

3000

3500

4000

4500

5000

Engl

ish

Telu

gu

Tam

il

Sans

krit

Kan

nada

Oth

ers

Urd

u

EnglishTeluguTamilSanskritKannadaOthersUrdu

Page 57: Million Books to the Web An Example of Indo-US Collaboration Lessons Learnt & The Road Ahead Prof N. Balakrishnan Indo-US Workshop on Open Digital Libraries

Cumulative Status

4771184

16550

Books Pages

Page 58: Million Books to the Web An Example of Indo-US Collaboration Lessons Learnt & The Road Ahead Prof N. Balakrishnan Indo-US Workshop on Open Digital Libraries

More Centres and Initiatives-Already 61 scanners in operation+ 39 in the pipe line

• Rashtrapathi Bhavan• Punjab Technical University• IIIT Hyderabad and University of

Hyderabad

Page 59: Million Books to the Web An Example of Indo-US Collaboration Lessons Learnt & The Road Ahead Prof N. Balakrishnan Indo-US Workshop on Open Digital Libraries

MCIT’s Initiatives

• Mobile Van with VSAT for the Book Mobile• ERNET providing connectivity to all centres• Many Centres supported with funds for

computers and for scanning operations• Total spending from Government support

and from Scanning Centre’s resources is ten times more than the Scanning equipment cost and effectively 100 times more

• Support from all quarters of the government, religious leaders, academia and private agencies

• Universal Digital Library of India to be launched

Page 60: Million Books to the Web An Example of Indo-US Collaboration Lessons Learnt & The Road Ahead Prof N. Balakrishnan Indo-US Workshop on Open Digital Libraries

Some Observations and the Road

ahead• More than 5 million pages have been

scanned• The highest average rate of sustained

scanning was about 4,000 pages per day at Hyderabad during February.

• Our goal is to establish best practices to reach 6000 pages a day

• 3 years – 1 M Books• By 2020 – 20 Million Books, 2 Million

Songs, 200,000 Movies • The most enviable content creation

Page 61: Million Books to the Web An Example of Indo-US Collaboration Lessons Learnt & The Road Ahead Prof N. Balakrishnan Indo-US Workshop on Open Digital Libraries

Road Ahead

• Establishing the Digital Library of India on the same lines as the E-Governance Initiative

• Under the MCIT• Head Quartered in AP• A think tank for content selection,

delivery, technology and policy directions for the country

• Creation of special funds for 4C

Page 62: Million Books to the Web An Example of Indo-US Collaboration Lessons Learnt & The Road Ahead Prof N. Balakrishnan Indo-US Workshop on Open Digital Libraries

Criteria for Selecting Mega Centres- 5 of them planned

• Geographical Distribution• Availability of contents of interest to

larger user base• Local enthusiasm to support and

sustain this activity• Budget of US$ 200,000 Initially and

around 0.5 cent per page of output• One single scanner can produce 2

Million pages a year-• We will have 300 scanners – a Million

books a year

Page 63: Million Books to the Web An Example of Indo-US Collaboration Lessons Learnt & The Road Ahead Prof N. Balakrishnan Indo-US Workshop on Open Digital Libraries

Raod Ahead

• Mega Content Creation Centres • New Delhi, Varanasi, Allahabad,

Hyderabad, Far east (Tawang or Guahathi), Kolkotta and Chennai

• Each Centre having around 40 scanners and 5 mobile scanners

• Content Creation Centres with upto 5 scanners in Gujarat, Rajasthan so as to cover the entire country

• Spearheading Language Technology Initiatives

• Adding voice and video of our heritage

Page 64: Million Books to the Web An Example of Indo-US Collaboration Lessons Learnt & The Road Ahead Prof N. Balakrishnan Indo-US Workshop on Open Digital Libraries

Universal Digital Library

• Goal — To have all public knowledge online, available for free to all, everywhere

• An achievable goal– There are only some 100,000,000 books in the world– A few billion dollars could bring these online

• Limitations– Copyright and licensing issues– Different language books and character recognition

technologies• We must ensure that English is not necessarily the de facto

language

• Universal Library

Page 65: Million Books to the Web An Example of Indo-US Collaboration Lessons Learnt & The Road Ahead Prof N. Balakrishnan Indo-US Workshop on Open Digital Libraries

TECHNOLOGICAL CHALLENGES

• Input (scanning, digitizing, OCR)• Data representation

– text, notations, images, web pages

• Navigation and Search• Multilingual Issues• Output (voice, pictures, virtual

reality)• Synthetic Documents

Page 66: Million Books to the Web An Example of Indo-US Collaboration Lessons Learnt & The Road Ahead Prof N. Balakrishnan Indo-US Workshop on Open Digital Libraries

SEARCH ENGINE of UDL• Very powerful light weight and

scalable CMU search engine• Greenstone• Both are working and are being

evaluated for the choice• Both have been modified for use as

Indian Language search engines- language independent search

• Future- Semantic web and content based retrieval – Speech input and speech output

Page 67: Million Books to the Web An Example of Indo-US Collaboration Lessons Learnt & The Road Ahead Prof N. Balakrishnan Indo-US Workshop on Open Digital Libraries

SearchEngine

TimeTaken

Boolean Proximity Case Stemming

Greenstone Not depending on the number of hits

OR & NOT

Default :AND

Phrase searching

User can select the

option

Stemming allowed

UDL Highly depending on the number

of hits

OR Default :AND

No No Case Sensitivity

Not available

COMPARATIVE ANALYSIS – GREENSTONE Vs UDL SEARCH ENGINES

Page 68: Million Books to the Web An Example of Indo-US Collaboration Lessons Learnt & The Road Ahead Prof N. Balakrishnan Indo-US Workshop on Open Digital Libraries

Choice of Collection• Use books from libraries that are

beyond copyright• Administrative metadata from OCLC,

ISBN, and other sources• Dublin Core for Indian Books• A Copy Right Metadata – aggressive

attempts to obtain copy right- Free Copyright from many agencies including GoI

• Source Library Metadata• Converge towards focussed collection

Page 69: Million Books to the Web An Example of Indo-US Collaboration Lessons Learnt & The Road Ahead Prof N. Balakrishnan Indo-US Workshop on Open Digital Libraries

Funding – Road Ahead• Funding effort must be an organized activity• Commercial funding unlikely for “public good”

activity– Must go to governments, NGOs

• World Bank• Qatar (if CMU deal succeeds)• Benefits of UDL:

– Digital Opportunity– Use in distance education– International involvement – cultural diversity– Technology dissemination– Low cost v. conventional libraries

• Funding is tied to Outreach (next slide)

Page 70: Million Books to the Web An Example of Indo-US Collaboration Lessons Learnt & The Road Ahead Prof N. Balakrishnan Indo-US Workshop on Open Digital Libraries

Outreach• The UDL message must be disseminated• Present at World Summit (WSIS) in

Geneva (12/03)• Pre-WSIS meeting at CERN (12/03)• Establish liaison with UN Decade of

Literacy (2003-2013)• Points:

– Terabyte servers– “Free to read” policy– Universal Dictionary (applicability to other

domains)

Page 71: Million Books to the Web An Example of Indo-US Collaboration Lessons Learnt & The Road Ahead Prof N. Balakrishnan Indo-US Workshop on Open Digital Libraries

Access by Public

• All content free to read, print one page at a time

• Restrictions imposed by donors will be respected

• Categories of use will be recognized, e.g. cannot print entire document

• Buttons, links to fulfillment houses and publishers are allowed- to take in “born Digital” copyrighted material

Page 72: Million Books to the Web An Example of Indo-US Collaboration Lessons Learnt & The Road Ahead Prof N. Balakrishnan Indo-US Workshop on Open Digital Libraries

Partner Relations- Future• All material scanned or input as part of

the UDL will be shared by all partners• Preference for national umbrella

organizations to simplify international partner relations

• Relationships between partners and their national DLs encouraged

• Online communication and collaboration tools needed to facilitate partner questions and interchanges

• Written partnership agreement will be made

Page 73: Million Books to the Web An Example of Indo-US Collaboration Lessons Learnt & The Road Ahead Prof N. Balakrishnan Indo-US Workshop on Open Digital Libraries

Standards• Published standards within the UDL• Quality control and testing standard • Funding to be sought to support

standards development• Logo to be developed (graphic device

without words). Must appear on all sites, all pages

• Logo should have a hot link to a gateway site that links all UDL sites

• Local variability in look and feel of sites is permitted so long as the logo is displayed

Page 74: Million Books to the Web An Example of Indo-US Collaboration Lessons Learnt & The Road Ahead Prof N. Balakrishnan Indo-US Workshop on Open Digital Libraries

Scanning/OCR Policy• We scan what gives greatest

impetus to continued funding• Language: majority of content in

English; otherwise no restriction• Scans will be previewed for

minimum quality; OCR will not be corrected unless local site desires

Page 75: Million Books to the Web An Example of Indo-US Collaboration Lessons Learnt & The Road Ahead Prof N. Balakrishnan Indo-US Workshop on Open Digital Libraries

Metadata

• All entries MUST have metadata according to MARC or Dublin Core

Page 76: Million Books to the Web An Example of Indo-US Collaboration Lessons Learnt & The Road Ahead Prof N. Balakrishnan Indo-US Workshop on Open Digital Libraries

Copyright• Public domain materials: no restrictions,

tools for printing entire document provided• Works of uncertain copyright status:

– Good faith effort to determine status, locate owner– Scan and index work– After a waiting period (at least one month), make

work viewable

• Archival material (old but unique)– Allow resolution restriction to avoid devaluation of

original

• Out-of-print in-copyright (OPIC)– Seek blanket permissions from publishers

Page 77: Million Books to the Web An Example of Indo-US Collaboration Lessons Learnt & The Road Ahead Prof N. Balakrishnan Indo-US Workshop on Open Digital Libraries

Possible Intake Model

CMUUL SERVER

INDIACENTRAL

MIRROR SITE

ENGLISHINTAKE

SCANNINGCENTER

SCANNINGCENTER

TAMILINTAKE

LOCALMATERIALS

SCANNINGCENTER

GUJARATIINTAKE

LOCALMATERIALS

HINDIINTAKE

SCANNINGCENTER

LOCALMATERIALS

ARTINTAKE

SCANNINGCENTER

CHINESEMIRROR SITE

AUSTRALIANMIRROR SITE

INDIA

OUTSIDEINDIA

Page 78: Million Books to the Web An Example of Indo-US Collaboration Lessons Learnt & The Road Ahead Prof N. Balakrishnan Indo-US Workshop on Open Digital Libraries

The Digital Library a Test Bed for language research

• Rich data in many languages from the Million Books to the web Project - atleast 10,000 books in any language

• Translations in many languages- Gita, NBT, NCERT etc- an excellent tool for language translation-

• Training data for the OCR• The case insensitive ITRANS standard

Page 79: Million Books to the Web An Example of Indo-US Collaboration Lessons Learnt & The Road Ahead Prof N. Balakrishnan Indo-US Workshop on Open Digital Libraries

The Digital Library a Test Bed for language research

• Rich data makes the creation of OCRs in Indian languages easy- In Tamil, Kannada and Malayalam – A rapid prototyping

• Speech synthesis and recognition• Indian Language Search Engines• Example Based Machine Translatio

n• Universal Dictionary

Page 80: Million Books to the Web An Example of Indo-US Collaboration Lessons Learnt & The Road Ahead Prof N. Balakrishnan Indo-US Workshop on Open Digital Libraries

Word English POS Pron Use Lang

danúbia linen tape HUNdanum water PMPdanun early PMPdanup hunger PMPdanup hunger, starvation PMPdanupan hungry, starving PMPdaný existent SLOdaný existing SLOdaný given SLOdaný číslom numerical SLOdaný na pospas obnoxious SLOdanyag landscape n HILdaog overturn v CEBdaog prevail v CEBdaogdaog manhandle v CEBdaong boat with a covered cabin, ark TAGdaong bring the ship to shore TAGdaot harm v CEBdaot mar v CEBdaotan bad adj CEBdaotan'g buut dislike n CEBdaotan'g hitabo mishap n CEBdaotan'g tinguha malice n CEBdaotan'g tuyo malice n CEBdapa granary n CEBdapa lie flat on stomach or face

down PMP

dapa lie flat on stomach or face down

TAGdapače on the contrary adv BOSdapadnúť (na nohy)

to land SLOd'apaiser to appease v FRE

HUNGARIAN

KAMPAMPANGAN

SLOVAK

HILIGAYNON

CEBUANO

TAGALOG

BOSNIAN

FRENCH

The Universal Dictionary

Page 81: Million Books to the Web An Example of Indo-US Collaboration Lessons Learnt & The Road Ahead Prof N. Balakrishnan Indo-US Workshop on Open Digital Libraries

Aboutness Hierarchy- Dr Shamos Universe

Word

Sentence

Paragraph

Section

Chapter

Collection

BookNewspaper

Article

Photograph

Object

3D Artifact

Glyph

KEYWORD SEARCHINGOCCURS HERE

SUBJECT SEARCHINGOCCURS HERE

Page 82: Million Books to the Web An Example of Indo-US Collaboration Lessons Learnt & The Road Ahead Prof N. Balakrishnan Indo-US Workshop on Open Digital Libraries

Legal and Business Challenges• Use of copyrighted material• Economics (Who pays? Who

gets?)• Privacy• Reliability of information• Change in the nature of teaching• Change in the nature of

Information creation and use

Page 83: Million Books to the Web An Example of Indo-US Collaboration Lessons Learnt & The Road Ahead Prof N. Balakrishnan Indo-US Workshop on Open Digital Libraries

Philosophy of Copy Right Laws

• Protect the Inventor so that private investments in R & D would flow

• Disseminate the information so that society grows

• Protect the fairuse• Ensure you get what you paid

for

Page 84: Million Books to the Web An Example of Indo-US Collaboration Lessons Learnt & The Road Ahead Prof N. Balakrishnan Indo-US Workshop on Open Digital Libraries

What can be copyrighted ?

• Must be tangible, e.g. a lecture can’t be copyrighted, a transcript of it can

• Work must be original

• Work must be creative - even minimal efforts usually count as creative

Page 85: Million Books to the Web An Example of Indo-US Collaboration Lessons Learnt & The Road Ahead Prof N. Balakrishnan Indo-US Workshop on Open Digital Libraries

Fair use doctrine

Authorizes any person to make fair use of a published or unpublished copyrighted work (including the making of unauthorized copies) in these contexts:

In connection with criticism of or comment on the work

In the course of news reporting For teaching purposes or As part of scholarship or research activity

Page 86: Million Books to the Web An Example of Indo-US Collaboration Lessons Learnt & The Road Ahead Prof N. Balakrishnan Indo-US Workshop on Open Digital Libraries

Four basic Factors:

1. The purpose and character of the use, including whether such use is of a commercial nature or is for nonprofit educational purposes

2. The nature of the copyrighted work3. The amount and substantiality of

the portion used in relation to the copyrighted work as a whole; and

4. The effect of the use upon the potential market for or value of the copyrighted work

Page 87: Million Books to the Web An Example of Indo-US Collaboration Lessons Learnt & The Road Ahead Prof N. Balakrishnan Indo-US Workshop on Open Digital Libraries

www.library.org principles

1. Scholarly and government information and knowledge is a public good

• that should be available, maintaining the balance of the rights of the individual creator vs. the needs of the public

2. The Library is the intellectual crossroads of the community.

3. Librarians will conceptualize and ensure

• implementation of innovative new systems• for the creation and dissemination of information

for succeeding generations.

Page 88: Million Books to the Web An Example of Indo-US Collaboration Lessons Learnt & The Road Ahead Prof N. Balakrishnan Indo-US Workshop on Open Digital Libraries

“This rule provides that the first sale of a copy of a work to a member of the public ‘exhausts’ the rights holder’s ability to control further distribution of that copy. A library is thus free to lend, or even rent or sell, its copies of books to patrons”

How does this work in the Digital World ?

Page 89: Million Books to the Web An Example of Indo-US Collaboration Lessons Learnt & The Road Ahead Prof N. Balakrishnan Indo-US Workshop on Open Digital Libraries

Music, Movie and Entertainment Industry

• Much larger part of most of the economies

• Large production costs• Need to protect business interest• Need to technology to protect • NAPSTER – peer to peer communication• DeCSS• NAPSTER for video ??• Consumer is different from the creator

Page 90: Million Books to the Web An Example of Indo-US Collaboration Lessons Learnt & The Road Ahead Prof N. Balakrishnan Indo-US Workshop on Open Digital Libraries

New paradigms in the Digital Library

• Should the laws used for protecting commercially attractive enterprise such as patents, music, entertainment be applied to DL

• The dissemination of information creates multiplication unlike in music etc

• Shorter life cycles for the information

Page 91: Million Books to the Web An Example of Indo-US Collaboration Lessons Learnt & The Road Ahead Prof N. Balakrishnan Indo-US Workshop on Open Digital Libraries

Copyright Conflicting requirements

Need to protect the financial interests of creators in order to encourage private investments to the economy

Need to create a framework for every human being to create

The 2nd principle should dominate in DLThe 1st principle should dominate the

others

Page 92: Million Books to the Web An Example of Indo-US Collaboration Lessons Learnt & The Road Ahead Prof N. Balakrishnan Indo-US Workshop on Open Digital Libraries

The Concept of FourC

The scientific community is the only one that is creator and consumer of information

It pays for both The SW Industry had shown

the way for freeware Can we do it in Scholarly

communication, text books etc.

Page 93: Million Books to the Web An Example of Indo-US Collaboration Lessons Learnt & The Road Ahead Prof N. Balakrishnan Indo-US Workshop on Open Digital Libraries

The Concept of FourC

In the 20th Century, in the interest of public good the Governments created BBC, PBS, AIR and also the Public Library System- provided compensation for artists and writers while providing free access to public

Total Global Expenditure in public broadcasting and public libraries exceed 100 B$

Look at our kings who supported all the poets and scholars

We need to find the 21st Century equivalent of BBC, AIR and PBS.

Page 94: Million Books to the Web An Example of Indo-US Collaboration Lessons Learnt & The Road Ahead Prof N. Balakrishnan Indo-US Workshop on Open Digital Libraries

The Concept of FourC

Learn from NAPSTER- will we have a video equivalent of NAPSTER

It is impossible to police and protect IP Rights at gigabit rate connections

Some countries and WIPO under pressure from lobbying groups form the draconian Copy Right Laws

Remember the FAIR USE Doctrine- and what the creators want- recognition and compensation

Page 95: Million Books to the Web An Example of Indo-US Collaboration Lessons Learnt & The Road Ahead Prof N. Balakrishnan Indo-US Workshop on Open Digital Libraries

The Solution -FourCConsortium for Compensation of Creative

Contents- FourCSet aside 25% of the current national

expenditure on public broadcasting and PLsAuthors are encouraged to put the work on

the web after a few years of commercial exploitation- many models- in return get tax excempt etc.

India showing the way IASc and INSABooks out of printTitanic effectAuthors Can take back the Copy right

Page 96: Million Books to the Web An Example of Indo-US Collaboration Lessons Learnt & The Road Ahead Prof N. Balakrishnan Indo-US Workshop on Open Digital Libraries

The Solution -FourC

Authors compensation based on the hits

Future versions of text books may be FAQs and XMLised-

Many eceonomic models- Can work for Courseware as well

Page 97: Million Books to the Web An Example of Indo-US Collaboration Lessons Learnt & The Road Ahead Prof N. Balakrishnan Indo-US Workshop on Open Digital Libraries

The Solution -FourC

The changing trend in publications- we want the documents to be readable by the machines as well humans

Born digital documentsCan we compensate those for

creating contents for the webCan we compensate those who create

music and movies for the web- really small form factor – small screens

Page 98: Million Books to the Web An Example of Indo-US Collaboration Lessons Learnt & The Road Ahead Prof N. Balakrishnan Indo-US Workshop on Open Digital Libraries

• Knowledge multiplies whenever bits are circulated on the web

• Technology has a habit of creating a problem (by knowledge explosion) and spending the rest of its time in trying to solve it- through Digital Library

• The Universal Digital Library with 20 Million Books by 2020 – A year our President dreams India to become a developed nation

• A FourC Policy and a Digital Library Act are in the anvil in India to meet this mission

• If a billion people sneeze- together we can create a Hurricane

• With the technology of the two nations we will convert this hurricane into useful energy and light up the world of knowledge

ConclusionConclusion

Page 99: Million Books to the Web An Example of Indo-US Collaboration Lessons Learnt & The Road Ahead Prof N. Balakrishnan Indo-US Workshop on Open Digital Libraries

• If you are creating a digital library, it should be for access by anyone, anytime and from any place

• If Your Digital Library Is For Exclusive Use, Let Us Talk About Weather

• There Is Nothing Called, Your DL, My DL

– It Is Our DL– The Universal Digital Library

Page 100: Million Books to the Web An Example of Indo-US Collaboration Lessons Learnt & The Road Ahead Prof N. Balakrishnan Indo-US Workshop on Open Digital Libraries

It happens only in

India