search technologies for digital libraries

69
Contemporary Search Technologies - also for Libraries? Clemens Neudecker, KB – 20/04/2011

Upload: cneudecker

Post on 05-Dec-2014

179 views

Category:

Technology


0 download

DESCRIPTION

Introduction Search & Information Retrieval Technologies for Digital Libraries

TRANSCRIPT

Page 1: Search Technologies for Digital Libraries

Contemporary Search Technologies - also for Libraries?

Clemens Neudecker, KB – 20/04/2011

Page 2: Search Technologies for Digital Libraries

Table of contents

Retrieval: Status Quo

New ways of searching

Prototypes & Outlook

Page 3: Search Technologies for Digital Libraries

Lossau (dlib, 2004)

How to position the library as an information provider in the 21st century?

Search services are critical!

http://www.dlib.org/dlib/june04/lossau/06lossau.html

Page 4: Search Technologies for Digital Libraries

Library as a “depot”

Collect

Preserve

Page 5: Search Technologies for Digital Libraries

Library as a “gateway”

New ways of searching and/or browsing

Service infrastructure

User-Generated content

Competition: Internet Search Engines

Page 6: Search Technologies for Digital Libraries

Simple Search

• By keyword

• Boolean operators

Page 7: Search Technologies for Digital Libraries

Advanced Search

Facets

Views

Phrases

Page 8: Search Technologies for Digital Libraries

Meta-Search

Page 9: Search Technologies for Digital Libraries

Basics

• Crawling

• Indexing

• Searching

• Ranking results

http://nlp.stanford.edu/IR-book/

Page 10: Search Technologies for Digital Libraries

Technology

Apache Lucene/Solr (KB: Migration Verity)

http://lucene.apache.org/

http://lucene.apache.org/solr/ SRU = Search/Retrieve via URL

http://www.loc.gov/standards/sru/ CQL = Contextual Query Language

http://www.loc.gov/standards/sru/specs/cql.html

Page 11: Search Technologies for Digital Libraries

Retrieval: Status Quo

Catalogue

Metadata

Page 12: Search Technologies for Digital Libraries

Catalogue Search

Page 13: Search Technologies for Digital Libraries

Metadata

Dublin Core (DCMI)

http://dublincore.org/

Z39.50

http://www.loc.gov/z3950/agency/

Page 14: Search Technologies for Digital Libraries

Metadata Harvesting

Open Archives Initiative: OIA-PMH

http://www.openarchives.org/

Page 15: Search Technologies for Digital Libraries

Linked Data

Page 16: Search Technologies for Digital Libraries

Authority Data

Named Entities

(Persons, Places, Institutions)

http://viaf.org/ Gazetteers

http://www.world-gazetteer.com/ Other Examples:

LocAuth, PND, NaCo

Page 17: Search Technologies for Digital Libraries

Persistent Identifier

URN = Uniform Resource Name

NBN = National Bibliography Number

Resolver = Translation into web address

Page 18: Search Technologies for Digital Libraries

Problems

Correctness of data

Coverage

Formats

Alignment

Multilingualism

Page 19: Search Technologies for Digital Libraries
Page 20: Search Technologies for Digital Libraries

What happened since

Google Books

The European Library

Europeana

Wolfram/Watson

What’s next?

Page 21: Search Technologies for Digital Libraries

Google Book Search

http://books.google.com/

Page 22: Search Technologies for Digital Libraries

Google Ngram Viewer

http://ngrams.googlelabs.com/

Page 23: Search Technologies for Digital Libraries

The European Library

http://search.theeuropeanlibrary.org

Page 24: Search Technologies for Digital Libraries

Europeana

http://www.europeana.eu/portal/

Page 25: Search Technologies for Digital Libraries

Michael

http://www.michael-culture.org

Page 26: Search Technologies for Digital Libraries

WorldCat

http://www.worldcat.org/

Page 27: Search Technologies for Digital Libraries

IBM Dr. Watson

www.ibm.com/uk/Watson

Page 28: Search Technologies for Digital Libraries

Wolfram

http://www.wolframalpha.com/

Page 29: Search Technologies for Digital Libraries
Page 30: Search Technologies for Digital Libraries

The web

The web is not limited to the www!

Data deluge

“Deep web” – not indexed (dynamic) parts

Web of users – currently ~2 billion

Page 31: Search Technologies for Digital Libraries

Internet Archive

http://www.archive.org/

Page 32: Search Technologies for Digital Libraries

Wayback Machine

http://web.archive.org/

Page 33: Search Technologies for Digital Libraries

Web archiving

Page 34: Search Technologies for Digital Libraries

The web as a resource

Knowledge Extraction (not the actual data!)

→ Semantic Web

(web of knowledge,

rather than data)

Page 35: Search Technologies for Digital Libraries

Semantic Web

RDFhttp://www.w3.org/RDF/

OWLhttp://www.w3.org/2004/OWL/

SPARQL http://www.w3.org/TR/rdf-sparql-query/

SKOS http://www.w3.org/2004/02/skos/

Page 36: Search Technologies for Digital Libraries

Ontologies

Ontology = “Model of the World”

Classes Instances Properties

Page 37: Search Technologies for Digital Libraries

Semantic Graphs

Page 38: Search Technologies for Digital Libraries
Page 39: Search Technologies for Digital Libraries

New resources

Digital libraries (Images + OCR) Digital born material The web

→ Interoperability (STITCH, CATCH)

Page 40: Search Technologies for Digital Libraries

Full text (OCR)

"... tte->e°n.m.66-..ie k>okke cire-5^ea. ver.è. 6.or ^ ^ ^ °

kiesrellj-oe-ikei^, v-in eeo ^elj-escdapeo ^UOI^, 7

^n>5«--'-/-r. veel8-Iiec-jc ttui5vroll^ v,a 'z » ^ v e . X. «. ^ ^ I» 2 L t. L ^-i ? > " Z Z^

l»v«e».ic. sx ^ ^ , 6en 2 l8c«. Leb. ^ L I L I tZ.

6eo zc> ^pr>!, >«(ZS. 8 O II 0 v ? L W. . L^-L"

. . ^ ... ,. , ^,a «ore Vrienilea ea Lekenaêll zeven dy aeeea ^^

^ LLQ d2i« 4 urea, 18 myoe ttuisvi-ouiv, van Kenoi5, Sis asr 0v?e darlelvk >zetief6e Vscier', ?. L08, op L

«eea vel.^esckspLa ^5^()I>Z verlof. Ke6ed w»cj6zZ reo l2urev, as eev Verval vsn ^evev^drscdceo, ^ ^ ^ "A.

Oevki>i7L«., K0>.^^Q8N()VL^, secZerr z ''Vckev öeclle^ri^ te , jv6evou6er6oru " ^

<Zen Zv ^pri!, 1806. ^x>0lè:ecsr. vsv dyQ!l 92 ^sr^n, ker ^clelvke vzet det Leu^visie vervvzilelc! 'O L ^ ^ ^ '-

".' «eckea mi6ck»z ruim êên uur verlatte ovvorfpieck-z. i>«kl. ^0-6 k»rskter verdeaxSe »Ue ryve iiinöeren en L--»S « > I L^Z

Page 41: Search Technologies for Digital Libraries

OCR Lexica

Word matching (fuzzy words) Frequency Morphology Historic forms Inflected forms

Page 42: Search Technologies for Digital Libraries

Visibility

“Hidden” - only indexed Highlighting in image Full text behind image (PDF) Parallel/switched mode User Correction/Annotation

Page 43: Search Technologies for Digital Libraries

Hidden in index

Page 44: Search Technologies for Digital Libraries

Image highlighting

Page 45: Search Technologies for Digital Libraries

PDF

Page 46: Search Technologies for Digital Libraries

Parallel/Switched

Page 47: Search Technologies for Digital Libraries
Page 48: Search Technologies for Digital Libraries

Crowdsourcing

Page 49: Search Technologies for Digital Libraries

Crowdsourcing examples

UIBK Catalogue NLA Newspapers

http://trove.nla.gov.au/newspaper Digitalkoot

http://www.digitalkoot.fi/en/splash Concert TranscriBentham

http://www.transcribe-bentham.da.ulcc.ac.uk/td/Transcribe_Bentham

Page 50: Search Technologies for Digital Libraries

UIBK Catalogue

Page 51: Search Technologies for Digital Libraries

Trove I

Page 52: Search Technologies for Digital Libraries

Trove II

Page 53: Search Technologies for Digital Libraries

Digitalkoot

Page 54: Search Technologies for Digital Libraries

Concert

Page 55: Search Technologies for Digital Libraries

TranscriBentham

Page 56: Search Technologies for Digital Libraries
Page 57: Search Technologies for Digital Libraries

Prototypes

Page 58: Search Technologies for Digital Libraries

Prototype: FEP

Page 59: Search Technologies for Digital Libraries

Prototype: Assets

http://virserv.isti.cnr.it:8080/assetsIRService/index

Page 60: Search Technologies for Digital Libraries

Prototype: Semantic Search

http://eculture.cs.vu.nl/europeana/session/search

Page 61: Search Technologies for Digital Libraries

Prototype: Waisda

http://waisda.q42.net/, http://blog.waisda.nl/

Page 62: Search Technologies for Digital Libraries

Prototype: Geospatial Search

Page 63: Search Technologies for Digital Libraries

Prototype: Image Annotation

http://dme.arcs.ac.at/annotation/ Problem: No Flash in Europeana (A/V content)

Page 65: Search Technologies for Digital Libraries

Prototype: Random Image Explorer

http://europeana.fe2.nl/ (Willem Jan Faber, KB)

Page 66: Search Technologies for Digital Libraries
Page 67: Search Technologies for Digital Libraries

Solution: Common API

API = Application Programming Interface

Set of descriptions defining how to access an electronic resource/application through a common interface

Page 68: Search Technologies for Digital Libraries

API

Documented Interface Definition

Machine readable

Public/shared

Page 69: Search Technologies for Digital Libraries

API Benefits

Data/functionality available through documented, public interfaces

Anybody can use it

Can be integrated in other services/tools

Can be compared, combined, linked

Libraries need not be the actual host