the virtual language observatory - clarin eric | virtual language observatory! dieter van uytvanck!...
TRANSCRIPT
CMDI workshop, 2012-09-13, Nijmegen!
The Virtual Language Observatory!
Dieter Van Uytvanck!CMDI workshop, Nijmegen!
2012-09-13!
1!
CMDI workshop, 2012-09-13, Nijmegen!
Overview!
• VLO?!• What is behind it? – Relation to CMDI?!• How do I get my data in there?!• Demo + excercises!!
2!
CMDI workshop, 2012-09-13, Nijmegen!
Context sketch!
• Lots of resources somewhere out there:!• Data collections!
• Corpora!• Lexica!• Grammars!• Multimedia recordings!
• Software!• Web applications / services!• Old-school linguistic resources:!
• Books!• Articles!• CD-ROMs!
• Itʼs like a jungle, sometimes ...!!
CMDI workshop, 2012-09-13, Nijmegen!
VLO: the idea!
• Researcher: “where do I start?”!• Provide a single entry point giving access to
all information!• Because of the large amount of data:!
• Drill-down paradigm (decrease search space gradually)!
• Multiple ways of exploring:!• Full-text search!• Facet browsing!• Geographic overlay!
• Unified interface, links to the original context!
CMDI workshop, 2012-09-13, Nijmegen!
VLO?!
• Virtual Language Observatory!• http://www.clarin.eu/vlo/!• Several parts:!
• Facet browser (real search)!• Google Earth overlay (visualization)!• LRT inventory (ad-hoc, last resort metadata
entry)!
5!
CMDI workshop, 2012-09-13, Nijmegen!
Facets?!
• A simple way to narrow down the search space, step by step!
• Values offered are dynamic: they change with every previous selection made!
• Purpose: quickly navigating through a huge amount of metadata!
6!
CMDI workshop, 2012-09-13, Nijmegen!
Facets?!
• Purpose: quickly navigating through a huge amount of resources!
• Useful too for metadata curation!• Not the tool to answer research questions!!
7!
CMDI workshop, 2012-09-13, Nijmegen!
VLO Faceted Browser (1)!
8!
h"p://catalog.clarin.eu/ds/vlo
CMDI workshop, 2012-09-13, Nijmegen!
VLO Faceted Browser (2)
CMDI workshop, 2012-09-13, Nijmegen!
VLO Faceted Browser (3)!
§ Metadata analyzed is CMDI format!§ Metadata sources!
§ CMDI files harvested from CLARIN centres !§ CMDIʼfied OLAC records (from CLARIN
centres and others)!§ CMDIʼfied LRT inventory records!
§ You can get to resources directly from search results!
10!
CMDI workshop, 2012-09-13, Nijmegen!
Exercises (1)!
• www.clarin.eu/vlo!
• Find some resources in the catalogue:!• Corpus Gysseling!• Telephone conversation recordings in Nepal!
CMDI workshop, 2012-09-13, Nijmegen!
Exercises (2)!
• Find some resources in the Endangered Languages archive which are:!• (Spoken) discourse with at least two
consultants in Asia!• Or (spoken) discourse with at least two
consultants in a Face to Face conversation!
CMDI workshop, 2012-09-13, Nijmegen!
Limits!
• Inherent limit: Simple search!• no OR combinations possible!• no sophisticated search operations!
• Current limit (to be fixed)!• Full-text search not for all fields, but only the
ones displayed in the VLO!
13!
CMDI workshop, 2012-09-13, Nijmegen!
metadata curator
CMDI architecture!
OAI-PMH Data provider
OAI-PMH Service provider
Local metadata repository
Joint metadata repository
metadata modeler
metadata user
metadata creator
component registry &
editor
metadata editor
metadata curator
metadata catalogue
search & semantic mapping
DATA
ISOcat
CMDI workshop, 2012-09-13, Nijmegen!
Behind the scenes (1)!
• SOLR + lucene!• Tomcat web application!• For the parsing of the CMDIʼs: VTD-XML!
• Faster than SAX-parser!• Still full XPath access!• Memory-efficient (1.3x~1.5x the size of an
XML document)!
15!
CMDI workshop, 2012-09-13, Nijmegen!
Behind the scenes (2)!
16!
CMDI workshop, 2012-09-13, Nijmegen!
VLO and ISOcat: natural allies (1)!
• The import of metadata files used to be hard coded!
• Now we look at the ISOcat links in the XSDs as generated from the CMDI profiles!
• Fallback to hard-coded XPath in case no ISOcat link is found!
17!
CMDI workshop, 2012-09-13, Nijmegen!
VLO and ISOcat: natural allies (2)!
• Import configuration example:!
<facetConcept name="name" allowMultipleValues="false"> <concept>http://www.isocat.org/datcat/DC-2544</concept> <concept>http://www.isocat.org/datcat/DC-2545</concept> <concept>http://purl.org/dc/terms/title</concept> <!-- no concept in lrt schema --> <pattern> /c:CMD/c:Components/c:LrtInventoryResource/c:LrtCommon/c:ResourceName/text() </pattern> </facetConcept>
18!
CMDI workshop, 2012-09-13, Nijmegen!
How do I get my metadata in there?!
• Provide it as CMDI over OAI-PMH!• If that is not possible:!
• Provide it as OLAC over OAI-PMH!• Provide it as IMDI over OAI-PMH!
• If that is not possible either:!• Enter it into the LRT inventory:!• www.clarin.eu/inventory!
19!
CMDI workshop, 2012-09-13, Nijmegen!
Instance 1 Instance 2 Instance 3
Profile 1 Profile 2 Profile 3
XSD files
CMDI files
Ingester
Component registry
Metadata Repository
ISOcat.org
VLO XPath = data category
ISOcat
Specimen Habitat Order de
finiKon
s
CMDI workshop, 2012-09-13, Nijmegen!
Recent Additions!
• Links to language information: WALS, Wikipedia, Ethnologue, LinguistList … and the VLO!
• Descriptions in the record listing!• National Project facet!• Feedback link!
21!
CMDI workshop, 2012-09-13, Nijmegen!
Still to come…!
• A faceted browser is as good as its data, so curation steps are needed!
• more CMDI metadata!• some more facets e.g.: year !• Human-readable hdl links!• Interface improvements!
22!
CMDI workshop, 2012-09-13, Nijmegen!
Questions?!
• … ask them now!• … or send a mail to [email protected]!
• More information: !• www.clarin.eu/vlo !• www.clarin.eu/cmdi!
23!