lamus & lat archiving software - clarin · 2020. 11. 2. · lamus is a web-application that...
TRANSCRIPT
![Page 1: LAMUS & LAT Archiving software - CLARIN · 2020. 11. 2. · LAMUS is a web-application that allows • Uploading and naming individual resources (media, annotations, information files)](https://reader033.vdocuments.site/reader033/viewer/2022060916/60a92b315629f60735475130/html5/thumbnails/1.jpg)
The Language Archive – Max Planck Institute for Psycholinguistics Nijmegen, The Netherlands
LAMUS & LAT Archiving software
Daan Broeder
Max-Planck Institute for Psycholinguistics
![Page 2: LAMUS & LAT Archiving software - CLARIN · 2020. 11. 2. · LAMUS is a web-application that allows • Uploading and naming individual resources (media, annotations, information files)](https://reader033.vdocuments.site/reader033/viewer/2022060916/60a92b315629f60735475130/html5/thumbnails/2.jpg)
• MPI for Psycholinguistics research corpora: child language, bilingualism, gesture, sign language, Corpus Spoken Dutch, second learner corpora, etc.
• Archive for the DOBES project • Hosting (and inviting) corpora for other projects in need (UNESCO study: 80% of all material is endangered)
– DBD, NGT, Leiden Univ. language documentation corpora – Donated endangered language corpora – Eibl Eibersfeldt human ethology collection
• Maintain a metadata catalog for properly described resources from other institutes – BAS, C-ORAL-ROM (Univ. Florence), … – LR from Lund Univ, INL, other archive partners
• Copy of CHILDES and Talkbank corpora from CMU Mainly annotated audio/video recordings
50 TB: 200k MD records, 250k AV resources, 200k annotation files, lexicons, sketch grammars, etc.
The Language Archive - 2011
![Page 3: LAMUS & LAT Archiving software - CLARIN · 2020. 11. 2. · LAMUS is a web-application that allows • Uploading and naming individual resources (media, annotations, information files)](https://reader033.vdocuments.site/reader033/viewer/2022060916/60a92b315629f60735475130/html5/thumbnails/3.jpg)
History
• Started in 2000 to try solve the mounting data chaos at the MPI for Psycholinguistics
• First needed proper data descriptions • Archive software development linked to the
IMDI metadata set for Language Resource • First archive was basically a file-system with
metadata descriptions and resource files • Tools operating directly on the files • A researcher’s notebook disk was just as
sophisticated
![Page 4: LAMUS & LAT Archiving software - CLARIN · 2020. 11. 2. · LAMUS is a web-application that allows • Uploading and naming individual resources (media, annotations, information files)](https://reader033.vdocuments.site/reader033/viewer/2022060916/60a92b315629f60735475130/html5/thumbnails/4.jpg)
IMDI – ISLE Metadata Initiative
• Metadata schema for Language Resources • Developed from 2000 also in several EU projects
ISLE, ECHO, INTERA • Especially multi-media/multi-modal recordings • 3 XML metadata schema + special profiles for
specific communities: Sign-Language, SL-acquisition, …
C C
S S S S S
C
M M T M T T
CT
I
![Page 5: LAMUS & LAT Archiving software - CLARIN · 2020. 11. 2. · LAMUS is a web-application that allows • Uploading and naming individual resources (media, annotations, information files)](https://reader033.vdocuments.site/reader033/viewer/2022060916/60a92b315629f60735475130/html5/thumbnails/5.jpg)
• Archiving formats only
• Metadata in XML files
• Relations represented by links
• DBs only as helpers
• Data safety through HSM, pushing data to TLs
TLA ARCHIVE
C C
S S S S S
C
M M
M
M
T T
T
} IMDI
metadata
}resources T
TLA Archive Organization
language
expedition
age group
genre
sessionX
media file
annot. file
![Page 6: LAMUS & LAT Archiving software - CLARIN · 2020. 11. 2. · LAMUS is a web-application that allows • Uploading and naming individual resources (media, annotations, information files)](https://reader033.vdocuments.site/reader033/viewer/2022060916/60a92b315629f60735475130/html5/thumbnails/6.jpg)
Local tools - ARBIL - ELAN
WWW browser
media files metadata
annotations
ARCHIVE
LOCAL DATA
IMDI- Browser
HTTP server
resource download
Browsing/Search/Visualization
LAMUS
AMS
Archive Access
Upload data
LARI TROVE
All resources accessible by HTTP if authorized
PID service
All web-apps can be configured to use either Shibboleth or a local LDAP for authentication
![Page 7: LAMUS & LAT Archiving software - CLARIN · 2020. 11. 2. · LAMUS is a web-application that allows • Uploading and naming individual resources (media, annotations, information files)](https://reader033.vdocuments.site/reader033/viewer/2022060916/60a92b315629f60735475130/html5/thumbnails/7.jpg)
imdidb. corpus structure
amsdb
C C
S S S S S
C
LAMUS
crawler
archive archive manager
content search
IMDI lucene
idx
IMDI search
IMDI browser
annexdb lamusdb
AMS
API
API API API API API
Archive Administration
![Page 8: LAMUS & LAT Archiving software - CLARIN · 2020. 11. 2. · LAMUS is a web-application that allows • Uploading and naming individual resources (media, annotations, information files)](https://reader033.vdocuments.site/reader033/viewer/2022060916/60a92b315629f60735475130/html5/thumbnails/8.jpg)
![Page 9: LAMUS & LAT Archiving software - CLARIN · 2020. 11. 2. · LAMUS is a web-application that allows • Uploading and naming individual resources (media, annotations, information files)](https://reader033.vdocuments.site/reader033/viewer/2022060916/60a92b315629f60735475130/html5/thumbnails/9.jpg)
Why ‘user managed’ deposition?
• Increasing costs – New cheaper technologies for recording, digitization and storage
causes huge increase in data quantities.
• Using depositor knowledge – Researcher/depositor knows where to put the data in the logical
structure (catalogue) of the archive. – Communication with archive managers is overhead.
• Offer remote archiving services – Support distributed projects
• Stricter checking – Make checks explicit – Archive managers have short contracts, knowledge seems to get lost.
• Maximizing deposition – 80 percent of all recordings is in danger (UNESCO report) – We want to open our archive for external depositors – But cannot afford extra workload for archive managers
![Page 10: LAMUS & LAT Archiving software - CLARIN · 2020. 11. 2. · LAMUS is a web-application that allows • Uploading and naming individual resources (media, annotations, information files)](https://reader033.vdocuments.site/reader033/viewer/2022060916/60a92b315629f60735475130/html5/thumbnails/10.jpg)
LAMUS is a web-application that allows • Uploading and naming individual resources (media,
annotations, information files) • Specifying ‘limited’ metadata and mutual relations for
and between resources • Creating relevant linguistic groupings for the data (sub-
corpora) LAMUS will: • Carry out checks for consistency and coherence: check
for accepted formats etc. (configurable list) • Updating databases and indexes • Issue PID for the new resources and metadata records
LAMUS
![Page 11: LAMUS & LAT Archiving software - CLARIN · 2020. 11. 2. · LAMUS is a web-application that allows • Uploading and naming individual resources (media, annotations, information files)](https://reader033.vdocuments.site/reader033/viewer/2022060916/60a92b315629f60735475130/html5/thumbnails/11.jpg)
ARCHIVE
WORKSPACE
local disk
![Page 12: LAMUS & LAT Archiving software - CLARIN · 2020. 11. 2. · LAMUS is a web-application that allows • Uploading and naming individual resources (media, annotations, information files)](https://reader033.vdocuments.site/reader033/viewer/2022060916/60a92b315629f60735475130/html5/thumbnails/12.jpg)
The Archive
check out
modify/add/..
check in
workspace
Add to original after • consistency check • versioning
Local tools: • Arbil, • ELAN, • Shoebox, • …
Using Arbil
using LAMUS
Corpus check-out check-in cycle
![Page 13: LAMUS & LAT Archiving software - CLARIN · 2020. 11. 2. · LAMUS is a web-application that allows • Uploading and naming individual resources (media, annotations, information files)](https://reader033.vdocuments.site/reader033/viewer/2022060916/60a92b315629f60735475130/html5/thumbnails/13.jpg)
TLA – Versioning of resources
TLA versioning policy • Nothing gets actually deleted • Users can delete resources which are removed
from the visible collection (corpus tree) but remain in the archive
• Users can update (replace) existing resources – The new version will get a new PID – Old version will be shelved but keep their PID
• Access to old versions is managed by the owner
![Page 14: LAMUS & LAT Archiving software - CLARIN · 2020. 11. 2. · LAMUS is a web-application that allows • Uploading and naming individual resources (media, annotations, information files)](https://reader033.vdocuments.site/reader033/viewer/2022060916/60a92b315629f60735475130/html5/thumbnails/14.jpg)
C C
S S S S S
C
• User role administration: archive manager, domain curator, domain manager, domain editor
• Set a required license • Set access rules per media type:
annotations, images, audio, video, info
• A rule sets access/denial to user/group for type of data
• Special groups: ‘all’, ‘registered user’
• Rules have priority • Inheritance of rules by descendant
nodes
M M M M M M
C
C
C
S
M
Rule 1
Rule 2
Rule 3
Rule 1 Rule 2 Rule 3
AMS – Access Management System
Sign academic license
![Page 15: LAMUS & LAT Archiving software - CLARIN · 2020. 11. 2. · LAMUS is a web-application that allows • Uploading and naming individual resources (media, annotations, information files)](https://reader033.vdocuments.site/reader033/viewer/2022060916/60a92b315629f60735475130/html5/thumbnails/15.jpg)
IMDI-Browser & Metadata Search
• Browse the hierarchy of corpora • Inspect metadata records • Create bookmarks
– resources – IMDI-Browser showing resources
• Show PIDs, URLs for resources and metadata • Make resource access requests • Search the metadata:
– simple keyword, – complex queries
![Page 16: LAMUS & LAT Archiving software - CLARIN · 2020. 11. 2. · LAMUS is a web-application that allows • Uploading and naming individual resources (media, annotations, information files)](https://reader033.vdocuments.site/reader033/viewer/2022060916/60a92b315629f60735475130/html5/thumbnails/16.jpg)
IMDI-Browser as a jump board
![Page 17: LAMUS & LAT Archiving software - CLARIN · 2020. 11. 2. · LAMUS is a web-application that allows • Uploading and naming individual resources (media, annotations, information files)](https://reader033.vdocuments.site/reader033/viewer/2022060916/60a92b315629f60735475130/html5/thumbnails/17.jpg)
http://corpus1.mpi.nl/ds/imdi_browser?openpath=MPI541199%23
![Page 18: LAMUS & LAT Archiving software - CLARIN · 2020. 11. 2. · LAMUS is a web-application that allows • Uploading and naming individual resources (media, annotations, information files)](https://reader033.vdocuments.site/reader033/viewer/2022060916/60a92b315629f60735475130/html5/thumbnails/18.jpg)
![Page 19: LAMUS & LAT Archiving software - CLARIN · 2020. 11. 2. · LAMUS is a web-application that allows • Uploading and naming individual resources (media, annotations, information files)](https://reader033.vdocuments.site/reader033/viewer/2022060916/60a92b315629f60735475130/html5/thumbnails/19.jpg)
![Page 20: LAMUS & LAT Archiving software - CLARIN · 2020. 11. 2. · LAMUS is a web-application that allows • Uploading and naming individual resources (media, annotations, information files)](https://reader033.vdocuments.site/reader033/viewer/2022060916/60a92b315629f60735475130/html5/thumbnails/20.jpg)
Publishing resources
![Page 21: LAMUS & LAT Archiving software - CLARIN · 2020. 11. 2. · LAMUS is a web-application that allows • Uploading and naming individual resources (media, annotations, information files)](https://reader033.vdocuments.site/reader033/viewer/2022060916/60a92b315629f60735475130/html5/thumbnails/21.jpg)
Regional Archives Initiative: Cooperation of TLA/MPI-PL with other organizations interested in EL archiving They use TLA LAT archiving software • Encourage local resource collecting & archiving • Network of South American archives has been established and contacts
with CLARA were made
Regional Archives Initiative
![Page 22: LAMUS & LAT Archiving software - CLARIN · 2020. 11. 2. · LAMUS is a web-application that allows • Uploading and naming individual resources (media, annotations, information files)](https://reader033.vdocuments.site/reader033/viewer/2022060916/60a92b315629f60735475130/html5/thumbnails/22.jpg)
Synchronization physical structure • Use “rsync” software • Complete replication • No special conditions possible • Use for backup to computing centers
Synchronization logical structure • Special software needed • Per corpus copy to a selected target
• Owner can make special exceptions
• Use to synchronize between archives
C C
S S S S S
C
S S S
C
C
Logical synchronization
Data Synchronization I
![Page 23: LAMUS & LAT Archiving software - CLARIN · 2020. 11. 2. · LAMUS is a web-application that allows • Uploading and naming individual resources (media, annotations, information files)](https://reader033.vdocuments.site/reader033/viewer/2022060916/60a92b315629f60735475130/html5/thumbnails/23.jpg)
C C
S S S S S
C
LAMUS archive
API
C
S S S
HTTP server
COSIX
COSIX: complex logic to compare corpus trees and determine
• what is new • what to replace • what to add • what to delete
Data Synchronization II
In a cooperation with CMU, COSIX is used to copy CHILDES and Talkbank corpora into our archive. CMU generating IMDI records on the fly from their DBs
![Page 24: LAMUS & LAT Archiving software - CLARIN · 2020. 11. 2. · LAMUS is a web-application that allows • Uploading and naming individual resources (media, annotations, information files)](https://reader033.vdocuments.site/reader033/viewer/2022060916/60a92b315629f60735475130/html5/thumbnails/24.jpg)
Technical Info
• Java web-applications running inside Tomcat servlet container
• Postgress DBMS • Platform: Linux • Web-app frameworks: JSP, Applets, JSF, FLEX,
Wicket,… • Works with most web browsers (Explorer,
Firefox, Opera, Safari)
![Page 25: LAMUS & LAT Archiving software - CLARIN · 2020. 11. 2. · LAMUS is a web-application that allows • Uploading and naming individual resources (media, annotations, information files)](https://reader033.vdocuments.site/reader033/viewer/2022060916/60a92b315629f60735475130/html5/thumbnails/25.jpg)
LAMUS & LAT Future
• TLA is part of CLARIN and is promoting CMDI, so … • We are planning the transition from LAMUS – IMDI to
LAMUS CMDI • We analyzed our set-up and still like the LAT
fundaments e.g. file based, modularity, … • But we will also alleviate some current problems and
inconveniences: – limited metadata editing in LAMUS – Insufficient provenance tracking of resources – Better handling of download/modify/upload cycle – Better integration with other (LAT) archives and
infrastructures.
![Page 26: LAMUS & LAT Archiving software - CLARIN · 2020. 11. 2. · LAMUS is a web-application that allows • Uploading and naming individual resources (media, annotations, information files)](https://reader033.vdocuments.site/reader033/viewer/2022060916/60a92b315629f60735475130/html5/thumbnails/26.jpg)
THANK YOU FOR YOUR ATTENTION
![Page 27: LAMUS & LAT Archiving software - CLARIN · 2020. 11. 2. · LAMUS is a web-application that allows • Uploading and naming individual resources (media, annotations, information files)](https://reader033.vdocuments.site/reader033/viewer/2022060916/60a92b315629f60735475130/html5/thumbnails/27.jpg)
Thank you for your attention
CLARIN has received funding fromthe European Community's Seventh Framework Programme
under grant agreement n° 212230