iasa presentatie
DESCRIPTION
Presentation for the 40th aniversary IASA Congress in AthensTRANSCRIPT
23-09-09
Hidden treasures lost forever?
Speech technology for the disclosure of Dutch audiovisual archives
Mies Langelaar and Willemijn Heeren
23-09-09 2
Contents
Introduction & Problem statement
Digitization/standardization in the E-repository
Speech technology for AV archives
System demonstration
23-09-09 3
Introduction
Hidden treasures of audiovisual archives lost forever?
• Backlog • Data stored on deteriorating analogue carriers• Digitized and digital born data in non-standardized formats
Digitization and international standardization needed
• Often global level of description• A few keywords data unit (hour, tape, interview)• Often no content description at all, because annotation is
very (time-)costly Reduce human effort through use of speech
technology?
23-09-09 4
The approach
NWO CATCH project CHoral (2006-2010)Goal: investigate and develop automatic annotation
and search technology for spoken word archives
Cooperation between 1. speech technology researchers, University of Twente 2. archivists, Rotterdam Municipal Archives
23-09-09 5
The test case
‘Radio Rijnmond’ (RR) archives city of Rotterdam's regional radio channel initial broadcast in 1983 broadcast recordings, amounting to over 60.000 hours
partially digitized, mostly analog partially disclosed, mostly waiting for annotation typical of A/V archives in cultural heritage (CH)
23-09-09 6
Searching the RR archives I
Minimal content descriptions per hour data
23-09-09 7
Searching the RR archives II
?????
23-09-09 8
Main problems
The main problems with this example collection are:
1. a large backlog of undisclosed material data are inaccessible for third parties
2. fairly unspecific annotations, if available restricted use for answering information needs
3. audio is being kept on analog data carriers or on CDs interactive or online search cannot be supported
23-09-09 9
Towards solutions…
Digitization/standardization in the E-repository
Speech technology for AV archives
23-09-09 10
Digitization/standardization in the E-repository
23-09-09 11
AV Collection of Rotterdam Municipal Archives
About 15.000 AV objects in collection
Most of this collection is on analog data
carriers
Part of the collection is on CD’s, dating form the 1980’s
onwards
No standardisation in storage formats
No or minimal metadata and description of content
available
23-09-09 12
Work in progress
Digitisation of the analogue audio material is done in
company
Standard formats that are used are:
.WAV for uncompressed PCM audio
44.1 KHZ 16 bit stereo for audio CDs that are already digitised, but need preservation
48 KHZ 24 bit stereo for old recordings
Digitally produced audio is accepted in its own format
Access to the objects is granted by audio CD or MP3
23-09-09 13
Work in progress (2)
Digitisation of Video and Film is done partly in company, partly by external partners
The standards that are used are: Minimal data rate of 50 Mb for conservational purposes Digital Betacam for VHS and Umatic tapes Digital Video is accepted in its original recording format
(miniDV, DVCam, XDCam etc) Digibeta for 8mm, 16mm and 35mm film (processed by
external partners) DV25 for 16 mm film (processed in company)
Digital Betacam is stored in 10 bit uncompressed
23-09-09 14
How to ensure long term sustainability
Set up a trusted digital repository, consisting of
hardware, software, procedures, methods, knowledge
and experience
23-09-09 15
Trusted Digital Repository
Feeder System
Workflow Job Queue
FileStorage
Characterisation
Preservation Planning
Migration
TechnicalRegistry
Active Preservation
MetadataStore Data
Management
Access
Reporting
StorageAdaptor
Passive Preservation
Ingest Toolkit
PreservationController
WorkflowController
User
Administrator Archivist
23-09-09 16
Adding a minimal set of metadata, necessary for
management, preservation and access
Using standard archival formats
Making agreements with producers of AV material about
acceptable formats
Disclosure of content through Automatic Speech
Recognition (ASR)
How to ensure long term access to data?
23-09-09 17
Speech technology for AV archives
23-09-09 18
Disclosure through speech technology
Disclosure: automatically generate a time-stamped content description Allows online retrieval of fragments of AV records
Method depends on: Available metadata Availability of context documents
When a transcript is available: Speech and transcript can be aligned,
i.e. Automatically couple what was said in the transcript, to where it was said in the audio
When there is no transcript: Use automatic speech recognition to generate hypotheses of
what was said where in the audio Word Error Rates under 40% allow automatically generated
content descriptions to be used as search index
23-09-09 19
CHoral
End user
AV archiving workflow
Content production
ASR IR UI
Indexing
Research topics ASR: Automatic Indexing IR: Information Retrieval UI: User Interface Development
23-09-09 20
Research
Automatic indexing through speech technology: Development of robust automatic speech recognition and audio classification tools
Information Retrieval:Retrieval of spoken documents based on ASR outputBridging the semantic gap between user queries and spoken content
User Interface development:Support search and browsing in audio documents(Re)presentation of audio content
23-09-09 21
Alignment
Speech signal
Typed transcript Landgenooten waar ik enkele
Begin frame # End frame # Word
00000 54400 -silence-
54400 65280 Landgenooten
65280 69120 Waar
69120 73600 Ik
73600 79520 Enkele
… … …
+
23-09-09 22
Automatic speech recognition
Acoustic model
Language model
Pronunciation dictionary
Speech recognition
50+ hour audio
250-500 M words
Pre-processing
Classification speech/nonspeech
Segmentation of speakers
2nd recognition with adapted models
Word level index
23-09-09 23
Types of word level indexes
Most probable words:
Lattice structures:
ASR: Er is een bekend beeld voor veel ouders de grote show in onveilige situatie voor de schoolTXT: ‘t is een bekend beeld voor veel ouders. De chaotische en onveilige situatie voor de school
“D’66 is z’n ene zetel kwijt”
23-09-09 24
Discussion ASR
For successful automatic annotation: Audio should be digitally available, preferably on a server To optimize ASR models for high-quality output,
part of the speech should be transcribed, or related documents should be available
? How to validate automatic indexes?
23-09-09 25
User interface development
Challenges
Understand users’ requirements and information needs
Support selection and browsing of spoken content Representation of spoken content via ‘surrogate’
Cross-linking to related content within the same or from another collection
IPR issues
23-09-09 26
CHoral speech technology for GAR
Alignment: Brandgrens interviews Rotterdam
Speech recognition: RR archives
23-09-09 27
Discussion
Development is ongoing in the work-flow and daily
practice at audiovisual archives, and speech technology
Careful tuning of processes is needed for mutual benefit
Examples demonstrate envisioned benefits:
Potential reduction of human effort for annotation of undisclosed
materials
Online access to fragments of spoken heritage
23-09-09 28
For more information, see http://hmi.ewi.utwente.nl/project/CHoral
Questions?