iasa presentatie

23-09-09

Hidden treasures lost forever?

Speech technology for the disclosure of Dutch audiovisual archives

Mies Langelaar and Willemijn Heeren

23-09-09 2

Contents

Introduction & Problem statement

Digitization/standardization in the E-repository

Speech technology for AV archives

System demonstration

23-09-09 3

Introduction

Hidden treasures of audiovisual archives lost forever?

• Backlog • Data stored on deteriorating analogue carriers• Digitized and digital born data in non-standardized formats

Digitization and international standardization needed

• Often global level of description• A few keywords data unit (hour, tape, interview)• Often no content description at all, because annotation is

very (time-)costly Reduce human effort through use of speech

technology?

23-09-09 4

The approach

NWO CATCH project CHoral (2006-2010)Goal: investigate and develop automatic annotation

and search technology for spoken word archives

Cooperation between 1. speech technology researchers, University of Twente 2. archivists, Rotterdam Municipal Archives

23-09-09 5

The test case

‘Radio Rijnmond’ (RR) archives city of Rotterdam's regional radio channel initial broadcast in 1983 broadcast recordings, amounting to over 60.000 hours

partially digitized, mostly analog partially disclosed, mostly waiting for annotation typical of A/V archives in cultural heritage (CH)

23-09-09 6

Searching the RR archives I

Minimal content descriptions per hour data

23-09-09 7

Searching the RR archives II

?????

23-09-09 8

Main problems

The main problems with this example collection are:

1. a large backlog of undisclosed material data are inaccessible for third parties

2. fairly unspecific annotations, if available restricted use for answering information needs

3. audio is being kept on analog data carriers or on CDs interactive or online search cannot be supported

23-09-09 9

Towards solutions…



23-09-09 10


23-09-09 11

AV Collection of Rotterdam Municipal Archives

About 15.000 AV objects in collection

Most of this collection is on analog data

carriers

Part of the collection is on CD’s, dating form the 1980’s

onwards

No standardisation in storage formats

No or minimal metadata and description of content

available

23-09-09 12

Work in progress

Digitisation of the analogue audio material is done in

company

Standard formats that are used are:

.WAV for uncompressed PCM audio

44.1 KHZ 16 bit stereo for audio CDs that are already digitised, but need preservation

48 KHZ 24 bit stereo for old recordings

Digitally produced audio is accepted in its own format

Access to the objects is granted by audio CD or MP3

23-09-09 13

Work in progress (2)

Digitisation of Video and Film is done partly in company, partly by external partners

The standards that are used are: Minimal data rate of 50 Mb for conservational purposes Digital Betacam for VHS and Umatic tapes Digital Video is accepted in its original recording format

(miniDV, DVCam, XDCam etc) Digibeta for 8mm, 16mm and 35mm film (processed by

external partners) DV25 for 16 mm film (processed in company)

Digital Betacam is stored in 10 bit uncompressed

23-09-09 14

How to ensure long term sustainability

Set up a trusted digital repository, consisting of

hardware, software, procedures, methods, knowledge

and experience

23-09-09 15

Trusted Digital Repository

Feeder System

Workflow Job Queue

FileStorage

Characterisation

Preservation Planning

Migration

TechnicalRegistry

Active Preservation

MetadataStore Data

Management

Access

Reporting

StorageAdaptor

Passive Preservation

Ingest Toolkit

PreservationController

WorkflowController

User

Administrator Archivist

23-09-09 16

Adding a minimal set of metadata, necessary for

management, preservation and access

Using standard archival formats

Making agreements with producers of AV material about

acceptable formats

Disclosure of content through Automatic Speech

Recognition (ASR)

How to ensure long term access to data?

23-09-09 17


23-09-09 18

Disclosure through speech technology

Disclosure: automatically generate a time-stamped content description Allows online retrieval of fragments of AV records

Method depends on: Available metadata Availability of context documents

When a transcript is available: Speech and transcript can be aligned,

i.e. Automatically couple what was said in the transcript, to where it was said in the audio

When there is no transcript: Use automatic speech recognition to generate hypotheses of

what was said where in the audio Word Error Rates under 40% allow automatically generated

content descriptions to be used as search index

23-09-09 19

CHoral

End user

AV archiving workflow

Content production

ASR IR UI

Indexing

Research topics ASR: Automatic Indexing IR: Information Retrieval UI: User Interface Development

23-09-09 20

Research

Automatic indexing through speech technology: Development of robust automatic speech recognition and audio classification tools

Information Retrieval:Retrieval of spoken documents based on ASR outputBridging the semantic gap between user queries and spoken content

User Interface development:Support search and browsing in audio documents(Re)presentation of audio content

23-09-09 21

Alignment

Speech signal

Typed transcript Landgenooten waar ik enkele

Begin frame # End frame # Word

00000 54400 -silence-

54400 65280 Landgenooten

65280 69120 Waar

69120 73600 Ik

73600 79520 Enkele

… … …

+

23-09-09 22

Automatic speech recognition

Acoustic model

Language model

Pronunciation dictionary

Speech recognition

50+ hour audio

250-500 M words

Pre-processing

Classification speech/nonspeech

Segmentation of speakers

2nd recognition with adapted models

Word level index

23-09-09 23

Types of word level indexes

Most probable words:

Lattice structures:

ASR: Er is een bekend beeld voor veel ouders de grote show in onveilige situatie voor de schoolTXT: ‘t is een bekend beeld voor veel ouders. De chaotische en onveilige situatie voor de school

“D’66 is z’n ene zetel kwijt”

23-09-09 24

Discussion ASR

For successful automatic annotation: Audio should be digitally available, preferably on a server To optimize ASR models for high-quality output,

part of the speech should be transcribed, or related documents should be available

? How to validate automatic indexes?

23-09-09 25

User interface development

Challenges

Understand users’ requirements and information needs

Support selection and browsing of spoken content Representation of spoken content via ‘surrogate’

Cross-linking to related content within the same or from another collection

IPR issues

23-09-09 26

CHoral speech technology for GAR

Alignment: Brandgrens interviews Rotterdam

Speech recognition: RR archives

23-09-09 27

Discussion

Development is ongoing in the work-flow and daily

practice at audiovisual archives, and speech technology

Careful tuning of processes is needed for mutual benefit

Examples demonstrate envisioned benefits:

Potential reduction of human effort for annotation of undisclosed

materials

Online access to fragments of spoken heritage

23-09-09 28

For more information, see http://hmi.ewi.utwente.nl/project/CHoral

Questions?

iasa presentatie

Documents

use of speech technology

speech technology researchers

erepository speech technology

audio cds

speech technologyfor

produced audio

av objects

hour data