© 2010 ibm corporation synthesized audio descriptions hironobu takagi, chieko asakawa ibm research...

© 2010 IBM Corporation

Synthesized Audio Descriptions

Hironobu Takagi, Chieko AsakawaIBM Research – Tokyo


National Women's Education Center - July 6th, 2010. National Women's Education Center - July 6th, 2010.

2

IBM History of Accessibility

1960sTalking Typewriter

1984 Talking 3270 Terminal

1960s 　 Talking Typewriter

1975 　 1403 Braille Printer

1984 　 Talking 3270 Terminal

1990 　 VoiceType™

2008 　 Social Accessibility

1994 　 Screen Magnifier™/2

1998 　 ViaVoice®

2000 　 Accessibility Center

1997 　 Home Page Reader

1988 　 ScreenReader/DOS

2004 　 aDesigner

2007 　 aiBrowser for Multimedia

2007 　 Eclipse Accessibility Tools Framework 1999Home Page ReaderJapanese, Italian, French, German, Spanish, US English, UK English

2009 　 ARIA (Accessible Rich Internet Application)


IBM Research - Tokyo

3

Status of Audio Descriptions in Japan

from NPO Media Access Support Center

Movies

TV 5.6%, 0.4%

　　 Ratio　of　TV　Programs　with　Audio　Descriptions　(2008)　(*2)

49.4%, 42.3%　　 Ratio　of　TV　Programs　with　captions　(2008)　(*1)

Public TV 　　　　　　　　　 Private TV Public TV Private

*1 :Ministry of Internal Affair and Communication (2008)*2 :NICT: National Institute of Information and Communications Technology

12.0% 0.9%Ratio　of　Japanese　movies　with　Captions

(2008)

Ratio　of　Japanese　movie　with　Audio　Descriptions



Captions and Audio Descriptions for TV Programs

4

0%

10%

20%

30%

40%

50%

60%

2001 2002 2003 2004 2005 2006 2007 2008

Captions - Public

Captions - Private

Audio descriptions -Public

Audio descriptions -Public (Education)

Audio descriptions -Private

based on data from MIC and NICT



5

Problems: Workload and Cost

Recording an audio description calls for a skilled narrator and a good recording environment.

Writing an audio description script requires special expertise to describe the scenes between dialogues and scene changes.

Transcribing

Recording

Transcribing

Audio descriptionsCaptions

Wo

rklo

ad



6

History of Text-to-speech Engines

Formant　( '85)日本語

( )波形重畳男　(.wav)

2004Super Voice (IBM)

1980 1990 2000

1985IBM

1996ProTalker(IBM)

2004Super Voice (IBM)

2010

1983 年DecTalk

2008Emotional TTS

(IBM)



7

Possible Reduction of Workload

Transcribing

Recording

Synthesized audio

descriptions

Wo

rklo

ad

Transcribing

Recording

Current audio

descriptions

Reduction by Synthesis

Reduction by Tool support



8 視聴覚障害者向け音声ガイド・字幕記述のための標準仕様の研究開発

Acceptance Ratio (United States) Method Online Survey

Participants 236 （ 39 low-vision, 197 blind ） Genre Education and documentary

Voice quality Human and TTS （ Heather ）

Constantly 70% ～ 80% answered more than neutralConstantly 70% ～ 80% answered more than neutral

0%10%20%30%40%50%60%70%80%90%

100%

Set 1 Set 2 Set 3 Set 4

Uncomfortable

Slightly Uncomfortable

Neutral

Acceptable

Comfortable



9

Video Accessibility Project: Goals Prove feasibility of text-based audio descriptions via user studies.

– Work with professional teams for audio descriptions – Japan – IBM with CAP and content from NHK– U.S. - WGBH

Create an open source platform for audio descriptions and captions– Authoring tools and players– Captions and text-based audio descriptions– Based on Eclipse.org Accessibility Tools Framework (ACTF)

Contribute to standardization of Internet media accessibility– Focus on “missing markups” in the existing standards.– Maintain neutrality for existing standards. – HTML5 is the primary target.

Supported by the Japanese government agency NICT (National Institute of Information and Communications Technology)


Thank you!



11

ACTF Script Editor

Authoring tool, specialized for audio descriptions.

Flexible to import and export various formats.

Planned for release as open source in March.



Case of the audio guide for the museum / the stage Museums : There are many actual usage of audio guide in museum and art museum. （ The main

purpose of audio guide is not to support person with visually impaired but to help everyone for studying the contents.)

– [for example : provider of audio guide]• National Museum of Nature and Science,Tokyo• The National Museum of Western Art• Hiroshima Museum of Art• Osaka Museum of Natural History• Tokyo Museum of Fire Department • Shimane Museum of Ancient Izumo.

– Almost every museum in Japan provides audio guide. – Generally, audio guide equipment is specially designed and made with prerecorded voice by

manufacture. There is a new approach for using NINTENDO DS and downloading the content in it at the museum.

The stage : Mini-drama group is main.– [for example : provider of audio guide]

• Drama group "Bakkari-Bakkari" provides audio guide once in a performance period.• A drama group in the city of Kawasaki, Kanagawa Pref.• A drama group "DORA"

– About caption, for example, SHIKI THEATRE COMPANY provides caption. There is very few case that large-scale theatre play provides audio guide.



Laws and Regulations

1993 Act on Advancement of Facilitation Program for Disabled Persons' Use of Telecommunications and Broadcasting Services, with a View to Enhance Convenience of Disabled Persons (1993)

1997 MIC defined a goal to “provide captions to all TV programs by 1997”

1998 BROADCAST LAW– Article 3-2 (4)– Any broadcaster shall, in compiling the broadcast programs for domestic

broadcasting, provide as many broadcasting programs as possible which provide voices and other sounds to explain about transient images of fixed or moving objects for blind persons, and providing characters or patterns to explain about voices and other sounds for deaf persons.

2007 Signed the “Convention on the Rights of Persons with Disabilities”

2010 New JIS (Japanese Industrial Standard) for Web Accessibiltiy – Technical guidelines are fully harmonized with WCAG 2.0

13



1414

ACTF aiBrowser

Direct audio control Allow users to increase or lower the volume, stop or play,

and control audio speed by using simple keyboard commands.

User interface simplification Structurally simplify interfaces by converting dynamic visual

interfaces into static text-based interfaces Dynamically add alternative texts to images and buttons

Audio descriptions with text Infrastructure to provide video descriptions at low cost

1

2

3



15

Status of Audio Descriptions in Japan

from NPO Media Access Support Center

Internet 0.2%

Team investigation

0.0%Popular video sharing services and educational

online videos, but no videos with audio descriptions (except for videos prepared as

examples of audio descriptions).

Ratio of video content with captions in the Open Courseware project.

(2 among 1,474)

Movies

TV 5.6%, 0.4%

　　 Ratio　of　TV　Programs　with　Audio　Descriptions　(2008)　(*2)

49.4%, 42.3%　　 Ratio　of　TV　Programs　with　captions　(2008)　(*1)

Public TV 　　　　　　　　　 Private TV Public TV Private

*1 :Ministry of Internal Affair and Communication (2008)*2 :NICT: National Institute of Information and Communications Technology

12.0% 0.9%Ratio　of　Japanese　movies　with　Captions

(2008)

Ratio　of　Japanese　movie　with　Audio　Descriptions



16

Analysis of Standards and Possible Focus

Voice styles and emotional expressions

Description(textual information)

SRT

W3C SSML, etc.

Unique for audio descriptions (extended, audio control, block, etc.)

Index structure for video(Scenes and chapters, etc.)

Association with video contents, multilingual, etc.

Mozilla <itext>, etc.

W3C TTDFXP

Each video format has its own specifications. (DVD, MPEG, etc.)

FOCUS AREA!

W3C Emotion ML

Layer of Markups (vocabulary lists) for text-based audio descriptions

Addressing (timing)

W3CSMIL

Flexible addressing

Personalization



17

2nd study: Level of Description

Using the extended description and listening twice both improved the comprehension.Using the extended description and listening twice both improved the comprehension.

0%

20%

40%60%

80%

100%

1 2

Number of Listening

Rat

e of

Cor

rect

Ans

wer

s

NormalExtended

Rate of correct answers for each level of description heard once or twice

30%



18

Difficulties in Online Videos

Historical Videos

News Entertainment E-Learning

Consumer-Generated Videos

Now is the time to create a new technical framework for audio descriptions!



19

Prior Projects e-Inclusion project in Canada supported by Canadian Heritage.

– CRIM (Centre de recherche informatique de Montréal)

– Four-year project completed this year

– Authoring tool and playback tool

LiveDescribe by Ryerson University

– Community-based authoring system

– Authoring tool and playback tool

NHK Research

– Prototyped and tested TTS-based audio descriptions

aiBrowser

– Developed by IBM Research and contributed to Eclipse.org

– Audio descriptions with Flash, QuickTime, and Windows Media Player

Other trials

– HTML5 + Live Region demo (Firefox team)

– WebShake

• Japanese online caption provider prototyped with TTS-based audio descriptions.

– ACAV, etc.



20

Distribution Flexibility

Audio

Audio

Text

Text

Text

Audio

Audio

Audio

Synthesizer

Human narrator

Synthesizer

Synthesizer

Voice quality

High

Low*

* Server-side synthesis is better than client-side synthesis. *** Client-side software support is required. ** The systems for human voices can be reused.

Lowest

Authoringcost

High

Low

Low

Low

Client-side synthesizer

Server-side synthesizer

Pre-recorded synthesized audio

Human voice (current model)

Low*

Systemcost

High

High**

High

Low***

Text



21

Experimental Results (Japan)

1st study (Sep 2009)–3 blind or visually impaired participants–Face-to-face, one-to-one sessions–Focused on the voice quality, level of description, and speech speed

2nd study (Feb 2010)–24 blind or visually impaired participants–Face-to-face, small group sessions–Consisted of 4 sub-studies for long-term listening, expressive voices,

describer expertise, and level of description




日本における字幕・音声ガイドの現状

2008 年に公開された邦画が対象NPO Media Access Support Center 資料より

12.0% 0.9%2008 年に公開された邦画のうち

字幕が提供されていた割合2008 年に公開された邦画のうち

副音声が提供されていた割合

5.6%, 0.4%　　平成 20 年度の在京キー局の地上波における解説放送の割合 (*2)

*1 : 総務省「平成 20 年度の字幕放送等の実績」報道資料より*2 :NICT: National Institute of Information and Communications Technology 資料より

映画

放送

インターネット

0.2%

本プロジェクト内での独自調査

0.0%主要な動画配信サイト、教育用コンテンツのサンプリング調査の結果、音声ガイドの付与された動画は見つからなかった。

オープンコースウェア（教育用コンテンツ）における字幕付与率。 1417 本中 2 本。

49.4%, 42.3%　　平成 20 年度の総放送時間に占める字幕放送時間の割

合 (*1)

NHK 総合　　　　　　　　　在京民放 NHK 総合在京民放



23

1st study: Results

The descriptions greatly improved the user experience regardless of the voice quality.

The participants’ comments indicated that Modern TTS was almost comparable to a human voice though the human was still preferred.

The descriptions greatly improved the user experience regardless of the voice quality.

The participants’ comments indicated that Modern TTS was almost comparable to a human voice though the human was still preferred.

Effectiveness scores for "drama" videos

0

20

40

60

80

100

Human TraditionalTTS

ModernTTS

w/o AD with AD

Effectiveness scores for "cooking" videos

0

20

40

60

80

100

Human TraditionalTTS

ModernTTS

w/o AD with AD



24

2nd study: Sub-studies

1. Long-term listening– Assess if TTS-based descriptions are acceptable for listening to full-

length programs– Target videos: cartoon (comedy), drama (tragedy), documentary

2. Expressive voices– Determine if the expressive TTS improves the user experience– Target videos: cartoon (comedy), drama (tragedy)

3. Describer expertise– Assess how the describer expertise affects understanding– Target video: public service announcement (warning about fraud)

4. Level of description– Assess how the level of description and repetitive listening affects

understanding– Target video: instructional program (how to fold and store clothing)



25

2nd study



26

2nd study: Long-term Listening

0

5

10

15

20

1 2 3 4 5Score

Fre

quency

Cartoon　(Comedy)Drama　(Tragedy)Documentary

TTS-based descriptions were generally acceptable for full-length programs

From comments, the documentary film received the highest evaluation, but that was not clear from the effectiveness scores.

TTS-based descriptions were generally acceptable for full-length programs

From comments, the documentary film received the highest evaluation, but that was not clear from the effectiveness scores.

Effectiveness scores for each video category



27

2nd study: Describer Expertise

Novice (Normal) was not preferred (score: 3.0)

Novice (Extended) was comparable (score: 4.3) to expert descriptions (score: 4.3 for normal, 4.6 for extended)

Novice (Normal) was not preferred (score: 3.0)

Novice (Extended) was comparable (score: 4.3) to expert descriptions (score: 4.3 for normal, 4.6 for extended)

Effectiveness scores for each describer expertise and level of description

0

3

6

9

12

1 2 3 4 5Score

Fre

quen

cy

Expert (Normal)Expert (Extended)Novice (Normal)Novice (Extended)



28

Typical Client-side TTS Setting

Website

Metadata Repository

Script Editor Video Player

Online Video

Audio Description ScriptRefer

Post

Browse

Fetch




W3C Web Contents Accessibility Guidelines 2.0 (2008 年 12 月勧告 )

– 1.2.5 　収録済の映像コンテンツの音声ガイド ( レベル AA)

– 1.2.7 収録済の映像コンテンツの拡張した音声ガイド ( レベル AAA)

日本改正著作権法　 (2009 年 6 月成立　 2010 年 1 月 1 日施行 )

日本 JIS X 8341-3:2010 (2010 年 6 月ごろ公示予定 )

© 2010 ibm corporation synthesized audio descriptions hironobu takagi, chieko asakawa ibm research...

Documents