© 2010 ibm corporation synthesized audio descriptions hironobu takagi, chieko asakawa ibm research...
TRANSCRIPT
© 2010 IBM Corporation
Synthesized Audio Descriptions
Hironobu Takagi, Chieko AsakawaIBM Research – Tokyo
© 2010 IBM Corporation
National Women's Education Center - July 6th, 2010. National Women's Education Center - July 6th, 2010.
2
IBM History of Accessibility
1960sTalking Typewriter
1984 Talking 3270 Terminal
1960s Talking Typewriter
1975 1403 Braille Printer
1984 Talking 3270 Terminal
1990 VoiceType™
2008 Social Accessibility
1994 Screen Magnifier™/2
1998 ViaVoice®
2000 Accessibility Center
1997 Home Page Reader
1988 ScreenReader/DOS
2004 aDesigner
2007 aiBrowser for Multimedia
2007 Eclipse Accessibility Tools Framework 1999Home Page ReaderJapanese, Italian, French, German, Spanish, US English, UK English
2009 ARIA (Accessible Rich Internet Application)
© 2010 IBM Corporation
IBM Research - Tokyo
3
Status of Audio Descriptions in Japan
from NPO Media Access Support Center
Movies
TV 5.6%, 0.4%
Ratio of TV Programs with Audio Descriptions (2008) (*2)
49.4%, 42.3% Ratio of TV Programs with captions (2008) (*1)
Public TV Private TV Public TV Private
*1 :Ministry of Internal Affair and Communication (2008)*2 :NICT: National Institute of Information and Communications Technology
12.0% 0.9%Ratio of Japanese movies with Captions
(2008)
Ratio of Japanese movie with Audio Descriptions
© 2010 IBM Corporation
IBM Research - Tokyo
Captions and Audio Descriptions for TV Programs
4
0%
10%
20%
30%
40%
50%
60%
2001 2002 2003 2004 2005 2006 2007 2008
Captions - Public
Captions - Private
Audio descriptions -Public
Audio descriptions -Public (Education)
Audio descriptions -Private
based on data from MIC and NICT
© 2010 IBM Corporation
IBM Research - Tokyo
5
Problems: Workload and Cost
Recording an audio description calls for a skilled narrator and a good recording environment.
Writing an audio description script requires special expertise to describe the scenes between dialogues and scene changes.
Transcribing
Recording
Transcribing
Audio descriptionsCaptions
Wo
rklo
ad
© 2010 IBM Corporation
IBM Research - Tokyo
6
History of Text-to-speech Engines
Formant ( '85)日本語
( )波形重畳 男 (.wav)
2004Super Voice (IBM)
1980 1990 2000
1985IBM
1996ProTalker(IBM)
2004Super Voice (IBM)
2010
1983 年DecTalk
2008Emotional TTS
(IBM)
© 2010 IBM Corporation
IBM Research - Tokyo
7
Possible Reduction of Workload
Transcribing
Recording
Synthesized audio
descriptions
Wo
rklo
ad
Transcribing
Recording
Current audio
descriptions
Reduction by Synthesis
Reduction by Tool support
© 2010 IBM Corporation
IBM Research - Tokyo
8 視聴覚障害者向け音声ガイド・字幕記述のための標準仕様の研究開発
Acceptance Ratio (United States) Method Online Survey
Participants 236 ( 39 low-vision, 197 blind ) Genre Education and documentary
Voice quality Human and TTS ( Heather )
Constantly 70% ~ 80% answered more than neutralConstantly 70% ~ 80% answered more than neutral
0%10%20%30%40%50%60%70%80%90%
100%
Set 1 Set 2 Set 3 Set 4
Uncomfortable
Slightly Uncomfortable
Neutral
Acceptable
Comfortable
© 2010 IBM Corporation
IBM Research - Tokyo
9
Video Accessibility Project: Goals Prove feasibility of text-based audio descriptions via user studies.
– Work with professional teams for audio descriptions – Japan – IBM with CAP and content from NHK– U.S. - WGBH
Create an open source platform for audio descriptions and captions– Authoring tools and players– Captions and text-based audio descriptions– Based on Eclipse.org Accessibility Tools Framework (ACTF)
Contribute to standardization of Internet media accessibility– Focus on “missing markups” in the existing standards.– Maintain neutrality for existing standards. – HTML5 is the primary target.
Supported by the Japanese government agency NICT (National Institute of Information and Communications Technology)
© 2010 IBM Corporation
IBM Research - Tokyo
11
ACTF Script Editor
Authoring tool, specialized for audio descriptions.
Flexible to import and export various formats.
Planned for release as open source in March.
© 2010 IBM Corporation
IBM Research - Tokyo
Case of the audio guide for the museum / the stage Museums : There are many actual usage of audio guide in museum and art museum. ( The main
purpose of audio guide is not to support person with visually impaired but to help everyone for studying the contents.)
– [for example : provider of audio guide]• National Museum of Nature and Science,Tokyo• The National Museum of Western Art• Hiroshima Museum of Art• Osaka Museum of Natural History• Tokyo Museum of Fire Department • Shimane Museum of Ancient Izumo.
– Almost every museum in Japan provides audio guide. – Generally, audio guide equipment is specially designed and made with prerecorded voice by
manufacture. There is a new approach for using NINTENDO DS and downloading the content in it at the museum.
The stage : Mini-drama group is main.– [for example : provider of audio guide]
• Drama group "Bakkari-Bakkari" provides audio guide once in a performance period.• A drama group in the city of Kawasaki, Kanagawa Pref.• A drama group "DORA"
– About caption, for example, SHIKI THEATRE COMPANY provides caption. There is very few case that large-scale theatre play provides audio guide.
© 2010 IBM Corporation
IBM Research - Tokyo
Laws and Regulations
1993 Act on Advancement of Facilitation Program for Disabled Persons' Use of Telecommunications and Broadcasting Services, with a View to Enhance Convenience of Disabled Persons (1993)
1997 MIC defined a goal to “provide captions to all TV programs by 1997”
1998 BROADCAST LAW– Article 3-2 (4)– Any broadcaster shall, in compiling the broadcast programs for domestic
broadcasting, provide as many broadcasting programs as possible which provide voices and other sounds to explain about transient images of fixed or moving objects for blind persons, and providing characters or patterns to explain about voices and other sounds for deaf persons.
2007 Signed the “Convention on the Rights of Persons with Disabilities”
2010 New JIS (Japanese Industrial Standard) for Web Accessibiltiy – Technical guidelines are fully harmonized with WCAG 2.0
13
© 2010 IBM Corporation
IBM Research - Tokyo
1414
ACTF aiBrowser
Direct audio control Allow users to increase or lower the volume, stop or play,
and control audio speed by using simple keyboard commands.
User interface simplification Structurally simplify interfaces by converting dynamic visual
interfaces into static text-based interfaces Dynamically add alternative texts to images and buttons
Audio descriptions with text Infrastructure to provide video descriptions at low cost
1
2
3
© 2010 IBM Corporation
IBM Research - Tokyo
15
Status of Audio Descriptions in Japan
from NPO Media Access Support Center
Internet 0.2%
Team investigation
0.0%Popular video sharing services and educational
online videos, but no videos with audio descriptions (except for videos prepared as
examples of audio descriptions).
Ratio of video content with captions in the Open Courseware project.
(2 among 1,474)
Movies
TV 5.6%, 0.4%
Ratio of TV Programs with Audio Descriptions (2008) (*2)
49.4%, 42.3% Ratio of TV Programs with captions (2008) (*1)
Public TV Private TV Public TV Private
*1 :Ministry of Internal Affair and Communication (2008)*2 :NICT: National Institute of Information and Communications Technology
12.0% 0.9%Ratio of Japanese movies with Captions
(2008)
Ratio of Japanese movie with Audio Descriptions
© 2010 IBM Corporation
IBM Research - Tokyo
16
Analysis of Standards and Possible Focus
Voice styles and emotional expressions
Description(textual information)
SRT
W3C SSML, etc.
Unique for audio descriptions (extended, audio control, block, etc.)
Index structure for video(Scenes and chapters, etc.)
Association with video contents, multilingual, etc.
Mozilla <itext>, etc.
W3C TTDFXP
Each video format has its own specifications. (DVD, MPEG, etc.)
FOCUS AREA!
W3C Emotion ML
Layer of Markups (vocabulary lists) for text-based audio descriptions
Addressing (timing)
W3CSMIL
Flexible addressing
Personalization
© 2010 IBM Corporation
IBM Research - Tokyo
17
2nd study: Level of Description
Using the extended description and listening twice both improved the comprehension.Using the extended description and listening twice both improved the comprehension.
0%
20%
40%60%
80%
100%
1 2
Number of Listening
Rat
e of
Cor
rect
Ans
wer
s
NormalExtended
Rate of correct answers for each level of description heard once or twice
30%
© 2010 IBM Corporation
IBM Research - Tokyo
18
Difficulties in Online Videos
Historical Videos
News Entertainment E-Learning
Consumer-Generated Videos
Now is the time to create a new technical framework for audio descriptions!
© 2010 IBM Corporation
IBM Research - Tokyo
19
Prior Projects e-Inclusion project in Canada supported by Canadian Heritage.
– CRIM (Centre de recherche informatique de Montréal)
– Four-year project completed this year
– Authoring tool and playback tool
LiveDescribe by Ryerson University
– Community-based authoring system
– Authoring tool and playback tool
NHK Research
– Prototyped and tested TTS-based audio descriptions
aiBrowser
– Developed by IBM Research and contributed to Eclipse.org
– Audio descriptions with Flash, QuickTime, and Windows Media Player
Other trials
– HTML5 + Live Region demo (Firefox team)
– WebShake
• Japanese online caption provider prototyped with TTS-based audio descriptions.
– ACAV, etc.
© 2010 IBM Corporation
IBM Research - Tokyo
20
Distribution Flexibility
Audio
Audio
Text
Text
Text
Audio
Audio
Audio
Synthesizer
Human narrator
Synthesizer
Synthesizer
Voice quality
High
Low*
* Server-side synthesis is better than client-side synthesis. *** Client-side software support is required. ** The systems for human voices can be reused.
Lowest
Authoringcost
High
Low
Low
Low
Client-side synthesizer
Server-side synthesizer
Pre-recorded synthesized audio
Human voice (current model)
Low*
Systemcost
High
High**
High
Low***
Text
© 2010 IBM Corporation
IBM Research - Tokyo
21
Experimental Results (Japan)
1st study (Sep 2009)–3 blind or visually impaired participants–Face-to-face, one-to-one sessions–Focused on the voice quality, level of description, and speech speed
2nd study (Feb 2010)–24 blind or visually impaired participants–Face-to-face, small group sessions–Consisted of 4 sub-studies for long-term listening, expressive voices,
describer expertise, and level of description
© 2010 IBM Corporation
IBM Research - Tokyo
22 視聴覚障害者向け音声ガイド・字幕記述のための標準仕様の研究開発
日本における字幕・音声ガイドの現状
2008 年に公開された邦画が対象NPO Media Access Support Center 資料より
12.0% 0.9%2008 年に公開された邦画のうち
字幕が提供されていた割合2008 年に公開された邦画のうち
副音声が提供されていた割合
5.6%, 0.4% 平成 20 年度の在京キー局の地上波における解説放送の割合 (*2)
*1 : 総務省 「平成 20 年度の字幕放送等の実績」報道資料より*2 :NICT: National Institute of Information and Communications Technology 資料より
映画
放送
インターネット
0.2%
本プロジェクト内での独自調査
0.0%主要な動画配信サイト、教育用コンテンツのサンプリング調査の結果、音声ガイドの付与された動画は見つからなかった。
オープンコースウェア(教育用コンテンツ)における字幕付与率。 1417 本中 2 本。
49.4%, 42.3% 平成 20 年度の総放送時間に占める字幕放送時間の割
合 (*1)
NHK 総合 在京民放 NHK 総合 在京民放
© 2010 IBM Corporation
IBM Research - Tokyo
23
1st study: Results
The descriptions greatly improved the user experience regardless of the voice quality.
The participants’ comments indicated that Modern TTS was almost comparable to a human voice though the human was still preferred.
The descriptions greatly improved the user experience regardless of the voice quality.
The participants’ comments indicated that Modern TTS was almost comparable to a human voice though the human was still preferred.
Effectiveness scores for "drama" videos
0
20
40
60
80
100
Human TraditionalTTS
ModernTTS
w/o AD with AD
Effectiveness scores for "cooking" videos
0
20
40
60
80
100
Human TraditionalTTS
ModernTTS
w/o AD with AD
© 2010 IBM Corporation
IBM Research - Tokyo
24
2nd study: Sub-studies
1. Long-term listening– Assess if TTS-based descriptions are acceptable for listening to full-
length programs– Target videos: cartoon (comedy), drama (tragedy), documentary
2. Expressive voices– Determine if the expressive TTS improves the user experience– Target videos: cartoon (comedy), drama (tragedy)
3. Describer expertise– Assess how the describer expertise affects understanding– Target video: public service announcement (warning about fraud)
4. Level of description– Assess how the level of description and repetitive listening affects
understanding– Target video: instructional program (how to fold and store clothing)
© 2010 IBM Corporation
IBM Research - Tokyo
26
2nd study: Long-term Listening
0
5
10
15
20
1 2 3 4 5Score
Fre
quency
Cartoon (Comedy)Drama (Tragedy)Documentary
TTS-based descriptions were generally acceptable for full-length programs
From comments, the documentary film received the highest evaluation, but that was not clear from the effectiveness scores.
TTS-based descriptions were generally acceptable for full-length programs
From comments, the documentary film received the highest evaluation, but that was not clear from the effectiveness scores.
Effectiveness scores for each video category
© 2010 IBM Corporation
IBM Research - Tokyo
27
2nd study: Describer Expertise
Novice (Normal) was not preferred (score: 3.0)
Novice (Extended) was comparable (score: 4.3) to expert descriptions (score: 4.3 for normal, 4.6 for extended)
Novice (Normal) was not preferred (score: 3.0)
Novice (Extended) was comparable (score: 4.3) to expert descriptions (score: 4.3 for normal, 4.6 for extended)
Effectiveness scores for each describer expertise and level of description
0
3
6
9
12
1 2 3 4 5Score
Fre
quen
cy
Expert (Normal)Expert (Extended)Novice (Normal)Novice (Extended)
© 2010 IBM Corporation
IBM Research - Tokyo
28
Typical Client-side TTS Setting
Website
Metadata Repository
Script Editor Video Player
Online Video
Audio Description ScriptRefer
Post
Browse
Fetch
© 2010 IBM Corporation
IBM Research - Tokyo
29 視聴覚障害者向け音声ガイド・字幕記述のための標準仕様の研究開発
W3C Web Contents Accessibility Guidelines 2.0 (2008 年 12 月勧告 )
– 1.2.5 収録済の映像コンテンツの音声ガイド ( レベル AA)
– 1.2.7 収録済の映像コンテンツの拡張した音声ガイド ( レベル AAA)
日本 改正著作権法 (2009 年 6 月成立 2010 年 1 月 1 日施行 )
日本 JIS X 8341-3:2010 (2010 年 6 月ごろ公示予定 )