crl speech team © 2003 ibm corporation chinese romanization for chinese voice browsing ibm china...

27
CRL Speech Team © 2003 IBM Corporation Chinese Romanization for Chinese Voice Browsing IBM China Research Lab

Upload: martina-taylor

Post on 30-Dec-2015

222 views

Category:

Documents


2 download

TRANSCRIPT

CRL Speech Team

© 2003 IBM Corporation

Chinese Romanization for Chinese Voice Browsing

IBM China Research Lab

CRL Speech Team

© 2003 IBM Corporation

Index

• Motivations & Proposals

• IPA. VS. Chinese Romanization

• Chinese Romanization Standards• Implementations of Chinese Romanization in SSML

• Extensions for other languages

CRL Speech Team

© 2003 IBM Corporation

Motivations & Proposals

CRL Speech Team

© 2003 IBM Corporation

IBM Speech Synthesis System

• IBM speech synthesis system support about 20 languages.

• For Asian Language, we cover: – Mandarine, – Cantonese, – Korean, – Japanese, – Thai.

CRL Speech Team

© 2003 IBM Corporation

Pronunciations Annotations are important for Chinese

• A Chinese character represents a meaning more than a pronunciation.

• The homograph phenomenon is very common for Chinese characters.

• So it will be very helpful if the pronunciation can be given explicitly.

CRL Speech Team

© 2003 IBM Corporation

Proposals

• We propose to use Chinese Romanization to annotate Chinese pronunciation in “phoneme” element.

• We also propose SSML to use diverse predefined and widely used pronunciation annotation standards for different languages.

• Thus SSML can be more easily accepted and used around the world.

• Note: Chinese Romanization = Hanyu Pinyin in this PPT.

CRL Speech Team

© 2003 IBM Corporation

IPA. VS. Chinese Romanization

CRL Speech Team

© 2003 IBM Corporation

Comparison Rule: Goal of SSML

• The goal of SSML is to “provide a rich, XML-based markup language for assisting the generation of synthetic speech in Web and other applications”.

• To reach the goal, we need more and more users of SSML, such as ordinary Web applications developers, to learn and use the SSML easily.

• So, we need to define the SSML based on ordinary people’s knowledge and skill rather than professional linguistics’ knowledge. – Otherwise, it will be a long way for SSML be widely accepted

and used around the world.

CRL Speech Team

© 2003 IBM Corporation

IPA is not very fit for Chinese

• IPA tries to collect an exhaustive set of pronunciations for all kinds of languages.– It has become very complicated and difficult to input.

• A well educated Chinese adult can not annotate Chinese Pronunciation in IPA without special training. – IPA is not very popular in China.

• Special linguistic phenomena in Chinese, such as tone, retroflex, can not be conveniently described by IPA.

CRL Speech Team

© 2003 IBM Corporation

Chinese Romanization is fit for Chinese

• Chinese Romanization is specially designed only for Chinese instead of all languages. – Adding ‘r’ in the end to describe a “retroflex” syllable.– Adding ‘tone’ attribute to describe the tone.

• Chinese Romanization is widely used and learnt. – Chinese people learn Chinese Romanization in primary school. – Many foreigners begin to learn Chinese by Chinese Romanization. – Chinese Romanization is widely used to input Chinese Characters

on computer. • Chinese government has brought into effect a standard for

Chinese Romanization. – It is in effect for education, publishing, information processing and

other related industries in China.

CRL Speech Team

© 2003 IBM Corporation

Chinese Romanization Standards

CRL Speech Team

© 2003 IBM Corporation

Chinese Romanization Standard

• The writing rules of Chinese Romanization conform to P.R.C state standard “Basic rules for Hanyu Pinyin Orthography” [1] published by (CSBQTS) in 1996.

• This Orthography is based on “Hanyu Pinyin Schema” published in 1958.

• According to the naming method of alphabet, we propose to use “x-CSBQTS-96” to represent Chinese Romanization alphabet. However, we also propose to use “x-Pinyin-96”, which is easier to remember.

* CSBQTS: China State Bureau of Quality and Technical Supervision

CRL Speech Team

© 2003 IBM Corporation

Hanyu Pinyin Schema (published in 1958)

• Character Set.– 25 characters, all from ‘a’ to ‘z’ except ‘ü’. – (For easy to input on computer: ü is replaced by v.)

• Initial Set:– b, p m, f, d, t, n, l, g, k, h, j, q, x, zh, ch, sh, r, z, c, s

• Final Set:– i, u, ü, a , ia, ua, o, uo, e, ie, eü, ai, uai, ei, uei, – ao, iao, ou, iou, an, ian, uan, üan, en, in, uen, ün – ang, iang, uang, eng, ing, ueng, ong, iong,

• Tone Annotation:– mā , má, mǎ, mà, ma

• Separator: ' – pi’ao

CRL Speech Team

© 2003 IBM Corporation

Pinyin VS. IPA

CRL Speech Team

© 2003 IBM Corporation

Basic rules for Hanyu Pinyin Orthography (published in 1996)

1. Words are the basic units for spelling the Chinese Common Language. (Space is used to separate Word)

– rén (person/people), péngyou (friend[s]), túshūguǎn (library/libraries) – wǒrén hé nóngmín (Workers and Farmers)

2. Structures of two or three syllables that indicate a complete concept are linked:

– quánguó (the whole nation), duìbuqǐ (sorry),

3. Separate terms with more than 4 syllables if they can be separated into words, otherwise link all the syllables:

– wúfèng gāngbǐ (seamless pen), Hóngshízìhuì (Red Cross)

CRL Speech Team

© 2003 IBM Corporation

Basic rules for Hanyu Pinyin Orthography (published in 1996)

4. Reduplicated monosyllabic words are linked, but reduplicated disyllabic words are separated: – rénrén (everybody), chángshi chángshi (give it a try)

5. In certain situations, for the purpose of making it convenient to read and understand the words, a hyphen can be added: – huán-bǎo (environmental protection), shíqī-bā suì (17 or 18

years old)

CRL Speech Team

© 2003 IBM Corporation

Implementations of Chinese Romanization in SSML

CRL Speech Team

© 2003 IBM Corporation

Implementation 1

• <?xml version="1.0"?>• <speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis"• xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"• xsi:schemaLocation="http://www.w3.org/2001/10/synthesis• http://www.w3.org/TR/speech-synthesis/synthesis.xsd"• xml:lang="zh-CH">• <phoneme alphabet=" x-CSBQTS-96" ph="duìbuqǐ"> 对不起

</phoneme>• <!-- This is an example of Chinese Romanization

Standard Tone Annotation-->• </speak>•

CRL Speech Team

© 2003 IBM Corporation

Implementation 2

• <?xml version="1.0"?>• <speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis"• xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"• xsi:schemaLocation="http://www.w3.org/2001/10/synthesis• http://www.w3.org/TR/speech-synthesis/synthesis.xsd"• xml:lang="zh-CH">• <phoneme alphabet="x-CSBQTS-96" ph="dui4bu0qi3"> 对不

起 </phoneme>• <!-- This is an example of Chinese Romanization • using number to describe tone -->• </speak>

CRL Speech Team

© 2003 IBM Corporation

Comparison between Two implementations

Implementation 1:<phoneme alphabet=" x-CSBQTS-96" ph="duìbuqǐ"> 对不起 </phoneme>

Implementation 2:<phoneme alphabet="x-CSBQTS-96"ph="dui4bu0qi3"> 对不起

</phoneme>

Note: "x-CSBQTS-96" may be replaced by "x-Pinyin-96"

CRL Speech Team

© 2003 IBM Corporation

Extensions for other languages

CRL Speech Team

© 2003 IBM Corporation

Extension for Cantonese

• The Linguistic society of Hong Kong has published a simple, easy-to-learn and easy-to-use “LSHK Cantonese Romanization Scheme” in 1993.

• This scheme is widely adopted in various areas: education, Cantonese information process and computer input method, etc.

• So we also propose to use “The LSHK Cantonese Romanization Scheme” to annotate Cantonese pronunciation.

CRL Speech Team

© 2003 IBM Corporation

Extension for more languages

• Though it is possible to form up a general standard to annotate all languages’ pronunciation, such a standard may become very complex to use.

• Another way is to use the predefined and widely accepted pronunciation annotation standards for different language.

• At least, these diverse standards should be an important complement to the general standard.

CRL Speech Team

© 2003 IBM Corporation

Thank you!

CRL Speech Team

© 2003 IBM Corporation

Korea Romanization

Korea Korea Romanization Meaning in English

밥 pap rice 

불고기 pul go gi broiled beef

갈비찜 kal bi jjim beef rib stew

만두 man tu dumplings

홍차 hong ch\'a  tea 

콜라 k\'ol la cola 

우유 u yu  milk 

It is used in our Korea Speech Synthesis System.

CRL Speech Team

© 2003 IBM Corporation

Japanese Romanization

• Japanese:– まだ覚えているでしょう 波音に包まれて

• Japanese Romanization:– mada oboeteiru deshou nami oto ni tsutsumarete

• English meaning:– Do you remember being surrounded by the sound of

tide?

CRL Speech Team

© 2003 IBM Corporation

Discussion of “Word”

• What is the definition of “Word” in Chinese?– Prosodic Word or Grammar Word

• 你来还是不来? nǐ lái háishi bù lái?

• Is “ 不来” a word?

• What is the difference between ‘Word’ & ‘break’?– The misunderstanding problem can be solved by adding ‘break’.

• Can Word information be handled by ‘Hanyu Pinyin Orthography’?

– In ‘Hanyu Pinyin Orthography’, space is used to separate words.