talsyntes: joakim gustafson tal, musik och hörsel - · pdf filetalsyntes: joakim...
TRANSCRIPT
Talsyntes: Joakim Gustafson Tal, musik och hörsel
1
J. Gustafson, CTT, KTH
Speech synthesis
DT2112
Joakim Gustafson, CTT, KTH School for Computer Science and
Communication Many slides prepared by Olov Engwall (and others)
J. Gustafson, CTT, KTH 2
Text-To-Speech synthesis (TTS)
The automatic generation of synthesized sound or visual output from any phonetic string.
Our focus in this course!
J. Gustafson, CTT, KTH 3
Different kinds of speech synthesis
• Recorded speech – Words or phrases (telephone banking)
– Fixed vocabulary – maintenance problems…
• Concatenative speech synthesis
• Parametric synthesis
• Multimodal synthesis
J. Gustafson, CTT, KTH The SCALE Summer Workshop June 2012
What the voice conveys
• The linguistic component (the words that are said)
• The extralinguistic component (the identity of the speaker)
• The paralinguistic component (the attitude of the speaker)
• The dialog control component (selection of the next speaker)
J. Gustafson, CTT, KTH
Desireable synthesis features from a dialogue perspective
• Real-time, Incremental, Interruptable • Explicit control of prosodic parameters
– Fundamental frequency – Intensity – Natural sounding lengthening, hesitation, interruptions
• Generation of extra-linguistic sounds – Filled pauses – Creeks/Gargles – Smacks/Inhalations/exhalations to give turn
J. Gustafson, CTT, KTH 6
The synthesis space
Intelligibility Naturalness
Bit rate
Vocabulary Units
Complexity
Processing needs
Flexibility Speech
Knowledge
Cost
Talsyntes: Joakim Gustafson Tal, musik och hörsel
2
J. Gustafson, CTT, KTH 7
The steps in TTS
text
Linguistic analysis
Prosodic analysis
Phonetic description
Sound generation
Morphological analysis Lexicon and rules Syntax analysis
Rules and lexicon
Rules and unit selection
Concatenation Rules
Language ident.
“abcd”
J. Gustafson, CTT, KTH
The automatic generation of synthesized sound from any text string.
From text
J. Gustafson, CTT, KTH 9
Preprocessor • Sentence end detection (semicolon, period – ratio, time and
decimal point, sentence ending respectively) • Abbreviations (e.g. – for instance) Changed to their full form with the help of lexicons • Acronyms (I.B.M – these can be read as a sequence of
characters, or NASA which can be read following the default way) • Numbers (Once detected, first interpreted as rational, time of the
day, dates and ordinal depending on their context) • Idioms (e.g. “In spite of”, “as a matter of fact”– these are combined
into single FSU using a special lexicon)
J. Gustafson, CTT, KTH 10
Grapheme-to-phoneme conversion • Dictionary:
– Store a maximum of phonological knowledge into a lexicon. – Compounding rules describe how the morphemes of dictionary items are
modified. – Hand-corrected, expensive – The lexicon is never complete: needs out of vocabulary pronouncer,
transcribed by rule.
• Rules: – A set of letter to sound (grapheme to phoneme) rules. – Words pronounced in a such a particular way that they have their own rule
are stored in exceptions directory. – Fast & easy, but lower accuracy
• Machine learning: – Cart tree – Analogy pronunciation
J. Gustafson, CTT, KTH
11
Prosody
• Prosody = melody, rhythm, “tone” of speech
• Not what words are said, but how they are said • Prosody is conveyed using:
– Pitch – Phone durations – Energy
• Human languages use prosody to convey: – phrasing and structure (e.g. sentence boundaries)
– disfluencies (e.g. false starts, repairs, fillers)
– sentence mode (statement vs question)
– emotional attitudes (urgency, surprise, anger)
J. Gustafson, CTT, KTH 12
Intonation: F0 contour Large pitch range (female) Authoritive (final fall) Emphasis for Finance (H*) Final has a raise – more information to come
• Word stress and sentence intonation – each word has at least one syllable which is spoken with higher
prominence – in each phrase the stressed syllable can be accented depending on
the semantics and syntax of the phrase • Prosody relies on syntax, semantics, pragmatics: personal reflection of
the reader.
Talsyntes: Joakim Gustafson Tal, musik och hörsel
3
J. Gustafson, CTT, KTH 13
Pitch contour modeling • Tonetics (the British school)
– tone groups composed of syllables {unstressed, stressed, accented or nuclear}.
– nuclear syllables have nuclear tones {fall, rise, fall-rise, rise-fall}
• ToBI (Tones and Break Indices) – Phrases split into intermediate phrases composed of syllables. – Relative tone levels: high (H) or low (L) (plus diacritics) at every
intonational or intermediate phrase boundary (%) and on every accented syllable
• Stylization method (prosodic pattern measured from natural speech) – Demo
J. Gustafson, CTT, KTH
The automatic generation of synthesized sound from any text string.
To Speech
J. Gustafson, CTT, KTH 15
Synthesis approaches
By Concatenation Elementary speech units are stored in a database and then concatenated and processed to produce the speech signal
By Rule Speech is produced by mathematical rules that describe the influence of phonemes on one another
J. Gustafson, CTT, KTH
Research trends in speech synthesis
1950 Synthesis by analysis
1960 Phonetic rules
1970 Linguistic processing
1980 Concatenation
1990 Automatic procedures
2000
J. Gustafson, CTT, KTH
Text-to-Speech Synthesis Evolution
1962 1967 1972 1982 1987 1992 1997 2002
Year
Formant Synthesis
Bell Labs; Joint Speech Research Unit; MIT (DEC-Talk); Haskins
Lab; KTH
LPC-Based Diphone/Dyad
Synthesis Bell Labs; CNET; Bellcore; Berkeley
Speech Technology
Unit Selection Synthesis
ATR in Japan; CSTR in Scotland; BT in
England; AT&T Labs (1998); L&H in
Belgium
Poor Intelligibility;
Poor Naturalness, Small footprint
Good Intelligibility;
Poor Naturalness
Good Intelligibility;
Customer Quality Naturalness
(Limited Context)
HMM ���Synthesis
HTS in Japan; CSTR in Scotland;
Multi-speaker training, speaker
adaptation; Naturalness,
generative, Small footprint
J. Gustafson, CTT, KTH 18
Synthesis by rule
Talsyntes: Joakim Gustafson Tal, musik och hörsel
4
J. Gustafson, CTT, KTH 19
Models of speech production a) humans b) machines
Carrier vocal chords, pitch, voice quality Modulation vocal tract, nose, mouth formants
J. Gustafson, CTT, KTH 20
The LPC Vocoder • Vocoder is a concatenation of
the terms 'voice' and 'coding')
• A Vocoder aims to replace the carrier with another carrier while keeping the message
• It is used in telephony to
decrease bit rate
• The pitch and LPC coefficients are computed on the speech
• These are then used as input in the speech generator
J. Gustafson, CTT, KTH 21
The TADEM-STRAIGHT Vocoder • Developed by Hideki Kawahara at Kyoto Univ. • speech analysis, modification and resynthesis • Will be used in Lab2
J. Gustafson, CTT, KTH
Articulatory Synthesis
J. Gustafson, CTT, KTH 23
Articulatory synthesis potential use • Articulatory synthesis
– Calculations directly from cross sectional areas
– Fluid dynamics calculations
• Visual synthesis – Articulation training
• Demonstrations and research
J. Gustafson, CTT, KTH 24
Articulatory parameters
• Jaw opening • Lip rounding • Lip Protrusion • Tongue position • Tongue height • Tongue tip • Velum • Hyoid
Talsyntes: Joakim Gustafson Tal, musik och hörsel
5
J. Gustafson, CTT, KTH
From articulation to acoustics
Transfer function
Vocal tract model
Tubes
Waveform
Cross-sections
3D air flow calculations
Area function
J. Gustafson, CTT, KTH
Benefits: • Speech production in the same way as humans • Can be made with few parameters • The changes are intuitive
(raise the tongue tip, round the lips)
Disadvantages: • Computationally demanding • Problems with consonants • Articulatory measurements required • State-of-the-art articulatory synthesis still sounds bad
Summary: Articulatory Synthesis
J. Gustafson, CTT, KTH
Formant Synthesis
J. Gustafson, CTT, KTH
Formant synthesis (1959-1987)
• Haskins, 1959 • KTH – Stockholm, 1962 • Bell Labs, 1973 • MIT, 1976 • MIT-talk, 1979 • Speak ‘N spell, 1980 • BELL Labs, 1985 • Dec talk, 1987
J. Gustafson, CTT, KTH 29
Gunnar Fant and OVE 1962
J. Gustafson, CTT, KTH 30
OVE II
Talsyntes: Joakim Gustafson Tal, musik och hörsel
6
J. Gustafson, CTT, KTH
Rule-driven formant synthesis
• Parameters are generated by rule (RULSYS , Carlson et al.)
• Formant values are generated by interpolating between target frequencies
• Parameters are fed to a synthesizer (GLOVE, Carlson et al.)
J. Gustafson, CTT, KTH
Summary: formant synthesis
Benefits: • Possible to change the voice to get different:
• speakers • emotions • voice qualities
• Small footprint
Disadvantages: • Hard to achieve naturalness in voice source • Some consonant sounds are hard to model with formants (bursts)
J. Gustafson, CTT, KTH 33
From rule based concatenative synthesis
• Rule based sounds unnatural, while concatenative synthesis provides (piece-wise) high quality speech.
• Certain sounds are hard to be produced by rule but easy to concatenate: – Bursts, voiceless stops are too difficult
• Rule based had an advantage of small footprint, however storing the segment database is no longer an issue
• Change of applications: – From reading machines for the blind to spoken dialogue systems
J. Gustafson, CTT, KTH 34
Synthesis by concatenation
J. Gustafson, CTT, KTH 35
Let’s get the terms straight
Concatenative synthesis Definition: All kinds of synthesis based on the concatenation of units, regardless of
type (sound, formant trajectories, articulatory parameters) and size (diphones, triphones, syllables, longer units). There is only one candidate per setting.
Everyday use: Concatenation of same-size sound units.
Unit selection synthesis Definition: All kinds of synthesis based on the concatenation of units where there are several candidates to choose from, regardless of if the candidates have the same, fixed size or if the size is variable. Everyday use: Concatenation of variable sized sound units.
J. Gustafson, CTT, KTH 36
Database preparation when building a concatenative synthesis
• Choose the speech units (Phone, Diphone, Sub-word unit, Cluster based unit selection) • Compile and record utterances • Segment signal and extract speech units • Store segment waveforms (along with context) and information in a database: Dictionary, waveform, pitch mark e.g. “ch-l r021 412.035 463.009 518.23”
diphone file Start time Middle time End • Pitch mark file: a list of each pitch mark position in the file • Extract parameters; create parametric segment database (for data compaction and prosody matching) • Perform amplitude equalization (prevents mismatches)
Talsyntes: Joakim Gustafson Tal, musik och hörsel
7
J. Gustafson, CTT, KTH 37
Signal manipulations in concatenative synthesis
• Prosodic modifications – Possibility to modify F0 – Possibility to lengthen or shorten segments
• Spectral modifications – Interpolation of spectrum at joints
J. Gustafson, CTT, KTH 38
Sequences of a particular sound/phone in all its environments of occurrence or all/most two-phone sequences occurring in a language: _auto_ -> _a, au, ut, to, o_ • Rationale: the center’of a phonetic realization is the most stable region, whereas the transition from one segment to another contains the most interesting phenomena, and is thus the hardest to model.
Assignment: Diphone ”synthesis”; cut and paste
Diphone Synthesis
J. Gustafson, CTT, KTH
Diphones
• Need O(phone2) number of units – Some combinations don’t exist (hopefully) – ATT (Olive et al. 1998) system had 43 phones
• 1849 possible diphones • Phonotactics ([h] only occurs before vowels), don’t need to keep
diphones across silence • Only 1172 actual diphones
– May include stress, consonant clusters • So could have more
– Lots of phonetic knowledge in design
• Database relatively small – Around 8 megabytes for English (16 KHz 16 bit)
Slide from Richard Sproat J. Gustafson, CTT, KTH
Building diphone schemata
• Find list of phones in language: – Plus interesting allophones – Stress, tons, clusters, onset/coda, etc – Foreign (rare) phones.
• Build carriers for: – Consonant-vowel, vowel-consonant – Vowel-vowel, consonant-consonant – Silence-phone, phone-silence – Other special cases
• Check the output: – List all diphones and justify missing ones – Every diphone list has mistakes
Slide from Richard Sproat
J. Gustafson, CTT, KTH
Designing a diphone inventory: Nonsense words
• Build set of carrier words: – pau t aa b aa b aa pau – pau t aa m aa m aa pau – pau t aa m iy m aa pau – pau t aa m iy m aa pau – pau t aa m ih m aa pau
• Advantages: – Easy to get all diphones – Likely to be pronounced consistently
• No lexical interference
• Disadvantages: – (possibly) bigger database – Speaker becomes bored
Slide from Richard Sproat J. Gustafson, CTT, KTH
Designing a diphone inventory: Natural words
• Greedily select sentences/words: – Quebecois arguments – Brouhaha abstractions – Arkansas arranging
• Advantages: – Will be pronounced naturally – Easier for speaker to pronounce – Smaller database? (505 pairs vs. 1345 words)
• Disadvantages: – May not be pronounced correctly
Slide from Richard Sproat
Talsyntes: Joakim Gustafson Tal, musik och hörsel
8
J. Gustafson, CTT, KTH
Making recordings consistent:
• Diphone should come from mid-word – Help ensure full articulation
• Performed consistently – Constant pitch (monotone), power, duration
• Use (synthesized) prompts: – Helps avoid pronunciation problems – Keeps speaker consistent – Used for alignment in labeling
Slide from Richard Sproat J. Gustafson, CTT, KTH
Recording conditions
• Ideal: – Anechoic chamber – Studio quality recording – EGG signal
• More likely: – Quiet room – Cheap microphone/sound blaster – No EGG – Headmounted microphone
• What we can do: – Repeatable conditions – Careful setting on audio levels
Slide from Richard Sproat
J. Gustafson, CTT, KTH
Labeling Diphones • Run an ASR in forced alignment mode
– Forced alignment: • In: A trained ASR system, a wavefile, a word transcription of the wavefile • Returns: an alignment of the phones in the words to the wavefile.
• Much easier than phonetic labeling: – The words are defined – The phone sequence is generally defined – They are clearly articulated – But sometimes speaker still pronounces wrong, so need to check.
• Phone boundaries less important – +- 10 ms is okay
• Midphone boundaries important – Where is the stable part – Can it be automatically found?
Slide from Richard Sproat J. Gustafson, CTT, KTH
Finding diphone boundaries
• Stable part in phones " For stops: one third in " For phone-silence: one quarter in " For other diphones: 50% in
• In time alignment case: " Given explicit known diphone boundaries in prompt in the label file " Use dynamic time warping to find same stable point in new speech
• Optimal coupling " Taylor and Isard 1991, Conkie and Isard 1996 " Instead of precutting the diphones
– Wait until we are about to concatenate the diphones together – Then take the 2 complete (uncut diphones) – Find optimal join points by measuring cepstral distance at potential join points,
pick best
Slide from Richard Sproat
J. Gustafson, CTT, KTH
Summary: Diphone Synthesis
• Well-understood, mature technology • Augmentations
– Stress – Onset/coda – Demi-syllables
• Problems: – Signal processing still necessary for modifying durations – Source data is still not natural – Units are just not large enough; can’t handle word-specific
effects, etc
Slide from Dan Jurafsky J. Gustafson, CTT, KTH
From diphone synthesis to Unit Selection Synthesis
• Natural data solves problems with diphones – Diphone databases are carefully designed but:
• Speaker makes errors • Speaker doesn’t speak intended dialect • Require database design to be right
– If it’s automatic • Labeled with what the speaker actually said • Coarticulation, schwas, flaps are natural
• “There’s no data like more data” – Lots of copies of each unit mean you can choose just the
right one for the context – Larger units mean you can capture wider effects
Slide from Dan Jurafsky
Talsyntes: Joakim Gustafson Tal, musik och hörsel
9
J. Gustafson, CTT, KTH
Unit selection Synthesis
Slide from Tokuda J. Gustafson, CTT, KTH
Unit Selection Intuition • Given a big database • For each segment that we want to synthesize
– Find the unit in the database that is the best to synthesize this target segment
• What does “best” mean? – Target cost: Closest match to the target description, in terms of
• Phonetic context • Pitch, power, duration, phrase position
– Concatenation cost: The difference between the end of diphone 1 and the start of diphone 2: • Matching formants + other spectral characteristics • Matching energy • Matching F0
Slide from Dan Jurafsky
J. Gustafson, CTT, KTH 51
The units in Unit Selection • Different types of units: e.g. diphones, phones, syllables, words, etc.
• Multiple occurrences of the units cover a wide space of the spectral and prosodic parameters
• Units nearest in this space to the targets will be chosen and will require only minor modification
• The corpus is segmented into phonetic units, indexed, and used as-is
• Selection is made on-line
• The trend is towards longer and longer units
1999 2000 2001 2002 2003 2004 2005
J. Gustafson, CTT, KTH 52
• Large databases of recorded natural speech
• Minimal processing
• Annotation of database – what information is needed?
• Few cuts > maximally long units selected (but context and prosody must fit well)
• Target and concatenation costs
Slide from Dan Jurafsky
Features of Unit Selection Synthesis
J. Gustafson, CTT, KTH
Database creation: a good speaker
• Professional speakers are always better: – Consistent style and articulation – Although these databases are carefully labeled
• Ideally (according to AT&T experiments): – Record 20 professional speakers (small amounts of data) – Build simple synthesis examples – Get many (200?) people to listen and score them – Take best voices
• Correlates for human preferences: – High power in unvoiced speech – High power in higher frequencies – Larger pitch range
Text from Paul Taylor and Richard Sproat J. Gustafson, CTT, KTH
Database creation: good recording conditions
• Good script – Application dependent helps
• Good word coverage • News data synthesizes as news data • News data is bad for dialog.
– Good phonetic coverage, especially wrt context – Low ambiguity – Easy to read
• Annotate at phone level, with stress, word information, phrase breaks
Text from Paul Taylor and Richard Sproat
Talsyntes: Joakim Gustafson Tal, musik och hörsel
10
J. Gustafson, CTT, KTH
Creating database
• Unlike diphone synthesis, prosodic variation is a good thing
• Accurate annotation is crucial
• Pitch annotation needs to be very accurate
• Phone alignments can be done automatically, as described for diphones
Slide from Dan Jurafsky J. Gustafson, CTT, KTH
Practical System Issues • Size of typical system (Rhetorical rVoice):
– ~300M
• Speed: – For each diphone, average of 1000 units to choose from, so: – 1000 target costs – 1000x1000 join costs – Each join cost, say 30x30 float point calculations – 10-15 diphones per second – 10 billion floating point calculations per second
• But commercial systems must run ~50x faster than real time • Heavy pruning essential:
– 1000 units -> 25 units
Slide from Paul Taylor
J. Gustafson, CTT, KTH
Summary: Unit Selection • Advantages
– Quality is far superior to diphones – Natural prosody selection sounds better – Non-linguistic features of the speakers voice built in
• Disadvantages: – Fixed voice – Quality can be very bad in places
• HCI problem: mix of very good and very bad is quite annoying – Large footprint, itis computationally expensive – Can’t synthesize everything you want:
• Diphone technique can move emphasis • Unit selection gives good (but possibly incorrect) result
Slide from Richard Sproat J. Gustafson, CTT, KTH
From Unit selection to HMM synthesis
• Problems with Unit Selection Synthesis – Discontinuities: Can’t modify signal – Hit or miss: database often doesn’t have exactly what you
want – Fixed voice
• Solution: HMM (Hidden Markov Model) Synthesis – Stable, Smooth and easy to create multiple voices – Sounds unnatural to researchers, but naïve subjects prefer it Example: Nina as unit selection and HMM synthesis voice
Slide from Dan Jurafsky
J. Gustafson, CTT, KTH
HMM Synthesis
Slide from Tokuda J. Gustafson, CTT, KTH 60
HMMs in synthesis
Talsyntes: Joakim Gustafson Tal, musik och hörsel
11
J. Gustafson, CTT, KTH 61
Relationship between unit selection and HMM synthesis
Slide from Tokuda J. Gustafson, CTT, KTH 62
Relationship between unit selection and HMM synthesis 2
Slide from Tokuda
J. Gustafson, CTT, KTH 63
The training part • The training is automatic. You need:
– The text + recordings of about 1000 sentences
• The training of 1000 sentences – takes 24 hours and generates a voice of less than 1 MB
• Separate HMMs for: Spectrum, F0, Duration
• Training in two steps: 1. Context independent models 2. Use these models to create context dependent models.
• Clustering: – Storing all contexts requires much space – It may be difficult to find alternatives for missing models – Many models are very similar = redundancy
J. Gustafson, CTT, KTH 64
Examples of features in HMM synthesis training
• Segment features: – immediate context – position in syllable
• Syllable features – Stress and lexical accent type – position in word and phrase
• Word features – number of syllables – position in phrase – morphological feature (compound or not) – part-of-speech tag (content or function word)
• Phrase features – phrase length in terms of syllables and words
• Utterance features: – length in syllables, words and phrases
• Speaker – Dialect, speaking style, emotional state
J. Gustafson, CTT, KTH 65
Clustering • Groups a large database into clusters • Three decision trees: Duration, F0 and Spectrum • Division based on yes/no questions
– Grouping acoustic similar phonemes – Features. – Context.
J. Gustafson, CTT, KTH 66
Speaker adaptation
http://homepages.inf.ed.ac.uk/jyamagis/Demo-html/map-new.html
Talsyntes: Joakim Gustafson Tal, musik och hörsel
12
J. Gustafson, CTT, KTH
Dalarna
Norrland
Skåne
Gotland
Svealand
Götaland
vart tar universum slut? centerpar3ster och kristdemokrater menar dock a7 brudparet kan slippa betala det kan bådas föräldrar göra gula sidorna finns från och med i fredags sökbart via mobiltelefonen om man 3ll exempel tar en telefon och frågar hur den fungerar så svarar vetenskapsmannen med a7 ly@a på luren och slå numret
Exempel från Simulekt-projektet
J. Gustafson, CTT, KTH 68
Use of HMM synthesis
• Various voices: – Speaker adaptation – Speaker interpolation
• Security of speaker identification systems • Very low bit rate speech coder • Small footprint, for use in mobile phones and web
browsers – E.g. in flash: http://www.furui.cs.titech.ac.jp/~dixonp/hts/index.html
J. Gustafson, CTT, KTH 69
Lab 2 synthesis Two parts: • Manual diphone synthesis
– Tool: Wavesurfer (download from sourceforge) – Do it at home, or with the lab computers
• Resynthesis experiments – Tool: Tandem STRAIGHT Vocoder – You will get email with link to software (PC, MAC, Linux,
but you will have to install MATLAB runtime) – The assignments will be sent next week, together with
times to come and do/present it with the lab computers