fundamental frequency contour synthesis for turkish text to speech erkan abdullahbeşe

92
Fundamental Frequency Contour Synthesis for Turkish Text to Speech Erkan Abdullahbeşe

Upload: jarrod-barnish

Post on 13-Dec-2015

233 views

Category:

Documents


1 download

TRANSCRIPT

  • Slide 1

Fundamental Frequency Contour Synthesis for Turkish Text to Speech Erkan Abdullahbee Slide 2 Content : TTS systems and prosody Turkish Intonation, Stress Observations on Collected Data Methodology Improvements on Methodology Discussion Conclusion Slide 3 Introduction to Text to Speech (TTS) Systems Text -> speech signal Widespread applications Message to speech generation Man-machine dialogue Multimedia applications Talking aids for handicapped CHALLENGE: Machine Accent -> Natural Speech SOLUTION: Prosody Generation Modules Slide 4 What is Prosody? Properties of speech that cannot be derived from the phoneme sequence Modulation of voice pitch Rhythm, changes in durations Fluctuations of loudness Related to domains larger than one phoneme (supra-segmental properties) Slide 5 Basic Acoustic Parameters Fundamental Frequency F 0 (pitch) Duration Intensity Prosodic Phenomena Modulate the basic acoustic parameters Modulation of fundamental frequency Intonation Stress (accent) Slide 6 Intonation Ensemble of pitch variations Perceived as speech melody Stress Modulate all the basic acoustic parameters Increase in F 0 and intensity (loudness) Lengthening in duration Three types: Word stress Phrase stress Sentence stress Stress on a single syllable Phrase and sentence stress coincide with word stress Slide 7 Prosody Generation Modules in TTS Prosodic description Prosodic phrasing -> phrase boundaries Accent labeling -> accents on syllables Prosodic labels -> F 0 contour Complex linguistic processing units (morphology, syntax, semantics) Speaker-dependence Articulation-related problems: microprosody vs. macroprosody PROBLEMS Slide 8 Basic Intonation Models Tone Sequence Models : Pitch contour as a sequence of fluctuations generated by local accents Pierrehumbert: A sequence of independent H and L tones (ortography) Pitch accent -> pitch movements on stressed syllables Boundary tone ->at phrase boundaries Phrase accent -> between stressed syllable and phrase boundary Superposition Models : Pitch contour as the superposition of several components with different domains: syllables, words, phrases, sentences, paragraphs, whole text Fujisaki: purely mathematical model -> parametric A basic F 0 A phrase component (crit. Damped sec. Order to impulse) An accent component (crit. Damped sec. Order to rectangular) Optimization of parameter values wrt F 0 (Analysis by Synthesis) Mbius -> Fujisaki + Linguistics -> German Slide 9 Approaches Perform an analysis on a speech corpus Transcribe the corpus Define F 0 labels(rise, fall, peak etc.) and boundary labels (minor, major etc.) Labeling By hand Examination -> rules -> automatic Automatic learning of : labels -> F 0 values (or parametrized) Neural Networks Stochastic methods Intonation pattern dictionary (from natural speech) Store pitch values in ST and key information (labels) for each pattern For the patterns in input sentence -> compare key info -> find closest pattern from dictionary -> apply pitch Slide 10 Approaches For integration into TTS (labeling input sentence from text) Complex linguistic processing units Morphology Syntax Semantics Stochastic methods Syntax -> most probable label sequence Slide 11 Sentence Intonation Types Terminal intonation pitch decreases at the end -> message completed Interrogative intonation pitch slightly increases on the last syllable -> waiting for response Progressive intonation pitch either increases slightly or does not show any lowering at the end -> message not completed yet Slide 12 Turkish Intonation Classification of sentences Type: Declaratives() wh-questions() yes-no questions() Structure: Simple Compound: () at the end of subordinate Megul olduundan() bizimle sinemaya gelemedi(). Slide 13 Turkish Intonation Tone groups (phrase or segment) Division into tone groups / Oraya varnca beni arayn. / / Oraya varnca / beni arayn. / Focus (new information) in each tone group / Oraya varnca beni arayn. / Pitch variations on focus Slide 14 Turkish Intonation Four levels of pitch: low(1), mid(2), high(3), extra high(4) gi 2 di 3 yoru 1 m sa 2 hi 4 mi 1 Speech melody musical melody (Nash) Hierarchy of intonation units(phrase -> text) Each intonation unit -> melody Successive intonation units related by motifs -> melody of the upper level Music: reiteration of motifs -> musical melody Slide 15 Turkish Stress Fixed(bound) stress vs. Free stress(Turkish) Stress on a single syllable of a word in Turkish Effect of suffixes on stress Stress on final syllable of root + stressable suffix yolcu + -lar yolcular Stress on final syllable of root, unstressable suffix involves oku + -yor okuyor + -lar okuyorlar Stress on non-final syllable of root karnca + -lar karncalar May disappear in sentence Word Stress Slide 16 Turkish Stress Signals the prominance of the most information-bearing element in a sentence Types Unmarked (preverbal position) Yarn stanbula gidiyorlar. marked (any position) Yarn stanbula gidiyorlar. Focusing elements Precede focus: sadece, daha Mehmet daha bugn devine balayabildi. Follow focus: -mi, da, bile Ayla m bugn Ankaradan dnyor? Sentence Stress Slide 17 Turkish Stress Phrase: modifier or complement and head Phrase stress on modifier in Turkish Types Phrases used as nouns telefon ahizesi gzel iekler Phrases used as verbs hzl ko severek yaa Others senin iin yarndan sonra Preserved in the sentence Phrase Stress Slide 18 Motivation Nevin bugn menemen yemeli. (template) N Z F V Nevin menemen yemeli. N F V Bizim Nevin domatesli menemen yemeli. P N A F V Nalan yarn ayna alyor. N Z F V Nalan ayna alyor. N F V Kardeim Nalan yeni ayna alyor. N N A F V Slide 19 Nevin bugn menemen yemeli. Nevin menemen yemeli. Slide 20 Nevin bugn menemen yemeli. Bizim Nevin domatesli menemen yemeli. Slide 21 Nevin bugn menemen yemeli. Nalan yarn ayna alyor. Slide 22 Nevin bugn menemen yemeli. Nalan ayna alyor. Slide 23 Nevin bugn menemen yemeli. Kardeim Nalan yeni ayna alyor. Slide 24 Sentence TypePositiveNegative Declaratives2515 Wh-questions105 Yes-no questions105 Conditionals64 Imperatives64 Exclamations64 Sentences 100 database sentences 19 close test sentences (add/remove categories) 18 random test sentences Syllable-based handlabeling Pitch extraction Slide 25 Observations Pitch decrease at the end (terminal intonation) Division into phrases Pitch increase on the phrase-final syllable (progressive intonation) Declaratives Nevin/bugn/menemen yemeli. Slide 26 Observations Pitch decrease at the end (terminal intonation) Division into phrases Pitch increase on the phrase-final syllable (progressive intonation) Declaratives Evvelki gn/ikimiz de/kuyumcu Aliye uradk. Slide 27 Observations Pitch increase on the last syllable (interrogative intonation) Evident pitch increase on the stressed syllable of the wh-word No division into phrases Word stress often disappears Wh-questions Dn neden zamanm aldn? Slide 28 Observations Pitch increase on the last syllable (interrogative intonation) Evident pitch increase on the stressed syllable of the wh-word No division into phrases Word stress often disappears Wh-questions Kimler yarn snf gezisine katlacaklar? Slide 29 Observations Pitch decrease at the end Evident pitch increase on the stressed syllable of the word before -mi No division into phrases Word stress often disappears Yes-no questions Oralar yine eskisi gibi gzel mi? Slide 30 Observations Pitch decrease at the end Evident pitch increase on the stressed syllable of the word before -mi No division into phrases Word stress often disappears Yes-no questions Mudanyada bu sene de ok yamur yayor mu? Slide 31 Observations Pitch decrease at the end (terminal intonation) Division into phrases Pitch increase on the phrase-final syllable (progressive intonation) -se always a phrase-final syllable Conditionals nsan azimliyse hereyi baarabilir. Slide 32 Observations Pitch decrease at the end (terminal intonation) Division into phrases Pitch increase on the phrase-final syllable (progressive intonation) -se always a phrase-final syllable Conditionals Babam keyifsizse ona konuyu bu akam anlatamam. Slide 33 Observations Pitch decrease at the end (terminal intonation) Division into phrases Pitch increase on the phrase-final syllable (progressive intonation) Imperatives Akam yemei iin ardan bireyler alsnlar. Slide 34 Observations Pitch decrease at the end (terminal intonation) Division into phrases Pitch increase on the phrase-final syllable (progressive intonation) Imperatives Sevgiyi ve mutluluu yarnlara erteleme. Slide 35 Observations Diverse Pitch decrease at the end (terminal intonation) Evident pitch increase on the stressed syllable of interjection or of another word Exclamations Aman byklerine bir saygszlk yapma! Slide 36 Observations Diverse Pitch decrease at the end (terminal intonation) Evident pitch increase on the stressed syllable of interjection or of another word Exclamations Haydi bugn hep birlikte piknie gidelim! Slide 37 Local Observations At most single stressed syllable excluding phrase-final increase Stress within the sentence coincides with the word stress Phrase stress preserved Ekonomik kriz / her kesimden insan / olumsuz etkiledi. Slide 38 Local Observations At most single stressed syllable excluding phrase-final increase Stress within the sentence coincides with the word stress Phrase stress preserved Evvelki gn / ikimiz de / kuyumcu Aliye uradk. Slide 39 Local Observations Word stress may disappear Beden salmz iin akamlar erken yatmalyz. Mehmet daha bugn devine balayabildi. Slide 40 Local Observations Word stress disappears at the end of positives (terminal intonation) Nevin bugn menemen yemeli. Merve evine zamannda dnemez. Slide 41 Local Observations Sentence stress (stress on focus) Nevin bugn menemen yemeli. Mehmet daha bugn devine balayabildi. Slide 42 Local Observations Effects on neighbour syllables Unstressed + stressed (ne+vin) Stressed + stressed nevin+bu+gn Nevin bugn menemen yemeli. Slide 43 Local Observations Effects on neighbour syllables Stressed + stressed (Partiye+gelmeyeceim) Ben akam partiye gelmeyeceim. Slide 44 Local Observations Effects on neighbour syllables Stressed + unstressed (Gece+ryasnda) Kardeim beni dn gece ryasnda grm. Slide 45 Local Observations Effects on neighbour syllables Stressed + unstressed (ney+le) Bu ge vakitte sizin eve neyle dneceiz? Slide 46 Local Observations Effects on neighbour syllables Stressed + unstressed (last syllable, terminal intonation) (deil+di) Akamki yemek pek gzel deildi. Slide 47 Local Observations Effects on neighbour syllables Stressed + unstressed (last syllable, terminal intonation) (gzel+mi) Oralar yine eskisi gibi gzel mi? Slide 48 Methodology Choose best sentence from a sentence database Apply its pitch to the matching regions of input sentence Compression / Stretching Interpolation Fit data to remaining regions using interpolation Overwiev Choose Best Sentence Generate Regional Durations Read FilesApply Pitch Slide 49 Methodology Input information used for sentences Sentence type (declarative, wh-question, yes-no question, conditional, imperative, exclamation) Sentence state (positive or negative) Categories of each word Number of syllables of each word The index of the syllable bearing word stress, for each word (stress in sentence coincides with word stress) Read Files Slide 50 Methodology Word categories rely mainly on part-of-speech (POS) categories: Read Files CategoryExamples nounelmaapple adjectivegzelbeautiful pronounbizwe verbgeliyorumIm coming adverbakamleyinin the evening postpositionkadarasas conjunctionfakatbut interjectionaman wh-wordhangiwhich question suffix wordalm mdid he take conditionaliyiyseif good numberbefive auxiliaryikayet (etti)(he complained) componentAlininAlis focuskitap (okuyor)(he reads) book comma (,) Slide 51 Methodology Search in database to find the best sentence Search the template sentences with the same Type State as the input sentence Two different approaches for Sentences other than questions Question sentences Choose Best Sentence Slide 52 Calculate sentence resemblance scores based on word resemblance scores (WRS) Choose the template sentence having the maximum sentence resemblance score Sentences other than Questions Word Resemblance Score (WRS) Measure of resemblance of two words Consists of Regional resemblance score (RRS) -> word stress information Category match score (CMS) -> word categories WRS = RRS + CMS Slide 53 Makes use of the four regions defined for every word Region before the stressed syllable Stressed syllable Region after the stressed syllable Phrase-final syllable Measure of resemblance of any two words in terms of these regions Based on number of syllables in each region Consists of Score of existing regions Score of lacking regions RRS = 0.9 x ERS + 0.1 x LRS Regional Resemblance Score (RRS) Slide 54 Calculation of ERS and LRS score = ERS = LRS = 0 (initialization) for all regions if the region exists in both words score = min( 1, (NSRW1 / NSRW2) ) ERS = ERS + score else if region lacks in both words LRS = LRS + 1 else LRS = LRS - 1 endif endfor ERS: score of existing regions LRS: score of lacking regions NSRW1: number of syllables in related region for first word NSRW2: number of syllables in related region for second word Slide 55 Example Calculation of WRS for the words stanbul and Ankara: ERS = 1/1 + 1/2 = 3/2 LRS = -1 + 1 = 0 RRS = 0.9 x 3/2 + 0.1 x 0 = 1.35 CMS = 3.7 WRS = 1.35 + 3.7 = 5.05 Category Match Score (CMS) Category match -> CMS CMS = 3.7 (maximum possible value of RRS) WordRegion 1Region 2Region 3Region 4 Ankara-Ankara- stanbulstanbul- Slide 56 Sentence Resemblance Score I 1, I 2, ,I N : words of the input sentence D 1, D 2, ,D M : words of the template sentence MxN S : score matrix with S i,j s where S i,j = WRS of the pair (D i, I j ) Path : (D a, I b ), (D c, I d ), , (D e, I f ) with 1 a < c < < e M and 1 b < d < < f N Score of the path : sum of WRSs of its pairs TASK: Find the path with the maximum score (maximum score path) score of maximum score path = sentence resemblance score optimum combination of word pairings preserving order Slide 57 EXAMPLE: TEMPLATE: Geen akam hepimiz mziin bysne kaplmtk. INPUT: Byk daymz Kadkydeki evinde senelerdir yalnz oturuyor. (akam, Byk), (mziin, daymz), (kaplmtk, evinde): valid (hepimiz, daymz), (geen, evinde), (bysne, yalnz): invalid (akam, evinde), (mziin, daymz), (kaplmtk, oturuyor): invalid (geen, daymz), (hepimiz, daymz), (kaplmtk, oturuyor): invalid Slide 58 Procedure MxN MPS : maximum path scores matrix MxNx2 CMPS : maximum path scores coordinates matrix MPS i,j : contains the score of the maximum score path beginning with the pair (D i, I j ) CMPS i,j,k : contains the indices of the next pair in the same path ( for example if the max score path of (D i, I j ) is (D i, I j ), (D m, I n ), , (D p, I q ), then CMPS i,j,1 = m and CMPS i,j,2 = n ) Recursive generation of MPS from itself and S CMPS generated from MPS Slide 59 for i = M, M-1, , 1 for j = N, N-1, , 1 if (i = M) or (j = N) MPS i,j = S i,j CMPS i,j,1 = CMPS i,j,2 = EMPTY else MPS i,j = S i,j + value of the max element of { MPS p,q | i+1 p M and j+1 q N } CMPS i,j,1 = first indice of max element of { MPS p,q | i+1 p M and j+1 q N } CMPS i,j,2 = second indice of max element of { MPS p,q | i+1 p M and j+1 q N } endif endfor Procedure Slide 60 Slide 61 Finding the maximum score path from MPS and CMPS Sentence resemblance score = max i,j (MPS i,j ) = MPS a,b for ex. MPS a,b -> max score path begins with (D a, I b ) Apply to CMPS a,b,1 and CMPS a,b,2 to obtain the second pair of the path If for ex. CMPS a,b,1 = c and CMPS a,b,2 = d -> (D c, I d ) is the second pair Similarly, apply to CMPS c,d,1 and CMPS c,d,2 to obtain the third pair of the path etc. Entire path is obtained Slide 62 We obtained answers to the following questions: What is the max resemblance capacity of the template sentence to the input sentence? Answer: sentence resemblance score (score of the max score path) How to arrive this max capacity, i.e. how to match the words and choose the pairs? Answer: as in max score path Slide 63 Pitch curve of a question Pitch curve of a word Whole question regarded as a word Use the same regions defined for words Region before the stressed syllable Stressed syllable (stressed syllable of the wh-word or question suffix word) Region after the stressed syllable Phrase-final syllable (exists for wh-questions) Use the same procedure assigning RRS to words to assign sentence resemblance score to the questions Question Sentences Slide 64 EXAMPLE Sentences: Aye bugn evde hangi yemei yapt? Bu su sesi yukardan m geliyor? Regions: Region 1Region 2Region 3Region 4 Ayebugnevdehangiyemeiyaptt Bususesiyukardanmgeliyor- Region 1Region 2Region 3Region 4 6151 7140 Slide 65 Methodology Region -> one or more syllables Inputs:(related to input and template sentences) The label files The number of syllables for each word The index of the syllable bearing word stress, for each word The information whether the last syllable shows a pitch rise or not, for each word (conditional, wh-question) Assumes a perfect duration analysis for the input sentence (label file of input sentence) Determines the durations of each region: the onset and end, for each word in both sentences Generate Regional Durations Slide 66 Methodology Inputs: Regional durations generated by the previos block Pitch contour of the template sentence The max score path pertaining to the input and template sentences For all pairs of the path, the pitch of the template sentence is applied to the input sentence, for the regions existing in both elements of a pair Usage of spline interpolation: Stretching / compression in time Data fitting for nonexisting regions Apply Pitch Slide 67 Improvements Problem: unvoiced regions of template sentence + spline -> distortions Example: Input: Yldzlar dnyadan gndz grlmez Template: Zamanm televizyonun karsnda bo yere harcayamam Path: (zamanm, yldzlar), (karsnda, dnyadan), (yere, gndz), (harcayamam, grlmez) Problematic pairs: (karsnda, dnyadan) and (yere, gndz) unvoiced regions in karsnda (/k/, // and /s/) and yere Solution: discard zero samples (unvoiced) and then apply Discarding Unvoiced Regions Slide 68 Yldzlar dnyadan gndz grlmez. Slide 69 Improvements Problem: poor performance of spline outside the borders of data points to be interpolated Example: Input: Didem her akam odasnda gnlk gazeteleri okur Template: Annem bize her zaman ok lezzetli yemekler piirir Problematic pairs: (annem, didem) and (piirir, okur) Solution: applying the value of the outermost data point to the whole region, if the region goes beyond this data point WordRegion 1Region 2Region 3Region 4 didemdididem-- annem-annem- okurokur-- piirirpiirir-- Slide 70 Didem her akam odasnda gnlk gazeteleri okur. Slide 71 Improvements Problem: spline sometimes yields unsatisfactory results within the data points Example: Input: ocuklar yazn gnein altnda fazla kalmamal. Problematic region: /zn/ of yazn generated by spline ocuklar yazn gnein altnda fazla kalmamal. Slide 72 Improvements Solution: check spline; spline -> linear interpolation when necessary Spline check: linear regression line, upper threshold and lower threshold lines for the pitch of template sentence If spline exceeds the threshold lines: spline -> linear Linear regression and the two threshold lines. Slide 73 ocuklar yazn gnein altnda fazla kalmamal. Slide 74 Discussion good -> choosing from same type and state -> expected microprosody degrades performance (unvoiced regions of input sentence unknown) Performance at sentence ends Kuzenim Nalan Oyaya yarn alyor. Slide 75 Discussion Performance at sentence ends good -> choosing from same type and state -> expected microprosody degrades performance (unvoiced regions of input sentence unknown) Marsta hayat var mdr? Slide 76 Discussion erroneous endings (increase instead of decrease) due to template pitch Performance at sentence ends Slide 77 Discussion erroneous endings (increase instead of decrease) due to template pitch Performance at sentence ends Slide 78 Discussion limited since the method is confined to the capacity of the database (same type, state) the capacity of the template sentence prosodic boundaries (yazn) and accented syllables unknown Performance at movements (rises and falls) ocuklar yazn gnein altnda fazla kalmamal. Slide 79 Discussion limited since the slope of the rise or fall may differ in input and template sentences (bizim) Performance at movements (rises and falls) Bizim Nevin domatesli menemen yemeli. Slide 80 Discussion limited since there may be an absolute difference between pitch values of both sentences (gndz) Performance at movements (rises and falls) Yldzlar genellikle gndz grlmez. Slide 81 Discussion limited since microprosodic effects (kardeim) Performance at movements (rises and falls) Kardeim Nalan yeni ayna alyor. Slide 82 Discussion limited since effects of rises and falls on neighbouring syllables are handled partially (only within words) Example: Input: Merve bu sefer zamannda dnemez Template: Akamki yemek pek gzel deildi Merve from yemek (/ye/ of yemek affected by /ki/ of akamki) Performance at movements (rises and falls) WordRegion 1Region 2Region 3Region 4 MerveMerve-- yemekyemek-- Slide 83 Akamki yemek pek gzel deildi. Merve bu sefer zamannda dnemez. Slide 84 Discussion High success due to their simple nature: Performance at questions Niin sorularma cevap vermiyorsun? Slide 85 Discussion High success due to their simple nature: Performance at questions nce nereye bilgi verilmeli? Slide 86 Discussion High success due to their simple nature: Performance at questions Ona bu gzel kolyeyi satn almayacak msn? Slide 87 Discussion Pitch -> speech melody, human perception -> ST scale distance d in ST between two frequencies f 1 and f 2 is given as: d = 12 x log 2 (f 1 / f 2 ) metrics mean squared distance between original and synthesized in ST proportion < 2ST distance compare with baseline solution constructed as: 6 types x 2 states -> 12 groups of DB sentences for each sentence -> median of nonzero pitch average of median of sentences in each group -> 12 baselines Objective Evaluation Slide 88 Sentence Domain Average Mean square distance in ST Average Proportion of distance < 2 ST MethodBaselinepMethodBaselinep Close test sentences 4.651410.56702.2160 x 10 -5 0.65730.46820.0043 Random test sentences 6.97418.76830.21280.60160.59280.8616 All sentences 5.78149.69209.6801 x 10 -5 0.63020.52880.0160 All questions 4.30909.91810.00260.70840.45470.0081 Discussion Objective Evaluation Slide 89 Discussion Objective Evaluation Sentence Domain Number of sentences Mean square distance in ST Proportion of distance < 2 ST Method is better Baseline is better Method is better Baseline is better Close test sentences 154145 Random test sentences 144117 All sentences 2982512 All questions 101 1 Slide 90 Discussion Objective Evaluation Results Method better than baseline in general Performance at close test sentences > Performance at random test sentences best results in questions similar results in both metrics ANOVA (analysis of variance) p = the probability of the means belonging to each method to be equal p averages statistically significant Slide 91 Conclusion Intonation and stress -> fundamental frequency Analysis of pitch contours Method based on syntactic structure in terms of word categories and word stress information Automatic generation of these inputs from text is relatively easy. Makes use of a sentence database (corpus of natural speech) interpolation Recordings of a single speaker Slide 92 Future Work Inclusion of other speakers A further categorization of words instead of POS categories -> subcategories -> more complex syntactic structures -> larger database for efficiency Other inputs: prosodic boundaries accented syllables and their automatic generation from input text (prosodic description) Handling microprosody