researcharticle end-to-end speech synthesis for tibetan

Research ArticleEnd-to-End Speech Synthesis for Tibetan Multidialect

Xiaona Xu Li Yang Yue Zhao and Hui Wang

School of Information Engineering Minzu University of China Beijing 100081 China

Correspondence should be addressed to Li Yang 654577893qqcom and Yue Zhao zhaoyuesomuceducn

Received 30 October 2020 Revised 27 December 2020 Accepted 12 January 2021 Published 27 January 2021

Academic Editor Ning Cai

Copyright copy 2021 Xiaona Xu et al 1is is an open access article distributed under the Creative Commons Attribution Licensewhich permits unrestricted use distribution and reproduction in any medium provided the original work is properly cited

1e research on Tibetan speech synthesis technology has beenmainly focusing on single dialect and thus there is a lack of researchon Tibetan multidialect speech synthesis technology 1is paper presents an end-to-end Tibetan multidialect speech synthesismodel to realize a speech synthesis system which can be used to synthesize different Tibetan dialects Firstly Wylie transliterationscheme is used to convert the Tibetan text into the corresponding Latin letters which effectively reduces the size of training corpusand the workload of front-end text processing Secondly a shared feature prediction network with a cyclic sequence-to-sequencestructure is built which maps the Latin transliteration vector of Tibetan character to Mel spectrograms and learns the relevantfeatures of multidialect speech data 1irdly two dialect-specific WaveNet vocoders are combined with the feature predictionnetwork which synthesizes the Mel spectrum of Lhasa-U-Tsang and Amdo pastoral dialect into time-domain waveform re-spectively 1e model avoids using a large number of Tibetan dialect expertise for processing some time-consuming tasks such asphonetic analysis and phonological annotation Additionally it can directly synthesize Lhasa-U-Tsang and Amdo pastoral speechon the existing text annotation 1e experimental results show that the synthesized speech of Lhasa-U-Tsang and Amdo pastoraldialect based on our proposed method has better clarity and naturalness than the Tibetan monolingual model

1 Introduction

Speech synthesis also known as text-to-speech (TTS)technology mainly solves the problem of converting textinformation into audible sound information Up to nowspeech synthesis technology has become one of the mostcommonly used methods of human-computer interaction Itis gradually replacing traditional human-computer inter-action methods making human-computer interaction moreconvenient and faster With the continuous development ofspeech synthesis technology multilingual speech synthesistechnology has become a research interest for researchers1is technology can realize the synthesis of different lan-guages in a unified speech synthesis system [1ndash3]

1ere are lots of ethnic minorities in China Manyethnic minorities have their own languages and scriptsTibetan is one of the minority languages it can be dividedinto three major dialects U-Tsang dialect Amdo dialectand Kham dialect which are mainly used in Tibet QinghaiSichuan Gansu and Yunnan All dialects use Tibetancharacters as written text but there are some differences in

the pronunciation of each dialect so it is difficult for thepeople who use different dialects to communicate with eachother 1ere have been some research studies on Lhasa-U-Tsang dialect speech synthesis technology [4ndash12] 1eend-to-end method [12] has more training advantages thanthe statistical parameter method and the synthesis effect isbetter 1ere are few existing research studies on the speechsynthesis of Amdo dialect and only the work [13] appliedthe statistical parameter speech synthesis (SPSS) based onthe hidden Markov model (HMM) for Tibetan Amdodialect

For the multilingual speech synthesis the research worksmainly use unit-selection concatenative synthesis techniqueSPSS based on HMM and deep learning technology 1eunit-selection concatenative synthesis technique mainlyincludes selecting an unit scale constructing a corpus anddesigning an algorithm of unit selection and splicing 1ismethod relies on a large-scale corpus [14 15] Additionallythe synthesis effect is unstable and the connection of thesplicing unit may have discontinuities SPSS technologyusually requires a complex text front-end to extract various

HindawiComplexityVolume 2021 Article ID 6682871 8 pageshttpsdoiorg10115520216682871

linguistic features from raw text a duration model anacoustic model which is used to learn the transformationbetween linguistic features and acoustic features and acomplex signal-processing-based vocoder to reconstructwaveform from the predicted acoustic features 1e work[16] proposes a framework for estimating HMM on datacontaining both multiple speakers and multiple languagesaiming to transfer a voice from one language to others 1eworks [2 17 18] propose a method to realize HMM-basedcross-lingual SPSS using speaker adaptive training Forspeech synthesis technology based on deep learning thework [19] realizes a deep neural network- (DNN-) basedMandarin-Tibetan bilingual speech synthesis 1e experi-mental results show that synthesized Tibetan speech is betterthan the HMM-based Mandarin-Tibetan cross-lingualspeech synthesis 1e work [20] trains the acoustic modelswith DNN hybrid long short-term memory (LSTM) andhybrid bidirectional long short-term memory (BLSTM) andimplements a deep learning-based Mandarin-Tibetan cross-lingual speech synthesis under a unique framework Ex-periments demonstrated that the hybrid BLSTM-basedcross-lingual speech synthesis framework was better thanthe Tibetan monolingual framework Additionally there aresome research studies which reveal that multilingual speechsynthesis using the end-to-end method gains a good per-formance 1e work [21] presents an end-to-end multilin-gual speech synthesis model using a Unicode encodingldquobyterdquo input representation to train a model which outputsthe corresponding audio of English Spanish or Mandarin1e work [22] proposes a multispeaker multilingual TTSsynthesis model based on Tacotron which is able to producehigh-quality speech in multiple languages

Taking into account that traditional methods require a lotof professional knowledge for phoneme analysis tone andprosody labelling the work is time-consuming and costly andthe modules are usually trained separately which will lead tothe effect of error stacking [23] while the end-to-end speechsynthesis system can automatically learn alignments andmapping from linguistic features to acoustic features 1esesystems can be trained onlt text audiogt pairs withoutcomplex language-dependent text front-end Inspired byabove works this paper proposes to use an end-to-endmethod to implement speech synthesis in Lhasa-U-Tsang andAmdo pastoral dialect using a single sequence-to-sequence(seq2seq) architecture with attention mechanism as theshared feature prediction network for Tibetan multi-dialectand introducing two dialect-specific WaveNet networks torealize the generation of time-domain waveforms

1ere are some similarities between this work and works[12 24] 1e WaveNet model is used in these worksHowever in our work and [12] the WaveNet model is usedfor the generation of waveform sample with the input ofpredicted Mel spectrogram for speech synthesis In the work[24] about speech recognition the WaveNet model is usedfor the generation of text sequence and the input is MFCCfeatures 1e work [12] achieved the speech synthesis forTibetan Lhasa-U-Tsang by using end-to-end model In thispaper we improved themodel of the work [12] to implementmultidialect speech synthesis

Our contributions can be summarized as follows (1) Wepropose an end-to-end Tibetanmultidialect speech synthesismodel which unifies all the modules into one model andrealizes the speech synthesis for different Tibetan dialectsusing one speech synthesis system (2) Joint learning is usedto train the shared feature prediction network by learningthe relevant features of multidialect speech data and it ishelpful to improve the speech synthesis performance ofdifferent dialects (3) We useWylie transliteration scheme toconvert the Tibetan text into the corresponding Latin letterswhich is used as the training units of the model It effectivelyreduces the size of training corpus reduces the workload offront-end text processing and improves the modellingefficiency

1e rest of this paper is organized as follows Section 2introduces the end-to-end Tibetan multidialect speechsynthesis model 1e experiments are presented in detail inSection 3 and the results are discussed as well Finally wedescribe our conclusions in Section 4

2 Model Architecture

1e end-to-end speech synthesis model is mainly composedof two parts the first part contains a seq2seq feature pre-diction network containing attention mechanism and thesecond part contains two dialect-specific WaveNet vocodersbased on Mel spectrogram 1e model adopts a synthesismethod from text to intermediate representation and in-termediate representation to speech waveform 1e encoderand decoder implement the conversion from text to inter-mediate representation and the WaveNet vocoders restorethe intermediate representation into waveform samplesFigure 1 shows the end-to-end Tibetan multidialect speechsynthesis model

21 Front-End Processing Although Tibetan pronunciationhas evolved over thousands of years the orthography ofwritten language remains unchanged It led to Tibetanspelling becoming very complicated Tibetan sentence iswritten from left to right and consists of a series of singlesyllable Single syllable is also called Tibetan character 1epunctuation mark ldquorsquordquomeans the ldquosoundproof symbolrdquo be-tween the syllables and the single hanging symbol ldquo|rdquo is usedat the end of a phrase or sentence Figure 2 shows a Tibetansentence

Each syllable in Tibetan has a root which is the centralconsonant of the syllable A vowel label can be added aboveor below the root to indicate different vowels Sometimesthere is a superscript at the top of the root one or twosubscripts at the bottom and a prescript at the front in-dicating that the initials of the syllable are compoundconsonants 1e sequence of connection of compoundconsonants is prescript superscript root and subscriptSometimes there is one or two postscripts after the rootwhich means that the syllable has one or two consonantendings 1e structure of Tibetan syllables is shown inFigure 3

2 Complexity

Due to the complexity of Tibetan spelling a Tibetansyllable can have as many as 20450 possibilities If a singlesyllable of a Tibetan character is used as the basic unit ofspeech synthesis a large amount of speech data will need tobe trained and the corpus construction workload will behuge 1e existing Tibetan speech synthesis system [10 13]uses the initials and vowels of Tibetan characters as the inputof the model which requires a lot of professional knowledgeof Tibetan linguistics and the front-end text processing Inthis paper we adopt the Wylie transliteration scheme usingonly the basic 26 Latin letters without adding letters andsymbols to convert the Tibetan text into the correspondingLatin letters It effectively reduces the size of training corpusreduces the workload of front-end text processing and

improves modelling efficiency Figure 4 shows the convertedTibetan sentence obtained by using the Wylie transliterationscheme for the Tibetan sentence in Figure 2

22 e Shared Feature Prediction Network We use Lhasa-U-Tsang and Amdo dialect datasets to train the sharedfeature prediction network and capture the relevant featuresbetween two dialects speech data by joint learning 1eshared feature prediction network is used to map the Latintransliteration vector of Tibetan character to Mel spectro-grams In this process Lhasa-U-Tsang and Amdo dialectshare the same feature prediction network 1e sharedfeature prediction network consists of an encoder an at-tention mechanism and a decoder

221 Encoder 1e encoder module is used to extract thetext sequence representation including a character em-bedding layer 3 convolutional layers and a long short-termmemory (LSTM) layer as shown in the lower left part ofFigure 1 Firstly the input Tibetan characters are embeddedinto sentence vectors using character embedding layer andthen input into 3 convolutional layers 1ese convolutionallayers model longer-term context in the input charactersequence and the output of the final convolutional layer will

Output waveform samples

Intput Latin letters correspondingto Tibetan sentences

WaveNet MoL for Amdo WaveNet MoL for Lhasa

Mel spectrogram

LSTM

3 conv layers

Character embedding

Attention

Postnet

Linear projection

LSTM layers

Prenet

0

0

20

50 100 150 200

4060

Figure 1 End-to-end Tibetan multidialect speech synthesis model

Figure 2 A Tibetan sentence

Prescript

Prescript

Vowel

Vowel

Superscipt

Superscipt

Root

Root

Subscript

Subscript

Postscript

Postscript

Post-postscript

Post-postscript

Figure 3 1e structure of Tibetan syllables

Complexity 3

be used as the input of a single bidirectional LSTM layer togenerate intermediate representations

222 Decoder 1e decoder consists of a prenet layer LSTMlayers and a linear projection layer as shown in the lower rightpart of Figure 1 1e decoder is an autoregressive recurrentneural network which is used to predict the output spectro-gram according to the encoded input sequence 1e result ofthe previous prediction is first input to a prenet layer and theoutput of the prenet layer and the output context vector of theattention mechanism network are concatenated and passedthrough 2 unidirectional LSTM layers 1en the LSTM outputand the attention context vector are concatenated and passedthrough a linear project layer to predict the target spectrogramframe Finally the predicted Mel spectrogram passes through apostnet and the residual connection ismade with the predictedspectrum to obtain the Mel spectrogram

223 Attention Mechanism 1e input sequence in theseq2seq structure will be compiled into a feature vector C of acertain dimension 1e feature vector C always links theencoding and decoding stages of the encoder-decoder model1e encoder compresses the information of the entire sequenceinto a fixed-length vector But with the continuous growth of thesequence this will cause the feature vector to fail to fully rep-resent the information of the entire sequence and the latterinput sequence will easily cover the first input sequence whichwill cause the loss of many detailed information To solve thisproblem an attention mechanism is introduced 1is mecha-nismwill encode the encoder into different ci according to eachtime step of the sequence that is the original unified featurevector Cwill be replacedwith a constantly changing ci accordingto the current generated word When decoding combine eachdifferent ci to decode the output so that when each output isgenerated the information carried by the input sequence can befully utilized and the result will be more accurate

1e feature vector ci is obtained by adding the hiddenvector sequence (h1 h2 hTx

) during encoding accordingto the weight as shown in equation (1) αij is the weightvalue as in equation (2) which represents the matchingdegree between the jth input of the encoder and the ithoutput of the decoder

ci 1113944

Tx

j1αijhj (1)

αij exp eij1113872 1113873

1113936Tx

k1 exp eik( 1113857 (2)

23 WaveNet Vocoder Tacotron [25 26] launched byGoogle can convert phonetic characters or text data intofrequency spectrum and it also needs a vocoder to restore

the frequency spectrum to waveforms to obtain synthe-sized speech Tacotronrsquos vocoder uses the GriffinndashLimalgorithm for waveform reconstruction 1e GriffinndashLimalgorithm is an algorithm to reconstruct speech under thecondition that only the amplitude spectrum is known andthe phase spectrum is unknown It is a relatively classicvocoder with simple and efficient algorithm Howeverbecause the waveform generated by the GriffinndashLim vo-coder is too smooth the synthesized voice has a poorquality and sounds obviously ldquomechanicalrdquo WaveNet is atypical autoregressive generation model which can im-prove the quality of synthetic speech 1erefore this workuses the WaveNet model as a vocoder to cover the lim-itation of the GriffinndashLim algorithm 1e sound waveformis a one-dimensional array in time domain and the audiosampling points are usually relatively large 1e waveformdata at a sampling rate of 16 kHz will have 16000 elementsper second which requires a large amount of calculationusing ordinary convolution In this regard WaveNet usescausal convolution which can increase the receptive fieldof convolution But causal convolution requires moreconvolutional layers which is computationally complexand costly 1erefore WaveNet has adopted the methodof dilated causal convolution to expand the receptive fieldof convolution without significantly increasing theamount of calculation 1e dilated convolution is shownin Figure 5 When the network generates the next elementit can use more previous element values WaveNet iscomposed of stacked dilated causal convolutions andsynthesizes speech by fitting the distribution of audiowaveforms by the autoregressive method that is Wave-Net predicts the next sampling point according to anumber of input sampling points and synthesizes speechby predicting the value of the waveform at each time pointwaveform

In the past traditional acoustic and linguistic featureswere used as the input of the WaveNet model for speechsynthesis In this paper we choose a low-level acousticrepresentation Mel spectrogram as the input of WaveNetfor training 1e Mel spectrogram emphasizes the details oflow frequency which is very important for the clarity ofspeech And compared to the waveform samples the phaseof each frame inMel spectrogram is unchanged it is easier totrain with the square error loss We train WaveNet vocodersfor Lhasa-U-Tsang dialect and Amdo pastoral dialect andthey can synthesize the corresponding Tibetan dialects withthe corresponding WaveNet vocoder

24 Training Process Training process can be summarizedinto 2 steps firstly training the shared feature predictionnetwork secondly training a dialect-specific WaveNet vo-coder for Lhasa-U-Tsang dialect and Amdo pastoral dialectrespectively based on the outputs generated by the networkwhich was trained in step 1

We trained the shared feature prediction network on thedatasets of Lhasa-U-Tsang dialect and Amdo pastoral dia-lect On a single GPU we used the teacher-forcingmethod totrain the feature prediction network and the input of the

da res kyi don rkyen thog nas bzo parsquoi gral rim ni blos lsquogel chog pa zhig yin pa dang

Figure 4 A Tibetan sentence after Wylie transliteration

4 Complexity

decoder was the correct output not the predicted outputwith a batch size of 8 An Adam optimizer was used withβ1 09 β2 0999 and ε 10minus 6 1e learning rate de-creased from 10minus 3 to 10minus 4 after 40000 iterations

1en the predicted outputs from the shared featureprediction network were aligned with the ground truth Wetrained the WaveNet for Lhasa-U-Tsang dialect and Amdopastoral dialect respectively by using the aligned predictedoutputs It means that these predicted data were generated inthe teacher-forcing mode1erefore each spectrum frame isexactly aligned with a sample of the waveform In the processof training the WaveNet network we used an Adam opti-mizer with β1 09 β2 0999 and ε 10minus 6 and thelearning rate was fixed at 10minus 3

3 Results and Analysis

31 Experimental Data 1e training data consist of theLhasa-U-Tsang dialect and Amdo pastoral dialect 1eLhasa-U-Tsang dialect speech data are about 143 hours with2000 text sentences 1e Amdo pastoral dialect speech dataare about 268 hours with 2671 text sentences Speech datafiles are converted to 16 kHz sampling rate with 16 bitquantization accuracy

32 Experimental Evaluation In order to ensure the accu-racy of the experimental results we apply two methodsobjective and subjective experiments to evaluate the ex-perimental results

In objective experiment the root mean square error(RMSE) of the time-domain sequences is calculated tomeasure the difference between the synthesized speech andthe reference speech 1e smaller the RMSE is the closer thesynthesized speech is to the reference and the better theeffect of speech synthesis is 1e formula of RMSE is shownin equation (3) where x1t and x2t respectively representthe value of the time series of reference speech and syn-thesized speech at time t

RMSE

1113936nt1 x1t minus x2t1113872 1113873

2

n

1113971

(3)

For Lhasa-U-Tsang dialect and Amdo pastoral dialectwe randomly select 10 text sentences use end-to-end Ti-betan multi-dialect speech synthesis model for speechsynthesis and calculate the average RMSE to evaluate thecloseness of the synthesized speech of the Lhasa-U-Tsang

dialect and Amdo pastoral dialect to the reference speech Inorder to evaluate the performance of the model we compareit with the end-to-end Tibetan Lhasa-U-Tsang dialect speechsynthesis model and end-to-end Tibetan Amdo pastoraldialect speech synthesis model 1ese two models were usedto synthesize the same 10 text sentences and the averageRMSE was calculated 1e results are shown in Table 1 ForLhasa-U-Tsang dialect the RMSE of the multidialect speechsynthesis model is 02126 which is less than the one ofLhasa-U-Tsang dialect speech synthesis model (02223) ForAmdo pastoral dialect the RMSE of the multidialect speechsynthesis model is 01223 which is less than the one of Amdopastoral dialect speech synthesis model (01253) It meansthat both Lhasa-U-Tsang dialect and Amdo pastoral dialectwhich are synthesized by our model are closer to theirreference speech 1e results show that our method hascapability of the feature representation for both Lhasa-U-Tsang and Amdo pastoral dialect through the sharedfeature prediction network so as to improve the multidialectspeech synthesis performance against single dialect Besidesthe synthetic speech effect of Amdo pastoral dialect is betterthan that of Lhasa-U-Tsang dialect because the data scale ofAmdo pastoral dialect is larger than that of Lhasa-U-Tsangdialect

Figures 6 and 7 respectively show the predicted Melspectrogram and target Mel spectrogram output by thefeature prediction network for Lhasa-U-Tsang dialect andAmdo pastoral dialect It can be seen from the figures thatthe predicted mel spectrograms of Lhasa-U-Tsang dialectand Amdo pastoral dialect are both similar to the target Melspectrograms

In subjective experiment the absolute category rating(ACR) measurement method was used to evaluate thesynthesized speech of the Lhasa-U-Tsang and Amdo pastoraldialects mentioned above In the ACR measurement weselected 25 listeners After listening to the synthesizedspeech we used the original speech as a reference and scoredthe synthesized speech according to the grading standard inTable 2 After obtaining the scores given by all listeners themean opinion score (MOS) of the synthesized speech wascalculated and Table 3 shows the results 1e MOS values ofthe synthesized speech in Lhasa-U-Tsang dialect and Amdopastoral dialects are 395 and 418 respectively which meansthat the synthesized speech has good clarity and naturalness

33 Comparative Experiment In order to verify the per-formance of the end-to-end Tibetan multidialect speechsynthesis system we have compared it with the ldquolinearprediction amplitude spectrum+GriffinndashLimrdquo and ldquoMelspectrogram+GriffinndashLimrdquo speech synthesis system 1eresults of comparison experiment are shown in Table 4According to Table 4 it can be seen that whether it is Lhasa-U-Tsang dialect or Amdo pastoral dialect the MOS value ofthe synthesized speech of ldquoMel spectrogram+GriffinndashLimrdquospeech synthesis system is higher than that of ldquolinear pre-diction amplitude spectrum+GriffinndashLimrdquo speech synthesissystem 1e results show that the Mel spectrogram is moreeffective as a predictive feature than the linear predictive

Input

Output

Hidden layer

Hidden layer

Figure 5 Dilated causal convolution [27]

Complexity 5

0 50 300100 150 200 250 350 400

0

20

40

60

Predicted Mel spectrogram

Target Mel spectrogram

0 50 300100 150 200 250 350 400

0

20

40

60

ndash4 ndash3 ndash2 ndash1 0 1 2 3 4


Figure 6 1e comparison of the output Mel spectrogram and the target Mel spectrogram of Lhasa-U-Tsang dialect

Table 1 Objective evaluation of the results

Tibetan dialect 1e RMSE of end-to-end Tibetanmultidialect speech synthesis model

1e RMSE of end-to-end Tibetan Lhasa-U-Tsang dialect speech synthesis model

1e RMSE of end-to-end Tibetan Amdopastoral dialect speech synthesis model

Lhasa-U-Tsangdialect

02126 02223 mdash

Amdo pastoraldialect 01223 mdash 01253

0

0 50 100 150 200

20

40

60



0

20

40

60

0 50 100 150 200



Figure 7 1e comparison of the output Mel spectrogram and the target Mel spectrogram of Amdo pastoral dialect

6 Complexity

amplitude spectrum and the quality of the generated speechis higher 1e ldquoMel spectrogram+WaveNetrdquo speech syn-thesis system outperforms the ldquoMelspectrogram+GriffinndashLimrdquo speech synthesis system withthe higher MOS value which means that WaveNet has abetter performance in recovering speech phase informationand generating higher quality of the synthesis speech thanthe GriffinndashLim algorithm

4 Conclusion

1is paper builds an end-to-end Tibetan multidialect speechsynthesis model including a seq2seq feature predictionnetwork which maps the character vector to the Melspectrogram and a dialect-specific WaveNet vocoder forLhasa-U-Tsang dialect and Amdo pastoral dialect respec-tively which synthesizes the Mel spectrogram into time-domain waveform Our model can utilize dialect-specificWaveNet vocoders to synthesize corresponding Tibetandialect In the experiments Wylie transcription scheme isused to convert Tibetan characters into Latin letters whicheffectively reduces the number of composite primitives andthe scale of training data Both objective and subjectiveexperimental results show that the synthesized speech ofLhasa-U-Tsang dialect and Amdo pastoral dialect has highqualities

Data Availability

1e data used to support the findings of this studyare available from the corresponding author uponrequest

Conflicts of Interest

1e authors declare that they have no conflicts of interest

Acknowledgments

1is study was supported by the National Natural ScienceFoundation of China under grant no 61976236

References

[1] Y J Wu Y Nankaku and K Tokuda ldquoState mapping basedmethod for cross-lingual speaker adaptation in HMM-basedspeech synthesisrdquo in Proceedings of the Interspeech 10thAnnual Conference of the International Speech Communica-tion Association Brighton UK September 2009

[2] H Liu Research on HMM-Based Cross-Lingual Speech Syn-thesi University of Science and Technology of China HefeiChina 2011

[3] R SproatMultilingual Text-To-Speech Synthesis e Bell LabsApproach Kluwer Academic Publishers Amsterdam Neth-erland 1998

[4] Z M Cairang Research on Tibetan Speech Synthesis Tech-nology Based on Mixed Primitives Shanxi Normal UniversityXirsquoan China 2016

[5] L Gao Z H Yu andW S Zheng ldquoResearch on HMM-basedTibetan Lhasa speech synthesis technologyrdquo Journal ofNorthwest University for Nationalities vol 32 no 2 pp 30ndash35 2011

[6] J X Zhang Research on Tibetan Lhasa Speech Synthesis Basedon HMM Northwest University for Nationalities LanzhouChina 2014

[7] S P Xu Research on Speech Quality Evaluation for TibetanStatistical Parametric Speech Synthesis Northwest NormalUniversity Lanzhou China 2015

[8] X J Kong Research on Methods of Text Analysis for TibetanStatistical Parametric Speech Synthesis Northwest NormalUniversity Lanzhou China 2017

[9] Y Zhou and D C Zhao ldquoResearch on HMM-based Tibetanspeech synthesisrdquo Computer Applications and Softwarevol 32 no 5 pp 171ndash174 2015

[10] G C Du Z M Cairang Z J Nan et al ldquoTibetan speechsynthesis based on neural networkrdquo Journal of Chinese In-formation Processing vol 33 no 2 pp 75ndash80 2019

[11] L S Luo G Y Li C W Gong and H L Ding ldquoEnd-to-endspeech synthesis for Tibetan Lhasa dialectrdquo Journal of PhysicsConference Series vol 1187 no 5 2019

[12] Y Zhao P Hu X Xu L Wu and X Li ldquoLhasa-Tibetanspeech synthesis using end-to-endmodelrdquo IEEE Access vol 7pp 140305ndash140311 2019

[13] L Su Research on the Speech Synthesis of Tibetan AmdoDialect Based on HMM Northwest Normal UniversityLanzhou China 2018

[14] S Quazza L Donetti L Moisa and P L Salza ldquoActor amultilingual unit-selection speech synthesis systerdquo in Proceedingsof the 4th ISCA Workshop on Speech Synthesis Perth Australia2001

[15] F Deprez J Odijk and J D Moortel ldquoIntroduction tomultilingual corpus-based concatenative speech synthesisrdquo inProceedings of the Interspeech 8th Annual Conference of theInternational Speech Communication Association pp 2129ndash2132 Antwerp Belgium August 2007

[16] H Zen N Braunschweiler S Buchholz et al ldquoStatisticalparametric speech synthesis based on speaker and language

Table 2 Grading standards of ACR

Grading value Estimated quality5 Very good4 Good3 Medium2 Bad1 Very bad

Table 3 1e MOS comparison of speech synthesized by differentsynthesis primitive models

Tibetan dialect MOSLhasa-U-Tsang dialect 395Amdo pastoral dialect 418

Table 4 1e MOS comparison of speech synthesized by differentmodels

ModelMOS ofLhasa-U-

Tsang dialect

MOS of Amdopastoraldialect

Linear predictive amplitudespectrum+GriffinndashLim 330 352

Mel spectrogram+GriffinndashLim 355 370Mel spectrogram+WaveNet 395 418

Complexity 7

factorizationrdquo IEEE Transactions on Audio Speech andLanguage Processing vol 20 no 6 pp 1713ndash1724 2012

[17] H Y Wang Research on Statistical Parametric Mandarin-Tibetan Cross-Lingual Speech Synthesis Northwest NormalUniversity Lanzhou China 2015

[18] L Z Guo Research on Mandarin-Xingtai Dialect Cross-Lingual Speech Synthesis Northwest Normal UniversityLanzhou China 2016

[19] P W Wu Research on Mandarin-Tibetan Cross-LingualSpeech Synthesis Northwest Normal University LanzhouChina 2018

[20] W Zhang H Yang X Bu and L Wang ldquoDeep learning forMandarin-Tibetan cross-lingual speech synthesisrdquo IEEE Ac-cess vol 7 pp 167884ndash167894 2019

[21] B Li Y Zhang T Sainath Y HWu andW Chan ldquoBytes areall you need end-to-end multilingual speech recognition andsynthesis with bytesrdquo in Proceedings of the ICASSP BrightonUK May 2018

[22] Y Zhang R J Weiss H Zen et al ldquoLearning to speak fluentlyin a foreign language multilingual speech synthesis and cross-language voice cloningrdquo 2019 httpsarxivorgabs190704448

[23] Z Y Qiu D Qu and L H Zhang ldquoEnd-to-end speechsynthesis based on WaveNetrdquo Journal of Computer Appli-cations vol 39 no 5 pp 1325ndash1329 2019

[24] Y Zhao J Yue X Xu L Wu and X Li ldquoEnd-to-end-basedTibetan multitask speech recognitionrdquo IEEE Access vol 7pp 162519ndash162529 2019

[25] R Skerry-Ryan E Battenberg Y Xiao et al ldquoTowards end-to-end prosody transfer for expressive speech synthesis withtacotronrdquo in Proceedings of the International Conference onMachine Learning (ICML) Stockholm Sweden July 2018

[26] Y Wang D Stanton Y Zhang et al ldquoStyle tokens unsu-pervised style modeling control and transfer in end-to-endspeech synthesisrdquo in Proceedings of the International Con-ference on Machine Learning (ICML) Stockholm SwedenJuly 2018

[27] A V D Oord S Dieleman H Zen et al ldquoWaveNet agenerative model for raw audiordquo 2016 httpsarxivorgabs160903499

8 Complexity

linguistic features from raw text a duration model anacoustic model which is used to learn the transformationbetween linguistic features and acoustic features and acomplex signal-processing-based vocoder to reconstructwaveform from the predicted acoustic features 1e work[16] proposes a framework for estimating HMM on datacontaining both multiple speakers and multiple languagesaiming to transfer a voice from one language to others 1eworks [2 17 18] propose a method to realize HMM-basedcross-lingual SPSS using speaker adaptive training Forspeech synthesis technology based on deep learning thework [19] realizes a deep neural network- (DNN-) basedMandarin-Tibetan bilingual speech synthesis 1e experi-mental results show that synthesized Tibetan speech is betterthan the HMM-based Mandarin-Tibetan cross-lingualspeech synthesis 1e work [20] trains the acoustic modelswith DNN hybrid long short-term memory (LSTM) andhybrid bidirectional long short-term memory (BLSTM) andimplements a deep learning-based Mandarin-Tibetan cross-lingual speech synthesis under a unique framework Ex-periments demonstrated that the hybrid BLSTM-basedcross-lingual speech synthesis framework was better thanthe Tibetan monolingual framework Additionally there aresome research studies which reveal that multilingual speechsynthesis using the end-to-end method gains a good per-formance 1e work [21] presents an end-to-end multilin-gual speech synthesis model using a Unicode encodingldquobyterdquo input representation to train a model which outputsthe corresponding audio of English Spanish or Mandarin1e work [22] proposes a multispeaker multilingual TTSsynthesis model based on Tacotron which is able to producehigh-quality speech in multiple languages

Taking into account that traditional methods require a lotof professional knowledge for phoneme analysis tone andprosody labelling the work is time-consuming and costly andthe modules are usually trained separately which will lead tothe effect of error stacking [23] while the end-to-end speechsynthesis system can automatically learn alignments andmapping from linguistic features to acoustic features 1esesystems can be trained onlt text audiogt pairs withoutcomplex language-dependent text front-end Inspired byabove works this paper proposes to use an end-to-endmethod to implement speech synthesis in Lhasa-U-Tsang andAmdo pastoral dialect using a single sequence-to-sequence(seq2seq) architecture with attention mechanism as theshared feature prediction network for Tibetan multi-dialectand introducing two dialect-specific WaveNet networks torealize the generation of time-domain waveforms

1ere are some similarities between this work and works[12 24] 1e WaveNet model is used in these worksHowever in our work and [12] the WaveNet model is usedfor the generation of waveform sample with the input ofpredicted Mel spectrogram for speech synthesis In the work[24] about speech recognition the WaveNet model is usedfor the generation of text sequence and the input is MFCCfeatures 1e work [12] achieved the speech synthesis forTibetan Lhasa-U-Tsang by using end-to-end model In thispaper we improved themodel of the work [12] to implementmultidialect speech synthesis

Our contributions can be summarized as follows (1) Wepropose an end-to-end Tibetanmultidialect speech synthesismodel which unifies all the modules into one model andrealizes the speech synthesis for different Tibetan dialectsusing one speech synthesis system (2) Joint learning is usedto train the shared feature prediction network by learningthe relevant features of multidialect speech data and it ishelpful to improve the speech synthesis performance ofdifferent dialects (3) We useWylie transliteration scheme toconvert the Tibetan text into the corresponding Latin letterswhich is used as the training units of the model It effectivelyreduces the size of training corpus reduces the workload offront-end text processing and improves the modellingefficiency

1e rest of this paper is organized as follows Section 2introduces the end-to-end Tibetan multidialect speechsynthesis model 1e experiments are presented in detail inSection 3 and the results are discussed as well Finally wedescribe our conclusions in Section 4

2 Model Architecture

1e end-to-end speech synthesis model is mainly composedof two parts the first part contains a seq2seq feature pre-diction network containing attention mechanism and thesecond part contains two dialect-specific WaveNet vocodersbased on Mel spectrogram 1e model adopts a synthesismethod from text to intermediate representation and in-termediate representation to speech waveform 1e encoderand decoder implement the conversion from text to inter-mediate representation and the WaveNet vocoders restorethe intermediate representation into waveform samplesFigure 1 shows the end-to-end Tibetan multidialect speechsynthesis model

21 Front-End Processing Although Tibetan pronunciationhas evolved over thousands of years the orthography ofwritten language remains unchanged It led to Tibetanspelling becoming very complicated Tibetan sentence iswritten from left to right and consists of a series of singlesyllable Single syllable is also called Tibetan character 1epunctuation mark ldquorsquordquomeans the ldquosoundproof symbolrdquo be-tween the syllables and the single hanging symbol ldquo|rdquo is usedat the end of a phrase or sentence Figure 2 shows a Tibetansentence

Each syllable in Tibetan has a root which is the centralconsonant of the syllable A vowel label can be added aboveor below the root to indicate different vowels Sometimesthere is a superscript at the top of the root one or twosubscripts at the bottom and a prescript at the front in-dicating that the initials of the syllable are compoundconsonants 1e sequence of connection of compoundconsonants is prescript superscript root and subscriptSometimes there is one or two postscripts after the rootwhich means that the syllable has one or two consonantendings 1e structure of Tibetan syllables is shown inFigure 3

2 Complexity








Mel spectrogram

LSTM

3 conv layers

Character embedding

Attention

Postnet

Linear projection

LSTM layers

Prenet

0

0

20

50 100 150 200

4060



Prescript

Prescript

Vowel

Vowel

Superscipt

Superscipt

Root

Root

Subscript

Subscript

Postscript

Postscript

Post-postscript

Post-postscript


Complexity 3






ci 1113944

Tx

j1αijhj (1)

αij exp eij1113872 1113873

1113936Tx

k1 exp eik( 1113857 (2)








4 Complexity







RMSE

1113936nt1 x1t minus x2t1113872 1113873

2

n

1113971

(3)






Input

Output

Hidden layer

Hidden layer


Complexity 5

0 50 300100 150 200 250 350 400

0

20

40

60



0 50 300100 150 200 250 350 400

0

20

40

60









02126 02223 mdash


0

0 50 100 150 200

20

40

60



0

20

40

60

0 50 100 150 200




6 Complexity


4 Conclusion


Data Availability




Acknowledgments


References






















ModelMOS ofLhasa-U-

Tsang dialect




Complexity 7













8 Complexity








Mel spectrogram

LSTM

3 conv layers

Character embedding

Attention

Postnet

Linear projection

LSTM layers

Prenet

0

0

20

50 100 150 200

4060



Prescript

Prescript

Vowel

Vowel

Superscipt

Superscipt

Root

Root

Subscript

Subscript

Postscript

Postscript

Post-postscript

Post-postscript


Complexity 3






ci 1113944

Tx

j1αijhj (1)

αij exp eij1113872 1113873

1113936Tx

k1 exp eik( 1113857 (2)








4 Complexity







RMSE

1113936nt1 x1t minus x2t1113872 1113873

2

n

1113971

(3)






Input

Output

Hidden layer

Hidden layer


Complexity 5

0 50 300100 150 200 250 350 400

0

20

40

60



0 50 300100 150 200 250 350 400

0

20

40

60









02126 02223 mdash


0

0 50 100 150 200

20

40

60



0

20

40

60

0 50 100 150 200




6 Complexity


4 Conclusion


Data Availability




Acknowledgments


References






















ModelMOS ofLhasa-U-

Tsang dialect




Complexity 7













8 Complexity






ci 1113944

Tx

j1αijhj (1)

αij exp eij1113872 1113873

1113936Tx

k1 exp eik( 1113857 (2)








4 Complexity







RMSE

1113936nt1 x1t minus x2t1113872 1113873

2

n

1113971

(3)






Input

Output

Hidden layer

Hidden layer


Complexity 5

0 50 300100 150 200 250 350 400

0

20

40

60



0 50 300100 150 200 250 350 400

0

20

40

60









02126 02223 mdash


0

0 50 100 150 200

20

40

60



0

20

40

60

0 50 100 150 200




6 Complexity


4 Conclusion


Data Availability




Acknowledgments


References






















ModelMOS ofLhasa-U-

Tsang dialect




Complexity 7













8 Complexity







RMSE

1113936nt1 x1t minus x2t1113872 1113873

2

n

1113971

(3)






Input

Output

Hidden layer

Hidden layer


Complexity 5

0 50 300100 150 200 250 350 400

0

20

40

60



0 50 300100 150 200 250 350 400

0

20

40

60









02126 02223 mdash


0

0 50 100 150 200

20

40

60



0

20

40

60

0 50 100 150 200




6 Complexity


4 Conclusion


Data Availability




Acknowledgments


References






















ModelMOS ofLhasa-U-

Tsang dialect




Complexity 7













8 Complexity

0 50 300100 150 200 250 350 400

0

20

40

60



0 50 300100 150 200 250 350 400

0

20

40

60









02126 02223 mdash


0

0 50 100 150 200

20

40

60



0

20

40

60

0 50 100 150 200




6 Complexity


4 Conclusion


Data Availability




Acknowledgments


References






















ModelMOS ofLhasa-U-

Tsang dialect




Complexity 7













8 Complexity


4 Conclusion


Data Availability




Acknowledgments


References






















ModelMOS ofLhasa-U-

Tsang dialect




Complexity 7













8 Complexity













8 Complexity

researcharticle end-to-end speech synthesis for tibetan

Documents