performance analysis of bangla speech recognizer model using hmm

41
Performance analysis of Bangla Speech Recognizer model using Hidden Markov Model (HMM) Submitted by: Md. Abdullah-al- MAMUN 1

Upload: alpha-reaction

Post on 15-Aug-2015

30 views

Category:

Engineering


5 download

TRANSCRIPT

Page 1: Performance analysis of bangla speech recognizer model using hmm

Performance analysis of Bangla Speech Recognizer

model using Hidden Markov Model (HMM)

Submitted by: Md. Abdullah-al-MAMUN

1

Page 2: Performance analysis of bangla speech recognizer model using hmm

OUTLINEOUTLINE What is speech recognition ?What is speech recognition ? The Structure of ASR The Structure of ASR Speech DatabaseSpeech Database Feature ExtractionFeature Extraction Hidden Markov ModelHidden Markov Model

Forward algorithmForward algorithm Backward algorithmBackward algorithm Viterbi algorithmViterbi algorithm

Training & RecogntionTraining & Recogntion ResultResult ConclusionsConclusions ReferencesReferences

2

Page 3: Performance analysis of bangla speech recognizer model using hmm

What is What is SSpeech peech RRecognitionecognition??

In Computer Science, In Computer Science, Speech Speech recognitionrecognition is the translation of  is the translation of spoken words into text .spoken words into text . Process of converting acoustic Process of converting acoustic signal captured by microphone to a signal captured by microphone to a set of words.set of words. Speech recognition known as Speech recognition known as “Automatic Speech Recognition (ASR) “Automatic Speech Recognition (ASR) ”, “Speech to Text (STT)".”, “Speech to Text (STT)".

3

Page 4: Performance analysis of bangla speech recognizer model using hmm

Model of Model of BBangla angla SSpeech peech RRecognitionecognition

4

Fig -1 : Simple model of Bangla Speech Recognition

Page 5: Performance analysis of bangla speech recognizer model using hmm

Database Signal Interface

Feature Extraction

Recognition

DatabasesTraining HMM

The Structure of The Structure of ASRASR System:System:

Figure 1 :Functional Scheme of an ASR SystemFigure 1 :Functional Scheme of an ASR System

Speech samples

X Y

S

W*

5

Page 6: Performance analysis of bangla speech recognizer model using hmm

Speech Database:Speech Database:-A speech database is a A speech database is a collection of recorded speech collection of recorded speech accessible on a computer and accessible on a computer and supported with the necessary supported with the necessary transcriptions.transcriptions.-The databases collect the The databases collect the observations required for observations required for parameter estimations.parameter estimations.-In this ASR system, I have used In this ASR system, I have used about 1200 keywords.about 1200 keywords.

6

Page 7: Performance analysis of bangla speech recognizer model using hmm

Classification of Classification of KeywordsKeywords

Bengal Word

Independent

Dependent

Vowel

Consonant

Modifier

Character

Compound

Character

7

Page 8: Performance analysis of bangla speech recognizer model using hmm

DDatabase atabase CCreation reation PProcessrocess

Database

8

Page 9: Performance analysis of bangla speech recognizer model using hmm

Speech Signal AnalysisSpeech Signal Analysis

Feature Extraction for Feature Extraction for ASR:ASR:

- The aim is to extract the voice - The aim is to extract the voice features to distinguish different features to distinguish different phonemes of a language.phonemes of a language.

9

515645465

156156165

156456454

251561565

Feature Extractio

n

Page 10: Performance analysis of bangla speech recognizer model using hmm

MFCCMFCC extractionextraction

Pre-emphasis DFTMel filter

banks Log(||2) IDFT

Speech

signalx(n)

WINDOW

x’(n)

xt (n)

Xt(k)

Yt(m)

MFCCyt(m)(k)

10

MFCC means Mel-frequency cepstral coefficients that representation of the short-term power spectrum of a sound for audio processing.

The MFCCs are the amplitudes of the resulting spectrum.

Page 11: Performance analysis of bangla speech recognizer model using hmm

Speech waveform Speech waveform of a phoneme “\of a phoneme “\

ae”ae”

After pre-emphasis After pre-emphasis and Hamming and Hamming

windowingwindowing

Power spectrumPower spectrum MFCCMFCC

Explanatory ExampleExplanatory Example

11

Page 12: Performance analysis of bangla speech recognizer model using hmm

FFeature eature VVector to ector to P(O|M)P(O|M) via via HMMHMM

12

51564654564

P(O|M)

HMM

For each input word O the HMM generate a corresponding probability P(O|M) that could be computed by the HMM.

Page 13: Performance analysis of bangla speech recognizer model using hmm

HMM ModelHMM Model

13

HMM is specified by a five-tuples λ=( , , , , )S O A B

Page 14: Performance analysis of bangla speech recognizer model using hmm

14

Elements of an HMMElements of an HMM

1) Set of hidden states 1) Set of hidden states S={1.2., … … N}S={1.2., … … N}

2) Set of observation symbols 2) Set of observation symbols O={oO={o11, o, o22, … … o, … … oMM}}

M: the number of observation symbolsM: the number of observation symbols

3) The initial state distribution3) The initial state distribution

4) State transition probability distribution4) State transition probability distribution

5) Observation symbol probability distribution in 5) Observation symbol probability distribution in state j state j

1{ } ( | ), 1 ,ij ij t tA a a P s j s i i j N

{ ( )} ( ) ( | ) 1 ,1j j t k tB b k b k P X o s j j N k M

0{ } ( ) 1i i P s i i N

Page 15: Performance analysis of bangla speech recognizer model using hmm

15

Three Basic Problems in Three Basic Problems in HMMHMM 1.The Evaluation Problem 1.The Evaluation Problem –Given a model –Given a model λλ =(A, B, =(A, B,

π)π) and a sequence of observations O and a sequence of observations O = (o = (o11, o, o22, , oo33,...o,...oMM ) ), what is the probability P(O|, what is the probability P(O|λλ); i.e., the ); i.e., the probability of the model that generates the probability of the model that generates the observations?observations?

2.The Decoding Problem 2.The Decoding Problem – Given a model – Given a model λλ =(A, B, =(A, B, π)π) and a sequence of observation O and a sequence of observation O = (o= (o11, o, o22, , oo33,...o,...oMM ) ), what is the most likely state sequence in , what is the most likely state sequence in the model that produces the observations?the model that produces the observations?

3.The Learning Problem 3.The Learning Problem –Given a model –Given a model λλ =(A, B, π) =(A, B, π) and a set of observations and a set of observations O = (oO = (o11, o, o22, o, o33,...o,...oMM ) ), how , how can we adjust the model parameter can we adjust the model parameter λλ to maximize to maximize the joint probability P(O|the joint probability P(O|λλ)?)?

How to evaluate an HMM?

Forward Algorithm

How to Decode an HMM?

Viterbi Algorithm

How to Train an HMM?

Baum-Welch Algorithm

Page 16: Performance analysis of bangla speech recognizer model using hmm

16

Calculate Calculate PProbability robability ( O| M )( O| M )

Trellis:

0.5

0.3

0.2

P(up)P(down) P(no-change)

0.30.30.4

0.70.10.2

0.10.60.3

0.179

0.036

0.008

0.35

0.02

0.09

0.35*0.2*0.3

0.02*0.5*0.7

0.09*0.4*0.7

0.02*0.2*0.3

0.09*0.5*0.3

0.35*0.6*0.7 0.179*0.6*0.7

0.008*0.5*0.7

0.036*0.4*0.7

0.60.50.4

0.20.30.1

0.2

0.2 transition matrix0.5

0.2230.46add probabilities !

Page 17: Performance analysis of bangla speech recognizer model using hmm

Forward Calculations – Forward Calculations – OverviewOverview

S0

S1

S2

S1

S2

S1

S2

1

2

a12=0.3

a11=0.7

a22=0.5

a21=0.5

TIME 2 TIME 3 TIME 4

0.60.10.3

0.10.10.2

17

Page 18: Performance analysis of bangla speech recognizer model using hmm

Forward Calculations (t=2)Forward Calculations (t=2)

S0

S1

S2

S1

S2

1

2

a12=0.3

a11=0.7

a22=0.5

a21=0.5

TIME 2

NOTE: that 1 (2)+ 2 (2)is the likelihood of the observation.

1

2

1 1 13 11 2 23 21

2 1 13 12 2 23 22

(1) 1

(1) 0

(2) (1) (1) 0.21

(2) (1) (1) 0.09

b a b a

b a b a

0.60.10.3

0.10.10.2

18

Page 19: Performance analysis of bangla speech recognizer model using hmm

Forward Calculations (t=3)Forward Calculations (t=3)

S0

S1

S2

S1

S2

S1

S2

1

2

a12=0.3

a11=0.7

a22=0.5

a21=0.5

TIME 2 TIME 3 TIME 4

1(3)

0.60.10.3

0.10.10.2

19

Page 20: Performance analysis of bangla speech recognizer model using hmm

Forward Calculations (t=4)Forward Calculations (t=4)

S0

S1

S2

S1

S2

S1

S2

1

2

a12=0.3

a11=0.7

a22=0.5

a21=0.5

TIME 2 TIME 3 TIME 4

S1

S2

0.60.10.3

0.10.10.2

20

Page 21: Performance analysis of bangla speech recognizer model using hmm

Forward Calculation of Forward Calculation of Likelihood FunctionLikelihood Function

t=1 t=2 t=3 t=4

1(t) 1.0

1 =1

0.21

1(1) a11 b13

+2(1) a21 b23

0.04621(2)a11 b12

+2(2)a21 b12

0.021294

2(t) 0.0

2 =0

0.09 1(1) a12 b13

+2(1) a22 b23

0.0378 0.010206

L(t)

p(K1… Kt)

1.01(1) +2(1)

0.31(2) +2(2)

0.0841(3) +2(3)

0.03151(4) +2(4)

21

Page 22: Performance analysis of bangla speech recognizer model using hmm

Backward Calculations – Backward Calculations – OverviewOverview

S0

S1

S2

S1

S2

S1

S2

1

2

a12=0.3

a11=0.7

a22=0.5

a21=0.5

TIME 2 TIME 3 TIME 4

0.60.10.3

0.10.10.2

22

Page 23: Performance analysis of bangla speech recognizer model using hmm

Backward Calculations (t=3)Backward Calculations (t=3)

S1

S2

TIME 3

0.60.10.3

0.10.10.2

23

Page 24: Performance analysis of bangla speech recognizer model using hmm

Backward Calculations (t=2)Backward Calculations (t=2)

S1

S2

S1

S2

TIME 2 TIME 3 TIME 4

a22=0.5

a11=0.7

a12=0.3a21=0.5

NOTE: that 1 (2)+ 2 (2)is the likelihood the observation/word sequence.

1

2

1

2

1 1 11 12 2 12 12

2 1 21 22 2 22 22

(4) 1

(4) 1

(3) 0.6

(3) 0.1

(2) (3) (3) 0.045

(2) (3) (3) 0.245

a b a b

a b a b

0.60.10.3

0.10.10.2

24

Page 25: Performance analysis of bangla speech recognizer model using hmm

Backward Calculations (t=1)Backward Calculations (t=1)

S0

S1

S2

S1

S2

S1

S2

1

2

a12=0.3

a11=0.7

a21=0.5

TIME 2 TIME 3 TIME 4

0.60.10.3

0.10.10.2

25

Page 26: Performance analysis of bangla speech recognizer model using hmm

Backward Calculation of Backward Calculation of Likelihood FunctionLikelihood Function

t=1 t=2 t=3 t=4

1(t) 0.0315 0.045a11b11 1(1) ++ a12b21 1(1)

0.6b11

1

2(t) 0.029 0.245 a11b11 1(1) +

+ a12b21 1(1)

0.1 b21

1

L(t)p(Kt… KT)

0.03151 1(1) +

2 2(1)

0.2901(2) +2(2)

0.71(3) + 2(3)

1

26

Page 27: Performance analysis of bangla speech recognizer model using hmm

27

Calculate Calculate maxmaxSS Prob. Prob. state sequence state sequence SS

0.35

0.09

0.02

P(up)P(down) P(no-change)

0.30.30.4

0.70.10.2

0.10.60.3

0.147

0.021

0.007

0.35*0.2*0.3

0.02*0.5*0.7

0.09*0.4*0.7

0.02*0.2*0.3

0.09*0.5*0.3

0.35*0.6*0.7 0.147*0.6*0.7

0.007*0.5*0.7

0.021*0.4*0.7

0.5

0.2

0.3

best

Select highest probability !

Page 28: Performance analysis of bangla speech recognizer model using hmm

Viterbi Algorithm – OverviewViterbi Algorithm – Overview

S0

S1

S2

S1

S2

S1

S2

1

2

a12=0.3

a11=0.7

a22=0.5

a21=0.5

TIME 2 TIME 3 TIME 4

0.60.10.3

0.10.10.2

28

Page 29: Performance analysis of bangla speech recognizer model using hmm

Viterbi Algorithm (Forward Calculations Viterbi Algorithm (Forward Calculations t=2)t=2)

S0

S1

S2

S1

S2

1=1

2=0

a12=0.3

a11=0.7

a22=0.5

a21=0.5

TIME 2

1 1

2 2

1 1 13 11 2 23 21

2 1 13 12 2 23 22

1

2

(1) 1

(1) 0

(2) max{ (1) , (1) } 0.21

(2) max{ (1) , (1) } 0.09

(2) 1

(2) 1

b a b a

b a b a

0.60.10.3

0.10.10.2

29

Page 30: Performance analysis of bangla speech recognizer model using hmm

Viterbi Algorithm (Backtracking t=2)Viterbi Algorithm (Backtracking t=2)

S0

S1

S2

S1

S2

1=1

2=0

a12=0.3

a11=0.7

a22=0.5

a21=0.5

TIME 2

1 1

2 2

1 1 13 11 2 23 21

2 1 13 12 2 23 22

1

2

(1) 1

(1) 0

(2) max{ (1) , (1) } 0.21

(2) max{ (1) , (1) } 0.09

(2) 1

(2) 1

b a b a

b a b a

0.60.10.3

0.10.10.2

30

Page 31: Performance analysis of bangla speech recognizer model using hmm

Viterbi Algorithm (Forward Viterbi Algorithm (Forward Calculations)Calculations)

S0

S1

S2

S1

S2

S1

S2

1

2

a12=0.3

a11=0.7

a22=0.5

a21=0.5

TIME 2 TIME 3 TIME 4

0.60.10.3

0.10.10.2

31

Page 32: Performance analysis of bangla speech recognizer model using hmm

Viterbi Algorithm (backtracking)Viterbi Algorithm (backtracking)

S0

S1

S2

S1

S2

S1

S2

1

2

a12=0.3

a11=0.7

a22=0.5

a21=0.5

TIME 2 TIME 3 TIME 4

0.60.10.3

0.10.10.2

32

Page 33: Performance analysis of bangla speech recognizer model using hmm

Viterbi Algorithm (Forward Calculations t=4)Viterbi Algorithm (Forward Calculations t=4)

S0

S1

S2

S1

S2

S1

S2

1

2

a12=0.3

a11=0.7

a22=0.5

a21=0.5

TIME 2 TIME 3 TIME 4

S1

S2

0.60.10.3

0.10.10.2

33

Page 34: Performance analysis of bangla speech recognizer model using hmm

Viterbi Algorithm (Backtracking to Obtain Viterbi Algorithm (Backtracking to Obtain Labeling)Labeling)

S0

S1

S2

S1

S2

S1

S2

1

2

a12=0.3

a11=0.7

a22=0.5

a21=0.5

TIME 2 TIME 3 TIME 4

S1

S2

0.60.10.3

0.10.10.2

34

Page 35: Performance analysis of bangla speech recognizer model using hmm

Implementing Implementing HMMHMM to speech to speech ModelingModeling

( (TrainingTraining and and Recognition Recognition ))

- Building HMM speech models based - Building HMM speech models based on the correspondence between the on the correspondence between the observation sequences observation sequences YY and the state and the state sequence (sequence (SS). ). (TRAINNING).(TRAINNING).- Recognizing speech by the stored - Recognizing speech by the stored HMM models HMM models and by the actual and by the actual observation Y. observation Y. (RECOGNITION)(RECOGNITION)

Training HMM

Feature Extraction

RecognitionW*Y

Y

S

Speech Samples

35

Page 36: Performance analysis of bangla speech recognizer model using hmm

RECOGNITIONRECOGNITION Process Process Given an input speech Given an input speech S=(sS=(s11,s,s22,…,s,…,sTT)) be the recognized . be the recognized . xxtt be the feature samples computed at time be the feature samples computed at time tt, where the feature , where the feature

sequence from time sequence from time 11 to to t t is indicated as: is indicated as: X=(xX=(x11,x,x22,…,x,…,xt t )).. The recognized states The recognized states S*S* could be obtained by: could be obtained by:

S*=ArgMax P(S,X|S*=ArgMax P(S,X|))..

Dynamic Structure

Search Algorithm

S*

Static Structure

St , P(xt,{st}|{st-1},)

{St-1}

xt

36

Page 37: Performance analysis of bangla speech recognizer model using hmm

ResultResult ((SSpeaker peaker RRecognition)ecognition)

37

Table 1: Speaker recognition result

Page 38: Performance analysis of bangla speech recognizer model using hmm

ResultResult ((IIsolated solated SRSR))

38

Table 3: Result for isolated speech recognition.

Page 39: Performance analysis of bangla speech recognizer model using hmm

Result Result ((CContinuous ontinuous SRSR))

39

Table 3: Continuous Speech recognition result

Page 40: Performance analysis of bangla speech recognizer model using hmm

ConclusionsConclusions No speech recognizer till now has No speech recognizer till now has

100% accuracy. 100% accuracy.

You should avoided poor quality You should avoided poor quality microphone consider using a better microphone consider using a better microphonemicrophone

On important matter is that , On important matter is that , training the computer will provide training the computer will provide an even better experience.an even better experience.

40

Page 41: Performance analysis of bangla speech recognizer model using hmm

Thank YouThank You

41