prosody modeling and eigen-prosody analysis for robust speaker recognition

26
PROSODY MODELING AND EIGEN- PROSODY ANALYSIS FOR ROBUST SPEAKER RECOGNITION Zi-He Chen, Yuan-Fu Liao, and Yau- Tarng Juang ICASSP 2005 Presenter: Fang-Hui CHU

Upload: agalia

Post on 06-Jan-2016

44 views

Category:

Documents


2 download

DESCRIPTION

PROSODY MODELING AND EIGEN-PROSODY ANALYSIS FOR ROBUST SPEAKER RECOGNITION. Zi-He Chen, Yuan-Fu Liao, and Yau-Tarng Juang ICASSP 2005 Presenter: Fang-Hui CHU. Reference. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: PROSODY MODELING AND EIGEN-PROSODY ANALYSIS FOR ROBUST SPEAKER RECOGNITION

PROSODY MODELING AND EIGEN-PROSODY ANALYSIS FOR ROBUST

SPEAKER RECOGNITION

Zi-He Chen Yuan-Fu Liao and Yau-Tarng Juang

ICASSP 2005

Presenter Fang-Hui CHU

2005102420051024 Speech Lab NTNU 22

ReferenceReference

Combination of Acoustic and Prosodic Information for Robust SpCombination of Acoustic and Prosodic Information for Robust Speaker Identification Yuan-Fu Liao Zhi-Xian Zhuang Zi-He Cheeaker Identification Yuan-Fu Liao Zhi-Xian Zhuang Zi-He Chen and Yau-Tarng Juang ROCLING 2005n and Yau-Tarng Juang ROCLING 2005

Eigen-Prosody Analysis for Robust Speaker Recognition under Eigen-Prosody Analysis for Robust Speaker Recognition under Mismatch Handset Environment Zi-He Chen Yuan-Fu Liao and Mismatch Handset Environment Zi-He Chen Yuan-Fu Liao and Yau-Tarng Juang ICSLP 2004Yau-Tarng Juang ICSLP 2004

2005102420051024 Speech Lab NTNU 33

OutlineOutline

IntroductionIntroduction

VQ-Based Prosodic ModelingVQ-Based Prosodic Modeling

AKI Unseen Handset EstimationAKI Unseen Handset Estimation

Eigen-Prosody AnalysisEigen-Prosody Analysis

Speaker Identification ExperimentsSpeaker Identification Experiments

ConclusionsConclusions

2005102420051024 Speech Lab NTNU 44

IntroductionIntroduction

A speaker identification system in telecommunication network enA speaker identification system in telecommunication network environment needs to be robust against distortion of mismatch handvironment needs to be robust against distortion of mismatch handsetssets

Prosodic features are known to be less sensitive to handset mismatchProsodic features are known to be less sensitive to handset mismatch

Several successful techniques Several successful techniques GMMs the per-frame pitch and energy values are extracted and modeled GMMs the per-frame pitch and energy values are extracted and modeled using traditional distribution modelsusing traditional distribution models

May not adequately capture the temporal dynamic information of the prosMay not adequately capture the temporal dynamic information of the prosodic feature contoursodic feature contours

2005102420051024 Speech Lab NTNU 55

Introduction (cont)Introduction (cont)

Several successful techniques (cont) Several successful techniques (cont) N-gram the dynamics of the pitch and energy trajectories are described bN-gram the dynamics of the pitch and energy trajectories are described by sequences of symbols and modeled by the n-gram statisticsy sequences of symbols and modeled by the n-gram statistics

DHMM the sequences of prosody symbols are further modeled by the staDHMM the sequences of prosody symbols are further modeled by the state observation and transition probabilitieste observation and transition probabilities

usually require large amount of trainingtest data to reach a reasonable perfusually require large amount of trainingtest data to reach a reasonable performanceormance

In this paper In this paper a VQ-based prosody modeling and an eigen-prosody analysis approach are a VQ-based prosody modeling and an eigen-prosody analysis approach are integrated together to add robustness to conventional cepstral features-basintegrated together to add robustness to conventional cepstral features-based GMMs close-set speaker identification system under the situation of mied GMMs close-set speaker identification system under the situation of mismatch unseen handsets and limited trainingtest datasmatch unseen handsets and limited trainingtest data

2005102420051024 Speech Lab NTNU 66

Introduction (cont)Introduction (cont)

Speakerrsquos enrollment

speech

Sequences ofProsody states

EPA

VQ-Based Prosodic Modeling

Text document records the detailprosodyspeaking

style of the speaker

Eigen-prosody space to representthe constellation

of speakers LSA

By this way the speaker identification problem is transformed into a

full-text document retrieval-similar task

2005102420051024 Speech Lab NTNU 77

Introduction (cont)Introduction (cont)

Since EPA utilizes prosody-level information it could be further Since EPA utilizes prosody-level information it could be further fused with acoustic-level information to complement each otherfused with acoustic-level information to complement each other

The The a priori a priori knowledge interpolation (AKI) approach is chosenknowledge interpolation (AKI) approach is chosenIt utilizes the maximum likelihood linear regression (MLLR) to It utilizes the maximum likelihood linear regression (MLLR) to compensate handset mismatchcompensate handset mismatch

2005102420051024 Speech Lab NTNU 88

VQ-Based Prosodic ModelingVQ-Based Prosodic Modeling

In this study syllables are chosen as the basic processing unitsIn this study syllables are chosen as the basic processing units

Five types of prosodic features Five types of prosodic features the slope of pitch contour of a vowel segmentthe slope of pitch contour of a vowel segment

lengthening factor of a vowel segmentlengthening factor of a vowel segment

average log-energy difference between two vowelsaverage log-energy difference between two vowels

value of pitch jump between two vowelsvalue of pitch jump between two vowels

pause duration between two syllablespause duration between two syllables

The prosody features are normalized according to underline vowel class to The prosody features are normalized according to underline vowel class to remove any non-prosodic effects (context-information)remove any non-prosodic effects (context-information)

vowel

voweluxx

ˆ

mean of the vowel class of whole corpusuvowel

vowel variance of the vowel class of whole corpus

2005102420051024 Speech Lab NTNU 99

Prosodic ModelingProsodic Modeling

2005102420051024 Speech Lab NTNU 1010

Prosodic Modeling (cont)Prosodic Modeling (cont)

To build the prosodic model the prosodic features are vector quaTo build the prosodic model the prosodic features are vector quantized into ntized into M M codewords using the expectation-maximum (EM) acodewords using the expectation-maximum (EM) algorithmlgorithm

2005102420051024 Speech Lab NTNU 1111

Prosodic Modeling (cont)Prosodic Modeling (cont)

For example state 6 is the phrase-start state 3 and 4 are the For example state 6 is the phrase-start state 3 and 4 are the major-breaksmajor-breaks

2005102420051024 Speech Lab NTNU 1212

Prosodic labelingProsodic labeling

2005102420051024 Speech Lab NTNU 1313

AKI Unseen Handset EstimationAKI Unseen Handset Estimation

The concept of AKI is to first collect a set of characteristics of seen handset aThe concept of AKI is to first collect a set of characteristics of seen handset as the s the a priori a priori knowledge to construct a space of handsetsknowledge to construct a space of handsets

Model-space AKI using the MLLR model transformation is chosen to compeModel-space AKI using the MLLR model transformation is chosen to compensate the handset mismatchnsate the handset mismatch

The estimate of the characteristic of a test handset is definedThe estimate of the characteristic of a test handset is defined

where where HH = = hhnn=(=(AAnn bbnn TTnn)) n n=1~=1~NN is the set of is the set of a priori a priori knowledgeknowledge

N

nnnhh

1

ˆ

h

1

1

ˆˆ

ˆˆˆˆ

CB

CC

BTB

buAuhu

T

T

μ and 1048580 are the original and adapted mixture mean vectorsΣ and 1048580 are the original and adapted mixture variance vectors and are MLLR mean and variance transformation matricesB is the inverse function of Choleski factor of the original variance matrix Σ-1

is the bias of the mixture component

A T

b

2005102420051024 Speech Lab NTNU 1414

AKI Unseen Handset Estimation (cont)AKI Unseen Handset Estimation (cont)

2005102420051024 Speech Lab NTNU 1515

Eigen-Prosody AnalysisEigen-Prosody Analysis

The procedures of the EPA includesThe procedures of the EPA includesVQ-based prosodic modeling and labelingVQ-based prosodic modeling and labeling

Segmenting the sequences of prosody states to extract important prosody kSegmenting the sequences of prosody states to extract important prosody keywordseywords

Calculating the occurrences statistics of these prosody keywords for each sCalculating the occurrences statistics of these prosody keywords for each speaker to form a prosody keyword-speaker occurrence matrixpeaker to form a prosody keyword-speaker occurrence matrix

Applying the singular values decomposition (SVD) technique to decompoApplying the singular values decomposition (SVD) technique to decompose the matrix to build an eigen-prosody spacese the matrix to build an eigen-prosody space

Measuring the speaker distance using the cosine of the angle between two Measuring the speaker distance using the cosine of the angle between two speaker vectorsspeaker vectors

2005102420051024 Speech Lab NTNU 1616

Eigen-Prosody Analysis (cont)Eigen-Prosody Analysis (cont)

2005102420051024 Speech Lab NTNU 1717

Prosody keyword extractionProsody keyword extraction

After the prosodic state labeling the prosody text documents of aAfter the prosodic state labeling the prosody text documents of all speakers are searched to find important prosody keywords in orll speakers are searched to find important prosody keywords in order to establish a prosody keywords dictionaryder to establish a prosody keywords dictionary

First all possible combinations of the prosody words including single worFirst all possible combinations of the prosody words including single words and word pairs (uni-gram and bi-gram) are listed and their frequency stds and word pairs (uni-gram and bi-gram) are listed and their frequency statistics are computedatistics are computed

After calculating the histogram of all prosody words frequency thresholds After calculating the histogram of all prosody words frequency thresholds are set to leave only high frequency onesare set to leave only high frequency ones

2005102420051024 Speech Lab NTNU 1818

Prosody keyword-speaker occurrence matrix Prosody keyword-speaker occurrence matrix statisticsstatistics

The prosody text document of each speaker is then parsed using the generated The prosody text document of each speaker is then parsed using the generated prosody keywords dictionary by simply giving higher priority to longer wordsprosody keywords dictionary by simply giving higher priority to longer words

The occurrence counts of keywords of a speaker are booked in a prosody The occurrence counts of keywords of a speaker are booked in a prosody keyword list vector to represent the long-term prosodic behaviors of the keyword list vector to represent the long-term prosodic behaviors of the specific speakerspecific speaker

The prosody keyword-speaker occurrence matrix The prosody keyword-speaker occurrence matrix A A is made up of the is made up of the collection of all speaker prosody keyword lists vectorscollection of all speaker prosody keyword lists vectors

To emphasize the uncommon keywords and to deemphasize the very common To emphasize the uncommon keywords and to deemphasize the very common ones the inverse document frequency (IDF) weighting method is appliedones the inverse document frequency (IDF) weighting method is applied

2005102420051024 Speech Lab NTNU 1919

Eigen-Prosody analysisEigen-Prosody analysis

In order to reduce the dimension of the prosody space the sparse prosody keIn order to reduce the dimension of the prosody space the sparse prosody keyword-speaker occurrence matrix yword-speaker occurrence matrix A A is further analyzed using SVD to find a cis further analyzed using SVD to find a compact eigen-prosody spaceompact eigen-prosody space

Given an Given an m m by by n n ((mmgtgtgtgtnn) matrix ) matrix A A of rank of rank R R A A is decomposed and further apis decomposed and further approximated using only the largest proximated using only the largest K K singular values assingular values as

Where AWhere AKK U UKK V VKK and and ΣΣKK matrices are the rank reduced matrices of the respective m matrices are the rank reduced matrices of the respective m

atricesatrices

TKKKK

T VUAVUA

2005102420051024 Speech Lab NTNU 2020

Eigen-Prosody Analysis Eigen-Prosody Analysis

2005102420051024 Speech Lab NTNU 2121

Score measurementScore measurement

The test utterances of a test speaker are first labeled and parsed to form the pThe test utterances of a test speaker are first labeled and parsed to form the pseudo query document seudo query document yyQQ and then transformed into the query vector and then transformed into the query vector vvQQ in tin t

he eigen-prosody speaker space byhe eigen-prosody speaker space by

iKQ

iKT

QiKQ

Pi

KKTQQ

v

vvS

Uy

1

)(

2005102420051024 Speech Lab NTNU 2222

Speaker Identification ExperimentsSpeaker Identification Experiments

HTIMIT database and experiment conditionsHTIMIT database and experiment conditionstotal 384 speakers each gave ten utterancestotal 384 speakers each gave ten utterances

The set of 38410 utterances was then playback and recorded through ninThe set of 38410 utterances was then playback and recorded through nine other different handsets include four carbon buttone other different handsets include four carbon button((碳墨式碳墨式 )) four electret four electret ((電子式電子式 ))handsets and one portable cordless phonehandsets and one portable cordless phone

In this paper all experiments were performed on 302 speakers inIn this paper all experiments were performed on 302 speakers including 151 females and 151 males which have all the ten utterancluding 151 females and 151 males which have all the ten utterancesces

For training the speaker models the first seven utterances of each speaker For training the speaker models the first seven utterances of each speaker from the senh handset were used as the enrollment speechfrom the senh handset were used as the enrollment speech

The other ten three-utterance sessions of each speaker from ten handsets wThe other ten three-utterance sessions of each speaker from ten handsets were used as the evaluation data respectivelyere used as the evaluation data respectively

2005102420051024 Speech Lab NTNU 2323

Speaker Identification Experiments (cont)Speaker Identification Experiments (cont)

To construct the speaker models a 256-mixture universal backgrTo construct the speaker models a 256-mixture universal background model (UBM) was first built from the enrollment speech of ound model (UBM) was first built from the enrollment speech of all 302 speakersall 302 speakers

For each speaker a MAP-adapted GMM (MAP-GMM) adapted fFor each speaker a MAP-adapted GMM (MAP-GMM) adapted from the UBM using his own enrollment speech was built rom the UBM using his own enrollment speech was built

a 32-state prosodic modeling was traineda 32-state prosodic modeling was trained

367 prosody keywords were extracted to form a sparse 367302 dimensio367 prosody keywords were extracted to form a sparse 367302 dimensional matrix nal matrix AA

the dimension of eigen-prosody space is 3the dimension of eigen-prosody space is 3

38 mel-frequency cepstral coefficiences (MFCCs) including38 mel-frequency cepstral coefficiences (MFCCs) including

Window size of 30 ms and frame shift of 10msWindow size of 30 ms and frame shift of 10ms

2005102420051024 Speech Lab NTNU 2424

Speaker Identification Experiments (cont)Speaker Identification Experiments (cont)

2005102420051024 Speech Lab NTNU 2525

ConclusionsConclusions

This paper presents an EPA+AKI+MAP-GMMCMS fusion This paper presents an EPA+AKI+MAP-GMMCMS fusion approachapproach

Unlike conventional fusion approaches the proposed method Unlike conventional fusion approaches the proposed method requires only few trainingtest utterances and could alleviate the requires only few trainingtest utterances and could alleviate the distortion of unseen mismatch handsetsdistortion of unseen mismatch handsets

It is therefore a promising method for robust speaker It is therefore a promising method for robust speaker identification under mismatch environment and limited available identification under mismatch environment and limited available datadata

2005102420051024 Speech Lab NTNU 2626

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
Page 2: PROSODY MODELING AND EIGEN-PROSODY ANALYSIS FOR ROBUST SPEAKER RECOGNITION

2005102420051024 Speech Lab NTNU 22

ReferenceReference

Combination of Acoustic and Prosodic Information for Robust SpCombination of Acoustic and Prosodic Information for Robust Speaker Identification Yuan-Fu Liao Zhi-Xian Zhuang Zi-He Cheeaker Identification Yuan-Fu Liao Zhi-Xian Zhuang Zi-He Chen and Yau-Tarng Juang ROCLING 2005n and Yau-Tarng Juang ROCLING 2005

Eigen-Prosody Analysis for Robust Speaker Recognition under Eigen-Prosody Analysis for Robust Speaker Recognition under Mismatch Handset Environment Zi-He Chen Yuan-Fu Liao and Mismatch Handset Environment Zi-He Chen Yuan-Fu Liao and Yau-Tarng Juang ICSLP 2004Yau-Tarng Juang ICSLP 2004

2005102420051024 Speech Lab NTNU 33

OutlineOutline

IntroductionIntroduction

VQ-Based Prosodic ModelingVQ-Based Prosodic Modeling

AKI Unseen Handset EstimationAKI Unseen Handset Estimation

Eigen-Prosody AnalysisEigen-Prosody Analysis

Speaker Identification ExperimentsSpeaker Identification Experiments

ConclusionsConclusions

2005102420051024 Speech Lab NTNU 44

IntroductionIntroduction

A speaker identification system in telecommunication network enA speaker identification system in telecommunication network environment needs to be robust against distortion of mismatch handvironment needs to be robust against distortion of mismatch handsetssets

Prosodic features are known to be less sensitive to handset mismatchProsodic features are known to be less sensitive to handset mismatch

Several successful techniques Several successful techniques GMMs the per-frame pitch and energy values are extracted and modeled GMMs the per-frame pitch and energy values are extracted and modeled using traditional distribution modelsusing traditional distribution models

May not adequately capture the temporal dynamic information of the prosMay not adequately capture the temporal dynamic information of the prosodic feature contoursodic feature contours

2005102420051024 Speech Lab NTNU 55

Introduction (cont)Introduction (cont)

Several successful techniques (cont) Several successful techniques (cont) N-gram the dynamics of the pitch and energy trajectories are described bN-gram the dynamics of the pitch and energy trajectories are described by sequences of symbols and modeled by the n-gram statisticsy sequences of symbols and modeled by the n-gram statistics

DHMM the sequences of prosody symbols are further modeled by the staDHMM the sequences of prosody symbols are further modeled by the state observation and transition probabilitieste observation and transition probabilities

usually require large amount of trainingtest data to reach a reasonable perfusually require large amount of trainingtest data to reach a reasonable performanceormance

In this paper In this paper a VQ-based prosody modeling and an eigen-prosody analysis approach are a VQ-based prosody modeling and an eigen-prosody analysis approach are integrated together to add robustness to conventional cepstral features-basintegrated together to add robustness to conventional cepstral features-based GMMs close-set speaker identification system under the situation of mied GMMs close-set speaker identification system under the situation of mismatch unseen handsets and limited trainingtest datasmatch unseen handsets and limited trainingtest data

2005102420051024 Speech Lab NTNU 66

Introduction (cont)Introduction (cont)

Speakerrsquos enrollment

speech

Sequences ofProsody states

EPA

VQ-Based Prosodic Modeling

Text document records the detailprosodyspeaking

style of the speaker

Eigen-prosody space to representthe constellation

of speakers LSA

By this way the speaker identification problem is transformed into a

full-text document retrieval-similar task

2005102420051024 Speech Lab NTNU 77

Introduction (cont)Introduction (cont)

Since EPA utilizes prosody-level information it could be further Since EPA utilizes prosody-level information it could be further fused with acoustic-level information to complement each otherfused with acoustic-level information to complement each other

The The a priori a priori knowledge interpolation (AKI) approach is chosenknowledge interpolation (AKI) approach is chosenIt utilizes the maximum likelihood linear regression (MLLR) to It utilizes the maximum likelihood linear regression (MLLR) to compensate handset mismatchcompensate handset mismatch

2005102420051024 Speech Lab NTNU 88

VQ-Based Prosodic ModelingVQ-Based Prosodic Modeling

In this study syllables are chosen as the basic processing unitsIn this study syllables are chosen as the basic processing units

Five types of prosodic features Five types of prosodic features the slope of pitch contour of a vowel segmentthe slope of pitch contour of a vowel segment

lengthening factor of a vowel segmentlengthening factor of a vowel segment

average log-energy difference between two vowelsaverage log-energy difference between two vowels

value of pitch jump between two vowelsvalue of pitch jump between two vowels

pause duration between two syllablespause duration between two syllables

The prosody features are normalized according to underline vowel class to The prosody features are normalized according to underline vowel class to remove any non-prosodic effects (context-information)remove any non-prosodic effects (context-information)

vowel

voweluxx

ˆ

mean of the vowel class of whole corpusuvowel

vowel variance of the vowel class of whole corpus

2005102420051024 Speech Lab NTNU 99

Prosodic ModelingProsodic Modeling

2005102420051024 Speech Lab NTNU 1010

Prosodic Modeling (cont)Prosodic Modeling (cont)

To build the prosodic model the prosodic features are vector quaTo build the prosodic model the prosodic features are vector quantized into ntized into M M codewords using the expectation-maximum (EM) acodewords using the expectation-maximum (EM) algorithmlgorithm

2005102420051024 Speech Lab NTNU 1111

Prosodic Modeling (cont)Prosodic Modeling (cont)

For example state 6 is the phrase-start state 3 and 4 are the For example state 6 is the phrase-start state 3 and 4 are the major-breaksmajor-breaks

2005102420051024 Speech Lab NTNU 1212

Prosodic labelingProsodic labeling

2005102420051024 Speech Lab NTNU 1313

AKI Unseen Handset EstimationAKI Unseen Handset Estimation

The concept of AKI is to first collect a set of characteristics of seen handset aThe concept of AKI is to first collect a set of characteristics of seen handset as the s the a priori a priori knowledge to construct a space of handsetsknowledge to construct a space of handsets

Model-space AKI using the MLLR model transformation is chosen to compeModel-space AKI using the MLLR model transformation is chosen to compensate the handset mismatchnsate the handset mismatch

The estimate of the characteristic of a test handset is definedThe estimate of the characteristic of a test handset is defined

where where HH = = hhnn=(=(AAnn bbnn TTnn)) n n=1~=1~NN is the set of is the set of a priori a priori knowledgeknowledge

N

nnnhh

1

ˆ

h

1

1

ˆˆ

ˆˆˆˆ

CB

CC

BTB

buAuhu

T

T

μ and 1048580 are the original and adapted mixture mean vectorsΣ and 1048580 are the original and adapted mixture variance vectors and are MLLR mean and variance transformation matricesB is the inverse function of Choleski factor of the original variance matrix Σ-1

is the bias of the mixture component

A T

b

2005102420051024 Speech Lab NTNU 1414

AKI Unseen Handset Estimation (cont)AKI Unseen Handset Estimation (cont)

2005102420051024 Speech Lab NTNU 1515

Eigen-Prosody AnalysisEigen-Prosody Analysis

The procedures of the EPA includesThe procedures of the EPA includesVQ-based prosodic modeling and labelingVQ-based prosodic modeling and labeling

Segmenting the sequences of prosody states to extract important prosody kSegmenting the sequences of prosody states to extract important prosody keywordseywords

Calculating the occurrences statistics of these prosody keywords for each sCalculating the occurrences statistics of these prosody keywords for each speaker to form a prosody keyword-speaker occurrence matrixpeaker to form a prosody keyword-speaker occurrence matrix

Applying the singular values decomposition (SVD) technique to decompoApplying the singular values decomposition (SVD) technique to decompose the matrix to build an eigen-prosody spacese the matrix to build an eigen-prosody space

Measuring the speaker distance using the cosine of the angle between two Measuring the speaker distance using the cosine of the angle between two speaker vectorsspeaker vectors

2005102420051024 Speech Lab NTNU 1616

Eigen-Prosody Analysis (cont)Eigen-Prosody Analysis (cont)

2005102420051024 Speech Lab NTNU 1717

Prosody keyword extractionProsody keyword extraction

After the prosodic state labeling the prosody text documents of aAfter the prosodic state labeling the prosody text documents of all speakers are searched to find important prosody keywords in orll speakers are searched to find important prosody keywords in order to establish a prosody keywords dictionaryder to establish a prosody keywords dictionary

First all possible combinations of the prosody words including single worFirst all possible combinations of the prosody words including single words and word pairs (uni-gram and bi-gram) are listed and their frequency stds and word pairs (uni-gram and bi-gram) are listed and their frequency statistics are computedatistics are computed

After calculating the histogram of all prosody words frequency thresholds After calculating the histogram of all prosody words frequency thresholds are set to leave only high frequency onesare set to leave only high frequency ones

2005102420051024 Speech Lab NTNU 1818

Prosody keyword-speaker occurrence matrix Prosody keyword-speaker occurrence matrix statisticsstatistics

The prosody text document of each speaker is then parsed using the generated The prosody text document of each speaker is then parsed using the generated prosody keywords dictionary by simply giving higher priority to longer wordsprosody keywords dictionary by simply giving higher priority to longer words

The occurrence counts of keywords of a speaker are booked in a prosody The occurrence counts of keywords of a speaker are booked in a prosody keyword list vector to represent the long-term prosodic behaviors of the keyword list vector to represent the long-term prosodic behaviors of the specific speakerspecific speaker

The prosody keyword-speaker occurrence matrix The prosody keyword-speaker occurrence matrix A A is made up of the is made up of the collection of all speaker prosody keyword lists vectorscollection of all speaker prosody keyword lists vectors

To emphasize the uncommon keywords and to deemphasize the very common To emphasize the uncommon keywords and to deemphasize the very common ones the inverse document frequency (IDF) weighting method is appliedones the inverse document frequency (IDF) weighting method is applied

2005102420051024 Speech Lab NTNU 1919

Eigen-Prosody analysisEigen-Prosody analysis

In order to reduce the dimension of the prosody space the sparse prosody keIn order to reduce the dimension of the prosody space the sparse prosody keyword-speaker occurrence matrix yword-speaker occurrence matrix A A is further analyzed using SVD to find a cis further analyzed using SVD to find a compact eigen-prosody spaceompact eigen-prosody space

Given an Given an m m by by n n ((mmgtgtgtgtnn) matrix ) matrix A A of rank of rank R R A A is decomposed and further apis decomposed and further approximated using only the largest proximated using only the largest K K singular values assingular values as

Where AWhere AKK U UKK V VKK and and ΣΣKK matrices are the rank reduced matrices of the respective m matrices are the rank reduced matrices of the respective m

atricesatrices

TKKKK

T VUAVUA

2005102420051024 Speech Lab NTNU 2020

Eigen-Prosody Analysis Eigen-Prosody Analysis

2005102420051024 Speech Lab NTNU 2121

Score measurementScore measurement

The test utterances of a test speaker are first labeled and parsed to form the pThe test utterances of a test speaker are first labeled and parsed to form the pseudo query document seudo query document yyQQ and then transformed into the query vector and then transformed into the query vector vvQQ in tin t

he eigen-prosody speaker space byhe eigen-prosody speaker space by

iKQ

iKT

QiKQ

Pi

KKTQQ

v

vvS

Uy

1

)(

2005102420051024 Speech Lab NTNU 2222

Speaker Identification ExperimentsSpeaker Identification Experiments

HTIMIT database and experiment conditionsHTIMIT database and experiment conditionstotal 384 speakers each gave ten utterancestotal 384 speakers each gave ten utterances

The set of 38410 utterances was then playback and recorded through ninThe set of 38410 utterances was then playback and recorded through nine other different handsets include four carbon buttone other different handsets include four carbon button((碳墨式碳墨式 )) four electret four electret ((電子式電子式 ))handsets and one portable cordless phonehandsets and one portable cordless phone

In this paper all experiments were performed on 302 speakers inIn this paper all experiments were performed on 302 speakers including 151 females and 151 males which have all the ten utterancluding 151 females and 151 males which have all the ten utterancesces

For training the speaker models the first seven utterances of each speaker For training the speaker models the first seven utterances of each speaker from the senh handset were used as the enrollment speechfrom the senh handset were used as the enrollment speech

The other ten three-utterance sessions of each speaker from ten handsets wThe other ten three-utterance sessions of each speaker from ten handsets were used as the evaluation data respectivelyere used as the evaluation data respectively

2005102420051024 Speech Lab NTNU 2323

Speaker Identification Experiments (cont)Speaker Identification Experiments (cont)

To construct the speaker models a 256-mixture universal backgrTo construct the speaker models a 256-mixture universal background model (UBM) was first built from the enrollment speech of ound model (UBM) was first built from the enrollment speech of all 302 speakersall 302 speakers

For each speaker a MAP-adapted GMM (MAP-GMM) adapted fFor each speaker a MAP-adapted GMM (MAP-GMM) adapted from the UBM using his own enrollment speech was built rom the UBM using his own enrollment speech was built

a 32-state prosodic modeling was traineda 32-state prosodic modeling was trained

367 prosody keywords were extracted to form a sparse 367302 dimensio367 prosody keywords were extracted to form a sparse 367302 dimensional matrix nal matrix AA

the dimension of eigen-prosody space is 3the dimension of eigen-prosody space is 3

38 mel-frequency cepstral coefficiences (MFCCs) including38 mel-frequency cepstral coefficiences (MFCCs) including

Window size of 30 ms and frame shift of 10msWindow size of 30 ms and frame shift of 10ms

2005102420051024 Speech Lab NTNU 2424

Speaker Identification Experiments (cont)Speaker Identification Experiments (cont)

2005102420051024 Speech Lab NTNU 2525

ConclusionsConclusions

This paper presents an EPA+AKI+MAP-GMMCMS fusion This paper presents an EPA+AKI+MAP-GMMCMS fusion approachapproach

Unlike conventional fusion approaches the proposed method Unlike conventional fusion approaches the proposed method requires only few trainingtest utterances and could alleviate the requires only few trainingtest utterances and could alleviate the distortion of unseen mismatch handsetsdistortion of unseen mismatch handsets

It is therefore a promising method for robust speaker It is therefore a promising method for robust speaker identification under mismatch environment and limited available identification under mismatch environment and limited available datadata

2005102420051024 Speech Lab NTNU 2626

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
Page 3: PROSODY MODELING AND EIGEN-PROSODY ANALYSIS FOR ROBUST SPEAKER RECOGNITION

2005102420051024 Speech Lab NTNU 33

OutlineOutline

IntroductionIntroduction

VQ-Based Prosodic ModelingVQ-Based Prosodic Modeling

AKI Unseen Handset EstimationAKI Unseen Handset Estimation

Eigen-Prosody AnalysisEigen-Prosody Analysis

Speaker Identification ExperimentsSpeaker Identification Experiments

ConclusionsConclusions

2005102420051024 Speech Lab NTNU 44

IntroductionIntroduction

A speaker identification system in telecommunication network enA speaker identification system in telecommunication network environment needs to be robust against distortion of mismatch handvironment needs to be robust against distortion of mismatch handsetssets

Prosodic features are known to be less sensitive to handset mismatchProsodic features are known to be less sensitive to handset mismatch

Several successful techniques Several successful techniques GMMs the per-frame pitch and energy values are extracted and modeled GMMs the per-frame pitch and energy values are extracted and modeled using traditional distribution modelsusing traditional distribution models

May not adequately capture the temporal dynamic information of the prosMay not adequately capture the temporal dynamic information of the prosodic feature contoursodic feature contours

2005102420051024 Speech Lab NTNU 55

Introduction (cont)Introduction (cont)

Several successful techniques (cont) Several successful techniques (cont) N-gram the dynamics of the pitch and energy trajectories are described bN-gram the dynamics of the pitch and energy trajectories are described by sequences of symbols and modeled by the n-gram statisticsy sequences of symbols and modeled by the n-gram statistics

DHMM the sequences of prosody symbols are further modeled by the staDHMM the sequences of prosody symbols are further modeled by the state observation and transition probabilitieste observation and transition probabilities

usually require large amount of trainingtest data to reach a reasonable perfusually require large amount of trainingtest data to reach a reasonable performanceormance

In this paper In this paper a VQ-based prosody modeling and an eigen-prosody analysis approach are a VQ-based prosody modeling and an eigen-prosody analysis approach are integrated together to add robustness to conventional cepstral features-basintegrated together to add robustness to conventional cepstral features-based GMMs close-set speaker identification system under the situation of mied GMMs close-set speaker identification system under the situation of mismatch unseen handsets and limited trainingtest datasmatch unseen handsets and limited trainingtest data

2005102420051024 Speech Lab NTNU 66

Introduction (cont)Introduction (cont)

Speakerrsquos enrollment

speech

Sequences ofProsody states

EPA

VQ-Based Prosodic Modeling

Text document records the detailprosodyspeaking

style of the speaker

Eigen-prosody space to representthe constellation

of speakers LSA

By this way the speaker identification problem is transformed into a

full-text document retrieval-similar task

2005102420051024 Speech Lab NTNU 77

Introduction (cont)Introduction (cont)

Since EPA utilizes prosody-level information it could be further Since EPA utilizes prosody-level information it could be further fused with acoustic-level information to complement each otherfused with acoustic-level information to complement each other

The The a priori a priori knowledge interpolation (AKI) approach is chosenknowledge interpolation (AKI) approach is chosenIt utilizes the maximum likelihood linear regression (MLLR) to It utilizes the maximum likelihood linear regression (MLLR) to compensate handset mismatchcompensate handset mismatch

2005102420051024 Speech Lab NTNU 88

VQ-Based Prosodic ModelingVQ-Based Prosodic Modeling

In this study syllables are chosen as the basic processing unitsIn this study syllables are chosen as the basic processing units

Five types of prosodic features Five types of prosodic features the slope of pitch contour of a vowel segmentthe slope of pitch contour of a vowel segment

lengthening factor of a vowel segmentlengthening factor of a vowel segment

average log-energy difference between two vowelsaverage log-energy difference between two vowels

value of pitch jump between two vowelsvalue of pitch jump between two vowels

pause duration between two syllablespause duration between two syllables

The prosody features are normalized according to underline vowel class to The prosody features are normalized according to underline vowel class to remove any non-prosodic effects (context-information)remove any non-prosodic effects (context-information)

vowel

voweluxx

ˆ

mean of the vowel class of whole corpusuvowel

vowel variance of the vowel class of whole corpus

2005102420051024 Speech Lab NTNU 99

Prosodic ModelingProsodic Modeling

2005102420051024 Speech Lab NTNU 1010

Prosodic Modeling (cont)Prosodic Modeling (cont)

To build the prosodic model the prosodic features are vector quaTo build the prosodic model the prosodic features are vector quantized into ntized into M M codewords using the expectation-maximum (EM) acodewords using the expectation-maximum (EM) algorithmlgorithm

2005102420051024 Speech Lab NTNU 1111

Prosodic Modeling (cont)Prosodic Modeling (cont)

For example state 6 is the phrase-start state 3 and 4 are the For example state 6 is the phrase-start state 3 and 4 are the major-breaksmajor-breaks

2005102420051024 Speech Lab NTNU 1212

Prosodic labelingProsodic labeling

2005102420051024 Speech Lab NTNU 1313

AKI Unseen Handset EstimationAKI Unseen Handset Estimation

The concept of AKI is to first collect a set of characteristics of seen handset aThe concept of AKI is to first collect a set of characteristics of seen handset as the s the a priori a priori knowledge to construct a space of handsetsknowledge to construct a space of handsets

Model-space AKI using the MLLR model transformation is chosen to compeModel-space AKI using the MLLR model transformation is chosen to compensate the handset mismatchnsate the handset mismatch

The estimate of the characteristic of a test handset is definedThe estimate of the characteristic of a test handset is defined

where where HH = = hhnn=(=(AAnn bbnn TTnn)) n n=1~=1~NN is the set of is the set of a priori a priori knowledgeknowledge

N

nnnhh

1

ˆ

h

1

1

ˆˆ

ˆˆˆˆ

CB

CC

BTB

buAuhu

T

T

μ and 1048580 are the original and adapted mixture mean vectorsΣ and 1048580 are the original and adapted mixture variance vectors and are MLLR mean and variance transformation matricesB is the inverse function of Choleski factor of the original variance matrix Σ-1

is the bias of the mixture component

A T

b

2005102420051024 Speech Lab NTNU 1414

AKI Unseen Handset Estimation (cont)AKI Unseen Handset Estimation (cont)

2005102420051024 Speech Lab NTNU 1515

Eigen-Prosody AnalysisEigen-Prosody Analysis

The procedures of the EPA includesThe procedures of the EPA includesVQ-based prosodic modeling and labelingVQ-based prosodic modeling and labeling

Segmenting the sequences of prosody states to extract important prosody kSegmenting the sequences of prosody states to extract important prosody keywordseywords

Calculating the occurrences statistics of these prosody keywords for each sCalculating the occurrences statistics of these prosody keywords for each speaker to form a prosody keyword-speaker occurrence matrixpeaker to form a prosody keyword-speaker occurrence matrix

Applying the singular values decomposition (SVD) technique to decompoApplying the singular values decomposition (SVD) technique to decompose the matrix to build an eigen-prosody spacese the matrix to build an eigen-prosody space

Measuring the speaker distance using the cosine of the angle between two Measuring the speaker distance using the cosine of the angle between two speaker vectorsspeaker vectors

2005102420051024 Speech Lab NTNU 1616

Eigen-Prosody Analysis (cont)Eigen-Prosody Analysis (cont)

2005102420051024 Speech Lab NTNU 1717

Prosody keyword extractionProsody keyword extraction

After the prosodic state labeling the prosody text documents of aAfter the prosodic state labeling the prosody text documents of all speakers are searched to find important prosody keywords in orll speakers are searched to find important prosody keywords in order to establish a prosody keywords dictionaryder to establish a prosody keywords dictionary

First all possible combinations of the prosody words including single worFirst all possible combinations of the prosody words including single words and word pairs (uni-gram and bi-gram) are listed and their frequency stds and word pairs (uni-gram and bi-gram) are listed and their frequency statistics are computedatistics are computed

After calculating the histogram of all prosody words frequency thresholds After calculating the histogram of all prosody words frequency thresholds are set to leave only high frequency onesare set to leave only high frequency ones

2005102420051024 Speech Lab NTNU 1818

Prosody keyword-speaker occurrence matrix Prosody keyword-speaker occurrence matrix statisticsstatistics

The prosody text document of each speaker is then parsed using the generated The prosody text document of each speaker is then parsed using the generated prosody keywords dictionary by simply giving higher priority to longer wordsprosody keywords dictionary by simply giving higher priority to longer words

The occurrence counts of keywords of a speaker are booked in a prosody The occurrence counts of keywords of a speaker are booked in a prosody keyword list vector to represent the long-term prosodic behaviors of the keyword list vector to represent the long-term prosodic behaviors of the specific speakerspecific speaker

The prosody keyword-speaker occurrence matrix The prosody keyword-speaker occurrence matrix A A is made up of the is made up of the collection of all speaker prosody keyword lists vectorscollection of all speaker prosody keyword lists vectors

To emphasize the uncommon keywords and to deemphasize the very common To emphasize the uncommon keywords and to deemphasize the very common ones the inverse document frequency (IDF) weighting method is appliedones the inverse document frequency (IDF) weighting method is applied

2005102420051024 Speech Lab NTNU 1919

Eigen-Prosody analysisEigen-Prosody analysis

In order to reduce the dimension of the prosody space the sparse prosody keIn order to reduce the dimension of the prosody space the sparse prosody keyword-speaker occurrence matrix yword-speaker occurrence matrix A A is further analyzed using SVD to find a cis further analyzed using SVD to find a compact eigen-prosody spaceompact eigen-prosody space

Given an Given an m m by by n n ((mmgtgtgtgtnn) matrix ) matrix A A of rank of rank R R A A is decomposed and further apis decomposed and further approximated using only the largest proximated using only the largest K K singular values assingular values as

Where AWhere AKK U UKK V VKK and and ΣΣKK matrices are the rank reduced matrices of the respective m matrices are the rank reduced matrices of the respective m

atricesatrices

TKKKK

T VUAVUA

2005102420051024 Speech Lab NTNU 2020

Eigen-Prosody Analysis Eigen-Prosody Analysis

2005102420051024 Speech Lab NTNU 2121

Score measurementScore measurement

The test utterances of a test speaker are first labeled and parsed to form the pThe test utterances of a test speaker are first labeled and parsed to form the pseudo query document seudo query document yyQQ and then transformed into the query vector and then transformed into the query vector vvQQ in tin t

he eigen-prosody speaker space byhe eigen-prosody speaker space by

iKQ

iKT

QiKQ

Pi

KKTQQ

v

vvS

Uy

1

)(

2005102420051024 Speech Lab NTNU 2222

Speaker Identification ExperimentsSpeaker Identification Experiments

HTIMIT database and experiment conditionsHTIMIT database and experiment conditionstotal 384 speakers each gave ten utterancestotal 384 speakers each gave ten utterances

The set of 38410 utterances was then playback and recorded through ninThe set of 38410 utterances was then playback and recorded through nine other different handsets include four carbon buttone other different handsets include four carbon button((碳墨式碳墨式 )) four electret four electret ((電子式電子式 ))handsets and one portable cordless phonehandsets and one portable cordless phone

In this paper all experiments were performed on 302 speakers inIn this paper all experiments were performed on 302 speakers including 151 females and 151 males which have all the ten utterancluding 151 females and 151 males which have all the ten utterancesces

For training the speaker models the first seven utterances of each speaker For training the speaker models the first seven utterances of each speaker from the senh handset were used as the enrollment speechfrom the senh handset were used as the enrollment speech

The other ten three-utterance sessions of each speaker from ten handsets wThe other ten three-utterance sessions of each speaker from ten handsets were used as the evaluation data respectivelyere used as the evaluation data respectively

2005102420051024 Speech Lab NTNU 2323

Speaker Identification Experiments (cont)Speaker Identification Experiments (cont)

To construct the speaker models a 256-mixture universal backgrTo construct the speaker models a 256-mixture universal background model (UBM) was first built from the enrollment speech of ound model (UBM) was first built from the enrollment speech of all 302 speakersall 302 speakers

For each speaker a MAP-adapted GMM (MAP-GMM) adapted fFor each speaker a MAP-adapted GMM (MAP-GMM) adapted from the UBM using his own enrollment speech was built rom the UBM using his own enrollment speech was built

a 32-state prosodic modeling was traineda 32-state prosodic modeling was trained

367 prosody keywords were extracted to form a sparse 367302 dimensio367 prosody keywords were extracted to form a sparse 367302 dimensional matrix nal matrix AA

the dimension of eigen-prosody space is 3the dimension of eigen-prosody space is 3

38 mel-frequency cepstral coefficiences (MFCCs) including38 mel-frequency cepstral coefficiences (MFCCs) including

Window size of 30 ms and frame shift of 10msWindow size of 30 ms and frame shift of 10ms

2005102420051024 Speech Lab NTNU 2424

Speaker Identification Experiments (cont)Speaker Identification Experiments (cont)

2005102420051024 Speech Lab NTNU 2525

ConclusionsConclusions

This paper presents an EPA+AKI+MAP-GMMCMS fusion This paper presents an EPA+AKI+MAP-GMMCMS fusion approachapproach

Unlike conventional fusion approaches the proposed method Unlike conventional fusion approaches the proposed method requires only few trainingtest utterances and could alleviate the requires only few trainingtest utterances and could alleviate the distortion of unseen mismatch handsetsdistortion of unseen mismatch handsets

It is therefore a promising method for robust speaker It is therefore a promising method for robust speaker identification under mismatch environment and limited available identification under mismatch environment and limited available datadata

2005102420051024 Speech Lab NTNU 2626

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
Page 4: PROSODY MODELING AND EIGEN-PROSODY ANALYSIS FOR ROBUST SPEAKER RECOGNITION

2005102420051024 Speech Lab NTNU 44

IntroductionIntroduction

A speaker identification system in telecommunication network enA speaker identification system in telecommunication network environment needs to be robust against distortion of mismatch handvironment needs to be robust against distortion of mismatch handsetssets

Prosodic features are known to be less sensitive to handset mismatchProsodic features are known to be less sensitive to handset mismatch

Several successful techniques Several successful techniques GMMs the per-frame pitch and energy values are extracted and modeled GMMs the per-frame pitch and energy values are extracted and modeled using traditional distribution modelsusing traditional distribution models

May not adequately capture the temporal dynamic information of the prosMay not adequately capture the temporal dynamic information of the prosodic feature contoursodic feature contours

2005102420051024 Speech Lab NTNU 55

Introduction (cont)Introduction (cont)

Several successful techniques (cont) Several successful techniques (cont) N-gram the dynamics of the pitch and energy trajectories are described bN-gram the dynamics of the pitch and energy trajectories are described by sequences of symbols and modeled by the n-gram statisticsy sequences of symbols and modeled by the n-gram statistics

DHMM the sequences of prosody symbols are further modeled by the staDHMM the sequences of prosody symbols are further modeled by the state observation and transition probabilitieste observation and transition probabilities

usually require large amount of trainingtest data to reach a reasonable perfusually require large amount of trainingtest data to reach a reasonable performanceormance

In this paper In this paper a VQ-based prosody modeling and an eigen-prosody analysis approach are a VQ-based prosody modeling and an eigen-prosody analysis approach are integrated together to add robustness to conventional cepstral features-basintegrated together to add robustness to conventional cepstral features-based GMMs close-set speaker identification system under the situation of mied GMMs close-set speaker identification system under the situation of mismatch unseen handsets and limited trainingtest datasmatch unseen handsets and limited trainingtest data

2005102420051024 Speech Lab NTNU 66

Introduction (cont)Introduction (cont)

Speakerrsquos enrollment

speech

Sequences ofProsody states

EPA

VQ-Based Prosodic Modeling

Text document records the detailprosodyspeaking

style of the speaker

Eigen-prosody space to representthe constellation

of speakers LSA

By this way the speaker identification problem is transformed into a

full-text document retrieval-similar task

2005102420051024 Speech Lab NTNU 77

Introduction (cont)Introduction (cont)

Since EPA utilizes prosody-level information it could be further Since EPA utilizes prosody-level information it could be further fused with acoustic-level information to complement each otherfused with acoustic-level information to complement each other

The The a priori a priori knowledge interpolation (AKI) approach is chosenknowledge interpolation (AKI) approach is chosenIt utilizes the maximum likelihood linear regression (MLLR) to It utilizes the maximum likelihood linear regression (MLLR) to compensate handset mismatchcompensate handset mismatch

2005102420051024 Speech Lab NTNU 88

VQ-Based Prosodic ModelingVQ-Based Prosodic Modeling

In this study syllables are chosen as the basic processing unitsIn this study syllables are chosen as the basic processing units

Five types of prosodic features Five types of prosodic features the slope of pitch contour of a vowel segmentthe slope of pitch contour of a vowel segment

lengthening factor of a vowel segmentlengthening factor of a vowel segment

average log-energy difference between two vowelsaverage log-energy difference between two vowels

value of pitch jump between two vowelsvalue of pitch jump between two vowels

pause duration between two syllablespause duration between two syllables

The prosody features are normalized according to underline vowel class to The prosody features are normalized according to underline vowel class to remove any non-prosodic effects (context-information)remove any non-prosodic effects (context-information)

vowel

voweluxx

ˆ

mean of the vowel class of whole corpusuvowel

vowel variance of the vowel class of whole corpus

2005102420051024 Speech Lab NTNU 99

Prosodic ModelingProsodic Modeling

2005102420051024 Speech Lab NTNU 1010

Prosodic Modeling (cont)Prosodic Modeling (cont)

To build the prosodic model the prosodic features are vector quaTo build the prosodic model the prosodic features are vector quantized into ntized into M M codewords using the expectation-maximum (EM) acodewords using the expectation-maximum (EM) algorithmlgorithm

2005102420051024 Speech Lab NTNU 1111

Prosodic Modeling (cont)Prosodic Modeling (cont)

For example state 6 is the phrase-start state 3 and 4 are the For example state 6 is the phrase-start state 3 and 4 are the major-breaksmajor-breaks

2005102420051024 Speech Lab NTNU 1212

Prosodic labelingProsodic labeling

2005102420051024 Speech Lab NTNU 1313

AKI Unseen Handset EstimationAKI Unseen Handset Estimation

The concept of AKI is to first collect a set of characteristics of seen handset aThe concept of AKI is to first collect a set of characteristics of seen handset as the s the a priori a priori knowledge to construct a space of handsetsknowledge to construct a space of handsets

Model-space AKI using the MLLR model transformation is chosen to compeModel-space AKI using the MLLR model transformation is chosen to compensate the handset mismatchnsate the handset mismatch

The estimate of the characteristic of a test handset is definedThe estimate of the characteristic of a test handset is defined

where where HH = = hhnn=(=(AAnn bbnn TTnn)) n n=1~=1~NN is the set of is the set of a priori a priori knowledgeknowledge

N

nnnhh

1

ˆ

h

1

1

ˆˆ

ˆˆˆˆ

CB

CC

BTB

buAuhu

T

T

μ and 1048580 are the original and adapted mixture mean vectorsΣ and 1048580 are the original and adapted mixture variance vectors and are MLLR mean and variance transformation matricesB is the inverse function of Choleski factor of the original variance matrix Σ-1

is the bias of the mixture component

A T

b

2005102420051024 Speech Lab NTNU 1414

AKI Unseen Handset Estimation (cont)AKI Unseen Handset Estimation (cont)

2005102420051024 Speech Lab NTNU 1515

Eigen-Prosody AnalysisEigen-Prosody Analysis

The procedures of the EPA includesThe procedures of the EPA includesVQ-based prosodic modeling and labelingVQ-based prosodic modeling and labeling

Segmenting the sequences of prosody states to extract important prosody kSegmenting the sequences of prosody states to extract important prosody keywordseywords

Calculating the occurrences statistics of these prosody keywords for each sCalculating the occurrences statistics of these prosody keywords for each speaker to form a prosody keyword-speaker occurrence matrixpeaker to form a prosody keyword-speaker occurrence matrix

Applying the singular values decomposition (SVD) technique to decompoApplying the singular values decomposition (SVD) technique to decompose the matrix to build an eigen-prosody spacese the matrix to build an eigen-prosody space

Measuring the speaker distance using the cosine of the angle between two Measuring the speaker distance using the cosine of the angle between two speaker vectorsspeaker vectors

2005102420051024 Speech Lab NTNU 1616

Eigen-Prosody Analysis (cont)Eigen-Prosody Analysis (cont)

2005102420051024 Speech Lab NTNU 1717

Prosody keyword extractionProsody keyword extraction

After the prosodic state labeling the prosody text documents of aAfter the prosodic state labeling the prosody text documents of all speakers are searched to find important prosody keywords in orll speakers are searched to find important prosody keywords in order to establish a prosody keywords dictionaryder to establish a prosody keywords dictionary

First all possible combinations of the prosody words including single worFirst all possible combinations of the prosody words including single words and word pairs (uni-gram and bi-gram) are listed and their frequency stds and word pairs (uni-gram and bi-gram) are listed and their frequency statistics are computedatistics are computed

After calculating the histogram of all prosody words frequency thresholds After calculating the histogram of all prosody words frequency thresholds are set to leave only high frequency onesare set to leave only high frequency ones

2005102420051024 Speech Lab NTNU 1818

Prosody keyword-speaker occurrence matrix Prosody keyword-speaker occurrence matrix statisticsstatistics

The prosody text document of each speaker is then parsed using the generated The prosody text document of each speaker is then parsed using the generated prosody keywords dictionary by simply giving higher priority to longer wordsprosody keywords dictionary by simply giving higher priority to longer words

The occurrence counts of keywords of a speaker are booked in a prosody The occurrence counts of keywords of a speaker are booked in a prosody keyword list vector to represent the long-term prosodic behaviors of the keyword list vector to represent the long-term prosodic behaviors of the specific speakerspecific speaker

The prosody keyword-speaker occurrence matrix The prosody keyword-speaker occurrence matrix A A is made up of the is made up of the collection of all speaker prosody keyword lists vectorscollection of all speaker prosody keyword lists vectors

To emphasize the uncommon keywords and to deemphasize the very common To emphasize the uncommon keywords and to deemphasize the very common ones the inverse document frequency (IDF) weighting method is appliedones the inverse document frequency (IDF) weighting method is applied

2005102420051024 Speech Lab NTNU 1919

Eigen-Prosody analysisEigen-Prosody analysis

In order to reduce the dimension of the prosody space the sparse prosody keIn order to reduce the dimension of the prosody space the sparse prosody keyword-speaker occurrence matrix yword-speaker occurrence matrix A A is further analyzed using SVD to find a cis further analyzed using SVD to find a compact eigen-prosody spaceompact eigen-prosody space

Given an Given an m m by by n n ((mmgtgtgtgtnn) matrix ) matrix A A of rank of rank R R A A is decomposed and further apis decomposed and further approximated using only the largest proximated using only the largest K K singular values assingular values as

Where AWhere AKK U UKK V VKK and and ΣΣKK matrices are the rank reduced matrices of the respective m matrices are the rank reduced matrices of the respective m

atricesatrices

TKKKK

T VUAVUA

2005102420051024 Speech Lab NTNU 2020

Eigen-Prosody Analysis Eigen-Prosody Analysis

2005102420051024 Speech Lab NTNU 2121

Score measurementScore measurement

The test utterances of a test speaker are first labeled and parsed to form the pThe test utterances of a test speaker are first labeled and parsed to form the pseudo query document seudo query document yyQQ and then transformed into the query vector and then transformed into the query vector vvQQ in tin t

he eigen-prosody speaker space byhe eigen-prosody speaker space by

iKQ

iKT

QiKQ

Pi

KKTQQ

v

vvS

Uy

1

)(

2005102420051024 Speech Lab NTNU 2222

Speaker Identification ExperimentsSpeaker Identification Experiments

HTIMIT database and experiment conditionsHTIMIT database and experiment conditionstotal 384 speakers each gave ten utterancestotal 384 speakers each gave ten utterances

The set of 38410 utterances was then playback and recorded through ninThe set of 38410 utterances was then playback and recorded through nine other different handsets include four carbon buttone other different handsets include four carbon button((碳墨式碳墨式 )) four electret four electret ((電子式電子式 ))handsets and one portable cordless phonehandsets and one portable cordless phone

In this paper all experiments were performed on 302 speakers inIn this paper all experiments were performed on 302 speakers including 151 females and 151 males which have all the ten utterancluding 151 females and 151 males which have all the ten utterancesces

For training the speaker models the first seven utterances of each speaker For training the speaker models the first seven utterances of each speaker from the senh handset were used as the enrollment speechfrom the senh handset were used as the enrollment speech

The other ten three-utterance sessions of each speaker from ten handsets wThe other ten three-utterance sessions of each speaker from ten handsets were used as the evaluation data respectivelyere used as the evaluation data respectively

2005102420051024 Speech Lab NTNU 2323

Speaker Identification Experiments (cont)Speaker Identification Experiments (cont)

To construct the speaker models a 256-mixture universal backgrTo construct the speaker models a 256-mixture universal background model (UBM) was first built from the enrollment speech of ound model (UBM) was first built from the enrollment speech of all 302 speakersall 302 speakers

For each speaker a MAP-adapted GMM (MAP-GMM) adapted fFor each speaker a MAP-adapted GMM (MAP-GMM) adapted from the UBM using his own enrollment speech was built rom the UBM using his own enrollment speech was built

a 32-state prosodic modeling was traineda 32-state prosodic modeling was trained

367 prosody keywords were extracted to form a sparse 367302 dimensio367 prosody keywords were extracted to form a sparse 367302 dimensional matrix nal matrix AA

the dimension of eigen-prosody space is 3the dimension of eigen-prosody space is 3

38 mel-frequency cepstral coefficiences (MFCCs) including38 mel-frequency cepstral coefficiences (MFCCs) including

Window size of 30 ms and frame shift of 10msWindow size of 30 ms and frame shift of 10ms

2005102420051024 Speech Lab NTNU 2424

Speaker Identification Experiments (cont)Speaker Identification Experiments (cont)

2005102420051024 Speech Lab NTNU 2525

ConclusionsConclusions

This paper presents an EPA+AKI+MAP-GMMCMS fusion This paper presents an EPA+AKI+MAP-GMMCMS fusion approachapproach

Unlike conventional fusion approaches the proposed method Unlike conventional fusion approaches the proposed method requires only few trainingtest utterances and could alleviate the requires only few trainingtest utterances and could alleviate the distortion of unseen mismatch handsetsdistortion of unseen mismatch handsets

It is therefore a promising method for robust speaker It is therefore a promising method for robust speaker identification under mismatch environment and limited available identification under mismatch environment and limited available datadata

2005102420051024 Speech Lab NTNU 2626

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
Page 5: PROSODY MODELING AND EIGEN-PROSODY ANALYSIS FOR ROBUST SPEAKER RECOGNITION

2005102420051024 Speech Lab NTNU 55

Introduction (cont)Introduction (cont)

Several successful techniques (cont) Several successful techniques (cont) N-gram the dynamics of the pitch and energy trajectories are described bN-gram the dynamics of the pitch and energy trajectories are described by sequences of symbols and modeled by the n-gram statisticsy sequences of symbols and modeled by the n-gram statistics

DHMM the sequences of prosody symbols are further modeled by the staDHMM the sequences of prosody symbols are further modeled by the state observation and transition probabilitieste observation and transition probabilities

usually require large amount of trainingtest data to reach a reasonable perfusually require large amount of trainingtest data to reach a reasonable performanceormance

In this paper In this paper a VQ-based prosody modeling and an eigen-prosody analysis approach are a VQ-based prosody modeling and an eigen-prosody analysis approach are integrated together to add robustness to conventional cepstral features-basintegrated together to add robustness to conventional cepstral features-based GMMs close-set speaker identification system under the situation of mied GMMs close-set speaker identification system under the situation of mismatch unseen handsets and limited trainingtest datasmatch unseen handsets and limited trainingtest data

2005102420051024 Speech Lab NTNU 66

Introduction (cont)Introduction (cont)

Speakerrsquos enrollment

speech

Sequences ofProsody states

EPA

VQ-Based Prosodic Modeling

Text document records the detailprosodyspeaking

style of the speaker

Eigen-prosody space to representthe constellation

of speakers LSA

By this way the speaker identification problem is transformed into a

full-text document retrieval-similar task

2005102420051024 Speech Lab NTNU 77

Introduction (cont)Introduction (cont)

Since EPA utilizes prosody-level information it could be further Since EPA utilizes prosody-level information it could be further fused with acoustic-level information to complement each otherfused with acoustic-level information to complement each other

The The a priori a priori knowledge interpolation (AKI) approach is chosenknowledge interpolation (AKI) approach is chosenIt utilizes the maximum likelihood linear regression (MLLR) to It utilizes the maximum likelihood linear regression (MLLR) to compensate handset mismatchcompensate handset mismatch

2005102420051024 Speech Lab NTNU 88

VQ-Based Prosodic ModelingVQ-Based Prosodic Modeling

In this study syllables are chosen as the basic processing unitsIn this study syllables are chosen as the basic processing units

Five types of prosodic features Five types of prosodic features the slope of pitch contour of a vowel segmentthe slope of pitch contour of a vowel segment

lengthening factor of a vowel segmentlengthening factor of a vowel segment

average log-energy difference between two vowelsaverage log-energy difference between two vowels

value of pitch jump between two vowelsvalue of pitch jump between two vowels

pause duration between two syllablespause duration between two syllables

The prosody features are normalized according to underline vowel class to The prosody features are normalized according to underline vowel class to remove any non-prosodic effects (context-information)remove any non-prosodic effects (context-information)

vowel

voweluxx

ˆ

mean of the vowel class of whole corpusuvowel

vowel variance of the vowel class of whole corpus

2005102420051024 Speech Lab NTNU 99

Prosodic ModelingProsodic Modeling

2005102420051024 Speech Lab NTNU 1010

Prosodic Modeling (cont)Prosodic Modeling (cont)

To build the prosodic model the prosodic features are vector quaTo build the prosodic model the prosodic features are vector quantized into ntized into M M codewords using the expectation-maximum (EM) acodewords using the expectation-maximum (EM) algorithmlgorithm

2005102420051024 Speech Lab NTNU 1111

Prosodic Modeling (cont)Prosodic Modeling (cont)

For example state 6 is the phrase-start state 3 and 4 are the For example state 6 is the phrase-start state 3 and 4 are the major-breaksmajor-breaks

2005102420051024 Speech Lab NTNU 1212

Prosodic labelingProsodic labeling

2005102420051024 Speech Lab NTNU 1313

AKI Unseen Handset EstimationAKI Unseen Handset Estimation

The concept of AKI is to first collect a set of characteristics of seen handset aThe concept of AKI is to first collect a set of characteristics of seen handset as the s the a priori a priori knowledge to construct a space of handsetsknowledge to construct a space of handsets

Model-space AKI using the MLLR model transformation is chosen to compeModel-space AKI using the MLLR model transformation is chosen to compensate the handset mismatchnsate the handset mismatch

The estimate of the characteristic of a test handset is definedThe estimate of the characteristic of a test handset is defined

where where HH = = hhnn=(=(AAnn bbnn TTnn)) n n=1~=1~NN is the set of is the set of a priori a priori knowledgeknowledge

N

nnnhh

1

ˆ

h

1

1

ˆˆ

ˆˆˆˆ

CB

CC

BTB

buAuhu

T

T

μ and 1048580 are the original and adapted mixture mean vectorsΣ and 1048580 are the original and adapted mixture variance vectors and are MLLR mean and variance transformation matricesB is the inverse function of Choleski factor of the original variance matrix Σ-1

is the bias of the mixture component

A T

b

2005102420051024 Speech Lab NTNU 1414

AKI Unseen Handset Estimation (cont)AKI Unseen Handset Estimation (cont)

2005102420051024 Speech Lab NTNU 1515

Eigen-Prosody AnalysisEigen-Prosody Analysis

The procedures of the EPA includesThe procedures of the EPA includesVQ-based prosodic modeling and labelingVQ-based prosodic modeling and labeling

Segmenting the sequences of prosody states to extract important prosody kSegmenting the sequences of prosody states to extract important prosody keywordseywords

Calculating the occurrences statistics of these prosody keywords for each sCalculating the occurrences statistics of these prosody keywords for each speaker to form a prosody keyword-speaker occurrence matrixpeaker to form a prosody keyword-speaker occurrence matrix

Applying the singular values decomposition (SVD) technique to decompoApplying the singular values decomposition (SVD) technique to decompose the matrix to build an eigen-prosody spacese the matrix to build an eigen-prosody space

Measuring the speaker distance using the cosine of the angle between two Measuring the speaker distance using the cosine of the angle between two speaker vectorsspeaker vectors

2005102420051024 Speech Lab NTNU 1616

Eigen-Prosody Analysis (cont)Eigen-Prosody Analysis (cont)

2005102420051024 Speech Lab NTNU 1717

Prosody keyword extractionProsody keyword extraction

After the prosodic state labeling the prosody text documents of aAfter the prosodic state labeling the prosody text documents of all speakers are searched to find important prosody keywords in orll speakers are searched to find important prosody keywords in order to establish a prosody keywords dictionaryder to establish a prosody keywords dictionary

First all possible combinations of the prosody words including single worFirst all possible combinations of the prosody words including single words and word pairs (uni-gram and bi-gram) are listed and their frequency stds and word pairs (uni-gram and bi-gram) are listed and their frequency statistics are computedatistics are computed

After calculating the histogram of all prosody words frequency thresholds After calculating the histogram of all prosody words frequency thresholds are set to leave only high frequency onesare set to leave only high frequency ones

2005102420051024 Speech Lab NTNU 1818

Prosody keyword-speaker occurrence matrix Prosody keyword-speaker occurrence matrix statisticsstatistics

The prosody text document of each speaker is then parsed using the generated The prosody text document of each speaker is then parsed using the generated prosody keywords dictionary by simply giving higher priority to longer wordsprosody keywords dictionary by simply giving higher priority to longer words

The occurrence counts of keywords of a speaker are booked in a prosody The occurrence counts of keywords of a speaker are booked in a prosody keyword list vector to represent the long-term prosodic behaviors of the keyword list vector to represent the long-term prosodic behaviors of the specific speakerspecific speaker

The prosody keyword-speaker occurrence matrix The prosody keyword-speaker occurrence matrix A A is made up of the is made up of the collection of all speaker prosody keyword lists vectorscollection of all speaker prosody keyword lists vectors

To emphasize the uncommon keywords and to deemphasize the very common To emphasize the uncommon keywords and to deemphasize the very common ones the inverse document frequency (IDF) weighting method is appliedones the inverse document frequency (IDF) weighting method is applied

2005102420051024 Speech Lab NTNU 1919

Eigen-Prosody analysisEigen-Prosody analysis

In order to reduce the dimension of the prosody space the sparse prosody keIn order to reduce the dimension of the prosody space the sparse prosody keyword-speaker occurrence matrix yword-speaker occurrence matrix A A is further analyzed using SVD to find a cis further analyzed using SVD to find a compact eigen-prosody spaceompact eigen-prosody space

Given an Given an m m by by n n ((mmgtgtgtgtnn) matrix ) matrix A A of rank of rank R R A A is decomposed and further apis decomposed and further approximated using only the largest proximated using only the largest K K singular values assingular values as

Where AWhere AKK U UKK V VKK and and ΣΣKK matrices are the rank reduced matrices of the respective m matrices are the rank reduced matrices of the respective m

atricesatrices

TKKKK

T VUAVUA

2005102420051024 Speech Lab NTNU 2020

Eigen-Prosody Analysis Eigen-Prosody Analysis

2005102420051024 Speech Lab NTNU 2121

Score measurementScore measurement

The test utterances of a test speaker are first labeled and parsed to form the pThe test utterances of a test speaker are first labeled and parsed to form the pseudo query document seudo query document yyQQ and then transformed into the query vector and then transformed into the query vector vvQQ in tin t

he eigen-prosody speaker space byhe eigen-prosody speaker space by

iKQ

iKT

QiKQ

Pi

KKTQQ

v

vvS

Uy

1

)(

2005102420051024 Speech Lab NTNU 2222

Speaker Identification ExperimentsSpeaker Identification Experiments

HTIMIT database and experiment conditionsHTIMIT database and experiment conditionstotal 384 speakers each gave ten utterancestotal 384 speakers each gave ten utterances

The set of 38410 utterances was then playback and recorded through ninThe set of 38410 utterances was then playback and recorded through nine other different handsets include four carbon buttone other different handsets include four carbon button((碳墨式碳墨式 )) four electret four electret ((電子式電子式 ))handsets and one portable cordless phonehandsets and one portable cordless phone

In this paper all experiments were performed on 302 speakers inIn this paper all experiments were performed on 302 speakers including 151 females and 151 males which have all the ten utterancluding 151 females and 151 males which have all the ten utterancesces

For training the speaker models the first seven utterances of each speaker For training the speaker models the first seven utterances of each speaker from the senh handset were used as the enrollment speechfrom the senh handset were used as the enrollment speech

The other ten three-utterance sessions of each speaker from ten handsets wThe other ten three-utterance sessions of each speaker from ten handsets were used as the evaluation data respectivelyere used as the evaluation data respectively

2005102420051024 Speech Lab NTNU 2323

Speaker Identification Experiments (cont)Speaker Identification Experiments (cont)

To construct the speaker models a 256-mixture universal backgrTo construct the speaker models a 256-mixture universal background model (UBM) was first built from the enrollment speech of ound model (UBM) was first built from the enrollment speech of all 302 speakersall 302 speakers

For each speaker a MAP-adapted GMM (MAP-GMM) adapted fFor each speaker a MAP-adapted GMM (MAP-GMM) adapted from the UBM using his own enrollment speech was built rom the UBM using his own enrollment speech was built

a 32-state prosodic modeling was traineda 32-state prosodic modeling was trained

367 prosody keywords were extracted to form a sparse 367302 dimensio367 prosody keywords were extracted to form a sparse 367302 dimensional matrix nal matrix AA

the dimension of eigen-prosody space is 3the dimension of eigen-prosody space is 3

38 mel-frequency cepstral coefficiences (MFCCs) including38 mel-frequency cepstral coefficiences (MFCCs) including

Window size of 30 ms and frame shift of 10msWindow size of 30 ms and frame shift of 10ms

2005102420051024 Speech Lab NTNU 2424

Speaker Identification Experiments (cont)Speaker Identification Experiments (cont)

2005102420051024 Speech Lab NTNU 2525

ConclusionsConclusions

This paper presents an EPA+AKI+MAP-GMMCMS fusion This paper presents an EPA+AKI+MAP-GMMCMS fusion approachapproach

Unlike conventional fusion approaches the proposed method Unlike conventional fusion approaches the proposed method requires only few trainingtest utterances and could alleviate the requires only few trainingtest utterances and could alleviate the distortion of unseen mismatch handsetsdistortion of unseen mismatch handsets

It is therefore a promising method for robust speaker It is therefore a promising method for robust speaker identification under mismatch environment and limited available identification under mismatch environment and limited available datadata

2005102420051024 Speech Lab NTNU 2626

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
Page 6: PROSODY MODELING AND EIGEN-PROSODY ANALYSIS FOR ROBUST SPEAKER RECOGNITION

2005102420051024 Speech Lab NTNU 66

Introduction (cont)Introduction (cont)

Speakerrsquos enrollment

speech

Sequences ofProsody states

EPA

VQ-Based Prosodic Modeling

Text document records the detailprosodyspeaking

style of the speaker

Eigen-prosody space to representthe constellation

of speakers LSA

By this way the speaker identification problem is transformed into a

full-text document retrieval-similar task

2005102420051024 Speech Lab NTNU 77

Introduction (cont)Introduction (cont)

Since EPA utilizes prosody-level information it could be further Since EPA utilizes prosody-level information it could be further fused with acoustic-level information to complement each otherfused with acoustic-level information to complement each other

The The a priori a priori knowledge interpolation (AKI) approach is chosenknowledge interpolation (AKI) approach is chosenIt utilizes the maximum likelihood linear regression (MLLR) to It utilizes the maximum likelihood linear regression (MLLR) to compensate handset mismatchcompensate handset mismatch

2005102420051024 Speech Lab NTNU 88

VQ-Based Prosodic ModelingVQ-Based Prosodic Modeling

In this study syllables are chosen as the basic processing unitsIn this study syllables are chosen as the basic processing units

Five types of prosodic features Five types of prosodic features the slope of pitch contour of a vowel segmentthe slope of pitch contour of a vowel segment

lengthening factor of a vowel segmentlengthening factor of a vowel segment

average log-energy difference between two vowelsaverage log-energy difference between two vowels

value of pitch jump between two vowelsvalue of pitch jump between two vowels

pause duration between two syllablespause duration between two syllables

The prosody features are normalized according to underline vowel class to The prosody features are normalized according to underline vowel class to remove any non-prosodic effects (context-information)remove any non-prosodic effects (context-information)

vowel

voweluxx

ˆ

mean of the vowel class of whole corpusuvowel

vowel variance of the vowel class of whole corpus

2005102420051024 Speech Lab NTNU 99

Prosodic ModelingProsodic Modeling

2005102420051024 Speech Lab NTNU 1010

Prosodic Modeling (cont)Prosodic Modeling (cont)

To build the prosodic model the prosodic features are vector quaTo build the prosodic model the prosodic features are vector quantized into ntized into M M codewords using the expectation-maximum (EM) acodewords using the expectation-maximum (EM) algorithmlgorithm

2005102420051024 Speech Lab NTNU 1111

Prosodic Modeling (cont)Prosodic Modeling (cont)

For example state 6 is the phrase-start state 3 and 4 are the For example state 6 is the phrase-start state 3 and 4 are the major-breaksmajor-breaks

2005102420051024 Speech Lab NTNU 1212

Prosodic labelingProsodic labeling

2005102420051024 Speech Lab NTNU 1313

AKI Unseen Handset EstimationAKI Unseen Handset Estimation

The concept of AKI is to first collect a set of characteristics of seen handset aThe concept of AKI is to first collect a set of characteristics of seen handset as the s the a priori a priori knowledge to construct a space of handsetsknowledge to construct a space of handsets

Model-space AKI using the MLLR model transformation is chosen to compeModel-space AKI using the MLLR model transformation is chosen to compensate the handset mismatchnsate the handset mismatch

The estimate of the characteristic of a test handset is definedThe estimate of the characteristic of a test handset is defined

where where HH = = hhnn=(=(AAnn bbnn TTnn)) n n=1~=1~NN is the set of is the set of a priori a priori knowledgeknowledge

N

nnnhh

1

ˆ

h

1

1

ˆˆ

ˆˆˆˆ

CB

CC

BTB

buAuhu

T

T

μ and 1048580 are the original and adapted mixture mean vectorsΣ and 1048580 are the original and adapted mixture variance vectors and are MLLR mean and variance transformation matricesB is the inverse function of Choleski factor of the original variance matrix Σ-1

is the bias of the mixture component

A T

b

2005102420051024 Speech Lab NTNU 1414

AKI Unseen Handset Estimation (cont)AKI Unseen Handset Estimation (cont)

2005102420051024 Speech Lab NTNU 1515

Eigen-Prosody AnalysisEigen-Prosody Analysis

The procedures of the EPA includesThe procedures of the EPA includesVQ-based prosodic modeling and labelingVQ-based prosodic modeling and labeling

Segmenting the sequences of prosody states to extract important prosody kSegmenting the sequences of prosody states to extract important prosody keywordseywords

Calculating the occurrences statistics of these prosody keywords for each sCalculating the occurrences statistics of these prosody keywords for each speaker to form a prosody keyword-speaker occurrence matrixpeaker to form a prosody keyword-speaker occurrence matrix

Applying the singular values decomposition (SVD) technique to decompoApplying the singular values decomposition (SVD) technique to decompose the matrix to build an eigen-prosody spacese the matrix to build an eigen-prosody space

Measuring the speaker distance using the cosine of the angle between two Measuring the speaker distance using the cosine of the angle between two speaker vectorsspeaker vectors

2005102420051024 Speech Lab NTNU 1616

Eigen-Prosody Analysis (cont)Eigen-Prosody Analysis (cont)

2005102420051024 Speech Lab NTNU 1717

Prosody keyword extractionProsody keyword extraction

After the prosodic state labeling the prosody text documents of aAfter the prosodic state labeling the prosody text documents of all speakers are searched to find important prosody keywords in orll speakers are searched to find important prosody keywords in order to establish a prosody keywords dictionaryder to establish a prosody keywords dictionary

First all possible combinations of the prosody words including single worFirst all possible combinations of the prosody words including single words and word pairs (uni-gram and bi-gram) are listed and their frequency stds and word pairs (uni-gram and bi-gram) are listed and their frequency statistics are computedatistics are computed

After calculating the histogram of all prosody words frequency thresholds After calculating the histogram of all prosody words frequency thresholds are set to leave only high frequency onesare set to leave only high frequency ones

2005102420051024 Speech Lab NTNU 1818

Prosody keyword-speaker occurrence matrix Prosody keyword-speaker occurrence matrix statisticsstatistics

The prosody text document of each speaker is then parsed using the generated The prosody text document of each speaker is then parsed using the generated prosody keywords dictionary by simply giving higher priority to longer wordsprosody keywords dictionary by simply giving higher priority to longer words

The occurrence counts of keywords of a speaker are booked in a prosody The occurrence counts of keywords of a speaker are booked in a prosody keyword list vector to represent the long-term prosodic behaviors of the keyword list vector to represent the long-term prosodic behaviors of the specific speakerspecific speaker

The prosody keyword-speaker occurrence matrix The prosody keyword-speaker occurrence matrix A A is made up of the is made up of the collection of all speaker prosody keyword lists vectorscollection of all speaker prosody keyword lists vectors

To emphasize the uncommon keywords and to deemphasize the very common To emphasize the uncommon keywords and to deemphasize the very common ones the inverse document frequency (IDF) weighting method is appliedones the inverse document frequency (IDF) weighting method is applied

2005102420051024 Speech Lab NTNU 1919

Eigen-Prosody analysisEigen-Prosody analysis

In order to reduce the dimension of the prosody space the sparse prosody keIn order to reduce the dimension of the prosody space the sparse prosody keyword-speaker occurrence matrix yword-speaker occurrence matrix A A is further analyzed using SVD to find a cis further analyzed using SVD to find a compact eigen-prosody spaceompact eigen-prosody space

Given an Given an m m by by n n ((mmgtgtgtgtnn) matrix ) matrix A A of rank of rank R R A A is decomposed and further apis decomposed and further approximated using only the largest proximated using only the largest K K singular values assingular values as

Where AWhere AKK U UKK V VKK and and ΣΣKK matrices are the rank reduced matrices of the respective m matrices are the rank reduced matrices of the respective m

atricesatrices

TKKKK

T VUAVUA

2005102420051024 Speech Lab NTNU 2020

Eigen-Prosody Analysis Eigen-Prosody Analysis

2005102420051024 Speech Lab NTNU 2121

Score measurementScore measurement

The test utterances of a test speaker are first labeled and parsed to form the pThe test utterances of a test speaker are first labeled and parsed to form the pseudo query document seudo query document yyQQ and then transformed into the query vector and then transformed into the query vector vvQQ in tin t

he eigen-prosody speaker space byhe eigen-prosody speaker space by

iKQ

iKT

QiKQ

Pi

KKTQQ

v

vvS

Uy

1

)(

2005102420051024 Speech Lab NTNU 2222

Speaker Identification ExperimentsSpeaker Identification Experiments

HTIMIT database and experiment conditionsHTIMIT database and experiment conditionstotal 384 speakers each gave ten utterancestotal 384 speakers each gave ten utterances

The set of 38410 utterances was then playback and recorded through ninThe set of 38410 utterances was then playback and recorded through nine other different handsets include four carbon buttone other different handsets include four carbon button((碳墨式碳墨式 )) four electret four electret ((電子式電子式 ))handsets and one portable cordless phonehandsets and one portable cordless phone

In this paper all experiments were performed on 302 speakers inIn this paper all experiments were performed on 302 speakers including 151 females and 151 males which have all the ten utterancluding 151 females and 151 males which have all the ten utterancesces

For training the speaker models the first seven utterances of each speaker For training the speaker models the first seven utterances of each speaker from the senh handset were used as the enrollment speechfrom the senh handset were used as the enrollment speech

The other ten three-utterance sessions of each speaker from ten handsets wThe other ten three-utterance sessions of each speaker from ten handsets were used as the evaluation data respectivelyere used as the evaluation data respectively

2005102420051024 Speech Lab NTNU 2323

Speaker Identification Experiments (cont)Speaker Identification Experiments (cont)

To construct the speaker models a 256-mixture universal backgrTo construct the speaker models a 256-mixture universal background model (UBM) was first built from the enrollment speech of ound model (UBM) was first built from the enrollment speech of all 302 speakersall 302 speakers

For each speaker a MAP-adapted GMM (MAP-GMM) adapted fFor each speaker a MAP-adapted GMM (MAP-GMM) adapted from the UBM using his own enrollment speech was built rom the UBM using his own enrollment speech was built

a 32-state prosodic modeling was traineda 32-state prosodic modeling was trained

367 prosody keywords were extracted to form a sparse 367302 dimensio367 prosody keywords were extracted to form a sparse 367302 dimensional matrix nal matrix AA

the dimension of eigen-prosody space is 3the dimension of eigen-prosody space is 3

38 mel-frequency cepstral coefficiences (MFCCs) including38 mel-frequency cepstral coefficiences (MFCCs) including

Window size of 30 ms and frame shift of 10msWindow size of 30 ms and frame shift of 10ms

2005102420051024 Speech Lab NTNU 2424

Speaker Identification Experiments (cont)Speaker Identification Experiments (cont)

2005102420051024 Speech Lab NTNU 2525

ConclusionsConclusions

This paper presents an EPA+AKI+MAP-GMMCMS fusion This paper presents an EPA+AKI+MAP-GMMCMS fusion approachapproach

Unlike conventional fusion approaches the proposed method Unlike conventional fusion approaches the proposed method requires only few trainingtest utterances and could alleviate the requires only few trainingtest utterances and could alleviate the distortion of unseen mismatch handsetsdistortion of unseen mismatch handsets

It is therefore a promising method for robust speaker It is therefore a promising method for robust speaker identification under mismatch environment and limited available identification under mismatch environment and limited available datadata

2005102420051024 Speech Lab NTNU 2626

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
Page 7: PROSODY MODELING AND EIGEN-PROSODY ANALYSIS FOR ROBUST SPEAKER RECOGNITION

2005102420051024 Speech Lab NTNU 77

Introduction (cont)Introduction (cont)

Since EPA utilizes prosody-level information it could be further Since EPA utilizes prosody-level information it could be further fused with acoustic-level information to complement each otherfused with acoustic-level information to complement each other

The The a priori a priori knowledge interpolation (AKI) approach is chosenknowledge interpolation (AKI) approach is chosenIt utilizes the maximum likelihood linear regression (MLLR) to It utilizes the maximum likelihood linear regression (MLLR) to compensate handset mismatchcompensate handset mismatch

2005102420051024 Speech Lab NTNU 88

VQ-Based Prosodic ModelingVQ-Based Prosodic Modeling

In this study syllables are chosen as the basic processing unitsIn this study syllables are chosen as the basic processing units

Five types of prosodic features Five types of prosodic features the slope of pitch contour of a vowel segmentthe slope of pitch contour of a vowel segment

lengthening factor of a vowel segmentlengthening factor of a vowel segment

average log-energy difference between two vowelsaverage log-energy difference between two vowels

value of pitch jump between two vowelsvalue of pitch jump between two vowels

pause duration between two syllablespause duration between two syllables

The prosody features are normalized according to underline vowel class to The prosody features are normalized according to underline vowel class to remove any non-prosodic effects (context-information)remove any non-prosodic effects (context-information)

vowel

voweluxx

ˆ

mean of the vowel class of whole corpusuvowel

vowel variance of the vowel class of whole corpus

2005102420051024 Speech Lab NTNU 99

Prosodic ModelingProsodic Modeling

2005102420051024 Speech Lab NTNU 1010

Prosodic Modeling (cont)Prosodic Modeling (cont)

To build the prosodic model the prosodic features are vector quaTo build the prosodic model the prosodic features are vector quantized into ntized into M M codewords using the expectation-maximum (EM) acodewords using the expectation-maximum (EM) algorithmlgorithm

2005102420051024 Speech Lab NTNU 1111

Prosodic Modeling (cont)Prosodic Modeling (cont)

For example state 6 is the phrase-start state 3 and 4 are the For example state 6 is the phrase-start state 3 and 4 are the major-breaksmajor-breaks

2005102420051024 Speech Lab NTNU 1212

Prosodic labelingProsodic labeling

2005102420051024 Speech Lab NTNU 1313

AKI Unseen Handset EstimationAKI Unseen Handset Estimation

The concept of AKI is to first collect a set of characteristics of seen handset aThe concept of AKI is to first collect a set of characteristics of seen handset as the s the a priori a priori knowledge to construct a space of handsetsknowledge to construct a space of handsets

Model-space AKI using the MLLR model transformation is chosen to compeModel-space AKI using the MLLR model transformation is chosen to compensate the handset mismatchnsate the handset mismatch

The estimate of the characteristic of a test handset is definedThe estimate of the characteristic of a test handset is defined

where where HH = = hhnn=(=(AAnn bbnn TTnn)) n n=1~=1~NN is the set of is the set of a priori a priori knowledgeknowledge

N

nnnhh

1

ˆ

h

1

1

ˆˆ

ˆˆˆˆ

CB

CC

BTB

buAuhu

T

T

μ and 1048580 are the original and adapted mixture mean vectorsΣ and 1048580 are the original and adapted mixture variance vectors and are MLLR mean and variance transformation matricesB is the inverse function of Choleski factor of the original variance matrix Σ-1

is the bias of the mixture component

A T

b

2005102420051024 Speech Lab NTNU 1414

AKI Unseen Handset Estimation (cont)AKI Unseen Handset Estimation (cont)

2005102420051024 Speech Lab NTNU 1515

Eigen-Prosody AnalysisEigen-Prosody Analysis

The procedures of the EPA includesThe procedures of the EPA includesVQ-based prosodic modeling and labelingVQ-based prosodic modeling and labeling

Segmenting the sequences of prosody states to extract important prosody kSegmenting the sequences of prosody states to extract important prosody keywordseywords

Calculating the occurrences statistics of these prosody keywords for each sCalculating the occurrences statistics of these prosody keywords for each speaker to form a prosody keyword-speaker occurrence matrixpeaker to form a prosody keyword-speaker occurrence matrix

Applying the singular values decomposition (SVD) technique to decompoApplying the singular values decomposition (SVD) technique to decompose the matrix to build an eigen-prosody spacese the matrix to build an eigen-prosody space

Measuring the speaker distance using the cosine of the angle between two Measuring the speaker distance using the cosine of the angle between two speaker vectorsspeaker vectors

2005102420051024 Speech Lab NTNU 1616

Eigen-Prosody Analysis (cont)Eigen-Prosody Analysis (cont)

2005102420051024 Speech Lab NTNU 1717

Prosody keyword extractionProsody keyword extraction

After the prosodic state labeling the prosody text documents of aAfter the prosodic state labeling the prosody text documents of all speakers are searched to find important prosody keywords in orll speakers are searched to find important prosody keywords in order to establish a prosody keywords dictionaryder to establish a prosody keywords dictionary

First all possible combinations of the prosody words including single worFirst all possible combinations of the prosody words including single words and word pairs (uni-gram and bi-gram) are listed and their frequency stds and word pairs (uni-gram and bi-gram) are listed and their frequency statistics are computedatistics are computed

After calculating the histogram of all prosody words frequency thresholds After calculating the histogram of all prosody words frequency thresholds are set to leave only high frequency onesare set to leave only high frequency ones

2005102420051024 Speech Lab NTNU 1818

Prosody keyword-speaker occurrence matrix Prosody keyword-speaker occurrence matrix statisticsstatistics

The prosody text document of each speaker is then parsed using the generated The prosody text document of each speaker is then parsed using the generated prosody keywords dictionary by simply giving higher priority to longer wordsprosody keywords dictionary by simply giving higher priority to longer words

The occurrence counts of keywords of a speaker are booked in a prosody The occurrence counts of keywords of a speaker are booked in a prosody keyword list vector to represent the long-term prosodic behaviors of the keyword list vector to represent the long-term prosodic behaviors of the specific speakerspecific speaker

The prosody keyword-speaker occurrence matrix The prosody keyword-speaker occurrence matrix A A is made up of the is made up of the collection of all speaker prosody keyword lists vectorscollection of all speaker prosody keyword lists vectors

To emphasize the uncommon keywords and to deemphasize the very common To emphasize the uncommon keywords and to deemphasize the very common ones the inverse document frequency (IDF) weighting method is appliedones the inverse document frequency (IDF) weighting method is applied

2005102420051024 Speech Lab NTNU 1919

Eigen-Prosody analysisEigen-Prosody analysis

In order to reduce the dimension of the prosody space the sparse prosody keIn order to reduce the dimension of the prosody space the sparse prosody keyword-speaker occurrence matrix yword-speaker occurrence matrix A A is further analyzed using SVD to find a cis further analyzed using SVD to find a compact eigen-prosody spaceompact eigen-prosody space

Given an Given an m m by by n n ((mmgtgtgtgtnn) matrix ) matrix A A of rank of rank R R A A is decomposed and further apis decomposed and further approximated using only the largest proximated using only the largest K K singular values assingular values as

Where AWhere AKK U UKK V VKK and and ΣΣKK matrices are the rank reduced matrices of the respective m matrices are the rank reduced matrices of the respective m

atricesatrices

TKKKK

T VUAVUA

2005102420051024 Speech Lab NTNU 2020

Eigen-Prosody Analysis Eigen-Prosody Analysis

2005102420051024 Speech Lab NTNU 2121

Score measurementScore measurement

The test utterances of a test speaker are first labeled and parsed to form the pThe test utterances of a test speaker are first labeled and parsed to form the pseudo query document seudo query document yyQQ and then transformed into the query vector and then transformed into the query vector vvQQ in tin t

he eigen-prosody speaker space byhe eigen-prosody speaker space by

iKQ

iKT

QiKQ

Pi

KKTQQ

v

vvS

Uy

1

)(

2005102420051024 Speech Lab NTNU 2222

Speaker Identification ExperimentsSpeaker Identification Experiments

HTIMIT database and experiment conditionsHTIMIT database and experiment conditionstotal 384 speakers each gave ten utterancestotal 384 speakers each gave ten utterances

The set of 38410 utterances was then playback and recorded through ninThe set of 38410 utterances was then playback and recorded through nine other different handsets include four carbon buttone other different handsets include four carbon button((碳墨式碳墨式 )) four electret four electret ((電子式電子式 ))handsets and one portable cordless phonehandsets and one portable cordless phone

In this paper all experiments were performed on 302 speakers inIn this paper all experiments were performed on 302 speakers including 151 females and 151 males which have all the ten utterancluding 151 females and 151 males which have all the ten utterancesces

For training the speaker models the first seven utterances of each speaker For training the speaker models the first seven utterances of each speaker from the senh handset were used as the enrollment speechfrom the senh handset were used as the enrollment speech

The other ten three-utterance sessions of each speaker from ten handsets wThe other ten three-utterance sessions of each speaker from ten handsets were used as the evaluation data respectivelyere used as the evaluation data respectively

2005102420051024 Speech Lab NTNU 2323

Speaker Identification Experiments (cont)Speaker Identification Experiments (cont)

To construct the speaker models a 256-mixture universal backgrTo construct the speaker models a 256-mixture universal background model (UBM) was first built from the enrollment speech of ound model (UBM) was first built from the enrollment speech of all 302 speakersall 302 speakers

For each speaker a MAP-adapted GMM (MAP-GMM) adapted fFor each speaker a MAP-adapted GMM (MAP-GMM) adapted from the UBM using his own enrollment speech was built rom the UBM using his own enrollment speech was built

a 32-state prosodic modeling was traineda 32-state prosodic modeling was trained

367 prosody keywords were extracted to form a sparse 367302 dimensio367 prosody keywords were extracted to form a sparse 367302 dimensional matrix nal matrix AA

the dimension of eigen-prosody space is 3the dimension of eigen-prosody space is 3

38 mel-frequency cepstral coefficiences (MFCCs) including38 mel-frequency cepstral coefficiences (MFCCs) including

Window size of 30 ms and frame shift of 10msWindow size of 30 ms and frame shift of 10ms

2005102420051024 Speech Lab NTNU 2424

Speaker Identification Experiments (cont)Speaker Identification Experiments (cont)

2005102420051024 Speech Lab NTNU 2525

ConclusionsConclusions

This paper presents an EPA+AKI+MAP-GMMCMS fusion This paper presents an EPA+AKI+MAP-GMMCMS fusion approachapproach

Unlike conventional fusion approaches the proposed method Unlike conventional fusion approaches the proposed method requires only few trainingtest utterances and could alleviate the requires only few trainingtest utterances and could alleviate the distortion of unseen mismatch handsetsdistortion of unseen mismatch handsets

It is therefore a promising method for robust speaker It is therefore a promising method for robust speaker identification under mismatch environment and limited available identification under mismatch environment and limited available datadata

2005102420051024 Speech Lab NTNU 2626

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
Page 8: PROSODY MODELING AND EIGEN-PROSODY ANALYSIS FOR ROBUST SPEAKER RECOGNITION

2005102420051024 Speech Lab NTNU 88

VQ-Based Prosodic ModelingVQ-Based Prosodic Modeling

In this study syllables are chosen as the basic processing unitsIn this study syllables are chosen as the basic processing units

Five types of prosodic features Five types of prosodic features the slope of pitch contour of a vowel segmentthe slope of pitch contour of a vowel segment

lengthening factor of a vowel segmentlengthening factor of a vowel segment

average log-energy difference between two vowelsaverage log-energy difference between two vowels

value of pitch jump between two vowelsvalue of pitch jump between two vowels

pause duration between two syllablespause duration between two syllables

The prosody features are normalized according to underline vowel class to The prosody features are normalized according to underline vowel class to remove any non-prosodic effects (context-information)remove any non-prosodic effects (context-information)

vowel

voweluxx

ˆ

mean of the vowel class of whole corpusuvowel

vowel variance of the vowel class of whole corpus

2005102420051024 Speech Lab NTNU 99

Prosodic ModelingProsodic Modeling

2005102420051024 Speech Lab NTNU 1010

Prosodic Modeling (cont)Prosodic Modeling (cont)

To build the prosodic model the prosodic features are vector quaTo build the prosodic model the prosodic features are vector quantized into ntized into M M codewords using the expectation-maximum (EM) acodewords using the expectation-maximum (EM) algorithmlgorithm

2005102420051024 Speech Lab NTNU 1111

Prosodic Modeling (cont)Prosodic Modeling (cont)

For example state 6 is the phrase-start state 3 and 4 are the For example state 6 is the phrase-start state 3 and 4 are the major-breaksmajor-breaks

2005102420051024 Speech Lab NTNU 1212

Prosodic labelingProsodic labeling

2005102420051024 Speech Lab NTNU 1313

AKI Unseen Handset EstimationAKI Unseen Handset Estimation

The concept of AKI is to first collect a set of characteristics of seen handset aThe concept of AKI is to first collect a set of characteristics of seen handset as the s the a priori a priori knowledge to construct a space of handsetsknowledge to construct a space of handsets

Model-space AKI using the MLLR model transformation is chosen to compeModel-space AKI using the MLLR model transformation is chosen to compensate the handset mismatchnsate the handset mismatch

The estimate of the characteristic of a test handset is definedThe estimate of the characteristic of a test handset is defined

where where HH = = hhnn=(=(AAnn bbnn TTnn)) n n=1~=1~NN is the set of is the set of a priori a priori knowledgeknowledge

N

nnnhh

1

ˆ

h

1

1

ˆˆ

ˆˆˆˆ

CB

CC

BTB

buAuhu

T

T

μ and 1048580 are the original and adapted mixture mean vectorsΣ and 1048580 are the original and adapted mixture variance vectors and are MLLR mean and variance transformation matricesB is the inverse function of Choleski factor of the original variance matrix Σ-1

is the bias of the mixture component

A T

b

2005102420051024 Speech Lab NTNU 1414

AKI Unseen Handset Estimation (cont)AKI Unseen Handset Estimation (cont)

2005102420051024 Speech Lab NTNU 1515

Eigen-Prosody AnalysisEigen-Prosody Analysis

The procedures of the EPA includesThe procedures of the EPA includesVQ-based prosodic modeling and labelingVQ-based prosodic modeling and labeling

Segmenting the sequences of prosody states to extract important prosody kSegmenting the sequences of prosody states to extract important prosody keywordseywords

Calculating the occurrences statistics of these prosody keywords for each sCalculating the occurrences statistics of these prosody keywords for each speaker to form a prosody keyword-speaker occurrence matrixpeaker to form a prosody keyword-speaker occurrence matrix

Applying the singular values decomposition (SVD) technique to decompoApplying the singular values decomposition (SVD) technique to decompose the matrix to build an eigen-prosody spacese the matrix to build an eigen-prosody space

Measuring the speaker distance using the cosine of the angle between two Measuring the speaker distance using the cosine of the angle between two speaker vectorsspeaker vectors

2005102420051024 Speech Lab NTNU 1616

Eigen-Prosody Analysis (cont)Eigen-Prosody Analysis (cont)

2005102420051024 Speech Lab NTNU 1717

Prosody keyword extractionProsody keyword extraction

After the prosodic state labeling the prosody text documents of aAfter the prosodic state labeling the prosody text documents of all speakers are searched to find important prosody keywords in orll speakers are searched to find important prosody keywords in order to establish a prosody keywords dictionaryder to establish a prosody keywords dictionary

First all possible combinations of the prosody words including single worFirst all possible combinations of the prosody words including single words and word pairs (uni-gram and bi-gram) are listed and their frequency stds and word pairs (uni-gram and bi-gram) are listed and their frequency statistics are computedatistics are computed

After calculating the histogram of all prosody words frequency thresholds After calculating the histogram of all prosody words frequency thresholds are set to leave only high frequency onesare set to leave only high frequency ones

2005102420051024 Speech Lab NTNU 1818

Prosody keyword-speaker occurrence matrix Prosody keyword-speaker occurrence matrix statisticsstatistics

The prosody text document of each speaker is then parsed using the generated The prosody text document of each speaker is then parsed using the generated prosody keywords dictionary by simply giving higher priority to longer wordsprosody keywords dictionary by simply giving higher priority to longer words

The occurrence counts of keywords of a speaker are booked in a prosody The occurrence counts of keywords of a speaker are booked in a prosody keyword list vector to represent the long-term prosodic behaviors of the keyword list vector to represent the long-term prosodic behaviors of the specific speakerspecific speaker

The prosody keyword-speaker occurrence matrix The prosody keyword-speaker occurrence matrix A A is made up of the is made up of the collection of all speaker prosody keyword lists vectorscollection of all speaker prosody keyword lists vectors

To emphasize the uncommon keywords and to deemphasize the very common To emphasize the uncommon keywords and to deemphasize the very common ones the inverse document frequency (IDF) weighting method is appliedones the inverse document frequency (IDF) weighting method is applied

2005102420051024 Speech Lab NTNU 1919

Eigen-Prosody analysisEigen-Prosody analysis

In order to reduce the dimension of the prosody space the sparse prosody keIn order to reduce the dimension of the prosody space the sparse prosody keyword-speaker occurrence matrix yword-speaker occurrence matrix A A is further analyzed using SVD to find a cis further analyzed using SVD to find a compact eigen-prosody spaceompact eigen-prosody space

Given an Given an m m by by n n ((mmgtgtgtgtnn) matrix ) matrix A A of rank of rank R R A A is decomposed and further apis decomposed and further approximated using only the largest proximated using only the largest K K singular values assingular values as

Where AWhere AKK U UKK V VKK and and ΣΣKK matrices are the rank reduced matrices of the respective m matrices are the rank reduced matrices of the respective m

atricesatrices

TKKKK

T VUAVUA

2005102420051024 Speech Lab NTNU 2020

Eigen-Prosody Analysis Eigen-Prosody Analysis

2005102420051024 Speech Lab NTNU 2121

Score measurementScore measurement

The test utterances of a test speaker are first labeled and parsed to form the pThe test utterances of a test speaker are first labeled and parsed to form the pseudo query document seudo query document yyQQ and then transformed into the query vector and then transformed into the query vector vvQQ in tin t

he eigen-prosody speaker space byhe eigen-prosody speaker space by

iKQ

iKT

QiKQ

Pi

KKTQQ

v

vvS

Uy

1

)(

2005102420051024 Speech Lab NTNU 2222

Speaker Identification ExperimentsSpeaker Identification Experiments

HTIMIT database and experiment conditionsHTIMIT database and experiment conditionstotal 384 speakers each gave ten utterancestotal 384 speakers each gave ten utterances

The set of 38410 utterances was then playback and recorded through ninThe set of 38410 utterances was then playback and recorded through nine other different handsets include four carbon buttone other different handsets include four carbon button((碳墨式碳墨式 )) four electret four electret ((電子式電子式 ))handsets and one portable cordless phonehandsets and one portable cordless phone

In this paper all experiments were performed on 302 speakers inIn this paper all experiments were performed on 302 speakers including 151 females and 151 males which have all the ten utterancluding 151 females and 151 males which have all the ten utterancesces

For training the speaker models the first seven utterances of each speaker For training the speaker models the first seven utterances of each speaker from the senh handset were used as the enrollment speechfrom the senh handset were used as the enrollment speech

The other ten three-utterance sessions of each speaker from ten handsets wThe other ten three-utterance sessions of each speaker from ten handsets were used as the evaluation data respectivelyere used as the evaluation data respectively

2005102420051024 Speech Lab NTNU 2323

Speaker Identification Experiments (cont)Speaker Identification Experiments (cont)

To construct the speaker models a 256-mixture universal backgrTo construct the speaker models a 256-mixture universal background model (UBM) was first built from the enrollment speech of ound model (UBM) was first built from the enrollment speech of all 302 speakersall 302 speakers

For each speaker a MAP-adapted GMM (MAP-GMM) adapted fFor each speaker a MAP-adapted GMM (MAP-GMM) adapted from the UBM using his own enrollment speech was built rom the UBM using his own enrollment speech was built

a 32-state prosodic modeling was traineda 32-state prosodic modeling was trained

367 prosody keywords were extracted to form a sparse 367302 dimensio367 prosody keywords were extracted to form a sparse 367302 dimensional matrix nal matrix AA

the dimension of eigen-prosody space is 3the dimension of eigen-prosody space is 3

38 mel-frequency cepstral coefficiences (MFCCs) including38 mel-frequency cepstral coefficiences (MFCCs) including

Window size of 30 ms and frame shift of 10msWindow size of 30 ms and frame shift of 10ms

2005102420051024 Speech Lab NTNU 2424

Speaker Identification Experiments (cont)Speaker Identification Experiments (cont)

2005102420051024 Speech Lab NTNU 2525

ConclusionsConclusions

This paper presents an EPA+AKI+MAP-GMMCMS fusion This paper presents an EPA+AKI+MAP-GMMCMS fusion approachapproach

Unlike conventional fusion approaches the proposed method Unlike conventional fusion approaches the proposed method requires only few trainingtest utterances and could alleviate the requires only few trainingtest utterances and could alleviate the distortion of unseen mismatch handsetsdistortion of unseen mismatch handsets

It is therefore a promising method for robust speaker It is therefore a promising method for robust speaker identification under mismatch environment and limited available identification under mismatch environment and limited available datadata

2005102420051024 Speech Lab NTNU 2626

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
Page 9: PROSODY MODELING AND EIGEN-PROSODY ANALYSIS FOR ROBUST SPEAKER RECOGNITION

2005102420051024 Speech Lab NTNU 99

Prosodic ModelingProsodic Modeling

2005102420051024 Speech Lab NTNU 1010

Prosodic Modeling (cont)Prosodic Modeling (cont)

To build the prosodic model the prosodic features are vector quaTo build the prosodic model the prosodic features are vector quantized into ntized into M M codewords using the expectation-maximum (EM) acodewords using the expectation-maximum (EM) algorithmlgorithm

2005102420051024 Speech Lab NTNU 1111

Prosodic Modeling (cont)Prosodic Modeling (cont)

For example state 6 is the phrase-start state 3 and 4 are the For example state 6 is the phrase-start state 3 and 4 are the major-breaksmajor-breaks

2005102420051024 Speech Lab NTNU 1212

Prosodic labelingProsodic labeling

2005102420051024 Speech Lab NTNU 1313

AKI Unseen Handset EstimationAKI Unseen Handset Estimation

The concept of AKI is to first collect a set of characteristics of seen handset aThe concept of AKI is to first collect a set of characteristics of seen handset as the s the a priori a priori knowledge to construct a space of handsetsknowledge to construct a space of handsets

Model-space AKI using the MLLR model transformation is chosen to compeModel-space AKI using the MLLR model transformation is chosen to compensate the handset mismatchnsate the handset mismatch

The estimate of the characteristic of a test handset is definedThe estimate of the characteristic of a test handset is defined

where where HH = = hhnn=(=(AAnn bbnn TTnn)) n n=1~=1~NN is the set of is the set of a priori a priori knowledgeknowledge

N

nnnhh

1

ˆ

h

1

1

ˆˆ

ˆˆˆˆ

CB

CC

BTB

buAuhu

T

T

μ and 1048580 are the original and adapted mixture mean vectorsΣ and 1048580 are the original and adapted mixture variance vectors and are MLLR mean and variance transformation matricesB is the inverse function of Choleski factor of the original variance matrix Σ-1

is the bias of the mixture component

A T

b

2005102420051024 Speech Lab NTNU 1414

AKI Unseen Handset Estimation (cont)AKI Unseen Handset Estimation (cont)

2005102420051024 Speech Lab NTNU 1515

Eigen-Prosody AnalysisEigen-Prosody Analysis

The procedures of the EPA includesThe procedures of the EPA includesVQ-based prosodic modeling and labelingVQ-based prosodic modeling and labeling

Segmenting the sequences of prosody states to extract important prosody kSegmenting the sequences of prosody states to extract important prosody keywordseywords

Calculating the occurrences statistics of these prosody keywords for each sCalculating the occurrences statistics of these prosody keywords for each speaker to form a prosody keyword-speaker occurrence matrixpeaker to form a prosody keyword-speaker occurrence matrix

Applying the singular values decomposition (SVD) technique to decompoApplying the singular values decomposition (SVD) technique to decompose the matrix to build an eigen-prosody spacese the matrix to build an eigen-prosody space

Measuring the speaker distance using the cosine of the angle between two Measuring the speaker distance using the cosine of the angle between two speaker vectorsspeaker vectors

2005102420051024 Speech Lab NTNU 1616

Eigen-Prosody Analysis (cont)Eigen-Prosody Analysis (cont)

2005102420051024 Speech Lab NTNU 1717

Prosody keyword extractionProsody keyword extraction

After the prosodic state labeling the prosody text documents of aAfter the prosodic state labeling the prosody text documents of all speakers are searched to find important prosody keywords in orll speakers are searched to find important prosody keywords in order to establish a prosody keywords dictionaryder to establish a prosody keywords dictionary

First all possible combinations of the prosody words including single worFirst all possible combinations of the prosody words including single words and word pairs (uni-gram and bi-gram) are listed and their frequency stds and word pairs (uni-gram and bi-gram) are listed and their frequency statistics are computedatistics are computed

After calculating the histogram of all prosody words frequency thresholds After calculating the histogram of all prosody words frequency thresholds are set to leave only high frequency onesare set to leave only high frequency ones

2005102420051024 Speech Lab NTNU 1818

Prosody keyword-speaker occurrence matrix Prosody keyword-speaker occurrence matrix statisticsstatistics

The prosody text document of each speaker is then parsed using the generated The prosody text document of each speaker is then parsed using the generated prosody keywords dictionary by simply giving higher priority to longer wordsprosody keywords dictionary by simply giving higher priority to longer words

The occurrence counts of keywords of a speaker are booked in a prosody The occurrence counts of keywords of a speaker are booked in a prosody keyword list vector to represent the long-term prosodic behaviors of the keyword list vector to represent the long-term prosodic behaviors of the specific speakerspecific speaker

The prosody keyword-speaker occurrence matrix The prosody keyword-speaker occurrence matrix A A is made up of the is made up of the collection of all speaker prosody keyword lists vectorscollection of all speaker prosody keyword lists vectors

To emphasize the uncommon keywords and to deemphasize the very common To emphasize the uncommon keywords and to deemphasize the very common ones the inverse document frequency (IDF) weighting method is appliedones the inverse document frequency (IDF) weighting method is applied

2005102420051024 Speech Lab NTNU 1919

Eigen-Prosody analysisEigen-Prosody analysis

In order to reduce the dimension of the prosody space the sparse prosody keIn order to reduce the dimension of the prosody space the sparse prosody keyword-speaker occurrence matrix yword-speaker occurrence matrix A A is further analyzed using SVD to find a cis further analyzed using SVD to find a compact eigen-prosody spaceompact eigen-prosody space

Given an Given an m m by by n n ((mmgtgtgtgtnn) matrix ) matrix A A of rank of rank R R A A is decomposed and further apis decomposed and further approximated using only the largest proximated using only the largest K K singular values assingular values as

Where AWhere AKK U UKK V VKK and and ΣΣKK matrices are the rank reduced matrices of the respective m matrices are the rank reduced matrices of the respective m

atricesatrices

TKKKK

T VUAVUA

2005102420051024 Speech Lab NTNU 2020

Eigen-Prosody Analysis Eigen-Prosody Analysis

2005102420051024 Speech Lab NTNU 2121

Score measurementScore measurement

The test utterances of a test speaker are first labeled and parsed to form the pThe test utterances of a test speaker are first labeled and parsed to form the pseudo query document seudo query document yyQQ and then transformed into the query vector and then transformed into the query vector vvQQ in tin t

he eigen-prosody speaker space byhe eigen-prosody speaker space by

iKQ

iKT

QiKQ

Pi

KKTQQ

v

vvS

Uy

1

)(

2005102420051024 Speech Lab NTNU 2222

Speaker Identification ExperimentsSpeaker Identification Experiments

HTIMIT database and experiment conditionsHTIMIT database and experiment conditionstotal 384 speakers each gave ten utterancestotal 384 speakers each gave ten utterances

The set of 38410 utterances was then playback and recorded through ninThe set of 38410 utterances was then playback and recorded through nine other different handsets include four carbon buttone other different handsets include four carbon button((碳墨式碳墨式 )) four electret four electret ((電子式電子式 ))handsets and one portable cordless phonehandsets and one portable cordless phone

In this paper all experiments were performed on 302 speakers inIn this paper all experiments were performed on 302 speakers including 151 females and 151 males which have all the ten utterancluding 151 females and 151 males which have all the ten utterancesces

For training the speaker models the first seven utterances of each speaker For training the speaker models the first seven utterances of each speaker from the senh handset were used as the enrollment speechfrom the senh handset were used as the enrollment speech

The other ten three-utterance sessions of each speaker from ten handsets wThe other ten three-utterance sessions of each speaker from ten handsets were used as the evaluation data respectivelyere used as the evaluation data respectively

2005102420051024 Speech Lab NTNU 2323

Speaker Identification Experiments (cont)Speaker Identification Experiments (cont)

To construct the speaker models a 256-mixture universal backgrTo construct the speaker models a 256-mixture universal background model (UBM) was first built from the enrollment speech of ound model (UBM) was first built from the enrollment speech of all 302 speakersall 302 speakers

For each speaker a MAP-adapted GMM (MAP-GMM) adapted fFor each speaker a MAP-adapted GMM (MAP-GMM) adapted from the UBM using his own enrollment speech was built rom the UBM using his own enrollment speech was built

a 32-state prosodic modeling was traineda 32-state prosodic modeling was trained

367 prosody keywords were extracted to form a sparse 367302 dimensio367 prosody keywords were extracted to form a sparse 367302 dimensional matrix nal matrix AA

the dimension of eigen-prosody space is 3the dimension of eigen-prosody space is 3

38 mel-frequency cepstral coefficiences (MFCCs) including38 mel-frequency cepstral coefficiences (MFCCs) including

Window size of 30 ms and frame shift of 10msWindow size of 30 ms and frame shift of 10ms

2005102420051024 Speech Lab NTNU 2424

Speaker Identification Experiments (cont)Speaker Identification Experiments (cont)

2005102420051024 Speech Lab NTNU 2525

ConclusionsConclusions

This paper presents an EPA+AKI+MAP-GMMCMS fusion This paper presents an EPA+AKI+MAP-GMMCMS fusion approachapproach

Unlike conventional fusion approaches the proposed method Unlike conventional fusion approaches the proposed method requires only few trainingtest utterances and could alleviate the requires only few trainingtest utterances and could alleviate the distortion of unseen mismatch handsetsdistortion of unseen mismatch handsets

It is therefore a promising method for robust speaker It is therefore a promising method for robust speaker identification under mismatch environment and limited available identification under mismatch environment and limited available datadata

2005102420051024 Speech Lab NTNU 2626

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
Page 10: PROSODY MODELING AND EIGEN-PROSODY ANALYSIS FOR ROBUST SPEAKER RECOGNITION

2005102420051024 Speech Lab NTNU 1010

Prosodic Modeling (cont)Prosodic Modeling (cont)

To build the prosodic model the prosodic features are vector quaTo build the prosodic model the prosodic features are vector quantized into ntized into M M codewords using the expectation-maximum (EM) acodewords using the expectation-maximum (EM) algorithmlgorithm

2005102420051024 Speech Lab NTNU 1111

Prosodic Modeling (cont)Prosodic Modeling (cont)

For example state 6 is the phrase-start state 3 and 4 are the For example state 6 is the phrase-start state 3 and 4 are the major-breaksmajor-breaks

2005102420051024 Speech Lab NTNU 1212

Prosodic labelingProsodic labeling

2005102420051024 Speech Lab NTNU 1313

AKI Unseen Handset EstimationAKI Unseen Handset Estimation

The concept of AKI is to first collect a set of characteristics of seen handset aThe concept of AKI is to first collect a set of characteristics of seen handset as the s the a priori a priori knowledge to construct a space of handsetsknowledge to construct a space of handsets

Model-space AKI using the MLLR model transformation is chosen to compeModel-space AKI using the MLLR model transformation is chosen to compensate the handset mismatchnsate the handset mismatch

The estimate of the characteristic of a test handset is definedThe estimate of the characteristic of a test handset is defined

where where HH = = hhnn=(=(AAnn bbnn TTnn)) n n=1~=1~NN is the set of is the set of a priori a priori knowledgeknowledge

N

nnnhh

1

ˆ

h

1

1

ˆˆ

ˆˆˆˆ

CB

CC

BTB

buAuhu

T

T

μ and 1048580 are the original and adapted mixture mean vectorsΣ and 1048580 are the original and adapted mixture variance vectors and are MLLR mean and variance transformation matricesB is the inverse function of Choleski factor of the original variance matrix Σ-1

is the bias of the mixture component

A T

b

2005102420051024 Speech Lab NTNU 1414

AKI Unseen Handset Estimation (cont)AKI Unseen Handset Estimation (cont)

2005102420051024 Speech Lab NTNU 1515

Eigen-Prosody AnalysisEigen-Prosody Analysis

The procedures of the EPA includesThe procedures of the EPA includesVQ-based prosodic modeling and labelingVQ-based prosodic modeling and labeling

Segmenting the sequences of prosody states to extract important prosody kSegmenting the sequences of prosody states to extract important prosody keywordseywords

Calculating the occurrences statistics of these prosody keywords for each sCalculating the occurrences statistics of these prosody keywords for each speaker to form a prosody keyword-speaker occurrence matrixpeaker to form a prosody keyword-speaker occurrence matrix

Applying the singular values decomposition (SVD) technique to decompoApplying the singular values decomposition (SVD) technique to decompose the matrix to build an eigen-prosody spacese the matrix to build an eigen-prosody space

Measuring the speaker distance using the cosine of the angle between two Measuring the speaker distance using the cosine of the angle between two speaker vectorsspeaker vectors

2005102420051024 Speech Lab NTNU 1616

Eigen-Prosody Analysis (cont)Eigen-Prosody Analysis (cont)

2005102420051024 Speech Lab NTNU 1717

Prosody keyword extractionProsody keyword extraction

After the prosodic state labeling the prosody text documents of aAfter the prosodic state labeling the prosody text documents of all speakers are searched to find important prosody keywords in orll speakers are searched to find important prosody keywords in order to establish a prosody keywords dictionaryder to establish a prosody keywords dictionary

First all possible combinations of the prosody words including single worFirst all possible combinations of the prosody words including single words and word pairs (uni-gram and bi-gram) are listed and their frequency stds and word pairs (uni-gram and bi-gram) are listed and their frequency statistics are computedatistics are computed

After calculating the histogram of all prosody words frequency thresholds After calculating the histogram of all prosody words frequency thresholds are set to leave only high frequency onesare set to leave only high frequency ones

2005102420051024 Speech Lab NTNU 1818

Prosody keyword-speaker occurrence matrix Prosody keyword-speaker occurrence matrix statisticsstatistics

The prosody text document of each speaker is then parsed using the generated The prosody text document of each speaker is then parsed using the generated prosody keywords dictionary by simply giving higher priority to longer wordsprosody keywords dictionary by simply giving higher priority to longer words

The occurrence counts of keywords of a speaker are booked in a prosody The occurrence counts of keywords of a speaker are booked in a prosody keyword list vector to represent the long-term prosodic behaviors of the keyword list vector to represent the long-term prosodic behaviors of the specific speakerspecific speaker

The prosody keyword-speaker occurrence matrix The prosody keyword-speaker occurrence matrix A A is made up of the is made up of the collection of all speaker prosody keyword lists vectorscollection of all speaker prosody keyword lists vectors

To emphasize the uncommon keywords and to deemphasize the very common To emphasize the uncommon keywords and to deemphasize the very common ones the inverse document frequency (IDF) weighting method is appliedones the inverse document frequency (IDF) weighting method is applied

2005102420051024 Speech Lab NTNU 1919

Eigen-Prosody analysisEigen-Prosody analysis

In order to reduce the dimension of the prosody space the sparse prosody keIn order to reduce the dimension of the prosody space the sparse prosody keyword-speaker occurrence matrix yword-speaker occurrence matrix A A is further analyzed using SVD to find a cis further analyzed using SVD to find a compact eigen-prosody spaceompact eigen-prosody space

Given an Given an m m by by n n ((mmgtgtgtgtnn) matrix ) matrix A A of rank of rank R R A A is decomposed and further apis decomposed and further approximated using only the largest proximated using only the largest K K singular values assingular values as

Where AWhere AKK U UKK V VKK and and ΣΣKK matrices are the rank reduced matrices of the respective m matrices are the rank reduced matrices of the respective m

atricesatrices

TKKKK

T VUAVUA

2005102420051024 Speech Lab NTNU 2020

Eigen-Prosody Analysis Eigen-Prosody Analysis

2005102420051024 Speech Lab NTNU 2121

Score measurementScore measurement

The test utterances of a test speaker are first labeled and parsed to form the pThe test utterances of a test speaker are first labeled and parsed to form the pseudo query document seudo query document yyQQ and then transformed into the query vector and then transformed into the query vector vvQQ in tin t

he eigen-prosody speaker space byhe eigen-prosody speaker space by

iKQ

iKT

QiKQ

Pi

KKTQQ

v

vvS

Uy

1

)(

2005102420051024 Speech Lab NTNU 2222

Speaker Identification ExperimentsSpeaker Identification Experiments

HTIMIT database and experiment conditionsHTIMIT database and experiment conditionstotal 384 speakers each gave ten utterancestotal 384 speakers each gave ten utterances

The set of 38410 utterances was then playback and recorded through ninThe set of 38410 utterances was then playback and recorded through nine other different handsets include four carbon buttone other different handsets include four carbon button((碳墨式碳墨式 )) four electret four electret ((電子式電子式 ))handsets and one portable cordless phonehandsets and one portable cordless phone

In this paper all experiments were performed on 302 speakers inIn this paper all experiments were performed on 302 speakers including 151 females and 151 males which have all the ten utterancluding 151 females and 151 males which have all the ten utterancesces

For training the speaker models the first seven utterances of each speaker For training the speaker models the first seven utterances of each speaker from the senh handset were used as the enrollment speechfrom the senh handset were used as the enrollment speech

The other ten three-utterance sessions of each speaker from ten handsets wThe other ten three-utterance sessions of each speaker from ten handsets were used as the evaluation data respectivelyere used as the evaluation data respectively

2005102420051024 Speech Lab NTNU 2323

Speaker Identification Experiments (cont)Speaker Identification Experiments (cont)

To construct the speaker models a 256-mixture universal backgrTo construct the speaker models a 256-mixture universal background model (UBM) was first built from the enrollment speech of ound model (UBM) was first built from the enrollment speech of all 302 speakersall 302 speakers

For each speaker a MAP-adapted GMM (MAP-GMM) adapted fFor each speaker a MAP-adapted GMM (MAP-GMM) adapted from the UBM using his own enrollment speech was built rom the UBM using his own enrollment speech was built

a 32-state prosodic modeling was traineda 32-state prosodic modeling was trained

367 prosody keywords were extracted to form a sparse 367302 dimensio367 prosody keywords were extracted to form a sparse 367302 dimensional matrix nal matrix AA

the dimension of eigen-prosody space is 3the dimension of eigen-prosody space is 3

38 mel-frequency cepstral coefficiences (MFCCs) including38 mel-frequency cepstral coefficiences (MFCCs) including

Window size of 30 ms and frame shift of 10msWindow size of 30 ms and frame shift of 10ms

2005102420051024 Speech Lab NTNU 2424

Speaker Identification Experiments (cont)Speaker Identification Experiments (cont)

2005102420051024 Speech Lab NTNU 2525

ConclusionsConclusions

This paper presents an EPA+AKI+MAP-GMMCMS fusion This paper presents an EPA+AKI+MAP-GMMCMS fusion approachapproach

Unlike conventional fusion approaches the proposed method Unlike conventional fusion approaches the proposed method requires only few trainingtest utterances and could alleviate the requires only few trainingtest utterances and could alleviate the distortion of unseen mismatch handsetsdistortion of unseen mismatch handsets

It is therefore a promising method for robust speaker It is therefore a promising method for robust speaker identification under mismatch environment and limited available identification under mismatch environment and limited available datadata

2005102420051024 Speech Lab NTNU 2626

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
Page 11: PROSODY MODELING AND EIGEN-PROSODY ANALYSIS FOR ROBUST SPEAKER RECOGNITION

2005102420051024 Speech Lab NTNU 1111

Prosodic Modeling (cont)Prosodic Modeling (cont)

For example state 6 is the phrase-start state 3 and 4 are the For example state 6 is the phrase-start state 3 and 4 are the major-breaksmajor-breaks

2005102420051024 Speech Lab NTNU 1212

Prosodic labelingProsodic labeling

2005102420051024 Speech Lab NTNU 1313

AKI Unseen Handset EstimationAKI Unseen Handset Estimation

The concept of AKI is to first collect a set of characteristics of seen handset aThe concept of AKI is to first collect a set of characteristics of seen handset as the s the a priori a priori knowledge to construct a space of handsetsknowledge to construct a space of handsets

Model-space AKI using the MLLR model transformation is chosen to compeModel-space AKI using the MLLR model transformation is chosen to compensate the handset mismatchnsate the handset mismatch

The estimate of the characteristic of a test handset is definedThe estimate of the characteristic of a test handset is defined

where where HH = = hhnn=(=(AAnn bbnn TTnn)) n n=1~=1~NN is the set of is the set of a priori a priori knowledgeknowledge

N

nnnhh

1

ˆ

h

1

1

ˆˆ

ˆˆˆˆ

CB

CC

BTB

buAuhu

T

T

μ and 1048580 are the original and adapted mixture mean vectorsΣ and 1048580 are the original and adapted mixture variance vectors and are MLLR mean and variance transformation matricesB is the inverse function of Choleski factor of the original variance matrix Σ-1

is the bias of the mixture component

A T

b

2005102420051024 Speech Lab NTNU 1414

AKI Unseen Handset Estimation (cont)AKI Unseen Handset Estimation (cont)

2005102420051024 Speech Lab NTNU 1515

Eigen-Prosody AnalysisEigen-Prosody Analysis

The procedures of the EPA includesThe procedures of the EPA includesVQ-based prosodic modeling and labelingVQ-based prosodic modeling and labeling

Segmenting the sequences of prosody states to extract important prosody kSegmenting the sequences of prosody states to extract important prosody keywordseywords

Calculating the occurrences statistics of these prosody keywords for each sCalculating the occurrences statistics of these prosody keywords for each speaker to form a prosody keyword-speaker occurrence matrixpeaker to form a prosody keyword-speaker occurrence matrix

Applying the singular values decomposition (SVD) technique to decompoApplying the singular values decomposition (SVD) technique to decompose the matrix to build an eigen-prosody spacese the matrix to build an eigen-prosody space

Measuring the speaker distance using the cosine of the angle between two Measuring the speaker distance using the cosine of the angle between two speaker vectorsspeaker vectors

2005102420051024 Speech Lab NTNU 1616

Eigen-Prosody Analysis (cont)Eigen-Prosody Analysis (cont)

2005102420051024 Speech Lab NTNU 1717

Prosody keyword extractionProsody keyword extraction

After the prosodic state labeling the prosody text documents of aAfter the prosodic state labeling the prosody text documents of all speakers are searched to find important prosody keywords in orll speakers are searched to find important prosody keywords in order to establish a prosody keywords dictionaryder to establish a prosody keywords dictionary

First all possible combinations of the prosody words including single worFirst all possible combinations of the prosody words including single words and word pairs (uni-gram and bi-gram) are listed and their frequency stds and word pairs (uni-gram and bi-gram) are listed and their frequency statistics are computedatistics are computed

After calculating the histogram of all prosody words frequency thresholds After calculating the histogram of all prosody words frequency thresholds are set to leave only high frequency onesare set to leave only high frequency ones

2005102420051024 Speech Lab NTNU 1818

Prosody keyword-speaker occurrence matrix Prosody keyword-speaker occurrence matrix statisticsstatistics

The prosody text document of each speaker is then parsed using the generated The prosody text document of each speaker is then parsed using the generated prosody keywords dictionary by simply giving higher priority to longer wordsprosody keywords dictionary by simply giving higher priority to longer words

The occurrence counts of keywords of a speaker are booked in a prosody The occurrence counts of keywords of a speaker are booked in a prosody keyword list vector to represent the long-term prosodic behaviors of the keyword list vector to represent the long-term prosodic behaviors of the specific speakerspecific speaker

The prosody keyword-speaker occurrence matrix The prosody keyword-speaker occurrence matrix A A is made up of the is made up of the collection of all speaker prosody keyword lists vectorscollection of all speaker prosody keyword lists vectors

To emphasize the uncommon keywords and to deemphasize the very common To emphasize the uncommon keywords and to deemphasize the very common ones the inverse document frequency (IDF) weighting method is appliedones the inverse document frequency (IDF) weighting method is applied

2005102420051024 Speech Lab NTNU 1919

Eigen-Prosody analysisEigen-Prosody analysis

In order to reduce the dimension of the prosody space the sparse prosody keIn order to reduce the dimension of the prosody space the sparse prosody keyword-speaker occurrence matrix yword-speaker occurrence matrix A A is further analyzed using SVD to find a cis further analyzed using SVD to find a compact eigen-prosody spaceompact eigen-prosody space

Given an Given an m m by by n n ((mmgtgtgtgtnn) matrix ) matrix A A of rank of rank R R A A is decomposed and further apis decomposed and further approximated using only the largest proximated using only the largest K K singular values assingular values as

Where AWhere AKK U UKK V VKK and and ΣΣKK matrices are the rank reduced matrices of the respective m matrices are the rank reduced matrices of the respective m

atricesatrices

TKKKK

T VUAVUA

2005102420051024 Speech Lab NTNU 2020

Eigen-Prosody Analysis Eigen-Prosody Analysis

2005102420051024 Speech Lab NTNU 2121

Score measurementScore measurement

The test utterances of a test speaker are first labeled and parsed to form the pThe test utterances of a test speaker are first labeled and parsed to form the pseudo query document seudo query document yyQQ and then transformed into the query vector and then transformed into the query vector vvQQ in tin t

he eigen-prosody speaker space byhe eigen-prosody speaker space by

iKQ

iKT

QiKQ

Pi

KKTQQ

v

vvS

Uy

1

)(

2005102420051024 Speech Lab NTNU 2222

Speaker Identification ExperimentsSpeaker Identification Experiments

HTIMIT database and experiment conditionsHTIMIT database and experiment conditionstotal 384 speakers each gave ten utterancestotal 384 speakers each gave ten utterances

The set of 38410 utterances was then playback and recorded through ninThe set of 38410 utterances was then playback and recorded through nine other different handsets include four carbon buttone other different handsets include four carbon button((碳墨式碳墨式 )) four electret four electret ((電子式電子式 ))handsets and one portable cordless phonehandsets and one portable cordless phone

In this paper all experiments were performed on 302 speakers inIn this paper all experiments were performed on 302 speakers including 151 females and 151 males which have all the ten utterancluding 151 females and 151 males which have all the ten utterancesces

For training the speaker models the first seven utterances of each speaker For training the speaker models the first seven utterances of each speaker from the senh handset were used as the enrollment speechfrom the senh handset were used as the enrollment speech

The other ten three-utterance sessions of each speaker from ten handsets wThe other ten three-utterance sessions of each speaker from ten handsets were used as the evaluation data respectivelyere used as the evaluation data respectively

2005102420051024 Speech Lab NTNU 2323

Speaker Identification Experiments (cont)Speaker Identification Experiments (cont)

To construct the speaker models a 256-mixture universal backgrTo construct the speaker models a 256-mixture universal background model (UBM) was first built from the enrollment speech of ound model (UBM) was first built from the enrollment speech of all 302 speakersall 302 speakers

For each speaker a MAP-adapted GMM (MAP-GMM) adapted fFor each speaker a MAP-adapted GMM (MAP-GMM) adapted from the UBM using his own enrollment speech was built rom the UBM using his own enrollment speech was built

a 32-state prosodic modeling was traineda 32-state prosodic modeling was trained

367 prosody keywords were extracted to form a sparse 367302 dimensio367 prosody keywords were extracted to form a sparse 367302 dimensional matrix nal matrix AA

the dimension of eigen-prosody space is 3the dimension of eigen-prosody space is 3

38 mel-frequency cepstral coefficiences (MFCCs) including38 mel-frequency cepstral coefficiences (MFCCs) including

Window size of 30 ms and frame shift of 10msWindow size of 30 ms and frame shift of 10ms

2005102420051024 Speech Lab NTNU 2424

Speaker Identification Experiments (cont)Speaker Identification Experiments (cont)

2005102420051024 Speech Lab NTNU 2525

ConclusionsConclusions

This paper presents an EPA+AKI+MAP-GMMCMS fusion This paper presents an EPA+AKI+MAP-GMMCMS fusion approachapproach

Unlike conventional fusion approaches the proposed method Unlike conventional fusion approaches the proposed method requires only few trainingtest utterances and could alleviate the requires only few trainingtest utterances and could alleviate the distortion of unseen mismatch handsetsdistortion of unseen mismatch handsets

It is therefore a promising method for robust speaker It is therefore a promising method for robust speaker identification under mismatch environment and limited available identification under mismatch environment and limited available datadata

2005102420051024 Speech Lab NTNU 2626

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
Page 12: PROSODY MODELING AND EIGEN-PROSODY ANALYSIS FOR ROBUST SPEAKER RECOGNITION

2005102420051024 Speech Lab NTNU 1212

Prosodic labelingProsodic labeling

2005102420051024 Speech Lab NTNU 1313

AKI Unseen Handset EstimationAKI Unseen Handset Estimation

The concept of AKI is to first collect a set of characteristics of seen handset aThe concept of AKI is to first collect a set of characteristics of seen handset as the s the a priori a priori knowledge to construct a space of handsetsknowledge to construct a space of handsets

Model-space AKI using the MLLR model transformation is chosen to compeModel-space AKI using the MLLR model transformation is chosen to compensate the handset mismatchnsate the handset mismatch

The estimate of the characteristic of a test handset is definedThe estimate of the characteristic of a test handset is defined

where where HH = = hhnn=(=(AAnn bbnn TTnn)) n n=1~=1~NN is the set of is the set of a priori a priori knowledgeknowledge

N

nnnhh

1

ˆ

h

1

1

ˆˆ

ˆˆˆˆ

CB

CC

BTB

buAuhu

T

T

μ and 1048580 are the original and adapted mixture mean vectorsΣ and 1048580 are the original and adapted mixture variance vectors and are MLLR mean and variance transformation matricesB is the inverse function of Choleski factor of the original variance matrix Σ-1

is the bias of the mixture component

A T

b

2005102420051024 Speech Lab NTNU 1414

AKI Unseen Handset Estimation (cont)AKI Unseen Handset Estimation (cont)

2005102420051024 Speech Lab NTNU 1515

Eigen-Prosody AnalysisEigen-Prosody Analysis

The procedures of the EPA includesThe procedures of the EPA includesVQ-based prosodic modeling and labelingVQ-based prosodic modeling and labeling

Segmenting the sequences of prosody states to extract important prosody kSegmenting the sequences of prosody states to extract important prosody keywordseywords

Calculating the occurrences statistics of these prosody keywords for each sCalculating the occurrences statistics of these prosody keywords for each speaker to form a prosody keyword-speaker occurrence matrixpeaker to form a prosody keyword-speaker occurrence matrix

Applying the singular values decomposition (SVD) technique to decompoApplying the singular values decomposition (SVD) technique to decompose the matrix to build an eigen-prosody spacese the matrix to build an eigen-prosody space

Measuring the speaker distance using the cosine of the angle between two Measuring the speaker distance using the cosine of the angle between two speaker vectorsspeaker vectors

2005102420051024 Speech Lab NTNU 1616

Eigen-Prosody Analysis (cont)Eigen-Prosody Analysis (cont)

2005102420051024 Speech Lab NTNU 1717

Prosody keyword extractionProsody keyword extraction

After the prosodic state labeling the prosody text documents of aAfter the prosodic state labeling the prosody text documents of all speakers are searched to find important prosody keywords in orll speakers are searched to find important prosody keywords in order to establish a prosody keywords dictionaryder to establish a prosody keywords dictionary

First all possible combinations of the prosody words including single worFirst all possible combinations of the prosody words including single words and word pairs (uni-gram and bi-gram) are listed and their frequency stds and word pairs (uni-gram and bi-gram) are listed and their frequency statistics are computedatistics are computed

After calculating the histogram of all prosody words frequency thresholds After calculating the histogram of all prosody words frequency thresholds are set to leave only high frequency onesare set to leave only high frequency ones

2005102420051024 Speech Lab NTNU 1818

Prosody keyword-speaker occurrence matrix Prosody keyword-speaker occurrence matrix statisticsstatistics

The prosody text document of each speaker is then parsed using the generated The prosody text document of each speaker is then parsed using the generated prosody keywords dictionary by simply giving higher priority to longer wordsprosody keywords dictionary by simply giving higher priority to longer words

The occurrence counts of keywords of a speaker are booked in a prosody The occurrence counts of keywords of a speaker are booked in a prosody keyword list vector to represent the long-term prosodic behaviors of the keyword list vector to represent the long-term prosodic behaviors of the specific speakerspecific speaker

The prosody keyword-speaker occurrence matrix The prosody keyword-speaker occurrence matrix A A is made up of the is made up of the collection of all speaker prosody keyword lists vectorscollection of all speaker prosody keyword lists vectors

To emphasize the uncommon keywords and to deemphasize the very common To emphasize the uncommon keywords and to deemphasize the very common ones the inverse document frequency (IDF) weighting method is appliedones the inverse document frequency (IDF) weighting method is applied

2005102420051024 Speech Lab NTNU 1919

Eigen-Prosody analysisEigen-Prosody analysis

In order to reduce the dimension of the prosody space the sparse prosody keIn order to reduce the dimension of the prosody space the sparse prosody keyword-speaker occurrence matrix yword-speaker occurrence matrix A A is further analyzed using SVD to find a cis further analyzed using SVD to find a compact eigen-prosody spaceompact eigen-prosody space

Given an Given an m m by by n n ((mmgtgtgtgtnn) matrix ) matrix A A of rank of rank R R A A is decomposed and further apis decomposed and further approximated using only the largest proximated using only the largest K K singular values assingular values as

Where AWhere AKK U UKK V VKK and and ΣΣKK matrices are the rank reduced matrices of the respective m matrices are the rank reduced matrices of the respective m

atricesatrices

TKKKK

T VUAVUA

2005102420051024 Speech Lab NTNU 2020

Eigen-Prosody Analysis Eigen-Prosody Analysis

2005102420051024 Speech Lab NTNU 2121

Score measurementScore measurement

The test utterances of a test speaker are first labeled and parsed to form the pThe test utterances of a test speaker are first labeled and parsed to form the pseudo query document seudo query document yyQQ and then transformed into the query vector and then transformed into the query vector vvQQ in tin t

he eigen-prosody speaker space byhe eigen-prosody speaker space by

iKQ

iKT

QiKQ

Pi

KKTQQ

v

vvS

Uy

1

)(

2005102420051024 Speech Lab NTNU 2222

Speaker Identification ExperimentsSpeaker Identification Experiments

HTIMIT database and experiment conditionsHTIMIT database and experiment conditionstotal 384 speakers each gave ten utterancestotal 384 speakers each gave ten utterances

The set of 38410 utterances was then playback and recorded through ninThe set of 38410 utterances was then playback and recorded through nine other different handsets include four carbon buttone other different handsets include four carbon button((碳墨式碳墨式 )) four electret four electret ((電子式電子式 ))handsets and one portable cordless phonehandsets and one portable cordless phone

In this paper all experiments were performed on 302 speakers inIn this paper all experiments were performed on 302 speakers including 151 females and 151 males which have all the ten utterancluding 151 females and 151 males which have all the ten utterancesces

For training the speaker models the first seven utterances of each speaker For training the speaker models the first seven utterances of each speaker from the senh handset were used as the enrollment speechfrom the senh handset were used as the enrollment speech

The other ten three-utterance sessions of each speaker from ten handsets wThe other ten three-utterance sessions of each speaker from ten handsets were used as the evaluation data respectivelyere used as the evaluation data respectively

2005102420051024 Speech Lab NTNU 2323

Speaker Identification Experiments (cont)Speaker Identification Experiments (cont)

To construct the speaker models a 256-mixture universal backgrTo construct the speaker models a 256-mixture universal background model (UBM) was first built from the enrollment speech of ound model (UBM) was first built from the enrollment speech of all 302 speakersall 302 speakers

For each speaker a MAP-adapted GMM (MAP-GMM) adapted fFor each speaker a MAP-adapted GMM (MAP-GMM) adapted from the UBM using his own enrollment speech was built rom the UBM using his own enrollment speech was built

a 32-state prosodic modeling was traineda 32-state prosodic modeling was trained

367 prosody keywords were extracted to form a sparse 367302 dimensio367 prosody keywords were extracted to form a sparse 367302 dimensional matrix nal matrix AA

the dimension of eigen-prosody space is 3the dimension of eigen-prosody space is 3

38 mel-frequency cepstral coefficiences (MFCCs) including38 mel-frequency cepstral coefficiences (MFCCs) including

Window size of 30 ms and frame shift of 10msWindow size of 30 ms and frame shift of 10ms

2005102420051024 Speech Lab NTNU 2424

Speaker Identification Experiments (cont)Speaker Identification Experiments (cont)

2005102420051024 Speech Lab NTNU 2525

ConclusionsConclusions

This paper presents an EPA+AKI+MAP-GMMCMS fusion This paper presents an EPA+AKI+MAP-GMMCMS fusion approachapproach

Unlike conventional fusion approaches the proposed method Unlike conventional fusion approaches the proposed method requires only few trainingtest utterances and could alleviate the requires only few trainingtest utterances and could alleviate the distortion of unseen mismatch handsetsdistortion of unseen mismatch handsets

It is therefore a promising method for robust speaker It is therefore a promising method for robust speaker identification under mismatch environment and limited available identification under mismatch environment and limited available datadata

2005102420051024 Speech Lab NTNU 2626

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
Page 13: PROSODY MODELING AND EIGEN-PROSODY ANALYSIS FOR ROBUST SPEAKER RECOGNITION

2005102420051024 Speech Lab NTNU 1313

AKI Unseen Handset EstimationAKI Unseen Handset Estimation

The concept of AKI is to first collect a set of characteristics of seen handset aThe concept of AKI is to first collect a set of characteristics of seen handset as the s the a priori a priori knowledge to construct a space of handsetsknowledge to construct a space of handsets

Model-space AKI using the MLLR model transformation is chosen to compeModel-space AKI using the MLLR model transformation is chosen to compensate the handset mismatchnsate the handset mismatch

The estimate of the characteristic of a test handset is definedThe estimate of the characteristic of a test handset is defined

where where HH = = hhnn=(=(AAnn bbnn TTnn)) n n=1~=1~NN is the set of is the set of a priori a priori knowledgeknowledge

N

nnnhh

1

ˆ

h

1

1

ˆˆ

ˆˆˆˆ

CB

CC

BTB

buAuhu

T

T

μ and 1048580 are the original and adapted mixture mean vectorsΣ and 1048580 are the original and adapted mixture variance vectors and are MLLR mean and variance transformation matricesB is the inverse function of Choleski factor of the original variance matrix Σ-1

is the bias of the mixture component

A T

b

2005102420051024 Speech Lab NTNU 1414

AKI Unseen Handset Estimation (cont)AKI Unseen Handset Estimation (cont)

2005102420051024 Speech Lab NTNU 1515

Eigen-Prosody AnalysisEigen-Prosody Analysis

The procedures of the EPA includesThe procedures of the EPA includesVQ-based prosodic modeling and labelingVQ-based prosodic modeling and labeling

Segmenting the sequences of prosody states to extract important prosody kSegmenting the sequences of prosody states to extract important prosody keywordseywords

Calculating the occurrences statistics of these prosody keywords for each sCalculating the occurrences statistics of these prosody keywords for each speaker to form a prosody keyword-speaker occurrence matrixpeaker to form a prosody keyword-speaker occurrence matrix

Applying the singular values decomposition (SVD) technique to decompoApplying the singular values decomposition (SVD) technique to decompose the matrix to build an eigen-prosody spacese the matrix to build an eigen-prosody space

Measuring the speaker distance using the cosine of the angle between two Measuring the speaker distance using the cosine of the angle between two speaker vectorsspeaker vectors

2005102420051024 Speech Lab NTNU 1616

Eigen-Prosody Analysis (cont)Eigen-Prosody Analysis (cont)

2005102420051024 Speech Lab NTNU 1717

Prosody keyword extractionProsody keyword extraction

After the prosodic state labeling the prosody text documents of aAfter the prosodic state labeling the prosody text documents of all speakers are searched to find important prosody keywords in orll speakers are searched to find important prosody keywords in order to establish a prosody keywords dictionaryder to establish a prosody keywords dictionary

First all possible combinations of the prosody words including single worFirst all possible combinations of the prosody words including single words and word pairs (uni-gram and bi-gram) are listed and their frequency stds and word pairs (uni-gram and bi-gram) are listed and their frequency statistics are computedatistics are computed

After calculating the histogram of all prosody words frequency thresholds After calculating the histogram of all prosody words frequency thresholds are set to leave only high frequency onesare set to leave only high frequency ones

2005102420051024 Speech Lab NTNU 1818

Prosody keyword-speaker occurrence matrix Prosody keyword-speaker occurrence matrix statisticsstatistics

The prosody text document of each speaker is then parsed using the generated The prosody text document of each speaker is then parsed using the generated prosody keywords dictionary by simply giving higher priority to longer wordsprosody keywords dictionary by simply giving higher priority to longer words

The occurrence counts of keywords of a speaker are booked in a prosody The occurrence counts of keywords of a speaker are booked in a prosody keyword list vector to represent the long-term prosodic behaviors of the keyword list vector to represent the long-term prosodic behaviors of the specific speakerspecific speaker

The prosody keyword-speaker occurrence matrix The prosody keyword-speaker occurrence matrix A A is made up of the is made up of the collection of all speaker prosody keyword lists vectorscollection of all speaker prosody keyword lists vectors

To emphasize the uncommon keywords and to deemphasize the very common To emphasize the uncommon keywords and to deemphasize the very common ones the inverse document frequency (IDF) weighting method is appliedones the inverse document frequency (IDF) weighting method is applied

2005102420051024 Speech Lab NTNU 1919

Eigen-Prosody analysisEigen-Prosody analysis

In order to reduce the dimension of the prosody space the sparse prosody keIn order to reduce the dimension of the prosody space the sparse prosody keyword-speaker occurrence matrix yword-speaker occurrence matrix A A is further analyzed using SVD to find a cis further analyzed using SVD to find a compact eigen-prosody spaceompact eigen-prosody space

Given an Given an m m by by n n ((mmgtgtgtgtnn) matrix ) matrix A A of rank of rank R R A A is decomposed and further apis decomposed and further approximated using only the largest proximated using only the largest K K singular values assingular values as

Where AWhere AKK U UKK V VKK and and ΣΣKK matrices are the rank reduced matrices of the respective m matrices are the rank reduced matrices of the respective m

atricesatrices

TKKKK

T VUAVUA

2005102420051024 Speech Lab NTNU 2020

Eigen-Prosody Analysis Eigen-Prosody Analysis

2005102420051024 Speech Lab NTNU 2121

Score measurementScore measurement

The test utterances of a test speaker are first labeled and parsed to form the pThe test utterances of a test speaker are first labeled and parsed to form the pseudo query document seudo query document yyQQ and then transformed into the query vector and then transformed into the query vector vvQQ in tin t

he eigen-prosody speaker space byhe eigen-prosody speaker space by

iKQ

iKT

QiKQ

Pi

KKTQQ

v

vvS

Uy

1

)(

2005102420051024 Speech Lab NTNU 2222

Speaker Identification ExperimentsSpeaker Identification Experiments

HTIMIT database and experiment conditionsHTIMIT database and experiment conditionstotal 384 speakers each gave ten utterancestotal 384 speakers each gave ten utterances

The set of 38410 utterances was then playback and recorded through ninThe set of 38410 utterances was then playback and recorded through nine other different handsets include four carbon buttone other different handsets include four carbon button((碳墨式碳墨式 )) four electret four electret ((電子式電子式 ))handsets and one portable cordless phonehandsets and one portable cordless phone

In this paper all experiments were performed on 302 speakers inIn this paper all experiments were performed on 302 speakers including 151 females and 151 males which have all the ten utterancluding 151 females and 151 males which have all the ten utterancesces

For training the speaker models the first seven utterances of each speaker For training the speaker models the first seven utterances of each speaker from the senh handset were used as the enrollment speechfrom the senh handset were used as the enrollment speech

The other ten three-utterance sessions of each speaker from ten handsets wThe other ten three-utterance sessions of each speaker from ten handsets were used as the evaluation data respectivelyere used as the evaluation data respectively

2005102420051024 Speech Lab NTNU 2323

Speaker Identification Experiments (cont)Speaker Identification Experiments (cont)

To construct the speaker models a 256-mixture universal backgrTo construct the speaker models a 256-mixture universal background model (UBM) was first built from the enrollment speech of ound model (UBM) was first built from the enrollment speech of all 302 speakersall 302 speakers

For each speaker a MAP-adapted GMM (MAP-GMM) adapted fFor each speaker a MAP-adapted GMM (MAP-GMM) adapted from the UBM using his own enrollment speech was built rom the UBM using his own enrollment speech was built

a 32-state prosodic modeling was traineda 32-state prosodic modeling was trained

367 prosody keywords were extracted to form a sparse 367302 dimensio367 prosody keywords were extracted to form a sparse 367302 dimensional matrix nal matrix AA

the dimension of eigen-prosody space is 3the dimension of eigen-prosody space is 3

38 mel-frequency cepstral coefficiences (MFCCs) including38 mel-frequency cepstral coefficiences (MFCCs) including

Window size of 30 ms and frame shift of 10msWindow size of 30 ms and frame shift of 10ms

2005102420051024 Speech Lab NTNU 2424

Speaker Identification Experiments (cont)Speaker Identification Experiments (cont)

2005102420051024 Speech Lab NTNU 2525

ConclusionsConclusions

This paper presents an EPA+AKI+MAP-GMMCMS fusion This paper presents an EPA+AKI+MAP-GMMCMS fusion approachapproach

Unlike conventional fusion approaches the proposed method Unlike conventional fusion approaches the proposed method requires only few trainingtest utterances and could alleviate the requires only few trainingtest utterances and could alleviate the distortion of unseen mismatch handsetsdistortion of unseen mismatch handsets

It is therefore a promising method for robust speaker It is therefore a promising method for robust speaker identification under mismatch environment and limited available identification under mismatch environment and limited available datadata

2005102420051024 Speech Lab NTNU 2626

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
Page 14: PROSODY MODELING AND EIGEN-PROSODY ANALYSIS FOR ROBUST SPEAKER RECOGNITION

2005102420051024 Speech Lab NTNU 1414

AKI Unseen Handset Estimation (cont)AKI Unseen Handset Estimation (cont)

2005102420051024 Speech Lab NTNU 1515

Eigen-Prosody AnalysisEigen-Prosody Analysis

The procedures of the EPA includesThe procedures of the EPA includesVQ-based prosodic modeling and labelingVQ-based prosodic modeling and labeling

Segmenting the sequences of prosody states to extract important prosody kSegmenting the sequences of prosody states to extract important prosody keywordseywords

Calculating the occurrences statistics of these prosody keywords for each sCalculating the occurrences statistics of these prosody keywords for each speaker to form a prosody keyword-speaker occurrence matrixpeaker to form a prosody keyword-speaker occurrence matrix

Applying the singular values decomposition (SVD) technique to decompoApplying the singular values decomposition (SVD) technique to decompose the matrix to build an eigen-prosody spacese the matrix to build an eigen-prosody space

Measuring the speaker distance using the cosine of the angle between two Measuring the speaker distance using the cosine of the angle between two speaker vectorsspeaker vectors

2005102420051024 Speech Lab NTNU 1616

Eigen-Prosody Analysis (cont)Eigen-Prosody Analysis (cont)

2005102420051024 Speech Lab NTNU 1717

Prosody keyword extractionProsody keyword extraction

After the prosodic state labeling the prosody text documents of aAfter the prosodic state labeling the prosody text documents of all speakers are searched to find important prosody keywords in orll speakers are searched to find important prosody keywords in order to establish a prosody keywords dictionaryder to establish a prosody keywords dictionary

First all possible combinations of the prosody words including single worFirst all possible combinations of the prosody words including single words and word pairs (uni-gram and bi-gram) are listed and their frequency stds and word pairs (uni-gram and bi-gram) are listed and their frequency statistics are computedatistics are computed

After calculating the histogram of all prosody words frequency thresholds After calculating the histogram of all prosody words frequency thresholds are set to leave only high frequency onesare set to leave only high frequency ones

2005102420051024 Speech Lab NTNU 1818

Prosody keyword-speaker occurrence matrix Prosody keyword-speaker occurrence matrix statisticsstatistics

The prosody text document of each speaker is then parsed using the generated The prosody text document of each speaker is then parsed using the generated prosody keywords dictionary by simply giving higher priority to longer wordsprosody keywords dictionary by simply giving higher priority to longer words

The occurrence counts of keywords of a speaker are booked in a prosody The occurrence counts of keywords of a speaker are booked in a prosody keyword list vector to represent the long-term prosodic behaviors of the keyword list vector to represent the long-term prosodic behaviors of the specific speakerspecific speaker

The prosody keyword-speaker occurrence matrix The prosody keyword-speaker occurrence matrix A A is made up of the is made up of the collection of all speaker prosody keyword lists vectorscollection of all speaker prosody keyword lists vectors

To emphasize the uncommon keywords and to deemphasize the very common To emphasize the uncommon keywords and to deemphasize the very common ones the inverse document frequency (IDF) weighting method is appliedones the inverse document frequency (IDF) weighting method is applied

2005102420051024 Speech Lab NTNU 1919

Eigen-Prosody analysisEigen-Prosody analysis

In order to reduce the dimension of the prosody space the sparse prosody keIn order to reduce the dimension of the prosody space the sparse prosody keyword-speaker occurrence matrix yword-speaker occurrence matrix A A is further analyzed using SVD to find a cis further analyzed using SVD to find a compact eigen-prosody spaceompact eigen-prosody space

Given an Given an m m by by n n ((mmgtgtgtgtnn) matrix ) matrix A A of rank of rank R R A A is decomposed and further apis decomposed and further approximated using only the largest proximated using only the largest K K singular values assingular values as

Where AWhere AKK U UKK V VKK and and ΣΣKK matrices are the rank reduced matrices of the respective m matrices are the rank reduced matrices of the respective m

atricesatrices

TKKKK

T VUAVUA

2005102420051024 Speech Lab NTNU 2020

Eigen-Prosody Analysis Eigen-Prosody Analysis

2005102420051024 Speech Lab NTNU 2121

Score measurementScore measurement

The test utterances of a test speaker are first labeled and parsed to form the pThe test utterances of a test speaker are first labeled and parsed to form the pseudo query document seudo query document yyQQ and then transformed into the query vector and then transformed into the query vector vvQQ in tin t

he eigen-prosody speaker space byhe eigen-prosody speaker space by

iKQ

iKT

QiKQ

Pi

KKTQQ

v

vvS

Uy

1

)(

2005102420051024 Speech Lab NTNU 2222

Speaker Identification ExperimentsSpeaker Identification Experiments

HTIMIT database and experiment conditionsHTIMIT database and experiment conditionstotal 384 speakers each gave ten utterancestotal 384 speakers each gave ten utterances

The set of 38410 utterances was then playback and recorded through ninThe set of 38410 utterances was then playback and recorded through nine other different handsets include four carbon buttone other different handsets include four carbon button((碳墨式碳墨式 )) four electret four electret ((電子式電子式 ))handsets and one portable cordless phonehandsets and one portable cordless phone

In this paper all experiments were performed on 302 speakers inIn this paper all experiments were performed on 302 speakers including 151 females and 151 males which have all the ten utterancluding 151 females and 151 males which have all the ten utterancesces

For training the speaker models the first seven utterances of each speaker For training the speaker models the first seven utterances of each speaker from the senh handset were used as the enrollment speechfrom the senh handset were used as the enrollment speech

The other ten three-utterance sessions of each speaker from ten handsets wThe other ten three-utterance sessions of each speaker from ten handsets were used as the evaluation data respectivelyere used as the evaluation data respectively

2005102420051024 Speech Lab NTNU 2323

Speaker Identification Experiments (cont)Speaker Identification Experiments (cont)

To construct the speaker models a 256-mixture universal backgrTo construct the speaker models a 256-mixture universal background model (UBM) was first built from the enrollment speech of ound model (UBM) was first built from the enrollment speech of all 302 speakersall 302 speakers

For each speaker a MAP-adapted GMM (MAP-GMM) adapted fFor each speaker a MAP-adapted GMM (MAP-GMM) adapted from the UBM using his own enrollment speech was built rom the UBM using his own enrollment speech was built

a 32-state prosodic modeling was traineda 32-state prosodic modeling was trained

367 prosody keywords were extracted to form a sparse 367302 dimensio367 prosody keywords were extracted to form a sparse 367302 dimensional matrix nal matrix AA

the dimension of eigen-prosody space is 3the dimension of eigen-prosody space is 3

38 mel-frequency cepstral coefficiences (MFCCs) including38 mel-frequency cepstral coefficiences (MFCCs) including

Window size of 30 ms and frame shift of 10msWindow size of 30 ms and frame shift of 10ms

2005102420051024 Speech Lab NTNU 2424

Speaker Identification Experiments (cont)Speaker Identification Experiments (cont)

2005102420051024 Speech Lab NTNU 2525

ConclusionsConclusions

This paper presents an EPA+AKI+MAP-GMMCMS fusion This paper presents an EPA+AKI+MAP-GMMCMS fusion approachapproach

Unlike conventional fusion approaches the proposed method Unlike conventional fusion approaches the proposed method requires only few trainingtest utterances and could alleviate the requires only few trainingtest utterances and could alleviate the distortion of unseen mismatch handsetsdistortion of unseen mismatch handsets

It is therefore a promising method for robust speaker It is therefore a promising method for robust speaker identification under mismatch environment and limited available identification under mismatch environment and limited available datadata

2005102420051024 Speech Lab NTNU 2626

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
Page 15: PROSODY MODELING AND EIGEN-PROSODY ANALYSIS FOR ROBUST SPEAKER RECOGNITION

2005102420051024 Speech Lab NTNU 1515

Eigen-Prosody AnalysisEigen-Prosody Analysis

The procedures of the EPA includesThe procedures of the EPA includesVQ-based prosodic modeling and labelingVQ-based prosodic modeling and labeling

Segmenting the sequences of prosody states to extract important prosody kSegmenting the sequences of prosody states to extract important prosody keywordseywords

Calculating the occurrences statistics of these prosody keywords for each sCalculating the occurrences statistics of these prosody keywords for each speaker to form a prosody keyword-speaker occurrence matrixpeaker to form a prosody keyword-speaker occurrence matrix

Applying the singular values decomposition (SVD) technique to decompoApplying the singular values decomposition (SVD) technique to decompose the matrix to build an eigen-prosody spacese the matrix to build an eigen-prosody space

Measuring the speaker distance using the cosine of the angle between two Measuring the speaker distance using the cosine of the angle between two speaker vectorsspeaker vectors

2005102420051024 Speech Lab NTNU 1616

Eigen-Prosody Analysis (cont)Eigen-Prosody Analysis (cont)

2005102420051024 Speech Lab NTNU 1717

Prosody keyword extractionProsody keyword extraction

After the prosodic state labeling the prosody text documents of aAfter the prosodic state labeling the prosody text documents of all speakers are searched to find important prosody keywords in orll speakers are searched to find important prosody keywords in order to establish a prosody keywords dictionaryder to establish a prosody keywords dictionary

First all possible combinations of the prosody words including single worFirst all possible combinations of the prosody words including single words and word pairs (uni-gram and bi-gram) are listed and their frequency stds and word pairs (uni-gram and bi-gram) are listed and their frequency statistics are computedatistics are computed

After calculating the histogram of all prosody words frequency thresholds After calculating the histogram of all prosody words frequency thresholds are set to leave only high frequency onesare set to leave only high frequency ones

2005102420051024 Speech Lab NTNU 1818

Prosody keyword-speaker occurrence matrix Prosody keyword-speaker occurrence matrix statisticsstatistics

The prosody text document of each speaker is then parsed using the generated The prosody text document of each speaker is then parsed using the generated prosody keywords dictionary by simply giving higher priority to longer wordsprosody keywords dictionary by simply giving higher priority to longer words

The occurrence counts of keywords of a speaker are booked in a prosody The occurrence counts of keywords of a speaker are booked in a prosody keyword list vector to represent the long-term prosodic behaviors of the keyword list vector to represent the long-term prosodic behaviors of the specific speakerspecific speaker

The prosody keyword-speaker occurrence matrix The prosody keyword-speaker occurrence matrix A A is made up of the is made up of the collection of all speaker prosody keyword lists vectorscollection of all speaker prosody keyword lists vectors

To emphasize the uncommon keywords and to deemphasize the very common To emphasize the uncommon keywords and to deemphasize the very common ones the inverse document frequency (IDF) weighting method is appliedones the inverse document frequency (IDF) weighting method is applied

2005102420051024 Speech Lab NTNU 1919

Eigen-Prosody analysisEigen-Prosody analysis

In order to reduce the dimension of the prosody space the sparse prosody keIn order to reduce the dimension of the prosody space the sparse prosody keyword-speaker occurrence matrix yword-speaker occurrence matrix A A is further analyzed using SVD to find a cis further analyzed using SVD to find a compact eigen-prosody spaceompact eigen-prosody space

Given an Given an m m by by n n ((mmgtgtgtgtnn) matrix ) matrix A A of rank of rank R R A A is decomposed and further apis decomposed and further approximated using only the largest proximated using only the largest K K singular values assingular values as

Where AWhere AKK U UKK V VKK and and ΣΣKK matrices are the rank reduced matrices of the respective m matrices are the rank reduced matrices of the respective m

atricesatrices

TKKKK

T VUAVUA

2005102420051024 Speech Lab NTNU 2020

Eigen-Prosody Analysis Eigen-Prosody Analysis

2005102420051024 Speech Lab NTNU 2121

Score measurementScore measurement

The test utterances of a test speaker are first labeled and parsed to form the pThe test utterances of a test speaker are first labeled and parsed to form the pseudo query document seudo query document yyQQ and then transformed into the query vector and then transformed into the query vector vvQQ in tin t

he eigen-prosody speaker space byhe eigen-prosody speaker space by

iKQ

iKT

QiKQ

Pi

KKTQQ

v

vvS

Uy

1

)(

2005102420051024 Speech Lab NTNU 2222

Speaker Identification ExperimentsSpeaker Identification Experiments

HTIMIT database and experiment conditionsHTIMIT database and experiment conditionstotal 384 speakers each gave ten utterancestotal 384 speakers each gave ten utterances

The set of 38410 utterances was then playback and recorded through ninThe set of 38410 utterances was then playback and recorded through nine other different handsets include four carbon buttone other different handsets include four carbon button((碳墨式碳墨式 )) four electret four electret ((電子式電子式 ))handsets and one portable cordless phonehandsets and one portable cordless phone

In this paper all experiments were performed on 302 speakers inIn this paper all experiments were performed on 302 speakers including 151 females and 151 males which have all the ten utterancluding 151 females and 151 males which have all the ten utterancesces

For training the speaker models the first seven utterances of each speaker For training the speaker models the first seven utterances of each speaker from the senh handset were used as the enrollment speechfrom the senh handset were used as the enrollment speech

The other ten three-utterance sessions of each speaker from ten handsets wThe other ten three-utterance sessions of each speaker from ten handsets were used as the evaluation data respectivelyere used as the evaluation data respectively

2005102420051024 Speech Lab NTNU 2323

Speaker Identification Experiments (cont)Speaker Identification Experiments (cont)

To construct the speaker models a 256-mixture universal backgrTo construct the speaker models a 256-mixture universal background model (UBM) was first built from the enrollment speech of ound model (UBM) was first built from the enrollment speech of all 302 speakersall 302 speakers

For each speaker a MAP-adapted GMM (MAP-GMM) adapted fFor each speaker a MAP-adapted GMM (MAP-GMM) adapted from the UBM using his own enrollment speech was built rom the UBM using his own enrollment speech was built

a 32-state prosodic modeling was traineda 32-state prosodic modeling was trained

367 prosody keywords were extracted to form a sparse 367302 dimensio367 prosody keywords were extracted to form a sparse 367302 dimensional matrix nal matrix AA

the dimension of eigen-prosody space is 3the dimension of eigen-prosody space is 3

38 mel-frequency cepstral coefficiences (MFCCs) including38 mel-frequency cepstral coefficiences (MFCCs) including

Window size of 30 ms and frame shift of 10msWindow size of 30 ms and frame shift of 10ms

2005102420051024 Speech Lab NTNU 2424

Speaker Identification Experiments (cont)Speaker Identification Experiments (cont)

2005102420051024 Speech Lab NTNU 2525

ConclusionsConclusions

This paper presents an EPA+AKI+MAP-GMMCMS fusion This paper presents an EPA+AKI+MAP-GMMCMS fusion approachapproach

Unlike conventional fusion approaches the proposed method Unlike conventional fusion approaches the proposed method requires only few trainingtest utterances and could alleviate the requires only few trainingtest utterances and could alleviate the distortion of unseen mismatch handsetsdistortion of unseen mismatch handsets

It is therefore a promising method for robust speaker It is therefore a promising method for robust speaker identification under mismatch environment and limited available identification under mismatch environment and limited available datadata

2005102420051024 Speech Lab NTNU 2626

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
Page 16: PROSODY MODELING AND EIGEN-PROSODY ANALYSIS FOR ROBUST SPEAKER RECOGNITION

2005102420051024 Speech Lab NTNU 1616

Eigen-Prosody Analysis (cont)Eigen-Prosody Analysis (cont)

2005102420051024 Speech Lab NTNU 1717

Prosody keyword extractionProsody keyword extraction

After the prosodic state labeling the prosody text documents of aAfter the prosodic state labeling the prosody text documents of all speakers are searched to find important prosody keywords in orll speakers are searched to find important prosody keywords in order to establish a prosody keywords dictionaryder to establish a prosody keywords dictionary

First all possible combinations of the prosody words including single worFirst all possible combinations of the prosody words including single words and word pairs (uni-gram and bi-gram) are listed and their frequency stds and word pairs (uni-gram and bi-gram) are listed and their frequency statistics are computedatistics are computed

After calculating the histogram of all prosody words frequency thresholds After calculating the histogram of all prosody words frequency thresholds are set to leave only high frequency onesare set to leave only high frequency ones

2005102420051024 Speech Lab NTNU 1818

Prosody keyword-speaker occurrence matrix Prosody keyword-speaker occurrence matrix statisticsstatistics

The prosody text document of each speaker is then parsed using the generated The prosody text document of each speaker is then parsed using the generated prosody keywords dictionary by simply giving higher priority to longer wordsprosody keywords dictionary by simply giving higher priority to longer words

The occurrence counts of keywords of a speaker are booked in a prosody The occurrence counts of keywords of a speaker are booked in a prosody keyword list vector to represent the long-term prosodic behaviors of the keyword list vector to represent the long-term prosodic behaviors of the specific speakerspecific speaker

The prosody keyword-speaker occurrence matrix The prosody keyword-speaker occurrence matrix A A is made up of the is made up of the collection of all speaker prosody keyword lists vectorscollection of all speaker prosody keyword lists vectors

To emphasize the uncommon keywords and to deemphasize the very common To emphasize the uncommon keywords and to deemphasize the very common ones the inverse document frequency (IDF) weighting method is appliedones the inverse document frequency (IDF) weighting method is applied

2005102420051024 Speech Lab NTNU 1919

Eigen-Prosody analysisEigen-Prosody analysis

In order to reduce the dimension of the prosody space the sparse prosody keIn order to reduce the dimension of the prosody space the sparse prosody keyword-speaker occurrence matrix yword-speaker occurrence matrix A A is further analyzed using SVD to find a cis further analyzed using SVD to find a compact eigen-prosody spaceompact eigen-prosody space

Given an Given an m m by by n n ((mmgtgtgtgtnn) matrix ) matrix A A of rank of rank R R A A is decomposed and further apis decomposed and further approximated using only the largest proximated using only the largest K K singular values assingular values as

Where AWhere AKK U UKK V VKK and and ΣΣKK matrices are the rank reduced matrices of the respective m matrices are the rank reduced matrices of the respective m

atricesatrices

TKKKK

T VUAVUA

2005102420051024 Speech Lab NTNU 2020

Eigen-Prosody Analysis Eigen-Prosody Analysis

2005102420051024 Speech Lab NTNU 2121

Score measurementScore measurement

The test utterances of a test speaker are first labeled and parsed to form the pThe test utterances of a test speaker are first labeled and parsed to form the pseudo query document seudo query document yyQQ and then transformed into the query vector and then transformed into the query vector vvQQ in tin t

he eigen-prosody speaker space byhe eigen-prosody speaker space by

iKQ

iKT

QiKQ

Pi

KKTQQ

v

vvS

Uy

1

)(

2005102420051024 Speech Lab NTNU 2222

Speaker Identification ExperimentsSpeaker Identification Experiments

HTIMIT database and experiment conditionsHTIMIT database and experiment conditionstotal 384 speakers each gave ten utterancestotal 384 speakers each gave ten utterances

The set of 38410 utterances was then playback and recorded through ninThe set of 38410 utterances was then playback and recorded through nine other different handsets include four carbon buttone other different handsets include four carbon button((碳墨式碳墨式 )) four electret four electret ((電子式電子式 ))handsets and one portable cordless phonehandsets and one portable cordless phone

In this paper all experiments were performed on 302 speakers inIn this paper all experiments were performed on 302 speakers including 151 females and 151 males which have all the ten utterancluding 151 females and 151 males which have all the ten utterancesces

For training the speaker models the first seven utterances of each speaker For training the speaker models the first seven utterances of each speaker from the senh handset were used as the enrollment speechfrom the senh handset were used as the enrollment speech

The other ten three-utterance sessions of each speaker from ten handsets wThe other ten three-utterance sessions of each speaker from ten handsets were used as the evaluation data respectivelyere used as the evaluation data respectively

2005102420051024 Speech Lab NTNU 2323

Speaker Identification Experiments (cont)Speaker Identification Experiments (cont)

To construct the speaker models a 256-mixture universal backgrTo construct the speaker models a 256-mixture universal background model (UBM) was first built from the enrollment speech of ound model (UBM) was first built from the enrollment speech of all 302 speakersall 302 speakers

For each speaker a MAP-adapted GMM (MAP-GMM) adapted fFor each speaker a MAP-adapted GMM (MAP-GMM) adapted from the UBM using his own enrollment speech was built rom the UBM using his own enrollment speech was built

a 32-state prosodic modeling was traineda 32-state prosodic modeling was trained

367 prosody keywords were extracted to form a sparse 367302 dimensio367 prosody keywords were extracted to form a sparse 367302 dimensional matrix nal matrix AA

the dimension of eigen-prosody space is 3the dimension of eigen-prosody space is 3

38 mel-frequency cepstral coefficiences (MFCCs) including38 mel-frequency cepstral coefficiences (MFCCs) including

Window size of 30 ms and frame shift of 10msWindow size of 30 ms and frame shift of 10ms

2005102420051024 Speech Lab NTNU 2424

Speaker Identification Experiments (cont)Speaker Identification Experiments (cont)

2005102420051024 Speech Lab NTNU 2525

ConclusionsConclusions

This paper presents an EPA+AKI+MAP-GMMCMS fusion This paper presents an EPA+AKI+MAP-GMMCMS fusion approachapproach

Unlike conventional fusion approaches the proposed method Unlike conventional fusion approaches the proposed method requires only few trainingtest utterances and could alleviate the requires only few trainingtest utterances and could alleviate the distortion of unseen mismatch handsetsdistortion of unseen mismatch handsets

It is therefore a promising method for robust speaker It is therefore a promising method for robust speaker identification under mismatch environment and limited available identification under mismatch environment and limited available datadata

2005102420051024 Speech Lab NTNU 2626

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
Page 17: PROSODY MODELING AND EIGEN-PROSODY ANALYSIS FOR ROBUST SPEAKER RECOGNITION

2005102420051024 Speech Lab NTNU 1717

Prosody keyword extractionProsody keyword extraction

After the prosodic state labeling the prosody text documents of aAfter the prosodic state labeling the prosody text documents of all speakers are searched to find important prosody keywords in orll speakers are searched to find important prosody keywords in order to establish a prosody keywords dictionaryder to establish a prosody keywords dictionary

First all possible combinations of the prosody words including single worFirst all possible combinations of the prosody words including single words and word pairs (uni-gram and bi-gram) are listed and their frequency stds and word pairs (uni-gram and bi-gram) are listed and their frequency statistics are computedatistics are computed

After calculating the histogram of all prosody words frequency thresholds After calculating the histogram of all prosody words frequency thresholds are set to leave only high frequency onesare set to leave only high frequency ones

2005102420051024 Speech Lab NTNU 1818

Prosody keyword-speaker occurrence matrix Prosody keyword-speaker occurrence matrix statisticsstatistics

The prosody text document of each speaker is then parsed using the generated The prosody text document of each speaker is then parsed using the generated prosody keywords dictionary by simply giving higher priority to longer wordsprosody keywords dictionary by simply giving higher priority to longer words

The occurrence counts of keywords of a speaker are booked in a prosody The occurrence counts of keywords of a speaker are booked in a prosody keyword list vector to represent the long-term prosodic behaviors of the keyword list vector to represent the long-term prosodic behaviors of the specific speakerspecific speaker

The prosody keyword-speaker occurrence matrix The prosody keyword-speaker occurrence matrix A A is made up of the is made up of the collection of all speaker prosody keyword lists vectorscollection of all speaker prosody keyword lists vectors

To emphasize the uncommon keywords and to deemphasize the very common To emphasize the uncommon keywords and to deemphasize the very common ones the inverse document frequency (IDF) weighting method is appliedones the inverse document frequency (IDF) weighting method is applied

2005102420051024 Speech Lab NTNU 1919

Eigen-Prosody analysisEigen-Prosody analysis

In order to reduce the dimension of the prosody space the sparse prosody keIn order to reduce the dimension of the prosody space the sparse prosody keyword-speaker occurrence matrix yword-speaker occurrence matrix A A is further analyzed using SVD to find a cis further analyzed using SVD to find a compact eigen-prosody spaceompact eigen-prosody space

Given an Given an m m by by n n ((mmgtgtgtgtnn) matrix ) matrix A A of rank of rank R R A A is decomposed and further apis decomposed and further approximated using only the largest proximated using only the largest K K singular values assingular values as

Where AWhere AKK U UKK V VKK and and ΣΣKK matrices are the rank reduced matrices of the respective m matrices are the rank reduced matrices of the respective m

atricesatrices

TKKKK

T VUAVUA

2005102420051024 Speech Lab NTNU 2020

Eigen-Prosody Analysis Eigen-Prosody Analysis

2005102420051024 Speech Lab NTNU 2121

Score measurementScore measurement

The test utterances of a test speaker are first labeled and parsed to form the pThe test utterances of a test speaker are first labeled and parsed to form the pseudo query document seudo query document yyQQ and then transformed into the query vector and then transformed into the query vector vvQQ in tin t

he eigen-prosody speaker space byhe eigen-prosody speaker space by

iKQ

iKT

QiKQ

Pi

KKTQQ

v

vvS

Uy

1

)(

2005102420051024 Speech Lab NTNU 2222

Speaker Identification ExperimentsSpeaker Identification Experiments

HTIMIT database and experiment conditionsHTIMIT database and experiment conditionstotal 384 speakers each gave ten utterancestotal 384 speakers each gave ten utterances

The set of 38410 utterances was then playback and recorded through ninThe set of 38410 utterances was then playback and recorded through nine other different handsets include four carbon buttone other different handsets include four carbon button((碳墨式碳墨式 )) four electret four electret ((電子式電子式 ))handsets and one portable cordless phonehandsets and one portable cordless phone

In this paper all experiments were performed on 302 speakers inIn this paper all experiments were performed on 302 speakers including 151 females and 151 males which have all the ten utterancluding 151 females and 151 males which have all the ten utterancesces

For training the speaker models the first seven utterances of each speaker For training the speaker models the first seven utterances of each speaker from the senh handset were used as the enrollment speechfrom the senh handset were used as the enrollment speech

The other ten three-utterance sessions of each speaker from ten handsets wThe other ten three-utterance sessions of each speaker from ten handsets were used as the evaluation data respectivelyere used as the evaluation data respectively

2005102420051024 Speech Lab NTNU 2323

Speaker Identification Experiments (cont)Speaker Identification Experiments (cont)

To construct the speaker models a 256-mixture universal backgrTo construct the speaker models a 256-mixture universal background model (UBM) was first built from the enrollment speech of ound model (UBM) was first built from the enrollment speech of all 302 speakersall 302 speakers

For each speaker a MAP-adapted GMM (MAP-GMM) adapted fFor each speaker a MAP-adapted GMM (MAP-GMM) adapted from the UBM using his own enrollment speech was built rom the UBM using his own enrollment speech was built

a 32-state prosodic modeling was traineda 32-state prosodic modeling was trained

367 prosody keywords were extracted to form a sparse 367302 dimensio367 prosody keywords were extracted to form a sparse 367302 dimensional matrix nal matrix AA

the dimension of eigen-prosody space is 3the dimension of eigen-prosody space is 3

38 mel-frequency cepstral coefficiences (MFCCs) including38 mel-frequency cepstral coefficiences (MFCCs) including

Window size of 30 ms and frame shift of 10msWindow size of 30 ms and frame shift of 10ms

2005102420051024 Speech Lab NTNU 2424

Speaker Identification Experiments (cont)Speaker Identification Experiments (cont)

2005102420051024 Speech Lab NTNU 2525

ConclusionsConclusions

This paper presents an EPA+AKI+MAP-GMMCMS fusion This paper presents an EPA+AKI+MAP-GMMCMS fusion approachapproach

Unlike conventional fusion approaches the proposed method Unlike conventional fusion approaches the proposed method requires only few trainingtest utterances and could alleviate the requires only few trainingtest utterances and could alleviate the distortion of unseen mismatch handsetsdistortion of unseen mismatch handsets

It is therefore a promising method for robust speaker It is therefore a promising method for robust speaker identification under mismatch environment and limited available identification under mismatch environment and limited available datadata

2005102420051024 Speech Lab NTNU 2626

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
Page 18: PROSODY MODELING AND EIGEN-PROSODY ANALYSIS FOR ROBUST SPEAKER RECOGNITION

2005102420051024 Speech Lab NTNU 1818

Prosody keyword-speaker occurrence matrix Prosody keyword-speaker occurrence matrix statisticsstatistics

The prosody text document of each speaker is then parsed using the generated The prosody text document of each speaker is then parsed using the generated prosody keywords dictionary by simply giving higher priority to longer wordsprosody keywords dictionary by simply giving higher priority to longer words

The occurrence counts of keywords of a speaker are booked in a prosody The occurrence counts of keywords of a speaker are booked in a prosody keyword list vector to represent the long-term prosodic behaviors of the keyword list vector to represent the long-term prosodic behaviors of the specific speakerspecific speaker

The prosody keyword-speaker occurrence matrix The prosody keyword-speaker occurrence matrix A A is made up of the is made up of the collection of all speaker prosody keyword lists vectorscollection of all speaker prosody keyword lists vectors

To emphasize the uncommon keywords and to deemphasize the very common To emphasize the uncommon keywords and to deemphasize the very common ones the inverse document frequency (IDF) weighting method is appliedones the inverse document frequency (IDF) weighting method is applied

2005102420051024 Speech Lab NTNU 1919

Eigen-Prosody analysisEigen-Prosody analysis

In order to reduce the dimension of the prosody space the sparse prosody keIn order to reduce the dimension of the prosody space the sparse prosody keyword-speaker occurrence matrix yword-speaker occurrence matrix A A is further analyzed using SVD to find a cis further analyzed using SVD to find a compact eigen-prosody spaceompact eigen-prosody space

Given an Given an m m by by n n ((mmgtgtgtgtnn) matrix ) matrix A A of rank of rank R R A A is decomposed and further apis decomposed and further approximated using only the largest proximated using only the largest K K singular values assingular values as

Where AWhere AKK U UKK V VKK and and ΣΣKK matrices are the rank reduced matrices of the respective m matrices are the rank reduced matrices of the respective m

atricesatrices

TKKKK

T VUAVUA

2005102420051024 Speech Lab NTNU 2020

Eigen-Prosody Analysis Eigen-Prosody Analysis

2005102420051024 Speech Lab NTNU 2121

Score measurementScore measurement

The test utterances of a test speaker are first labeled and parsed to form the pThe test utterances of a test speaker are first labeled and parsed to form the pseudo query document seudo query document yyQQ and then transformed into the query vector and then transformed into the query vector vvQQ in tin t

he eigen-prosody speaker space byhe eigen-prosody speaker space by

iKQ

iKT

QiKQ

Pi

KKTQQ

v

vvS

Uy

1

)(

2005102420051024 Speech Lab NTNU 2222

Speaker Identification ExperimentsSpeaker Identification Experiments

HTIMIT database and experiment conditionsHTIMIT database and experiment conditionstotal 384 speakers each gave ten utterancestotal 384 speakers each gave ten utterances

The set of 38410 utterances was then playback and recorded through ninThe set of 38410 utterances was then playback and recorded through nine other different handsets include four carbon buttone other different handsets include four carbon button((碳墨式碳墨式 )) four electret four electret ((電子式電子式 ))handsets and one portable cordless phonehandsets and one portable cordless phone

In this paper all experiments were performed on 302 speakers inIn this paper all experiments were performed on 302 speakers including 151 females and 151 males which have all the ten utterancluding 151 females and 151 males which have all the ten utterancesces

For training the speaker models the first seven utterances of each speaker For training the speaker models the first seven utterances of each speaker from the senh handset were used as the enrollment speechfrom the senh handset were used as the enrollment speech

The other ten three-utterance sessions of each speaker from ten handsets wThe other ten three-utterance sessions of each speaker from ten handsets were used as the evaluation data respectivelyere used as the evaluation data respectively

2005102420051024 Speech Lab NTNU 2323

Speaker Identification Experiments (cont)Speaker Identification Experiments (cont)

To construct the speaker models a 256-mixture universal backgrTo construct the speaker models a 256-mixture universal background model (UBM) was first built from the enrollment speech of ound model (UBM) was first built from the enrollment speech of all 302 speakersall 302 speakers

For each speaker a MAP-adapted GMM (MAP-GMM) adapted fFor each speaker a MAP-adapted GMM (MAP-GMM) adapted from the UBM using his own enrollment speech was built rom the UBM using his own enrollment speech was built

a 32-state prosodic modeling was traineda 32-state prosodic modeling was trained

367 prosody keywords were extracted to form a sparse 367302 dimensio367 prosody keywords were extracted to form a sparse 367302 dimensional matrix nal matrix AA

the dimension of eigen-prosody space is 3the dimension of eigen-prosody space is 3

38 mel-frequency cepstral coefficiences (MFCCs) including38 mel-frequency cepstral coefficiences (MFCCs) including

Window size of 30 ms and frame shift of 10msWindow size of 30 ms and frame shift of 10ms

2005102420051024 Speech Lab NTNU 2424

Speaker Identification Experiments (cont)Speaker Identification Experiments (cont)

2005102420051024 Speech Lab NTNU 2525

ConclusionsConclusions

This paper presents an EPA+AKI+MAP-GMMCMS fusion This paper presents an EPA+AKI+MAP-GMMCMS fusion approachapproach

Unlike conventional fusion approaches the proposed method Unlike conventional fusion approaches the proposed method requires only few trainingtest utterances and could alleviate the requires only few trainingtest utterances and could alleviate the distortion of unseen mismatch handsetsdistortion of unseen mismatch handsets

It is therefore a promising method for robust speaker It is therefore a promising method for robust speaker identification under mismatch environment and limited available identification under mismatch environment and limited available datadata

2005102420051024 Speech Lab NTNU 2626

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
Page 19: PROSODY MODELING AND EIGEN-PROSODY ANALYSIS FOR ROBUST SPEAKER RECOGNITION

2005102420051024 Speech Lab NTNU 1919

Eigen-Prosody analysisEigen-Prosody analysis

In order to reduce the dimension of the prosody space the sparse prosody keIn order to reduce the dimension of the prosody space the sparse prosody keyword-speaker occurrence matrix yword-speaker occurrence matrix A A is further analyzed using SVD to find a cis further analyzed using SVD to find a compact eigen-prosody spaceompact eigen-prosody space

Given an Given an m m by by n n ((mmgtgtgtgtnn) matrix ) matrix A A of rank of rank R R A A is decomposed and further apis decomposed and further approximated using only the largest proximated using only the largest K K singular values assingular values as

Where AWhere AKK U UKK V VKK and and ΣΣKK matrices are the rank reduced matrices of the respective m matrices are the rank reduced matrices of the respective m

atricesatrices

TKKKK

T VUAVUA

2005102420051024 Speech Lab NTNU 2020

Eigen-Prosody Analysis Eigen-Prosody Analysis

2005102420051024 Speech Lab NTNU 2121

Score measurementScore measurement

The test utterances of a test speaker are first labeled and parsed to form the pThe test utterances of a test speaker are first labeled and parsed to form the pseudo query document seudo query document yyQQ and then transformed into the query vector and then transformed into the query vector vvQQ in tin t

he eigen-prosody speaker space byhe eigen-prosody speaker space by

iKQ

iKT

QiKQ

Pi

KKTQQ

v

vvS

Uy

1

)(

2005102420051024 Speech Lab NTNU 2222

Speaker Identification ExperimentsSpeaker Identification Experiments

HTIMIT database and experiment conditionsHTIMIT database and experiment conditionstotal 384 speakers each gave ten utterancestotal 384 speakers each gave ten utterances

The set of 38410 utterances was then playback and recorded through ninThe set of 38410 utterances was then playback and recorded through nine other different handsets include four carbon buttone other different handsets include four carbon button((碳墨式碳墨式 )) four electret four electret ((電子式電子式 ))handsets and one portable cordless phonehandsets and one portable cordless phone

In this paper all experiments were performed on 302 speakers inIn this paper all experiments were performed on 302 speakers including 151 females and 151 males which have all the ten utterancluding 151 females and 151 males which have all the ten utterancesces

For training the speaker models the first seven utterances of each speaker For training the speaker models the first seven utterances of each speaker from the senh handset were used as the enrollment speechfrom the senh handset were used as the enrollment speech

The other ten three-utterance sessions of each speaker from ten handsets wThe other ten three-utterance sessions of each speaker from ten handsets were used as the evaluation data respectivelyere used as the evaluation data respectively

2005102420051024 Speech Lab NTNU 2323

Speaker Identification Experiments (cont)Speaker Identification Experiments (cont)

To construct the speaker models a 256-mixture universal backgrTo construct the speaker models a 256-mixture universal background model (UBM) was first built from the enrollment speech of ound model (UBM) was first built from the enrollment speech of all 302 speakersall 302 speakers

For each speaker a MAP-adapted GMM (MAP-GMM) adapted fFor each speaker a MAP-adapted GMM (MAP-GMM) adapted from the UBM using his own enrollment speech was built rom the UBM using his own enrollment speech was built

a 32-state prosodic modeling was traineda 32-state prosodic modeling was trained

367 prosody keywords were extracted to form a sparse 367302 dimensio367 prosody keywords were extracted to form a sparse 367302 dimensional matrix nal matrix AA

the dimension of eigen-prosody space is 3the dimension of eigen-prosody space is 3

38 mel-frequency cepstral coefficiences (MFCCs) including38 mel-frequency cepstral coefficiences (MFCCs) including

Window size of 30 ms and frame shift of 10msWindow size of 30 ms and frame shift of 10ms

2005102420051024 Speech Lab NTNU 2424

Speaker Identification Experiments (cont)Speaker Identification Experiments (cont)

2005102420051024 Speech Lab NTNU 2525

ConclusionsConclusions

This paper presents an EPA+AKI+MAP-GMMCMS fusion This paper presents an EPA+AKI+MAP-GMMCMS fusion approachapproach

Unlike conventional fusion approaches the proposed method Unlike conventional fusion approaches the proposed method requires only few trainingtest utterances and could alleviate the requires only few trainingtest utterances and could alleviate the distortion of unseen mismatch handsetsdistortion of unseen mismatch handsets

It is therefore a promising method for robust speaker It is therefore a promising method for robust speaker identification under mismatch environment and limited available identification under mismatch environment and limited available datadata

2005102420051024 Speech Lab NTNU 2626

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
Page 20: PROSODY MODELING AND EIGEN-PROSODY ANALYSIS FOR ROBUST SPEAKER RECOGNITION

2005102420051024 Speech Lab NTNU 2020

Eigen-Prosody Analysis Eigen-Prosody Analysis

2005102420051024 Speech Lab NTNU 2121

Score measurementScore measurement

The test utterances of a test speaker are first labeled and parsed to form the pThe test utterances of a test speaker are first labeled and parsed to form the pseudo query document seudo query document yyQQ and then transformed into the query vector and then transformed into the query vector vvQQ in tin t

he eigen-prosody speaker space byhe eigen-prosody speaker space by

iKQ

iKT

QiKQ

Pi

KKTQQ

v

vvS

Uy

1

)(

2005102420051024 Speech Lab NTNU 2222

Speaker Identification ExperimentsSpeaker Identification Experiments

HTIMIT database and experiment conditionsHTIMIT database and experiment conditionstotal 384 speakers each gave ten utterancestotal 384 speakers each gave ten utterances

The set of 38410 utterances was then playback and recorded through ninThe set of 38410 utterances was then playback and recorded through nine other different handsets include four carbon buttone other different handsets include four carbon button((碳墨式碳墨式 )) four electret four electret ((電子式電子式 ))handsets and one portable cordless phonehandsets and one portable cordless phone

In this paper all experiments were performed on 302 speakers inIn this paper all experiments were performed on 302 speakers including 151 females and 151 males which have all the ten utterancluding 151 females and 151 males which have all the ten utterancesces

For training the speaker models the first seven utterances of each speaker For training the speaker models the first seven utterances of each speaker from the senh handset were used as the enrollment speechfrom the senh handset were used as the enrollment speech

The other ten three-utterance sessions of each speaker from ten handsets wThe other ten three-utterance sessions of each speaker from ten handsets were used as the evaluation data respectivelyere used as the evaluation data respectively

2005102420051024 Speech Lab NTNU 2323

Speaker Identification Experiments (cont)Speaker Identification Experiments (cont)

To construct the speaker models a 256-mixture universal backgrTo construct the speaker models a 256-mixture universal background model (UBM) was first built from the enrollment speech of ound model (UBM) was first built from the enrollment speech of all 302 speakersall 302 speakers

For each speaker a MAP-adapted GMM (MAP-GMM) adapted fFor each speaker a MAP-adapted GMM (MAP-GMM) adapted from the UBM using his own enrollment speech was built rom the UBM using his own enrollment speech was built

a 32-state prosodic modeling was traineda 32-state prosodic modeling was trained

367 prosody keywords were extracted to form a sparse 367302 dimensio367 prosody keywords were extracted to form a sparse 367302 dimensional matrix nal matrix AA

the dimension of eigen-prosody space is 3the dimension of eigen-prosody space is 3

38 mel-frequency cepstral coefficiences (MFCCs) including38 mel-frequency cepstral coefficiences (MFCCs) including

Window size of 30 ms and frame shift of 10msWindow size of 30 ms and frame shift of 10ms

2005102420051024 Speech Lab NTNU 2424

Speaker Identification Experiments (cont)Speaker Identification Experiments (cont)

2005102420051024 Speech Lab NTNU 2525

ConclusionsConclusions

This paper presents an EPA+AKI+MAP-GMMCMS fusion This paper presents an EPA+AKI+MAP-GMMCMS fusion approachapproach

Unlike conventional fusion approaches the proposed method Unlike conventional fusion approaches the proposed method requires only few trainingtest utterances and could alleviate the requires only few trainingtest utterances and could alleviate the distortion of unseen mismatch handsetsdistortion of unseen mismatch handsets

It is therefore a promising method for robust speaker It is therefore a promising method for robust speaker identification under mismatch environment and limited available identification under mismatch environment and limited available datadata

2005102420051024 Speech Lab NTNU 2626

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
Page 21: PROSODY MODELING AND EIGEN-PROSODY ANALYSIS FOR ROBUST SPEAKER RECOGNITION

2005102420051024 Speech Lab NTNU 2121

Score measurementScore measurement

The test utterances of a test speaker are first labeled and parsed to form the pThe test utterances of a test speaker are first labeled and parsed to form the pseudo query document seudo query document yyQQ and then transformed into the query vector and then transformed into the query vector vvQQ in tin t

he eigen-prosody speaker space byhe eigen-prosody speaker space by

iKQ

iKT

QiKQ

Pi

KKTQQ

v

vvS

Uy

1

)(

2005102420051024 Speech Lab NTNU 2222

Speaker Identification ExperimentsSpeaker Identification Experiments

HTIMIT database and experiment conditionsHTIMIT database and experiment conditionstotal 384 speakers each gave ten utterancestotal 384 speakers each gave ten utterances

The set of 38410 utterances was then playback and recorded through ninThe set of 38410 utterances was then playback and recorded through nine other different handsets include four carbon buttone other different handsets include four carbon button((碳墨式碳墨式 )) four electret four electret ((電子式電子式 ))handsets and one portable cordless phonehandsets and one portable cordless phone

In this paper all experiments were performed on 302 speakers inIn this paper all experiments were performed on 302 speakers including 151 females and 151 males which have all the ten utterancluding 151 females and 151 males which have all the ten utterancesces

For training the speaker models the first seven utterances of each speaker For training the speaker models the first seven utterances of each speaker from the senh handset were used as the enrollment speechfrom the senh handset were used as the enrollment speech

The other ten three-utterance sessions of each speaker from ten handsets wThe other ten three-utterance sessions of each speaker from ten handsets were used as the evaluation data respectivelyere used as the evaluation data respectively

2005102420051024 Speech Lab NTNU 2323

Speaker Identification Experiments (cont)Speaker Identification Experiments (cont)

To construct the speaker models a 256-mixture universal backgrTo construct the speaker models a 256-mixture universal background model (UBM) was first built from the enrollment speech of ound model (UBM) was first built from the enrollment speech of all 302 speakersall 302 speakers

For each speaker a MAP-adapted GMM (MAP-GMM) adapted fFor each speaker a MAP-adapted GMM (MAP-GMM) adapted from the UBM using his own enrollment speech was built rom the UBM using his own enrollment speech was built

a 32-state prosodic modeling was traineda 32-state prosodic modeling was trained

367 prosody keywords were extracted to form a sparse 367302 dimensio367 prosody keywords were extracted to form a sparse 367302 dimensional matrix nal matrix AA

the dimension of eigen-prosody space is 3the dimension of eigen-prosody space is 3

38 mel-frequency cepstral coefficiences (MFCCs) including38 mel-frequency cepstral coefficiences (MFCCs) including

Window size of 30 ms and frame shift of 10msWindow size of 30 ms and frame shift of 10ms

2005102420051024 Speech Lab NTNU 2424

Speaker Identification Experiments (cont)Speaker Identification Experiments (cont)

2005102420051024 Speech Lab NTNU 2525

ConclusionsConclusions

This paper presents an EPA+AKI+MAP-GMMCMS fusion This paper presents an EPA+AKI+MAP-GMMCMS fusion approachapproach

Unlike conventional fusion approaches the proposed method Unlike conventional fusion approaches the proposed method requires only few trainingtest utterances and could alleviate the requires only few trainingtest utterances and could alleviate the distortion of unseen mismatch handsetsdistortion of unseen mismatch handsets

It is therefore a promising method for robust speaker It is therefore a promising method for robust speaker identification under mismatch environment and limited available identification under mismatch environment and limited available datadata

2005102420051024 Speech Lab NTNU 2626

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
Page 22: PROSODY MODELING AND EIGEN-PROSODY ANALYSIS FOR ROBUST SPEAKER RECOGNITION

2005102420051024 Speech Lab NTNU 2222

Speaker Identification ExperimentsSpeaker Identification Experiments

HTIMIT database and experiment conditionsHTIMIT database and experiment conditionstotal 384 speakers each gave ten utterancestotal 384 speakers each gave ten utterances

The set of 38410 utterances was then playback and recorded through ninThe set of 38410 utterances was then playback and recorded through nine other different handsets include four carbon buttone other different handsets include four carbon button((碳墨式碳墨式 )) four electret four electret ((電子式電子式 ))handsets and one portable cordless phonehandsets and one portable cordless phone

In this paper all experiments were performed on 302 speakers inIn this paper all experiments were performed on 302 speakers including 151 females and 151 males which have all the ten utterancluding 151 females and 151 males which have all the ten utterancesces

For training the speaker models the first seven utterances of each speaker For training the speaker models the first seven utterances of each speaker from the senh handset were used as the enrollment speechfrom the senh handset were used as the enrollment speech

The other ten three-utterance sessions of each speaker from ten handsets wThe other ten three-utterance sessions of each speaker from ten handsets were used as the evaluation data respectivelyere used as the evaluation data respectively

2005102420051024 Speech Lab NTNU 2323

Speaker Identification Experiments (cont)Speaker Identification Experiments (cont)

To construct the speaker models a 256-mixture universal backgrTo construct the speaker models a 256-mixture universal background model (UBM) was first built from the enrollment speech of ound model (UBM) was first built from the enrollment speech of all 302 speakersall 302 speakers

For each speaker a MAP-adapted GMM (MAP-GMM) adapted fFor each speaker a MAP-adapted GMM (MAP-GMM) adapted from the UBM using his own enrollment speech was built rom the UBM using his own enrollment speech was built

a 32-state prosodic modeling was traineda 32-state prosodic modeling was trained

367 prosody keywords were extracted to form a sparse 367302 dimensio367 prosody keywords were extracted to form a sparse 367302 dimensional matrix nal matrix AA

the dimension of eigen-prosody space is 3the dimension of eigen-prosody space is 3

38 mel-frequency cepstral coefficiences (MFCCs) including38 mel-frequency cepstral coefficiences (MFCCs) including

Window size of 30 ms and frame shift of 10msWindow size of 30 ms and frame shift of 10ms

2005102420051024 Speech Lab NTNU 2424

Speaker Identification Experiments (cont)Speaker Identification Experiments (cont)

2005102420051024 Speech Lab NTNU 2525

ConclusionsConclusions

This paper presents an EPA+AKI+MAP-GMMCMS fusion This paper presents an EPA+AKI+MAP-GMMCMS fusion approachapproach

Unlike conventional fusion approaches the proposed method Unlike conventional fusion approaches the proposed method requires only few trainingtest utterances and could alleviate the requires only few trainingtest utterances and could alleviate the distortion of unseen mismatch handsetsdistortion of unseen mismatch handsets

It is therefore a promising method for robust speaker It is therefore a promising method for robust speaker identification under mismatch environment and limited available identification under mismatch environment and limited available datadata

2005102420051024 Speech Lab NTNU 2626

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
Page 23: PROSODY MODELING AND EIGEN-PROSODY ANALYSIS FOR ROBUST SPEAKER RECOGNITION

2005102420051024 Speech Lab NTNU 2323

Speaker Identification Experiments (cont)Speaker Identification Experiments (cont)

To construct the speaker models a 256-mixture universal backgrTo construct the speaker models a 256-mixture universal background model (UBM) was first built from the enrollment speech of ound model (UBM) was first built from the enrollment speech of all 302 speakersall 302 speakers

For each speaker a MAP-adapted GMM (MAP-GMM) adapted fFor each speaker a MAP-adapted GMM (MAP-GMM) adapted from the UBM using his own enrollment speech was built rom the UBM using his own enrollment speech was built

a 32-state prosodic modeling was traineda 32-state prosodic modeling was trained

367 prosody keywords were extracted to form a sparse 367302 dimensio367 prosody keywords were extracted to form a sparse 367302 dimensional matrix nal matrix AA

the dimension of eigen-prosody space is 3the dimension of eigen-prosody space is 3

38 mel-frequency cepstral coefficiences (MFCCs) including38 mel-frequency cepstral coefficiences (MFCCs) including

Window size of 30 ms and frame shift of 10msWindow size of 30 ms and frame shift of 10ms

2005102420051024 Speech Lab NTNU 2424

Speaker Identification Experiments (cont)Speaker Identification Experiments (cont)

2005102420051024 Speech Lab NTNU 2525

ConclusionsConclusions

This paper presents an EPA+AKI+MAP-GMMCMS fusion This paper presents an EPA+AKI+MAP-GMMCMS fusion approachapproach

Unlike conventional fusion approaches the proposed method Unlike conventional fusion approaches the proposed method requires only few trainingtest utterances and could alleviate the requires only few trainingtest utterances and could alleviate the distortion of unseen mismatch handsetsdistortion of unseen mismatch handsets

It is therefore a promising method for robust speaker It is therefore a promising method for robust speaker identification under mismatch environment and limited available identification under mismatch environment and limited available datadata

2005102420051024 Speech Lab NTNU 2626

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
Page 24: PROSODY MODELING AND EIGEN-PROSODY ANALYSIS FOR ROBUST SPEAKER RECOGNITION

2005102420051024 Speech Lab NTNU 2424

Speaker Identification Experiments (cont)Speaker Identification Experiments (cont)

2005102420051024 Speech Lab NTNU 2525

ConclusionsConclusions

This paper presents an EPA+AKI+MAP-GMMCMS fusion This paper presents an EPA+AKI+MAP-GMMCMS fusion approachapproach

Unlike conventional fusion approaches the proposed method Unlike conventional fusion approaches the proposed method requires only few trainingtest utterances and could alleviate the requires only few trainingtest utterances and could alleviate the distortion of unseen mismatch handsetsdistortion of unseen mismatch handsets

It is therefore a promising method for robust speaker It is therefore a promising method for robust speaker identification under mismatch environment and limited available identification under mismatch environment and limited available datadata

2005102420051024 Speech Lab NTNU 2626

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
Page 25: PROSODY MODELING AND EIGEN-PROSODY ANALYSIS FOR ROBUST SPEAKER RECOGNITION

2005102420051024 Speech Lab NTNU 2525

ConclusionsConclusions

This paper presents an EPA+AKI+MAP-GMMCMS fusion This paper presents an EPA+AKI+MAP-GMMCMS fusion approachapproach

Unlike conventional fusion approaches the proposed method Unlike conventional fusion approaches the proposed method requires only few trainingtest utterances and could alleviate the requires only few trainingtest utterances and could alleviate the distortion of unseen mismatch handsetsdistortion of unseen mismatch handsets

It is therefore a promising method for robust speaker It is therefore a promising method for robust speaker identification under mismatch environment and limited available identification under mismatch environment and limited available datadata

2005102420051024 Speech Lab NTNU 2626

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26
Page 26: PROSODY MODELING AND EIGEN-PROSODY ANALYSIS FOR ROBUST SPEAKER RECOGNITION

2005102420051024 Speech Lab NTNU 2626

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Slide 23
  • Slide 24
  • Slide 25
  • Slide 26