acoustic modeling using deep belief networks

CCNT, ZJU

Acoustic Modeling using Deep Belief Networks

Yueshen [email protected]

Zhejiang University

1

CCNT, ZJU

Abstract

Problem Achieving better phone recognition

Method Deep neural networks which contain many layers of features and

numbers of parameters Rather than Gaussian Mixture Models

Step Step 1: Pre-trained as a multi-layer generative models without

making use of any discriminative information spectral feature vector

Step2: Using backpropagation to make those features better at predicting a probability distribution

2

CCNT, ZJU

Introduction

Typical Automatic Speech Recognition System Model the sequential structure of speech signals: Hidden Markov

Model Spectral representation of the sound wave: HMM state + mixture

of Gaussians+Mel-frequency Cepstral Coefficients(梅尔倒频谱系数 )

New research direction Deeper acoustic models containing many layers of features Feedforward neural networks

Advantages The estimation of posterior probabilities of HMM does not require

detailed assumptions about data distribution Suitable for discrete and continuous features

3

CCNT, ZJU

Introduction

Comparison among MFCCs, GMM MFCCs

Partially overcome the very strong conditional independence assumption of HMM

GMM Easy to fit to data using the EM algorithm Inefficient at modeling high-dimensional data

Previous work of neural network Using backpropagation algorithms to train neural networks

discriminatively Generative modeling vs discriminative training Efficient to handle those unlabeled speech

4

CCNT, ZJU

Introduction

Main novelty of this paper Achieve consistently better phone recognition performance by pre-

training a multi-layer neural network One layer at a time, as a generative model

General Description The generative pre-training creates many layers of feature detector Using backpropagation algorithm to adjust the features in every

layer to make features more useful for discrimination

5

CCNT, ZJU

Learning a multilayer generative model

Two vital assumptions of this paper The discrimination is more directly related to the underlying causes

of data than to the individual elements of data itself A good feature vector representation of the underlying causes can

be recovered from the input data by modeling its higher order statistical structure

Directed view Fit a multilayer generative model having infinitely layers of latent

variablesUndirected view

Fitting a relatively simple type of learning module that only has one layer of latent variables

6

CCNT, ZJU


Undirected view Restricted Boltzmann Machine(RBM)

Bipartite graph in which visible units are connected to hidden units No visible-visible or hidden-hidden connections

Visible units vs. Hidden units Visible units: representing observation Hidden units: representing features using undirected weighted

connections

RBM in this paper Binary RBM

Both hidden and visible units are binary and stochastic Gaussian-Bernouli RBM

Hidden units are binary but visible units are linear with Gaussian noise

7

CCNT, ZJU


Binary RBM The weights on the connections and biases of individual units

define a probability distribution over the joint states of visible and hidden units via an energy function

The conditional distribution p(h| v, )

The conditional distribution p(v| h, )

8

CCNT, ZJU


Learning DBN Updating each weight wij using the difference between two

measured, pairwise correlations:

Directed view A sigmoid belief net consisting of multiple layers of binary

stochastic units

Hidden layers Binary features

Visible layers Binary data vectors

9

CCNT, ZJU


Generating data from the model Binary states are chosen for the top layer of hidden units Adjusting the weights on the top-down connections

Performing gradient ascent in the expected log probabilityChallenge

Getting unbiased samples from exponentially large posterior is intractable

Lack of conditional independenceLearning with tied weights (1/2)

Learning Context: a sigmoid belief net with an infinite number of layers and tied symmetric weights between layers

The posterior can be computed by simply multiplying visible vectors by transposed weight matrix

10

CCNT, ZJU


Learning is a little more difficult Because every copy of tied weight matrix gets different derivatives

• An infinite sigmoid

belief net with

weights

• Inference is easy since once posteriors have been sampled for the first hidden layer, the same process can be used for the next hidden layer

11

CCNT, ZJU


Unbiased estimate of the sum of derivatives h(2) can be viewed as a noisy but unbiased estimate of probabilities

for visible units predicted by h(1)

h(3) can be viewed as a noisy but unbiased estimate of probabilities for visible units predicted by h(2)

Unbiased estimate of the sum of derivatives h(2) can be viewed as a noisy but unbiased estimate of probabilities

for visible units predicted by h(1)

h(3) can be viewed as a noisy but unbiased estimate of probabilities for visible units predicted by h(2)

12

CCNT, ZJU


Learning different weights in each layer Making the generative model more powerful by allowing different

weights in different layers Step1: Learn with all of weight matrices tied together Step2: Untie the bottom weight matrix form the other matrices Step3: Obtain the frozen matrix W(1)

Step4: Keeping all remaining matrices tied together, and continuing to learn higher matrices

This involves first inferring h(1) from v by using W(1) and then inferring h(2) , h(3) , and h(4) in a similar bottom up manner using W or WT

13

CCNT, ZJU


Deep belief net(DBN) Having learned K layers of features, we get a directed generative

model called ’Deep Belief Net’ DBN has K different weight matrices between lower layers and an

infinite number of higher layers This paper models the whole system as a feedforward,

deterministic neural network This network is then discriminatively fine tuned by using

backpropagation to maximize the log probability of correct HMM states

14

CCNT, ZJU

Using Deep Belief Nets for Phone Recognition

Visible unit Using a context window of n successive frames of speech

coefficients Generate phone sequences The resulting feedforward neural network is discriminatively trained

to output a probability distribution over all possible labels of central frames

Then the pdfs over all possible labels for each frame is fed into a standard Viterbi decoder

15

CCNT, ZJU

Conclusions

Initiative This is the first application to acoustic modeling of neural networks

in which multiple layers of features are generatively pre-trained This approach can be extended to explicitly model the covariance

structure of input features It can be used to jointly train acoustic and language models It can be applied to a large vocabulary task replace of GMM

16

CCNT, ZJU

Thank you

17

acoustic modeling using deep belief networks

Education