04 efficient auditory coding

7/31/2019 04 Efficient Auditory Coding

1/35

Efficient auditory codingEvan Smith & Michael Lewicki (2006)

Presented by Yuanliang Meng


2/35

Auditory system


3/35

Basilar membrane in the cochlea A stiff structural element that separates two liquid-filled tubes that runalong the coil of the cochlea Frequecy dispersion: Sound input of certain frequency vibrates certain

locations of the basialr membrane more than other locations.


4/35

Basilar membrane impulse response


5/35

Optimal coding of an acoustic

waveform Goal: Predict optimal transformation of acoustic

waveform from statistics of the environment.

Find an efficient representation.


6/35

Block-based representations: does not yield time-

relative codes


7/35

Problems of block-based representations:It obscures transients and periodicities.Small time shifts can produce large changes

in the representation. e.g. Consonants in a

speech, onset of a gunshotCoding can be optimal only within a block.

Intuitively, we may want to increase theblock rate to reduce shift-sensitivity. Thisultimately leads to a continuous filterbank.


8/35

Convolution representations

The representation is shift-invariant but it does not

reduce the information rate. Therefore it is a highly

inefficientcode.


9/35

A sparse, shiftable kernel representation

The signal is decomposed in terms of discreteacoustic events, represented by the kernelfunctions m.

Each kernel function has a precise amplitude sm,iand temporal position m,i

The kernels could be assumed to be anyfunctions, such as gammatones; but they canalso be learned from data.


10/35

The signal (A) is represented in the spikegram (B) as a set of

ovals whose size and intensity indicate the amplitude of the

spike. The position of the oval indicates the kernel center

frequency (CF, y-axis) and timing (x-axis). The kernel

functions corresponding to the spikes (represented by each

oval) are overlayed in gray.


11/35

The word cateen IPA: [kntin ]


12/35

Encoding algorithms

The computational objective is to minimizethe error while maximizing coding

efficiency. There is a tradeoff between the error and

the computational complexity.

There are many possible encodingalgorithms. Matching-pursuitis chosen.


13/35

Matching pursuit

Iteratively approximate the input signal with successive orthogonalprojection onto kernels.

The projection with the largest inner product will minimize the powerofRx(t), thereby capturing the most structure possible given a single

kernel.

In each iteration, the kernel projection is subtracted from the signal,leading to a reduced residual.


14/35


15/35


16/35


17/35


18/35


19/35


20/35


21/35


22/35


23/35


24/35

Learning


25/35

Sounds

Animal vocalizations: Cornell Macaulay library

Natural sounds: vocal:transient:ambient=1.0:0.8:1.2

Speech sounds: TIMIT corpus


26/35


27/35


28/35

Fidelity curves for Frourier, Daubachies wavelet, gammatone and spike codes.


29/35

Compare the adapted kernels with revcor filters.

We can use spike-triggered average to estimate impulse response of

auditory nerve. These response functions are called revcor filters.Even though the kernels are optimized independentof revcor filters, they

turn out to be very similar.


30/35


31/35

The characteristics of sound influence the features of adapted

kernels.

Kernels learned from any of the three categories alone cannot reflect

the population distribution of revcor filters of natural sounds.


32/35

However, speech sounds seem to represent natural

sounds much better than animal vocalization orenvironmental sounds!


33/35

Some implications

Revcor filters have sharp onsets and decaying offsets. Ithad been assumed to be phenomenological, being aconsequence of the impulse response of the basilarmembrane. However, the learned kernels share the

same feature. So it may just be the nature of sounds. Most languages prefer CV structure and dislike VC structure.

(fast change in the beginning)

A gunshot is likely to produce a very fast changing onset. This model still does not explain the role of stimulus

intensity. Speech is a compromise of natural sounds.


34/35


35/35

Further question (not in the article)

There is the million dollar question in speech science: the lack ofinvariance in speech signal. Perception of sounds in a connected speech often requires

restructuring. Formants and other acoustic cues do not give you reliablerepresentations.

Speech conditions Speakers (males, females and children produce different sounds!)

Solutions: Motor theory: the invariance is rooted in motor control of articulators. Top-down processing: you have to anticipate what it is to perceive it

Maybe the invariance can be found in some better representations,like spike code? Lewicki is trying that, but no good results yet.

04 efficient auditory coding

Documents