the 1980’s

17
The 1980’s The 1980’s Collection of large standard corpora Front ends: auditory models, dynamics Engineering: scaling to large vocabulary continuous speech Second major (D)ARPA ASR project HMMs become ready for prime time

Upload: gary-dean

Post on 31-Dec-2015

28 views

Category:

Documents


0 download

DESCRIPTION

The 1980’s. Collection of large standard corpora Front ends: auditory models, dynamics Engineering: scaling to large vocabulary continuous speech Second major (D)ARPA ASR project HMMs become ready for prime time. Standard Corpora Collection. Before 1984, chaos TIMIT RM (later WSJ) - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: The 1980’s

The 1980’sThe 1980’s

• Collection of large standard corpora

• Front ends: auditory models, dynamics

• Engineering: scaling to large

vocabulary continuous speech

• Second major (D)ARPA ASR project

• HMMs become ready for prime time

Page 2: The 1980’s

Standard Corpora Standard Corpora CollectionCollection

• Before 1984, chaos

• TIMIT

• RM (later WSJ)

• ATIS

• NIST, ARPA, LDC

Page 3: The 1980’s

Front Ends in the 1980’sFront Ends in the 1980’s

• Mel cepstrum (Bridle, Mermelstein)

• PLP (Hermansky)

• Delta cepstrum (Furui)

• Auditory models (Seneff, Ghitza, others)

Page 4: The 1980’s

Mel Frequency ScaleMel Frequency Scale

Page 5: The 1980’s

Spectral vs TemporalSpectral vs Temporal ProcessingProcessing

Analysis (e.g., cepstral)

Processing(e.g., mean removal)

Time

frequ

ency

frequ

ency

Spectral processing

Temporal processing

Page 6: The 1980’s

Dynamic Speech FeaturesDynamic Speech Features

• temporal dynamics useful for ASR

• local time derivatives of cepstra

• “delta’’ features estimated over

multiple frames (typically 5)

• usually augments static features

• can be viewed as a temporal filter

Page 7: The 1980’s

““Delta” impulse responseDelta” impulse response

.2

.1

0

-.2

-.1

0 1 2-1-2frames

Page 8: The 1980’s

HMM’s for ContinuousHMM’s for ContinuousSpeechSpeech

• Using dynamic programming for cts speech

(Vintsyuk, Bridle, Sakoe, Ney….)

• Application of Baker-Jelinek ideas to

continuous speech (IBM, BBN, Philips, ...)

• Multiple groups developing major HMM

systems (CMU, SRI, Lincoln, BBN, ATT)

• Engineering development - coping with

data, fast computers

Page 9: The 1980’s

2nd (D)ARPA Project2nd (D)ARPA Project

• Common task• Frequent evaluations• Convergence to good, but similar, systems • Lots of engineering development - now up to

60,000 word recognition, in real time, on aworkstation, with less than 10% word error

• Competition inspired others not in project -Cambridge did HTK, now widely distributed

Page 10: The 1980’s

Knowledge vs. Knowledge vs. IgnoranceIgnorance

• Using acoustic-phonetic knowledge

in explicit rules

• Ignorance represented statistically

• Ignorance-based approaches (HMMs)

“won”, but

• Knowledge (e.g., segments) becoming

statistical

• Statistics incorporating knowledge

Page 11: The 1980’s

Some 1990’s IssuesSome 1990’s Issues

• Independence to long-term spectrum

• Adaptation

• Effects of spontaneous speech

• Information retrieval/extraction with

broadcast material

• Query-style systems (e.g., ATIS)

• Applying ASR technology to related

areas (language ID, speaker verification)

Page 12: The 1980’s

Where Pierce Letter Where Pierce Letter AppliesApplies

• We still need science

• Need language, intelligence

• Acoustic robustness still poor

• Perceptual research, models

• Fundamentals of statistical pattern

recognition for sequences

• Robustness to accent, stress,

rate of speech, ……..

Page 13: The 1980’s

Progress in 25 YearsProgress in 25 Years

• From digits to 60,000 words

• From single speakers to many

• From isolated words to continuous

speech

• From no products to many products,

some systems actually saving LOTS

of money

Page 14: The 1980’s

Real UsesReal Uses

• Telephone: phone company services

(collect versus credit card)

• Telephone: call centers for query

information (e.g., stock quotes,

parcel tracking)

• Dictation products: continuous

recognition, speaker dependent/adaptive

Page 15: The 1980’s

But:But:

• Still <97% accurate on “yes” for telephone

• Unexpected rate of speech causes doubling

or tripling of error rate

• Unexpected accent hurts badly

• Accuracy on unrestricted speech at 60%

• Don’t know when we know

• Few advances in basic understanding

Page 16: The 1980’s

Confusion Matrix for Confusion Matrix for Digit RecognitionDigit Recognition

Overall error rate 4.85%

ClassErrorRate

1

2

3

4

5

6

7

8

9

0

1 2 3 4 5 6 8 9 0

191 0 0 5 1 0 0 2 0

0 188 2 0 0 1 0 0 6

0 3 191 0 1 0 0 3 0

8 0 0 187 4 0 0 0 0

0 0 0 0 193 0 0 7 0

2 2 0 2 0 1 0 1 2

0 1 0 0 1 2 196 0 0

5 0 2 0 8 0 0 179 3

1 4 0 0 0 1 0 1 192

7

1

3

2

1

0

190

2

3

1

4.5

6.0

4.5

6.5

3.5

2.0

5.0

2.0

10.5

4.5

0 0 0 0 1 196 2 0 10

Page 17: The 1980’s

Large Vocabulary CSRLarge Vocabulary CSRErrorRate

%

‘88 ‘89 ‘90 ‘91 ‘92 ‘93 ‘94Year

••

12

9

6

3

--- RM ( 1K words, PP 60)

___ WSJØ, WSJ1(5K, 20-60K words, PP 100)

Ø 1•

~~

~~