speech recognition and removal of disfluencies

31
Automatic Detection of Sentence Boundaries and Disfluencies in speech recognition techniques. Ankit Sharma -1MJ10EC013

Upload: ankit-sharma

Post on 27-May-2015

91 views

Category:

Engineering


6 download

DESCRIPTION

automatic detection of sentence boundaries and dis-fluencies

TRANSCRIPT

Page 1: speech recognition and removal of disfluencies

Automatic Detection of Sentence Boundaries and Disfluencies in speech recognition

techniques.

• Ankit Sharma -1MJ10EC013

Page 2: speech recognition and removal of disfluencies

Speech Processing

Speech is one of the most intriguing signals that humans work with every day.

• Purpose of speech processing:– To understand speech as a means of communication;– To represent speech for transmission and reproduction;– To analyze speech for automatic recognition and extraction of

information– To discover some physiological characteristics of the talker.

Page 3: speech recognition and removal of disfluencies

Automatic speech recognitionWhat is the task?What are the main difficulties?How is it approached?How good is it?How much better could it be?

3/34

Page 4: speech recognition and removal of disfluencies

text(concept)

speech

air flow

Sound sourcevoiced: pulseunvoiced: noise

frequencytransfercharacteristics

magnitudestart--end

fundamentalfrequency

mod

ulat

ion

of c

arrie

r w

ave

by s

peec

h in

form

atio

n

fund

amen

tal f

req.

voic

ed/u

nvoi

ced

freq

. tr

ans

. ch

ar.

4

Speech production process in humans

Page 5: speech recognition and removal of disfluencies

How might computers do it?

DigitizationAcoustic analysis of the

speech signalLinguistic interpretation

8

Acoustic waveform Acoustic signal

Speech recognition

Page 6: speech recognition and removal of disfluencies

Microsoft Speech Recognition – Windows 7

6/34

Page 7: speech recognition and removal of disfluencies

DigitizationAnalog to digital conversion Sampling and quantizingUse filters to measure energy levels for

various points on the frequency spectrumKnowing the relative importance of

different frequency bands (for speech) makes this process more efficient

E.g. high frequency sounds are less informative, so can be sampled using a broader bandwidth (log scale)

7/34

Page 8: speech recognition and removal of disfluencies

Separating speech from background noiseNoise cancelling microphones

Two mics, one facing speaker, the other facing away

Ambient noise is roughly same for both micsKnowing which bits of the signal relate to

speechSpectrograph analysis

8/34

Page 9: speech recognition and removal of disfluencies

Variability in individuals’ speechVariation among speakers due to

Vocal range (f0, and pitch range – see later)Voice quality (growl, whisper, physiological

elements such as nasality, adenoidality, etc)ACCENT !!! (especially vowel systems, but also

consonants, allophones, etc.)Variation within speakers due to

Health, emotional stateAmbient conditions

Speech style: formal read vs spontaneous9/34

Page 10: speech recognition and removal of disfluencies

10/34

Detection of Sentence Boundaries and Disfluencies

Page 11: speech recognition and removal of disfluencies

11/34

Divide speech into frames

Speech is a non-stationary signal

… but can be assumed to be quasi-stationary

Divide speech into short-time frames (e.g., 5ms shift, 25ms length)

Page 12: speech recognition and removal of disfluencies

12/34

Approaches to ASR

Template based

Neural Network

based

Statistics based

Page 13: speech recognition and removal of disfluencies

Statistics-based approachCollect a large corpus of transcribed speech

recordingsTrain the computer to learn the

correspondences (“machine learning”)At run time, apply statistical processes to

search through the space of all possible solutions, and pick the statistically most likely one

13/34

Page 14: speech recognition and removal of disfluencies

14/34

What is a corpus?A corpus can be defined as a collection of texts assumed to be representative of a given language put together so that it can be used for linguistic analysis. Usually the assumption is that the language stored in a corpus is naturally-occurring, that is gathered according to explicit design criteria, with a specific purpose in mind, and with a claim to represent natural chunks of language selected according to specific typology

“nowadays the term 'corpus' nearly always implies the additional feature of 'machine-readable'”.

Page 15: speech recognition and removal of disfluencies

Statistics based approachAcoustic and Lexical Models

Analyse training data in terms of relevant features

Learn from large amount of data different possibilities different phone sequences for a given word different combinations of elements of the speech

signal for a given phone/phonemeCombine these into a Hidden Markov Model

expressing the probabilities

15/34

Page 16: speech recognition and removal of disfluencies

Excitationgeneration

Synthesis Filter

TEXT

Text analysis

SYNTHESIZEDSPEECH

Training HMMs

Parameter generationfrom HMMs

Context-dependent HMMs& state duration models

Labels Excitationparameters

Excitation

Spectralparameters

Labels

Training part

Synthesis part

ExcitationParameterextraction

SPEECHDATABASE

SpectralParameterExtraction

Spectralparameters

Excitationparameters

Speech signal

HMM-based speech synthesis system (HTS)

16

Page 17: speech recognition and removal of disfluencies

HMMs for some words

17/34

Page 18: speech recognition and removal of disfluencies

Identify individual phonemesIdentify wordsIdentify sentence structure and/or meaning

18/34

Page 19: speech recognition and removal of disfluencies

Performance errorsPerformance “errors” include

Non-speech soundsHesitationsFalse starts, repetitions

Filtering implies handling at syntactic level or above

Some disfluencies are deliberate and have pragmatic effect – this is not something we can handle in the near future

19/34

Page 20: speech recognition and removal of disfluencies

20/34

Disfluencies

Page 21: speech recognition and removal of disfluencies

Disfluencies: standard terminology (Level it)

Reparandum : thing repairedInterruption point (IP): where speaker breaks

offEditing phase (edit terms): uh, I mean, you

knowRepair: fluent continuation

Page 22: speech recognition and removal of disfluencies

Prosodic characteristics of disfluenciesFragments are good cues to disfluenciesProsody:

Pause duration is shorter in disfluent silence than fluent silence

F0 increases from end of reparandum to beginning of repair, but only minor change

Repair interval offsets have minor prosodic phrase boundary, even in middle of NP: Show me all n- | round-trip flights | from Pittsburgh | to Atlanta

Page 23: speech recognition and removal of disfluencies

Syntactic Characteristics of DisfluenciesThe repair often has same structure as reparandumBoth are Noun Phrases (NPs) in this example:

So if could automatically find IP, could find and correct reparandum!

Page 24: speech recognition and removal of disfluencies

Disfluencies in language modelingShould we “clean up” disfluencies before

training LM (i.e. skip over disfluencies?)Filled pauses

Does United offer any [uh] one-way fares?Repetitions

What what are the fares?Deletions

Fly to Boston from BostonFragments (we’ll come back to these)

I want fl- flights to Boston.

Page 25: speech recognition and removal of disfluencies

Detection of disfluenciesDecision tree at wi-wj boundary

pause duration Word fragments Filled pause Energy peak within wi Amplitude difference between wi and wj F0 of wi F0 differences Whether wi accented

Results: 78% recall/89.2% precision

Page 26: speech recognition and removal of disfluencies

Recent work: EARS Metadata Evaluation (MDE)Sentence-like Unit (SU) detection:

find end points of SU Detect subtype (question, statement, backchannel)

Edit word detection: Find all words in reparandum (words that will be removed)

Filler word detection Filled pauses (uh, um) Discourse markers (you know, like, so) Editing terms (I mean)

Interruption point detection

Liu et al 2003

Page 27: speech recognition and removal of disfluencies

Kinds of disfluenciesRepetitions

I * I like itRevisions

We * I like itRestarts (false starts)

It’s also * I like it

Page 28: speech recognition and removal of disfluencies

MDE transcriptionConventions:

./ for statement SU boundaries, <> for fillers, [] for edit words, * for IP (interruption point) inside edits

And <uh> <you know> wash your clothes wherever you are ./ and [ you ] * you really get used to the outdoors ./

Page 29: speech recognition and removal of disfluencies

Recent works to improve qualityVocoding

– MELP-style / CELP-style excitation– LF model– Sinusoidal models

Acoustic model– Segment models, trajectory models– Model combination (product of experts)– Minimum generation error training– Bayesian modeling

Oversmoothing– Pre & postfiltering– Improvements of GV– Hybrid approaches

& more…29

Page 30: speech recognition and removal of disfluencies

Other challenging topicsNon-professional speakers

• AVM + adaptation (CSTR)

Too little speech data• VTLN-based rapid speaker adaptation (Titech, IDIAP)

Noisy recordings• Spectral subtraction & AVM + adaptation (CSTR)

No labels• Un- / Semi-supervised voice building (CSTR, NICT, CMU, Toshiba)

Insufficient knowledge of the language or accent• Letter (grapheme)-based synthesis (CSTR)• No prosodic contexts (CSTR, Titech)

Wrong language• Cross-lingual speaker adaptation (MSRA, EMIME)• Speaker & language adaptive training (Toshiba)

30

Page 31: speech recognition and removal of disfluencies

THANK YOU!

31/34