towards a model of speech production: cognitive modeling and computational applications michelle l....

26
Towards a model of speech production: Cognitive modeling and computational applications Michelle L. Gregory SNeRG 2003

Post on 19-Dec-2015

225 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Towards a model of speech production: Cognitive modeling and computational applications Michelle L. Gregory SNeRG 2003

Towards a model of speech production: Cognitive modeling and computational applications

Michelle L. GregorySNeRG 2003

Page 2: Towards a model of speech production: Cognitive modeling and computational applications Michelle L. Gregory SNeRG 2003

Outline Where I’ve been

• Predictability affects on word duration

• Predictability effects on pitch accent

Where I’m at• Computational model of pitch accent

• Prosodic information to aid parsing

• Psychological models of production

Where I’d like to go• Prosody

• Disfluencies

• Speech synthesis

Page 3: Towards a model of speech production: Cognitive modeling and computational applications Michelle L. Gregory SNeRG 2003

Where I’ve beenBad idea

despair

Cute!

knowledge

Page 4: Towards a model of speech production: Cognitive modeling and computational applications Michelle L. Gregory SNeRG 2003

Where I’m at…

CU-BOULDER LINGUISTICS PROFESSOR WINS 2002 MACARTHUR FELLOWSHIP

Page 5: Towards a model of speech production: Cognitive modeling and computational applications Michelle L. Gregory SNeRG 2003

Where I’m headed …

Page 6: Towards a model of speech production: Cognitive modeling and computational applications Michelle L. Gregory SNeRG 2003

Predictability affects on word duration

Methodology• Corpus and Design (swbd, regression)

• Measures of predictability (frequency, bigram, joint, mutual information, repetition)

Function words (top ten most frequent)

• Vowel reduction

• Coda deletion

• Duration

Content words• t/d deletion

• Duration

Page 7: Towards a model of speech production: Cognitive modeling and computational applications Michelle L. Gregory SNeRG 2003

Predictability affects on word duration

The probabilistic reduction hypothesis:The higher the probability of a word, the more it is reduced/shortened/lenited in lexical production. (Gregory et a. 1999, Jurafsky, Bell, Gregory, and Raymond 2000)

Implications

• Any factor that increases the probability of a word also increases phonological reduction.

Is that the only role of probabilistic information?

Page 8: Towards a model of speech production: Cognitive modeling and computational applications Michelle L. Gregory SNeRG 2003

Predictability effects on pitch accent

Same database, used regression models, but this time coded for pitch accent.

What is pitch accent?

Perceptual phenomenon (Hirschberg, 1993) Associated with duration, amplitude, and F0 of units. Words that appear more intonationally prominent than others are said to bear pitch accent.

Page 9: Towards a model of speech production: Cognitive modeling and computational applications Michelle L. Gregory SNeRG 2003

Predictability and pitch accent

Even now I would I would LIKE to NOT have to WORK in SOME ways.

Time (s)0.11522 0.538167100

300

Time (s)0.0819116 0.67453100

600

Pitch accent is associated with meaning:

Page 10: Towards a model of speech production: Cognitive modeling and computational applications Michelle L. Gregory SNeRG 2003

Predictability effects on pitch accent Results. More predictable words are less likely to bear pitch accent, as

measured by (this is true for all parts of speech):• Frequency• Conditional bigram probability• Joint probability• Semantic relatedness• Repetition

(not all the same measures that affect reduction, e.g., preceding context is more important with pitch accent)

Implications• The role of predictability is not limited to reduction processes • Predictability is not just a fact about lexical access, this information is

available during phonological encoding• Prosody in speech synthesis is rudimentary, a probabilistic model is

(relatively) easy to implement.

Page 11: Towards a model of speech production: Cognitive modeling and computational applications Michelle L. Gregory SNeRG 2003

Current Research

Computational model of pitch accent Prosodic information to aid parsing Psychological models of production

Page 12: Towards a model of speech production: Cognitive modeling and computational applications Michelle L. Gregory SNeRG 2003

Computational model of pitch accent(joint work with Yasemin Altun)

ProblemPredicting accent is not an exact science.

Hirschberg (1993) and Pan & Hirschberg (2000) demonstrate that frequency and conditional probability increase accuracy in pitch accent prediction.

• Function vs content only 68% • Frequency, conditional probability 71%• BOTH 1 and 2 73%

Will the addition of more/different probabilistic variables increase accuracy as well?

Page 13: Towards a model of speech production: Cognitive modeling and computational applications Michelle L. Gregory SNeRG 2003

Computational model of pitch accent

Testing more variables• Joint probability, reverse conditional

probability

• The effects of surrounding accents

• More fine grained part of speech

• Things like rate of speech, etc.

Page 14: Towards a model of speech production: Cognitive modeling and computational applications Michelle L. Gregory SNeRG 2003

Prosodic information to aid parsing

(Joint work with Mark Johnson and Eugene Charniak)

Problem:

Parsing conversational speech is difficult

Accuracy of parsing • the wall street journal 90% (Charniak 2000)

• switchboard 84.5%

• wsj, no punctuation 86%

• swbd, no punctuation 81%

Add prosodic features instead of punctuation

Page 15: Towards a model of speech production: Cognitive modeling and computational applications Michelle L. Gregory SNeRG 2003

Prosodic information to aid parsing

Methodology• Get timing information from the transcripts

• Add pause duration information as a term in the parser • (use pauses as a cue instead of punctuation)

• For sentence-internal punctuation only

http://cog.brown.edu:16080/~mj/papers/acl02-emptynodes.pdf

Results• Accuracy goes down (80%)

• Because the language model is not as strong?

Page 16: Towards a model of speech production: Cognitive modeling and computational applications Michelle L. Gregory SNeRG 2003

Psychological models of production: Disfluencies

(joint work with Julie Sedivy and Dan Grodner)

Looking at what’s going on during speech and when

Initially, we were interested in how prosody maps to discourse constraints in the production of prenominal adjectives

Move the red cup

Facts:• Speakers only use scalar or material adjectives in the environment

of a contrast. Speakers use color adjectives ALL the time.• Marking a contrast is prosodically marked (there is an increase in

pitch range in the presence of a contrast)• Despite an increase in pitch range, there is not a duration increase

with adjectives produced in a contrastive environment.• BUT Scalar adjectives are longer

Page 17: Towards a model of speech production: Cognitive modeling and computational applications Michelle L. Gregory SNeRG 2003

Psychological models of production: Disfluencies

Really neat fact: Speakers produce more disfluencies with scalar adjectives compared to material or color.

disfluencies account for about 6% of spontaneous speech. Shriberg (2002)

• silent pauses move the <sil> red …

• elongated pronunciations move theee

• filled pauses move the um

• repetitions move the the

• restarts move the uh the red …

Page 18: Towards a model of speech production: Cognitive modeling and computational applications Michelle L. Gregory SNeRG 2003

Psychological models of production: Disfluencies

• Used an eye-tracking device to find out what’s happening during the disfluency

Move the, uh, big car next to the turtle

Page 19: Towards a model of speech production: Cognitive modeling and computational applications Michelle L. Gregory SNeRG 2003

Psychological models of production: Disfluencies

Results:• We found that speakers are looking more at the contrasting

object in the case of the scalars during the disfluency

• AND during the adjective!

Implications:• Marking a contrast set does not increase processing load

• Encoding a relative property does increase processing load

• Duration is affected by lexical encoding (suggests a continuum of planning difficulty effects)

Page 20: Towards a model of speech production: Cognitive modeling and computational applications Michelle L. Gregory SNeRG 2003

Near-future research

Page 21: Towards a model of speech production: Cognitive modeling and computational applications Michelle L. Gregory SNeRG 2003

Prosody

In general, continue looking a the factors that influence prosodic variation and see if these can be modeled probabilistically.

The challenge: • Lots of people have found discourse-pragmatic factors

contribute to prosodic marking• Others, including myself, have found that prosody is

affected by probabilistic variables• How can we model aspects of the speech context

probabilistically?

Page 22: Towards a model of speech production: Cognitive modeling and computational applications Michelle L. Gregory SNeRG 2003

Disfluencies Disfluencies have proven to be a very useful

window into processes of speech production.• Are there more disfluencies around evaluative terms in

general? • Do different types of disfluencies correspond to

difficulties associated with difference aspects of production (initial planning versus lexical encoding and access)

• Investigate more fully the connection between disfluencies and the length of surrounding words. • Why is it that words following a disfluency are longer? • How much of duration variation can be accounted for by

planning difficulties versus other factors?

Page 23: Towards a model of speech production: Cognitive modeling and computational applications Michelle L. Gregory SNeRG 2003

Speech synthesisThree types of TTS systems:

Concatenated or diphone models. • Advantages: the ability to process of novel strings of text, does not require a huge database

of stored speech.• Disadvantage: mechanical sounding speech, a lot of post-processing

Corpus based--prosodic patterns (durations, stress, F0 contours) are not defined by the signal processor, but rather the phoneme sequences are chosen based on exact prosodic pattern matches in a corpus. • Advantage: natural sounding speech, specifically with regard to prosody. • Disadvantage: a much larger database is required with a lot more hand coding involved. It

also does not allow for totally novel sequences of sounds or words that are not in the database.

Phrase splicing (unit selection)--selects the largest unit possible from a corpus of one speaker.• Advantage: Very natural, requires very little post-speech processing from a signal processor. • Disadvantage: Requires an extremely large (~10) hours of hand-annotated corpus of speech.

It also does not allow for novel sequences of speech, thus must be used in conjunction with a diphone model.

Page 24: Towards a model of speech production: Cognitive modeling and computational applications Michelle L. Gregory SNeRG 2003

Speech synthesis(joint work with Mike Buckley and Kris Schindler)

Using a Probabilistic Model to Improve Speech Synthesis in the UB Talker

The UB Talker: The UB Talker

• artificial speaking device

• menu-driven means of selecting words and phrases,

• Menus, words, and phrases can be pre-programmed

• or entered in on-screen

• Uses context-awareness and phrase completion to predict responses

• Statistics are derived using frequency of use, most-recently used, time of day, day of week, and time of year to present most-likely phrases to users.

Page 25: Towards a model of speech production: Cognitive modeling and computational applications Michelle L. Gregory SNeRG 2003

Speech synthesis Once a string is selected, a synthesizer

component produced speech.

Two goals:

1. Add a probabilistic model of prosody to the current free TTS system

2. Build a corpus of speech toward a unit selection model (the Client has about 2,000 phrases in the system that can be pre-recorded)

Page 26: Towards a model of speech production: Cognitive modeling and computational applications Michelle L. Gregory SNeRG 2003

Speech synthesis some academically available and

commercially available synthesizers:

http://www.cstr.ed.ac.uk/projects/festival/userin.html http://www.rhetorical.com/cgi-bin/demo.cgi http://www.research.att.com/projects/tts/demo.html