perception + synthesis april 6, 2010 cp results

45
Perception + Synthesis April 6, 2010

Upload: phebe-shaw

Post on 14-Dec-2015

217 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: Perception + Synthesis April 6, 2010 CP Results

Perception + Synthesis

April 6, 2010

Page 2: Perception + Synthesis April 6, 2010 CP Results

CP ResultsDiscrimination - New Listeners

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

1-3 2-4 3-5 4-6 5-7 6-8 7-9 8-10 9-11

Stimulus Pair

% Different Responses

Predicted Observed

Page 3: Perception + Synthesis April 6, 2010 CP Results

Discrimination - Experienced Listeners

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

1-3 2-4 3-5 4-6 5-7 6-8 7-9 8-10 9-11

Stimulus Pair

% Different Responses

Predicted Observed

Page 4: Perception + Synthesis April 6, 2010 CP Results

Discrimination - New Listeners

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

1-1 2-2 3-3 4-4 5-5 6-6 7-7 8-8 9-9 10-10 11-11

Stimulus Pairs

% Same Responses

Observed Predicted

Page 5: Perception + Synthesis April 6, 2010 CP Results

Discrimination-Experienced Listeners

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

1-1 2-2 3-3 4-4 5-5 6-6 7-7 8-8 9-9 10-10 11-11

Stimulus Pairs

% Same Responses

Observed Predicted

Page 6: Perception + Synthesis April 6, 2010 CP Results

Interesting Thoughts• Lisa: “The reason [listeners] perform poorer in the [ga] range…may be because of the wide variation in F2 transitions for velars in real life. For instance, contrast the stops in ‘key’ and ‘coo’, whereas a [b] will always be a [b].”

• Aaron: “…there is an optimal place of articulation for each phoneme. For example, the F2 has a very high anti-node towards the back of the velum. Articulations at the back of the velum will drop the F2 more than articulations at the front of the velum. With a more drastic drop in F2, we are apt to see more of a difference in perception.”

Page 7: Perception + Synthesis April 6, 2010 CP Results

Interesting Thoughts• Jacqueline: “Subjects may have been primed by the first exercise. When asked to identify ‘ba’, ‘da’, and ‘ga’, subjects are already aware that there are ambiguous sounds, so when asked to identify tones as ‘same’ or ‘different’, they have the past experience with the stimulus to discriminate differences more accurately than the mathematical functions are able to predict.”

• Joey: “What I suggest is that a better understanding of the cues you should be listening for will lead a person to pay closer attention to the precise portion of the stimulus that will show a contrast or not. For example, experienced listeners will be listening only to the formant transitions between onset and nucleus of the syllable, but new listeners will also be (potentially) listening for vowel quality and trying to analyze the entire stimulus.”

Page 8: Perception + Synthesis April 6, 2010 CP Results

Interesting Thoughts• Emily: “As an ‘experienced’ listener myself, I didn’t find it any easier than the first time. Maybe there is something psychological involved that makes us feel cocky about it and then we don’t listen hard enough because we think we know what we’re doing…”

• Adrienne: “Experienced listeners were better able to discriminate different sounds than the new listeners, however this may be because the experienced listeners knew which acoustic cues to listen for and thus discriminate between the sounds. Inversely, the experienced listeners were less able to discriminate sounds that were the ‘same’ suggesting that they were over-applying their knowledge of this task.”

Page 9: Perception + Synthesis April 6, 2010 CP Results

Interesting Thoughts• Janet: “I don’t think that this test is designed to determine if these stimuli can be learned from experience. I think that the experiment would need to be set up to test the ‘experienced’ group regularly at more consistent intervals. Then compare their results from the first day to the last day. Then compare those results against a control group of non-linguistic listeners who were only tested once.”

• Michael: “I would say that it is quite possible for one to learn how to categorize sounds the more practice one has and the more familiarity one has with the sounds…This would especially be true if they were trained with the answer in front of them.”

Page 10: Perception + Synthesis April 6, 2010 CP Results

Exemplar Predictions: Traces• Point: all properties of all exemplars play a role in categorization…

• Not just the “definitive” ones.

• Prediction: non-invariant properties of speech categories should have an effect on speech perception.

• E.g., the voice in which a [b] is spoken.

• Or even the room in which a [b] is spoken.

• Is this true?

• Let’s find out…

Page 11: Perception + Synthesis April 6, 2010 CP Results

Another Experiment!• Circle whether each word is a new or old word in the list.

1. 9. 17.

2. 10. 18.

3. 11. 19.

4. 12. 20.

5. 13. 21.

6. 14. 22.

7. 15. 23.

8. 16. 24.

Page 12: Perception + Synthesis April 6, 2010 CP Results

Another Experiment!• Circle whether each word is a new or old word in the list.

25. 33.

26. 34.

27. 35.

28. 36.

29. 37.

30. 38.

31. 39.

32. 40.

Page 13: Perception + Synthesis April 6, 2010 CP Results

Evidence for Exemplar Storage• In a “continuous word recognition” task, listeners hear a long sequence of words…

• some of which are new words in the list, and some of which are repeats.

• Task: decide whether each word is new or a repeat.

• Twist: some repeats are presented in a new voice;

• others are presented in the old (same) voice.

• Finding: repetitions are identified more quickly and more accurately when they’re presented in the old voice. (Palmeri et al., 1993)

• Implication: we store voice + word info together in memory.

Page 14: Perception + Synthesis April 6, 2010 CP Results

Class-Based Data

All Stimuli

50%

60%

70%

80%

90%

100%

New Repeat-Same Repeat-Different

Word Type

Percent Correct

Page 15: Perception + Synthesis April 6, 2010 CP Results

Class-Based DataRepeated Stimuli

50%

60%

70%

80%

90%

100%

Within 10 Stimuli After 10 Stimuli

Percent Correct

Repeat-Same Repeat-Different

Page 16: Perception + Synthesis April 6, 2010 CP Results

More Interactions• Another task (Nygaard et al., 1994):

• train listeners to identify talkers by their voices.

• Then test the listeners’ ability to recognize words spoken in noise by:

• the talkers they learned to recognize

• talkers they don’t know

• Result: word recognition scores are much better for familiar talkers.

• Implication: voice properties influence word recognition.

• The opposite is also true:

• Talker identification is easier in a language you know.

Page 17: Perception + Synthesis April 6, 2010 CP Results

Variability in Learning• Caveat: it’s best not to let the relationship between words and voices get too close in learning.

• Ex: training Japanese listeners to discriminate between /r/ and /l/.

• Discrimination training on one voice: no improvement. (Strange and Dittman, 1984)

• Bradlow et al. (1997) tried:

• training on five different voices

• multiple phonological contexts (onset, coda, clusters)

• 4 weeks of training (with monetary rewards!)

• Result: improvement!

Page 18: Perception + Synthesis April 6, 2010 CP Results

Variability in Learning• General pattern:

• Lots of variability in training better classification of novel tokens…

• Even though it slows down improvement in training itself.

• Variability in training also helps perception of synthetic speech. (Greenspan et al., 1988)

• Another interesting effect: dialect identification (Clopper, 2004)

• Bradlow et al. (1997) also found that perception training (passively) improved production skills…

Page 19: Perception + Synthesis April 6, 2010 CP Results

Perception Production• Japanese listeners performed an /r/ - /l/ discrimination task.

• Important: listeners were told nothing about how to produce the /r/ - /l/ contrast

• …but, through perception training, their productions got better anyway.

• Also note cross-modal voice learning (Rosenblum et al., 2007)

Page 20: Perception + Synthesis April 6, 2010 CP Results

Exemplars in Production• Goldinger (1998): “shadowing” task.

• Task 1--Production:

• A: Listeners read a word (out loud) from a script.

• B: Listeners hear a word (X), then repeat it.

• Finding: formant values and durations of (B) productions match the original (X) more closely than (A) productions.

• Task 2--Perception: AXB task

• A different group of listeners judges whether X (the original) sounds more like A or B.

• Result: B productions are perceptually more similar to the originals.

Page 21: Perception + Synthesis April 6, 2010 CP Results

Shadowing: Interpretation• Some interesting complications:

• Repetition is more prominent for low frequency words…

• And also after shorter delays.

• Interpretation:

• The “probe” activates similar traces, which get combined into an echo.

• Shadowing imitation is a reflection of the echo.

• Probe-based activation decays quickly.

• And also has more of an influence over smaller exemplar sets.

Page 22: Perception + Synthesis April 6, 2010 CP Results

Moral of the Story• Remember--categorical perception was initially used to justify the claim that listeners converted a continuous signal into a discrete linguistic representation.

• In reality, listeners don’t just discard all the continuous information.

• (especially for sounds like vowels)

• Perceptual categories have to be more richly detailed than the classical categories found in phonology textbooks.

• (We need the details in order to deal with variability.)

Page 23: Perception + Synthesis April 6, 2010 CP Results

Speech Synthesis:A Basic Overview

• There are four basic types of synthetic speech:

1. Mechanical synthesis

2. Formant synthesis

• Based on Source/Filter theory

3. Concatenative synthesis

• = stringing bits and pieces of natural speech together

4. Articulatory synthesis

• = generating speech from a model of the vocal tract.

Page 24: Perception + Synthesis April 6, 2010 CP Results

1. Mechanical Synthesis• The very first attempts to produce synthetic speech were made without electricity.

• = mechanical synthesis

• In the late 1700s, models were produced which used:

• reeds as a voicing source

• differently shaped tubes for different vowels

Page 25: Perception + Synthesis April 6, 2010 CP Results

Mechanical Synthesis, part II• Later, Wolfgang von Kempelen and Charles Wheatstone created a more sophisticated mechanical speech device…

• with independently manipulable source and filter mechanisms.

Page 26: Perception + Synthesis April 6, 2010 CP Results

Mechanical Synthesis, part III• An interesting historical footnote:

• Alexander Graham Bell and his “questionable” experiments with his dog.

• Mechanical synthesis has largely gone out of style ever since.

• …but check out Mike Brady’s talking robot.

Page 27: Perception + Synthesis April 6, 2010 CP Results

The Voder• The next big step in speech synthesis was to generate speech electronically.

• This was most famously demonstrated at the New York World’s Fair in 1939 with the Voder.

• The Voder was a manually controlled speech synthesizer.

• (operated by highly trained young women)

Page 28: Perception + Synthesis April 6, 2010 CP Results

Voder Principles• The Voder basically operated like a vocoder.

• Voicing and fricative source sounds were filtered by 10 different resonators…

• each controlled by an individual finger!

• Only about 1 in 10 had the ability to learn how to play the Voder.

Page 29: Perception + Synthesis April 6, 2010 CP Results

The Pattern Playback• Shortly after the invention of the spectrograph, the pattern playback was developed.

• = basically a reverse spectrograph.

• Idea at this point was still to use speech synthesis to determine what the best cues were for particular sounds.

Page 30: Perception + Synthesis April 6, 2010 CP Results

2. Formant Synthesis• The next synthesizer was PAT (Parametric Artificial Talker).

• PAT was a parallel formant synthesizer.

• Idea: three formants are good enough for intelligble speech.

• Subtitles: What did you say before that? Tea or coffee? What have you done with it?

Page 31: Perception + Synthesis April 6, 2010 CP Results

PAT Spectrogram

Page 32: Perception + Synthesis April 6, 2010 CP Results

2. Formant Synthesis, part II• Another formant synthesizer was OVE, built by the Swedish phonetician Gunnar Fant.

• OVE was a cascade formant synthesizer.

• In the ‘50s and ‘60s, people debated whether parallel or cascade synthesis was better.

• Weeks and weeks of tuning each system could get much better results:

Page 33: Perception + Synthesis April 6, 2010 CP Results

Synthesis by rule• The ultimate goal was to get machines to generate speech automatically, without any manual intervention.

• synthesis by rule

• A first attempt, on the Pattern Playback:

(I painted this by rule without looking at a spectrogram. Can you understand it?)

• Later, from 1961, on a cascade synthesizer:

• Note: first use of a computer to calculate rules for synthetic speech.

• Compare with the HAL 9000:

Page 34: Perception + Synthesis April 6, 2010 CP Results

Parallel vs. Cascade• The rivalry between the parallel and cascade camps continued into the ‘70s.

• Cascade synthesizers were good at producing vowels and required fewer control parameters…

• but were bad with nasals, stops and fricatives.

• Parallel synthesizers were better with nasals and fricatives, but not as good with vowels.

• Dennis Klatt proposed a synthesis (sorry):

• and combined the two…

Page 35: Perception + Synthesis April 6, 2010 CP Results

KlattTalk

• KlattTalk has since become the standard for formant synthesis. (DECTalk)

http://www.asel.udel.edu/speech/tutorials/synthesis/vowels.html

Page 36: Perception + Synthesis April 6, 2010 CP Results

KlattVoice• Dennis Klatt also made significant improvements to the artificial voice source waveform.

• Perfect Paul:

• Beautiful Betty:

• Female voices have remained problematic.

• Also note: lack of jitter and shimmer

Page 37: Perception + Synthesis April 6, 2010 CP Results

LPC Synthesis• Another method of formant synthesis, developed in the ‘70s, is known as Linear Predictive Coding (LPC).

• Here’s an example:

• To recapitulate childhood: http://www.speaknspell.co.uk/

• As a general rule, LPC synthesis is pretty lousy.

• But it’s cheap!

• LPC synthesis greatly reduces the amount of information in speech…

Page 38: Perception + Synthesis April 6, 2010 CP Results

Filters + LPC• One way to understand LPC analysis is to think about a moving average filter.

• A moving average filter reduces noise in a signal by making each point equal to the average of the points surrounding it.

yn = (xn-2 + xn-1 + xn + xn+1 + xn+2) / 5

Page 39: Perception + Synthesis April 6, 2010 CP Results

Filters + LPC• Another way to write the smoothing equation is

• yn = .2*xn-2 + .2*xn-1 + .2*xn + .2*xn+1 + .2*xn+2

• Note that we could weight the different parts of the equation differently.

• Ex: yn = .1*xn-2 + .2*xn-1 + .4*xn + .2*xn+1 + .1*xn+2

• Another trick: try to predict future points in the waveform on the basis of only previous points.

• Objective: find the combination of weights that predicts future points as perfectly as possible.

Page 40: Perception + Synthesis April 6, 2010 CP Results

Deriving the Filter• Let’s say that minimizing the prediction errors for a certain waveform yields the following equation:

• yn = .5*xn - .3*xn-1 + .2*xn-2 - .1*xn-3

• The weights in the equation define a filter.

• Example: how would the values of y change if the input to the equation was a transient where:

• at time n, x = 1

• at all other times, x = 0

• Graph y at times n to n+3.

Page 41: Perception + Synthesis April 6, 2010 CP Results

Decomposing the Filter• Putting a transient into the weighted filter equation yields a new waveform:

• The new equation reflects the weights in the equation.

• We can apply Fourier Analysis to the new waveform to determine its spectral characteristics.

Page 42: Perception + Synthesis April 6, 2010 CP Results

LPC Spectrum• When we perform a Fourier Analysis on this waveform, we get a very smooth-looking spectrum function:

• This function is a good representation of what the vocal tract filter looks like.

LPC spectrum

Original spectrum

Page 43: Perception + Synthesis April 6, 2010 CP Results

LPC Applications• Remember: the LPC spectrum is derived from the weights of a linear predictive equation.

• One thing we can do with the LPC-derived spectrum is estimate formant frequencies of a filter.

• (This is how Praat does it)

• Note: the more weights in the original equation, the more formants are assumed to be in the signal.

• We can also use that LPC-derived filter, in conjunction with a voice source, to create synthetic speech.

• (Like in the Speak & Spell)

Page 44: Perception + Synthesis April 6, 2010 CP Results

3. Concatenative Synthesis• Formant synthesis dominated the synthetic speech world up until the ‘90s…

• Then concatenative synthesis started taking over.

• Basic idea: string together recorded samples of natural speech.

• Most common option: “diphone” synthesis

• Concatenated bits stretch from the middle of one phoneme to the middle of the next phoneme.

• Note: inventory has to include all possible phoneme sequences

• = only possible with lots of computer memory.

Page 45: Perception + Synthesis April 6, 2010 CP Results

Concatenated Samples• Concatenated synthesis tends to sound more natural than formant synthesis.

• (basically because of better voice quality)

• Early (1977) combination of LPC + diphone synthesis:

• LPC + demisyllable-sized chunks (1980):

• More recent efforts with the MBROLA synthesizer:

• Also check out the Macintalk Pro synthesizer!