automating phonetic measurements

39
Automating Phonetic Measurements Mark Liberman http://ling.upenn.edu/~myl

Upload: marla

Post on 23-Feb-2016

56 views

Category:

Documents


0 download

DESCRIPTION

Automating Phonetic Measurements. Mark Liberman http:// ling.upenn.edu/~myl. ABSTRACT: - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Automating  Phonetic Measurements

Automating Phonetic Measurements

Mark Libermanhttp://ling.upenn.edu/~myl

Page 2: Automating  Phonetic Measurements

Groningen Quantitative Linguistics 2

ABSTRACT:

For a century and a half, phoneticians have been measuring durations, frequencies, and other physical properties of speech sounds. At first they used purely mechanical devices; then they used electro-mechanical devices; and now they use programs running on general-purpose computers -- but in nearly all cases, each measurement still involves an act of human judgment. This adds a significant amount of labor to the cost-benefit analysis implicit in experimental design; and this in turn has tended to hold the field back, especially in taking advantage of the millions of hours of speech now becoming available.

There were some successful examples of large-scale automated phonetic measurement in the 1970s, and the relevant technology has improved enormously since then. However, it remains very rare for research in speech science to use automated measurement techniques, despite the obvious benefits of vastly increased productivity, potential access to large-scale naturalistic data, and reduction in the effects of experimenter bias. This talk will attempt to identify the reasons for this conservatism, and to discuss the prospects for overcoming them.

6/29/2012

Page 3: Automating  Phonetic Measurements

Groningen Quantitative Linguistics 36/29/2012

Claim:

For most well-defined phonetic detection, classification, and measurement tasks,

judicious choice of acoustic features and application of modern machine-learning techniques

will produce an automated annotation system that agrees with human annotators about as well as they agree with one another.

Page 4: Automating  Phonetic Measurements

Groningen Quantitative Linguistics 46/29/2012

Automating old-fashioned measurementsFormantsSegment duration, event timing (e.g. VOT)Variables like “g-dropping”, “t/d deletion”Pronunciation variation in general . . . But on 1K-10K hours instead of 1-10 hours

Creating new measurements [optional appendices …]Modeling vowel, tone contours/l/ allophonyVoice quality, etc.

Automating experiments [not discussed today]

Opportunities:

Page 5: Automating  Phonetic Measurements

Groningen Quantitative Linguistics 56/29/2012

BUT…

Today’s published phonetics research rarely uses automated measurement techniques.

This talk will attempt to show that such techniques are possible, and also to explain why they are not widely used.

…and how this might change.

Page 6: Automating  Phonetic Measurements

Groningen Quantitative Linguistics 66/29/2012

An interesting paper that uses automatic timing measurements:

Saul Sternberg et al., “The Latency and Duration of Rapid Movement Sequences”, in G.E. Stelmach, Ed., Information Processing in Motor Control and Learning

“These studies began with two accidental findings […]. The first was that the number of words in a brief rapid utterance influenced the time to initiate the utterance, even though the talker knew what he would have to say well in advance of the reaction signal. This finding seemed surprising, particularly in view of the claim based on previous studies […] that the latency (or reaction time) for saying a single word, known in advance, is not affected by the number of syllables it contains. The second finding was that the functions relating the duration of these rapid utterances to the number of words they contained were concave upward rather than being linear, indicating that words in longer sequences were produced at slower rates.”

“Using an energy-sensitive speech detector with a low threshold, we made two measures of each response: its latency, measured from signal onset to the start of the utterance, and its duration. “

Page 7: Automating  Phonetic Measurements

Groningen Quantitative Linguistics 76/29/2012

Page 8: Automating  Phonetic Measurements

Groningen Quantitative Linguistics 86/29/2012

Random (open circles) vs. fixed (filled circles) fore-periods

Page 9: Automating  Phonetic Measurements

Groningen Quantitative Linguistics 96/29/2012

Another interesting paper from the same group:

Saul Sternberg et al., “Motor Programs in Rapid Speech: Additional Evidence”, in Cole, Ed., Perception and Production of Fluent Speech.

[W]e report measurements that permit us to segment rapid utterances into their component words, and thereby to specificy how the time interval from one word to the next depends on the serial position of the word pair within an utterance, as well as on utterance length. […] [T]he resulting serial-position functions […] provide further evidence for the dependence of local features throughout the utterance on a representation (“motor program” of the entire utterance.

… [W]e report a first attempt to localize, within the time interval from the start of one response unit to the start of the next, the effects of utterance length and serial position; these measurements then permit relatively clean tests of two implications drawn from our model.

Page 10: Automating  Phonetic Measurements

Groningen Quantitative Linguistics 106/29/2012

“The experiment described in Section V was designed to permit using the amplitude envelope of each utterance to determine the mean intervals between successive response units: We used words beginning with stop consonants and we balanced word identities across serial positions. However, identification of the “beginning” of the execution of a response unit embedded in continuous speech presents difficulties of definition as well as measurement, and such an effort depends on relatively arbitrary decisions.

[…] In attempting to deal with these difficulties we made two assumptions. First, we assumed that each monosyllabic word is a response unit; second we assumed that the occurrence time of an event relatively early in the word (the transition between the stop consonant and the vowel …) would adequately measure (on the average, and to within an additive constant) the starting time of the command stage for the unit in that serial position.”

Page 11: Automating  Phonetic Measurements

Groningen Quantitative Linguistics 116/29/2012

“We digitized subjects’ speech at 10kHz and determined the mean absolute sample value in adjacent 10-msec intervals. Linear interpolation between successive means then defined an amplitude envelope that usually showed a single peak for each word in the utterance and a single trough between each word and the next.

On 86% of the correct trials we were able to choose a criterion amplitude value that intersected the ascending flanks of exactly n amplitude peaks. […] The time differences between successive intersection then defined a series of n-1 intervals between the ‘beginnings’ of successive words.”

Page 12: Automating  Phonetic Measurements

Groningen Quantitative Linguistics 126/29/2012

Mean interword interval for each serial position in lists of monosyllabic words of lengths n=2,3,4,5. Graph (a) is homogenous lists, (b) is heterogeneous lists.

Page 13: Automating  Phonetic Measurements

Groningen Quantitative Linguistics 136/29/2012

Why did Sternberg et al. use automated (proxy) measurements rather than hand measurements from waveforms and spectrograms?

1. Accuracy

2. Freedom from experimenter bias

3. Efficiency: Across all their experiments and controls, they ran ~100 subjects, ~1,000 trials per subject

4. And also, …

Page 14: Automating  Phonetic Measurements

Groningen Quantitative Linguistics 146/29/2012

Saul Sternberg et al., “The Latency and Duration of Rapid Movement Sequences”, in G.E. Stelmach, Ed., Information Processing in Motor Control and Learning, 1978.

Saul Sternberg et al., “Motor Programs in Rapid Speech: Additional Evidence”, in R. Cole, Ed., Perception and Production of Fluent Speech, 1980.

Something I left out – the dates of publication:

Page 15: Automating  Phonetic Measurements

Groningen Quantitative Linguistics 15

Flash forward 32 years…

This year’s Journal of Phonetics:• Four issues so far (January, March, May, July)• 45 articles

+ 3 online in advance of publication = 48 articles

• 33 with selection, classification, measurement of time points or regions in acoustic, articulatory or physiological recordings

(others involve only perception, symbol-manipulation or philosophy…)

6/29/2012

Page 16: Automating  Phonetic Measurements

Groningen Quantitative Linguistics 16

Use of automated measures?

Depending on how you count, either 0 or 1 out of the 33 papers making production measurements used automatic detection or classification methods

And special note should be taken ofC. Mooshammer et al., “Bridging planning and execution: Temporal planning of syllables”, J. Phonetics 40 (May 2012)

This paper used a variant of the Sternberg “delayed naming” paradigm,

BUT . . .

6/29/2012

Page 17: Automating  Phonetic Measurements

Groningen Quantitative Linguistics 176/29/2012

[W]e employed the delayed naming paradigm (see Kawamoto et al., 2008; Rastle et al., 2005; Sternberg et al., 1988). This was implemented using custom software that controlled presentation of each stimulus on a computer screen as follows: first, subjects were shown the target word with a ‘‘Get ready, say ‘uhhh’’’ prompt, their cue to read silently, inhale, and produce schwa. […] Next, at a randomized delay varying between 1 and 2 s, the screen changed color, the prompt changed to ‘‘Go,’’ and an audible beep was emitted, providing the cue to the subject to produce the target word as quickly as possible.

The stimulus onset was determined by automatic detection of the maximal amplitude of the beep. This signal was a sinusoid of 500 Hz, amplitude filtered using a triangular window with a duration of 90 ms. The onset of the speech signal was labeled manually at the release burst for items beginning with stops, at the onset of frication for initial fricatives, and at the end of the transition from the preceding schwa to a lateral.

Page 18: Automating  Phonetic Measurements

Groningen Quantitative Linguistics 18

One paper used forced alignment:

Barbara Schuppler et al., “How linguistic and probabilistic properties of a word affect the realization of its final /t/: Studies at the phonemic and sub-phonemic level”, J .Phonetics 40(4) July 2012

uses “forced alignment” to choose among pronunciation variants

But then…

6/29/2012

Page 19: Automating  Phonetic Measurements

Groningen Quantitative Linguistics 196/29/2012

Schuppler et al. 2012, the abstract:

This paper investigates the realization of word-final /t/ in conversational standard Dutch. First, based on a large number of word tokens (6747) annotated with broad phonetic transcription by an automatic transcription tool, we show that morphological properties of the words and their position in the utterance’s syntactic structure play a role for the presence versus absence of their final /t/. We also replicate earlier findings on the role of predictability (word frequency and bigram frequency with the following word) and provide a detailed analysis of the role of segmental context.

Second, we analyze the detailed acoustic properties of word-final /t/ on the basis of a smaller number of tokens (486) which were annotated manually. [emphasis added]

Page 20: Automating  Phonetic Measurements

Groningen Quantitative Linguistics 206/29/2012

Schuppler et al. 2012 -- From the body of the paper:

An ASR system was used to create a broad phonetic transcription for the ECSD. Automatic transcriptions have the advantage that they are consistent and can be more easily obtained than manual transcriptions for large data sets. We used the so-called forced alignment for creating the broad phonetic transcriptions. Input for the forced alignment were the speech files, the orthographic transcriptions of these files, a pronunciation lexicon of the words in these transcriptions, and acoustic models for each phone that had been trained beforehand. First, the words from the orthographic transcriptions were looked up in a pronunciation lexicon containing multiple pronunciation variants per word. Then, given the acoustic signal and the acoustic phone models, the ASR system chose the pronunciation variant that matched best with the speech signal.

Page 21: Automating  Phonetic Measurements

Groningen Quantitative Linguistics 216/29/2012

Schuppler et al. 2012, continuing:

The automatically generated broad phonetic transcriptions used in Study I treat the signal as if it consists of bead on a strong, with each bead representing a single, clearly realized phone. As a consequence, pronunciation variation could only be captured as phone substitutions, insertion, or deletion. however, phonetic reality is more complex. Especially, speech of an informal speaking style, like our material, may show realizations resulting from articulatory overlap with neighboring segments. The goal of Study II is to give a detailed analysis of different phonetic properties of /t/, which provides better insight into how reduction is reflected in terms of sub-phonemic properties.

The phonetic analysis was carried out manually by two trained phoneticians, both native speakers of Dutch. They scored the the tokens for a set of sub-phonemic properties, based on analytic listening combined with inspection of the waveforms and spectrograms. [...] In case of disagreement, the labelers inspected the signal together to arrive at a consensus judgment.

Page 22: Automating  Phonetic Measurements

Groningen Quantitative Linguistics 226/29/2012

Schuppler et al. 2012 – basis of the fine phonetic classification:

Canonical /t/ is realized with complete closure. The labelers first determined whether a constriction was present or not. If present, it was classified as (a) complete, (b) realized with friction (i.e., weak alveolar friction partially or completely replacing canonical complete closures [...]), (c) with nasal friction (weak but audible, nasal friction replacing complete closure), (d) or with nasal murmur, casued by a preceding nasal consonant (similar to the manifestation of a regular nasal consonant, but with a lower amplitude).

In the next step, the constriction was classified as voiced or unvoiced. Voiced constrictions are characterized by periodicity of relatively strong amplitude that contributes to a segment being perceived as voice, whereas unvoiced constrictions do not have any periodicity or only contain periodicity of rapidly decreasing amplitude after a voiced segment.

Page 23: Automating  Phonetic Measurements

Groningen Quantitative Linguistics 236/29/2012

Schuppler et al. 2012, fine phonetic classification con’t:

Next, the burst was classified as present or absent. If present, it was specified whether there was one of multiple bursts. We classified a burst as ‘multiple bursts’ if there was two or more release impulses that are distinct from the friction noise of the next segment by short duration and relatively strong intensity. We classified a burst as ‘single burst’ if there as one short impulse, separated friction noise of the next segment.

In addition, bursts were labeled as strong or weak, where weak bursts were characterized by extremely short durations and with energy in only part of the spectrum. All burst labels were based on the bursts’ acoustic representations in the spectrograms.

Page 24: Automating  Phonetic Measurements

Groningen Quantitative Linguistics 246/29/2012

Why does this matter?

These are terrific papers! So who cares how the measurements are made?

1. Human judgment = Wasted time and effort; slower progress of research Possible problems of confirmation bias and other sorts of experimenter error

2. Lost opportunities to do very-large-scale research, especially on “found data” – but also large-scale laboratory speech collections

But is there any alternative?

Page 25: Automating  Phonetic Measurements

Groningen Quantitative Linguistics 256/29/2012

Claim,, again:

For most well-defined phonetic detection, classification, and measurement tasks,

judicious choice of acoustic features and application of modern machine-learning techniques

will produce an automated annotation system that agrees with human annotators about as well as they agree with one another.

Page 26: Automating  Phonetic Measurements

Groningen Quantitative Linguistics 266/29/2012

Example: VOT measurement for voiceless stops

(Work in progress by Neville Ryant at Penn; supported by NSF grant 0964556 , P.I. Jiahong Yuan, “New Tools and Methods for Very Large Scale Phonetics Research”)

Concept:

1. Use forced alignment to find the approximate region for a measurement

2. Use machine-learning methods to classify the region: …what was the pronunciation like in this particular case? …should we be making a measurement at all? …what should we measure, and how?

3. Given the classification, use machine-learning methods to detect relevant events, and measure durations, frequencies, amplitudes etc. as appropriate.

Page 27: Automating  Phonetic Measurements

Groningen Quantitative Linguistics 276/29/2012

Architecture:

At each identified stop region (as localized by forced alignment):

1. We create a time-series of acoustic feature vectors, extracted every ms.

2. The decision function of a max-margin stop-burst detector is computed for every frame. If the highest peak is non-negative, its location is taken as the start of the burst (tb).

3. The decision function of a max-margin voicing onset detector is computed for every frame following the burst, and the highest non-negative peak is taken to be the onset of voicing (tv).

4. If either the the burst onset or voice onset detector fails to find a nonnegative peak, VOT measurement fails for this region. Otherwise, the VOT is recorded as tv-tb.

Page 28: Automating  Phonetic Measurements

Groningen Quantitative Linguistics 286/29/2012

Feature extraction:

Five acoustic features are calculated every ms from the short-time spectrumcomputed over a 5 ms gaussian window:

1. log Et(t) – locally-normalized total energy (0-8 kHz)2. log El(t) – locally-normalized low-frequency energy (0-500 Hz)3. log Eh(t) – locally-normlized high-frequency energy (3000-8000 Hz)4. H(t) – spectral entropy5. C(t) – spectral centroid

Each of these 5 time series, along with the first and second differences of the energy measures, is mapped to a scale-space representation by convolution with 21 gaussian kernels with σ2 = {0, 0.5, …, 10} ms.

Page 29: Automating  Phonetic Measurements

Groningen Quantitative Linguistics 296/29/2012

Burst detector:

We use the scale-space representations of overall energy, high-frequency energy, the first and second derivatives of the two energy measures, and spectral entropy.

This yields a 7*21 = 147 dimensional vector for every 1-ms frame.

These 147-dimensional vectors are then projected into an 800-dimensional space using an approximate RBF kernel(Rahimi & Recht, “Random features for large-scale kernel machines”, NIPS2007) ,and fed to a max-margin classifier trained using Stochastic Gradient Descent on all voiceless stops in the TIMIT training set preceding a voiced segment.

(All this is more or less standard current machine-learning practice….)

Page 30: Automating  Phonetic Measurements

Groningen Quantitative Linguistics 306/29/2012

Voicing onset detector:

We use the scale-space representation of total energy, low-frequency energy, and spectral center of mass, along with their first and second derivatives.

This yields a 9*21 = 189-dimensional vector for each speech frame.

These 189-dimensional vectors are then projected into an 800-dimensional space using an approximate RBF kernel(Rahimi & Recht, “Random features for large-scale kernel machines”, NIPS2007) , and fed to a max-margin classifier trained using Stochastic Gradient Descent on all voiceless stops in the TIMIT training set preceding a voiced segment.

Page 31: Automating  Phonetic Measurements

Groningen Quantitative Linguistics 316/29/2012

Results on TIMIT:

For all word-initial voiceless stops in the TIMIT test set,compared to VOT measurements from TIMIT segmentation points,the algorithm yields a mean absolute difference of 4.67 msec.,with 85.4% within 10 msec.

For all voiceless stops in the TIMIT test set,the result is a mean absolute difference of 4.74 msec.,with 85.7% within 10 msec.

Page 32: Automating  Phonetic Measurements

Groningen Quantitative Linguistics 326/29/2012

On a dataset from Sheila Blumstein’s lab at Brown, (including both voiced and voiceless stops) where all stops are word-initial, followed by a stressed vowel, and preceded by an unstressed vowel:

Compared to careful annotation by a human phonetician, the mean absolute difference was 1.9 msec., the standard deviation of the differences was 1.7 msec., with 100% of the measurements within 10 msec.

When two human phoneticians were compared, the mean absolute difference was 1.5 msec., standard deviation of differences 2.5 msec.

Page 33: Automating  Phonetic Measurements

Groningen Quantitative Linguistics 336/29/2012

Page 34: Automating  Phonetic Measurements

Groningen Quantitative Linguistics 346/29/2012

Page 35: Automating  Phonetic Measurements

Groningen Quantitative Linguistics 356/29/2012

Blumstein/Fox data:

Page 36: Automating  Phonetic Measurements

Groningen Quantitative Linguistics 366/29/2012

TIMIT (all voiceless stops):

Page 37: Automating  Phonetic Measurements

Groningen Quantitative Linguistics 376/29/2012

Claim:

For most well-defined phonetic detection, classification, and measurement tasks,

judicious choice of acoustic features and application of modern machine-learning techniques

will produce an automated annotation system that agrees with human annotators about as well as they agree with one another.

Page 38: Automating  Phonetic Measurements

Groningen Quantitative Linguistics 386/29/2012

Furthermore…

It should be possible to create a measurement tool kit

that can be adapted to automate the range of measurements that phoneticians have become used to making

and also to add new measures for phonetic dimensions now without good quantitative acoustic correlates

Page 39: Automating  Phonetic Measurements

Groningen Quantitative Linguistics 396/29/2012

An offer….

If you have a phonetic measurement problem, with a reasonable amount of training data (i.e. human annotation) and a large body of un-annotated material,

let me know, and maybe we can help.