automating phonetic measurements

Automating Phonetic Measurements

Mark Libermanhttp://ling.upenn.edu/~myl

Groningen Quantitative Linguistics 2

ABSTRACT:

For a century and a half, phoneticians have been measuring durations, frequencies, and other physical properties of speech sounds. At first they used purely mechanical devices; then they used electro-mechanical devices; and now they use programs running on general-purpose computers -- but in nearly all cases, each measurement still involves an act of human judgment. This adds a significant amount of labor to the cost-benefit analysis implicit in experimental design; and this in turn has tended to hold the field back, especially in taking advantage of the millions of hours of speech now becoming available.

There were some successful examples of large-scale automated phonetic measurement in the 1970s, and the relevant technology has improved enormously since then. However, it remains very rare for research in speech science to use automated measurement techniques, despite the obvious benefits of vastly increased productivity, potential access to large-scale naturalistic data, and reduction in the effects of experimenter bias. This talk will attempt to identify the reasons for this conservatism, and to discuss the prospects for overcoming them.

6/29/2012

Groningen Quantitative Linguistics 36/29/2012

Claim:

For most well-defined phonetic detection, classification, and measurement tasks,

judicious choice of acoustic features and application of modern machine-learning techniques

will produce an automated annotation system that agrees with human annotators about as well as they agree with one another.


Automating old-fashioned measurementsFormantsSegment duration, event timing (e.g. VOT)Variables like “g-dropping”, “t/d deletion”Pronunciation variation in general . . . But on 1K-10K hours instead of 1-10 hours

Creating new measurements [optional appendices …]Modeling vowel, tone contours/l/ allophonyVoice quality, etc.

Automating experiments [not discussed today]

Opportunities:


BUT…

Today’s published phonetics research rarely uses automated measurement techniques.

This talk will attempt to show that such techniques are possible, and also to explain why they are not widely used.

…and how this might change.


An interesting paper that uses automatic timing measurements:

Saul Sternberg et al., “The Latency and Duration of Rapid Movement Sequences”, in G.E. Stelmach, Ed., Information Processing in Motor Control and Learning

“These studies began with two accidental findings […]. The first was that the number of words in a brief rapid utterance influenced the time to initiate the utterance, even though the talker knew what he would have to say well in advance of the reaction signal. This finding seemed surprising, particularly in view of the claim based on previous studies […] that the latency (or reaction time) for saying a single word, known in advance, is not affected by the number of syllables it contains. The second finding was that the functions relating the duration of these rapid utterances to the number of words they contained were concave upward rather than being linear, indicating that words in longer sequences were produced at slower rates.”

“Using an energy-sensitive speech detector with a low threshold, we made two measures of each response: its latency, measured from signal onset to the start of the utterance, and its duration. “


Random (open circles) vs. fixed (filled circles) fore-periods


Another interesting paper from the same group:

Saul Sternberg et al., “Motor Programs in Rapid Speech: Additional Evidence”, in Cole, Ed., Perception and Production of Fluent Speech.

[W]e report measurements that permit us to segment rapid utterances into their component words, and thereby to specificy how the time interval from one word to the next depends on the serial position of the word pair within an utterance, as well as on utterance length. […] [T]he resulting serial-position functions […] provide further evidence for the dependence of local features throughout the utterance on a representation (“motor program” of the entire utterance.

… [W]e report a first attempt to localize, within the time interval from the start of one response unit to the start of the next, the effects of utterance length and serial position; these measurements then permit relatively clean tests of two implications drawn from our model.


“The experiment described in Section V was designed to permit using the amplitude envelope of each utterance to determine the mean intervals between successive response units: We used words beginning with stop consonants and we balanced word identities across serial positions. However, identification of the “beginning” of the execution of a response unit embedded in continuous speech presents difficulties of definition as well as measurement, and such an effort depends on relatively arbitrary decisions.

[…] In attempting to deal with these difficulties we made two assumptions. First, we assumed that each monosyllabic word is a response unit; second we assumed that the occurrence time of an event relatively early in the word (the transition between the stop consonant and the vowel …) would adequately measure (on the average, and to within an additive constant) the starting time of the command stage for the unit in that serial position.”


“We digitized subjects’ speech at 10kHz and determined the mean absolute sample value in adjacent 10-msec intervals. Linear interpolation between successive means then defined an amplitude envelope that usually showed a single peak for each word in the utterance and a single trough between each word and the next.

On 86% of the correct trials we were able to choose a criterion amplitude value that intersected the ascending flanks of exactly n amplitude peaks. […] The time differences between successive intersection then defined a series of n-1 intervals between the ‘beginnings’ of successive words.”


Mean interword interval for each serial position in lists of monosyllabic words of lengths n=2,3,4,5. Graph (a) is homogenous lists, (b) is heterogeneous lists.


Why did Sternberg et al. use automated (proxy) measurements rather than hand measurements from waveforms and spectrograms?

1. Accuracy

2. Freedom from experimenter bias

3. Efficiency: Across all their experiments and controls, they ran ~100 subjects, ~1,000 trials per subject

4. And also, …


Saul Sternberg et al., “The Latency and Duration of Rapid Movement Sequences”, in G.E. Stelmach, Ed., Information Processing in Motor Control and Learning, 1978.

Saul Sternberg et al., “Motor Programs in Rapid Speech: Additional Evidence”, in R. Cole, Ed., Perception and Production of Fluent Speech, 1980.

Something I left out – the dates of publication:


Flash forward 32 years…

This year’s Journal of Phonetics:• Four issues so far (January, March, May, July)• 45 articles

+ 3 online in advance of publication = 48 articles

• 33 with selection, classification, measurement of time points or regions in acoustic, articulatory or physiological recordings

(others involve only perception, symbol-manipulation or philosophy…)

6/29/2012


Use of automated measures?

Depending on how you count, either 0 or 1 out of the 33 papers making production measurements used automatic detection or classification methods

And special note should be taken ofC. Mooshammer et al., “Bridging planning and execution: Temporal planning of syllables”, J. Phonetics 40 (May 2012)

This paper used a variant of the Sternberg “delayed naming” paradigm,

BUT . . .

6/29/2012


[W]e employed the delayed naming paradigm (see Kawamoto et al., 2008; Rastle et al., 2005; Sternberg et al., 1988). This was implemented using custom software that controlled presentation of each stimulus on a computer screen as follows: first, subjects were shown the target word with a ‘‘Get ready, say ‘uhhh’’’ prompt, their cue to read silently, inhale, and produce schwa. […] Next, at a randomized delay varying between 1 and 2 s, the screen changed color, the prompt changed to ‘‘Go,’’ and an audible beep was emitted, providing the cue to the subject to produce the target word as quickly as possible.

The stimulus onset was determined by automatic detection of the maximal amplitude of the beep. This signal was a sinusoid of 500 Hz, amplitude filtered using a triangular window with a duration of 90 ms. The onset of the speech signal was labeled manually at the release burst for items beginning with stops, at the onset of frication for initial fricatives, and at the end of the transition from the preceding schwa to a lateral.


One paper used forced alignment:

Barbara Schuppler et al., “How linguistic and probabilistic properties of a word affect the realization of its final /t/: Studies at the phonemic and sub-phonemic level”, J .Phonetics 40(4) July 2012

uses “forced alignment” to choose among pronunciation variants

But then…

6/29/2012


Schuppler et al. 2012, the abstract:

This paper investigates the realization of word-final /t/ in conversational standard Dutch. First, based on a large number of word tokens (6747) annotated with broad phonetic transcription by an automatic transcription tool, we show that morphological properties of the words and their position in the utterance’s syntactic structure play a role for the presence versus absence of their final /t/. We also replicate earlier findings on the role of predictability (word frequency and bigram frequency with the following word) and provide a detailed analysis of the role of segmental context.

Second, we analyze the detailed acoustic properties of word-final /t/ on the basis of a smaller number of tokens (486) which were annotated manually. [emphasis added]


Schuppler et al. 2012 -- From the body of the paper:

An ASR system was used to create a broad phonetic transcription for the ECSD. Automatic transcriptions have the advantage that they are consistent and can be more easily obtained than manual transcriptions for large data sets. We used the so-called forced alignment for creating the broad phonetic transcriptions. Input for the forced alignment were the speech files, the orthographic transcriptions of these files, a pronunciation lexicon of the words in these transcriptions, and acoustic models for each phone that had been trained beforehand. First, the words from the orthographic transcriptions were looked up in a pronunciation lexicon containing multiple pronunciation variants per word. Then, given the acoustic signal and the acoustic phone models, the ASR system chose the pronunciation variant that matched best with the speech signal.


Schuppler et al. 2012, continuing:

The automatically generated broad phonetic transcriptions used in Study I treat the signal as if it consists of bead on a strong, with each bead representing a single, clearly realized phone. As a consequence, pronunciation variation could only be captured as phone substitutions, insertion, or deletion. however, phonetic reality is more complex. Especially, speech of an informal speaking style, like our material, may show realizations resulting from articulatory overlap with neighboring segments. The goal of Study II is to give a detailed analysis of different phonetic properties of /t/, which provides better insight into how reduction is reflected in terms of sub-phonemic properties.

The phonetic analysis was carried out manually by two trained phoneticians, both native speakers of Dutch. They scored the the tokens for a set of sub-phonemic properties, based on analytic listening combined with inspection of the waveforms and spectrograms. [...] In case of disagreement, the labelers inspected the signal together to arrive at a consensus judgment.


Schuppler et al. 2012 – basis of the fine phonetic classification:

Canonical /t/ is realized with complete closure. The labelers first determined whether a constriction was present or not. If present, it was classified as (a) complete, (b) realized with friction (i.e., weak alveolar friction partially or completely replacing canonical complete closures [...]), (c) with nasal friction (weak but audible, nasal friction replacing complete closure), (d) or with nasal murmur, casued by a preceding nasal consonant (similar to the manifestation of a regular nasal consonant, but with a lower amplitude).

In the next step, the constriction was classified as voiced or unvoiced. Voiced constrictions are characterized by periodicity of relatively strong amplitude that contributes to a segment being perceived as voice, whereas unvoiced constrictions do not have any periodicity or only contain periodicity of rapidly decreasing amplitude after a voiced segment.


Schuppler et al. 2012, fine phonetic classification con’t:

Next, the burst was classified as present or absent. If present, it was specified whether there was one of multiple bursts. We classified a burst as ‘multiple bursts’ if there was two or more release impulses that are distinct from the friction noise of the next segment by short duration and relatively strong intensity. We classified a burst as ‘single burst’ if there as one short impulse, separated friction noise of the next segment.

In addition, bursts were labeled as strong or weak, where weak bursts were characterized by extremely short durations and with energy in only part of the spectrum. All burst labels were based on the bursts’ acoustic representations in the spectrograms.


Why does this matter?

These are terrific papers! So who cares how the measurements are made?

1. Human judgment = Wasted time and effort; slower progress of research Possible problems of confirmation bias and other sorts of experimenter error

2. Lost opportunities to do very-large-scale research, especially on “found data” – but also large-scale laboratory speech collections

But is there any alternative?


Claim,, again:





Example: VOT measurement for voiceless stops

(Work in progress by Neville Ryant at Penn; supported by NSF grant 0964556 , P.I. Jiahong Yuan, “New Tools and Methods for Very Large Scale Phonetics Research”)

Concept:

1. Use forced alignment to find the approximate region for a measurement

2. Use machine-learning methods to classify the region: …what was the pronunciation like in this particular case? …should we be making a measurement at all? …what should we measure, and how?

3. Given the classification, use machine-learning methods to detect relevant events, and measure durations, frequencies, amplitudes etc. as appropriate.


Architecture:

At each identified stop region (as localized by forced alignment):

1. We create a time-series of acoustic feature vectors, extracted every ms.

2. The decision function of a max-margin stop-burst detector is computed for every frame. If the highest peak is non-negative, its location is taken as the start of the burst (tb).

3. The decision function of a max-margin voicing onset detector is computed for every frame following the burst, and the highest non-negative peak is taken to be the onset of voicing (tv).

4. If either the the burst onset or voice onset detector fails to find a nonnegative peak, VOT measurement fails for this region. Otherwise, the VOT is recorded as tv-tb.


Feature extraction:

Five acoustic features are calculated every ms from the short-time spectrumcomputed over a 5 ms gaussian window:

1. log Et(t) – locally-normalized total energy (0-8 kHz)2. log El(t) – locally-normalized low-frequency energy (0-500 Hz)3. log Eh(t) – locally-normlized high-frequency energy (3000-8000 Hz)4. H(t) – spectral entropy5. C(t) – spectral centroid

Each of these 5 time series, along with the first and second differences of the energy measures, is mapped to a scale-space representation by convolution with 21 gaussian kernels with σ2 = {0, 0.5, …, 10} ms.


Burst detector:

We use the scale-space representations of overall energy, high-frequency energy, the first and second derivatives of the two energy measures, and spectral entropy.

This yields a 7*21 = 147 dimensional vector for every 1-ms frame.

These 147-dimensional vectors are then projected into an 800-dimensional space using an approximate RBF kernel(Rahimi & Recht, “Random features for large-scale kernel machines”, NIPS2007) ,and fed to a max-margin classifier trained using Stochastic Gradient Descent on all voiceless stops in the TIMIT training set preceding a voiced segment.

(All this is more or less standard current machine-learning practice….)


Voicing onset detector:

We use the scale-space representation of total energy, low-frequency energy, and spectral center of mass, along with their first and second derivatives.

This yields a 9*21 = 189-dimensional vector for each speech frame.

These 189-dimensional vectors are then projected into an 800-dimensional space using an approximate RBF kernel(Rahimi & Recht, “Random features for large-scale kernel machines”, NIPS2007) , and fed to a max-margin classifier trained using Stochastic Gradient Descent on all voiceless stops in the TIMIT training set preceding a voiced segment.


Results on TIMIT:

For all word-initial voiceless stops in the TIMIT test set,compared to VOT measurements from TIMIT segmentation points,the algorithm yields a mean absolute difference of 4.67 msec.,with 85.4% within 10 msec.

For all voiceless stops in the TIMIT test set,the result is a mean absolute difference of 4.74 msec.,with 85.7% within 10 msec.


On a dataset from Sheila Blumstein’s lab at Brown, (including both voiced and voiceless stops) where all stops are word-initial, followed by a stressed vowel, and preceded by an unstressed vowel:

Compared to careful annotation by a human phonetician, the mean absolute difference was 1.9 msec., the standard deviation of the differences was 1.7 msec., with 100% of the measurements within 10 msec.

When two human phoneticians were compared, the mean absolute difference was 1.5 msec., standard deviation of differences 2.5 msec.


Blumstein/Fox data:


TIMIT (all voiceless stops):


Claim:





Furthermore…

It should be possible to create a measurement tool kit

that can be adapted to automate the range of measurements that phoneticians have become used to making

and also to add new measures for phonetic dimensions now without good quantitative acoustic correlates


An offer….

If you have a phonetic measurement problem, with a reasonable amount of training data (i.e. human annotation) and a large body of un-annotated material,

let me know, and maybe we can help.