spectral factors in the perception of vowel quantity in icelandic

11
Scandinavian Journal of Psychology, 1996, 37, 121 -131 Spectral factors in the perception of vowel quantity in Icelandic JORGEN PIND Faculty of Social Sciences, University of Iceland, Reykjavik, Iceland Pind, J. ( 1996). Spectral factors in the perception of vowel quantity in Icelandic. Scandina- vian Journal of Psychology, 37, I21 - 13 1. Previous research has shown that the ratio of vowel to rhyme (vowel +consonant) duration is a major cue for quantity in Icelandic. In particular it serves as a higher-order invariant which enables the listener to disentangle those durational transformations of the speech signal which are “extrinsic” (e.g. due to changes in speaking rate) from those which are “intrinsic” to the phonemic message, involving a change of phonemic quantity. Previous research has been based on speech segment contrasts which are purely durational, involving vowels with a uniform spectrum whether phonemically long or short, such as [a] or [I]. This paper looks at the role of spectral factors in vowels which are spectrally dissimilar in their long and short varieties. It is shown that in these cases the spectral differences can be sufficiently great to override the previously established relational invariant for quantity. The implications of this finding for a model of quantity perception are discussed. Key words: Speech perception, higher-order invariants, vowel quantity, Icelandic. J. Pind, Faculty of Social Sciences, University of Iceland, Oddi, 101 Reykjavik, Iceland. The investigation of vowel perception has a long history in perceptual research and can be traced back at least to Helmholtz’s classic work on tone perception (Jenkins, 1987). Interest has naturally enough focused primarily on the role of spectral properties, especially of formants, i.e. the vocal tract resonances, in the perception of vowel quality (Strange, 1989). Vowels, however, are not only distinguished by their spectral properties since in many languages a distinction is made between phonemically long and short vowels. Thus in Icelandic we find a contrasting word pair such as man [ma:n] ‘woman (poetic)’ and mann [man:] ‘man, accusative singular’. The quantity opposition in Icelandic, occurring always in the first syllable of non-compound words, involves both vowels and consonants and can be simply described: A vowel is long if followed by one or no consonant, otherwise it is short. An exception to this rule is that a vowel followed by one of p, t, k, s + u, j, r is long. Further examples are words such as is [i:s] ‘ice’ vs. iss [is:] ‘ice, genitive singular’ andjlysja [flmja] ‘to peel’ vs.jlissa [flwa] ‘to giggle’. Note that when the vowel is followed by two identical consonants they are pronounced as a single long consonant. Such cases show complementary opposition of long and short segments, a long vowel is followed by a short consonant and vice versa. Previous research (Pind, 1986; 1995) on the perception of quantity in Icelandic has shown that the complementary nature of the phonological opposition is reflected in a higher-order invariant expressed through a ratio of vowel to rhyme duration, V/(V + C), which functions as the major temporal cue for quantity in Icelandic’. ‘In this paper the letters V and C will be used to refer to vowels and consonants respectively as is customary in the phonetic literature. A colon is used to signify length, thus a syllable of the type V:C has a (phonemically) long vowel followed by a (phonemically) short consonant whereas a VC: syllable has a short vowel followed by a long consonant. The term rhyme as used in this paper refers to the part of the syllable consisting of the vowel and following consonant, i.e. the syllable nucleus and coda, less any syllable-initial consonants. 0 1996 Scandinavian University Press. ISSN 0036-5564

Upload: joergen-pind

Post on 30-Sep-2016

214 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Spectral factors in the perception of vowel quantity in Icelandic

Scandinavian Journal of Psychology, 1996, 37, 121 -131

Spectral factors in the perception of vowel quantity in Icelandic

JORGEN PIND Faculty of Social Sciences, University of Iceland, Reykjavik, Iceland

Pind, J. ( 1996). Spectral factors in the perception of vowel quantity in Icelandic. Scandina- vian Journal of Psychology, 37, I21 - 13 1.

Previous research has shown that the ratio of vowel to rhyme (vowel +consonant) duration is a major cue for quantity in Icelandic. In particular it serves as a higher-order invariant which enables the listener to disentangle those durational transformations of the speech signal which are “extrinsic” (e.g. due to changes in speaking rate) from those which are “intrinsic” to the phonemic message, involving a change of phonemic quantity. Previous research has been based on speech segment contrasts which are purely durational, involving vowels with a uniform spectrum whether phonemically long or short, such as [a] or [I]. This paper looks at the role of spectral factors in vowels which are spectrally dissimilar in their long and short varieties. It is shown that in these cases the spectral differences can be sufficiently great to override the previously established relational invariant for quantity. The implications of this finding for a model of quantity perception are discussed.

Key words: Speech perception, higher-order invariants, vowel quantity, Icelandic.

J . Pind, Faculty of Social Sciences, University of Iceland, Oddi, 101 Reykjavik, Iceland.

The investigation of vowel perception has a long history in perceptual research and can be traced back at least to Helmholtz’s classic work on tone perception (Jenkins, 1987). Interest has naturally enough focused primarily on the role of spectral properties, especially of formants, i.e. the vocal tract resonances, in the perception of vowel quality (Strange, 1989).

Vowels, however, are not only distinguished by their spectral properties since in many languages a distinction is made between phonemically long and short vowels. Thus in Icelandic we find a contrasting word pair such as man [ma:n] ‘woman (poetic)’ and mann [man:] ‘man, accusative singular’. The quantity opposition in Icelandic, occurring always in the first syllable of non-compound words, involves both vowels and consonants and can be simply described: A vowel is long if followed by one or no consonant, otherwise it is short. An exception to this rule is that a vowel followed by one of p , t , k , s + u, j , r is long. Further examples are words such as is [i:s] ‘ice’ vs. iss [is:] ‘ice, genitive singular’ andjlysja [flmja] ‘to peel’ vs.jlissa [flwa] ‘to giggle’. Note that when the vowel is followed by two identical consonants they are pronounced as a single long consonant. Such cases show complementary opposition of long and short segments, a long vowel is followed by a short consonant and vice versa.

Previous research (Pind, 1986; 1995) on the perception of quantity in Icelandic has shown that the complementary nature of the phonological opposition is reflected in a higher-order invariant expressed through a ratio of vowel to rhyme duration, V/(V + C), which functions as the major temporal cue for quantity in Icelandic’.

‘In this paper the letters V and C will be used to refer to vowels and consonants respectively as is customary in the phonetic literature. A colon is used to signify length, thus a syllable of the type V:C has a (phonemically) long vowel followed by a (phonemically) short consonant whereas a VC: syllable has a short vowel followed by a long consonant. The term rhyme as used in this paper refers to the part of the syllable consisting of the vowel and following consonant, i.e. the syllable nucleus and coda, less any syllable-initial consonants.

0 1996 Scandinavian University Press. ISSN 0036-5564

Page 2: Spectral factors in the perception of vowel quantity in Icelandic

122 J. Pind Scan J Psycho1 37 (1996)

v: c

v: c V C:

Rate change Phonemic change Fig. 1. The durations of speech sounds are affected by two different transformations. One such involves changes in speaking rate, the other a change of phonemic makeup, changing a long vowel to a short vowel and vice versa for the following consonant. Both transformations affect the durations of individual speech segments. A higher-order invariant of vowel to rhyme duration V/( V + C ) is unaffected by the rate transformation but highly sensitive to a change in the phonemic makeup of the syllable. This ratio has been shown to be a major cue for quantity in Icelandic.

Temporal speech cues of the kind explored in this paper pose some interesting challenges for theories of speech perception (Summerfield, 1981; Pind, 1986; Miller 1987). One issue in particular is of interest here, namely how the listener is able t o separate those temporal changes which are in some sense “extrinsic” to the speech cues, such as speaking rate, from those which are “intrinsic”. such as the change from a phonemically short to a phonemi- a l l y long segment. Both changes affect the same acoustic properties of the speech signal, i.e. the durations of individual segments. If the perception of the quantity contrast is mediated through a higher-order invariant like the V/(V + C) ratio, the listener should in fact have no trouble distinguishing the intrinsic and extrinsic transformations since these will have differential effects on the manifestation of the speech cues (cf. Fig. I). An intrinsic transformation will change the V/(V + C) ratio whereas an extrinsic transformation will not, it will simply shorten the duration of individual speech sounds, keeping the ratio unchanged. Perceptual studies have revealed that this is in fact what happens in the perception of quantity in Icelandic (Pind, 1986; 1995).

In discussing the V/(V + C) ratio as a higher-order invariant it is necessary to draw a distinction between a ratio for speech production and for speech perception. The produc- tion ratio is not absolutely invariant under transformations of rate since speaking rate has a greater effect on the phonemically long segment than the short one. Thus with slower speaking rates the phonemically long member of a VC pair, be it the vowel or consonant, will show greater lengthening than the phonemically short member (cf. Fig. 2). Thus the V/(V + C) ratio for production measurements will not stay invariant. Does this mean that production and perception “go their separate ways”? No, not a t all. If an attempt is made, on the basis of production measurements, to define an “optimal boundary” separating the two categories of syllables, V:C and VC:, this boundary, which can also be expressed as a V/(V + C) ratio, will in fact be close to invariant and perception has been shown to adhere closely to this particular V/(V + C) ratio. Perception and production are thus in step, which is quite different from the behavior of a speech cue such as Voice Onset Time under comparable transformations of rate (Pind, 1995). In the following the perceptual ratio (or, alternatively, the calculated optimal boundary) will be denoted by capital letters, V/ ( V + C ) , while the ratios based on measurements of vowel and consonant durations in individual word tokens will be denoted by lower case letters, v/(v + c).

Page 3: Spectral factors in the perception of vowel quantity in Icelandic

%and J Psycho1 37 ( 1996) Perception of vowel quantity 123

300

250

- v) .!z. 200

E 8

.- 5 100

c K

0 E 150

.c 0

c E! n =I

50

lei-fast

0 /a/-fast /ahnormal

+ Stimulus series /men/ slow

gb -+++

v:c /" 0 1 I I I I I r

0 50 100 150 200 250 300 350

Duration of vowel (ms)

Fig. 2. Durations of vowels and consonants in the syllable rhymes measured in Experiment 1. Ten tokens were produced for each condition. Two clusters of data points emerge in the figure, one for the V:C type syllables, another for the VC: type syllables. This Figure also shows the durations of vowels and consonants in one of the continua generated for the perception experiment. Note that the data-points of the continuum lie on a line with a slope close to - I reflecting a constant rhyme duration.

Though the V/(V + C) ratio is highly stable under transformations of rate, it presumably cannot account for all aspects of the perception of quantity in Icelandic, since other factors, previously unexplored, are also likely to play a significant role in the perception of quantity. One such factor is investigated in this paper, i.e. the effect of the spectral character of the vowel on the perception of quantity. Icelandic has eight vowel monophthongs [i, I, E , a, 3, ae, Y, u]. Studies of vowel formants in the monophthongs (e.g. Games, 1976; Petursson, 1974) has revealed that there are some qualitative differences in the long and short vowels, especially for the central vowels [ E , 3, 4 . Thus Petursson ( 1974) reports an F1 of 550 Hz and an F2 of 2400 Hz for the phonemically long [E:]; the corresponding values for a short [ E ] are 750 Hz and 2050 Hz. These differences are well beyond the jnd's of 3-5% for formant frequencies reported by Flanagan (1972). In Petursson's data FI differs by no less than 36%, F2 by 17%. These differences should thus be clcarly audible. Similar differences are also shown for the vowels [3] and [a].

The question to be explored in this papcr is how thcsc spectral differences affect speech perception, and in particular how they will afcct thc previously established role of the higher-order relational cue of V/(V + C). Considcring the magnitude of the spectral differ- ences it will be hypothesized that they will in fact ovcrride the relational cue.

Hadding-Koch and Abramson (1964) showed in early tape-cutting experiments on Swedish, which has a quantity system similar to Icelandic (Elert, 1964), that in some cases the durational contrast was carried as much by the qualitative differences between the long and short phonemes, so that it proved impossible to shorten a long phoneme via tape-cutting to yield a good token of the corresponding short phoneme. Abramson and Ren (1990) have published results of a similar, but more extensive, investigation on the role of spectral and temporal factors in the perception of distinctive vowel length in Thai. In their study of five vowel pairs, computer-edited natural speech was used, and involved both shortening of the

Page 4: Spectral factors in the perception of vowel quantity in Icelandic

124 .I. Pind Scan J Psycho1 37 (1996)

long vowels for some stimulus series and lengthening short vowels in other series. Their results show that both spectral and temporal factors influence the placement of the vowel boundaries of long and short vowels, though duration emerged as the major cue.

In the case of Icelandic the qualitative differences of the three central vowels [E, 3, ce] are so great that one would expect to find an effect similar to that found originally by Hadding-Koch and Abramson (1964), i.e. a strong effect of the spectral character on the perception of quantity, making it very difficult to, say, change a long [ E : ] into a short one and vice versa using waveform editing. In that case the previously estab- lished findings of the role of a higher-order relational invariant in the perception of quantity in Icelandic needs to be qualified by taking into account the role of spectral factors as well.

EXPERIMENT 1 In two previous experiments on the perception of quantity in Icelandic using natural edited speech (Pind, 1986) the same procedure was employed. A word containing a phonemically long vowel was used as the base stimulus. To obtain a continuum of vowel and consonant durations the long vowels were gradually shortened (by removing pitch periods) and the consonants lengthened by a the same amount. In these experiments the words contained the vowels [a] and [I], vowels which are qualitatively quite similar whether long or short. Thus it is to be expected that this procedure would yield similar results whether the stimulus continua are made by shortening a phonemically long vowel (and lengthening the consonant) or lengthening a short vowel (and in that case shortening the consonant).

The data for the vowel formants reviewed previously would lead one to doubt that this procedure would yield comparable results for all vowels. In particular a different outcome is to be expected in the case of the vowels [ E , 3, ae] where qualitative differences between the long and short phonemes are appreciable. In this experiment one of these vowels, [ E ] , will be compared with a vowel, [a], which has nearly the same formant frequencies in the long and short phonemes.

The experiment consists of two parts, a short production study in which formant frequencies and segment durations are measured for a number of words. On the basis of this production study words are chosen to form the base stimuli which are then manipulated using waveform editing techniques to yield the stimulus continua used in the perceptual experiment.

Production study In this study two contrasting pairs of words were used: man [ma:n] ‘woman (poetic)’ vs. mum [man:] ‘man, accusative singular’ and men [ms:n] ‘necklace’ us. menn [man:] ‘men, accusative phrdl’. The author read ten tokens of each word at normal and fast utterance rates for a total of 80 tokens. These tokens were subsequently transferred to a PC computer (at 10 kHz sampling rate) for measurements of the durations of individual segments as well as the vowel formants. All these measurements were carried out on a PC computer using the Sensimetrics SpeechStationTM.

Figure 2 shows the results of the measurements of vowel and consonant durations exhibiting the typical pattern of complementary durations of vowels and consonants. Two clusters emerge in the figure, one for the syllable type V:C, the other for the syllable type VC:. In the V:C type words the vowel averages 238 ms, the following consonant 97 ms. In the VC: type syllables these durations are respectively 125 and 184 ms. The rhyme duration (V + C) in the V:C type words thus averages 335 ms, 309 ms in the VC: type words.

The first three vowel formants were measured, from a combined display of short-term FFT spectra and LPC spectra, at three points in each of the words, near the beginning, middle and end of the vowel. Table 1 shows the average formant values at each position in the vowels. Looking at the values in the middle of the vowel it is clear that the there is very little difference in the formants in the long and short [a] while [El shows the expected differences, with F1 being 433 Hz on average in [E:] and 512 Hz on average in [E]. The corresponding values for F2 are 1906 Hz in [ E:] and 1600 Hz in [E]. These differences amount to 18 ~ 19% and are thus somewhat less than the differences found by Pktursson ( 1974) as regards FI but still much higher than the jnd‘s for formant frequencies. The vowel [a] on the other hand shows minimal differences.

Page 5: Spectral factors in the perception of vowel quantity in Icelandic

Scand J Psycho1 37 ( 1996)

Table I . Average vowel formant frequencies (in Hz) measured at three points in the words read in Experiment 1. Long and short [a] have very similar formani patterns whereas long and short [E] have diflerent sound qualiries.

Perception of vowel quantity 125

Beginning Middle End

Vowel Quantity F1 F2 F3 FI F2 F3 F1 F2 F3

[ a1 long 795 1127 2508 804 1154 2514 750 1208 2523 short 794 1126 2502 806 1169 2481 688 1169 2398

[El long 408 1906 2426 433 1904 2443 442 1553 2390 short 496 1575 2395 512 1600 2423 477 1604 2448

Method The following perception experiment is concerned with establishing the boundaries between long and short vowels under different conditions making use of edited natural speech to generate the stimulus continua. From the previous discussion of vowel quantity in Icelandic it is clear that rhyme duration remains reasonably invariant at a given speaking rate despite changes in quantity. Because of this, an attempt is made to keep the rhyme duration constant in the stimulus continua generated for the listening tests. The generation of stimulus continua therefore involves two concurrent operations: stepwise lengthening of one segment of the syllable and shortening of the other, aiming to keep the rhyme duration constant. These durational manipulations are accomplished in steps corresponding to individual pitch periods of the segments ([E] or [n]), taking care to cut the waveform at zero-crossings so as to eliminate the risk of introducing spurious clicks into the stimuli. Since the editing steps correspond to the individual pitch periods of the segments, and with a slightly varying fundamental frequency in the utterances, it is not to be expected that rhyme durations can be kept completely invariant.

In this experiment the base stimuli (from which the continua are generated) are either of the type V:C (man or men) or VC: (mann or menn). If it is of the former kind the waveform editing consists of shortening the vowel and lengthening the consonant in the stepwise manner described. If the base stimulus is of the type VC: the waveform editing consists of lengthening the vowel and shortening the consonant.

Based on the previous measurements, eight tokens were chosen as the endpoints for generation of the stimulus continua. Four of these continua contained the vowel [a], four [E]. Within vowels the four continua were distinguished by rate and base word. Thus, in the case of [E], the four continua were: 1) Normal rate, men base word 2) Fast rate men base word; 3) Normal rate menn base word: 4) Fast rate, menn base word. Similarly for the words containing [a]. The base words were chosen by finding the token in each condition which was most “typical” by showing segment durations lying closest to the average values for all tokens in that condition.

Table 2 shows the durations of vowels and consonants in the base stimuli in each condition (“start values”), as well as the endpoint durations of vowels and consonant in each stimulus continuum (to the nearest whole ms) and the v/(v + c) ratios for each continuum. The stimulus continua were defined in such a way that they would cover approximately the v/(v + c) range 0.4-0.7. The word durations in the stimulus continua range from 322 to 461 ms. The average duration of the words spoken at the fast rate was 353 ms and 427 ms at the normal rate, a ratio of 1:1.21.

Table 2 gives F1 and F2 frequencies measured at a single point near the middle of the vowel in those stimuli from each continua having the longest vowel. The final item found in the table shows the number of stimuli constructed in each condition. Figure 2 above shows the location of the stimuli making up one of the continua of the present experiment, the slow [m~:n] continuum. The durations of [E] and [n] are indicated in the two-dimensional plane by the symbol +. Note that the stimuli lie on a line with a slope close to -1 , i.e. reflecting a constant rhyme duration. The other continua are similarly located in the VC-plane, lying approximately parallel to the stimuli indicated in Fig. 2.

The waveform editing was done on a PC computer using the Wave for WindowsTM waveform editor from Turtle Beach, Inc. In this case the base tokens were resampled into the computer at 11,025 kHz using the Turtle Beach MultisoundTM card which is a high quality 16 bit stereo card. This card was also used for the playback of the stimuli.

Page 6: Spectral factors in the perception of vowel quantity in Icelandic

126 J . Pind Scan J Psycho1 37 (1996)

Table 2. The stimulus parameters for Experiment 1. Formants are measured (in Hz) near the middle of those stimuli having the longest vowel in each condition. The start values reflect the durations (in ms) in the base stimuli on which each continuum was based. The final values shows the durations at the opposite end of each continuum. Additionally the v / ( v f c) ratios have been calculated at both ends of each continuum. Finally the column marked Nstim gives the number of stimuli in each continuum

Vowel Continua

[El [El [El

[El

v:c-slow V:C-fast vc:-slow VC:-fast v:c-slow VC-fast vc:-slow VC:-fast

Vowel formants

FI F2

410 1970 470 1740 450 1580 470 1540 800 1130 800 1170 820 1210 800 1190

Stimulus series

Start values (ms) End values (ms)

v c v/(v+c)

229 103 0.69 194 95 0.67 126 192 0.40 124 144 0.46 292 100 0.74 230 99 0.70 131 229 0.36 123 174 0.41

V c v/(v +c) Nstim

125 I07 239 202 149 1 I9 260 222

191 0.40 165 0.39 105 0.69 76 0.73 21 1 0.41 193 0.38 113 0.70 89 0.71

10 9 11 8 13 I I 13 10

Procedure

After the stimulus continua had been made using the waveform editor they were recorded onto two separate tapes for the listening tests, one containing the [a] continua, the othet the [E] continua. In all cases 11 tokens of each stimulus were recorded, starting with one randomized run of the whole stimulus set which was used for familiarization purposes. Following this the stimulus set was played in five randomized blocks each containing two repetitions of each stimulus. The inter-stimulus interval was 2.5 seconds in all cases. Subjects listened to the tapes in a quiet room over Sennheiser HD-530-11 circumaural headphones at a comfortable listening level. All tests were of the forced two-choice variety and subjects indicated their choices by marking appropriate fields on response sheets they were provided with.

Subjects Subjects were nine undergraduate psychology students at the University of Iceland, none of whom had participated in a speech perception experiment before, as well as the author. All reported normal hearing. All subjects listened to both tests in one one-hour session, first the [a] tape followed by the [E] tape, and indicated their responses by marking appropriate fields on response sheets, m a n us. mann and men us. menn. Subjects were instructed to guess if they were unsure of the correct response.

Results Pooled identification curves for all subjects are presented in Figs. 3 and 4, the former showing the results for the [a] continua, the latter for the [E] continua.

It will be immediately apparent that there is a striking difference in the results depicted in Figs. 3 and 4. Figure 3 shows that it is readily possible to generate [a] continua spanning from a V.C to a VC: type syllable by editing either the phonemically long or short vowels. Figure 4 on the contrary shows that as far as the vowel [E] is concerned this is not in general possible, to listeners’ ears a shortened long [E] does not sound natural in the same way as a “real” short [E]. A lengthened short [E] yields even worse responses as [E:].

It is possible to calculate phoneme boundaries for individual subjects for the four [a]-continua. This was done using the method of probits (Finney, 1971). On average the phoneme boundaries were found to lie at a vowel duration of 177.6 ms in the Fast rate V C condition and at 174 ms in the Fast rate VC: condition. For the normal rates the phoneme boundaries were on average located at at a vowel duration of 193.8 ms in the V:C condition and at 203.5 ms in the VC: condition. A two-way repeated measures ANOVA, rate (slow us. fast) x series ( V C us. VC: base stimuli), shows that the effect of rate is highly significant, F( 1, 9) = 107.16, p < 0.001 while that of series is not significant, F( 1 , 9) = 1.71, p = 0.22. On

Page 7: Spectral factors in the perception of vowel quantity in Icelandic

Scand J Psycho1 37 (1996) Perception of vowel quantity 127

100

80 u)

c 8 g u, 60 9! c

o) 40

.

. E c B C e,

o) 2 a 20

0 120 160 200 240 280 320

Duration of vowel (ms)

Fig. 3. Identification curves for all ten subjects in Experiment 1 in the four [a] continua. The original rate the words were spoken at, fast or slow, has a major effect on the location of the identification curves.

100

80 u)

2 s 60 E

c

o) 40

.

. E" UY m C o)

- 2 d 20

0

I 0 Fast rate - V:C 0 Fast rate - VC:

Slow rate - V:C Slow rate - VC:

100 120 140 160 180 200 220 240 260

Duration of vowel (ms) Fig. 4. Identification curves for all ten subjects in Experiment 1 in the four [E] continua. These curves show clearly the influence of the spectral differences of [E] and [E:] on the perception of quantity, especially as compared with Figure 2 where the spectral effects are shown to have a minor effect on the perception of short and long [a].

average the phoneme boundaries are located at 175.8 ms at the fast rate and at 198.7 ms at the slow rate, a difference of 22.9 ms. This difference is in the expected direction since with a longer rhyme duration a longer vowel is needed to cue a phonemically long vowel. The interaction of rate and series is significant, F( I , 9) = 15.55, p = 0.003 and reflects the fact that at the fast rate the phoneme boundaries

Page 8: Spectral factors in the perception of vowel quantity in Icelandic

128 J. Pind Scan J Psycho1 37 (1996)

Table 3. The percentage men [mm] responsesfor the three subjects (AG, SG, and VV) classified as spectral listeners and two subjects (JP and LH) who are partly spectral listeners, relying on spectral cues in the two VC: continua, not in the V:C continua.

Subject Fast V:C Fast VC: Slow vc Slow vc:

AG 96.7 0 99 0.9 SG 81.1 2.5 97 3.6 w 84.4 10 95 12.7 JP 2.5 0.9 LH 1.3 1.8

of the V:C type words are located at a longer vowel duration than in V:C type words; at the slow rate it is the other way around.

Figure 4 shows that the results from the [E] condition are quite different. In particular, the overall curves do not reliably cross the phoneme boundaries. Results for individual subjects show interesting differences, and it is possible to distinguish three types of listeners. First, “spectral” listeners base their judgment exclusively, or almost exclusively on the spectral characteristics of the base vowel, irrespective of its length. Thus they would classify all, or nearly all words in the V:C conditions as being men, conversely all stimuli in the VC: conditions as being the word menn. Three subjects can be said to belong to this category. A second group is made up of the “temporal” listeners who base their judgments primarily on the temporal cue, at least to the extent that they show reliable cross-over of listening judgments depending on the duration of the vowel. Five listeners can be classified as belonging to this group. That leaves two listeners, one of them being the present author, who show a mixture of temporal and spectral judgments, basing their judgments of words in the V:C syllables on spectral cues (thus perceiving them all, or nearly all, as [me:n], reflecting the original vowel) while being influenced by the durational cues in the VC: conditions. Table 3 shows the percentage men judgments for the five subjects classified as “spectral” or “partly spectral” listeners. A pure spectral listener, assuming no error responses, would show the pattern 100-0-100-0. Evidently, subject AG comes very close to this ideal.

Phoneme boundaries were calculated for the five “temporal” listeners. In the Fast-V:C condition these averaged 138.8 ms and 184.9 ms in the Fast-VC: condition. In the slow conditions the averages were 143.3 ms for the V:C type syllables and 210.7 ms in the VC: type syllables. Evidently the effect of syllable type is quite noticeable here yielding an average difference of 56.7 ms of vowel duration for the syllable types (compared to 3.1 ms in the [a] conditions), this difference reflecting the great spectral differences in the vowels [ E ] and [E : ] . So even the “temporal listeners” are highly sensitive to the spectral differences between long and short [E].

A two-way repeated measures ANOVA, rate (slow us. fast) x series (V:C us. VC:), shows a nonsignifi- cant effect of rate, F( 1, 4) = 5.11, p > 0.05 while that of series is highly significant, F( I , 4) = 64.96, p =0.001. The different rates, while not statistically significant in this test, do yield a shift of the phoneme boundaries in the same direction as in the [a] conditions amounting to 15.1 ms. The interaction of rate and series is significant, F( I , 4) = 15.64, p < 0.05.

GENERAL DISCUSSION

The present experiment has shown that spectral factors can be of decisive importance in the perception of quantity in Icelandic. Previous studies (Pind 1986; see also Garnes 1976), all dealing with the perception of quantity in vowels which are spectrally similar whether long or short, had established that a higher-order relational invariant was the major cue t o quantity. This finding now needs to be qualified by taking into account the influence of the vowel spectrum in those cases where long and short phonemes are spectrally dissimilar. In those cases the spectral cue turns ou t be of overriding importance.

In a previous paper (Pind, 1986) a two-stage model for the perception of quantity in Icelandic was proposed. This model was meant to deal with two aspects of the perception of temporal speech cues in particular. The first one is the finding mentioned in this paper of a relational basis to the perception of quantity which has been shown t o be mediated by a

Page 9: Spectral factors in the perception of vowel quantity in Icelandic

Scand J Psycho1 37 (1996) Perception of vowel quantity 129

higher-order invariant, namely the V/(V + C) ratio. Additionally the model dealt with the finding that manipulations of the external speech context, e.g. of speech rate in precursor sentences, exert some influence on the placement of phoneme boundaries. Since these effects are extrinsic to the syllables being investigated they were presumed to arise from an extrinsic type of normalization, e.g. of the kind hypothesized by Nooteboom (1979), where the listener is assumed to set an internal clock on the basis of speaking rate which is then used to gauge the subjective duration of the speech segments. It turned out that the external normalizations seen were an order of magnitude smaller than those involving the manipulations of the intrinsic ratios of the syllable rhyme. Thus it was hypothesized that these effects were only to be found within the “interval of uncertainty” separating the two phonemic categories of long and short phonemes.

To this earlier model must now be added another cue to quantity, i.e. the vowel spectrum which has been shown in the present experiment to exert a decisive influence on the perception of the quantity of the vowel [E]. It is a well-established finding in speech perception research that there are usually multiple speech cues to any phonetic contrast (Lisker, 1986) and even when some cues are quite unambiguous other cues may yet exert an effect (Whalen, Abramson, Lisker, and Mody, 1993). Therefore it is to be expected that even if a durational ratio is a major cue to quantity, other cues would also make a contribution. This has of course been shown in the present experiment, revealing as it does the role of the vowel spectrum in the perception of quantity.

In the previous discussion much has been made of the role of the V/( V + C) ratio as a higher-order invariant. How do the present results affect the status of the V/(V + C ) ratio as an invariant? This will of course depend to a considerable extent on the precise significance attached to the term invariant. Fowler ( 1994) has recently provided a thoughtful discussion of the term as used in speech-perception research. A given acoustic pattern is an invariant for a speech feature if it is “invariably present when its feature is produced, and because the pattern is unique to the feature, it can specify the feature-that is determine it uniquely.” This is the guiding principle behind the well-known work of Stevens and Blumstein (1981) on invariants for the perception of place of articulation for stop consonants. An invariant is thus a specifier, it always and invariably accompanies a particular feature and being unique to the feature it provides sufficient information for the perception of the feature.

How do the results of the present experiment fit in with this concept of an invariant? Recall that previous results (Pind 1986; Pind, 1995) have shown quite conclusively the importance of speech segment ratios for the perception of quantity in syllables having spectrally uniform long and short vowels. For these experiments the V/(V + C) ratio is thus clearly a specifier in Fowler’s sense: It always accompanies the quantity feature and is unique to it. But it will not account for the perception of quantity in those syllables having spectrally dissimilar long and short vowels, even though these show quite similar durational relation- ships to those of uniform vowels (Fig. 2) -that is, the durational relationships of vowels and consonants (in production) is the same whether quantity is cued by spectral or durational means. We are thus left with a somewhat paradoxical situation where an invariant cue, the V/( V + C) ratio, which presumably could act as a specifier in all cases, is not in fact always the speech cue to which the listener is attuned.

Evidently, a somewhat broader conceptualization of the speech cue for quantity is needed, one that acknowledges the existence of multiple cues to phonetic contrasts. This does not entail that a stimulus-based account for the perception of quantity is not viable, though the notion of an invariant as a specifier is obviously too restrictive. But other stimulus-based accounts of perception come to mind as possible frameworks for models of speech percep- tion, in particular the “probabilistic functionalism” of Brunswik ( 1956) which holds that

Page 10: Spectral factors in the perception of vowel quantity in Icelandic

130 J. Pind Scan J Psycho1 37 (1996)

perceptual cues have different “ecological validities”. This theory would posit different ecological validities for the spectral and durational cues to quantity, high for the former in the case of [E], high for the latter in the case of [a].

Such a conceptualization is in many ways similar to a number of phonetic theories which explicitly acknowledge the variability of speech, not as a feature to be “explained away” but rather to “mak[e] sense of it” (Lindblom, 1994). Thus Lindblom has argued forcefully in many papers, e.g. Lindblom (1990, 1994), for the view that speech is adaptive and guided by a principle of “sufficient auditory contrast” rather than acoustic or articulatory invariance. This theory leads to the interesting prediction regarding the data presented here that the speaker would be much more careless in aiming for durational invariants in those cases where the vowel spectra are of paramount importance in the perception of vowel quantity than in those cases where vowel spectra play no such role. To test this theory requires much more extensive phonetic measurements than those presented here. Lindblom’s theory also leads to another prediction as regards the perceiver, namely that he or she would monitor durational ratios less closely in those cases where vowel spectral differences correlate with the quantity differences than in those cases where no such correlation is to be expected. Further experiments are needed to establish whether this is in fact what happens.

In any case it is clear that the previously proposed model for the perception of quantity in Icelandic needs to be revised to take account of the finding that vowel spectra can cue vowel quantity. While the present data do not allow us to explicate in detail the interaction of spectral and durational factors in the perception of quantity, it is obvious that the relational speech cue needs to consider the whole syllable while the spectral speech cue is presumably available to the listener right after the onset of the syllable nucleus. This leads to the interesting hypothesis that the time-course of the extraction of these two acoustic dimensions in perception are different. If this were to be modeled with some kind of interactive-activation type of model it may be hypothesized that the response strengths for quantity would grow much more quickly in the perception of the vowel [E] than for [a].

Research supported by the Icelandic Science Foundation and the Research Fund of the University of Iceland. This paper was mostly written while the author was a visiting scientist in the Research Laboratory of Electronics at MIT. I am grateful to Bj6m Lindblom, Stefanie Shattuck-Hufnagel and an anonymous reviewer for constructive comments on an earlier version of this paper.

R E F E R E N C E S Abrarnson, A. S. & Ren, N. (1990). Distinctive vowel length: duration us. spectrum in Thai. Journal of

Brunswik, E. ( 1956). Perception and the representative design of psychological experiments. California:

Elert, C.-C. (1964) Phonologic studies of quantity in Swedish. Uppsala: Almqwist & Wiksell. Finney, D. J. ( 1971). Probir analysis. Cambridge: Cambridge University Press. Flanagan, J. L. ( 1972). Speech analysis, synthesis and perception. Berlin: Springer-Verlag. Fowler, C. A. (1994). Invariants, specifiers, cues: An investigation of locus equations as information for

Games, S. ( 1976). Quantity in Icelandic: Production and perception. Hamburg: Helmut Buske Verlag. Hadding-Koch, K. & Abramson, A. S. (1964). Duration versus spectrum in Swedish vowels: Some

perceptual experiments. Studia Linguistica, 18, 94- 107. Jenkins, J. (1987). A selective history of issues in vowel perception. Journal of Memory and Language,

26, 542-549. Lindblom, B. (1990). Explaining phonetic variation. A sketch of the H&H theory. In W. Hardcastle &

A. Marchal (Eds.), Speech production and speech modeling (pp. 403-439). Dordrecht: Kliiwer. Lindblom, B. ( 1994). Role of articulation in speech perception: Clues from production. Paper presented

at the 127th meeting of the Acoustical Society of America, June 1994.

Phonetics, IS, 79-92.

University of California Press.

place of articulation. Perception & Psychophysics, 55, 597-61 0.

Page 11: Spectral factors in the perception of vowel quantity in Icelandic

Scand J Psycho1 37 (1996) Perception of vowel quantity 13 1

Lisker, L. (1986)“Voicing” in English: A catalogue of acoustic features signaling /b/ versus /p/ in trochees. Language and Speech, 29, 3- 1 1.

Miller, J. L. (1987). Rate-dependent processing in speech perception. In A. W. Ellis (Ed.), Progress in the psychology of language (Vol. 111, pp. 119- 157). Hove: Lawrence

Nooteboom, S. G. (1979). Complex control of simple decisions in the perception of vowel length. In: Proceedings of the Ninth International Congress of Phonetic Sciences (Vol. 11, pp. 298-304). Copenhagen: Institute of Phonetics.

Pktursson, M. (1974). Peut-on interpreter les donnkes de la radiocinkmatographie en fonction du tube acoustique a section uniforme? Reflexions a propos de I’analyse du systkme vocalique de I’islandais moderne. Phonetica, 29, 22-79.

Pind, J. (1986). The perception of quantity in Icelandic. Phonetica, 43, 116- 139. Pind, J. (1995). Speaking rate, voice-onset time, and quantity: The search for higher-order invariants for

Strange, W. (1989). Evolving theories of vowel perception. Journal of the Acoustical Society of America,

Stevens, K. N. & Blumstein, S. E. (1981). The search for invariant acoustic correlates of acoustic features. In P. D. Eimas & J. L. Miller (eds.) Perspectives on the study of speech, 1-38. Hillsdale, N.J.: Lawrence Erlbaum Associates.

Summerfield, Q. (1981). On articulatory rate and perceptual constancy in phonetic perception. Journal of Experimental Psychology: Human Perception and Performance, 7, 1074- 1095.

Whalen, D. H., Abramson, A. S. Lisker, L. & Mody, M. (1993). FO gives voicing information even with unambiguous voice onset times. Journal of the Acoustical Society of America, 93, 2152-2159.

two Icelandic speech cues. Perception & Psychophysics, 57, 291 -304.

85, 2081 -2087.

Received 3 October 1994, accepted 10 March 1995