speech considered as modulated voice - s u · h. traunmüller / speech considered as modulated...

42
H. Traunmüller / Speech considered as modulated voice 1 Speech considered as modulated voice Hartmut Traunmüller Department of Linguistics, Stockholm University, S-106 91 Stockholm ABSTRACT In addition to linguistically coded information, speech conveys necessarily also some paralinguistic information of expressive (affective and adaptive), organic and perspectival kind, but all respective features lack invariant absolute acoustic correlates. According to the Modulation Theory, a speaker’s voice functions as a carrier that is modulated by speech gestures. Listeners have to demodulate the signal. Speakers freely vary their voice but compensate for impediments to modulation. Listeners “tune in” to a speech signal based on intrinsic and extrinsic cues and evaluate the deviations of its properties from those they expect of a linguistically neutral vocalization with the same paralinguistic quality. It is shown how this is reflected in the results of various investigations. Most organic and some expressive information is conveyed in the properties of the carrier. Expressive factors affect also amplitude and rate of linguistic modulations. Acquisition and use of speech require a neural linkage between perceptual demodulation and speech motor control (echo neurons). The imitation of body postures and gestures requires analogous structures evidenced in mirror neurons. Relations with gestural theories of speech perception and models of production as well as implications for distinctive feature theory and for the representation of speech in memory are discussed. Keywords: Speech perception; Speech production; Speech acquisition; Imitative behavior; Theory of speech; Paralinguistic; Acoustic phonetics.

Upload: ngodat

Post on 02-Apr-2018

217 views

Category:

Documents


2 download

TRANSCRIPT

H. Traunmüller / Speech considered as modulated voice 1

Speech considered as modulated voice Hartmut Traunmüller

Department of Linguistics, Stockholm University, S-106 91 Stockholm

ABSTRACT

In addition to linguistically coded information, speech conveys necessarily also some paralinguistic information of expressive (affective and adaptive), organic and perspectival kind, but all respective features lack invariant absolute acoustic correlates. According to the Modulation Theory, a speaker’s voice functions as a carrier that is modulated by speech gestures. Listeners have to demodulate the signal. Speakers freely vary their voice but compensate for impediments to modulation. Listeners “tune in” to a speech signal based on intrinsic and extrinsic cues and evaluate the deviations of its properties from those they expect of a linguistically neutral vocalization with the same paralinguistic quality. It is shown how this is reflected in the results of various investigations. Most organic and some expressive information is conveyed in the properties of the carrier. Expressive factors affect also amplitude and rate of linguistic modulations. Acquisition and use of speech require a neural linkage between perceptual demodulation and speech motor control (echo neurons). The imitation of body postures and gestures requires analogous structures evidenced in mirror neurons. Relations with gestural theories of speech perception and models of production as well as implications for distinctive feature theory and for the representation of speech in memory are discussed. Keywords: Speech perception; Speech production; Speech acquisition; Imitative behavior; Theory of speech; Paralinguistic; Acoustic phonetics.

H. Traunmüller / Speech considered as modulated voice 2

1. Background

1.1. Problems with the source-filter dichotomy

During the 20th century, acoustic phonetics advanced into a mature science. We now have a solid understanding of the relation between the acoustic properties of speech sounds and their origin in the variable shape and physical properties of the organ of speech (Fant, 1960; Stevens, 1998). Studies within this area resulted in widely useful source-filter models of speech production and the source/filter dichotomy also influenced the thinking within more psychologically and linguistically oriented areas of speech research.

In descriptions of the acoustic properties of speech sounds it has often been assumed that the properties of the glottal source reflect all the characteristics of the speaker’s voice, as well as intonation, tone and the voicing features of speech sounds, while features such as those that distinguish different vowels and different consonantal places of articulation from each other were thought to be reflected exclusively in the filtering by the vocal tract, in particular in the formant frequencies. In order to study the acoustic properties of different speech sounds, it is advisable to investigate how one and the same speaker produces these and to avoid any non-linguistic variation in voice. Under this precondition, the mentioned assumption is correct. It appears as if speakers strived for invariant filtering (vocal tract shape) and acoustic output in producing vowels as well as consonants. They even perform immediate compensatory articulations when the mobility of some part of the speech organ is constrained by an extraneous obstacle (Folkins & Abbs, 1975; Fowler & Turvey, 1980; Kelso, Tuller, Vatikiotis-Bateson & Fowler, 1984).

However, from studies of speech perception, where the possible within- and between-speaker variation in voice and in vocal tract geometry has to be taken into account, there emerges a different picture. Perception experiments with synthetic or manipulated stimuli have shown that the perceived phonetic quality of segments such as vowels is quite strongly affected by variation in those of the source. A synthetic speech signal that is perceived as a high-pitched [i] can, for instance, be transformed into an [æ] by just lowering its pitch (f0)1. The perceived voice, in turn, is quite strongly affected by properties of the filter. A loud adult voice can, for instance, be transformed into a moderately loud child voice by elevating the formant frequencies while leaving f0 unchanged.1 Thus, we see that both source and filter properties are essential for the perception of speech sounds as well as of voices. In common usage, “voice” refers to the overall impression evoked by the vocalizations of a speaker and not just to properties that can be ascribed to the glottal source. We are going to adopt this broad definition of “voice”, which also includes the whispered voice.

It is also generally known that we open the mouth more when we shout than when we talk with moderate effort (Schulman 1989; Geumann, 2001). This implies a different shape of the vocal tract and different formant frequencies in the same vowels produced by the same speaker. When speakers increase their vocal effort, they do not strive for invariant formant frequencies, but they allow these, especially F1 (the first formant), to rise together with f0 (Traunmüller & Eriksson, 2000). This is different from cases in which the mobility of some part of the speech organ is constrained by an extraneous obstacle and no change in voice is involved. 1 An auditory demonstration that includes such cases is accessible on the Web (Traunmüller, 1998).

H. Traunmüller / Speech considered as modulated voice 3

The fact that f0 affects the perceived openness or “height” of vowels has been known for some time (Miller, 1953; Fujisaki & Kawashima, 1968; Slawson, 1968; Ainsworth, 1971; Traunmüller, 1981), but has often been neglected. The way in which the properties of speech sounds are described in textbooks – typically on the basis of one voice only – tends to mislead students to believe that the acoustic attributes that convey the linguistic quality would be independent of those that convey the quality of the voice. Either explicitly or tacitly, this false assumption entered into the design and data analysis of many experiments. It also occurred that phonetic qualities were ascribed to vowels on the basis of their formant frequencies rather than by asking listeners and that listeners were considered as ‘unsuccessful in identifying the speech sounds’ when they had identified them in disagreement with expectations based on filter properties.

In this connection it may also be instructive to consider an investigation by Whalen, Levitt, Hsiao and Smorodinsky (1995), who claimed to have found differences in the intrinsic f0 of vowels in the babbling of infants. This claim was based on measurements showing that f0 was higher in babbled vowels classified as high than in those classified as low by one of the investigators. However, this just confirms that f0 affected perceived vowel height in a sense opposite to that of F1 also for this investigator. A similar result is to be expected with any arbitrary set of natural or synthetic vowels in which f0 shows the wide range of variation that is typical of babbling and due to variation in voice. It tells nothing about the presence or absence of f0 differences among different vowels produced without intentional variation in voice, although such differences have been observed in speakers of many languages (Whalen & Levitt, 1995) and appear to be due to an interaction between tongue and larynx (Honda, 1983) that is unlikely to be absent in infants.

The effect that the length of the vocal tract has according to the acoustic theory of speech production has been more widely appreciated. On this basis it is clear that the formant frequencies must decrease as a function of increasing vocal tract length. They cannot be the same in vowels articulated similarly by children and adults. Data on the formant frequencies of vowels produced by men and women also reflected the difference in vocal tract length. In studying the vowel systems of different speaker groups, the approach commonly followed by researchers was to try removing the differences by normalizing the data (Disner, 1980), and the tacit or explicit assumption that a process of normalization is also involved in human speech perception is inherent in various models of speech perception.

Speech used to be considered as a sequence of discrete segments representing phonemes described in articulatory terms. The analytical distinctions were often those made in Table 1. Menzerath and de Lacerda (1933) observed far-reaching effects of context on the articulation of phonetic segments and they introduced the notion of ‘coarticulation’. While their kymographic investigations suggested invariance to be absent in the acoustic signal as well, they still assumed that the acoustic signal allowed segregating the segments. It was later understood that the boundaries visible in oscillograms and spectrograms are not equivalent to those between phonetic segments (Joos, 1948). Although context-independent articulatory targets appear to exist for most consonants and to be reflected in the ‘loci’ of extrapolated formant transitions (Delattre, Liberman & Cooper, 1955), there is a lack-of-invariance problem in acoustic as well as in articulatory analyses. This is avoided when speech signals are described and compared in a holistic fashion. Coarticulation cannot even be talked of unless the signal is thought of as representing a sequence of phonetic segments or targets.

H. Traunmüller / Speech considered as modulated voice 4

Table 1: Properties of segmental units (tokens) of speech.

Phonemic, linguistic

Phonemic and allophonic qualities (and orthographic equivalents), distinctive features

Combinatorial, due to segmental context

Effects of coarticulation (in the widest sense)

Indexical, due to other factors

Particularities of talkers and tokens (due to any idiolectal, stylistic, expressive, organic, or perspectival factors)

The Modulation Theory (Traunmüller, 1994), which will be further elaborated and discussed in the following, neither presupposes segmentation nor categorization. It suggests a description of speech in terms of features and it is compatible with a segmental analysis, but also, e.g., with polysystemic approaches such as advocated by Hawkins and Smith (2001). The Modulation Theory is not primarily concerned with the distinctions made in Table 1. Not being bound to a segmental phonemic analysis, it has not much to say about coarticulation, but the interplay between coarticulation and sensory adaptation is within its frame. The theory has more to say about the co-production of speech and expressions of emotions, since it is primarily concerned with the more fundamental problem of distinguishing the extra- and paralinguistic aspects of speech from its linguistically informative aspects (Table 2).

Table 2: Kinds of information in speech signals.

Quality and characterization

Constituents Information

Linguistic, conventional, social

Words, speech sounds, prosodic patterns, ...

Message; dialect, accent, speech style, ...

Expressive, psycho-physiological within speaker variation

Vocal effort, speaking rate, pitch dynamics, voice quality ...

Emotion, attitude, environment, ...

Organic, morphological between speaker variation

Larynx size, vocal tract length, ...

Age, sex, pathology, ...

Perspectival, spatial, transmittal

Acoustic and optic factors Place, distance, orientation, transmission channel, ...

1.2. Kinds of information in speech

Each spoken utterance has a certain linguistic phonetic quality that reflects not only the message but also the speaker’s idiolect (language, dialect, sociolect possible foreign accent, linguistic idiosyncrasies) and the chosen speech style. All these forms of variation concern the “linguistic quality” in Table 2. It is this quality that phoneticians attempt to reproduce in an accurate phonetic transcription. Real transcriptions of utterances always involve a quantization loss due to segmentation and categorization biased by the transcriber’s knowledge and preconceptions of the phonological system of the language. This is not necessarily objectionable since a similar bias is likely to be shared by those who are competent in the language in question. However, in the present paper it is the quality prior to any language specific categorization that is of primary concern. We are going to refer to this as the “linguistically informative quality” of a speech signal where clarity so requires.

H. Traunmüller / Speech considered as modulated voice 5

In addition to their linguistically informative quality, speech signals contain necessarily several other kinds of information that listeners perceive, but which phoneticians do not usually transcribe. Among this additional information and variation, which is largely perceived as variation in voice, it is often convenient to distinguish the organic variation between speakers from the expressive (affective and adaptive) variation, which occurs within speakers as well.

The organic quality varies with a speaker’s age and occasionally due to afflictions such as a cold. Expressive variation occurs on a shorter time scale given by variations in the speaker’s psychological state. Its scope can be as short as a spoken clause, or even just a single word. The typical time scale of linguistic phonetic variations is of course still shorter. It corresponds to a single phonetic speech segment.

When the linguistic phonetic quality of speech is in focus, the expression “presentational variation” can be used as a cover term for all organic and expressive variation. Presentational variation exists also in written communication and in non-linguistic cases of categorical perception. From the perceiver’s point of view, speech signals are also affected by perspectival variation in the same way as all other acoustic and optic signals. In face-to-face situations, the optic signal is quite important in speech, but in principle, the distinctions drawn in Table 2 are applicable not only to speech but to communication by sign language as well.

Listeners are capable of evaluating the different kinds of information without much cross-interference although the acoustic attributes that convey the linguistic quality are not independent of those that convey the organic and expressive qualities. All the properties usually studied in phonetics and exploited in automatic speech recognition (sound pressure levels, f0, formant frequencies, segment durations, etc.) are affected by organic, expressive and linguistic factors to a similar extent (cf. Traunmüller, 1988). There are no absolute acoustic or optic properties of speech that convey any particular one of these kinds of information invariantly.

Expressive variation is reflected in the kind of phonation, vocal effort, speaking rate, etc. Some of this variation affects the speaker’s ‘Artikulationsbasis’, i.e., the overall ‘articulatory setting’ (Laver 1978, 1980; Jenner, 2001) of the vocal tract and thereby influences the formant frequencies, e.g., as shown by Story, Titze and Hoffman (2001) for yawny and twangy settings compared with a normal setting. Variation in vocal effort affects not only levels, but also the spectral slope, f0 and F1 substantially (Frøkjær-Jensen, 1966; Rostolland, 1982a; 1982b; Liénard & Di Benedetto, 1999; Huber, Stathopoulos, Curione, Ash & Johnson, 1999; Traunmüller & Eriksson, 2000; Geumann 2001). There is also variation in distinctness that shows itself in the extent of f0-excursions, sometimes referred to as “pitch span” (Ladd, 1986), and in more or less extreme values of the formant frequencies as an effect of emotions and attitudes or due to adaptation to communicational needs, as suggested by Lindblom’s (1990) hyper/hypo theory (for emotions see Kienast & Sendlmeier, 2000; for infant-directed speech see Kuhl & al., 1997). To some extent, variation along the hypo-hyper continuum also affects the linguistically informative quality of speech. A clear example is the variation in the frequency of segment deletions. Deletions occur rarely when a speaker is angry (Kienast & Sendlmeier, 2000) and therefore makes an extra effort to get his message through, while they are frequent when a speaker is affected by sadness, which may impede the adaptation to communicational needs.

All the acoustic properties that convey the linguistic prosody of utterances (f0, sound pressure level, segment durations, etc.) are also affected by expressive variation, and it may well be that most of the prosodic features of spoken languages have been taken over from the pre-linguistic and now sub-linguistic system of expressive vocal

H. Traunmüller / Speech considered as modulated voice 6

communication. The pitch contours of breath groups in human speech (Vaissière, 1983) are similar to those in the vocalizations of other mammals, such as cats and sheep, in their gradual decrease in f0. As for the linguistic and paralinguistic utilization of pitch, there is an evident connection with organic factors: high frequencies are associated with small vocalizers and low frequencies with large. Further, ‘small’ associates naturally with ‘harmless’, ‘submissive’ and ‘unassertive’ while ‘large’ associates with ‘dangerous’, ‘dominant’, and ‘assertive’. This gives rise to a frequency code (Ohala, 1984) that works across species and that is also used linguistically, e.g., in order to distinguish questions from statements. It is also evident that the origin of the most common ways of emphasizing a constituent of an utterance is paralinguistic and very reasonably pre-linguistic, with analogies in non-human communication. In the case of contrastive stress, one can hesitate between classifying the phenomenon as gradual (paralinguistic) or discrete (linguistic) or both (Gussenhoven, 1999). Features with such an origin are widely spread and they reoccur in languages that are not closely related to each other, but they are not universal. Some languages offer, e.g., lexical means of emphasizing constituents. The boundary between linguistic and expressive is to some extent language specific, since a prosody that has a linguistic function in one language may lack such a function in another, while it may possibly be interpretable as a meaningful expressive variation.

Organic variation reflects the morphological between-speaker variation in the dimensions of organs of speech, and it affects f0 as well as all formant frequencies. To a first approximation, the formant frequencies vary in inverse proportion to vocal tract length. There are, however, deviations due to the disproportional growth of the pharynx, which is very short in infants and disproportionately longer in adult males as compared with females (Fant, 1975; Goldstein, 1980; Traunmüller, 1984). The relation between the auditory patterns evoked by the same vowels produced by men, women and children can be described, in approximation, as a uniform translation in the tonotopic dimension (equivalent to a translation in critical band rate), with deviations due to the more peripheral quality of female vowels and the dominance of variation in F1 in children (Traunmüller, 1988). While habitual particularities characteristic of the way in which single speakers or the members of a speech community use their voice may be expressive in origin, they can more aptly be classified as quasi-organic if they are persistent.

Perspectival or ‘transmittal’ variation does not interfere so much with the other kinds of quality. In acoustic speech signals, it affects mainly the sound pressure level (SPL) and its between-ear balance. Typically, it does not affect f0 at all and the apparent formant frequencies only marginally, but it may to a noticeable extent affect segment durations at the listener’s place due to the delay of reflections from walls etc. Formant frequencies can be affected substantially by the chemical composition of the gas that fills the vocal tract, e.g. a helium-oxygen mixture such as used in diving, but this may be better considered as an “organic” rather than a “perspectival” phenomenon. In the optic channel, perspectival variation has particularly profound consequences caused by variation in viewing angles.

1.3. Deficiencies in modeling speech perception

Although models of auditory speech perception typically include a signal analysis that simulates the spectral analysis that occurs in the auditory periphery, the variation due to non-linguistic factors, in particular the presentational variation, has often been ignored either implicitly or explicitly. Even in a recent overview of speech perception research (Diehl, Lotto & Holt, 2004), the problem of separating the linguistically informative quality from the additional information that speech signals convey is not

H. Traunmüller / Speech considered as modulated voice 7

treated explicitly. The perspective considers the distinctions made in Table 1 and ignores those made in Table 2.

In order to exemplify implicit ignorance, we can take a look at the model of spectral matching described by Bladon and Lindblom (1981), which due to its explicitness also readily exposes the problems that are inherent in these kinds of approaches. This model is based on a calculation of the tonotopic loudness density pattern (sone/bark) of acoustic stimuli. It assumes the perceived difference between two stimuli to be proportional to the difference between their loudness density patterns. The model has been used successfully in simulating the matching behavior of listeners (matching synthetic two-formant vowels to more elaborate synthetic vowels) when no variation in any but the linguistically informative quality was involved. It does not provide for distinguishing the different kinds of signal quality. It does not either take into account that the weight that listeners attach to different cues in the signal depends on the kind of quality judged. Based on a preparatory investigation (Carlson, Granström & Klatt, 1979), Carlson and Granström (1979) observed, e.g., that listeners attach a higher weight to the frequency positions of single formants (but not to their amplitudes) when they compare vowel quality than when they compare the “psychoacoustic” quality of the same sounds.

If f0 is allowed to vary, the model has a serious problem due to the auditory resolution of single partials, unless f0 is very low. This tends to make exemplars of vowels representing the same category but produced at different f0s more dissimilar from each other than from exemplars of other vowels produced at the same f0. This effect is very substantial when f0 > 400 Hz, but the intelligibility of vowels remains (Maurer & Landis, 1996). A way of solving this problem by effectively attaching zero weight to the spectral regions between partials has been suggested by de Cheveigné and Kawahara (1999). Selective attention to tonotopic places where partials are found or can be expected is also assumed in theories of virtual pitch, such as described by Terhardt (1972a, b).

An even more basic problem with Bladon and Lindblom's (1981) model resides in the choice of loudness (in sones) for scaling the prominence of spectral components. This causes the calculated difference between two sounds to increase in proportion to their loudness, e.g., as an effect of decreasing distance between speaker and listener. This is at variance with the more realistic (objective) way in which listeners perceive differences between speech as well as non-speech sounds. Scaling spectral prominence in terms of SPL (in dB), which is the normal practice in systems of automatic speech recognition, results in a realistic comparison. This is more in line with the behavior of listeners and with classical Fechnerian psychophysics, in which just noticeable differences are represented by equal intervals.

Explicit ignorance of paralinguistic variation can be seen in some other approaches such as by Stevens and Blumstein (1978), Blumstein and Stevens (1979) and is also present to some extent in more recent approaches aimed at automatic recognition of spoken utterances (Stevens, 2000). Models of this kind detect certain gross acoustic properties that provide cues for the presence of certain linguistic phonetic features. They use property detectors that are meant to compute such attributes of the signal as can be expected to be resistant to various kinds of variation. Sensitivity to paralinguistic variation is reduced by allowing for some arbitrary shift in frequency. Unlike Bladon and Lindblom’s model, which preserves irrelevant detail, models of this kind deliberately neglect some detail that is less important for the recognition of the linguistic information. While implementations of such models may offer satisfactory recognition, in particular of easily recognized “articulator free” properties and voicing distinctions, they fall short of simulating the sensitivity of listeners to various secondary intrinsic and extrinsic factors that has been observed in speech perception

H. Traunmüller / Speech considered as modulated voice 8

experiments. They do not either capture the perception of the paralinguistic information. Therefore, they are not good models of human speech perception.

Attempts to separate the linguistic phonetic information from other information in speech signals are not quite recent (Traunmüller, 1981; Syrdal & Gopal, 1986; Hirahara & Kato, 1992). In the cited studies, listeners were assumed to evaluate the tonotopic distances (critical band rate differences in barks) between spectral landmarks shaped by the formants and by f0 in order to extract the linguistic phonetic information. It has subsequently been shown that listeners are likely to relate F1 to an estimate of the prosodic base value of f0 rather than to its instantaneous value (Traunmüller, 1991) and that suppression of the first partial does not lead to any substantial phoneme boundary shifts (Fahey & Diehl, 1996). As for the magnitude of the effect of f0,2 there is considerable variation in the literature. While some investigators (Traunmüller, 1981; Ménard, Schwartz, Boë, Kandel & Vallée, 2002) reported effects that were large enough to account for the equivalence of vowels produced by men, women and children (Syrdal & Gopal, 1986; Traunmüller, 1988) or produced at vastly different levels of vocal effort, other investigators, e.g., Slawson (1968), Ainsworth (1971) and Nearey (1989) reported smaller effects. Most of the discrepancies reflect differences in context (Traunmüller, 1990). Approaches in which only the effects of intrinsic cues were considered (Traunmüller, 1981) fall short of capturing extrinsic factors such as the influence of the properties of an introductory phrase on the perception of vowels, observed by Ladefoged and Broadbent (1957), Ainsworth (1975) and Nearey (1989).

In their classical experiment, Ladefoged and Broadbent (1957) used synthetic two-formant versions of the English words “bit”, “bet”, “bat”, and “but”, preceded by the phrase “please say what this word is”. This introductory phrase was produced with altered overall frequency positions of F1 and F2. Such a modification caused the test word to be heard as a different one. When F1 was increased in the introductory phrase, the perceived openness of the test word vowels decreased, and F2 affected the front/back perception in an analogous way. In a subsequent experiment, Broadbent and Ladefoged (1960) showed that a modification of the formant frequencies of the introductory phrase brought about by a change in its wording did not affect the identifications. This showed that the results could not be explained on the basis of an unspecific “adaptation level theory”.

The Modulation Theory aims at explaining the interplay between the linguistic and non-linguistic aspects of human speech in production as well as in perception. It is meant to describe, in principle, how speakers merge and how listeners separate the various kinds of information listed in Table 2. The latter involves also extrinsic factors. A proper understanding of these processes is fundamental to speech science and also relevant to distinctive feature theory. The present version of the theory captures not only the most essential aspects of production and perception as such, but also those of the neural linkage between perception, production and representation of speech. This makes it a theory of imitation that captures the processes that are fundamental for speech acquisition. Its underlying principles are applicable not only to communication by speech, but also by gesture and to the production, perception and imitation of any bodily postures and gestures.

The theory brings many results of previous experimental research into new light. This concerns extra- and paralinguistic variations in the properties of speech, the perception of extra- and paralinguistic information, effects of context and familiarity with a speaker’s voice on the perception of linguistic information, and cases of 2 An auditory demonstration is accessible on the Web (Traunmüller, 1998).

H. Traunmüller / Speech considered as modulated voice 9

deficiency in acoustic information. A selection of these results will be considered in the Discussion section of this paper, where relations with gestural theories of speech perception, speech motor control and theories of concept representation and categorization will also be discussed.

2. The Modulation Theory

2.1. Essentials

Within the frame of the Modulation Theory of speech, in the following abbreviated as MDT, where the “D” can also be interpreted as standing for “demodulation”, the human faculty of communicating by means of speech is seen as a biological innovation that is founded on a faculty of expressive communication by voice that has been around for a long time before and that still plays an important part in human communication. The theory is founded on an analysis of how the different kinds of information are merged in speech production. Its essentials can be captured in four propositions (P1 to P4). P1 and P2 represent first principles that describe the nature of speech signals. P3 expresses a logical conclusion about speech perception drawn from P1 and P2. P4 describes the linkage between perception, production and representation of speech in the brains of speakers. The four propositions constitute a complete listing of the tenets of MDT. In order to falsify the theory, it has to be shown that at least one of these is untenable. While MDT restricts quite severely the range of workable models of speech processes, it does not provide such models without a range of further considerations.

2.1.1. The carrier

P1: A speech signal is basically the result of a process in which a carrier, characterized by the static properties of the speaker's voice, has been modulated by phono-articulatory gestures.

In radio communication it is common to vary either the amplitude or the frequency of a sine wave, referred to as the “carrier”, in proportion to the amplitude of an acoustic signal. In this way, the carrier is “modulated” with the acoustic signal. This must be distinguished from a “superposition”, which involves nothing more than an addition of two concurrent signals. Speech is not superimposed on the voice of a speaker, but the speaker’s voice functions as a carrier that is modulated in a complex way by phonatory and articulatory gestures, which are made audible in this way. This involves not only amplitude- and frequency modulation of the glottal vibrations but also a modulation of each one of the formants, whose frequency modulation is particularly informative. Considered holistically, the variable resonances of the vocal tract can be said to effectuate a spectral envelope modulation.

A periodic unmodulated carrier signal can be thought of as an unarticulated, ‘colorless’ and linguistically featureless oral vowel, produced with the vocal folds optimally adducted for phonation and relaxed (slack) to the extent to which expressive factors allow this. Acoustic data from speech in which the expressive quality varied suggest that an f0-value close to the lower end of the current pitch span of a given speaker should be considered as representative of the carrier (Traunmüller & Eriksson, 1994; 1995 b). This f0c (index c for “carrier”) can be identified with the prosodic baseline that is observable indirectly in speech by linear interpolation between successive f0-minima. It is slightly higher at the beginning of utterances than at their end, except when a singing mode is adopted. While an aperiodic carrier, such as present in whispered speech, lacks an f0, it has formants that can be modulated.

H. Traunmüller / Speech considered as modulated voice 10

The properties of the carrier are given by the morphology and biomechanical properties of the speaker's organ of speech (vocal fold mass, length and stiffness, vocal tract length, etc.) and by its expressive settings. While some of the properties of the carrier can be ascribed to the glottal source, others reflect the filtering function of the supraglottal cavities that arises with a linguistically neutral articulatory setting. Although the speech specific rest position of the articulators, which speakers tend to assume just before beginning to speak (Öhman 1967, Perkell 1969) and between utterances (Gick, Wilson, Koch & Cook, 2004) is not itself reflected in the acoustic signal, it is suggestive of the neutral setting that defines the properties of the carrier. Expressive factors affect not only the properties that can be ascribed to the glottal source, but also the setting of the supraglottal vocal tract and thereby its filtering function. Most noticeably, speakers adopt a more open setting while speaking at increased vocal effort, which results in an increased F1 in vowels. Increased formant frequencies are also observed in whispering, but in this case this appears to be mainly due to a more constricted laryngopharynx.

Speech sounds convey more or less information about the carrier, whose properties can best be inferred from the vowels and to some extent from other voiced segments, while they are more or less obscured in obstruents and in voiceless segments. The qualification “basically” in P1 provides room for exceptional speech sounds such as voiceless fricatives and clicks, which can be said to be carried by themselves, i.e., by the sound that arises concomitantly with their articulation, rather than by the speaker’s voice. Out of their context, such sounds are not easily recognized as speech. Even when embedded in voiced speech, click sounds are typically perceived as environmental noises rather than as belonging to the stream of speech by listeners who lack familiarity with a click language. It is, therefore, reasonable to believe that extravagant sounds such as clicks and fricatives were absent in early forms of regular human speech (Traunmüller, 2003).

Although the thought of considering speech as a modulated carrier wave had already occurred to Dudley (1940), the carrier he had in mind was the sound produced in the larynx, which constitutes the source for most speech sounds in source-filter models. For sounds with a glottal source, the source-filter approach involves a physical dichotomy between (a) a speaker ‘decapitated’ at the level just above the glottis and (b) the speaker’s head with its variably shaped supraglottal vocal tract. The present approach is, in contrast, based on a functional dichotomy between (a) a complete speaker who is phonating but not articulating and whose vocal tract setting may be influenced by emotions and attitudes and (b) the speaker’s phono-articulatory gestures. While this dichotomy is different form that between source and filter, which is justified in acoustic phonetics, it is in line with the tradition in practical phonetics represented, e.g., by Catford (2001).

More akin to the present approach is the idea of considering speech gestures as strings of vowel gestures on which consonantal gestures are superimposed (Öhman, 1966; Öhman, Persson, Leanderson, 1967; Carré & Chennoukh, 1995), so that the vowels would function as a carrier that is acoustically modulated by the consonants. In a sense that would satisfy P2 in addition to P1, such an analysis may be applicable when a consonant calls for the same articulator as the vowel. This is the case in velars, whose production and perception strongly depend on adjacent vowels (or vice versa). In contrast, when a labial is produced, the tongue is free to adopt the shape required by an adjacent vowel3. The extent to which this freedom is used differs between languages. While speakers of French maintain the vowel gesture during the production of the [p] in [VpV]-sequences, 3 Menzerath and de Lacerda (1933) used the term “Koartikulation” for this kind of cases, while they spoke of “Steuerung” [control] when the same articulator was involved.

H. Traunmüller / Speech considered as modulated voice 11

the tongue is deactivated during the [p] by speakers of English (cf. Lindblom, Sussman, Modarresi & Burlingame, 2002). Similarly, lip shape has been observed to become more neutral while an [s] is produced between rounded vowels (Gay, 1978; Engstrand, 1981; Perkell, 1986). Coarticulatory behavior remains outside the frame of MDT, which regards the speaker’s voice as a carrier that is modulated by vowels as well. There remains, nevertheless, a shadow of this prior idea: Adaptation effects in proprioception are likely to cause the tongue shape and the jaw position speakers consider as neutral to be biased in the direction of the vowels in the currently produced syllables. This may go unnoticed by listeners as well as by speakers, since listeners are also affected by adaptation in their perception.

2.1.2. The modulation

P2: The auditory properties of speech signals with the same linguistically informative quality deviate from those of an unmodulated carrier essentially in the same specific way.

This proposition applies to speech signals irrespective of their presentational (expressive and organic) and perspectival quality. It implies that the linguistically informative quality of speech signals is associated with these deviations (the modulation) and not immediately with the absolute properties of the signal. The modulation is claimed to be essentially the same for any speech signals whose linguistically informative quality is the same, while it is assumed to be different whenever this quality is different. In order for communication to succeed, speakers must produce a modulation that corresponds to their intended message. P2 is meant to apply to acoustically clean speech that includes all relevant context. The latter is the case in words or speech sounds produced in isolation, but strings or segments of speech that have been excised from their context are not equivalent to these.

Although a given auditory modulation pattern always signals the same string of speech sounds, P2 does not imply that speech sounds or features must always be signaled by the same modulation pattern. It is, e.g., well known that place of articulation of prevocalic stop consonants can be signaled by formant transitions as well as by the release burst, or even by optic cues. This is not in any way exceptional in perception: The presence of a dog can be signaled by its visible tail as well as by its visible snout and also by auditory or olfactory cues.

P2 applies to auditory proprioception (how a speaker hears self-produced speech) as well as to exteroception (how a speaker hears the speech of others, perhaps influenced by vision). In particular during speech acquisition, auditory proprioception is crucial for correct speech production because the acoustic patterns of modulation have to conform to the relevant standards of the speech community. Conformity is normally also required of the visible articulatory gestures, but this is not the case for gestures that remain invisible.

The “auditory properties” referred to in P2 are not equivalent to the acoustic properties of speech signals. They have to be expressed in auditory terms and, when the speaker is visible, they can be modified by the optic signal (cf. McGurk & MacDonald, 1976). Assuming that listeners consider pitch intervals that span the same number of semitones as equal in size, which appears to be valid in speech (Traunmüller & Eriksson, 1995) as well as in music, the amplitude of the modulations in pitch can be described as

M0 = log(f0) – log(f0c). (1)

H. Traunmüller / Speech considered as modulated voice 12

It can be convenient to express this in semitones:

M0 (in st) = 12 [log(f0) – log(f0c)] / log(2). (2)

In describing the spectral modulation of a signal, the spectrum has to be related to the spectrum of the carrier. The difference between the spectra of different speech sounds can often be described in its essentials considering only the auditory representation of the frequency positions of their formants. These can be adequately expressed in critical band units (bark). The modulation of the first formant

M1 = z(F1) – z(F1c), (3)

where “z” stands for “critical band rate of”, is closely related to vowel openness (negatively to vowel height). The modulation of the second formant

M2 = z(F2) – z(F2c) (4)

is primarily related to the place of articulation (tongue advancement) in vowels and in dorsal consonants and secondarily and in opposite sense to lip protrusion. Modulation amplitudes can also be calculated in an analogous way for F3 and any higher formants, but these are not as informative about the linguistic quality of speech sounds as M1 and M2.

The possible range of variation in the frequency position of any Fn is limited by the formants above and below. It is reasonable to consider the Fc’s above and below as the ultimate limits and to use these for calculating relative modulation amplitudes such as

M2rel = M2 / [z(F3c) – z(F1c)] (5)

In an investigation of the perception of two-formant vowels by speakers of Turkish, Swedish, and other languages (Traunmüller & Lacerda, 1987), there emerged a variable I

with a boundary specific invariant value for the boundaries [u/ɯ], [ɯ/y] and [y/i]. The variable could be interpreted as I = [z(F2’) – z(F1c)] / [z(F3c) – z(F1c)] (Traunmüller, 1994), with F2’ denoting the single upper formant. This is equivalent to a calculation of M2rel by equation (5) if F2 and F2’ behave alike, and F2c is predictable from F1c and F3c. Equation (5) and its analog

M1rel = M1 / [z(F2c) – z(Fgc)], (6)

in which the perceptual validity of the denominator still needs to be verified, capture organic variation between speakers as well as variation in articulatory settings between and within speakers. “Fgc” in equation (6) refers to the ‘glottal formant’ or pseudo-formant Fg (Fant, 1979), which describes the spectrum of the unfiltered glottal pulses. Fg is associated with f0 and with voice quality. It is visible as a separate spectral peak in open vowels. The equations (3) to (6) are not immediately applicable to nasal and lateral speech sounds, which involve a more complex spectral envelope modulation due to extra formants.

While organic variation typically affects the properties of the carrier, expressive variation typically affects not only the carrier but also the modulations. Normally, it does not affect the polarity of the deviations that convey the linguistic information, but expressive factors may very well affect their amplitudes and also their rate of execution. This is why we need the qualification “essentially” in P2. This also implies that modulation amplitudes calculated with any of the equations (1) to (6) still cannot be regarded as invariants descriptive of the linguistically informative quality alone. The

H. Traunmüller / Speech considered as modulated voice 13

effects of expressive variation are especially noticeable in the f0-excursions, the overall amplitude (pitch span) of which varies widely as a function of liveliness (Traunmüller & Eriksson, 1995). The magnitude of formant frequency excursions varies, to a lesser extent, as a function of articulatory distinctness and, in languages such as Dutch (Bergem, 1993), also as a function of sentence accent and word stress. Increased vocal effort is characterized by an increase in the openness of the mouth, which shows itself mainly in the vowels, and a corresponding increase in the amplitude of articulatory gestures, which is required when producing obstruents in the context of vowels affected in this way (Schulman, 1989; Geumann, Kroos & Tillmann, 1999).

Expressive quality is often attributed to the “voice” of a speaker even if the quality is reflected more in the amplitude of modulations and perhaps in their rate. The “carrier” referred to in P1 is not synonymous with the “voice” in this sense, since the carrier does not reflect dynamic properties of a voice, such as its pitch span. This is why the attribute “static” appears in P1.

While it is claimed that the linguistic phonetic information is conveyed by the modulation, P2 does not disallow expressive modulations. These are, in most cases, slower than linguistic modulations. Expressive information is mainly conveyed by the carrier and by amplitude and rate of the modulations, while the organic information is mainly conveyed by the carrier alone.

2.1.3. Demodulation

P3: For the perception of the different kinds of information in speech, P1 and P2 imply that a demodulat ion is necessary in order to recover and separate the kinds of information.

If speech is modulated voice, then any approach that cannot be interpreted as a demodulation will fail in the presence of variation in voice. In order to be able to recognize the various kinds of information conveyed, listeners have to recover both the voice and its modulation4.

In order to recover the linguistic information, listeners must “tune in” to the speaker’s voice and evaluate the deviations of the current properties of the speech signal (f0, formant frequencies, etc.) from those they expect of a linguistically neutral sound with the same presentational and perspectival quality. An analogous process is assumed to take place in lipreading, which requires tuning in to the speaker’s face. In tuning in to a speaker’s voice, it can be assumed that listeners will mainly be guided by their recent experience (if any) of the voice. The theory sets no restrictions concerning the kinds of additional extrinsic information listeners may rely on or possibly disregard in this process. Although the theory does not tell which weight listeners will attach to the different cues to a speaker’s voice, it can be expected on independent grounds that the weights will be positively correlated with the reliability of the cues. When voices are presented with incongruent cues, which is not uncommon in speech perception experiments, the effects of particular cues will not show up to their full extent.

4 Within communications engineering, “demodulation” refers to the process of recovering the informative modulation of a signal, while the carrier is just disregarded. This is different from the present context, in which the carrier is also informative, so that “demodulation” may be interpreted as recovering the carrier as well as its modulation.

H. Traunmüller / Speech considered as modulated voice 14

Due to its perceptual salience, f0 plays a most important part in the process of tuning in to a speaker, but the formants must also be considered in order to separate the expressive from the organic information. Men, women, and children can all speak at the same f0, but if they do, there will be a substantial difference in expressive quality in addition to the difference in organic quality. The organic information indicative of a speaker’s age and sex can still be distinguished from the expressive information on the basis of the formants above F2. In auditory terms, these formants are not affected as much as f0, F1 and F2 by variation in linguistic and expressive quality.

The presence of f0 and of formants above F2 is not absolutely necessary in order to tune in to an unknown voice. Thanks to probabilistic factors and redundancies inherent in speech signals, even an initially mistuned listener will often be able to recognize some speech sounds. The listener can then adjust his tuning to fit his perception. This post-hoc method is slower than the use of intrinsic cues and prior information. For high precision, it requires several syllables with different vowels as an input.

In order to account for the perceptual aspects of vowel contrast reduction, Koopmans van Beinum (1980) suggested that listeners arrange the vowel prototypes for a given speaker around the centroid of all the speaker’s vowels. This is the state that results from tuning in to the speaker's voice if the centroid represents the carrier.

In order to recover the frequency modulation of f0, listeners have to evaluate the instantaneous value of f0 in relation to its expected base value f0c, as described by equations (1) and (2). The f0 does not necessarily need to be represented by a partial with that frequency, since it is the virtual pitch that is relevant here.

Although the equations (1) and (2) allow a musical analysis of the f0-excursions in speech, they are not fully sufficient for a linguistic analysis. For this purpose, listeners must relate the amplitudes of excursions to the current pitch span, which represents a dimension of the expressive quality. Instead of the extreme pitch span log(f0max) – log(f0min), it is preferable to consider the RMS of the deviations of log(f0) from its mean, i.e., its standard deviation σ. This gives us a theoretical invariant

M0ling = M0 / σ[log(f0)]. (7)

The accuracy to which listeners need to know a speaker’s current σ is language dependent. Languages that can be specified phonologically by distinguishing only between highs and lows, with no distinctive levels in between, do not require high precision here. Equation (7) is applicable within stretches of speech that exhibit the same type of phonation. Stretches of creaky voice and voiceless segments must not be included together with stretches of modal voice in the calculation of σ. A listener cannot really know the M0ling of an isolated syllable produced by an unknown speaker at a constant pitch, but a guess based on an assumed normal value of σ is possible. For the estimation of f0c, required in calculating M0, the formants provide guidance in such cases.

In the perception of the linguistically informative quality of speech sounds, listeners are very sensitive to small changes in the frequency positions of the formants (Carlson, & Granström, 1979). This suggests listeners evaluate the positions of the formants in the signal in relation to their expected neutral positions. In this process, primary attention must be paid to the direction (polarity) of the deviations, but their amplitude is also important.

The principle role of f0 in spectral demodulation is that of serving as a clue to where F1c is to be expected. When the articulatory settings are otherwise normal, F1c tends to have

H. Traunmüller / Speech considered as modulated voice 15

the same value for a given f0c (roughly 3.5 barks above it) no matter how expressive and organic factors combine. This is compatible with the observations of [z(F1) – z(f0)] being a good but not a perfect predictor of perceived openness (Traunmüller, 1981, 1991; Hirahara & Kato, 1992; Ménard & al., 2002).

While the equations (3) and (4) describe how the modulation amplitudes M1 and M2 corresponding to F1 and F2 can be obtained, they only factor out variation that corresponds to a regionally linear shift in the tonotopic dimension, which accounts for much of the differences between men, women and children. In order to factor out also tonotopically non-linear variation that is typical of expressive variation in vocal effort and of whispered vs. ordinary speech, but which may also be due to organic factors, the equations (5) and (6) have to be applied. However, the relative modulation amplitudes M1rel and M2rel still conserve the variation in articulatory distinctness, which may be due to expressive factors such as emotions (Kienast & Sendlmeier, 2000), but which can also be observed between male and female speakers, where it may be due to organic factors (Traunmüller, 1988; 2001; Simpson, 2001).

In order to perform a complete linguistic evaluation of the amplitude of a deviation, say in F2, it is not enough to relate the deviation to the maximal theoretical range, as in equations (5) and (6), but it is necessary to take the speaker’s current dynamic range of the formant frequency excursions into account. This is analogous to what equation (7) describes for f0.

M2ling = [z(F2) – z(F2c)] / σ[z(F2)] (8)

M1ling = [z(F1) – z(F1c)] / σ[z(F1)]. (9)

As in the case of M0ling (equation 7), the degree of accuracy required in adapting to the speaker's current articulatory distinctness depends on the linguistic use of the dimension, in this case for a possible tense/lax distinction. If variation in articulatory distinctness would be factored out completely, this variable could no longer figure in a definition of speech sound prototypes. Since a prototype or “best exemplar” must be distinct, subjects are likely to rate more distinct stimuli as better representatives of their category and perhaps even as better copies of particular exemplars, which would account for the “hyperspace effect” (Johnson, Flemming & Wright, 1993).

In general, normalization is not equivalent to demodulation. Normalization of acoustic speech data typically involves a substitution of the paralinguistic information conveyed by the voice with a different one considered as ‘normal’ and does not result in a separation between the voice and its modulation. However, a self-normalization in which F1 and F2 are expressed in terms of their deviation (in units of σ) from their mean value (Gerstman, 1968; Lobanov, 1970) does not involve such a substitution and is equivalent to a demodulation if the mean value is representative of the carrier. While the latter holds approximately for F1 and F2 if the set of all vowels of a language is considered, the mean value of f0 is not representative of the carrier, except in completely monotonous speech. The equations (5) to (9) all involve the calculation of self-normalized quantities in which properties of speech produced by the speaker himself provide units of measurement that can be considered as a ‘norm’ that is valid at the instant in question.

The equations (1) to (9) can be used in testing the perceptual validity of the theory with synthetic or manipulated speech stimuli whose formant frequencies are known. This validity can be expected to break down when any Fc goes outside the range of a listener’s prior experience. The equations that involve formants should not be interpreted as immediately descriptive of percepts or the steps involved in perception. While M0 corresponds to a prominent percept of pitch, M1 and M2 may just be

H. Traunmüller / Speech considered as modulated voice 16

descriptors of the spectral modulation that recommend themselves because of their simple relation with the acoustic variables F1 and F2.

Although listeners are very sensitive to small changes in the frequency positions of the formants, which are defined as the resonances of the vocal tract, these are often not very well represented in the acoustic signal. In the voiced segments of speech, the partials often stand out much more clearly than the formants. The transfer function of the vocal tract, which reflects the formants, is spectrally sampled by the partials. According to the sampling theorem, this allows a faithful representation of spectral detail only down to a resolution of 2 f0. Although formant frequency discrimination degrades with increasing f0 (Kewley-Port, Li, Zheng & Neel, 1996), listeners behave as if they did much better than the sampling theorem seems to allow (cf. de Cheveigné & Kawahara, 1999). They can do this on the basis of their implicit knowledge of the spectra of speech sounds. Since listeners are exposed to exemplars produced with different and dynamically varying f0s from natural speech, they are in a position to acquire gapless knowledge of the complete spectral envelopes that characterize different speech sounds. In this way, listeners will acquire templates that reflect spectrally uninterrupted transfer functions.

Although the stored templates are spectrally complete, the matching that listeners can be assumed to perform must in voiced segments only occur at those tonotopic places where the actual partials are. Visual recognition of shapes drawn with dotted lines requires an analogous process. The spectral comparison does not involve any smoothing of the spectra of incoming sounds, but only a prior process of “tuning in” to the speaker’s voice. This can be modeled as a spectral warping of all the stored templates such that the points that correspond to the formants of a neutral vowel agree with those expected of the speaker’s voice in critical band rate and in auditory prominence. The warping of the templates can be modeled as piecewise linear with the boundaries of the pieces defined by the low frequency end of the scale and by Fgc (or possibly f0c), F1c, F2c and F3c, as far as these are considered as reliably known5. The warping has to be governed by the most reliable Fcs. It can further be assumed that the warping is subject to constraints given by the prior experience of the listeners (discussed in section 3.3). The properties of the carrier are described by the warping, which arises in the process of ‘tuning in’. The modulation is described by a multidimensional vector that includes a measure of similarity with each one of the stored templates at each instant in time. Formants and the modulation amplitudes expressed in the equations (3) to (6), (8) and (9) are useful in analyses, but they cannot be assumed to be directly accessible to listeners. They are not really perceived. Instead, a listener’s templates can be thought of as arranged in a multidimensional space of continuously scaled perceptually distinctive properties (dimensions and features) that emerge from the set of distinctive segments. The openness of vowels represents one of those properties. Each speech signal is represented as a trace in this multidimensional space. This representation still preserves the variation that is factored out by the equations (7) to (9). This can be accomplished in a subsequent stage of analysis.

The approach described allows treating also ‘misfits’ such as voiceless fricatives, for which it is questionable whether they constitute modulations of the voice. However, there is a problem concerning spectral warping in the frequency range where these

5 It is also possible to describe the process on the basis of a logarithmic transform of frequency instead of a CB-rate transform. When the warped pieces are as limited in extent as the distance between subsequent Fcs, the choice of scaling affects the results only marginally.

H. Traunmüller / Speech considered as modulated voice 17

speech sounds predominate acoustically, i.e., above F3c. In relative terms, the total length of the vocal tract varies more, as a function of age, than the length of the front cavity, which is associated with F2’. Therefore, the translation required in order to compensate for organic variation is here smaller than for F1 and F2. Perception experiments have also shown that the phoneme boundaries between fricatives based on the front cavity resonance are not affected very much by a tonotopic translation of their vocalic context (Traunmüller & Krull, 1987) when this resonance is above the range of F2. On the other hand, it is also clear that this resonance is very sensitive to expressive variation in lip protrusion and retraction. This remains yet to be clarified.

Listeners are assumed to update the Fcs continuously. In this process there arises a slight bias due to attraction by the actual formants of the just preceding and the present sounds heard, in particular of the vowels. The susceptibility to this bias depends on the strength of additional information that may speak against the presence of a change in voice. All the estimated properties of the carrier and also the estimated speaking rate are assumed to be subject to an analogous process. This is assumed to occur in addition to peripheral sensory adaptation, which can have similar consequences, most noticeably concerning loudness and spectral prominence.

In order to distinguish reliably between organic, expressive and linguistic phonetic information, it can be helpful to consider also the temporal persistence of the phenomena. Even if this is done, in some cases it is not obvious how a phenomenon should be classified. Contrastive stress has been mentioned as a borderline case between linguistic and expressive phenomena, and some seemingly organic phenomena conceal their nature. The sociocultural variation in pitch (Bladon, Henton & Pickering, 1984; Bezooijen, 1995) is a case in point, and it is unclear to which extent the observed gender difference in articulatory explicitness is due to organic and to sociocultural factors. Such problems do not need to be resolved by listeners. Although recognition of the linguistic information requires a sufficiently precise demodulation, it does not require an interpretation of the properties of the voice. The voice does not even need to be human.

MDT embodies the general idea that the auditory system must adjust to a speaker’s speech for effective phonetic perception (Joos, 1948). Some technical systems of speech recognition in which overall speaker adaptation is used (McDonough, Schaaf & Waibel, 2004) come close to functioning like this. However, adapting to the speech of the speaker is not quite the same as adapting to the speaker’s current voice. These systems adapt to some linguistic phonetic peculiarities of speakers as well as to the effects of expressive, organic and perspectival factors without distinction. While this may be a desirable simplification in speech recognition devices, it does not model the human capability of distinguishing between dialectal variation and variation in voice.

2.1.4. Perception and production of speech

P4: Users of spoken language associate the modulation patterns of speech signals, whether perceived by exteroception or by auditory proprioception, with the kinesthetically and haptically sensible properties of their own speech gestures with the same linguistic phonetic quality.

This implies that the somatosensory feedback allows speakers to control their speech production appropriately if they strive to modulate their voice in a desired way.

P4 is meant to hold for language users with normally functioning sensory and neuromotor systems. It describes a connection that is most fundamental for the human

H. Traunmüller / Speech considered as modulated voice 18

faculty of speech. The association is meant to be bi-directional. After speech acquisition, it is assumed to be firmly established in form of a neural mapping that is likely to be realized by echo neurons (Rizzolatti & Craighero, 2004) in the human brain. Such a mapping presupposes perspectival and presentational variation in exteroception and auditory proprioception to be factored out, as detailed in the preceding sections. This also requires filling the spectral gaps between the partials. Observations with infants, detailed in section 2.2., suggest such a mapping to be present in a rudimentary form already at birth, and it is assumed to be elaborated and fine-tuned by babbling and during the course of speech acquisition, when the sound is heard coincidently with the motor activity performed in its production. Spectral gaps can be filled when pitch is varied while the articulation is kept constant. The mapping appears to be molded in a language specific way by cumulative experience, as suggested by the perceptual exaggeration of the difference between stimuli that fall into different categories, which has been observed in investigations of “categorical perception”, and by the related “perceptual magnet effect” (Kuhl, 1991; Guenther & Gjaja, 1996). In the process of speech development, groups of echo neurons appear to become bundled together, each bundle representing a separate phonetic category. The bundles can be thought of as phonetically labeled. P4 describes a cognitive process and applies strictly only when the process of category formation is already completed. If “linguistically informative quality” is substituted for “linguistic phonetic quality”, it is also valid before that stage, but the latter choice ties in with the process of category formation. P4 implies an agreement between perception and production, which may be reflected in the observed covariation between production and perception of phoneme contrasts: the more accurately such contrasts are discriminated, the more accurately they are produced by a speaker (Perkell & al., 2006).

The relation between the variables mapped onto each other in accordance with P4 does not need to be a linear one but will in most instances involve sensory threshold effects as well as biomechanical and acoustic saturation effects (Fujimura & Kakita, 1979; Perkell, 1999; Perkell & al., 1997), which make the amplitude of acoustic modulations less sensitive to variation in neuro-motor activity, e.g., when the wall of the vocal tract has been reached by an articulator and when another formant is approached. These saturation effects lend stability to speech communication so that only extreme expressive settings or abnormalities in vocal tracts impede adequate acoustic modulation. P4 does not even require a one-to-one relationship between modulation patterns and articulatory gestures to hold in general. In some exceptional cases, highly similar modulation patterns can be obtained with alternative articulatory gestures. This can be exemplified by alternative articulations of the /ɹ/-sound of American English. In such cases, language users may fail to discover all possibilities. If a speaker uses more than one alternative in such cases, it is necessary to assume a private difference in phonetic quality between them.

Due to non-linear relationships, it is not strictly true that the vocal tract area function will always deviate in the same way from a linguistically neutral one when a speaker strives for the same linguistic quality while adopting a different expressive setting. However, an investigation of this kind (Story & Titze, 2002) can still be used for gauging P4 in a qualitative way. It will be much more difficult to measure the proprioceptive sensations, if it can be done at all.

Neuromotor control based on somatosensory feedback is an essential feature of the DIVA model suggested by Guenther (1995). Its feedback loop allows compensation for impediments to modulation without neural delay. The effect of somatosensory feedback can be appreciated considering results of experiments with impeded articulation. In these, articulation had been disturbed by a resistive load applied to the jaw (Folkins &

H. Traunmüller / Speech considered as modulated voice 19

Abbs, 1975), to the lower lip (Abbs & Gracco, 1984; Munhall, Löfqvist & Kelso, 1994) or by use of a bite-block (Lindblom, Lubker & McAllister, 1977; Lindblom, Lubker & Gay, 1979; Fowler & Turvey, 1980). These experiments showed that an impediment to jaw movement was immediately compensated for by the tongue. The formant frequencies remain nearly unaffected as long as sensory information from the oral region is accessible to the speaker (Hoole, 1987). In these cases, a coordinative structure consisting of tongue and jaw appears to be engaged (Kelso & al., 1984), while the results obtained with an impediment to lip movement demonstrated that speakers use a coordinative structure consisting of upper lip, lower lip and jaw in producing a bilabial consonant. When the movement of one of the articulators that belong to such a coordinative structure is impeded, the others simply continue their action until the feedback tells the neural system that the target, in this particular case labial closure, has been reached.

The observations with impeded articulation contrast with formant frequency measurements in the same vowels produced by the same speakers at different levels of vocal effort (Frøkjær-Jensen, 1966; Rostolland, 1982a, 1982b; etc.). Although even in this case there must be complete closure in stops and nasals, which requires compensation for a variable position of the jaw, speakers do not appear to strive for maintaining the same articulation and the same formant frequencies in the same vowels, as they do with a fixed mandible. This is a consequence of the altered linguistically neutral setting of the articulators (mainly of the jaw) that goes with a change in vocal effort but which is absent in experiments with impeded articulation. The somatosensory feedback loop allows adequate production of obstruents under this condition as well.

P4 imposes no restrictions on the carrier (the voice) or on the linguistically neutral setting of the articulators. Auditory feedback has normally an effect on the voice, as evidenced by the Lombard effect, but it is not required for the production of intelligible speech. The speech of adults with acquired deafness normally remains intelligible for many years.

2.2. Imitation and speech acquisition

When an infant says its first word, it demonstrates that it has acquired at least a rudimentary control over all the processes involved in the perception, memorization and production of speech. The infant must first have recognized how a person such as its mother, whose organ of speech is much larger than the infant’s own, had modulated her voice. It must further have stored a representation of the modulation in memory. Thereafter it must have modulated its own voice in a similar way. The similarity must be assumed to reside in those auditory and visual aspects of the utterance that can be perceived by another person. Invisible patterns of articulation are only indirectly relevant by their audible effects.

A capability of imitating visible postures and gestures is present among the great apes. They have a disposition for ‘aping’, but among primates only humans can imitate with their voice. When parrots imitate human speech (Patterson & Pepperberg, 1994), they attempt copying speech signals with the voice quality preserved together with the linguistic. This is what impersonators also do, but it is not typical of the normal behavior of infants acquiring speech. Infants imitate only the modulation.

In humans, a rudimentary capability of imitating facial as well as manual gestures is present already in neonates, which “can equate their own unseen behaviors with gestures they see others perform” (Meltzoff & Moore, 1977). This astonishing capability cannot be understood unless we assume that there is an innate linkage between visual

H. Traunmüller / Speech considered as modulated voice 20

perception and motor control that allows “demodulating” visually perceived facial gestures and translating the modulation into the motor activity that is required in order to “modulate” the own face in a similar way. It is interesting, though, that babies do not imitate faces expressive of emotions as readily as other facial gestures (Kaitz, Meschulach-Sarfaty, Auerbach & Eidelman, 1988).

As for a linkage between visual perception and motor control in monkeys, we have neurological evidence in “mirror neurons”, which fire both when the monkey performs an action such as grasping, holding, or tearing, and when he observes another primate performing the same action (Rizzolatti, Fadiga, Fogassi & Gallese, 1996; Rizzolatti & Craighero, 2004). These mirror neurons were found in the monkey homologue of Broca's area, which is central to articulatory encoding in humans. The responses of mirror neurons are typically not affected by perspectival variation, such as in viewing distance. Perspectival and also organic variation has already been factored out at this neural level. The mirror neurons observed in monkeys were only responsive to transitive actions directed toward a visible object or a concealed object whose presence was known to the monkey. Mirror neuron activity in the human cortex has been observed in response to intransitive actions as well (Rizzolatti & Craighero, 2004). Pure imitative behavior, in which a visible gesture is imitated as such, requires mirror neurons that are responsive to intransitive actions.

In humans, there is also a linkage between auditory perception and speech motor control, and babbling appears to serve primarily the purpose of fine-tuning this linkage. It allows an optimal two-way mapping between the positions of the articulators, as perceived by the tactile and kinesthetic senses, and auditory modulation patterns of the own voice to be established. In adults, this linkage shows itself in echolalia and in shadowing as an automatic ability of listeners to transform heard speech into vocal motor activity. Shadowing experiments have shown “that the early phases of speech analysis yield information which is directly convertible to information required for speech production” (Porter & Lubker, 1980). Spontaneous silent articulatory shadowing of speech listened to can also be observed among healthy people. Although we have no direct neurophysiological evidence for “echo-neurons” that would represent the link between auditory perception and speech motor control, there is convincing indirect evidence for their presence in humans (Rizzolatti & Craighero, 2004).

A linkage between visual and auditory perception of speech sounds has been shown to be present at an age of 20 weeks (Kuhl & Meltzoff, 1984; 1996). Babies of that age preferred to look at the matching face when the speech sound [a] or [i] was presented to them. Aldridge, Braga, Walton and Bower (1999) reported having observed similar patterns of intermodal interaction in neonates. This linkage is relevant here, since vision contributes to normal speech perception. Vision can even dominate over audition in the perception of labiality in consonants (McGurk & MacDonald, 1976) and lip rounding in vowels (Traunmüller & Öhrström, in press). The McGurk Effect has been shown to be present in prelingual infants, 4 months of age (Burnham & Dodd, 2004). Adults exposed to stimuli with conflicting audiovisual cues typically report hearing not only what they perceive with their ears but also the features they perceive with their eyes. This shows us that speech is represented primarily in the auditory modality (the only one referred to in P2) although there is an entrance for optic information. The kind of imitation neuron that could bring this about would be an echo neuron that is also responsive to visual input. Imitation neurons responsive to visual as well as to auditory stimuli are also mentioned by Rizzolatti and Craighero (2004).

While visual input is less essential for speech acquisition, an ability of associating auditory patterns of modulation with speech motor control (P4) is a necessary

H. Traunmüller / Speech considered as modulated voice 21

precondition for vocal imitation and provides the biological foundation for human communication by speech. Further reasoning along this line is also exposed in Skoyles (1998). Studdert-Kennedy (1983, 2000) underlined that the imitation of sound patterns precedes the discovery of distinctive speech sounds by an infant in the process of speech acquisition.

Except for its linguistic component, the kind of process we observe in speech acquisition is not something very specific to spoken human language, but the imitation of any bodily posture or gesture is analogous. In all such cases of imitation, there is a carrier (a body, hand, face, mouth, etc.) with a persistent underlying, typically skeletal structure. Postures and gestures can be considered as modulations of such a carrier or skeleton. After demodulation and self-normalization, imitators can transpose the modulation onto their own bodies, which may differ in size and proportions. The retinal representation of bodily postures and gestures, including the oro-facial gestures visible in speech, is subject to quite complex perspectival variation. It varies often drastically with the orientation of the carrier in three-dimensional space and with different viewing angles. In the case of acoustic speech signals, the effects of a speaker’s rotation in space are less drastic.

Visual perception of speech can be considered within the wider frame of human face perception. The perception of faces involves a comparison with a neutral reference, which appears to be present at birth, and we attach significance to deviations from the neutral shape. Recovering these deviations is equivalent to a demodulation. It is the essence of caricatures to show these deviations (the modulation) in an exaggerated fashion. However, the face of each person has its own organic quality, which has to be factored out in order to detect the expressive information (emotions and attitudes) and the linguistic. The recognition of expressive information in facial expressions (Calder & al. 2000) requires processes analogous to those MDT describes. The observer has to “tune in” to the observed face and to evaluate deviations from its known or expected neutral appearance. This is, of course, also required in order to detect any visible linguistic information.

The analogies between speech acquisition and the imitation of bodily gestures suggest that the imitative behavior that is necessary for speech development may have had its phylogenetic origin in a preexisting disposition for ape-like mimicking, but this would rather have disposed our ancestors for sign language. A disposition for communication by manual gestures is actually present not only in humans but also in closely related species such as chimpanzees and bonobos, who regulate their interactions using also manual gestures (Plooij, 1978). The unique development of communication by speech in the human species presupposes an additional disposition for babbling and parrot-like vocal imitation. Such a disposition, involving echo neurons, is likely to have distinguished our early human ancestors from other primates.

With the phonetic labeling left aside, MDT may be applicable to vocal communication by several species. Variations in size, age and sex are reflected in the vocalizations of various mammals in an analogous way as in humans, and there is a range of different expressive qualities the vocalizations can acquire. If the expressive quality is to be distinguished from the organic, the more permanent properties of the animal’s voice need to be factored out, and it may well be that several species are capable of doing this.

2.3. Implications for phonology

MDT considers speech to be voice modulated with phono-articulatory features, but it does not in itself involve any claims concerning what these features are in detail and

H. Traunmüller / Speech considered as modulated voice 22

how they interact with each other. The theory is, nevertheless, relevant to phonology and distinctive feature theory due to some of its comprehensive implications. Firstly, it requires regarding the role of properties descriptive of the carrier as fundamentally different from that of properties descriptive of its modulation and, secondly, it suggests perception as well as production to be considered in phonology. 6

All the features of the carrier are to be considered as linguistically unmarked and incapable of conditioning a phonological process, in accordance with Harris & Urua (2001). Within a framework of single-valued features that can only be present or not present in a segment (Harris & Lindsey, 1995), a schwa whose articulatory target agrees with the speech rest position is simply to be considered as featureless. It is fully specified as soon as the presence of a carrier signal is specified. While there are between-language differences in the properties of true schwas and in speech rest position (Gick & al., 2004), MDT suggests that these should be regarded as conventional particularities in voice, which remain outside the domain of phonology although they can be relevant to language teaching. However, in some languages, the vowel that functions as a schwa appears to retain a trace of a full vowel as whose reduction product it can be seen, and in other cases, one of the full vowels is chosen to serve this function. Some between-language variation has also been observed in hesitation sounds (Eisen, 2001), which often are similar to schwas.

A case in which the systematic distinction between linguistic phonetic properties of segments and those of the carrier leads to a more adequate phonological description is the process of lenition (Harris & Urua, 2001), in which various articulatory properties become neutralized while the properties of the carrier remain. Consonant lenition and vowel reduction degrade phonetic information in the speech signal and this should preferably also be reflected in phonological representations (Harris, 2004).

The distinctive features of speech sounds have traditionally been described either in acoustic/auditory or in articulatory terms. The fact that the acoustic reflection of speech is common to speaker and listener is an argument for adopting acoustic/auditory definitions, while the fact that most phonological processes appear to have a biomechanical origin is an argument for choosing articulatory definitions, but none of these arguments provides an argument against the other. P2 and P4 rather suggest that phonetic categories should be defined by a conjunction of properties of both kinds, more precisely based on

(a) speech signal properties perceived, and (b) sensible properties of speech gestures considered together with the

biomechanical propensities of the articulators.

The properties considered under (a) are not only conveyed by audition, but also by vision, and they concern not only exteroception but also (auditory) proprioception. For labial features such as labial closure and lip rounding, the visible reflection of the articulation may be more important than the acoustic. For each feature, listeners appear to attach the highest weight to the modality that provides the most reliable information. The properties of type (b) are conveyed kinesthetically and haptically. This allows the fortis/lenis distinction, whose effect on vocal tract shape and acoustic signal is subtle, to be defined haptically. Although each distinctive feature is reflected in certain properties of type (a) as well as (b), these are not necessarily of equal importance or explanatory power in particular phonological processes.

6 This was not clear from the description of the theory in Traunmüller (1994), which contained no equivalent of proposition (4).

H. Traunmüller / Speech considered as modulated voice 23

Although production (b) is normally mirrored in perception (a), the mirror image is typically distorted due to non-linearities and saturation effects. The subsequent incongruence between perception and production would be likely to trigger a sound shift unless speech sounds or features occupy stable corners in property space. Any instability has the effect of driving speech sounds into stable corners, in the vein of Stevens’ (1972, 1989) quantal theory of speech or at least into places of maximal distance from neighbors, as predicted by the theory of adaptive dispersion (Liljencrantz & Lindblom, 1972; Lindblom, 1986), see also Lindblom and Engstrand (1989) and Johnson (2000). Most phono-articulatory continua have just one or two stable corners, whose stability is due to saturation effects, and this is reflected in the dominating ‘single-valued’ or ‘binary’ phonological features.

In many cases, the articulatory side (b) appears to be more important in the sense that it provides the stable corners and possesses greater explanatory power for phonological processes. A notable exception is pitch level. In this case, an auditory definition appears to be required at least for some languages, since there are no saturation effects to fall back on when there are more distinctive levels of pitch than just “high” as opposed to the “low” that has to be regarded as neutral within the frame of MDT.

A similar state appears to prevail concerning the vowel height dimension. Even in this case, there are no saturation effects to fall back on when there are more distinctions than just “high” and “low” as opposed to “neutral”. In this case it may appear problematic to propose an auditory definition since a risk of circularity can be suspected when the perception of this dimension has to rely on reference to the vowels, as concluded in section 2.1.3. However, there is no real circularity since the height of each vowel can be specified by reference to that of other vowels in each case.

3. Discussion

3.1. Reanalysis of intrinsic and extrinsic factors in perception experiments

The results of different investigations exploring the effects of intrinsic f0 and F3 on the perception of the information conveyed primarily by F1 and F2 did not always agree with each other. It can now be seen that some of the discrepancies were due to the presence or absence of extrinsic cues that affected the ability of listeners to tune in to the speaker’s voice. Experiments using mixed or blocked voices, and randomized or ordered presentation of stimuli are very different in this respect. In other cases, the presence or absence of additional congruent or incongruent intrinsic voice cues affected the results. The full effects of f0 and F3 can only show up when there is no incongruence among cues. The smaller effects reported by Nearey (1989) could be predicted when the probable expectations of the listeners were taken into account. An analysis explicitly based on MDT is equivalent to the one followed by Traunmüller (1990) in the reanalysis of Nearey’s results.

The effects of f0 and F3 are not always noticeable. Nusbaum and Magnuson (1997) characterized speech perception as “a flexible process in which listeners only pay attention to f0 and F3 when there is talker variability”. They based this characterization on results obtained by Johnson (1990) and on their own observations. This contrasts with the prediction by MDT that f0 and F3 will be fully effective when there is no variation in voice, while the observable effect will be reduced and blurred when these cues vary unpredictably between subsequent stimuli in an experiment and whenever discrepant cues to the speaker's voice are present. However, as long as the frequency positions of f0 and F3 agree with those the listeners would expect even without access to

H. Traunmüller / Speech considered as modulated voice 24

this information, the effects of f0 and F3 will not be visible. While this explains the observations by Nusbaum and Magnuson (1997), it is at variance with their conclusion.

The effects of raising or lowering the formants in an introductory phrase on the perception of the vowel in a following monosyllabic test word (Ladefoged & Broadbent 1957) can be understood as due to the assumptions about the carrier which the listeners establish while hearing the introductory phrase. Thus, when F1 in all the vowels of the introductory phrase had been raised, the listeners assumed a higher F1c, indicative of a more open setting of the speaker’s vocal tract although f0 remained low. The vowels of a following test word were subsequently heard as less open than in the original case. An explanation of this kind also applies to the results obtained by Ainsworth (1975) and Nearey (1989) as detailed in Traunmüller (1990). A modification of the expressive and/or organic quality of an introductory phrase will affect the perceived linguistic phonetic quality of a subsequent test word, while a modification of its linguistic phonetic quality, such as in the experiment by Broadbent and Ladefoged (1960), will not, since it does not alter the listeners’ assumptions about the carrier.

In several investigations it was observed that the percentage of errors in phoneme identifications was larger and the intelligibility of isolated spoken words was reduced when several talkers were used instead of just one. This was found to hold at several signal-to-noise ratios (Strange, Verbrugge, Shankweiler & Edman, 1976; Mullennix, Pisoni & Martin, 1989; Ryalls & Pisoni, 1997). According to MDT, this is to be expected at any S/N-ratio since listeners need some exposure in order to optimize their tuning in to a new voice. With words, the effects were not very large, since most words contain enough voice information to allow a reasonable tuning. Listeners also need some exposure in order to recognize and adjust the pace of their internal clock to a changed speaking rate, and this is why words produced at different speaking rates are more poorly identified than the same words produced by one speaker at a constant speaking rate (Pisoni, 1993).

In an investigation of the detection of vowels, consonants and words, Nusbaum and Morin (1992) observed increased response times when talkers were mixed instead of blocked, but they observed no increase in response time when mixed talkers that were discriminable had similar frequency positions of f0 and F3. This result illustrates that listeners need no extra time if they are already tuned in to a sufficiently similar voice.

Particularly informative is experiment 5 of Nusbaum and Morin's (1992). The stimuli used in this experiment, with different groups of listeners, were the following: (a) eight natural vowels from two male and two female talkers, (b) noise-excited versions of these, obtained by LPC analysis and resynthesis, (c) low-pass filtered versions of (a), in which F3 and the higher frequency components were suppressed and (d) similarly filtered noise-excited versions. The stimuli were presented to the listeners in blocked and in mixed talker conditions. In the blocked talker condition, almost all the natural, noise-excited, and filtered vowels were accurately recognized. For the voiced vowels, the recognition rate was only marginally and not significantly lower in the mixed talker condition. The noise-excited vowels were recognized significantly less accurately when talkers were mixed. The difference was particularly large for the filtered noise-excited vowels. These results demonstrate that listeners are not dependent on intrinsic presence of f0 and F3 when their tuning does not need to be changed. This is only necessary for the first one or two stimuli in each block when stimuli are blocked by speaker. The lower recognition rate obtained with noise-excited stimuli must, in part, be ascribed to the inappropriate formant frequencies, since these are normally higher in whispered speech. This also holds for a similar experiment by Katz and Assmann (2001), their third one. However, Halberstam and Raphael (2004) used naturally whispered vowels and obtained results that were essentially equivalent to those of

H. Traunmüller / Speech considered as modulated voice 25

Nusbaum and Morin (1992). They pointed out that F3 contributes to the recognition of whispered vowels, but much less so and less than f0 in phonated speech. Disregard of F3 is possible due to redundancy in cases in which there is no expressive variation in f0c, and listener’s preferential reliance on f0 can be understood as due to its greater salience, as compared with F3.

Johnson (1990) investigated the English [hʊd]-[hʌd] boundary for stimuli with f0=100 and 150 Hz with and without a carrier phrase “This is …”, with rising and falling intonation. According to Johnson, the results “suggest that perceived speaker identity is a better predictor of vowel normalization effects than is intrinsic f0”. If we substitute “demodulation” for “normalization effects”, this is compatible with MDT, but this theory also allows taking variation in the same speaker's voice into account. The pretest described in the same paper illustrates the role of the base value of f0 for speaker perception. It showed that listeners were most likely to perceive the speaker to be the same when the minima of the f0-contours of two utterances were identical. This is in accordance with the observation that speakers keep their f0-lows close to constant when they vary their pitch span (Bruce, 1982; Traunmüller & Eriksson, 1994).

The role of various kinds of information about speaker gender in vowel perception was investigated in an experiment by Johnson, Strand and D'Imperio (1999) in which the effects of several factors were shown very nicely. In three experiments, these researchers investigated the [hʊd]-[hʌd] boundary value of F1 in series of synthetic stimuli in which F2, F3 and F4 were kept unmodified. In their first experiment, they found that the gender of a visually presented face affected the location of the phoneme boundary. They also found that voice stereotypicality had an effect on the phoneme boundary. The female/male frequency shift of the boundary value was greater when the voices were rated by listeners as more “stereotypical”. In another experiment (their third one), they just instructed the listeners to imagine a male or female talker while performing an audio-only identification task with a gender-ambiguous continuum of stimuli, and they found a small boundary shift as a function of the imagined gender of the talker. Apparently not being aware of MDT, these researchers concluded that their results demonstrated “perceptual talker normalization as opposed to vocal tract normalization and radical invariance”. According to MDT, reference to the talker is not always sufficient, since it does not capture any possible expressive variation in a talker’s voice, but the reasoning followed by Johnson et al. (1999) does conform to the other tenets of MDT.

The effects of non-linguistic visual and cognitive information on vowel categorization can hardly be explained on the basis of a theory that does not take the listeners’ expectations into account, but it follows naturally from MDT, which requires listeners to tune in to a talker’s voice and does not disallow visual and cognitive information to have an effect in this process. The shift observed by Johnson et al. (1999) at the [ʊ]-[ʌ] boundary was smaller than the ordinary female-male difference. This was to be expected since the unchanged intrinsic voice cues were salient and accordingly restraining.

Analyses of between-speaker differences in the formant frequencies of vowels (Traunmüller, 1984) and results of vowel perception experiments (Fahey, Diehl & Traunmüller, 1996) suggested that the weight listeners attach to the distances between spectral landmarks decreases with increasing spectral distance. While not immediately evident, this is compatible with MDT: Where the distance between formants is large, such as between F2 and F1 in front vowels, the flexibility of the process of tuning in to the voice (warping of templates) will allow ascribing most of a formant frequency variation to the carrier rather than to its modulation. If there is a similar variation (in barks) where the distance is small, such as between F2 and F1 in back vowels, this would

H. Traunmüller / Speech considered as modulated voice 26

correspond to a more substantial change in voice and require a more drastic warping of templates. The sensitivity for variations in between-peak distances appears to be and is, on this basis, also predicted to be more constant if the variation is expressed as a percentage.

3.2. Reanalysis of observations concerning perception of voice

Although the sound pressure of vowels varies as a function of their linguistic phonetic quality (intrinsic level), this variation should not affect perceived vocal effort or speaker closeness if based on the properties of the carrier and not directly on those of the speech signal. This is in agreement with the experimental finding that judgments of “loudness” (perceived vocal effort) correlate more closely with the subglottal pressure at which syllables were produced than with their SPL (Ladefoged & McKinney, 1963). An analogous result, although not quite free from interference, was obtained in an experiment by Eriksson and Traunmüller (2002), in which vowels that varied in vocal effort and presentation level were presented to listeners who had to estimate both vocal effort (apparent distance between speaker and addressee) and speaker closeness.

The difference between effects of voice and coarticulation showed itself clearly in several experiments done by Mann and Repp (1980), who used synthetic stimuli consisting of fricative noises from a [S]-[s] continuum preceding [a] or [u]. Their listeners perceived more instances of [s] in the context of [u] than in that of [a], but the effect of the vowel was reduced when a gap had been introduced between the consonant and the vowel. There was also a speaker gender effect on the perception of the fricative. The magnitude of this effect was not reduced by the introduction of a gap. If the gender effect is due to the voice, as suggested by MDT, it should persist as long as the listeners perceive no change in voice. Introduction of a gap does not bring about such a change. However, the before mentioned effect of the vowel was not a voice effect. Instead, it can be understood as an adaptation to the effects of coarticulation: In the context of a following [u], the spectral boundary between [S] and [s] is at a lower frequency than in the context of a following [a]. Cases like this can equally well be interpreted as reflecting a holistic perception of the syllables. When a gap had been introduced, the fricatives were perceived more as if produced in isolation.

In an experiment with several talkers and several words by Mullennix and Pisoni (1990), subjects were required to attend selectively to the talker’s voice or to the word that was said. The results showed a significant amount of mutual interference, but this was asymmetrical. Variation in linguistic-phonetic quality did not deter performance in voice recognition as much as variation in voice deterred word recognition. This can also be understood on the basis of MDT, according to which listeners depend on tuning in to the speaker’s voice in order to be able to recognize its linguistic-phonetic modulation reliably. The recognition of the voice is not so crucially dependent on a correct interpretation of its modulation, which can be said to belong to the next higher level of analysis.

Van Lancker, Kreiman and Emmorey (1985) observed that listeners were able to recognize familiar speakers also when their speech had been time-reversed. This is to be expected if they identify the speakers on the basis of the carrier, whose properties are only marginally affected by time reversal. This presupposes that the properties of the carrier can be estimated in the absence of a reliable linguistic demodulation. MDT does not suggest that listeners identify speakers only on the basis of their voice. In ordinary speech, the personal peculiarities in the modulation pattern, which reflect details in linguistic phonetic quality (the speaker’s idiolect), are also likely to contribute substantially to speaker recognition. The “personal quality” of a talker has linguistic as

H. Traunmüller / Speech considered as modulated voice 27

well as organic components and also habitual expressive components. The use of the term “personal quality” by Traunmüller (1994) and by others in previous studies in sense of “organic quality” was not fully adequate.

Traunmüller and Bezooijen (1994) investigated the auditory perception of children’s age on the basis of speech that had been modified in agreement with MDT to simulate whispered and phonated speech of children 5 to 11 years of age. While there was a bias towards an average age and a contribution of “verbal maturation” (linguistic quality), the results showed the formants (together with speaking rate) to have a slightly higher weight than f0. Therefore, the age ratings of whispered speech were also quite consistent.

Speech with the same essential information (Fns) as in whispered speech can also be obtained by substituting sine waves for the formants of natural (preferably whispered) speech. Remez, Fellowes and Rubin (1997) reported on experiments with this kind of stimuli. They found that listeners were still able to identify the talkers, albeit with reduced accuracy. Analysis based on a dichotomy between laryngeal voice properties and non-laryngeal “phonetic” properties suggested that “the phonetic properties of utterances serve to identify both words and talkers”. According to MDT, a signal cannot be perceived as speech unless it is interpreted as a carrier modulated in a speech-like way. This implies that organic properties can be attributed to the perceived “voice”, while there may also be talker specific idiolectal information in the modulations. Experiments like this one show that demodulation is possible even when the carrier has properties that are atypical of human voices. The carrier can still be considered as a voice, albeit an unnatural one.

3.3 Limitations and desirable extensions

The flexibility of the sense of hearing is limited in the sense that tuning in to a new voice will only succeed easily when its characteristic frequencies (f0c, Fnc) are within the range of voices experienced previously. Attending to a voice that deviates more than so requires training at least. This is, e.g., the case when divers’ helium speech is to be understood. There are probably limits also to what can be achieved by such training, but these limits are not yet well known.

From experiments with recorded vowels that were reproduced at a reduced speed (Chiba & Kajiyama, 1941; also Assmann & Nearey, 2003) it can be concluded that listeners (adults) do not expect the formants ever to be lower than they are in speech produced by men at moderate vocal effort even if the intrinsic cues suggest them to be. A related observation concerns the reduced importance of f0 for f0 < 150 Hz, especially in high vowels (Traunmüller, 1984; Di Benedetto, 1987, 1994). This may be due to a lack of experience with speech of giants with still lower formant frequencies. With a spectral warping model such as described in section 2.1.3, such limitations can be simulated by constraining the process of spectral warping.

While it is clear that perceptually distinctive spectral features cannot be strictly equated with modulation amplitudes of single formants, the percept of pitch is very closely associated with f0. However, pitch is affected marginally but significantly by vowel quality (Chuang & Wang, 1978; Stoll, 1984; Hellström, Aaltonen, Raimo & Vilkman, 1994). When f0 is the same, the perceived pitch decreases from [a] to [i] to [u]. This suggests that listeners attach some weight to F1 and apparently also to F2’ (Pape, 2005) when making pitch judgments. The effect is too small for the purpose of compensating for the intrinsic f0 differences between vowels, and these have only been shown to correlate (negatively) with F1.

H. Traunmüller / Speech considered as modulated voice 28

There is also a perceptual interaction between formant frequencies and pitch span. In experiments with manipulated male and female speech in which the average or base-value of f0 was the same, it was observed that the f0-excursions needed to be larger in the female version than in the male, in order for syllables to be perceived as equally prominent (Gussenhoven & Rietveld, 1998) or the speech to be perceived as equally lively (Traunmüller & Eriksson, 1995). Listeners appear to evaluate f0-excursions with respect to the spectral space that is available below F1c. Given the same f0, this space is wider in female speech. While this suggestion is in the gist of MDT, it is not captured by the equations in section 2.1.

A major problem that is not treated in the present paper concerns the effects of variation in speaking rate. This is more difficult to capture than the amplitudes of modulations in f0 and in the formant frequencies since it involves in addition to the properties of the air-filled vocal tract also the biomechanical properties of the articulators and of the jaw. The duration of speech segments is largely given by the acceleration and inertia of these more or less massive structures. Due to this variation, substantial deviations from linearity can be observed in aligning the segments of the same utterances produced at different speaking rates.

Another problem that remains to be treated concerns audiovisual integration. It is clear that audiovisual integration requires prior demodulation in both modalities, since perspectival and presentational variation have to be factored out before audiovisual integration can occur, but the further details are not so clear. The fuzzy logical model of perception (Oden & Massaro, 1978; Massaro & Stork, 1998) allows remarkably successful prediction of audiovisual phoneme perception. However, predictions are based on auditory and visual phoneme perception, and not on properties of speech signals. Further, the model treats the visual information as on a par with the auditory and describes the result as amodal, while subjects exposed to incongruent audiovisual stimuli report an autidory percept affected by optic input and a separate visual percept affected by acoustic input (Traunmüller, 2006). This is compatible with MDT, but remains to be modeled in detail.

3.4. Human and non-human perception

Although the faculty of communicating by means of speech is only present in humans, other species share some of the components on which this biological innovation rests. However, other animals with a laryngeal and oral morphology similar to that of humans lack the ability of controlling their sound production in the way that would be required for the production of speech. Therefore they cannot associate speech perceived with any corresponding articulations, and they cannot have the same concepts of speech sounds that humans have. Those birds that show an ability of acoustically copying human speech undoubtedly associate human speech sounds with articulations of their own. Since the sound production mechanism of birds is quite different from that of humans, the same must also be assumed of their linkages between perception and production of sounds.

Although there have been experiments in which the perception of human speech sounds by other species has been investigated, it has not been common to consider speech and similar communicative vocalizations as modulated voice. Questions such as whether these animals demodulate vocalizations and if so, in which way they do it and whether this is more or less human-like have not yet been asked appropriately.

Baru (1975) had trained dogs to discriminate between synthetic male versions of [i] and [a], and the animals promptly generalized the distinction to female versions. A similar

H. Traunmüller / Speech considered as modulated voice 29

observation was made by Burdick and Miller (1975) with chinchillas. When these had been trained to distinguish between [ɑ] and [i], as produced by one speaker, they were able to make the same distinction among tokens produced by many different speakers. Similarly, Kojima and Kiritani (1989) observed that chimpanzees were able to separate variation in vowel quality ([a]/[i], [i]/[u], [u]/[a]) from variation in speaker sex. The results of these experiments showed that the animals perceived similarity between the male and female versions of the vowels. Observations like this one were taken to suggest that the animals have a capability of ‘normalizing’ speech (abstracting away variations in voice). However, discriminating auditorily between men and women requires no more than a rudimentary ability of pitch perception, and the formant frequencies and gross acoustic spectra of the extreme vowels chosen in all the mentioned experiments were much more different than those of the same vowels produced by male and female speakers. Since no more than an ability of gross acoustic pattern matching is required to explain the results, these investigations remain inconclusive concerning the question whether these animals possess a capacity of factoring out the presentational variation in human speech.

Reports on observed between-species differences have not always been conclusive either. It has been reported that monkeys attach comparatively more weight to F1 and less to the higher formants than adult humans do (Sinnott, Brown, Malik & Kressley, 1997). However, a similar observation had been made when human infants were compared with adults (Lacerda, 1992). A study by Sinnott and Williamson (1999) has shown Japanese macaque monkeys to be unable to categorize voiced stop consonants according to place of articulation using formant transitions alone. This is more interesting since human infants appear to pay even more attention to formant transitions than adults do (Nittrouer, 1992).

Most noteworthy within the present context is a study by Polka and Bohn (2003), which compared asymmetries in vowel perception by blackbirds, cats and human infants. This study allows the conclusion that blackbirds and cats do not perform a demodulation of the kind observed in adult humans, while infants do. For cats and blackbirds, an increase in F2 was easier to notice than a decrease, but for humans, a change from a more schwa-like to a more colorful vowel quality was easier to notice than a change in the opposite direction even when this involved a decrease in F2. It appears that humans, cats and blackbirds all more easily detect a change in sound quality when the second sound is more salient than the first one. An asymmetry like this is suggested by the substantially greater sensitivity of humans to increases than to decreases in prominence observed in the jnds of a formant in the vicinity of a more prominent formant (Chistovich & Lublinskaja, 1979). However, in blackbirds and cats, salience is likely to increase when F2 increases, while in humans, it is likely to increase when the amplitude of the spectral modulation increases.

Although we must expect the perception of human speech by any non-human species to differ in some ways from that by humans, it is still reasonable to expect that some species share the ability of perceiving expressive and organic aspects of speech in a more or less human-like way. Successful cross-species communication on the non-linguistic level is, after all, quite common.

3.5. Relations with other theories

3.5.1. Theories of speech perception and production

The Motor Theory of speech perception (Liberman, Cooper, Harris & MacNeilage, 1962; Liberman, Cooper, Shankweiler & Studdert-Kennedy, 1967; Liberman & Mattingly, 1985)

H. Traunmüller / Speech considered as modulated voice 30

and the Direct Realist Theory of speech perception (Fowler, 1986, 1994) both presuppose an unspecified processes that allows listeners to derive the phono-articulatory activity that underlies a speech signal. Instead of the sounds, these theories consider their production to be perceived.

In the Motor Theory, the percept is represented by the neuromotor commands that the listener would activate himself if he were producing the perceived utterance. This suggests that the unspecified process could involve an analysis-by-synthesis procedure (Stevens, 1960). The process of spectral matching by selective attention to partials described for a model based on MDT in section 2.1.3. is also, to some extent, similar to a process of analysis-by-synthesis, but its templates represent prototypical auditory patterns. In the process in which these templates are acquired, also MDT suggests the speaker’s own production to be involved and more specifically the somatosensory feedback from it. The discussion of the Motor Theory has often been focused on the properties of phonetic concepts rather than percepts. It suggests all distinctive features of speech sounds to be defined by way of their production. MDT does not require this, but it is akin to the Motor Theory in so far as the patterns of modulation that it assumes to characterize speech can be thought of as associated with the listener’s own phono-articulatory activity as well as representing the result of auditory processes.

According to Fowler’s Direct Realist Theory, the speech gestures constitute the immediate percepts and their perception is direct. It is assumed that listeners in some way detect acoustic information about articulation. It is true that we often immediately perceive the events that gave rise to a sound that we hear. People may report, e.g., that they hear a ping-pong boll bouncing between rackets and the surface of a table, but auditory perception is in not restricted to such cases. We can, e.g., recognize a melody even if we have no idea of the instrument by which it was produced.

The assumption that the perception of speech is direct in the sense of Gibson (1966) implies that no intermediate cognitive steps are involved and no hypothesis driven processing is required. Neither this nor the ‘realist’ assumption, according to which the objects of perception and cognition exist independently of the mind, implies that the speech gestures have to be perceived instead of the modulations of the voice. The creation of the gestural theories rather reflects a lack of awareness of the second alternative, according to which the ‘objects’ that speakers produce for listeners to perceive are the modulations of their voice. The self-normalized properties of these modulations reflect the linguistically informative quality in a realistic way, i.e., free from presentational and perspectival variation. A speaker’s gestures, as conceived of in Fowler’s theory, do not reflect the linguistically informative quality directly and realistically, since they are subject to expressive variation. As for the role of cognitive factors, MDT does not disallow such factors to contribute to the process of tuning in to a voice. Such an influence, although not a large one, has been observed in the experiments by Johnson et al. (1999).

Accepting that the objects of normal speech perception by listeners who are also able to speak are auditory does not imply an assumption that phonetic units as concepts could be described exhaustively in auditory terms. The knowledge of how the speech sounds are produced must be considered part of a language user’s phonetic concepts. If a syllable-final plosive is identified with a syllable-initial plosive that has no distinctive acoustic properties in common with it, it is reasonable to assume that the identification is based on the similarity of the articulatory gesture. Most other effects of context and the biomechanical saturation effects can also be described more adequately in articulatory terms. It is fully in line with MDT to consider speech as audible gestures (Löfqvist, 1990), but this is primarily descriptive of the relation between the signal and its production - not its perception.

H. Traunmüller / Speech considered as modulated voice 31

MDT has a more solid foundation and also a more comprehensive scope than the mentioned more widely noticed theories. As for perception, it is primarily concerned with the perceptual processes that remained completely unspecified in these theories. Having specified these processes, we can now see that there is no need of a detour via speech production in the process of auditory speech perception, although an associative mapping between proprioception of speech gestures and exteroception of modulation patterns is assumed to be present and to be essential for speech acquisition. This linkage may also cause some interference between perception and production in both directions. This mapping, which is likely to be realized by echo neurons in the human brain, presupposes perspectival and presentational variation to be factored out by listeners. Since perspectival and organic variation is not reflected in an individual’s speech production, these kinds of variation remain outside the scope of theories that solely refer to speech production.

MDT requires production to mirror perception, but it does not require motor processes to be engaged in normal speech perception. On the other hand, it requires sensory processes to be engaged in the production of speech. It is well established that somatosensory feedback contributes essentially to articulation and that there is also a role for auditory feedback in normal speech production. Therefore, MDT is more in line with sensory theories of speech production, such as the “auditory-motor theory of speech production” suggested by Ladefoged, DeClerk, Lindau and Papçun (1972), Tillmann’s (1980) model of re-afferent control, and the DIVA model (Guenther, 1994, 1995, Guenther, Ghosh & Nieto-Castanon, 2003), than with motor theories of speech perception. MDT relates only indirectly to the initial stages (efferent neural signals) in the execution of speech gestures and their biomechanical conditions, but it couples to the neuromotor control involved, which is also central to the DIVA model. This model and MDT complement each other on this basis. As for perception, MDT does not either describe all the processes involved, but it captures some of the essentials of the processes involved up to the level of analysis where the ‘realist’ properties descriptive of the speech perceived are obtained.

As for the role of different kinds of feedback in speech production, MDT suggests that speech motor control relies immediately on somatosensory feedback. Auditory feedback can only have a monitoring function because it is not fast enough. This monitoring function is of particular importance in speech acquisition. In addition, auditory feedback has a function in controlling voice production, which does not require as fast an action as the control of the linguistically informative modulations of the voice. In this connection it is informative to consider an experiment by Houde and Jordan (1998). In this experiment, speakers whispered /CɛC/ syllables (“C” = any consonant) while receiving auditory feedback in which the formant frequencies were slowly shifted during the course of the experiment. The speakers were observed to adjust their speech production to this condition. It is crucial that the effect generalized not only to other consonantal contexts but also to other vowels. This suggests that the speakers adjusted the neutral setting of their vocal tract and thereby their voice rather than its modulation. This distinction needs yet to be incorporated into the DIVA model (Guenther, Ghosh & Nieto-Castanon, 2003).

3.5.2. Theories of categorization and memory

Among contemporary theories of concept representation and categorization, Exemplar Theory (Medin & Schaffer, 1978; Hintzman, 1986) competes with Prototype Theory (Posner & Keele, 1968; Reed, 1972). These theories, which do not need to be regarded as mutually exclusive, originated from considerations of the nature of categories and categorization in the abstract, i.e., without regard to perspectival and presentational

H. Traunmüller / Speech considered as modulated voice 32

factors. The retinal representations of the shapes drawn on paper, which were often used in investigations within this field (Reed, 1972; Smith & Minda, 1998), and their variation due to perspective (viewing distance and angles) and presentation (size, surface curvature, etc.) was not a matter of concern.

According to Prototype Theory, categories are represented in memory by stored prototypes whose features are abstracted from exposures to many different exemplars, and the features describe what is typical for members of a category. Categorization requires that there is sufficient agreement between the discernible features of a stimulus and those of the prototype - and less agreement with competing prototypes. Particularly in visual perception, it is often the case that some of the typical features are hidden due to perspectival factors. In order to obtain a prototype of a speech sound, it is necessary to factor out all variation in perspective and in presentation, i.e., in voice. In speech, prototypes appear to have a function at least in production, which is evidenced in a highly self-consistent behavior of speakers.

According to Exemplar Theory, any experience is represented in memory as a configuration of primitive perceived properties. These are ‘realistic’ properties descriptive of the experiences and objects perceived rather than of the excitation they evoke in the sensory organs. The overall similarity of any two experiences can be related to their distance in a multidimensional space of primitive properties. Categories can be thought of as represented in memory by clouds of exemplars that are sufficiently similar in their relevant properties. As for speech, MDT describes how the linguistically relevant primitive properties are obtained in the face of perspectival and presentational variation.

When Exemplar Theory was discussed in research concerned with the memorization of spoken utterances, the nature of the properties that were assumed to be memorized was not always made clear (e.g., Goldinger, 1996). Some studies in which an exemplar approach was applied to problems of speech perception and acquisition (Lacerda, 1995; Guenther & Gjaja, 1996; Pierrehumbert, 2001) drew on properties such as formant frequencies or auditory excitation patterns instead of realistic properties. Stimuli were assumed to be represented in episodic memory holistically, without prior separation of their perspectival, presentational and linguistic aspects. The approach appears to be rooted in previous modeling of speech perception in which the presentational and perspectival variation was ignored implicitly. It is afflicted with all the shortcomings of such models mentioned in section 1.3. It implies a denial of the relevance of any neural structures and processes that predispose humans for speech and additional species for more primitive vocal communication. This is obviously at variance with MDT, whose objective it is to capture precisely these processes.

In the experiment by Eriksson & Traunmüller (2002), in which vowels that varied in vocal effort and presentation level were presented to listeners who had to estimate both vocal effort (apparent distance between speaker and addressee) and speaker closeness (own apparent distance from speaker), the order of the two questions was switched between two groups of subjects. The SPL is involved in both distance variables and it also varies between vowels. However, vowel quality will not affect distance estimates if these are based on the carrier. The result did show some interference, but there was distinctly less of it in the answers to the second question as compared with the first. This can be understood as follows: When the listeners answered the first question, they had still access to a detailed auditory mapping of the stimulus, which provided for precision as well as for interference. When they answered the second question, this detailed information was no longer accessible, but only the abstracted properties of the voice. Therefore, the interference due to between-vowel variation in level vanished, but the overall performance became worse.

H. Traunmüller / Speech considered as modulated voice 33

When subjects had to estimate vocal effort before speaker closeness, there was a substantial deterioration in their performance in speaker closeness, while there was only a small deterioration in vocal effort when the tasks were ordered in the opposite way. This suggests that the retention of SPL is less persistent than the retention of vocal effort. Such a differential is incompatible with models of episodic memory that assume utterances to be retained in memory in a holistic fashion. The same conclusion can also be drawn from results obtained by Bradlow, Nygaard and Pisoni (1999), who observed that listeners’ performance in deciding whether a word had been presented previously in a list of words was not significantly affected by variation in amplitude (perspectival), while listeners were less accurate when there was variation in speaking rate (expressive) and in speaker (organic).

Direct evidence in support of MDT and in contradiction with the assumption that detailed traces of utterances are retained in memory prior to analysis can also be seen in the results of an experiment by Nygaard, Sommers and Pisoni (1995) (see also Nygaard & Pisoni, 1998). This experiment showed that prior exposure to a talker’s voice facilitates subsequent recognition of novel words produced by the same talker. This finding demonstrates a form of memory for a talker’s voice (the carrier) that is distinct from the retention of individual words (the modulation).

The mentioned experimental results and the problems that would be caused by the presentational variation in the spectral fine structure of utterances if all its detail was retained in memory suggest that utterances are memorized after separation of their various perspectival, presentational and linguistic aspects, for which different decay times have been observed. The memory traces of the different aspects of an utterance must be assumed to remain associated with each other in episodic memory while they must also be assumed to be separately accessible.

Acknowledgment

I am grateful to five anonymous reviewers for their valuable and varied critiques on the manuscript.

References Abbs, J. H., & Gracco, V. L. (1984). Control of complex motor gestures: orofacial muscle

responses to load perturbations of lip during speech. Journal of Neurophysiology, 51, 705-723.

Ainsworth, W. A. (1971). Perception of synthesized isolated vowels and h_d words as a function of fundamental frequency. Journal of the Acoustical Society of America, 49, 1323-1324.

Ainsworth, W. A. (1975). Intrinsic and extrinsic factors in vowel judgments. In G. Fant & M. Tatham, (Eds.), Auditory Analysis and Perception of Speech, (pp. 103-113). London: Academic Press.

Aldridge, M. A., Braga, E. S., Walton, G. E., & Bower, T. G. R. (1999). The intermodal representation of speech in newborns. Developmental Science, 2, 42-46.

Assmann, P. F., & Nearey, T. M. (2003). Frequency shifts and vowel identification. In the Proceedings of the 15th International Congress of Phonetic Sciences, Barcelona, pp. 1397-1400.

Baru, A. V. (1975). Discrimination of synthesized vowels [a] and [i] with varying parameters (fundamental frequency, intensity, duration and number of formants) in dog. In G. Fant & M. A. A. Tatham, (Eds.), Auditory Analysis and Perception of Speech, (pp. 91-101). London: Academic Press.

H. Traunmüller / Speech considered as modulated voice 34

Bergem, D. R. van (1993). Acoustic vowel reduction as a function of sentence accent, word stress, and word class. Speech Communication, 12, 1-23.

Bezooijen, R. van (1995). Sociocultural aspects of pitch differences between Japanese and Dutch women. Language and Speech, 38, 253-265.

Bladon, R. A. W., Henton, C. G., & Pickering, J. B. (1984). Towards an auditory theory of speaker normalization. Language and Communcation, 4, 59-69.

Bladon, R. A. W., & Lindblom, B. (1981). Modeling the judgement of vowel quality differences. Journal of the Acoustical Society of America, 69, 1414-1422.

Blumstein, S. E., & Stevens, K. N. (1979). Acoustic invariance in speech production: Evidence from measurements of the spectral characteristics of stop consonants. Journal of the Acoustical Society of America, 66, 1001-1017.

Bradlow, A. B., Nygaard, L. C., & Pisoni, D. B. (1999). Effects of talker, rate, and amplitude variation on recognition memory for spoken words. Perception & Psychophysics, 61, 206-219.

Broadbent, D. E., & Ladefoged, P. (1960). Vowel judgements and adaptation level. Proceedings of the Royal Society, Series B-Biological Sciences, 151, 384-399.

Bruce, G. (1982). Developing the Swedish intonation model. In Working Papers, 22, pp. 51-116: Department of Linguistics & Phonetics, Lund University.

Burdick, C. K., & Miller, J. D. (1975). Speech perception by the chinchilla: discrimination of sustained [a] and [i]. Journal of the Acoustical Society of America, 58, 415-427.

Burnham, D., & Dodd, B. (2004). Auditory-visual speech integration by prelinguistic infants: Perception of an emergent consonant in the McGurk effect. Developmental Psychobiology, 45, 204-220.

Calder, A. J., Rowland, D., Young, A. W., Nimmo-Smith, I., Keane, J., & Perrett, D. I. (2000). Caricaturing facial expressions. Cognition, 76, 105-146.

Carlson, R., & Granström, B. (1979). Model predictions of vowel dissimilarity. In STL-QPSR 3-4/1979, pp. 84-104. Department of Speech Transmission and Musical Acoustics, Royal Institute of Technology, Stockholm.

Carlson, R., Granström, B., & Klatt, D. (1979). Vowel perception: The relative perceptual salience of selected acoustic manipulations. In STL-QPSR 3-4/1979, pp. 73-83. Department of Speech Transmission and Musical Acoustics, Royal Institute of Technology, Stockholm.

Carré, R., & Chennoukh, S. (1995). Vowel-consonant-vowel modeling by superposition of consonant closure on vowel-to-vowel gesture. Journal of Phonetics, 23, 231-241.

Catford, J. C. (2001). A Practical Introduction to Phonetics. Oxford: Oxford University Press. Chiba, T., & Kajiyama, M. (1941). The Vowel, its Nature and Structure. Tokyo: Kaiseikan. Chistovich, L. A., & Lublinskaja (1979). The center of gravity effect in vowel spectra and

critical distance between the formants. Hearing Research, 1, 185-195. Chuang, C.-K., & Wang, W. S.-Y. (1978). Psychophysical pitch biases related to vowel

quality, intensity difference, and sequential order. Journal of the Acoustical Society of America, 64, 1004-1014.

de Cheveigné, A., & Kawahara, H. (1999). Missing-data model of vowel identification. Journal of the Acoustical Society of America, 105, 3497-3508.

Delattre, P. C., Liberman, A. M., & Cooper, F. S. (1955). Acoustic loci and transitional cues for consonants. Journal of the Acoustical Society of America, 27, 769-773.

Di Benedetto, M. G. (1987). On vowel height: Acoustic and perceptual representations by the fundamental and the first formant frequency. In the Proceedings of the 11th International Congress of Phonetic Sciences, Tallinn, Vol. 5, pp. 198-201.

Di Benedetto, M. G. (1994). Acoustic and perceptual evidence of a complex relation between F1 and F0 in determining vowel height. Journal of Phonetics, 22, 205-224.

Diehl, R. L., Lotto, A. J., & Holt, L. L. (2004). Speech perception. Annual Review of Psychology, 55, 149-179.

Disner, S. F. (1980). Evaluation of vowel normalization procedures. Journal of the Acoustical Society of America, 67, 253-261.

H. Traunmüller / Speech considered as modulated voice 35

Dudley, H. (1940). The carrier nature of speech. Bell Systems Technical Journal, 19, 495-515. Eisen, B. (2001). Phonetische Aspekte zwischensprachlicher Interferenz: Untersuchungen zur

Artikulationsbasis an Häsitationspartikeln nicht-nativer Sprecher des Deutschen [Phonetic aspects of between-language interference: Investigations concerned with the basis of articulation by means of hesitation particles of non-native speakers of German]. Frankfurt am Main: Lang.

Engstrand, O. (1981). Acoustic constraints or invariant input representation? An experimental study of selected articulatory movements and targets. In RUUL 7, pp. 67-95. Department of Linguistics, Uppsala University.

Eriksson, A., & Traunmüller, H. (2002). Perception of vocal effort and distance from the speaker on the basis of vowel utterances. Perception and Psychophysics, 64, 131-139.

Fahey, R. P., & Diehl, R. L. (1996). The missing fundamental in vowel height perception. Perception & Psychophysics, 58, 725-733.

Fahey, R. P., Diehl, R. L., & Traunmüller, H. (1996). Perception of back vowels: Effects of varying F1-F0 bark distance. Journal of the Acoustical Society of America, 99, 2350-2357.

Fant, G. (1960). Acoustic Theory of Speech Production, The Hague: Mouton Fant, G. (1975). Non-uniform vowel normalization. In STL-QPSR 2-3/1975, pp. 1-19.

Department of Speech Transmission and Musical Acoustics, Royal Institute of Technology, Stockholm.

Fant, G. (1979). Glottal source and excitation analysis. In STL-QPSR 1/1979, pp. 85-107. Department of Speech Transmission and Musical Acoustics, Royal Institute of Technology, Stockholm.

Folkins, J. W., & Abbs, J. H. (1975). Lip and jaw motor control during speech: responses to resistive loading of the jaw. Journal of Speech and Hearing Research, 18, 207-219.

Fowler, C. A. (1986). An event approach to the study of speech perception from a direct-realist perspective. Journal of Phonetics, 14, 3-28.

Fowler, C. A. (1994). Speech perception: Direct realist theory. In Encyclopedia of Language and Linguistics, (Vol. 8, pp. 4199-4203). Oxford: Pergamon Press.

Fowler, C. A., & Turvey, M. T. (1980). Immediate compensation in bite-block speech. Phonetica, 37, 306-326.

Frøkjær-Jensen, B. (1966). Changes in formant frequencies and formant levels at high voice effort. In ARIPUC, pp. 47-55. Copenhagen: Department of Phonetics, University of Copenhagen.

Fujimura, O., & Kakita, Y. (1979). Remarks on quantitative description of the lingual articulation. In B. Lindblom & S. Öhman (Eds.) Frontiers of Speech Communication Research, (pp. 17-24). London: Academic Press.

Fujisaki, H., & Kawashima, T. (1968). The roles of pitch and higher formants in the perception of vowels. IEEE Transactions on Audio and Electroacoustics, AU-16, 73-77.

Gay, T. (1978). Articulatory units: segments or syllables? In A. Bell & J. Hooper, (Eds.), Syllables and Segments, (pp. 121-131). Amsterdam: North-Holland.

Gerstman, J. L. (1968). Classification of self-normalized vowels. IEEE Transactions on Audio and Electroacoustics, AU-16, 78-80.

Geumann, A. (2001). Vocal intensity: acoustic and articulatory correlates. In the Proceedings of the 4th Conference on Motor Control, Nijmegen, Netherlands, pp. 70-73.

Geumann, A., Kroos, C., & Tillmann, H. G. (1999). Are there compensatory effects in natural speech? In the Proceedings of the XIVth International Congress of Phonetic Sciences, San Francisco, Vol. 1, pp. 399-402.

Gibson, J. J. (1966). The Senses Considered as Perceptual Systems. Boston, MA: Hougton Mifflin.

Gick, B., Wilson, I., Koch, K., & Cook, C. (2004). Language-specific articulatory settings: Evidence from inter-utterance rest position. Phonetica, 61, 220-233.

H. Traunmüller / Speech considered as modulated voice 36

Goldinger, S. D. (1996). Words and voices: Episodic traces in spoken word identification and recognition memory. Journal of Experimental Psychology - Learning, Memory and Cognition, 22, 1166-1183.

Goldstein, U. G. (1980). An Articulatory Model for the Vocal Tracts of Growing Children, D. Sc. thesis, Massachusetts Institute of Technology, Cambridge, MA.

Guenther, F. H. (1994). A neural network model of speech acquisition and motor equivalent speech production. Biological Cybernetics, 72, 43-53.

Guenther, F. H. (1995). Speech sound acquisition, coarticulation, and rate effects in a neural network model of speech production. Psychological Review, 102, 594-621.

Guenther, F. H., Ghosh, S. S., & Nieto-Castanon, A. (2003). A neural model of speech production. In the Proceedings of the 6th International Seminar on Speech Production, Sydney, pp. 85-90.

Guenther, F. H., & Gjaja, M. N. (1996). The perceptual magnet effect as an emergent property of neural map formation. Journal of the Acoustical Society of America, 100, 1111-1121.

Gussenhoven, C. (1999). Discreteness and gradience in intonational contrasts. Language and Speech, 42, 283-305.

Gussenhoven, C., & Rietveld, T. (1998). On the speaker-dependence of the perceived prominence of F0 peaks. Journal of Phonetics, 26, 371-380.

Halberstam, B., & Raphael, L. J. (2004). Vowel normalization: the role of fundamental frequency and upper formants. Journal of Phonetics, 32, 423-434.

Harris, J. (2004). Vowel reduction as information loss. UCL Working Papers in Linguistics, 16, pp. 1-16. Department of Phonetics and Linguistics, University College London: http://www.phon.ucl.ac.uk/home/PUB/WPL/uclwpl.html

Harris, J., & Lindsey, G. (1995). The elements of phonological representation. In J. Durand & F. Katamba, (Eds.), Frontiers of phonology: atoms, stuctures, derivations, (pp. 34-79). London: Longman.

Harris, J., & Urua, E.-A. (2001). Lenition degrades information: consonant allophony in Ibibio. Speech, Hearing and Language: Work in Progress, 13, 72-105. Department of Phonetics and Linguistics, University College, London: http://www.phon.ucl.ac.uk/home/shl/

Hawkins, S., & Smith, R. (2001). Polysp: a polysystemic, phonetically-rich approach to speech understanding. Italian Journal of Linguistics—Rivista di Linguistica, 13, 99-188.

Hellström, A., Aaltonen, O., Raimo, I., & Vilkman, E. (1994). The role of vowel quality in pitch comparison. Journal of the Acoustical Society of America, 96, 2133-2139.

Hintzman, D. G. (1986). "Schema abstraction" in a multiple-trace memory model. Psychological Review, 93, 411-428.

Hirahara, T., & Kato, H. (1992). The effect of F0 on vowel identification. In Y. Tohkura, E. Vatikiotis-Bateson & Y. Sagisaka, (Eds.), Speech Perception, Production and Linguistic Structure, (pp. 89-112). Tokyo, Amsterdam: Ohmsha, IOS Press.

Honda, K. (1983). Relationship between pitch control and vowel articulation. In D. M. Bless & J. H. Abbs, (Eds.), Vocal Fold Physiology, (pp. 286-297). San Diego: College-Hill Press.

Hoole, P. (1987). Bite-block speech in the absence of oral sensibility. In the Proceedings of the the XIth International Congress of Phonetic Sciences, Tallinn, Vol. 4, pp. 16-19.

Houde, J. F. & Jordan, M. I. (1998). Sensorimotor adaptation in speech production. Science, 279, 1213-1216.

Huber, J. E., Stathopoulos, E. T., Curione, G. M., Ash, T. A., & Johnson, K. (1999). Formants of children, women, and men: The effects of vocal intensity variation. Journal of the Acoustical Society of America, 106, 1532-1542.

Jenner, B. (2001). ‘Articulatory setting’: genealogies of an idea. Historiographia Linguistica, 28, 121-141.

Johnson, K. (1990). The role of perceived speaker identity in F0 normalization of vowels.

H. Traunmüller / Speech considered as modulated voice 37

Journal of the Acoustical Society of America, 88, 642-654. Johnson, K. (2000). Adaptive dispersion in vowel perception. Phonetica, 57, 181-188. Johnson, K., Flemming, E., & Wright, R. (1993). The hyperspace effect: Phonetic targets

are hyperarticulated. Language, 69, 505-528. Johnson, K., Strand, E. A., & D'Imperio, M. (1999). Auditory-visual integration of talker

gender in vowel perception. Journal of Phonetics, 27, 359-384. Joos, M. A. (1948). Acoustic Phonetics. Language, 24, 1-136. Kaitz, M., Meschulach-Sarfaty, O., Auerbach, J., & Eidelman, A. (1988). A reexamination

of newborns' ability to imitate facial expressions. Developmental Psychology, 24, 3-7.

Katz, W. F., & Assmann, P. F. (2001). Identification of children's and adults' vowels: intrinsic fundamental frequency, fundamental frequency dynamics, and presence of voicing. Journal of Phonetics, 29, 23-51.

Kelso, J. A., Tuller, B., Vatikiotis-Bateson, E., & Fowler, C. A. (1984). Functionally specific articulatory cooperation following jaw perturbations during speech: evidence for coordinative structures. 10, 812-832.

Kewley-Port, D., Li, X., Zheng, Y., & Neel, A. T. (1996). Fundamental frequency effects on thresholds for vowel formant discrimination. Journal of the Acoustical Society of America, 100, 2462-2470.

Kienast, M., & Sendlmeier, W. F. (2000). Acoustical analysis of spectral and temporal changes in emotional speech. In the Proceedings of the ISCA Workshop on Speech and Emotion, Queen's University, Belfast, pp. 92-97.

Kojima, S., & Kiritani, S. (1989). Vocal-auditory functions in the chimpanzee: Vowel perception. International Journal of Primatology, 10, 199-213.

Koopmans van Beinum, F. J. (1980). Vowel Contrast Reduction: An Acoustic and Perceptual Study of Dutch Vowels in Various Speech Conditions, Ph.D. thesis, University of Amsterdam.

Kuhl, P. K. (1991). Human adults and human infants show a ‘perceptual magnet effect’ for the prototypes of speech categories, monkeys do not. Perception and Psychophysics, 50, 93-107.

Kuhl, P. K., Andruski, J. E., Chistovich, I. A., Chistovich, L. A., Kozhevnikova, E. V., Ryskina, V. L., et al. (1997). Cross-language analysis of phonetic units in language addressed to infants. Science, 277, 684-686.

Kuhl, P. K., & Meltzoff, A. N. (1984). The intermodal representation of speech in infants. Infant Behavior & Development, 7, 361-381.

Kuhl, P. K., & Meltzoff, A. N. (1996). Infant vocalizations in response to speech: vowel imitation and developmental change. Journal of the Acoustical Society of America, 100, 2425-2439.

Lacerda, F. (1992). Young infants prefer high/low vowel contrasts. In PERILUS XV, pp. 85-90. Department of Linguistics, Stockholm University.

Lacerda, F. (1995). The perceptual-magnet effect: An emergent consequence of exemplar-based phonetic memory. In the Proceedings of the the XIIIth International Congress of Phonetic Sciences, Stockholm, Vol. 2, pp. 140-147.

Ladd, R. D. (1986). Intonational Phonology. Cambridge: Cambridge University Press. Ladefoged, P., & Broadbent, D. E. (1957). Information conveyed by vowels. Journal of the

Acoustical Society of America, 29, 98-104. Ladefoged, P., De Clerk, J., Lindau, M., & Papçun, G. (1972). An auditory-motor theory of

speech production. UCLA Working Papers in Phonetics, 22, 48-75. Ladefoged, P., & McKinney, N. (1963). Loudness, sound pressure, and subglottal pressure

in speech. Journal of the Acoustical Society of America, 35, 454-460. Laver, J. (1978). The concept of articulatory settings: an historical survey. Historiographia

Linguistica,, 1-14. Laver, J. (1980). The Phonetic Description of Voice Quality. Cambridge: Cambridge University

Press.

H. Traunmüller / Speech considered as modulated voice 38

Liberman, A. M., Cooper, F. S., Harris, K. S., & MacNeilage, P. F. (1962). A motor theory of speech perception. In the Proceedings of the Speech Communication Seminar, Stockholm, Vol. II, Paper D3.

Liberman, A. M., Cooper, F. S., Shankweiler, D. P., & Studdert-Kennedy, M. (1967). Perception of the speech code. Psychological Review, 74, 431-461.

Liberman, A. M., & Mattingly, I. G. (1985). The motor theory of speech perception revised. Cognition, 21, 1-36.

Liénard, J.-S., & Di Benedetto, M.-G. (1999). Effect of vocal effort on spectral properties of vowels. Journal of the Acoustical Society of America, 106, 411-422.

Liljencrants, J., & Lindblom, B. (1972). Numerical simulation of vowel quality systems: The role of perceptual contrast. Language, 48, 839-862.

Lindblom, B. (1986). Phonetic universals in vowel systems. In J. J. Ohala & J. J. Jaeger, (Eds.), Experimental Phonology, (pp. 13-44). Orlando: Academic Press.

Lindblom, B. (1990). Explaining phonetic variation: a sketch of the h&h theory. In W. Hardcastle & A. Marchal, (Eds.), Speech Production and Speech Modelling, (pp. 403-439). Dordrecht: Kluwer Academic Publishers.

Lindblom, B., & Engstrand, O. (1989). In what sense is speech quantal? Journal of Phonetics, 17, 107-121.

Lindblom, B., Lubker, J. F., & Gay, T. (1979). Formant frequencies of some fixed-mandible vowels and a model of speech motor programming by predictive simulation. Journal of Phonetics, 7, 141-161.

Lindblom, B., Lubker, J. F., & McAllister, R. (1977). Compensatory articulation and the modeling of normal speech production behavior.

Lindblom, B., Sussman, H. M., Modarresi, G., & Burlingame, E. (2002). The trough effect: implications for speech motor programming. Phonetica, 59, 245-262.

Lobanov, B. M. (1971). Classification of Russian vowels spoken by different speakers. Journal of the Acoustical Society of America, 49, 606-608.

Löfqvist, A. (1990). Speech as audible gestures. In W. J. Hardcastle & A. Marchal, (Eds.), Speech Production and Speech Modelling, (pp. 289-322). Dordrecht: Kluwer Academic Press.

Mann, V. A., & Repp, B. H. (1980). Influence of vocalic context on perception of the [ʃ]-[s] distinction. Perception & Psychophysics, 28, 213-228.

Massaro, D. W., & Stork, D. G. (1998). Speech recognition and sensory integration. American Scientist, 86, 236-224.

Maurer, D., & Landis, T. (1996). Intelligibility and spectral differences in high-pitched vowels. Folia Phoniatrica et Logopaedica, 48, 1-10.

McDonough, J., Schaaf, T., & Waibel, A. (2004). Speaker adaptation with all-pass transforms. Speech Communication, 42, 75-91.

McGurk, H., & MacDonald, J. (1976). Hearing lips and seeing voices. Nature, 264, 746-748. Medin, D. L., & Schaffer, M. M. (1978). Context theory of classification learning.

Psychological Review, 85, 207-238. Meltzoff, A. N., & Moore, K. (1977). Imitation of facial and manual gestures by human

neonates. Science, 198, 75-78. Ménard, L., Schwartz, J.-L., Boë, L.-J., Kandel, S., & Vallée, N. (2002). Auditory

normalization of French vowels synthesized by an articulatory model simulating growth from birth to adulthood. Journal of the Acoustical Society of America, 111, 1892-1905.

Menzerath, P., & de Lacerda, A. (1933). Koartikulation, Steuerung und Lautabgrenzung: Eine experimentelle Untersuchung [Coarticulation, control and sound segregation: An experimental investigation]. Berlin: Dümmler.

Miller, D. C. (1953). Auditory tests with synthetic vowels. Journal of the Acoustical Society of America, 25, 114-121.

Mullennix, J. W., & Pisoni, D. B. (1990). Stimulus variability and processing dependencies in speech perception. Perception & Psychophysics, 47, 379-390.

H. Traunmüller / Speech considered as modulated voice 39

Mullennix, J. W., Pisoni, D. B., & Martin, C. S. (1989). Some effects of talker variability on spoken word recognition. Journal of the Acoustical Society of America, 85, 365-378.

Munhall, K. G., Löfqvist, A., & Kelso, J. A. S. (1994). Lip–larynx coordination in speech: Effects of mechanical perturbations to the lower lip. Journal of the Acoustical Society of America, 95, 3605-3616.

Nearey, T. M. (1989). Static, dynamic, and relational properties in vowel perception. Journal of the Acoustical Society of America, 85, 2088-2113.

Nittrouer (1992). Age-related differences in perceptual effects of formant transitions within syllables and across syllable boundaries. Journal of Phonetics, 20, 351-382.

Nusbaum, H. C., & Magnuson, J. (1997). Talker normalization: Phonetic constancy as a cognitive process. In K. Johnson & J. W. Mullennix, (Eds.), Talker Variability in Speech Processing, (pp. 109-132). San Diego, London: Academic Press.

Nusbaum, H. C., & Morin, T. M. (1992). Paying attention to differences among talkers. In Y. Tohkura, E. Vatikiotis-Bateson & Y. Sagisaka, (Eds.), Speech Perception, Production and Linguistic Structure, (pp. 113-134). Tokyo, Amsterdam: Ohmsha, IOS Press.

Nygaard, L. C., & Pisoni, D. B. (1998). Talker-specific learning in speech perception. Perception & Psychophysics, 60, 355-376.

Nygaard, L. C., Sommers, M. S., & Pisoni, D. B. (1995). Effects of stimulus variability on perception and representation of spoken words in memory. Perception & Psychophysics, 57, 898-1001.

Oden, G. C., & Massaro, D. W. (1978). Integration of featural information in speech perception. Psychological Review, 85, 172-191.

Ohala, J. J. (1984). An ethological perspective on common cross-language utilization of F0 of voice. Phonetica, 41, 1-16.

Öhman, S. (1966). Coarticulation in VCV utterances: Spectrographic measurements. Journal of the Acoustical Society of America, 39, 151-168.

Öhman, S. (1967). Peripheral motor commands in labial articulation. In STL-QPSR 4/1967, pp. 30-63. Department of Speech Transmission, Royal Institute of Technology, Stockholm.

Öhman, S., Persson, A., & Leanderson, R. (1967). Speech production at the neuro-muscular level. In STL-QPSR 2-3/1967, pp. 15-19. Department of Speech Transmission, Royal Institute of Technology, Stockholm.

Pape, D. (2005). Is pitch perception and discrimination of vowels language-dependent and influenced by the vowels spectral properties? In the Proceedings of the Eleventh Meeting of the International Conference on Auditory Display, Limerick, Ireland, pp. 340-343.

Patterson, D. K., & Pepperberg, I. M. (1994). A comparative study of human and parrot phonation - Acoustic and articulatory correlates of vowels. Journal of the Acoustical Society of America, 96, 634-648.

Perkell, J. S. (1969). Physiology of speech production: results and implications of a quantitative cineradiographic study. Cambridge MA: MIT Press.

Perkell, J. S. (1986). Coarticulation strategies: preliminary implications of a detailed analysis of lower lip protrusion movements. Speech Communication, 5, 47-68.

Perkell, J. S. (1996). Properties of the tongue help to define vowel categories: hypotheses based on physiologically-oriented modelling. Journal of Phonetics, 24, 3-22.

Perkell, J., Guenther, F., Lane, H., Marrone, N., Matthies, M. L., Stockmann, E., et al. (2006) Production and perception of phoneme contrasts covary across speakers. In J. Harrington & M. Tabain (Eds.) Speech Production: Models, Phonetic Processes & Techniques, (pp. 69-). New York: Psychology Press.

Perkell, J. S., Matthies, M., Lane, H. L., Guenther, F. H., Wilhelms-Tricarico, R., Wozniak, J., et al. (1997). Speech motor control: Acoustic goals, saturation effects, auditory feedback and internal models. Speech Communication, 22, 227-250.

Pierrehumbert, J. B. (2001). Exemplar dynamics: Word frequency, lenition and contrast.

H. Traunmüller / Speech considered as modulated voice 40

In J. Bybee & P. Hopper, (Eds.), Frequency and the Emergence of Linguistic Structure, (pp. 137-157). Amsterdam: Benjamins.

Pisoni, D. B. (1993). Long-term memory in speech perception: Some new findings on talker variability, speaking rate and perceptual learning. Speech Communication, 13, 109-125.

Plooij, F. X. (1978) Some basic traits of language in wild chimpanzees? In A. Lock (Ed.) Action, gesture and symbol, (pp. 111-131). New York: Academic Press.

Polka, L., & Bohn, O.-S. (2003). Asymmetries in vowel perception. Speech Communication, 41, 221-231.

Porter, R. J., Jr., & Lubker, J. F. (1980). Rapid reproduction of vowel-vowel sequences: evidence for a fast and direct acoustic-motoric linkage in speech. Journal of Speech and Hearing Research, 23, 593-602.

Posner, M. I., & Keele, S. W. (1968). On the genesis of abstract ideas. Journal of Experimental Psychology, 77, 353-363.

Reed, S. K. (1972). Pattern recognition and categorization. Cognitive Psychology, 3, 382-407.

Remez, R. E., Fellowes, J. M., & Rubin, P. E. (1997). Talker identification based on phonetic information. Journal of Experimental Psychology: Human Perception and Performance, 23, 651-666.

Rizzolatti, G., & Craighero, L. (2004). The mirror-neuron system. Annual Review of Neuroscience, 27, 169-192.

Rizzolatti, G., Fadiga, L., Fogassi, L., & Gallese, V. (1996). Premotor cortex and the recognition of motor actions. Cognitive Brain Research, 3, 131-141.

Rostolland, D. (1982). Acoustic features of shouted voice. Acustica, 50, 118-125. Rostolland, D. (1982). Phonetic structure of shouted voice. Acustica, 51, 80-89. Ryalls, B. O., & Pisoni, D. B. (1997). The developmental course of talker normalization in

preschool children. Developmental Psychology, 33, 441-452. Schulman, R. (1989). Articulatory dynamics of loud and normal speech. Journal of the

Acoustical Society of America, 85, 295-312. Sinnott, J. M., Brown, C. H., Malik, W. T., & Kressley, R. A. (1997). A multidimensional

scaling analysis of vowel discrimination in humans and monkeys. Perception & Psychophysics, 59, 1214-1224.

Sinnott, J. M., & Williamson, T. L. (1999). Can macaques perceive place of articulation from formant transition information? Journal of the Acoustical Society of America, 106, 929-937.

Simpson, A. P. (2001). Dynamic consequences of differences in male and female vocal tract dimensions. Journal of the Acoustical Society of America, 109, 2253-2164.

Skoyles, J. R. (1998). Speech phones are a replication code. Medical Hypotheses, 50, 167-173.

Slawson, A. W. (1968). Vowel quality and musical timbre as functions of spectrum envelope and fundamental frequency. Journal of the Acoustical Society of America, 43, 87-101.

Smith, J. D., & Minda, J. P. (1998). Protoypes in the mist: The early epoches of category learning. Journal of Experimental Psychology: Learning, Memory, and Cognition, 24, 1411-1436.

Soken, N. H., & Pick, A. D. (1999). Infants' perception of dynamic affective expressions: Do infants distinguish specific expressions? Child Development, 70, 1275-1282.

Stevens, K. N. (1960). Toward a model for speech recognition. Journal of the Acoustical Society of America, 32, 47-55.

Stevens, K. N. (1972). The Quantal nature of speech: Evidence from articulatory-acoustic data. In E. E. David & P. B. Denes, (Eds.), Human communication: A unified view, (pp. 51-66). New York: McGraw-Hill.

Stevens, K. N. (1989). On the quantal nature of speech. Journal of Phonetics, 17, 3-45. Stevens, K. N. (1998). Acoustic Phonetics. Cambridge, Mass.: MIT Press.

H. Traunmüller / Speech considered as modulated voice 41

Stevens, K. N. (2000). From acoustic cues to segments, features and words. In the Proceedings of the 6th International Conference on Spoken Language Processing, Beijing, Vol. 1, pp. A1-A8.

Stevens, K. N., & Blumstein, S. E. (1978). Invariant cues for place of articulation in stop consonants. Journal of the Acoustical Society of America, 64, 1358-1368.

Stoll, G. (1984). Pitch of vowels: experimental and theoretical investigation of its dependence on vowel quality. Speech Communication, 3, 137-150.

Story, B. H., & Titze, I. R. (2002). A preliminary study of voice quality transformation based on modifications to the neutral vocal tract area function. Journal of Phonetics, 30, 485-509.

Story, B. H., Titze, I. R., & Hoffman, E. A. (2001). The relationship of vocal tract shape to three voice qualities. Journal of the Acoustical Society of America, 109, 1651-1667.

Strange, W., Verbrugge, R. R., Shankweiler, D. P., & Edman, T. R. (1976). Consonant environment specifies vowel identity. Journal of the Acoustical Society of America, 60, 213-224.

Studdert-Kennedy, M. (1983). On learning to speak. Human Neurobiology, 2, 191-195. Studdert-Kennedy, M. (2000). Imitation and the emergence of segments. Phonetica, 57,

275-283. Syrdal, A. K., & Gopal, H. S. (1986). A perceptual model of vowel recognition based on

the auditory representation of American English vowels. Journal of the Acoustical Society of America, 79, 1086-1100.

Terhardt, E. (1972). Zur Tonhöhenwahrnehmung von Klängen. I. Psychoakustische Grundlagen [On the pitch perception of complex tones. I. Psychoacoustic fundamentals]. Acustica, 26, 173-186.

Terhardt, E. (1972). Zur Tonhöhenwahrnehmung von Klängen. II. Ein Funktionsschema [On pitch perception of complex tones. II. A model]. Acustica, 26, 187-199.

Tillmann, H. G. & Mansell, P. (1980) Phonetik; Lautsprachliche Zeichen, Sprachsignale und lautsprachlicher Kommunikationsprozeß [Phonetics; Signs in vocal language, speech signals, and the process of vocal communication]. Stuttgart: Klett-Cotta.

Traunmüller, H. (1981). Perceptual dimension of openness in vowels. Journal of the Acoustical Society of America, 69, 1465-1475.

Traunmüller, H. (1984). Articulatory and perceptual factors controlling the age- and sex-conditioned variability in formant frequencies of vowels. Speech Communication, 3, 49-61.

Traunmüller, H. (1988). Paralinguistic variation and invariance in the characteristic frequencies of vowels. Phonetica, 45, 1-29.

Traunmüller, H. (1990). A note on hidden factors in vowel perception experiments. Journal of the Acoustical Society of America, 88, 2015-2019.

Traunmüller, H. (1991). The context sensitivity of the perceptual interaction between F0 and F1. In the Proceedings of the 12th International Congress of Phonetic Sciences, Aix-en-Provence, Vol. 5, pp. 62 - 65.

Traunmüller, H. (1994). Conventional, biological and environmental factors in speech communication: A modulation theory. Phonetica, 51, 170-183.

Traunmüller, H. (1998). The role of F0 in vowel perception at the Web site of the Department of Linguistics, Stockhlm University: http://www.ling.su.se/staff/hartmut/i.htm

Traunmüller, H. (2001). Size and physiological effort in the production of signed and spoken utterances. In Working Papers, 49, pp. 164-167: Department of Linguistics & Phonetics, Lund University.

Traunmüller, H. (2003). Clicks and the idea of a human protolanguage. In the Proceedings of Fonetik 2003, Phonum, 9, pp. 1-4: Department of Philosophy and Linguistics, Umeå University.

Traunmüller, H. (2006). Cross-modal interactions in visual as opposed to auditory perception of vowels. In Working Papers, 52, pp. 137-140: Department of Linguistics & Phonetics, Lund University.

H. Traunmüller / Speech considered as modulated voice 42

Traunmüller, H., & van Bezooijen, R. (1994). The auditory perception of children's age and sex. In the Proceedings of the Third International Conference on Spoken Language Processing, Yokohama, Vol. 3, pp. 1171-1174.

Traunmüller, H., & Eriksson, A. (1994). The size of F0-excursions in speech production and perception. In the Proceedings of Fonetik 1994, Working Papers, 43, pp. 136-139: Department of Linguistics & Phonetics, Lund University.

Traunmüller, H., & Eriksson, A. (1995). The perceptual evaluation of F0 excursions in speech as evidenced in liveliness estimations. Journal of the Acoustical Society of America, 107, 1905-1915.

Traunmüller, H., & Eriksson, A. (1995 b). The frequency range of the voice fundamental in the speech of male and female adults at the Web site of the Department of Linguistics, Stockhlm University: http://www.ling.su.se/staff/hartmut/f0_m&f.pdf

Traunmüller, H., & Eriksson, A. (2000). Acoustic effects of variation in vocal effort by men, women, and children. Journal of the Acoustical Society of America, 107, 3438-3451.

Traunmüller, H., & Krull, D. (1987). An experiment on the cues to the identification of fricatives. In the Proceedings of the XIth International Congress of Phonetic Sciences, Tallinn, Vol. 5, pp. 205-208.

Traunmüller, H., & Lacerda, F. (1987). Perceptual relativity in identification of two-formant vowels. Speech Communication, 6, 143-157.

Traunmüller, H., & Öhrström, N. (in press). Audiovisual perception of openness and lip rounding in front vowels. Journal of Phonetics.

Van Lancker, D., Kreiman, J., & Emmorey, K. (1985). Familiar voice recognition: Patterns and parameters. Journal of Phonetics, 13, 19-52.

Vaissière, J. (1983). Language-independent prosodic features. In A. Cutler & D. R. Ladd (Eds.) Prosody: Models and Measurements, (pp. 53-66). Berlin: Springer Verlag.

Whalen, D. H., & Levitt, A. G. (1995). The universality of intrinsic F0 of vowels. Journal of Phonetics, 23, 349-366.

Whalen, D. H., Levitt, A. G., Hsiao, P.-L., & Smorodinsky, I. (1995). Intrinsic F0 of vowels in the babbling of 6-, 9- and 12-month-old French- and English-learning infants. Journal of the Acoustical Society of America, 97, 2533-2539.