evaluating the influence of the ‘unity assumption’ on the temporal perception of realistic...

12
Evaluating the influence of the ‘unity assumption’ on the temporal perception of realistic audiovisual stimuli Argiro Vatakis * , Charles Spence Crossmodal Research Laboratory, Department of Experimental Psychology, University of Oxford, Oxford OX1 3UD, UK Received 8 October 2006; received in revised form 3 December 2006; accepted 4 December 2006 Available online 26 January 2007 Abstract Vatakis, A. and Spence, C. (in press) [Crossmodal binding: Evaluating the ‘unity assumption’ using audiovisual speech stimuli. Per- ception & Psychophysics] recently demonstrated that when two briefly presented speech signals (one auditory and the other visual) refer to the same audiovisual speech event, people find it harder to judge their temporal order than when they refer to different speech events. Vatakis and Spence argued that the ‘unity assumption’ facilitated crossmodal binding on the former (matching) trials by means of a pro- cess of temporal ventriloquism. In the present study, we investigated whether the ‘unity assumption’ would also affect the binding of non- speech stimuli (video clips of object action or musical notes). The auditory and visual stimuli were presented at a range of stimulus onset asynchronies (SOAs) using the method of constant stimuli. Participants made unspeeded temporal order judgments (TOJs) regarding which modality stream had been presented first. The auditory and visual musical and object action stimuli were either matched (e.g., the sight of a note being played on a piano together with the corresponding sound) or else mismatched (e.g., the sight of a note being played on a piano together with the sound of a guitar string being plucked). However, in contrast to the results of Vatakis and Spence’s recent speech study, no significant difference in the accuracy of temporal discrimination performance for the matched versus mismatched video clips was observed. Reasons for this discrepancy are discussed. Ó 2006 Elsevier B.V. All rights reserved. PsycINFO classification: 2320 Keywords: Unity assumption; Crossmodal binding; Synchrony perception; TOJ; Music; Object action; Speech; Audition; Vision 1. Introduction According to research on the ‘unity assumption’ (e.g., Bedford, 2001; Vatakis & Spence, in press; Welch, 1999a, 1999b; Welch & Warren, 1980), 1 whenever two or more sensory inputs are highly consistent (in one or more dimen- sion(s); such as time, space, temporal patterning, number, and semantic content), observers will be more likely to treat them as referring to the same underlying multisensory 0001-6918/$ - see front matter Ó 2006 Elsevier B.V. All rights reserved. doi:10.1016/j.actpsy.2006.12.002 * Corresponding author. Tel.: +44 1865 271307; fax: +44 1865 310447. E-mail addresses: [email protected], argiro.vatakis@gmail. com (A. Vatakis). 1 It is unclear whether the ‘unity assumption’ refers to a top-down (i.e., more cognitively mediated) versus bottom-up (i.e., more stimulus-driven) process (see Spence, in press; Welch & Warren, 1980, on this point). This is due to the fact that many of the factors that promote a top-down assumption of unity (such as spatial and temporal coincidence) are also likely to lead to enhanced bottom-up multisensory integration as well (e.g., Stein & Meredith, 1993; Welch, 1999a, 1999b). See Section 5 for further discussion of this point. It should be noted that it is also unclear whether the process of unification occurs consciously or unconsciously (Bertelson & de Gelder, 2004; Spence, in press; Welch & Warren, 1980). www.elsevier.com/locate/actpsy Available online at www.sciencedirect.com Acta Psychologica 127 (2008) 12–23

Upload: argiro-vatakis

Post on 03-Sep-2016

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Evaluating the influence of the ‘unity assumption’ on the temporal perception of realistic audiovisual stimuli

Available online at www.sciencedirect.com

www.elsevier.com/locate/actpsy

Acta Psychologica 127 (2008) 12–23

Evaluating the influence of the ‘unity assumption’ on thetemporal perception of realistic audiovisual stimuli

Argiro Vatakis *, Charles Spence

Crossmodal Research Laboratory, Department of Experimental Psychology, University of Oxford, Oxford OX1 3UD, UK

Received 8 October 2006; received in revised form 3 December 2006; accepted 4 December 2006Available online 26 January 2007

Abstract

Vatakis, A. and Spence, C. (in press) [Crossmodal binding: Evaluating the ‘unity assumption’ using audiovisual speech stimuli. Per-

ception & Psychophysics] recently demonstrated that when two briefly presented speech signals (one auditory and the other visual) referto the same audiovisual speech event, people find it harder to judge their temporal order than when they refer to different speech events.Vatakis and Spence argued that the ‘unity assumption’ facilitated crossmodal binding on the former (matching) trials by means of a pro-cess of temporal ventriloquism. In the present study, we investigated whether the ‘unity assumption’ would also affect the binding of non-speech stimuli (video clips of object action or musical notes). The auditory and visual stimuli were presented at a range of stimulus onsetasynchronies (SOAs) using the method of constant stimuli. Participants made unspeeded temporal order judgments (TOJs) regardingwhich modality stream had been presented first. The auditory and visual musical and object action stimuli were either matched (e.g.,the sight of a note being played on a piano together with the corresponding sound) or else mismatched (e.g., the sight of a note beingplayed on a piano together with the sound of a guitar string being plucked). However, in contrast to the results of Vatakis and Spence’srecent speech study, no significant difference in the accuracy of temporal discrimination performance for the matched versus mismatchedvideo clips was observed. Reasons for this discrepancy are discussed.� 2006 Elsevier B.V. All rights reserved.

PsycINFO classification: 2320

Keywords: Unity assumption; Crossmodal binding; Synchrony perception; TOJ; Music; Object action; Speech; Audition; Vision

1. Introduction

According to research on the ‘unity assumption’ (e.g.,Bedford, 2001; Vatakis & Spence, in press; Welch, 1999a,

0001-6918/$ - see front matter � 2006 Elsevier B.V. All rights reserved.

doi:10.1016/j.actpsy.2006.12.002

* Corresponding author. Tel.: +44 1865 271307; fax: +44 1865 310447.E-mail addresses: [email protected], argiro.vatakis@gmail.

com (A. Vatakis).

1999b; Welch & Warren, 1980),1 whenever two or moresensory inputs are highly consistent (in one or more dimen-sion(s); such as time, space, temporal patterning, number,and semantic content), observers will be more likely totreat them as referring to the same underlying multisensory

1 It is unclear whether the ‘unity assumption’ refers to a top-down (i.e.,more cognitively mediated) versus bottom-up (i.e., more stimulus-driven)process (see Spence, in press; Welch & Warren, 1980, on this point). This isdue to the fact that many of the factors that promote a top-downassumption of unity (such as spatial and temporal coincidence) are alsolikely to lead to enhanced bottom-up multisensory integration as well (e.g.,Stein & Meredith, 1993; Welch, 1999a, 1999b). See Section 5 for furtherdiscussion of this point. It should be noted that it is also unclear whetherthe process of unification occurs consciously or unconsciously (Bertelson& de Gelder, 2004; Spence, in press; Welch & Warren, 1980).

Page 2: Evaluating the influence of the ‘unity assumption’ on the temporal perception of realistic audiovisual stimuli

A. Vatakis, C. Spence / Acta Psychologica 127 (2008) 12–23 13

event rather than as referring to separate unimodal events.Consequently, observers will be more likely to assume thatthe sensory inputs have a common spatiotemporal origin,and hence will be more likely to bind them into a single uni-fied percept. The crossmodal binding of multiple sensorysignals into unified multisensory percepts has been demon-strated in studies utilizing both simple light and soundstimuli (e.g., see Thomas, 1941; Witkin, Wapner, & Leven-thal, 1952, for early studies) as well as with more complexstimuli, such as speech (e.g., Easton & Basala, 1982; Vata-kis & Spence, in press; Walker, Bruce, & O’Malley, 1995;though see Green & Gerdeman, 1995; Green, Kuhl, Meltz-off, & Stevens, 1991, for conflicting results).

Vatakis and Spence (in press) recently reported fourexperiments in which they demonstrated that the ‘unityassumption’ modulates the multisensory integration ofaudiovisual speech stimuli using an audiovisual temporalorder judgment (TOJ) task. Video clips of speakers utteringspeech sounds (Experiments 1, 3, and 4) or words (Exper-iment 2) were presented to participants at a range of differ-ent stimulus onset asynchronies (SOAs) using the methodof constant stimuli (Spence, Shore, & Klein, 2001). Theauditory and visual speech stimuli were either gendermatched (i.e., a female face presented with a female voice)or else gender mismatched (i.e., a female face presentedwith a male voice; Experiments 1–3). In Experiment 4,the same speaker was used with matching (e.g., the lipmovements of /ba/ together with the /ba/ sound) and mis-matching speech tokens (e.g., the lip movements of /ba/together with the /da/ sound). Vatakis and Spence hypoth-esized that if the ‘unity assumption’ were to facilitate themultisensory integration of speech stimuli, then partici-pants should find it harder to determine whether the visuallip movements or the auditory speech had been presentedfirst when the two stimuli referred to the same underlyingperceptual event than when they did not (though seeRadeau & Bertelson, 1977, Experiment 2, for contradictoryfindings from an early study of the ‘unity assumption’). Inline with this prediction, Vatakis and Spence’s resultsshowed that the participants in all four of their experimentswere able to discriminate the temporal order of the audi-tory and visual signals much more accurately in the mis-matched speech condition than in the matched condition.That is, the just noticeable difference (JND; conventionallydefined as the sensitivity measure of the temporal intervalbetween two sensory signals needed in order for the partic-ipants to judge the temporal order of the auditory andvisual signals correctly 75% of the time) was significantlyhigher for the matched than for the mismatched videoclips. This result was obtained even though the complexity(i.e., temporal patterning and informational content) of theauditory and visual stimuli was exactly matched as a con-sequence of using exactly the same stimuli to generate boththe matching and mismatching video clips (cf. Warren,Welch, & McCarthy, 1981). Vatakis and Spence’s resultstherefore provided empirical evidence supporting the claimthat the ‘unity assumption’ can facilitate the multisensory

integration of audiovisual speech stimuli (see Easton &Basala, 1982; Walker et al., 1995, for converging findingsfrom earlier studies).

To date, however, studies that have looked at the multi-sensory binding of complex stimuli have mainly focused onthe use of audiovisual speech stimuli (e.g., Easton & Basala,1982; Green & Gerdeman, 1995; Green et al., 1991; Jones &Jarick, 2006; Vatakis & Spence, in press; Walker et al., 1995;Warren et al., 1981). It is important to note here though thatspeech represents a highly overlearned stimulus for themajority of people, and also that it has even been arguedby a number of researchers that it may represent a ‘special’class of sensory event (e.g., Bernstein, Auer, & Moore, 2004;Jones & Jarick, 2006; Libermann & Mattingly, 1985; Mass-aro, 2004; Munhall & Vatikiotis-Bateson, 2004; Tuomai-nen, Andersen, Tiippana, & Sams, 2005). In addition,there are many other types of stimuli that are both complexand at the same time relevant to daily life such as objectactions and/or the playing of music. One might thereforewonder whether the effect of the ‘unity assumption’ onaudiovisual temporal perception (be it consciously orunconsciously mediated; see Footnote 1) is restricted tothe case of speech or whether instead it also extends to affectthe multisensory integration of other classes of complexnaturalistic audiovisual stimuli as well.

In the present study, therefore, we used exactly the sameTOJ task and experimental set-up as used recently by Vata-kis and Spence (in press) in order to test whether the ‘unityassumption’ would also influence the processing of pairs ofauditory and visual non-speech stimuli that were eithermatched (i.e., that referred to the same underlying complexperceptual event) or else were mismatched (by combiningthe auditory and visual streams from different perceptualevents). In our first experiment, we used audiovisual objectaction stimuli, as used in the earliest empirical studies rele-vant to the study of the ‘unity assumption’ (e.g., see Jack-son’s (1953), early study of spatial ventriloquism; see alsoBeauchamp, Argall, Bodurka, Duyn, & Martin, 2004).Specifically, the stimuli in Experiment 1 were composedof a block of ice being smashed once by a hammer and aball bouncing once on the floor with the visual image ofthe action (smashing an ice block versus bouncing a ball)either matched (i.e., matched visual and auditory signalof a hammer smashing an ice block or a ball bouncing)or mismatched (i.e., the visual signal of a hammer smash-ing a block of ice and the sound of a ball bouncing). Ifthe ‘unity assumption’ influences the multisensory integra-tion of non-speech stimuli (as it has recently been shown todo for the case of speech stimuli; Vatakis & Spence, inpress), then participants should find it harder to determinewhether the visual or auditory stream was presented firstwhen the two stimuli refer to the same perceptual eventthan when they refer to different events. Such an outcomewould provide the first empirical demonstration that the‘unity assumption’ can also facilitate the crossmodal bind-ing of audiovisual non-speech stimuli at a perceptual level,while at the same time ruling out the alternative decisional

Page 3: Evaluating the influence of the ‘unity assumption’ on the temporal perception of realistic audiovisual stimuli

14 A. Vatakis, C. Spence / Acta Psychologica 127 (2008) 12–23

level (i.e., non-perceptual) explanation (e.g., that partici-pants merely adopt a biased decisional strategy dependingon, for example, the ‘informational-richness’ of the stimulipresented; see Vatakis & Spence, in press) that has ren-dered the appropriate interpretation of all previous studies(such as that of Jackson, 1953; Thomas, 1941; Witkin et al.,1952, in the spatial domain) on this topic problematic (seeBertelson & Aschersleben, 2003; Vatakis & Spence, inpress, on this point; though see also Radeau & Bertelson,1977, Experiment 1, for null results).

2. Experiment 1

2.1. Methods

2.1.1. Participants

Twelve participants (5 male and 7 female) aged 19–32years (mean age of 25 years) took part in this experiment.All of the participants were naıve as to the purpose of thestudy and all reported having normal hearing and normalor corrected-to-normal visual acuity. The experiment wasperformed in accordance with the ethical standards laiddown in the 1990 Declaration of Helsinki, as well as theethical guidelines laid down by the Department of Experi-mental Psychology, University of Oxford. The experimentlasted for approximately 40 min.

2.1.2. Apparatus and materialsThe experiment was conducted in a completely dark

sound-attenuated testing booth with the participants seatedfacing straight-ahead. The visual stimuli were presented ona 17-in. (43.18 cm) TFT colour LCD monitor (SXGA1240 · 1024 pixel resolution; 60-Hz refresh rate), placedat eye level, approximately 68 cm from the participant.The auditory stimuli were presented by means of two Pack-ard Bell Flat Panel 050 PC loudspeakers; one placed25.4 cm to either side of the centre of the monitor (i.e.,the auditory and visual speech stimuli were presented fromthe same spatial location). The audiovisual stimuli con-sisted of black and white video clips presented on a black

Fig. 1. Still images of the object actions (a) and musical

background using the Presentation programming software(Version 10.0; Neurobehavioral Systems Inc., CA). Thevideo clips (300 · 280-pixel, Cinepak Codec video compres-sion, 16-bit Audio Sample Size, 24-bit Video Sample Size,30 frames/s) were processed using Adobe Premiere 6.0.The clips consisted of the following: (a) two different videoclips of a man (visible from the waist to the upper chest)smashing a block of ice with a hammer (consisting of a sin-gle hitting action; clip A presented the man as seen fromthe left side, while clip B presented the same individual asseen from the right side) and two video clips of a manbouncing a ball on the ground (consisting of a singlebounce of the ball; clip A presented a whole-body viewof the man bouncing the ball, while clip B presented justthe hand of the man bouncing the ball; all of the clips were466 ms long; that is, the duration of the event was equalwith that of the entire clip, with both the auditory andvisual signals being dynamic over the same time period;e.g., starting from the beginning of the downward move-ment of the hammer and ball and ending with the pointof contact with the ice block and ground, respectively;the actual time that the ball was in motion between thehand and the floor was equal to 190 ms); (b) the sameobject actions but with the auditory channels swapped overso that, for example, the video image of the ice smashingwas paired with the ball bouncing sound and vice versa(see Fig. 1a for still images taken from two of the videos).In order to achieve accurate synchronization of the dubbedvideo clips, each original clip was re-encoded using XviDcodec (single pass, quality mode of 100%). Using the mul-titrack setting in Adobe Premiere, the visual and auditorycomponents of the to-be-dubbed videos were aligned basedon the peak auditory signals of the two videos and thevisual frames of the point of contact with the ice block/ground. A final frame-by-frame inspection of the videoclips was performed in order to ensure the correct align-ment of the auditory and visual signals.

At the beginning and end of each video clip, a still imageand background acoustic noise were presented for a vari-able duration. The duration of the image and noise was

instruments (b) video clips used in Experiments 1–3.

Page 4: Evaluating the influence of the ‘unity assumption’ on the temporal perception of realistic audiovisual stimuli

0

20

40

60

80

100

-300 -200 -100 0 100 200 300

SOA (ms)

p ('

Vis

ion

firs

t' r

espo

nses

)

Matched a

Mismatched a

Matched d

Mismatched d

Vision first

Sound first

Condition

Experiment 2 - Musical notes

0

20

40

60

80

100

-300 -200 -100 0 100 200 300

SOA (ms)

p ('

Vis

ion

firs

t' r

espo

nses

)

Matched d

Mismatched-visual d

Matched c

Mismatched-visual c

Visionfirst

Soundfirst

Condition

Experiment 3 - Musical notes

0

20

40

60

80

100

-300 -200 -100 0 100 200 300

SOA (ms)

p ('

Vis

ion

firs

t' r

espo

nses

)

Matched ball

Mismatched ball

Matched ice

Mismatched ice

Visionfirst

Sound first

Condition

Experiment 1 - Object actionsa

b

c

Fig. 2. Mean percentage of ‘vision-first’ responses plotted as a function ofstimulus onset asynchrony (SOA) in Experiments 1–3.

A. Vatakis, C. Spence / Acta Psychologica 127 (2008) 12–23 15

unequal with the difference in their duration being equiva-lent to the particular SOA tested (values reported below) ineach condition. This aspect of the design ensured that theauditory and visual streams always started at the sametime, thus avoiding cuing the participants as to the natureof the audiovisual delay with which they were being pre-sented. In order to achieve a smooth transition at the startand end of each video clip, a 33.33 ms cross-fade was addedbetween the still image and the video clip. The participantsresponded using a standard computer mouse, which theyheld with both hands, using their right thumb for ‘vision-first’ responses and their left thumb for ‘audition-first’responses (or vice versa, the response buttons were coun-terbalanced across participants).

2.1.3. Design

Nine possible SOAs between the auditory and visualstimuli were used: ±300, ±200, ±133, ±66, and 0 ms (cf.Vatakis & Spence, in press). Negative SOAs indicate thatthe auditory stream was presented first, whereas positivevalues indicate that the visual stream was presented first.The participants completed one block of eight practice tri-als before the main experimental session in order to famil-iarize themselves with the task and the video clips. Thepractice block was followed by five blocks of 144 experi-mental trials, consisting of two presentations of each ofthe eight video clips at each of the nine SOAs per blockof trials. The various SOAs were presented randomlywithin each block of trials using the method of constantstimuli (see Spence et al., 2001).

2.1.4. Procedure

The participants were informed that they would be pre-sented with a series of video clips (matched and mis-matched clips of the visual and auditory stream of ahammer smashing a block of ice and a ball bouncing).The participants were informed that on each trial theywould have to decide whether the auditory or visual signalappeared to have been presented first and that they wouldsometimes find this task difficult, in which case they shouldmake an informed guess as to the order of stimulus presen-tation (the participants were encouraged to avoid ran-domly guessing and to try to respond as accurately aspossible). The participants did not have to wait until thevideo clip had finished before making their response, buta response had to be made before the experiment wouldadvance on to the next trial.

2.2. Results

The proportions of ‘vision-first’ responses were con-verted to their equivalent z-scores under the assumptionof a cumulative normal distribution (cf. Finney, 1964).The data from the seven intermediate SOAs (±200,±133, ±66, and 0 ms) were used to calculate best-fittingstraight lines for each participant for each condition,which, in turn, were used to derive values for the slope

and intercept (matched: ball, r2 = .95, p < .01; ice,r2 = .94, p < .01; mismatched: ball, r2 = .94, p < .01; ice,r2 = .98, p < .01; the r2 values reflect the correlationbetween the SOAs and the proportion of ‘vision-first’responses, and hence provide an estimate of the goodnessof the data fits; see Fig. 2a). The ±300 ms points wereexcluded from this computation due to the fact that most

Page 5: Evaluating the influence of the ‘unity assumption’ on the temporal perception of realistic audiovisual stimuli

16 A. Vatakis, C. Spence / Acta Psychologica 127 (2008) 12–23

participants performed near-perfectly at this interval andso did not provide any significant information regardingour experimental manipulations (cf. Spence et al., 2001;Vatakis & Spence, in press, for a similar approach). Theslope and intercept values were used to calculate the JND(JND = 0.675/slope; since ±0.675 represents the 75% and25% points on the cumulative normal distribution) andthe point of subjective simultaneity (PSS = �intercept/slope) values (see Coren, Ward, & Enns, 2004, for moredetails).

The PSS provides an estimate of the time interval bywhich the event in one sensory modality had to lead theevent in the other modality in order for synchrony to beperceived (or rather, for the ‘vision-first’ and ‘sound-first’responses to be chosen equally often). For all of the analy-ses reported here, Bonferroni-corrected t-tests (wherep < .05 prior to correction) were used for all post hoc com-parisons. The analysis of audiovisual matching (i.e.,matched or mismatched actions) and object action (i.e.,bouncing of the ball or smashing the block of ice) revealedno main effect of object action (F(1, 11) = 3.67, p = .12),therefore the data from the two object action events werecollapsed. The JND and PSS data for each of the matchedand mismatched object action events were analysed usingrepeated measures analysis of variance (ANOVA) with

0

20

40

60

80

100

JND

(m

s)

Poorperformance

Goodperformance

Experiment 1 Experiment

-100 -50 0

PSS (ms)

Soundfirst

Experiment 1

Experiment 3

Experiment 2

a

b

Fig. 3. (a) Average JNDs for the matched and mismatched audiovisual speechfor the matched and mismatched music and object action stimuli. The error b

the factor of audiovisual matching. The ‘unity assumption’was not expected to have any reliable effect on the PSS val-ues (cf. Vatakis & Spence, 2006b, in press). Hence, the PSSdata are reported primarily for the sake of completeness.

The analysis of the JND data revealed that the partici-pants’ ability to discriminate the temporal order of the audi-tory and visual object action stimuli was unaffected bywhether the object image and action sound matched(M = 69 ms) or not (M = 67 ms; Fig. 3a) (F(1,11) < 1,n.s.). Note that the JND values obtained here are withinthe range of values obtained in Vatakis and Spence’s (inpress) recent study (e.g., in their Experiment 1, the JND val-ues for matched stimuli were 70 ms and for mismatched59 ms) where speech stimuli were used. Hence, a differencein task difficulty cannot account for the null effect of match-ing reported in the present study. Analysis of the PSS datarevealed a significant main effect of audiovisual matching(F(1,11) = 11.36, p < .01), with the matched (M = 54 ms)stimuli requiring larger visual leads for the PSS as comparedto the mismatched clips (M = 31 ms; see Fig. 3b).

2.3. Discussion

The analysis of the JND data from Experiment 1showed absolutely no sign of any difference in the accuracy

Matched

Mismatched

Stimuli

2 Experiment 3

50 100

Stimuli

Matched

Mismatched

Visionfirst

stimuli (music and object actions) presented in Experiments 1–3. (b) PSSsars represent the standard errors of the mean.

Page 6: Evaluating the influence of the ‘unity assumption’ on the temporal perception of realistic audiovisual stimuli

A. Vatakis, C. Spence / Acta Psychologica 127 (2008) 12–23 17

with which participants were able to discriminate the tem-poral order of the auditory and visual object action eventsas a function of whether they came from the same objectaction or not. As such, the results of Experiment 1 failedto provide any support for the ‘unity assumption’ in thecase of the object action video clips used here (cf. Radeau& Bertelson, 1977, Experiment 1, for a similar null result inthe case of spatial ventriloquism using as stimuli the visualimage and sound of a set of bongos being played).

We thought it at least possible that the null effect ofmatching might have been attributed to the participantssimply being unaware of the mismatch between the audi-tory and visual stimuli in the object action video clips(cf. Saldana & Rosenblum, 1993). We therefore conducteda follow-up control study which was identical to Experi-ment 1 with the sole exception that the participants nowhad to indicate whether the auditory and visual streamsmatched or not (rather than indicating which sensorymodality appeared to have been presented first). Theresults (N = 5 participants) showed that participants wereable to discriminate between the matched and mismatchedvideo clips at well over 90% correct. Thus, the null effectof matching on the JND reported in Experiment 1 cannotbe accounted for by any lack of awareness on the part ofthe participants regarding the presence of a discrepancybetween what they were seeing and hearing in the mis-matched video clips.

3. Experiment 2

Surprisingly few previous studies in the area of temporalperception have attempted to investigate people’s percep-tion of musical stimuli (though see Arrighi, Alais, & Burr,2006; Vatakis & Spence, 2006a, 2006b, for recent excep-tions). However, music might be thought to represent aclass of stimulus that is more similar to speech (than objectactions and simple sound bursts and light flashes). In par-ticular, musical stimuli often have a complex time-varyingprofile that is, in some sense, comparable to speech. Addi-tionally, people are not only familiar with music, but canalso extract information from a piece of music in a mannerthat is somewhat similar to what they do when they listento a message uttered by a speaker. Thus, in our secondexperiment, we investigated whether the ‘unity assumption’might affect the multisensory integration of musical stimuliby presenting participants with matched (i.e., the matchedvisual and auditory signal of the note ‘a’ or ‘d’ being playedon a piano) and mismatched video clips of musical notesbeing played on a guitar and a piano (i.e., the mismatchedclips of the sight of the note ‘a’ or ‘d’ being played on thepiano and the sound of the same note played on a guitar).

3.1. Methods

Thirteen participants (4 male and 9 female) aged 18–35years (mean age of 23 years) took part in the experiment.

The apparatus, stimuli, design, and procedure were exactlythe same as in Experiment 1 with the sole exceptionthat the audiovisual stimuli now consisted of the following:(a) close-up views of a person’s fingers on a piano and ofanother person’s fingers on a classical guitar playing thenotes ‘a’ and ‘d’ (the clips were 1650 ms long; that is, theduration of the event was equal with that of the entire clip,with both the auditory and visual signals being dynamicover the same time period; e.g., starting from the view ofthe hand and the piano before the contact with the keyor string and ending with the lifting of the finger fromthe key or string); (b) the same finger movements but withthe auditory channels swapped over so that the visualimage of the piano was paired with the guitar sound (play-ing the same note) and the video image of the guitar waspaired with the piano sound (see Fig. 1b). Due to the factthat not all of the participants were musically experienced,they were required to complete a short block of trials at thestart of the experiment in order to ensure that they couldsuccessfully discriminate the particular notes being playedand the instruments from which the sounds originated.The practice block was composed of six trials for eachmatched video clip (giving rise to a total of 24 trials) andwas followed by a test block in which the participants wereasked to discriminate the notes and instrument beingplayed. All of the participants performed this task perfectlyat their first attempt.

3.2. Results

Six of the participants reported having played musicalinstruments (piano, guitar, violin, and/or cello) and exten-sive previous experience of reading and listening to musicalnotes. However, preliminary analysis of the data revealedno difference in their performance (i.e., in terms of theirJND or PSS values; F < 1, n.s.) relative to those partici-pants who reported having had no, or only limited, priormusical experience, and so this factor is not considered fur-ther here (see also Vatakis & Spence, 2006b, for similar nullresults of musical expertise on TOJ performance for briefaudiovisual musical stimuli).

The goodness of the data fits was significant for all con-ditions (matched: guitar, r2 = .88, p < .01; piano, r2 = .90,p < .01; mismatched: guitar, r2 = .94, p < .01; piano,r2 = .93, p < .01; see Fig. 2b). The analysis of the JND datarevealed no significant difference in the accuracy with whichthe participants could judge the temporal order of the audi-tory and visual musical stimuli when the auditory and visualstreams matched (M = 60 ms) versus when they did not(M = 64 ms; see Fig. 3a) (F(1, 12) = 1.69, p = .22). Notethat the JND values obtained in this experiment are similarto those reported in Experiment 1 (see also Vatakis &Spence, in press; Experiment 1). The analysis of the PSSdata also failed to reveal any significant main effect ofaudiovisual matching (F(1, 12) = 1.96, p = .19), with boththe matched and mismatched musical clips requiring a

Page 7: Evaluating the influence of the ‘unity assumption’ on the temporal perception of realistic audiovisual stimuli

18 A. Vatakis, C. Spence / Acta Psychologica 127 (2008) 12–23

visual lead in order for the PSS to be reached (matched:M = 61 ms; mismatched: M = 46 ms; see Fig. 3b).

3.3. Discussion

The results of Experiment 2 showed no significant differ-ence in the accuracy with which participants could discrim-inate the temporal order of the musical video clips as afunction of whether those clips consisted of matched versusmismatched auditory and visual streams. These results con-firm the findings of Experiment 1 by failing to provide anysupport for the claim that the ‘unity assumption’ affects theprocessing of non-speech stimuli (cf. Radeau & Bertelson,1977; Experiment 1). According to the ‘unity assumption’,the matching of the auditory and visual stimuli should haveled to increased crossmodal binding and hence higherJNDs should have been observed as compared to the mis-matched videos (Vatakis & Spence, in press). A furthercontrol study once again demonstrated that the partici-pants were able to distinguish the matched from the mis-matched videos. This follow-up study (N = 5participants) was identical to Experiment 2 with the soleexception that participants had to indicate whether theauditory and visual signals matched or not (rather thanindicating which sensory modality appeared to have beenpresented first). As expected, the participants were able toperform this task at well over 90% correct, thus showingthat the audiovisual discrepancy on trials in which theauditory and visual signals belonged to different instru-ments was easily distinguishable.

4. Experiment 3

We conducted a third and final experiment in order toconfirm that the ‘unity assumption’ has no behavioralconsequences for the binding of audiovisual non-speechstimuli (in the temporal domain, as tested here). In thisexperiment, we followed closely the experimental set-upused in Vatakis and Spence (in press) study for speechstimuli. This was done by only recruiting participantshaving a high-level of prior expertise with the piano inorder to closely match the high familiarity level withspeech stimuli in Vatakis and Spence’s study. Both thevisual and auditory signals were taken from one offour notes being played on the same instrument (piano),thus matching the piano expertise of our participants, fol-lowing the logic of Vatakis and Spence’s Experiment 3(and eliminating the possible variability that could poten-tially have been introduced in Experiment 2 due to theuse of two visually different instruments). (We chosethese stimuli in order to eliminate any variability attribut-able to the visual and motor differences involved whenpresenting video clips of a guitar and a piano, as usedin Experiment 2.) If the pattern of results in this finalexperiment was found to be similar to those reported in

our previous two experiments, then we would be ableto rule out any account of the absence of the unity effectbeing a result of the possible differences related to thetemporal patterning, familiarity, and composition of ourstimuli.

4.1. Methods

Nine participants (5 male and 4 female) aged 19–33years (mean age of 23 years) took part in Experiment 3.The participants all had extensive experience (at least 10years) of playing the piano. The apparatus, stimuli, design,and procedure were exactly the same as in Experiment 1with the sole exception that the audiovisual stimuli nowconsisted of the following: (a) close-up views of a person’sfingers on a piano playing the notes ‘b’, ‘d’, ‘c’, and ‘f’(duration of 833 ms; that is, the duration of the eventwas equal with that of the entire clip, with both the audi-tory and the visual signals being dynamic over the sametime period; e.g., starting from the contact of the key andending with the lifting of the finger from the key); (b) thesame video clips but with the auditory channels swappedover so that the sight of the note ‘c’ being played on thepiano was paired with the ‘f’ sound and the sight of thenote ‘b’ was paired with the ‘d’ sound.

4.2. Results

The goodness of the data fits was significant for all con-ditions (matched ‘d’ and ‘b’, r2 = .95, p < .01; matched ‘c’and ‘f’, r2 = .95, p < .01; mismatched ‘d’ and ‘b’, r2 = .95,p < .01; mismatched ‘c’ and ‘f’, r2 = .97, p < .01; seeFig. 2b). The analysis of the JND data revealed no signif-icant difference in the accuracy with which participantswere able to judge the temporal order of the auditoryand visual stimuli in the case of matched (M = 59 ms) ver-sus mismatched piano video clips (M = 67 ms; see Fig. 3a)(F(1,8) = 1.86, p = .20). Note that the JNDs are similar tothose obtained in Experiments 1 and 2 and the trend isactually in the opposite direction to that reported in all fourof Vatakis and Spence (in press) experiments. The analysisof the PSS data also failed to reveal any significant maineffect of audiovisual match (F(1, 12) = 1.96, p = .19), withboth the matched and mismatched musical clips requiringa visual lead in order for the PSS to be achieved (matched:M = 54 ms; mismatched: M = 31 ms; see Fig. 3b).

4.3. Discussion

Once again, the results of Experiment 3 failed to provideany support for the ‘unity assumption’, at least when theparticipants had to discriminate the temporal order ofauditory and visual musical stimuli. The results of our finalexperiment are therefore consistent with the null effects ofmatching reported in Experiments 1 and 2. Taken together,

Page 8: Evaluating the influence of the ‘unity assumption’ on the temporal perception of realistic audiovisual stimuli

A. Vatakis, C. Spence / Acta Psychologica 127 (2008) 12–23 19

these results provide a striking contrast with the significanteffects of audiovisual matching obtained in all four of Vata-kis and Spence’s (in press) recent experiments utilizingaudiovisual speech stimuli (i.e., where significantly lowerJNDs were observed for the mismatched as compared tothe matched video clips). Given that even expert musicianscannot necessarily recognize musical notes by their pitchalone, we conducted a final control study in order to ensurethat our participants had in fact been able to distinguishthe matched from the mismatched videos. This follow-upstudy (N = 5 participants) was identical to Experiment 3with the sole exception that the participants (who againwere all expert musicians) had to indicate whether the audi-tory and visual piano signals matched or not. The partici-pants were able to perform this task very accurately(M = 94% correct), thus showing that the audiovisualdiscrepancy on trials where the auditory and visualsignals belonged to different piano notes was easilydistinguishable.

5. General discussion

Taken together, the results of the three experimentsreported in the present study demonstrate that the ‘unityassumption’ does not seem to influence people’s temporalperception of realistic, multisensory, non-speech stimuli(cf. Radeau & Bertelson, 1977, Experiment 1, for a similarconclusion based on the results of a spatial ventriloquismstudy in which the experimental stimuli consisted of thesight and sound of a set of bongos being played). Morespecifically, temporal discrimination performance (in termsof the JND) for object actions (i.e., the bouncing of a balland the smashing of a block of ice) and musical events(i.e., the playing of a piano and a guitar) did not differ sig-nificantly as a function of the matching versus mismatch-ing of the video clips. The same was not true when theparticipants were presented with speech stimuli in Vatakisand Spence’s (in press) recent study. There, the partici-pants had been able to judge the order of the auditoryand visual streams significantly more accurately (i.e., theJNDs were significantly lower) for the mismatched speechconditions than for the matched speech conditions. Takentogether, these two sets of empirical results therefore dem-onstrate that while the ‘unity assumption’ can influence themultisensory integration of speech stimuli, it does not nec-essarily have any such effect on the integration of non-speech stimuli.

The analysis of the PSS data reported in the presentstudy showed that both the matched and mismatched videoclips of the object action and musical events used in Exper-iments 1–3 required a visual lead for the PSS to be reached.Such visual leads are consistent with the results of a num-ber of previous studies of audiovisual synchrony percep-tion that have utilized musical stimuli, where the auditorystimuli had to be delayed by as much as 150 ms relative

to the visual stimuli in order to be perceived as being simul-taneous (e.g., Arrighi et al., 2006; Vatakis & Spence, 2006a,2006b; see also Spence et al., 2001, for a review of audiovi-sual TOJ studies using simpler stimuli reporting both posi-tive and negative PSS values). Just as for Vatakis andSpence’s (in press) study, we also failed to find any reliableeffect of matching on the PSS values reported in the presentexperiments.

The extant literature on spatial ventriloquism suggeststhat a greater visual bias of perceived auditory locationoccurs for more ‘meaningful’ (Warren et al., 1981) combi-nations of audiovisual stimuli (e.g., such as the sight of asteaming kettle and a whistling sound; Jackson, 1953) thanfor more arbitrary stimulus combinations (e.g., such as asimple light-flash and a burst of sound). However, whileseveral empirical studies have been taken to provide sup-port for the ‘unity assumption’ in the domain of spatialventriloquism (e.g., Jack & Thurlow, 1973; Jackson,1953; Warren et al., 1981), the possibility that these resultsreflect nothing more than a response bias has never beenunequivocally ruled out (e.g., see Bertelson & Aschersle-ben, 1998; Bertelson & de Gelder, 2004; Caclin, Soto-Far-aco, Kingstone, & Spence, 2002; Welch, 1999a, 1999b).That is, it is unclear whether the unity effect observed inprevious spatial ventriloquism research should be attrib-uted to the consequences of enhanced perceptual integra-tion or simply to post-perceptual judgment processesinstead (see Bertelson & de Gelder, 2004, on this point).Furthermore, the ‘unity assumption’ has also beenreported to have no effect on audiovisual integration inother studies of spatial ventriloquism (see Radeau & Ber-telson, 1977; Experiments 1 and 2).

In the present study (as in the Vatakis & Spence’s, inpress, study), we were able to rule out a response biasaccount of our findings by utilizing a TOJ task. Specifi-cally, the participants in our study were presented withaudiovisual video clips of matched and mismatched objectactions or music and were asked to make a judgment as tothe order of stimulus presentation (i.e., to report whetherthe auditory or visual stream appeared to have been pre-sented first). The presentation of matched versus mis-matched video clips in our study should not have hadany effect on the likelihood of our participants making an‘auditory-first’ as opposed to a ‘visual-first’ response. Thatis, participants’ judgments of audiovisual temporal orderwere orthogonal to any variations in the strength of the‘unity assumption’ in our experiment. (Note that thisorthogonality in the design would have been lost had wechosen to use a simultaneity judgment task instead; For,under such conditions, any tendency that participantsmay have had to respond ‘simultaneous’ as the strengthof the ‘unity assumption’ increased would have been indis-tinguishable from any bias that might have been inducedby any cognitively mediated assumption of unity).

The null results of the ‘unity assumption’ on the JNDobtained for all three of the experiments reported here

Page 9: Evaluating the influence of the ‘unity assumption’ on the temporal perception of realistic audiovisual stimuli

20 A. Vatakis, C. Spence / Acta Psychologica 127 (2008) 12–23

demonstrate that, in contrast to the case of speech stimuli,the ‘unity assumption’ does not appear to influence thebinding of realistic non-speech stimuli (at least for thestimuli utilized in the present study). We believe that thesethree independent null results represent a ‘good effort’ todemonstrate an effect of the ‘unity assumption’ for non-speech stimuli, given the large number of participantstested (34 participants in total), the large number of exper-imental trials conducted per participant, our control forvariance factors, the use of an experimental manipulationthat has been shown elsewhere to provide an effective andsensitive measure of performance (see Vatakis & Spence,in press), and the avoidance of both floor and ceilingeffects (cf. Frick, 1995). Crucially, we kept the experimen-tal set-up absolutely identical (e.g., task, instructions, test-ing settings, practice trials, and number of experimentaltrials with 144 trials in each of the 5 blocks) to that whichhas been shown to be highly effective (since it gave rise toa significant effect in all four of Vatakis & Spence’s, inpress, recently reported experiments). We also avoidedfloor and ceiling effects by varying the SOAs (and thusvarying task difficulty) between the onset of the auditoryand visual stimuli. Finally, it is worth noting Frick’s(1995) suggestion that a null effect can be considered asmeaningful if a significant effect can be found in very sim-ilar circumstances, as is the case for the present studywhen our results are compared to those of Vatakis andSpence (in press).

One might wonder whether the fact that, to date, sup-port for the ‘unity assumption’ has primarily come fromstudies of the perception of audiovisual speech events(i.e., Easton & Basala, 1982; Vatakis & Spence, in press;Walker et al., 1995; Warren et al., 1981; though seeRadeau & Bertelson, 1977, Experiment 2; Green et al.,1991, for null results), but not from studies that have uti-lized non-speech events (such as the musical and objectaction events reported in the present study; see alsoRadeau & Bertelson, 1977, Experiment 1) should beexplained in terms of the putatively ‘special’ nature ofspeech processing. As noted earlier, a number of research-ers have argued that audiovisual speech processing may bespecial as compared to the processing of other complexnon-speech audiovisual events (e.g., Bernstein et al.,2004; Jones & Jarick, 2006; Libermann & Mattingly,1985; Massaro, 2004; Munhall & Vatikiotis-Bateson,2004; Tuomainen et al., 2005).

Researchers have argued for the existence of a ‘specificmode of perception’ that is unique to audiovisual speech.This mode refers to the structural and functional processesrelated to the articulatory gestures of speech and/or to theperceptual processes associated with the phonetic cuespresent in speech signals (e.g., Tuomainen et al., 2005).The putatively ‘special’ nature of speech processing is pre-sumably related to the fact that speech represents a veryimportant stimulus for human interaction. According tothe findings outlined here, another property of speech stim-uli that may be special relates to the fact that our percep-

tion of the temporal aspects of audiovisual speech stimulican be influenced (likely by a process of temporal ventrilo-quism; Morein-Zamir, Soto-Faraco, & Kingstone, 2003;Vatakis & Spence, in press; Vroomen & Keetels, 2006) bythe ‘unity assumption’, whereas our perception of othernon-speech (i.e., music or object action) stimuli is seem-ingly not. It is worth noting on this point that there is noa priori reason to expect that temporal ventriloquism(i.e., the mechanism of sensory realignment suggested tobe behind to the unity effect in Vatakis & Spence’s, in press,study) should operate any differently for the case of speechversus non-speech stimuli (indeed, the temporal ventrilo-quism effect has itself primarily been demonstrated usingsimple tones and light flashes; e.g., Morein-Zamir et al.,2003; Vroomen & Keetels, 2006).

Alternatively, however, the difference between theresults of the present study and the experiments reportedby Vatakis and Spence (in press) may also relate to people’shigh levels of everyday exposure to (or familiarity with)speech events which means that people are generally verygood at recognizing articulatory gestures and facial move-ments as compared to other classes of complex events (suchas music; see Summerfield, 1987). Thus, it might be arguedthat in the mismatched speech condition utilized in Vatakisand Spence’s previous study, the participants may havebeen particularly sensitive to (or aware of) the incompati-bility (between the movements of the speakers’ face andthe speech sounds that they heard). Thus, participantsmay somehow have used this information as a cue to helpthem discriminate the nature of the asynchrony that theywere being exposed to (as opposed to the matched condi-tion, where audiovisual binding would have made it harderfor the participants to detect whether the auditory or thevisual stream had been actually presented first). It couldalso be argued that participants found the audiovisualspeech events to be somehow more ‘attention-grabbing’than the non-speech events, since previous studies haveshown that face stimuli tend to be much more salient thanmany other classes of stimuli (e.g., see Bindemann, Burton,Hooge, Jenkins, & de Haan, 2005; Theeuwes & Van derStigchel, in press).

A final possible account for the absence of the unityeffect observed in the present study (as compared to the sig-nificant effect reported in Vatakis & Spence’s, in press,study) is that there might be a greater temporal correlationbetween auditory and visual streams for speech stimulithan for the object action and musical stimuli utilized inthe present study (cf. Jack & Thurlow, 1973; Jones & Jar-ick, 2006). Speech stimuli are composed of complex anddynamic auditory and visual streams that may lead the per-ceptual system into relying more heavily on the correlationbetween the time-varying characteristics of those signals(see Summerfield, 1987) in order to facilitate integration(cf. Van Wassenhove, Grant, & Poeppel, 2005). This mightnot, however, be the case for non-speech stimuli, such asobject actions or music, where the perceptual system mayrely primarily on the temporal coincidence of the particular

Page 10: Evaluating the influence of the ‘unity assumption’ on the temporal perception of realistic audiovisual stimuli

A. Vatakis, C. Spence / Acta Psychologica 127 (2008) 12–23 21

events (onsets and offsets) within the two stimulus streams(Jones & Jarick, 2006).2

It should be noted that multisensory integration can bemodulated by both cognitive (i.e., top-down) and struc-tural (i.e., stimulus-driven or bottom-up) factors (e.g., Ber-telson & de Gelder, 2004; Spence, in press; Welch, 1999a,1999b). It is, however, at present unclear to what extentVatakis and Spence’s (in press) findings (regarding the sig-nificant effect of the ‘unity assumption’ on audiovisual TOJperformance) reflect the influence of structural factors (cf.Bermant & Welch, 1976; Radeau & Bertelson, 1977,1987; Welch, 1999a, 1999b) versus more cognitive factorsthat affect a participant’s decision about whether or nottwo signals ‘go together’, or refer to the same event(Radeau & Bertelson, 1977). It is, of course, also possiblethat the effect of the unity assumption in the experimentsreported by Vatakis and Spence may reflect some unknowncombination of structural and cognitive effects. Indeed, thetwo are likely influenced by many of the same factors (seeSpence, in press, on this point). It would seem, a priori,that the contribution of any purely cognitive factors tothe ‘unity assumption’ should, if anything, have been likelyto have had a greater contribution to performance in thepresent study than in Vatakis and Spence’s (in press) study,since the mismatched conditions were composed of twovery different events in the present study (i.e., smashing ablock of ice versus bouncing of a ball), whereas all of thevideos used by Vatakis and Spence (both matched and mis-matched) always contained the sight and sound of someonespeaking (i.e., the auditory and visual streams always con-sisted of the same type of event).

In the present study (just as in Vatakis & Spence’s, inpress, study), we attempted to minimize any possible bot-tom-up differences in the integration of the auditory and

2 Radeau and Bertelson (1977) examined the influence of discrepantsemirealistic auditory–visual events on spatial adaptation. In their secondexperiment, the participants were presented with videos of auditory speechpaired with either the sight of a speaker’s face or amplitude-modulatedlight-flashes, and asked to report pre- and post-additions or deletions inthe auditory stream (after-effect measure). The results showed nosignificant differences on the shifts of the localization of the auditoryspeech either when the speaker’s face or the modulated flashes werepresented. These results therefore conflict with those reported by Vatakisand Spence, in press, and thus with the claim that the process of speech issomehow ‘special’ when compared with non-speech events. It is, however,important to note that the conflicting outcomes reported in the two studiescould reflect the fact that different response measures were utilized in thetwo studies (i.e., online versus aftereffects; see also Frissen, Vroomen, deGelder, & Bertelson, 2003, 2005; Navarra et al., 2005, on this point) orthat the present study examined the temporal aspects of perception, whilethe Radeau and Bertelson study looked at spatial aspects of perceptualintegration. What is more, Radeau and Bertelson utilized speech stimulithat were composed of long sentences, while Vatakis and Spence utilizedbrief syllables or words. Recent studies have shown that the use of longsentences can result in a higher variability in participants’ performanceand also variable shifts in temporal perception (PSS shifts; e.g., Vatakis &Spence, 2006a, 2006b, submitted for publication), thus perhaps making itharder to detect any subtle effects of the ‘unity assumption’ onperformance.

visual stimuli by carefully matching the timing of the events.It is, however, possible that the differential contribution ofbottom-up factors to multisensory integration affects theperception of speech versus non-speech events differentially:for example, it might be the case that people are more sen-sitive to bottom-up differences for speech than for non-speech stimuli. Therefore, it could be that the absence ofthe unity effect observed in the present study may reflect acombination of both cognitive and structural factors. Alter-natively, any cognitive assumption of unity may actuallyhave had little effect on participants’ performance.

An important issue with regard to the ‘unity assump-tion’ relates to the question of the extent to which people’sconscious awareness of any variation in the strength of the‘unity assumption’ influences the magnitude of any multi-sensory integration effects observed. Researchers have pre-viously argued that the pairing of two sensory eventsshould be distinguished from the conscious impression ofa common origin (e.g., spatial fusion) due to the fact thatpairing may occur in the absence of any such impressionof common origin (that is, it may occur unconsciously;e.g., Bertelson & Aschersleben, 1998; Bertelson & de Gel-der, 2004; Bertelson & Radeau, 1981; Caclin et al., 2002;de Gelder & Bertelson, 2003; Radeau & Bertelson, 1977;Welch & Warren, 1980). In the present study, the partici-pants were informed that they would be presented withboth matching and mismatching video clips. Thus, onewould have assumed that the pairing would have been con-sciously processed/perceived by participants. Indeed, asnoted already, the mismatch (in the mismatched videoclips) should have been even more apparent in the presentstudy than in the Vatakis and Spence’s (in press) study(given that different types of events were used in the presentstudy compared to the fact that all stimuli in Vatakis andSpence’s study consisted of the sight and sound of speech).The participants’ task was, however, unrelated to thematching/mismatching factor (but instead required theparticipants simply to try and discriminate the order ofstimulus presentation), therefore the pairing of the two sen-sory events could also have been driven by some form ofunconscious processing instead. Consequently, at the pres-ent time, we are unable to distinguish whether the pairingof the two sensory events (object actions or music) pre-sented in our study was influenced by unconscious or con-scious processes. Resolving the relative contribution ofconscious and unconscious factors, and structural versuscognitive factors to the unity effect will clearly represent achallenging issue for future research.

Acknowledgements

This work was presented at the seventh Annual Meetingof the International Multisensory Research Forum, June18–21, 2006, in Dublin, Ireland. The paper talk was entitled‘Factors modulating the temporal perception of audiovi-sual speech stimuli’. The abstract can be downloaded fromhttp://www.science.mcmaster.ca/~IMRF/2006/viewabstract.

Page 11: Evaluating the influence of the ‘unity assumption’ on the temporal perception of realistic audiovisual stimuli

22 A. Vatakis, C. Spence / Acta Psychologica 127 (2008) 12–23

php?id=94&symposium=0. A.V. was supported by aNewton Abraham Studentship from the Medical SciencesDivision, University of Oxford.

References

Arrighi, R., Alais, D., & Burr, D. (2006). Perceptual synchrony ofaudiovisual streams for natural and artificial motion sequences.Journal of Vision, 6, 260–268.

Beauchamp, M. S., Argall, B. D., Bodurka, J., Duyn, J. H., & Martin, A.(2004). Unraveling multisensory integration: Patchy organizationwithin human STS multisensory cortex. Nature Neuroscience, 7,1190–1192.

Bedford, F. L. (2001). Towards a general law of numerical/object identity.Cahiers de Psychologie Cognitive/Current Psychology of Cognition, 20,113–175.

Bermant, R. I., & Welch, R. B. (1976). Effect of degree of separation ofvisual–auditory stimulus and eye position upon spatial interaction ofvision and audition. Perceptual and Motor Skills, 43, 487–493.

Bernstein, L. E., Auer, E. T., & Moore, J. K. (2004). Audiovisual speechbinding: Convergence or association?. In G. A. Calvert C. Spence, &B. E. Stein (Eds.), The handbook of multisensory processing

(pp. 203–223). Cambridge, MA: MIT Press.Bertelson, P., & Aschersleben, G. (1998). Automatic visual bias of

perceived auditory location. Psychonomic Bulletin & Review, 5,482–489.

Bertelson, P., & Aschersleben, G. (2003). Temporal ventriloquism:Crossmodal interaction on the time dimension. 1. Evidence fromauditory–visual temporal order judgment. International Journal of

Psychophysiology, 50, 147–155.Bertelson, P., & de Gelder, B. (2004). The psychology of multimodal

perception. In C. Spence & J. Driver (Eds.), Crossmodal space and

crossmodal attention (pp. 141–177). Oxford: Oxford University Press.Bertelson, P., & Radeau, M. (1981). Cross-modal bias and perceptual

fusion with auditory–visual spatial discordance. Perception & Psycho-

physics, 29, 578–584.Bindemann, M., Burton, A. M., Hooge, I. T., Jenkins, R., & de Haan, E.

H. (2005). Faces retain attention. Psychonomic Bulletin & Review, 6,1048–1053.

Caclin, A., Soto-Faraco, S., Kingstone, A., & Spence, C. (2002). Tactile‘capture’ of audition. Perception & Psychophysics, 64, 616–630.

Coren, S., Ward, L. M., & Enns, J. T. (2004). Sensation & perception (6thed.). Fort Worth: Harcourt Brace.

de Gelder, B., & Bertelson, P. (2003). Multisensory integration, perceptionand ecological validity. Trends in Cognitive Sciences, 7, 460–467.

Easton, R. D., & Basala, M. (1982). Perceptual dominance duringlipreading. Perception & Psychophysics, 32, 562–570.

Finney, D. J. (1964). Probit analysis: Statistical treatment of the sigmoid

response curve. London: Cambridge University Press.Frick, R. W. (1995). Accepting the null hypothesis. Memory & Cognition,

23, 132–138.Frissen, I., Vroomen, J., de Gelder, B., & Bertelson, P. (2003). The

aftereffects of ventriloquism: Are they sound-frequency specific? Acta

Psychologica, 113, 315–327.Frissen, I., Vroomen, J., de Gelder, B., & Bertelson, P. (2005). The

aftereffects of ventriloquism: Generalization across sound-frequencies.Acta Psychologica, 118, 93–100.

Green, K. P., & Gerdeman, A. (1995). Cross-modal discrepancies incoarticulation and the integration of speech information: The McGurkeffect with mismatched vowels. Journal of Experimental Psychology:

Human Perception and Performance, 21, 1409–1426.Green, K., Kuhl, P., Meltzoff, A., & Stevens, E. (1991). Integrating speech

information across talkers, gender, and sensory modality: Female facesand male voices in the McGurk effect. Perception & Psychophysics, 50,524–536.

Jackson, C. V. (1953). Visual factors in auditory localization. Quarterly

Journal of Experimental Psychology, 5, 52–65.

Jack, C. E., & Thurlow, W. R. (1973). Effects of degree of visualassociation and angle of displacement on the ‘‘ventriloquism’’ effect.Perceptual & Motor Skills, 37, 967–979.

Jones, J. A., & Jarick, M. (2006). Multisensory integration of speechsignals: The relationship between space and time. Experimental Brain

Research, 174, 588–594.Libermann, A. M., & Mattingly, I. G. (1985). The motor theory of speech

perception revisited. Cognition, 21, 1–36.Massaro, D. W. (2004). From multisensory integration to talking heads

and language learning. In G. A. Calvert, C. Spence, & B. E. Stein(Eds.), The handbook of multisensory processing (pp. 153–176). Cam-bridge, MA: MIT Press.

Morein-Zamir, S., Soto-Faraco, S., & Kingstone, A. (2003). Auditorycapture of vision: Examining temporal ventriloquism. Cognitive Brain

Research, 17, 154–163.Munhall, K. G., & Vatikiotis-Bateson, E. (2004). Spatial and temporal

constraints on audiovisual speech perception. In G. A. Calvert, C.Spence, & B. E. Stein (Eds.), The handbook of multisensory processing

(pp. 177–188). Cambridge, MA: MIT Press.Navarra, J., Vatakis, A., Zampini, M., Humphreys, W., Soto-Faraco, S.,

& Spence, C. (2005). Exposure to asynchronous audiovisual speechextends the temporal window for audiovisual integration. Cognitive

Brain Research, 25, 499–507.Radeau, M., & Bertelson, P. (1987). Auditory-visual interaction and the

timing of inputs. Thomas (1941) revisited. Psychological Research, 49,17–22.

Radeau, M., & Bertelson, P. (1977). Adaptation to auditory–visualdiscordance and ventriloquism in semirealistic situations. Perception &

Psychophysics, 22, 137–146.Saldana, H. M., & Rosenblum, L. D. (1993). Visual influences on auditory

pluck and bow judgments. Perception & Psychophysics, 54,406–416.

Spence, C. (in press). Audiovisual multisensory integration. Journal of the

Acoustical Society of Japan: Acoustical Science and Technology.Spence, C., Shore, D. I., & Klein, R. M. (2001). Multisensory prior entry.

Journal of Experimental Psychology: General, 130, 799–832.Stein, B. E., & Meredith, M. A. (1993). The merging of the senses.

Cambridge, MA: MIT Press.Summerfield, Q. (1987). Some preliminaries to a comprehensive account

of audio-visual speech perception. In B. Dodd & R. Campbell (Eds.),Hearing by eye: The psychology of lip-reading (pp. 3–51). London:LEA.

Theeuwes, J., & Van der Stigchel, S. (in press). Faces capture attention:Evidence from inhibition-of-return. Visual Cognition.

Thomas, G. J. (1941). Experimental study of the influence of vision onsound localization. Journal of Experimental Psychology, 28, 163–177.

Tuomainen, J., Andersen, T. S., Tiippana, K., & Sams, M. (2005). Audio-visual speech is special. Cognition, 96, B13–B22.

Van Wassenhove, V., Grant, K. W., & Poeppel, D. (2005). Visual speechspeeds up the neural processing of auditory speech. Proceedings of the

National Academy of Sciences, 102, 1181–1186.Vatakis, A., & Spence, C. (2006a). Audiovisual synchrony perception for

speech and music using a temporal order judgment task. Neuroscience

Letters, 393, 40–44.Vatakis, A., & Spence, C. (2006b). Audiovisual synchrony perception for

music, speech, and object actions. Brain Research, 1111, 134–142.Vatakis, A., & Spence, C. (in press). Crossmodal binding: Evaluating the

‘unity assumption’ using audiovisual speech stimuli. Perception &Psychophysics.

Vatakis, A., & Spence, C. (submitted for publication). Investigating thefactors that influence the temporal perception of complex audiovisualevents. In Proceedings of EuroCogSci07.

Vroomen, J., & Keetels, M. (2006). The spatial constraint in intersensorypairing: No role in temporal ventriloquism. Journal of Experimental

Psychology: Human Perception & Performance, 32, 1063–1071.Walker, S., Bruce, V., & O’Malley, C. (1995). Facial identity and facial

speech processing: Familiar faces and voices in the McGurk effect.Perception & Psychophysics, 57, 1124–1133.

Page 12: Evaluating the influence of the ‘unity assumption’ on the temporal perception of realistic audiovisual stimuli

A. Vatakis, C. Spence / Acta Psychologica 127 (2008) 12–23 23

Warren, D. H., Welch, R. B., & McCarthy, T. J. (1981). The role ofvisual–auditory ‘compellingness’ in the ventriloquism effect: Implica-tions for transitivity among the spatial senses. Perception & Psycho-

physics, 30, 557–564.Welch, R. B. (1999a). The advantages and limitations of the

psychophysical staircase procedure in the study of intersensorybias: Commentary on Bertelson. In G. Aschersleben, T. Bachmann,& J. Musseler (Eds.), Cognitive contributions to the perception of

spatial and temporal events (pp. 363–369). Amsterdam: ElsevierScience.

Welch, R. B. (1999b). Meaning, attention, and the ‘‘unity assumption’’ inthe intersensory bias of spatial and temporal perceptions. In G.Aschersleben, T. Bachmann, & J. Musseler (Eds.), Cognitive contribu-

tions to the perception of spatial and temporal events (pp. 371–387).Amsterdam: Elsevier Science.

Welch, R. B., & Warren, D. H. (1980). Immediate perceptual response tointersensory discrepancy. Psychological Bulletin, 88, 638–667.

Witkin, H. A., Wapner, S., & Leventhal, T. (1952). Sound localizationwith conflicting visual and auditory cues. Journal of Experimental

Psychology, 43, 58–67.