international journal of human-computer · line capt system with a pronunciation learning cycle of...

15
Int. J. Human-Computer Studies 109 (2018) 26–40 Contents lists available at ScienceDirect International Journal of Human-Computer Studies journal homepage: www.elsevier.com/locate/ijhcs Evaluating a 3-D virtual talking head on pronunciation learning Xiaolan Peng a , Hui Chen a,d,, Lan Wang b , Hongan Wang a,c,d a Beijing Key Lab of Human-Computer Interaction, Institute of Software, Chinese Academy of Sciences, Building 5, No. 4, South Fourth Street, Zhong Guan Cun, Beijing 100190, PR China b CAS Key Laboratory of Human-Machine Intelligence-Synergy Systems, Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen University Town, 1068 Xueyuan Avenue, Shenzhen 518055, PR China c State Key Laboratory of Computer Science, Institute of Software, Chinese Academy of Sciences, Building 5, No.4, South Fourth Street, Zhong Guan Cun, Beijing 100190, PR China d University of Chinese Academy of Sciences, 100049, PR China a r t i c l e i n f o Keywords: Evaluation 3-D talking head Computer-aided pronunciation training a b s t r a c t We evaluate a 3-D virtual talking head on non-native Mandarin speaker pronunciation learning under three language presentation conditions – audio only (AU), human face video (HF) and audio-visual animation of a three- dimensional talking head (3-D). An auto language tutor (ALT) configured with AU, HF and 3-D is developed as the computer-aided pronunciation training system. We apply both subjective and objective methods to study user acceptance of the 3-D talking head, user comparative impressions and pronunciation performance under different conditions. The subjective ratings show that the 3-D talking head achieved a high level of user acceptance, and both 3-D and HF were preferred to AU. The objective pronunciation learning improvements show that 3-D was more beneficial than AU with respect to blade-alveolar, blade-palatal, lingua-palatal, open-mouth, open-mouth(-i) and round-mouth. Learning with 3-D was better than learning with HF with respect to blade-alveolar, lingua- palatal and round-mouth, and the tones of falling-rising and falling. Learning with AU was better than learning with HF with respect to the falling-rising tone. Neither HF nor AU was superior to 3-D with respect to any of the initials, finals and tones. © 2017 Elsevier Ltd. All rights reserved. 1. Introduction Learning how to pronounce Mandarin properly presents a number of challenges. Non-native Mandarin learners must practice sounds and tones that do not exist in their own language or are typically differ- ent from “European/Western” languages. Proper instruction is needed for Mandarin learning. Traditional second language learning meth- ods depend largely upon printed text, audio and video materials (Kim and Gilman, 2008). Recent developments in teaching Mandarin visual speech through animated talking heads provide an appropriate means of facilitating language learning (Chen and Massaro, 2011; Liu et al., 2013). Animated talking heads have been applied to computer aided lan- guage learning (CALL) as carriers of audio-visual speech (Hazan et al., 2005; Wang et al., 2012a), and they have been expected to enhance the computer aided language learning system by instructing the learn- ers with visualized pronunciation animations. In particular, both inter- nal and external articulator movements have been studied in a three- Corresponding author. E-mail addresses: [email protected] (X. Peng), [email protected] (H. Chen), [email protected] (L. Wang), [email protected] (H. Wang). dimensional talking head to refine instruction for hearing-loss chil- dren or second-language learners (Badin et al., 2010; Gibert et al., 2015; Grauwinkel et al., 2007; Wang et al., 2012a). Multimodal presentation conditions promote effective learning be- cause humans process information through both visual and verbal chan- nels (Mayer, 2009; Sweller et al., 1998). It is promising that a three- dimensional talking head-embedded computer-aided language learning system could promote learning through multimodal interaction. The three-dimensional talking head provides language learners audio-visual and face-to-face instruction. Although several studies have focused on the implementation of a three-dimensional talking head, few experi- ments have been performed to evaluate the effectiveness of the talking head on pronunciation learning (Theobald et al., 2008). There is a lack of comprehensive understanding of whether a three-dimensional talking head can efficiently enhance language learning and effectively instruct language learners. http://dx.doi.org/10.1016/j.ijhcs.2017.08.001 Received 31 May 2017; Received in revised form 7 August 2017; Accepted 9 August 2017 Available online 10 August 2017 1071-5819/© 2017 Elsevier Ltd. All rights reserved.

Upload: others

Post on 02-Aug-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: International Journal of Human-Computer · line CAPT system with a pronunciation learning cycle of “listen-record- check-learn ”was designed wherein an animated talking head was

Int. J. Human-Computer Studies 109 (2018) 26–40

Contents lists available at ScienceDirect

International Journal of Human-Computer Studies

journal homepage: www.elsevier.com/locate/ijhcs

Evaluating a 3-D virtual talking head on pronunciation learning

Xiaolan Peng

a , Hui Chen

a , d , ∗ , Lan Wang

b , Hongan Wang

a , c , d

a Beijing Key Lab of Human-Computer Interaction, Institute of Software, Chinese Academy of Sciences, Building 5, No. 4, South Fourth Street, Zhong Guan Cun, Beijing

100190, PR China b CAS Key Laboratory of Human-Machine Intelligence-Synergy Systems, Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen University

Town, 1068 Xueyuan Avenue, Shenzhen 518055, PR China c State Key Laboratory of Computer Science, Institute of Software, Chinese Academy of Sciences, Building 5, No.4, South Fourth Street, Zhong Guan Cun, Beijing

100190, PR China d University of Chinese Academy of Sciences, 100049, PR China

a r t i c l e i n f o

Keywords:

Evaluation 3-D talking head Computer-aided pronunciation training

a b s t r a c t

We evaluate a 3-D virtual talking head on non-native Mandarin speaker pronunciation learning under three language presentation conditions – audio only (AU), human face video (HF) and audio-visual animation of a three- dimensional talking head (3-D). An auto language tutor (ALT) configured with AU, HF and 3-D is developed as the computer-aided pronunciation training system. We apply both subjective and objective methods to study user acceptance of the 3-D talking head, user comparative impressions and pronunciation performance under different conditions. The subjective ratings show that the 3-D talking head achieved a high level of user acceptance, and both 3-D and HF were preferred to AU. The objective pronunciation learning improvements show that 3-D was more beneficial than AU with respect to blade-alveolar, blade-palatal, lingua-palatal, open-mouth, open-mouth(-i) and round-mouth. Learning with 3-D was better than learning with HF with respect to blade-alveolar, lingua- palatal and round-mouth, and the tones of falling-rising and falling. Learning with AU was better than learning with HF with respect to the falling-rising tone. Neither HF nor AU was superior to 3-D with respect to any of the initials, finals and tones.

© 2017 Elsevier Ltd. All rights reserved.

1

o

t

e

f

o

a

s

o

2

g

2

t

e

n

d

d

G

c

n

d

s

t

a

t

m

h

o

h

l

hRA1

. Introduction

Learning how to pronounce Mandarin properly presents a numberf challenges. Non-native Mandarin learners must practice sounds andones that do not exist in their own language or are typically differ-nt from “European/Western ” languages. Proper instruction is neededor Mandarin learning. Traditional second language learning meth-ds depend largely upon printed text, audio and video materials ( Kimnd Gilman, 2008 ). Recent developments in teaching Mandarin visualpeech through animated talking heads provide an appropriate meansf facilitating language learning ( Chen and Massaro, 2011; Liu et al.,013 ).

Animated talking heads have been applied to computer aided lan-uage learning (CALL) as carriers of audio-visual speech ( Hazan et al.,005; Wang et al., 2012a ), and they have been expected to enhancehe computer aided language learning system by instructing the learn-rs with visualized pronunciation animations. In particular, both inter-al and external articulator movements have been studied in a three-

∗ Corresponding author.

E-mail addresses: [email protected] (X. Peng), [email protected] (H. Chen), lan.wang@

ttp://dx.doi.org/10.1016/j.ijhcs.2017.08.001 eceived 31 May 2017; Received in revised form 7 August 2017; Accepted 9 August 2017 vailable online 10 August 2017 071-5819/© 2017 Elsevier Ltd. All rights reserved.

imensional talking head to refine instruction for hearing-loss chil-ren or second-language learners ( Badin et al., 2010; Gibert et al., 2015;rauwinkel et al., 2007; Wang et al., 2012a ).

Multimodal presentation conditions promote effective learning be-ause humans process information through both visual and verbal chan-els ( Mayer, 2009; Sweller et al., 1998 ). It is promising that a three-imensional talking head-embedded computer-aided language learningystem could promote learning through multimodal interaction. Thehree-dimensional talking head provides language learners audio-visualnd face-to-face instruction. Although several studies have focused onhe implementation of a three-dimensional talking head, few experi-ents have been performed to evaluate the effectiveness of the talkingead on pronunciation learning ( Theobald et al., 2008 ). There is a lackf comprehensive understanding of whether a three-dimensional talkingead can efficiently enhance language learning and effectively instructanguage learners.

siat.ac.cn (L. Wang), [email protected] (H. Wang).

Page 2: International Journal of Human-Computer · line CAPT system with a pronunciation learning cycle of “listen-record- check-learn ”was designed wherein an animated talking head was

X. Peng et al. Int. J. Human-Computer Studies 109 (2018) 26–40

1

e

s

a

c

n

2

m

b

(

t

c

p

s

l

w

(

c

2

i

t

t

e

Z

i

v

a

p

p

w

r

d

t

1

b

2

v

f

c

c

s

F

p

l

T

a

t

i

t

c

t

e

9

T

a

(p

v

f

o

f

o

t

d

t

i

c

c

u

r

g

t

o

t

e

l

l

c

a

a

w

(

l

e

1

i

t

M

n

L

i

w

e

i

a

e

t

l

r

r

e

t

l

d

L

c

p

i

i

s

h

p

c

h

i

e

m

c

r

.1. Articulation in Mandarin

Mandarin is the standard Chinese spoken across most of north-rn and southwestern China. There are 900 million Chinese nativepeakers, which is greater than the number of native speakers ofny other language in the world ( Lewis et al., 2015 ). Each Mandarinharacter is spoken as one syllable, consisting of an initial and a fi-al, encoded by a tone ( Chen et al., 2013 ). Mandarin has a total of1 initials and 39 finals that can be combined together to createore than 400 sounds. Most existing studies on visual speech have

een performed on “European/Western ” languages, particularly English Chen and Massaro, 2011 ). Compared with English, Mandarin has someypical characteristics. For example, some Mandarin sounds do not typi-ally appear in English, including the initials of blade-palatal and lingua-alatal, and the finals of close-mouth and round-mouth ( Chen and Mas-aro, 2011 ). Moreover, the Mandarin supradental, blade-alveolar, andingua-palatal are produced primarily with the tongue tip in constrastith those English sounds produced primarily with the tongue blade Lee and Zee, 2003 ). Moreover, the perception and production of lexi-al tones are particularly difficult when learning Mandarin ( Chen et al.,013; Chiu et al., 2009 ). Chen and Massaro (2011) suggested that study-ng and applying Mandarin visual speech information presents contribu-ions to segmental and tonal aspects of visual speech.

Many studies on Mandarin visual speech have been conducted fromhe perspectives of computer science and computer engineering ( Chent al., 2005; Pei and Zha, 2006; 2007; Wang et al., 2003; Wu et al., 2006;hou and Wang, 2007 ). However, very few evaluation works that specif-cally attempt to study the language-learning effectiveness of Mandarinisual speech have been published. In a study that evaluated syntheticnd natural Mandarin visual speech, Chen and Massaro (2011) com-ared participants ’ visual speech perception responses and then im-roved the quality of the synthetic Mandarin consonants, vowels, andhole syllables conveyed by an animated talking head. To date, there

emains a lack of evaluation of language learners ’ production of Man-arin initials, finals and tones after learning with a three-dimensionalalking head.

.2. Language learning with talking heads

Audio-visual articulatory instructions conveyed by talking heads areeneficial to language learning ( Engwall, 2008; Fagel and Madany,008 ). Fagel and Madany (2008) used a 3-D virtual talking head withisualized articulators to train the German pronunciations of /s/ and /z/or eight children with speech disorders, in which the children ’s pronun-iations were recorded and scored manually. The results showed that sixhildren could significantly enhance their speech production of the /s,z/ound. Engwall (2008) used an animated virtual teacher to teach sevenrench subjects to pronounce nine Swedish words in a 5–10 min trainingrogram in which the subjects ’ acoustic and articulatory data were col-ected by an ultrasound scanner and an electromagnetic tracking system.he results showed that the subjects ’ pronunciation improvement waschieved through mimicking the articulations indicated by the virtualeacher ( Engwall, 2008 ).

Massaro et al. (2008) used a between-subjects design in which a talk-ng head was shown to the subjects with different presentation condi-ions, including audio, audio-visual, frontal and inside views of the vo-al tract. It was demonstrated that visible speech contributed positivelyo the acquisition of new speech distinctions. In recent work of Wangt al. (2014) , two groups of Mandarin speakers were trained to learn single vowels using either an auditory or audiovisual talking head.he experiment showed that the audio-visual group outperformed theuditory group in the task of immediate repetition of vowels. Liu et al.2007) conducted an online experiment to compare language learners ’erformance with different conditions of a talking head, human face andoice only, showing that learners in the talking head condition outper-ormed those in the voice only condition with respect to improvement

27

n Mandarin finals, whereas no significant training condition effect wasound on Mandarin initials. Hamdan et al. (2015) evaluated the effectsf the realism level of talking-head characters on students ’ pronuncia-ion training. Four groups of students learned 20 English words fromifferent characters, showing that the group of students learning withhe 3D non-realistic animation character obtained the best performancen the pronunciation tests, followed by learning with the actual humanharacter, the 2-D animation character and the 3D realistic animationharacter.

Many factors can affect users ’ language learning performance whensing talking heads, such as imprecise articulatory movements, over-ealistic appearance, limited language training materials and a short lan-uage training period, along with a lack of commonly accepted evalua-ion criteria or evaluation methods until now. To evaluate talking headsn Mandarin pronunciation learning, there is a need for a well-designedalking head and a proper between-subjects design. In our study, wevaluate a three-dimensional talking head on Mandarin pronunciationearning. The talking head exhibits both the external and internal articu-atory movements of speaking and instructs Mandarin learners ’ pronun-iations. We developed an auto language tutor (ALT) configured withudio only (AU), human face video (HF) and audio-visual animation of three-dimensional talking head (3-D). Sixty-nine non-native speakersere recruited to learn 60 Mandarin syllables under three conditions

AU, HF and 3-D). Comparative results under these conditions were col-ected and analyzed to provide a clear insight into 3-D talking headffects on Mandarin pronunciation learning.

.3. Evaluation methods

Subjective evaluation is required to assess synthesized talking headsn terms of both visual speech synthesis intelligibility and naturalness inhe LIPS2008 Visual Speech Synthesis Challenge ( Theobald et al., 2008 ).attheyses et al. (2009) obtained participant ratings of visual speech

aturalness and synchrony between audio and visual tracks using theIPS2008 visual speech synthesis challenge database. The subjective rat-ngs of preference and humor between the synthetic and natural talkersere collected in the study of Stevens et al. (2013) to evaluate modality

ffects on speech understanding and cognitive load. The subjective rat-ngs of likeability with respect to different talking faces (a standard face, texture mapped face and a sampled-based face) have also been used tovaluate synthetic talking faces for a simple interactive real-time systemhat provides information about theater shows ( Pandzic et al., 1999 ).

Objective evaluation is usually applied when subjects ’ languageearning performance is quantitatively measured. Word selecting accu-acy ( Fagel and Madany, 2008; Massaro and Light, 2004 ), pronunciationepeating accuracy ( Calka, 2011 ), reaction time ( Bailly, 2003; Stevenst al., 2013 ) and pronunciation naming accuracy ( Ali et al., 2015 ) arehe common quantitative measures employed to assess pronunciationearning performance. Ali et al. (2015) examined the effects of threeifferent multimedia presentations on 3-D talking head Mobile-Assisted-anguage-Learning (MALL). The objective pre-test and post-test pronun-iation naming scores were utilized to determine participants ’ overallerformance. The results showed that the participants in the 3-D talk-ng head with spoken text and on-screen text MALL outperformed thosen the 3-D talking head with spoken text alone MALL and those withpoken text with on-screen text MALL ( Ali et al., 2015 ).

Converging methods based on both subjective and objective dataave been conducted by Stevens et al. (2013) , in which a dual-taskaradigm was used to investigate the relative cognitive demand of per-eiving Audio-only versus Audio-Visual speech produced by a talkingead. They collected the objective measures of reaction time, shadow-ng accuracy and latency data, along with subjective ratings of quality,njoyment and engagement. The results showed that the Audio-Visualodality had the advantage in speech understanding but created great

ognitive load. D ’Mello et al. (2010) demonstrated that spoken tuto-ial dialogues increased learning more than typed dialogues did in a

Page 3: International Journal of Human-Computer · line CAPT system with a pronunciation learning cycle of “listen-record- check-learn ”was designed wherein an animated talking head was

X. Peng et al. Int. J. Human-Computer Studies 109 (2018) 26–40

h

m

r

l

c

t

g

j

s

p

s

c

j

p

c

o

t

m

c

1

u

a

o

i

h

t

p

1

M

t

S

e

a

b

g

w

g

c

3

D

n

f

t

t

c

w

a

a

t

e

1

c

t

a

F

l

a

d

t

l

s

p

v

r

a

h

t

t

q

a

p

1

p

d

e

s

g

t

t

e

i

c

o

o

ws

i

e

t

A

m

a

b

t

a

1

s

a

f

y

t

s

i

2

w

u

2

o

m

i

s

I

i

L

uman–computer tutorial dialogue system. They recorded the objectiveeasures of content coverage and learning gains along with subjective

atings of user satisfaction. In the study of Yuen et al. (2011) , an on-ine CAPT system with a pronunciation learning cycle of “listen-record-heck-learn ” was designed wherein an animated talking head was usedo provide visual articulatory feedback for Chinese learners to learn En-lish. Both objective measures of mispronunciation detection and sub-ective ratings of user satisfaction were collected and analyzed. The re-ults showed that the online CAPT system had the capability of mis-ronunciation detection and that the subjects were satisfied with theystem.

The ultimate goal of our work is to augment personalized pronun-iation training for non-native Mandarin language learners. Both sub-ective and objective measures were collected in two independent ex-eriments with well-designed training procedures to obtain a moreomprehensive evaluation. In the first experiment, users ’ acceptancef the 3-D talking head and users ’ preferences for the three presenta-ion conditions (AU, HF and 3-D) were evaluated. In the second experi-ent, users ’ pronunciation learning improvements were measured and

ompared.

.4. Research questions

The goal of our study is to assess the degree of pronunciation learningsing the auto language tutor (ALT) in three presentation conditions:udio only (AU), human face video (HF) and audio-visual animationf a three-dimensional talking head (3-D). Three research questions arenvestigated:

RQ1: How do language learners accept the ALT with the 3-D talkingead?

RQ2: What are language learners ’ impressions of the three presenta-ion conditions of the ALT?

RQ3: How do language learners perform using the ALT in the threeresentation conditions?

.4.1. How do language learners accept the ALT with the 3-D talking head?

Different systems have different talking heads to present speech.any studies have made great efforts with graphics-based approaches

o make the talking head appear human-like ( Mattheyses et al., 2009;tevens et al., 2013; Theobald et al., 2008; Wang et al., 2012b ). How-ver, the uncanny valley effect ( Mori et al., 2012 ) refers to the idea thats robots appear more human-like, our sense of familiarity first increasesut then declines sharply into a valley. Butler and Joschko (2009) sug-ested that a character that is over-realistic or almost resembles a humanould eventually cause users to feel fearful and horrified. Our auto lan-uage tutor (ALT) system with the 3-D talking head was designed foromputer-aided pronunciation training. Whether the appearance of the-D talking head is acceptable should be evaluated. Moreover, the 3- talking head presents visualized pronunciation movements of exter-al and internal articulators. It is unclear whether the learners wouldeel uncomfortable when the 3-D talking head speaks, particularly withransparent internal articulatory movements. In our study, we supposehat a high level of learners ’ acceptance would lead to efficient pronun-iation learning. Therefore, language learners ’ acceptance of the ALTith the 3-D talking head was assessed by collecting subjective ratingst the beginning of the study. The subjective evaluation method that wedopted in our study has been widely used in other talking head sys-em evaluations ( Kühnel et al., 2008 ; Mattheyses et al., 2009; Theobaldt al., 2008; Weiss et al., 2010 ).

.4.2. What are language learners ’ impressions of the three presentation

onditions of the ALT?

Many works have been conducted on analyzing language percep-ual ability between the audio-only and audio-visual presentations of talking head ( Badin et al., 2008; Engwall, 2008; Navarra and Soto-araco, 2007; Wang et al., 2014; Wik and Engwall, 2008 ). However,

28

ittle is known about language learners ’ comparative impressions of annimated talking head, a real human instructor and an audio-only con-ition. How do the learners think about the quality of these presen-ation conditions and which condition would they choose for futureearning? Theobald et al. (2008) suggested that the synthesized visualpeech involved in any applications must undergo an evaluation of itserceived quality, which can be judged by asking viewers to rate theisual speech, because the viewers are often very sensitive to the er-ors. Hamdan and Ali (2015) evaluated the functionality, future usend many other aspects of a non-realistic three-dimensional talking-ead animation by collecting users ’ subjective opinions. In our study,o obtain language learners ’ preference for 3-D, HF and AU, learners ’ in-erest, their opinions on the functionality (indicated by the perceiveduality of the pronunciation instruction) and future use were evalu-ted by asking participants to rank the three conditions in order of theirreference.

.4.3. How do language learners perform using the ALT in the three

resentation conditions?

Sweller et al. (1998) and Mayer (2009) in their study of multime-ia learning proposed that multimodal presentation conditions promoteffective learning because humans process information through both vi-ual and verbal channels. Several studies have demonstrated that lan-uage learners ’ performance of speech perception and speech produc-ion improved after learning with a talking head with internal articula-ory movements ( Massaro et al., 2008; Wang et al., 2012a; 2014 ). Badint al. (2010) studied whether subjects can understand speech from see-ng the tongue in an augmented speech condition. Four presentationonditions, AU (audio signal alone), AVJ (audio signal + cutaway viewf the virtual head without tongue), AVT (audio signal + cutaway viewf the virtual head with tongue) and AVF (audio signal + complete faceith skin texture) were provided. The results showed that the subjects ’

peech comprehension score was ranked as AVF > AVT > AVJ > AU,ndicating that seeing the tongue benefited speech understanding. Badint al. (2010) speculated that AVJ might show redundant information ofhe jaw, lips and cheeks and AVF might be more natural than AVT andVJ.

In our study, it is appealing to compare learners ’ language perfor-ance using the ALT with the three presentation conditions (AU, HF

nd 3-D). Furthermore, to discover in which cases learning with 3-D isetter than learning with AU or HF, we introduce a task to collect par-icipants ’ pronunciation learning performance in AU, HF and 3-D andnalyze the results in Mandarin segmental and tonal aspects.

.5. Research plan

A number of factors affect learners ’ second language performance,uch as the learner ’s first language, educational level, chronological age,nd the necessity to learn the target language ( Piske et al., 2001 ). Theseactors vary from person to person. In our study, two groups of firstear foreign Ph.D. students were recruited as language learners. All ofhem have the same educational level, a similar age and the same neces-ity to learn Mandarin. One group of 33 students participated in exper-ment 1, and another group of 36 students participated in experiment. The learning materials are 60 Mandarin syllables (see Appendix A )ith clear phonological structures. Because minimal pairs are commonlysed to distinguish similar phonemes of the target language ( Wang et al.,012a ), all of the Mandarin syllables in the ALT were shown in termsf minimal pairs. Doing so made it easy for participants to observe andimic.

Both subjective and objective methods were adopted for investigat-ng the three research questions. In experiment 1, RQ1 and RQ2 were as-essed subjectively by conducting questionnaire surveys. Questionnaire was designed to investigate RQ1, in which eight rating questions werencluded and each statement was rated by the participants on a five-levelikert scale from strongly agree to strongly disagree ( Allen and Seaman,

Page 4: International Journal of Human-Computer · line CAPT system with a pronunciation learning cycle of “listen-record- check-learn ”was designed wherein an animated talking head was

X. Peng et al. Int. J. Human-Computer Studies 109 (2018) 26–40

Fig. 1. Implementation of 3-D.

2

R

a

o

i

d

T

i

s

i

i

s

T

2

2

s

(

s

m

r

s

m

t

2

m

r

r

l

n

t

t

v

v

w

m

a

d

t

2

t

o

w

t

t

t

l

t

s

t

t

l

2

p

2

o

t

l

t

v

s

t

t

p

o

wb

t

i

I

n

t

d

o

a

t

a

007; Sefero ĝlu, 2005 ). Questionnaire II was designed to investigateQ2, in which six ranking questions were included and each questionsked the participants to rank the three conditions (AU, HF and 3-D) inrder of their preference. In experiment 2, RQ3 was studied by compar-ng the participants ’ learning performance through a between-subjectsesign with three groups for the three conditions (AU, HF and 3-D).he participants ’ learning performance refers to pronunciation learning

mprovement, which is indicated by the pronunciation naming post –pre

cores. The remainder of the paper is organized as follows. In Section 2 , the

mplementation of the 3-D talking head is briefly described, and thenterfaces with the auto language tutor (ALT) are shown. Details of theubjective and objective evaluations are presented in Sections 3 and 4 .he conclusion and future work are provided in the last section.

. Description of auto language tutor (ALT)

.1. 3-D talking head with internal articulator dynamics

The 3-D talking head with internal articulator dynamics has madeignificant progress in simulating pronunciation in our previous studies Chen et al., 2016; 2010; Liu et al., 2013; Wang et al., 2012a ). In ourtudy, the 3-D talking head with Mandarin pronunciations was imple-ented in a Visual C++ program with OpenGL libraries for graphic

endering. The implementation of 3-D is shown as Fig. 1 . A native female language teacher was invited to read Mandarin

yllables correctly in standard Mandarin. Her audio and articulatoryovements were collected synchronously by using a camera to record

he videos and an Electro-Magnetic Articulography (EMA AG500) at00 frames per second to collect the external and internal articulatoryovements. The articulatory motions of thirteen feature points were

ecorded, in which five facial feature points (nose, left head, right head,ight jaw, and left jaw) were used for calibration. In particular, fourip feature points (upper lip, lower lip, right lip corner, and left lip cor-er) and another four internal feature points (tongue tip, middle tongue,ongue root, and middle jaw) were normalized and smoothed to drivehe 3-D animation model to speak. The audio in our study is the naturaloice of the native female language teacher, rather than a synthesizedoice.

In the articulatory animation process, a static 3-D talking head modelas loaded and initialized. The thirteen feature points were identifiedanually and then registered to the static 3-D talking head model via

ffine transformations. In anatomy, the motions of the articulators areivided into muscular soft tissue deformations of lips and tongue, rota-ional up-down movements of chin and relatively fixed parts ( Chen et al.,

29

010 ). The lips and tongue are both muscular hydrostats composed en-irely of soft tissue and move under local deformation. An effective ge-metric deformation of the Dirichlet free-form deformation algorithmas applied to control the soft tissue deformations. The up-down ro-

ation of the jaw affects the animation processes of the jaw, the lowereeth, the linked chin skin, and the tongue body. The degree of the ro-ation was computed with the displacement of the feature point on theower incisors. Given that the head does not move, the skull, the uppereeth, the linked upper facial skin and the tone ties were kept still whenpeaking.

As shown in Fig. 1 , the 3-D articulatory animation videos show bothhe frontal view and the profile view that presents the internal articula-ory movements of the lips, tongue and jaw. The videos of each syllableast for approximately 10 s. The results in our previous work ( Liu et al.,013 ) showed that the videos presented by the 3-D talking head showrecise pronunciation movements.

.2. System design

The auto language tutor (ALT) is the pronunciation training tool inur study. The process of learning with the three conditions is similaro that of learning from a human teacher in a language class. That is,earners listen to the audio or watch the videos first and then practicehe pronunciations.

As shown in Fig. 2 , the learners listen to the audios or watch theideos repeatedly by clicking the “Play ” button of the ALT. Two pairedyllables are shown side-by-side so that the learners can easily identifyhe pronunciation differences. The “Record ” and “Listen ” buttons helphe learners to record and replay their own pronunciations. The ALT alsoresents the corresponding Mandarin characters and English meaningsf each syllable to help the learners to remember and understand theords. The learners can change to the next page by clicking the “Next ”utton.

In the ALT, the Mandarin syllables are presented in three presenta-ion conditions (AU, HF and 3-D). The canonical audio of each syllables played twice in 10 seconds; the playback cannot be paused or skipped.n Fig. 2 (a), the videos of HF (human face video) are the videos of theative female language teacher. Fig. 2 (b) is the AU (audio only) condi-ion. In Fig. 2 (c), the videos of 3-D (audio-visual animation of a three-imensional talking head) are played once in a frontal view and thennce in a profile view. In the profile view, movements of the internalrticulators (tongue, teeth, jaw, and palate) are observed clearly. All ofhe audios in the three conditions are the same and the volume is set to fixed level.

Page 5: International Journal of Human-Computer · line CAPT system with a pronunciation learning cycle of “listen-record- check-learn ”was designed wherein an animated talking head was

X. Peng et al. Int. J. Human-Computer Studies 109 (2018) 26–40

Fig. 2. Interface of the auto language tutor (ALT).

3

w

o

m

p

d

A

s

M

M

o

3

3

S

e

w

t

m

t

3

a

a

E

p

3

e

w

I

c

t

Q

p

B

Q

3

3

s

a

a

t

i

c

w

q

a

. Experiment 1 – Questionnaire survey

Experiment 1 investigated language learners ’ acceptance of the ALTith the 3-D talking head (RQ1) and their comparative impressionsf the three presentation conditions of the ALT (RQ2). In this experi-ent, each participant was assigned to attend an experiential course ofronunciation learning in the conditions of AU, HF and 3-D in a ran-om order. The learning materials are the 60 Mandarin syllables (seeppendix A ), which were divided equally into 3 units and then pre-ented in pairs in the ALT. The learning materials include 21 types ofandarin initials and 39 types of Mandarin finals, which cover all of theandarin initials and most of the Mandarin finals. The repetition times

f each initial/final in the learning materials ranged from one to eight.

.1. Methods

.1.1. Participants

Thirty-three non-Chinese speakers (30 men, 3 women, M = 28 years,D = 2.9 years) participated in experiment 1. All participants were for-ign Ph.D. students from the University of Chinese Academy of Sciences,ho had attended the same Chinese language learning class for only

wo weeks. All had self-reported normal or corrected vision and nor-al hearing. Each participant received $20 for participating. Data from

hree participants were eliminated due to their inconsistent answers.

.1.2. Procedure

Fig. 3 shows the procedure of experiment 1. Each participant wassked to complete the procedure in a single quiet room. An introduction

30

bout the procedure was provided to the participant before the learning.ach participant then learned three pronunciation training units in threeresentation conditions, e.g., unit 1 in AU, unit 2 in HF and unit 3 in-D, in which the order was assigned randomly by the ALT. The wholexperiment was completed in approximately 90 min. When the 3-D unitas completed, the participant was asked to complete Questionnaire

. After completing all three of the units, the participant was asked toomplete Questionnaire II.

Questionnaire I includes eight rating questions for evaluating par-icipants ’ acceptance level of the ALT with the 3-D talking head (RQ1).uestionnaire II includes six ranking questions for comparing partici-ants ’ general impressions of the three presentation conditions (RQ2).oth the rating questions in Questionnaire I and the ranking questions inuestionnaire II have been refined and determined through pilot tests.

.2. Results

.2.1. How do language learners accept the ALT with the 3-D talking head?

In Table 1 , within the five-level Likert scale (score 5: strongly agree,core 4: agree, score 3: neither agree nor disagree, score 2: disagreend score 1: strongly disagree), the eight rating questions and the meangree score with standard deviation of each question are shown. Amonghe eight rating questions, Question 1 was an overall question concern-ng the use of the ALT, and question 2 was an overall question con-erning the general acceptance of the 3-D talking head. Questions (5–8)ere four reverse questions for consistency checking; the results of theseuestions were reversed to be consistent with questions (1–4) beforenalysis.

Page 6: International Journal of Human-Computer · line CAPT system with a pronunciation learning cycle of “listen-record- check-learn ”was designed wherein an animated talking head was

X. Peng et al. Int. J. Human-Computer Studies 109 (2018) 26–40

Fig. 3. Procedure of experiment 1.

Table 1 Questionnaire I: a questionnaire to investigate language learners ’ acceptance of the ALT with the 3-D talking head (RQ1); maximum score is 5 ( “strongly agree ”), and minimum score is 1 ( “strongly disagree ”).

Question Description Mean score Standard deviation

(1) The ALT tool is easy to use. 4.83 0.379 (2) I can accept the 3-D talking head for pronunciation training. 4.20 0.714 (3) The 3-D talking head offers me a nice look and feeling. 4.17 0.834 (4) The 3-D talking head shows clear and natural movements of the pronunciation. 4.13 0.819 (5) I feel uncomfortable when looking at the 3-D talking head. 1.77 0.935 (6) I could not accept the internal articulators which are normally unaccustomed to see in daily life. 2.17 0.874 (7) It is difficult to follow the movements of both external and internal articulators of the 3-D talking head. 2.40 1.070 (8) The transparent face of the 3-D talking head is difficult to accept. 1.83 0.834

Table 2 Questionnaire I: factor naming, internal consistency and explained variance.

Factor Question Factor loading Cronbach ’s alpha Eigen-value % of variance Cumulative %

Factor 1. Appearance (3) 0.897 0.856 2.565 42.7 42.7 (4) 0.916 (5) 0.832

Factor 2. Articulatory movements (6) 0.796 0.741 1.793 29.9 72.6 (7) 0.926 (8) 0.688

i

T

𝑆

f

t

2

u

a

t

t

y

(

f

a

i

f

3

a

n

l

u

A

s

n

s

d

t

M

c

n

l

m

t

a

f

c

t

0

d

i

i

t

3

c

H

t

o

o

r

p

r

e

c

A Cronbach ’s alpha score of 0.758 was reached for Questionnaire I,ndicating that Questionnaire I was generally reliable ( Nunnally, 1978 ).he results show that the ALT is quite easy to use (question 1, 𝑀 = 4 . 83 ,𝐷 = 0 . 379 ) and that participants can easily accept the 3-D talking head

or pronunciation training (question 2, 𝑀 = 4 . 20 , 𝑆𝐷 = 0 . 714 ). Ques-ions (3–8) asked about several aspects related to the overall question. For questions (3–8), factor analysis was conducted to identify possiblenderlying factors. The Kaiser-Meyer-Olkin (KMO) measure of samplingdequacy was 0.585 and the Bartlett ’s Test of Sphericity indicated thathe

2 = 80 . 172 , 𝑑𝑓 = 15 , and p < 0.001, suggesting that common fac-ors existed. Factors were extracted through principal component anal-sis. Factors whose eigenvalues were greater than 1.0 were extracted Gorsuch, 1997 ). Varimax rotation was used to aid interpretation. Twoactors were extracted as given in Table 2 , and the factor loadings are allbove 0.50. Two factors accounted for 72.6% of the total variance. Thenternal consistency (indicated by Cronbach ’s alpha score) was 0.856or factor 1 and 0.741 for factor 2.

Factor 1, termed as “appearance ”, describes the look and face of the-D talking head and explains the largest percentage of the total vari-nce. Any talking head used for pronunciation learning should have aatural but not overly realistic face to make learners feel good whenooking at it. Badin et al. (2010) found that subjects learning with AVFnderstood speech better than did those with AVJ and AVT, becauseVF might have a more natural face. Hamdan et al. (2015) found thattudents exposed to a 3D non realistic character achieved better pro-unciation performance than did those with a 3D realistic character,uggesting that considering the level of realism is important during theesign phase of the talking-head animation character.

Factor 2, termed as “articulatory movements ”, includes both the in-ernal and external articulatory movements of the 3-D talking head.

c

31

ori et al. (2012) found that subjects ’ feeling of familiarity about robotshanged when the robots began to move. The talking heads used for pro-unciation learning should also make learners feel comfortable whenooking at the movements of the articulators, particularly the move-ents of internal articulators, which are rarely seen in daily life.

A stepwise multiple linear regression was conducted. The two fac-ors were used as independent variables, and the overall question aboutcceptance of 3-D was the dependent variable. The results showed thator factor 1, 𝛽 = 0 . 545 , 𝑡 = 3 . 665 , p < 0.05, adjusted 𝑅

2 = 0 . 272 , and R

2

hange = 0 . 297 , suggesting that “appearance ” is the most important fac-or influencing the participants ’ acceptance of 3-D. For factor 2, 𝛽 = . 325 , 𝑡 = 2 . 183 , p < 0.05, adjusted 𝑅

2 = 0 . 358 , and R

2 change = 0 . 105 , in-icating that “articulatory movements ” is also an important factor thatnfluences the participants ’ acceptance of 3-D. We found that the partic-pants had a high acceptance level of both the appearance and articula-ory movements of 3-D.

.2.2. What are language learners ’ impressions of the three presentation

onditions of the ALT?

For each question in Questionnaire II, the participants ranked 3-D,F and AU in order of their preference, all on a 1–3 scale (ranking 1:

he best one, ranking 2: the less good one, ranking 3: the least goodne). If one participant assigned ranking 1 to one condition, then nother condition could also be assigned ranking 1. Table 3 shows the sixanking questions and their mean ranking, standard deviation and theercentage distribution of the number of times that a certain conditioneceived the minimum ranking of each condition.

Questionnaire II examined comparative impressions including learn-rs ’ interest, the functionality and future use of the three presentationonditions. Question 9 compared learners ’ interest, questions (10–12)ompared the functionality, and questions (13–14) compared future use.

Page 7: International Journal of Human-Computer · line CAPT system with a pronunciation learning cycle of “listen-record- check-learn ”was designed wherein an animated talking head was

X. Peng et al. Int. J. Human-Computer Studies 109 (2018) 26–40

Table 3 Questionnaire II: a questionnaire to compare language learners ’ impressions of the three presentation conditions (RQ2); maximum ranking is 3 ( “the least good one ”), and minimum ranking is 1 ( “the best one ”).

Question Description Condition Mean Standard Percentage ranking deviation distribution

of ranking 1 (%)

(9) Which condition you are the most interested in? 3-D 1.63 0.765 53.3 HF 1.63 0.615 43.3 AU 2.73 0.521 3.3

(10) Which condition helps you the most to understand the correct movements of pronunciation? 3-D 1.43 0.679 66.7 HF 1.77 0.626 33.3 AU 2.80 0.407 0

(11) Which condition offers you the most complete instruction? 3-D 1.33 0.711 80.0 HF 1.87 0.507 20.0 AU 2.80 0.407 0.0

(12) Which condition helps you the most to identify the differences of each pair? 3-D 1.57 0.728 56.7 HF 1.73 0.640 36.7 AU 2.70 0.596 6.7

(13) Which condition helps you the most to learn new pronunciation materials? 3-D 1.63 0.718 50.0 HF 1.60 0.621 46.7 AU 2.77 0.504 3.3

(14) If you want to learn more pronunciation materials, which condition would you like the most? 3-D 1.80 0.761 40.0 HF 1.60 0.724 53.3 AU 2.60 0.621 6.7

T

a

𝐾

T

m

t

q

s

s

p

a

p

4

s

p

o

M

4

4

w

A

w

f

B

s

t

w

a

4

g

T

t

a

p

l

t

a

e

o

t

T

s

a

r

c

a

l

d

T

t

p

4

m

s

b

w

o

b

M

(

p

s

m

d

m

v

r

i

s

4

s

e

fi

w

s

he results of Kendall ’s coefficient test showed that for each question, high inter-rater reliability among the 30 participants was reached, = 30 , 𝑁 = 3 , 𝑑𝑓 = 2 , 𝐾𝑒𝑛𝑑𝑎𝑙 𝑙 ′𝑠 ( 𝑤 ) > 0 . 280 ,

2 > 16 . 800 , and p < 0.001.he non-parametric statistical test of Friedman followed by all pairwiseultiple comparisons were conducted to test the significance of condi-

ion effect for each question; the significance level was .05. For eachuestion, both 3-D and HF received lower rankings than did AU, withignificant differences (3-D vs. AU: p < 0.05; HF vs. AU: p < 0.05). Noignificant differences were found between 3-D and HF (3-D vs . HF: > 0.05). These results indicate that, for learners ’ interest, functionalitynd future use, the participants prefer 3-D and HF at the same level butrefer AU at the lowest level.

. Experiment 2 – pronunciation learning performance

Experiment 2 is the objective evaluation which was designed to as-ess how language learners perform using the ALT system in the threeresentation conditions (AU, HF and 3-D) (RQ3). Each participant usednly one of the three presentation conditions for learning all of the 60andarin syllables.

.1. Methods

.1.1. Participants

A new group of 36 first year foreign Ph.D. students (28 men, 8omen, 𝑀 = 29 years, 𝑆𝐷 = 3 . 1 years) from the University of Chinesecademy of Sciences participated in experiment 2. Among them, 21ere from Pakistan, 3 from Kenya, 2 from Thailand, 2 from Sudan, 2

rom Ethiopia, 2 from Nepal, 2 from Egypt, 1 from Guinea and 1 fromangladesh. All participants had self-reported normal or corrected vi-ion and normal hearing. All of them had just learned Mandarin fromhe beginning in the same Chinese language class for approximately twoeeks. Each participant received an award of $20 for their participationnd the participants who studied hard received another award of $15.

.1.2. Procedure

In this experiment, the 36 participants were assigned into threeroups of equal numbers for the three conditions of AU, HF or 3-D.hey were assigned based on their baseline pronunciation levels andheir nationalities to distribute the skilled and L1 participants equallymong the three groups. Before starting the learning procedure, thearticipants were asked to read out the learning materials of 60 syl-ables (see Appendix A ), and a language teacher was asked to evaluate

32

heir baseline pronunciation quality. After that, the participants weressigned into the experimental groups so that each group contained anqual number of equally skilled participants. Fig. 4 shows the proceduref experiment 2. Each participant first completed a speaking test (pre-est) by recording his/her pronunciations of the 29 syllables one by one.he recording set has 26 syllables (the bolded syllables in Appendix A )elected from the learning materials, whereas the other 3 (di ū , hu ī , qu ē)re not in the learning materials. The participants then completed a first-ound intensive learning of the 60 Mandarin syllables in one of the threeonditions. During the process, they were asked to listen, watch, recordnd compare the syllables by themselves for approximately 90 min, fol-owed by a speaking test as a little quiz after learning. The little quiz isesigned for participants to review and recall what they have learned.wo days later, they completed a second-round reviewing practice ofhe same 60 syllables in the same presentation condition. Finally eacharticipant completed a speaking test (post-test).

.1.3. Evaluation metrics

Different evaluation metrics are used to judge pronunciation perfor-ance in visual speech systems ( Theobald et al., 2008 ). Automatic mea-

urements based on computer technology exist, such as the likelihood-ased method of “Goodness of Pronunciation ” ( Witt and Young, 1997 ),hich is designed for automatical assessment of pronunciation qualityf non-native speech at the phone level. The basic GOP algorithm isased on automatic speech recognition techniques using Hidden Markovodels. To assess the usability of the GOP measure, Witt and Young

2000) extended the method to include individual thresholds for eachhone based on both averaged native confidence scores and on rejectiontatistics provided by human judges. They designed a set of four perfor-ance measures to measure different aspects of how well the computer-erived phone-level scores agree with the human scores. Their experi-ental results showed that GOP scoring is likely to be capable of pro-

iding similar feedback as human judgments. Human judgments were used in research works to obtain more-

eliable ratings directly. With human judgments, a clear rating scales always used to assess pronunciation quality. A number of assessmentcales were used with varying foci and specific purposes. Witt used a-point scale in human judgments ( Witt and Young, 2000 ). Neri usedix experts on a 10-point scale to evaluate speech independently ( Nerit al., 2006 ). Müller used an 8-point Likert scale to assess the oral pro-ciency of proficient L2 speakers ( Müller et al., 2009 ). A 5-point scaleith a mean opinion scale was the recommended measuring scale for

ynthesized speech quality ( ITU-T, 1994 ). To measure the improvements

Page 8: International Journal of Human-Computer · line CAPT system with a pronunciation learning cycle of “listen-record- check-learn ”was designed wherein an animated talking head was

X. Peng et al. Int. J. Human-Computer Studies 109 (2018) 26–40

Fig. 4. Procedure of experiment 2.

o

(

r

p

q

s

i

n

a

i

e

a

g

i

n

f

t

a

t

3

l

s

o

w

r

t

t

p

4

4

l

c

m

i

T

(

e

fi

c

t

p

fi

w

d

s

𝜂

p

t

fi

t

c

c

3

𝐹

4

2

u

a

l

l

P

o

t

w

i

t

m

m

p

t

t

F

o

p

d

t

𝜂

p

A

b

4

o

s

n

fi

T

p

b

t

t

w

i

m

g

f students ’ pronunciation, Sefero ĝlu also used a 5-point Likert scale Sefero ĝlu, 2005 ).

In our study, we collected and evaluated the pre-test and post-testecordings of 29 syllables to measure the pronunciation learning im-rovement. A 5-point Likert scale was used to indicate the pronunciationuality of each syllable (score 5: completely accurate, score 4: accurate,core 3: partially accurate, score 2: inaccurate and score 1: completelynaccurate). Since each Mandarin syllable consists of an initial and a fi-al with a single particular tone, and each initial or final has its specialrticulatory characteristics, a separate scoring was performed for eachnitial/final/tone of a syllable. The pre-test and post-test recordings ofach syllable were rated as pre and post in the three types of initial, finalnd tone.

In a recent study of Ali et al. (2015) , which was designed to investi-ate the benefit of inclusion of various verbal elements in the 3-D talk-ng head for pronunciation learning among non-native speakers, threeative language teachers were appointed to assess the participants ’ per-ormance to ensure the reliability of the results. In our study, five na-ive language teachers were invited to score the accuracy of pronunci-tions independently without knowledge of the learning procedure orhe conditions of the experiment. Each language teacher rated all the6 participants on all of the 29 pre-test and all of the 29 post-test syl-ables in three types of initial, final and tone, that is 6264 (36 ∗ 29 ∗ 2 ∗ 3)cores in summary. An inter-rater reliability with Kendall ’s coefficientf 0.67 ( 𝐾 = 5 , 𝑁 = 6264 , 𝑑𝑓 = 6263 , 𝑤 = 0 . 6733 ,

2 = 21084 , 𝑝 < 0 . 001 )as reached for the scores given by the five language teachers, which

epresents high inter-rater reliability. Thereafter, the average score ofhe five language teachers ’ ratings was used to measure the pronuncia-ion quality of each syllable (initial, final and tone) for the pre-tests andost-tests.

.2. Results

.2.1. Pronunciation improvement in all of the 29 syllables

For each participant, the pronunciation improvement in each syl-able in the three types of initial, final and tone was measured by thealculation of post minus pre scores. The overall pronunciation improve-ent of each participant was calculated by taking the average of the

mprovement of the 29 syllables in the three types (initial, final, tone).he average pronunciation improvement of each presentation conditionAU, HF or 3-D) is shown in Fig. 5 .

To analyze overall pronunciation improvement (12 participants inach condition), a one-way ANOVA was conducted on each type (initial,nal, tone) on the three presentation conditions. Effects with a signifi-ance level < .05 were treated as statistically significant. For initial andone (with homogeneity variances), post-hoc multiple comparisons wereerformed using the least significance difference (LSD) corrections. Fornal (with inhomogeneity variances), post-hoc multiple comparisonsere performed using the Games-Howell corrections. The statistical in-ices are displayed in Appendix B .

For initial, the ANOVA indicated a significant main effect of the pre-entation condition on the magnitude; 𝐹 (2 , 33) = 8 . 578 , p < 0.05, and2 = 0 . 342 . The post-hoc pairwise comparisons revealed that partici-ants in 3-D achieved greater pronunciation improvement than did

33

hose in HF and AU (3-D vs . HF: p < 0.05; 3-D vs . AU: p < 0.05). Fornal, there was also a significant main effect of the presentation condi-ion; 𝐹 (2 , 33) = 12 . 963 , p < 0.05, and 𝜂2 = 0 . 440 . The post-hoc pairwiseomparisons revealed that participants in 3-D achieved greater pronun-iation improvement than did those in HF and AU (3-D vs . HF: p < 0.05;-D vs . AU: p < 0.05). For tone, no significant differences were found; (2 , 33) = 1 . 707 , p > 0.05.

.2.2. Initials in category

According to the International Phonetic Alphabet ( Association et al.,005 ), there are two ways to classify initials: by the place of artic-lation and by the manner of articulation. In Mandarin, the initialsre normally subdivided into seven categories by the place of articu-ation: bilabial, labiodental, supradental, blade-alveolar, blade-palatal,ingua-palatal and backlingual ( Li and Thompson, 1989; Norman, 1988;ulleyblank, 2011 ). In our study, the pronunciation improvement resultsf initials in categories are shown in Fig. 6 .

A one-way ANOVA was conducted for initials in categories on thehree presentation conditions. Effects with a significance level < .05ere treated as statistically significant. The post-hoc multiple compar-

sons were performed using the Games-Howell corrections for labioden-al initials (with inhomogeneity variances). For other initials (with ho-ogeneity variances), the LSD corrections were used.

For the blade-alveolar initials, the ANOVA indicated a significantain effect of presentation condition on the magnitude; 𝐹 (2 , 33) = 8 . 886 , < 0.05, and 𝜂2 = 0 . 350 . The post-hoc pairwise comparisons revealedhat participants in 3-D achieved greater pronunciation improvementhan did those in HF and AU (3-D vs . HF: p < 0.05; 3-D vs . AU p < 0.05).or the blade-palatal initials, there was also a significant main effectf presentation condition; 𝐹 (2 , 33) = 3 . 723 , p < 0.05, and 𝜂2 = 0 . 184 . Thearticipants in 3-D achieved greater pronunciation improvement thanid those in AU (3-D vs . AU: p < 0.05). For the lingua-palatal initials,he significant main effect was also found; 𝐹 (2 , 33) = 4 . 556 , p < 0.05, and2 = 0 . 216 . The participants in 3-D achieved greater pronunciation im-rovement than did those in HF and AU (3-D vs . HF: p < 0.05; 3-D vs .U: p < 0.05). No significant differences were found with respect to theilabial, labiodental, supradental or backlingual initials.

.2.3. Finals in category

Mandarin finals can normally be subdivided into four categories:pen-mouth, even-teeth, close-mouth and round-mouth ( Li and Thomp-on, 1989; Norman, 1988; Pulleyblank, 2011 ). The open-mouth finals doot have a medial, the even-teeth finals begin with [ i ], the close-mouthnals begin with [ u ] and the round-mouth finals begin with [ y ] ( Li andhompson, 1989; Norman, 1988; Pulleyblank, 2011 ). In our study, wearticularly viewed the open-mouth(-i) as an independent subcategoryecause most participants felt very difficult to learn it. The pronuncia-ion improvement results of finals in categories are shown in Fig. 7 .

A one-way ANOVA was conducted for finals in categories on thehree presentation conditions. Effects with a significance level < .05ere treated as statistically significant. The post-hoc multiple compar-

sons were performed using the Games-Howell corrections for the open-outh(-i) (with inhomogeneity variances). For other finals (with homo-

eneity variances), the LSD corrections were used.

Page 9: International Journal of Human-Computer · line CAPT system with a pronunciation learning cycle of “listen-record- check-learn ”was designed wherein an animated talking head was

X. Peng et al. Int. J. Human-Computer Studies 109 (2018) 26–40

Fig. 5. Experiment 2: the average pronunciation improvement ( post –pre ) of each condition. Error bars refer to standard deviation.

Fig. 6. Experiment 2: the pronunciation improvement ( post –pre ) of initials in categories. Error bars refer to standard deviation.

e

p

t

t

fi

t

a

v

f

t

t

n

fi

4

For the open-mouth finals, the ANOVA indicated a significant mainffect of presentation condition on the magnitude; 𝐹 (2 , 33) = 4 . 423 , < 0.05, and 𝜂2 = 0 . 172 . The post-hoc pairwise comparisons revealedhat participants in 3-D achieved greater pronunciation improvementhan did those in AU (3-D vs . AU: p < 0.05). For the open-mouth(-i)nal, there was also a significant main effect of presentation condi-ion; 𝐹 (2 , 33) = 8 . 065 , p < 0.05, and 𝜂2 = 0 . 328 . The participants in 3-Dchieved greater pronunciation improvement than did those in AU (3-Ds . AU: p < 0.05). For the round-mouth finals, the significant main ef-

a

34

ect was also found; 𝐹 (2 , 33) = 3 . 914 , p < 0.05, and 𝜂2 = 0 . 192 . The par-icipants in 3-D achieved greater pronunciation improvement than didhose in HF and AU (3-D vs . HF: p < 0.05; 3-D vs . AU: p < 0.05). No sig-ificant differences were found on the even-teeth or the close-mouthnals.

.2.4. Tones

There are four main tones in Mandarin: level, rising, falling-risingnd falling ( Li and Thompson, 1989 ). The 29 recorded syllables included

Page 10: International Journal of Human-Computer · line CAPT system with a pronunciation learning cycle of “listen-record- check-learn ”was designed wherein an animated talking head was

X. Peng et al. Int. J. Human-Computer Studies 109 (2018) 26–40

Fig. 7. Experiment 2: the pronunciation improvement ( post –pre ) of finals in categories. Error bars refer to standard deviation.

2

T

t

w

i

e

0

p

(

m

t

𝜂

d

f

5

5

l

s

u

ca

t

l

s

p

a

T

f

t

i

3

m

a

t

p

t

e

n

l

t

t

3

e

o

n

r

c

fi

g

w

r

2 level tones, 3 rising tones, 2 falling-rising tones and 2 falling tones.he pronunciation improvement results of four tones are shown in Fig. 8 .

A one-way ANOVA was conducted for tones in categories on thehree presentation conditions. Effects with a significance level < .05ere treated as statistically significant. The post-hoc multiple compar-

sons were carried out using the LSD corrections. For the falling-rising tone, the ANOVA indicated a significant main

ffect of presentation condition on the magnitude; 𝐹 (2 , 33) = 6 . 830 , 𝑝 < . 05 , and 𝜂2 = 0 . 293 . The post-hoc pairwise comparisons revealed thatarticipants in 3-D achieved greater improvement than did those in HF3-D vs . HF: p < 0.05) and participants in AU achieved greater improve-ent than did those in HF (AU vs . HF: p < 0.05). For the falling tone,

here was also a significant main effect; 𝐹 (2 , 33) = 3 . 443 , 𝑝 < 0 . 05 , and2 = 0 . 173 . The participants in 3-D achieved greater improvement thanid those in HF (3-D vs . HF: p < 0.05). No significant differences wereound on the level tone or the rising tone.

. Discussion

.1. Summary of findings

In experiment 1, we designed Questionnaire I to investigate languageearners ’ acceptance of the ALT with the 3-D talking head (RQ1). Thetatistical results of Questionnaire I show that the ALT is quite easy tose and participants can easily accept the 3-D talking head for pronun-iation training. Moreover, we found that two factors of “appearance ”nd “articulatory movements ” significantly influenced learners ’ accep-ance of the 3-D talking head, and both the two factors obtained a highevel of learners ’ acceptance.

We designed Questionnaire II to compare language learners ’ impres-ions of the three presentation conditions of the ALT (RQ2). The three

35

resentation conditions are: audio only (AU), human face video (HF)nd audio-visual animation of a three-dimensional talking head (3-D).he results of Questionnaire II show that, concerning learners ’ interest,unctionality and future use, the participants preferred 3-D and HF athe same level and prefer AU at the lowest level.

In experiment 2, we examined how language learners perform us-ng the ALT system in the three presentation conditions (AU, HF and-D)(RQ3). We calculated participants ’ overall pronunciation improve-ent in each condition. We found that participants achieved pronunci-

tion improvements in different degrees in initial, final and tone underhe three presentation conditions. For both Mandarin initials and finals,articipants in the 3-D condition gained greater improvement than didhose in HF and AU, with significant differences. No significant differ-nces were found in the Mandarin tones according to the overall results.

The subcategory results of the initials revealed that there was a sig-ificant difference with respect to the blade-alveolar, blade-palatal andingua-palatal initials. For blade-alveolar and lingua-palatal initials, par-icipants in the 3-D condition achieved greater improvement than didhose in HF and AU. For the blade-palatal initials, participants in the-D condition improved more than those in AU. No significant differ-nces were found with respect to the bilabial, labiodental, supradental,r backlingual initials.

The subcategory results of the finals revealed that there was a sig-ificant difference with respect to the open-mouth, open-mouth(-i) andound-mouth finals. For the round-mouth finals, participants in the 3-Dondition improved more than those in HF and AU. For the open-mouthnals and the open-mouth(-i), participants in the 3-D condition achievedreater improvement than did those in AU. No significant differencesere found with respect to the even-teeth or close-mouth finals.

The subcategory results of the tones showed that, for the falling-ising tone, the participants in both 3-D and AU outperformed those in

Page 11: International Journal of Human-Computer · line CAPT system with a pronunciation learning cycle of “listen-record- check-learn ”was designed wherein an animated talking head was

X. Peng et al. Int. J. Human-Computer Studies 109 (2018) 26–40

Fig. 8. Experiment 2: the pronunciation improvement ( post –pre ) of the tones. Error bars refer to standard deviation.

H

H

5

(

b

e

H

v

a

v

c

f

c

f

a

c

d

M

t

a

i

a

(

3

r

t

c

(

o

c

s

t

H

t

o

(

t

i

h

R

3

a

f

l

t

t

i

t

l

t

m

n

s

c

l

m

a

s

a

m

m

d

t

t

r

F. For the falling tone, the participants in 3-D outperformed those inF. No significant differences were found on the level or rising tones.

.2. Interpretations

According to Mayer ’s cognitive theory of multimedia learning Mayer, 2009 ), humans process information through the visual and ver-al channels, and effective learning occurs when a learner retains rel-vant information in both of the channels. In our study, both 3-D andF are multimodal presentation conditions that show both audio andisual information. HF is a real human face with visualized externalrticulatory movements and 3-D is a talking head animation with bothisualized external and internal articulatory movements. We argue that,oncerning multimedia language learning, the functional articulation in-ormation presented through the visual channel should benefit pronun-iation learning, whereas involving irrelevant details should disrupt ef-ective learning. The proper design of the 3-D talking head, with a highlycceptable appearance and precise articulatory movements, could haveontributed to the good performance of the participants in the 3-D con-ition in our study.

Based on the Uncanny Valley phenomenon on human ( Mori, 1970;ori et al., 2012 ), Hamdan et al. (2015) suggested that it is important

o select the proper level of realism when designing the talking headnimation character. They provided conditions of a 3D non-realistic an-mation character (3D-NR), a 3D realistic animation character (3D-R),n actual human character (HUMAN) and a 2-D animation character2D) for pronunciation training. 3D-NR has a cartoon-like face, whereasD-R has a highly human-like face. Both 3D-NR and 3D-R produce natu-al lip synchronization but not internal articulations. The results showedhat the group of students exposed to 3D-NR obtained the best pronun-iation performance, followed by HUMAN, 2D and 3D-R. Hamdan et al.2015) speculated that a proper mental model could be formed with-ut any influence from the effects of realism levels in 3D-NR; this factorould have contributed to the good performance of 3D-NR group. Be-ides, compared with 3D-R, 3D-NR could reduce the burden of cogni-ive activities in students ’ brain cells ( Hamdan et al., 2015 ). However,

36

amdan et al. (2015) did not explain why learning with 3D-NR was bet-er than learning with HUMAN in their study.

In our study, the Uncanny Valley effect was avoided when devel-ping the 3-D talking head. Unlike 3D-R in the work of Hamdan et al.2015) , which is covered with realistic appearance of wrinkles, mous-aches, and shadows on the face, the 3-D talking head in our works more cartoon-like and appears friendlier. Moreover, the 3-D talkingead can show transparent internal articulator movements whereas 3D- can not. In our study, the results of experiment 1 showed that the-D talking head obtained a high level of acceptance of both its appear-nce and articulatory movements. Besides, concerning learners ’ interest,unctionality and future use, the 3-D talking head obtained the sameevel of preference as the real human face of a friendly-looking girl.

The subcategory results showed that participants in the 3-D condi-ion achieved greater improvement than did those in AU with respecto more than half of the initials and finals, indicating that the visualnformation (the animated face and the external and internal articula-ory movements) presented by the 3-D talking head promotes effectiveearning. Furthermore, we found that the initials and finals in which par-icipants in 3-D outperformed those in AU are articulated primarily byoving the internal articulators or cooperating the internal and exter-al articulators. However, several of the initials and finals in 3-D thathowed no difference with AU (bilabial, labiodental, supradental andlose-mouth) are articulated primarily by moving the external articu-ators. We can therefore infer that the visualized internal articulatoryovements of 3-D contribute to the better performance of 3-D than AU.

No significant difference was found between HF and AU concerningll of the initials and finals, although HF also presented both the vi-ual information (the real face and the external articulator movements)nd the audio information. This result could be because HF is a real hu-an face that conveyed some irrelevant details, such as the girl ’s vividouth, hair and neck, which can seize participants ’ attention and thenisrupt the meaningful learning process. This split-attention effect withhe HF condition was identified in experiment 1 when some of the par-icipants talked about the girl ’s features with great interest.

We found that participants in 3-D outperformed those in HF withespect to blade-alveolar, lingua-palatal and round-mouth; all are

Page 12: International Journal of Human-Computer · line CAPT system with a pronunciation learning cycle of “listen-record- check-learn ”was designed wherein an animated talking head was

X. Peng et al. Int. J. Human-Computer Studies 109 (2018) 26–40

a

a

t

t

a

a

t

t

a

b

n

c

s

d

o

(

t

o

e

(

t

W

v

a

t

u

c

l

N

c

c

s

p

l

s

t

d

5

n

a

v

d

t

h

a

c

d

d

t

s

p

i

t

m

t

t

s

l

a

t

d

d

e

t

l

e

o

c

i

t

r

d

3

a

t

t

m

M

o

g

p

h

t

a

a

6

M

t

v

g

c

t

3

t

p

i

s

a

c

h

p

a

m

t

l

i

a

f

t

r

t

t

t

w

m

p

w

l

rticulated with highly involved internal articulators. The blade-lveolar and lingua-palatal initials are articulated primarily by movinghe tip of the tongue or the blade of the tongue against the alveolar ridge,ermed as apical and laminal respectively. The round-mouth finals arerticulated primarily by cooperatively moving the blade of the tonguend the lips. We can therefore infer that the visualized internal articula-ory movements of 3-D also contribute to the better performance of 3-Dhan HF. In addition, no significant difference was found between 3-Dnd HF with respect to some of the initials and finals, which might beecause the external articulatory movements (lip movements) are moreatural when presented by HF than by 3-D.

In particular, the acquisition of Mandarin tones is also a well-knownhallenge in Mandarin pronunciation learning. The pitch contour of ayllable defines the lexical tone; thus, the lexical meaning of a Man-arin character is changed by merely changing the pitch contour with-ut altering the phonetic characteristics ( Chen et al., 2015 ). Chen et al.2015) argued that, for non-native Mandarin learners, the falling-risingone is the most difficult one and the level tone is the easiest one. Inur study, we found that the conditions of 3-D and AU are more ben-ficial than HF for the learning of the falling-rising tone. Sweller et al.1998) argued that the visual and verbal channels in the human cogni-ive system are independent and that each of them has limited capacity.

hen participants ’ attention was distracted by the irrelevant details con-eyed by the vivid human face, the effective learning of tones could beffected because it depends more upon the audio information instruc-ion.

One finding of our study is that the visualization of internal artic-lators is beneficial for (Mandarin) language learning, which is in ac-ordance with most previous studies for (European/Western) languageearning ( Engwall, 2008; Fagel and Madany, 2008; Liu et al., 2007;avarra and Soto-Faraco, 2007; Wik and Engwall, 2008 ). Practicinglear pronunciations of new words pertaining to a second languageould be particularly difficult for adult learners ( Tan et al., 2013 ). Theounds of Mandarin initials and finals are determined not only by theosition in which the exhaled air is blocked in the cavity of the articu-ators but also how the air is blocked or released. The 3-D talking headhows clear articulatory movements but is limited in displaying howo control the aspiration. That limitation might be why no significantifference was found with respect to the initials of labiodental.

.3. Implications

Massaro et al. (2008) argued that the effectiveness of showing inter-al articulatory movements for pronunciation training is hard to provend evaluate. Massaro et al. (2008) carried a preliminary study to in-estigate whether viewing the tongue, palate, and velum during pro-uction is more beneficial for learning than a standard frontal view ofhe speaker and determine whether learning differences are due to en-anced receptive learning from additional visual information or to morective learning motivated by the visual presentations. By observing andalculating imitative behaviors, they found that, for two similar Man-arin vowels /i/ and /y/, there appeared to be a trend in which the au-iovisual condition was more likely to lead to imitative behavior thanhe audio only. According to the subcategory results analyzed in ourtudy, most of the initials and finals in which participants in 3-D out-erformed those in AU and HF are pronounced primarily by moving thenternal articulators, indicating that participants in 3-D really got addi-ional help, probably by observing or imitating the internal articulatoryovements presented by the 3-D talking head.

Although HF is the most natural presentation condition in our study,he participants in HF did not achieve any better performance than didhose in 3-D and AU. It appears not true that the more natural is the pre-entation condition, the better pronunciation performance the second-anguage learners achieve. Liu et al. (2007) suggested that emotions thatre formed through facial expression also have a significant role in cap-uring attention and therefore facilitate learning. Studies have striven to

37

evelop a talking head with a realistic face or even facial expressions toisplay visual speech ( Ali et al., 2015; Mattheyses et al., 2009; Stevenst al., 2013; Wang et al., 2012b ). In contrast, our results suggest that thealking head for second language learning, at least for pronunciationearning, should not convey much vivid information that is either irrel-vant or only slightly related to the functional articulations. The designf the talking head should help the pronunciation learners easily con-entrate their attention on the functional articulations rather than onrrelevant peripheral details.

According to our findings, 3-D and AU are more helpful than HF is inhe learning of Mandarin tones, especially in the learning of the falling-ising tone. Considering that the simulation of aspiration is important toistinguish some confusable consonants ( Chen et al., 2016 ), whereas the-D talking head in our work is limited in displaying how to control thespiration, we suppose that the 3-D talking head with proper instruc-ions on aspiration might show more potential advantages on Mandarinones learning.

Before using the ALT, some participants complained to the experi-enter that it was quite difficult to learn new pronunciations in theirandarin language class because there were too many students but only

ne teacher and the teacher spoke too fast. Due to the shortage of lan-uage professionals and skilled teachers who can offer individual orersonalized language training ( Massaro, 2006 ), the animated talkingeads can be well customized according to individuals ’ needs at a rela-ively low cost. Furthermore, the ALT with the 3-D talking head provides flexible and feasible means of updating the language learning materi-ls.

. Conclusion and future work

In this study, we evaluate a 3-D virtual talking head on non-nativeandarin speaker pronunciation learning under three language presen-

ation conditions – audio only (AU), human face video (HF) and audio-isual animation of a three-dimensional talking head (3-D). An auto lan-uage tutor (ALT) configured with AU, HF and 3-D is developed as theomputer-aided pronunciation training system. We apply both subjec-ive and objective methods to study language learners ’ acceptance of the-D talking head, their comparative impressions of the three presenta-ion conditions, and their pronunciation performance under the threeresentation conditions.

Experiment 1 studied language learners ’ acceptance of the 3-D talk-ng head and their comparative impressions of AU, HF and 3-D. Resultshow that the ALT system is easy to use. Two factors termed ‘appear-nce ” and “articulatory movements ” significantly influence learners ’ ac-eptance of the 3-D talking head and both the two factors achieved aigh level of learners ’ acceptance. Results also show that participantsreferred 3-D and HF to AU with respect to learners ’ interest, function-lity and future use.

Experiment 2 compared learners ’ pronunciation learning improve-ents in AU, HF and 3-D. We found that learning with 3-D was bet-

er than learning with AU with respect to blade-alveolar, blade-palatal,ingua-palatal, open-mouth, open-mouth(-i) and round-mouth. Learn-ng with 3-D was better than learning with HF with respect to blade-lveolar, lingua-palatal, round-mouth, and the tones of falling-rising andalling. Learning with AU was better than learning with HF with respecto the falling-rising tone. Neither HF nor AU was superior to 3-D withespect to all of the initials, finals and tones.

We analyzed the results in low-level classifications of Mandarin ini-ials, finals and tones. We conclude that it is valuable to provide instruc-ions on internal articulatory movements for non-native adult learnerso learn new pronunciations. The proper design of the 3-D talking head,ith a highly acceptable appearance and precise articulatory move-ents, could have contributed to the better performance of the partici-ants in 3-D than those in HF and AU. Participants ’ good performanceith respect to the initials and finals that dominated by internal articu-

ator movements were acquired probably by observing or imitating the

Page 13: International Journal of Human-Computer · line CAPT system with a pronunciation learning cycle of “listen-record- check-learn ”was designed wherein an animated talking head was

X. Peng et al. Int. J. Human-Computer Studies 109 (2018) 26–40

a

p

3

t

t

w

l

t

i

T

i

o

f

o

n

o

I

i

d

t

t

A

(

(

F

Asb

Acm

R

A

AA

B

B

B

B

L

C

dditional information presented by the 3-D talking head. The partici-ants in HF did not achieve any better performance than did those in-D and AU, which may be because HF conveyed some irrelevant detailshat may easily distract participants ’ attention. In general, we suggesthat any talking head used for pronunciation training should be designedith a natural but not overly realistic face, clear and acceptable articu-

ator movements, and little vivid information that is slightly related tohe functionality.

Limitations of the study include the reliance on only one single talk-ng head and a single female native language teacher in the HF videos.he number of sample tones (except the level tone) was small. Only 3 ris-

ng tones, 2 falling-rising tones and 2 falling tones were included. Mostf the recruited Ph.D. students are male. A wide range of participantsrom other backgrounds should also be recruited to evaluate the validityf the 3-D talking head. According to the findings in our study, we areot sure whether a 3-D talking head with only a good appearance ornly precise articulatory movements might also obtain similar results.n future evaluations of a 3-D talking head, more conditions should bencluded, and users ’ learning behaviors can be recorded to provide aeeper investigation. More language materials such as phrases and sen-ences should also be presented by the 3-D talking head to conduct ex-ended evaluation on language learning.

cknowledgments

This work was supported by the National Key R&D Program of China2016YFB1001201), the National Natural Science Foundation of ChinaNSFC: 61232013, 61661146002, 61422212, 91420301) and ShenZhenundamental Research Program (JCYJ20160429184226930).

ppendix A. Pronunciation training stimuli of the 60 Mandarin

yllables in which the recorded pronunciations are shown in

oldface

Unit 1 Unit 2 Unit 3

k ā, k ē m ā, h ā ni ǔ, ni ǎo m ō , m ē mái, méi n ǚ, l ǚd ē, g ē f ān, f ēn qú, qún t ē, h ē f āng, f ēng gu ī , g ān l ē, ēr p ī n, p ī ng gu ā, gu ān qi ā, qi ē l ēng, l ō ng qu ān, chu ān b āo, bi āo qiáng , qióng xu ān , shu ān p ān, pi ān1 ju ān, zhu ān qá, rámá, mú4 z ī , zh ī c ī , ch ī s ī , sh ī s ī i, sh ī i ca1, ch ā

ppendix B. Statistics of the ANOVAs and post-hoc pairwise omparisons of the pronunciation improvement in which the ∗

eans the p -value is less than 0.05

Pronunciation improvement (post–pre)

Presentation condition effect

All the 29 Mandarin syllables Statistics Post-hoc pairwise

comparisons Initial F (2,33) = 8.578, P = 0.001,

𝜂2 = 0.342 3-D vs. HF: p = 0.003 ∗

3-D vs. AU: p = 0.001 ∗

HF vs. AU: P = 0.541 Final F (2,33) = 12.963,

P = 0.000, 𝜂2 = 0.440 3-D vs. HF: p = 0.016 ∗

3-D vs. AU: p = 0.001 ∗

HF vs. AU: P = 0.110

( continued on next page )

38

Pronunciation improvement (post–pre)

Presentation condition effect

Tone F (2,33) = 1.707, P = 0.197 –

Initials in category Statistics Post-hoc pairwise

comparisons Bilabial F (2,33) = 1.545, P = 0.228 –Labiodental F (2,33) = 1.203, P = 0.313 –Supradental F (2,33) = 2.487, P = 0.099 –Blade-alveolar F (2,33) = 8.886, P = 0.001,

𝜂2 = 0.350 3-D vs. HF: p = 0.002 ∗

3-D vs. AU: p = 0.000 ∗

HF vs. AU: P = 0.628 Blade-palatal F (2,33) = 3.723, P = 0.035,

𝜂2 = 0.184 3-D vs. HF: p = 0.080

3-D vs. AU: p = 0.012 ∗

HF vs. AU: P = 0.391 Lingua-palatal F (2,33) = 4.556, P = 0.018,

𝜂2 = 0.216 3-D vs. HF: p = 0.016 ∗

3-D vs. AU: p = 0.011 ∗

HF vs. AU: P = 0.891 Backlingual F (2,33) = 2.121, P = 0.136 –

Finals in category Statistics Post-hoc pairwise

comparisons Open-mouth F (2,33) = 3.423, P = 0.045,

𝜂2 = 0.172 3-D vs. HF: p = 0.086

3-D vs. AU: p = 0.015 ∗

HF vs. AU: P = 0.439 Open-mouth(-i) F (2,33) = 8.065, P = 0.001,

𝜂2 = 0.328 3-D vs. HF: p = 0.200

3-D vs. AU: p = 0.003 ∗

HF vs. AU: P = 0.079 Even-teeth F (2,33) = 3.132, P = 0.057 –Close-mouth F (2,33) = 0.809, P = 0.454 –Round-mouth F (2,33) = 3.914, P = 0.030,

𝜂2 = 0.192 3-D vs. HF: p = 0.019 ∗

3-D vs. AU: p = 0.023 ∗

HF vs. AU: P = 0.935

The tone Statistics Post-hoc pairwise

comparisons Level F (2,33) = 0.084, P = 0.920 –Rising F (2,33) = 1.427, P = 0.254 –Falling-rising F (2,33) = 6.830, P = 0.003,

𝜂2 = 0.293 3-D vs. HF: p = 0.002 ∗

3-D vs. AU: p = 0.734 HF vs. AU: P = 0.005 ∗

Falling F (2,33) = 3.443, P = 0.044, 𝜂2 = 0.173

3-D vs. HF: p = 0.019 ∗

3-D vs. AU: p = 0.673 HF vs. AU: P = 0.051

eferences

li, A.Z.M. , Segaran, K. , Hoe, T.W. , 2015. Effects of verbal components in 3D talking-headon pronunciation learning among non-native speakers. J. Educ. Technol. Soc. 18 (2),313–322 .

llen, I.E. , Seaman, C.A. , 2007. Likert scales and data analyses. Qual. Prog. 40 (7), 64 . ssociation, I.P. , et al. , 2005. International phonetic alphabet. International phonetic al-

phabet[J]. Revised to, 2005 . adin, P. , Tarabalka, Y. , Bailly, G. , 2008. Can you ‘read tongue movements ’? Proceedings

of Interspeech, Interspeech, Brisbane, Australia 2635–2637 . adin, P. , Tarabalka, Y. , Elisei, Y. , Bailly, G. , 2010. Can you ‘read ’ tongue movements?

evaluation of the contribution of tongue display to speech understanding. SpeechCommun. 52 (6), 493–503 .

ailly, G. , 2003. Close shadowing natural versus synthetic speech. Int. J. Speech Technol.6 (1), 11–19 .

utler, M. , Joschko, L. , 2009. Final fantasy or the incredibles, animation studies. 15 24Peer Reviewed Online J. Animat. Hist. Theory 8, 15–24 .

i, C.N. , Thompson, S.A. , 1989. Mandarin Chinese: A Functional Reference Grammar.University of California Press .

alka, A. , 2011. Pronunciation learning strategies–identification and classification. Speak-ing and instructed foreign language acquisition. Clevedon: Multilingual Matters,pp. 149–168 .

Page 14: International Journal of Human-Computer · line CAPT system with a pronunciation learning cycle of “listen-record- check-learn ”was designed wherein an animated talking head was

X. Peng et al. Int. J. Human-Computer Studies 109 (2018) 26–40

C

C

C

C

C

C

C

D

E

F

G

G

G

H

H

H

I

K

K

L

L

L

L

M

M

M

M

MMM

M

N

N

NNP

P

P

P

PS

S

S

T

T

W

W

W

W

W

W

W

W

W

Y

Z

hen, F. , Chen, H. , Wang, L. , Zhou, Y. , He, J. , Yan, N. , Peng, G. , 2016. Intelligible en-hancement of 3d articulation animation by incorporating airflow information. In: Pro-ceedings of the 2016 IEEE International Conference on Acoustics, Speech and SignalProcessing (ICASSP). IEEE, pp. 6130–6134 .

hen, F. , v. Spinko , Shi, D. , 2005. Real-time lip synchronization using wavelet network.In: Proceedings of the 2005 International Conference on Cyberworlds. IEEE, pp. 4–pp .

hen, H. , Wang, L. , Liu, W. , Heng, P.A. , 2010. Combined X-ray and facial videos for phone-me-level articulator dynamics. Vis. Comput. 26 (6–8), 477–486 .

hen, N.F. , Shivakumar, V. , Harikumar, M. , Ma, B. , Li, H. , 2013. Large-scale character-ization of mandarin pronunciation errors made by native speakers of european lan-guages. In: INTERSPEECH, pp. 2370–2374 .

hen, N.F. , Tong, R. , Wee, D. , Lee, P. , Ma, B. , Li, H. , 2015. ICALL corpus: Mandarin chinesespoken by non-native speakers of european descent. In: Proceedings of the SixteenthAnnual Conference of the International Speech Communication Association .

hen, T.H. , Massaro, D.W. , 2011. Evaluation of synthetic and natural mandarin visualspeech: initial consonants, single vowels, and syllables. Speech Commun. 53 (7),955–972 .

hiu, C.Y. , Lia, Y.F. , Külls, D. , Mixdorff, H. , Chen, S.L. , 2009. A preliminary study oncorpus design for computer-assisted German and Mandarin language learning. In:Speech Database and Assessments, 2009 Oriental COCOSDA International Conferenceon. IEEE, pp. 154–159 .

’Mello, S.K. , Graesser, A. , King, B. , 2010. Toward spoken human–computer tutorial dia-logues. Hum.-Comput. Inter. 25 (4), 289–323 .

ngwall, O. , 2008. Can audio-visual instructions help learners improve their articulation?an ultrasound study of short term changes. Proceedings of Interspeech, Interspeech,Brisbane, Australia 2631–2634 .

agel, S. , Madany, K. , 2008. A 3-D virtual head as a tool for speech therapy for children.Proceedings of Interspeech, Interspeech, Brisbane, Australia 2643–2646 .

ibert, G. , Leung, K.N. , Stevens, C.J. , 2015. Transforming an embodied conversationalagent into an efficient talking head: from keyframe-based animation to multimodalconcatenation synthesis. Comput. Cognit. Sci. 1 (1), 1–12 .

orsuch, R.L. , 1997. Exploratory factor analysis: its role in item analysis. J. Personal.Assess. 68 (3), 532–560 .

rauwinkel, K. , Dewitt, B. , Fagel, S. , 2007. Visual information and redundancy conveyedby internal articulator dynamics in synthetic audiovisual speech. Proceedings of In-terspeech, Interspeech, Antwerp, Belgium 706–709 .

amdan, M.N. , Ali, A.Z.M. , 2015. User satisfaction of non-realistic three-dimensional talk-ing-head animation courseware (3D-NR). Int. J. e-Educ., e-Bus. e-Manag. e-Learn. 5(1), 23 .

amdan, M.N. , Ali, A.Z.M. , Hassan, A. , 2015. The effects of realism level of talking-headanimated character on students ’ pronunciation learning. In: 2015 International Con-ference on Science in Information Technology (ICSITech). IEEE, pp. 58–62 .

azan, V. , Sennema, A. , Iba, M. , Faulkner, A. , 2005. Effect of audiovisual perceptual train-ing on the perception and production of consonants by Japanese learners of english.Speech Commun. 47 (3), 360–378 .

TU-T , 1994. A method for subjective performance assessment of the quality of speechoutput devices. International Telecommunications Union publication .

im, D. , Gilman, D.A. , 2008. Effects of text, audio, and graphic aids in multimedia instruc-tion for vocabulary learning. Educ. Technol. Soc. 11 (3), 114–126 .

ühnel, C. , Weiss, B. , Wechsung, I. , Fagel, S. , Möller, S. , 2008. Evaluating talking heads forsmart home systems. Proceedings of the 10th international conference on Multimodalinterfaces, ACM, New York, NY, USA 81–84 .

ee, W.S. , Zee, E. , 2003. Standard Chinese (Beijing). J. Int. Phon. Assoc. 33 (01), 109–112 .ewis, M. P., Simons, G. F., Fennig, C. D., 2015. Summary by language size. Ethnologue:

Languages of the World (19th ed.)(online version). Dallas, TX: SIL International. Re-trieved February 22, 2016.

iu, X. , Yan, N. , Wang, L. , Wu, X. , Ng, M.L. , 2013. An interactive speech training systemwith virtual reality articulation for mandarin-speaking impaired children. 2013 IEEEInternational Conference on Information and Automation 191–196 .

iu, Y. , Massaro, D.W. , Chen, T.H. , Chan, D. , Perfetti, C. , 2007. Using visual speech fortraining Chinese pronunciation: An in-vivo experiment. In: SLaTE, pp. 29–32 .

assaro, D.W. , 2006. Embodied agents in language learning for children with languagechallenges. In: Computers Helping People with Special Needs. Springer, pp. 809–816 .

assaro, D.W. , Bigler, S. , Chen, T.H. , Perlman, M. , Ouni, S. , 2008. Pronunciation training:the role of eye and ear. Proceedings of Interspeech, Interspeech, Brisbane, Australia2623–2626 .

assaro, D.W. , Light, J. , 2004. Using visible speech to train perception and production ofspeech for individuals with hearing loss. J. Speech Lang. Hear. Res. 47 (2), 304–320 .

39

attheyses, W. , Latacz, L. , Verhelst, W. , 2009. On the importance of audiovisual coherencefor the perceived quality of synthesized visual speech. EURASIP J. Audio Speech MusicProcess. 2009 (1), 1 .

ayer, R.E. , 2009. Multimedia Learning. Cambridge University Press . ori, M. , 1970. The uncanny valley. Energy, 7, 33–35 . ori, M. , MacDorman, K.F. , Kageki, N. , 2012. The uncanny valley [from the field]. IEEE

Robot. Autom. Mag. 19 (2), 98–100 . üller, P. , Wet, F.D. , Walt, C.V.D. , Niesler, T. , 2009. Automatically assessing the oral

proficiency of proficient l2 speakers. In: Proc. SLaTE. Warwickshire, UK, pp. 29–32 . avarra, J. , Soto-Faraco, S. , 2007. Hearing lips in a second language: visual articulatory

information enables the perception of second language sounds. Psychol. Res. 71 (1),4–12 .

eri, A. , Cucchiarini, C. , Strik, H. , 2006. ASR-based corrective feedback on pronunciation:does it really work? INTERSPEECH .

orman, J. , 1988. Chinese. Cambridge University Press . unnally, J., 1978. Psychometric methods. andzic, I.S. , Ostermann, J. , Millen, D. , 1999. User evaluation: synthetic talking faces for

interactive services. Vis. Comput. 15 (7–8), 330–340 . ei, Y. , Zha, H. , 2006. Vision based speech animation transferring with underlying anatom-

ical structure. In: Computer Vision – ACCV 2006. Springer, pp. 591–600 . ei, Y. , Zha, H. , 2007. Transferring of speech movements from video to 3D face space.

IEEE Trans. Visual. Comput. Graph. 13 (1), 58–69 . iske, T. , MacKay, I.R.A. , Flege, J.E. , 2001. Factors affecting degree of foreign accent in

an l2: a review. J. Phon. 29 (2), 191–215 . ulleyblank, E.G. , 2011. Middle Chinese: A Study in Historical Phonology. UBC Press . efero ĝlu, G. , 2005. Improving students pronunciation through accent reduction software.

Br. J. Educ. Technol. 36 (2), 303–316 . tevens, C.J. , Gibert, G. , Leung, Y. , Zhang, Z. , 2013. Evaluation a synthetic talking head

using a dual task: modality effects on speech understanding and cognitive load. Int.J. Hum.-Comput. Stud. 71 (4), 440–454 .

weller, J. , van Merriënboer, J.J.G. , Paas, F.G.W.C. , 1998. Cognitive architecture and in-structional design. Educ. Psychol. Rev. 10 (3), 251–296 .

an, W.H. , Lin, C.Y. , Wang, Y. , 2013. Mandarin communication learning app: a proof-of–concept prototype of contextual learning. J. Res. Policy Pract. Teach. Teach. Educ. 3(2), 38–48 .

heobald, B.J. , Fagel, S. , Bailly, G. , Elisei, F. , 2008. Lips2008: visual speech synthesischallenge. Proceedings of Interspeech, Interspeech, Brisbane, Australia 2310–2313 .

ang, L. , Chen, H. , Li, S. , Meng, H.M. , 2012a. Phoneme-level articulatory animation inpronunciation training. Speech Commun. 54 (7), 845–856 .

ang, L. , Han, W. , Soong, F.K. , 2012b. High quality lip-sync animation for 3D photo-re-alistic talking head. In: 2012 IEEE International Conference on Acoustics, Speech andSignal Processing (ICASSP). IEEE, pp. 4529–4532 .

ang, X. , Hueber, T. , Badin, P. , 2014. On the use of an articulatory talking head forsecond language pronunciation training: the case of Chinese learners of French. In10th International Seminar on Speech Production .

ang, Z.-m., Cai, L.-h., Ai, H.-z., 2003. Text-to-visual speech in Chinese based on data-driven approach.

eiss, B. , Kühnel, C. , Wechsung, I. , Fagel, S. , Möller, S. , 2010. Quality of talking heads indifferent interaction and media contexts. Speech Commun. 52 (6), 481–492 .

ik, P. , Engwall, O. , 2008. Can visualization of internal articulators support speech per-ception? Eurospeech 2627–2630 .

itt, S.M. , Young, S.J. , 1997. Language learning based on non-native speech recogni-tion. In: Proceedings of the 1997 European Conference on Speech Communicationand Technology (Eurospeech) .

itt, S.M. , Young, S.J. , 2000. Phone-level pronunciation scoring and assessment for in-teractive language learning. Speech communication, 30 (2), 95–108 .

u, Z. , Zhang, S. , Cai, L. , Meng, H. , 2006. Real-time synthesis of chinese visual speechand facial expressions using MPEG-4 FAP features in a three-dimensional avatar. In:INTERSPEECH. Citeseer .

uen, K.W. , Leung, W.K. , Liu, P. , Wong, K.H. , Qian, X. , Lo, W.K. , Meng, H. , 2011. Enunci-ate: an internet-accessible computer-aided pronunciation training system and relateduser evaluations. 2011 IEEE International Conference on Speech Database and Assess-ments 85–90 .

hou, W., Wang, Z., 2007. Speech animation based on chinese mandarin triphone model.In: Computer and Information Science, 2007. ICIS 2007. 6th IEEE/ACIS InternationalConference on. IEEE, pp. 924–929.

Page 15: International Journal of Human-Computer · line CAPT system with a pronunciation learning cycle of “listen-record- check-learn ”was designed wherein an animated talking head was

X. Peng et al. Int. J. Human-Computer Studies 109 (2018) 26–40

Robotic Interaction Lab from University of Science and Technology Beijing, P.R. China. She is of Sciences, P.R. China. Her research interests include human–computer interaction, affective

nce from Shandong University, P.R. China, and received her Ph.D. degree in computer science now an associate professor with Institute of Software Chinese Academy of Sciences, P.R. China. ffective interaction, haptics and virtual reality.

tion Science, Peking University. She obtained her Ph.D. degree from the Machine Intelligence n 2006. She worked on the Autonomous Global Integrated Language Exploitation project funded gram. She is now an research professor of Shen-Zhen Institutes of Advanced Technology, Chinese large vocabulary continuous speech recognition, speech visualization and audio information

ftware Chinese Academy of Sciences, P.R. China. He has been a full professor and the director of Software Chinese Academy of Sciences, P.R. China. His research interests include human- hed over 100 papers in RTAI and HCI fields, including IEEE RTSS, ACM CHI, IJHCS, ACM IUI,

Xiaolan Peng received her B.S. and M.S. degrees in Human now an Engineer with Institute of Software Chinese Academyinteraction and multimodal learning.

Hui Chen received her B.S. and M.S. degrees in computer sciefrom the Chinese University of Hong Kong, P.R. China. She is Her research interests include human–computer interaction, a

Lan Wang received her M.S. degree in the Center of InformaLaboratory of Cambridge University Engineering Department iunder DARPA ’s Global Autonomous Language Exploitation proAcademy of Sciences, P.R. China. Her research interests areindexing.

Hongan Wang received his Ph.D. degree from Institute of Soof Beijing Key Lab of Human-Computer Interaction, Institutecomputer interaction and real-time intelligence. He has publisACM TIST, etc.

40