analysis of native listeners' facial...

5
Analysis of Native Listeners’ Facial Microexpressions while Shadowing Non-native Speech – Potential of Shadowers’ Facial Expressions for Comprehensibility Prediction – Tasavat Trisitichoke, Shintaro Ando, Daisuke Saito, Nobuaki Minematsu Graduate School of Engineering, The University of Tokyo {tasavat,s ando,dsk saito,mine}@gavo.t.u-tokyo.ac.jp Abstract Recently, researchers’ attention has been paid to pronunci- ation assessment not based on comparison between L2 speech and native models, but based on comprehensibility of L2 speech [1, 2, 3]. In our previous studies [4, 5, 6], native listeners’ shad- owing of L2 speech was examined and it was shown that delay of shadowing and accuracy of articulation in shadowing utter- ances, both of which were acoustically calculated, are strongly influenced by the amount of cognitive load imposed for under- standing L2 speech, especially when it is with strong accents. In this paper, aside from acoustic analysis of shadowings, we focus on shadowers’ facial microexpressions and examine how they are correlated with perceived comprehensibility. To ex- tract facial expression features, two methods are tested. One is a computer-vision-based method and recorded videos of shad- owers’ facial expressions are analyzed. The other is a method using a physiological sensor that can detect subtle movements of facial muscles. In experiments, four shadowers’ facial ex- pressions are analyzed, each of whom shadowed approximately 800 L2 utterances. Results show that some of shadowers’ facial expressions are highly correlated with perceived comprehensi- bility, and that those facial expressions are strongly shadower- dependent. These results indicate a high potential of shadowers’ facial expressions for comprehensibility prediction. Index Terms: pronunciation assessment, comprehensibility, shadowing, facial expressions, FACS, EMG 1. Introduction Assessment of naturalness of synthetic voices is often done by asking listeners to rate the voices subjectively using n-point scores, called mean opinion score. In recent works, listeners’ re- sponses or behaviors are focused on to realize objective assess- ment. In [7], quickness of listeners’ voluntary responses was ex- amined and in [8], their involuntary and physiological reactions were focused on. Similarly in our studies [4, 5, 6], to realize objective assessment of comprehensibility of L2 speech, native listeners were asked to shadow L2 speech. As smooth shadow- ing always requires smooth understanding of an input speech, smoothness of the shadowing utterances, characterized by two acoustic features extracted from the utterances, was found to be well correlated with perceived comprehensibility. The two features were delay of shadowing and accuracy of articulation. In this study, we focus on shadowers’ involuntary and phys- iological reactions observed while shadowing. In previous stud- ies, to measure listening efforts objectively, listeners’ pupil di- lation was quantitatively measured [9, 10, 11], and to estimate the amount of cognitive load, listeners’ involuntary facial mi- croexpressions were detected and examined [12]. In this paper, following [12], shadowers’ facial expressions are detected and analyzed with respect to perceived comprehensibility. In this paper, in order to investigate potential of shadowers’ facial microexpressions for comprehensibility prediction, four native Japanese were asked to shadow Japanese utterances spo- ken by Vietnamese learners. Here, native shadowers’ microex- pressions were detected via two methods: from videos of shad- owers’ face, and from surface electromyography (EMG) signals recorded by a wearable device [13]. In the former, Facial Ac- tion Coding System (FACS) is used to analyze recorded face videos, and in the latter, amplitude analysis is adopted, where RMS values of the EMG envelope are calculated. These objective measurements are compared to compre- hensibility scores, which were subjectively given by shadowers themselves every time they shadowed a given L2 speech. In ad- dition to comprehensibility scores, after shadowing, shadowers were asked to give another subjective score, which is shadowa- bility score. Here, shadowability means subject’s self-evaluated smoothness of shadowing. Correlation analysis is done between the objective scores and the two kinds of subjective scores and experimental results imply a potential of facial microexpres- sions for predicting comprehensibility and shadowability of an L2 utterance, especially when a prediction model is built sepa- rately for individual shadowers. 2. Related work 2.1. Native listeners’ shadowing of L2 utterances In conventional form of shadowing, native utterances are pre- sented to learners and learners have to repeat the presented ut- terances as simultaneously as possible. Shadowing is often done in classroom as multi-task speech training where listening, un- derstanding, and speaking have to be run at the same time. In comparison to this form of shadowing, native listeners’ shad- owing requires native listeners to shadow learners’ utterances, as shown in Figure 1. As smooth shadowing always requires smooth understanding of a given utterance, in [4, 5, 6], shad- owers’ performance was shown to be highly correlated with comprehensibility scores and shadowability scores subjectively judged by shadowers. Here, two kinds of acoustic features were examined. One is shadowers’ accuracy of articulation, which was quantified as posterior probabilities of phonemes and they were calculated by using a front-end module of a DNN-based speech recognizer. The other is delay of shadowing. For a pair of an L2 utterance and its corresponding native shadowing, it was defined as average over the temporal gaps between every pair of corresponding phoneme boundaries between the two ut- terances. These phoneme boundaries were automatically ob- tained by forced alignment. Further details on these acoustic features and validation of native listeners’ shadowing are re- ferred to in [4, 5, 6]. In the current paper, we examine non- Copyright © 2019 ISCA INTERSPEECH 2019 September 15–19, 2019, Graz, Austria http://dx.doi.org/10.21437/Interspeech.2019-1953 1861

Upload: others

Post on 11-Jul-2020

7 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Analysis of Native Listeners' Facial …pdfs.semanticscholar.org/d90a/257bcaa7f2d26de1587db4749...standing L2 speech, especially when it is with strong accents. In this paper, aside

Analysis of Native Listeners’ Facial Microexpressionswhile Shadowing Non-native Speech

– Potential of Shadowers’ Facial Expressions for Comprehensibility Prediction –

Tasavat Trisitichoke, Shintaro Ando, Daisuke Saito, Nobuaki Minematsu

Graduate School of Engineering, The University of Tokyo{tasavat,s ando,dsk saito,mine}@gavo.t.u-tokyo.ac.jp

AbstractRecently, researchers’ attention has been paid to pronunci-

ation assessment not based on comparison between L2 speechand native models, but based on comprehensibility of L2 speech[1, 2, 3]. In our previous studies [4, 5, 6], native listeners’ shad-owing of L2 speech was examined and it was shown that delayof shadowing and accuracy of articulation in shadowing utter-ances, both of which were acoustically calculated, are stronglyinfluenced by the amount of cognitive load imposed for under-standing L2 speech, especially when it is with strong accents.In this paper, aside from acoustic analysis of shadowings, wefocus on shadowers’ facial microexpressions and examine howthey are correlated with perceived comprehensibility. To ex-tract facial expression features, two methods are tested. One isa computer-vision-based method and recorded videos of shad-owers’ facial expressions are analyzed. The other is a methodusing a physiological sensor that can detect subtle movementsof facial muscles. In experiments, four shadowers’ facial ex-pressions are analyzed, each of whom shadowed approximately800 L2 utterances. Results show that some of shadowers’ facialexpressions are highly correlated with perceived comprehensi-bility, and that those facial expressions are strongly shadower-dependent. These results indicate a high potential of shadowers’facial expressions for comprehensibility prediction.Index Terms: pronunciation assessment, comprehensibility,shadowing, facial expressions, FACS, EMG

1. IntroductionAssessment of naturalness of synthetic voices is often done byasking listeners to rate the voices subjectively using n-pointscores, called mean opinion score. In recent works, listeners’ re-sponses or behaviors are focused on to realize objective assess-ment. In [7], quickness of listeners’ voluntary responses was ex-amined and in [8], their involuntary and physiological reactionswere focused on. Similarly in our studies [4, 5, 6], to realizeobjective assessment of comprehensibility of L2 speech, nativelisteners were asked to shadow L2 speech. As smooth shadow-ing always requires smooth understanding of an input speech,smoothness of the shadowing utterances, characterized by twoacoustic features extracted from the utterances, was found tobe well correlated with perceived comprehensibility. The twofeatures were delay of shadowing and accuracy of articulation.

In this study, we focus on shadowers’ involuntary and phys-iological reactions observed while shadowing. In previous stud-ies, to measure listening efforts objectively, listeners’ pupil di-lation was quantitatively measured [9, 10, 11], and to estimatethe amount of cognitive load, listeners’ involuntary facial mi-croexpressions were detected and examined [12]. In this paper,following [12], shadowers’ facial expressions are detected and

analyzed with respect to perceived comprehensibility.In this paper, in order to investigate potential of shadowers’

facial microexpressions for comprehensibility prediction, fournative Japanese were asked to shadow Japanese utterances spo-ken by Vietnamese learners. Here, native shadowers’ microex-pressions were detected via two methods: from videos of shad-owers’ face, and from surface electromyography (EMG) signalsrecorded by a wearable device [13]. In the former, Facial Ac-tion Coding System (FACS) is used to analyze recorded facevideos, and in the latter, amplitude analysis is adopted, whereRMS values of the EMG envelope are calculated.

These objective measurements are compared to compre-hensibility scores, which were subjectively given by shadowersthemselves every time they shadowed a given L2 speech. In ad-dition to comprehensibility scores, after shadowing, shadowerswere asked to give another subjective score, which is shadowa-bility score. Here, shadowability means subject’s self-evaluatedsmoothness of shadowing. Correlation analysis is done betweenthe objective scores and the two kinds of subjective scores andexperimental results imply a potential of facial microexpres-sions for predicting comprehensibility and shadowability of anL2 utterance, especially when a prediction model is built sepa-rately for individual shadowers.

2. Related work2.1. Native listeners’ shadowing of L2 utterances

In conventional form of shadowing, native utterances are pre-sented to learners and learners have to repeat the presented ut-terances as simultaneously as possible. Shadowing is often donein classroom as multi-task speech training where listening, un-derstanding, and speaking have to be run at the same time. Incomparison to this form of shadowing, native listeners’ shad-owing requires native listeners to shadow learners’ utterances,as shown in Figure 1. As smooth shadowing always requiressmooth understanding of a given utterance, in [4, 5, 6], shad-owers’ performance was shown to be highly correlated withcomprehensibility scores and shadowability scores subjectivelyjudged by shadowers. Here, two kinds of acoustic features wereexamined. One is shadowers’ accuracy of articulation, whichwas quantified as posterior probabilities of phonemes and theywere calculated by using a front-end module of a DNN-basedspeech recognizer. The other is delay of shadowing. For a pairof an L2 utterance and its corresponding native shadowing, itwas defined as average over the temporal gaps between everypair of corresponding phoneme boundaries between the two ut-terances. These phoneme boundaries were automatically ob-tained by forced alignment. Further details on these acousticfeatures and validation of native listeners’ shadowing are re-ferred to in [4, 5, 6]. In the current paper, we examine non-

Copyright © 2019 ISCA

INTERSPEECH 2019

September 15–19, 2019, Graz, Austria

http://dx.doi.org/10.21437/Interspeech.2019-19531861

Page 2: Analysis of Native Listeners' Facial …pdfs.semanticscholar.org/d90a/257bcaa7f2d26de1587db4749...standing L2 speech, especially when it is with strong accents. In this paper, aside

SS = smoothness of shadowing

Figure 1: Native listeners’ reverse shadowing

Figure 2: Facial action units [23]

verbal features related to native listeners’ shadowing, that aretheir facial microexpressions.

2.2. Facial microexpressions and their automatic detection

Facial microexpressions are brief and subtle facial expressionswhich are involuntarily leaked despite a deliberate attempt toconceal [14]. Generally, facial microexpressions are said to lastbetween 1/25 to 1/2 of a second, but [15] discussed that the crit-ical point between distinguishing macro and micro expressionsis about 1/5 seconds.

As one method for detecting microexpressions, we use acomputer-vision-based extraction method, that is Facial ActionCoding System (FACS) [16]. FACS is a system which encodesmovements of individual facial muscles into units called ActionUnits (AUs) [17]. FACS is a standard framework to categorizeautomatically physical facial expressions of emotions into morethan 28 main and 30 miscellaneous Action Units codes. Figure2 shows examples of AUs and their description. Moreover, asAUs are independent of any interpretation, they are also usedfor more general applications, such as analysis of depression[18] and prediction of cognitive load [12].

An alternative is a surface-EMG-based detection method,which in comparison, is advantageous over a CV-based detec-tion method due to better temporal resolution and robustnessagainst occlusion and light changes [13]. Electrodes might beplaced in specific locations, and amplitude analysis is then ap-plied to detect activation of specific facial muscles that cor-respond to microexpressions of interest [19, 20]. Recently in[13, 21], Artificial Neural Network (ANN) was adopted to clas-sify microexpressions directly from preprocessed EMG signals.

3. Experiment3.1. Dataset and subjects

Japanese utterances spoken by Vietnamese learners collected in[22] were used as a dataset. In total, 3,203 utterances wereprepared. Learners’ utterances are categorized into two types:textbook utterances, which learners read aloud Japanese para-graphs selected from textbooks, and story utterances, which

Table 1: #utterances assigned to each subject

S1 S2 S3 S4 TotalVideo 408 369 430 378 1,585EMG 421 417 371 409 1,618Total 829 786 801 787 3,203

Figure 3: Experimental setup. Subjects wore an EMG device infront of the web interface for shadowing where their face videoswere recorded by the front web camera.

Figure 4: Wearable device using two electrode pairs for facialexpression recognition [13]

they read their own stories written in Japanese, thus they are ex-pected to be lower in comprehensibility, as grammatical errorsor proper nouns such as names of persons or cities are likely tobe found. Therefore, for some utterances, perceived compre-hensibility will depend not only on proficiency of learners butalso on linguistic content of the utterances.

Four female adult native speakers of Japanese participatedin the experiments as shadowers. Considering coverage of allthe utterances, the four subjects were assigned to different setsof utterances in a way that one utterance was assigned to onlyone subject. Table 1 lists the total number of utterances assignedto each subject and the number of those assigned to each shad-owing session, which will be discussed shortly.

3.2. Setup and equipment

The experimental setup is shown in Figure 3. The front webcamera of a MacBook Air (640x480) was used for video record-ing. Since a subject will be looking at the web interface duringshadowing, the front web camera guarantees that his/her faceis in frame, even if the subject changes his/her posture duringa shadowing session. In addition to video recording, surfaceEMG signals were detected on facial muscles using a wearabledevice with two electrode pairs [13], shown in Figure 4. Forstable detection of signals, once equipped, subjects were not al-lowed to put off the device until the end of a shadowing session.In EMG session, videos were also recorded for synchronizationwith audio signals.

1862

Page 3: Analysis of Native Listeners' Facial …pdfs.semanticscholar.org/d90a/257bcaa7f2d26de1587db4749...standing L2 speech, especially when it is with strong accents. In this paper, aside

3.3. Procedures

After shadowing practices, all the subjects were asked toshadow all the assigned utterances, where they were not al-lowed to shadow an utterance repeatedly unless considered nec-essary. A shadowing experiment was further divided into twosessions, where subjects’ facial expressions were recorded byface videos or EMG signals. In the video session, the devicewas not equipped, while in the EMG session, face videos werealso recorded just for synchronization between video and EMG.

After shadowing each utterance, subjects were asked to ratesubjectively comprehensibility and shadowability, where a 7-point scale was used. The following questions were asked:

1. How easily did you understand the presented utterance?2. How smoothly did you shadow the presented utterance?

In our studies [4, 5], these scores were highly correlated and thelatter scores are more directly related to smoothness of shadow-ing, which were acoustically defined in these studies.

3.4. Analysis

3.4.1. Video-based

OpenFace [24] was used as a toolkit for Action Units extractionfrom face videos frame by frame. It contains two predictionmodels: classification and regression. The former model pre-dicts presence of AU, meaning visibility of AU, and the lattermodel predicts intensity of AU on a continuous 5-point scale,where the point scale ranges from 0 (not present) or 1 (presentat minimum) to 5 (present at maximum). To prevent confusionbetween subjects’ oral movements while shadowing and theirfacial expressions, only AUs in the upper face area (AU01, 02,04, 05, 06, 07 and 09) were chosen for analysis.

For each utterance, AUs’ intensity and presence were pre-dicted for each frame by both prediction models. Since shad-owers’ rating was done for each utterance, the utterance-levelmeasures of AUs’ intensity and AUs’ presence ratio were ob-tained for a given utterance by averaging the two measures overall the frames of that utterance, respectively. As correspondencebetween utterances and subjects is one on one, the subsequentanalysis is done separately for each subject.

After calculating the utterance-level measures of inten-sity and presence, another averaging operation was conducted.Since subjective rating was done with a 7-point scale, whichranges from 1 to 7, the average of the utterance-level measureswas calculated for each score over all the utterances correspond-ing to that score. Finally, for each subject, we obtained typicalvalues of presence and intensity for each score of subjectivecomprehensibility and shadowability.

It should be noted that, when the typical value of intensityfalls below 1 (minimum intensity) for all the subjective scoresof 1 to 7 of a particular subject, as reliability of analysis is con-sidered to be low, those AUs of that subject are considered un-detectable and are not used for further analysis.

Correlations between the two kinds of measures (typicalvalues of intensity and presence ratio) and the two kinds of sub-jective scores (comprehensibility and shadowability) were cal-culated for each of the leftover AUs, separately for each subject.

3.4.2. EMG-based

Signal processing procedures for surface EMG signals are re-ferred to in [13]. As shown in Figure 5, the surface EMG wasrecorded at 1kHz sampling rate with four channels. First, all

Figure 5: Signal processing for EMG-based method [13]

the four channels were bandpass-filtered from 50 to 350 Hz,and notch filtered at 50 Hz and its harmonics up to 350Hz. Sec-ond, Independent Component Analysis (ICA) was applied to thesignals to separate distal EMG from different source muscles.Then, their Root-Mean Square (RMS) values were calculatedover overlapping windows of 200 ms with a sliding frame of 50ms.

The subsequent analysis was done in a similar manner tothat of the video-based method. The utterance-level measuresof all four signals were obtained by averaging RMS envelopevalues of all the frames in an utterance, and further average ofthe utterance-level measures was calculated for each score.

Again, correlations between the objective measures (fourtypical values of the RMS values) and the two kinds of subjec-tive scores (comprehensibility and shadowability) were calcu-lated seperately for each subject.

4. Results and discussionAs results of the video-based method, Tables 2 and 3 show cor-relations between the two kinds of subjective scores of compre-hensibility and shadowability, and their corresponding typicalvalues of the two kinds of objective measurements, intensity (r)and presence (c). These correlations were calculated for eachsubject. ∗ indicates that correlation is significant with P-valuebeing less than 0.10. “–” means that its AUs were filtered out.

Most subjects show that only a few AUs were detectable,especially in S4 who had only one AU detected. In S3, eventhough more than half of the AUs were detected, only one wasfound to be significant. From these tables, we can also claimthat there exists a large inter-shadower variation in significantAUs, suggesting that facial expressions in a shadowing taskdepend on individual shadowers. Nevertheless, for every sub-ject, at least one AU was found to be highly correlated witheither of the two subjective scores, indicating the potential offacial expressions as an additional feature for comprehensibil-ity/shadowability prediction, when a prediction model is builtdependently on individual shadowers.

Positive correlation between the objective measures and thesubjective scores indicates that facial expressions tend to bemore perceivable as learners’ utterances become easier to com-prehend or to shadow, and vice versa. Only AU07 (Lid tight-ening) shows significant positive correlations in both intensityand presence ratio. However, correlation of presence ratio (fre-quency of facial expressions) surpasses that of intensity in bothsubjective scores for all subjects.

AU01 (Inner brow raising) and AU04 (Brow lowering)show significant negative correlations. This is very reasonablesince these AUs are common facial responses when encounter-ing with a puzzling or ambiguous situation (see Figure 2).

For S3, AU01(c) was found significant to comprehensibilitywhile AU07(c) was not, even the latter was more commonlyfound in other subjects. On the other hand, any subject whoseAU07(c) shows significant correlation does not show detectableAU01 response. Difficulty of co-occurrence of AU01 and AU07could be explained in a way that they are in negative competitionrelation with each other [25].

1863

Page 4: Analysis of Native Listeners' Facial …pdfs.semanticscholar.org/d90a/257bcaa7f2d26de1587db4749...standing L2 speech, especially when it is with strong accents. In this paper, aside

Table 2: Correlations between measures of intensity (r) & pres-ence (c) of AUs and comprehensibility scores for each subject

S1 S2 S3 S4AU01(r) – – – –AU02(r) – – – –AU04(r) -0.91∗ – -0.07 –AU05(r) – – – –AU06(r) – – – –AU07(r) -0.34 0.74∗ 0.41 –AU09(r) – – – –AU01(c) – – -0.75∗ –AU02(c) – – -0.43 –AU04(c) -0.57 – 0.34 –AU05(c) -0.07 -0.79∗ -0.25 0.92∗AU06(c) – 0.14 0.10 –AU07(c) 0.74∗ 0.82∗ 0.46 –AU09(c) – – -0.12 –

Table 3: Correlations between measures of intensity (r) & pres-ence (c) of AUs and shadowability scores for each subject

S1 S2 S3 S4AU01(r) – – – –AU02(r) – – – –AU04(r) -0.86∗ – -0.17 –AU05(r) – – – –AU06(r) – – – –AU07(r) -0.39 0.59 0.08 –AU09(r) – – – –AU01(c) – – -0.61 –AU02(c) – – -0.26 –AU04(c) -0.85∗ – 0.17 –AU05(c) 0.29 -0.68∗ -0.28 0.63AU06(c) – 0.26 -0.04 –AU07(c) 0.74∗ 0.76∗ 0.39 –AU09(c) – – 0.44 –

It is very interesting that AU05(c) shows both positive andnegative correlations to comprehensibility, depending on shad-owers. This means that one specific type of facial expressions isobserved when a shadower perceives difficulty of understandingwhile another perceives easiness of understanding.

It should be noted that these correlations are obtained be-tween the subjective scores and their corresponding typical val-ues of the objective measurements. Experimentally, these cor-relations do not hold in utterance-level analysis, because vari-ance of the objective measurements in an utterance-level is sohigh that it may be difficult to use them exclusively to predictcomprehensibility. Hence, to utilize these measures effectively,they should be used as additional feature to the acoustic featuresproposed in our previous studies [5].

As results of the EMG-based method, Tables 4 and 5 showcorrelations between the two kinds of subjective scores and theircorresponding typical values of EMG-based objective measure-ments. CH1 to CH4 represent four distal EMG channels fromdifferent source muscles. For most subjects, except for S3, sig-nificant and very strong correlations were found in almost allchannels. It seems in S3 that comprehensibility of L2 utterancesand shadowaing performance may not influence S3’s facial ex-pressions strongly. In Tables 2 and 3, only S3 has AU01 assignificant expression and s/he exhibits unique behaviors.

Table 4: Correlations between average RMS EMG envelopesand comprehensibility scores for each subject

S1 S2 S3 S4CH1 0.84∗ -0.95∗ 0.25 0.97∗

CH2 0.61 -0.93∗ 0.00 0.93∗

CH3 0.87∗ -0.93∗ -0.05 0.94∗

CH4 0.88∗ -0.94∗ -0.13 0.85∗

Table 5: Correlations between average RMS EMG envelopesand shadowability scores for each subject

S1 S2 S3 S4CH1 0.77∗ -0.91∗ -0.12 0.78∗

CH2 0.72∗ -0.90∗ -0.34 0.74∗

CH3 0.73∗ -0.91∗ -0.54 0.79∗

CH4 0.74∗ -0.93∗ -0.61 0.76∗

From Tables 4 and 5, we can say that polarity of correla-tion of the four channels is consistent with each other. Roughlyspeaking, different channels in the EMG method are consideredto correspond to different AUs in the FACS method, where posi-tive and negative correlations are found for different AUs withina shadower. Consistent polarity among different channels in Ta-bles 4 and 5 might imply that all four EMG signals are from orare dominated by a single source muscle.

Very strong correlations in Tables 4 and 5 indicate that,with EMG sensing, it may be possible to attribute shadowers’perceived comprehensibility and shadowability to subtle move-ments of a specific source muscle on a face. In the current analy-sis, however, since a source muscle from each channel is yet un-known at the moment, further speculation is required on whichsource muscles are involved in the activation.

Similar to the video-based method, correlations of objectivemeasures are higher with comprehensibility than self-evaluatedshadowability, and they do not hold in an utterance level.

5. ConclusionsIn this paper, in order to investigate potential of facial microex-pressions of native listeners while shadowing L2 utterances,correlation analysis was done between subjective scores andobjective measurements. The subjective scores were relatedto comprehensibility and shadowability, and the objective mea-surements were obtained from face videos and surface EMGsignals of the shadowers. The results of analysis showed that thetypical values of objective measurements corresponding to thesubjective scores has a very strong linear relation to those sub-jective scores. This clearly indicates a potential of using facialexpressions for comprehensibility prediction, but we have to ad-mit that correlations are very shadower-dependent and the ob-jective measurements corresponding to a single subjective scorehave a very large variation. Thus, we have to carefully select ordesign a model for prediction. In a future work, facial microex-pressions will be used as new additional features in the task ofpredicting comprehensibility of L2 utterances. In [5], we usedonly acoustic features for prediction and we’re interested in ef-fectiveness of facial features for prediction.

6. AcknowledgementsThe authors would like to thank Dr. Kenji Suzuki and Dr.Masakazu Hirokawa for the wearable device. This work is sup-ported financially by JSPS or MEXT KAKENHI JP26118002and JP18H04107.

1864

Page 5: Analysis of Native Listeners' Facial …pdfs.semanticscholar.org/d90a/257bcaa7f2d26de1587db4749...standing L2 speech, especially when it is with strong accents. In this paper, aside

7. References[1] T. M. Derwing, M. J. Munro, Pronunciation fundamentals:

Evidence-based perspectives for L2 teaching and research, JohnBenjamins Publishing, 2015.

[2] M. J. Munro, T. M. Derwing, “Foreign accent, comprehensibil-ity, and intelligibility in the speech of second language learners,”Language Learning, 45, 1, 73–97, 1995.

[3] M. J. Munro, T. J. Derwing, “The functional load principle inESL pronunciation instruction: An exploratory study,” System,34, 520–531, 2006.

[4] Y. Inoue, S. Kabahima, D. Saito, N. Minematsu, K. Kanamura, Y.Yamauchi, “A study of objective measurement of comprehensibil-ity through native speakers’ shadowing of learners’ utterances,”Proc. INTERSPEECH, 1651–1655, 2018.

[5] S. Kabashima, Y. Inoue, D. Saito, N. Minematsu, “DNN-basedscoring of language learners’ proficiency using learners’ shadow-ings and native listeners’ responsive shadowings,” Proc. SpokenLanguage Technology, 971–978, 2018.

[6] T. Trisitichoke, S. Ando, Y. Inoue, D. Saito, N. Minematsu,“Influence of content variations on native speakers’ fluency ofshadowing,” IPSJ SIG Technical Reports, 2018-SLP-125, 16, 1–3, 2018.

[7] A. Govender, S. King “Measuring the cognitive load of syntheticspeech using a dual task paradigm,” Proc. INTERSPEECH, 2843–2847, 2018.

[8] A. Govender, S. King “Using pupillometry to measure the cogni-tive load of synthetic speech,” Proc. INTERSPEECH, 2838–2842,2018.

[9] M. B. Winn, J. R. Edwards, R. Y. Litovsky, “The Impact of Au-ditory Spectral Resolution on Listening Effort Revealed by PupilDilation,” Ear and hearing, 36, 4, 153–165, 2015.

[10] V. Demberg, A. Sayeed, “The Frequency of Rapid Pupil Dilationsas a Measure of Linguistic Processing Difficulty,” PLoS ONE11(1): e0146194., 2016.

[11] T. Koelewijn, H. de Kluiver, B. Shinn-Cunningham, A. Zekveld,“The pupil response reveals increased listening effort when it isdifficult to focus attention,” Hearing Research, 323, 81–90, 2015.

[12] M. S. Hussain, R. A. Calvo, F. Chen, “Automatic Cognitive LoadDetection from Face, Physiology, Task Performance and FusionDuring Affective Interference,” Interacting with Computers, 26,3, 256–268, 2014.

[13] M. Perusquıa-Hernandez, M. Hirokawa, K. Suzuki, “A wearabledevice for fast and subtle spontaneous smile recognition,” Proc.IEEE Transactions on Affective Computing, PP, 1–1, 2017.

[14] P. Ekman, W. P. Friesen, “Nonverbal Leakage and Clues to De-ception,” Psychiatry, 32, 1, 88–106, 1969.

[15] X. Shen, Q. Wu, X. Fu, “Effects of the duration of expressionson the recognition of microexpressions,” J. Zhejiang Univ. Sci. B(Biomed. Biotechnol.), 13, 3, 221–230, 2012.

[16] J. H. Janssen, P. Tacken, J.J.G. de Vries, E. L. van den Broek,J. H.D.M. Westerink, P. Haselager, W. A. IJsselsteijn, “Machinesoutperform laypersons in recognizing emotions elicited by auto-biographical recollection,” Human-Computer Interaction, 28, 6,479–517, 2013.

[17] P. Ekman, W. V. Friesen, “Facial Action Coding System: A Tech-nique for the Measurement of Facial Movement,” Consulting Psy-chologists Press, Palo Alto, 1978.

[18] L. I. Reed, M. A. Sayette, J. F. Cohn, “Impact of depression onresponse to comedy: A dynamic facial coding analysis,” Journalof Abnormal Psychology, 116, 4, 804–809, 2007.

[19] U. Hess, A. Kappas, G. J. McHugo, R. E. Klerk, J. T. Lanzetta,“An analysis of the encoding and decoding of spontaneous andposed smiles: The use of facial electromyography,” Journal ofNonverbal Behavior, 13, 2, 121–137, 1989.

[20] A. van Boxtel, “Facial EMG as a tool for inferring affectivestates,” Proc. measureing behavior, 104–108, 2010

[21] A. Gruebler, K. Suzuki, “Design of a Wearable Device for Read-ing Positive Expressions from Facial EMG Signals,” Proc. IEEETransactions on Affective Computing, 5, 3, 227–237, 2014.

[22] S. Ando, T. Trisitichoke, Y. Inoue, F. Yoshizawa, D. Saito, N.Minematsu, “A large collection of Japanese sentences read aloudby Vietnamese learners and native speakers’ responsive shadow-ings,” Proc. Spring Meeting of ASJ, 3-5-5, 2019.

[23] https://www.ecse.rpi.edu/homepages/cvrl/lei/researchactivelabeling.htm

[24] https://github.com/TadasBaltrusaitis/OpenFace/wiki/Action-Units

[25] K. Zhao, W. -S. Chu, F. De la Torre, J. F. Cohn, “Joint Patchand Multi-label Learning for Facial Action Unit Detection,” Proc.IEEE Conference on Computer Vision and Pattern Recognition,2207-2216, 2015.

1865