analysis of audio-video correlation in vow els in...

Post on 13-Aug-2020

1 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

ANALYSIS OF AUDIO-VIDEO CORRELATION IN VOWELS INAUSTRALIAN ENGLISH

Roland Goecke1, J Bruce Millar1, Alexander Zelinsky2, and Jordi Robert-Ribes3

1Computer Sciences Laboratory and 2Robotic Systems Laboratory, Research School of Information Sciencesand Engineering, Australian National University, Canberra ACT 0200, Australia

3Cable & Wireless Optus, 101 Miller St, North Sydney NSW 2060, Australia

E-Mail: Roland.Goecke@anu.edu.au URL: http://cslab.anu.edu.au/~rgoecke

ABSTRACT

This paper investigates the statistical relationshipbetween acoustic and visual speech features forvowels. We extract such features from our stereovision AV speech data corpus of Australian English.A principal component analysis is performed todetermine which data points of the parameter curvefor each feature are the most important ones torepresent the shape of each curve. This is followedby a canonical correlation analysis to determinewhich principal components, and hence which datapoints of which features, correlate most across thetwo modalities. Several strong correlations arereported between acoustic and visual features. Inparticular, F1 and F2 and mouth height werestrongly correlated. Knowledge about the correlationof acoustic and visual features can be used to predictthe presence of acoustic features from visualfeatures in order to improve the recognition rate ofautomatic speech recognition systems inenvironments with acoustic noise.

1. INTRODUCTION

Although automatic speech recognition (ASR)systems have become common tools in human-computer interaction (HCI), they still have somelimitations with respect to the environment in whichthey can be used. Current commercially availableASR systems employ statistical models of spokenlanguage and enable continuous speech recognitionin reasonably good acoustic conditions. However,they can fail unpredictably in noisy conditions. Oneway of overcoming some of the limitations of audio-only ASR systems is to use the additional visualinformation of the act of speaking [1-5].

The aim in audio-video (AV) ASR in adverseconditions is to be able to replace less reliableacoustic measurements with more reliable visualmeasurements. This requires a-priori knowledge ofthe correlation between acoustic and visual speechfeatures for all phonemes. We explore in this paperhow this knowledge can be established.

Yehia et al [6] found a strong correlation (80-91%)between the shape of the vocal-tract and the position

of facial feature points around the mouth and lowerface. They also found that a large part (72-85%) ofthe variance observed in acoustic parameters can bedetermined from vocal-tract and facial data together.And even the facial data alone performed well inaccounting for the acoustic parameter variance. Thedrawbacks of their study were the small number ofspeakers looked at (only 2) and the use of intrusivemeasurement techniques. They used smalltransducers placed on tongue, lips and teeth forelectromagnetic tracking of the vocal-tract motionand infrared LEDs on the lower face half for opticaltracking of the facial motion. While Yehia et alfocussed on the relation between acoustic and facialparameters for speech production and animation, welook at the issue from an ASR perspective.

We have recently presented a novel algorithm forthe explicit extraction of lip feature points based ona stereo vision head-tracking system [7]. Both head-tracking and lip-tracking are completely non-intrusive and do not use any facial markers. Wehave also recorded an AV speech data corpus (with10 speakers) containing all phonemes and visemesin Australian English for the analysis of thecorrelation between acoustic and visual features [8].Section 2 briefly describes the experimental designfor the recordings and the techniques used forextracting the features used in the correlationanalysis. Section 3 details the methods used in thestatistical analysis of the relation between acousticand visual features. The results are presented inSection 4 and discussed in Section 5. Section 6concludes with a discussion of the future directionof this work.

2. EXPERIMENTAL DESIGN

2.1. Stereo Vision System

The lip-tracking system is integrated with the stereovision head tracking system (Figure 1). A stereovision system has the advantage that depthinformation can be recovered and that measurementsare therefore in 3D, giving real-world distancesrather than simply the 2D image coordinates of amono-vision system.

Figure 1: Overview of the stereo vision system

The head tracking system is based on templatematching using normalised cross-correlation and isable to track the person's movements at a frame rateof 15-30Hz. The system consists of two calibratedstandard, colour analog NTSC video cameras. Thecamera outputs are multiplexed at half the verticalresolution into a single 512x480 image (Figure 2)before being acquired by a Hitachi IP5005 videocard on a Pentium II (300MHz CPU) every 33ms.Details on the system can be found in [9].

The lip-tracking algorithm is applied to the mouthwindows which are automatically determined duringthe head tracking based on the head pose (Figure 2).We combine colour information from the imageswith knowledge about the structure of the moutharea for different degrees of mouth openness in thealgorithm. For example, when the mouth is open, weoften expect to see teeth, so we can specifically lookfor them which improves the robustness of the lip-tracking.

Figure 2: Stereo image with head-tracking templates and mouthwindows for lip-tracking.

The algorithm extracts the 3D positions of the twolip corners and the mid-point of upper and lowerlips. Since every person has differently shaped lips,we use the inner lip contour so that the personalcharacteristic shape of the lips has minimal effect onthe measurements. From these four lip points, we

derive a feature set which gives a first-orderdescription of the shape of the mouth: mouth width,mouth height, and protrusion of upper and lowerlip. Furthermore, the algorithm labels each frame onthe appearance of upper and lower teeth. Figure 3shows two examples of lip-tracking results. Videoclips of the lip-tracking process can be found at ourhomepage (http://cslab.anu.edu.au/~rgoecke). Adetailed description of the lip-tracking algorithm canbe found in [7].

Figure 3: Lip-tracking results.

2.2. AV Speech Data Corpus

We have recorded an AV speech data corpus forAustralian English (AuE) using the stereo visionsystem just described. The speakers sat in front of astereo camera pair with an omnidirectionalmicrophone attached 20-25cm below their mouth(Fig. 4). The face was well illuminated by a lightsource just below the cameras so that no shadowsappear on a speaker's face. Recordings were made todigital video (DV) tape because of its ability toplayback the recordings many times without a lossof quality. The DV standard also comprises a digitalaudio component. In our case, the recordings weremade at 30Hz video frame rate and 16bit 48kHzmono audio rate in a controlled acousticenvironment (almost no external noise, some airconditioning and computer noise in thebackground).

The data corpus comprises 10 native speakers (5female and 5 male speakers). It was designed tocover all phonemes and visemes in AuE except forthe neutral vowel /@/ because of its great audiovariability and the neutral consonant /h/ which addslittle to the correlation analysis. In addition, thevoiced fricative /Z/ and the diphthong /u@/ werealso omitted because they have a low occurence inAuE. The core part of the corpus consists of 40sequences per speaker containing consonant-vowel-consonant- (CVC-) or vowel-consonant-vowel-(VCV-) words with the phoneme of interest in the

central position. These words were put in a carrierphrase (“You grab word beer.”) to overcomearticulation patterns associated with reading wordsfrom a list. The carrier phrase also facilitates thevisual segmentation through the use of bi-labialclosings before and after the CVC- or VCV-word.Full details on the design of the AV speech datacorpus can be found in [8].

Figure 4: Setup for AV Recordings.

2.3. Acoustic Feature Extraction

First, the audio component was grabbed from theDV tape and stored as a .wav file without changingthe bit rate and sampling rate settings. Then theESPS signal processing toolkit was used for theextraction of the acoustic features. The acousticfeatures extracted were the voice source excitationfrequency f0, the formant frequencies F1-F3 andRMS energy. The ESPS function get_f0 was used toextract the f0 and RMS energy values. It uses a30ms Hanning window with 5ms overlap. The ESPSfunction formant was used to extract F1, F2, and F3from the audio data by linear prediction analysis.The ESPS standard parameters of 49ms windowlength and cos4 window type were applied in theanalysis.

3. METHODS FOR ANALYSIS

In this paper, we focus on the analysis of the AVcorrelation in vowels in AuE. While our AV speechdata corpus also contains sequences of theconsonants in Australian English, their AVcorrelation analysis will be reported on in a differentpaper. This categorisation into vowels andconsonants is based on the phonemes since wecollected data from the phonemes of AuE ratherthan the visemes. Of course, visemes and phonemesare related but the sets have a different structure.

There is no 1-1 map between phonemes andvisemes. Leaving diphthongs aside, there aresequences for 6 short and 5 long vowels of AuE inthe data corpus.

First, a linear correlation was performed on thevisual feature set to see if any of the features werecorrelated. Not surprisingly, it turned out that theprotrusion of upper and lower lip were highlycorrelated (r > 0.97) because in normal speech bothlips are moved backward and forwardsimultaneously. Therefore, we will look only at oneprotrusion parameter in the analysis, the upper lipprotrusion, for example. Table 1 shows the featureset used in the statistical analysis.

Acoustic Visual

Voice source excitation f0 Mouth height

Formant frequency F1 Mouth width

Formant frequency F2 Lip protrusion Formant frequency F3

RMS energy Table 1: Overview of the acoustic and visual features

For the vowels, the CVC-word was in the form of/bXb/ where X is substituted for the particularvowel. The parameters of all features were extractedfor the period of time from the beginning of the lipopening after the initial bi-labial /b/ until the end ofthe lip opening just before the second bi-labial /b/.These time periods were about 200ms (or 6 videoframes) for short vowels and about 300ms (or 9video frames) for long vowels. However, durationsalso vary across speakers for each vowel. In order tohave the same number of points on the time scale forall utterances of a vowel from all speakers, apiecewise linear interpolation was performed to giveeach parameter curve a base of 50 points foracoustic features and 10 points for visual features.We used the R statistical system [10] for theinterpolation and all subsequent statistical analyses.Note that our use of the PCA to represent the shapeof a curve is immune to the fact that acoustic andvisual features have a different number of datapoints. Obviously, the analysis frame rate isdifferent for acoustic features (every 10ms) andvisual features (every 33ms).

Having a bi-labial context simplifies the visualanalysis. Using /b/ instead of /p/ lengthens eachword and thus results in more data to analyse, whichparticularly important in the case of short vowels.On the other side, a bi-labial context causes strongcoarticulation effects in the formants. However,these are quite predictable for /b/ and we believe theadvantages of a bi-labial context for visualsegmentation offset the disadvantages. We limitedcoarticulation deliberately to constrain the size andcomplexity of our data set.

For each parameter data point and for eachphoneme, a principal component analysis (PCA)was performed across the 10 speakers. It isimportant to understand that this use of the PCAtechnique is a way to represent the shape of aparameter curve. For example, the first principalcomponent (PC) will then tell us the linearcombination of which data points accounts for thelargest amount of the variance in the parametercurves of all speakers for that one particularphoneme. This use of the PCA technique is differentfrom applying it to all, let's say, acoustic features inorder to reduce the number of features by selectingthe ones that account for most of the variance.

Then the top two PCs of each feature representingabout 70% of the variance were taken and twofeatures (= 2x2 PCs) at a time combined as acousticand visual PC vectors. Finally, a canonicalcorrelation on these PC vectors was done to quantifywhich PCs, and subsequently which data points ofwhich features, were correlated across the twomodalities. This also reveals temporal informationabout the relationship between acoustic and visualfeatures. We tested several combinations of PCvectors but we have not yet tested all possiblecombinations. Figure 5 shows a schematic exampleof the mouth height parameter curve. Black dots onthe parameter curve represent data points withcorrelation values above a certain threshold, forexample 0.9.

Figure 5: Schematic example of relationship between datapoints and two principal components.

Canonical correlation is a form of correlationrelating two sets of variables. Similar to factoranalysis, there is more than one canonicalcorrelation, each representing orthogonally separatepatterns of relationships between the two sets. Thefirst canonical correlation is always the one which

explains most of the relationship. We only look atthe first correlation in our analyses.

4. RESULTS

4.1. Results of PCA

The first PC of f0 and the formant features typicallyaccounted for most of the variance in the first and inthe central data points. To a lesser extent, the firstPC was also related to the data points at the end ofthe vowel. The second PC of these acoustic featuresaccounted for the variance in the data pointssurrounding the central ones.

The picture is different for the RMS energy feature.Here, the first PC was much more dominant than forthe formant features (about 80% compared to about40-50% of the variance). It used a linearcombination of practically all data points to achievethis.

The first two PCs of the mouth height and lipprotrusion features exhibit a behaviour similar to theone of the formant features. However, for the mouthwidth feature, the first PC was rather related to thecentral data points and the second PC to the first andthe last data points.

4.2. Results of Canonical Correlation

Table 2 shows, as an example, the r-values of thefirst canonical correlation with the first two PCs ofF1 and F2 as audio variables and the first two PCsof mouth height and mouth width as video variables.Generally, we found that combinations of the firsttwo PCs of F1, F2, and F3 correlated strongly withthe PCs of mouth height. We also found a strongcorrelation between the first PC of f0 and the PCs ofthe mouth height, but not for the second PC of f0.The PCs of F2 and F3 also correlated well with thefirst PC of the mouth width. Finally, there was astrong correlation between RMS energy and F2 onone side and mouth height on the other side. Noindication of a linear relationship between the datapoints of the protrusion parameter and the datapoints of any of the acoustic features was found.Short and long vowels showed similar results.

Short Vowels Long Vowels

/A/ 0.99 /a:/ 0.96 /E/ 0.96 /@:/ 0.98 /I/ 0.95 /i:/ 0.96 /O/ 0.98 /o:/ 0.94 /U/ 0.96 /u:/ 0.97 /V/ 0.94

Table 2: r-values of first canonical correlation with the first twoPCs of F1 and F2 as audio variables and the first two PCs ofmouth height and mouth width.

5. DISCUSSION

It must be said that this is an exploratoryinvestigation. We are aware of the fact that a datacorpus with 10 speakers is still of a fairly small sizefor such an analysis. In particular, one would like toincorporate more than the first two PCs to accountfor a higher percentage of variance in the utterancesfor each of the vowels. We have immediate plans toextend our data corpus by adding more speakers sothat our analysis can include more PCs of eachfeature to give a more accurate picture.

One can also argue that a piecewise linearinterpolation in the process of giving all parametercurves the same number of data points is lesseffective than a spline interpolation which wouldsmooth the curve and thus reduce the effects ofmeasurement errors. This needs to be investigated.However, for the current analysis we chose apiecewise linear interpolation for simplicity.

We looked at individual phoneme correlationsbecause of the possible many-to-one articulatory-to-acoustic transforms. It has long been known thatdifferent articulatory movements can produceacoustically very similar results.

A strong correlation was found between someacoustic and visual features. In particular, the datapoints of F1 and F2 and the data points of the mouthheight were strongly related. Articulatory-to-acoustic speech production theories model areknown to model the acoustical consequences of thedegree of mouth opening well. Opening or closingthe lips has almost immediate acousticalconsequences and hence it is no surprise to find ahigh correlation between the formant frequenciesand the mouth height feature. However, since it maybe possible that a speaker changes the behaviour ofthe vocal tract behind the lips to compensate for theacoustical consequences of the lip movements, it isuseful to really find these strong correlations.

We would have expected the lip protrusion featureto be of more significance. However, fairly diverseparameter curves were found when looking at eachvowel separately across the speakers. This diversitycould either be due to speakers using different lippositions to produce the same sound or it couldpoint to inaccuracies in the extraction of the lipprotrusion feature from the video signal. It is knownthat different articulatory movements can stillproduce the same acoustic result. It is possible thatthis plays a role here as well. These results needfurther investigation.

6. CONCLUSION AND FUTURE WORK

To summarise, we performed an exploratoryanalysis of the statistical relationship between

acoustic and visual speech features. A PCA wasused to represent the shape of the parameter curvesand then a canonical correlation analysis was doneusing the first two PCs of each feature with twofeatures combined at a time to form the set ofvariables for this analysis. Strong correlationsbetween the data points of F1 and F2 and the datapoints of mouth height were found. F3, RMSenergy, and mouth width also showed some highcorrelation values. The lip protrusion featureappeared to be of little significance. The reason forthe diversity of shapes experienced in the lipprotrusion parameter curves must be furtherinvestigated. As mentioned before, not all possiblecombinations of features have yet been looked at butwe will do so in the future.

Strong correlations between some acoustic andvisual features means that this knowledge could beused in an AV ASR system to predict the presenceof these acoustic features from the visual features.This is of particular importance for the use of ASRsystems in environments with acoustic noise.

The data corpus needs to be extended by addingmore speakers so that more PCs can be included inthe canonical correlation analysis, thus incorporatingmore of the variance in the features. It can also beargued that repetitions from the same speakerswould be of value to look into the intra-subjectvariability. This is common in speech corpora forspeaker recognition but only to a lesser extent incorpora designed for ASR. We will also explore theuse of a spline interpolation technique for producingthe same number of data points for each utterancebefore the PCA. Smoother parameter curves maylead to more accurate results.

ACKNOWLEDGEMENT

The authors would like to thank the reviewer of thispaper whose comments have helped to make this abetter paper.

7. REFERENCES

1. Adjoudani, A., and Benoît, C. “On the Integration ofAuditory and Visual Parameters in an HMM-basedASR” In Stork, D.G. and Hennecke, M.E. (editors):Speechreading by Humans and Machines, Vol. 150,NATO ASI Series, Springer-Verlag, 1996, p461-471.

2. Bregler, C., and Konig, Y. “’Eigenlips’ for RobustSpeech Recognition” Proc. of ICASSP’94, Vol. II,Adelaide, Australia, 1994, p669-672.

3. Matthews, I., Cootes, T., Cox, S., Harvey, R., andBangham, J.A. “Lipreading using shape, shading andscale” Proc. of AVSP’98, Terrigal, Australia, 1998, p73-78.

4. Petajan, E.D. “Automatic Lipreading to Enhance SpeechRecognition” PhD thesis, University of Illinois atUrbana-Champaign, 1984.

5. Stork, D.G., and Hennecke, M.E. (editors)“Speechreading by Humans and Machines”, Vol. 150,NATO ASI Series, Springer-Verlag, 1996.

6. Yehia, H., Rubin, P., and Vatikiotis-Bateson, E.“Quantitative association of vocal-tract and facialbehavior” Speech Communniation, 26 (1-2): 23-43.

7. Goecke, R., Millar, J.B., Zelinsky, A., and Robert-Ribes, J. “Automatic Extraction of Lip Feature Points”Proc. of the Australian Conference on Robotics andAutomation ACRA2000, Melbourne, Australia, 2000,31-36.

8. Goecke, R., Tran, Q.N., Millar, J.B., Zelinsky, A., andRobert-Ribes, J. “Validation of an Automatic Lip-Tracking Algorithm and Design of a Database forAudio-Video Speech Processing” Proc. of the 8th

Australian International Conference on Speech Scienceand Technology SST-2000, Canberra, Australia, 2000,92-97.

9. Newman, R., Matsumoto, Y., Rougeaux, S., andZelinsky, A. “Real-time stereo tracking for head poseand gaze estimation” Proc. of Automatic Face andGesture Recognition FG2000, Grenoble, France, 2000.

10. Gentleman, R., and Ihaka, R. “The R Project forStatistical Computing” http://www.r-project.org, 2000.

top related