[ieee 2009 14th international csi computer conference (csicc 2009) (postponed from july 2009) -...

Feature Selection and Dimension Reduction for Automatic Gender Identification

Mohammad Ali Keyvanrad, Mohammad Mehdi Homayounpour Laboratory for Intelligent Signal and Speech Processing, Amirkabir University of Technology,

Tehran, Iran. Email: [email protected], [email protected]

Abstract Gender identification based on speech signal has

become gradually a matter of concern in recent years. In this context 6 feature types including MFCC, LPC, RC, LAR, pitch values and formants are compared for automatic gender identification and three best feature types are selected using four feature selection techniques. These techniques are GMM, Decision Tree, Fisher’s Discriminant Ratio, and Volume of Overlap Region. A dimension reduction is done on the best three feature types and the best coefficients are then selected from each feature vector. Selected coefficients are evaluated for gender classification using three types of classifiers including GMM, SVM and MLP neural network. 96.09% gender identification performance was obtained as the best performance using the selected coefficients and MLP classifier.

Keywords—feature comparison; Gaussian Mixture Model; Decision trees; Fisher’s Discriminant Ratio; Volume of Overlap Region; SVM; MLP

1. Introduction

Automatic Gender Identification (AGI) is a

technique to determine the sex of the user of a voice processing system through speech signal analysis[1].

Automatically detecting the gender of a speaker has several useful usages. In speech recognition systems, gender dependent models are more accurate than gender independent ones[2]. For example, the performance of SPHINX-II (an ASR system developed by Carnegie Mellon University) improved when gender dependent parameters were used [3].

In the context of speaker recognition, gender detection can improve the performance by limiting the search space to speakers from the same gender[2]. For example in some researches it has been found that

nearly half of the false acceptance error rates were caused by impostors of opposite gender[4].

Also in the context of content based multimedia indexing, the speaker’s gender is a cue used in the annotation[2]. In addition gender dependent speech coders are more accurate than gender independent ones[5].

Gender identification has become gradually a matter of concern in recent years. Harb and Chen (2005) used pitch and spectral features with multi layer perceptron classifier and reported 93% of classification accuracy[5]. They also tested their classifier on the switchboard [6] database and got 98.5% gender classification performance. They reported a performance of 92% by combining neural networks and Kallback-Laibler distance metric[2]. The performance achieved by Azghadi and Bonyadi (2007) was 96% [7].

Some authors used Hidden Markov Model (HMM) as classifier. For example, Parris and Carey (1996) used HMM classifier, pitch and MFCC features and reported 99% gender classification performance on OGI[8] database when only utterances of 5 seconds length were used as test data [4]. Some other classifiers have also been used. For example Lee and Lang (2008) used SVM (Support Vector Machine) [9] and Silvosky and Nouza (2006) used GMM [10]. In some articles, preprocessing techniques have been used to improve gender classification performances. Fagundes and Martins (2002) used PCA for dimension reduction on MFCC features vectors. They used neural networks as classifier and obtained 100% accuracy. The database they used was a subset of clean microphony TIMIT dataset including 100 speakers.

In this paper, some feature selection techniques were used and most appropriate features were selected. These features were then concatenated in a feature vector and were used in gender classification.

978-1-4244-4262-1/09/$25.00 ©2009 IEEEProceedings of the 14th International CSI Computer Conference (CSICC'09)

613

2. Database The Oregon Graduate Institute (OGI)

Multilanguage Telephone Corpus collected by Muthusamy [8] was used in this paper. This database consists of spontaneous speech utterances of eleven languages uttered over real telephone lines by approximately 90 male and 40 female speakers. The 11 languages in the corpus are English, French, Korean, Mandarin, Farsi, German, Spanish, Hindi, Vietnamese, Tamil and Japanese. The sampling frequency is 16kHz and the resolution per sample is 16 bits. Speech files are compressed using the Shorten scheme. In our gender identification experiments, speech data from speakers from three languages (Spanish, Tamil and Vietnamese) were used as test data (29% of database) and speech data from all other 8 languages were used for training gender models. This partitioning helps us to have one-third of database for test and other parts for train. Also results are language independent. Therefore we have 8001 files for training and 3365 files for test. Files length is 5 second in average. This kind of partitioning of database shows that our gender classification experiments are language independent.

3. Feature comparison techniques

Many features have been used for gender

identification. In this paper, some of more used features in speech articles are compared and the best features are selected and used in our gender identification experiments. These features are:

MFCC (Mel Frequency Cepstral Coefficients) LPC (Linear Predictive Coefficients) RC (Reflection Coefficients) LAR (Log Area Ratio coefficients) Pitch Formant

The feature vector dimension is 25 for each of MFFC, LPC, RC and LAR. 9 parameters including frequency, bandwidth and amplitudes of 3 first formants are obtained from each analysis frame and used as formant parameters. MATLAB and Voicebox toolbox [11] were used for extraction of MFCC, LPC, RC, LAR, Pitch and formant parameters. Features were normalized before being used. For normalization we have used the following equation:

d

ddFdN

)()(

Where F(d) is the value at dimension d in the feature vector and d is average in feature dimension d and d is standard deviation in feature dimension d.

In our experiments, median filtering was used for smoothing the pitch contour and removing noisy pitch values. The same procedure was done for time contour of each coefficient in successive feature vectors. We believe that this removes the abrupt variations of feature values in successive frames. The median filtering smoothing window length was 21.

Four methods were used for selection of better features. These methods are GMM, Decision Tree, Fisher's Discriminant Ratio (FDR), and Volume of Overlap Region (VOR).

3.1. GMM

GMM is a general classifier that can be used for

checking feature efficiency. In this method we run GMM for each feature separately to evaluate its classification performance. For initialization of GMM, k-harmonic means algorithm with 64 mixtures was used.

3.2. Decision trees

Decision trees are well known techniques in data

mining domain. They can be used as classifiers as they aim at building rules in an IF THEN fashion permitting the decision about the class of a sample given its different attributes, or features[5].

3.3. Fisher’s Discriminant Ratio

The Fisher’s Discriminant Ratio permits the

estimation of the discrimination capability in each feature dimension. It is given by:

21 2

2 21 2

( )( ) d d

d d

f d

Where f(d) is the Fisher’s Discriminant ratio for the feature dimension “d”, and μ1d, μ2d, 1d

2, 2d2 are

respectively the means and variances of class “1”and “2” in the feature dimension “d”.

As used in [5] we use the maximum of f(d) over all dimensions. The higher the fisher discriminate ratio is, the better are the features for the given classification problem.

3.4. Volume of Overlap Region

The Volume of Overlap Region is another measure

to analyze the complexity of a classification problem; it calculates the overlap between the classes in a selected feature space. This can be measured by calculating the maximum and the minimum of the feature values in each feature dimension and then calculating the length

Proceedings of the 14th International CSI Computer Conference (CSICC'09)614

of overlap for each dimension. The volume of overlap will be the product of the overlap lengths for all dimensions[5]. It is given by the following equation:

1 2 1 2

1 2 1 2

(max( , ),max( , )) (min( , ),min( , ))(max( , ),max( , )) (min( , ),min( , ))

i i i i

i i i i

MIN f c f c MAX f c f cVOR

MAX f c f c MIN f c f c

Where max(fi,c1) and min(fi,c1) are respectively the maximum and the minimum values of the feature fi for the class c1 (resp. c2 ). i=1,…,d for a d-dimensional feature space.

The VOR is zero if there is at least one feature dimension in which the two classes do not overlap (the overlap is negative).

4. Experiments

In this section we see feature comparison and a

classification by combining the best features.

4.1. Feature comparison The first experiment conducted in this research was

to compare and select the best features for gender identification. The feature comparison techniques explained in the previous section were used. Tables 1, 2, and 3 depict the comparison results.

In these tables feature performance for discriminating between male and female speakers are presented for each feature and for each feature comparison method, once when the features are smoothed using median filtering and once without median filtering. Table 1 shows the comparison results using GMM. This table depicts that for most of features, performance improves when median filtering is used. In addition it can be observed that the best features are respectively pitch, MFCC, RC, LAR, formant and LPC.

Table 1. Feature comparison using GMM classifier. Best

performances are denoted by *.

Features

GMM (without median

filtering)

GMM (with median filtering)

Male Female Mean Male Female Mean

MFCC 87.0 82.4 84.7 86.6 84.3 85.4*

LPC 57.1 63.0 60.0* 56.5 61.3 58.9

RC 81.1 78.0 79.6 81.1 78.2 79.7*

LAR 79.8 78.5 79.2* 79.1 78.2 78.6

Pitch 86.0 97.1 91.5 88.0 98.4 93.2*

Formant 71.5 73.2 72.3 70.7 76.3 73.5*

Table 2 shows the comparison results when

Decision tree is used as comparison method. This table also depicts the usefulness of median filtering for most of features. The best features are respectively pitch, RC, MFCC, LAR, formant and LPC. RC is better than MFCC in this experiment.

Table 2. Feature comparison using decision tree classifier.

Best performances are denoted by *.

Features

Decision Tree (without median

filtering)

Decision Tree (with median filtering)

M F Mean M F Mean

MFCC 80.5 74.3 77.4 81.3 76.6 78.9*

LPC 74.6 55.2 64.9 73.4 57.7 65.5*

RC 83.7 75.7 79.7 82.2 78.6 80.4*

LAR 83.2 77.5 80.4* 81.9 76.0 79.0

Pitch 88.4 89.1 88.8 88.4 93.3 90.9*

Formant 78.0 57.9 68.0 78.9 59.6 69.2* In Table 3, Fisher’s Discriminant Ratio and VOR as

two statistical measures are used for feature comparison. Bigger values for Fisher’s Discriminant Ratio show better feature discrimination performance, while smaller VOR values show smaller overlap and therefore better discrimination performance. In Table 3, for VOR, five features show better performances when median filtering is used. But for Fisher’s Discriminant Ratio, only two features are better when features are filtered using median filtering. It can also be concluded from Table 3 that pitch, MFCC, RC, LAR, and formant values outperform the LPC coefficients when Fisher’s Discriminant Ratio is used as comparison method. But the results obtained for VOR, show that LPC features are better than other features, which is in contradiction with the results obtained using the other feature comparison methods studied in this paper. Regarding to the last result, it can be concluded that VOR is not an appropriate parameter because it uses only the maximum and minimum of feature value in each direction that can be affected by noise.

Our conclusion from the results presented in Tables 1, 2 and 3 is that for discrimination between male and female speakers:

the best features are pitch, RC, MFCC. median filtering usually improves the performance.


Table 3. Feature comparison using statistic parameters (Fisher and VOR). Best values are denoted by *

Features without median

filtering with median

filtering Fisher VOR Fisher VOR

MFCC 0.3757* 0.0047 0.3608 0.0032*

LPC 0.0983* 0.0020 0.0831 0.0008*

RC 1.2792* 0.0095 1.2749 0.0088*

LAR 1.3032* 0.0083 1.2895 0.0074*

Pitch 3.9851 0.5221* 4.5324* 0.6108

Formant 0.5768 0.0351 0.7191* 0.0305*

4.2. Feature selection Based on the results obtained in the previous

experiment, pitch (1 coefficient), MFCC (25 coefficients) and RC (25 coefficients) features were considered as the most appropriate features for gender classification. In this experiment we intend to do a dimension reduction by selecting the best coefficients in each feature vector.

Fast Feature Ranking Algorithm (SOAP) for feature selection method described by [12] was used in this paper. In this method we know that the best attributes have smallest number of label changes (NLC). SOAP is based on counting the label (class label) changes, produced when crossing the projections of each example in each dimension. If the attributes are in ascending order according to the number of label changes (NLC), we will have a list that defines the priority of selection from greater to smaller importance.

In the intervals with multiple labels we will consider the worst case, which is the maximum number of label changes possible for the same value. Figure 1 depicts this method.

Figure 1. Subsequence of the same value (a) two changes

(b) seven changes [12]

Finally it can be concluded that this method searches each direction for maximum changing labels and if some samples have the same values in one direction, the maximum change is counted.

Using the above feature selection procedure, 5 features from 51 features were selected. Best selected features are pitch, 7th RC coefficient, 16th MFCC coefficient, 18th MFCC coefficient and 9th RC coefficient.

4.3. Classification using GMM, SVM and MLP

In this experiment a GMM with a mixture of 96

Gaussian components, SVM, and MLP classifiers were used for gender classification. The 5 selected features obtained from the previous experiment were used in this experiment. K-harmonic means algorithm was used for GMM initialization. An experiment was conducted to determine the number of Gaussian components for GMM model. The result of this experiment is depicted in Figure 2. As it can be seen in this Figure, 96 Gaussian components is a good choice. It seems that the training data is not sufficient for estimating more than about 96 Gaussian components.

Figure 2 . Gender identification performance based on the

number of Gaussian components in GMM models RBF kernel was used as SVM kernel. Since SVM

training is very time consuming, k-means clustering (k=64) was used to cluster the training data. The centroids of clusters were used as training data for SVM. This method can reduce the number of training data and also remove noisy data. Another experiment was conducted to obtain the number of sufficient cluster centroids for being used as SVM training data. (see Figure 3). It seem that due to the small number of features in each feature vector (5 coefficients), only 64 cluster are sufficient to cluster the training data.


Figure 3. SVM gender identification performance based

on the number of cluster centroids used as training data for SVM models

A 5*50*1 (5 input nodes, 50 nodes in hidden layer

and one output node) MLP neural network was also used in our experiments for gender identification. Five input nodes are used for input features. The value of 50 as the number of hidden layer nodes has been obtained based on an experiment conducted for this purpose (see Figure 4). Figure 4 depicts the gender identification performance based on the number of nodes in the hidden layer.

Figure 4. Gender identification performance based on the number of neurons in hidden layer in MLP classifier

Table 4 depicts the gender classification

performance using the GMM, SVM and MLP classifiers. The order of performance is MLP, SVM, and GMM.

Table 4. Gender identification performance using GMM,

SVM, and MLP

Classifiers performance

Male Female Mean

GMM 91.05% 98.15% 94.60%

SVM 93.37% 98.15% 95.76%

MLP 94.84% 97.33% 96.09%

The best performance is 96.09% which belongs to

MLP. SVM and MLP as two discriminative classifiers, outperform GMM which is a generative classification technique.

5. Conclusion

This paper presented some experiments for gender

classification. Many features have been proposed and used for gender identification in the literature. In this paper it was shown that some features are more convenient than other features for gender classification. It was also shown that in a feature vector, some coefficients have a more important role in gender discrimination. Feature selection and dimension reduction are two important tasks in each classification process. They are important since they can reduce the computation complexity and may help us to develop fast and real time systems. In this paper we showed that only five coefficients including pitch value, two RC and two MFCC coefficients lead to a very high gender classification performance. Three classifiers were also compared in this paper and it was shown that discriminative classifiers including MLP and SVM outperform generative classifiers such as GMM. Using only 5 coefficients and MLP as classifier, 96.09% gender classification performance was obtained.

6. Acknowledgments

The authors would like to thank Iran

Telecommunication Research Center (ITRC) for supporting this work under contract No. T/500/14939.

7. References

[1] R. D. R. Fagundes, A. A. C. Martins, F. Comparsi de Castro et al., “Automatic gender identification by speech signal using eigenfiltering based on Hebbian learning,” in Neural Networks, 2002. SBRN 2002. Proceedings. VII Brazilian Symposium on, pp. 212-216, 2002. [2] H. Harb, and C. Liming, “Gender identification using a general audio classifier,” in Multimedia and Expo, 2003. ICME '03. Proceedings. 2003 International Conference on, pp. II-733-6 vol.2, 2003. [3] W. H. Abdulla, and N. K. Kasabov, “Improving speech recognition performance through gender separation,” in Artificial Neural Networks and Expert Systems International Conference (ANNES), Dunedin, New Zealand, pp. 218-222, 2001.


[4] E. S. Parris, and M. J. Carey, “Language independent gender identification,” in Acoustics, Speech, and Signal Processing, 1996. ICASSP-96. Conference Proceedings., 1996 IEEE International Conference on, pp. 685-688 vol. 2, 1996. [5] H. Harb, and L. Chen, “Voice-based gender identification in multimedia applications,” Journal of Intelligent Information Systems, pp. 179-198, 2005. [6] J. J. Godfrey, E. C. Holliman, and J. McDaniel, “SWITCHBOARD: telephone speech corpus for research and development,” in Acoustics, Speech, and Signal Processing, 1992. ICASSP-92., 1992 IEEE International Conference on, pp. 517-520 vol.1, 1992. [7] S. Mostafa Rahimi Azghadi, M. Reza Bonyadi, and H. Shahhosseini, “Gender Classification Based on FeedForward Backpropagation Neural Network,” Journal, pp. 299-304, 2007.

[8] Y. K. Muthusamy, R. A. Cole, and B. T. Oshika, “The OGI multilanguage telephone speech corpus,” pp. 895-898, 1992. [9] K.-H. Lee, S.-I. Kang, D.-H. Kim et al., “A Support Vector Machine-Based Gender Identification Using Speech Signal,” IEICE Trans Commun, pp. 3326-3329, October 1, 2008, 2008. [10] J. Silovsky, and J. Nouza, “Speech, Speaker and Speaker's Gender Identification in Automatically Processed Broadcast Stream,” Radioengineering, pp. 42-48, 2006. [11] M. Brookes, “Voicebox: Speech Processing Toolbox for Matlab,” Journal, pp. [12] R. Ruiz, J. C. Riquelme, and J. S. Aguilar-Ruiz, “Fast Feature Ranking Algorithm,” Journal, pp. 325-331, 2003.


[ieee 2009 14th international csi computer conference (csicc 2009) (postponed from july 2009) -...

Documents