[ieee tencon 2007 - 2007 ieee region 10 conference - taipei, taiwan (2007.10.30-2007.11.2)] tencon...

6
A Neural Network Based Audio Content Classification Vikramjit Mitra University of Maryland, Dept. of Electrical Engineering, College Park, MD 20770 email: [email protected] Chia-Jiu Wang University of Colorado, Dept. of Electrical and Computer Engineering, Colorado Springs, CO 80933 email: [email protected] Abstract— The emergence of digital music in the Internet calls for a reliable real-time tool to analyze and properly categorize them for the users. To incorporate content or genre queries in web searches, audio content analysis and classification is imperative. This paper proposes a set of audio content features and a parallel Neural Network architecture that addresses the task of automated content based audio classification. Feature sets based on signal periodicity, beat information, sub-band energy, Mel-frequency Cepstral coefficients and Wavelet transforms are proposed and each of the feature sets are individually analyzed for their pertinence in the proposed task. A parallel Multi-layered Perceptron network is proposed which offers a classification accuracy of 84.4% to distinguish between 6 different genres. The proposed architecture is compared with a Support Vector Machine based classifier and is found to perform superiorly than the later. I. INTRODUCTION SAGE of audio-visual files over the internet have increased exponentially over the past few years. Search engines have been customized to incorporate audio and video file searching. International Standard Organization (ISO) has standardized the process for describing multimedia file content, for example MPEG-7 (Motion Pictures Expert Group) [1, 2]. However these descriptions may not be coherent, hence content analysis is of critical importance to categorize the audio-visual files in a consistent way. This paper addresses the task of content analysis of audio files. Music genres are high-level descriptors used by the listeners, dealers, organizations and institutes to describe and organize their audio collections [3]. Unfortunately, these high level descriptors are often not well defined, which makes the task of automatic analysis and categorization of audio files into such descriptors, a non-trivial one. Despite the lack of systematic definition of music genres, it has been observed [4] that human subjects can accurately perform genre classification with a remarkably high success rate by listening to only 250 milliseconds of audio data [4]. This claims that perceptual criteria are of critical importance rather than systematic theoretic definition for music genre classification. Huge efforts have been made to perform manual categorization of audio files into corresponding genres. Weare et. al. [5] has informed about such a procedure for manually categorizing hundred thousand audio files by Microsoft Corp., which required the effort of 30 musicologists for one year. This strongly claims the necessity and usefulness of an automated genre classifier. In an effort to seek a taxonomy of musical genres, Pachet et. al. [6] has shown that building a genre hierarchy is a non- trivial task. They have also shown that there exists no general agreement among the genre taxonomies and even if such taxonomy is constructed, a single audio file may be found to belong to multiple taxons. Hence, they claimed that the widely used genres, such as Rock, Jazz etc. represent different song categories and the hierarchy structure amongst the genres are different than the taxonomical structure. A number of interesting studies exist, that address automatic genre classification of audio files. The approaches have been different in the sense, that different authors have proposed different feature sets and classifiers that try to improve the performance of the classification task. Tzanetakis et al. [7] have implemented Discrete Wavelet Transform (DWT) for non-speech audio analysis and have also extracted beat information from music signals. They have achieved a success rate of 61% for classification amongst ten genres. Kosina [8] achieved an accuracy of 88% with three genres. McKinney et al. [9] have achieved a success rate of 74% using seven categories. This paper addresses the classification among six genres: Classical, Hard-Rock, Jazz, Pop, Rap and Soft-Rock. A set of seven low level features, based on the acoustic information content of the audio samples is proposed. Each of these feature set is individually processed by a Multi-layer Perceptron (MLP) network and the overall decision about the genre category is made based upon the combined decisions of the 7 MLPs. Such a parallel structure ensures distributive computation capability, which is desired for real time applications. The proposed system is also compared with a similar parallel SVM classifier structure, however the former was found to offer a higher success rate of 84.4% where as the later offered 58.93%. II. FEATURE SET A set of 7 features is proposed in this paper. The first 3 primarily incorporate perceptual information based upon careful study of the correlation between human perception and spectrogram study of audio signals. The 4 th set incorporate vocalist related information, which can also be used to distinguish between segments, which have and do not have vocals. The last three features sets are obtained by using Discrete Wavelet Transform (DWT), based on the observations in [7, 27] which has been selected to compactly represent the audio signals in both time and frequency. U

Upload: ngominh

Post on 22-Mar-2017

218 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: [IEEE TENCON 2007 - 2007 IEEE Region 10 Conference - Taipei, Taiwan (2007.10.30-2007.11.2)] TENCON 2007 - 2007 IEEE Region 10 Conference - A neural network based audio content classification

A Neural Network Based Audio Content Classification

Vikramjit Mitra University of Maryland, Dept. of Electrical Engineering, College Park, MD 20770

email: [email protected]

Chia-Jiu Wang University of Colorado, Dept. of Electrical and Computer Engineering, Colorado Springs, CO 80933

email: [email protected]

Abstract— The emergence of digital music in the Internet calls for a reliable real-time tool to analyze and properly categorize them for the users. To incorporate content or genre queries in web searches, audio content analysis and classification is imperative. This paper proposes a set of audio content features and a parallel Neural Network architecture that addresses the task of automated content based audio classification. Feature sets based on signal periodicity, beat information, sub-band energy, Mel-frequency Cepstral coefficients and Wavelet transforms are proposed and each of the feature sets are individually analyzed for their pertinence in the proposed task. A parallel Multi-layered Perceptron network is proposed which offers a classification accuracy of 84.4% to distinguish between 6 different genres. The proposed architecture is compared with a Support Vector Machine based classifier and is found to perform superiorly than the later.

I. INTRODUCTION SAGE of audio-visual files over the internet have increased exponentially over the past few years. Search

engines have been customized to incorporate audio and video file searching. International Standard Organization (ISO) has standardized the process for describing multimedia file content, for example MPEG-7 (Motion Pictures Expert Group) [1, 2]. However these descriptions may not be coherent, hence content analysis is of critical importance to categorize the audio-visual files in a consistent way. This paper addresses the task of content analysis of audio files. Music genres are high-level descriptors used by the listeners, dealers, organizations and institutes to describe and organize their audio collections [3]. Unfortunately, these high level descriptors are often not well defined, which makes the task of automatic analysis and categorization of audio files into such descriptors, a non-trivial one. Despite the lack of systematic definition of music genres, it has been observed [4] that human subjects can accurately perform genre classification with a remarkably high success rate by listening to only 250 milliseconds of audio data [4]. This claims that perceptual criteria are of critical importance rather than systematic theoretic definition for music genre classification. Huge efforts have been made to perform manual categorization of audio files into corresponding genres. Weare et. al. [5] has informed about such a procedure for manually categorizing hundred thousand audio files by Microsoft Corp., which required the effort of 30 musicologists for one year. This strongly claims the necessity and usefulness of an automated genre classifier.

In an effort to seek a taxonomy of musical genres, Pachet et. al. [6] has shown that building a genre hierarchy is a non-trivial task. They have also shown that there exists no general agreement among the genre taxonomies and even if such taxonomy is constructed, a single audio file may be found to belong to multiple taxons. Hence, they claimed that the widely used genres, such as Rock, Jazz etc. represent different song categories and the hierarchy structure amongst the genres are different than the taxonomical structure. A number of interesting studies exist, that address automatic genre classification of audio files. The approaches have been different in the sense, that different authors have proposed different feature sets and classifiers that try to improve the performance of the classification task. Tzanetakis et al. [7] have implemented Discrete Wavelet Transform (DWT) for non-speech audio analysis and have also extracted beat information from music signals. They have achieved a success rate of 61% for classification amongst ten genres. Kosina [8] achieved an accuracy of 88% with three genres. McKinney et al. [9] have achieved a success rate of 74% using seven categories. This paper addresses the classification among six genres: Classical, Hard-Rock, Jazz, Pop, Rap and Soft-Rock. A set of seven low level features, based on the acoustic information content of the audio samples is proposed. Each of these feature set is individually processed by a Multi-layer Perceptron (MLP) network and the overall decision about the genre category is made based upon the combined decisions of the 7 MLPs. Such a parallel structure ensures distributive computation capability, which is desired for real time applications. The proposed system is also compared with a similar parallel SVM classifier structure, however the former was found to offer a higher success rate of 84.4% where as the later offered 58.93%.

II. FEATURE SET A set of 7 features is proposed in this paper. The first 3 primarily incorporate perceptual information based upon careful study of the correlation between human perception and spectrogram study of audio signals. The 4th set incorporate vocalist related information, which can also be used to distinguish between segments, which have and do not have vocals. The last three features sets are obtained by using Discrete Wavelet Transform (DWT), based on the observations in [7, 27] which has been selected to compactly represent the audio signals in both time and frequency.

U

Page 2: [IEEE TENCON 2007 - 2007 IEEE Region 10 Conference - Taipei, Taiwan (2007.10.30-2007.11.2)] TENCON 2007 - 2007 IEEE Region 10 Conference - A neural network based audio content classification

A. Periodicity and beats Temporal quasi-periodicity of signal amplitude envelope

plays a critical role to distinguish Rap from other genres.

(a)

(b)

Fig. 1. Plot of the (a) spectrogram and (b) temporal data, of a segment of an audio data from genre ‘Rap’.

The periodicity of the envelope can be observed both from the temporal data as well as the spectrogram plot and can be observed more clearly from the autocorrelation function. The repetitiveness of the Rap songs and its music is reflected by this quasi-periodicity. It was observed that the quasi-periodicity is not so apparent among the remaining 5 genres; hence this information can be used to distinguish Rap from the remaining 5.

The beat information is also an important feature to identify the genre of the audio sample. For example we can expect more frequent beats in a Hard-Rock audio sample than a Classical sample. The beat cycles for each audio segment is obtained and their distribution over the entire segment is modeled by using the first four moments (mean, variance, skewness and curtosis). Hence the first feature set V1, has 5 elements, the first being the average signal period and rest being the first 4 moments of the beat distribution.

B. Subband Energy Study of the audio signal spectrogram across different

genres show that there is a wide variation of signal energy across the various sub-bands from one genre to the other. It can be observed from Fig. 2, that in Classical and Jazz, almost no energy exists in the range of 12 to 16 KHz, where as in Soft Rock and Pop the energy is found to exist till 16KHz. For the second feature set, each audio sample is filtered into 30 different sub-bands, where the first 20 subbands have a bandwidth of 500Hz and the last 10 have a bandwidth of 1000Hz. Gabor band-pass [28] filters with

Fig. 2. Spectrogram of audio file segment from genres (a)

Classical, (b) Jazz, (c) Soft Rock and (d) Pop.

appropriate center frequencies were used for the task of subband filtering. The energy in each of the subbands were evaluated and were normalized with respect to the subband having the maximum energy, which gives a vector of 30 normalized energy coefficients. The lowest subband with center frequency at 250Hz had the maximum energy in case of all audio files; due to this the first element of the vector was always equal to 1, which can be ignored. This results in a subband energy vector (V2) of 29 coefficients.

C. Periodicity of the signal power The signal power provides some valuable discriminatory

information about the music genres. It can be observed that due to the quasi-periodicity of the amplitude envelop of Rap music, its power signal is also found to be almost periodic in nature, as observed from Fig. 3.a. Fig.3 shows that in Jazz the power is usually high with less variation, where as a lot of variation is observed in Hard-Rock and Pop. Also, apart from Rap, all other genres have an aperiodic power plot. The power plots in Fig. 3, were obtained by using 20 millisecond rectangular windows with 50% overlap. The obtained power functions are modeled by 30 Linear Prediction (LP) coefficients, which generate the 4th feature vector (V4).

Linear Prediction, commonly known as Linear Predictive Coding (LPC) is widely used in speech processing. It was introduced in the late sixties [15] and has evolved a lot during the last three decades [16, 17]. In linear prediction, an estimate ŝ(n) for a sample value s(n) is given as a linear combination of previous samples, as given in (1) and (2).

(a)

(b)

(c)

(d)

Page 3: [IEEE TENCON 2007 - 2007 IEEE Region 10 Conference - Taipei, Taiwan (2007.10.30-2007.11.2)] TENCON 2007 - 2007 IEEE Region 10 Conference - A neural network based audio content classification

)(.......)2()1()()1(ˆ 210 Nnsansansansans N −++−+−+=+ (1) ∑ −=

=1][][̂

jj jnsans (2)

There are different ways to form a linear combination of signal history and use it to predict the next signal value. The simplest of them is to obtain a weighted sum of a finite number of previous signal values. The underlying mathematical theory is in Wold’s decomposition principle [18] stating that a regular sequence can be obtained from white noise by filtering with an IIR filter. Traditionally an LP filter is assumed to be all-pole, whose coefficients are obtained using the Yule-Walker equations [19].

Fig. 3. Plot of signal power for 6 genres (a) Rap, (b) Jazz, (c)

Classical, (d) Hard-Rock, (e) Pop and (f) Soft-Rock.

D. Mel Frequency Cepstral Coefficients (MFCC) MFCCs are primarily used as features for speech and speaker recognition. Blum et. al. [10] has used MFCCs to model music and audio signals. MFCCs are short-term [11] spectral features that are obtained from a type of cepstral representation of the audio segment. The main difference between the normal cepstrum and the MFCC is that in MFCC, the frequency bands are positioned logarithmically or in mel-scale, which represents the human auditory system's response more closely than the linearly-spaced frequency bands obtained directly from using the ordinary Fast Fourier Transform (FFT) or Discrete Cosine Transform (DCT). Logan in his work [12] has showed that MFCCs work well

for speech/music classification. Audio files usually have both speech and music components, which tend to make the feature extraction task difficult and MFCCs are primarily used to circumvent this problem. In certain genres music may be more predominant than vocals (example Clasical, where vocal(s) if present are less speech like, due to high pitch) and in certain cases it might be just the reverse (example Rap), in such cases MFCCs can work as a crucial discriminatory feature. It is well known that the first 13 MFCC coefficients provide the best speaker related information [13]. The third feature vector (V3) is based upon the first 13 MFCC coefficients. As these coefficients retain speaker related information, they can be effectively used to classify audio samples according to vocalist(s).

The MFCC algorithm is obtained from the Auditory toolbox designed by Slaney [14] that calculates the MFCC values for a 20-millisecond window (with 50% overlap between the successive windows) of a given audio signal. For each window 13 MFCC coefficients were calculated, generating a total of 263 sets of 13 MFC-coefficients. Each MFC-coefficient has a vector of 263 values, corresponding to 263 windows; hence the overall MFC-coefficient set has a dimension of 13x263. To reduce the dimension of the feature set, each MFCC vector is modeled by linear prediction (LP) of order 8 that generates 9 LP coefficients. The first coefficient was always equal to one, which can be ignored to further reduce the feature set to a 13x8 matrix. This matrix is treated as a vector of length 104 elements and represents the third feature set (V3).

E. Discrete Wavelet Transform Wavelet transform is an orthogonal decomposition

technique, where the basis functions are known as Wavelets. Wavelet functions are obtained from a single prototype called the mother wavelet. Wavelet functions are dilated and shifted versions of the mother wavelet and are represented as

−=

abt

atba ψψ

||1)(, (3)

where, ψa,b is the wavelet function, ψ is the mother wavelet, a is the scaling parameter and b is the shifting parameter. Continuous Wavelet Transform of any signal, s(t) is given as

−=

∞−

∗a

bta

tsbaW ψ||

1)(),( (4)

For a discrete time signal s[k], Discrete Wavelet Transform (DWT) is used, which gives a discrete wavelet representation of that signal. The discrete wavelet function, Φa,b, is expressed [20] in terms of the mother wavelet Φ as

[ ]bnn aa

ba −Φ=Φ −−22)( 2, (5)

The variables a and b are integers that translate and dilate the mother wavelet, Φ, to generate wavelet functions. The DWT of a discrete signal s[k] is given as

[ ]knkskjW j

k

j

j−Ψ∑∑= −−

22][],[ 2 (6)

The dilation factor in this case is of the form a = 2j, where j is an integer. The corresponding translation factor takes the

a

b

c

d

e

f

Page 4: [IEEE TENCON 2007 - 2007 IEEE Region 10 Conference - Taipei, Taiwan (2007.10.30-2007.11.2)] TENCON 2007 - 2007 IEEE Region 10 Conference - A neural network based audio content classification

form a = 2j n, where n is an integer. The DWT works as a multi-rate filter bank and is usually viewed as a constant Q filter bank with octave spacing between the center frequencies of each filter bank. The DWT down samples the input signal by two and results in an approximate coefficient vector, CA and a detailed coefficient vector, CD. The sampling frequency of the audio samples considered in this paper is 44.1 KHz. In time domain, down sampling by 2 using DWT, translates to filtering the signal into a high-pass and low-pass components in frequency domain, where CA is the low-pass component (0 Hz to 11.025 KHz) and CD is the high-pass component (11.025 KHz to 22.05 KHz).

The wavelet coefficients provide a compact representation of the energy distribution of the signal in both time and frequency [7]. But the dimensionality of the DWT coefficients makes the realization of a simple ANN structure difficult. With increase in the input dimension, the ANN structure becomes more and more complicated, increasing the training time and reducing the performance during classification. To reduce the dimensionality of the extracted DWT coefficients, Tzanetakis et al. [7] considered the statistics over a non-overlapping coefficient sets, which gives the statistical information about the energy distribution in time and frequency of the input signal. Using DWT, three feature vectors (V5, V6 and V7) were constructed. The basic construction procedure of all these feature vectors was same, except that they have different mother wavelet functions. V5 used Haar, V6 used Daubechies and V7 uses Symlet as the mother wavelet. To obtain these feature vectors, a 2-level DWT is performed on the windowed signal components and the resulting approximate coefficient, CA2, is divided into 6 equal length non-overlapping vectors. The mean and variance of each of the vectors are obtained, that gives rise to 12 parameters. This 12-element vector is denoted as V5, when the mother wavelet used is Haar. Similarly using the above procedure, but using ‘Daubechies’ and ‘Symlet’ as the mother wavelet, feature sets V6 and V7 are obtained.

III. THE DATA MP3 and WAV files are primarily considered in this

research. MP3 is the file extension for MPEG, audio layer 3. A rule-based algorithm reads the input file and makes a decision whether the input is in MP3 or WAV format and based upon that, the input file is read and processed for feature extraction. Random non-overlapping windowed segments of data are then processed to obtain the desired feature sets. Typically from each audio file more than one audio segments were selected. This paper considers audio files obtained from 6 different genres: Classical, Hard-Rock, Jazz, Pop, Rap and Soft-Rock, with 600 files from each genre, 3600 files altogether. 480 files per genre were used for the training the classifiers and the remaining 120 files were used for testing them.

IV. PROPOSED SYSTEM ARCHITECTURE The proposed system architecture consists of four modules

as depicted in Fig. 4. The first module identifies and reads the audio file, which is then processed by the feature extractor that generates the seven feature vectors. These feature vectors

are used by the Classifier to generate predictions. The Classifier predictions are analyzed by the Decision Logic (DL), which helps to generate the final decision regarding the genre of the processed audio file.

Fig. 4. Block Diagram of the proposed system.

As evident from Fig. 4, the proposed system has a parallel

architecture; hence it can be implemented as a distributed system, which can ensure real time processing of the data.

V. THE CLASSIFIER Two different classifiers are considered in this paper: (1) a

simple Multi-Layered Perceptron (MLP) network and (2) a Support Vector Machine (SVM) based classifier.

Artificial Neural Networks (ANNs) are used extensively for performing classification tasks due to their reliable and efficient classification performance, parallel computation potential and high adaptability [21]. Based on the philosophy of human nervous system, the ANNs perform extremely well for large and complex feature sets.

Support Vector Machine (SVM) is a kernel-based classifier. The most general form of Support Vector classification is to separate two classes by a function, which is induced from the available examples [22]. SVM classifier finds an optimal separating hyperplane that maximizes the separation margin or the distance between it and the nearest data point of each class. In this research three different kernel functions were considered, the linear kernel function defined as –

xxxxk T ˆ),( =′ (7) where x is the data vector.

The polynomial kernel function, defined as –

dT txxxxk )ˆ(),( +=′ (8) where t is any constant and d is the order of the polynomial. The Gaussian Radial Basis Function kernel, defined as –

22

),( σ

xx

exxk

−−

=′ (9)

VI. RESULTS Using the architecture shown in Fig. 4, two different

classifiers based on SVM and MLP were considered. SVM’s are binary classifiers, but can also be implemented for multi-class problems. To implement a multi-class SVM, two different approaches are usually adopted: (i) The entire data is considered in a single optimization function [23] or (ii) perform the multiple classification by using a cascade of binary SVMs, where again two different techniques can be used ‘One Verses All’ (OVA) method [22] or ‘One Verses One’ (OVO) method [24]. The multi-class SVM used in this

Page 5: [IEEE TENCON 2007 - 2007 IEEE Region 10 Conference - Taipei, Taiwan (2007.10.30-2007.11.2)] TENCON 2007 - 2007 IEEE Region 10 Conference - A neural network based audio content classification

paper is implemented by using the LIBSVM [24] based OSU SVM Classifier Matlab Toolbox [26]. On each the seven training feature set, the multi-class OVO C-SVM is trained with the linear, polynomial and RBF kernel. The regularizing parameters C and σ were determined by cross validation on the training set. The validation performance was measured by training on 50% of the training set and testing on the remaining 50%. The optimal C and σ that gave the best accuracy was selected. Typically the number of support vectors was in the range from 3000 to 9000, based upon the dimension of the feature set.

TABLE I

SUCCESS RATES IN PERCENTAGE, FOR THE DIFFERENT KERNEL FUNCTIONS

As evident from the Table I, the success rates were not

consistent throughout the different genres, the overall success rates are also very low, however the average training times for all the three different SVM kernels were appreciably low, which is: 0.28 sec for the Linear kernel, 1.98 sec for the Polynomial kernel and 1.57 sec for the G-RBF kernel.

The MLP based classifier is a parallel neural architecture with 7 parallel branches (corresponding to the 7 feature vectors) as shown in Fig. 5. Each MLP has Ni input nodes (where Ni is the length of the ith feature vector that is fed to that specific MLP) and 3 output nodes. The decision regarding the ANN architecture was based upon two choices, the prediction accuracy and the training time. The architecture providing the best prediction accuracy at the same time requiring lower training time was selected. Each MLP has 3 hidden layers, where the first hidden layer had 50 processing elements (PEs), the 2nd hidden layer had 60 PEs and the 3rd hidden layer had 40 PEs. The MLPs use a momentum-learning rule, with momentum value of 0.9. The excitation function used was sigmoid function and the training termination rule was Minimum Mean Square Error.

TABLE II DESIRED OUTPUT FOR EACH GENRE

Classes Desired Output Classical 0 0 1

Hard Rock 0 1 0 Jazz 0 1 1 Pop 1 0 0 Rap 1 0 1

Soft Rock 1 1 0

Fig. 5. The Parallel Neural Architecture

The ANNs are trained with a desired output value for each genre. The desired outputs used are shown in Table II.

The prediction decisions obtained from each of the 7

parallel MLPs are vector summed with one another and then normalized by 7 (as there are 7 parallel MLPs) to obtain the final prediction value. The final prediction is processed by a Decision Logic (DL), which presents the final decision. The rule base of the DL is given in Table III, where X1, X2 and X3 are the values obtained after the vector addition followed by normalization with 7, of the 3 outputs of the parallel ANN architecture.

TABLE III

RULE BASE FOR THE DECISION LOGIC

Parallel ANN output DL output Genre X1 < 0.5 X2 < 0.5 X3 ≥ 0.5 Classical X1 < 0.5 X2 ≥ 0.5 X3 < 0.5 Hard Rock X1 < 0.5 X2 ≥ 0.5 X3 ≥ 0.5 Jazz X1 ≥ 0.5 X2 < 0.5 X3 < 0.5 Pop X1 ≥ 0.5 X2 < 0.5 X3 ≥ 0.5 Rap X1 ≥ 0.5 X2 ≥ 0.5 X3 < 0.5 Soft Rock All MLPs of the parallel branches were trained separately

in supervised training mode. Each trained MLP was also tested individually to analyze their performances for the different classes corresponding to different feature vectors. Fig. 6. presents the histogram of the classification success rates for the different genres.

It was observed that certain feature vector work very well for certain genres, but may not behave so for the others, for example, the feature vector V1 containing the signal amplitude envelop periodicity information and beat information, may work very well for determining Rap but may not do so for others, such as Classical. The parallel MLP architecture addresses this issue to overcome this low classification accuracy for certain feature vectors and hence averages out the prediction result to obtain more reliable

Kernel Genre

Linear Polynomial G-RBF

Classical 99.05 91.67 83.37 Hard Rock 51.31 25.12 58.33

Jazz 16.67 17.98 51.19 Pop 3.02 7.51 52.33 Rap 96.17 33.38 96.98

Soft Rock 2.11 1.56 11.33 Over all 44.72 29.54 58.93

Page 6: [IEEE TENCON 2007 - 2007 IEEE Region 10 Conference - Taipei, Taiwan (2007.10.30-2007.11.2)] TENCON 2007 - 2007 IEEE Region 10 Conference - A neural network based audio content classification

classification accuracy over all the different feature vectors.

Fig. 6. Success Rates for different Genres.

The other advantage of this parallel architecture is from the

implementation point of view, where the proposed system can be implemented using a distributed architecture, hence can be realized as a real time system.

The overall success rate obtained from the MLP based system is 84.4%, which is much higher than the success rate of 58.93% obtained from G-RBF SVM. The training time of the SVMs were very less as compared to that of the MLP (with average training time of 10918 seconds), however, performance wise MLPs were much better than SVMs.

VII. CONCLUSION A set of 7 feature sets are proposed which can be used in a

parallel architecture to discern between six music genres. The obtained success rate of 84.4% is encouraging and comparable to the previously proposed architectures. It was observed that certain feature sets worked extremely well in detecting certain specific genres. This can be utilized by implementing a tree like structure with MLPs or any suitable classifier at each node. In such a case the decision will be made between one specific genre and the other. An important direction would be to detect the voicing information, which can provide some significant information regarding the genre, for example in Classical music, the voicing is often high pitched as compared to Rap and other genres. It was observed that there was a relatively lower success rate in detecting Pop from Soft-Rock and vice versa, which claims that a proper selection of a feature set that discriminates between them may help to boost the success rate. Future research should address more genres to observe the robustness of the proposed feature set. Other classifier types may be considered to improve the success rate.

REFERENCES [1] MPEG Requirement Group, “MPEG-7: Overview of the MPEG-7

Standard. ISO/IEC JTC1/SC29/WG11 N3752”, France, Oct 1998. [2] “MPEG-7: ISO/IEC 15938-5 Final Committee Draft-Information

Technology-Multimedia Content Description Interface-Part 5

Multimedia Description Schemes”, ISO/IEC JTC1/SC29/WG11 MPEG00/N3966, Singapore, May2001, Wesley, CAN, 2ndEd, 1989.

[3] N. Scaringella, G. Zoia and D. Mlynek, “Automatic genre classification of music content: a survey”, IEEE Signal Processing Magazine, Mar 2006, Vol. 23, Iss. 2, pp. 133-141.

[4] D. Perrot and R. Gjerdigen, “Scanning the dial: An exploration of factors in the identification of musical style”, Proc. of the Society for Music Perception and Cognition, 1999 pp.88.

[5] R. Dannenberg, J. Foote, G. Tzanetakis, C. Weare, “Panel: new directions in music information retrieval”, Proc. of International Computer Music Conf.., Havana, Cuba, September 2001.

[6] F.Pachet, D. Cazaly, “A taxonomy of musical genres”, Proc. of Content-Based Multimedia Information Access, Paris, France, 2000.

[7] G. Tzanetakis, G. Essl and P. Cook, “Audio Analysis using the Discrete Wavelet Transform”, Proc. of WSES International Conf. Acoustics and Music: Theory and Applications, Greece, 2001.

[8] K. Kosina, “Music genre recognition”, Diploma thesis. Technical College of Hagenberg, Austria, 2002.

[9] M. McKinney and J. Breebaart, “Features for audio and music classification”, Proc. of International Symposium on Music Information Retrieval, 2003, pp. 151–8.

[10] T. Blum, D. Keislar, J. Wheaton, and E. Wold, “Method and article of manufacture for content-based analysis, storage, retrieval, and segmentation of audio information”, U.S. Patent 5, 918, 223, 1999.

[11] L. Rabiner and B. Juang, Fundamentals of Speech Recognition, Prentice-Hall, 1993.

[12] B. Logan, “Mel Frequency Cepstral Coefficients for Music Modeling”, Proceedings of the International Symposium on Music Information Retrieval, 1st ISMIR, 2000.

[13] G. Tzanetakis and P. Cook, ‘Musical genre classification of audio signals’, IEEE Transactions on Speech and Audio Processing, Jul 2002, Vol.. 10, Iss. 5, pp. 293- 302.

[14] M. Slaney, Auditory Toolbox for Matlab, Interval Research Corporation, Version 2.

[15] B. Atal, and M. Schroeder, “Predictive coding of speech signals and subjective error criteria”, IEEE Transactions on Acoustics, Speech, and Signal Processing, Vol. 27, Issue 3, pp. 247–254, June, 1979.

[16] T. Kailath, “A view of three decades of linear filtering theory”, IEEE Transactions on Information Theory, Vol. 20, Issue. 2, pp. 146–181, March, 1974.

[17] J. Markel, and A. Gray, Linear Prediction of Speech, Communication and Cybernetics, Springer Verlag, June, 1976.

[18] H. Wold, A Study in the Analysis of Stationary Time Series, Almquist and Wiksell, Stockholm, Sweden, 2 edition, 1954.

[19] N. Wiener, Extrapolation, Interpolation and Smoothing of Stationary Time Series with Engineering Applications, Technology Press and John Wiley & Sons, Inc., New York, August, 1949.

[20] A. Graps, “An Introduction to Wavelets”, IEEE Computer Science and Engineering, IEEE Computer Society, Vol. 2, No. 2, 1995.

[21] J.C. Principe, N.R. Euliano and W.C. Lefebvre, Neural and Adaptive Systems: Fundamentals through Simulations, John Wiley & Sons, Inc., February 29, 2000.

[22] V. Vapnik, The Nature of Statistical Learning Theory, Springer Publications, 1995.

[23] K. Crammer and Y. Singer, ‘On the algorithmic implementation of multiclass kernel-based vector machines’, Journal of Machine Learning Research, Vol. 2, pp. 265–292, 2001.

[24] U. Kreßel, ‘Pairwise classification and support vector machines’. In Advances in Kernel Methods: Support Vector Learnings, pp. 255–268, Cambridge, MA, 1999. MIT Press.

[25] C. Chang and C. Lin., ‘Libsvm: a library for support vector machines’, http://www.kernel-machines.org/, 2001.

[26] J. Ma, Y. Zhao, and S. Ahalt, ‘OSU svm classifier matlab toolbox’, http://www.kernel-machines.org/, 2002.

[27] V. Mitra and C.J. Wang, ‘Content Based Audio Classification: A Neural Network Approach’, To appear in the proceedings of International Symposium of Neural Network.

[28] V. Mitra, A Parallel Architecture for Subband Speech Coding, Thesis submitted at University of Denver, SECS, 2004.