1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. © machine-learning based classification of speech and music...
TRANSCRIPT
1.2.3.4.5.6.7.8.9.10.11.12.
©
Machine-Learning Based Classification Of Speech And Music
Khan, MKS; Al-Khatib, WG
SPRINGER, MULTIMEDIA SYSTEMS; pp: 55-67; Vol: 12
King Fahd University of Petroleum & Minerals
http://www.kfupm.edu.sa
Summary
The need to classify audio into categories such as speech or music is an important
aspect of many multimedia document retrieval systems. In this paper, we investigate
audio features that have not been previously used in music-speech classification, such
as the mean and variance of the discrete wavelet transform, the variance of Mel-
frequency cepstral coefficients, the root mean square of a lowpass signal, and the
difference of the maximum and minimum zero-crossings. We, then, employ fuzzy C-
means clustering to the problem of selecting a viable set of features that enables better
classification accuracy. Three different classification frameworks have been
studied:Multi-Layer Perceptron (MLP) Neural Networks, radial basis functions (RBF)
Neural Networks, and Hidden Markov Model (HMM), and results of each framework
have been reported and compared. Our extensive experimentation have identified a
subset of features that contributes most to accurate classification, and have shown that
MLP networks are the most suitable classification framework for the problem at hand.
References:BEIERHOLM T, 2004, P 17 INT C PATT REC, V2, P379BEZDEK JC, 1981, PATTERN RECOGNITIONBUGATTI A, 2002, EURASIP J APPL SIG P, V4, P372CAREY MJ, 1999, P IEEE INT C AC SPEE, V1, P149CHOU W, 2001, P ICASSP 01 SALT LAK, V2, P865CYBENKO G, 1989, MATH CONTROL SIGNAL, V2, P303DELFS C, 1998, P INT C AC SPEECH SI, V3, P1569DUDA RO, 2001, PATTERN CLASSIFICATIELMALEH K, 2000, P ICASSP2000 JUN, V4, P2445HARB H, 2001, P 7 INT C DISTR MULT, P257HARB H, 2003, P 7 INT C SIGN PROC, V2, P125HOYT JD, 1994, P INT C NEUR NETW IE, V7, P4493
Copyright: King Fahd University of Petroleum & Minerals;http://www.kfupm.edu.sa
13.14.15.16.17.18.19.
©
KARNEBACK S, 2001, P EUR C SPEECH COMM, P1891KHAN MKS, 2005, THESIS KING FAHD U PLAMBROU T, 1998, P INT C AC SPEECH SI, V6, P3621LI DG, 2001, PATTERN RECOGN LETT, V22, P533LIPP OV, 2004, EMOTION, V4, P233, DOI 10.1037/1528-3542.4.3.233LU L, 2001, P 9 ACM INT C MULT, P203LU L, 2002, IEEE T SPEECH AUDI P, V10, P504, DOI
10.1109/TSA.2002.80454620. LU L, 2003, ACM MULTIMEDIA SYSTE, V8, P48221. MAMMONE RJ, 1994, ARTIFICIAL NEURAL NE22. PANAGIOTAKIS C, 2004, IEEE T MULTIMEDIA23. PARRIS ES, 1999, P EUROSPEECH 99 BUD, P219124. PELTONEN V, 2001, THESIS TAMPERE U TEC25. PINQUIER J, 2002, P ICSLP 02, V3, P200526. PINQUIER J, 2002, P INT C AC SPEECH SI, V4, P416427. PINQUIER J, 2003, P INT C AC SPEECH SI, V2, P1728. RABINER LR, 1986, IEEE ASSP MAG, V3, P429. SAAD EM, 2002, P 19 NAT RAD SCI C N, P20830. SAUNDERS J, 1996, P INT C AC SPEECH SI, V2, P99331. SCHEIRER E, 1997, P ICASSP 97, V2, P133132. SHAO X, 2003, P 4 INT C INF COMM S, V3, P182333. SRINIVASAN SH, 2004, P INT C AC SPEECH SI, V4, P32134. TZANETAKIS G, 1999, EUROMICRO WORKSH MUS, V2, P6135. TZANETAKIS G, 2001, P INT S MUS INF RETR, P20536. TZANETAKIS G, 2002, IEEE T SPEECH AUDI P, V10, P29337. WANG WQ, 2003, P INF COMM SIGN PROC, V3, P1325
For pre-prints please write to: [email protected]
Copyright: King Fahd University of Petroleum & Minerals;http://www.kfupm.edu.sa