1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. © machine-learning based classification of speech and music...

2
1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. © Machine-Learning Based Classification Of Speech And Music Khan, MKS; Al-Khatib, WG SPRINGER, MULTIMEDIA SYSTEMS; pp: 55-67; Vol: 12 King Fahd University of Petroleum & Minerals http://www.kfupm.edu.sa Summary eed to classify audio into categories such as speech or music is an important t of many multimedia document retrieval systems. In this paper, we investigat features that have not been previously used in music-speech classification, e mean and variance of the discrete wavelet transform, the variance of Mel- ency cepstral coefficients, the root mean square of a lowpass signal, and the rence of the maximum and minimum zero-crossings. We, then, employ fuzzy C- clustering to the problem of selecting a viable set of features that enables ification accuracy. Three different classification frameworks hav ed:Multi-Layer Perceptron (MLP) Neural Networks, radial basis functions (RBF) l Networks, and Hidden Markov Model (HMM), and results of each framework been reported and compared. Our extensive experimentation have identified a t of features that contributes most to accurate classification, and have show etworks are the most suitable classification framework for the problem at han References: BEIERHOLM T, 2004, P 17 INT C PATT REC, V2, P379 BEZDEK JC, 1981, PATTERN RECOGNITION BUGATTI A, 2002, EURASIP J APPL SIG P, V4, P372 CAREY MJ, 1999, P IEEE INT C AC SPEE, V1, P149 CHOU W, 2001, P ICASSP 01 SALT LAK, V2, P865 CYBENKO G, 1989, MATH CONTROL SIGNAL, V2, P303 DELFS C, 1998, P INT C AC SPEECH SI, V3, P1569 DUDA RO, 2001, PATTERN CLASSIFICATI ELMALEH K, 2000, P ICASSP2000 JUN, V4, P2445 HARB H, 2001, P 7 INT C DISTR MULT, P257 HARB H, 2003, P 7 INT C SIGN PROC, V2, P125 HOYT JD, 1994, P INT C NEUR NETW IE, V7, P4493 Copyright: King Fahd University of Petroleum & Minerals; http://www.kfupm.edu.sa

Upload: chloe-hunter

Post on 03-Jan-2016

212 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. © Machine-Learning Based Classification Of Speech And Music Khan, MKS; Al-Khatib, WG SPRINGER, MULTIMEDIA SYSTEMS;

1.2.3.4.5.6.7.8.9.10.11.12.

©

Machine-Learning Based Classification Of Speech And Music

Khan, MKS; Al-Khatib, WG

SPRINGER, MULTIMEDIA SYSTEMS; pp: 55-67; Vol: 12

King Fahd University of Petroleum & Minerals

http://www.kfupm.edu.sa

Summary

The need to classify audio into categories such as speech or music is an important

aspect of many multimedia document retrieval systems. In this paper, we investigate

audio features that have not been previously used in music-speech classification, such

as the mean and variance of the discrete wavelet transform, the variance of Mel-

frequency cepstral coefficients, the root mean square of a lowpass signal, and the

difference of the maximum and minimum zero-crossings. We, then, employ fuzzy C-

means clustering to the problem of selecting a viable set of features that enables better

classification accuracy. Three different classification frameworks have been

studied:Multi-Layer Perceptron (MLP) Neural Networks, radial basis functions (RBF)

Neural Networks, and Hidden Markov Model (HMM), and results of each framework

have been reported and compared. Our extensive experimentation have identified a

subset of features that contributes most to accurate classification, and have shown that

MLP networks are the most suitable classification framework for the problem at hand.

References:BEIERHOLM T, 2004, P 17 INT C PATT REC, V2, P379BEZDEK JC, 1981, PATTERN RECOGNITIONBUGATTI A, 2002, EURASIP J APPL SIG P, V4, P372CAREY MJ, 1999, P IEEE INT C AC SPEE, V1, P149CHOU W, 2001, P ICASSP 01 SALT LAK, V2, P865CYBENKO G, 1989, MATH CONTROL SIGNAL, V2, P303DELFS C, 1998, P INT C AC SPEECH SI, V3, P1569DUDA RO, 2001, PATTERN CLASSIFICATIELMALEH K, 2000, P ICASSP2000 JUN, V4, P2445HARB H, 2001, P 7 INT C DISTR MULT, P257HARB H, 2003, P 7 INT C SIGN PROC, V2, P125HOYT JD, 1994, P INT C NEUR NETW IE, V7, P4493

Copyright: King Fahd University of Petroleum & Minerals;http://www.kfupm.edu.sa

Page 2: 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. © Machine-Learning Based Classification Of Speech And Music Khan, MKS; Al-Khatib, WG SPRINGER, MULTIMEDIA SYSTEMS;

13.14.15.16.17.18.19.

©

KARNEBACK S, 2001, P EUR C SPEECH COMM, P1891KHAN MKS, 2005, THESIS KING FAHD U PLAMBROU T, 1998, P INT C AC SPEECH SI, V6, P3621LI DG, 2001, PATTERN RECOGN LETT, V22, P533LIPP OV, 2004, EMOTION, V4, P233, DOI 10.1037/1528-3542.4.3.233LU L, 2001, P 9 ACM INT C MULT, P203LU L, 2002, IEEE T SPEECH AUDI P, V10, P504, DOI

10.1109/TSA.2002.80454620. LU L, 2003, ACM MULTIMEDIA SYSTE, V8, P48221. MAMMONE RJ, 1994, ARTIFICIAL NEURAL NE22. PANAGIOTAKIS C, 2004, IEEE T MULTIMEDIA23. PARRIS ES, 1999, P EUROSPEECH 99 BUD, P219124. PELTONEN V, 2001, THESIS TAMPERE U TEC25. PINQUIER J, 2002, P ICSLP 02, V3, P200526. PINQUIER J, 2002, P INT C AC SPEECH SI, V4, P416427. PINQUIER J, 2003, P INT C AC SPEECH SI, V2, P1728. RABINER LR, 1986, IEEE ASSP MAG, V3, P429. SAAD EM, 2002, P 19 NAT RAD SCI C N, P20830. SAUNDERS J, 1996, P INT C AC SPEECH SI, V2, P99331. SCHEIRER E, 1997, P ICASSP 97, V2, P133132. SHAO X, 2003, P 4 INT C INF COMM S, V3, P182333. SRINIVASAN SH, 2004, P INT C AC SPEECH SI, V4, P32134. TZANETAKIS G, 1999, EUROMICRO WORKSH MUS, V2, P6135. TZANETAKIS G, 2001, P INT S MUS INF RETR, P20536. TZANETAKIS G, 2002, IEEE T SPEECH AUDI P, V10, P29337. WANG WQ, 2003, P INF COMM SIGN PROC, V3, P1325

For pre-prints please write to: [email protected]

Copyright: King Fahd University of Petroleum & Minerals;http://www.kfupm.edu.sa