learning techniques for identifying vocal regions in music

112

Upload: others

Post on 01-Jan-2022

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Learning Techniques for Identifying Vocal Regions in Music

Learning Techniques for Identifying Vocal Regions in Music Using the Wavelet

Transformation, Version 1.0

A Thesis

submitted to the Faculty of the

Graduate School of Arts and Sciences

of Georgetown University

in partial ful�llment of the requirements for the

degree of

Master of Science

in Computer Science

By

Michael J. Henry, B.S.

Washington, DC

February 14, 2011

Page 2: Learning Techniques for Identifying Vocal Regions in Music

Copyright c© 2011 by Michael J. Henry

All Rights Reserved

ii

Page 3: Learning Techniques for Identifying Vocal Regions in Music

Learning Techniques for Identifying Vocal Regions in Music Using the

Wavelet Transformation, Version 1.0

Michael J. Henry, B.S.

Thesis Advisor: Marcus A. Maloof

Abstract

In this research I present a machine learning method for the automatic detection of vocal regions

in music. I employ the wavelet transformation to extract wavelet coe�cients, from which I build fea-

ture sets capable of constructing a model that can distinguish between regions of a song that contain

vocals and those that are purely instrumental. Singing voice detection is an important aspect of the

broader �eld of Music Information Retrieval, and e�cient vocal region detection facilitates further

research in other areas such as a singing voice detection, genre classi�cation and the management

of large music databases. As such, it is important for researchers to accurately detect automatically

which sections of music contain vocals and which do not. Previous methods that used features,

such as the popular Mel-Frequency Cepstral Coe�cients (MFCC), have several disadvantages when

analyzing signals in the time-frequency domain that the wavelet transformation can overcome. The

models constructed by using the wavelet transformation on a windowed music signal produce a

classi�cation accuracy of 86.66%, 11% higher than models built using MFCCs. Additionally, I show

that applying a decision tree algorithm to the vocal region detection problem will produce a more

accurate model when compared to other, more widely applied learning algorithms, such as Support

Vector Machines.

Index words: Vocal Region Detection, Machine Learning, MFCC, Wavelet Transform,

SVM, Naïve Bayes, Decision Tree, Thesis (academic)

iii

Page 4: Learning Techniques for Identifying Vocal Regions in Music

Dedication

For my wife, Stacy, in appreciation for her endless support and patience.

iv

Page 5: Learning Techniques for Identifying Vocal Regions in Music

Acknowledgments

I would like to thank my thesis advisor, Dr. Marcus A. Maloof, for his tireless e�ort in helping

me complete this thesis, and for his support in encouraging me to pursue an interest of mine. This

research would not have been possible without his support over the past three years. Also, I would

like to thank my committee members, Dr. Lisa Singh and Dr. Ophir Frieder for inspiring my interest

in Machine Learning and for o�ering continuous guidance and support, without which this would

not be possible. I would also like to thank Dr. Brian Blake for allowing me the opportunity to get

here.

v

Page 6: Learning Techniques for Identifying Vocal Regions in Music

Table of Contents

Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v

List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii

List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii

Chapter

1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.1 Music File Formats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.2 Audio Signal Transforms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.3 The Daubechies Wavelet Family . . . . . . . . . . . . . . . . . . . . . . . . . . . 132.4 Learning Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

3 Review of Related Literature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

4 A Method for Detecting Vocal Regions . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 325.1 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 325.2 Experiment 1: Measuring Overall Performance . . . . . . . . . . . . . . . . . . 365.3 Experiment 2: Measuring Performance Across Di�erent Artists . . . . . . . . . 415.4 Experiment 3: Measuring Performance Across Gender . . . . . . . . . . . . . . 435.5 Experiment 4: Measuring Overall Performance Across Groups of Artists . . . . 445.6 Experiment 5: Measuring Performance Across Di�erent Groups of Artists . . . 45

6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 486.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 486.2 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 516.3 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 526.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

Appendix

A Complete Experiment 1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

B Complete Experiment 2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

C Complete Experiment 3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

D Complete Experiment 4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

E Complete Experiment 5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95

Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99

vi

Page 7: Learning Techniques for Identifying Vocal Regions in Music

List of Figures

2.1 Spectrogram Progression of Alternating Vocal and Instrumental Regions in a QueenSong, from Left to Right Respectively and Top to Bottom . . . . . . . . . . . . . . . 9

2.2 Binary Tree Representation of the Discrete Wavelet Transformation Filter Bank . . 132.3 Wavelet Coe�cients Extracted from a Progression of Alternating Vocal and Instru-

mental Regions in a Queen Song, from Left to Right Respectively and Top to Bottom 142.4 Wavelet (Left) and Scaling (Right) Functions of the Four Daubechies Wavelets used

in this Thesis. Magnitude is shown along the y-axis and support length is shown alongthe x-axis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

4.1 The Training Phase of the Presented Methodology . . . . . . . . . . . . . . . . . . . 275.1 Experimental Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 325.2 Transcribe! window with Aerosmith's �Dream On� open . . . . . . . . . . . . . . . . 34

vii

Page 8: Learning Techniques for Identifying Vocal Regions in Music

List of Tables

5.1 Experiment 1 Results for Vocal Region Detection Using MFCC Features on IndividualFrames from Single Artists . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

5.2 Experiment 1 Results for Vocal Region Detection Using MFDWC Features on Indi-vidual Frames from Single Artists . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

5.3 Experiment 1 Results for Vocal Region Detection Using Wavelet Energy Features onIndividual Frames from Single Artists . . . . . . . . . . . . . . . . . . . . . . . . . . 39

5.4 Experiment 2 Results for Vocal Region Detection on Individual Frames from Di�erentIndividual Artists . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

5.5 Experiment 3 Results for Gender-Based Vocal Region Detection at a Sampling Rateof 20% . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

5.6 Experiment 4 Results for Vocal Region Detection on Frames From Multiple Artistsat a Sampling Rate of 40% . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

5.7 Experiment 5 Results for Vocal Region Detection on Individual Frames from Multiple,Di�erent Artists using a 32% Sampling Rate, Average Over 10 Runs . . . . . . . . . 46

A.1 Complete Experiment 1 Results Using MFCC Features . . . . . . . . . . . . . . . . . 54A.2 Complete Experiment 1 Results Using MFDWC Features and Naïve Bayes . . . . . 57A.3 Complete Experiment 1 Results Using MFDWC Features and J48 . . . . . . . . . . 60A.4 Complete Experiment 1 Results Using MFDWC Features and SVM . . . . . . . . . 63A.5 Complete Experiment 1 Results Using MFDWC Features and JRip . . . . . . . . . . 66A.6 Complete Experiment 1 Results Using MFDWC Features and IBk . . . . . . . . . . 69A.7 Complete Experiment 1 Results Using Wavelet Energy Features and Naïve Bayes . . 72A.8 Complete Experiment 1 Results Using Wavelet Energy Features and J48 . . . . . . . 75A.9 Complete Experiment 1 Results Using Wavelet Energy Features and SVM . . . . . . 78A.10 Complete Experiment 1 Results Using Wavelet Energy Features and JRip . . . . . . 81A.11 Complete Experiment 1 Results Using Wavelet Energy Features and IBk . . . . . . . 84B.1 Complete Experiment 2 Results using MFCC Features . . . . . . . . . . . . . . . . . 88B.2 Complete Experiment 2 Results using MFDWC Features (Training set in rows; test

set in columns) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89B.3 Complete Experiment 2 Results using Wavelet Energy Features (Training set in rows;

test set in columns) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90C.1 Complete Experiment 3 Results using MFCC Features at 2% Incremental Sampling

Rates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92C.2 Complete Experiment 3 Results using MFDWC Features at 2% Incremental Sampling

Rates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92C.3 Complete Experiment 3 Results using Wavelet Energy Features at 2% Incremental

Sampling Rates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92D.1 Complete Experiment 4 Results using MFCC Features at 2% Incremental Sampling

Rates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94D.2 Complete Experiment 4 Results using MFDWC Features at 2% Incremental Sampling

Rates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94D.3 Complete Experiment 4 Results using Wavelet Energy Features at 2% Incremental

Sampling Rates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94

viii

Page 9: Learning Techniques for Identifying Vocal Regions in Music

E.1 Complete Experiment 5 Results at 2% Incremental Sampling Rates using MFCCFeatures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

E.2 Complete Experiment 5 Results at 2% Incremental Sampling Rates using MFDWCFeatures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

E.3 Complete Experiment 5 Results at 2% Incremental Sampling Rates using WaveletEnergy Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

ix

Page 10: Learning Techniques for Identifying Vocal Regions in Music

Chapter 1

Introduction

The prevalence of digital music in the Internet Age has created a need for music distributors and

content creators to develop and deploy systems capable of managing enormous libraries of music.

That need contributed to the birth of a sub�eld of Information Retrieval, referred to as Music

Information Retrieval (MIR), established with the goal of �nding innovative ways to store, manage

and retrieve information from music. MIR encompasses a broad range of research topics, including

singing voice identi�cation [1, 16, 19, 39], source separation [29], query-by-humming [9], acoustic

�ngerprinting [42] and copyright protection [40].

Western popular music is typically structured in a similar fashion across most genres. A primary

singing voice�occasionally accompanied by one or more backup or supplementary singers�is typ-

ically featured over a number of instrument tracks, which can include such popular instruments as

guitar, bass, piano or drums. In many of the MIR research areas mentioned above, most notably

singing voice identi�cation, any application that wishes to perform its stated task must �rst be able

to detect the singing voice before performing further analysis.

Modern digital music, however, makes that task di�cult. Most digital music available is encoded

using the MPEG-1 Audio Layer 3 (MP3) format [13], or the increasingly popular MPEG-4 (MP4,

M4A) format [14]. The MP3 encoding format uses a patented lossy data compression algorithm [13].

Audio compression is used in digital music as a way to sharply reduce the storage space required by

a digital music �le. For MIR researchers, however, this posses a problem. A song encoded using the

MP3 format has each of the instrument and vocal channels compressed down to two channels [13],

making it di�cult to separate an individual contributor from the rest of the audio.

For instance, researchers interested in automatic singer identi�cation must �rst isolate the vocal

regions in the song [1, 16, 19, 39]. In order to properly develop a singer identi�cation method, the

researchers must �rst preprocess their data using a reliable vocal region detection system. These

vocal region detection systems present their own set of problems and considerations independent of

1

Page 11: Learning Techniques for Identifying Vocal Regions in Music

the underlying task. Fortunately, there are some parallels that can be drawn between singing voice

recognition research, which is relatively new, and speaking voice recognition, which is a more mature

area of research. But while those similarities do exist, there are special considerations that must be

taken into account when approaching a vocal identi�cation task, as opposed to a speech identi�cation

task. The singing voice and the speaking voice contain di�erent levels of voiced components (sounds

generated by phonation), cover a wider range of timbre and frequency variation and the presence of

noise in the form of background instrumentation is a much more signi�cant component with vocal

region detection than noise considerations are with speech identi�cation.

These di�erences are signi�cant enough to motivate MIR research that branches away from

research done in the speech discrimination domain. This research most often focuses on exploiting

those stated di�erences between the singing voice and the speaking voice [16], as well as di�erences

between the human voice and instruments, which will enable a model to distinguish between the

two.

The current approach most often found in today's literature involves utilizing the Short-Time

Fourier Transform (STFT) in the calculation of Mel-Frequency Cepstral Coe�cients (MFCC), and

using that representation of a signal's power spectrum as the features in a vocal region detection

model. MFCCs have wide applicability in both the speech and music analysis domains, and are

frequently found in literature for both research areas [11, 21, 26, 31, 34, 40, 45]. However, there

are some limitations of using MFCC features when building a vocal region model. First of which

is the fact that the Short-Time Fourier Transform is well adapted for stationary signals�signals

that repeat with the same periodicity into in�nity�whereas music does not show the same peri-

odic behavior. Additionally, Fourier analysis only permits analysis at a single resolution and once

the Fourier Transform is applied to a signal, good time resolution is lost. Therefore, applying the

Fourier Transformation to the entire signal produces a signal resolution for the entire signal, and

applying the STFT produces a single resolution for each portion of the windowed signal. The MFCC

algorithm also utilizes the Discrete Cosine Transformation (DCT) in its calculation. While the DCT

has demonstrated a wide range of functionality in signal analysis, the coe�cients produced as a result

of applying the DCT are not well localized and thus, they are susceptible to noise and corruption,

negatively a�ecting the performance of the MFCC features in a model where the data is noisy.

The Wavelet Transformation o�ers a means of overcoming those limitations present in the current

research. The Wavelet Transformation gives researchers the ability to analyze a non-stationary signal

2

Page 12: Learning Techniques for Identifying Vocal Regions in Music

at a number of di�erent resolutions. In addition, the Wavelet Transformation produces a series of

localized wavelet coe�cients that are representative of the signal in the time�frequency domain. The

wavelet family employed in this research�the Daubechies family�provides minimum support for a

given number of vanishing moments, which is helpful in detecting abrupt transitions in the signal,

producing coe�cients that better capture the distinguishing characteristics of the signal.

As I show in this thesis, the Wavelet Transformation can be incorporated into a number of

di�erent feature vectors and can sharply outperform MFCC features when building a vocal region

detection system. I show that features built using the Discrete Wavelet Transformation outperform

MFCC features by up to 11%.

The purpose of this thesis is to �rst examine previous research in the area of vocal region detec-

tion in an attempt to identify potential limitations that exist within the current body of research.

Mitigating those limitations will produce a better vocal region model, which in turn can be used in

other MIR applications to help produce better results.

In this research I make four contributions to vocal region detection research. First, I propose the

application of two di�erent features with proven applicability in other signal analysis research [8, 11]

to the vocal region detection problem. The �rst is an energy feature vector calculated by using the

Discrete Wavelet Transformation (DWT), utilized in [8] that showed promise in building a more

accurate model than one built using MFCCs. The second feature vector substitutes the DWT for

the Discrete Cosine Transformation in the standard MFCC algorithm [11].

Second, I annotate and use a standard benchmark music corpus, consisting of a wide variety of

popular music songs in their entirety, which was done in response to the wide variety of data sets

used in the MIR literature. The annotations used in this research will be made available for other

researchers to use in the vocal region detection tasks.

Third, I present a methodology is based on the standard machine learning research framework

that is adapted to utilize the Discrete Wavelet Transformation. This methodology can be generalized

to accommodate a number of di�erent data sets, wavelet families and learning algorithms.

Finally, I evaluate these features on �ve machine learning algorithms in an attempt to determine

which algorithm performs the best for the vocal region detection problem. My results are compared

to those generated by using the standard MFCC algorithm, and I show that MFCC features are

outperformed by features vectors built using Wavelet Coe�cients, particularly for the features built

using the wavelet energy algorithms found in [8]. I show that for each of my �ve experiments, the

3

Page 13: Learning Techniques for Identifying Vocal Regions in Music

model built with Wavelet Energy features has a higher accuracy than a model built using MFCC

features. Additionally, my results will show that a decision tree algorithm is the best learning method

to use when attempting to identify vocal regions in music, outperforming the other tested algorithms,

which includes Support Vector Machines; a popular classi�cation algorithm in MIR applications.

This thesis is organized as follows. First, background information is presented on music �le

formats, the audio transformations used in this research, the Daubechies wavelet family and the

learning algorithms employed in this thesis. Next is a discussion on the previous research in the

area of vocal region detection and the limitations found in that research. That will be followed

by a description of the methodology developed for this research, including a discussion on the

preprocessing steps, feature extraction and utilization of learning algorithms. Following that is a

discussion of the experimental setup of this research, including a discussion of the data used, as well

as a description of each of the �ve tests that were run. Finally, results are then presented, followed

by a discussion of the results and the conclusions that can be drawn, motivating follow-on research.

The unabridged results are presented in the Appendix.

4

Page 14: Learning Techniques for Identifying Vocal Regions in Music

Chapter 2

Background

Before discussing the details of this thesis it is important to give background information on the

music �le format, audio signal transforms, machine learning algorithms and wavelet family used in

this thesis. Often overlooked, the format in which an audio recording is encoding into can have a

signi�cant impact on which audio information is available to analyze. Likewise, each signal trans-

formation has speci�c properties and o�ers di�erent insight into the input signal. From a machine

learning standpoint, the choice of the algorithm used has a similar e�ect. Each machine learning

algorithm makes di�erent assumptions and performs better on di�erent types of data. Thus, each

algorithm is explored in an e�ort to better evaluate an algorithm with the data in this thesis.

Additionally, each wavelet family is designed to o�ered di�erent attractive properties, based on the

problems on which it is applied. Understanding the properties of the wavelet family used in this

research o�ers insight into its performance.

2.1 Music File Formats

The �le format of a music �le plays an important role in an audio analysis task. Raw music from an

audio CD has a sampling rate of 44.1 kHz with a bit rate of 1,411.2 kbps. The bit rate of a audio

�le determines how much space on disk that audio �le uses. Due to the large bit rate, CD-quality

music �les are quite large and can be cumbersome when attempting to store, share or process them.

These issues became the motivation behind the research that would eventually lead to the MPEG-1

standard.

While there exists a number of music �le formats, including Microsoft's proprietary Windows

Media format, Vorbis and AAC�the latter two being open source standards�MPEG-1 Part 3

(commonly referred to as MP3) has become very popular, as it was the �le format of choice during

the early days of peer-to-peer �le sharing, and is used by large online music retailers like Amazon.com.

Encoding �les in the MP3 format allows an individual to compress a song down to a tenth or less of

5

Page 15: Learning Techniques for Identifying Vocal Regions in Music

its original size, making it much easier to store a large database of music �les as well as share those

�les across a network.

The ability to shrink a music �le down to a fraction of its original size while retaining a close

approximation of the original sound quality is due to the compression algorithm used by the MP3

encoder. The bit rate of an MP3 encoder is not explicitly set by the MPEG-1 standard, but the

bit rate most commonly used is 128 kbps, which is a high enough bit rate where the common user

will be unable to detect any degradation in quality while remaining small enough to yield a much

smaller �le size. Comparing the common bit rate of an MP3 to the bit rate of a CD yields a ratio

of 128:1411.2, or approximately 1:11. In order to achieve such a ratio, the MP3 encoding algorithm

relies on a perspective acoustic model known as perceptual coding [13] in order to compress the �le.

Perceptual coding involves discarding or reducing the components of an audio signal that the human

ear either cannot perceive or is less audible. This allows the encoding algorithm to discard much

of the information in an audio �le that is of little or no importance to the listener while keeping

the relevant information and encoding it e�ciently. Additionally, the MPEG-1 standard allows for

an encoder to take advantage of variable bit rates by performing bit rate switching on a frame-by-

frame basis. This allows for further compression by using a lower bit rate for less important data

and a higher bit rate�and thus more storage space�for more complicated and important audio

components.

While the compression algorithm used by MP3 encoders is very e�cient and capable of storing

large audio �les in a more compact fashion, there are some drawbacks from a research standpoint.

First, the encoding algorithms are patented, but unfortunately the patent issue surrounding the

MP3 format is complicated. The MP3 format is part of an ISO standard, which is freely available

for interested parties to seek out and implement, which many have done, leading to a variety of

similar yet possibly slightly di�erent MP3 encoding implementations, so researchers must depend on

the reliability of the particular encoding/decoding software they use choose. Numerous companies

own patents on certain MP3 encoding implementations, as well as other pieces that surround the

MP3 algorithm. The patent situation is a convoluted issue, and researchers must keep in mind all

applicable patent restrictions when conducting their research.

Secondly, the compression algorithm used by MP3 is lossy. This means that the algorithm throws

out data while performing its compression, data that cannot be retrieved later. This presents a

problem because information is removed to shrink the size of the audio �le. This information is

6

Page 16: Learning Techniques for Identifying Vocal Regions in Music

irrelevant from the standpoint of the listener but it is information that might be relevant from the

standpoint of an algorithm attempting to deduce identifying information from that audio signal.

Additionally, whereas raw recorded audio typically contains a separate channel for each input into

the signal (guitars, bass, drums, vocals, etc.), those channels are �mixed� together in formats such as

MP3. This prevents a researcher from automatically extracting the audio components that they are

interested in, instead forcing them to develop methods�statistical learning methods or otherwise�

that will allow them to isolate a particular component of interest from the original signal.

Despite the mentioned drawbacks, it is important for audio researchers to use the MP3 �le

format in their research due to its popularity. While newer media containers such as MP4, based o�

of Apple's QuickTime container format [14], have grown in popularity in recent years, MP3 remains

the de facto standard, as many digital music sellers, such as Amazon.com, continue to distribute

their digital music in the MP3 format.

2.2 Audio Signal Transforms

When analyzing audio, it is often useful to transform the signal from one domain to another in order

to extract information from it that would not be present in the signal in its original form. Often a

transform is applied to obtain frequency information, as with the Fourier Transformation [5] and the

Wavelet Transformation [23], or in order to gain a compact representation of a signal for compression

purposes, such as the Discrete Cosine Transformation. In the following sections, I will discuss four

transformations of interest: the Fourier Transformation, the Discrete Cosine Transformation, the

Continuous Wavelet Transformation and the Discrete Wavelet Transformation.

2.2.1 The Fourier Transformation

The Fourier Transform (FT) is one of the most widely used transformations in signal analysis [5].

The FT is a operation that produces a frequency domain representation of an input signal that is

in the time domain.

The Fourier Transform is de�ned [23] as the following:

F (τ) =

∫f(t)e−itτdt (2.1)

where τ is the frequency. The inverse Fourier Transform is de�ned [23] as the following:

7

Page 17: Learning Techniques for Identifying Vocal Regions in Music

f(t) =1

∫F (τ)eitτdτ (2.2)

The FT is derived from the Fourier series, where periodic functions can be written as a sum of

sines and cosines. Expanding the input signal into a series of sines and cosines gives the frequency

content, over all time, of the input signal.

The FT assumes a stationary signal that is the same throughout time, making it di�cult to apply

the FT to a non-stationary, time-varying signal. As a means of overcoming that limitation, the Short-

Time Fourier Transform (STFT) was developed, introducing the concept of time to Fourier analysis.

The STFT works by �rst windowing an input signal so that the resulting output is zero outside of

the range permitted by the windowing function. The FT is then computed on that window. That

window is shifted along the length of the signal and the FT is calculated at each position.

From the STFT we can generate a spectrogram, which is a time-variant spectral image repre-

senting how the energy of a signal is distributed with frequency. Figure 2.1 shows the spectrogram

of six sequential segments (alternating vocal and non-vocal) of a Queen song�Death on Two Legs.

These spectrograms were generated with a window size of 256 samples, with a 128 sample overlap.

As you can see, the frequency content is di�erent when comparing a vocal segment to a non-vocal

segment. The spectrogram shows time along the horizontal axis, frequency along the vertical axis

and intensity�the amplitude of a particular frequency at a given time�is represented by a color

scale. The maximum amplitude of the frequency content for a given window is di�erent for vocal

and non-vocal regions, as illustrated by the circles on the �rst two images in the series. Viewing the

spectrogram gives us a summary view of the frequency content of a signal, and how that frequency

content changes over time. As you can see, there is a noticeable di�erence between vocal regions

and non-vocal regions. Therein lies the motivation behind applying frequency analysis techniques

for MIR research.

The STFT can be applied to mitigate the aforementioned limitations of the Fourier Transform,

but windowing an input signal does not completely solve the time resolution problem. The windows

used in the STFT are of �xed length, therefore the time resolution of STFT analysis is �xed as well.

Given the non-stationary property of music signals, a �xed-length window is insu�cient to capture

all relevant information. Therefore, while a lot of important information can be gained from the

STFT, the �xed resolution is a limitation of STFT analysis, a shortcoming that can be overcome by

the wavelet transformation.

8

Page 18: Learning Techniques for Identifying Vocal Regions in Music

Figure 2.1: Spectrogram Progression of Alternating Vocal and Instrumental Regions in a QueenSong, from Left to Right Respectively and Top to Bottom

9

Page 19: Learning Techniques for Identifying Vocal Regions in Music

2.2.2 The Discrete Cosine Transformation

The Discrete Cosine Transformation (DCT) is employed in the calculation of Mel-Frequency Cep-

stral Coe�cients (MFCC), providing a compact representation of a cosine series expansion of the

�lterbank energies [6]. The resulting cepstrum coe�cients are the MFCC features that can be used

as features in an audio classi�cation task.

The DCT is related to the Fourier series expansion utilized by the DFT, except that instead of

decomposing a signal into a series of both cosines and sines, the DCT only expresses a signal as a

sum of cosines [32]. This is due to the fact that the DCT enforces di�erent boundary conditions that

the DTF [32].

There are several variations of the DCT, but the most common is DCT-II, which is used in

a number of signal analysis applications, such as the calculation of MFCC [6]. In the calculation

of DCT-II, N real valued numbers x0, x1, ..., xN − 1 are decomposed into N real valued numbers

X0, X1, ..., XN−1 by 2.3:

Xk =

N−1∑n=0

xncos

(πk

N(n+ 0.5)

), k = 0, 1, ..., N − 1(2.3)

2.2.3 The Continuous Wavelet Transformation

The motivation behind the Wavelet Transform grew out of a need to analyze non-stationary signals,

a limit of other analysis techniques, such as Fourier analysis [23]. Whereas the standard Fourier

Transform converts a time-domain signal to its frequency-domain representation�and thus losing

good time resolution in the process�the STFT applies a �xed-length window to the input signal

and subsequently applies the Fourier Transform to the windowed signal in an attempt to compen-

sate for that limitation and track the frequency changes over time. While applying the STFT gives

researchers a way of introducing an element of time to their analysis, the �xed-length window that is

applied introduces a time/frequency resolution trade-o�. A shorter window gives excellent time res-

olution, but the frequency resolution is degraded. Applying a longer window improves the frequency

resolution, but the time resolution su�ers as a result. While the STFT has a �xed resolution, the

Wavelet Transform has variable resolution over a number of scales, thus giving good time resolution

for the high frequencies and good frequency resolution for the low frequencies [23].

In order to produce a time-frequency representation of a signal, the signal must be cut into

pieces and analyzed individually. However, the Heisenberg Uncertainty Principle stipulates that in

10

Page 20: Learning Techniques for Identifying Vocal Regions in Music

the domain of signal processing, it is impossible to know with absolute certainly the point on the

time�frequency axis where the signal should be mapped [23]. Therefore, it is understood that it is

very important to be careful when determine how one would slice up the signal in preparation for

analysis.

The Wavelet Transformation solves the signal cutting problem by using a scaled window that is

shifted across the signal, and for every point along the signal, its spectrum is calculated. That process

is then repeated with either shorter or longer windows. The end result is a set of time�frequency

representations of the signal, all at di�erent resolutions. At this point we switch from speaking in

terms of the time�frequency representation and instead discuss the time�scale representation of a

signal, where scale is thought of as the opposite of frequency. A large scale presents the big picture

representation of the signal with less detail than a small scale, which shows more detail across a

small time frame [2, 7, 11, 35, 43].

The Continuous Wavelet Transformation (CWT) is [23]:

X(s, τ) =

∫f(t)ψ∗s,τ (t)dt (2.4)

It can be seen from (2.4) that the input signal, f(t) is decomposed into the series of basis functions,

or wavelets, denoted by ψs,τ (t), where s represents the scale and τ represents the translation.

The wavelet series, ψs,τ (t), is derived from a single basis wavelet, known as the mother wavelet,

ψ(t). The mother wavelet is used to generate every other wavelet in the wavelet series. Each wavelet

in the series can be extracted by scaling and translating the mother wavelet [23]:

ψs,τ (t) =1√sψ

(t− τs

)(2.5)

where s is the scale factor and τ is the translation factor. The wavelet function derived from the

mother wavelet together with the scaling function de�ne a wavelet.

2.2.4 The Discrete Wavelet Transformation

Due to the fact that the CWT operates over every possible scale and translation, it is ine�cient to

compute. The DWT, however, uses a subset of scale and translation values, which makes it more

e�cient to compute. Rather than being continuously scalable and translatable like the CWT, the

DWT is scalable and translatable in discrete steps, resulting in a discrete sampling in the time�scale

representation. Thus, (2.5) becomes [23]:

11

Page 21: Learning Techniques for Identifying Vocal Regions in Music

ψj,k(t) =1√sj0

ψ

(t− kτ0sj0

sj0

)(2.6)

where j and k represent the scale and shift, respectively and s0 de�nes the dilation step. Thus we

can describe [23] the DWT as follows:

X(j, k) =∑j

∑k

x(k)2−j2 ψ(2−jn− k) (2.7)

The DWT can be calculated by passing the signal through a �lter bank, where the signal is

simultaneously passed through both a low pass �lter and a high pass �lter, which act as the scaling

and wavelet function, respectively. Repeating this process produces a binary tree representation of

the �lter bank, where each node in the tree represents di�erent time / frequency localizations, and

each level in the binary tree is a di�erent decomposition level. This representation is shown in Figure

2.2. Passing the signal through a low pass �lter results in the production of a series of approximate

coe�cients, while the detail coe�cients are given from passing the signal through the high pass �lter.

This can be seen [25] from the following equations:

yhigh[k] =∑n

x[n]g[2k − n] (2.8)

ylow[k] =∑n

x[n]h[2k − n] (2.9)

where h[·] and g[·] are weighting factors that form the low pass and high pass �lters, respectively. In

order to keep the output data rate equal to the input data rate�instead of it being doubled after

passing through each �lter�we apply a factor of two to the high and low pass �lters, denoted by

2k in (2.8) and (2.9), e�ectively subsampling the output of each �lter by 2. This process described

one iteration, or level, of the DWT calculation algorithm. The signal can be further decomposed by

repeating the process using the approximate coe�cients as input to the next level of high and low

pass �lters. The maximum number of wavelet decomposition levels that can be applied to a signal

is related to the length of the signal, N = L2, where N is the length of the signal and L is the

maximum decomposition level.

The extracted coe�cients represent the decomposed signal at di�erent resolutions: detailed infor-

mation is produced by the high pass �lter at a low scale and a coarse representation produced by

the low pass �lter at a high scale. These coe�cients, produced by the DWT, give a time�frequency

12

Page 22: Learning Techniques for Identifying Vocal Regions in Music

Figure 2.2: Binary Tree Representation of the Discrete Wavelet Transformation Filter Bank

representation of the signal in a compressed fashion, representing how the energy of the signal is

distributed in the time�frequency domain. Compared to the spectrogram shown in Figure 2.1, the

graph of the wavelet coe�cients is shown in Figure 2.3 for the same sequence of alternating vocal

and non-vocal frames, where time is given along the x axis, scale along the y axis and coe�cient

magnitude along the z axis. The wavelet coe�cients give a multiresolution representation of the

input signal, and as such, o�er an opportunity to extract more information about the nature of the

signal.

2.3 The Daubechies Wavelet Family

Wavelet families are typically characterized by the manner in which the scaling and wavelet func-

tions are de�ned. Daubechies wavelets, however, cannot be described in closed form [23]. Thus, the

Daubechies wavelet family is de�ned by its orthogonal basis and a minimum number of vanishing

moments for a given, compact support.

13

Page 23: Learning Techniques for Identifying Vocal Regions in Music

Figure 2.3: Wavelet Coe�cients Extracted from a Progression of Alternating Vocal and InstrumentalRegions in a Queen Song, from Left to Right Respectively and Top to Bottom

14

Page 24: Learning Techniques for Identifying Vocal Regions in Music

Having an orthonormal basis is an important property of the Wavelet Transformation. L2 space

is the space of square-summable sequences, which is a Hilbert space. If a family of vectors exists in

that Hilbert space then it is said to be an orthogonal basis of that Hilbert space if 〈ψ n,ψp= 0, where

n6=p. It is given that if f ∈ L2 and ψi is the orthonormal basis of L2, then the Plancherel formula

for energy conservation is produce as follows [23]:

‖f‖2 =

+∞∑n=0

| 〈f, ψn〉 |2 (2.10)

The conservation of energy asserts that the distances between two objects are not a�ected by

the Wavelet Transformation [18].

Compact support is de�ned as the �nite interval upon which the basis function is supported.

This means that when a wavelet is applied to a region of a signal, the parts of the signal that fall

outside the support of the wavelet are not a�ected by the wavelet function. This guarantees that the

wavelet coe�cients produced by applying the wavelet transformation will be localized with respect

to that region of interest [18].

Vanishing moments describe the type of polynomial information that can be represented by the

wavelet [23]. The number of vanishing moments represents the order of the wavelet�i.e. db3 has

three vanishing moments. As the order of the wavelet increases�the number of vanishing moments�

the wavelet becomes better able to approximate the signal. In some of the literature wavelet orders

are represented by the length of its support. In that case, the number of vanishing moments is given

by 1/2 of the support length. For the purposes of this thesis, the subscript on the wavelet name will

denote the number of vanishing moments.

The Daubechies wavelet family is popular in many research areas, such as dimensionality reduc-

tion and denoising [18], as well as privacy-preserving data stream identi�cation [35], pattern recogni-

tion [7, 30], general sound and audio analysis [17, 41] and speech analysis [8, 11, 37, 43]. Its popularity

in a wide variety of research areas, both closely and tangentially related, motivate its use in this

thesis.

The wavelet and scaling functions, discussed in the previous section, for the four Daubechies

wavelets [25] used in this thesis are shown in Figure 2.4. As Figure 2.4 illustrates, the �rst-order

Daubechies wavelet�db1�is equivalent to the Haar wavelet [23].

15

Page 25: Learning Techniques for Identifying Vocal Regions in Music

Figure 2.4: Wavelet (Left) and Scaling (Right) Functions of the Four Daubechies Wavelets used inthis Thesis. Magnitude is shown along the y-axis and support length is shown along the x-axis

16

Page 26: Learning Techniques for Identifying Vocal Regions in Music

2.4 Learning Algorithms

For the purpose of this thesis, models were built using the following learning algorithms: Support

Vector Machines, Naïve Bayes, JRip (RIPPER), J48 (C4.5) and IBk. Each of those �ve learning

algorithms represents a unique machine learning approach. In addition to discovering which features

work best on which learning algorithm, using a wide variety of algorithms allows me to discuss the

stability of the extracted features as learning features. If a feature set works very well with one

learning algorithm, while performing poorly on all others, a discussion of why that might happen

should take place, and a feature set that performs well on all or most learning algorithms provides

a strong degree of con�dence in the usability of that feature set.

2.4.1 Support Vector Machines

One of the main appeals of using Support Vector Machines (SVM) is that it works well with high-

dimensional data and is able to avoid the curse of dimensionality problem. The two main ideas

behind SVMs with regards to this research are linear separability and the hyperplane. The SVM

algorithm attempts to construct a hyperplane that divides a data set into n regions, where n is

the number of class labels in the data set. There are an in�nite number of hyperplanes that can be

constructed to divide data set, and as such, SVMs attempts to �nd the maximum margin hyperplane

with which to separate the data [38]. A hyperplane with the maximum margin has the maximum

distance between the boundaries of the training data for each class. This provides, in general, a

better accuracy when testing the model with previously unseen data.

Let N denote the number of training instances in the data set. Each instance has a set of features

which are used to train the model, denoted by fi = (fi,1, fi,2, ..., fi,m); i = (1, 2, ..., N). For a binary

classi�cation task, the class label is denoted by l ∈ c1, c2, or more generally, l ∈ −1,+1. The

SVM algorithm then determines the decision boundary of a linear SVM classi�er with the following

equation [38]:

y · f+ b = 0 (2.11)

where y and b are the parameters of the SVM model. Those parameters are selected such that the

following [38] two conditions are met:

17

Page 27: Learning Techniques for Identifying Vocal Regions in Music

for li = 1

y · fi + b ≥ 1 (2.12)

for li = −1

y · fi + b ≤ −1 (2.13)

Furthermore, Support Vector Machines �nd the maximal marginal distance via applying the

following equation [38]:

f(y) =‖y‖2

2(2.14)

Satisfying those equations gives the model parameters, which can then be applied to previously

unseen data to determine the most appropriate class label.

2.4.2 Naïve Bayes

Naïve Bayes (NB) is a simple probabilistic classi�er that, while simple, has been shown to produce

good results on a variety of data sets [38]. NB relies heavily on the assumption of independence for

each of its training features. That is, NB assumes that the presence or absence of one feature is not

related to the presence or absence of any other feature in the model. More formally, the conditional

independence condition can be expressed [38] as so:

P (X|Y, Z) = P (X|Z) (2.15)

where X,Y and Z are all random variables. The conditional independence assumption is important

because it allows us to consider the class-conditional probability for each training feature, given

a class label, as opposed to computing the class-conditional probability for every combination of

training features. Therefore, given a set of features X and a class label L, the posterior probability

for each class can be calculated as such [38]:

P (L|X) = P (L)

d∏i=1

P (Xi|L) (2.16)

18

Page 28: Learning Techniques for Identifying Vocal Regions in Music

From the training set labels, the prior probability, P (L), can be easily calculated, which leaves

only the task of calculating the only model parameter for the NB classi�er, P (Xi|Y ). This too can

be calculated directly from the features in the training set.

2.4.3 JRip

JRip is an implementation of the Repeated Incremental Pruning to Produce Error Reduction

(RIPPER) algorithm [4]. This rule-induction algorithm is attractive because it scales linearly with

the number of training examples, works well in situations where the class distribution is uneven and

it has been shown to work well with noisy data [38].

In the two-class case, JRip initializes by setting the majority class to be the default class. From

there, rules are learned to detect the minority class.

The JRip algorithm looks at the minority class and labels all examples that are of that class as

being positive, while setting all other examples to be negative.

The rule is generated by examining the information gain of adding a new condition to the

antecedent rule. When the generated rule adds negative examples, no more conditions are added

and the algorithm moves to the pruning phase. The pruning metric for the JRip algorithm is (p+1)(p+n+2)

[33], where p and n denote the number of positive and negative examples, respectively. This rule

is then added to the rule set provided that it does not cause the rule set to exceed the minimum

description length�defaulted to 64 bits with JRip [12, 33]�or the error rate of the rule on the

validation set dips below 50%.

2.4.4 J48

J48 is a statistical classi�er which is an implementation of the C4.5 decision tree algorithm. J48

works by constructing a tree based on the features and associated class labels, which is later pruned

to produce an e�cient means of classifying a previously unseen feature set.

J48 uses information gain to determine which training features most e�ectively splits the training

examples into subsets. Information gain, also referred to as Kullback-Leibler divergence, is a measure

of the reduction of uncertainty regarding the classi�cation of an example in a training set. Given

a feature set Xi for a set of examples Ed and a corresponding class label vector Ld, J48 iterates

through Xi and calculates the information gain from splitting on each feature. Whichever feature

displays the highest information gain will allow the creation of a decision node that splits on that

19

Page 29: Learning Techniques for Identifying Vocal Regions in Music

feature. Each subsequent subset has the algorithm performed recursively on it, until the point where

all the training examples in that subset belong to the same class. At which point, a leaf node is

created, indicating that the corresponding class label should be selected if that leaf node is met.

After the tree is constructed, a pruning algorithm can be applied to eliminate branches of the tree

which do not aid the classi�cation task; replacing them with leaf nodes.

The use of information gain as the cost function helps produce small trees, which in conjunction

with the binary tree structure makes them easy to traverse when labeling previously unseen data.

Additionally, the pruning involved with J48 also reduces the size of the resulting decision tree. This

is an important feature of J48 to consider when dealing with a data set with as many dimensions as

the data used in this research.

2.4.5 IBk

IBk is an implementation of the k-Nearest Neighbor algorithm by the software program WEKA [12].

It is an instance based learner with a �xed neighborhood determined by k, where k indicates the

number of neighbors used, defaulted to 1. The IBk algorithm selects the appropriate value of k via

hold-one-out cross validation.

IBk looks to classify previously unseen examples based on their proximity to training data already

discovered in the instance space. Because IBk is instance based, it does not create a model using the

speci�ed features like the previously discussed algorithms. Instead, it compares each new example of

unlabeled data to labeled data already encountered. Instance based learning is an example of �lazy

learning", where training examples are not used to build a class model and are instead stored and

accessed when unlabeled data is encountered.

The primary design decision when implementing an IBk test is the selection of the distance

measure. The two most common distance measures are the Euclidean Distance�used for continuous

values�and the Hamming Distance�more suited for discrete attribute values [38]. In a feature space

consisting of two examples with feature sets E1 and E2, the Euclidean distance is simply the length

of the line segment connecting the two examples, or, more formally:

d(E1,E2) =

√√√√ n∑i=1

(E1 − E1)2 (2.17)

20

Page 30: Learning Techniques for Identifying Vocal Regions in Music

The Hamming Distance can be thought of as the number of substitutions that would have to be

made in order to turn one feature set into another, or the number of positions at which two equal-

length feature vectors are di�erent. Hamming Distances work well in cases where features belong to

a discrete set of values, but is meaningless for continuous-valued feature sets.

The selection of a value for k is also an important decision when setting up an IBk experiment.

Such as the case with this research, the binary class problem often forces one to use an odd value of

k to prevent ties when doing a majority vote. An optimal value of k can be selected within the IBk

algorithm by using hold-one-out cross validation. This can help to reduce noise but it can blur the

boundaries between classes, particularly when the number of classes is greater than 2.

The drawbacks of the IBk algorithm include the fact that if an example set contains an over-

abundance of a particular class, the labeling of new instances can be dominated by that class. This

can be overcome by weighting the votes of the selected nearest neighbors, often by a function related

to their distance from the unlabeled example [44].

IBk is resistant to noise and has demonstrable success in Information Retrieval research, such as

text processing [38, 44].

21

Page 31: Learning Techniques for Identifying Vocal Regions in Music

Chapter 3

Review of Related Literature

The problem of vocal region detection and other Music Information Retrieval problems is a relatively

new area of research but has been aided by research in the related area of speech analysis. As such,

there is a strong relationship between the work being done in speech analysis and similar work

done with music analysis [1, 3, 8, 11, 15, 16, 17, 19, 20, 27, 31, 40, 41, 43, 45]. The applicability of

speech analysis techniques in music applications encourages research, while the structural di�erences

between speech and music signals should be noted and investigated in an e�ort to improve the

performance of those techniques in music analysis.

Tsai and Wang identi�ed a singer in a song by recognizing that the singing voice has harmonic

features that are not present in other instruments [40]. Their task was to automatically recognize

a singer in a song, and they separated vocal and non-vocal regions by building a Gaussian Mixture

Model (GMM) classi�er. They used Mel-scale frequency cepstral coe�cients, calculated on a �xed

length sliding window, as their feature vector. This was done to model common speech-recognition

tasks, which is a common technique in Music Information Retrieval. They performed their analysis

on the vocal region segments in order to identify the singer, using another GMM classi�er and a

custom decision function. They achieve a vocal region detection accuracy of just over 82% for songs

with a solo vocalist. The data set they used was a set of three separate sets of songs: 242 solo tracks,

22 duet track and 174 instrument only tracks. The vocal tracks consisted of 10 male Mandarin pop

singers, and 10 female Mandarin pop singers.

Ramona, Richard and David proposed a learning methodology for identifying vocal regions by

created an extensive feature vector, which they applied a classi�er on to reach their decision [31].

They built their feature vector on a range of song characteristics computed on successive frames of a

song. The feature vector contained 116 components, including 13-order Mel-scale frequency cepstral

coe�cients, linear predictive coding coe�cients and zero-crossing rate. Those feature vectors were

fed into a Support Vector Machine classi�er with a radial basis function, distinguishing between

22

Page 32: Learning Techniques for Identifying Vocal Regions in Music

the two classes: vocal and non-vocal. The results of the classi�er were then smoothed by using a

two-state Hidden Markov Model. They achieved a frame classi�cation accuracy rate of around 82%

for a data set of 93 songs, all from di�erent artists.

Bartsch and Wake�eld proposed an algorithm for identifying the singing voice in a song with

no instrumentation using the spectral envelope of the signal [1]. They attempted to estimate the

spectral envelope by using a composite transfer function, which utilizes the instantaneous amplitude

and frequency of segments of the music signal. The computed features were then fed into a quadratic

classi�er, which yielded an accuracy of roughly 95%. Their data set consisted of 12 classically-

trained female singers vocalizing a series of �ve-note exercises. This high accuracy demonstrates the

importance of being able to properly isolate the vocal regions, since they were only able to achieve

such good results by having the singing voice by itself, without any background music.

Mesaros, Virtanen and Klapuri evaluated methods for identifying singers in polyphonic music

using pattern recognition and vocal separation algorithms [26]. They took two approaches to their

singer identi�cation task; one was to extract their model's feature vector directly from the music

signal and in the other they attempted to extract the vocal channel from the polyphonic music

signal. The extraction of the vocal line from the complete music signal required use of a previously

developed melody transcription system, the output of which was then fed to a sinusoidal modeling

re-synthesis system. They used an internally-developed data set of 65 songs from 13 unique singers in

their singer identi�cation task, and found that the models which were fed the extracted vocal channel

as opposed to the complete music signal faired much better, achieving up to 67% singer classi�cation

accuracy as compared to 42% for music signals with the worst singer-to-accompaniment ratios. This

demonstrates that being able to e�ectively isolate the vocal regions has a noticeably positive impact

on tasks that require extracting information about the vocalist in a song.

Tzanetakis, Essl and Cook proposed a learning technique for using the Discrete Wavelet Trans-

form to classify a variety of non-speech audio signals [41]. Their implementation made use of the

level four Daubechies wavelet family to generate wavelet coe�cients for their feature vector. For

comparison, they also built models from features extracted from the Short Time Fourier Transform

(STFT) and well as MFCC features. The data they used consisted of audio separated into three

classes: MusicSpeech (126 �les), Voices (60 �les) and Classical (320 �les). They then trained a Gaus-

sian classi�er on each of the three feature vector sets, and found that the Gaussian classi�er trained

with DWT coe�cients performed better than random classi�cation and its performance was on par

23

Page 33: Learning Techniques for Identifying Vocal Regions in Music

with the classi�ers trained on MFCC features and STFT features, producing classi�cation accuracies

that were consistently within 10% of the best performing model.

Kronland, Morlet and Grossman explored how sound patterns can be analyzed by using the

Wavelet Transformation [17]. They built a real-time physical signal processor, through which they

fed a variety of audio signals, including chirps, spoken words and notes played on a clarinet. By doing

so, they were able to note that the Wavelet Transformation produced outputs that they believed

could be used in a variety of signal processing and pattern recognition research areas. In particular to

speech, their results showed that the segmentation of speech sounds would be possible using in part,

information gained by performing the Wavelet Transformation. Their promising results motivated

my work in exploring how the Wavelet Transformation could be used in identifying vocal regions

within music.

Gowdy and Tufekci proposed a new feature vector for speech recognition that adapted the MFCC

algorithm to leverage the desirable properties of the Discrete Wavelet Transformation [11]. They

proposed that the time and frequency localization property exhibited by the coe�cients calculated

by performing the Discrete Wavelet Transformation would provide a more noise-resistant learning

feature when substituted for the Discrete Cosine Transformation in the MFCC algorithm. In their

research, they found that models built with these features � named Mel-Frequency Discrete Wavelet

Coe�cients (MFDWC) � outperformed similar models built with the standard MFCC features for

both clean and noisy speech with a phoneme recognition rate increase of up to 10%. While they

performed their analysis on a variety of speech signals, the close relationship between speech and

music analysis suggests that it may be possible to apply the MFDWC features in the vocal region

detection domain and achieve similar results.

Didiot, Illina, Fohr and Mella also explored the applicability of the Wavelet Transformation in

improving the standard MFCC algorithm [8]. In their research, two class/non-class classi�ers were

built: one for speech/non-speech and the other for music/non-music. For their models, they calcu-

lated the energy distribution for each frequency band using the detail wavelet coe�cients extracted

by applying three wavelet families: Daubechies, Symlet and Coi�et. For each energy band, they cal-

culated the instantaneous energy, the Teager energy and the Hierarchical energy. When compared

to models built with the standard MFCC features, the researchers saw a reduction in error rate of

more than 30%. The corpus employed in this research consisted of two hours of audio � consisting

of both instrumental music and songs � and over four hours of broadcast media taken from French

24

Page 34: Learning Techniques for Identifying Vocal Regions in Music

radio, consisting of news, interview and musical performance segments. While the research presented

in this thesis covers a more narrow corpus � strictly music extracted from popular music CDs � the

success of Didiot et al motivates further study in the applicability of wavelet-based energy features

in a vocal region detection model.

Other researchers have explored a variety of other techniques, including utilizing the Fourier

Transform [20, 29], exploring the bene�ts of the harmonic and energy features of the singing voice

[28], and employing various frequency coe�cients as the basis for a feature vector used in classi�cation

[15, 19, 22, 27, 34].

Previous research as noted above has focused primarily on building a model of the singing voice

from either feature vectors created by examining the characteristics of the song�such as zero-

crossing rate, sharpness, spread or loudness, among others [31] or by examining the frequency domain

representation of that signal using the Short-Time Fourier Transform [20, 29], used in the calculation

of the standard Mel-Frequency Cepstral Coe�cients. While those features have in some cases been

able to act as e�ective features for some vocal models, each have their own set of drawbacks. Many of

the song characteristics that have been used as features are common in the speech analysis domain.

However, due to the presence of background instrumentation, the utility of those features is reduced

[31]. Additionally, the singing is over 90% voiced (voiced sounds are those that are produced by vocal

chord vibrations), compared to speech, which is 60% voice [16], a di�erence that forces researchers

to add additional steps to account for the di�erences between the two types of signals, such as

harmonic analysis [16]. Other researchers have decided to forgo analyzing a signal using that set of

characteristics, and instead attempted to use the Short-Time Fourier Transform (STFT) to analyze

the frequency components of the music signal [20, 29].

The fact that the frequency components of a music signal change over time should be a con-

sideration when attempting to detect vocal regions in that signal. Whereas the Short-Time Fourier

Transform is unable to adequately capture those frequency changes, the Wavelet Transform analyzes

a signal at a multitude of resolutions, o�ering a more complete characterization of the signal for use

in classi�cation.

25

Page 35: Learning Techniques for Identifying Vocal Regions in Music

Chapter 4

A Method for Detecting Vocal Regions

This chapter discusses the vocal region detection methodology presented by this thesis, used in

conjunction with both Wavelet Energy features and MFDWC features to o�er an improvement on

previous research. The method described herein can be implemented using a variety of machine

learning algorithms and data sets; the following chapter will discuss my selections and my justi�ca-

tions for doing so.

This thesis evaluates a methodology for detecting vocal regions in music using Wavelet Energy

and MFDWC features. As mentioned above, the implementation speci�cs can be tuned based on the

particular application of the research, but the components chosen for this thesis will be discussed

in detail in the next section. I have developed this method as a means of leveraging the Wavelet

Transformation in the problem of vocal region detection.

The methodology evaluated in this thesis builds upon a common framework for conducting

machine learning / knowledge discovery and data mining research. In general, that framework

involves the following steps:

1. Gather training data

2. Extract features

3. Build model

4. Gather testing data

5. Extract features

6. Test against model

7. Report results

26

Page 36: Learning Techniques for Identifying Vocal Regions in Music

Figure 4.1: The Training Phase of the Presented Methodology

That general methodology is evaluated in this research, and it adapted to utilize the aforemen-

tioned features calculated using the Discrete Wavelet Transformation in the task of identifying vocal

regions in music.

Vocal region detection is a two class problem, where a learning algorithm attempts to distinguish

between frames of music that contain vocals (assigned the class label �vocal�) and those which contain

pure instrumentation (assigned the class label �inst�). This methodology has two phases: a training

phase and an application phase. The training phase of this methodology is shown visually in Figure

4.1.

The training phase involves the gathering of digital music, independent of format, and �rst

segmenting it into overlapping frames of a predetermined length with a predetermined overlap such

that the length is a power of two�for the purpose of this thesis, the window length was set at 1,024

samples with a overlap of 512 samples. Overlapping frames are used because they give the Wavelet

Transformation more coverage in order to track the frequency changes of the signal over time.

Additionally, the Discrete Wavelet Transformation operates by a series of successive subsampling by

two, therefore in order for the algorithm to be the most e�cient (without having to pad the frame

with zeros to get it to the right size) the frame length must be a power of two.

After the frames are extracted, successive Discrete Wavelet Transformations are applied to the

frame, until a maximum level decomposition has been performed on the frame and detail coe�-

cients are extracted at each level up to the maximum level. Once the transformation coe�cients

are extracted they are then used in the calculation of one of two features: MFDWC features and

Wavelet Energy features.

Mel-Frequency Cepstral Coe�cients are a common feature in vocal region detection research

and in broader signal analysis applications. But while those features tend to perform well in vocal

region related tasks, there exists possible areas of improvement. MFCCs rely on the �xed window

size of the STFT and thus have poor time resolution. The Discrete Wavelet Transformation o�ers

27

Page 37: Learning Techniques for Identifying Vocal Regions in Music

an opportunity to make up for those limitations by providing multiresolution analysis, good time

and frequency resolution as well as providing features in the form of wavelet coe�cients, which are

resistant to noise and, in the cause of db4, have a high number of vanishing moments.

The �rst wavelet transformation features tested are Mel-frequency Discrete Wavelet Coe�cients

(MFDWC), as described by Gowdy and Tufekci in [11]. In their research, Gowdy and Tufekci note

that one of the advantages of the Discrete Wavelet Transformation (DWT) is its localization property

in the time and frequency domains. Localization is important because di�erent parts of a signal can

contain di�erent distinguishing information. A drawback of using MFCCs is the utilization of the

Discrete Cosine Transformation (DCT). The DCT is computed over a �xed-length window, giving

it the same resolution in time and frequency. But in a signal that has changes in time and frequency,

like music, it can be helpful to track those changes at di�erent resolutions. The multiresolution

characteristic of the DWT makes it well adapted to tracking those changes, due to the fact that the

basis vectors of the DWT that capture the high frequency component of the signal have better time

resolution than the basis vectors used to calculate the low frequency components. Additionally, those

basis vectors used to calculate the low frequency components of the signal have better resolution in

the frequency domain than those basis vectors used to �nd the high frequency components. Therefore,

excellent resolution in both the time and frequency domain is achieved by using the DWT, and thus

the DWT coe�cients are better localized in time and frequency than the DCT coe�cients.

In order to take advantage of that localization property, the researchers in [11] replace the utiliza-

tion of the DCT in the MFCC algorithm with the DWT and name those coe�cients MFDWC�Mel-

Frequency Discrete Wavelet Coe�cients. I utilize that approach in my research with two changes.

First, in their research, Gowdy and Tufekci used the Cohen, Daubechies and Feauveau's wavelet

family in order to take advantage of its compactly supported bi-orthogonal spline properties. For

the purposes of my research, I used the Daubechies wavelet family. The Daubechies family was used

due the high number of vanishing moments for a given support and the fact that they have demon-

strated usefulness in a wide variety of signal analysis domains [7, 8, 11, 43]. Secondly, the researchers

in [11] used eight coe�cients at scale four, four coe�cients at scale eight, two coe�cients at scale 16

and one coe�cient at scale 32. I perform a maximum-level decomposition of the frame and utilize

all the coe�cients. Additionally, Gowdy and Tufekci tested their coe�cients on a speech data set,

whereas I test the e�ectiveness of MFDWC features on a music data set.

28

Page 38: Learning Techniques for Identifying Vocal Regions in Music

The second feature set I utilize in this thesis that is built using the DWT was based o� of

research conducted by Didiot, Illina, Fohr and Mella [8]. These researchers recognized the limitations

of the Fourier Transformation in analyzing non-stationary signals. They used the DWT to calculate

three di�erent energy features: Instantaneous energy, Teager energy and Hierarchical energy. For the

calculation of each energy feature, the wavelet coe�cients are denoted by wj(r) where j denotes the

frequency band and r denotes the time. N is the length of the window, and J denotes the lowest

frequency band. Equations [8] for Instantaneous, Teager and Hierarchical energy are shown in 4.1,

4.2 and 4.3, respectively.

fj = log1

Nj

Nj∑r=1

(wj(r))2 (4.1)

fj = log1

Nj

Nj−1∑r=1

|(wj(r))2 − wj(r − 1) ∗ wj(r + 1)| (4.2)

fj = log1

Nj

Nj+NJ/2∑r=(Nj−NJ )/2

(wj(r))2 (4.3)

Instantaneous Energy gives the energy distribution in each frequency band. Teager energy has

been shown in a number of speech applications to give formant information and is resistant to noise

[8]. Whereas instantaneous energy is a measure of the amplitude of the signal at a point in time,

Teager energy shows the variations of the amplitude and frequency of the signal. Hierarchical energy

provides time resolution for a windowed signal and has demonstrated applications in speech analysis.

My utilization of those energy values di�ers from that of the original researchers in two ways.

First, the researchers in [8] used a variety of di�erent decomposition levels, whereas I perform a

maximum level decomposition of the frame. This was done because I ran a series of pilot studies

on a subset of the data where I varied the decomposition level and examined its e�ect on the

classi�cation accuracy of a vocal region detection model that used just the raw detail Discrete

Wavelet coe�cients as the input feature vector. Preliminary results from those pilot studies showed

that the accuracy of the system increased as the number of decomposition levels increased. Also, I

was unable to �nd anything in the literature that suggested a certain decomposition level should be

used, thus motivating me to use every coe�cient at every level. Additionally, the researchers tried

a variety of di�erent combinations of energy values for their feature vector, whereas I use every

energy value in my feature vector produced. Also, the scope of each research project di�ers. They

29

Page 39: Learning Techniques for Identifying Vocal Regions in Music

were attempting to apply these features in dual binary classi�cation system�speech/non-speech

and music/non-music�whereas this thesis looks to distinguish between vocal and non-vocal regions

of a song. Additionally, they applied their features to a corpus consisting of 15s long segments of

audio�20 �les of speech and 21 of music�in addition to French news radio recordings and French

entertainment radio broadcasting, which consisted of interviews and musical programs. While their

data set is quite diverse, it has a relatively minor music component, of which is the sole focus of my

research.

The research in this thesis uses, for both the MFDWC and Wavelet Energy features, all detail

coe�cients extracted by performing a maximum-level DWT decomposition. The presented method-

ology can be generalized to use any number of coe�cients from any decomposition level. It may

be advantageous to not use every coe�cient for reasons such as the wavelet family used, or the

prohibitive cost of storing every coe�cient.

Once the desired feature is calculated, the training phase is completed when a machine learning

model is built. The speci�c algorithms I used in this thesis are described in general in the Background

section and with speci�cs to this research the next section, but this method generalizes to any

machine learning algorithm.

The second phase of the two-phase methodology described in this thesis is the application phase.

In the application phase, frames are produced from a set of unlabeled digital music and features are

extracted from those frames in the same manner as described above. Those resulting feature vectors

are then applied to the model built in the training phase and the model determines whether these

unlabeled frames contain vocalizations or pure instrumentation and assigns it the appropriate class

label.

As discussed above, the only implementation-speci�c aspects of this methodology are the

features�Wavelet Energy and MFDWC. For the other components, this method generalizes to

selections other than what was made in this thesis, stating that any other digital music formats

(vorbis, wma, etc.), wavelet families (Symlets, Coi�ets, Morlet, etc...) and learning algorithms

(Hidden Markov Models, Gaussian Mixture Models, etc...) could be applied within this algorithmic

architecture.

The novel contributions of this methodology lies in its adaptation of the standard machine

learning / knowledge discovering and data mining research methodology to demonstrate the perfor-

mance of the Discrete Wavelet Transformation on music data. The only details of the methodology

30

Page 40: Learning Techniques for Identifying Vocal Regions in Music

that are pre-determined are the training features�MFDWC and Wavelet Energy. This methodology

o�ers a means of replacing speci�c aspects of the experimental evaluation presented here with other

components that a researcher may wish to test. Future research may wish to evaluate the impact

of using another wavelet family outside of Daubechies, an evaluation that can be performed using

this methodology. The same is also true for those wishing to evaluate di�erent machine learning

techniques or input data. This methodology is novel because of the �exibility it allows in evaluating

the performance of wavelet-related features in a time-series machine learning task.

In the next section I will discuss the speci�c component choices that were made for this thesis, my

justi�cation for choosing them and I will present the experiments used to evaluate the performance

of this methodology.

31

Page 41: Learning Techniques for Identifying Vocal Regions in Music

Chapter 5

Experiments

5.1 Implementation Details

The aim of the research presented in this thesis is to leverage the favorable properties of the wavelet

transformation to improve the accuracy of a vocal region detection system when compared to the

same models built using a standard musical signal feature, MFCCs. In the following sections I will

describe the implementation details of my experiments. That process is shown visually in Figure 5.1.

5.1.1 Data Preparation

One of the predominant issues facing researchers interested in studying singing voice detection

and identi�cation has been the lack of a standard music data set. Each of the research papers

mentioned in this thesis details research performed on a set of songs unique to that particular

research project. Each of those sets contains songs that vary in length, sampling rate and music

type, and in some cases, the di�erences can be as large as sets that contain just vocals and sets that

Figure 5.1: Experimental Method

32

Page 42: Learning Techniques for Identifying Vocal Regions in Music

contain a vocal/instrumentation mix, for instance [1] and [16], respectively. These di�erences make

it di�cult to con�dently compare the research results from one paper to the next.

It is because of those issues that this thesis must use a pre-existing data set that has already been

deployed in a research setting. Out of the numerous music data set options, I use the artist20 dataset

[10], compiled by Dan Ellis at Columbia University. The artist20 set is comprised of 20 artists, each

contributing six full albums, for a total of 1,413 songs. This set grew out of the artist identi�cation

work Dr. Ellis' research team has been performing. The songs are full-length 32 kbps mono-tracked

MP3s. The songs have been down-sampled from the original 44.1 kHz stereo tracks to a sampling

rate of 16 kHz, bandlimited to 7.2 kHz. In their research, the downsampling did not adversely a�ect

their results [10], but the quality of the sound is down enough to avoid con�icts with the owners of

the song's copyrights.

For the purposes of this research, only one album per artist was used. This was done to bridge

the balance between having con�dence in the results and the considerable amount of time required

to annotate by hand each song in the set. In total, a subset of the artist20 data set consisting of 221

songs make up this thesis' data set. The songs can be broadly classi�ed as American pop music, and

consists of four female artists and 16 male artists spanning eras from the 1970s to the 2000s.

In order to extract Wavelet Transformation coe�cient (WTC) features from music frames and

properly label those frames as belonging to either a vocal region or a non�vocal region, each song

is annotated by using the Transcribe! software program [36]. This was done by listening to each of

the 221 songs and by ear marking the vocal region boundaries. The Transcribe! application allows

the user to slow down the song to up to 25% of the original playing speed, providing a means of

more accurately determining the subtle di�erence between when the singing voice trails o�, leaving

pure instrumentation. In some cases, that boundary is less clear�cut than what is ideal, particularly

with artists such as Prince, Madonna and Tori Amos. While it may be ideal to use multiple people

annotating the training data, the presence of other annotators would not have altered the ground-

truth data signi�cantly. This is due to that fact that although there does exist some ambiguity

around the vocal/non-vocal boundaries in some songs, by allowing the user to slow down a song

and place markers along the waveform, Transcribe! o�ers a means of examining a song in such as

way that the e�ect of those ambiguities is minimized as much as possible. The vocal boundaries can

then be marked and Transcribe! can be used to split the song based on those markers, exporting

33

Page 43: Learning Techniques for Identifying Vocal Regions in Music

Figure 5.2: Transcribe! window with Aerosmith's �Dream On� open

the resulting song samples as .wav �les for MATLAB to import and extract features from. A screen

shot of Transcribe! with Aerosmith's �Dream On� is shown in Figure 5.2.

5.1.2 Feature Extraction

Features are extracted from the music clips and arraigned in a feature vector suitable for a machine

learning task. All the processing was done using MATLAB with the Wavelet Toolbox [25]. The song

clips were broken down into overlapping frames 1,024 samples long with an overlap of 512 samples,

in keeping with a convention used by many researchers [1, 19, 37, 40], while remaining a factor of

two in size, which is helpful when performing wavelet analysis [23]. The last frame in each segment

was dropped, as it was unlikely to contain the correct number of segments and the vocal boundaries

were often blurred at the edges.

Each of the three features used in this thesis were extracted on a per-frame basis. Those were

MFCCs, MFDWCs and Wavelet Energy features. The MFCC algorithm is a well-established signal

analysis technique, and its algorithm is quite simple to implement, making it an ideal control feature.

The steps are as follows:

1. Take the Fourier Transform of a frame

34

Page 44: Learning Techniques for Identifying Vocal Regions in Music

2. Using a series of triangular overlapping windows, map the powers of the spectrum obtained in

step one to the mel scale

3. Take the log of the values obtained in step two

4. Take the Discrete Cosine Transformation of the values found in step three

5. Return the top 13 coe�cients obtained in step four

Thirteen coe�cients were used for MFCCs due to the fact that after the 13th coe�cient, the

remaining values were very small and the variance between the remaining coe�cients was also very

small.

For the Wavelet Transformation features, MATLAB's Wavelet Toolbox was utilized. The Wavelet

Toolbox o�ers a fast implementation of the Discrete Wavelet Transformation, used by other algo-

rithms to calculate wavelet coe�cients to produce their features.

Additional, MATLAB produces a �le with each frame's feature vector and associated class label

(vocal vs. instr) in .ar� �le format, a format used by machine learning software such as Weka [12]

and Rapid Miner [33].

5.1.3 Classification

Those .ar� �les are fed to the Java-based machine learning library Weka, version 3.6 [12]. Weka is

responsible for loading the .ar� �les, sampling the input as necessary and performing the required

classi�cation task for the current experiment. Sampling was performed on the input due to the fact

that the number of instances grew to a point where the available memory of the machine running

the tasks was exhausted. For the experiments that required it, the input data was sampled at rates

starting at 2% and increasing to 40% in 2% increments. By 40% the performance of the classi�er

was no longer increasing with the number of instances used to train it. Each classi�er was ran with

the default settings provided by Weka.

5.1.4 Obtaining Results

Additionally, Weka generates accuracy and mean squared error results for each experiment. These

results were then compiled into the tables that can be viewed in the Results subsection under each

experiment described later in this chapter, and in their full form in the Appendix. Accuracy and

35

Page 45: Learning Techniques for Identifying Vocal Regions in Music

mean squared error were selected due to the fact that they are common metrics used when discussing

classi�cation results.

This thesis consists of �ve experiments, each with a di�erent scope in an e�ort to test the

robustness of the vocal region detection methodology. The performance of each feature vector during

each experiment o�ers insight into the scope, robustness and utility of Wavelet Coe�cient features

in vocal region detection problems.

5.2 Experiment 1: Measuring Overall Performance

The �rst experiment is designed as a broad examination of the e�ectiveness of wavelet coe�cient

features in a vocal region detection task when compared to a control feature, MFCCs. Models

are built using MFCC features, Wavelet Energy features using Daubechies (db) 1 through 4 and

MFDWC features using db1�db4. The goal of this experiment is to discover which, out of the �ve

aforementioned classi�ers�run using their default settings�performs the best in the vocal region

detection task. Also, with regards to wavelet coe�cient features, which wavelet, out of db1�db4,

performs the best. The best performing classi�er and wavelets are to be used for the remaining four

experiments.

For Experiment 1, I evaluated the �ve classi�cation methods using 10-fold cross-validation. All

the feature vectors for a particular artist are grouped together and divided into 10 subsamples. A

single subsample is retained as the validation data for testing the model, and the remaining nine

subsamples are used as training data. This was repeated until each subsample has been used as

validation data. The 10 results are averaged to determine a single value for the performance of the

system

5.2.1 Experiment 1 Results

As described above, Experiment 1 involves training a model using frames from a single artist, and

validating that model using 10-fold cross validation.

MFCC Features

Table 5.1 shows the accuracy and mean squared error result after running each of the �ve learning

algorithms using MFCC features. The numbers shown in the table represent the average of the 10

36

Page 46: Learning Techniques for Identifying Vocal Regions in Music

Table 5.1: Experiment 1 Results for Vocal Region Detection Using MFCC Features on IndividualFrames from Single Artists

Algorithm Accuracy Mean Squared Error

Naïve Bayes 70.49 0.3487JRip 75.27 0.3589J48 75.55 0.3019IBk 75.89 0.2948SVM 73.67 0.2633

validation runs. As you can see from the table, with the exception of Naïve Bayes, the learning

algorithms all performed roughly the same, within 2% of each other, near 75%.

MFDWC Features

Table 5.2 shows the results of Experiment 1 using MFDWC features. Each of the �ve learning

algorithms are trained using MFDWC features extracted by applying each of the four wavelets: db1,

db2, db3, db4. In this experiment, J48 distinguishes itself by having a an accuracy above 80% for

three of the four wavelets, while features extracted using db1 also tend to have the best performance.

Wavelet Energy Features

Table in 5.3 shows the results of Experiment 1 using the Wavelet Energy features, where energy

values for the frames are calculated using the detail coe�cients that arise as a result of applying

the Discrete Wavelet Transformation. Once again, J48 is clearly performs the best out of the �ve

algorithms, while for this feature set the db4 wavelet produces the highest accuracy scores.

5.2.2 Experiment 1 Discussion

The purpose of Experiment 1 is to introduce wavelet features to the vocal region detection problem,

comparing the results to a well-known and widely-applied feature set, MFCCs. The data set used is

the artist20 data set, which, as discussed in previous sections, is a more comprehensive data set than

some of the others applied in the literature. It contains songs from 20 di�erent artists, spanning

decades, genre and vocalist gender. The best performing algorithm was Ibk, with an accuracy of

75.89%. However, J48 and JRip were not far behind, posting accuracy scores of 75.55% and 75.27%,

37

Page 47: Learning Techniques for Identifying Vocal Regions in Music

Table 5.2: Experiment 1 Results for Vocal Region Detection Using MFDWC Features on IndividualFrames from Single Artists

Algorithm Wavelet Accuracy Mean Squared Error

db1 70.26 0.3598

db2 67.83 0.3679

Naïve Bayes db3 68.75 0.3552

db4 65.69 0.3893

db1 72.59 0.3784

db2 69.86 0.4027

JRip db3 70.57 0.3968

db4 66.82 0.4279

db1 83.57 0.1837

db2 81.40 0.2101

J48 db3 82.28 0.1987

db4 75.76 0.2843

db1 78.26 0.2542

db2 74.34 0.2984

IBk db3 74.75 0.2892

db4 71.00 0.3249

db1 69.49 0.3056

db2 68.49 0.3167

SVM db3 69.27 0.3073

db4 65.96 0.3404

38

Page 48: Learning Techniques for Identifying Vocal Regions in Music

Table 5.3: Experiment 1 Results for Vocal Region Detection Using Wavelet Energy Features onIndividual Frames from Single Artists

Algorithm Wavelet Accuracy Mean Squared Error

db1 59.51 0.4042

db2 59.64 0.4032

Naïve Bayes db3 59.81 0.4014

db4 59.78 0.4017

db1 76.42 0.3413

db2 77.25 0.3298

JRip db3 77.73 0.3272

db4 77.97 0.3242

db1 85.86 0.1566

db2 86.23 0.1551

J48 db3 86.43 0.1551

db4 86.66 0.1521

db1 75.95 0.2756

db2 78.79 0.2459

IBk db3 80.53 0.2265

db4 81.01 0.2215

db1 72.83 0.2717

db2 73.42 0.2658

SVM db3 73.49 0.2651

db4 73.51 0.2649

39

Page 49: Learning Techniques for Identifying Vocal Regions in Music

respectively. These values are lower than some results in the posted literature [20, 31, 40]; however,

the di�culty of the artist20 data set accounts for that di�erence, due to the fact that the number

of songs in the artist20 data set is larger than the number of songs used in [20, 31, 40] and the

songs are more diverse in terms of artist gender, genre and era. In addition to varying the learning

algorithm, for the two features that involved the DWT, the wavelet was varied in order to �nd the

best performing wavelet for that speci�c feature. As you can see from Table 5.2 MFDWC features

perform below the level of MFCC features, with the lone exception of when MFDWC features were

used with the J48 learning algorithm. In that case, MFDWC features outperformed MFCC features,

and additionally, the db1 wavelet was found to be the best wavelet, posting an accuracy of 83.57%,

a di�erence of roughly 8% over a J48 model built using MFCCs. As you can see from Table 5.3, this

trend continues, with J48 performing the best. Additionally, the db4 wavelet provided the highest

accuracy score�86.66%�over 11% higher than the equivalent model built using MFCC features.

The results in Experiment 1 are encouraging enough to show that features calculated using the

DWT have the potential to outperform models built using standard MIR features, thus motivating

the research in the following four experiments. Additionally, I was able to determine that for MFDWC

features, applying the db1 wavelet produced the best results, while using the db4 wavelet with

Wavelet Energy features produced the best results. Also, J48 outperformed the other four learning

algorithms, an interesting result due to the fact that to the best of my knowledge, J48 does not

make an appearance in the vocal region detection literature. Most researchers favor Support Vector

Machines, which have a long history of use in the Speech/Music Analysis realm. My research,

however, suggests that it may not always be the best algorithm to apply. In order to validate

these results, I conducted a pilot study to determine if J48 would continue to outperform SVMs.

Experiments 2�5 were conducted on a subset of the data and SVMs were employed. In the pilot

study, SVMs were outperformed by J48 models. For example, in Experiment 3, SVMs only managed

an accuracy of 61.98% when a model was trained on Wavelet Energy features extracted from frames

from female artists and tested on frames from male artists. J48, however, in an identical experiment,

achieved an accuracy of 66.14%. These results added con�dence in my utilization of J48 for the

remainder of my tests. Thus, for Experiments 2-5, J48 was used as the learning algorithm, db1 was

used in the calculation of MFDWC features, and db4 was used in the calculation of Wavelet Energy

features. The full results for Experiment 1 can be found in Appendix A.

40

Page 50: Learning Techniques for Identifying Vocal Regions in Music

5.3 Experiment 2: Measuring Performance Across Different Artists

Experiment 2 was designed to test how the model would perform when trained on frames from one

artist and then tested on frames from a completely di�erent artist. Performing well on Experiment

2 would show that wavelet features are able to generalize across artist. For Experiment 2, models

are built using the MFCC features as the control features and for the Wavelet Energy and MFDWC

features, the best-performing wavelet from Experiment 1 is chosen. Models are built using features

from one artist, and then that model is tested by frames from the remaining 19 artists, one by one.

This is done for each of the 20 artists. After performing this experiment, a 20x20 matrix is produced

showing the performance of a model trained on each artist when tested with frames for each of the

remaining 19 artists. Each row was then averaged to determine a �nal number showing the average

accuracy of a model built by a single artist.

5.3.1 Experiment 2 Results

The best performing setup from Experiment 1 is selected to carry forward to the remaining four

experiments. For the MFDWC features, the db1 wavelet is chosen, while for the wavelet energy

features db4 is selected. J48 was selected as the learning algorithm, due to how well it performs

in for each of the features in Experiment 1. For Experiment 2, a model was trained on features

extracted from frames from one artist, and then tested on frames from a di�erent artist. Table 5.4

summarizes the results by giving the average performance of a model built using features from the

given artist on each of the remaining 19 artists for each of the three feature sets. To see how each

artist performed on any other artist, please consult Appendix B.

5.3.2 Experiment 2 Discussion

Experiment 1 implied that wavelet features could possibly be applied to the vocal region detection

problem with better results than MFCCs. Experiment 2 was designed as a more rigorous test of that

conclusion by building models with features from one artist and testing that model with features

from a di�erent artist. As you can see from Table 5.4, MFCC features show an improvement in

comparative performance. MFDWCs clearly perform the worst out of the three features, and the

di�erences between the average performance of MFCC and Wavelet Energy features never varies

by more than 5%. While on average Wavelet Energy features do outperform MFCC features, the

di�erence is not as stark as it was with Experiment 1. In general, the accuracy scores for Experiment

41

Page 51: Learning Techniques for Identifying Vocal Regions in Music

Table 5.4: Experiment 2 Results for Vocal Region Detection on Individual Frames from Di�erentIndividual Artists

MFCC MFDWC Wavelet Energies

Training Artist Accuracy M.S.E. Accuracy M.S.E. Accuracy M.S.E.

Aerosmith 60.94 0.4204 56.42 0.4611 63.36 0.3919

Beatles 63.44 0.3901 56.03 0.4557 65.09 0.3756

CCR 59.03 0.4203 56.43 0.4483 61.89 0.3951

Cure 61.18 0.4096 58.74 0.4364 63.88 0.3829

Dave Matthews Band 64.95 0.3812 58.83 0.4614 66.00 0.3736

Depeche Mode 60.13 0.4245 56.97 0.4498 61.07 0.4054

Fleetwood Mac 64.62 0.3944 59.63 0.4426 65.68 0.3715

Garth Brooks 61.95 0.4024 56.82 0.4664 62.81 0.3942

Green Day 62.29 0.4074 57.42 0.4514 63.32 0.3849

Led Zeppelin 60.37 0.4254 57.78 0.4526 63.11 0.4037

Madonna 60.25 0.4217 54.72 0.4657 60.74 0.4028

Metallica 57.34 0.4208 56.97 0.4409 60.78 0.3947

Prince 58.78 0.4519 60.08 0.4532 59.25 0.4520

Queen 56.17 0.4483 55.04 0.4643 53.94 0.4741

Radiohead 60.56 0.4039 55.96 0.4564 59.89 0.4096

Roxette 64.81 0.3997 55.82 0.4555 66.08 0.3790

Steely Dan 60.67 0.4121 58.13 0.4327 62.24 0.3986

Suzanne Vega 58.68 0.4296 55.35 0.4653 63.56 0.3889

Tori Amos 61.67 0.4344 52.26 0.4819 64.54 0.3992

U2 62.72 0.4058 57.90 0.4500 62.68 0.3917

42

Page 52: Learning Techniques for Identifying Vocal Regions in Music

2 are lower than those in Experiment 1 across each feature, and the highest accuracy values across

each of the �ve experiments were seen in Experiment 1. This is most likely due to the fact that in

Experiment 1, models were both trained by and tested against frames from a single artist. Artists

were not isolated in the remaining four experiments, showing that a mixture of artists in vocal region

detection research degrades the performance of the system.

5.4 Experiment 3: Measuring Performance Across Gender

Experiment 3 tests the generality of the features across gender. A model is trained by frames from

male singers and tested by frames from female singers. Then a model is trained by frames from

the four female singers and tested by frames from the remaining 16 male artists. The purpose of

this experiment was to determine if DWT features were capable of generalizing across gender. As

an additional consideration, this test introduces the possible hardware limitations of performing

these tasks. Each artist produces between 50,000 and 70,000 frames. Combining artists into a single

training or testing set quickly leads to an unmanageable number of instances, which the computer

hardware had a di�cult time handling. In order to compensate for that problem, the training and

testing sets are sampled, starting at 2% and moving up to 40% and stepping up by 2% with each

iteration. This is done for the tasks in experiments 3�5. At each sampling interval, an accuracy

score was produced, and it was shown that the accuracy results evened out before reaching the 40%

upper limit. 40% was chosen as the upper limit based on empirical results from running a series of

experiments with higher sampling rates with no improvement in performance.

5.4.1 Experiment 3 Results

Experiment three examines the robustness of wavelet features in the presence of a gender di�erences.

The combination of data sets required to perform this experiment leads to unmanageably large

training and testing data sets. In order to compensate for the additional complexity introduced in

this experiment, this experiment is performed at a series of sampled training and testing sets. The

sampling rate is set at 2%, and is increased by 2% up to 40%, where the accuracy values showed

no signi�cant increase in value. Table 5.5 shows the results of each of the three feature vectors at a

sampling rate of 20%, which is when the accuracy values begin to level out. Full results from 2% to

40% are shown in Appendix C.

43

Page 53: Learning Techniques for Identifying Vocal Regions in Music

Table 5.5: Experiment 3 Results for Gender-Based Vocal Region Detection at a Sampling Rate of20%

MFCC MFDWC Wavelet Energies

Training Set Test Set Accuracy M.S.E. Accuracy M.S.E. Accuracy M.S.E.

Female Male 62.98 0.4125 55.65 0.4698 66.14 0.3957

Male Female 62.16 0.4126 52.46 0.4779 66.43 0.3820

5.4.2 Experiment 3 Discussion

Experiment 3 was a further test of the performance of the wavelet features by separating training

and test sets based on the gender of the vocalist. As you can see from Table 5.5, MFDWC features

continue to lag in performance behind MFCC and Wavelet Energy features, and that di�erence starts

to become reasonably signi�cant. Wavelet Energy features, however, regain a consistent advantage

over MFCCs, averaging an accuracy improvement of roughly 4%. The calculation of MFDWC fea-

tures di�ers from the calculation of MFCC feature only in the utilization of the Discrete Wavelet

Transformation instead of the Discrete Cosine Transformation. While the DWT coe�cients are more

localized in time and frequency than the DCT coe�cients, making them more resistant to noise,

the output of the DCT is orthogonal and is related to the Fourier Transform, which is used in the

�rst step of the MFCC and MFDWC calculations. The ensuing shift that occurs when moving from

the Fourier domain to the multiresolution DWT domain that occurs with MFDWCs could account

for the degraded performance when compared to MFCCs, which remains consistenly in the Fourier

domain throughout its calculation. Additionally, it should be noted that the Fourier Transformation

is a decomposition of a signal into a series of sines and cosines, the DCT is a decomposition into a

series of cosines and the DWT using the Daubechies wavelet family is neither.

5.5 Experiment 4: Measuring Overall Performance Across Groups of Artists

In Experiment 4, frames were taken from groups of artists, features were extracted and a model

was built. That model was then evaluated using 10-fold cross-validation. This test was designed to

determine if DWT features generalized to groups of artists. It was mirrored after Experiment 1 to

generalize from single artists to multiple artists. Similar to Experiment 1, Experiment 4 is evaluated

44

Page 54: Learning Techniques for Identifying Vocal Regions in Music

Table 5.6: Experiment 4 Results for Vocal Region Detection on Frames From Multiple Artists at aSampling Rate of 40%

MFCC MFDWC Wavelet Energies

Accuracy M.S.E. Accuracy M.S.E. Accuracy M.S.E.

72.12 0.3374 67.29 0.3795 73.19 0.3286

using 10-fold cross validation. In Experiment 4, frames from all 20 artists are used�sampled at rates

starting from 2% to 40% with 2% increases in between�to build and test a model. The sampling

rate was increased up to 40% to show that performance of the model would level out without having

to introduce the added complexity of using the entire training set.

5.5.1 Experiment 4 Results

Experiment 4 mirrors the set up in Experiment 1, except models are trained using features from all

artists and tested on features from all artists. The results shown in Table 5.6 are the average of the

10 runs during the validation step. The results shown in Table 5.6 are obtained using a sampling rate

of 40%, where no signi�cant change in the accuracy values from previous runs is seen. Full results

for each sampling rate can be found in Appendix D.

5.5.2 Experiment 4 Discussion

In Experiment 4, Experiment 1 was expanded to incorporate features from multiple artists, instead

of just one. Each of the 20 artists in the data set contributed frames to the training and test sets

in Experiment Four. As you can see in Table 5.6, MFDWC features lag behind MFCC features and

Wavelet Energy features by roughly 5%. MFCC features trail the Wavelet Energy features by around

1%. This is not a signi�cant di�erence, but it is in keeping with the results shown in the previous

three experiments.

5.6 Experiment 5: Measuring Performance Across Different Groups of Artists

In Experiment 5, a model was trained from features extracted from frames from groups of artists,

and was tested against another set of features extracted from frames from another, distinct, group

45

Page 55: Learning Techniques for Identifying Vocal Regions in Music

Table 5.7: Experiment 5 Results for Vocal Region Detection on Individual Frames from Multiple,Di�erent Artists using a 32% Sampling Rate, Average Over 10 Runs

MFCC MFDWC Wavelet Energies

Accuracy M.S.E. Accuracy M.S.E. Accuracy M.S.E.

65.25 0.3912 59.80 0.4413 66.43 0.3881

of artists. Experiment 5 was set up to determine if DWT features generalized to groups of multiple

artists. A feature that performs well in Experiment 5 suggests that little if any constraints need

to be placed on those features, as it would have performed well regardless of testing data, gender

and training set make up. Experiment 5 was designed to extend Experiment 2 to multiple artists.

Frames from a set of 10 artists are used to train a model, which is then tested against frames from

the remaining 10 artists. Artists are chosen to be a part of either set randomly, and 10 runs are

conducted for each of the three features. The results of each of the 10 runs per feature are then

averaged to get an overall accuracy score for groups of frames across multiple artists.

5.6.1 Experiment 5 Results

As Experiment 4 extends Experiment 1, Experiment 5 extends Experiment 2. In Experiment 5, 10

artists are chosen at random to be the training set, while the remaining 10 are chosen to be the test

set. In total 10 runs are performed, with each run a new training and test set are generated and

tested. The results of the 10 runs are averaged and shown in Table 5.7 for a sampling rate of 32%.

As in the previous experiments, the sampling rate was initially set at 2%, where on subsequent runs

it was increased in 2% intervals up to 40%. By 40%, the model's performance leveled o�, showing no

improvement in performance with the addition of training data, thereby decreasing the complexity

of the experiment. The full results can be found in Appendix E.

5.6.2 Experiment 5 Discussion

Finally, Experiment 5, shown in Table 5.7, extends Experiment 2 by using 10 artists for both the

training and testing sets. And again, MFDWC features trail MFCC and Wavelet Features. And

again, Wavelet Energy features hold a performance advantage over MFCC features, this time by

46

Page 56: Learning Techniques for Identifying Vocal Regions in Music

over 1%. The consistant performance of the Wavelet Energy features show that they are stable

features that o�er improvements over standard MFCC features in a vocal region detection task.

47

Page 57: Learning Techniques for Identifying Vocal Regions in Music

Chapter 6

Conclusion

In the preceding chapter I gave results for each of the �ve experiments designed to evaluate the

methodology described in Chapter 4. This chapter o�ers some conclusions that can be drawn from

those results. First, I will discuss the four contributions of this thesis, then I will discuss some

limitations of this research, before concluding with some ideas for the future direction of research in

vocal region detection.

6.1 Contributions

This thesis o�ers contributions in four areas: feature sets used for vocal region detection, data set

annotation and use in vocal region detection, an evaluation of a machine learning methodology

for using the DWT in vocal region detection and learning algorithm selection in the vocal region

detection task.

6.1.1 Feature Set Selection for Vocal Region Detection

To the best of my knowledge, no other researchers have employed the Wavelet Transformation in

the same manner as I have. The researchers who developed the MFDWC features [11] tested their

features on a speech corpus, while those who developed the Wavelet Energy features [8] used a data

set consisting of radio broadcasts that included a heavy focus on speech. Meanwhile, most of the

researchers interested in vocal region detection in music and other music-related tasks used MFCC

features. However, the performance of my wavelet-related features suggests alternatives to MFCCs. In

the �rst experiment, MFDWC features outperformed MFCC features, suggesting that improvements

can be made with regards to the MFCC algorithm's use of the Discrete Cosine Transformation to

possibly improve performance of a vocal region detection system. While MFCCs outperformed in the

MFDWC features in the other four experiments, the di�erence in the �rst test supports the idea that

the localization of coe�cients property within the DWT does o�er some bene�ts to researchers. The

48

Page 58: Learning Techniques for Identifying Vocal Regions in Music

fact that the Wavelet Energy features outperformed MFCC features in all �ve experiments further

illustrates the importance of localized coe�cients. The real power of the DWT is its multiresolution

analysis approach, removing the �xed-window constraint imposed by the MFCC algorithm's use of

the Short-Time Fourier Transform. This o�ers a better means of tracking the frequency changes in a

signal over time, and that improvement is supported by my results, where I saw the Wavelet Energy

features outperform MFCC features in all tasks. While MFCCs have a demonstrated place in the

analysis of music signals, the fact that their limitations can be overcome by using the DWT o�ers

an intriguing alternative for researchers wishing to further improve their algorithms.

6.1.2 Data Sets for Vocal Region Detection

As discussed in previous sections, the selection of which data set to use is a crucial �rst step in

designing an experiment. The old adage, �garbage in equals garbage out� holds true when discussing

the relevancy of results in any research task, and it is particularly important in the case of vocal

region detection. Researchers must avoid data sets that are too �ne-tuned to their experimental

method. Any methods developed and tested against one data set ideally should be equipped to

handled a new data set without seeing any meaningful degradation in performance. For my thesis,

I chose to use the artist20 data set due to the fact that it has been used in previous research [10]

and represented a range of di�erent artists, artist genders, musical genres and musical eras. The

temptation to develop and test against my own data set was avoided in order to prevent biasing the

system to the point where the results of my experiment could not be reasonably compared to other

research. Unfortunately, there does not exist a standard data set from which all researchers interested

in vocal region detection could build their experiments around. Additionally, artist20 contains full

songs, as opposed to some data sets which only contain clips of songs [26, 34, 41, 45]. The bene�t to

using the whole song as opposed to a 30 second (or similarly lengthed) clip is a signi�cant amount

of information is lost when you only use a small percentage of the whole signal. When using a clip,

the researchers must chose very carefully which portion of the song to use without introducing a

bias. Therefore, lacking such a data set, I chose to select one, artist20, that had previously been

used before in other areas of Music Information Retrieval. While not a perfect data set selection,

artist20 is attractive because it is larger than most data sets that I was aware of at the time of my

research [20, 26, 27, 31, 41, 45], it covers a reasonably wide range of vocal features and styles, and

has previously been used and shown to be a useful selection. Artist20 was selected so that future

49

Page 59: Learning Techniques for Identifying Vocal Regions in Music

researchers looking to tie their researcher to a previously evaluated data set had one such data set

available. Additionally, the subset of artist20 that was used in this thesis was annotated and will be

made publically available for use by future researchers interested in the vocal region detection task.

6.1.3 Experimental Evaluation

The presented methodology is based on the standard machine learning / knowledge discovery and

data mining methodology: gather training data, extract features, build a model, gather testing data,

extract features, test against the model and report the results. That standard approach is utilized to

take advantage of the Discrete Wavelet Transformation in the task of automatically detecting vocal

regions in music. The algorithm developed in this thesis is an extensible, comprehensive method for

detecting vocal regions in music, using the Wavelet Transformation. The fact that MFDWC features

outperformed MFCC features in the �rst test but failed to replicate that success in subsequent

tests suggests that while there might be a useful application of MFDWC features, they still come

up short in broader vocal region detection research when compared to MFCC features. The fact

that they come up short exposes a weakness inherent to having too limited of an experimental

scope when examining a new feature. Wavelet Energy features, however, demonstrated just the

opposite. By outperforming MFCC features in the �ve experimental set ups, Wavelet Energy features

showed generality in their application to the vocal region detection problem. This generality shows

that the Wavelet Energy features are a stable classi�cation feature. In addition, the fact that the

Wavelet Energy features for the most part performed well when applied to each of the selected

learning algorithms further supports that generality statement. This suggests that the Discrete

Wavelet Transformation can be applied in a number of related research areas and possibly continue

to perform well. Having an extensive experimental methodology gives valuable context to the results

being shown.

6.1.4 Learning Algorithm Selection

Perhaps the most interesting conclusion drawn from this research was the fact that a standard

MIR learning algorithm, Support Vector Machines, was outperformed by a lesser-applied decision

tree algorithm, J48. The fact that this was identi�ed in my research is surprising, due to the fact

that SVMs are very common in MIR research. Going back to the experimental evaluation, this is

an important conclusion drawn only by expanding the scope of my research. Five starkly di�erent

50

Page 60: Learning Techniques for Identifying Vocal Regions in Music

algorithms were chosen in Experiment 1. By having all three feature sets applied to each learning

algorithm, I was able to demonstrate the stability of the wavelet features, as discussed in the previous

section, while also showing that an uncommon approach can sometimes lead to interesting results.

As such, it is interesting to see SVMs outperformed in an area where they are a dominant algo-

rithm. Using an SVM classi�er in the vocal region detection task is useful because SVMs are able

to appropriately handle high dimension data, and SVMs have been used in previous research with

consistently solid results [16, 20, 24, 31]. Despite these stated advantages of SVMs, they were consis-

tently outperformed by the less frequency used decision tree algorithm, J48. There is no universally

best learning algorithm; each comes with its pluses and minuses, and each works better on di�erent

data, attacking di�erent problems. While SVMs have a strong hold in Music Information Retrieval

speci�cally and signal analysis in general [16, 20, 21, 24, 31, 34] this thesis o�ers the suggestion that

there might be other algorithms that may be better adapted to MIR problems. Speci�cally, decision

trees prove to be well adapted to Wavelet features and vocal region detection.

6.2 Limitations

One of the things to consider when evaluating this research is the choice to proceed with a decision

tree algorithm for Experiments 2-5. SVMs are one of the learning algorithms of choice in MIR

[16, 20, 21, 24, 31, 34], and decision trees have limitations. While each possible learning algorithm

has its drawbacks, it's important to consider those drawbacks when attempting to draw conclusions

from research. Decision tree algorithms are unstable and when performing a literature review on

the state of vocal region detection I did not come across a single paper that used decision trees. I

attempted to mitigate those concerns by running a pilot study for Experiments 205 using a subset

of my data and SVMs. In every case, J48 continued to outperform SVMs, which suggests that for

the purposes of my experiments, decision trees are an acceptable solution. Also, while I selected

�ve widely diverse learning algorithms, additional approaches do exist, such as Gaussian Mixture

Models, Hidden Markov Models and Neural Networks that could improve the performance of my

system.

Another shortcoming is the fact that I chose to only use the Daubechies wavelet family. There

exists a multitude of wavelet families, all with di�erent features that make them attractive to

researchers seeking to solve a number of problems. This conclusions I have drawn from my research

suggests that it would be a worthwhile endeavor to attempt to �nd a better wavelet family for use

51

Page 61: Learning Techniques for Identifying Vocal Regions in Music

in the vocal region detection problem. However, the scope of my research was to demonstrate that

wavelets could be applied to this particular problem�which I believe I have�and o�er a jumping-

o� point for future research. Daubechies wavelets were chosen due to their demonstrated usability

in the speech analysis domain, and due to the wide overlap of the two areas of speech and music

analysis. Selecting Daubechies wavelets appeared to be an intelligent choice.

An additional shortcoming was the limitations of my feature extraction method. Many researchers

tend to further optimize their features in a number of ways. Some of those methods include pre-

processing their data in a di�erent manner, combining a larger set of features to make one diverse

feature vector, or combining or chaining classi�ers to improve performance. Certainly some perfor-

mance gains could possibly be introduced by including some or all of these methods. However, I

set out to show that the DWT has a useful application in the vocal region detection domain, and

I believe I have done that. Further research, to be discussed in the next section, could be done to

build on what I have shown.

Finally, a limitation of my research goes back to the selection of my data set. While the artist20

data set is an attractive data set to use, it is not a research standard, making it di�cult to compare

results. Additionally, further improvements could be made to the data to include even more diverse

music, perhaps including rap music, folk, or Eastern music.

6.3 Future Work

While I was able to successfully demonstrate the utility of the DWT in detecting vocal regions in

music, there exists plenty of areas of research that could be built o� of what has been done here.

As mentioned in the previous sections, other data sets could be used within this methodology to

demonstrate how well it performs, as well as a way to tie in previous research. Also, more wavelet

families, such as Symlets and Coi�ets [8, 11, 43], could be used to extract features to expand on

the limitation of Daubechies wavelets. Additionally, while MFCCs are a standard feature in MIR

research, they are by no means the only features used. Wavelets used in conjunction with other

features, such as pitch, timbre, zero-crossing rate, and others, could further improve the perfor-

mance of this system. Additionally, the gains seen here in this research could be broadly expanded

to cover more areas within MIR, such as singer identi�cation, audio �ngerprinting and genre identi-

�cation, among many others. And �nally, this research employed a classi�cation approach to solve

52

Page 62: Learning Techniques for Identifying Vocal Regions in Music

this problem. Other approaches, such as clustering or parameter estimation in a semi-supervised

environment, have additional applicability and should be pursued further.

6.4 Summary

In this thesis, I evaluated a methodology for detecting vocal regions in music that uses the Discrete

Wavelet Transformation in the calculation of features that show an improvement over traditional

features in a classi�cation system. Results from a rigorous empirical evaluation suggest that lever-

aging the properties of coe�cient localization, the multiresolution analysis approach and a lack of

a resolution trade-o� between time and frequency present in the Wavelet Transformation allow the

utilization of Mel-Frequency Discrete Wavelet Coe�cient and Wavelet Energy features, calculated

from the Discrete Wavelet Transformation, to achieve a higher classi�cation accuracy over a stan-

dard audio analysis feature, Mel-Frequency Cepstral Coe�cients when distinguishing between vocal

and non-vocal regions in music. Additionally, this thesis shows that a commonly-used classi�cation

algorithm, Support Vector Machines, is outperformed by a less frequently-applied Decision Tree clas-

si�cation algorithm. Reliable methods for detecting vocal regions will lead to better analysis results

for researchers interesting in other areas of Music Information Retrieval, such as singer identi�cation,

that depend on a reliable vocal region detection system in order to achieve optimal results.

53

Page 63: Learning Techniques for Identifying Vocal Regions in Music

Appendix A

Complete Experiment 1 Results

Table A.1: Complete Experiment 1 Results Using MFCC Features

Artist Algorithm Accuracy Mean Squared Error

Naïve Bayes 67.68 0.3793

JRip 73.19 0.3765

Aerosmith J48 72.40 0.3342

IBk 72.85 0.3314

SVM 73.85 0.2615

Naïve Bayes 75.44 0.2752

JRip 78.51 0.3111

Beatles J48 78.64 0.2594

IBk 78.86 0.2665

SVM 72.27 0.2773

Naïve Bayes 81.93 0.2654

JRip 83.98 0.2551

CCR J48 83.41 0.2203

IBk 86.39 0.1844

SVM 81.68 0.1832

Naïve Bayes 71.90 0.3549

JRip 79.83 0.3118

Cure J48 79.96 0.2664

IBk 79.87 0.2598

SVM 78.14 0.2186

Naïve Bayes 71.27 0.3419

JRip 75.84 0.3529

Dave Matthews Band J48 76.00 0.2993

IBk 75.95 0.2988

SVM 75.10 0.2490

Naïve Bayes 68.06 0.3768

JRip 73.64 0.3734

Depeche Mode J48 72.89 0.3266

IBk 75.65 0.2984

Continued on Next Page

54

Page 64: Learning Techniques for Identifying Vocal Regions in Music

TableA.1 � Continued From Previous Page

SVM 73.63 0.2637

Naïve Bayes 66.69 0.3789

JRip 72.22 0.3897

Fleetwood Mac J48 73.42 0.3338

IBk 68.74 0.3636

SVM 70.60 0.2940

Naïve Bayes 74.02 0.3114

JRip 78.04 0.3223

Garth Brooks J48 78.39 0.2575

IBk 80.05 0.2514

SVM 77.18 0.2282

Naïve Bayes 66.22 0.3665

JRip 73.51 0.3714

Green Day J48 73.85 0.3251

IBk 72.30 0.3292

SVM 71.46 0.2854

Naïve Bayes 65.98 0.3865

JRip 73.42 0.3771

Led Zeppelin J48 74.10 0.3229

IBk 74.96 0.3014

SVM 71.37 0.2863

Naïve Bayes 71.13 0.3411

JRip 74.88 0.3605

Madonna J48 75.47 0.3053

IBk 74.95 0.3082

SVM 70.91 0.2909

Naïve Bayes 82.18 0.2492

JRip 85.33 0.2384

Metallica J48 85.03 0.2076

IBk 84.97 0.2082

SVM 82.30 0.1770

Naïve Bayes 65.13 0.4133

JRip 66.99 0.4352

Prince J48 67.98 0.3848

IBk 67.97 0.3694

SVM 66.70 0.3330

Naïve Bayes 67.93 0.3646

JRip 70.11 0.4018

Queen J48 72.07 0.3248

IBk 76.32 0.2871

Continued on Next Page

55

Page 65: Learning Techniques for Identifying Vocal Regions in Music

TableA.1 � Continued From Previous Page

SVM 71.01 0.2899

Naïve Bayes 70.98 0.3226

JRip 80.13 0.2975

Radiohead J48 80.16 0.2362

IBk 80.66 0.2426

SVM 77.75 0.2225

Naïve Bayes 65.52 0.3872

JRip 70.26 0.4083

Roxette J48 71.33 0.3514

IBk 69.20 0.3577

SVM 69.03 0.3097

Naïve Bayes 75.61 0.3210

JRip 78.17 0.3257

Steely Dan J48 78.05 0.2709

IBk 79.36 0.2636

SVM 77.42 0.2258

Naïve Bayes 68.37 0.3792

JRip 74.44 0.3700

Suzanne Vega J48 73.52 0.3212

IBk 75.34 0.3035

SVM 72.16 0.2784

Naïve Bayes 69.16 0.3636

JRip 73.36 0.3795

Tori Amos J48 74.53 0.3242

IBk 74.91 0.3050

SVM 70.58 0.2942

Naïve Bayes 64.66 0.3952

JRip 69.49 0.4187

U2 J48 69.75 0.3669

IBk 68.41 0.3656

SVM 70.32 0.2968

56

Page 66: Learning Techniques for Identifying Vocal Regions in Music

Table A.2: Complete Experiment 1 Results Using MFDWC Features and

Naïve Bayes

Artist Wavelet Accuracy Mean Squared Error

db1 69.53 0.3772

db2 67.42 0.3678

Aerosmith db3 68.03 0.3606

db4 65.61 0.3982

db1 71.46 0.3146

db2 69.59 0.3286

Beatles db3 71.77 0.3073

db4 67.63 0.3473

db1 82.40 0.2369

db2 80.84 0.2405

CCR db3 80.68 0.2339

db4 79.83 0.2426

db1 77.80 0.2995

db2 77.16 0.2829

Cure db3 77.60 0.2799

db4 74.96 0.3292

db1 67.20 0.4019

db2 64.44 0.4158

Dave Matthews Band db3 66.31 0.3961

db4 59.68 0.4508

db1 67.91 0.3790

db2 69.54 0.3500

Depeche Mode db3 68.35 0.3567

db4 65.01 0.4045

db1 63.42 0.4248

db2 58.78 0.4529

Fleetwood Mac db3 61.60 0.4242

db4 57.10 0.4636

db1 72.45 0.3250

db2 70.69 0.3112

Garth Brooks db3 70.09 0.3173

db4 62.76 0.3861

db1 70.11 0.3664

db2 65.60 0.3922

Green Day db3 69.50 0.3544

db4 61.38 0.4365

Continued on Next Page

57

Page 67: Learning Techniques for Identifying Vocal Regions in Music

TableA.2 � Continued From Previous Page

db1 69.30 0.3824

db2 67.33 0.4122

Led Zeppelin db3 67.93 0.3958

db4 67.16 0.4120

db1 67.14 0.3839

db2 64.92 0.4001

Madonna db3 65.31 0.3797

db4 63.02 0.4100

db1 82.76 0.2282

db2 80.27 0.2478

Metallica db3 81.06 0.2394

db4 80.96 0.2634

db1 62.74 0.4428

db2 60.47 0.4450

Prince db3 61.66 0.4324

db4 58.49 0.4646

db1 68.95 0.3560

db2 65.37 0.3665

Queen db3 66.28 0.3554

db4 64.14 0.3725

db1 71.49 0.3357

db2 68.14 0.3521

Radiohead db3 68.48 0.3450

db4 67.74 0.3483

db1 65.44 0.4049

db2 61.38 0.4225

Roxette db3 62.55 0.4086

db4 60.48 0.4282

db1 74.99 0.3317

db2 72.46 0.3334

Steely Dan db3 73.96 0.3176

db4 69.85 0.3755

db1 68.65 0.3929

db2 64.44 0.4171

Suzanne Vega db3 64.67 0.4117

db4 61.99 0.4316

db1 66.59 0.3997

db2 64.36 0.4099

Tori Amos db3 65.10 0.3909

db4 64.25 0.4067

Continued on Next Page

58

Page 68: Learning Techniques for Identifying Vocal Regions in Music

TableA.2 � Continued From Previous Page

db1 64.95 0.4132

db2 63.34 0.4104

U2 db3 64.09 0.3979

db4 61.74 0.4149

59

Page 69: Learning Techniques for Identifying Vocal Regions in Music

Table A.3: Complete Experiment 1 Results Using MFDWC Features and

J48

Artist Wavelet Accuracy Mean Squared Error

db1 82.82 0.1944

db2 81.71 0.2108

Aerosmith db3 82.60 0.1958

db4 77.09 0.2701

db1 83.95 0.1775

db2 82.02 0.1996

Beatles db3 82.89 0.1886

db4 77.33 0.2568

db1 89.76 0.1245

db2 88.29 0.1453

CCR db3 88.39 0.1416

db4 86.26 0.1830

db1 85.69 0.1721

db2 85.36 0.1807

Cure db3 85.29 0.1813

db4 80.92 0.2618

db1 81.58 0.2033

db2 79.67 0.2238

Dave Matthews Band db3 81.34 0.2031

db4 70.96 0.3218

db1 83.25 0.1848

db2 82.38 0.2004

Depeche Mode db3 82.06 0.1996

db4 74.47 0.3004

db1 79.15 0.2269

db2 75.18 0.2709

Fleetwood Mac db3 78.39 0.2347

db4 67.33 0.3597

db1 85.18 0.1621

db2 82.99 0.1880

Garth Brooks db3 83.03 0.1866

db4 74.34 0.2874

db1 83.36 0.1871

db2 80.63 0.2150

Green Day db3 82.48 0.1958

db4 74.21 0.2977

Continued on Next Page

60

Page 70: Learning Techniques for Identifying Vocal Regions in Music

TableA.3 � Continued From Previous Page

db1 84.03 0.1797

db2 81.04 0.2187

Led Zeppelin db3 81.85 0.2078

db4 76.64 0.2957

db1 82.49 0.1907

db2 80.47 0.2136

Madonna db3 81.23 0.2044

db4 74.35 0.2915

db1 89.22 0.1319

db2 88.38 0.1459

Metallica db3 88.87 0.1365

db4 85.88 0.1896

db1 78.21 0.2458

db2 75.31 0.2816

Prince db3 76.30 0.2690

db4 68.19 0.3639

db1 83.81 0.1767

db2 80.19 0.2190

Queen db3 81.72 0.2011

db4 74.88 0.2900

db1 86.61 0.1487

db2 83.98 0.1801

Radiohead db3 84.04 0.1779

db4 80.29 0.2328

db1 80.43 0.2128

db2 77.80 0.2411

Roxette db3 78.64 0.2328

db4 71.01 0.3263

db1 85.44 0.1668

db2 83.45 0.1927

Steely Dan db3 84.08 0.1832

db4 78.93 0.2634

db1 83.32 0.1843

db2 80.15 0.2228

Suzanne Vega db3 81.18 0.2096

db4 75.06 0.2909

db1 83.38 0.1827

db2 80.56 0.2135

Tori Amos db3 81.72 0.1995

db4 75.72 0.2777

Continued on Next Page

61

Page 71: Learning Techniques for Identifying Vocal Regions in Music

TableA.3 � Continued From Previous Page

db1 79.81 0.2203

db2 78.38 0.2383

U2 db3 79.50 0.2244

db4 71.25 0.3253

62

Page 72: Learning Techniques for Identifying Vocal Regions in Music

Table A.4: Complete Experiment 1 Results Using MFDWC Features and

SVM

Artist Wavelet Accuracy Mean Squared Error

db1 69.21 0.3079

db2 69.52 0.3049

Aerosmith db3 70.11 0.2989

db4 66.11 0.3389

db1 71.06 0.2984

db2 69.91 0.3009

Beatles db3 71.72 0.2828

db4 68.52 0.3148

db1 82.36 0.1764

db2 81.32 0.1868

CCR db3 80.49 0.1951

db4 79.89 0.2011

db1 75.99 0.2401

db2 75.99 0.2401

Cure db3 75.99 0.2401

db4 75.99 0.2401

db1 66.60 0.3340

db2 65.34 0.3466

Dave Matthews Band db3 67.63 0.3237

db4 60.86 0.3914

db1 69.48 0.3052

db2 70.63 0.2937

Depeche Mode db3 69.72 0.3028

db4 61.36 0.3864

db1 61.69 0.3831

db2 59.29 0.4071

Fleetwood Mac db3 62.75 0.3725

db4 57.92 0.4208

db1 72.48 0.2752

db2 72.25 0.2775

Garth Brooks db3 72.67 0.2733

db4 64.55 0.3545

db1 69.36 0.3064

db2 65.18 0.3482

Green Day db3 69.37 0.3063

db4 63.03 0.3697

Continued on Next Page

63

Page 73: Learning Techniques for Identifying Vocal Regions in Music

TableA.4 � Continued From Previous Page

db1 69.19 0.3081

db2 65.05 0.3495

Led Zeppelin db3 65.05 0.3495

db4 65.05 0.3495

db1 64.54 0.3546

db2 64.54 0.3546

Madonna db3 64.54 0.3546

db4 64.54 0.3546

db1 80.43 0.1957

db2 80.47 0.1953

Metallica db3 82.02 0.1798

db4 80.43 0.1957

db1 62.22 0.3778

db2 60.29 0.3971

Prince db3 61.28 0.3872

db4 55.94 0.4406

db1 71.19 0.2881

db2 69.09 0.3091

Queen db3 69.04 0.3096

db4 66.21 0.3379

db1 69.21 0.3079

db2 69.66 0.3034

Radiohead db3 69.49 0.3051

db4 68.51 0.3149

db1 63.57 0.3643

db2 63.44 0.3656

Roxette db3 63.88 0.3612

db4 61.72 0.3828

db1 73.25 0.2675

db2 73.21 0.2679

Steely Dan db3 73.48 0.2652

db4 69.94 0.3006

db1 68.63 0.3137

db2 65.08 0.3793

Suzanne Vega db3 66.08 0.3392

db4 61.69 0.3831

db1 64.80 0.3520

db2 64.80 0.3520

Tori Amos db3 64.80 0.3520

db4 64.80 0.3520

Continued on Next Page

64

Page 74: Learning Techniques for Identifying Vocal Regions in Music

TableA.4 � Continued From Previous Page

db1 64.45 0.3555

db2 64.65 0.3535

U2 db3 65.31 0.3469

db4 62.14 0.3786

65

Page 75: Learning Techniques for Identifying Vocal Regions in Music

Table A.5: Complete Experiment 1 Results Using MFDWC Features and

JRip

Artist Wavelet Accuracy Mean Squared Error

db1 71.62 0.3882

db2 70.95 0.3969

Aerosmith db3 71.19 0.3925

db4 66.80 0.4347

db1 74.37 0.3647

db2 70.98 0.3969

Beatles db3 72.28 0.3850

db4 68.52 0.4223

db1 85.07 0.2399

db2 83.31 0.2656

CCR db3 82.90 0.2722

db4 81.51 0.2900

db1 79.62 0.3167

db2 79.75 0.3163

Cure db3 79.82 0.3153

db4 77.17 0.3488

db1 66.62 0.4386

db2 64.11 0.4554

Dave Matthews Band db3 65.68 0.4464

db4 58.70 0.4824

db1 72.16 0.3875

db2 71.28 0.3972

Depeche Mode db3 70.84 0.4001

db4 66.34 0.4415

db1 63.18 0.4619

db2 58.41 0.4851

Fleetwood Mac db3 59.99 0.4781

db4 56.18 0.4920

db1 74.90 0.3603

db2 71.60 0.3958

Garth Brooks db3 71.19 0.3991

db4 63.45 0.4588

db1 72.91 0.3699

db2 68.68 0.4153

Green Day db3 71.23 0.3908

db4 62.32 0.4622

Continued on Next Page

66

Page 76: Learning Techniques for Identifying Vocal Regions in Music

TableA.5 � Continued From Previous Page

db1 73.45 0.3781

db2 70.01 0.4088

Led Zeppelin db3 70.50 0.4066

db4 67.87 0.4307

db1 70.29 0.4043

db2 66.35 0.4398

Madonna db3 66.95 0.4302

db4 65.72 0.4457

db1 84.89 0.2476

db2 84.07 0.2589

Metallica db3 84.36 0.2529

db4 82.19 0.2866

db1 64.34 0.4523

db2 61.18 0.4708

Prince db3 62.56 0.4638

db4 58.56 0.4827

db1 72.28 0.3876

db2 66.94 0.4378

Queen db3 68.12 0.4288

db4 65.36 0.4494

db1 76.17 0.3424

db2 73.32 0.3734

Radiohead db3 73.24 0.3709

db4 70.89 0.3965

db1 65.52 0.4459

db2 61.86 0.4669

Roxette db3 62.56 0.4622

db4 60.66 0.4729

db1 76.98 0.3420

db2 74.99 0.3636

Steely Dan db3 75.67 0.3567

db4 71.13 0.4051

db1 71.28 0.3911

db2 67.93 0.4203

Suzanne Vega db3 68.33 0.4193

db4 65.26 0.4452

db1 70.85 0.4040

db2 67.44 0.4324

Tori Amos db3 68.38 0.4241

db4 66.68 0.4383

Continued on Next Page

67

Page 77: Learning Techniques for Identifying Vocal Regions in Music

TableA.5 � Continued From Previous Page

db1 65.24 0.4446

db2 64.06 0.4560

U2 db3 65.69 0.4405

db4 61.15 0.4727

68

Page 78: Learning Techniques for Identifying Vocal Regions in Music

Table A.6: Complete Experiment 1 Results Using MFDWC Features and

IBk

Artist Wavelet Accuracy Mean Squared Error

db1 75.67 0.2789

db2 73.57 0.3052

Aerosmith db3 73.00 0.3063

db4 70.23 0.3355

db1 82.40 0.2193

db2 78.13 0.2597

Beatles db3 78.57 0.2632

db4 74.34 0.2994

db1 87.29 0.1552

db2 84.42 0.1833

CCR db3 84.02 0.1866

db4 81.41 0.2174

db1 80.74 0.2251

db2 79.70 0.2370

Cure db3 79.39 0.2396

db4 76.05 0.2750

db1 75.36 0.2848

db2 70.48 0.3337

Dave Matthews Band db3 71.73 0.3222

db4 66.35 0.3711

db1 78.22 0.2550

db2 75.22 0.2856

Depeche Mode db3 74.43 0.2935

db4 70.01 0.3362

db1 71.78 0.3249

db2 66.96 0.3674

Fleetwood Mac db3 68.11 0.3536

db4 64.16 0.3885

db1 81.53 0.2234

db2 76.82 0.2736

Garth Brooks db3 76.18 0.2742

db4 70.29 0.3313

db1 75.01 0.2869

db2 70.64 0.3305

Green Day db3 72.88 0.3093

db4 67.15 0.3626

Continued on Next Page

69

Page 79: Learning Techniques for Identifying Vocal Regions in Music

TableA.6 � Continued From Previous Page

db1 77.64 0.2546

db2 73.68 0.2993

Led Zeppelin db3 73.98 0.2948

db4 70.18 0.3296

db1 76.44 0.2771

db2 71.76 0.3233

Madonna db3 72.59 0.3161

db4 69.72 0.3416

db1 85.50 0.1718

db2 83.68 0.1913

Metallica db3 83.99 0.1871

db4 81.32 0.2185

db1 70.36 0.3351

db2 66.49 0.3682

Prince db3 67.60 0.3600

db4 64.91 0.3823

db1 79.52 0.2424

db2 73.71 0.3032

Queen db3 74.55 0.2934

db4 70.61 0.3298

db1 83.71 0.1952

db2 78.36 0.2523

Radiohead db3 79.01 0.2461

db4 74.68 0.2878

db1 73.55 0.3032

db2 69.55 0.4347

Roxette db3 69.97 0.3379

db4 67.12 0.3625

db1 81.33 0.2232

db2 76.86 0.2685

Steely Dan db3 77.13 0.2632

db4 73.45 0.3018

db1 78.48 0.2544

db2 73.09 0.3055

Suzanne Vega db3 73.32 0.3050

db4 69.35 0.3411

db1 78.62 0.2556

db2 73.97 0.3017

Tori Amos db3 74.88 0.2929

db4 71.36 0.3252

Continued on Next Page

70

Page 80: Learning Techniques for Identifying Vocal Regions in Music

TableA.6 � Continued From Previous Page

db1 72.02 0.3184

db2 69.66 0.3435

U2 db3 69.72 0.3385

db4 67.39 0.3599

71

Page 81: Learning Techniques for Identifying Vocal Regions in Music

Table A.7: Complete Experiment 1 Results Using Wavelet Energy Fea-

tures and Naïve Bayes

Artist Wavelet Accuracy Mean Squared Error

db1 56.30 0.4361

db2 57.17 0.4279

Aerosmith db3 57.56 0.4242

db4 57.38 0.4258

db1 74.80 0.2522

db2 74.25 0.2577

Beatles db3 74.04 0.2592

db4 73.91 0.2609

db1 44.87 0.5421

db2 44.74 0.5477

CCR db3 44.82 0.5465

db4 44.95 0.5446

db1 41.41 0.5855

db2 40.83 0.5914

Cure db3 40.85 0.5911

db4 40.82 0.5912

db1 66.64 0.3337

db2 66.85 0.3314

Dave Matthews Band db3 66.91 0.3308

db4 66.84 0.3316

db1 54.86 0.4511

db2 51.97 0.4801

Depeche Mode db3 52.37 0.4761

db4 51.94 0.4806

db1 62.88 0.3711

db2 63.12 0.3685

Fleetwood Mac db3 63.60 0.3641

db4 63.46 0.3650

db1 65.85 0.3415

db2 65.25 0.3475

Garth Brooks db3 64.96 0.3504

db4 64.75 0.3521

db1 55.60 0.4441

db2 56.05 0.4396

Green Day db3 56.26 0.4375

db4 56.36 0.4367

Continued on Next Page

72

Page 82: Learning Techniques for Identifying Vocal Regions in Music

TableA.7 � Continued From Previous Page

db1 52.15 0.4778

db2 52.06 0.4785

Led Zeppelin db3 52.09 0.4784

db4 51.92 0.4797

db1 68.52 0.3150

db2 68.84 0.3117

Madonna db3 68.96 0.3101

db4 68.98 0.3099

db1 57.30 0.4240

db2 56.90 0.4290

Metallica db3 56.20 0.4343

db4 56.03 0.4371

db1 59.48 0.4045

db2 60.07 0.3988

Prince db3 60.10 0.3981

db4 60.20 0.3970

db1 59.84 0.4015

db2 60.45 0.3959

Queen db3 60.79 0.3925

db4 60.86 0.3921

db1 69.02 0.3102

db2 70.95 0.2907

Radiohead db3 71.69 0.2831

db4 72.02 0.2802

db1 63.90 0.3611

db2 64.01 0.3599

Roxette db3 64.34 0.3572

db4 64.25 0.3579

db1 53.11 0.4684

db2 54.49 0.4550

Steely Dan db3 55.22 0.4481

db4 55.35 0.4461

db1 55.18 0.4482

db2 55.50 0.4452

Suzanne Vega db3 55.76 0.4430

db4 55.91 0.4411

db1 68.95 0.3109

db2 69.46 0.3054

Tori Amos db3 69.73 0.3030

db4 69.74 0.3036

Continued on Next Page

73

Page 83: Learning Techniques for Identifying Vocal Regions in Music

TableA.7 � Continued From Previous Page

db1 59.45 0.4055

db2 59.86 0.4013

U2 db3 60.02 0.4000

db4 60.01 0.4002

74

Page 84: Learning Techniques for Identifying Vocal Regions in Music

Table A.8: Complete Experiment 1 Results Using Wavelet Energy Fea-

tures and J48

Artist Wavelet Accuracy Mean Squared Error

db1 84.00 0.1778

db2 84.57 0.1734

Aerosmith db3 84.36 0.1790

db4 84.55 0.1757

db1 87.85 0.1365

db2 87.97 0.1357

Beatles db3 88.31 0.1343

db4 88.34 0.1329

db1 87.45 0.1447

db2 88.38 0.1366

CCR db3 89.12 0.1284

db4 89.14 0.1272

db1 87.11 0.1509

db2 87.59 0.1460

Cure db3 87.81 0.1464

db4 87.95 0.1434

db1 86.02 0.1539

db2 86.39 0.1526

Dave Matthews Band db3 86.43 0.1561

db4 86.55 0.1536

db1 83.89 0.1751

db2 84.57 0.1700

Depeche Mode db3 84.54 0.1736

db4 85.14 0.1663

db1 84.65 0.1716

db2 85.17 0.1674

Fleetwood Mac db3 85.48 0.1671

db4 85.81 0.1633

db1 88.23 0.1293

db2 88.70 0.1279

Garth Brooks db3 88.95 0.1258

db4 89.18 0.1236

db1 86.91 0.1441

db2 87.37 0.1420

Green Day db3 87.79 0.1403

db4 88.09 0.1369

Continued on Next Page

75

Page 85: Learning Techniques for Identifying Vocal Regions in Music

TableA.8 � Continued From Previous Page

db1 85.18 0.1640

db2 85.66 0.1603

Led Zeppelin db3 86.03 0.1591

db4 86.01 0.1606

db1 86.87 0.1444

db2 87.18 0.1453

Madonna db3 87.18 0.1462

db4 87.41 0.1422

db1 89.63 0.1207

db2 90.03 0.1163

Metallica db3 90.29 0.1170

db4 90.52 0.1129

db1 82.08 0.1980

db2 81.51 0.2096

Prince db3 81.36 0.2155

db4 81.38 0.2140

db1 83.19 0.1798

db2 83.96 0.1746

Queen db3 83.94 0.1770

db4 84.55 0.1702

db1 89.58 0.1153

db2 89.74 0.1150

Radiohead db3 90.25 0.1103

db4 90.50 0.1076

db1 84.95 0.1659

db2 84.74 0.1716

Roxette db3 84.58 0.1748

db4 84.89 0.1712

db1 86.54 0.1468

db2 86.78 0.1477

Steely Dan db3 87.40 0.1425

db4 87.45 0.1413

db1 84.90 0.1655

db2 85.75 0.1586

Suzanne Vega db3 86.04 0.1567

db4 86.55 0.1533

db1 84.84 0.1690

db2 84.96 0.1705

Tori Amos db3 85.40 0.1654

db4 85.86 0.1604

Continued on Next Page

76

Page 86: Learning Techniques for Identifying Vocal Regions in Music

TableA.8 � Continued From Previous Page

db1 83.42 0.1795

db2 83.61 0.1811

U2 db3 83.28 0.1860

db4 83.38 0.1853

77

Page 87: Learning Techniques for Identifying Vocal Regions in Music

Table A.9: Complete Experiment 1 Results Using Wavelet Energy Fea-

tures and SVM

Artist Wavelet Accuracy Mean Squared Error

db1 72.18 0.2782

db2 72.98 0.2702

Aerosmith db3 72.99 0.2701

db4 73.03 0.2697

db1 75.94 0.2406

db2 76.09 0.2391

Beatles db3 75.91 0.2409

db4 75.83 0.2417

db1 76.52 0.2348

db2 76.52 0.2348

CCR db3 76.52 0.2348

db4 76.52 0.2348

db1 75.99 0.2401

db2 75.99 0.2401

Cure db3 75.99 0.2401

db4 75.99 0.2401

db1 74.34 0.2566

db2 75.65 0.2435

Dave Matthews Band db3 75.74 0.2426

db4 75.89 0.2411

db1 68.33 0.3167

db2 69.44 0.3056

Depeche Mode db3 70.83 0.2917

db4 71.29 0.2871

db1 73.59 0.2641

db2 74.03 0.2597

Fleetwood Mac db3 74.10 0.2590

db4 74.27 0.2573

db1 75.19 0.2481

db2 75.51 0.2449

Garth Brooks db3 74.70 0.2530

db4 74.09 0.2591

db1 74.43 0.2557

db2 75.37 0.2463

Green Day db3 75.23 0.2477

db4 75.29 0.2471

Continued on Next Page

78

Page 88: Learning Techniques for Identifying Vocal Regions in Music

TableA.9 � Continued From Previous Page

db1 72.33 0.2767

db2 73.05 0.2695

Led Zeppelin db3 72.77 0.2723

db4 72.85 0.2715

db1 71.07 0.2893

db2 72.13 0.2787

Madonna db3 72.46 0.2754

db4 72.35 0.2765

db1 80.43 0.1957

db2 80.43 0.1957

Metallica db3 80.43 0.1957

db4 80.43 0.1957

db1 67.50 0.3250

db2 68.02 0.3198

Prince db3 67.90 0.3210

db4 67.72 0.3228

db1 62.96 0.3704

db2 63.37 0.3663

Queen db3 63.76 0.3624

db4 63.86 0.3614

db1 78.04 0.2196

db2 78.39 0.2161

Radiohead db3 78.24 0.2176

db4 78.16 0.2184

db1 72.53 0.2747

db2 72.36 0.2764

Roxette db3 71.73 0.2827

db4 71.63 0.2837

db1 75.03 0.2497

db2 75.62 0.2438

Steely Dan db3 76.01 0.2399

db4 76.43 0.2357

db1 71.44 0.2856

db2 72.77 0.2723

Suzanne Vega db3 73.19 0.2681

db4 73.19 0.2681

db1 70.70 0.2930

db2 71.73 0.2827

Tori Amos db3 71.80 0.2820

db4 71.82 0.2818

Continued on Next Page

79

Page 89: Learning Techniques for Identifying Vocal Regions in Music

TableA.9 � Continued From Previous Page

db1 67.99 0.3201

db2 69.00 0.3100

U2 db3 69.44 0.3056

db4 69.60 0.3040

80

Page 90: Learning Techniques for Identifying Vocal Regions in Music

Table A.10: Complete Experiment 1 Results Using Wavelet Energy Fea-

tures and JRip

Artist Wavelet Accuracy Mean Squared Error

db1 74.18 0.3640

db2 75.10 0.3596

Aerosmith db3 75.19 0.3566

db4 75.56 0.3526

db1 81.01 0.2884

db2 81.57 0.2812

Beatles db3 81.82 0.2772

db4 81.85 0.2766

db1 81.32 0.2942

db2 82.80 0.2733

CCR db3 83.34 0.2658

db4 83.54 0.2638

db1 80.71 0.3049

db2 81.54 0.2928

Cure db3 81.96 0.2870

db4 82.11 0.2862

db1 77.20 0.3352

db2 77.94 0.3287

Dave Matthews Band db3 78.56 0.3228

db4 78.47 0.3224

db1 72.21 0.3867

db2 73.46 0.3750

Depeche Mode db3 74.50 0.3632

db4 74.63 0.3630

db1 74.97 0.3581

db2 75.78 0.3525

Fleetwood Mac db3 76.31 0.3456

db4 76.42 0.3452

db1 80.32 0.2949

db2 81.07 0.2841

Garth Brooks db3 81.53 0.2803

db4 81.75 0.2748

db1 77.59 0.3218

db2 78.39 0.3133

Green Day db3 79.56 0.3026

db4 79.95 0.2971

Continued on Next Page

81

Page 91: Learning Techniques for Identifying Vocal Regions in Music

TableA.10 � Continued From Previous Page

db1 75.62 0.3560

db2 76.55 0.3442

Led Zeppelin db3 76.85 0.3405

db4 77.49 0.3356

db1 77.45 0.3307

db2 78.02 0.3265

Madonna db3 77.67 0.3292

db4 78.04 0.3263

db1 84.50 0.2483

db2 85.16 0.2396

Metallica db3 85.62 0.2331

db4 85.92 0.2272

db1 69.31 0.4188

db2 68.77 0.3665

Prince db3 69.65 0.4165

db4 70.08 0.4124

db1 68.76 0.4126

db2 69.40 0.4089

Queen db3 70.54 0.3991

db4 69.93 0.4084

db1 81.77 0.2752

db2 83.04 0.2590

Radiohead db3 83.54 0.2543

db4 84.24 0.2410

db1 73.62 0.3757

db2 74.04 0.3732

Roxette db3 74.20 0.3719

db4 74.38 0.3681

db1 77.66 0.3333

db2 78.70 0.3198

Steely Dan db3 79.29 0.3114

db4 80.16 0.3021

db1 74.56 0.3607

db2 76.56 0.3429

Suzanne Vega db3 77.01 0.3372

db4 77.74 0.3275

db1 75.38 0.3581

db2 75.64 0.3552

Tori Amos db3 75.66 0.3557

db4 76.01 0.3512

Continued on Next Page

82

Page 92: Learning Techniques for Identifying Vocal Regions in Music

TableA.10 � Continued From Previous Page

db1 70.17 0.4089

db2 71.43 0.3987

U2 db3 71.81 0.3945

db4 71.19 0.4021

83

Page 93: Learning Techniques for Identifying Vocal Regions in Music

Table A.11: Complete Experiment 1 Results Using Wavelet Energy Fea-

tures and IBk

Artist Wavelet Accuracy Mean Squared Error

db1 72.93 0.3075

db2 75.03 0.2845

Aerosmith db3 76.56 0.2685

db4 76.84 0.2643

db1 80.09 0.2350

db2 81.60 0.2167

Beatles db3 83.32 0.1974

db4 83.97 0.1918

db1 78.25 0.2524

db2 81.91 0.2138

CCR db3 83.64 0.1920

db4 84.11 0.1857

db1 79.62 0.2344

db2 80.74 0.2211

Cure db3 82.14 0.2070

db4 82.34 0.2049

db1 77.04 0.2677

db2 79.06 0.2464

Dave Matthews Band db3 80.58 0.2300

db4 80.40 0.2301

db1 72.60 0.3086

db2 76.41 0.2726

Depeche Mode db3 77.50 0.2568

db4 78.85 0.2451

db1 74.91 0.2843

db2 76.86 0.2675

Fleetwood Mac db3 78.43 0.2491

db4 79.55 0.2404

db1 78.55 0.2496

db2 82.23 0.2082

Garth Brooks db3 84.14 0.1860

db4 84.87 0.1796

db1 75.34 0.2867

db2 78.61 0.2484

Green Day db3 80.37 0.2285

db4 80.65 0.2261

Continued on Next Page

84

Page 94: Learning Techniques for Identifying Vocal Regions in Music

TableA.11 � Continued From Previous Page

db1 74.34 0.2918

db2 78.44 0.2481

Led Zeppelin db3 80.38 0.2291

db4 81.14 0.2220

db1 78.79 0.2456

db2 82.13 0.2112

Madonna db3 82.95 0.2004

db4 82.94 0.2009

db1 81.35 0.2111

db2 83.89 0.1876

Metallica db3 85.07 0.1749

db4 85.24 0.1716

db1 71.29 0.3240

db2 73.04 0.3055

Prince db3 74.22 0.2931

db4 74.51 0.2888

db1 70.69 0.3289

db2 74.98 0.2870

Queen db3 78.76 0.2491

db4 79.57 0.2381

db1 81.08 0.2234

db2 84.11 0.1886

Radiohead db3 86.07 0.1668

db4 86.88 0.1586

db1 73.77 0.3010

db2 77.26 0.2624

Roxette db3 79.34 0.2398

db4 80.29 0.2302

db1 76.24 0.2699

db2 78.96 0.2418

Steely Dan db3 81.28 0.2162

db4 81.71 0.2107

db1 74.60 0.2904

db2 77.55 0.2596

Suzanne Vega db3 80.27 0.2316

db4 80.17 0.2315

db1 76.12 0.2781

db2 78.76 0.2507

Tori Amos db3 80.30 0.2317

db4 80.60 0.2287

Continued on Next Page

85

Page 95: Learning Techniques for Identifying Vocal Regions in Music

TableA.11 � Continued From Previous Page

db1 71.49 0.3225

db2 74.19 0.2953

U2 db3 75.33 0.2826

db4 75.51 0.2806

86

Page 96: Learning Techniques for Identifying Vocal Regions in Music

Appendix B

Complete Experiment 2 Results

87

Page 97: Learning Techniques for Identifying Vocal Regions in Music

Table

B.1:CompleteExperiment2ResultsusingMFCCFeatures(Trainingsetin

rows;testset

incolumns)

AerosmithBeatles

CCR

Cure

DMB

Depeche

Mode

Fleetwood

Mac

Garth

Brooks

Green

Day

Led

Zep-

pelin

Madonna

Metallica

Prince

Queen

RadioheadRoxette

Steely

Dan

Suzanne

Vega

Tori

Amos

U2

Aerosmith

69.1

0.3

6

57.9

0.4

5

67.9

0.3

9

60.3

0.4

2

62.1

0.4

0

61.5

0.4

2

65.1

0.3

9

62.4

0.4

1

62.5

0.4

1

48.3

0.5

0

67.9

0.4

0

59.4

0.4

2

57.0

0.4

4

62.7

0.4

1

57.7

0.4

5

68.5

0.3

5

55.6

0.4

5

55.1

0.4

6

55.8

0.4

5

Beatles

62.0

0.4

1

61.1

0.4

2

69.0

0.3

6

67.3

0.3

6

64.0

0.3

8

65.4

0.3

7

69.8

0.3

3

65.8

0.3

8

63.3

0.3

9

58.4

0.4

2

51.2

0.5

1

60.2

0.4

0

59.3

0.4

1

68.9

0.3

4

61.9

0.4

0

70.9

0.3

2

60.9

0.4

1

64.5

0.3

8

60.6

0.4

1

CCR

55.3

0.4

5

57.6

0.4

3

73.2

0.3

1

58.6

0.4

2

58.8

0.4

2

57.1

0.4

4

60.1

0.4

1

60.6

0.4

1

64.2

0.3

8

48.0

0.5

1

69.4

0.3

6

57.5

0.4

3

55.2

0.4

5

60.9

0.4

0

53.2

0.4

7

68.1

0.3

4

57.7

0.4

3

48.8

0.5

0

56.7

0.4

4

Cure

64.6

0.3

9

63.5

0.4

1

69.4

0.3

8

59.1

0.4

2

61.3

0.3

9

60.6

0.4

1

65.6

0.3

7

60.5

0.4

3

66.2

0.3

7

50.4

0.4

7

72.3

0.3

8

58.6

0.4

2

55.2

0.4

5

59.4

0.4

3

54.3

0.4

6

69.6

0.3

2

61.4

0.4

0

51.5

0.4

8

58.1

0.4

2

DMB

61.1

0.4

1

73.7

0.3

0

63.8

0.4

2

66.3

0.3

8

63.4

0.3

9

67.4

0.3

6

70.5

0.3

2

66.5

0.3

8

63.0

0.3

9

64.2

0.3

8

66.7

0.4

1

60.0

0.4

1

54.2

0.4

6

63.7

0.3

9

63.5

0.3

9

71.8

0.3

2

65.0

0.3

7

66.9

0.3

6

61.5

0.4

0

Depeche

Mode

57.4

0.4

4

58.8

0.4

4

53.7

0.4

7

63.9

0.4

1

59.0

0.4

4

61.1

0.4

3

66.9

0.3

7

58.6

0.4

4

60.9

0.4

2

56.5

0.4

5

63.2

0.3

9

61.4

0.4

0

57.5

0.4

4

64.0

0.4

0

55.6

0.4

5

66.6

0.3

7

61.6

0.4

1

57.0

0.4

5

58.0

0.4

5

Fleetwood

Mac

59.6

0.4

4

70.2

0.3

6

66.3

0.4

1

70.9

0.3

6

66.6

0.3

8

63.7

0.3

8

66.6

0.3

7

64.6

0.4

2

65.7

0.3

8

65.8

0.3

9

57.4

0.4

6

59.8

0.4

1

56.4

0.4

4

63.8

0.3

8

65.5

0.4

0

71.3

0.3

2

64.8

0.3

8

65.9

0.4

0

62.1

0.4

1

Garth

Brooks

57.1

0.4

4

71.3

0.3

1

55.9

0.4

7

67.0

0.3

9

64.0

0.3

8

62.8

0.4

0

63.1

0.3

9

62.1

0.4

0

61.5

0.4

1

61.3

0.3

9

55.5

0.4

7

59.0

0.4

2

60.5

0.4

0

62.8

0.3

9

60.2

0.4

1

63.9

0.3

9

62.2

0.3

9

64.7

0.3

7

61.4

0.4

1

Green

Day

57.4

0.4

5

71.5

0.3

4

68.3

0.4

0

64.8

0.4

0

65.6

0.3

8

56.2

0.4

5

61.1

0.4

1

66.8

0.3

6

62.0

0.4

1

58.7

0.4

2

70.8

0.3

8

60.4

0.4

1

57.2

0.4

3

59.3

0.4

3

61.5

0.4

0

61.6

0.4

1

59.5

0.4

2

61.9

0.4

0

58.1

0.4

3

Led

Zep-

pelin

59.9

0.4

4

67.7

0.3

7

65.1

0.4

2

72.4

0.3

6

61.6

0.4

1

58.3

0.4

3

61.4

0.4

1

61.2

0.4

0

62.5

0.4

3

50.7

0.4

7

53.8

0.5

1

56.5

0.4

5

56.4

0.4

4

61.7

0.4

1

57.6

0.4

4

67.5

0.3

7

59.4

0.4

2

56.7

0.4

4

56.0

0.4

5

Madonna

56.8

0.4

5

73.1

0.3

2

46.8

0.5

4

52.8

0.4

8

68.0

0.3

8

60.5

0.4

2

66.7

0.3

7

65.5

0.3

7

57.6

0.4

4

55.5

0.4

6

39.2

0.6

0

60.2

0.4

4

59.1

0.4

2

62.2

0.4

0

65.1

0.3

8

64.5

0.3

9

61.3

0.4

1

68.9

0.3

4

60.2

0.4

3

Metallica

60.0

0.4

0

56.0

0.4

2

76.7

0.2

6

75.8

0.2

5

53.3

0.4

5

60.2

0.4

0

51.3

0.4

7

52.7

0.4

6

58.7

0.4

1

66.6

0.3

4

38.5

0.5

9

56.3

0.4

4

49.2

0.5

0

62.1

0.3

7

48.8

0.5

0

69.3

0.3

1

58.7

0.4

1

42.6

0.5

5

51.8

0.4

7

Prince

53.8

0.4

8

65.7

0.4

1

53.0

0.5

0

50.8

0.4

9

61.1

0.4

4

62.0

0.4

3

60.3

0.4

4

65.3

0.4

1

62.4

0.4

5

56.6

0.4

5

62.1

0.4

3

47.5

0.5

3

60.1

0.4

3

57.4

0.4

6

60.2

0.4

4

60.7

0.4

3

58.8

0.4

5

59.5

0.4

5

58.7

0.4

5

Queen

55.4

0.4

6

65.4

0.3

7

45.9

0.5

2

60.7

0.4

2

57.0

0.4

4

61.7

0.4

1

53.8

0.4

7

66.0

0.3

7

57.0

0.4

4

58.7

0.4

3

53.8

0.4

6

32.6

0.6

1

57.6

0.4

4

57.7

0.4

4

54.9

0.4

5

60.2

0.4

2

54.8

0.4

5

56.2

0.4

5

56.8

0.4

5

Radiohead57.0

0.4

4

61.1

0.4

0

70.0

0.3

4

72.1

0.3

2

60.7

0.3

9

63.8

0.3

7

54.6

0.4

5

62.6

0.3

8

61.3

0.4

0

64.6

0.3

7

47.3

0.5

0

69.1

0.3

6

56.7

0.4

3

50.8

0.4

9

55.0

0.4

5

70.2

0.3

1

60.9

0.3

9

57.6

0.4

2

54.5

0.4

5

Roxette

62.9

0.4

2

70.5

0.3

5

64.0

0.4

3

67.2

0.4

0

65.3

0.4

0

62.3

0.4

1

67.8

0.3

9

68.6

0.3

6

65.4

0.4

0

61.0

0.4

2

65.0

0.4

0

73.5

0.3

8

58.2

0.4

3

54.3

0.4

5

65.2

0.3

8

69.5

0.3

6

62.6

0.4

1

67.9

0.3

7

59.3

0.4

4

Steely

Dan

60.5

0.4

3

56.3

0.4

6

73.1

0.3

4

71.3

0.3

6

56.5

0.4

4

64.1

0.3

8

61.8

0.4

1

60.4

0.4

1

56.2

0.4

4

65.1

0.3

8

50.6

0.4

7

80.4

0.2

6

61.0

0.4

0

51.6

0.4

8

62.9

0.3

9

54.3

0.4

5

62.2

0.3

9

48.3

0.4

9

55.3

0.4

5

Suzanne

Vega

51.8

0.4

8

56.4

0.4

5

53.5

0.4

7

62.6

0.4

1

57.0

0.4

3

63.2

0.4

0

59.2

0.4

2

64.2

0.3

8

55.6

0.4

5

59.5

0.4

2

61.4

0.4

1

55.6

0.4

5

59.0

0.4

2

53.4

0.4

7

59.5

0.4

2

57.0

0.4

4

67.1

0.3

7

61.5

0.4

1

56.6

0.4

5

Tori

Amos

59.6

0.4

4

72.3

0.3

6

54.3

0.4

6

59.9

0.4

4

66.2

0.4

0

61.5

0.4

1

64.4

0.4

1

68.7

0.3

6

62.5

0.4

3

58.5

0.4

4

63.5

0.4

0

49.0

0.5

1

59.2

0.4

2

56.4

0.4

5

65.3

0.3

9

63.9

0.4

1

64.9

0.3

9

61.2

0.4

1

59.7

0.7

4

U2

55.1

0.4

6

70.5

0.3

7

51.3

0.4

7

66.2

0.3

8

67.8

0.3

8

63.1

0.3

9

66.3

0.3

9

68.8

0.3

6

62.5

0.4

1

63.6

0.3

9

68.8

0.3

8

45.2

0.5

2

61.0

0.4

1

59.0

0.4

3

64.4

0.3

9

62.0

0.4

2

68.3

0.3

5

62.0

0.4

0

65.0

0.4

0

88

Page 98: Learning Techniques for Identifying Vocal Regions in Music

Table

B.2:CompleteExperiment2ResultsusingMFDWC

Features(Trainingsetin

rows;test

setin

columns)

AerosmithBeatles

CCR

Cure

DMB

Depeche

Mode

Fleetwood

Mac

Garth

Brooks

Green

Day

Led

Zep-

pelin

Madonna

Metallica

Prince

Queen

RadioheadRoxette

Steely

Dan

Suzanne

Vega

Tori

Amos

U2

Aerosmith

61.2

0.4

1

61.7

0.4

5

60.9

0.4

3

51.3

0.4

9

53.9

0.4

8

55.2

0.4

7

56.7

0.4

5

59.7

0.4

4

61.4

0.4

3

47.1

0.5

2

69.6

0.4

0

52.7

0.4

8

57.0

0.4

5

57.2

0.4

5

55.0

0.4

6

59.9

0.4

4

50.4

0.5

1

48.3

0.5

1

52.0

0.4

8

Beatles

54.0

0.4

7

53.2

0.4

9

55.3

0.4

7

54.7

0.4

6

52.1

0.4

9

57.0

0.4

5

64.4

0.3

9

61.1

0.4

2

53.8

0.4

7

57.8

0.4

3

50.4

0.5

1

56.3

0.4

6

61.2

0.4

1

56.5

0.4

4

56.3

0.4

5

52.1

0.4

9

51.5

0.5

0

58.2

0.4

3

57.8

0.4

4

CCR

56.7

0.4

5

49.7

0.5

1

70.0

0.3

4

51.3

0.4

9

58.1

0.4

3

55.2

0.4

6

60.2

0.4

1

56.9

0.4

4

61.3

0.4

1

47.8

0.5

2

65.3

0.4

0

54.6

0.4

7

57.4

0.4

3

59.0

0.4

2

51.3

0.4

9

63.9

0.3

9

52.4

0.4

8

46.0

0.5

3

54.0

0.4

6

Cure

58.0

0.4

4

50.3

0.4

8

75.4

0.3

4

51.6

0.4

8

62.5

0.4

2

55.9

0.4

6

61.3

0.4

0

56.8

0.4

5

64.5

0.4

0

47.4

0.5

0

76.7

0.3

3

58.5

0.4

4

59.3

0.4

2

61.2

0.4

2

50.3

0.4

9

68.4

0.3

7

56.8

0.4

5

42.7

0.5

5

57.6

0.4

4

DMB

51.4

0.4

9

60.9

0.4

1

63.1

0.4

4

56.4

0.4

73

52.6

0.4

9

57.3

0.4

5

62.7

0.4

1

61.6

0.4

3

56.0

0.4

7

54.3

0.4

7

63.5

0.4

2

55.6

0.4

7

52.1

0.4

8

53.9

0.4

8

49.1

0.5

0

53.2

0.4

9

53.2

0.4

8

50.7

0.4

9

57.0

0.4

5

Depeche

Mode

54.7

0.4

8

47.5

0.5

3

64.7

0.4

1

71.6

0.3

5

51.0

0.4

9

53.3

0.4

8

59.3

0.4

2

52.3

0.4

8

60.4

0.4

3

48.8

0.5

0

72.2

0.3

5

58.3

0.4

4

57.1

0.4

5

57.7

0.4

4

47.6

0.5

1

65.8

0.3

9

60.1

0.4

3

43.4

0.5

4

55.7

0.4

5

Fleetwood

Mac

54.9

0.4

8

59.7

0.4

3

69.2

0.3

9

66.4

0.4

2

55.9

0.4

7

58.6

0.4

5

63.7

0.4

2

59.0

0.4

5

60.1

0.4

5

57.4

0.4

5

69.2

0.3

9

58.8

0.4

5

55.5

0.4

6

56.9

0.4

6

57.0

0.4

5

63.1

0.4

2

56.5

0.4

6

53.7

0.4

7

56.6

0.4

6

Garth

Brooks

52.2

0.4

7

60.6

0.4

0

54.9

0.4

8

63.9

0.7

2

56.7

0.4

5

57.7

0.4

5

57.6

0.4

4

58.1

0.4

5

56.4

0.4

6

59.4

0.4

3

54.8

0.4

9

57.0

0.4

6

58.9

0.4

2

55.4

0.4

4

52.9

0.4

7

55.0

0.4

7

55.8

0.4

6

53.0

0.4

7

58.6

0.4

4

Green

Day

54.6

0.4

8

68.6

0.3

5

64.1

0.4

4

54.5

0.4

9

60.6

0.4

2

49.8

0.5

1

58.1

0.4

4

62.1

0.4

0

56.2

0.4

7

55.5

0.4

5

69.0

0.4

1

54.9

0.4

7

53.9

0.4

6

56.3

0.4

6

57.0

0.4

4

51.2

0.5

1

51.9

0.4

9

56.0

0.4

4

55.7

0.4

6

Led

Zep-

pelin

60.3

0.4

4

58.7

0.4

3

66.2

0.4

2

69.8

0.3

9

52.6

0.4

8

56.4

0.4

7

56.2

0.4

6

58.4

0.4

4

58.7

0.4

4

46.8

0.5

1

68.2

0.4

2

56.9

0.4

6

55.9

0.4

5

57.7

0.4

5

50.7

0.4

9

63.9

0.4

3

56.7

0.4

6

48.9

0.5

0

53.9

0.4

7

Madonna

50.3

0.5

0

59.3

0.4

2

47.5

0.5

1

53.0

0.4

8

53.0

0.4

7

54.0

0.4

8

57.8

0.4

5

59.5

0.4

3

52.1

0.4

8

52.7

0.4

8

50.7

0.5

0

54.9

0.4

7

54.3

0.4

7

54.6

0.4

7

58.7

0.4

4

54.7

0.4

7

56.4

0.4

6

61.9

0.4

1

53.4

0.4

8

Metallica

61.4

0.4

0

56.0

0.4

5

73.6

0.3

2

72.5

0.3

1

50.2

0.5

0

58.7

0.4

3

54.7

0.4

6

53.7

0.4

7

56.5

0.4

4

65.8

0.3

7

44.3

0.5

4

56.3

0.4

5

54.6

0.4

6

54.9

0.4

7

50.1

0.4

9

67.9

0.3

5

55.7

0.4

5

42.0

0.5

6

52.7

0.4

7

Prince

55.5

0.4

7

61.6

0.4

3

66.6

0.4

4

66.5

0.4

3

56.0

0.4

7

62.0

0.4

5

58.4

0.4

6

63.7

0.4

2

57.9

0.4

9

61.3

0.4

5

54.7

0.4

6

73.3

0.4

2

58.2

0.4

5

60.5

0.4

4

53.1

0.4

7

62.9

0.4

5

57.1

0.4

7

53.7

0.4

7

57.5

0.4

6

Queen

57.0

0.4

6

59.7

0.4

2

55.9

0.4

7

67.0

0.4

0

48.3

0.5

1

60.4

0.4

2

52.7

0.4

8

61.8

0.4

0

54.1

0.4

7

58.1

0.4

5

49.6

0.4

9

43.4

0.5

6

54.1

0.4

7

57.1

0.4

5

50.5

0.4

8

59.7

0.4

4

52.9

0.4

8

46.8

0.5

1

55.7

0.4

6

Radiohead53.1

0.4

8

52.2

0.5

1

63.6

0.4

0

57.1

0.4

5

52.2

0.4

8

56.1

0.4

6

54.2

0.4

7

58.4

0.4

4

54.5

0.4

7

58.0

0.4

4

52.4

0.4

9

69.8

0.3

4

57.1

0.4

5

50.6

0.5

0

54.0

0.4

7

58.0

0.4

4

53.1

0.4

8

52.5

0.4

8

55.6

0.4

6

Roxette

55.4

0.4

6

60.7

0.4

2

52.3

0.4

7

49.8

0.4

8

50.0

0.5

0

53.0

0.4

8

56.9

0.4

5

60.1

0.4

3

58.2

0.4

5

52.6

0.4

6

63.3

0.4

2

68.0

0.3

8

52.9

0.4

8

53.3

0.4

7

54.9

0.4

4

52.4

0.4

7

51.5

0.4

8

60.5

0.4

3

53.8

0.4

7

Steely

Dan

58.7

0.4

4

52.6

0.4

8

75.0

0.3

1

74.3

0.3

3

50.6

0.4

9

61.6

0.4

1

56.1

0.4

6

55.5

0.4

5

55.3

0.4

5

64.1

0.3

9

45.0

0.5

2

80.5

0.2

3

58.5

0.4

3

52.9

0.4

8

59.3

0.4

2

49.6

0.4

9

58.5

0.4

3

41.8

0.5

6

54.0

0.4

6

Suzanne

Vega

51.0

0.5

1

49.8

0.5

0

58.4

0.4

6

61.2

0.4

3

49.7

0.4

9

58.6

0.4

5

54.3

0.4

7

54.8

0.4

6

51.8

0.4

9

57.8

0.4

6

56.5

0.4

4

66.3

0.4

1

57.9

0.4

5

50.6

0.5

0

55.1

0.4

6

51.0

0.4

9

61.1

0.4

3

52.8

0.4

7

52.3

0.4

8

Tori

Amos

49.6

0.5

0

61.2

0.4

1

43.5

0.5

4

41.0

0.5

6

53.0

0.4

8

47.9

0.5

1

54.3

0.4

7

57.3

0.4

4

56.1

0.4

6

50.5

0.4

9

63.8

0.4

0

48.5

0.5

3

53.3

0.4

8

51.7

0.4

8

53.9

0.4

7

58.2

0.4

4

43.9

0.5

3

51.5

0.4

9

52.8

0.4

8

U2

54.0

0.4

7

63.1

0.4

2

57.3

0.4

6

63.7

0.4

2

55.2

0.4

7

56.2

0.4

6

57.7

0.4

6

65.4

0.3

9

58.8

0.4

5

58.1

0.4

4

57.5

0.4

5

59.4

0.4

5

56.6

0.4

6

60.2

0.4

3

61.0

0.4

1

54.1

0.4

7

57.3

0.4

5

51.4

0.4

9

52.4

0.4

8

89

Page 99: Learning Techniques for Identifying Vocal Regions in Music

TableB.3:CompleteExperiment2ResultsusingWaveletEnergyFeatures(Trainingsetin

rows;

testsetin

columns)

AerosmithBeatles

CCR

Cure

Dave

Matthews

Depeche

Mode

Fleetwood

Mac

Garth

Brooks

Green

Day

Led

Zep-

pelin

Madonna

Metallica

Prince

Queen

RadioheadRoxette

Steely

Dan

Suzanne

Vega

Tori

Amos

U2

Aerosmith

70.9

0.3

3

65.6

0.3

8

68.6

0.3

7

64.8

0.3

8

60.1

0.4

1

67.1

0.3

7

62.6

0.3

9

67.9

0.3

6

66.4

0.3

6

52.9

0.4

7

66.5

0.3

8

61.8

0.4

0

55.8

0.4

4

60.0

0.4

2

63.3

0.3

9

70.5

0.3

3

60.1

0.4

2

61.2

0.4

1

57.0

0.4

4

Beatles

63.2

0.4

0

64.9

0.3

9

70.7

0.3

5

69.5

0.3

4

62.2

0.3

9

67.3

0.3

7

70.8

0.3

2

69.2

0.3

5

64.1

0.3

8

59.7

0.4

1

66.7

0.4

0

57.6

0.4

3

52.2

0.4

6

68.5

0.3

2

66.6

0.3

6

71.0

0.3

2

63.7

0.3

8

69.2

0.3

3

59.0

0.4

2

CCR

59.5

0.4

2

65.5

0.3

6

71.7

0.3

2

61.2

0.4

0

59.4

0.4

1

61.6

0.4

0

61.2

0.4

0

65.6

0.3

7

67.8

0.3

5

51.3

0.4

8

73.3

0.3

2

57.3

0.4

3

53.1

0.4

7

65.2

0.3

6

59.5

0.4

2

69.0

0.3

3

61.5

0.3

9

55.5

0.4

4

56.0

0.4

4

Cure

63.8

0.3

9

68.1

0.3

3

63.8

0.4

1

63.9

0.3

8

60.1

0.4

0

67.9

0.3

5

71.6

0.3

1

65.1

0.3

9

68.4

0.3

5

59.0

0.4

1

60.4

0.4

5

57.0

0.4

3

52.6

0.4

6

64.5

0.3

7

64.0

0.3

8

69.5

0.3

3

66.8

0.3

5

65.6

0.3

6

61.0

0.4

0

Dave

Matthews

Band

60.3

0.4

3

73.4

0.3

2

63.8

0.4

0

67.4

0.3

8

65.5

0.3

8

69.6

0.3

5

72.9

0.3

1

69.2

0.3

6

64.3

0.3

9

68.0

0.3

5

67.8

0.3

7

57.6

0.4

3

51.7

0.4

9

61.8

0.3

9

68.0

0.3

6

70.8

0.3

3

68.6

0.3

5

70.7

0.3

3

61.9

0.4

0

Depeche

Mode

60.0

0.4

2

54.9

0.4

6

57.9

0.4

3

57.2

0.4

3

63.0

0.4

0

69.0

0.3

4

63.0

0.3

8

54.0

0.4

6

60.7

0.4

0

67.7

0.3

6

59.9

0.4

1

61.1

0.4

0

53.8

0.4

7

60.4

0.4

0

60.3

0.4

2

67.9

0.3

5

66.8

0.3

6

60.5

0.4

1

61.4

0.4

0

Fleetwood

Mac

65.0

0.3

8

68.9

0.3

4

64.9

0.3

9

71.5

0.3

4

67.1

0.3

6

64.8

0.3

8

70.7

0.3

2

68.6

0.3

5

68.1

0.3

5

64.0

0.3

9

58.7

0.4

5

59.4

0.4

1

55.1

0.4

5

64.7

0.3

7

68.2

0.3

5

72.0

0.3

1

66.7

0.3

6

65.9

0.3

6

62.7

0.3

9

Garth

Brooks

55.9

0.4

6

72.9

0.2

9

56.3

0.4

6

70.4

0.3

6

66.6

0.3

6

61.7

0.4

1

67.7

0.3

6

61.9

0.4

0

62.6

0.4

0

62.9

0.3

8

55.9

0.4

6

56.3

0.4

4

53.0

0.4

7

60.2

0.4

0

65.5

0.3

7

65.4

0.3

8

68.5

0.3

5

69.5

0.3

2

59.5

0.4

2

Green

Day

59.8

0.4

2

76.3

0.2

6

63.6

0.3

9

56.1

0.4

6

69.4

0.3

3

56.5

0.4

5

63.0

0.3

9

67.8

0.3

3

59.4

0.4

2

63.7

0.3

7

70.0

0.3

6

59.5

0.4

1

52.4

0.4

7

62.7

0.3

8

67.1

0.3

5

62.3

0.4

0

63.0

0.3

9

69.7

0.3

2

60.0

0.4

1

Led

Zep-

pelin

62.6

0.4

1

70.8

0.3

2

61.1

0.4

0

72.2

0.3

4

64.9

0.3

8

59.6

0.4

2

65.8

0.3

8

66.0

0.3

7

62.3

0.3

9

53.2

0.4

6

72.7

0.5

9

57.8

0.4

3

56.8

0.4

4

63.0

0.3

9

61.2

0.4

0

65.7

0.3

7

61.6

0.4

0

62.3

0.3

9

58.8

0.4

3

Madonna

59.3

0.4

1

73.2

0.2

8

48.3

0.5

1

50.0

0.4

9

68.4

0.3

5

59.6

0.4

3

65.8

0.3

5

66.7

0.3

4

61.7

0.3

9

57.8

0.4

3

39.5

0.5

8

58.0

0.4

4

57.8

0.4

2

64.0

0.3

7

65.7

0.3

5

63.7

0.3

9

61.5

0.4

0

69.7

0.3

2

62.4

0.4

0

Metallica

63.0

0.3

7

66.0

0.3

5

73.7

0.2

8

76.2

0.2

4

59.0

0.4

1

59.1

0.4

1

56.7

0.4

3

57.5

0.4

3

66.2

0.3

5

67.4

0.3

3

42.3

0.5

7

57.0

0.4

3

49.4

0.5

0

67.0

0.3

3

57.5

0.4

3

69.6

0.3

1

59.0

0.4

1

54.6

0.4

5

52.9

0.4

6

Prince

58.8

0.4

6

68.5

0.4

0

57.2

0.4

9

42.4

0.5

4

58.4

0.4

5

59.7

0.4

4

60.4

0.4

4

60.7

0.4

3

66.3

0.4

2

55.5

0.4

7

61.1

0.4

4

57.6

0.5

1

54.7

0.4

6

60.8

0.4

4

62.0

0.4

4

59.8

0.4

4

57.9

0.4

5

65.5

0.4

2

57.7

0.4

5

Queen

52.5

0.4

8

60.0

0.4

0

43.9

0.5

3

53.1

0.4

7

57.4

0.4

3

58.3

0.4

3

52.8

0.4

7

56.2

0.4

4

53.3

0.4

5

57.6

0.4

3

56.2

0.4

4

28.5

0.6

4

55.6

0.4

5

57.3

0.4

3

55.8

0.4

4

58.4

0.4

3

57.4

0.4

4

55.7

0.4

4

54.0

0.4

6

Radiohead55.1

0.4

4

56.4

0.4

4

65.9

0.3

6

71.9

0.3

9

60.9

0.3

9

61.1

0.3

9

54.9

0.4

4

61.2

0.4

0

60.5

0.4

1

63.8

0.3

7

51.0

0.4

8

66.4

0.3

8

56.2

0.4

4

48.8

0.5

0

54.7

0.4

5

69.5

0.3

1

61.6

0.3

9

61.0

0.4

0

56.0

0.4

4

Roxette

62.6

0.4

1

74.0

0.3

0

67.1

0.3

9

64.7

0.4

0

69.5

0.3

9

62.7

0.4

1

70.1

0.3

6

71.3

0.3

3

70.3

0.3

5

61.9

0.4

0

67.3

0.3

8

71.1

0.3

5

58.3

0.4

3

51.5

0.4

8

66.0

0.3

5

69.9

0.3

5

65.9

0.3

8

71.0

0.3

3

60.6

0.4

2

Steely

Dan

62.4

0.4

0

55.7

0.4

6

71.6

0.3

2

66.9

0.3

8

57.1

0.4

5

66.1

0.3

7

64.1

0.3

8

61.9

0.4

0

58.7

0.4

3

65.2

0.3

7

55.8

0.4

4

78.6

0.2

4

61.2

0.3

9

50.9

0.4

9

67.0

0.3

7

58.6

0.4

4

65.1

0.3

7

58.3

0.4

9

56.4

0.4

3

Suzanne

Vega

55.4

0.4

6

66.7

0.3

5

62.1

0.4

2

66.6

0.3

9

65.5

0.3

7

64.3

0.3

8

63.5

0.3

9

72.1

0.3

1

64.1

0.3

9

61.8

0.4

0

64.8

0.3

7

61.9

0.4

4

57.5

0.4

2

52.2

0.4

7

66.6

0.3

5

66.2

0.3

7

66.8

0.3

6

69.7

0.3

3

58.9

0.4

3

Tori

Amos

61.2

0.4

4

75.5

0.3

1

61.3

0.4

2

69.2

0.4

0

67.8

0.3

8

62.6

0.4

1

66.5

0.4

0

71.8

0.3

5

67.4

0.3

8

63.7

0.4

0

64.2

0.3

9

63.2

0.4

3

55.0

0.4

4

52.5

0.4

6

67.1

0.3

5

66.6

0.3

9

66.1

0.3

9

66.0

0.3

9

57.8

0.4

4

U2

56.9

0.4

4

67.9

0.3

3

51.6

0.4

6

66.3

0.3

8

68.9

0.3

4

61.9

0.4

0

70.3

0.3

4

68.3

0.3

4

58.7

0.4

1

66.0

0.3

7

68.3

0.3

6

42.8

0.5

4

59.9

0.4

1

57.6

0.4

3

61.9

0.4

0

64.4

0.3

8

69.6

0.3

3

63.5

0.3

9

65.5

0.3

7

90

Page 100: Learning Techniques for Identifying Vocal Regions in Music

Appendix C

Complete Experiment 3 Results

91

Page 101: Learning Techniques for Identifying Vocal Regions in Music

Table

C.1:CompleteExperiment3ResultsusingMFCC

Featuresat2%

IncrementalSampling

Rates

Training

Testing

24

68

10

12

14

16

18

20

22

24

26

28

30

32

34

36

38

40

Female

Male

62.21

62.16

61.72

63.21

63.30

62.73

63.20

62.87

63.59

62.98

62.93

63.12

63.36

62.84

63.26

63.14

62.73

62.56

62.94

62.81

0.4297

0.4265

0.4263

0.4194

0.4160

0.4188

0.4167

0.4160

0.4125

0.4125

0.4138

0.4118

0.4130

0.4131

0.4111

0.4107

0.4100

0.4119

0.4102

0.4113

Male

Female

61.19

59.28

61.38

62.61

60.81

62.02

61.13

62.17

61.86

62.16

62.07

61.18

61.41

60.81

61.81

61.77

62.05

60.83

61.52

61.12

0.4299

0.4256

0.4219

0.4179

0.4178

0.4139

0.4147

0.4136

0.4122

0.4126

0.4135

0.4130

0.4137

0.4137

0.4111

0.4105

0.4107

0.4109

0.4102

0.4108

TableC.2:CompleteExperiment3ResultsusingMFDWCFeaturesat2%

IncrementalSampling

Rates

Training

Testing

24

68

10

12

14

16

18

20

22

24

26

28

30

32

34

36

38

40

Female

Male

54.26

53.85

55.07

55.36

55.21

54.48

54.64

54.32

56.59

55.65

56.29

55.51

54.99

55.78

55.50

55.55

55.54

55.39

55.31

55.49

0.4767

0.4780

0.4756

0.4731

0.4739

0.4748

0.4725

0.4709

0.4706

0.4698

0.4676

0.4673

0.4682

0.4668

0.4669

0.4676

0.4660

0.4659

0.4654

0.4669

Male

Female

51.33

50.98

51.69

52.14

51.86

52.35

52.73

52.63

52.48

52.46

52.71

52.93

52.91

53.02

52.92

52.97

53.00

53.21

52.90

52.92

0.4907

0.4901

0.4849

0.4828

0.4854

0.4830

0.4793

0.4797

0.4793

0.4779

0.4750

0.4772

0.4750

0.4752

0.4758

0.4754

0.4743

0.4741

0.4735

0.4739

Table

C.3:CompleteExperiment3ResultsusingWaveletEnergyFeaturesat2%

Incremental

SamplingRates

Training

Testing

24

68

10

12

14

16

18

20

22

24

26

28

30

32

34

36

38

40

Female

Male

66.95

64.15

65.68

65.13

66.09

66.10

65.84

65.40

66.07

66.14

65.78

66.27

65.54

65.08

65.37

65.13

65.34

64.53

64.98

64.78

0.4088

0.4091

0.4029

0.3989

0.4004

0.3985

0.3977

0.3972

0.3961

0.3957

0.3928

0.3900

0.3926

0.3926

0.3912

0.3927

0.3913

0.3923

0.3919

0.3918

Male

Female

66.17

66.78

66.76

67.32

66.23

67.06

66.64

66.00

66.58

66.43

66.50

66.73

66.78

66.80

66.28

66.11

66.30

65.84

66.05

65.64

0.3976

0.3930

0.3929

0.3871

0.3862

0.3848

0.3862

0.3842

0.3852

0.3820

0.3811

0.3826

0.3813

0.3799

0.3799

0.3817

0.3783

0.3803

0.3775

0.3808

92

Page 102: Learning Techniques for Identifying Vocal Regions in Music

Appendix D

Complete Experiment 4 Results

93

Page 103: Learning Techniques for Identifying Vocal Regions in Music

Table

D.1:CompleteExperiment4ResultsusingMFCC

Featuresat2%

IncrementalSampling

Rates

24

68

10

12

14

16

18

20

22

24

26

28

30

32

34

36

38

40

67.41

67.77

68.38

68.75

69.28

69.48

69.44

69.70

69.93

70.05

70.36

70.47

70.66

70.83

71.11

71.24

71.39

71.65

71.90

72.12

0.3976

0.3899

0.3861

0.3829

0.3773

0.3748

0.3727

0.37

0.3675

0.3657

0.3627

0.3597

0.3577

0.3548

0.3514

0.3493

0.3463

0.3434

0.3405

0.3374

TableD.2:CompleteExperiment4ResultsusingMFDWCFeaturesat2%

IncrementalSampling

Rates

24

68

10

12

14

16

18

20

22

24

26

28

30

32

34

36

38

40

62.25

62.25

62.74

63.06

63.49

63.71

64.01

64.24

64.47

64.69

64.84

65.16

65.44

65.66

65.96

66.22

66.55

66.79

67.13

67.29

0.4383

0.4367

0.4325

0.4288

0.4261

0.4227

0.4196

0.4164

0.4144

0.4109

0.409

0.4054

0.4023

0.3997

0.3959

0.3926

0.3889

0.3852

0.3821

0.3795

Table

D.3:CompleteExperiment4ResultsusingWaveletEnergyFeaturesat2%

Incremental

SamplingRates

24

68

10

12

14

16

18

20

22

24

26

28

30

32

34

36

38

40

69.57

70.07

70.28

70.57

70.59

70.67

71.01

71.16

71.38

71.44

71.63

71.68

72.00

72.05

72.34

72.50

72.64

72.91

73.02

73.19

0.3847

0.3782

0.3753

0.3699

0.3677

0.3663

0.3627

0.3603

0.3579

0.355

0.3526

0.3512

0.3471

0.3453

0.3419

0.3399

0.3374

0.3334

0.331

0.3286

94

Page 104: Learning Techniques for Identifying Vocal Regions in Music

Appendix E

Complete Experiment 5 Results

95

Page 105: Learning Techniques for Identifying Vocal Regions in Music

TableE.1:CompleteExperiment5Resultsat2%

IncrementalSamplingRatesusingMFCCFea-

tures

Run

Feature

24

68

10

12

14

16

18

20

22

24

26

28

30

32

34

36

38

40

1MFCC

64.5

0.4

1

65.6

0.4

1

64.7

0.4

1

62.7

0.4

0

65.2

0.4

0

63.2

0.4

1

65.5

0.4

1

64.0

0.4

1

65.1

0.4

0

65.8

0.3

9

63.4

0.4

0

64.9

0.4

0

65.7

0.4

0

64.3

0.4

0

62.7

0.4

1

64.2

0.3

9

64.9

0.3

9

64.2

0.3

9

65.5

0.4

0

64.0

0.4

0

265.2

0.4

2

64.5

0.4

1

64.9

0.4

1

63.5

0.4

2

65.0

0.4

1

64.7

0.4

0

69.9

0.3

9

64.4

0.4

0

64.8

0.4

0

65.3

0.3

9

64.5

0.4

0

65.3

0.4

0

65.7

0.3

9

65.5

0.3

9

63.0

0.4

0

65.6

0.3

5

63.8

0.4

0

63.8

0.4

0

64.9

0.3

9

64.3

0.4

0

366.4

0.4

1

63.2

0.4

2

65.1

0.4

2

64.0

0.3

9

65.8

0.4

0

65.9

0.4

0

64.5

0.4

0

65.7

0.4

0

65.1

0.4

0

65.8

0.3

9

65.7

0.3

9

65.4

0.3

9

65.1

0.4

0

65.3

0.4

0

63.6

0.4

0

65.5

0.3

9

64.4

0.3

9

64.4

0.3

9

64.8

0.4

1

65.5

0.4

0

464.1

0.4

2

62.9

0.4

1

64.8

0.4

2

63.3

0.4

0

66.6

0.4

0

64.5

0.4

0

64.7

0.4

0

64.3

0.4

0

65.0

0.3

9

66.1

0.3

9

64.8

0.4

0

64.7

0.3

9

65.7

0.4

0

65.6

0.3

9

64.1

0.4

1

66.7

0.3

9

63.6

0.3

9

64.8

0.3

9

66.4

0.3

9

65.0

0.3

9

565.1

0.4

1

64.9

0.4

1

65.1

0.4

2

64.8

0.4

0

64.6

0.4

1

65.7

0.4

0

64.0

0.4

1

63.4

0.4

0

64.3

0.4

0

65.3

0.3

9

63.8

0.4

1

64.6

0.4

2

63.3

0.4

0

65.9

0.3

9

64.2

0.4

0

65.2

0.4

0

65.0

0.3

9

63.9

0.3

9

64.4

0.3

9

64.2

0.3

9

664.7

0.4

3

64.2

0.4

1

64.5

0.4

1

64.5

0.4

0

65.2

0.4

1

66.1

0.3

9

64.8

0.4

0

64.6

0.4

0

63.6

0.4

0

64.1

0.3

9

63.7

0.3

9

65.2

0.3

9

65.4

0.4

0

64.8

0.3

9

64.0

0.4

1

64.1

0.3

9

64.4

0.3

9

65.4

0.3

9

64.3

0.3

9

65.1

0.3

8

764.1

0.4

1

64.7

0.4

1

64.2

0.4

2

65.4

0.3

9

65.8

0.4

1

64.4

0.4

0

65.3

0.3

9

65.7

0.4

0

64.7

0.3

9

65.9

0.3

9

63.8

0.4

0

65.3

0.4

0

65.5

0.3

9

66.1

0.4

0

63.9

0.4

0

66.3

0.4

0

64.3

0.4

0

64.2

0.3

9

64.6

0.3

9

65.0

0.3

9

864.0

0.4

1

62.4

0.4

1

64.3

0.4

1

65.4

0.4

1

63.4

0.4

0

63.3

0.4

1

66.9

0.4

0

65.6

0.4

1

65.1

0.4

0

64.9

0.3

9

64.9

0.4

0

67.4

0.3

9

65.3

0.3

9

65.5

0.4

0

65.0

0.3

9

65.3

0.3

9

64.1

0.4

1

64.4

0.4

0

62.7

0.4

1

65.0

0.3

9

963.4

0.4

3

64.4

0.4

1

65.3

0.4

1

66.1

0.4

0

63.8

0.4

0

64.5

0.4

1

65.8

0.4

0

64.4

0.4

0

64.3

0.4

1

64.6

0.4

0

64.6

0.3

9

65.8

0.3

9

64.1

0.4

0

65.5

0.3

8

64.5

0.3

9

64.0

0.3

9

64.4

0.3

9

63.0

0.4

1

66.5

0.3

9

64.5

0.4

0

10

64.8

0.4

1

64.5

0.4

0

65.3

0.4

0

64.1

0.4

0

63.4

0.4

1

64.7

0.4

0

65.5

0.4

0

65.6

0.4

0

65.5

0.3

9

65.3

0.3

9

64.3

0.4

0

64.9

0.4

0

65.9

0.3

9

63.8

0.3

9

61.8

0.4

0

65.3

0.3

8

64.9

0.4

0

64.9

0.3

9

64.2

0.4

0

64.5

0.4

0

96

Page 106: Learning Techniques for Identifying Vocal Regions in Music

Table

E.2:CompleteExperiment5Resultsat2%

IncrementalSamplingRatesusingMFDWC

Features

Run

Feature

24

68

10

12

14

16

18

20

22

24

26

28

30

32

34

36

38

40

1MFDWC

57.7

0.4

6

60.4

0.4

6

54.4

0.4

7

60.8

0.4

5

59.1

0.4

5

58.8

0.4

5

59.0

0.4

5

61.6

0.4

4

59.3

0.4

5

60.2

0.4

5

59.6

0.4

4

58.8

0.4

5

60.9

0.4

4

59.2

0.4

4

59.8

0.4

5

60.1

0.4

4

59.7

0.4

4

59.5

0.4

4

59.7

0.4

4

59.7

0.4

4

258.9

0.4

6

60.4

0.4

5

58.9

0.4

5

59.3

0.4

5

58.4

0.4

5

60.4

0.4

5

59.9

0.4

5

60.0

0.4

5

60.0

0.4

4

60.7

0.4

4

61.5

0.4

4

58.8

0.4

4

61.6

0.4

4

58.5

0.4

5

58.4

0.4

5

60.2

0.4

4

58.3

0.4

5

59.3

0.4

4

61.0

0.4

4

60.1

0.4

4

356.9

0.4

6

60.3

0.4

6

61.1

0.4

5

59.7

0.4

6

57.2

0.4

6

59.0

0.4

5

59.8

0.4

5

60.7

0.4

5

59.9

0.4

5

59.1

0.4

5

59.2

0.4

4

58.9

0.4

5

59.7

0.4

4

59.9

0.4

4

60.8

0.4

4

59.1

0.4

4

58.9

0.4

5

59.0

0.4

4

59.3

0.4

4

61.5

0.4

4

458.3

0.4

6

60.5

0.4

5

60.1

0.4

5

57.4

0.4

6

59.4

0.4

5

59.0

0.4

5

60.1

0.4

5

59.1

0.4

4

59.2

0.4

5

59.4

0.4

4

60.5

0.4

4

56.7

0.4

5

60.2

0.4

4

59.7

0.4

4

60.3

0.4

4

58.7

0.4

4

58.1

0.4

4

61.3

0.4

4

59.2

0.4

4

61.4

0.4

3

558.1

0.4

5

59.4

0.4

6

59.6

0.4

5

59.4

0.4

6

58.8

0.4

5

59.0

0.4

5

59.9

0.4

5

60.9

0.4

4

58.6

0.4

5

60.8

0.4

4

60.6

0.4

5

60.6

0.4

4

59.2

0.4

5

56.3

0.4

5

60.4

0.4

4

60.3

0.4

4

58.6

0.4

4

59.5

0.4

4

59.6

0.4

4

59.2

0.4

4

660.4

0.4

6

60.0

0.4

6

59.4

0.4

6

59.1

0.4

5

59.7

0.4

5

59.0

0.4

5

59.4

0.4

5

61.2

0.4

3

58.7

0.4

5

59.3

0.4

5

59.5

0.4

4

58.8

0.4

5

61.7

0.4

4

62.6

0.4

3

59.2

0.4

469

60.8

0.4

34

57.5

0.4

5

59.5

0.4

4

58.9

0.4

4

61.0

0.4

4

759.0

0.4

6

60.0

0.4

6

59.9

0.4

5

59.5

0.4

5

59.7

0.4

5

59.4

0.4

5

59.4

0.4

5

60.7

0.4

5

60.0

0.4

5

62.3

0.4

4

58.8

0.4

5

59.5

0.4

5

59.7

0.4

4

60.4

0.4

4

60.1

0.4

4

59.3

0.4

4

58.3

0.4

5

58.6

0.4

418

59.9

0.4

4

60.5

0.4

4

857.6

0.4

6

59.9

0.4

6

60.5

0.4

5

57.4

0.4

55

57.6

0.4

6

60.0

0.4

5

57.1

0.4

5

60.7

0.4

4

59.3

0.4

5

59.0

0.4

5

59.3

0.4

4

59.3

0.4

4

60.7

0.4

4

59.9

0.4

4

59.5

0.4

4

59.1

0.4

4

57.4

0.4

5

59.6

0.4

5

60.7

0.4

4

59.3

0.4

4

960.5

0.4

6

60.5

0.4

5

59.1

0.4

5

58.9

0.4

5

59.4

0.4

5

59.0

0.4

5

60.6

0.4

5

61.8

0.4

4

59.1

0.4

5

59.6

0.4

5

60.2

0.4

4

59.7

0.4

5

60.0

0.4

4

58.5

0.4

4

60.2

0.4

4

59.9

0.4

4

59.1

0.4

4

59.7

0.4

4

59.9

0.4

4

58.7

0.4

5

10

58.9

0.4

5

60.5

0.4

6

57.3

0.4

5

58.8

0.4

5

59.6

0.4

5

59.5

0.4

5

59.6

0.4

5

59.7

0.4

5

59.2

0.4

5

61.0

0.4

5

61.6

0.4

4

59.0

0.4

5

59.7

0.4

4

60.3

0.4

4

60.9

0.4

4

60.2

0.4

4

59.8

0.4

4

60.1

0.4

4

58.6

0.4

4

60.0

0.4

4

97

Page 107: Learning Techniques for Identifying Vocal Regions in Music

Table

E.3:CompleteExperiment5Resultsat2%

IncrementalSamplingRatesusingWavelet

EnergyFeatures

Run

Feature

24

68

10

12

14

16

18

20

22

24

26

28

30

32

34

36

38

40

1Wavelet

Energy

69.7

0.4

0

66.5

0.4

0

66.0

0.3

9

68.3

0.3

8

66.7

0.3

9

66.4

0.3

9

66.1

0.3

8

66.4

0.3

9

65.9

0.4

0

65.2

0.3

9

65.5

0.3

9

66.6

0.3

8

65.8

0.3

8

67.6

0.3

8

65.4

0.3

9

67.6

0.3

8

66.8

0.3

8

65.4

0.3

9

65.7

0.3

9

64.5

0.3

8

263.2

0.4

0

67.7

0.4

0

65.9

0.3

9

68.0

0.3

8

66.8

0.4

0

67.1

0.3

8

67.6

0.4

0

65.7

0.4

1

65.5

0.3

9

66.1

0.3

8

65.3

0.3

9

66.3

0.3

9

66.3

0.3

8

66.7

0.3

8

66.5

0.3

9

66.8

0.3

8

65.5

0.3

9

66.0

0.3

9

65.0

0.3

8

65.4

0.3

9

366.8

0.4

0

64.2

0.4

2

66.2

0.4

0

68.0

0.3

9

66.0

0.3

9

66.2

0.4

0

67.1

0.3

8

66.6

0.3

8

66.1

0.3

9

66.1

0.3

8

65.7

0.3

9

66.6

0.3

8

64.5

0.3

9

65.7

0.3

8

66.7

0.3

9

66.1

0.4

0

66.7

0.3

9

65.3

0.3

9

65.2

0.3

9

66.9

0.3

9

468.7

0.4

1

67.3

0.4

0

68.0

0.4

1

65.9

0.3

9

66.4

0.3

9

64.4

0.3

9

67.4

0.3

8

65.4

0.3

9

67.0

0.3

8

66.8

0.4

0

65.1

0.3

9

66.1

0.3

9

66.2

0.4

0

65.5

0.3

8

64.4

0.3

9

66.3

0.3

9

64.9

0.3

9

64.7

0.3

8

65.9

0.3

8

65.2

0.3

9

565.2

0.4

0

67.1

0.4

0

65.5

0.4

1

66.5

0.3

9

66.3

0.3

9

66.6

0.3

8

65.1

0.3

9

65.7

0.3

8

67.1

0.3

9

66.4

0.3

9

65.0

0.3

9

66.8

0.3

8

67.0

0.3

8

65.5

0.3

9

65.1

0.3

9

66.9

0.3

8

66.3

0.3

9

65.7

0.3

8

64.3

0.3

8

66.0

0.3

8

666.9

0.4

1

68.8

0.4

0

66.2

0.4

0

66.3

0.4

0

67.0

0.3

8

66.4

0.3

9

66.1

0.3

9

65.5

0.3

8

65.7

0.3

9

67.1

0.3

9

66.8

0.3

9

64.7

0.4

0

66.6

0.3

9

65.8

0.3

9

67.0

0.3

8

65.9

0.3

9

65.3

0.3

9

65.6

0.3

9

67.3

0.3

8

64.7

0.4

1

766.9

0.3

9

67.2

0.4

0

67.0

0.3

9

66.4

0.3

9

66.0

0.3

9

65.9

0.4

0

65.9

0.3

9

65.5

0.3

9

65.5

0.4

0

66.7

0.3

9

67.2

0.3

8

65.3

0.3

9

66.8

0.3

9

67.1

0.3

8

65.2

0.4

0

66.0

0.4

0

66.3

0.3

9

66.0

0.3

9

65.5

0.3

8

65.6

0.3

8

866.7

0.3

9

65.6

0.4

0

66.7

0.3

9

67.0

0.3

8

67.3

0.3

9

65.2

0.4

0

66.1

0.4

0

66.8

0.3

9

67.5

0.3

8

66.8

0.4

0

65.3

0.4

0

66.0

0.4

0

64.8

0.4

0

64.3

0.4

0

65.7

0.3

8

65.8

0.3

9

65.2

0.3

8

65.4

0.3

8

65.0

0.3

8

65.8

0.3

9

966.1

0.4

1

67.7

0.4

1

66.6

0.4

1

66.5

0.3

8

66.6

0.3

9

66.2

0.4

0

66.3

0.3

9

67.1

0.4

0

65.7

0.3

8

65.0

0.4

2

65.4

0.3

9

65.9

0.3

9

64.9

0.4

0

66.5

0.3

8

66.4

0.3

9

66.3

0.3

9

66.0

0.3

8

66.4

0.3

9

65.7

0.3

9

65.3

0.3

9

10

66.7

0.4

0

67.6

0.4

0

65.9

0.3

9

67.6

0.3

9

66.7

0.3

8

67.1

0.3

9

66.7

0.3

8

65.6

0.3

8

67.5

0.3

7

66.5

0.3

8

67.4

0.3

8

65.5

0.3

9

65.4

0.3

9

66.4

0.3

8

64.8

0.4

0

66.3

0.3

9

64.3

0.3

9

66.8

0.3

7

65.6

0.3

8

65.2

0.4

0

98

Page 108: Learning Techniques for Identifying Vocal Regions in Music

Bibliography

[1] Mark A. Bartsch and Gregory Wake�eld. Singing voice identi�cation using spectral envelope

estimation. IEEE Transactions on Speech and Audio Processing, 12(2):100�109, March 2004.

[2] Rich Caruana and Alexandru Niculescu-Mizil. An empirical comparison of supervised learning

algorithms. In Proceedings of the International Conference on Machine Learning. ACM, 2006.

[3] Alain de Cheveigné and Hideki Kawahara. Yin, a fundamental frequency estimator for speech

and music. Journal of the Acoustical Society of America, 111(4):1917�1930, April 2002.

[4] William Cohen. Fast e�ective rule induction. In In Proceedings of the 12 International Confer-

ence on Machine Learning, pages 115�123. Morgan Kaufmann, 1995.

[5] James W. Cooley, Peter A. Lewis, and Peter D. Welch. The fast fourier transform and its

applications. IEEE Transactions on Education, 12(1):27�34, March 1969.

[6] S. Davis and P. Mermelstein. Comparison of parametric representations for monosyllabic word

recognition in continuously spoken sentences. IEEE Transactions on Acoustics, Speech and

Signal Processing, 28(4):357�366, August 1980.

[7] P. de Chazal, B.G. Celler, and R.B. Reilly. Using wavelet coe�cients for the classi�cation of

the electrocardiogram. In In Proceedings of the 22nd Annual EMBS International Conference.

IEEE, 2000.

[8] E. Didiot, I. Illina, D. Fohr, and O. Mella. A wavelet-based parameterization for speech/music

discrimination. Computer Speech and Language, 24(2):341�357, April 2010.

[9] Stephen Downie and Michael Nelson. Evaluation of a simple and e�ective music information

retrieval method. In SIGIR '00: Proceedings of the 23rd Conference on Research and Develop-

ment in Information Retrieval, pages 73�80. ACM, 2000.

[10] D. Ellis. Classifying music audio with timbral and chroma features. In Proceedings of the

International Conference on Music Information Retrieval ISMIR-07. IEEE, September 2007.

99

Page 109: Learning Techniques for Identifying Vocal Regions in Music

[11] J.N. Gowdy and Z. Tufekci. Mel-scaled discrete wavelet coe�cients for speech recognition. In

Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing.

IEEE, 2000.

[12] Mark Hall, Eibe Frank, Geo�rey Holmes, Bernhard Pfahringer, and Peter Reutemann. The

weka data mining software: An update. SIGKDD Explorations, 11(1):10�18, July 2009.

[13] ISO/IEC 11172-3:1993. Information technology � Coding of moving pictures and associated

audio for digital storage media at up to about 1.5 Mbit/s � Part 3: Audio. ISO, Geneva,

Switzerland.

[14] ISO/IEC 14496-14:2003. Information technology � Coding of audio-visual objects � Part 14:

MP4 �le format. ISO, Geneva, Switzerland.

[15] Youngmoo E. Kim. Excitation codebook design for coding of the singing voice. In IEEE

Workshop on Applications of Signal Processing to Audio and Acoustics, pages 21�24. IEEE,

October 2001.

[16] Youngmoo E. Kim and Brian Whitman. Singer identi�cation in popular music recordings using

voice coding features. In Proceedings of the 3rd International Conference on Music Information

Retrieval, pages 164�169. IEEE, 2002.

[17] R. Kronland-Martinet, J. Morlet, and A. Grossman. Analysis of sound patterns through wavelet

transforms. International Journal of Pattern Recognition and Arti�cial Intelligence, 1(2):97�

126, January 1987.

[18] Tao Li, Qi Li, Shenghuo Zhu, and Mitsunori Ogihara. A survey on wavelet applications in data

mining. IEEE SIGKDD Explorations, 4(2):49�68, June 2007.

[19] Chih-Chin Liu and Chuan-Sung Huang. A singer identi�cation technique for content-based

classi�cation of mp3 music objects. In CIKM '02: Proceedings of the Eleventh International

Conference on Information and Knowledge Management, pages 438�445. ACM, 2002.

[20] Namunu Maddage, Kongwah Wan, Changsheng Xu, and Ye Wang. Singing voice detection using

twice-iterated composite fourier transform. In IEEE International Conference on Multimedia

and Expo, pages 1347�1350. IEEE, June 2004.

100

Page 110: Learning Techniques for Identifying Vocal Regions in Music

[21] Namunu Maddage, Changsheng Xu, Mohan Kankanhalli, and Xi Shao. Content-based music

structure analysis with applications to music semantics understanding. In ACM Conference on

Multimedia. ACM, October 2004.

[22] Namunu C. Maddage. Automatic structure detection for popular music. IEEE Multimedia, 13

(1):65�77, January 2006.

[23] Stéphane Mallat. A Wavelet Tour of Signal Processing: The Sparse Way. Academic Press, 2009.

[24] Janet Marques and Pedro J. Moreno. A study of musical instrument classi�cation using gaus-

sian mixture models and support vector machines. Technical Report 4, Compaq Corporation,

Cambridge Research Laboratory, June 1999.

[25] The MathWorks. Matlab and simulink. http://www.mathworks.com/, 1984�2008.

[26] Annamaria Mesaros, Tuomas Virtanen, and Anssi Klapuri. Singer identi�cation in polyphonic

music using vocal separation and pattern recognition methods. In Proceedings of the 8th Inter-

national Conference on Music Information Retrieval, pages 375�378. International Society for

Music Information Retrieval, 2007.

[27] Tin Lay Nwe and Haizhou Li. Exploring vibrato-motivated acoustic features for singer identi�-

cation. IEEE Transactions on Audio, Speech, and Language Processing, 15(2):519�530, February

2007.

[28] Tin Lay Nwe, Arun Shenoy, and Ye Wang. Singing voice detection in popular music. In MUL-

TIMEDIA '04: Proceedings of the 12th Annual ACM International Conference on Multimedia,

pages 324�327. ACM, 2004.

[29] Alexey Ozerov, Pierrick Philippe, Rémi Gribonval, and Frédéric Bimbot. One microphone

singing voice separation using source-adapted models. IEEE Workshop on Applications of Signal

Processing to Audio and Acoustics, pages 90�93, October 2005.

[30] Stefan Pittner and Sagar Kamarthi. Feature extraction from wavelet coe�cients for pattern

recognition tasks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 21(1):

83�88, January 1999.

101

Page 111: Learning Techniques for Identifying Vocal Regions in Music

[31] Mathieu Ramona, G. Richard, and B. David. Vocal detection in music with support vector

machines. In IEEE International Conference on Acoustics, Speech and Signal Processing, pages

1885�1888. IEEE, April 2008.

[32] K.R. Rao and P. Yip. Discrete Cosine Transformation: Algorithms, Advantages, Applications.

Academic Press, 1990.

[33] Rapid-I. Rapid miner. http://rapid-i.com/, 2006�2009.

[34] Jialie Shen, Bin Cui, John Shepherd, and Kian-Lee Tan. Towards e�cient automated singer

identi�cation in large music databases. In SIGIR '06: Proceedings of the 29th Annual Interna-

tional ACM SIGIR Conference on Research and Development in Information Retrieval, pages

59�66. ACM, 2006.

[35] Lisa Singh and Mehmet Sayal. Privately detecting burst in streaming, distributed time series

data. Data and Knowledge Engineering, 68(6):509�530, June 2009.

[36] SeventhString Software. Transcribe! http://www.seventhstring.com/, 1998�2009.

[37] J. Stegmann, G. Schroder, and K.A. Fischer. Robust classi�cation of speech based on the dyadic

wavelet transform with application to celp coding. In Proceedings of the 1996 International

Conference on Acoustics, Speech and Signal Processing, pages 546�549. IEEE, 1996.

[38] Pang-Ning Tan, Michael Steinbach, and Vipin Kumar. Introduction to Data Mining. Pearson

Addison Wesley, 2006.

[39] Wei-Ho Tsai. Automatic singer recognition of popular music recordings via estimation and

modeling of solo vocal signals. IEEE Transactions on Audio, Speech, and Language Processing,

14(1):330�341, January 2006.

[40] Wei-Ho Tsai and Hsin-Min Wang. On the extraction of vocal-related information to facilitate the

management of popular music collections. In JCDL '05: Proceedings of the 5th ACM/IEEE-CS

Joint Conference on Digital Libraries, pages 197�206. ACM, 2005.

[41] George Tzanetakis, Georg Essl, and Perry Cook. Audio analysis using the discrete wavelet

transform. In Proceedings of the 1st Conference on Acoustics and Music Theory Applications.

World Scienti�c and Engineering Academy and Society, 2001.

102

Page 112: Learning Techniques for Identifying Vocal Regions in Music

[42] Avery Li-Chun Wang. An industrial-strength audio search algorithm. In 4th Symposium Con-

ference on Music Information Retrieval. ISMIR, 2003.

[43] Hubert Wassner and Gérard Chollet. New cepstral representation using wavelet analysis and

spectral transformation for robust speech recognition. In Proceedings of the International Con-

ference on Spoken Language. IEEE, 1996.

[44] Jianping Zhang and Inderjeet Mani. Knn approach to unbalanced data distributions: A case

study involving information extraction. In Proceedings of the International Conference on

Machine Learning, pages 667�671. ACM, 2003.

[45] Tong Zhang. Automatic singer identi�cation. In Proceedings of the International Conference

on Multimedia and Expo, 2003, pages 33�36. IEEE, 2003.

103