1
Queen Mary University of London
Department of Electronic Engineering
Xiang LI
Musical Instruments Identification System Master of Science Project
Project supervisor: Dr Mark Plumbley
2
Disclaimer
This report is submitted as part of requirement for the degree of MSc in Digital Signal Processing at the University of London. It is the product of my own labour except where indicated in the text. The report may be freely copied and distributed provided the source is acknowledged.”
3
Acknowledgements
Foremost I would like to acknowledge Dr. Mark Plumbley for supervising me on this project; you always gave me inspirations and provided guidance, advice and support for all kinds of this work. Special thanks to Andrew Nesbit who provided me very helpful advice to this project and the musical instrument sample resource, help me to keep the project in a correct direction. I would also like to thank all of my MSc classmates for supporting me in the project. Thanks to the Department of Electronic Engineering office staffs for assistance with the project. Finally, I would like to give a big thanks to my parents, though I am far away from home, they always gave me encouragements and confidence.
4
Abstract
This project presents the identification of musical instruments, where the idea is to construct a computer system which can listen to the musical piece mixing several musical instruments and identify which musical instruments are playing. In this project, we research an efficient method to separate the musical instrument mixture and implement the different classifiers on the pre-processed signals which are the separated individual sources. Several different feature extractions are implemented and features are selected for a higher recognition rate during design a pattern recognition system. The performance of the system is evaluated in several experiments.
5
List of Contents:
Disclaimer ........................................................................................................................ 2 Acknowledgements............................................................................................................ 3 Abstract............................................................................................................................ 4 List of Contents:................................................................................................................ 5 Glossary: .......................................................................................................................... 6 List of Tables and Figures: .................................................................................................. 7 1. Introduction................................................................................................................... 8 2. Background and related work........................................................................................... 9
2.1 Blind Source Separation......................................................................................... 9 2.2 Musical Instrument Recognition ............................................................................ 16
2.2.1 Recognition of single tones ......................................................................... 16 2.2.2 Spectral shape ........................................................................................... 19 2.2.3 Onset and offset transients, and the amplitude envelope.................................. 19 2.2.4 Pitch features............................................................................................. 20 2.2.5 Amplitude and loudness features.................................................................. 20
3. System Design ............................................................................................................. 22 4. System implementation................................................................................................. 23
4.1 DUET algorithm.................................................................................................. 23 4.2 Feature extraction................................................................................................ 25
4.2.1 Mel-Frequency Cepstral Coefficient (MFCC)................................................ 25 4.2.2 Root Mean Square (RMS)........................................................................... 28 4.2.3 Spectral Centroid (SC)................................................................................ 28 4.2.4 Zero Crossing Rates (ZCR)......................................................................... 29 4.2.5 Spectral rolloff:.......................................................................................... 29 4.2.6 Bandwidth: ............................................................................................... 29
4.3 Classification ...................................................................................................... 30 4.3.1 The K Nearest Neighbour classifier .............................................................. 30 4.3.2 The GMM classifier ................................................................................... 32
4.4 Summary............................................................................................................ 34 5. Results........................................................................................................................ 35
5.1 Musical Instrument Database ................................................................................ 35 5.2 Experiments........................................................................................................ 37 5.3 Discussion.......................................................................................................... 41
6. Conclusion .................................................................................................................. 43 6.1 Summary............................................................................................................ 43 6. 2 Future Work....................................................................................................... 43
Reference ....................................................................................................................... 45
6
Glossary:
BASS – Blind Audio Source Separation BSS – Blind Source Separation ICA – Independent Component Analysis SDR – Source-to-Distortion Ratio SAR – Source-to-Artifacts Ratio SIR – Source-to-Interference Ratio STFT – Short-time Fourier Transform DUET – Degenerate Unmixing Estimation Technique IID – Inter-channel Intensity Difference IPD –Inter-channel Phase Difference MDCT – Modified Discrete Cosine Transform MFCC – Mel-Frequency Cepstral Coefficient RMS – Root Mean Square SC – Spectral Centroid ZCR – Zero Crossing Ratio k-NN or KNN– k-nearest neighbour GMM – Gaussian Mixture Model EM – Expectation Maximization
7
List of Tables and Figures:
Table 1. Vocabulary used to classify mixture types Table 2. Summary of recognition percentages of isolated note recognition systems using only one example of each instrument Table 3. The training database Table 4. The testing database Table 5. wav files to be tested Table 6. SDR of estimated sources Table 7.Classification Result Table 8 Results of experiments when number of source is three Table 9 Results of experiments when number of source is four Table 10 Results of experiments when number of source is five Figure 1: Some usual ways of obtaining audio mixtures: live recording, studio recording and synthetic mixing Figure 2: Separation of a three-source instantaneous stereo mixture using IID cues and binary masking Figure 3 . Block diagram of implemented music instrument identification system Figure 4. Block diagram of the MFCC feature extractor Figure 5. Mel frequency scale respect with Hertz scale Figure 6 The K nearest neighbourhood rule ( K=5) Figure 7 Representation of an M component mixture model Figure 8. Magnitudes of mixtures in MDCT domain Figure 9 Estimated sources Figure 10. Original sources
8
1. Introduction
The purpose of this project is to build a musical instrument identification system based on blind source separation and pattern classification. Musical instrument identification plays an important role in musical signal indexing and database retrieval [27]. When musical content is concerned, complex musical mixtures could be labelled and indexed in terms of musical events which meet the need for multimedia description. That means people can search music by the musical instruments instead of the type or the author. For instance, user is able to query ‘find piano mono parts of a musical database’.
Musical instrument identification is a crucial task and many problems remain unsolved given the current state of the arts. My approach consists of two stages, where the first stage is blind source separation and the second one is source recognition. Most musical recognition studies mainly focus on the case where isolated notes are played [27]. The task of blind source separation should be achieved as a first stage of processing, which can be even more intricate than source recognition.
Blind source separation aims to estimate the individual sources from a number of observed mixtures of those source signals [13]. If the number of mixtures is greater than, or equal to, the number of sources, which are called over-determined mixtures, independent component analysis (ICA) might be a good solution for these cases. But when there are more sources than mixtures, it is an under-determined case which ICA becomes insufficient. In this case, more information is needed to estimate the sources.
In this project, the mixtures we deal with are stereo which means only two mixtures are available. A method aiming to increase the sparsity of mixtures, which means only one source has large energy and the other does not, is implemented on stereo mixtures [13]. The qualities of separated sources are evaluated by performance measurement.
The separated source is identified by pattern recognition. Features of each instrument are extracted and input into the classifier. To improve the recognition rate, features and classifiers selection are performed.
The system has been built to separate the source from two mixtures and extract the features from musical instrument sounds and recognise their sources. However, the implemented system is still far from being applicable to real-world musical signals in general. The system operates on the anechoic mixtures which are composed of isolated notes.
The rest of this report is organised as follows. In section 2 we describe the background on musical instruments identification and the related fields of interest. Section 3 presents how the implemented system is designed, and the implementation of this system is discussed in Section 4. Section 5 is devoted to describe the experiments and the evaluations of results. Finally Section 6 summarises the observations of this project and suggests some future work.
9
2. Background and related work
In this section, we review the relevant backgrounds of our project. According to the aims mentioned in the section 1, we classify the backgrounds into two main areas:
1. Blind Source Separation
2. Musical Instrument Recognition
2.1 Blind Source Separation
Music sources [1] including musical instruments, singers and synthetic instruments usually consist of sequences of events called notes or tones. The signal in each note may be composed of several parts, including a nearly periodic signal like harmonic sinusoidal partials made by bowing a string or blowing into a pipe, or a transient signal coming from hitting a drum or a bar or plucking a string. Wideband noise is also added in some wind wood instruments due to blowing. In western music, the fundamental frequencies of periodic tones are generally constant or slowly varying and around discrete values on the semitone scale, which is a 1/12 octave scale spanning the range between 30 Hz and 4 kHz. Timbre of each instrument is characterised which depends on the note onsets time, on the shape of the spectral envelope when the note intensity is given and on the amount of frequency modulation. Musical phrases which consist successive notes, are often played without any silence. Some polyphonic instruments, compared to monophonic, can also play chords with several simultaneous notes. Within a music concert, western harmony rules tend to favour synchronous notes at rational fundamental frequency ratios such as 2, 3/2 or 5/4. Thus harmonic partials from different sources often overlap at some frequencies [2].
Audio mixtures can be classified into two types: live recordings or synthetic mixtures. Figure 1 describes the procedure of how to acquire the mixture signals. Synthetic mixtures are often derived from studio recordings, which are unmixed signals where each source is recorded separately. Since synthetic mixing effects differ from natural mixing effects obtained by live recording, the two types of mixture signals have greatly different properties which accounts for the use of different processing methods [2].
A common way of recording multiple sources is to record each one successively in a studio [2]. Point sources such speakers or small musical instruments, which have a limited spatial distance, are usually recorded using a single close microphone. Extended sources such as piano or drums, which can produce distinct sounds in different regions of space at the same time, are better recorded using several microphones. Pop musical recording where the players synchronize the music by listening on headphones to a mixture of the advanced recorded instruments employ
10
this technique. Similarly, recoding of each speaker separately with synchronization provided by the images is dubbed into movie dialogues. These studio recordings match extractly the actual sound of the sources since studio walls are covered with sound dampening materials and marginal reflections of the sound waves on the walls could be limited by using directional microphones. Electric or electronic instruments such as electric guitars and keyboards may even be recorded together since only the direct sound of each instrument is detected. The advantage of studio recordings is that different special effect to be applied to each source is allowed because sources are perfectly separated into distinct tracks. Both studio and live recordings could be transformed into synthetic mixtures with a smaller number of channels using a mixing desk or dedicated software. The purpose of this post-production process is most often to create stereo (two channel) mixtures such as music CDs or radio broadcasts or five-channel mixtures such as movie soundtracks [2]. Mono (single-channel) mixtures are rare in practice, although they have been studied as a challenging blind audio source separation (BASS) problem [2].
Figure 1: Some usual ways of obtaining audio mixtures: live recording (left), studio recording (top right) and synthetic mixing (bottom right) (From [2]). The mixing process for live recordings and synthetic mixtures may be formalized mathematically in the same way. When the mixture contains point sources only, the
channels Iii tx ≤≤1))(( of the mixture signal are given by
∑ ∑=
+∞
−∞=
−−=J
jjiji tsttx
1
).(),()(τ
τττα (1)
where I is the number of channels and J is the number of sources, Jjj ts ≤≤1))(( are the
original single-channel source signals and JjIiij t ≤≤≤≤ 1,1),( τα is a set of time-varying
11
mixing filters describing either the room transfer functions from the source positions to the microphone positions or the synthetic mixing effects used. This equation becomes irrelevant for extended sources, which cannot be represented as single-channel signals and do not have well-defined positions. More generally, the mixing process can always be written as
)()(1
tStxJ
j
IMGiji ∑
=
= (2)
where )(tS IMGij denotes the image of the j-th source on the i-th mixture channel. This
quantity is defined for point sources by )(),()( τττατ
−−= ∑+∞
−∞=
tsttS jijIMGij , and it can
still be defined for extended sources by integrating over all the point sources that compose each extended source [2]. The literatures of blind source separation have proposed some definitions of the blind source separation (BSS) problem. One definition states that BSS is the problem of
estimating the original source signals Jjj ts ≤≤1))(( given the mixture
channels Iii tx ≤≤1))(( [2]. This definition is relevant for point sources only and it
includes the dereverberation or deconvolution problem, which is the estimation of the inverse of the mixing filters to recover the original source signals with the smallest possible distortion. In practice, prior information about the sources or the mixing filters is needed to obtain perfect dereverberation. Thus most source separation algorithms are able to estimate the original source signals at best up to arbitrary gain or filtering distortions [3, 4]. In the following, dereverberation is considered as a single problem and we assume instead that the BSS problem consists in estimating the source image signals
Jjj ts ≤≤1))(( from the mixture channels Iii tx ≤≤1))(( . This definition makes the problem
more tractable by reducing the amount of uncertainties about the estimated sources: in theory only a potential ordering uncertainty remains for some algorithms [5]. In order to measure the approximate difficulty of separating a given mixture, the signal processing literature classifies mixtures according to three distinct criteria: the respective number of mixture channels and sources, the length of the mixing filters and the variation of the mixing filters over time [6]. Table 1 defines the corresponding mixture models to explain the mixture procedure. The terms “over-determined”, “determined” and “over-determined” that describe whether there more, as many or less mixture channels than sources, are borrowed from linear algebra. From these definitions, it appears that most live audio recordings are over-determined or under-determined time-varying reverberant mixtures while synthetic audio mixtures are under-determined time invariant or time-varying convolutive mixtures. Under-determined, reverberant or time-varying mixtures are typically more difficult
12
to separate than over-determined, instantaneous or time-invariant mixtures respectively. Thus the separation of realistic audio mixtures is a difficult problem [2].
Table 1: Vocabulary used to classify mixture types. (From [2]) Mixture Model Meaning Over-determined Mixture channels are greater than sources Determined Mixture channels equal sources Under-determined Mixture channels are less than sources Instantaneous Attenuation gains, no time delays Anechoic Attenuation gains and time delays Convolutive Non trivial mixing filters Reverberant Mixing filters exhibiting a realistic
reverberation time Time-invariant Mixing filters constant over time Time-varying Mixing filters slowly varying over time
Since the separated sources are listened by audiences, the quality of a given separation result is related to the perceptual distortion between each estimated source image and the corresponding unknown true image. This distortion may be described using a global rating together with sub-ratings corresponding to various kinds of distortions including interferences from other sources, musical noise artifacts, timbre distortion and spatial distortion of the target source. Musical noise artifacts are typically caused by nonlinear filtering of the data. These are more annoying than other distortions, especially for musical signals which demand a high perceptual quality. Currently, the perceptual degradation due to the various kinds of distortions may be quantified closely only by means of listening tests. Objective evaluation such as the Source-to-Distortion Ratio (SDR), the Source-to-Interferences Ratio (SIR) and the Sources-to-Artifacts Ratio (SAR) [6] may provide rough estimates within an evaluation work where the true source images are known [2]. Most algorithms treat BASS as a two-part problem: first problem aims to identify the number of sources and a set of separation parameters attached to each source, then filtering of the mixtures based on mixing parameters to recover the source image signals. The separation parameters must contain enough information to calculate relevant separating filters. Depending on the mixture, these parameters may include spatial parameters such as source directions and/or spectral parameters such as source magnitude spectra. The filtering problem is generally solved by using existed filtering techniques such as beamforming or time-frequency masking. The performance of these techniques depends on the type of mixture. Experiments on benchmark datasets have shown that time-frequency masking is more powerful than beamforming for the separation of real world audio mixtures such as under-determined mixtures or determined reverberant mixtures [7]. Much of the current research effort concentrates on the identification problem. Most identification algorithms consider simplified models that can discriminate the sources in simple situations only. For instance, algorithms based on the spatial directions of
13
the sources may have a performance decrease on mixtures involving reverberation, extended sources or sources which are close together. Other algorithms based on fundamental frequency and spectral envelope information generally suffer to separate sources from the same class. These researches show that solving the identification problem with no restrictions on the mixtures means the exploitation of ma ny source properties jointly, including properties specific to audio data [8, 9]. Time-frequency masking is a particular kind of non-stationary filtering conducted in the time-frequency domain on each mixture channel separately. By carefully designing the time-varying magnitude response of the filters, it is possible to filter out time-frequency regions dominated by interfering sources. More precisely, the
complex STFT ),( fnX i of the i-th mixture channel is computed and the complex
STFTs JjIMG
ij fnS ≤≤1)),(( of the source images on this channel are derived by
),(),()),(( fnXfnMfnS iijIMG
ij = (3)
where Jjij fnM ≤≤1),( are a set of time-frequency masks containing gains between 0
and 1. The masks are usually defined from the magnitudes of the STFTs of the source
images JjIi
IMGij fnS
≤≤≤≤ 1,1|)),((| whose estimation constitutes the inference problem.
The most popular masking rules are adaptive Wiener filtering
∑=
= J
k
IMGij
IMGij
ij
fnS
fnSfnM
1
2
2
),(
),(),( (4)
and binary masking
=01
),( fnM ij otherwise
fnSfnSif IMGkk
IMGij |),(|max|),(| =
(5)
In the end, each source image signal JjIMG
ij fnS ≤≤1)),(( is estimated by inverting the
corresponding STFT using the standard overlap-add method. This filtering method has also been applied to other time-frequency-like representations such as auditory-motivated filter banks [10]. Since filtering techniques have been defined, the other part of the BSS problem concerns the identification of demixing filters or time-frequency masks from the data. The simplest identification methods focus on multi-channel mixtures and segregate the sources based on mixing procedure. The assumption allowing the exploitation of this mixing procedure is that the sources are sparse in the time domain or in the time-frequency domain. For speech signals and in general for speech and music signals, this is true because speech signals contain silence segments and music signals which are transient or periodic have concentrated energy respectively in a few time frames or in a few sub bands at harmonic frequencies [11, 12, 13]. For convolutive
14
mixtures with moderate reverberation, time-frequency sparsity allows the separation of the sources up to an arbitrary ordering in each sub band. Stronger assumptions are needed to sort them in the same order across sub bands, such as the correlation of the source magnitudes across sub bands or the knowledge of the distances between microphones. The Degenerate Unmixing Estimation Technique (DUET) [14] uses a different approach based on the assumption that only one source is present in any given time-frequency point. This so-called W-disjoint orthogonality assumption is expressed mathematically using the STFTs of the source signals as [14]
fnjjfnSfnS jj ,,'0),(),( ' ∀≠∀= (6)
Under this assumption, it is possible to determine which source is present in each point from the spatial location information contained in the STFT coefficients of the mixture channels. In the case of stereo mixtures, this information is given by the Inter-channel Intensity Difference[14]
),(),(
log20),(1
210 fnX
fnXfnIID = (7)
and the Inter-channel Phase Difference
π2mod)),(),(
(),(1
2
fnXfnX
fnIPD ∠= (8)
where (.)∠ denotes the principal phase in [ ]ππ ,− of a complex number. For an
instantaneous or an anechoic mixture, ),( fnIID equals the relative mixing gain
)||||
(log201
210
j
j
jaa of the active source j in this time-frequency point. For an
anechoic mixture, ffnIPD π2/),( equals modulo f/1 the mixing delay
jd corresponding to the active source j . Note that a phase ambiguity problem appears
in the upper frequency range: above the threshold frequency maxmax 2/1 df = , where
maxd is the maximum allowed absolute delay, several values of jd may be
compatible with the observed value of ),( fnIPD . The original DUET method is
designed for synthetic anechoic mixtures where the mixing filters combine positive gains and fractional delays between -1 and +1 sample, in which case the phase ambiguity problem does not arise up to the Nyquist frequency. The two-dimensional
histogram of ( ffnIPDfnIID π2/),();,( ) is computed and its peaks, which
15
correspond to the values of the relative mixing gains and delays of the sources, are located. Then, all the time-frequency points belonging to each peak are converted into a binary mask that is used to separate the corresponding source. This algorithm is presented in a more developed statistical framework in [13]. A similar algorithm can be derived for purely instantaneous mixtures using the one-dimensional histogram of
),( fnIID [15]. An example of application with an instantaneous music mixture is
provided in figure 2. These algorithms result in a good performance on stereo instantaneous or anechoic time-invariant mixtures provided the sources have different spatial directions, which are different mixing gains or delays. Also the processing can be done in real-time. However in practice the W-disjoint orthogonality assumption rarely holds completely and musical noise artifacts appear. The perceptual quality of the separated sources can be improved by suppressing the time-frequency points where several sources are present prior to applying DUET. In [15] these points are located based on the value of the inter-channel coherence, which is the absolute value
of the correlation between the coefficients of ),(1 fnX and ),(2 fnX computed on
a small number of successive frames. This quantity is equal to one when a single source is present and it is generally smaller when several sources overlap [2].
Figure 2: Separation of a three-source instantaneous stereo mixture using IID cues and binary masking
(sources correspond to IID values of -4, 0 and 4 dB respectively, from [2])
The main limitation of DUET is that it cannot deal with realistic convolutive mixtures where the absolute mixing delay is larger than one sample. In [16], a solution to this problem is provided in the case of stereo convolutive mixtures recorded with a particular microphone pair termed binaural pair, where IID and IPD values are dependent on each other and can be roughly expressed as a function of source direction. The observed direction is estimated in each time-frequency point based on the IPD value, where IID is exploited in the upper frequency range to solve the phase ambiguity problem. Then the histogram of directions is computed and the sources are separated by binary masking as previously. This algorithm is thought to mimic the processing of spatial cues by the auditory system [16].
16
A similar algorithm based on learning the exact average mapping from IID and IPD to source direction is devised in [17]. The same idea could be exploited for other types of microphone pairs. These DUET-like methods provide a good performance on live recordings of non-moving point sources with moderate reverberation. However their performance generally decreases in reverberant environments because reverberation effectively adds in interfering sources at random locations. In this situation, interferences are better removed than with ICA, even for determined mixtures [9].
2.2 Musical Instrument Recognition
Various attempts have been made to build automatic musical instrument recognition systems [26]. Different approaches and scopes have been used by researchers, achieving different performances. Most researches have focused on isolated notes, often taken from the same, single source, and having notes over a very small pitch range [26]. Solo music taken from commercial recordings has been operated on the recent systems. Polyphonic recognition has also received some attempts, although the number of instruments has still been very limited [36]. The studies using isolated tones and monophonic phrases are the most relevant in our scope.
2.2.1 Recognition of single tones
These studies have used isolated notes as test material, with varying number of instruments and pitches.
Studies using one example of each instrument
Kaminskyj and Materka[26] used features derived from a root-mean-square (RMS) energy envelope via PCA and used a neural network or a k-nearest neighbor (k-NN) classifier to classify guitar, piano, marimba and accordion tones over a one-octave band. Both classifiers achieved a good performance, approximately 98 %. However, strong conclusions cannot be made since the instruments were very different, there was only one example of each instrument, the note range was small, and the training and test data were from the same recording session. More recently, Kaminskyj ([27]) has extended the system to recognize 19 instruments over three octave pitch range from the McGill collection [28].
Using features derived from the RMS-energy envelope and constant-Q transform ([29]), an accuracy of 82 % was reported using a classifier combination scheme. Leave-one-out cross validation was used, and the pitch of the note was provided for the system and utilized in limiting the search set for training examples.
Fujinaga and Fraser trained a k-NN with features extracted from 1338 spectral slices of 23 instruments playing a range of pitches [30]. Using leave-one-out cross validation and a genetic algorithm for finding good feature combinations, a recognition accuracy of 50 % was obtained with 23 instruments. When the authors
17
added features relating to the dynamically changing spectral envelope, and velocity of spectral centroid and its variance, the accuracy increased to 64 % [31]. Finally, after small refinements and adding spectral irregularity and tristimulus features, an accuracy of 68% was reported [32]. Martin and Kim reported a system operating on full pitch ranges of 14 instruments [33]. The samples were a subset of the isolated notes on the McGill collection [28]. The best classifier was the k-NN, enhanced with the Fisher discriminant analysis to reduce the dimensions of the data, and hierarchical classification architecture for first recognizing the instrument families. Using 70 % / 30 % splits between the training and test data, they obtained a recognition rate of 72 % in individual instrument, and after finding a 10-feature set giving the best average performance, an accuracy of 93 % in classification between five instrument families.
Kostek has calculated several different features relating to the spectral shape and onset characteristics of tones taken from chromatic scales with different articulation styles [34]. A two-layer feed-forward neural network was used as a classifier. The author reports excellent recognition percentages with four instruments: the bass trombone, trombone, English horn and contra bassoon. However, the pitch of the note was provided for the system, and the training and test material were from different channels of the same stereo recording setup. Later, Kostek and Czyzewski also tried using wavelet-analysis based features for musical instrument recognition, but their preliminary results were worse than with the earlier features [35]. In the most recent paper, the same authors expanded their feature set to include 34 FFT-based features, and 23 wavelet features [36]. A promising percentage of 90 % with 18 classes is reported, however, a leave-one-out cross-validation scheme probably increases the recognition rate. The results obtained with the wavelet features were almost as good as with the other features.
Table 2 summarizes the recognition percentages reported in isolated note studies. The most severe limitation of all these studies is that they all used only one example of each instrument. This significantly decreases the generalizability of the results. The study described next is the only study using isolated tones from more than one source and represents the state-of-the-art in isolated tone recognition.
18
Table 2: Summary of recognition percentages of isolated note recognition systems using only one example of each instrument.
Study Percentage correct Number of instruments
Kaminskyj and Materka 98 4 guitar, piano, marimba and accordion
Kaminskyj 82 19
Fujinaga 50 23
Fraser 64 23
Fujinaga 68 23
Martin 72 14
Kostek 97
81
4 bass trombone,trombone,
English horn and contra bassoon
20
Kostek and Czyzewski 93
90
4 oboe, trumpet, violin, cello
18
A study using several examples of each instrument
Martin used a wide set of features describing the acoustic properties discussed later in this chapter, which were calculated from the outputs of a log-lag correlogram [37]. The classifier used was a Bayesian classifier within a taxonomic hierarchy, enhanced with context dependent feature selection and rule-one-category-out decisions. The computer system was evaluated with the same data as in his listening test.
In classifying 137 notes from 14 instruments from the McGill collection, and with 27 target classes, the best accuracy reported was 39 % for individual instrument, and 76 % for instrument family classification. Thus, when a more demanding evaluation material is used, the recognition percentages are significantly lower than in the experiments described above [38].
Various features can be used for recognizing musical instruments, summarizing the observations from constructed recognition systems, timbre studies and instrument acoustics.
19
2.2.2 Spectral shape
A classic feature is spectral shape, the time varying relative amplitude of each frequency partial [39]. Various measures can be used to characterize spectral shape. The spectral energy distribution measured with the spectral centroid has been an explanation for the results of many perception experiments [40, 41, 42, 43, 44, 45, 24]. It relates to the perceived brightness of tones [44], and measures also harmonic richness. Sounds that have few harmonics sound soft but dark, and those with lots of harmonics, especially strong high harmonics, have a bright and sometimes sharp tone.
Formants are characteristic for many instruments. Their frequencies would be one character. However, the exact frequencies are hard to measure reliably; therefore the formant information is usually represented with an approximation of the smooth spectral envelope. We will discuss this in more depth in the description of cepstral analysis algorithms in section 4.
Variance of component amplitudes or spectral irregularity corresponds to the standard deviation of the amplitudes from the spectral envelope [46]. This has been also referred as the spectral flux or spectral fine structure [34], or spectral smoothness [47]. Irregularity of the spectrum can indicate a complex resonant structure often found in string instruments. Other measures are the even and odd harmonic content, which can be indicative of the cylindrical tube closed at one end used in clarinets [37].
Measuring the spectrum over different portions of the tone reveals information on different properties. In the quasi-steady state (or almost steady state), information on formants is conveyed in the spectrum. During the onset, the spectral shape may reveal the frequency contents of the source vibration, and the differences in the rate of rise of different frequency partials.
2.2.3 Onset and offset transients, and the amplitude envelope
Onset and offset transients can provide a rich source of information for musical instrument recognition. Some instruments have more rapid onsets, i.e. the duration of the onset period (also called rise time) is shorter than with others. Rapid onsets indicate tight coupling between the excitation and resonance structures. For instance, the flute has a very slow onset, while other wind instruments generally have quite rapid onsets. Rise time has often been a perceptually salient cue in human perception experiments [48, 39, 45].
The differences in the attack and decay of different partials are important. For string instruments, the differences are due to variations in the method of excitation, whether bowed or plucked, or variations in the damping of different resonance modes. With the wind instruments, nonlinear feedback causes differences in the development of different partials. A property characterizing the onset times of different partials is onset asynchrony, which is usually measured via the deviation in the onset times and durations of different partials. The absolute and relative onset times of partials reveal
20
information of the center frequencies and the Q values of resonance modes of the sound source [37].
However, there are some problems in using onset features for musical instrument recognition. Iverson and Krumhansl investigated timbre perception using entire tones, onsets only, and tones minus the onsets [41]. Based on subjects’ ratings, they argue that subjects seem to be making judgments on the basis of acoustic properties that are found throughout the tone. Most likely the attack transients become less useful with melodic phrases than with isolated notes, since the music can be continuous having no clear onsets or offsets. In addition, the shapes of onsets and offsets of partials vary across sounds, depending on the pitch and playing style [39].
2.2.4 Pitch features
The pitch period indicates the vibration frequency of the source. Absolute pitch also tells about the size of the instrument. Large instruments commonly produce lower pitches than smaller instruments. Also, if we can reliably measure the pitch, and know the playing ranges of possible instruments, pitch can be used to rule out those instruments that cannot produce the measured pitch [33].
Variations in the quasi-steady state of musical instruments convey lots of information of the sound source. Vibrato playing is characteristic for many instruments, but it also reveals information of the resonance structures [37, 48]. The frequency modulation causes the instrument’s harmonic partials to interact with the resonances, which again causes amplitude modulation, and by measuring this; information of the resonance structure is obtained. The stability of source excitation and the strength of the coupling between the excitation and resonance structures are indicated by random variations or fluctuations in pitch [37]. For example, the onsets of brass instruments have an unstable period during the onset, until the pitch stabilizes into the target value. The unstable interaction between the bow and string causes the tones of string instruments to have high amounts of pitch jitter.
2.2.5 Amplitude and loudness features
Besides pitch, amplitude variations in the quasi-steady state of tones convey lots of important information. Differences in amplitude envelopes contributed to similarity judgments in [41], furthermore, the dynamic attributes were present not only in the onset, but also throughout the tone. Tremolo, i.e. periodic amplitude modulation is characteristic for many instruments. For instance, flutes produce strong tremolo. In addition, playing flutes in flutter style introduces characteristic amplitude variation into the tones.
Information on the dynamics of an instrument could also aid recognition. Of the Western orchestral instruments, the brass instruments produce the largest sound levels, the strings are about 10 dB quieter. The woodwind instruments produce slightly louder sounds than the strings. [34]
21
Using information on the dependence of the qualities of tone on the playing dynamics might be used for recognition. Beauchamp has suggested using the ratio of spectral centroid to intensity; as tones become louder, they become brighter in a relationship that is characteristic for a particular instrument.
22
3. System Design
In this chapter, the overview of the implemented system is presented before more detailed description is given.
The whole system is composed by two functional components. Figure 3 presents a block diagram of the main components of the implemented system.
Figure 3 . Block diagram of implemented music instrument identification system
In the separation stage, each source is estimated by DUET method. The time-frequency representation of the stereo input is constructed by modified discrete cosine transform. The attenuation and delay parameter estimator is calculated from each nonzero time-frequency point. The pairs of delay and attenuation parameters are taken as a union and furthermore they are constructed as a two dimensions histogram. The peaks of this histogram are located and the mixing parameters are estimated from the peak locations. The time-frequency masks corresponding to each pair is constructed by the Maximum Likelihood partition and then applied to one channel of the mixtures to yield the time-frequency representation of the original sources. The last step is to convert each estimate into time domain. The detailed description of time-frequency representation is presented in next section.
The estimated source is fed into a feature extractor and the features vectors are extracted. Before the classification, the training stores the feature vectors of each type of musical instrument. In the classification step, the tested feature vector is compared with the entire trained feature vector and a best match gives the recognition result.
DUET
Separation
Input mixture
x(n)
.
.
.
s1
s2
s3
ns
Estimated
sources
Feature extraction
Classification
Classification results
23
4. System implementation In this section, the formulation and calculation of DUET method is described. Along with these, the detailed description of each step of DUET algorithm is presented. The second part of this section devote to the feature extraction and the implemented classifier. A wide range of features are described and compared and a list of features we implement in this system is given as well. Mathematical details are given to support the necessary discussion.
4.1 DUET algorithm
In this project, the DUET algorithm is implemented to demix the stereo mixture. Ideally, the DUET algorithm is able to separate an arbitrary number of sources using two mixtures. In practice, especially in this project, the number of source is between three and five because the simulation system work slowly when the number of source is large.
Before describing the DUET algorithm, we start with the mixture model. As we mentioned in table 1, the mixture model we deal with in this project is underdetermined, instantaneous, two-channel mixture. Assume we have N
sources )(),...,(1 tsts N . Let )(1 tx and )(2 tx be the mixture such that
2,1),()(1
=−= ∑=
ktsatx kjj
N
jkjk δ (9)
where parameters kja and kjδ are the attenuation coefficients and the time delays
associated with the path from the j th source to the k th receiver [13]. In
instantaneous and stereo case
=
nn
n
s
s
aa
aa
xx ML
L 1
221
111
2
1 (10)
where js is the j th source, ix is the i th mixture, ija is the positive real
amplitude(mixing parameter) of the j th source in the i th mixture(observation), and
nj ≤≤1 and 2,1=i [49].Then we take a real- or complex-valued linear transform
24
on the two mixtures 1x and 2x in Equation 10. The transformed mixtures 1~x and
2~x has the same mixing structure as Equation10.
Then we can estimate the individual sources from 1~x and 2
~x by constructing binary
time-frequency masks. As we described in section 2.1, a condition of W-disjoint orthogonality is assumed which means at each point in the transform domain, energy from at most one source dominates. Then we can use the mask to extract the coefficients labeling a particular source.
If we can construct the corresponding masks for each source, the sources could be demixed from just one of the mixtures. The masks can be constructed as follows.
The attenuation parameter ),( wta j could be computed as
),(~),(~
),(),(1
221 wtx
wtxwtRwta j == (11)
Then each time-frequency point in the transformed domain is labeled with the
attenuation parameter ),( wta j . As we have assumed that at most only one source is
active at each point, there will be N distinct labels. The masks jjM Ω= 1 could be
constructed by constructing the set jΩ which can be achieved by grouping the
time-frequency point with the same label. When the masks are decided, we deduce that
1~~ xMs jj = (12)
where js~ is the time-frequency representation of one of the original sources. The last
step is to convert each js~ into the time domain.
The various forms of time-frequency masking could be applied to different transform domain such as STFT and MDCT. A sparse transform is desired because most coefficients are very close to zero and only a few large coefficients. This transform will help to represent the mixtures in the desired way, such that the sources have disjoint support in the transformed domain.
Short-time Fourier Transform is the most common transform method in DUET algorithm. STFT is a Fourier-related transform used to determine the sinusoidal frequency and phase content of local sections of a signal as it changes over time
25
[15].The mathematical definition of Short-time Fourier Transform is given as follows:
∑∞
−∞=
−−=n
jnemnwnxX ωωτ ][][),( (26)
where ][nx is the signal to be transformed, ][nw is the window function.
Compared to other Fourier related transform like STFT, MDCT is a bit unusual since
it has half as many outputs as inputs. The 2N real numbers inputs 12,, −No xx … are
transformed into N real numbers outputs 1,, −No XX … according to the fomula
)]21
)(22
1(cos[
12
0
+++= ∑−
=
kN
nN
xXN
nnk
π (27)
In this project, MDCT is used to represent the time-frequency points. The performance of the separated source is measured by Source to Distortion Ratio (SDR). We will discuss the performance in the results section.
4.2 Feature extraction
In this section, the detailed description of features used for classification is given and the analysis of the features is performed for musical instrument classification.
4.2.1 Mel-Frequency Cepstral Coefficient (MFCC)
Mel-frequency cepstral coefficients have become one of the most popular techniques for the front-end feature-extraction in automatic speech recognition systems [49].
Brown has utilized cepstral coefficients calculated from a constant-Q transform for the recognition of woodwind instruments [50, 51]. We will use here the FFT based method utilizing a mel-scaling filterbank. A block diagram of the MFCC feature extractor is shown is Figure 4.
26
Figure 4. Block diagram of the MFCC feature extractor
We pre-emphasized the input signal so that the spetrum is flat. Next, the mel-frequency scale is spaced by a filterbank consisting of triangular filters and their heights scaled to unity are simulated. The mel-scale is given by
)700
1(log2595)( 10
ffMel += , (13)
where f is the linear frequency value. Figure 5 give the relationship between Mel and Hertz.
Figure 5. Mel frequency scale respect with Hertz scale
The implement of filterbank is obtained by transforming the windowed audio data with DFT and taking its magnitude. A spectral magnitude value of each channel is calculated by multiplying the magnitude spectrum with each triangular filter and summing the values at each channel. The dynamic range of the spectrum is compressed by taking a logarithm of the magnitude at each filterbank channel. Finally,
Signal Frame Blocking, windowing
Discrete Fourier Transform
Simulate Mel filterbank
Power at the output of each filter
Discrete Cosine Transform MFCC
27
a discrete cosine transform is performed to the log filterbank magnitudes km as
follows and the cepstral coefficients are computed using the equation below
))21
cos()(1
−= ∑=
kNn
mncK
kkmel
π. (14)
DCT decorrelates the cepstral coefficients, thereby making it possible to use diagonal covariance matrices in the statistical modeling of the feature observations.
In most cases, it is possible to retain only the lower order cepstral coefficients to obtain a more compact representation. The lower coefficients describe the overall spectral shape, whereas pitch and spectral fine structure information is included in higher coefficients. The zeroth cepstral coefficient is normally discarded, as it is a function of the channel gain.
The dynamic or transitional properties of the overall spectral envelope can be characterized with delta cepstral coefficients [52, 53]. A first order differential logarithmic spectrum is defined by
ωω jn
n
n ettc
ttS −
∞
−∞=∑ ∂
∂=
∂∂ )(),(log
(15)
where is the cepstral coefficient n at time t [53, 54]. Usually the time derivative is obtained by polynomial approximation over a finite segment of the coefficient trajectory, since the cepstral coefficient sequence does not have any analytical solution. In the case of fitting a first order polynomial into a segment of the cepstral
trajectory )(tcn , t=–M, –M+1 ,... ,M, the fitting error to be minimized is expressed as
221 )]()([ thhtcE
M
Mtn −−= ∑
−=
. (16)
The resulting solution with respect to 2h is
∑
∑
−=
−== M
Mt
M
Mtn
t
ttch
22
)(, (17)
and is used as an approximation for the first time derivative of nc [53], which we
denote by )(tnδ . This gives a smoother estimate of the derivative than a direct
28
difference operation. The curve fitting is done individually for each of the cepstral
coefficient trajectories nc , n=1,2,..., L.
More features can be obtained by estimating the second order derivative, and the resulting features are referred as acceleration coefficients in the speech recognition literature. The efficiency of MFCCs is due to the mel-based filter spacing and the dynamic range compression in the log filter outputs, which represent the mechanisms present in human hearing in a simplified way [25].
4.2.2 Root Mean Square (RMS)
For a short audio signal (frame) consisting of N samples, the amplitude of the signal measured by the Root Mean Square is described by equation (18). RMS is a measure of the loudness of an audio signal and since changes in loudness are important cues for new sound events it can be used in audio segmentation [57]. In this project RMS features are used to detect boundaries among different musical instruments. The method for detecting boundaries is based on the dissimilarity measure of these amplitude distributions.
∑=
=N
t
txP1
2 )( (18)
4.2.3 Spectral Centroid (SC)
Spectral centroid (SC) is a simple but very useful feature [57]. Research has demonstrated that the spectral centroid correlates strongly with the subjective qualities of “brightness” or “sharpness”. It can be calculated from different mid-level representations; commonly it is defined as the first moment with respect to frequency in a magnitude spectrum. However, the harmonic spectrum of a musical sound is hard to measure, as we will soon see, therefore more robust feature values are obtained if spectral centroid is calculated from the outputs of a filterbank.
We calculated spectral centroid according to the following equation
∑
∑
=
== B
k
B
ksc
kP
kfkPf
1
1
)(
)()( (19)
where k is the index of filter channel, whose RMS power is P(k), and center frequency f(k), and B is the total number of filterbank channels.
29
4.2.4 Zero Crossing Rates (ZCR)
In the case of discrete time signals, a zero crossing is said to occur if there is a sign difference between successive samples. The rate at which zero crossings happen is a simple measure of the frequency content of a signal. For narrow band signals, the average zero crossing rate gives a reasonable way to estimate the frequency content of the signal.
But for a broad band signal such as speech, it is much less accurate. However, by using a representation based on the short time average zero crossing rate, rough estimates of spectral properties can be obtained. The expression for the short time average zero crossing rates are shown below. In this expression, each pair of samples is checked to determine where zero crossings occur and then the average is computed over N consecutive samples.
)()]1([)]([ nmwnxsignnxsignZn
m −−−= ∑ (20)
where the sign function is −
=1
1)]([ mxsign
0)(0)(
<≥
mxmx
=021
)( Nmw otherwise
Nm 10 −≤≤ and x(n) is the time domain signal for frame m.
Zero crossing rates have been proven to be useful in characterizing different audio signals and have been popularly used in speech/music classification problems [55].
4.2.5 Spectral rolloff:
Spectral roll-off point measures the frequency below which a certain amount of the power spectrum resides. It is calculated by summing up the power spectrum samples until the desired percentage (threshold) of the total energy is reached. This measure distinguishes voiced from unvoiced speech. Unvoiced speech has a high proportion of energy contained in the high-frequency range of the spectrum where most of the energy for unvoiced speech and music is contained in lower bands. This is a measure of the ‘skewness’ of the spectral shape. The value is higher for right skewed distributions [56].
4.2.6 Bandwidth:
Bandwidth is defined as the width of the range of frequencies that the signal occupies. Bandwidth is the square root of the power-weighted average of the squared difference between the spectral components and frequency centroid [57]. In general, the
30
bandwidth range of speech is from 0.3KHz to 3.4KHz. For music the range is much wider, ordinarily BW is 22.05KHz.
In section 4.2 the main audio features used in musical instrument classification system have been considered. MFCC features are frequently used by many researchers in music genre classification problems. Spectral rolloff, bandwidth and Root mean square are widely used in different audio classification schemes. Zero crossing rate, being a simple measure of the frequency content of a signal, is also used to distinguish between different audio types.
4.3 Classification
In this section, the problem considered by us is to classify the extracted audio features into one of a number of audio classes. We can consider the task of classification as a process where the unknown input data is assigned to a particular class. A decision rule is established and applied to make such assignments. For example, a simple decision rule could be the assignment of a new data sample to a class whose mean it is closest to in feature space.
Classification algorithms are divided into supervised and unsupervised algorithms. In a supervised classification, the algorithm is trained by a labeled set of training samples whereas in the case of an unsupervised classification the data is grouped into some clusters without the use of labeled training set. Parametric and nonparametric classification is another way to categorize classification algorithms. The functional form of the probably density of the feature vectors of each class is known in parametric methods. In nonparametric methods on the other hand, no specific functional form is assumed in advance, instead the probability density is rather approximated locally based on the training data.
4.3.1 The K-Nearest Neighbour classifier
The K nearest neighbour classifier is an example of a non parametric classifier [50]. The basic algorithm in such classifiers is simple. For each input feature vector to be classified, a search is made to find the location of the K nearest training examples, and then assign the input to the class having the largest members in this location. Euclidean distance is commonly used as the metric to measure neighbourhood. For the special case of K=1 we will obtain the nearest neighbour classifier, which simply assigns the input feature vector to the same class as that of the nearest training vector.
The Euclidean distance between feature vector nxxxX ,,, 21 …= and
nyyyY ,,, 21 …= is given by:
31
)(1
∑=
−=n
iii yxd (21)
The KNN algorithm, as mentioned earlier, is very simple yet rather powerful, and used in many applications. However, there are things that need to be considered when KNN classifiers are used. The Euclidean distance measure is typically used in the KNN algorithm. In some cases, use of this metric might result in an undesirable outcome. For instance, in cases where several feature sets (where one feature set has relatively large values) are used as a combined input to a KNN classifier, the KNN will be biased by the larger values. This leads to a very poor performance. A possible method for avoiding this problem would be to normalize the feature sets.
In Figure 6 an example of a three class classification task is shown. The aim is to use the KNN classifier for finding the class of an unknown feature X As it can be seen in the figure, of the closest neighbors ( K=5) four belong to class a and only one belongs to class b and hence X is assigned to class a.
Figure 6 The K nearest neighbourhood rule ( K=5)
Some of the disadvantages of the K nearest neighbour classifiers are [50]:
Need the entire feature vectors of all training data when a new vector feature is to be classified and hence large storage requirements. The classification time is longer when compared to some other classifiers. The K nearest neighbour classifier has some qualities that are important such as it requires no training and this is helpful especially when a new training data is added. Uses local information and hence can learn complex functions without needing to represent them explicitly. In this project, k-NN is adopted to allow the new data added.
X
Class a
Class b
Class c
x
y
32
4.3.2 The GMM classifier
The General Mixture Model (GMM) classifier is a type of classifier which combines the advantages of parametric and non parametric methods. As the name indicates, the density function is a form of density function known as mixture model. A brief description of the classifier is given in the following paragraphs.
Given a d-dimensional vector X, a Gaussian mixture density is a weighted sum of M component densities and can be written as equation (22). The number M of components is treated as a parameter of the model and is typically much less than the number N of data points.
)(),,/(1
XbppuXpM
jjj∑
=
=Σ (22)
where Mjp j ,,2,1,0 …=≤ are the mixture weights, with ∑=
=M
jjp
1
1 and )(Xb j
are the component densities each with mean vector ju and covariance matrix jΣ .
−Σ−−
Σ= − )()(
21exp
)2(
1)( 1
21
2jj
tj
j
dj uXuXXbπ
(23)
For the Gaussian mixture model given in equation (22), the mixture density is parameterized by the mean vectors, covariance matrices and mixture weights from all component densities.
jjj up Σ= ,,θ . Mj ,,2,1 …= (24)
33
Figure 7 Representation of an M component mixture model
General Mixture Models can assume many different forms, depending on the type of covariance matrices. The two mostly used are the full and diagonal covariance matrices. When the type of the covariance matrix is diagonal, the number of parameters that need to be optimised are reduced. This constraint on the matrices reduces the modelling capability and it might be necessary to increase the number of components. However, in many applications this compromise has proven worthwhile.
For audio classification, the distribution of the feature vectors extracted from a particular audio class is modelled by a mixture of M weighted multidimensional Gaussian distributions. Given a sequence of feature vectors from an audio class, maximum likelihood of the parameters are obtained using the iterative Expectation Maximization (EM) algorithm. The basic idea of the EM algorithm is, beginning with
an initial model 'θ , to estimate a new model, such that )|()|( ' θθ xpxp ≥ . The new
model then becomes the initial model for the next iteration. The process is continued until some convergence threshold is reached. The class of an unknown audio sample can then be obtained with the log likelihood ratio. By assuming equal prior for each class, points in feature space for which the likelihood is relatively high are classified as belonging to that class. The log likelihood ratio for musical instrument classification can be expressed as follows:
)|()|(
2
1
M
M
XpXp
LRθθ
= where )|( 1Mxp θ is the likelihood given that the feature vector X is
from a musical instrument 1 and )|( 2Mxp θ is the likelihood given that the feature X
X
()1b ()2b()Mb
… … 11,Σu 22 , Σu MMu Σ,
Σ
1P2P MP
)|( θxp
34
is from a musical instrument 2. The likelihood ratio in log domain is given as:
=
)|()|(
log2
1
M
M
XpXp
LLRθθ
. If the value of LLR is greater than 0, then the unknown
audio sample belongs to the musical instrument 1 otherwise it belongs to the musical instrument 2.
In this section different classification algorithms have been discussed. The classification algorithms were categorized into parametric and non-parametric methods. The k-nearest neighbour classifier is a simple yet powerful classification method. However, the classification time is longer when compared to some other classifiers, and requires storage of the entire training vectors. The general mixture model requires estimation of the parameters of a model and hence is computationally complex. Contrary to the k-NN, the GMM does not require storage of training vectors and is much faster.
4.4 Summary
In section 4, we present a detailed description of all the relevant techniques with the implemented system which include DUET algorithm, feature extraction and classification. The implementation of DUET algorithm is described step by step and the formulation of this algorithm is achieved. All the features we implement in this system are discussed and the definitions of the features are given to support. K-NN and GMM classifier are described in section 4.3, which represent supervised algorithm and unsupervised algorithm. The whole system takes DUET algorithm as a pre-processing to obtain the each source of the audio piece whereas the feature extraction provides the discriminant features of each musical instrument. The extracted feature vectors of each source is trained by different models and then the queried feature vector is classified by the implemented classifier. The performance of this system is measured in next section.
35
5. Results
This section describes the experiments of the evaluation of this system with varying amounts of data. We first introduce the evaluation data set, and then present the computer simulations and experimental results.
5.1 Musical Instrument Database
The sample sources are downloaded from the University of Iowa website [58] called UIowa database. Each source is a wave file of an instrument, and there are eighteen instruments in total. Moreover, one instrument can have different playing styles amount different sources. We generate the training set and the testing set from the sample sources, i.e. we randomly select 2/3 sources to form the training set, and the rest of 1/3 sources to form the testing set. The training set is used to generate the classification model, and the testing set is mixed from several sources. As we mentioned before, when the number of sources in a mixture is greater than five, the DUET algorithm works slowly. For efficient reason, we only choose the sources of five instruments to generate a testing mixture, but our system is trained by the sources from all the eighteen instruments. The performance of classification rate is evaluated by the testing set. Table 3 lists the number of training samples in each class of the eighteen instruments. Table 4 summarises the information of what instruments are used to compose the testing mixture. The pitch range differs slightly from source to source.
The stereo mixture is generated by instantaneously mi xing as follows:
=
09.090.0
2
1
xx
29.071.0
50.050.0
72.028.0
5
4
3
2
1
36.064.0
sss
ss
(25)
Where 1x and 2x are two mixtures, and 51 ,, ss … are five different instrumental
sources.
36
Table 3. The training database No. Instrument Number of Sample Sources 1 Alto Flute 8 2 Alto Saxophone 12 3 Bass Flute 8 4 Bassoon 12 5 Bass Trombone 8 6 B-flat Clarinet 9 7 Bach Trumpet 18 8 Cello 28 9 Double Bass 23 10 E-flat Clarinet 9 11 Flute 15 12 Horn 8 13 Oboe 8 14 Soprano Saxophone 8 15 Tenor Trombone 8 16 Tuba 6 17 Viola 21 18 Violin 26
Table 4. The testing database Instrument Playing styles Sources
Alto Saxophone Vibrato, Non vibrato
Novib.mf.C4B4, Vib.ff.Db3B3 Novib.mf.Db3B3, Vib.mf.Db3B3 Novib.pp.C5Ab5, VIb.pp.C4B4
Bassoon Normal ff.C4B4, mf.C3B3 pp.C2B2
Double Bass Arco ff.sulA.C2B2, ff.sulE.C2B2 ff.sulA.C3A3, ff.sulE.E1B1 ff.sulD.C3B3, ff.sulG.C4G4 ff.sulD.D2B2, ff.sulG.G2B2
Flute Vibrato Vib.ff.B3B4, Vib.pp.C5B5 Vib.pp.B3B4, Vib.mf.B3B4 Vib.pp.C6B6
Viola Arco sulA.mf.A4B4, sulD.mf.C5B5 sulA.mf.C5B5, sulD.mf.C6Eb6 sulA.mf.C6Ab6, sulD.mf.D4B4 sulC.mf.C3B3, sulG.mf.C4B4 sulC.mf.C4B4, sulG.mf.C5G5 sulG.mf.G3B3
37
5.2 Experiments
We propose three testing cases that evaluate the mixtures consist of three, four and five sources, respectively. We ran five experiments for each case, i.e. fifteen experiments in total. Table 5 illustrates an experiment which consists of three sources. We discuss the evaluation by demonstrating the three sources experiment showed in table 5, but the complete experimental result for all experiments will be listed in table 6, and we will analyse the results in section 5.3.
Table 5. wav files to be tested No. Instrument Source 2 Alto Saxophone AltoSax.Vib.pp.C4B4 9 Double Bass Bass.ff.sulD.D2B2 4 Bassoon Bassoon.mf.C3B3 These three sources are mixed instantaneously by a mixer as follows:
=
09.090.0
2
1
xx
29.071.0
50.050.0
3
2
1
ss
s (24)
where 31 ,, ss … are the sources in table 5.
As we described in section 4.1, MDCT with block size at 4096 are taken on the stereo mixture to transform them into time-frequency domain. The time-frequency magnitudes of the two mixtures are plot as follows:
Figure 8. Magnitudes of mixtures in MDCT domain
Figure 8 presents 1~x and 2
~x , which are the time-frequency representations of two
mixtures in DUET algorithm. As we described in section 4.1, the purpose of a transform is to represent the mixtures in a sparse domain, which means the sources are not disjoint and could be separated by a probable time-frequency masking. Figure
38
8 illustrates the magnitudes 1~x and 2
~x , which are the MDCT transform values of two
mixtures. The two images in Figure 8 map the magnitude values of mixtures in MDCT domain into the image values between 0 and 255 which are the 8-bit image pixel range. The input numbers are 4096 which is the block size we used while the outputs number is 2048 which is represented in frequency axe in Figure 8. The input signal to be transformed is divided into 569 frames as we can find in time axe in Figure 8. Then MDCT transform is implemented on each frame. The results after applying DUET algorithm is displayed in Figure 9 below. Three estimated sources are plot and the original three sources are presented in Figure 10 in contrast.
Figure 9 Estimated sources
Figure 10. Original sources
39
Compared to plots in Figure 10, three different plots in Figure 9 have different performances. Apparently, signal of Bassoon has the best estimation of all. When we compare the signals of Double Bass and Alto Saxophone, we can find that there are signal leak between these two signals which make the estimated signals are quite like the original ones. In particular, part of Alto Saxophone signals becomes the interference in Double Bass. However, the envelopes of signals in Double Bass and Alto Saxophone have not changed greatly.
The performance of DUET separation could be evaluated by both listening test and subjective test such as SDR. Table 6 lists the SDR of each source.
Table 6. SDR of estimated sources source SDR AltoSax.Vib.pp.C4B4 17.4453 Bass.ff.sulD.D2B2 10.4249 Bassoon.mf.C3B3 6.0127
Now the separation has been achieved and the rest task focuses on the classification of the estimated sources. The data in training set are trained before classification and each feature vector extracted from the training data is labeled by the number of class from one to eighteen because the classifier we adopt is k-Nearest Neighbor. For this reason, the training data is a matrix composed by the feature vectors extracted from each class in the training set.
The classification results are given in table 7 below.
Table 7.Classification Result Source name Original
Source Estimated Result
Classification Result
AltoSax.Vib.pp.C4B4 2 2 correct Bass.ff.sulD.D2B2 9 9 correct Bassoon.mf.C3B3 4 4 correct
From the classification results, these three instruments are recognized by the musical instrument identification system. However, the system has to be evaluated by more experiments so that a percentage of correct could be obtained. All the experiments are divided into three groups based on the number of source. E.g. when the number of source is three, five mixtures are generated by mixing three sources instantaneously. For each mixture, DUET algorithm is implemented and the estimated sources are classified by the musical instrument system. The results of each mixture include SDR which represents the performance of DUET algorithm and classification results which stands for the selection of the features and classifier. The results of each group experiments are listed in table 8, 9 and 10. A discussion of the results is given in next section.
40
Table 8 Results of experiments when number of source is three No. of Mixture
Source name SDR Original Source
Estimated Result
Classification Result
AltoSax.Vib.pp.C4B4 17.4453 2 2 correct Bass.ff.sulD.D2B2 10.4249 9 9 correct
1
Bassoon.mf.C3B3 6.0127 4 4 correct AltoSax.Novib.mf.Db3B3 16.0226 2 1 incorrect Bassoon.mf.C3B3 10.1893 4 4 correct
2
Flute.vib.pp.C5B5 5.0703 11 6 incorrect Bassoon.mf.C3B3 19.3207 4 4 correct Flute.vib.pp.C5B5 8.2396 11 7 incorrect
3
Viola.sulC.mf.C4B4 5.6172 17 17 correct AltoSax.Vib.pp.C4B4 16.5936 2 2 correct Bassoon.pp.C2B2 10.5727 4 4 correct
4
Viola.sulD.mf.C5B5 5.4975 17 17 correct Bass.ff.sulD.D2B2 17.7940 9 9 correct Bassoon.pp.C2B2 10.3970 4 4 correct
5
Viola.sulC.mf.C3B3 5.1615 17 17 correct Average correct percentage: 80%
Table 9 Results of experiments when number of source is four No. of Mixture
Source name SDR Original Source
Estimated Result
Classification Result
AltoSax.Novib.mf.C4B4 17.5359 2 2 correct Bass.ff.sulA.C2B2 10.4535 9 8 incorrect Bassoon.mf.C3B3 6.0304 4 4 correct
1
Flute.Vib.mf.B3B4 2.8896 17 17 correct AltoSax.Novib.mf.Db3B3 16.0304 2 1 incorrect Bassoon.mf.C3B3 10.2572 4 4 correct Flute.vib.pp.B3B4 4.6458 11 7 incorrect
2
Viola.sulC.mf.C4B4 2.7497 17 16 incorrect AltoSax.Novib.pp.C5Ab5 15.8065 2 1 incorrect Bassoon.pp.C2B2 10.5570 4 4 correct Flute.vib.pp.C6B6 6.0216 11 11 correct
3
Viola.sulC.mf.C3B3 2.8737 17 17 correct AltoSax.vib.ff.Db3B3 17.0802 2 2 correct Bassoon.ff.C4B4 10.5354 4 4 correct Flute.vib.pp.B3B4 4.5036 11 3 incorrect
4
Viola.sulD.mf.C6Eb6 2.9021 17 18 incorrect AltoSax.vib.mf.Db3B3 18.2850 2 2 correct Bassoon.pp.C2B2 9.6308 4 4 correct Viola.sulC.mf.C4B4 5.3924 17 18 incorrect
5
Bass.ff.sulC.C2B2 2.8477 9 9 correct Average correct percentage: 60%
41
Table 10 Results of experiments when number of source is five No. of Mixture
Source name SDR Original Source
Estimated Result
Classification Result
AltoSax.Novib.pp.C5Ab5 15.7971 2 5 incorrect Bass.ff.sulA.C3A3 10.4974 9 9 correct Bassoon.pp.C2B2 5.9451 4 4 correct Flute.vib.ff.B3B4 2.8918 11 8 incorrect
1
Viola.sulA.mf.C6Ab6 8.6238 17 18 incorrect AltoSax.Novib.mf.C4B4 16.9862 2 2 correct Bass.ff.sulG.C4G4 4.9220 9 9 correct Bassoon.ff.C4B4 10.6727 4 4 correct Flute.vib.pp.B3B4 2.8657 11 7 incorrect
2
Viola.sulG.mf.C4B4 5.0476 17 17 correct AltoSax.Novib.mf.Db3B3 16.6418 2 2 correct Bass.ff.sulG.D2B2 9.9178 9 9 correct Bassoon.mf.C3B3 5.8671 4 4 correct Flute.vib.pp.B3B4 2.5365 11 1 incorrect
3
Viola.sulD.mf.D4B4 6.6230 17 8 incorrect AltoSax.vib.mf.Db3B3 18.0407 2 5 incorrect Bass.ff.sulA.C2B2 9.9196 9 8 incorrect Bassoon.mf.C3B3 5.8679 4 4 correct Flute.vib.pp.B3B4 2.7429 11 1 incorrect
4
Viola.sulC.mf.C3B3 5.4507 17 16 incorrect AltoSax.vib.mf.Db3B3 17.3504 2 5 incorrect Bass.ff.sulE.E1B1 10.3685 9 9 correct Bassoon.mf.C3B3 5.9931 4 4 correct Flute.vib.mf.B3B4 2.8947 11 8 incorrect
5
Viola.sulD.mf.C5B5 7.7829 17 17 correct Average correct percentage: 48%
5.3 Discussion
It is clear from table 8,9,10 that the more sources we mix, the worse the system performs. This is not surprising, since this system is based on feature selection, and we would have to represent a musical instrument by more discriminant features.
In table 8, the results show that the sources of three mixtures have been identified correctly. Especially, there are three mixtures have been identified without mistake. That means the system has a 80% percentage correctness for these five three-source mixtures.
Table 9 shows that none of the mixtures have been identified its sources 100% correct. Even though, there are three out of five mixtures have incorrect identification once respectively, while one of the other two still has 50% correctness. In general, when a
42
mixture has four sources, our system has 60% identification confidence.
Table 10 illustrates it is similar to table 9 which has no 100% correct identification. Mixture 2 has the best identification since only one of fifth sources has been misclassified. Three of fifths have been identified correctly in mixture 3 and 5 while two of fifths are recognized for mixture 1. The worst one is mixture 4 which only one source has been identified correctly. Generally, separation for mixture with five sources is slightly worse than separation for mixture with four sources, and our system shows 48% correctness.
From the above discussion, the system is evaluated by SDR and classification results. Furthermore, the correct percentages of experiments from each group have been computed from the results in table 8, 9 and 10.
The worst result is 48% where when there are five sources in one mixture. As we know, the correct recognition percentage is 1/18 if each source is picked from the database randomly since the database includes 18 classes of musical instruments. That means there is only 5.5% to recognize one source in this database which is a very low probability. The system we construct in this project has increased the percentage at best up to 80% when each mixture has three sources. Even in the worst case where each mixture consists of five sources, the recognition percentage of 48% is still much better than 5.5% when we recognize each source randomly.
Another improvement of this system is that the preprocessing of mixture separation helps the recognition percentage. If we classify each instrument without separation, each mixture has only one recognition result which is not the recognition result of each source we expect. This system makes it possible to recognize each source individually by applying DUET algorithm as a preprocessing.
The system is likely to identify the sources of mixtures when there are fewer sources. The reason is simply since when the sources we classify are estimated from mixtures which means more sources generate more interference. Another reason is that the DUET algorithm is based on binary masking which introduce some artifacts and we could not expect the estimated sources are exactly the same as the original sources. Even for the original sources, the classification process also plays an important role in recognition stage. Features for classification are crucial for a high correct percentage. The classifier we adopt has also influenced the performance of this system. K-nearest neighbour classifier has some disadvantages as we mentioned in section 4.3 but the new added data does not have to be trained which make the system adaptive for different database.
43
6. Conclusion
6.1 Summary
We have described a system that can listen to a mixture of musical instruments and recognize them. The work started by reviewing Blind Source Separation and Musical Instrument Recognition. Then we studied time-frequency masking techniques especially the DUET algorithm for musical instruments separation. Features which make musical instrument distinguishable from each other are presented and discussed. The principles of two classifier k-NN and GMM are described.
The system has three functional components in general which has been presented in the overview of this system. To design such a musical instrument identification system, DUET algorithm has been implemented as a pre-processing. The estimated sources have been evaluated by SDR. The classification stage has been implemented following the DUET algorithm. Features are extracted from estimated sources and normalized to keep generality. The k-NN classifier is used to evaluate the testing data on this identification system.
The results of all the experiments give a subjective evaluation of this system. The system has a best performance when each mixture has three sources. When each mixture has more sources, the recognition percentage decreases slightly. The worst percentage is 48% when each mixture has five sources. This percentage is still much better when compared with other correct percentage for individual instrument recognition.
The musical mixture we identify is composed by isolated notes. In order to ma ke truly realistic evaluations, more acoustic data would be needed, including monophonic material. In general, the task of reliably recognizing a wide set of instruments from realistic monophonic recordings is not a trivial one; it is difficult for humans and especially for computers. It becomes easier as the recognition is performed at the level of instrument families.
6. 2 Future Work
The main challenge for the construction of musical instrument recognition systems is increasing their robustness. Many factors influence the performance of this system such as the time-frequency masking scheme, features and classifiers selection. Binary masking has some limits such as introducing artifacts in the estimated sources. These also include the different playing styles and dynamics that vary the sound spectrum. Very few features are constant across the pitch range of an instrument.
In this project, k-NN approach is implemented and MFCC is used as a main feature. It is likely that using the MFCCs is not enough for this task, and therefore means to
44
effectively combine the various other features with cepstral features should be examined. The approach of combining classifiers is one interesting alternative [59]. For instance, it would be worth experimenting to combine the GMM or the Hidden Markov Model using cepstral features calculated in frames, and the k-NN using features calculated for each note.
45
Reference
[1] D. E. Hall, “Musical Acoustics”, 3rd Edition. Brooks Cole, 2001. [2] E. Vincent, M. G. Jafari, S. A. Abdallah, M. D. Plumbley, and M. E. Davies. “Blind audio source separation”. Technical Report C4DM-TR-05-01, Centre for Digital Music, Queen Mary, University of London, 24 November 2005. [3] P. Comon, “Independent component analysis - a new concept?” Signal Processing, vol. 36, no. 3, pp. 287-314,1994. [4] K. Torkkola, “Blind separation of convolved sources based on information maximization” in Proc. IEEE Workshop on Neural Networks for Signal Processing (NNSP), 1996, pp. 423-432. [5] K. Matsuoka and S. Nakashima, “Minimal distortion principle for blind source separation” in Proc. Int. Conf. on Independent Component Analysis and Blind Source Separation (ICA), 2001, pp. 722-727. [6] E. Vincent, R. Gribonval, and C. Fevotte, “Performance measurement in blind audio source separation” IEEE Trans. on Speech and Audio Processing, vol. 14, no. 4, 2006, to appear. [7] E. Vincent and R. Gribonval, “Construction d'estimateurs oracles pour la separation de sources” in Proc. 20th GRETSI Symposium on Signal and Image Processing, 2005, pp. 1245-1248. [8] T. Nakatani, “Computational auditory scene analysis based on residue-driven architecture and its application to mixed speech recognition,” Ph.D. dissertation, Dept. of Applied Analysis and Complex Dynamical Systems, Kyoto University, 2002. [9] E. Vincent, “Musical source separation using time-frequency source priors” IEEE Trans. on Speech and AudioProcessing, vol. 14, no. 1, 2006, to appear. [10] G. J. Brown and M. P. Cooke, “Computational auditory scene analysis” Computer Speech and Language, vol. 8, pp. 297-336, 1994. [11] M. Zibulevsky, B. A. Pearlmutter, P. Bofill, and P. Kisilev, “Blind source separation by sparse decomposition in a signal dictionary,” in Independent Component Analysis : Principles and Practice. Cambridge Press, 2001, pp. 181-208. [12] M. E. Davies, “Audio source separation,” in Mathematics in Signal Processing V. Oxford University Press, 2002. [13] O. Yilmaz and S. T. Rickard, “Blind separation of speech mixtures via time-frequency masking” IEEE Trans.on Signal Processing, vol. 52, no. 7, pp. 1830-1847, 2004. [14] A. N. Jourjine, S. T. Rickard, and O. Yilmaz, “Blind separation of disjoint orthogonal signals: Demixing N sources from 2 mixtures,” in Proc. IEEE Int. Conf. on Acoustics, Speech, and Signal Processing (ICASSP), 2000, pp. V-2985-2988. [15] C. Avendano and J.-M. Jot, “Frequency domain techniques for stereo to multichannel upmix,” in Proc. AES 22nd Conference on Vitual, Synthetic and Entertainment Audio, 2002, pp. 121-130. [16] H. Viste and G. Evangelista, “On the use of spatial cues to improve binaural source separation,” in Proc. Int.Conf. on Digital Audio Effects (DAFx), 2003, pp. 209-213. [17] N. Roman, D. Wang, and G. J. Brown, “Speech segregation based on sound localization” Journal of the ASA,vol. 114, no. 4, pp. 2236-2252, 2003. [19] Feiten, Frank, Ungvary. “Organization of sounds with neural nets“. In Proc. International Computer Music Conference, 1991.
46
[20] DePoli, Prandoni, Tonella. “Timbre clustering by self-organizing neural networks“ Proc. of XColloquium on Musical Informatics. University of Milan,1993. [21] Cosi, De Poli, Lauzzana. “Auditory Modelling and Self-Organizing Neural Networks for Timbre Classification“. Journal of New Music Research, Vol. 23, pp. 71-98, 1994. [22] Feiten, Guntzel. “Automatic indexing of a sound database using self-organizing neural nets“.Computer Music Journal, Vol. 18, No. 3, pp. 53-65,1994. [23] Toiviainen, Kaipainen, Louhivuori. “Musical timbre: similarity ratings correlate with computational feature space distances“ Journal of New Music Research, Vol. 24, No. 3, pp. 282-298.1995. [24] Toiviainen. "Optimizing Auditory Images and Distance Metrics for Self-Organizing Timbre Maps" Journal of New Music Research, Vol. 25, pp. 1-30.1996. [25] De Poli, Prandoni.“Sonological Models for Timbre Characterization“. Journal of New MusicResearch, Vol. 26, pp. 170-197, 1997. [26] Kaminskyj, Materka.“Automatic Source Identification of Monophonic Musical Instrument Sounds”. Proceedings of the IEEE Int. Conf. on Neural Networks, 1995. [27] Kaminskyj. “Multi-feature Musical Instrument Sound Classifier“. In Proc. Australasian Computer Music Conference, Queensland University of Technology, July 2000. [28] Opolko, F. & Wapnick, J. “McGill University Master Samples” (compact disk). McGill University, 1987. [29] Brown, Puckette. “An Efficient Algorithm for the Calculation of a Constant Q Transform”. J.Acoust. Soc. Am. 92, pp. 2698-2701.1992. [30] Fujinaga.“Machine recognition of timbre using steady-state tone of acoustic musical instruments”.Proceedings of the International Computer Music Conference, 1998. [31] Fraser, Fujinaga. “Towards real-time recognition of acoustic musical instruments”. Proceedings of the International Computer Music Conference, 1999. [32] Fujinaga. “Realtime recognition of orchestral instruments”. Proceedings of the International Computer Music Conference, 2000. [33] Martin. “Musical instrument identification: A pattern-recognition approach“. Presented at the 136th meeting of the Acoustical Society of America, October 13, 1998. [34] Kostek. “Soft Computing in Acoustics: Applications of Neural Networks, Fuzzy Logic and Rough Sets to Musical Acoustics”. Physica-Verlag, 1999. [35] “Automatic Classification of Musical Sounds” In Proc. 108th Audio Eng. Soc. Convention. [36] Kostek, Czyzewski. “Automatic Recognition of Musical Instrument Sounds - Further Developments”. In Proc. 110th Audio Eng. Soc. Convention, Amsterdam, Netherlands, May 2001. [37] Martin. “Sound-Source Recognition: A Theory and Computational Model”. Ph.D. thesis, MIT. [39] Handel."Timbre Perception and Auditory Object Identification". In eds. Moore,"Hearing". [40] Grey. “Multidimensional perceptual scaling of musical timbres”. J. Acoust. Soc. Am., Vol. 61,No. 5, May 1977. [41] Iverson, Krumhansl. “Isolating the dynamic attributes of musical timbre“. J. Acoust. Soc.Am., Vol. 94, pp. 2595-2603.1993. [42] Plomp. “Aspects of tone sensation“. London, Academic Press.1976. [43] Wedin, Goude. “Dimension analysis of the perception of instrumental timbre“. Scandinavian Journal of Psychology, 13, pp. 228-240.1972
47
[44] Wessel. “Timbre space as a musical control structure”. Computer Music Journal, Vol. 3, No. 2, 1979. [45] Poli, Prandoni, "Sonological Models for Timbre Characterization" Journal of New Music Research, Vol.26, pp. 170-197. [46] Krimphoff, J., McAdams, S. & Winsberg, S. "Caractérisation du timbre des sons complexes. II. Analyses acoustiques et quantification psychophysiques." Journal de Physique: 625-628.1994 [47] McAdams, Beauchamp, Meneguzzi. “Discrimination of musical instrument sounds resynthesized with simplified spectrotemporal parameters“. J. Acoust. Soc. Am., Vol. 105, pp. 882-897. 1999. [48] McAdams. “Recognition of Auditory Sound Sources and Events. Thinking in Sound: The Cognitive Psychology of Human Audition”. Oxford University Press, 1993. [49] Davis, Mermelstein. “Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Sentences“. IEEE Trans. on Acoustics, Speech and Signal Proc. Vol. 28, No.4, oo. 357-366.1980. [50] Brown. “Computer identification of musical instruments using pattern recognition with cepstral coefficients as features”. J. Acoust. Soc. Am. 105(3), March 1999. [51] Brown. “Feature dependence in the automatic identification of musical woodwind instruments”.J. Acoust. Soc. Am. 109(3), March 2001. [52] Soong, Rosenberg. “On the Use of Instantaneous and Transitional Spectral Information in Speaker Recognition“. IEEE Trans. Acoustics, Speech and Signal Proc, Vol. 36, No. 6, pp. 871-879.1988. [53] Rabiner, Juang. “Fundamentals of speech recognition”. Prentice-Hall 1993. [54] Young, Kershaw, Odell, Ollason, Valtchev, Woodland. "The HTK Book (for HTK Version 3.0)". Cambridge University Engineering Department, July 2000. [55] Lie Lu, Hong-Jiang Zhang and Hao Jiang. “Content analysis for audio classification and segmentation”. IEEE Transactions on speech and audio processing, vol.10, no.7, October 2002 [56] Eric Scheirer' Malcoh Slaney, “Construction and Evaluation of A Robust Multi-feature Speech Music Discriminator”, Interval ResearchCorp., Page Mill Road, Palo Alto. CA, 94304 USA pp 1-3 [57] Bai Liang, Hu Yanli, Lao Songyang, Chen Jianyun, Wu Lingda, “Feature Analysis and Extraction for Audio Automatic Classification”, Multimedia Research &Development Center National University of Defense and Technology ChangSha ,P. R. China pp 3-5 [58] University of Iowa. University of Iowa Musical Instrument Samples page. http://theremin.music.uiowa.edu.2000 [59] Kittler, Hatef, Duin, Matas. “On Combining Classifiers“. IEEE Transactions on Pattern Analysis and Intelligence, Vol. 20, No. 3, March 1998.