audio based bird species recognition
TRANSCRIPT
________________________________
Audio-based Bird Species
Classification ________________________________
Final Report
SGN 81006- Signal Processing Innovation Project
Version 1.1, June 2015
Shriram Nandakumar
Student Number: 244935
Client: Tuomas Virtanen, Department of Signal Processing, Tampere University of Technology.
Expected Credits: 7
Client: Course Responsible:
_____________________________ __________________________
Tuomas Virtanen Sari Peltonen
Version history
SL No. Date Version Person Description
01. 14.05.2015 1.0 Shriram N Preliminary version
02 01.06.2015 1.1 Shriram N Final version
Abstract
This document presents the details of the project on hierarchical classification of bird species
using audio information. The project finds immediate applications in biodiversity conservation
and is also highly beneficial for the machine learning community. The preliminary tasks
included understanding of the data and its taxonomy, followed by retrieval and organization.
Tools were then developed to visualize the hierarchical class structure. Frame-level MFCC
features were extracted from the raw audio files and they were used to train multi-class
classifiers at the leaf level of the taxonomy. Naïve Bayes and k-Nearest Neighbor Classifiers
were investigated. Due to the challenging nature of database and a wider flat-class structure,
the recognition rates were remarkably poor. Nevertheless, as the next step, hierarchical
classification with local classifiers was attempted. Specifically, Local Classifier per Parent
Node (LCPN) approach was used. The performances were assessed using hierarchical
Precision (hP), hierarchical Recall (hR) and hierarchical F-score (hF). There was no significant
improvement in the performance and hence opens up opportunities for further investigation.
CONTENTS
1. Introduction …………………………………………………………….. 1
2. Summary of the project …………………………………………………
2.1 Project Organization …………………………………………….......
2.2 Project Objectives …………………………………………………...
2.3 Project Resources …………………………………………………...
1
1
1
1
3. Project Implementation Details …………………………………………
3.1 Database & Taxonomy ……………………………………………...
3.2 Class labeling …………………………………………….................
3.3 Feature Extraction ……………………………………………..........
3.4 Training & Test Data Preparation …………………………………..
3.5 Classifiers used ……………………………………………...............
3.6 Hierarchical Classification ………………………………………….
3.6.1 Flat Classification …………………………………………...
3.6.2 Local Classifier per Parent Node approach …………………
3.6.3 Performance Measures ……………………………………...
1
2
2
2
5
5
5
6
7
8
4. Project Realization ……………………………………………...............
4.1 Workload division ……………………………………………..........
4.2 Meetings with the client …………………………………………….
4.3 Problems, changes and delays ………………………………………
4.4 Budget ……………………………………………...……………….
4.5 Lessons Learnt ……………………………………………................
8
8
8
8
8
10
5. Project results and conclusions …………………………………………. 10
6. Comments on the course ..………………………………………………
7. References ……………………………………………...……………….
11
11
List of Abbreviations
DAG Directed Acyclic Graph
LCPN Local Classifier per Parent Node
MFCC Mel-Frequency Cepstral Coefficients
NB Naïve Bayes
𝑘-NN 𝑘- Nearest Neighbours
pg. 1
1. Introduction
Automatic classification and recognition of bird species by their acoustic cues has been a
subject of interest to ornithologists, ecologists, biodiversity conservationists and pattern
detection researchers for many years [1]. Birds have been used widely as indicators of
biodiversity because they provide critical ecosystem services, respond rapidly to change, are
relatively easy to detect, and may reflect changes at lower trophic levels (e.g., insects, plants)
[2]. Hence the immediate and often cited application of the project will be automatic bird
population surveys.
In many application fields, taxonomies and hierarchies are natural ways to organize and
classify objects. Most of the Machine Learning research has largely focused on flat target
prediction, where the output is a single binary or multi-valued scalar variable [3]. The natural
taxonomical structure of the bird species lends itself an ample scope for research in machine
learning. Most of the work in the area has been on signal representation, noise removal, feature
extraction and flat target classification; hierarchical classification is seldom attempted [4].
2. Summary of the project
2.1 Project Organization
The project, in all its stages, was done by a single person. The client was Tuomas Virtanen
([email protected]) representing the Audio Research Team (http://arg.cs.tut.fi) at
Tampere University of Technology.
2.2 Project Objectives
The client had given the author multiple objectives. Initially methods had to be found out to
successfully retrieve a subset of the database from Xeno-Canto [5] and organize them. The
author also had to develop tools to visualize the hierarchy. Suitable feature extraction and
classification tools had to be used to build an audio-based bird species recognition system.
Emphasis had to be given on the implementation of hierarchical classification rather than
improving performance. The evaluation results had to be finally reported.
The personal goals of the author were to successfully apply the learned signal processing and
machine learning methods and gain a hands-on experience. As a person more interested in the
underlying mathematics of the whole process, the author also wanted to investigate the problem
on a more mathematical framework.
2.3 Project resources
This section lists the project resources that were needed to carry out the project:
1. Audio data from Xeno-Canto [5], which the author had to find ways to obtain.
2. A personal computer equipped with good computing resources (CPU 2+ GHz, 2 Gb RAM)
and ample disc space (at least 20Gb). The author used his own laptop computer.
3. Matlab Software.
4. Openly available tools for feature extraction (MFCC calculation) and classification [6].
3. Project Implementation Details
This chapter discusses the implementation of the developed audio-based bird species
classification system. It starts with the general description of the retrieved data and the class
taxonomy details. It is followed by the description of feature extraction module and the details
pg. 2
of the classifiers used. An introduction to hierarchical classification is provided along with the
performance measures used in assessment.
3.1 Database & Taxonomy
In order to solve the bird species identification problem in a machine learning framework, it
was necessary to have a database of bird recorded songs labelled with their corresponding
species. The site http://www.xeno-canto.org/ [5] contains an extensive list of audio recorded
songs for bird species, and also the scientific taxonomy of each species. So, a database with
audio records of birds had to be obtained from the site by an information extraction procedure.
The taxonomical details of the employed dataset are provided in Table 1. The dataset is
composed of 3435 records of bird songs obtained across 48 species that appear in South
Atlantic Coast of Brazil [4]. The recordings are not standardized, recorded in different
environments and corrupted with sounds from co-habitants and other sources of noises such as
wind, rain and vehicles in the background.
For flat classification, only the species was taken into account and there are all together 48
species. For hierarchical classification, both the order and family were also taken into account.
As there is a unique genus for every species in the taxonomy of the employed dataset, species
was skipped from the hierarchy. Hence a 3-level hierarchy was taken into consideration for
classification. Table 1 is visualized as a tree in Figure 2 where the nodes are the class labels.
3.2 Class Labelling
In order to perform classification, the audio samples had to be tagged with appropriate labels.
For flat classification, the class labels were given as numbers in the range {1, 48}. For
hierarchical classification, class labels were given for every level of hierarchy. The leaf nodes
of the tree have 4 components in their class-labels. For example, the 1st leaf from the left has
the class-label 0.1.1.1 (0 indicates the root of the tree) and the last leaf has the class-label
0.4.1.3.
3.3 Feature Extraction The main task of the project was to concentrate on the implementation of hierarchical
classification. Hence the ubiquitous static MFCCs are used as features. This choice is justified
by the fact that MFCCs are most commonly used for solving various audio related pattern
recognition problems and have shown robustness in characterizing the amplitude spectrum,
which corresponds to the way the human auditory system processes audio. The openly
available VOICEBOX speech processing toolbox [6] was used for this purpose. The default
parameters (Frame length= power-of-2 < 0.03*sampling frequency in Hz, 50% overlap
between frames, number of cepstral coefficients=12, Hamming window in time-domain,
triangular filers in mel domain) were used. The details of the steps involved are out of scope
for the report.
The statistical averages (mean and standard-deviation) of the frame-wise MFCCs computed
henceforth were used as features. The feature extraction stage thus yielded a representation of
the raw audio signal as a 24-dimensional feature vector (the first 12 components representing
the mean of the MFCCs and the remaining 12 the standard deviation).
pg. 3
Table 1.
A schema of the employed hierarchy [4]
pg. 4
00
.10
.20
.30
.40
.50
.60
.70
.80
.91
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
heig
ht =
3
0
12
34
11
21
23
45
67
89
10
1
12
11
11
11
23
12
12
34
56
78
91
01
11
21
23
45
67
89
12
31
23
45
67
81
23
343
5
73
160
306
21
40
73
1303
01
049
63
32
02
49
784
154
802
164
674
140
20
53
1303
01
049
63
37
86
16
32
32
68
73
08
42
93
25
67
31
567
38
18
38
96
51
038
51
507
74
79
97
09
08
14
37
64
58
41
585
44
71
81
442
71
422
97
73
4
Fig
ure
1. C
lass
Hie
rarc
hy V
isual
izat
ion
pg. 5
3.4 Training Data & Test Data Preparation Every 24-dimensional feature vector was given the appropriate class label in order to perform
a supervised classification. The resulting dataset was divided into training and test sets in the
ratio 70:30. Care was taken that all the 48 species were covered in both training and test sets.
Since simple classifiers were used, no cross-validation was required.
3.5 Classifiers used Simple multi-class classifiers, which can be easily extended to the problem of hierarchical
classification, was the main requirement. Hierarchical classification of text is a well-studied
problem and Naïve Bayes is the most commonly used classifier. Naturally, it was the first
choice for the classifier. Moreover due to the popularity and proven efficiency of k-Nearest
Neighbors in audio classification tasks, it was also strongly considered. The possibility of using
plain-vanilla neural networks was abandoned due to limited computing power available even
though the author tried using it.
3.6 Hierarchical classification Hierarchical classification is a type of structured classification problem where the output of
the classification algorithm is defined over a class taxonomy. The term structured classification
is broader and denotes a classification problem which has some structure (hierarchical or not)
among the classes. “A class taxonomy is a tree structured regular concept hierarchy defined
over a partially order set (C,≺), where C is a finite set that enumerates all class concepts in the
application domain, and the relation ≺ represents the “IS-A" relationship” [7].
- The only one greatest element “R” is the root of the tree.
- ∀𝑐𝑖, 𝑐𝑗 ∈ 𝐶, if 𝑐𝑖 ≺ 𝑐𝑗 then 𝑐𝑗 ⊀ 𝑐𝑖.
- ∀𝑐𝑖 ∈ 𝐶, 𝑐𝑖 ⊀ 𝑐𝑖. - ∀𝑐𝑖, 𝑐𝑗 , 𝑐𝑘 ∈ 𝐶, if 𝑐𝑖 ≺ 𝑐𝑗 and 𝑐𝑗 ≺ 𝑐𝑘 then 𝑐𝑖 ≺ 𝑐𝑘.
A class taxonomy can be a tree or a Directed A-cyclic Graph (DAG). This report does not
talk about DAGs as the methods used in them are very different.
Figure 2. An example of a tree-based hierarchical class structure
The two types of conventional classification methods: two-class and multi-class classifiers
cannot directly cope with hierarchical classes [7]. In the context of hierarchical classification
most approaches could be multi-label as well. For instance, considering the hierarchical class
structure presented in Figure 2 (where R denotes the root node), if the output of a classifier is
22
.1
2.
1
R
1.2
1 2
1.1 2.1 2.2
2.2.2 2.2.1 1.2.2 1.2.1
pg. 6
class 2.1.1, it is natural to say that it also belongs to classes 2 and 2.1, therefore having three
classes as the output of the classifier.
Hierarchical classification methods differ in a number of criteria. The first criterion is the
type of hierarchical structure used- tree or DAG. As previously mentioned, DAG will not be
considered here. The second criterion is related to how deep the classification in the hierarchy
is performed. i.e., the hierarchical classification method can be implemented in a way that will
always classify a leaf node (often termed in the literature as mandatory leaf node prediction)
or the method can consider stopping the classification at any node in any level of the hierarchy
(non-mandatory leaf node prediction).
The third criterion is related to how the hierarchical structure is explored. The current
literature often refers to top-down (or local) classifiers, when the system employs a set of local
classifiers; big-bang (or global) classifiers, when a single classifier coping with the entire class
hierarchy is used; or flat classifiers, which ignore the class relationships, typically predicting
only the leaf nodes.
In this work only flat classifiers and one method of top-down (local) classifiers called Local
Classifier per Parent Node (LCPN) were used. For more information on the definition, scope
and details of hierarchical classification, the reader should refer to [7].
3.6.1 Flat Classification The simplest approach to hierarchical classification is to completely ignore the class
hierarchy, typically predicting only classes at the leaf nodes [7]. This approach is more like a
traditional classification algorithm during training and testing. However, indirectly it provides
a solution for hierarchical classification, because, when a leaf class is assigned to an example,
one can consider that all its ancestor classes are also implicitly assigned to that instance.
However, this very simple approach has the serious disadvantage of having to build a classifier
to discriminate among a large number of classes (all leaf classes), without exploring
information about parent-child class relationships present in the class hierarchy [7]. Figure 3
illustrates this approach.
Figure 3. Flat classification using flat multi-class classification algorithm. Circles represent
classes and shaded circles represent flat classes over which a single multi-class classifier is
trained.
22
.1
2.
1
R
1.2
1 2
1.1 2.1 2.2
2.2.2 2.2.1 1.2.2 1.2.1
pg. 7
3.6.2 Local Classifier per Parent Node (LCPN) Approach For every parent node in the class hierarchy, a multi-class classifier is trained to distinguish
between its child nodes. Figure 4 illustrates this approach.
3.6.2.1 Training phase In order to train the classifiers either the “siblings" policy or the “exclusive siblings" policy
can be employed. The notations that are used to concisely explain the policies are listed in
Table 2.
Figure 4. Local Classifier per Parent Node. Circles represent classes and partially shaded
circles represent multi-class classifiers predicting their child classes.
Table 2. Notations for local classifiers [7]
Symbol Meaning
𝑇𝑟 Set of all training examples
𝑇𝑟+(𝑐𝑗) Set of positive training examples of 𝑐𝑗
𝑇𝑟−(𝑐𝑗) Set of negative training examples of 𝑐𝑗
↑ (𝑐𝑗) Parent category of 𝑐𝑗
↓ (𝑐𝑗) Set of children categories of 𝑐𝑗
⇑ (𝑐𝑗) Set of ancestor categories of 𝑐𝑗
⇓ (𝑐𝑗) Set of descendent categories of 𝑐𝑗
↔ (𝑐𝑗) Set of sibling categories of 𝑐𝑗
∗ (𝑐𝑗) Examples whose most specific known class is 𝑐𝑗
Siblings policy: 𝑇𝑟+(𝑐𝑗) =∗ (𝑐𝑗) ∪ ⇓ (𝑐𝑗) and 𝑇𝑟−(𝑐𝑗) =↔ (𝑐𝑗) ∪ ⇓ (↔ (𝑐𝑗))
Exclusive siblings policy: 𝑇𝑟+(𝑐𝑗) =∗ (𝑐𝑗) and 𝑇𝑟−(𝑐𝑗) =↔ (𝑐𝑗)
3.6.2.2 Testing phase The testing phase can be best explained with an example. Considering Figure 4, suppose that
the first level classifier assigns the example to the class 2. The second level classifier, which
was only trained with the children of the class node 2, in this case 2.1 and 2.2, will then make
its class assignment (and so on, if deeper-level classifiers were available).
22
.1
2.
1
R
1.2
1 2
1.1 2.1 2.2
2.2.2 2.2.1 1.2.2 1.2.1
pg. 8
3.6.3 Performance Measures
When dealing with hierarchical classification problems, it is necessary to use evaluation
measures that are appropriate for these types of problems. In this work we have used the metrics
of hierarchical precision (hP) , hierarchical recall (hR) and hierarchical F-score (hF) as defined
by Kiritchenko et al in [8]. The formulas for computing the measures are as follows:
ℎ𝑃 =1
|𝐼|∑
|�̂�𝑖 ∩ �̂�𝑖|
|�̂�𝑖|𝑖
ℎ𝑅 =1
|𝐼|∑
|�̂�𝑖 ∩ �̂�𝑖|
|�̂�𝑖|𝑖
ℎ𝐹 =2 ∗ ℎ𝑃 ∗ ℎ𝑅
ℎ𝑃 + ℎ𝑅
where 𝐼 is the set of all test examples, �̂�𝑖 is the set consisting of the most specific class predicted
for test example 𝑖 and all its ancestor classes, �̂�𝑖 is the set consisting of the true class of test
example 𝑖 and all its ancestor classes. The motivation for these measures are clearly discussed
in detail in [7] and [8].
4. Project Realization
This section enlists the non-technical aspects of the project such as the total labour hours
spent, division of the workload amongst various implementation steps, planned and realized
tasks, problems faced and lessons learnt from the project.
4.1 Workload Division
The project started commencing from the 3rd week of January 2015 with introductory
seminars. The actual work started from 1st February 2015 after the finalization of project topic.
The overall implementation of the project and the division of labor hours is visualized in figure
5.
4.2 Meetings with the client
There were regular e-mail exchanges and bi-weekly personal meetings with the client.
4.3 Problems, delays and changes
The planned schedule was realized almost unchanged. There were multiple unforeseen delays
due to various reasons. Data retrieval was not easy and there was confusion regarding the
choice between Macaulay library database (http://www.birds.cornell.edu/) and Xeno Canto [5],
finally settling down with the latter. Data from Xeno Canto was not easily downloadable and
some basic web programming had to be done, which the author had never been exposed to.
The feature extraction step, on the other hand, was quite easy with the readily available tool-
box and took less time than anticipated. There were delays when the author encountered poor
performance accuracies in flat classification with several classifiers. With the wrong
assumption of improving flat classification performance in terms of classification accuracy, the
author spent more time on that task than what was planned. Moreover the exploration of neural
network training and multiple crashes that the machine faced due to massive training data
pg. 9
further delayed the progress. Around the same time, the author also had some health issues due
to Seasonal Affective Disorder. The project gathered the lost momentum with the
implementation of hierarchical classification.
4.4 Budget
With the 190 labor hours that were spent to achieve the tasks and with 14e/h labor cost, the
total labor cost comes to around 2660 euros.
Figure 5. Project Implementation Chart
16%
10%
5%
11%
11%
13%
13%
13%
8%
Division of 190 labour hours
Familiarization with taxonomy, data retrieval and organization (for flat classification)
Literature Review
Writing project plan report
Feature extraction & preparation of training & test sets
Flat Classification, exploration of suitable classifiers & choice of features
Development of visualization tool for class hierarchy
Implementation of hiearchical classification & evaluation
Writing final report
Attending seminars & presentations
pg. 10
4.5 Lessons learnt
With massive work load from other courses and unforeseen health issues, the author learnt
valuable lessons on time management. He also learnt the importance of getting clarified with
any technical difficulty by openly talking to the client/guide instead of pondering over it by
himself.
5. Project Results and conclusions The first important result of the project was the visualization of class hierarchy for the bird
sound database, which was already introduced in Figure 1.
Figure 6 shows the main results of hierarchical classification. Both flat and hierarchical
(LCPN approach) classification schemes were compared in terms of hierarchical precision,
recall and F-score measures. The number of neighbours 𝑘 as in 𝑘-Nearest Neighbours was
varied together between 1 and 10 for all local classifiers. In other words no two local classifiers,
irrespective of their position in class hierarchy, can have different 𝑘 values. This is just to avoid
unnecessary exhaustive analysis. The performance of hierarchical Naïve Bayes classifier is also
shown in the same figure for easy comparison.
Figure 6. Performance measures for various classifiers.
It can be observed from Figure 6 that hierarchical classification by LCPN has an edge over
flat-classification approach, albeit not significantly. The reasons are multifold- one reason can
be due to the presence of background noises in the audio data and this may call for the need of
pre-processing by identifying the bird song segments from the raw data. Another reason can be
due to unbalanced class hierarchy, in that most bird species are of Passeriformes order, which
means there is not much information about the class hierarchy that has to be carried over to the
leaf nodes. These are open questions which demand more thorough investigation. Other
hierarchical classification methods, especially big-bang (global) classifiers [7] can also be
explored.
1 2 3 4 5 6 7 8 9 10
0.4
0.5
0.6
0.7
0.8
0.9
1
k- Number of Neighbors
hP(Flat k-NN)
hR(Flat k-NN)
hF(Flat k-NN)
hP(LCPN k-NN)
hR(LCPN k-NN)
hF(LCPN k-NN)
hP(LCPN Naive Bayes)
hR(LCPN Naive Bayes)
hF(LCPN Naive Bayes)
pg. 11
6. Comments on the course The course provided the author a hands-on experience by giving an opportunity to apply the
learned methods and do real programming. It also gave him a chance to hone his project
management skills at all levels. The mandatory seminars, especially on report writing, were
highly useful and helped in refining the author’s writing skills. Overall, the course was
extremely satisfying and the author offers his thanks to his guide/client Tuomas Virtanen for
giving him an opportunity and the co-ordinator Sari Peltonen for the smooth conduct of the
course.
7. References
[1] Z.Chen & R.C.Maher, “Semi-automatic classification of bird vocalizations using spectral
peak tracks,” Journal of Acoustic Society of America, vol. 120, pp. 2974-2982, 2006.
[2] F.Briggs, R.Raich, X.Z. Fern. “Audio Classification of Bird Species: a Statistical Manifold
Approach,” in Proc. of the 9th IEEE Int. Conf. on Data Mining, 2009, pp. 51-60
[3] N.C. Bianchi, C. Gentile, L. Zaniboni. “Hierarchical classification: combining Bayes with
SVM,” Proc. of the 23rd International Conference on Machine Learning, 2006
[4] C.S.Silla Jr, C.A.A. Kaestner, “Hierarchical Classification of Bird Species Using Their
Audio Recorded Songs,” IEEE International Conference on Systems, Man, and Cybernetics,
2013, pp. 1895-1900.
[5] Xeno Canto, Sharing bird sounds from around the world [online], Available:
http://www.xeno-canto.org/
[6] Mike Brookes, Department of Electrical & Electronic Engineering, Imperial College,
London, VOICEBOX: Speech Processing Toolbox for Matlab [online], Available:
http://www.ee.ic.ac.uk/hp/staff/dmb/voicebox/voicebox.html 10 [7] C.N.Silla Jr. A.A.Freitas, “A Survey of Hierarchical Classification Across Different
Application Domains,” Data Mining and Knowledge Discovery, Vol. 22, pp. 31-72, 2011.
[8] Kiritchenko S, Matwin S, Famili AF, “Functional annotation of genes using hierarchical
text categorization,” Proc. of the ACL Workshop on Linking Biological Literature, Ontologies
and Databases: Mining Biological Semantics, 2005.