segmentation of musical items: a computational perspective

96
Segmentation of musical items: A Computational Perspective A THESIS submitted by SRIDHARAN SANKARAN for the award of the degree of MASTER OF SCIENCE (by Research) DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING INDIAN INSTITUTE OF TECHNOLOGY, MADRAS. Oct 2017

Upload: others

Post on 16-Feb-2022

4 views

Category:

Documents


0 download

TRANSCRIPT

Segmentation of musical items: A ComputationalPerspective

A THESIS

submitted by

SRIDHARAN SANKARAN

for the award of the degree

of

MASTER OF SCIENCE(by Research)

DEPARTMENT OF COMPUTER SCIENCE ANDENGINEERING

INDIAN INSTITUTE OF TECHNOLOGY, MADRAS.Oct 2017

THESIS CERTIFICATE

This is to certify that the thesis entitled Segmentation of musical items: A Com-

putational Perspective, submitted by Sridharan Sankaran, to the Indian Institute

of Technology, Madras, for the award of the degree of Master of Science (by Re-

search), is a bonafide record of the research work carried out by him under my

supervision. The contents of this thesis, in full or in parts, have not been submitted

to any other Institute or University for the award of any degree or diploma.

Dr. Hema A. MurthyResearch GuideProfessorDept. of Computer Science and EngineeringIIT-Madras, 600 036

Place: Chennai

Date:

ACKNOWLEDGEMENTS

I would like to express my sincere gratitude to my guide, Prof. Hema A. Murthy, for

the excellent guidance, patience and for providing me with an excellent atmosphere

for doing research. She helped me to develop my background in signal processing

and machine learning and to experience the practical issues beyond the textbooks.

The endless sessions that we had about research, music and beyond have not only

helped in improving my perspective towards research but also towards life.

I would like to thank my collaborator Krishnaraj Sekhar PV. The completion

of this thesis would not have been possible without his contribution. He helped

me in building datasets, carrying out the experiments, analyzing results and in

writing research papers.

Thanks to Venkat Viraraghavan, Jom and Krishnaraj for proof reading this

thesis.

I am grateful to the members of my General Test Committee, for their sugges-

tions and criticisms with respect to the presentation of my work. I am also grateful

for being a part of the CompMusic project. It was a great learning experience

working with the members of this consortium.

I would like to thank Dr Muralidharan Somasundaram, my guide at Tata

Consultancy Services for making maths look simple.

I am grateful to Prof V.Kamakoti, who encouraged me to pursue this pro-

gramme at IIT and connected me to my guide.

i

I would like to thank my employer Tata Consultancy Services for sponsoring me

for this external programme and accommodating my absence from work whenever

I am at the institute.

I would like to thank Anusha, Jom, Karthik, Manish, Padma, Praveen, Raghav,

Sarala, Saranya, Shrey, and other members of Donlab for their help and support

over the years. It would have been a lonely lab without them.

I am also obliged to the European Research Council for funding the research un-

der the European Unions Seventh Framework Program, as part of the CompMusic

[14] project (ERC grant agreement 267583).

I would like to thank my family for their support and for tolerating my ”non-

cooperation” at home citing my academic pursuits.

ii

ABSTRACT

KEYWORDS: Carnatic Music, Pattern matching, Segmentation, Query,

Cent filter bank

Carnatic music is a classical music tradition widely performed in the southern

part of India. While the Western classical music is primarily polyphonic, mean-

ing different notes are sounded at the same time to create ”harmony”, Carnatic

music is essentially monophonic, meaning only single note is sounded at a time.

Carnatic music focusses on expanding those notes and expounding the melody as-

pect and emotional aspect. Carnatic music also gives importance to (on-the-stage)

manodharma (improvisations).

Carnatic music, which is one of the two styles of Indian classical music, has rich

repertoires with many traditions and genres. It is primarily an oral tradition with

minimal codified notations. Hence it has well established teaching and learning

practices. Carnatic music has hardly been archived with the objective of music

information retrieval (MIR). Neither it has been studied scientifically until recently.

Since Carnatic music is rich in manodharma, it is difficult to analyse and represent

adopting techniques used for Western music. With MIR, there are many aspects

that can be analysed and retrieved from a Carnatic music item such as the raga,

tala, the various segments of the item, the rhythmic strokes used by the percussion

instruments, the rhythmic patterns used etc. Any such MIR task will be of great

benefit not only to enhance the listening pleasure but also will serve as a learning

iii

aid for students.

In Carnatic music, musical items are made up of multiple segments. The main

segment is the composition (kriti) which has melody, rhythm and lyrics and it can

be optionally preceded by pure melody segment (alapana) without lyrics or beats

(talam). The alapana segment, if present, will have a sub-segment rendered by

vocalist optionally followed by a sub-segment rendered by the accompanying vio-

linist. The kriti in turn is generally made of three sub-segments - pallavi, anupallavi

and caranam. The goal of this thesis is to segment a musical item into its various

constituent segments and sub-segments mentioned above.

We first attempted to segment the musical item into alapana and kriti using

an information theoretic approach. Here, the symmetric KL divergence (KL2)

distance measure between alapana segment and kriti segment was used to identify

the boundary between alapana and kriti segments. We got around 88% accuracy in

segmenting between alapana and kriti.

Next we attempted to segment the kriti into pallavi, anupallavi and caranami

using pallavi (or part of it) as the query template. A sliding window approach with

time-frequency template of the pallavi that slides across the entire composition was

used and the peaks of correlation were identified as matching pallavi repetitions.

Using these pallavi repetitions as the delimiter, we were able to segment the kriti

with 66% accuracy.

In all these approaches, it was observed that Cent filterbank based features

provided better results than traditional MFCC based approach.

iv

TABLE OF CONTENTS

ACKNOWLEDGEMENTS i

ABSTRACT iii

LIST OF TABLES

LIST OF FIGURES

ABBREVIATIONS

1 Introduction 1

1.1 Overview of the thesis . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Music Information retrieval . . . . . . . . . . . . . . . . . . . . . . 9

1.3 Carnatic Music - An overview . . . . . . . . . . . . . . . . . . . . . 10

1.3.1 Raga . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

1.3.2 Tala . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

1.3.3 Sahitya . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

1.4 Carnatic Music vs Western Music . . . . . . . . . . . . . . . . . . . 12

1.4.1 Harmony and Melody . . . . . . . . . . . . . . . . . . . . . 13

1.4.2 Composed vs improvised . . . . . . . . . . . . . . . . . . . 13

1.4.3 Semitones, microtones and ornamentations . . . . . . . . . 13

1.4.4 Notes - absolute vs relative frequencies . . . . . . . . . . . 15

1.5 Carnatic Music - The concert setting . . . . . . . . . . . . . . . . . 16

1.6 Carnatic Music segments . . . . . . . . . . . . . . . . . . . . . . . . 16

1.6.1 Composition . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

1.6.2 Alapana . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

1.7 Contribution of the thesis . . . . . . . . . . . . . . . . . . . . . . . 18

1.8 Organisation of the thesis . . . . . . . . . . . . . . . . . . . . . . . 19

2 Literature Survey 20

2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

2.2 Segmentation Techniques . . . . . . . . . . . . . . . . . . . . . . . 22

2.2.1 Machine Learning based approaches . . . . . . . . . . . . . 23

2.2.2 Non machine learning approaches . . . . . . . . . . . . . . 25

2.3 Audio Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

2.3.1 Temporal Features . . . . . . . . . . . . . . . . . . . . . . . 29

2.3.2 Spectral Features . . . . . . . . . . . . . . . . . . . . . . . . 30

2.3.3 Cepstral Features . . . . . . . . . . . . . . . . . . . . . . . . 33

2.3.4 Distance based Features . . . . . . . . . . . . . . . . . . . . 36

2.4 Discussions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

2.4.1 CFB Energy Feature . . . . . . . . . . . . . . . . . . . . . . 40

2.4.2 CFB Slope Feature . . . . . . . . . . . . . . . . . . . . . . . 42

2.4.3 CFCC Feature . . . . . . . . . . . . . . . . . . . . . . . . . . 43

3 Identification of alapaa and kriti segments 44

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

3.2 Segmentation Approach . . . . . . . . . . . . . . . . . . . . . . . . 45

3.2.1 Boundary Detection . . . . . . . . . . . . . . . . . . . . . . 45

3.2.2 Boundary verification using GMM . . . . . . . . . . . . . . 47

3.2.3 Label smoothing using Domain Knowledge . . . . . . . . 47

3.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . 49

3.3.1 Dataset Used . . . . . . . . . . . . . . . . . . . . . . . . . . 49

3.3.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

3.3.3 Discussions . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

4 Segmentation of a kriti 53

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

4.2 Segmentation Approach . . . . . . . . . . . . . . . . . . . . . . . . 55

4.2.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

4.2.2 Time Frequency Templates . . . . . . . . . . . . . . . . . . 57

4.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . 59

4.3.1 Finding Match with a Given Query . . . . . . . . . . . . . 60

4.3.2 Automatic Query Detection . . . . . . . . . . . . . . . . . . 64

4.3.3 Domain knowledge based improvements . . . . . . . . . . 67

4.3.4 Repetition detection in a RTP . . . . . . . . . . . . . . . . . 70

4.3.5 Discussions . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

5 Conclusion 75

5.1 Summary of work done . . . . . . . . . . . . . . . . . . . . . . . . 75

5.2 Criticism of the work . . . . . . . . . . . . . . . . . . . . . . . . . . 77

5.3 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

LIST OF TABLES

1.1 Differences in frequencies of the 12 notes for Indian Music and West-ern Music . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

3.1 Division of dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

3.2 Confusion matrix: Frame-level labelling . . . . . . . . . . . . . . . 50

3.3 Performance: Frame-level labelling . . . . . . . . . . . . . . . . . . 50

3.4 Confusion matrix: Item Classification . . . . . . . . . . . . . . . . 50

3.5 Performance: Item Classification . . . . . . . . . . . . . . . . . . . 51

3.6 Confusion matrix: Frame-level labelling . . . . . . . . . . . . . . . 51

3.7 Performance: Frame-level labelling . . . . . . . . . . . . . . . . . . 51

3.8 Confusion matrix: Item Classification . . . . . . . . . . . . . . . . 51

3.9 Performance: Item Classification . . . . . . . . . . . . . . . . . . . 52

4.1 Comparison between various features . . . . . . . . . . . . . . . . 67

4.2 Manual vs automatic query extraction (CFB Energy: Cent filter bankcepstrum, CFB Slope: Cent filterbank energy slope). Time is givenin seconds. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

LIST OF FIGURES

1.1 A typical Carnatic music concert . . . . . . . . . . . . . . . . . . . 3

1.2 Tonic normalisation of two similar phrases . . . . . . . . . . . . . 6

1.3 Typical melodic variations in repetitions . . . . . . . . . . . . . . . 8

1.4 Pitch histogram of raaga Sankarabharanam with its Hindustani andWestern classical equivalents . . . . . . . . . . . . . . . . . . . . . 14

1.5 Effect of gamakas on pitch trajectory . . . . . . . . . . . . . . . . . . 15

1.6 Concert Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . 18

1.7 Item segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

2.1 Block diagram of MFCC extraction . . . . . . . . . . . . . . . . . . 35

2.2 Block diagram of HCC analysis . . . . . . . . . . . . . . . . . . . . 36

2.3 Filter-banks and filter-bank energies of a melody segment in the melscale and the cent scale with different tonic values . . . . . . . . . 40

3.1 KL2 Values and possible segment boundaries. . . . . . . . . . . . 46

3.2 GMM Labels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

3.3 Entire song label generated using GMM . . . . . . . . . . . . . . . 48

3.4 Entire song label generated using GMM after smoothing . . . . . 49

4.1 Time-frequency template of music segments using FFT specturm (Xaxis: Time in frames, Y axis: Frequency in Hz) . . . . . . . . . . . 57

4.2 Time-frequency template of music segments using cent filterbankenergies (X axis: Time in frames, Y axis: Filter) . . . . . . . . . . . 58

4.3 Time-frequency template of music segments using cent filterbankslope (X axis: Time in frames, Y axis: Filter) . . . . . . . . . . . . . 59

4.4 Correlation as a function of time (cent filterbank energies) . . . . 60

4.5 Correlation as a function of time (cent filterbank slope) . . . . . . 61

4.6 Spectrogram of query and matching segments as found out by thealgorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

4.7 Query matching with Cent filterbank slope feature . . . . . . . . . 62

4.8 Query matching with Chroma feature (no overlap) . . . . . . . . . 63

4.9 Query matching with Chroma feature (with overlap) . . . . . . . 63

4.10 Query matching with MFCC feature . . . . . . . . . . . . . . . . . 64

4.11 Intermediate output (I) of the automatic query detection algorithmusing slope feature . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

4.12 Intermediate output (II) of the automatic query detection algorithmusing slope feature . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

4.13 Final output of the automatic query detection algorithm using slopefeature. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

4.14 Correlation for full query Vs. half query . . . . . . . . . . . . . . . 68

4.15 False positive elimination using rhythmic cycle information . . . 69

4.16 Repeating pattern recognition in other Genres . . . . . . . . . . . 70

4.17 Normal tempo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

4.18 Half the original tempo . . . . . . . . . . . . . . . . . . . . . . . . . 72

4.19 tisram tempo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

4.20 Double the original tempo . . . . . . . . . . . . . . . . . . . . . . . 73

ABBREVIATIONS

AAS Automatic Audio Segmentation

BFCC Bark Filterbank Cepstral Coefficients

BIC Bayesian Information Criteria

CFCC Cent Filterbank Cepstral Coefficients

CNN Convolutional Neural Networks

DCT Discrete Cosine Transform

DFT Discrete Fourier Transform

EM Expectation Maximisation

FFT Fast Fourier Transform

GIR Generalised Likelihood Ratio

GMM Gaussian Mixture Model

HMM Hidden Markov Model

IDFT Inverse Discrete Fourier Transform

KL Divergence KullbackLeibler divergence

LP Linear Prediction

LPCC Linear Prediction Cepstral Coefficients

MFCC Mel Filterbank Cepstral Coefficients

MIR Music Information Retrieval

PSD Power Spectral Density

RMS Root Mean Square

STE Short Term Energy

SVM Support Vector Machine

t-f Time Frequency

ZCR Zero Crossings Rate

CHAPTER 1

Introduction

1.1 Overview of the thesis

Carnatic music is a classical music tradition performed largely in the southern

states of India namely Tamil Nadu, Kerala, Karnataka, Telangana and Andhra

Pradesh. Carnatic music and Hindustani music form the two sub genres of Indian

classical music, the latter being more popular in the Northern states of India.

Though the origin of Carnatic and Hindustani music can be traced back to the

theory of music written by Bharata Muni around 400 BCE, these two sub-genres

have evolved differently over a period of time due to the prevailing socio-political

environments in various parts of India, but still retaining certain core principles in

common. In this work, we will focus mainly on Carnatic music, though some of the

challenges and approaches to MIR described under Carnatic music are applicable

to Hindustani music also.

Raga (melodic modes), tala (repeating rhythmic cycle) and sahitya (lyrics) form

the three pillars on which Carnatic music rests.

The concept of raga is central to Carnatic music. While it can be grossly ap-

proximated to a ‘scale‘ in Western music, in reality a raga encompasses collective

expression of melodic phrases that are formed due to movement or trajectories of

notes that conform to the grammar of that raga. The trajectories themselves are

defined variously by gamakas which are movement, inflexion and ornamentation

of notes [40, Chapter 5]. While a note corresponds to a pitch position in Western

music, a note in Carnatic music (called the svara) need not be a singular pitch

position but a pitch contour or a pitch trajectory as defined by the grammar of that

raga. In other words, a note in Western music corresponds to a point value in a

time- frequency plane while a svara in Carnatic music can correspond to a curve

in the time- frequency plane. The shape of this curve can vary from svara to svara

and raga to raga. The set of svaras that define the raga are dependent on the tonic.

Unlike Western music, the main performer of a concert is at liberty to choose any

frequency as the tonic of that concert. Once a tonic is chosen, the pitch positions

of other notes are derived from the tonic. In a Carnatic music concert, this tonic is

maintained by an instrument called tambura.

The next important concept in Carnatic music is tala. It is related to rhythm,

speed, metre etc and it is a measure of time. There are various types of tala that

are characterised by different matra (beat) count per cycle. The matra is further

subdivided into akshara. The count of akshara per matra is decided by the nadai/gati

of that tala. For every composition, the main artist chooses a speed of the item to

render. Once the speed is chosen, Carnatic music is reasonably strict about keeping

the speed constant, but for inadvertent minor variations in speed due to human

errors.

The third important concept in Carnatic music is sahitya or lyrics. Most of the

lyrical compositions that are performed today have been written a few centuries

ago. The composers were both musicians and poets and hence it can be seen that

music and lyrics go together in their compositions.

A Carnatic music concert is performed by a small ensemble of musicians as

shown in Figure 1.1

2

Figure 1.1: A typical Carnatic music concert

The main artiste is usually a vocalist but an artiste playing flute / veena /

violin can also be a main artiste. Where the main artiste is a vocalist or flautist, the

melodic accompaniment is given by a violin artist. The percussion accompaniment

is always given by a mrudangam artist. An additional percussion accompaniment

in the form of ghatam / khanjira / morsing is optional. A tambura player maintains

the tonic.

A typical Carnatic music concert varies in duration from 90 mins to 3 hours

and is made up of a succession of musical items. These items are standard lyrical

compositions (kritis) with melodies in most cases set to a specific raga and rhythm

structure set to a specific tala. The kritis can be optionally preceded by an alapana

which is the elaboration of the raga.

The main musician chooses a set of musical items that forms the concert. The

choice of items is an important part of concert planning. While there is no hard

and fast rules governing the choice of items, certain traditions are in the vogue

for the past 70 years. There are various musical forms in Carnatic music, namely

3

varnam, kriti, javali, thillana, viruttam, tiruppugazh, padam, raga malika and ragam

tanam pallavi (RTP). Typically, a concert will start with a varnam, followed by a set

of kritis. One or two kritis will be taken up for detailed exploration and rendering.

In certain concerts, a RTP is taken up for detailed rendition. Towards the end of

the concert, items such as javali, thillana, viruttam are rendered.

There are around 100 varnams, 5000 kritis and a few hundred other forms

available to choose. Musicians choose a set of items for a concert based on many

parameters such as:

• The occasion of the concert - e.g. thematic concerts based on a certain raga ora certain composer.

• Voice condition of the artiste - musical compositions that cover a limitedoctave range may be chosen in such cases. Also fast tempo items may beomitted.

• Contrast - Contrast in ragas, variety in composers, rotation of various typesof tala, compositions covering lyrics of various languages, variation in tempoetc.

While planning the items of a concert provides the outline of the concert, the

worth of the musician is evident only in the creativity and spontaneity demon-

strated by the artiste while presenting these chosen items. The creativity of the

artiste gets exhibited by manodharma or improvised music which is made of :

1. alapana (melodic exploration of a raga without rhythm and lyrical composi-tion)

2. niraval (spontaneous repetitions of a line of a lyric in melodically differentways conforming to the raga and rhythm cycle)

3. kalpana svara (spontaneous svara passages that conform to raga grammar withvariety of rhythmic structures)

Manodharma in Carnatic music is akin to an impromptu speech. The speaker

himself/herself will not know what the next sentence is going to be. The quality of

manodharma depends on a few factors such as :

4

1. Technical capability of the artiste - A highly accomplished artiste will havethe confidence to take higher risks during improvisations and will be able tocreate attractive melodic and rhythmic patterns on the fly.

2. Technical capability of the co-artists - Since improvisations are not rehearsed,the accompanying artistes have to be on high alert closely following themoves of the main artiste and be ready to do their part of improvisationswhen the main artiste asks them to.

3. The mental and physical condition of the artiste - The artiste may decide notto exert himself/herself too much while traversing higher octaves or fasterrhythmic neravals

4. Audience response and their pulse - If the audience comprises largely ofpeople who do not have deep knowledge, it is prudent not to demonstratetoo much of technical virtuosity.

As we can see, unlike Western classical music, Carnatic music rendition clearly

has two components - the taught / practised / rehearsed part (called kalpita) and

the spontaneous on-the-stage improvisations (manodharma). A MIR system for

Carnatic music should be able to identify / analyse / retrieve information pertain-

ing not only to the kalpita part but also pertaining to the manodharma part. The

unpredictability of the manodharma aspect makes MIR techniques used in Western

classical music ineffective in Carnatcic music.

At this stage, it is reiterated that the notes that make up the melody in Carnatic

music are defined with respect to a reference called tonic frequency. Hence the

analysis of a concert depends on the tonic. This makes MIR of Carnatic music

non trivial. A melody when heard without a reference tonic can be perceived as

a different raga depending on the svara that is assumed as tonic. Two melodic

motifs rendered with two different tonics will not show similarity unless they are

tonic normalised. Figure 1.2 1 shows the pitch contours of two similar melodic

phrases rendered at different tonics. Any time series algorithm will give a high

distance between the two phrases without tonic normalisation. The effect of tonic1Image courtesy Shrey Dutta

5

normalisation is also shown in this figure. Hence normalisation of these phrases

with respect to tonic is important before any comparison is made.

Figure 1.2: Tonic normalisation of two similar phrases

While there are are so many aspects of Carnatic music that are candidates for

analysis and retrieval, in this thesis we discuss the various segmentation techniques

that can be applied for segmenting a Carnatc music item.

6

In Carnatic music, a song or composition typically comprises of three segments-

pallavi, anupallavi and caranami-although in some cases there can be more segments.

Segmentation of compositions is important both from the lyrical and musical

aspects, as detailed in chapter 4. The three segments pallavi, anupallavi and caranam

have different significance. From music perspective, one segment builds on the

other. Also, one of the segments is usually taken up for elaboration in the form

of neraval and kalpana svaram. From MIR perspective it is of importance to know

which segment has been taken up by an artiste for elaboration.

The alapana is another creative exercise and the duration of an alapana directly

reflects a musician’s depth of knowledge and creativity. So it is informative to

know the duration of an alapana performed by a main artiste and an accompanying

artist. In this context, segmenting an item into alapana and musical composition if

of interest.

Segmentation of compositions directly from audio is a well researched problem

reported in the literature. Segmentation of a composition into its structural com-

ponents using repeating musical structures (such as chorus) as a cue has several

applications. The segments can be used to index the audio for music summarisa-

tion and browsing the audio (especially when an item is very long). While these

techniques have been attempted for Western music where the repetitions have

more or less static time-frequency melodic content, finding repetitions in impro-

visational music such as Carnatic music is a difficult task. This is because, the

repetitions vary melodically from one repetition to another as illustrated in Figure

1.3. Here the pitch contours of four different repetitions (out of eight rendered) of

the opening line of the composition vatapi are shown.

As we can see, unlike in Western music where segmentation using melodically

7

Figure 1.3: Typical melodic variations in repetitions

invariant chorus is straightforward, segmentation using melodically varying rep-

etitions of pallavi is a non trivial task. In this thesis, we discuss about segmenting

a composition into its constituent parts using pallavi (or a part of pallavi) of the

composition as the query template. As detailed in Chapter 4, the segments of a

composition have lot of melodic and lyrical significance. Hence segmentation of a

composition is a very important MIR task in Carnatic music.

The repeated line of a pallavi is seen as a trajectory in the time-frequency plane.

A sliding window approach is used to determine the locations of the query in the

composition. The locations at which the correlation is maximum corresponds to

matches with the query.

8

We further identify composition segment and the alapana segment using dif-

ference in timbre due to absence of percussion in alapana segment. This done by

evaluating KL2 distance (which is a Information theoretic distance measure be-

tween two probability density functions) between adjacent samples and thereby

locating the boundary of change.

For all these types of segmentations, we use Cent filterbank cepstral coeffi-

cients (CFCC) as features by which the features are tonic independent and hence

comparable across musicians and concerts where variation in tonic is possible.

1.2 Music Information retrieval

There is an ever increasing availability of music in digital format which requires

development of tools for music search, accessing, filtering, classification, visual-

isation and retrieval. Music Information Retrieval (MIR) covers many of these

aspects. Technology for music recording, digitization, and playback allows users

for an access that is almost comparable to the listening of a live performance. Two

main approaches to MIR are common 1) metadata-based and 2) content-based. In

the former, the issue is mainly to find useful categories for describing music. These

categories are expressed in text. Hence, text-based retrieval methods can be used

to search those descriptions. The more challenging approach in MIR is the one

that deals with the actual musical content, e.g. melody and rhythm.

In information retrieval, the objective is to find documents that match a users

information need, as expressed in a query. In content-based MIR, this aim is

usually described as finding music that is similar to a set of features or an example

(query string). There are many different types of musical similarity such as :

9

• Musical works that bring out the same emotion (romantic, sadness etc)

• Musical works that belong to the same genre (ex: classical, jazz etc)

• Musical works created by the same composer

• Music originating from the same culture( etc: Western, Indian)

• Varied repetitions of a melody

In order to perform analyses of various kinds on musical data, it is sometimes

desirable to divide it up into coherent segments. These segmentations can help in

identifying the high-level musical structure of a composition or a concert and can

help in better MIR. Segmentation also helps in creating ”thumbnails” of tracks that

are representative of a composition, thereby enabling surfers to sample parts of

composition before they decide to listen / buy. The identification of musically rele-

vant segments in music requires usage of large amount of contextual information

to assess what distinguishes different segments from each other.

In this work, we focus on segmentation as a tool for MIR in the context of

Carnatic music items.

1.3 Carnatic Music - An overview

The three basic elements of Carnatic music are raga(melody), tala (rhythm) and

sahitya (lyrics).

1.3.1 Raga

Each raga consists of a series of svaras, which bear a definite relationship to the

tonic note (equivalent of key in Western music) and occur in a particular sequence

10

in ascending scale and descending scale. The ragas form the basis of all melody

in Indian Music. The character of each raga is established by the order and the

sequence of notes used in the ascending and descending scales and by the manner

in which the notes are ornamented. These ornamentations, called gamakas, are

subtle, and they are an integral part of the melodic structure. In this respect, raga is

neither a scale, nor a mode. In a concert, ragas can be sung by themselves without

any lyrics (called alapana) and then be followed by a lyrical composition set to tune

in that particular raga. There are finite (72 to be exact) janaka (parent) ragas and

theoretically infinite possible janya (child) ragas born out of these 72 parent ragas.

Ragas are said to evoke moods such as tranquillity, devotion, anger, loneliness,

pathos etc. [42, Page 80] Ragas are also associated with certain time of the day,

though it is not strictly adhered to in Carnatic music.

1.3.2 Tala

Tala or the time measure is another principal element in Carnatic music. Tala is the

rhythmical groupings of beats in repeating cycles that regulates music composi-

tions and provides a basis for rhythmic coordination between the main artistes and

the accompanying artists. Hence, it is the theory of time measure. The beats (called

the matras) are further divided into aksharas. Tala encompasses both structure and

tempo of a rhythmic cycle. Almost all musical compositions other than those sung

as pure ragas (alapana) are set to a tala. There are 108 talas in theory [41, Page 17],

out of which less than 10 are are commonly in practice. Adi tala (8 beat/ cycle) is

the one most commonly used and is also universal. The laya is the tempo, which

keeps the uniformity of time span. In a Carnatic music concert, the tala is shown

with standardized combination of claps and finger counts by the musician.

11

1.3.3 Sahitya

The third important element of Carnatic music is the sahitya (lyrics). A musical

composition presents a concrete picture of not only the raga but the emotions

envisaged by the composer as well. If the composer also happens to be a good

poet, the lyrics are enhanced by the music, while preserving the metre in the lyrics

and the music, leading to an aesthetic experience, where a listener not only enjoys

the music but also the lyrics. The claim of a musical composition to permanence lies

primarily in its musical setting. In compositions considered to be of high quality,

the syllables of the sahitya blends beautifully with the musical setting. Sahitya

serve as the models for the structure of a raga. In rare ragas such as kiranavali

even solitary works of great composer have brought out the nerve-centre of the

raga. The aesthetics of listening to the sound of these words is an integral part

of the Carnatic experience, as the sound of the words blends seamlessly with the

sound of the music. Understanding the actual meanings of the words seems quite

independent of this musical dimension, almost secondary or even peripheral to

the ear that seeks out the music. The words provide a solid yet artistic grounding

and structure to the melody.

1.4 Carnatic Music vs Western Music

While one may be tempted to approach MIR in Carnatic music similar to MIR in

Western music, such attempts are quite likely to fail. There are some fundamental

differences between Western and Indian classical music systems, which are impor-

tant to understand as most of the available techniques on repetition detection and

segmentation for Western music are ineffective for Carnatic music. The differences

12

between these two systems of music are outlined below:

1.4.1 Harmony and Melody

This is the prime difference between the two classical music system. The Western

classical music is primarily polyphonic (i.e) different notes are sounded at the same

time. The concept of western music lies on the ”harmony” created by the different

notes. Thus, we see different instruments sounding different notes being played

at the same time, creating a different feel. It is the principle of ”harmony”. Indian

music system is essentially monophonic, meaning only single note is sung /played

at a time [13, Chapter 1.3]. Its focus is on melodies created using a sequence

of notes. Indian music focusses on expanding those svaras and expounding the

melody aspect, and emotional aspect.

1.4.2 Composed vs improvised

Western music is composed whereas Indian classical music is improvised. All

Western compositions are formally written using the staff notation, and performers

have virtually no latitude for improvisation. The converse is the case with Indian

music, where compositions have been passed on from teacher to student over

generations with improvisations in creative segments such as alapana, niraval and

kalpana svaras on the spot, on the stage.

1.4.3 Semitones, microtones and ornamentations

Western music is largely restricted to 12 semi tones whereas Indian classical music

makes extensive use of 22 microtones (called 22 shrutis though only 12 semi tones

13

are represented formally). In addition to microtones, Indian classical music makes

liberal use of inflexions and oscillations of notes. In Carnatic Music, they are called

gamakas. These gamakas act as ornamentations that describe the contours of a raga.

It is widely accepted that there are ten types of gamakas [45, Page 152]. A svara in

Carnatic music is not a single point of frequency although it is referred to with a

definite pitch value. It is perceived as movements within a range of pitch values

around a mean. Figure 1.4 2 compares the histograms of pitch values in a melody of

raga Sankarabharanam with its Hindustani equivalent (bilaval) and Western classical

counterpart (major scale). We can see that the pitch histogram is continuous for

Carnatic music and Hindustani music but it is almost discrete for Western music.

It is clearly seen that the svaras are a range of pitch values in Indian classical music

and this range is maximum for Carnatic music.

Figure 1.4: Pitch histogram of raaga Sankarabharanam with its Hindustani andWestern classical equivalents

The effect of gamakas on the note positions is illustrated in figure 1.5. The pitch

trajectories of arohana (ascending scale) of sankarabharanam raaga in tonic ”E” is

compared with that of ascending scale of E-major, which is its equivalent. We can

see that the pitch positions of many svaras of sankarabharanam move around its

2Image courtesy Shrey Dutta

14

intended pitch values as the result of ornamentations.

Figure 1.5: Effect of gamakas on pitch trajectory

1.4.4 Notes - absolute vs relative frequencies

In Western music, the positions of the notes are absolute. For instance, middle C is

fixed at 261.6 hz. In Carnatic music, the frequency of the various notes (svaras) are

relative to the tonic note (called Sa or shadjam). Hence the svara ”Sa” may be sung

at C (261.63 Hz) or G (392 Hz) or at any other frequency as chosen by the performer.

The relationship between the notes remains the same in all cases. Hence Ga1 is

always three chromatic steps higher than Sa. Once the key/tonic for the svara Sa is

chosen, then the frequencies for all the other notes are fully determined. There are

also differences in ratios among the 12 notes between Western music and Indian

music as provided in Table 1.1 . In this table, the columns referring to harmonic

are related to Western music. 3.3This Table is courtesy: M V N Murthy, Professor, IMSc

15

Table 1.1: Differences in frequencies of the 12 notes for Indian Music and WesternMusic

Note Natural Frequency Ratio Ratio Harmonic(Indian) (Hz-C-4) Indian Harmonic

Ratio = 2(1/12)

1 S 1 261.63 1 1 261.632 R1 16/15 279.07 1.067 1.059 277.193 R2/G1 9/8 294.33 1.125 1.122 293.694 R3/G2 6/5 313.96 1.200 1.189 311.165 G3 5/4 327.04 1.250 1.260 329.686 M1 4/3 348.84 1.333 1.335 349.197 M2 17/12 370.64 1.417 1.414 370.088 P 3/2 392.45 1.500 1.499 392.009 D1 8/5 418.61 1.600 1.588 415.32

10 D2/N1 5/3 436.05 1.667 1.682 440.0011 D3/N2 9/5 470.93 1.800 1.782 466.2212 N3 15/8 490.56 1.875 1.888 493.9613 (S) 2 523.26 2 2.000 523.35

1.5 Carnatic Music - The concert setting

A typical Carnatic music concert has a main artiste who is mostly a vocalist who is

accompanied on the violin, mrudangam (a percussion instrument) and optionally

other percussion instruments. The main artiste chooses a tonic frequency to which

the other accompanying artistes tune their instruments. This tonic frequency

becomes the concert pitch for that concert. The tonic frequency for male vocal

artistes are typically in the range 100 - 140 hz and for female vocalists in the range

180-220 hz

1.6 Carnatic Music segments

Typically, a Carnatic music concert is 1.5 to 3 hours of duration and is comprised

of a series of musical items. A musical item in Carnatic music is broadly made up

16

of 2 segments. 1) A composition segment and 2) Optional alapana segment which

precedes the composition segment. These 2 segments can be further segmented as

below:

1.6.1 Composition

The central part of every item is a a song or a composition which is characterised by

participation of all the artistes on the stage. This segment has some lyrics (sahitya)

that is set to a certain melody (raga) and rhythm (tala). Typically this segment

comprises of 3 sub-segments-pallavi, anupallavi and caranam, although in some

cases there can be more segments due to multiple caranam segments. While many

artistes render only one caranam segment (even if the composition has multiple

caranam segments), some artistes do render multiple caranams or all the caranams.

The pallavi part is repeated at the end of anupallavi and caranam.

1.6.2 Alapana

The composition can be optionally preceded by an alapana segment.. If alapana

is present, the percussion instruments do not participate in it. Only the melodic

aspect is expanded and explored without rhythmic support by the main artiste

supported by the violin artist. There are no lyrics for alapana. The main artiste

does the alapana followed by an optional alapana sub-segment by the violin artiste.

The above description is depicted in the figures 1.6 and 1.7.

17

Figure 1.6: Concert Segmentation

Figure 1.7: Item segmentation

1.7 Contribution of the thesis

The following are the main contributions of the thesis.

1. Relevance of MIR for Carnatic music

2. Challenges in MIR for Carnatic music

3. Representation of a musical composition as a time frequency trajectory.

4. Template matching of audio using t-f representation

5. Information theoretic approach to differentiate between composition seg-ment that has percussion and melody segment without percussion

18

1.8 Organisation of the thesis

The organization of the thesis is as follows:

Chapter 1 outlined the work done and gives a brief introduction to Carnatic

music that will help appreciate this work.

In Chapter 2, some of the related work on music segmentation and various

commonly used features have been discussed and their suitability to Carnatic

music has been studied.

Chapter 3 elaborates the approach and results for segmenting an item into

alapana and kriti.

Chapter 4 elaborates the approach to segment a kriti into pallavi, anupallavi and

caranam along with experimental results.

Finally, Chapter 5 summarizes the work and discusses possible future work.

19

CHAPTER 2

Literature Survey

2.1 Introduction

The manner in which humans listen to, interpret and describe music implies that

it must contain an identifiable structure. Musical discourse is structured through

musical forms such as repetitions and contrasts. The forms of the Western music

have been studied in depth by music theorists and codified. Musical forms are

used for pedagogical purposes, in composition as in music analysis and some of

these forms (such as variations or fugues) are also principles of composition.

Musical forms describe how pieces of music are structured. Such forms explain

how the sections/ segments work together through repetition, contrast and vari-

ations. Repetition brings unity, and variation brings novelty and spark interest.

The study of musical forms is fundamental in musical education as among other

benefits, the comprehension of musical structures leads to a better knowledge of

composition rules, and is the essential first approach for a good interpretation of

musical pieces. Every composition in Indian classical music has these forms and

are often an important aspect of what one expects when listening to music.

The terms used to describe that structure varies according to musical genre.

However it is easy for humans to commonly agree upon musical concepts such as

melody, beat, rhythm, repetitions etc. The fact that humans are able to distinguish

between these concepts implies that the same may be learnt by a machine using

signal processing and machine learning. Over the last decade, increase in comput-

ing power and advances in music information retrieval have resulted in algorithms

which can extract features such as timbre [3], [29], [50], tempo and beats [35], note

pitches [26] and chords [32] from polyphonic, mixed source digital music files e.g.

mp3 files, as well as other formats.

Structural segmentation of compositions directly from audio is a well re-

searched problem in the literature, especially for Western music. Automatic audio

segmentation (AAS), is a subfield of Music information retrieval (MIR) that aims

at extracting information on the musical structure of songs in terms of segment

boundaries, repeating structures and appropriate segment labels. With advancing

technology, the explosion of multimedia content in databases, archives and digital

libraries has resulted in new challenges in efficient storage, indexing, retrieval and

management of this content. Under these circumstances, automatic content anal-

ysis and processing of multimedia data becomes more and more important. In

fact, content analysis, particularly content understanding and semantic informa-

tion extraction, have been identified as important steps towards a more efficient

manipulation and retrieval of multimedia content. Automatically extracted struc-

tural information about songs can be useful in various ways, including facilitating

browsing in large digital music collections, music summarisation, creating new

features for audio playback devices (skipping to the boundaries of song segments)

or as a basis for subsequent MIR tasks.

Structural music segmentation consists of dividing a musical piece into several

parts or sections and then assigning to those parts identical or distinct labels

according to their similarity. The founding principles of structural segmentation

are homogeneity, novelty or repetition.

21

Repetition detection is a fundamental requirement for music thumbnailing and

music summarisation. These repetitions are also often the ”chorus” part of a

popular music piece that are thematic and musically uplifting. For these MIR

tasks, a variety of approaches have been discussed in the past.

Previous attempts at music segmentation involved segmenting by spectral

shape, segmenting by harmony, and segmenting by pitch and rhythm. While

these methods exhibited some amount of success, they generally resulted in over

segmentation (identification of segments at locations where segments do not exist).

In this chapter, under section 2.2, we will summarise some of the approaches

attempted by the research community for segmentation and repetition detection

tasks. In section, 2.3, we will review the various audio features commonly used

by speech and music community. We will conclude with our chosen feature and

its suitability for Carnatic music.

2.2 Segmentation Techniques

The authors of [39] discuss three fundamental approaches to music segmenta-

tion - a) novelty-based where transitions are detected between contrasting parts

b) homogeneity-based where sections are identified based on consistency of their

musical properties, and c) repetition-based where recurring patterns are deter-

mined.

In the following subsections, we will do a literature study on the segmentation

approaches carried out using machine learning and other approaches.

22

2.2.1 Machine Learning based approaches

In model-based segmentation approaches used in Machine learning, each audio

frame is separately classified to a specific sound class, e.g. speech vs music, vocal

vs instrumental, melody vs rhythm etc. In particular, a model is used to represent

each sound class. The models for each class of interest are trained using training

data. During the testing (operational) phase, a set of new frames is compared

against each of the models in order to provide decisions (sound labelling) at the

frame-level. Frame labelling is improved using post processing algorithms. Next,

adjacent audio frames labelled with the same sound class are merged to construct

the detected segments. In the model-based approaches the segmentation process is

performed together with the classification of the frames to a set of sound categories.

The most commonly used machine learning algorithms in audio segmentation are

the Gaussian Mixture Model (GMM), Hidden Markov Model (HMM), Support

Vector Machine (SVM) and Artificial Neural Network (ANN).

In [4] a 4-state ergodic HMM is trained with all possible transitions to discover

different regions in music, based on the presence of steady statistical texture fea-

tures. The Baum-Welch algorithm is used to train the HMM. Finally, segmentation

is deduced by interpreting the results from the Viterbi decoding algorithm for the

sequence of feature vectors for the song.

In [30], an automatic segmentation approach is proposed that combines SVM

classification and audio self-similarity segmentation. This approach firstly sep-

arates the sung clips and accompaniment clips from pop music by using SVM

preliminary classification. Next, heuristic rules are used to filter and merge the

classification result to determine potential segment boundaries further. And fi-

23

nally, a self similarity detecting algorithm is introduced to refine segmentation

results in the vicinity of potential points.

In [31], HMM is used as one of the methods to discover song structure. Here

the song is first parameterised using MFCC features. Then these features are used

to discover the song structure either by clustering fixed-length segments or by an

HMM. Finally using this structure, heuristics are used to choose the key phrase.

In [16], techniques such as Wolff-Gibbs algorithm, HMM and prior distribution

are used to segment an audio.

In [38], a fitness function for the sectional form descriptions is used to select

the one with the highest match with the acoustic properties of the input piece. The

features are used to estimate the probability that two segments in the description

are repeats of each other, and the probabilities are used to determine the total fitness

of the description. Since creating the candidate descriptions is a combinatorial

problem a novel greedy algorithm constructing descriptions gradually is proposed

to solve it.

In [1], the audio frames are first classified based on their audio properties,

and then agglomerated to find the homogeneous or self-similar segments. The

classification problem is addressed using an unsupervised Bayesian clustering

model, the parameters of which are estimated using a variant of the EM algorithm.

This is followed by beat tracking, and merging of adjacent frames that might belong

to the same segment.

In [43], segmentation of a full length Carnatic music concert into individual

items using applause as a boundary is attempted. Applauses are identified for

a concert using spectral domain features. GMMs are built for vocal solo, violin

solo and composition ensemble. Audio segments between a pair of the applauses

24

are labeled as vocal solo, violin solo, composition ensemble etc. The composition

segments are located and the pitch histograms are calculated for the composition

segments. Based on similarity measure the composition segment is labelled as

inter-item or intra-item. Based on the inter-item locations, intra-item segments are

merged into the corresponding items

In [48], a Convolutional Neural Networks (CNN) is trained directly on mel-

scaled magnitude spectrograms. The CNN is trained as a binary classifier on

spectrogram excerpts, and it includes a larger input context and respects the higher

inaccuracy and scarcity of segment boundary annotations.

The author(s) of [23] use CNN with spectrograms and self-similarity lag matri-

ces as audio features, thereby capturing more facets of the underlying structural

information. A late time-synchronous fusion of the input features is performed in

the last convolutional layer, which yielded the best results.

2.2.2 Non machine learning approaches

Non machine learning approaches have primarily used time frequency features or

distance measures to identify segment boundaries.

Distance based audio segmentation algorithms estimate segments in the audio

waveform, which correspond to specific acoustic categories, without labelling

the segments with acoustic classes. The chosen audio is blocked into frames,

parametrised, and a metric based on distance is applied to feature vectors that are

adjacent thus estimating what is called a distance curve. The frame boundaries

correspond to peaks of the distance curve where the distance is maximized. These

are positions with high acoustic change, and hence are considered as candidate

25

audio segment boundaries. Post-processing is done on the candidate boundaries

for the purpose of selecting which of the peaks on the distance curve will be

identified as audio segment boundaries. The sequence of segments will not be

classified to a specific audio sound category at this stage. The categorization is

usually performed by a machine learning based classifier as the next stage.

Foote was the first to use a auto-correlation matrix where a song’s frames are

matched against themselves. The author(s) of [18] describe methods for automat-

ically locating points of significant change in music or audio, by analysing local

self-similarity. This approach uses the signal to model itself, and thus does not

rely on particular acoustic cues nor requires training.

This approach was further enhanced in [6], where a self similarity matrix fol-

lowed by dynamic time warping (DTW) was used to find segment transitions and

repetitions.

In [51] unsupervised audio segmentation using Bayesian Information Crite-

rion is used. After identifying the candidate segments using Euclidean distance,

delta-BIC integrating energy-based silence detection is employed to perform the

segmentation decision to pick the final acoustic changes.

In [52], anchor speaker segments are identified using Bayesian Information

Criterion to construct a summary of broadcast news.

In [7], three divide-and-conquer approaches for Bayesian information criterion

based speaker segmentation are proposed. The approaches detect speaker changes

by recursively partitioning a large analysis window into two sub-windows and

recursively verifying the merging of two adjacent audio segments using Delta BIC.

In [9], a two pass approach is used for speaker segmentation. In the first pass,

26

GLR distance is used to detect potential speaker changes, and in second pass, BIC

is used to validate these potential speaker changes.

In [8], the authors describe a system that uses agglomerative clustering in music

structure analysis of a small set of Jazz and Classical pieces. Pitch, which is used

as the feature, is extracted and the notes are identified from the pitch. Using

the sequence of notes, the melodic fragments that repeat are identified using a

similarity measure. Then clusters are formed from pairs of similar phrases and

used to describe the music in terms of structural relationships.

In [17], the authors propose a dissimilarity matrix containing a measure of

dissimilarity for all pairs of feature tuples using MFCC features. The acoustic

similarity between any two instants of an audio recording is calculated and dis-

played as a two-dimensional representation. Similar or repeating elements are

visually distinct, allowing identification of structural and rhythmic characteristics.

Visualization examples are presented for orchestral, jazz, and popular music.

In [21], a feature-space representation of the signal is generated; then, sequences

of feature-space samples are aggregated into clusters corresponding to distinct

signal regions. The clustering of feature sets is improved via linear discriminant

analysis; dynamic programming is used to derive optimal cluster boundaries.

In [22], the authors describe a system called RefraiD that locates repeating

structural segments of a song, namely chorus segments and estimates both ends

of each section. It can also detect modulated chorus sections by introducing a

perceptually motivated acoustic feature and a similarity that enable detection of

a repeated chorus section even after modulation. Chorus extraction is done in

four stages - computation of acoustic features and similarity measures, repeti-

tion judgement criterion, estimating end-points of repeated sections and detecting

27

modulated repetitions

In [33], the structure analysis problem is formulated in the context of spectral

graph theory. By combining local consistency cues with long-term repetition

encodings and analyzing the eigenvectors of the resulting graph Laplacian, a

compact representation is produced that effectively encodes repetition structure at

multiple levels of granularity.

In [46], the authors describe a novel application of the symmetric Kullback-

Leibler distance metric that is used as a solution for segmentation where the

goalis to produce a sequence of discrete utterances with particular characteristics

remaining constant even when speaker and the channel change independently.

In [34], a supervised learning scheme using ordinal linear discriminant analysis

and constrained clustering is used. To facilitate abstraction over multiple training

examples, a latent structural repetition feature is developed, which summarizes the

repetitive structure of a song of any length in a fixed-dimensional representation.

2.3 Audio Features

In machine learning, choosing a feature, which is an individual measurable prop-

erty of a phenomenon is critical. Extracting or selecting features is both an art and

science as it requires experimentation of multiple possible features combined with

domain knowledge. Features are usually numeric and represented by feature vec-

tors. Perception of music is based on the temporal, spectral and spectro-temporal

features. For our work, we could broadly divide the audio features into the fol-

lowing groups :

• Temporal

28

• Spectral

• Cepstral

• Distance based

2.3.1 Temporal Features

Speech and vocal music are produced from a time varying vocal tract system with

time varying excitation. For musical instruments the audio production model is

different from vocal music. Still, the system and the excitation are time variant. As a

result the speech and music signals are non-stationary in nature. Most of the signal

processing approaches studied for signal processing assume time invariant system

and time invariant excitation, i.e. stationary signal. Hence these approaches are

not directly applicable for speech and music processing. While the speech signal

can be considered to be stationary when viewed in blocks of 10-30 msec windows,

music signal can be considered to be stationary when viewed in blocks of 50 - 100

msec windows. Some of the short term parameters are discussed here.

• Short-Time Energy (STE) : The Short-Time Energy of an audio signal is de-fined as

En =

∞∑m=−∞

(x[m])2 w [n −m] (2.1)

where, w [n] is a window function. Normally, a Hamming window is used.

• RMS : The root mean square of the waveform calculated in the time domainto indicate its loudness. It is a measure of amplitude in one analysis windowsand is defined as

RMS =

√x1

2 + x22 + x3

2 + ... + xn2

n

where n is the number of samples within an analysis window and x is thevalue of the sample.

• Zero-Crossing Rate (ZCR): It is defined as the rate at which the signal crosseszero. It is a simple measure of the frequency content of an audio signal. Zero

29

Crossings are also useful to detect the amount of noise in a signal. The ZCRis defined as

Zn =

∞∑m=−∞

| sgn (x[m]) − sgn (x[m − 1]) | w[n −m] (2.2)

where sgn(x[n]) =

{1, x[n] ≥ 0−1, x[n] < 0

where x[n] is a discrete time audio signal, sgn(x[n]) is the signum functionand w [n] is a window function. ZCR can also be used to distinguish betweenvoiced and unvoiced speech signals as unvoiced speech segments normallyhave much higher ZCR values than voiced segments.

• Pitch: Pitch is an auditory sensation in which a listener assigns musicaltones to relative positions on a musical scale based primarily on frequencyof vibration. Pitch, often used interchangeably with fundamental frequency,provides important information about an audio signal that can be used fordifferent tasks including music segmentation [25], speaker recognition [2]and speech analysis and synthesis purposes [47]. Generally, audio signalsare analysed in the time-domain and spectral-domain to characterise a signalin terms of frequency, amplitude, energy etc. But there are some audiocharacteristics such as pitch, which are missing from spectra, which areuseful for characterising a music signal. Spectral characteristics of a signalcan be affected by channel variations, whereas pitch is unaffected by suchvariations. There are different ways to estimate pitch of an audio signal asexplained in [20].

• Autocorrelation: It is the correlation of a signal with a delayed copy of itselfas a function of delay. This is achieved by providing different time lag forthe sequence and computing with the given sequence as reference.

2.3.2 Spectral Features

A temporal signal can be transferred into spectral domain using suitable spectral

transformation, such as the Fourier transform. There are a number of co-efficients

that can be derived from Fast Fourier Transform(FFT) such as :

• Spectral centroid : It indicates a region with the biggest density of frequencyrepresentation in the audio signal. The spectral centroid is commonly associ-ated with the measure of the brightness of a sound. This measure is obtainedby evaluating the center of gravity using the Fourier transforms frequency

30

and magnitude information. The individual centroid of a spectral frame isdefined as the average frequency weighted by amplitudes, divided by thesum of the amplitude.

C =

∑N1 kX[k]∑N1 X[k]

where X[k] is the magnitude of the FFT at frequency bin k and N the numberof frequency bins.

Using this feature, in [10], a sound stream is segmented by classifying eachsub-segment into silence, pure speech, music, environmental sound, speechover music, and speech over environmental sound classes in multiple steps.

• Spectral Flatness : It is the flatness of the spectrum as represented by theratio between the geometric and arithmetic means. Its output can be seenas a measure of the tonality/noisiness of the sound. A high value indicatesa flat spectrum, typical of noise-like sounds or ensemble sections. On theother hand, harmonic sounds produce low flatness values, an indicator forsolo phrases.

SFn =

k

Xn[k](1/K)

1

K

∑k

Xn[k]

−1

where k is the frequency bin index of the magnitude spectrum X at frame n.

In [24] and [19], a method that utilizes a spectral flatness based tonalityfeature for segmentation and content based retrieval of audio is outlined.

• Spectral flux : Spectral flux is a measure of how quickly the power spectrumof a signal is changing, calculated by comparing the power spectrum for oneframe against the power spectrum from the previous frame. More precisely,it takes the Euclidean norm between the two spectra, each one normalizedby its energy. It is defined as 2-norm of two adjacent frames.

SF[n] =

∫ω

(| Xn(e jω) | − | Xn+1(e jω) |)2dω (2.3)

where Xn(e jω) is the Fourier Transform of the nth frame of the input signal andis defined as

Xn(e jω) =

∞∑m=−∞

w [n −m] x [m] e− jωm (2.4)

In [53], spectral flux is one of the features used to segment an audio streamon the basis of its content into four main audio types: pure-speech, music,environment sound, and silence.

• Spectral Crest - The shape of the spectrum is described by this feature. itis a measure for the peakiness of a spectrum and is inversely proportional

31

to the spectral flatness. It is used to distinguish between sounds that arenoise-like and tone-like. Noise like spectra will have a spectral crest near 1.It is calculated by the formula

(max(Xn[k])

) 1K

∑k

Xn[k]

−1

In [19], spectral crest is used as one of the features to detect solo phrases inMusic.

• Spectral roll-off - It determines a threshold below which the biggest part ofthe signal energy resides. The roll-off is a measure of spectral shape. Spectralrolloff point is defined as the Nth percentile of the power spectral distribution,where N is usually 85%. The roll-off point is the frequency below which theN% of the magnitude distribution is concentrated.

In [27], modified spectral roll-off is used to segment between speech andmusic.

• Spectral skewness : This is a statistical measure of the asymmetry of theprobability distribution of the audio signal spectrum. It indicates whether ornot the spectrum is skewed towards a particular range of values.

• Spectral slope : It characterize the loss of signal’s energy at higher frequencies.It is a measure of how quickly the spectrum of an audio sound tails off towardsthe high frequencies, calculated using a linear regression on the amplitudespectrum.

• Spectral Entropy: It is a measure of randomness of a system. It is calculatedas below:

- Calculate the spectrum X[k]) of the signal.- Calculate the power spectral density (PSD) of the signal by squaring itsamplitude and normalizing by the number of bins.- Normalize the calculated PSD so that it can be viewed as a probabilitydensity function (integral is equal to 1) - The Power Spectral entropy can benow calculated using a standard formula for an entropy calculation

PSE = −

n∑i=1

pi ln pi

where pi is the normalised PSD.

32

2.3.3 Cepstral Features

Cepstral analysis originated from speech processing. Speech is composed of two

components - the glottal excitation source and the vocal tract system. These two

components have to be separated from the speech in order to analyze and model

independently. The objective of cepstral analysis is to separate the speech into its

source and system components without any a priori knowledge about source and

/ or system. Because these two component signals are convolved, they cannot be

easily separated in the time domain.

The cepstrum c is defined as the inverse DFT of the log magnitude of the DFT

of the signal x.

c[n] = F −1{log |F {x[n]}|}

where F is the DFT and F −1 is the IDFT.

Cepstral analysis measures rate of change across frequency bands. The cepstral

coefficients are a very compact representation of the spectral envelope. They

are also (to a large extent) uncorrelated. Glottal excitation is captured by the

coefficients where n is high and the vocal tract response, by those where n is

low. For these reasons, cepstral coefficients are widely used in speech recognition,

generally combined with a perceptual auditory scale.

We discuss some types of cepstral coefficients used in speech and music analysis

fields:

• Linear prediction Cepstral co-efficients (LPCC) : For finding the source (glot-tal excitation) and system (vocal tract) components from time domain itself,the linear prediction analysis was proposed by Gunnar Fant [15] as a linearmodel of speech production in which glottis and vocal tract are fully de-coupled. Linear prediction calculates a set of coefficients which provide anestimate - or a prediction - for a forthcoming output sample. The commonest

33

form of linear prediction used in signal processing is where the output esti-mate is made entirely on the basis of previous output samples. The result ofLPC analysis then is a set of coefficients a[1..k] and an error signal e[n], theerror signal will be as small as possible and represents the difference betweenthe predicted signal and the original

According to the model, the speech signal is the output y[n] of an all-polerepresentation 1

A(z) excited by x[n]. The filter 1Ap(z) is known as the synthesis

filter. This implicitly introduces the concept of linear predictability, whichgives the name to the model. Using this model speech signal can be expressedas

y[n] =

p∑k=1

akx[n − k] + e[n]

which states that the speech sample can be modeled as a weighted sum of thep previous samples plus some excitation contribution. In linear prediction,the term e[n] is usually referred to as the error (or residual). LP parameters{ai} are estimated such that the error is minimised. The techniques used forthis are covariance method and auto-correlation method.

The LP coefficients are too sensitive to numerical precision. A very smallerror can distort the whole spectrum, or make the prediction filter unstable.So it is often desirable to transform LP coefficients into cepstral coefficients.LPCC are Linear Prediction Coefficients (LPC) represented in the cepstrumdomain. The cepstral co-efficients of LPCC are derived as below:

c(n) =

0 if n < 0;ln(G), if n = 0;an +

∑n−1k=1 ( k

n )c(k)an-k if 0 < n ≤ p∑n−1k=n−p( k

n )c(k)an-k if n > p

Though LP coefficients and LPCC are widely used in speech analysis andsynthesis tasks, it is not directly used for audio segmentation. However, arelated feature called Line spectral frequencies (LSF) has been used for audiosegmentation. LSFs are an alternative to the direct form linear predictorcoefficients. They are an alternate parametrisation of the filter with a one-to-one correspondence with the direct form predictor coefficient. They arenot very sensitive to quantization noise and are also stable. Hence they arewidely used for quantizing LP filters.

In [11], LSFs are used as the core feature for speech - music segmentation.In addition to this, a new feature, the linear prediction zero-crossing ratio(LP-ZCR) is also used which is defined as the ratio of the zero crossing countof the input and the zero crossing count of the output of the LP analysis filter.

• Mel-Frequency Cepstrum Coefficients (MFCC): The motivation for usingMel-Frequency Cepstrum Coefficients was due to the fact that the auditory

34

response of the human ear resolves frequencies non-linearly. MFCC was firstproposed in [36]. The mapping from linear frequency to me1 frequency isdefined as

fmel = 2595 ∗ log 10

(1 +

f700

)The steps involved in extracting MFCC feature is shown in the below figure:

Figure 2.1: Block diagram of MFCC extraction

• Bark frequency cepstral coefficients (BFCC): The Bark scale, another per-ceptual scale, divides the audible spectrum into 24 critical bands that tryto mimic the frequency response of the human ear. Critical bands refer tofrequency ranges corresponding to regions of the basilar membrane that areexcited when stimulated by specific frequencies. Critical band boundariesare not fixed according to frequency, but dependent upon specific stimuli.Relative bandwidths are more stable, and repeated experiments have foundconsistent results. In frequency, these widths remain more or less constantat 100 Hz for center frequencies up to 500 Hz, and are proportional to highercenter frequencies by a factor of 0.2.

The relation between frequency scale and Bark scale is as below:

Bark = 6 ln

f600

+

( f600

)2

+ 1

0.5

In [37] BFCC is used for real-time instrumental sound segmentation andlabeling.

• Harmonic Cepstral Coefficients (HCC) : In the MFCC approach, the spectrumenvelope is computed from energy averaged over each mel-scaled filter. Thismay not work well for voiced sounds with quasi periodic features, as the for-mant frequencies tend to be biased toward pitch harmonics, and formant

35

bandwidth may be mis-estimated. To overcome this shortcoming, instead ofaveraging the energy within each filter, which results in a smoothed spec-trum in MFCC, harmonic cepstral coefficients (HCC) are derived from thespectrum envelope sampled at pitch harmonic locations. This requires robustpitch estimation and voiced/unvoiced/transition (V/UV/T) classification per-formed. This is accomplished using spectro-temporal auto-correlation (STA)followed by peak-picking algorithm; the block diagram of HCC is shownbelow:

Figure 2.2: Block diagram of HCC analysis

2.3.4 Distance based Features

The distance metrics are distance-based algorithms that perform an analysis over

a stream of data to find that point which gives the optimum characteristic event of

interest. Many functions have been proposed in the audio segmentation literature,

mainly because they can be blind to the audio stream characteristics i.e. type

of audio (recording conditions, number of acoustic sources, etc) or type of the

upcoming audio classes (speech, music, etc). The most commonly used are:

• The Euclidean distance

36

This is the simplest distance metric for comparing two windows of featurevectors. For distance between two distributions, we take the distance be-tween only the means of the two distributions. For two windows of audiodata described as Gaussian models G1(µ1,Σ1) and (G2(µ2,Σ2), the Euclideandistance metric is given by:

(µ1 − µ2)T(µ1 − µ2)

• The Bayesian information criterion(BIC)

The Bayesian information criterion aims to find the best models that describea set of data. From the two given windows of audio stream the algorithmcomputes three models representing the windows separately and jointly.From each model the formula extracts the likelihood and a complexity termthat expresses the number of the model parameters. For two windows ofaudio data described as Gaussian models G1(µ1,Σ1)and(G2(µ2,Σ2) and withtheir combined windows described as G(µ,Σ) , the ∆BIC distance metric isevaluated as below:

∆BIC = BIC(G1) + BIC(G2) − BIC(G)

BIC(G) = −N log |Σ|

2−λ(d +

d(d−1) log N2 )

2−

dN log 2π2

−N2

∆BIC =N log |Σ|

2−

N1 log |Σ1|

2−

N1 log |Σ2|

2−λd2−λd4

(d+1)(log N1+log N2−log N)

where N,N1,N2 are the number of frames in the corresponding streams, d isthe number of features of the feature vectors andλ is an experimentally factor

In [5], BIC is used detect acoustic change due to speaker change which inturn is used for segmentation based on speaker change.

• The Generalized Likelihood Ratio(GLR): When we process music, context isvery important. We therefore like to understand the trajectory of featuresas a function of time. GLR is a simplification of the Bayesian InformationCriterion. Like BIC, it finds the difference between two windows of audiostream using the three Gaussian models that describe these windows sep-arately and jointly. For two windows of audio data described as Gaussianmodels G1(µ1,Σ1) and G2(µ2,Σ2), the GLR distance is given by:

GLR = w(2 log |Σ] − log[Σ1] − log |Σ2|)

where w is the window size.

In[49], segmenting an audio stream into homogeneous regions accordingto speaker identities, background noise, music, environmental and channelconditions is proposed using GLR.

37

• KL2 Distance Metric based segmentation is a popular technique for seg-mentation. It relies on the computation of a distance between two acousticsegments to determine whether they have similar timbre or not. Change intimbre is an indicator of change in acoustic characteristics such as speaker,musical instrument, background ambience etc.

KL divergence is an information theoretic likelihood-based non-symmetricmeasure that gives the difference between two probability distributions Pand Q. The larger this value, the greater the difference between these PDFs.It is given given by:

DKL(P‖Q) =∑

i

P(i) logP(i)Q(i)

. (2.5)

As mentioned in [46], since DKL(P‖Q) measure is not symmetric, it can notbe used as a distance metric. Hence its variation, KL2 metric is used here fordistance computation. It is defined as follows

DKL2(P,Q) = DKL(P‖Q) + DKL(Q‖P) (2.6)

A Gaussian distribution computed on a window of fourier transformed centnormalised spectrum is considered as a probability density function. KL2distance is computed between adjacent frames to determine the divergencebetween two adjacent spectra.

In [46], KL2 distance is used to detect segment boundaries where speakerchange or channel change occur.

• The Hotelling T2 statistic is another popular tool for comparing distributions.The main difference with KL2 is the assumption that the two comparing win-dows of audio stream have no difference on their covariances. For two win-dows of audio data described as Gaussian models G1(µ1,Σ1) and G2(µ2,Σ2),the Hotelling T2 distance metric is given by:

T2 =N1N2

N1 + N2(µ1 − µ2)TΣ−1(µ1 − µ2)

where Σ equals Σ1 and Σ2 and N1,N2 are the number of frames in the corre-sponding streams

In [54], Hotelling T2 statistic is used to pre-select candidate segmentationboundaries followed by BIC to perform the segmentation decision.

38

2.4 Discussions

While these techniques have been attempted for Western music where the repeti-

tions have more or less static time-frequency melodic content, finding repetitions

in improvisational music is a difficult task. In Indian music, the melody content

of the repetitions varies significantly (Fig 1.3) during repetitions within the same

composition due to the improvisations performed by the musician. A musician‘s

rendering of a composition is considered rich, if (s)he is able improvise and pro-

duce a large number of melodic variants of the line while preserving the grammar,

rhythmic structure and the identity of the composition. Another issue that needs

to be addressed is of the tonic. The same composition when rendered by different

musicians can be sung in different tonics. Hence matching a repeating pattern of a

composition across recordings of various musicians requires a tonic-independent

approach.

The task of segmenting an item into alapana and kriti in Carnatic music involves

differentiating between the textures of the music during alapana and kriti. While

the kriti segment involves both melody and rhythm and hence includes the par-

ticipation of percussion instruments, the alapana segment involves only melody

contributed to by lead performer and the accompanying violinist.

It has been well established in [43] that MFCC features are not suitable for

modelling music analysis tasks where there is a dependency on the tonic. When

MFCCs are used to model music, a common frequency range is used for all musi-

cians, which does not give the best results when variation in tonic is factored in.

With machine learning techniques, when MFCC features are used, training and

testing datasets should have the same tonic. This creates problems when music is

39

compared across tonics as the tonic can vary from concert to concert and musician

to musician. To address the issue of tonic dependency, a new feature called cent

filterbank (CFB) energies was introduced in [43]. Hence, modelling of Carnatic

music using cent filter-bank (CFB) based features that are normalised with respect

to the tonic of the performance, namely CFB Energy and Cent Filterbank Cepstral

Coefficients (CFCC), is the preferred approach for this thesis.

Figure 2.3: Filter-banks and filter-bank energies of a melody segment in the melscale and the cent scale with different tonic values

2.4.1 CFB Energy Feature

The cent is a logarithmic unit of measure used for musical intervals. Twelve-tone

equal temperament divides the octave into 12 semitones of 100 cents each. An

40

octave (two notes that have a frequency ratio of 2:1) spans twelve semitones and

therefore 1200 cents.

As mentioned earlier, notes that make up a melody in Carnatic music are

defined with respect to the tonic. The tonic chosen for a concert is maintained

throughout the concert using an instrument called the tambura (drone). The anal-

ysis of a concert therefore should depend on the tonic. The tonic ranges from 180

Hz to 220 Hz for female and 100 Hz to 140 Hz for male singers. Tonic normali-

sation in CFB removes the spectral variations. This is illustrated in Fig. 2.3 1 that

shows time filter-bank energy plots for both the mel scale and cent scale. The time

filter-bank energies are shown for the same melody segment as sung by a male

and a female musician. Filter-bank energies and filter-banks are plotted for two

different musicians (male motif with tonic 134 Hz and female motif with tonic 145

Hz) with different tonic values. In the case of mel scale, filters are placed across

the same frequencies for every concert irrespective of the tonic values, whereas, in

the case of the cent scale, the filter-bank frequencies are normalised with respect

to the tonic. The male and female motifs are clearly emphasised irrespective of the

tonic values in the cent scale, and are not clearly emphasised in the mel scale.

CFB energy feature extraction is carried out as below:

1. The audio signal is divided into frames.

2. The short-time DFT is computed for each frame.

3. The frequency scale is normalised by the tonic. The cent scale is defined as:Cent = 1200 . log2 (f / tonic)

4. Six octaves corresponding to [ –1200 : 6000] cents are chosen for every mu-sician. While upto 3 octaves can be covered in a concert, the instrumentsproduce harmonics which are critical to capture the timbre. The choice of sixharmonics is to capture the rich harmonics involved in musical instruments.

1Image courtesy Padi Sarala

41

5. The cent normalised power spectrum is then multiplied by a bank of 80 filtersthat are spaced uniformly in the linear scale to account for the harmonics ofpitch. The choice of 80 filters is based on experimentations in [43].

6. The filterbank energies are computed for every frame and used as a featureafter removing the bias.

CFB energy features were extracted for every frame of length 100 ms of the

musical item, with a shift of 10 ms. Thus, a 80 dimensional feature is obtained for

every 10 ms of the item, resulting in N feature vectors for the entire item.

2.4.2 CFB Slope Feature

In Carnatic music, a collective expression of melodies that consists of svaras (or-

namented notes) in a well defined order constitute phrases (aesthetic threads of

ornamented notes) of a raga. Melodic motifs are those unique phrases of a raga

that collectively give a raga its identity. In Fig. 4.2, it can be seen that the the

presence of the strokes due to the mrudangam destroys the melodic motif. To ad-

dress this issue, cent filterbank based slope was computed along frequency. Let

the vector of log filter bank energy values be represented as Fi = ( f1,i, f2,i, ..., fn f ,i)t,

where n f is the number of filters. Mean subtraction on the sequence Fi, where i =

1,2,..,n is applied as before. Here, n is the number of feature vectors in the query.

To remove the effect of percussion, slope values across consecutive values in each

vector Fi are calculated. Linear regression over 5 consecutive filterbank energies is

performed. A vector of slope values s = (s1,i, s2,i, ..., sF−1,i)t for each frame of music

is obtained as a result.

42

2.4.3 CFCC Feature

To arrive at Cent filterbank cepstral coefficients (CFCC) feature, after carrying out

the steps enumerated in section 2.4.1, DCT is applied on filterbank energies to

de-correlate and the required co-efficients are retained.

43

CHAPTER 3

Identification of alapaa and kriti segments

3.1 Introduction

Alapana (Sanskrit: dialogue) is a way of rendition to explore the features and beauty

of a raga. Since alapana is purely melodic with no lyrical and rhythmic components,

it is best suited to bring out the various facets of a raga [40, Chapter 4]. The per-

former brings out the beauty of a raga using creativity and internalised knowledge

about the grammar of the raga. During alapana, the performer improvises each

note or a set of notes gradually gliding across octaves, emphasising important

notes and motifs thereby evoking the mood of the raga. After the main artiste

finishes the alapana, optionally the accompanying violinist may perform an alapana

in the same raga.

The kritis are central to any Carnatic music concert. Every kriti is a confluence

of 3 aspects —lyrics, melody and rhythm. Every musical item in a concert will

have the mandatory kriti segment and optionally alapana segment. The syllables of

a lyrics of the kriti go hand in hand with the melody of the raga thereby enriching

the listening experience. The lyrics are also important in Carnatic music. While

the raga evokes certain emotional feelings, the lyrics further accentuate it, adding

to the aesthetics and listening experience.

In this chapter, we will describe an approach to identify the boundary separat-

ing alapana and kriti using KL2,GMM and CFB Energy Feature. In section 3.2, we

will describe our algorithm used for the segmentation. Under section 3.3, we will

be discussing the results of our experiments. We will conclude this chapter with

discussions on the results.

3.2 Segmentation Approach

3.2.1 Boundary Detection

In order to detect the boundary separating alapana and kriti, individual feature

vectors need to be labelled. One naive approach to find the boundary would be

to label each and every feature vector. Since each feature vector corresponds to 10

ms, and a musical item can last anywhere between 3 mins to 30 mins, there is a

need to label too many feature vectors for the entire musical item. Moreover, there

would be small intervals of time during the kriti, when percussion content would

be absent either due to inter-stroke silence or due to aesthetic pauses deliberately

introduced by the percussionist.

So, a better approach would be to extract a segment of feature vectors from

the item and try to label the segment as a whole. Hence, finding the boundary

between alapana and kriti would involve:

• Iterate over the N feature vectors, one at a time.

• Consider a segment of specified length to the left and right of the currentfeature vector.

• Use a machine learning technique to label these two segment as a whole. Thisreduces the resolution of the segmentation process to the segment length.

• Use music domain knowledge to correct and agglomerate the labels to findthe boundary between alapana and kriti.

45

This approach is computationally intensive. To further improve the efficiency

of this process, we have to reduce the search space for the boundary. The following

approach using KL2 was used:

• Iterate over the N feature vectors, one at a time.

• Consider a sliding window consisting of a sequence of 500 feature vectors (5seconds), Wn, where n denotes the starting position. n = 1, 2..N − 500

• Average the density function obtained earlier for the entire window length.

• Calculate KL2 distance between 2 successive frames of music, Wn and Wn+1.

• Larger values of KL2 distance denote large change in distribution.

• A threshold was automatically chosen such that, there is 3 seconds spacingbetween adjacent peaks of KL2 values. This is to prevent the algorithm fromgenerating too many change points. The choice of 3 seconds was empiricallyarrived at, as a trade-off between accuracy and efficiency.

• The peaks extracted will correspond to a array of K possible boundariesB = [b1, b2, .., bK] between alapana and kriti.

0 1 2 3 4 5 6 7

x 104

0

0.2

0.4

0.6

0.8

1

1.2

1.4

Feature Vector Index

KL

2 V

alu

e

KL2 ValueBoundary PointsAutomatically Chosen ThresholdSelected Peak

Alapana Kriti

Figure 3.1: KL2 Values and possible segment boundaries.

Fig. 3.1 shows the output of the algorithm described above.

46

3.2.2 Boundary verification using GMM

From the K possible boundary values, the actual boundary between kriti and

alapana needs to be identified. In order to verify the boundaries GMMs were

used. GMMs were trained using CFB energy features after applying DCT for

compression. GMMs were trained for both the classes kriti and alapana using a

training dataset with 32 mixtures per class. The approach is as follows:

• A window of length 1000 feature vectors (10 Seconds) was extracted to theleft and right of the possible boundary points, B.

• Labels for left segments, LSL and the right segments, RSL were estimatedusing GMMs (as shown in 3.2).

Figure 3.2: GMM Labels

3.2.3 Label smoothing using Domain Knowledge

Now, using the set of possible boundaries (B) and their left and right segment

labels (LSL and RSL) we need to assign the label for individual feature vector, L.

The following approach was used to find L.

47

L[n] =

LSL[1], if 1 ≤ n ≤ B[1]

RSL[k], if B[k] < n < (B[k] + B[k + 1])/2, (k = 1..K − 1)

LSL[k], if (B[k − 1] + B[k])/2 ≤ n ≤ B[k], (k = 2..K)

RSL[K], if B[K] < n ≤ N

The labels after applying the above approach is as shown in the Figure 3.3

Figure 3.3: Entire song label generated using GMM

Domain information was used to improve the results. To agglomerate the

labels, a smoothing algorithm was used as described below:

• An item can have utmost 2 segments-alapana and kriti.

• If present, alapana must be atleast 30 seconds long.

• Kriti may be preceded by alapana, and not vice versa.

• If a smaller segment of a particular label (alapana or kriti), was identified inbetween two larger segments of different label, then the smaller segment isrelabelled and merged with the adjacent larger segments.

Final song label will be as shown in the Figure 3.4.

48

Figure 3.4: Entire song label generated using GMM after smoothing

3.3 Experimental Results

3.3.1 Dataset Used

Experiments were conducted on 40 live concert recordings. Of these 40 concerts, 6

were multi track recordings and the remaining were single track recordings. The

details of the dataset used is given in Table 3.1. Durations are given in approximate

hours (h).

Table 3.1: Division of dataset.

Male Female TotalNo. of artistes 11 11 22No. of Concerts 26 14 40No. of items with alapana 95 59 154No. of Items without alapana 104 63 167Total no. of items 199 122 321Total duration of kriti 30 h 18 h 48 hTotal duration of alapana 12 h 7 h 19 h

49

3.3.2 Results

Experiments were performed using both MFCC and CFB based features. Two

metrics were used to calculate the accuracy of segmentation —frame-level accuracy

and item classification accuracy. As mentioned earlier, a musical item in a concert

can be, a kriti optionally preceded by an alapana. Assuming that the alapana-kriti

boundary was detected properly, item classification was pursued.

Results using CFB features

Table 3.2 shows the confusion matrix for the frame-level classification using CFB

based feature. Table 3.3 shows the performance for the frame-level classification.

Table 3.2: Confusion matrix: Frame-level labelling

kriti alapanakriti 1,64,11,759 7,77,804alapana 13,16,925 56,87,274

Table 3.3: Performance: Frame-level labelling

kriti alapanaPrecision 0.9257 0.8797Recall 0.9548 0.8120F-measure 0.9400 0.8445Accuracy 0.9134

Table 3.4 shows the confusion matrix for the item classification using CFB based

feature. Table 3.5 shows the corresponding performance for the item classification.

Table 3.4: Confusion matrix: Item Classification

Without alapana With alapanaWithout alapana 155 12With alapana 26 128

50

Table 3.5: Performance: Item Classification

Without alapana With alapanaPrecision 0.8564 0.9143Recall 0.9281 0.8312F-measure 0.8908 0.8707Accuracy 0.8816

Results using MFCC features

Table 3.6 shows the confusion matrix for the frame-level classification using MFCC

feature. Table 3.7 shows the performance for the frame-level classification.

Table 3.6: Confusion matrix: Frame-level labelling

kriti alapanakriti 1,39,58,342 32,31,221alapana 52,60,214 17,43,985

Table 3.7: Performance: Frame-level labelling

kriti alapanaPrecision 0.7263 0.3505Recall 0.8120 0.2490F-measure 0.7668 0.2912Accuracy 0.6490

Table 3.8 shows the confusion matrix for the item classification using MFCC

feature. Table 3.9 shows the corresponding performance for the item classification.

Table 3.8: Confusion matrix: Item Classification

Without alapana With alapanaWithout alapana 145 22With alapana 115 39

3.3.3 Discussions

It can be observed that, using this approach, frame level labelling accuracy of

91.34% and item classification accuracy of 88.16% has been achieved using CFB

51

Table 3.9: Performance: Item Classification

Without alapana With alapanaPrecision 0.5577 0.6393Recall 0.8683 0.2532F-measure 0.6792 0.3628Accuracy 0.5732

Energy feature. Whereas using the MFCC feature, frame level labelling accuracy

of 64.90% and item classification accuracy of 57.32 has been achieved. Accuracy of

MFCC is low due to the common frequency range assumed in the feature extraction

process. Also some of the recordings are not clean and in some cases , the alapana

was very short. These have contributed to errors in classification.

52

CHAPTER 4

Segmentation of a kriti

4.1 Introduction

In Carnatic music, a kriti or composition typically comprises of 3 segments-pallavi,

anupallavi and caranam-although in some cases there can be more segments due to

multiple caranam segments. While many artistes render only 1 caranam segment

(even if the composition has multiple caranam segments), some artistes do render

multiple caranam or all the caranams.

The pallavi in a composition in Carnatic music is akin to the chorus or refrain

in Western music albeit with a key difference; the pallavi (or part of it) can be

rendered with a number of variations in melody, without any change in the lyrics,

and is repeated after each segment of a composition. Segmentation and detection of

repeating chorus phrases in Western music is a well researched problem. A number

of techniques have been proposed to segment a Western music composition. While

these techniques have been attempted for Western music where the repetitions

have more or less static time-frequency melodic content, finding repetitions in

improvisational music is a difficult task. In Indian music, the melody content of

the repetitions varies significantly during repetitions within the same composition

due to the improvisations performed by the musician. A musician’s rendering of

a composition is considered rich, if (s)he is able to improvise and produce a large

number of melodic variants of the line while preserving the grammar, identity of

the composition and the raga. Further, the same composition when rendered by

different musicians can be sung in different tonics. Hence matching a repeating

pattern of a composition across recordings of various musicians requires a tonic-

independent approach.

Segmentation of compositions is important both from the perspective of lyrics

and melody. Pallavi, being the first segment, also plays a major role in presenting

a gist of the raga, which gets further elaborated in anupallavi and caranam. In the

pallavi, a musical theme is initiated with key phrases of the raga, developed a

little further in the anupallavi and further enlarged in the caranam, maintaining a

balanced sequence - one built upon the other. Similar stage-by-stage development

from lyrical aspect can also be observed. An idea takes form initially in the

pallavi, which is the central lyrical theme, further emphasised in the anupallavi and

substantiated in the caranam.

Let us illustrate this with an example with the kriti of Saint Tyagaraja. The cen-

tral theme of this composition is ”why there is a screen between us”. The lyrical

meaning of this kriti is as below:

Pallavi: Oh Lord, why this screen (between us) ?

Anu pallavi : Oh lord of moving and non-moving forms who has sun and moon

as eyes, why this screen ?

Caranam: Having searched my inner recess, I have directly perceived that every-

thing is You alone. I shall not even think in my mind of anyone other than You.

Therefore, please protect me. Oh Lord, why this screen ?

The pallavi or a part of pallavi is repeated multiple times with improvisation for

the following reasons: 1) The central lyrical theme that gets expressed in the pallavi

is highlighted by repeating it multiple times, 2) the melodic aspects of the raga

54

and the creative aspects of the artiste (or the music school) jointly get expressed

by repetitions of pallavi. These improvisations in a given composition also stand

out as signatures to identify an artiste or the music school. Since pallavi serves

as a delimiter or separator between the various segments, locating the pallavi

repetitions also leads to knowledge of the number of segments in a composition

(>=3) as rendered by a certain performer.

A commonly observed characteristic of improvisation of pallavi (or a part of it)

is that for a given composition, a portion (typically half) of the repeating segment

will remain more or less constant in melodic content through out the composition

while the other portion varies from one repetition to another. For instance, if

the first half of the repeating segment remains constant in melody, the second half

varies during repetitions and vice-versa. This property is used to locate repetitions

of pallavi inspite of variations in melody from one repetition to another.

In this chapter, under section 4.2, we will discuss the algorithm used to segment

a kriti. Then under section kritiExpResults, we will present the results of our

experiments. We will conclude with discussions on our findings.

4.2 Segmentation Approach

4.2.1 Overview

The structure of a composition in Carnatic music is such that, the pallavi or part of

it gets repeated at the end of anupallavi and caranam segments. Hence our overall

approach is to use the pallavi or a part of it as a query to look for repetitions of

the query in the composition and thereby segment the composition into pallavi ,

55

anupallavi and caranam.

In our initial attempts, the query was first manually extracted from 75 popular

Carnatic music compositions. In 65 of these compositions, the lead artiste was a

vocalist accompanied by a violin and one or more percussion instruments while in

the remaining 10 compositions, an instrumentalist was the lead artiste accompa-

nied by one or more percussion instruments. The pallavi lines were converted to

time-frequency motifs. These motifs were then used to locate the repetitions of this

query in the composition. Cent-filterbank based features were used to obtain tonic

normalised features. Although the pallavi line of a composition can be improvised

in a number of different ways with variations in melody, the timbral characteristics

and some parts of the melodic characteristics of the pallavi query do have a match

across repetitions. The composition is set to a specific tala (rhythmic cycle), and

lines of a pallavi must preserve the beat structure. With these as the cues, given the

pallavi or a part of it as the query, an attempt was made to segment the composition.

The time-frequency motif was represented as a matrix of mean normalised cent

filterbank based features. Cent filterbank based energies and slope features were

extracted for the query and the entire composition. The correlation co-efficients

between the query and the composition were obtained while sliding the query

window across the composition. The locations of the peaks of correlation indicate

the locations of the pallavi. We also attempted to extract the query automatically

for all the compositions using the approach described in 4.3.2 and cross-checked

the query length with the manual approach.

56

4.2.2 Time Frequency Templates

The spectrogram is a popular time-frequency representation. The repeated line of

a pallavi is a trajectory in the time-frequency plane. Fig. 4.1 shows spectrograms

of the query and the matched and unmatched time-frequency segments of the

same length in a composition using linear filterbank energies. One can see some

similarity of structure between query and matched segments. Such a similarity

of structure is absent between query and un-matched segments. The frequency

range is set appropriately to occupy about 6 octaves for any musician. Although

the spectrogram does show some structure, the motifs corresponding to that of

the query are not evident. This is primarily because the motif is sung to a specific

tonic. Therefore the analysis of a concert also crucially depends on the tonic.

0

200

400

600

800

1000

0 100 200 300

Query

0

50

100

150

200

250

300

350

0

200

400

600

800

1000

0 100 200 300

Unmatched Segment

50

100

150

200

250

300

350

0

200

400

600

800

1000

0 100 200 300

Unmatched Segment

50

100

150

200

250

300

350

0

200

400

600

800

1000

0 100 200 300

Matched Segment

50

100

150

200

250

300

350

0

200

400

600

800

1000

0 100 200 300

Matched Segment

50

100

150

200

250

300

350

0

200

400

600

800

1000

0 100 200 300

Matched Segment

0

50

100

150

200

250

300

350

Figure 4.1: Time-frequency template of music segments using FFT specturm (Xaxis: Time in frames, Y axis: Frequency in Hz)

57

The cent filterbank energies were computed for both the query and the compo-

sition. The time-dependent filterbank energies were then used as a query. Fig. 4.2

shows a time-frequency template of the query and some matched and unmatched

examples from the composition. A sliding window approach was used to deter-

mine the locations of the query in the composition. The locations at which the

correlation is maximum corresponds to matches with the query. Fig. 4.4 shows

a plot of the correlation as a function of time. The location of the peaks in the

correlation, as verified by a musician, correspond to the locations of the repeating

query.

0

20

40

60

0 100 200 300

Query

-10

-8

-6

-4

-2

0

2

4

0

20

40

60

0 100 200 300

Unmatched

-10

-8

-6

-4

-2

0

2

4

0

20

40

60

0 100 200 300

Unmatched

-10

-8

-6

-4

-2

0

2

4

0

20

40

60

0 100 200 300

Matched

-10

-8

-6

-4

-2

0

2

4

0

20

40

60

0 100 200 300

Matched

-10

-8

-6

-4

-2

0

2

4

0

20

40

60

0 100 200 300

Matched

-10

-8

-6

-4

-2

0

2

4

Figure 4.2: Time-frequency template of music segments using cent filterbank en-ergies (X axis: Time in frames, Y axis: Filter)

As mentioned earlier in 2.4.2, percussion strokes destroy the motif of the

melody. So Cent filterbank slope features were also used as an alternate feature.

Fig. 4.3 shows a plot of the time-dependent query based on filter bank slope and

58

corresponding matched and unmatched segments in the composition. One can

observe that the motifs are significantly emphasised, while the effect of percussion

is almost absent.

0

20

40

60

0 100 200 300

Query

-5

-4

-3

-2

-1

0

1

2

0

20

40

60

0 100 200 300

Unmatched Segment

-5

-4

-3

-2

-1

0

1

2

0

20

40

60

0 100 200 300

Unmatched Segment

-5

-4

-3

-2

-1

0

1

2

0

20

40

60

0 100 200 300

Matched Segment

-5

-4

-3

-2

-1

0

1

2

0

20

40

60

0 100 200 300

Matched Segment

-5

-4

-3

-2

-1

0

1

2

0

20

40

60

0 100 200 300

Matched Segment

-5

-4

-3

-2

-1

0

1

2

Figure 4.3: Time-frequency template of music segments using cent filterbank slope(X axis: Time in frames, Y axis: Filter)

4.3 Experimental Results

The experiments were performed primarily on Carnatic music, though limited

experiments were done on other musical genres - Hindustani and Western music.

For Carnatic music, a database of 75 compositions by various artistes was used.

The database comprised of compositions rendered by a lead vocalist or lead in-

strumentalist, the instruments being flute, violin and veena. The tonic information

was determined for each composition. Cent filterbank based energies and cent

59

filter bank based slope features were extracted for each of these compositions and

used for the experiments. For every 100 millisecond frame of the composition,

80 filters were uniformly placed across 6 octaves (the choice of number of filters

was experimentally arrived at to achieve the required resolution). The correlation

between the query and the moving windows of the composition was computed.

4.3.1 Finding Match with a Given Query

The query for each composition was extracted manually and the cent filterbank

based features were computed. Then Algorithm 1 was used for both CFB based

energy and slope features. Fig. 4.4 and Fig. 4.5 show correlation plots using CFB

energy and slope features for the composition janani ninnu vina. We can see that

the identified repeating patterns clearly stand out among the peaks due to higher

correlation. The spectrogram of the initial portion of the same composition with

the query and the matching sections is shown in Fig. 4.6.

0 50 100 150 200 250 300 350 400−6

−4

−2

0

2

4

6

8

10

12

14x 10

−6

Time (in Seconds)

Correlation

CorrelationThresholdGround Truth

Pallavi

Anu pallavi Caranam

Figure 4.4: Correlation as a function of time (cent filterbank energies)

60

0 50 100 150 200 250 300 350 400−4

−2

0

2

4

6

8

10

12

14x 10

−6

Time (in Seconds)

Correlation

CorrelationThresholdGround Truth

Pallavi

Anu pallavi Caranam

Figure 4.5: Correlation as a function of time (cent filterbank slope)

Figure 4.6: Spectrogram of query and matching segments as found out by thealgorithm.

The experiments were repeated with MFCC, chroma features with and without

overlapping filters. For MFCC, 20 co-efficients were extracted with 40 filters placed

in the frequency range 0 hz to 8000 hz. The chroma filter-banks [12] used for West-

ern classical music use non-overlapping filters as the scale is equi-temperament

and hence is characterised by a unique set of 12 semitones, subsets of which are

used in performances. Indian music pitches follow a just intonation rather than

an equi-temperament intonation [44]. Even just intonation is also not adequate

as shown in [28] because the pitch histograms across all ragas of Carnatic music

appear to be more or less continuous. To account for this, the chroma filter-banks

with a set of overlapping filters was experimented in addition to chroma filter

61

banks without overlapping filters. The comparative performance of these four

features is tabulated in Table 4.1 and the correlation plots for one kriti using these

four features are included in figures 4.7, 4.8, 4.9 and 4.10. It is evident that CFB

based feature outperforms the other three.

Figure 4.7: Query matching with Cent filterbank slope feature

62

Figure 4.8: Query matching with Chroma feature (no overlap)

Figure 4.9: Query matching with Chroma feature (with overlap)

63

Figure 4.10: Query matching with MFCC feature

Algorithm 1 Composition-Query Comparison1: Extract CFCC Energy feature for the composition and for the query.

2: Using a sliding window approach, move across the composition in one frame steps.

3: Extract composition segments of length same as query at each step.

4: Compute the correlation between extracted composition segments and the query segment.

5: Locate the positions which give high correlation, which are the matches.

6: Repeat the above steps for CFCC slope feature also.

4.3.2 Automatic Query Detection

It is possible to find the query automatically if query is found at the beginning

of the composition. This is indeed true for Carnatic music as the composition

rendering starts with the pallavi. The approach mentioned in Algorithm 2 was

used for the automatic query detection.

As mentioned in the algorithm, the correlations of the composition with pro-

64

gressively increasing query lengths can be seen in Fig. 4.11 and Fig. 4.12. The

product of these correlation values is computed at each time instance within the

composition and the result is plotted in Fig. 4.13. As we can see, the unwanted

spurious peaks have all been smoothed out resulting in clear identification of the

actual query length.

0 50 100 150 200 250 300 350 400−2

0

2

4x 10

−4 Query Length of .5 Seconds

Time (in seconds)

Cor

rela

tion

0 50 100 150 200 250 300 350 400−5

0

5

10

15x 10

−5 Query Length of 1 Second

Time (in seconds)

Cor

rela

tion

0 50 100 150 200 250 300 350 400−5

0

5

10x 10

−5 Query Length of 1.5 Seconds

Time (in seconds)

Cor

rela

tion

Figure 4.11: Intermediate output (I) of the automatic query detection algorithmusing slope feature

Algorithm 2 Automatic Query Detection1: Extract CFCC Energy feature for the composition.2: Extract segments of varying lengths of 0.5 to 3 seconds (50 to 300 frames) in steps of 0.5 seconds (50 frames).3: For each of these segments, considering this as the query, calculate the correlation as in Algorithm 1.4: Multiply the above computed correlations corresponding to each frame.5: Look for the first significant peak, which corresponds to the query length.6: Repeat the above steps for CFCC Slope feature also.

It was observed that for all the 75 compositions, the durations of their queries

calculated using the automatic method matched closely with the actual query

length, thereby producing similar segmentation results. A subset of the results is

tabulated in Table 4.2.

65

0 50 100 150 200 250 300 350 400−5

0

5

10x 10

−5 Query Length of 2 Seconds

Time (in seconds)

Cor

rela

tion

0 50 100 150 200 250 300 350 400−2

0

2

4

6x 10

−5 Query Length of 2.5 Seconds

Time (in seconds)

Cor

rela

tion

0 50 100 150 200 250 300 350 400−2

0

2

4

6x 10

−5 Query Length of 3 Seconds

Time (in seconds)

Cor

rela

tion

Figure 4.12: Intermediate output (II) of the automatic query detection algorithmusing slope feature

0 5 10 15 20 25

0

0.005

0.01

0.015

0.02

0.025

0.03

0.035

0.04

Time (in seconds)

Product

ofco

rrelations(A

cross

variousquerylength

s)

Automatic Query Detection

Query

Figure 4.13: Final output of the automatic query detection algorithm using slopefeature.

66

Table 4.1: Comparison between various features

Feature Name Total Songs Successfully Segmented % SuccessCFCC 75 35 46.47MFCC 75 23 30.67Chroma with overlap 75 17 22.67Chroma without overlap 75 8 10.67

Table 4.2: Manual vs automatic query extraction (CFB Energy: Cent filter bankcepstrum, CFB Slope: Cent filterbank energy slope). Time is given inseconds.

Composition Manual CFB Energy CFB SlopeS1 4.90 5.09 5.08S2 12.07 12.52 12.52S3 6.03 6.10 6.11S4 11.84 9.75 11.73S5 8.71 8.79 8.76S6 5.58 5.75 5.8S7 8.50 8.59 8.59S8 4.65 4.86 4.85S9 11.79 11.73 11.70

S10 12.84 12.92 12.85

4.3.3 Domain knowledge based improvements

Out of 75 compositions, 35 compositions were correctly segmented into pallavi

anupallavi and caranam using full query length when compared with the ground

truth marked by a musician. The segment lengths were incorrectly detected in the

remaining compositions primarily due to the following reasons:

1. False negatives: It is quite possible that while repeating the pallavi (query)with melodic variations, the artiste may repeat only a part of the query sometimes. In such cases, correlation of the partial match will be low. Also often,the melodic content may vary drastically during repetitions leading to lowcorrelation with the reference query.

2. False positives: Some portions of a composition (such as anupallavi / caranam)may have similar melodic content to that of the pallavi query, though thelyrics of these portions can be entirely different. These portions result inhigher correlation due to melodic similarity.

67

In order to address the false negative results, we further experimented with

half the length of those queries. In Carnatic music, though the pallavi is repeated

with various melodic variations, usually either the 1st half or the 2nd half of the

query remains static in melodic content. In other words, if the 1st half of the query

undergoes melodic variation during repetitions, the 2nd half remains melodically

invariant and vice versa. Taking this as a cue, we experimented by considering the

1st half and 2nd half of the original query as the new queries.

0 20 40 60 80 100 120 140−0.5

0

0.5

1

1.5

2

2.5x 10

−5

Time (in Seconds)

Cor

rela

tion

Full Query

CorrelationThresholdGround Truth

0 20 40 60 80 100 120 140−2

−1

0

1

2

3

4

5x 10

−5

Time (in Seconds)

Cor

rela

tion

Half Query

CorrelationThresholdGround Truth

Figure 4.14: Correlation for full query Vs. half query

The results showed that using one of the two half length queries, better cor-

relation of matched segments was obtained thereby increasing our segmentation

success from 35 to 48 compositions resulting in 64% success rate. Fig. 4.14 shows

the correlation plot of a composition with the full query and the corresponding

half query.

To address the false positives, we used the rhythm cycle. If the query length is

68

L seconds, the repetitions should ideally occur at nL (n = 1,2,3..N), with margins

for human errors in maintaining the rhythm. Any instances of elevated correlation

that are not around nL are not likely to be repetition of the query and hence can

be discarded. Fig. 4.15 shows the correlation plot using this approach. As we can

see, only the false positive peaks have been discarded, resulting in only the true

positive peaks. Using this approach our segmentation success increased from 48

to 50 compositions resulting in 66.66% as the overall success rate. However, this

approach will be effective only when the artiste maintains the rhythm cycle more

or less accurately.

0 10 20 30 40 50 60 70 80−4

−2

0

2

4

6

8

10x 10

−6

Time (in Seconds)

Cor

rela

tion

Normal Correlation

CorrelationThresholdGround Truth

0 10 20 30 40 50 60 70 80−2

0

2

4

6

8

10x 10

−6

Time (in Seconds)

Cor

rela

tion

Correlation Considering Cycle Information

CorrelationThresholdGround Truth

Figure 4.15: False positive elimination using rhythmic cycle information

We repeated the experiment on a small sample of Hindustani and Western

music compositions - 5 compositions in each genre. Since these genres of music

do not have segments similar to pallavi, anupallavi and caranam; we restricted our

experiments to locating the repetitions of a query segment. In the case of Western

music, we were able to locate all the repetitions corresponding to the query. This

69

is because of little melodic variation between repetitions in Western music. In

the case of Hindustani music, we were able to identify all the repetitions with a

false positive rate of 33%. Fig. 4.16 shows the correlation plot of a Hindustani

composition and a Western music composition.

0 50 100 150 200 250 300−5

0

5

10

15x 10

−6

Time (in Seconds)

Correlation

Hindustani

CorrelationThresholdGround Truth

0 50 100 150 200 250 300−5

0

5

10

15x 10

−6

Time (in Seconds)

Correlation

Western

CorrelationThresholdGround Truth

Figure 4.16: Repeating pattern recognition in other Genres

4.3.4 Repetition detection in a RTP

Ragam Tanam Pallavi (RTP) is a form of singing unique to Carnatic music that

allows the musicians to improvise to a great extent. It incorporates raga alapana,

tanam, niraval, and kalpanasvara and may be followed by a tani avartanam.

Unlike a regular kriti that has pallavi anupallavi and caranam segments, a RTP

has only a pallavi. Also, the pallavi is composed such that it occupies only one

tala cycle. After elaborate alapana and tanam renditions, the pallavi is taken up

for melodic improvisations. This is followed by a rhythmic exercise called tri kala

70

tisram during which the pallavi line is sung in various speeds. First the pallavi is

sung in normal tempo two times (thus occupying 2 cycles), then it is sung in half

the speed once followed by rendition of the same pallavi three times in tisra nadai

(equivalent of eighth/sixteenth note triplet in Western music) occupying the same

2 cycles followed by rendition of the pallavi four times in 2nd tempo.

We tried to apply our query matching technique on this rhythmic exercise

to locate all the repetitions of the query, which is pallavi line sung in normal

speed. While matching the query in normal speed is similar to how we did for

kritis, matching repetitions sung at different tempos required tweaking the earlier

approach. Our approach to repetition matching for various tempos is listed as

below:

• In the case of half the original tempo, we dropped every alternate sample inthe repetition, thereby the duration of query and repetition became almostequal. Then we used the sliding window technique to locate the repetition.

• For matching the repetitions rendered in double the speed, we dropped everyalternate sample of the query, thus shortening the length of the query by half.Using this shortened query as the new query, the sliding window techniquewas applied

• For matching the repetitions rendered in tisram, for every two samples of thequery, one sample was dropped, thus shortening the length of the query to23 of its original length. Using this shortened query as the new query, thesliding window technique was again applied to locate the matches.

The experiments were carried out on 5 RTPs. It was observed that while query

matching worked well with normal tempo, the approach of dropping frames for

other tempos was not always successful. This is perhaps because, when a motif

is slowed down, the gamakas responsible for the melody is not uniformly and

linearly slowed down by the performer in reality. A possible solution to properly

slowing down motifs and the associated gamakas is based on an approach being

experimented (by V Viraraghavan, whose related paper is awaiting acceptance)

71

by which tempo of transients (stationary points) is preserved but flat notes and

silences are slowed down1. The query matching plots for the various tempos for

one of the RTPs is given below:

Figure 4.17: Normal tempo

Figure 4.18: Half the original tempo

1https://www.iitm.ac.in/donlab/pctestmusic/index.html?owner=venkat&testid=test1&testcount=6

72

Figure 4.19: tisram tempo

Figure 4.20: Double the original tempo

73

4.3.5 Discussions

For those Carnatic music compositions where segmentation results are accurate, it

is possible to automatically match the audio segments with the pallavi, anupallavi

and caranam lyrics of the composition by looking up a lyrics database. This can

enhance the listening experience by displaying the lyrics as captions when the

composition is played. We were able to automatically generate captions as SRT

(SubRip Text) files for each composition using automatic segmentation and lyrics

database lookup.

It was observed that more the melodic variations of the repetitions, lesser is

the accuracy in segmenting the kriti. While the half query method increased the

accuracy, it can be computationally intense to calculate using full query and then

using both the half queries and choosing the right approach. Cent filter bank

features performed better than mfcc due to tonic normalisation. Though chroma

is tonic normalised, folding of octaves distorted the tf-templates of the motifs and

hence its accuracy was lesser that that of CFB based features. Based on the limited

experiments done with other genres such as Western music and Hindustani music,

we can say that identification of repetitions using t-f template can be extended to

other genres.

74

CHAPTER 5

Conclusion

5.1 Summary of work done

Automatic segmentation of Carnatic music items into segments is required for

automatic archival of a concert for music information retrieval tasks. At present,

there is no automated way to seek directly a given segment within an item. In

this thesis, an attempt was made to segment an item into various segments auto-

matically. A concert is made up of a number of items. Each item can be further

segmented into alapana and kriti and the kriti can be further segmented into pallavi,

anupallavi and caranam.

The key acoustic differentiator between alapaa and kriti is that alapaa is ren-

dered as pure melody without percussion instrument support while kriti is ren-

dered with accompanying percussion instruments. In this thesis, segment bound-

ary between alapaa and kriti was carried out using this key differentiator. Our

approach was to extract a segment of feature vectors from the item and try to

label the segment as a whole. As this approach was computationally intense, to

further improve the efficiency of this process, we had to reduce the search space for

the boundary. We used Kl2 distance metric to look for timbre similarity between

windows. The spectrum of a frame of audio was computed. The spectrum was

converted to a PDF. The KL2 distance was computed between adjacent frames to

determine the divergence between two adjacent spectra. Larger values of KL2

distance denoted large change in distribution. A threshold was automatically

chosen to identify the possible boundaries. Then using domain knowledge, these

boundaries were agglomerated to identify the actual boundary between alapaa

and kriti.

The kriti was further segmented into pallavi, anupallavi an caranam. Our ap-

proach to segmenting a kriti was to use repetition of pallavi as the boundary. In

any kriti, the pallavi gets repeated after anupallavi and again after caranam. By using

pallavi or part of a pallavi¢ as a query, we attempted to look for the repetitions and

hence the segment boundaries. In our initial attempts, the query was fist manually

extracted from 75 popular Carnatic music kritis. We then used time-frequency mo-

tifs to locate the repetitions of this query in the kriti. Cent-filterbank based features

were used to obtain tonic normalised features. Although the pallavi line of a kriti

can be improvised in a number of different ways with variations in melody, the

timbral characteristics and some parts of the melodic characteristics of the pallavi

query do have a match across repetitions. The composition is set to a specific tala

(rhythmic cycle), and lines of a pallavi must preserve the beat structure. With these

as the cues, given the pallavi or a part of it as the query, an attempt was made to

segment the composition. CFB based energies and slope features were extracted

for the query and the entire composition. The correlation co-coefficients between

the query and the kriti were obtained while sliding the query window across the

kriti. The locations of the peaks of correlation indicate the locations of the pallavi.

Next we embarked on identifying the query automatically. It is possible to find

the query automatically if query is found at the beginning of the composition.

This is indeed true for Carnatic music as the composition rendering starts with

the pallavi. It was observed that for all the 75 compositions, the durations of their

queries calculated using the automatic method matched closely with the actual

query length, thereby producing similar segmentation results.

76

We compared the performance of CFB features with that of MFCC and chroma

features. It was found that CFB features outperformed MFCC and chroma features,

justifying our choice of features.

5.2 Criticism of the work

Currently, for segmenting a kriti¢, threshold setting for peak picking is done man-

ually. This manual intervention should be addressed in the future work.

Automatic query detection algorithm works only if the query is exactly at the

start of the item. It is possible that there can be some gap or drone sound at

the beginning. In such cases automatic query detection will fail. This has to be

addressed

Currently locating boundary between alapana and kriti is computational inten-

sive. Tradeoff has to be made between accuracy and performance.

Pallavi repetition detection in RTPs was not giving consistent results for slower

and faster tempos of rendition of the pallavi line. Dropping of samples to increase

the speed of audio appears to spoil the motif and the embedded gamakas and hence

alternate approaches to preserve the shape of the gamaka to be explored.

5.3 Future work

In this work, we have presented a novel approach to composition segmentation

using cent filterbank based features. This approach is particularly suited for Car-

natic music as compared to other fingerprinting algorithms. In Indian music, the

same composition can be sung in different tonics. Further, a number of different

77

variants of the pallavi can be sung. This can vary from musician to musician and

the position of the composition in the entire concert. A large number of variations

is an indication that the musician has chosen the particular composition for a more

detailed exploration.

This segmentation work of the composition into pallavi, anupallavi and caranam

using the repeating pallavi line can be extended to locate repeating niraval patterns

and kalpana-svara portions of an item.

Knowledge of intra kriti segment boundary locations can also be used to look

up lyric database and display the lyrics of those segments as captions when the

music is played. Since lyrics play a pivotal role in Carnatic music, such an effort

will enhance the pleasure of listening to music.

While we have identified the boundary between alapana and kriti, it is possible

to take up identification of the boundary between vocal alapaa and violin alapaa

to further sub-segment the alapaa segment.

The main musical item of a concert will usually feature a thaniavarthanam

during which only the percussion instruments play pure rhythmic patterns. Future

work can take up identification of thaniavarthanam portion in a main item. This

will be of immense value to students of percussion music and rhythm enthusiasts.

78

REFERENCES

[1] Samer Abdallah, Katy Noland, Mark Sandler, Michael A Casey, Christophe Rhodes,et al. Theory and evaluation of a bayesian music structure extractor. 2005.

[2] Bishnu Saroop Atal. Automatic speaker recognition based on pitch contours. TheJournal of the Acoustical Society of America, 52(6B):1687–1697, 1972.

[3] J. Aucouturier, F. Pachet, and M. Sandler. The way it sounds:timbre models foranalysis and retrieval of music signals. In IEEE Trans. Multimedia, vol. 7, no. 6, pp1028–1035, 2005.

[4] Jean-Julien Aucouturier and Mark Sandler. Segmentation of musical signals usinghidden markov models. Preprints-Audio Engineering Society, 2001.

[5] Messaoud Bengherabi and A Sehad. Development and evaluation of automatic-speaker based-audio identification and segmentation for broadcast news recordingsindexation. In Information and Communication Technologies, 2006. ICTTA’06. 2nd, vol-ume 1, pages 1230–1235. IEEE, 2006.

[6] Wei Chai. Automated analysis of musical structure. PhD thesis, Massachusetts Instituteof Technology, 2005.

[7] Shih-Sian Cheng, Hsin-Min Wang, and Hsin-Chia Fu. Bic-based audio segmentationby divide-and-conquer. In 2008 IEEE International Conference on Acoustics, Speech andSignal Processing, pages 4841–4844. IEEE, 2008.

[8] Roger B Dannenberg and Ning Hu. Discovering musical structure in audio recordings.In Music and Artificial Intelligence, pages 43–57. Springer, 2002.

[9] Charlet Delphine. Model-free anchor speaker turn detection for automatic chaptergeneration in broadcast news. In ICASSP, pages 4966–4969, 2010.

[10] Ebru Dogan, Mustafa Sert, and Adnan Yazici. Content-based classification and seg-mentation of mixed-type audio by using mpeg-7 features. In Advances in Multimedia,2009. MMEDIA’09. First International Conference on, pages 152–157. IEEE, 2009.

[11] Khaled El-Maleh, Mark Klein, Grace Petrucci, and Peter Kabal. Speech/music dis-crimination for multimedia applications. In Acoustics, Speech, and Signal Processing,2000. ICASSP’00. Proceedings. 2000 IEEE International Conference on, volume 4, pages2445–2448. IEEE, 2000.

[12] Daniel PW Ellis. Classifying music audio with timbral and chroma features. In ISMIR,volume 7, pages 339–340, 2007.

[13] Asoke Kumar et al. Signal analysis of hindustani classical music, springer 2017.

[14] Universitat Pompeu Fabra. Computational models for the discovery of the world’smusic. http://compmusic.upf.edu/node/25.

79

[15] Gunnar Fant. Acoustic theory of speech production: with calculations based on X-ray studiesof Russian articulations, volume 2. Walter de Gruyter, 1971.

[16] Paul Finkelstein. Music segmentation using markov chain methods. 2011.

[17] Jonathan Foote. Visualizing music and audio using self-similarity. In Proceedings of theseventh ACM international conference on Multimedia (Part 1), pages 77–80. ACM, 1999.

[18] Jonathan Foote. Automatic audio segmentation using a measure of audio novelty. InMultimedia and Expo, 2000. ICME 2000. 2000 IEEE International Conference on, volume 1,pages 452–455. IEEE, 2000.

[19] Ferdinand Fuhrmann, Perfecto Herrera, and Xavier Serra. Detecting solo phrasesin music using spectral and pitch-related descriptors. Journal of New Music Research,38(4):343–356, 2009.

[20] David Gerhard. Pitch extraction and fundamental frequency: History and current tech-niques. Regina: Department of Computer Science, University of Regina, 2003.

[21] Michael M Goodwin and Jean Laroche. Audio segmentation by feature-space cluster-ing using linear discriminant analysis and dynamic programming. In Applications ofSignal Processing to Audio and Acoustics, 2003 IEEE Workshop on., pages 131–134. IEEE,2003.

[22] Masataka Goto. A chorus-section detecting method for musical audio signals. InAcoustics, Speech, and Signal Processing, 2003. Proceedings.(ICASSP’03). 2003 IEEE In-ternational Conference on, volume 5, pages V–437. IEEE, 2003.

[23] Thomas Grill and Jan Schluter. Music boundary detection using neural networks oncombined features and two-level annotations. In Proceedings of the 16th InternationalSociety for Music Information Retrieval Conference (ISMIR 2015), Malaga, Spain. Citeseer,2015.

[24] Ozgur Izmirli. Using a spectral flatness based feature for audio segmentation andretrieval. 2000.

[25] Min-Hong Jian, Chia-Han Lin, and Arbee LP Chen. Perceptual analysis for musicsegmentation. In Electronic Imaging 2004, pages 223–234. International Society forOptics and Photonics, 2003.

[26] A. Klapuri and M. Davy. Signal processing methods for music transcription. In NewYork: Springer-Verlag, 2006.

[27] Marko Kos, Zdravko KacIc, and Damjan Vlaj. Acoustic classification and segmen-tation using modified spectral roll-off and variance-based features. Digital SignalProcessing, 23(2):659–674, 2013.

[28] TM Krishna and Vignesh Ishwar. Svaras, gamaka, motif and raga identity. In Workshopon computer music, 2012.

[29] M. Levy and M. Sandler. Structural segmentation of musical audio by constrainedclustering. In IEEE Trans. Audio, Speech, Lang. Process., vol. 16, no. 2, pp. 318326, Feb2008.

80

[30] Feng Li, You You, Yuqin Lu, and YuQing Pan. An automatic segmentation method ofpopular music based on svm and self-similarity. In International Conference on HumanCentered Computing, pages 15–25. Springer, 2014.

[31] Beth Logan and Stephen Chu. Music summarization using key phrases. In Acoustics,Speech, and Signal Processing, 2000. ICASSP’00. Proceedings. 2000 IEEE InternationalConference on, volume 2, pages II749–II752. IEEE, 2000.

[32] M. Mauch, K. Noland, and S. Dixon. Using musical structure to enhance automaticchord transcription. In Proc. ISMIR, 2009, pp.231236, 2009.

[33] Brian McFee and Dan Ellis. Analyzing song structure with spectral clustering. InISMIR, pages 405–410, 2014.

[34] Brian McFee and Daniel PW Ellis. Learning to segment songs with ordinal lineardiscriminant analysis. In 2014 IEEE International Conference on Acoustics, Speech andSignal Processing (ICASSP), pages 5197–5201. IEEE, 2014.

[35] M. F. McKinney, D. Moelants, M. E. P. Davies, and A. Klapuri. Evaluation of audiobeat tracking and music tempo extraction algorithms. In J. New Music Res., vol. 36, no.1, pp. 116, 2007.

[36] Paul Mermelstein. Distance measures for speech recognition, psychological andinstrumental. Pattern recognition and artificial intelligence, 116:374–388, 1976.

[37] Adriano Monteiro and Jonatas Manzolli. A framework for real-time instrumentalsound segmentation and labeling. In Proceedings of IV International Conference of Puredata–Weimar, 2011.

[38] Jouni Paulus and Anssi Klapuri. Music structure analysis using a probabilistic fitnessmeasure and a greedy search algorithm. IEEE Transactions on Audio, Speech, andLanguage Processing, 17(6):1159–1170, 2009.

[39] Jouni Paulus, Meinard Muller, and Anssi Klapuri. State of the art report: Audio-basedmusic structure analysis. In ISMIR, pages 625–636, 2010.

[40] NS Ramachandran. The Ragas of Karnatic Music. University of Madras, 1938.

[41] K Ramchandran. Mathematical Basic Of Thala System. University of Michigan, 1962.

[42] Geetha Ravikumar. The Concept and Evolution of Raga in Hindustani and Karnatic Music.Bharatiya Vidya Bhavan, 2002.

[43] Padi Sarala and Hema A Murthy. Inter and intra item segmentation of continuousaudio recordings of carnatic music for archival. Entropy, 1500:2500, 2000.

[44] Joan Serra, Gopala K Koduri, Marius Miron, and Xavier Serra. Assessing the tuningof sung indian classical music. In ISMIR, pages 157–162, 2011.

[45] Vidya Shankar. The art and science of Carnatic music. Music Academy Madras, 1983.

[46] Matthew A Siegler, Uday Jain, Bhiksha Raj, and Richard M Stern. Automatic segmen-tation, classification and clustering of broadcast news audio. In Proc. DARPA speechrecognition workshop, volume 1997, 1997.

81

[47] Todd Andrew Stephenson, Herve Bourlard, et al. Automatic speech recognition usingpitch information in dynamic bayesian networks. Technical report, IDIAP, 2000.

[48] Karen Ullrich, Jan Schluter, and Thomas Grill. Boundary detection in music structureanalysis using convolutional neural networks. In ISMIR, pages 417–422, 2014.

[49] David Wang, R Vogt, M Mason, and Sridha Sridharan. Automatic audio segmentationusing the generalized likelihood ratio. In Signal Processing and Communication Systems,2008. ICSPCS 2008. 2nd International Conference on, pages 1–5. IEEE, 2008.

[50] J. Wellhausen and M. Hoeynck. Audio thumbnailing using mpeg-7 low level audiodescriptors. In Proc. ITCom03, Citeseer, 2003.

[51] Hao Xue, HaiFeng Li, Chang Gao, and ZiQiang Shi. Computationally efficient audiosegmentation through a multi-stage bic approach. In Image and Signal Processing(CISP), 2010 3rd International Congress on, volume 8, pages 3774–3777. IEEE, 2010.

[52] Sree Harsha Yella, Vasudeva Varma, and Kishore Prahallad. Significance of anchorspeaker segments for constructing extractive audio summaries of broadcast news. InSpoken Language Technology Workshop (SLT), 2010 IEEE, pages 13–18. IEEE, 2010.

[53] Saadia Zahid, Fawad Hussain, Muhammad Rashid, Muhammad Haroon Yousaf, andHafiz Adnan Habib. Optimized audio classification and segmentation algorithm byusing ensemble methods. Mathematical Problems in Engineering, 2015, 2015.

[54] Bowen Zhou and John HL Hansen. Efficient audio stream segmentation via thecombined t/sup 2/statistic and bayesian information criterion. IEEE Transactions onSpeech and Audio Processing, 13(4):467–474, 2005.

82