automatic music transcription using autoregressive

ENSEEIHT2 rue Charles CamichelBP 712231071 Toulouse Cedex 7

Télécommunications Spatiales et Aéronautiques (TéSA)17 bis, rue Paul RiquetF-31000 ToulouseFRANCE

Automatic Music Transcription using

Autoregressive Frequency Estimation

14. June 2001

Fredrik HeklandNTNU, Norwegian University of Science and Technology

[email protected]

Under the direction of:Corinne Mailhes (ENSEEIHT)David Bonacci (ENSEEIHT)

1 of 38

PrefaceWhere�and�whyWith�help�from�the�Erasmus�student�exchange�program,�I�had�the�opportunity�to�do�my�fourth�yearas�an�engineer�student�abroad.�I�had�chosen�France�as�country�since�I�wanted�to�learn�to�speakFrench�and�see�parts�of�Europe�I�hadn't�seen�before.�I�found�ENSEEIHT�in�Toulouse�which�offeredSignal�Processing,�and�luckily�they�accepted�my�application.�Since�it�was�only�the�fifth�year�at�thisschool�who�was�offering�enough�courses�within�signal�processing,�I�followed�the�option"Traitement�du�Signal�et�des�Images"�even�though�I�was�one�year�short.�This�also�meant�that�a"Stage",�a�four�months�final�project,�was�ahead�of�me.�

I�was�kindly�given�chance�to�do�the�Stage�at�TéSA,�a�research�laboratory�at�the�school.�Somepossible�subjects�were�presented�to�me,�and�after�some�counsels�from�my�responsible�professor�atNTNU�I�chose�the�subject�regarding�music�transcription.�

TéSA�(Télécommunications�Spatiales�et�Aéronautiques)�is�a�newly�created�research�laboratory�as�acollaboration�between�several�schools�and�enterprises.�The�lab�is�well�equipped�with�both�hardwareand�software,�and�a�good�library�which�made�my�literature�review�easy.

The�workIn�the�beginning�I�had�a�two�weeks�period�doing�a�literature�review,�trying�to�find�previous�work�inthe�field�and�all�the�necessary�background�information.�A�lot�of�articles�was�found�with�help�fromgoogle.com�and�citeseer.nj.nec.com,�while�the�library�contained�most�of�the�IEEE�publications.After�having�gained�an�overview�of�the�problems,�a�longer�period�of�testing�was�conducted.�Bothsynthetic�signals�and�samples�from�real�instruments�were�used.�The�different�frequency�estimatorsand�model�order�criteria�were�explored,�and�it�was�decided�to�use�the�Modified�Covariance�Methodcoupled�with�a�simple�AIC/MDL�order�selection�criterion�for�the�transcriber.�A�workingmonophonic�transcriber�was�built,�and�some�of�the�findings�in�the�project�can�permit�an�extensionto�polyphonic�operation.

Software

Matlab�5.2�and�5.3�was�used�for�all�coding�purposes�and�most�of�the�analysis�work,�andSpectrogram�6.0.4�was�helping�to�analysis�the�instrument�samples.�Yamaha's�free�wave-editor�wasindispensable�to�mix�and�manipulate�the�necessary�wav-files.

The�Matlab�files�referred�to�in�the�text�is�not�rugged�enough�to�be�used�in�any�serious�transcribing.For�that�reason,�the�code�is�not�available�to�the�public.

Personal�outcomeEven�though�my�work�is�not�exactly�ground-breaking,�I�have�gained�much�personally.�Especiallyconcerning�the�process�of�doing�a�research�work�and�writing�a�report.�I�now�know�more�how�toproceed�and�what�to�do�underway,�and�certainly�some�of�the�pitfalls�to�avoid.

Both�within�Matlab�programming�and�Signal�Processing�I�seen�a�great�development,�and�withinparametric�modelling�I�have�gained�a�greater�understanding.

Finally,�it�has�been�interesting�to�see�how�the�life�is�in�a�laboratory�and�to�observe�the�culturaldifferences�and�similarities�between�Norway�and�France.�At�last,�it�must�be�mentioned�that�I�havelearned�a�lot�French,�starting�at�ground-zero�before�my�arrival�in�France�now�being�able�tocommunicate�without�too�much�problem�at�a�basic�level.�I�hope�I'll�be�able�to�maintain�and�improvethe�language�in�the�future.

2�of�38

AcknowledgementsI�would�like�to�thank�the�Director�of�the�laboratory�Prof.�Francis�Castanié,�and�the�responsible�for"Traitement�du�Signal�et�des�Images"�Dr.�Corinne�Mailhes�for�giving�me�the�opportunity�to�do�thiswork�in�the�labs.

Mdm.�Mailhes�also�being�the�responsible�for�my�Stage,�and�David�Bonacci�was�the�person�workingon�a�subject�closest�to�mine�and�being�the�most�important�advisor�giving�good�ideas�and�tips.�Bothdeserves�a�thank.

The�guys�at�"Bureau�treize"�for�having�received�me�well,�and�accepting�me�even�though�my�Frenchis�at�best�confusing,�and�at�times�incomprehensible.�C'est�dommage�que�je�ne�pourrais�pasparticiper�dans�vos�conneries.�J'ai�passé�un�bon�moment�chez�vous�quand�même.

Thanks�to�all�the�other�persons�at�the�lab�helping�me�out�and�being�nice�to�me.�

Lots�of�moral�support�and�positive�words�from�Tonje�helped�me�when�I�needed�it�most.�I�love�you.

3�of�38

AbstractThe�project�studies�the�use�of��Principle�Component�AR�Frequency�Estimation�in�automatic�musictranscription�and�discusses�some�of�the�problems�arising�when�using�AR�models,�among�themmodel�order�selection.�Some�comparisons�to�classical�Fourier�method�is�done.

A�well-functioning�monophonic�transcriber�using�the�Modified�Covariance�Method�as�pitchestimator�is�implemented�in�Matlab�and�some�suggestions�of�further�work�is�given.

4�of�38

Table of ContentsPreface.......................................................................................................................2Acknowledgements...................................................................................................3Abstract.....................................................................................................................4Chapter�1�–�Initial�theoretical�studies�....................................................................6

Introduction.................................................................................................................................6Presentation�of�the�problem......................................................................................................6The�goal�of�this�project.............................................................................................................7

Literature�review.........................................................................................................................8Papers�dealing�on�musical�transcription....................................................................................8Commercial�or�free�transcription�programs�available...............................................................9

A�closer�look�at�the�challenges..................................................................................................10Common�features�of�instruments............................................................................................10Problems�encountered�when�estimation�pitch.........................................................................11Other�problems�related�to�transcription...................................................................................12MIDI�file�format......................................................................................................................12

Pitch�estimators..........................................................................................................................13Cross-correlation�between�signal�and�sinusoids......................................................................13Filter�banks.............................................................................................................................13Fourier-based�methods............................................................................................................14Autoregressive�methods..........................................................................................................15Model�order�selection..............................................................................................................16

Estimation�of�the�Prediction�Error�Power.........................................................................................................16Order�estimations�methods�based�on�Singular�Values�or�noise�subspace�estimation.......................................17

Chapter�2�–�Implementation�of�a�music-transcriber............................................18Initial�testing�on�real�and�synthetic�musical�signals...............................................................18

Fourier�methods,�real�instruments...........................................................................................18Yule-Walker,�synthetic�signals................................................................................................20Modified�covariance�method,�synthetic�signals......................................................................22

Implementation�of�the�transcriber...........................................................................................25The�structure�of�the�transcriber...............................................................................................25Transcribing�real�music...........................................................................................................26

Flute5.wav�–�a�simple�flute�solo........................................................................................................................26Flute00.wav�–�A�quicker�flute�solo...................................................................................................................29Some�different�sound�files.................................................................................................................................30

Some�final�words........................................................................................................................31Limitations�of�the�transcriber..................................................................................................31Improvement�for�the�transcriber�and�ideas�for�future�work.....................................................32Conclusion..............................................................................................................................33

A�–�Converting�between�Note,�MIDI�and�Frequency...........................................34B�–�Matlab�code�for�the�transcriber......................................................................35References...............................................................................................................44

5�of�38

Chapter�1�–�Initial�theoretical�studies�

Introduction

Presentation�of�the�problemMusic�can�be�represented�in�many�different�ways,�ranging�from�simple�monophonic�PCM�files�withlow�sample�rates�to�highly�symbolical�multitrack�representations�containing�control�information�forvarious�musical�parameters.�Every�abstraction�level�has�its�use�and�being�able�to�easily�switch�toanother�representation�would�certainly�be�useful.�The�most�obvious�application�is�to�aid�musiciansor�composers�write�music�by�playing�the�actual�instrument�rather�than�writing�notes�or�playing�on�akeyboard.�This�could�also�extend�to�analysis�of�existing�music�where�nothing�but�the�soundrecordings�exists.�An�application�that�could�be�of�greater�commercial�interest�is�the�possibility�tosearch�for�music�on�the�Internet�by�simply�whistling�the�tune,�so�called�content-based retrieval�orquery by audio�[McNab],[Vercoe97].�One�could�also�think�of�a�scheme�that�tracked�all�radiostations�for�a�particular�music�style.�Areas�which�certainly�would�appreciate�perfect�transcriptionare�the�coming�standards�MPEG-4�for�structured�audio�[Vercoe97],�and�MPEG-7�for�description�ofaudio-visual�content.

While�the�process�of�converting�music�from�a�symbolic�format�to�a�waveform�representation(synthesis)�has�evolved�over�the�years,�and�now�gives�a�fairly�realistic�sound�for�a�reasonable�prize,the�opposite�process�(analysis�or�transcription)�is�far�from�being�ready�for�commercialisation.�Thereexists�several�monophonic�solutions�that�are�reported�to�work�well�in�real�time,�but�at�soon�as�wewant�to�analyse�polyphonic�music�our�options�are�effectively�reduced�to�zero.�A�quick�search�on�theInternet�found�some�shareware�programs�claiming�to�perform�transcription�(see�table�1),�but�trialsshowed�these�programs�to�perform�rather�poorly�even�after�substantial�parameter�regulations.

Speech�recognition�is�very�similar�to�the�problem�of�musical�transcription,�but�while�the�former�hasexperienced�much�interest�and�successful�applications�during�the�last�three�decades,�research�onmusic�recognition�has�mainly�been�done�by�a�few�individuals�with�special�interests�in�the�subject.One�reason�for�this�is�the�apparent�lack�of�commercial�applications.�Another�is�the�complexity�ofthe�problem�compared�to�speech�recognition.�While�speech�is�limited�to�frequencies�between�50Hzto�4kHz�and�the�sources�all�have�similar�characteristics,�musical�frequencies�ranges�from�20Hz�to20kHz�and�there�are�many�different�instrument�models.�However,�the�main�problem�is�the�fact�thatwestern�music�is�constructed�upon�harmonic�relations�(i.e.�Different�instruments�playingfrequencies�that�are�in�fractional�relation)�which�gives�rise�to�spectral�overlapping�and�possiblycomplete�masking�of�certain�notes.�When�we�think�of�a�symphonic�orchestra�with�many�musiciansplaying�simultaneously,�the�task�of�separating�and�recognising�each�one�of�them�from�a�simple�two-channel�recording�seems�(and�might�prove�to�be)�impossible.

Currently�most�systems�works�in�an�bottom-up�fashion,�that�is,�all�decisions�are�based�upon�thefrequency-�and�segmentation-information�one�obtains�from�the�music�recording.�This�workssatisfactory�for�monophonic�music�where�the�notes�are�well�separated�in�time�and�frequency.�Butthese�systems�are�unable�to�correct�even�the�most�obvious�errors,�since�they�don't�possess�anyknowledge�of�compositional�styles�(rules).�This�problem�is�addressed�by�putting�a�top-down�alongwith�the�normal�bottom-up�recognising�engine,�and�for�example�letting�the�"knowledge-loaded"top-down�engine�monitor�the�transcription�process�and�intervene�when�it�disagrees�to�theestimations�found.�The�common�way�to�implement�such�systems�are�so�called�Blackboard systemwhose�name�stem�from�the�idea�that�several�'experts',�each�having�knowledge�of�a�certainparameter,�are�'standing�in�front�of�a�blackboard'�and�together�solving�the�given�problem.�Thesesystems�are�very�flexible�since�it�is�easy�to�add�experts�and�the�system�can�be�driven�in�a�bottom-upcorrecting�fashion�or�as�a�top-down�predicting�fashion.

6�of�38

The�goal�of�this�projectGiven�the�restricted�time�and�lack�of�experience,�the�goal�of�this�project�was�to�explore�theproblems�related�to�transcription�and�review�the�existing�solutions,�and�thereby�try�to�implement�amonophonic�transcriber�that�works�better�than�the�existing�affordable�programs.�Instruments�thatare�non-harmonic�in�nature,�such�as�percussion�instruments�are�left�out,�while�any�kind�of�harmonicinstrument�is�targeted.�In�fact�an�independence�of�instrument�was�wanted,�and�additionally�apossible�later�extension�to�polyphonic�recognition�was�desired.�It�was�therefor�decided�to�try�otherways�to�do�frequency�estimation�and�possibly�obtain�higher�resolutions�and�more�precise�estimatesthan�what�is�realisable�using�standard�Fourier�methods.�A�natural�choice�was�to�investigate�thedifferent�parametric�methods�available,��their�advantages�and�disadvantages,�and�the�well-knownproblem�of�model�order�selection.�A�very�limited�top-down�based�correcting�system�isimplemented,�as�to�improve�the�transcription�process�in�lack�of�a�segmentation�system.As�output�format�the�standard�MIDI�file�format�was�chosen.�This�because�of�it�widespread�use�andits�relative�simplicity.�All�the�programs�was�to�be�built�in�Matlab�since�it�provides�most�of�theneeded�routines�and�enables�rapid�development.

7�of�38

Literature review

Papers�dealing�on�musical�transcriptionOnly�one�article�on�the�use�of�parametric�methods�in�music�recognition�was�found�[Schro00].�Here,three�algorithms�are�discussed�and�analysed�with�respect�to�relative�frequency�precision.�It�wasconcluded�that�the�Modified�Covariance�Method�(MODCOVAR)�was�superior�to�both�standardMaximum�Entropy�Method�(Yule-Walker)�and�Prony�Spectral�Line�Estimation.�Additionally,�thetwo�former�methods�give�us�the�relative�size�of�the�spectral�peaks�for�free.�The�"Modcovar"�methodwas�applied�to�a�short�piano�recording�using�a�fixed�order�of�20,�showing�a�promising�result.�Noorder�estimation�was�discussed.

Another�and�more�elaborate�work�is�the�master�thesis�of�Anssi�Klapuri�[Klapuri98]�where�some�ofthe�problems�related�to�automatic�transcription�are�discussed,�and�a�system�trying�to�resolve�theproblem�of�harmonic�overlapping�is�described.�The�shortcomings�of�the�purely�bottom-up�approachwhen�it�comes�to�polyphonic�music�and�the�necessity�to�employ�a�top-down�"knowledge�system"are�discussed.�The�thesis�discusses�very�general�problems�and�its�possible�solutions,�and�differenttechniques�for�extracting�information�are�presented.�Klapuri�has�also�released�several�papers�relatedto�this�thesis,�more�directed�to�a�specific�implementation;�[Klapuri01A],�[Klapuri01B].�Even�morecan�be�found�at�<http://www.cs.tut.fi/sgn/arg/publications.html>�and<http://www.cs.tut.fi/~klap/iiro/>

Top-down�systems�are�probably�the�most�promising�approach�for�polyphonic�music�recognitionand�despite�the�fact�that�such�systems�will�need�greater�understanding�of�human�musical�perceptionand�a�huge�amount�of�musical�knowledge�at�its�hand,�some�systems�with�limited�knowledge�havebeen�implemented�and�shows�improvements�over�the�usual�bottom-up�systems.�Examples�of�an'expert'�in�a�top-down�system�are�probability�of�transition�between�different�chords�and�rules�forwhich�notes�can�be�played�to�which�chord.�See�[Kashino98]�and�[Martin96].

Two�other�areas�of�importance,�and�yet�untreated�in�this�project,�are�segmentation�and�instrumentrecognition.�Segmentation�of�events�(i.e.�Notes�and�pauses)�gives�us�the�possibility�to�respect�theduration�of�each�note,�and�to�avoid�including�two�notes�in�one�frequency�analysis�frame�to�obtainbetter�frequency�estimates.�The�master�thesis�of�T.�Jehan�deals�with�this�problem�[Jehan97],�andproposes�methods�both�in�the�time�domain�and�in�the�frequency�domain.�Especially�the�thirdchapter�using�changes�in�AR�model�as�basis�for�segmentation�could�be�an�interesting�futureaddition�to�this�project.�Recognition�of�instruments�gives�us�the�opportunity�to�improve�the�processeven�more�by�using�instrument�models�in�the�frequency�analysis,�and�by�automatically�settingcorrect�instruments�in�the�output�file.�Methods�seen�tested�are�some�kind�of�neural�networkworking�on�cepstral�coefficients�or�features�from�log-lag�correlograms,�See�[Brown99]�and[Martin98].

8�of�38

Commercial�or�free�transcription�programs�available

Name and Url TechnologyWIDI�Recognition�System�2.6http://www.midi.ru/w2m

FFT�basedPolyphonic

AudioToMidi�1.01http://www.midi.ru/AudioToMidi/

Unsure.�Possibly�cross-correlation�withsynthetic�sinusoids.�Polyphonic.

AmazingMIDI�1.60http://www.pluto.dti.ne.jp/~araki/amazingmidi/

UnsureSingle�instrument,�polyphonic

WAV2MID�1.5ahttp://www.audioworks.com

UnknownMonophonic

DigitalEar�3.0http://www.digital-ear.com/index2.html

UnknownMonophonic

Table�1

As�said�in�the�introduction,�none�of�these�programs�offers�fully�automatic�operation�andindependence�of�instrument�and�even�after�many�parameter�adjustments,�none�of�the�polyphonicones�give�an�impressive�result�even�on�monophonic�music.�A�more�comprehensive�list�of�existingprograms�can�be�found�at:�<http://www.s-line.de/homepages/gerd_castan/compmus/audio2midi_e.html>

The�two�monophonic�transcribers�performs�quite�well.�Digital�Ear�was�difficult�to�test,�because�thedemo�version�was�very�restricted.�From�what�could�be�tested,�it�seemed�a�bit�less�powerful�thanboth�the�project�transcriber�and�the�AudioWorks'�transcriber.�

The�AudioWorks�transcriber�performed�nearly�perfect�with�a�minimum�of�parameter�setting�andexcellent�speed.�Testing�the�program�with�the�same�wav-files�as�in�the�project�showed�a�similarperformance�to�what�was�obtained�in�this�project�with�less�parameters.�A�small�bug�in�the�programmade�the�instrument�in�the�MIDI�file�to�always�be�a�grand�piano.�

9�of�38

A closer look at the challenges

Common�features�of�instrumentsApart�from�percussion�instruments,�the�creation�of�sound�in�most�instruments�is�based�ongenerating�standing�waves�on�strings�or�in�hollow�tubes�of�some�geometry.�Since�the�length�of�thestring�and�tube�normally�is�fixed�for�a�given�note,�the�possible�wavelengths�are�constrained�to�fulfilthe�wave�equation�for�the�given�length,�geometry�and�speed�of�sound:�� 2 u � x,t ��

x² � 1v²

� 2 u � x,t ��t² � 0 (1)

Usually�strings�are�connected�in�both�ends,�while�tubes�can�be�open�in�one�or�both�ends�with�eithercircular�or�conical�inner�geometry.�These�constraints�only�allows�certain�modes�to�operate�in�theresonator,�and�for�strings�and�circular�open�tubes�possible�solutions�are�of�the�form:

Ak � sin � 2 � k � v� ,k � � (2)

eventually�as�a�sum�of�cosines�if�the�tube�is�circular�half�open.�As�for�the�conical�case,�the�solutionconsists�of�spherical�harmonics.�A�more�elaborate�explanation�can�be�found�at�the�website[WolfeA]

This�gives�rise�to�a�harmonic�spectrum,with�all�over-harmonics�being�an�integralnumber�of�the�fundamental�frequency.Figure�1�shows�a�typical�frequencyspectrum,�with�the�fundamental�frequencyat�392Hz�and�all�the�over-harmonicsbeing�2�to12�times�the�fundamentalfrequency.�Another�typical�feature�is�thatmost�of�the�energy�is�located�in�the�lowerharmonics,�leading�to�weaker�over-harmonics.�For�example�in�piano�sounds,more�than�90%�of�the�energy�is�containedin�the�fundamental�[Schro00].�In�westernmusical�notation,�it's�the�frequency�of�thefundamental�in�the�harmonic�series,�alsocalled�the�pitch1 ,�that�names�the�note.�Therelation�between�the�notes�is�given�in�(3),with�440Hz�being�the�so-called�chamber�note�(A4),�and�k�ranging�from�-48�to�39�for�a�standardpiano.�To�convert�between�frequency,�note�name�and�Midi�number,�see�table�4,�appendix�A.

f � 440Hz � 2k12 � Hz � ,k �� (3)

1The�term�pitch�is�not�well�defined,�and�at�least�three�different�meanings�exists;�a)�Production�pitch:�The�rate�of�which�the�excitation�device�opens�and�closes�(e.g.�Glottal�closure�rate�in�humanspeech).��b)�The�mathematical�pitch:�The�fundamental�of�the�harmonic�seriesc)�Perception�pitch:�The�actual�frequency�perceived�by�the�listener.In�this�project,�the�definition�in�b)�is�used,�since�it�is�directly�connected�to�the�searched�notes.

10�of�38

Figure�1

0 1000 2000 3000 4000 5000−80

−70

−60

−50

−40

−30

−20

−10

0

Frequence [Hz]

Mag

nitu

de [d

B]

A G4 note from a flute

Problems�encountered�when�estimation�pitchIt�is�clear�that�a�reliable�pitch�estimation�is�essential�to�successfully�do�the�transcription,�and�therelative�frequency�error�must�not�exceed�2.5%�to�avoid�picking�the�note�a�semitone�away�from�thereal�note.�Further�we�want�to�be�able�to�resolve�several�instruments�playing�simultaneously�so�wemust�find�as�many�harmonics�as�possible,�where�the�harmonics�can�be�closely�spaced�or�evenoverlapping.�Instruments�without�pitch�that�have�noise-like�spectres�(percussion/drums)�canobviously�not�be�identified�by�tracking�the�pitch,�and�are�not�treated�in�this�project�either.Another�problem�is�irregularities�in�the�harmonic�spectrum.�A�classic�example�is�the�clarinet�whereonly�the�odd�harmonics�are�present,�because�of�the�constraints�the�circular�half-open�resonatorimpose�on�the�wave�equation.�

Guitars�can�sometimes�play�without�the�first�harmonic�present,�or�at�least�very�weak.�This�leads�tothe�phenomenon�virtual�pitch,�where�the�human�brain�deduces�the�missing�fundamental�from�thedistance�between�the�harmonics�[Terhardt]�(It�is�this�concept�that�is�used�to�in�some�headphones�toobtain�a�lower�frequency�response�than�what�is�possible�with�the�given�membrane�radius).�Anotherproblem�related�to�string�instruments�is�inharmonicities�in�the�spectrums.�This�is�due�to�the�factthat�the�string�isn't�infinitely�thin,�but�has�a�certain�mass,�and�when�this�mass�is�unevenlydistributed�some�of�the�harmonics�can�get�displaced�from�its�'correct'�position�[WolfeB].

Even�if�these�exceptions�cause�problems�for�the�transcription�process,�they�can�be�countered�fairlyeasily�and�the�most�concerning�issue�is�still�estimating�pitch�of�multiple�instruments�where�theharmonic�series�are�overlapping.�This�might�seem�like�a�special�case�which�only�happens�from�timeto�time,�but�the�fact�is�that�this�is�the�rule�rather�than�the�exception.�The�reason�for�composing�themusic�in�this�fashion�is�rooted�in�the�way�the�brain�experiences�the�music;�entities�relatedharmonically�are�put�together�in�a�bigger�entity�which�is�experienced�as�a�whole,�that�is,�one�cannotdistinguish�the�different�notes.�This�helps�reducing�the�complexity�of�the�listening�process,�andmakes�it�easier�to�listen�to�the�music.�The�more�the�elements�in�the�music�are�in�harmonic�relation,the�more�the�human�brain�is�grouping�the�elements�together,�making�it�more�pleasing�to�listen�to.So�what�makes�the�listening�easier,�makes�the�transcription�process�harder.

Many�different�approaches�exists�for�estimating�the�pitch,�where�many�seems�to�be�adapted�frommethods�originally�developed�for�speech�processing.�Still,�most�of�these�methods�are�made�formonophonic�music�and�will�perform�poorly�when�applied�to�polyphonic�music.�The�most�used�isthe�spectrogram,�based�on�the�Short-Time�Fourier�Transform�(STFT).�Other�examples�arecorrelogram,�filterbanks,�Constant�Q-Transform,�cepstrum,�auto-correlation,�cross-correlation,zero-crossing�detection,�wavelet�transform,�sinusoidal�modelling,��AR-�or�ARMA�methods,�Pronymodelling.�Some�efforts�are�made�in�trying�to�improve�multipitch�estimation�by�subtracting�thespectrum�associated�with�an�estimated�note,�and�thereby�continue�searching�for�notes[Klapuri01B].�Some�of�the�methods�mentioned�above�are�discussed�further�down,�other�can�befound�in�[Klapuri98],�[Schro00],�[Fitch00].�Even�though�it's�not�directly�applicable�to�music,�thetutorial�paper�on�signal�modelling�in�speech�recognition�by�Picone�is�worth�a�read�to�get�anoverview�over�available�techniques�within�speech�processing��[Picone93].

11�of�38

Other�problems�related�to�transcriptionIt�is�a�known�fact�that�it�is�the�attack�portion�of�the�note�(the�transient�state�before�harmonics�insteady-state�take�over)�dictates�the�way�we�experience�the�timbre,�and�it�is�probably�this�region�thatcan�provide�the�necessary�information�needed�to�identify�the�instrument.�It�is�therefor�important�todetermine�the�location�of�the�on-�and�off-sets�of�the�notes.�This�will�also�help�avoiding�placing�anpitch�analysis�frame�over�two�different�notes�and�thereby�getting�more�accurate�estimates.�Moreimportantly�when�the�orders�is�rising,�segmentation�enables�us�to�adjust�the�analysis�windows�tocover�the�whole�note,�giving�us�as�many�datapoints�as�possible.�Many�schemes�for�segmentingmusic�exists,�see��[Jehan97],�[Klapuri98].

Successfully�identifying�the�instruments�would�enable�us�to�apply�instrument�models�to�the�pitchestimation�process.�This�could�help�detecting�notes�with�overlapping�harmonics�since�we�would�tosome�extent�know�the�expected�amplitude�signatures.�Also,�it�would�provide�automatic�selection�ofinstruments�in�the�output�file.

Processing�cost�and�storage�requirements�are�not�often�discussed�in�connection�with�automatictranscription.�It�might�not�be�an�issue�with�laboratory�tests�or�professional�applications�not�in�needof�realtime�applications,�but�if�used�for�music�search�on�the�Internet,�the�transcriber�would�mostprobably�be�implemented�as�a�Java�applet�that�is�downloaded�from�a�server.�It�is�clear�that�as�longas�not�everybody�is�blessed�with�a�high-speed�Internet�connection�and�the�fastest�processors,�only�alimited�amount�of�knowledge�and�calculations�can�be�put�into�the�applet.�Thus,�a�system�requiringas�little�resources�as�possible�while�still�being�able�to�do�the�job�would�be�desired.

MIDI�file�formatThe�MIDI�file�format�is�relative�simple,�and�much�documentation�can�be�found�on�the�Internet.�It�isin�binary�format�something�that�helps�making�the�files�small.�Both�single�track�and�multi�track�filesare�supported.�In�brief,�the�file�is�organised�in�chunks.�Every�chunk�start�with�a�four�bytes�ID�fieldtelling�what�type�of�chunk�it�is�and�a�four�bytes�length�field�telling�the�number�of�bytes�of�datafollowing�this�header.�All�MIDI�files�begin�with�the�MThd�chunk�containing�tempo�information,and�are�followed�by�one�or�more�MTrk�containing�meta-data�and�the�actual�music.�All�events�in�thetrack-chunk�such�as�note-on�and�note-off�are�equipped�with�a�delta-time�field,�indicating�theduration�(in�MIDI�clocks)�between�the�event�and�the�preceding�event.�This�delta-time�is�stored�in�avariable-length�format�enabling�shorter�events�to�be�represented�with�fewer�bytes�than�with�a�fixedformat.

Three�types�of�files�exists.�Type�0,�1�and�2.�Type�0�contains�only�one�track�and�is�therebymonophonic.�Type�1�contains�one�or�more�tracks�where�all�tracks�are�played�simultaneously.�Type2�contains�one�or�more�tracks,�but�the�tracks�are�sequentially�independent.�Type�1�was�the�naturalchoice�for�this�project,�since�an�extension�to�polyphony�is�accomplished�simply�by�adding�MTrkchucks.

A�good�overview�over�the�event�codes�in�the�specification�can�be�fount�at<http://crystal.apana.org.au/ghansper/midi_introduction/midi_file_format.html>,��while�a�moretextual�introduction�can�be�found�at�<http://jedi.ks.uiuc.edu/~johns/links/music/midifile.html>

12�of�38

Pitch�estimators

Cross-correlation�between�signal�and�sinusoidsSince�we�assume�that�the�signal�consists�of�sinusoid�with�several�harmonics�(also�sinusoids),�wecould�find�these�sinusoids�by�taking�the�cross-correlation�between�the�signal�block�and�testsinusoids�with�frequencies�given�by�(3).�The�frequencies�found�are�those�who�yield�the�highestcross-correlation�values.�These�test�sinusoids�could�be�synthetic�ones�or�samples�of�realinstruments.�This�method�will�of�course�be�a�bit�costly�in�terms�of�computation,�since�everypossible�note�must�be�tested.�Further,�we�can't�reveal�notes�where�all�harmonics�are�masked�by�alower�note.�To�be�able�to�to�that,�we�must�apply�instrument�models�to�compare�the�correlation�valuegiven�from�(4),�with�the�expected�value�from�the�used�model;�a�deviation�from�the�expected�valuecould�indicate�a�hidden�note.

cxy � k ��l � 0

N � 1 �x � l � y � l � k � ,k � 0,1, � kmax (4)

�

The�calculation�cost�israther�expensive�sincewe�have�to�do�Nmultiplications�forevery�possible�note(128�for�a�piano),�andthe�lower�the�note�is,the�bigger�kmax�gets�toaccount�for�the�longersinus-period.

Fig.2�shows�anexample�of�the�G4,392�Hz�(same�as�inFig.1)�,�where�thecross-correlationbetween�the�signaland�8�sinusoids�hasbeen�calculated.

Filter�banksSome�of�the�earliest�attempts�to�estimate�pitch�was�done�with�filterbanks.�One�simply�separates�thespectrum�by�filtering�the�signal�with�several�bandpass�filters,�and�then�measures�the�energy�presentin�the�different�filters.�A�better,�and�more�promising�approach�is�to�use�a�dyadic�filterbank�inconjunction�with�wavelets,�see�[Karaj99]�and�[Fitch00].

13�of�38

Figure�2

0 5 10 15 20 25 30−15

−10

−5

0

5

10

15Xcorr between G4 from a flute, and 8 sinusoids

sample

c xy

392Hz 784Hz 1176Hz 1568Hz 0.67of the first 4

Fourier-based�methodsAs�mentioned�earlier,�the�spectrogram�has�been�the�most�popular�way�to�obtain�a�pitch�estimate.�Tofind�the�frequencies�of�the�sinusoids,�one�simply�picks�the�peaks�in�the�spectrum.�The�spectrogram�based�on�the�Short-Time�Fourier�Transform�(5),�is�an�extension�of�the�normalFourier�transform�where�the�small�analysis�window�is�moved�along�the�time�axis�and�successivetransforms�are�taken.

Sx u,f !#"%$& '' f t !)( w t * u ! e & i2 + f dt (5)

The�spectrogram�gives�the�energy�density�similar�to��the�periodogram,�and�is�given�by�(6)

PS f u,f !#"-, Sf u,f !., 2 (6)

Two�problems�are�associated�with�the�spectrogram.�The�first�problem�is�the�windowing�function.This�windowing�introduces�side-lobes�which�can�mask�weaker�signals,�and�trying�to�reduce�thesidelobes�eventually�leads�to�a�wider�main�lobe.�Further,�the�resolution�in�time�and�frequency�islimited�by�Heisenberg's�uncertainty�theorem,�which�states�that�the�area�of�the�rectangle�defined�bythe�windowing�function�has�a�minimum�given�by�(7)/

f 0 / t 1 12

(7)

This�windowing�of�the�data�clearly�limits�the�frequency�resolution�obtainable,�and�we�are�given�thechoice�between�high�frequency�resolution�or�high�time�resolution,�but�not�both.�Additionally,�wehave�got�some�given�minimal�time�resolution�to�respect�if�we�want�to�avoid�mixing�several�notes�inthe�same�analysis�frame,�and�to�know�where�the�note�is�actually�starting�and�stopping.�For�example,a�musical�piece�playing�at�200BPM�(beats�per�minute�or�quarter�notes�per�minute)�sampled�at44.1kHz�gives�us�60sec/(200*4)=75msec�per�sixteenth�note,�which�again�gives�us44100*75msec=3309�samples�per�sixteenth�note.�Lower�sample�rate�means�fewer�datapoints,�and�ifwe�want�to�capture�ornamental�tones2,�we'll�have�even�fewer�sample�points.

Then�to�the�second�problem,�namely�the�linear�spaced�frequency�bins�of�the�Fourier�transform.Musical�tones�are�ordered�in�a�logarithmic�fashion�similar�to�the�sensitivity�of�human�hearing,�sothe�distance�between�two�neighbouring�notes�in�the�higher�end�of�the�musical�scale�is�greater�thanin�the�lower�end.�That�allows�for�higher�time�resolution�by�lowering�the�frequency�resolution�for�thehigher�end�of�the�spectrum.�The�Constant�Q�Transform�obtains�a�such�constant�ratio�between�centrefrequency�and�frequency�resolution�for�each�frequency�partition�by�combining�several�bins�of�anFFT.�This�however�fails�to�give�better�time�resolution�for�the�higher�frequencies,�and�we�haven'twon�much.�

2�Ornamental�notes�are�very�short�notes�not�written�on�the�music�sheet,�but�added�by�the�musician.

14�of�38

Autoregressive�methodsSince�the�windowing�function�is�a�problem�for�the�resolution,�we�want�to�avoid�windowing�ourdata.�Parametric�methods�like�AR�models�and�eigenvector�methods�make�this�possible.�We�assumeour�signal�to�be�composed�by�a�number�of�sinusoids�in�white�noise,�something�that�enables�us�tomake�use�of�an�eigendecomposition�of�the�estimated�auto-correlation�matrix.�This�is�possible�sincethe�matrix�is�composed�of�a�signal�auto-correlation�matrix�and�a�noise�auto-correlation�matrix,where�the�eigenvalues�associated�with�noise�are�generally�small�compared�to�those�of�the�signal.

Since�we�cannot�find�theoretical�Maximum�Likelihood�Estimator�(MLE)�for�more�than�onesinusoid�analytically�[Kay88],�we�must�use�a�sub-optimal�method.�The�most�promising�estimatorthat�is�not�too�computationally�expensive�is�the�Principal�Component�AR�frequency�estimator.�2

a 354 2Rxx

6 1 2r xx (8)

Eq.�(8)�is�the�standard�AR�parameter�estimation,�with�the�auto-correlation�matrix�being�positivedefinite�hermitian.�This�allows�for�a�eigendecomposition�with�real,�positive�eigenvalues�andorthonormal�eigenvectors.�These�eigenvectors�span�the�entire�auto-correlation�space,�so�we�canwrite�the�auto-correlation�vector�as�a�linear�combination�of�the�eigenvectors,�which�inserted�into�(8)gives:

2a 35487

i 9 1

M 12:i

2vi

2vi

H 7j 9 1

M ;j

2v j (9)

Simplifying�and�discarding�the�(M-p)�smallest�eigenvalues,�we�obtain:

2aPC 3 4<7

i 9 1

p;

i2:i

2vi (10)

Finally,�the�frequency�estimates�are�obtained�by�picking�the�peaks�in�the�spectral�estimator�in�(10)which�is�done�by�solving�the�roots�of�(10),�and�take�the�angles�of�these�poles.

P pc = f >#3 2?pc2

|1 @ 7k 9 1

p 2apc = k > e 6 j2 A f k|

2 (11)

In�practice,�we�can�estimate�the�AR�coefficients�with�the�Modified�Covariance�Method(MODCOV),�pick�the�p�poles�closest�to�the�unity�circle�and�again�we�obtain�the�frequencyestimations�from�the�angles�of�these�poles.

The�modified�covariance�method�has�proven�to�be�insensitive�to�initial�phases�of�the�sinusoids�andspectral�peak�shifting�due�to�noise�is�low.�In�fact,�in�absence�of�noise�the�true�frequencies�arefound.�Tests�show�that�the�Principal�Component�Method�reaches�the�Cramer-Rao�bound�for�SNRhigher�than�10dB.�Since�the�MODCOV�method�is�Least-squares�optimised�without�constraints,�thepoles�can�end�up�outside�the�unit�circle�leading�to�unstable�models.�This�is�not�a�problem�since�weare�only�interested�in�the�angles�of�the�poles.

Further,�the�AR�spectrum�can�be�calculated�from�(11).

15�of�38

Model�order�selectionThe�choice�of�model�order�is�crucial�in�order�to�make�the�AR�methods�perform�acceptably,�butunluckily�no�perfect�methods�exists.�This�problems�gets�even�harder�as�the�number�of�availablesample�points�are�reduced.�As�soon�as�the�possible�order�is�greater�than�0.1�times�the�availabledatapoints,�we�are�dealing�with�a�finite�sample�and�the�criteria�for�larger�samples�are�no�longervalid�and�need�corrections.�Most�of�the�methods�are�based�on�the�maximum�likelihood�estimator�ofsome�parameter�with�some�bias/variance�correcting�factor.

Estimation�of�the�Prediction�Error�Power

Most�of�the�methods�that�exists�tries�to�fit�several�model�structures�and�orders�to�the�data�and�theorder�is�chosen�by�minimising�a�cost�function.�This�cost�function�is�usually�based�on�the�MaximumLikelihood�Estimate�of�the�prediction�error�power�(PEP)�from�the�model,�but�since�the�PEP�is�onaverage�decreasing�for�increasing�model�order,�we�have�to�add�a�certain�penalty�to�the�costfunction.�This�because�higher�order�leads�to�higher�variance�in�the�prediction�coefficient�estimates,and�thereby�higher�variance�in�the�PEP.�The�information-based�criteria�requires�two�passes�over�thedata;�one�for�calculating�the�cost-function�and�another�for�choosing�the�right�order.

PEP: BC pep2 D E EGF x H n IKJLBx H n IMF 2 NKD Brxx H 0 I�OQP

k R 1

p

a H k ISBr xx H k I (12)

The�a's�in�the�PEP�is�calculated�using�the�actual�type�(YW,�ModCov�(also�called�FB�or�Forward-Backward)�etc.)�and�PEPFB�indicates�that�the�Modified�covariance�method�is�used.

The�best�known�criterion�(for�AR�models)�is�Akaike's�Information�Criterion�(AIC):

AIC H Bk I D arg mink T 0,1,..,pmax

[ ln H BC pep2 H k IUI#O k

N V 2] (13)

This�criterion�is�however�not�consistent�and�tends�to�overestimate�the�model�order.�The�minimumdescription�length�(MDL)�criterion�fixes�this�problem�by�having�a�higher�penalty�factor,�and�isasymptotically�consistent:

MDL H Bk I D arg mink T 0,1,..,pmax

[ ln H BC pep2 H k IUI#O k

N V ln N] (14)

These�two�criteria�impose�equal�penalty�to�all�unknown�parameters�(amplitudes,�phases,frequencies�etc.).�Better�performance�can�be�obtained�with�the�maximum�a�posteriori�(MAP)criterion,�where�different�penalties�can�be�attributed�to�the�different�unknown�parameters.�Forexample�for�AR�models�the�MAP�is�equal�to�MDL,�and�for�sinusoids�with�unknown�frequencies�+white�noise�the�penalty�is�5k/N�ln(N).�For�a�development�of�the�MAP,�see�[Djuric96]�and[Djuric98].

A�problem�arises�when�the�number�of�datapoints�is�less�than�ten�times�the�model�order,�as�we'rethen�dealing�with�finite�samples�and�the�criteria�given�above�no�longer�works.�This�problem�isespecially�present�in�small�samples�where�they�can�even�fail�to�give�a�minimum.�In�larger�samplesthey�tend�to�choose�too�high�orders�as�they�don't�account�for�the�increased�variance�in�themodelling�error.�Finite�Sample�Information�Criterion�(FSIC)�[Broer00]�tries�to�handle�thisoverestimation�by�changing�the�penalty�factor�to�better�suit�the�variance�from�the�given�ARestimation�method�and�model�order.

FSIC H Bk I D arg mink T 0,1,..,pmax

[ ln H BC pep2 H k I�I�O ( W i R 0

k 1 O vi

1 J vi

J 1)] (15)

Where�the�vi�depend�on�the�AR�method,�and�for�MODCOV�it�is�given�as�vi�=(N+1.5-1.5i)-1.�Thecombined�information�criterion�uses�the�maximum�between�the�one�in�(15)�and�3⋅Σvi,�see�[Broer00]eq.13.

16�of�38

Order�estimations�methods�based�on�Singular�Values�or�noise�subspace�estimation

Another�group�of�methods�for�estimating�orders�are�the�ones�based�on�determining�singular�valuesof�the�signal�auto-correlation-�or�the�covariance�matrix.�These�methods�are�more�specialised�as�theyare�usually�based�upon�the�assumption�that�the�signal�consists�of�sinusoids�in�white�noise.�Thismakes�it�possible�to�decompose�the�auto-correlation�matrix�in�a�signal�matrix�and�a�noise�matrixbecause�the�singular�values�associated�with�the�signal�are�generally�much�higher�than�those�of�thenoise.�Additionally,�these�two�subspaces�are�orthogonal.�The�drawbacks�with�these�methods�basedon�eigen-calculation�are�that�they�need�O(p3)�operations,�and�that�the�true�order�is�not�always�thebest�order�(At�least�for�autoregressive�models�which�are�AR(∞)�when�corrupted�with�white�noise).

The�AIC�and�MDL�criteria�based�on�eigenvalues�are�given�below,�and�the�development�can�befound�in�[Wax85].

AIC svd XZYk [#\ arg mink ] 0,1 ^ pmax[ _ ln{( `

l a k b 1

p Yc l)1

p d k

1p _ k el a k b 1

p Yc l}f k

N(2p _ kp _ k )] (16)

MDLsvd X)Yk [#\ arg mink ] 0,1 ^ pmax[ _ ln{( `

l a k b 1

p Yc l)1

p d k

1p _ k el a k b 1

p Yc l}f k

N g( X 2p _ k [ ln X N [2 X p _ k [ )] (17)

where�k�is�the�estimated�order;�p�is�the�order�of�the�covariance�matrix,�N�is�the�number�ofdatapoints�used�to�calculate�the�covariance�matrix,�and�the�λ's�are�the�eigenvalues�of�the�covariancematrix�ordered�as�λ1 > λ2 > .. > λp.�The�expression�in�the�braces�are�simply�the�ratio�of�thegeometric�mean�to�the�arithmetic�mean�of�the�p-k�smallest�eigenvalues.�Similar�to�the�PEP-basedcriteria,�the�AIC�tends�to�overestimate�the�order,�while�the�MDL�has�shown�to�be�consistent.�Itshould�be�possible�to�merge�the�order�estimation�and�the�parameter�estimation�in�order�to�reducethe�computational�load.

Another�possible�method�is�to�continuously�estimate�the�noise�subspace�by�means�of�a�QR-factorisation�of�the�covariance�matrix�or�even�the�data�matrix�itself.�The�idea�is�to�decompose�amatrix�X�into�a�square�matrix�Q�with�orthonormal�columns��and�a�upper-triangular�matrix�R,

XE h QR (18)where�E�is�a�permutation�matrix�that�orders�the�diagonal�of�R�in�descending�order.�

R \ (R11 R12

0 R22) (19)

The�factorisation�in�(18)�is�called�rank-revealing�QR�factorisation�if�R22�has�a�small�norm.�Thedimension�of�R22��is�equal�to�the�rank-deficiency�of�X,�or�equivalently�equal�to�the�dimension�of�thenoise�subspace.�The�dimension�of�R11�is�then�equal�to�that�of�the�signal�subspace.�An�effectiveimplementation�requiring�on�average�O(p²)�operations�can�be�found�in�[Bisc92].

Of�course,�one�can�always�decompose�the�signal-�or�covariance�matrix,�and�determine�the�the�rankof�the�noise�subspace�by�finding�the�smallest�singular�values.�However,�this�proves�to�be�difficultwhen�the�order�of�the�matrix�is�small.

17�of�38

Chapter�2�–�Implementation�of�a�music-transcriber

Initial�testing�on�real�and�synthetic�musical�signalsUnless�otherwise�cited,�all�sound�files�are�sampled�at�11025�Hz�with�16�bits�in�one�channel.�Somesingle�note�sound�files�was�found�in�the�Internet.�A�series�of�tones�from�piano,�plus�some�singleclarinet�and�saxophone�examples�was�also�found.�The�most�useful�was�a�series�of�flute�tones�foundat�<http://www.phys.unsw.edu.au/music/flute/flute.html>�from�different�flutes,�with�impedance�andspectral�measurements�available.�These�sound�clips�was�chosen�as�the�basis�the�project.�The�synthetic�signals�were�created�from�the�script�synthsign.m�and�the�signals�are�analysed�withthe�script�process.m.�Description�of�the�different�program�modules�is�found�in�the�appendix�B.

Fourier�methods,�real�instrumentsIn�the�beginning�some�testing�was�done�on�single�notes�from�real�instruments.�This�to�better�see�thedifference�between�parametric�and�non-parametric�methods,�and�to�get�a�picture�of�how�instrumentspectrums�look�like.�Additionally,�some�modules�for�file-handling�and�note/frequency�decision�wasbuilt�that�also�was�needed�for�the�real�transcriber.�Synthetic�signals�was�not�tested,�since�it�wasassumed�that�as�long�as�the�peaks�are�well�separated�the�correct�frequencies�are�found,�and�that�theresolution�is�limited�by�the�number�of�datapoints�N�in�the�analysis�windows.

A�crude�periodogram�was�first�implemented,and�a�simple�peak-picking�method�was�used�toestimate�the�frequencies.�The�peak-pickingworked�by�searching�for�the�maximum�valueand�then�deleting�a�certain�number�of�samplesaround�this�maximum.�An�example�from�anA4B�is�shown�in�Fig.3�with�the�Matlab�outputfound��in�Table�2.�A�problem�is�that�we�have�touse�a�threshold�and�thereby�miss�some�weakerpeaks.�Instead�of�a�fixed-value�threshold,�amore�adaptive�method�that�estimates�the�noise-floor�to�use�as�threshold�[Durne98]�could�beused.

Peak�1�is�at�freq�0.000�Hz Difference�from�0�Hz:��0.000�Hz

Peak�2�is�at�freq�425.954�Hz Difference�from�previous�peak:�425.954�Hz










It�is�clear�that�the�FourierTransform�will�work�fine�formonophonic�music,�but�acertain�smoothing�isnecessary�before�the�peaks�aresearched�for.�A�differentsearch�routine�would�bedesired,�since�one�neverknows�how�many�points�todelete�around�the�peaks.

Table�2

18�of�38

Figure�3

0 1000 2000 3000 4000 5000−80

−70

−60

−50

−40

−30

−20

−10

0

10

Frequence [Hz]

Mag

nitu

de [d

B]

a4b.wav

An�averaged�periodogramwith�overlapping�windowswould�be�a�betteralternative,�as�it�wouldreduce�both�the�amplitude-and�frequency�variance.Matlab�provides�this�as�aWelch-periodogram.�Thesame�sound�is�tested�inFig.4,�and�we�see�a�clearimprovement�over�theperiodogram:�a�moreprecise�frequency�estimateis�obtained,�and�thespurious�peaks�present�intable�2�are�gone.�This�isfile�has�18743�samples�(1.7sec)�and�is�of�course�a�bitlonger�than�the�averagenote�duration.

Peak�1�is�at�freq�441.431�Hz Difference�from�0�Hz:� ��441.431�Hz




Table�3

In�music,�we�can�expect�thesignal�to�be�stationary�for�about25�ms,�so�a�window�size�of�thatduration�is�realistic.�Fig.5�showsa�28�ms�clip�(308�points)�of�theprevious�A4,�and�the�the�samefrequencies�as�in�table.3�arefound.�

With�such�a�short�window�wehave�a�possible�resolution�of(300-1)*11025Hz�=�36.75�Hz,and�a�smoothed�periodogrammakes�it�even�worse.�Forexample,�an�A1�(55Hz)�wouldprove�to�be�difficult�to�resolve,and�overlapping�harmonicswould�be�unresolvable.�

19�of�38

Figure�4

0 1000 2000 3000 4000 5000−80

−70

−60

−50

−40

−30

−20

−10

0

Frequence [Hz]

Mag

nitu

de [d

B]

A4 − Welch, 1024 points FFT, 512 points window, 25% overlap

Figure�5

0 1000 2000 3000 4000 5000

−60

−50

−40

−30

−20

−10

0

Frequence [Hz]

Mag

nitu

de [d

B]

A4 − Welch, 1024 points FFT, 512 points window, 25% overlap

Yule-Walker,�synthetic�signalsEven�though�the�Yule-Walker�method�isreported�to�perform�poorly�as�a�frequencyestimator�in�noisy�signals�[Kay88],�it�wasimplemented�partly�because�of�itscomputational�simplicity�and�partlybecause�it�was�the�only�available�method�toestimate�the�AR�coefficients�in�Matlab�5.2.A�25ms�synthetic�noise-free�signalcorresponding�to�an�A4�with�four�over-harmonics�was�created�and�the�frequencieswas�estimated�from�rooting�the�ARcoefficient�polynom.�The�resulting�spectragiven�from�(11)�are�seen�in�Fig.6,�and�thefrequencies�found�are�436.639,�879.691,1321.294,�1763.093,�2207.368.�We�seethat�even�though�the�signal�is�noise�free,the�true�frequencies�are�not�found.�This�isdue�to�the�'zeroing'�of�the�auto-correlation�values�outside�the�auto-correlation�matrix,�somethingthat�smoothes�and�displaces�the�peaks.

Figure�7�shows�the�pole-zero�plot�of�thenoise-free�A4.�We�see�that�the�poles�notmodelling�the�sinusoids�are�fairly�close�tothe�unit�circle.�This�will�lead�to�problemsselecting�the�threshold�for�which�poles�toaccept.�The�reason�for�the�relative�highorder�selected�compared�to�the�true�order(10)�is�the�same�as�above:�the�zeroing�ofthe�auto-correlation�values.�To�minimisethe�modellation�error�a�higher�order�isnecessary.�The�AIC�function�(with�penaltyk�set�to�1)�is�shown�in�figure�8.�The�samevalue�is�obtained�for�the�MDL�and�FPE,while�the�eigenvalue-based�criteriachooses�55�which�is�way�too�high.�

This�is�probably�due�to�the�fact�thatwithout�noise,�the�auto-correlation�matrixis�singular.

When�the�signal�is�corrupted�with�noise,the�order�stays�the�same,�but�the�polesmodelling�the�noise�get�even�closer�to�theunit�circle.

20�of�38

Figure�7

−1 −0.5 0 0.5 1

−1

−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1

Real Part

Imag

inar

y P

art

0 10 20 30 40 50 60 70 80 90 100−4

−3.9

−3.8

−3.7

−3.6

−3.5

−3.4

−3.3

−3.2

−3.1aic

model order

Figure�8

Figure�6

0 1000 2000 3000 4000 5000 600010−6

10−5

10−4

10−3

10−2

10−1

100

101

102

Frequency [Hz]

Welch periodogram of signal, and estimated AR−spectrum

OriginalAR(27)

One�of�the�reasons�for�using�AR�models�instead�of�Fourier�was�to�obtain�higher�resolution,�andthereby�to�be�able�to�resolve�harmonics�a�semitone�apart,�or�maybe�even�resolve�overlappingharmonics.�

A�25ms�synthetic�signal�consisting�of�a�G4#�(415.4�Hz)�and�an�A4�(440)�in�white-noise�withvariance�0.16�was�created.�At�a�sampling�rate�11025kHz�we�have�276�datapoints�which�gives�usthe�Fourier-resolution�of�(1/276)*11025=39.95�Hz.�This�means�that�a�standard�periodogram�shouldnot�be�able�to�resolve�the�first�harmonic�(24.7�Hz�apart),�while�the�over-harmonics�could�be�found(>49�Hz�apart).�

Still�using�the�YW-method�to�find�theprediction�coefficients,�we�analyse�thesignal,�using�a�standard�AIC�with�penalty1�to�estimate�the�order.�Fig.�9�shows�theresult,�and�as�predicted�we�see�that�theWelch-periodogram�is�unable�to�resolvethe�first�harmonics.�While�not�visible�inthis�figure,�the�two�peaks�are�found�in�theAR�model,�see�fig.10.�However,�one�ofthe�poles�are�too�far�from�the�unit�circle,and�is�considered�as�noise.�This�limitedcapability�to�separate�signal�and�noiseprobably�makes�the�the�Yule-Walkermethod�for�estimating�the�poles�lessattractive�and�other�methods�should�betested.

On�the�other�side,�one�could�resolve�all�ofthe�frequencies�in�the�model,�and�fromthat�search�for�possible�harmonic�series.However,�care�must�be�taken�to�avoidcreation�of�non-existing�notes,�since�wesee�from�fig.10�and�7�that�the�poles�notassociated�with�the�signal�are�not�exactlyrandomly�distributed.

One�could�also�consider�Burg's�algorithmsince�it�has�the�same�computational�costas�YW�but�performs�on�general�better.The�problem�is�the�phenomenon�of�linesplitting�when�the�order�augments.

21�of�38

Figure�9

0 1000 2000 3000 4000 5000 600010−5

10−4

10−3

10−2

10−1

100

101

102

Frequency [Hz]


OriginalAR(51)

Figure�10

−1 −0.5 0 0.5 1

−1

−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1

Real Part

Imag

inar

y P

art

415.30Hz + 440Hz, 25ms/276 points

Modified�covariance�method,�synthetic�signalsAs�reported�in�[Schro00],�Marple'smodified�covariance�method�holdspromise�of�better�frequency�estimates�thanthe�traditional�Yule-Walker�method.�Andit�was�claimed�that�in�the�absence�ofnoise,�the�true�frequencies�were�found.�

Matlab�5.2�does�not�include�thealgorithm,�but�5.3�does,�and�it's�calledarmcov.m�

The�same�noiseless�A4�as�used�on�the�YWis�tested�with�the�ModCov�algorithm.�Wesee�in�fig.11�that�the�correct�model�orderis�found�with�the�AIC�when�using�theModCov�to�find�the�PEP.�The�problem�isthat�some�of�these�poles�are�displacedmore�than�0.035�from�the�unit�circle.

We�obtain�444.87,�899.84,�1332.10,1764.21�and�2200.45�HZ�which�is�evenworse�than�YW.�This�could�be�due�toround-off�errors�when�forming�thecovariance�matrix�or�inaccuracies�wheninverting�it.

To�improve�the�estimations,�one�could�tryto�increase�the�order�of�the�model.�Bydoubling�the�order,�we�see�that�the�polesare�on�the�unit�circle�and�give�the�exactfrequencies.�Fig.13�shows�the�result�ofadding�two�poles,�in�fact,�just�increasingthe�order�to�12�gives�a�better�result�withmaximum�0.12�Hz�deviations,�see�fig.13and�14.

22�of�38

Figure�11

0 10 20 30 40 50 60 70 80 90 100−10

−9

−8

−7

−6

−5

−4

−3

−2

−1

0AIC,k=1, PEP from Modcov

model order

Figure�12

−1 −0.5 0 0.5 1

−1

−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1

Real Part

Imag

inar

y P

art

A4, no noise, ModCov

Again,�as�with�the�Yule-Walker�the�signalwith�notes�a�semitone�apart�is�examined(page�21).�While�using�the�ModCov�toestimate�the�PEP,�the�MDL,�FPE,�FSICand�AIC�with�penalty�2�all�gives�order�21which�is�approximately�the�correct�order(20)�for�the�noiseless�case,�but�too�low�tomodel�all�the�sinusoids�in�noise.�Usingpenalty�k=1�for�AIC�using�PEPFB�givesorder�76,�something�that�successfullyfinds�all�the�sinusoids�with�less�than�5�Hzerror.�This�overestimation�is�typical�forthe�AIC.�On�the�other�hand,�using�order76�gives�us�five�extra�sinusoids,�not�inharmonic�relation�to�those�really�existing.This�shows�that�letting�the�order�becometoo�high�quickly�gives�rise�to�spuriouspeaks�which�have�to�be�'filtered'�away�by�some�means.�It�is�clear�that�this�method�is�not�optimal.�Ifwe�look�at�fig.9�again,�we�see�that�order�51�was�chosen�for�exactly�the�same�signal�using�AICk=1�andPEPYW.�In�fact,�this�combination�seem�to�work�fine�for�selecting�order�to�use�with�the�ModCov,�andwas�most�often�used�when�doing�the�transcription.�Much�because�the�calculation�using�theLevinson�algorithm�is�faster�than�the�ModCov�algorithm,�and�thus�speeds�up�the�analysis.

23�of�38

Figure�13

0 1000 2000 3000 4000 5000 600010−6

10−4

10−2

100

102

104

106

Frequency [Hz]


OriginalAR(12)

Figure�15

Figure�14

−1 −0.5 0 0.5 1

−1

−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1

Real Part

Imag

inar

y P

art

0 1000 2000 3000 4000 5000 600010−5

10−4

10−3

10−2

10−1

100

101

102

Frequency [Hz]


OriginalAR(76)

Synthetic�signals�are�useful�since�we�will�havecomplete�knowledge�of�the�signal,�but�realinstruments�don't�create�perfect�sinusoids�so�it�isinteresting�to�do�testing�of�order�estimation�onreal�instrument�samples.

Again�an�A4�was�chosen,�but�this�time�from�aflute.�The�number�of�datapoints�was�308,�whichis�about�25ms�at�11kHz�sample�rate�and�thisshould�be�short�enough�to�catch�most�of�thenotes.�The�Fig.17�shows�the�MDLk=2�usingPEPFB,�and�looking�at�the�periodogram�in�fig.16,we�see�that�the�correct�order�is�found�again�(ormore�precisely:�the�correct�number�of�sinusoidscompared�with�the�number�of�peaks).�The�sameholds�for�CIC�and�FSIC,�while�AIC,�FPE�andthe�eigenvalue-based�method�in�(16)�and�(17)overestimate�the�order.

The�order�chosen�in�the�model�is�twice�the�orderestimated.�This�seem�to�be�a�good�balancebetween�the�correct�order�which�is�too�smoothand�too�high�order�that�gives�spurious�peaks.We�see�that�one�extra�peak�appears,�somethingthat�is�tolerable.�Experiments�showed�that�modelorders�of�1.5�to�2�times�the�number�of�sinusoidsgave�good�results.

24�of�38

Figure�17

0 10 20 30 40 50 60 70 80 90 100−900

−850

−800

−750

−700

−650

−600

−550

A4 flute − 308 points, PEPFB

, MDLk=2

model order

Figure�16

0 1000 2000 3000 4000 5000 600010−8

10−6

10−4

10−2

100

102

104

Frequency [Hz]


OriginalAR(38)

Figure�18

−1 −0.5 0 0.5 1

−1

−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1

Real Part

Imag

inar

y P

art

A4 flute − 308 points, ModCov, Order=38

Implementation�of�the�transcriber

The�structure�of�the�transcriberThe�transcriber�WavToMid�is�completely�implemented�in�Matlab,�and�takes�a�wav-file�from�thesubdirectory�".\wav\"�as�input�and�gives�a�midi-file�in�".\mid\"�as�output.�Instrument�type�in�themidi-file�must�be�specified�manually.�For�the�moment�the�program�first�finds�all�midi�notes,�andthen�does�the�post-processing�and�writing�to�a�MIDI�file.

The�processing�is�done�block-wise�with�each�block�being�fixed�to�250�samples�(25ms).�This�is�doneto�simplify�the�program,�but�it�is�obvious�that�better�results�can�be�obtained�if�a�propersegmentation�is�implemented.�The�size�of�the�block�was�chosen�from�the�fact�that�at�200�BPM�asixteenth-note�is�75ms�long,�and�a�25ms�window�should�then�be�able�to�capture�most�of�thechanges�in�the�music.�Each�block�is�tested�for�silence�before�it�is�passed�to�frequency�estimation.After�the�preliminary�testing,�it�was�clear�that�for�the�frequency�estimation�the�ModifiedCovariance�Method�was�the�way�to�go.�The�frequencies�are�found�by�using�Matlab's�tf2zp.m�andcalculating�the�angles�of�the�poles�returned.�Finding�the�roots�of�the�polynoms�implies�someeigenvalue-calculation.�Whether�this�is�possible�to�avoid�is�unknown.�Model�order�selection�wasmost�of�the�time�done�by�AICYW��with�penalty�k=1�because�of�the�speed�advantages,�but�usingMDLFB�with�penalty�2�gives�us�a�closer�estimate�of�the�number�of�sinusoids�and�can�in�some�casesgive�better�results.�

After�the�frequency�estimation,�harmonic�series�are�searched�for�in�order�to�find�potentialfundamental�frequencies,�and�afterwards�converted�to�the�corresponding�midi-number.�At�thismoment�only�monophonic�music�is�supported,�but�an�extension�to�polyphony�should�bestraightforward.�One�problem�is�to�decide�which�note�belongs�to�which�instrument.

When�the�notes�are�determined�some�'top-down�processing'�is�done.�Pauses�and�notes�that�are�tooshort�are�removed.�Reverb�is�removed�by�trying�to�detect�whether�a�note�is�continuing�to�play�whileanother�note�is�present.�Finally�the�midi�numbers�are�converted�to�binary�Midi�1.1�format�andwritten�to�disk.�The�relations�between�the�different�tempo�parameters�used�in�the�MIDIspecification�was�not�completely�understood,�so�changing�the�analysis�block�size�is�not�possiblewithout�some�manual�regulations�in�these�parameters.

The�different�modules�in�thetranscriber�are�shown�in�fig.19.The�shaded�boxes�are�sharedwith�process.m�used�in�theinitial�testing,�while�the�hatchedbox�is�non-essential�for�theprogram.�It�is�just�an�earlypitch-estimator�calculating�themost�occurring�distancebetween�the�frequencies�foundin�the�Fourier�spectrum.�It�isused�in�the�time-frequency�plotof�the�data.�The�numbers�in�theboxes�indicate�in�which�orderthe�functions�are�called.�Allparameters�not�set�interactivelycan�be�found�in�the�main�script.

25�of�38

Figure�19

WavToMid.m

loadfile.m

orderselect.m

freq2mid.m

most_freq.m

ar_cov.m

midiwrite.m

fixed2var.m

coeff2freq.m

1

2

3

4

5

Transcribing�real�musicSince�flute�samples�was�used�throughout�the�preliminary�testing,�a�short�flute�solo�(flute5.wav)�waschosen�as�the�reference�transcribing�clip.�It�is�recorded�in�11025Hz�with�8�bits�resolution,�so�thequality�is�average�with�not�too�many�partitials�present,�and�a�small�amount�of�reverb�is�present.�Thetempo�is�modest,�with�the�shortest�notes�being�around�260ms.�To�test�the�ability�to�transcribequicker�passages,�some�seconds�of�Bach's�"Badinerie"�was�used�(flute00.wav).�Here�the�shortestnote�is�about�100ms,�and�there�is�almost�no�reverb�present.�Another�quick�passage�with�a�lot�ofreverb�was�tested�(bachfast11.wav).�This�is�Bach's�"Partita�in�A�Minor".�A�number�of�other�smallclips�with�different�instruments�was�also�tested�to�see�how�sensitive�the�transcriber�was�to�theinstrument�used.

Flute5.wav�–�a�simple�flute�solo

Looking�at�the�spectrogram�in�fig.20�we�could�expect�to�find�three�to�five�harmonics,�and�a�visualcomparison�between�this�spectrogram�and�the�time-frequency�plot�of�the�music�clip�was�used�as�abenchmark�when�testing�the�order�selection�criteria.�Additionally,�when�a�MIDI�file�was�made,�thewav-file�and�the�midi-file�was�compared�by�listening.�

In�all�of�the�following�time-frequency�plots,�the�blue�circles�indicate�the�frequencies�found�from�theangle�of�the�poles,�while�the�red�crosses�are�estimations�of�the�pitch�calculated�with�most_freq.mand�are�not�used�in�the�written�MIDI-file.

26�of�38

Figure�20

Testing�the�different�order�selectioncriteria�showed�the�same�as�the�one-notetesting:�When�we�use�PEPYW�and�directlyusing�the�order�found,�we�have�to�useAICk=1�to�avoid�underestimation.�In�fact,this�seems�to�be�the�most�practical�settingas�it�performs�very�well�for�different�typesof�instruments�and�tempos.�In�fig.21�wesee�the�time-frequency�plot�with�thissetting�used,�and�it�is�not�too�far�from�thereal�spectrum,�and�the�conversion�to�MIDIformat�is�perfect.�We�see�however�thatthere�are�some�spurious�poles�where�thefrequencies�are�changing.�This�effectcould�be�reduced�if�the�analysis�windowsare�dynamically�adjusted.�

A�more�simple�approach�is�to�'smooth'�theorder�selection�curve,�taking�the�averagebetween�the�estimated�order�and�some�ofthe�last�orders�found.�In�fig.22�we�see�theresult�of�taking�the�average�with�the�twopreceding�orders,�and�the�non-continuousareas�are�better�estimated.�However,�thissolution�is�probably�not�a�good�choice�inpolyphonic�music�since�the�orders�willchange�more�rapidly.�A�smoothing�willthen�lead�to�instruments�not�detected�atonce�because�the�order�is�too�low�.�In�thisproject�the�method�seems�to�work�fine,and�was�used�most�of�the�time

Looking�at�the�spectrum�in�fig.1�again,�wesee�that�the�over-harmonics�are�gettingweaker�and�weaker,�and�are�oftenmodelled�too�far�from�the�unit-circle�to�bechosen.�This�phenomenon�is�also�presentin�speech,�and�sometimes�a�'pre-flattening'filter�that�emphasises�the�higherfrequencies�is�used�[Picone93].�This�filteris�most�often�a�one-tap�FIR�filter�witha i [-1,-0.4].�The�problem�with�this�filter�isthat�the�noise�is�boosted�as�well.�In�fig.23we�see�just�a�minor�improvement�usinga=-0.85.�The�gains�might�be�higher�whendealing�with�string-based�instrumentswhere�the�over-harmonics�tend�to�die�outquickly.

27�of�38

Figure�21

0 50 100 150 200 2500

1000

2000

3000

4000

5000

Hz

Blocknr

flute5.wav − AIC k=1, PEPYW

, order= 1 x estimated

Figure�22

0 50 100 150 200 2500

1000

2000

3000

4000

5000

Hz

Blocknr


, order= 1 x estimated smoothed

Figure�23

0 50 100 150 200 2500

1000

2000

3000

4000

5000

Hz

Blocknr


, order= 1 x estim/smoo, flatten

Using�the�method�that�seemed�successfulin�the�one-note�case�on�page�24�(�withfour�times�the�number�of�sinusoids)misses�many�of�the�over-harmonics.�Thatis�simply�because�the�signal�to�noise�ratio(SNR)�is�worse�in�this�case,�and�numberof�poles�has�to�be�further�augmented�inorder�to�cope�with�the�noise.�The�SNRwill�without�doubt�be�important�whenadding�even�more�instruments�and�therebythe�order.�Probably�using�a�highersampling�rate�(and�possibly�lowpassfiltering�to�obtain�the�same�signalbandwith�as�before)�will�help�since�wethen�will�have�more�datapoints�on�whichto�base�our�modelling.�

This�need�to�adjust�the�order�according�to�the�noise�level�is�not�good�for�our�goal�of�creating�aautomatic�transcriber.�Perhaps�an�estimate�of�the�SNR�could�be�calculated,�and�from�that�select�amultiplication�factor�to�the�number�of�sinusoids�giving�the�best�order�to�use.�This�might�need�thePEPFB�in�order�to�have�a�precise�estimate�of�the�number�of�sinusoids.

If�we�knew�the�average�number�ofsinusoid�and�the�SNR,�we�could�of�courseuse�a�fixed�order�with�some�success.Eventually,�the�order�could�be�estimatedfor�bigger�blocks,�speeding�up�theanalysis.

Figure�26�shows�the�output�of�a�modifiedversion�of�the�transcriber,�using�thecorrelation�based�pitch�tracker�describedon�page�13.�Even�though�the�calculationsare�optimised�to�use�as�short�test-sinusesas�possible,�the�analysis�is�rather�lengthyand�using�segmentation�to�reduce�thenumber�of�blocks�would�help�also�here.The�resulting�MIDI�file�was�perfect.

This�method�will�not�be�able�to�distinguishharmonics�in�the�same�frequency�band.�Theonly�hope�is�to�apply�signal�models�as�tocompare�the�correlation�values�found�with�thoseexpected,�assuming�overlapping�harmonicswhere�the�value�found�is�higher�than�expected.

28�of�38

Figure�24

0 50 100 150 200 2500

1000

2000

3000

4000

5000

Hz

Blocknr

flute5.wav − MDL k=2, PEPFB


Figure�25

0 50 100 150 200 2500

1000

2000

3000

4000

5000

Hz

Blocknr

Fixed order − AR(32)

Figure�26

Flute00.wav�–�A�quicker�flute�solo

This�clip�is�quicker�than�the�foregoing.�Noreverb�and�more�harmonics.�However,�itproved�to�be�a�bit�difficult�for�thetranscriber.

We�see�that�the�time-frequency�plot�is�nottoo�far�away�from�the�spectrogram,�andthe�conversion�to�MIDI�format�is�not�toobad�either.�The�problems�that�arise�whenthe�tempo�is�increased�appear�to�be�morerelated�to�the�lack�of�segmentation�and�tothe�post-processing�of�the�notes�found.Both�using�PEPYW�with�AICk=1,�and�PEPFB

with�MDL�or�CIC�works�fine.�

29�of�38

Figure�27

Figure�28

0 20 40 60 80 100 120 140 160 1800

1000

2000

3000

4000

5000

Hz

Blocknr



Some�different�sound�files

Fig.29�shows�an�example�file�fromAudioWorks�this�time�with�a�clarinet.Many�more�harmonics�are�present,�andthe�chosen�order�is�between�21�and�67.�Insuch�a�file�a�fixed�order�would�not�workwell.�A�spectrogram�of�this�clip�showsweak�even�harmonics�which�is�typical�fora�clarinet.

The�transcription�done�is�on�par�with�theresult�from�AudioWorks'�own�transcriber.�

Using�PEPFB�in�this�case�is�painfully�slow,since�the�maximum�allowable�order�mustbe�set�to�70.�It�is�clear�that�an�analysiswindow�matched�to�each�note�must�beused,�since�that�would�reduce�the�numberof�order�estimations�from�about�470�to�35in�this�example.

The�clip�bachfast.wav�is�another�flutesolo�with�a�lot�of�reverb.�Energy�from�upto�four�notes�can�observedsimultaneously.�This�leads�to�high�modelorders,�and�higher�demand�on�the�post-processing�to�eliminate�the�echo.�Both�thespectrogram�and�the�time-frequency�plotis�cluttered,�but�the�result�from�thetranscription�is�not�too�bad��and�is�still�onpar�with�the�AudioWorks�transcriber.Such�heavy�reverb�is�problem�for�theautoregressive�methods�since�the�numberof�sinusoids�exploses.�If�such�files�areexpected�to�be�converted�successfully,some�sort�of�echo-cancellation�should�beemployed.

Oboe.wav,�being�of�modest�tempo�andorder,�was�converted�easily.

30�of�38

Figure�29

0 100 200 300 400 5000

1000

2000

3000

4000

5000

Hz

Blocknr

clarinet example.wav − AIC k=1, PEPYW

, order= 1 x estim/smoo

Figure�30

0 100 200 300 400 5000

1000

2000

3000

4000

5000

Hz

Blocknr

bachfast.wav − CIC k=3, PEPYW

, order= 1.5 x estimated

Figure�31

Some�final�words

Limitations�of�the�transcriberAll�of�the�music�clips�tested�share�some�common�characteristics:�1.�They�are�all�created�from�tuberesonators�which�means�that�all�harmonics�'live'�equally�long.�2.�They�have�a�minimum�frequencynot�too�low�which�mean�a�limited�number�of�harmonics.�The�reason�for�omitting�piano�and�guitarmusic�is�twofold.�Most�importantly,�these�instruments�are�seldom�monophonic.�Another�aspect�islower�(possible)��fundamental�frequencies.�Looking�at�fig.32�which�is�an�A0�from�a�grand�piano,we�see�that�there�are�a�lot�of�harmonics�and�the�higher�harmonics�die�out�quickly.��

The�dying�harmonics�are�not�a�problem�since�the�transcriber�needs�only�one�harmonic�to�decide�anote.�The�high�number�of�harmonics�is�worse,�especially�if�we�are�dealing�with�polyphony.�Thismeans�that�it�is�harder�to�find�the�best�order,�and�the�frequency�estimations�will�be�less�accurate.�Toremedy�this,�we�have�to�increase�the�number�of�datapoints�by�increasing�the�sampling�rate�and/orsegmenting�the�music�to�form�bigger�analysis�blocks.

The�program�is�using�a�fixed�key-press�(loudness)�for�all�notes.�This�is�seldom�the�case�for�realmusic.

The�duration�of�the�notes�are�not�rigourously�respected�since�the�analysis�is�done�with�fixed�blocks.Additionally,�the�relations�between�the�timing/tempo�parameters�in�the�MIDI�specification�was�notcompletely�understood,�so�some�files�experiences�incorrect�conversion�with�the�standard�setup.Some�manual�adjustments�of�the�parameters�fixes�the�problems.

The�speed�is�also�a�problem.�Reducing�the�number�of�calculations�by�reducing�the�number�ofanalysis�blocks�is�desired.�In�other�words,�segmentation�is�needed.�Of�course,�critical�parts�could�becoded�in�C.�

Different�noise�levels�are�not�accounted�for.�Since�more�noise�implies�higher�AR�models,�someautomatic�adjustment�of�the�model�order�according�to�noise�level�should�be�employed�if�less�hand-adjusting�is�desired.�

31�of�38

Figure�32

Improvement�for�the�transcriber�and�ideas�for�future�workNo�further�work�is�planned�in�this�project,�at�least�not�on�a�professional�basis.�However,�somesuggestions�for�further�work�are�given.

Without�doubt,�the�most�important�modification�of�the�transcriber�is�to�do�segmentation�allowingthe�analysis�window�to�cover�the�whole�note.�This�has�several�advantages:�

1. More�datapoints�available�in�the�analysis�window�leading�to�better�frequency-�and�order-estimations.

2. Avoid�mixing�two�consecutive�notes�in�the�same�analysis�window.�

3. The�note�duration�will�be�respected�in�the�MIDI�representation.�

4. The�number�of�calculations�can�be�reduced,�since�fewer�order�estimations�is�needed.

5. The�attack�of�the�note�can�be�analysed�to�identify�the�instrument.

6. The�relative�loudness�of�each�note�can�easily�be�determined.

Further,�implementing�pkt.6�above�to�respect�each�note's�loudness,�and�making�a�GUI�for�easiertesting�of�different�important�parameters.

Some�ideas�that�requires�a�bit�more�research�before�integration�into�the�transcriber�are:

AR�modelling�in�sub-bands.�If�D.�Bonacci's�research�is�successful�we�would�be�able�to�do�ARmodelling�in�sub-bands.�This�would�enable�us�to�use�many�low-order�models�in�place�of�one�highorder�model,�possibly�making�the�estimations�more�reliable�(and�faster).

Adaptive�sequential�algorithms.�Many�adaptive�algorithms�for�spectrum�estimation�exists,updating�the�spectrum�for�every�data�point�arriving,�and�requiring�only�O(9m)�multiplicationsinstead�of�the�usual�O(p2)�multiplications�[Kalou87].�These�algorithms�could�be�used�for�frequencyestimations�in�real-time,�possibly�adding�the�benefits�of�detecting�note�changes�and�avoiding�orderestimations.

Directionality.�A�stereo�signal�usually�contains�information�of�spatial�placement,�and�using�forexample�MUSIC�or�Capon's�method�could�use�this�information�to�suppress�all�but�one�instrument,thus�improving�polyphonic�transcription.�Frequency�estimations�could�be�done�simultaneously.

Phase�information.�No�work�regarding�the�phase�information�in�a�music�signal�was�found,�noteven�excluding�the�possibility.�The�idea�is�that�if�there�are�any�relation�between�the�phases�of�theharmonics�created�in�the�instrument,�one�could�be�able�to�detect�if�a�harmonic�is�mixed�with�aharmonic�from�another�instrument�(having�a�different�phase).�

32�of�38

ConclusionIt�has�been�demonstrated�that�AR�frequency�estimation�is�a�plausible�solution�when�transcribingmusic.�The�AR�methods�are�able�to�provide�higher�resolutions�than�the�Fourier�counterparts,something�that�is�important�when�the�number�of�simultaneous�notes�increases.�The�frequencyestimates�from�Principal�Component�Frequency�Estimation�using�the�Modified�Covariance�Methodare�reliable�even�in�the�presence�of�noise.�Even�if�some�of�the�harmonics�are�completely�masked.Fig.33�shows�two�flutes�played�simultaneously,�where�the�C5�is�completely�masked.�Having�1000datapoints�and�using�MDL�with�PEPFB�gives�almost�the�correct�number�of�sinusoids,�and�usingorder�4�times�the�number�of�sinusoids�actually�makes�it�possible�to�find�the�hidden�C5.�This�is�apromising�result�when�considering�polyphonic�transcription.

Order�estimation�is�crucial�for�optimal�performance.�Best�order�is�not�equal�to�the�2*(number�ofreal�sinusoids),�but�somewhat�higher�depending�on�the�amount�of�noise.�Using�the�PEPFB�togetherwith�the�information�criteria�MDL�or�CIC�seems�to�give�best�estimate�of�the�number�of�sinusoids.Then�using�an�estimate�of�the�noise�level�to�find�a�multiplicator�for�the�order�found�to�get�the�bestmodel�order�seems�to�be�a�way�to�go�in�order�to�cope�with�different�SNR.�However,�calculating�allthe�orders�up�to�the�maximum�allowable�order�for�every�block�using�the�modified�covariancemethod�is�slow.�A�faster�and�'less�correct',�but�well�performing�method�is�utilised�in�the�project,namely�using�the�order�found�from�AICk=1�with�PEPYW.�This�works�remarkably�well�even�fordifferent�types�of�instruments.

A�monophonic�transcriber�has�been�built�taking�a�wav-file�as�input�and�giving�a�MIDI-file�asoutput.�The�program�is�performing�on�par�with�the�commercial�available�monophonic�transcribers.The�program�has�been�build�with�a�possible�extension�to�polyphonic�operation�in�mind.

33�of�38

Figure�33

−1 −0.5 0 0.5 1

−1

−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1

Real Part

Imag

inar

y P

art

C4+C5, 91ms, MDLk=2

, PEPFB

, 2x

A�–�Converting�between�Note,�MIDI�and�FrequencyThe�A�with�MIDI�number�21�is�called�A0,�while�the�next�is�called�A1�etc.�Similar�for�all�the�othernotes.�The�table�is�created�from�equation�3,�where�the�MIDI�number�equals�k+69.

34�of�38

Table�4

Freq.[Hz] Freq.[Hz] Freq.[Hz] Freq.[Hz] Freq.[Hz] Freq.[Hz]

C� 0 8.176 12 16.352 24 32.703 36 65.406 48 130.813 60 261.626

1 8.662 13 17.324 25 34.648 37 69.296 49 138.591 61 277.183

D� 2 9.177 14 18.354 26 36.708 38 73.416 50 146.832 62 293.665

3 9.723 15 19.445 27 38.891 39 77.782 51 155.563 63 311.127

E� 4 10.301 16 20.602 28 41.203 40 82.407 52 164.814 64 329.628

F� 5 10.913 17 21.827 29 43.654 41 87.307 53 174.614 65 349.228

6 11.562 18 23.125 30 46.249 42 92.499 54 184.997 66 369.994

G� 7 12.250 19 24.500 31 48.999 43 97.999 55 195.998 67 391.995

8 12.978 20 25.957 32 51.913 44 103.826 56 207.652 68 415.305

A� 9 13.750 21 27.500 33 55.000 45 110.000 57 220.000 69 440.000

Bb� 10 14.568 22 29.135 34 58.270 46 116.541 58 233.082 70 466.164

B� 11 15.434 23 30.868 35 61.735 47 123.471 59 246.942 71 493.883

C� 72 523.251 84 1046.502 96 2093.005 108 4186.009 120 8372.018

73 554.365 85 1108.731 97 2217.461 109 4434.922 121 8869.844

D� 74 587.330 86 1174.659 98 2349.318 110 4698.636 122 9397.273

75 622.254 87 1244.508 99 2489.016 111 4978.032 123 9956.063

E� 76 659.255 88 1318.510 100 2637.020 112 5274.041 124 10548.082

F� 77 698.456 89 1396.913 101 2793.826 113 5587.652 125 11175.303

78 739.989 90 1479.978 102 2959.955 114 5919.911 126 11839.822

G� 79 783.991 91 1567.982 103 3135.963 115 6271.927 127 12543.854

80 830.609 92 1661.219 104 3322.438 116 6644.875

A� 81 880.000 93 1760.000 105 3520.000 117 7040.000

Bb� 82 932.328 94 1864.655 106 3729.310 118 7458.620

B� 83 987.767 95 1975.533 107 3951.066 119 7902.133

Note��name

MIDI�Nr.

MIDI�Nr.

MIDI�Nr.

MIDI�Nr.

MIDI�Nr.

MIDI�Nr.

Db�

Eb�

Gb�

Ab�

Db�

Eb�

Gb�

Ab�

B�–�Matlab�code�for�the�transcriber

35�of�38

ReferencesREF NAME TITLE PUBL YEAR

[Bisc92] C.H.Bischof,M.Shroff

"On�Updating�SignalSubspaces"

IEEETrans.Sig.Proc,Vol.40,No.1

1992

[Broer00] P.M.T.Broersen "Finite�SampleCriteria�forAutoregressive�OrderSelection


2000

[Brown99] J.�Brown "Computeridentification�ofmusical�instrumentsusing�patternrecognition�withcepstral�coefficientsas�features"

MIT�Media�Labs 1999

[Dick94] J.R.Dickie,A.K.Nandi

"On�the�Performanceof�AR�Model�OrderSelection�Methods"

Signal�ProcessingVII;�Theories�andApplications

1994

[Djuric96] P.M.Djuric "A�Model�SelectionRule�for�Sinusoids�inWhite�GaussianNoise"


1996

[Djuric98] P.M.Djuric "Asymptotic�MAPCriteria�for�ModelSelection"


1998

[Durne98] M.Durnerin Operation�ASPECT 1998

[Feldman] J.�Feldman "Derivation�of�theWave�Equation"

http://www.math.ubc.ca/~feldman/apps/wave.pdf

[Fitch00] J.Fitch,�W.Shabana "A�Wavelet-basedPitch�Detector�ForMusical�Signals"

University�of�Bath,UK

[Fuchs88] J.J.Fuchs "Estimating�theNumber�of�Sinusoidsin�Additive�WhiteNoise"

IEEETrans.ASSP,Vol.36,No.12

1988

[Jehan97] T.�Jehan "Musical�SignalParameterEstimation"

http://www.cnmat.berkeley.edu/~tristan/Thesis/thesis.html

1997

[Kalou87] N.Kalouptsidis,S.Theodoridis

"Fast�Adaptive�L-SAlgorithms�for�PowerSpectral�Estimation"

IEEETrans.ASSP,Vol.35,pp.95-108�May

1987

[Karaj99] M.Karjalainen,T.Tolonen

"Multi-Pitch�andPeriodicity�AnalysisModel�for�SoundSeparation�andAuditory�SceneAnalysis"

http://citeseer.nj.nec.com/411704.html

1999

[Kashino95] Kashino,�Nakadai,Kinoshita,�Tanaka.

"Organization�ofHierarchicalPerceptual�Sounds"

http://citeseer.nj.nec.com/27731.html

1995

36�of�38

REF NAME TITLE PUBL YEAR

[Kashino98] Kashino,�Nakadai,Kinoshita,�Tanaka.

"Application�ofBayesian�ProbabilityNetwork�to�MusicalScene�Analysis"

Http://citeseer.nj.nec.com/kashino98application.html

1998

[Kay88] S.M.Kay "Modern�SpectralEstimation,�Theory�&Application"

Prentice�Hall 1988

[Klapuri01A] A.�Klapuri� "Means�of�IntegratingAudio�ContentAnalysis�Algorithms"

2001

[Klapuri01B] A.�Klapuri� "MultipitchEstimation�AndSound�Separation�ByThe�SpectralSmoothnessPrinciple"

2001

[Klapuri98] A.�Klapuri� "Automatictranscription�ofmusic"

Http://www.cs.tut.fi/sgn/arg/music/klapthes.pdf.zip

1998

[Lee92] H.B.Lee "Eigenvalues�andEigenvectors�ofCovariance�Matricesfor�Signals�CloselySpaced�in�Frequency


1992

[Mallat99] S.Mallat "A�Wavelet�Tour�ofSignal�Processing"

Academic�Press 1999

[Martin96] K.�Martin "AutomaticTranscription�ofSimple�PolyphonicMusic:�Robust�FrontEnd�Processing"

Third�Joint�Meetingof�the�AcousticalSocieties�of�Americaand�Japan,ftp://sound.media.mit.edu/pub/Papers/kdm-TR399.ps.gz

1996

[Martin98] K.�Martin "Musical�instrumentidentification:�Apattern-recognitionapproach"

136�th�meeting�of�theAcoustical�Society�ofAmerica,

1998

[McNab] R.�J.�McNab�,�L.�A.Smith,�I.�H.�Witten,C.�L.�Henderson�,S.Jo�Cunningham

"Towards�the�DigitalMusic�Library:�TuneRetrieval�fromAcoustic�Input"

University�ofWaikato,�Hamilton,New�Zealand.

1996

[Picone93] J.W.Picone "Signal�ModelingTechniques�in�SpeechRecognition"

Proc.�of�IEEE,�Sept.1993,�p1215-1247

1993

[Proakis] J.G.Proakis,D.G.Manolakis

"Digital�SignalProcessing,principles,�algorithms,and�applications"

Prentice�Hall 1996

[Schro00] T.�von�Schroeter "Auto-regressivespectral�line�analysisof�piano�tones"

2000

[Terhardt] E.�Terhardt "Psychoacousticsrelated�to�musicalperception"

http://www.mmk.ei.tum.de/persons/ter.html

37�of�38

REF NAME TITLE PUBL YEAR

[Vercoe97] B.L.Vercoe,W.G.Gardner,E.D.Schreirer

"Structured�Audio:Creation,transmission,�andRendering�ofParametric�SoundRepresentations"

Proc.�of�IEEE,�May1998,�p922-940

1998

[Wax85] M.Wax,�T.Kailath "Detection�of�Signalsby�InformationTheoretic�Criteria"

IEEETrans.ASSP,Vol.33,No.2

1985

[WolfeA] J.�Wolfe "The�University�ofNew�South�Wales,Australia�–�Musicacoustics�group"

http://www.phys.unsw.edu.au/music/

[WolfeB] J.�Wolfe "How�harmonic�areharmonics"

http://www.phys.unsw.edu.au/~jw/harmonics.html

38�of�38

automatic music transcription using autoregressive

Documents