automatic music transcription for music information

HONOURS RESEARCH REPORT

Automatic Music Transcription for Music Information Retrieval ofPolyphonic Piano Pieces

AUTHOR: BONGANI SHONGWE

SUPERVISOR: PRAVESH RANCHOD

DEPARTMENT OF COMPUTER SCIENCE

UNIVERSITY OF THE WITWATERSRAND, JOHANNESBURG

November 23, 2011

AbstractMusic Information Retrieval (MIR) is a rising study in obtaining information from audio recordingsusing computational algorithms. Many applications have been developed from the research conducted inMIR which are able to categorize, manipulate, query and even create music. The applications have had aprofound effect on computer science, musicology and society. In the MIR field, there has been a strongfocus on developing music transcription applications. Music transcription is the process of producingsymbolic notation information such as a score sheet or a MIDI file from an audio recording of music.Obtaining the symbolic notation is an important factor in MIR as the notation obtained from a song canbe used for a number of tasks. As an example, the notation acquired from two audio recordings could beused to measure similarity between the songs to solve copyright issues. The notation acquired an audiorecording can also be used to classify to a song in a specific music genre.

This document outlines the development of an application which is able retrieve information of anaudio recording. This will be done by obtaining notation of the recording through transcription andusing the notation to search a database to retrieve the information by the matching the music notationof other songs contained in the database. A problem which arises is that the notation produced froma transcription application may contain errors. These errors may include missing notes or incorrectlylabelled notes. To accommodate for these errors approximate string matching will be used to matchsongs.

The system developed is tested to see if it can correctly identify various songs from two differentstructured databases in order to gauge the systems accuracy and efficiency. The results obtained fromthe tests are very positive, even though there are cases in which songs were incorrectly identified. Thissimply leads to a clearer insight into music transcription and music information retrieval.

i

DeclarationI, Bongani Shongwe, hereby declare the contents of this research report to be my own work. This reportis submitted for the degree of Bachelor of Science with Honours at the University of the Witwatersrand.This work has not been submitted to any other university, or for any other degree.

ii

AcknowledgementsI would like to thank Pravesh Ranchod for offering me this topic and for all the advice and guidancegiven. His support and input into the research, most of all this document, has been invaluable. I wouldalso like thank Alan Gostin for providing his piano note detection code. Thanks to this, the research wascompleted in time. Furthermore, I would also like to thank the National Research Foundation (NRF) forthe financial assistance provided.

iii

Contents

Abstract i

Declaration ii

Acknowledgements iii

1 Introduction 1

2 Background 32.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32.2 Music Transcription . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2.2.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32.2.2 Traditional form . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32.2.3 Automatic Transcription . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32.2.4 Note Onset Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42.2.5 Decoding Frequencies into Notes . . . . . . . . . . . . . . . . . . . . . . . . . 52.2.6 Usage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.3 Evaluating Automatic Music Transcription Systems . . . . . . . . . . . . . . . . . . . . 62.4 Digital Music Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.4.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.4.2 MIDI files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.5 Approximate String Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.5.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.5.2 Levenshtein Distance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

3 Research Methodology 103.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103.2 Research Hypothesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103.3 Research Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

3.3.1 Audio File Database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113.3.2 Digital Notation Database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

3.4 Implementation of System Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123.4.1 Automatic Transcription System . . . . . . . . . . . . . . . . . . . . . . . . . . 123.4.2 Reading and Writing of MIDI files . . . . . . . . . . . . . . . . . . . . . . . . . 123.4.3 Levenshtein distance algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 13

3.5 Experimentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133.5.1 System Specification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133.5.2 Timing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

3.6 Finding Songs of Similar Nature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

iv

4 Results & Discussion 164.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164.2 Observations from the Automatic Music Transcription System . . . . . . . . . . . . . . 164.3 Time Taken to Identify a Song . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184.4 Identifying Songs of Similar Nature . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

5 Future Work 225.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225.2 Incorrect Song Matched . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

5.2.1 Improvement of the Automatic Transcription System . . . . . . . . . . . . . . . 225.2.2 Tie-breaker . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235.2.3 MIDI Database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235.2.4 Finding Songs of Similar Nature using Local Alignment . . . . . . . . . . . . . 23

5.3 Client- Server Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235.4 Computational Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245.5 Programming Language . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245.6 Further Development . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

6 Conclusion 26

References 26

A MIDI Note Number Table 29

v

List of Figures

2.1 Spectrogram representation using Fast Fourier Transform from a piano piece . . . . . . . 42.2 Note Onset Detection (red) against a WAVE file (blue) . . . . . . . . . . . . . . . . . . 52.3 Sheet music representation Georges Bizet Carmen Suite - Act 1 . . . . . . . . . . . . . 72.4 A piano roll representation Georges Bizet Carmen Suite - Act 1 . . . . . . . . . . . . . . 7

3.1 A visual layout of the system architecture . . . . . . . . . . . . . . . . . . . . . . . . . 11

4.1 Comparison of piano rolls from Yoko Kanno’s Gravity (top) and the transcription of thatsong (below) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

4.2 The time it took to return a result from the digital notation database Vs. that of the audiofile database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

vi

Chapter 1

Introduction

In the world of music production there has been a change to the way music is created. The change inwhich music recording has gone from the traditional analogue recording to digital recording has been dueto the involvement of ever evolving technology. Digital Audio Workstations (DAW) have become usefultools in the process of music creation, where music can be recorded and produced on a computer withDAW software. DAWs can be used to create music by taking in a digital musical score sheet created bythe musician or from a digital score file and producing sound with the use of synthesizers (or samplers)which can be written into an audio file.

Using a synthesizer program shows it is possible to use information related to the pitch, timing andvelocity of a music piece and translate that into an audio file. The reverse of this process would be totake an audio file and transcribe a digital musical score by inferring the information from the audio file.

Music transcription is the process of transforming a musical audio recording into its symbolic noterepresentation. The problem of performing music transcription can be solved easily for monophonicmusic sequences but is more difficult for polyphonic sequences, which raises an interest as most realworld music sequences are polyphonic.

Music information retrieval is the process of obtaining attributes from an audio recording for use invarious tools and applications. Examples of attributes which can be retrieved from an audio recordingare: the tempo of a song, the frequencies produced at any point in a song or the volume levels produced bya song. Music information retrieval has become a growing field of study were transcription applicationsare used and evaluated in obtaining information from audio recordings [Downie 2008].

With the information obtained from an audio recording, a number of tasks can be done. For example,an application can be created to automatically ‘tag’ a song. By automatically tagging audio recordings; asong can classified into a certain genre, playlists can be created dependant on the users’ mood, songs canbe automatically rated to the users’ musical taste and suggestions of songs to listen can be made. Thereare also musical information retrieval enquiry tasks which involve getting information from a database.This will be the focus of this document.

The basis of this research is as follows:Given an audio recording without any information on the musician or the name of the song, use mu-

sic information retrieval techniques (such as music transcription) to obtain information on the song bysearching through a database.

The document is divided into the following chapters. Chapter 2 provides background information onthe research topic. The chapter establishes the finer details of music transcription and how musical no-tation can be obtained from an audio file using computational procedures. The chapter also discussesMIDI files, from which musical notation can also be obtained.

Chapter 3 focuses on the methodology behind the research. A research hypothesis is given withmotivation as to why the hypothesis may be true. The chapter outlines the design of a music transcriptionapplication and how it will be used for music information retrieval. Experiments to prove the hypothesis

1

are also discussed.The results from the experiments are presented in Chapter 4. From the results observations are made

on the on the music information retrieval system. The chapter also provides an answer to the hypothesisstated in Chapter 3. Given the observations made, future work which should be conducted is given inChapter 5. Chapter 5 lists ideas in creating a more sophisticated music information retrieval application.The chapter also states what improvement can be done.

The document concludes in Chapter 5, summarising the main points from the research conducted.

2

Chapter 2

Background

2.1 Introduction

In order to have an understating of music transcription and the proposed research methodology, it isnecessary to establish the context of the research. In Section 2.2 the idea of music transcription isintroduced, together with its different form and use, as well as other related factors. Details are given onhow computer applications perform music transcription by using the frequencies obtained from an audiorecord and assigning the frequencies to a note.

In Section 2.3 a list of parameters will be given that assess any transcription application’s accuracy intranscribing a song. Digital music notation is encountered in Section 2.4, where its use in digital musicrecording is discussed and how it assists in music transcription. MIDI files will also be addressed, as theyare a popular form of digital music notation. Section 2.5 will give an overview of approximate stringmatching in which Levenshtein distance algorithm will be reviewed. Finally Section 2.6 will end thischapter.

2.2 Music Transcription

2.2.1 Background

Music transcription is a discipline in Music Information Retrieval (MIR). Music transcription is the pro-cess of converting a musical audio recording into symbolic notation such as sheet music or any equivalentrepresentation usually containing event information associated with pitch, note duration and intensity[Argenti et al. 2010].

2.2.2 Traditional form

The task of transcribing music can be done by a musician by listening to a piece of music and distin-guishing which notes are played, then manually creating the score. For a trained professional it may notbe a difficult task, but for others it may be a time consuming task and spending money on having to hirea trained professional to transcribe the music is undesirable.

The effort involved in transcribing music from recordings varies depending on the complexity of themusic, how well trained the musician is on distinguishing notes and how detailed the transcription mustbe [Robinson 2001]. For example a simple chord from a song can be easily be identified and writtendown, but a full detailed transcription of a complex song with many polyphonic sounds will take sometime.

2.2.3 Automatic Transcription

Automatic transcription is the use of a machine to infer information from an audio file and output thenotation of the music piece in a format that can be understood by humans or digital audio workstations

3

[Argenti et al. 2010]. There are very slight cases were a piece of music is monophonic (one note playedat a time), such as in a trombone solo. In such cases, a musician may easily transcribe the audio piecewithout the assistance of automatic transcription software. Since most pieces of music are polyphonic(one or more notes played at a time) it becomes even more difficult for someone to transcribe the audiopiece. This is where automatic transcription systems can assist.

The basics of any automatic transcription system is that it needs to correctly identify the notes withinthe same time frame of the audio recording. Though the methodology used in automatic transcriptiondiffers from system to system, there are two common phases.

The first phase consists of the time frequency representation of the musical signal being calculated.That is, the frequency of the sound in the audio recording is calculated by transforming it from itstime domain to its frequency domain, which can be accomplished with various techniques ranging fromFast Fourier Transform to Bilinear Distributions. This phase will be noted as the frequency analysisphase. Figure 2.1 shows a spectrogram representation of the frequencies on a piano piece after a FastFourier Transform was performed. The spectrogram representation was created by using Sonic Visualiser[Cannam et al. 2010]. The second phase consists of the time frequency representation being refined bysegmenting it into a sequence of frames which allows the system to identify which notes are detected.Each segment is then labelled to a note, and then grouped back to together using different algorithms.This is often referred to as the pitch detection/estimation, which will be described in more detail furtheron in this document.

Figure 2.1: Spectrogram representation using Fast Fourier Transform from a piano piece

2.2.4 Note Onset Detection

Another problem within automatic music transcription is when identifying a set of notes, how can itbe identified that a new set of notes have started playing in a musical piece. Note onset detection is atechnique which has been used in some automatic transcription systems to detect the start of new musicalevents (i.e. notes) [v. d. boogaart and Lienhart 2009]. Note onset detection is said to be more accuratein detecting new events compared to other event detecting techniques such as frame-wise detection [v. d.boogaart and Lienhart 2009]. Figure 2.2 shows note onset detection (the red stems) performed on aWAVE file (the blue waves) containing a piano piece. Figure 2.2 was created by using Sonic Visualiser[Cannam et al. 2010]. It can be stated that new events occur in audio recording when a sudden change in

4

energy within the frequency domain is detected, thus the peak of stems of energy occur such as in Figure2.2.

Figure 2.2: Note Onset Detection (red) against a WAVE file (blue)

2.2.5 Decoding Frequencies into Notes

Piano’s which are tuned correctly have a mathematical correlation between the notes and the frequenciesproduced by each note. Most standard pianos contain 88 keys, with each key producing a frequency inhertz (cycles per second) [Kammler 2007]. Each frequency is known to be 12

√2 times larger than the

preceding note [Rossing 1982]. The relation between the notes and the frequency produced by eachnote is presented in Table 2.1. The full list containing all the note-frequency relations can be found atWikipedia [2011] or in Kammler [2007].

Key Number Note Frequency (Hz)47 G4 391.99548 G#4 415.30549 A4 440.00050 B4 466.16451 C5 493.88352 C#5 523.25153 D5 554.36554 D#5 587.33055 E5 622.254

Table 2.1: Extract of Piano Note Frequency Table taken from Wikipedia [2011]

The 49th key, A4, produces a frequency of 440Hz. Using the mathematical function given below, fre-quencies of other keys can be computed, where n is the key number.

f(n) = 440 ∗ ( 12√2)(n−49)

5

As an example, substituting 50 into the equation results in f(n) = 466,164, the exact frequency of the50th key given in Table 2.1. From this, a logarithmic property has been deduced which is able to computethe key number given the frequency [Gostin 2006]. The function is given below, where f is the frequency.The solution is rounded off to the nearest number.

n = (12 ∗ ln(f/27.5)) + 1

2.2.6 Usage

Transcribing music has been the basis of creating sheet music for songs which don’t have score sheets.There are times when a musician wishes to perform a previously recorded song, known as performing acover version. In order to perform the song, sheet music is required and may be unobtainable, thus thetask of transcribing the original song will have to take place. Another need to transcribe music, would befor the act of taking a portion or sample of a sound recording and reusing it as an instrument or sound ina song, this is known as sampling [Wikipedia 2010]. An issue with this is that the sound recording maynot be clear enough to be used in a song. In this situation it would be best to recreate the music audiowith the score sheet obtained from the transcribed audio piece.

2.3 Evaluating Automatic Music Transcription Systems

The Music Information Retrieval Evaluation eXchange (MIREX) is the community-based frameworkfor the formal evaluation of Music Information Retrieval (MIR) systems and algorithms [Downie 2008].The community annually assess new techniques and algorithms tailored for music information retrieval.As of 2009, the following evaluation parameters had been gathered from various research’s conductedby MIREX for evaluating automatic transcription systems [Argenti et al. 2010]:

• Accuracy: Developed by Dixon [2000], the formula calculates the overall transcription perfor-mance,

Acc =TP

TP + FN + FP

where TP (true positive) is the number of correctly identified notes, FP (false positive) is thenumber of notes reported by the system that were not played and FN (false negatives) is the numberof notes played that were not reported by the system. An incorrectly identified note is counted asboth a false positive (a incorrectly reported note) and a false negative (a note that should have beenreported). It must be noted that the formula does not give insight into the trade-off between notesthat are missed and notes that are identified [Poliner and Ellis 2007].

• Precision: Is the ratio of correctly transcribed pitches to all transcribed pitches for each frame,

Prec =TP

TP + FP

• Recall: Is the ratio of correctly transcribed pitches to all ground truth reference pitches for eachframe,

Rec =TP

TP + FN

• F-measure: Is a measure yielding information about the balance between FP and FN,

Rec = 2× Prec×Rec

Prec+Rec

There are other evaluations formulas that assess automatic transcription system, which won’t be dis-cussed but can be found in Raphael [2002], Poliner et al. [2007] and Poliner and Ellis [2007] for theinterested reader.

6

2.4 Digital Music Notation

2.4.1 Background

Digital music notation may be represented in a number of ways, such as piano roll based notes or sheetmusic. The format which the notation is represented is dependent on the user and the sequencing softwareused. A problem is that not all systems use the same format of representing music notation. Some peoplemay prefer composing with traditional sheet music such as in Figure 2.3, others may have adapted towhat multi-track recording systems use which is the piano roll such as in Figure 2.4. This brings an issuethat many musicians may not be able to use the other form of what they commonly use to create musicpieces. A problem may arise when a musician is working on the same composition with other musicianswho use different formats of note representation.

For computer based sequencing software, MIDI files are common data files which can be used amongdifferent recording software. MIDI files also provide the information needed to convert one form ofmusic notation to another.

Figure 2.3: Sheet music representation Georges Bizet Carmen Suite - Act 1

Figure 2.4: A piano roll representation Georges Bizet Carmen Suite - Act 1

2.4.2 MIDI files

Musical Instrument Digital Interface (MIDI) files do not contain the actual sounds of instruments (orvoices) like other audio digital formats. All which is contained in a MIDI file is the “information ofthe performance” representing the musical piece’s notes, volume, instruments [Association 2010]. MIDIfiles are extremely compact and contain a lot of information for content based analysis and queries[Raphael 2002].

7

A reason why MIDI files are popular, is that they only contain a list of events. Therefore MIDI filescan be imported into any music software environment (given support) and the playback device (recordingsoftware, sound card etc.) will be set to generate the sounds specified by the MIDI file. MIDI files arealso flexible, since a MIDI file can be easily edited.

A standard MIDI file consists of two chunks. The first being the header which appears at the begin-ning of the file, the second chunk is the track [Selfridge-Field 1997]. The header chunk indicates theformat of the MIDI file, such as the number of tracks contained in the file. In the second chunk, thetrack(s) contains events pertaining to the song. The events are known as track events. There are severaltrack events which can occur in a MIDI file, only two of these will be mentioned. The interested readermay look at Selfridge-Field [1997]. The two events which will be discussed are the note-on event andthe note-off event. The note on event indicates when a key is pressed, while the note off event indi-cates when a key is released. Each of these events is associated with a number indicating which key ispressed/released. Appendix A gives a table which maps each MIDI note number to a key.

2.5 Approximate String Matching

2.5.1 Background

String matching algorithms are used to find matching sequences of strings. Many algorithms have beendeveloped and improved upon to assure that strings can be easily found [Gusfield 1997]. String matchingalgorithms can be used in querying tasks such as finding out if a substring is contained in a sequence.Such queries are often done in Bioinformatics when searching for a DNA sequence [Gusfield 1997].

Due to some error (or DNA evolution) a problem arises were the string which being queried is notcontained in the sequence. Take for example when using an electronic dictionary to search for a wordand the word is misspelled. The search would normally return nothing because the word is not found,but it is possible to return a list of words closely related to the misspelled word. This is possible by usingapproximate string matching. Instead of trying to match the word character for character, approximatestring matching algorithms mitigate character errors and return the closest matching word.

2.5.2 Levenshtein Distance

Levenshtein distance algorithm computes the numerical differences between two strings. This donecalculating the minimum number of edits it would take it to transform one string into another. Thatallowable edits are a deletion, insertion or substitution of a single character [Gusfield 1997]. For examplethe minimal number of edits which can be performed on the word ‘bomb’ to change it to ‘boom’ is 2.

1. bomb→ bom (deletion of b)

2. bob→ boom (insertion of o)

Levenshtein algorithm has a complexity of O(mn), where m and n are the lengths of the two stringsbeing compared. Levenshtein algorithm can be used for many approximate string matching applications.For example it can be used to build up a short list of suggested words for a spellchecker. This is doneby calculating the edit distance between the incorrect word and a dictionary. The algorithm can also beused for bioinformatics use such as analysis of DNA evolution [Gusfield 1997].

2.6 Conclusion

This chapter sought to establish the context of the research. The framework of music transcription wasaddressed in Section 2.2. Automatic music transcription systems were also brought into light werethe mechanics of transcribing were discussed. The differences between the traditional form of musictranscription and automatic music transcription were compared. It was also discussed why automatic

8

music transcription may prevail upon the traditional form when transcribing complex pieces of music.Section 2.4 introduced digital notation files. Musical notation could be retrieved from a digital notationfile by simply extracting the contents of the file. With this discovery, the background given in this chapterwill provide context for the research methodology given in the next chapter.

9

Chapter 3

Research Methodology

3.1 Introduction

In the previous chapter, it was given that there are two methods in which an application can obtain musicnotation. The first method discussed was to use an automatic music transcription system to obtain thenotation from an audio file. The second method is to read the contents of a MIDI file and retrieve thedata listed in the events chunk. Using the research done in the previous chapter, in this chapter a researchmethodology will be given.

Section 3.2 will give a research hypothesis based on the research done concerning music informationretrieval. In Section 3.3 a research method will be outlined on how to address the hypothesis by creatinga music information retrieval application. Motivation on why the approach is chosen and how it relates tothe hypothesis is also be stated. The research methodology is divided into following subsections: Section3.4 discusses the implementation of the application. In Section 3.5 the experimentation procedure on howto prove the hypothesis will be discussed. Section 3.6 extends the research methodology by introducinganother experiment which will be conducting.

3.2 Research Hypothesis

In Section 2.2 it was shown that automatic transcription systems can be used to obtain notation from audiorecords. With this notation, a search can be done amongst a collection of songs to retrieve informationof the input song. A problem arises, which is: What type of file should be stored in the database to beused for comparison against a song? Using digital notation files seems logical as audio files will requiretranscribing for each file which will take time more time than reading the notation from digital notationfile. Thus the research hypothesis can be stated as follows:

• Automatic music transcription systems can be used to match notation files for song informa-tion retrieval, which will provide more efficient results than using audio files for queryinginformation on a song.

3.3 Research Method

The aim of this research is to create a music information retrieval system which takes as input an audiorecording containing polyphonic piano pieces. From the input audio, the system must correctly identifythe title and artist of the piano piece. A brief layout of the architecture is given in figure .

An audio file is taken in as input to be transcribed. A MIDI file will be the output for the transcription,which can be also be used for analysis. The MIDI file is then given to either the digital notation databaseor the audio file database. The MIDI file is compared to every song in the database using Levenshteindistance algorithm. The song containing the smallest edit distance is considered the matching song andis returned as the result.

10

Figure 3.1: A visual layout of the system architecture

Motivation

During the investigation of automatic transcription systems, it was discovered that no automatic tran-scription is able to produce a perfect note-for-note transcription of every song [Poliner et al. 2007]. Thismeans that it is possible to receive a notation file from a transcription system with notes missing (falsenegative), notes which never occurred but are present (false positive) and incorrectly labelled notes. Di-rectly matching the notation sequence to the database would likely not return any result. To solve thisproblem approximate string matching is used.

3.3.1 Audio File Database

The audio file database consists of WAVE files. The WAVE files were created from the digital notationfile contained in the digital notation database. This was done using Propellerhead’s Reason 5 NN-XTAdvanced Sampler [Hylvander and Nordmark 2010]. The WAVE files were saved in CD quality for-mat, i.e. 44100 Hz sample rate with 16 bit depth [Watkinson 1994]. When using the WAVE files forcomparison, transcription is performed on each file in order to obtain the notation from the audio file.

Motivation

The reason why WAVE files were selected is because they are an uncompressed format. So there willbe no to factor in the loss of quality when transcribing audio. Due to the popularity of WAVE audioencoding, a lot systems and applications are able to use WAVE files. Recall it was stated in Section 2.2.5that the mathematical function, used in determining the notes, holds for piano’s which are tuned correctly.Propellerhead’s Reason 5 NN-XT Advanced Sampler was used to create the audio files because it’s ableto produce theoretically ideal piano sounds.

11

3.3.2 Digital Notation Database

The digital notation database consists of MIDI files. The MIDI files were obtained freely from varioussources on the internet. Each MIDI file was modified using Propellerhead’s Reason 5 so that it couldbe used for testing. Modification consisted of removing other instruments in the MIDI file (bass, drumsetc.), leaving the piano notes only contained in the MIDI file.

When using the MIDI files for comparison, the contents of each MIDI file are extracted. The notationfrom the file is then structured into a string so it can be used for approximate string matching.

Motivation

MIDI files were chosen for use in the digital notation database because they can be easily obtained online.MIDI files can also be edited using most digital music composition software.

3.4 Implementation of System Design

MATLAB is a programming language with many scientific tools, including signal processing functions[MathWorks 1994]. MATLAB was therefore chosen to implement the proposed application. MATLABalso provides the option of creating an executable file (.exe). This allows users who do not have theMATLAB programming environment to run an executable file created from MATLAB, this is providedthat the user has MATLAB Compiler Runtime (MCR) installed on their system. MATLAB is limited topeople who want to develop applications because it’s not free. Fortunately there are a number of thirdparty tools which can be obtained from MATLABs file exchange website [MathWorks 1994].

3.4.1 Automatic Transcription System

Gostin [2006] automatic transcription system will be used to obtain notation from audio files. By us-ing MATLABs wavread function; the sample data in an audio file can be acquired. Using the sampledata Gostin [2006] algorithm detect the points where keys are pressed by using a note onset detectionalgorithm. This is done by taking the absolute value of the sample data, this signal is convolved with anedge detection filter using a fast-convolution technique. This will output a positive valued spike in thesample data for positive edges and a negative valued spike for negative edges (or drop-offs) in the sampledata. Points were keys are pressed can now be assigned. A relative minimum occurs above an assignedthreshold (0.1) amongst the positive spikes. When this is detected it is counted as a key being pressed.Negative peaks indicate the decline of a note and are ignored. For each point where a key is pressed, aFast Fourier Transformation is performed to obtain the frequencies of the song. The frequencies are thenbe labelled to the note which they represent using the formulation presented in Section 2.2.5.

3.4.2 Reading and Writing of MIDI files

Schutte [2010] has developed MATLAB functions which are able to read and write MIDI files. In orderto use Schutte [2010] functions to write a MIDI file, a Nx6 matrix is used. N is the number of eventswhich occurred in a song, i.e. the number of notes. Each column consists of the following:

1. Track number

2. Channel number

3. Note number (MIDI encoding of pitch)

4. Velocity

5. Note start time (seconds)

6. Note end time (seconds)

12

For every song the track and channel number will be 1. The note number is obtained by transforming thenote calculated from the transcription. This is obtained from a table similar to that found in Appendix A.The velocity of every note is to be set by the user (default value: 64). The start time and end time indicatewhat time in the song a key is pressed and released. The matrix is then given to Schutte [2010] writemidifunction, which creates a MIDI file from the information obtained from the matrix. A Nx8 matrix isobtained from reading a MIDI file using Schutte [2010] readmidi function. The first six columns are thesame as the matrix for writing a MIDI file. The seventh and eighth column respectively give the messagenumber for the note on and note off events.

Motivation

The velocity contained in the MIDI file has no effect on the comparison of songs, therefore the velocityof the notes which are to be written to file are set by the user.

3.4.3 Levenshtein distance algorithm

Levenshtein distance algorithm is used to calculate the minimal number of edits it will take to transformone sequence of notes to another sequence of notes. A MATLAB function was written to the specifi-cation’s of Wagner and Fischer [1974] algorithm using dynamic programming to compute Levenshteindistance algorithm. The pseudocode is provided in Algorithm 1.

Algorithm 1 Pseudocode for Levenshtein distance algorithm using Wagner and Fischer [1974] approachReceive as input string S with m characters and string T with n characters{For all i and j, d[i,j] will hold the Levenshtein distance between the first i characters of S and thefirst j characters of T}Create integer array d[0..m, 0..n]for i from 0 to m do

d[i, 0]← i {The distance of any first string to an empty second string}5 end for

for j from 0 to n dod[j, 0]← j {The distance of any second string to an empty first string}

end forfor j from 1 to n do

10 for i from 1 to m doif S[i] == T[j] then

d[i, j]← d[i− 1, j − 1] {no operation required}else

d[i, j] ← minimum (d[i − 1, j] + 1, d[i, j − 1] + 1, d[i − 1, j − 1] + 1) {minimum of ( adeletion, an insertion, a substitution)}

15 end ifend for

end for

3.5 Experimentation

3.5.1 System Specification

System specifications are included to put the run time measured into perspective. The MATLAB envi-ronment specifications are also stated, as test done on another machine may produce different results dueto MATLAB memory management [MathWorks 2011].

• Processor: Intel Core 2 Duo T5800, 2.00GHz, 2MB L2 cache

13

• Memory (RAM): 2GB

• Operating system: Windows Vista SP 2, 32 bit

• MATLAB Environment: MATLAB R2010a, MCR version 7

• Maximum possible array in MATLAB: 1011 MB

• Memory available for all arrays in MATLAB: 2479 MB

The maximum possible array is the amount of space a single array/matrix can consume, while memoryavailable for all arrays is the total amount of space assigned for all arrays. These memory limits are setfor MATLAB 32 bit versions and cannot be increased by the user [MathWorks 2011]. While conductingthe tests only essential services will be allowed to run, such as MATLAB and the operating systemsservices. This was done so no other service/application interferes with the outcome of the results byconsuming the computers resources.

3.5.2 Timing

MATLABs tic-toc function was used to the measure the time it takes for a song to be identified fromany of the databases. The tic function starts the clock, which will be called once the notation has beenretrieved from the input audio. The toc function stops the clock and returns the amount of time passed(in seconds) since the tic function was called. The toc function is called once the matching song to theinput audio has been identified.

Motivation

In order to prove the hypothesis, the time taken to return the correct song from a database is analysed.There is no need to measure how much time it takes to transcribe the input audio or any other miscella-neous activity, as this will hinder the analysis.

3.6 Finding Songs of Similar Nature

Section 2.2.6 had discussed some of the uses of transcribed audio. Often in popular music a song maycontain borrowed elements (chord progression, drum) from another song. These elements are replayedto create a new song. This is known as interpolation of music [Wikipedia 2010]. In order to gauge thesystems accuracy in correctly identifying songs, the system will be used to identify songs which containsimilar elements. This will be done by selecting a song which is known to be an interpolation of anothersong in the database (or vice-versa) and use the application to correctly identify the song with similarelements.

Motivation

Either than conducting another test to make sure that application developed will return the expectedresult, the main motivation for having an application which will be able to determine songs which containborrowed elements is for copyright infringement. A musician may use such this tool in asserting elementsof a song have been used without the musicians’ permission.

3.7 Conclusion

The goal of this research is to create an application which can efficiently identify a song amongst acollection of songs. This chapter described the research methodology in accomplishing the goal. Ahypothesis was given in Section 3.2. The design of the application which aims to prove the hypothesis

14

was discussed in Section 3.3. Section 3.4 gave finer details of system. Section 3.5 discussed how thetests were conducted in order to prove the hypothesis.

In Section 3.6 the subject of songs containing borrowed elements from another song was introduced.It was also discussed how testing for these songs will give a broader analysis on the research beingconducted. The results conducted from the experiments discussed in this chapter are produced in thefollowing chapter. The code, executable file and the MIDI file database used will be provided online1.

1http://www.cs.wits.ac.za/ bonganis

15

Chapter 4

Results & Discussion

4.1 Introduction

In the previous chapter an application was discussed in which an audio record would be identified bysearching through a database. The experiments that we conducted were listed. This chapter containsresults obtained from the experiments discussed in Chapter 3. Section 4.2 discusses observations madefrom the notation produced from the automatic transcription system. Section 4.3 focuses on the results toprove the hypothesis established in Section 3.2, which is to compare the time taken to retrieve informationfrom two different databases. Section4.4 gives results which were conducted amongst songs of similarnature. A summary of all the results will be given in Section 4.5.

4.2 Observations from the Automatic Music Transcription System

Given that the automatic music transcription employed in the research application was taken from Gostin[2006] piano note detection implementation, brief remarks will be given on the transcription system.Some of the observations made by Gostin [2006] which are significant to the testing of the applicationwill be highlighted below:

Accomplishments:

• The transcription system has a high precision in hit detection for reasonably paced songs. Thismeans that as long as the tempo of a song is not too fast, no notes will be missed or mislabelled. Itis unknown what the exact tempo of a song should be for notes to be missed.

• Monophonic piano notes were correctly labelled.

• The note-length determination (how long a key is pressed) for notes detected and analysed isprecise.

Flaws:

• Overlapping notes, such as chords, may escape detection.

• The more notes which are played at any single point in a song, the higher the risk of not detectingsome of the notes.

• A note which is still being played while a new note is introduced will seem as two new notes,instead of a new single new note and note which is still has its key pressed.

A problem which Gostin [2006] did not state, is that the velocity which a key is being pressed will affectthe labelling of the notes. If a key is pressed with a lot of force from the musician, the volume level

16

from the key will be very high. This may end up masking other softer keys, which may result in notesbeing missed or being incorrectly labelled. Given these observations, the automatic transcription systemused in the application is fairly accurate. The automatic transcription system also has some faults, whichwere expected. As it was stated in Section 3.3, no music transcription system can transcribe every songwithout errors. If the transcription system worked perfectly in transcribing all songs, there would be noneed for approximate string matching.

In Figure 4.1 piano rolls of Yoko Kanno’s Gravity are presented. The first piano roll is of the originalsong obtained from a MIDI file. The second piano is obtained from the MIDI file produced by transcrib-ing the song. It can be seen the notes produced from the transcription are closely related to the songsnotation. At first the transcribed MIDI file matches that of the song. As the song progresses and morenotes are played the transcribed MIDI files shows errors which had transcription process, such as somenotes being incorrectly labelled. For the interested reader, the MIDI files produced from the transcriptionof songs during the test are available online1.

Figure 4.1: Comparison of piano rolls from Yoko Kanno’s Gravity (top) and the transcription of thatsong (below)

1http://www.cs.wits.ac.za/ bonganis

17

4.3 Time Taken to Identify a Song

Several songs were chosen to compare the time it would take to identify a song between the digitalnotation database and the audio file database. The songs were chosen for various reasons such as genreand the playing time of the song. Each song was tested twice and checked of the times measured weresimilar. This was to ensure that no “unseen factor would taint the results. The results which were obtainedfrom the tests are provided in Table 4.1. The first column identifies the name and artist of a song. Thesecond column gives the playing time of the song. The third column states if the input audio file wascorrectly identified from the songs contained in the database. The fourth and fifth column give the timeit had taken for a song to be identified in the digital notation (MIDI) database and the audio file (WAVE)database respectively. The times given are the mean average of the two tests which were conducted.

Song : Artist(s)Time ofSong (min-utes:seconds)

CorrectlyIdenti-fied

MIDI Time(Seconds)

WAVETime(Seconds)

Beautiful: 2:31 Yes 58 1864Christina AguileraChanges: 2:36 Yes 51 1841Tupac ShakurDon’t Know Why: 1:22 Yes 45 1733Norah JonesFur Elise: 1:51 Yes 54 1849Ludwig van BeethovenGravity: 1:45 Yes 50 1774Yoko KannoKiss From A Rose: 2:08 Yes 51 1838SealLilium: 2:35 Yes 47 1759K. Kayo & K. YukioMean - - 51 1808

Table 4.1: Results from measuring the time it took to identify a song

The results are very positive. The application designed was able to correctly identify all the songs.This that confirms automatic music transcriptions systems along with approximate string matching algo-rithms can be used to retrieve information on a song.

There is an obvious time difference in how long it took for a song to be identified in the two differentdatabases. Figure 4.2 shows a clear distinction between searching for a song in the digital notationdatabase and searching for a song in the audio file database.

The time it took to correctly identify a song in the audio file database is clearly longer than the time ittook to identify a song in the digital notation database. This certifies the hypothesis stated; using digitalnotation files to match a song is more efficient than using audio files. The average time to correctlyidentify a song from the audio file database was 30 minutes, while it took an average of 51 seconds toidentify a song from the digital notation database.

An interesting observation which was made is that there exists a relation between the number of notesa song contains and the time it took to return a result. Take for example Lilium which has a time span of2 minutes and 35 seconds, while Fur Elise has a playing time of 1 minute and 51 seconds. Though theplaying time of Lilium is more than Fur Elise, Fur Elise actually took longer to return a result from eitherdatabase than it did for Lilium. Examining the notation files, Fur Elise had more notes than Lilium. Thismeant for Fur Elise, the approximate string matching algorithm had a longer string to compare, thereforit took more time to identify the song.

18

Figure 4.2: The time it took to return a result from the digital notation database Vs. that of the audio filedatabase

4.4 Identifying Songs of Similar Nature

Six songs were used to conduct the experiments. Each song had another song in the database whichcontained similar elements to the chosen song. The MIDI database was used to conduct the test as it isnow known that it’s more efficient than the audio file database. The tests were conducted as follows foreach song:

1. Test if the song can be correctly identified in the database

2. Remove the song from the database

3. Test if the song containing similar elements can be correctly identified

The results which were acquired from the tests are given below in Table 4.2. The numbers in the firstcolumn indicate which two songs have elements of similar nature. The second column gives the nameand artist of the song while the input songs notation was still in the database. The third column states ifa song was correctly identified. The fourth column states what song was identified when the notation fileof the song was removed from the database.

The initial tests which were conducted to check if the input song can be correctly identified wereall successful except for one. Kashmir by Led Zeppelin was actually identified as Come With Me byPuff Daddy and Jimmy Page, the song which contains replayed elements from Kashmir. Kashmir andCome With Me contain a strong similarity to each other, but there should be a reason why Come WithMe was chosen rather than Kashmir. Examining the MIDI file which was used to create the audio forKashmir, it’s revealed that Kashmir has a chord progression contained along with the rest of the song,

19

Song:Artist(s)Identification withSong in Database

Identification withoutSong in Database

1.a The Way It Is: Matched input Hurt - Jonny CashBruce Hornsby

1.b Changes: Matched inputThe Way It Is - BruceHornsby

Tupac Shakur2.a Gimme Gimme Gimme: Matched input Ghetto Vet - Ice Cube

Abba

2.b Hung Up: Matched inputLoving You - MinnieRiperton

Madonna

3.a Kashmir:Did not match in-put

Puff Daddy & JimmyPage - Come With Me

Led Zeppelin

3.b Come With Me: Matched inputLed Zeppelin - Kash-mir

Puff Daddy & Jimmy PageSuccess Rate 83.3% 50%

Table 4.2: Results from identifying songs of similar nature

which Come With Me does not have. Meaning that Kashmir has a lot more notes played at the sametime than Come With Me. It also discovered that the velocity of Kashmir is extremely high. These arepoints which were stated in Section 4.2 that would allow notes to missed when transcribing the audio.This means that notes were missed for the transcription of Kashmir and Levenshtein approximate stringmatching algorithm found Come With Me to be a better match.

Tests which were conducted when the digital notation file for the input song was removed had showninteresting results. Only three tests were able to identify the best matching song as the song containingsimilar elements. The songs being Change, Come With Me and Kashmir. Gimme Gimme Gimme andHung Up were not identified as the song containing elements of similar nature when each song wastested. This was to be expected as both of the songs only contain a small section where the songs havesimilar elements, the remaining parts of the song are very different from each other. The Way It Is wasnot linked to Changes as having a similar sound, but Changes was able to pick The Way It Is. ExaminingThe Way It Is MIDI file which was used to create the audio file, it’s revealed that though the song initiallyhas a slow pace (102 beats per minute), the tempo of the song soon picks up (116 beats per minute). Asin Section 4.2 it was stated that if the pace of song is too fast, notes will be missed when transcribingthe audio recording. This seems to be the case for the song; The Way It Is. Notes were missed duringtranscription and therefore when comparing the song to other songs with Levenshtein distance algorithmanother song was picked to be more closely related to it.

4.5 Conclusion

This chapter presented results and observations for the experiments conducted on the information re-trieval system developed. Section 4.2 discussed results from the automatic transcription. The transcrip-tion system was fairly accurate with some flaws which were highlighted. This was to be expected as ifthe transcription was able to perfectly transcribe any song (which is unlikely); there would have been noneed to implement approximate string matching when comparing the notation obtained to the songs inthe databases.

20

In Section 4.3 results from measuring the time it took to return information on a song was given.It was shown that that obtaining information from a database containing digital notation files is muchquicker than obtaining information from a database containing audio files. From this the hypothesisstated in Section 3.2 was shown to be true. It was also shown that the time taken in identifying a song isdependent on how many notes the song contains.

Another task for this for this research was to assure that the correct song will always be identifiedfrom the input audio. Initially it was believed that the application developed was highly accurate in iden-tifying the input audio. Section 4.4 showed there are cases in which the wrong song may be identified.A song containing elements similar to the input song may be identified as the input song though it is not.This also happens due to notes missed from the transcription process. This shows that system developeddoes not only rely on the approximate matching algorithm but as well as the accuracy of the transcriptionsystem. The next chapter discusses future work related to the results presented in this chapter and otherpossibilities related to this research.

21

Chapter 5

Future Work

5.1 Introduction

An application such as the one developed in this research would serve well as a consumer applicationfor identifying unknown songs. Applications like Shazam and Tunatic are examples of tools that querydatabases for song information after analysing audio [Cozma 2010] [Barker 2006]. In Section 5.3 aproposition is given on how the application developed in the research could be used be impended as acommercial product. From the investigation conducted throughout this research, several areas have beenidentified in which future work can be conducted. This includes proposing methods which will ensurethat the input audio is never matched with the incorrect song (Section 5.2). Section 5.4 discusses howthe time in identifying a song can be improved.

Limitations of the programming language were not given in detail in previous chapters. Section 5.5aims to shed light on some issues about MATLAB and how these problems can be resolved. Lastly, somerecommendations are given in Section 5.6 to make the tool developed more flexible.

5.2 Incorrect Song Matched

Though results from the tests conducted had shown that the song identification using approximate stringmatching had worked successfully; except for the cases were songs contain similar elements. It is un-known if this success in correctly identifying songs will continue, especially in the case of an evergrowing database with the songs to be matched to an input song. Such an example would be if a tieoccurred between two songs when the edit distance is computed. The wrong song could be chosen as thetop matched song.

5.2.1 Improvement of the Automatic Transcription System

The automatic transcription system applied is not fully accurate, i.e. the notation produced from tran-scribing a song may not be a note-for-note match to the original song. In order to guarantee that there willnever be a song which will be incorrectly matched (provided the song is also contained in the database);it is suggested that the accuracy of the automatic transcription system should be improved. The more ac-curate the transcribed notation is the probability of incorrectly matching a song decreases. Listed beloware possible solutions to fix some of the flaws in the transcription system applied in this research [Gostin2006]:

• The music transcription system is not able to detect when a note is released. The issue with this isthat if a note released while other notes are playing, due to the energy which the other notes giveoff, the note will still be viewed as if it was being pressed. Using the negative peaks from the noteonset detection phase might fix this problem.

22

• It was stated in Section 4.2 that if a note is still being played while a new note is introduced it willseem as two new notes. Using a Hamming window may resolve this issue.

• The transcription system does not account for the tempo of a song to speed up or slow down. Thisis major problem especially when a song speeds up as notes can be missed. A technique in whichthis can be solved is to have a method which looks through the song for any noticeable change innote duration. If there is a major the song can be separated into two parts, with the section havingan faster tempo being analysed again to make sure no notes were missed.

5.2.2 Tie-breaker

If a tie between two songs were to occur, a tie-breaker would be required so that the incorrect song isnot chosen. Levenshtein distance can be used again but this time assigning weights to specific edits.For example, when the edit distance is computed between two notation sequences, a substitution wouldoccur when a note was incorrectly labelled. An insertion would occur when a note was not detected(false negative) and a deletion would occur when a note never occurred in a song but was labelled (falsepositive). The insertion and deletion operations could be weight more to the insertion operation; thiswould allow ties to broken.

5.2.3 MIDI Database

Currently the MIDI database contains highly correct notified MIDI songs. A more suitable method inguaranteeing that no song will be incorrectly matched would be to create the MIDI database with fileswhich have been transcribed from the automatic transcription system. Those MIDI files produced wouldthen be used for song identification. Since the transcription system which will create the MIDI databaseand transcribe a song for matching will be the same. This will decrease the possibility of any song beingincorrectly matched.

5.2.4 Finding Songs of Similar Nature using Local Alignment

In Section 4.4 it was stated that songs which have a small section of similarity won‘t be identified as haveelements of similar nature. To solve this issue an approximate string matching algorithm with a localalignment property to match strings can be used. An example of such an algorithm is Smith-Waterman[Gusfield 1997].

5.3 Client- Server Implementation

MIDI files are very small; the database used contained MIDI files totalling up to 334KB. There shouldbe no problem in the user downloading the database and storing it on a local machine. An issue mayarise when the database is updated. Updates may include adding new songs or fixing errors in files. Insuch a case the user will have to re-download the whole database. This would become a nuisance todownload the MIDI database each time there is an update. For future work, the system developed couldbe split into two applications, one for the user (a client application) and another application to be usedon a dedicated server.

The user application would take as input an audio file and transcribe the song as before. The dif-ference is that the MIDI file produced from the transcription will be sent to a server which contains theMIDI database. The server will search the database and will return to the user a match for the song.Implementing this will have many benefits, such as:

• The computational load of identifying a song is moved off the user’s computer. This would be abenefit as the user’s resources (memory, CPU) are freed.

23

• The user no longer has to worry about updating songs. All which needs to updated (by an admin-istrator) is the database on the server.

• Updating the application becomes easier since we can concentrate on two separate applications.

5.4 Computational Time

With advancing technology, users of computing applications have been accustomed to receiving resultsfrom queries in a very short time span. The application developed took an average of 51 seconds to returna result on database containing 70 songs. The application is evidently quick but is bound to get sloweras the database increases. In order to avoid frustration in waiting for a result, investigation must be doneto improve the return time in querying the database. Given below are a few suggested methods:

• It was proposed in Section 5.3 that a server should be used in identifying a song given from MIDIfile. If the server which hosts the application is powerful enough, some sort of parallel computingtechnique can be implemented. This will serve to split up the comparisons of files for approximatestring matching, which will result in a quicker throughput.

• The application developed was tested only using Levenshtein distance for approximate stringmatching. Other approximate string matching algorithms (such as the SmithWaterman algorithm)need to be investigated in order to adopt the quickest approximate string matching algorithm intothe system. Matlab contains functions for the SmithWaterman algorithm and the Needleman-Wunsch algorithm [Burstein 2011]. The reason why they were not implemented in this researchis that algorithms were for Bioinformatics use and would only accept strings of DNA/RNA se-quences.

5.5 Programming Language

The downside of using MATLAB as a development environment is that MATLAB is not free, which is aconsiderable problem for researchers who want to share code and collaborate [J.Glover et al. 2010]. If auser does not have the MATLAB tool, they will be required to download the MATLAB Compiler Run-time (MCR) in order to run the executable file created from MATLAB. The 32-bit version of MATLABalso has memory limitations set in [MathWorks 2011]. All the 32-bit version of MATLAB has a limit of2GB (Memory available for all arrays) and 1GB (Maximum possible array). The application developedwould only accept audio files of a certain size (less than 30MB). If the audio file was larger than 30MB,the matrix created during the audio analysis phase would be too large and the MATLAB environmentwould run out of memory, abruptly closing the application. Investigation should be done on methods tosolve this problem, such as splitting the matrix and doing computations on each individual matrix.

The application created for this research was initially to be developed in Python. It was discoveredthat the tools used for audio signal processing had problems; therefore Python was abandoned for MAT-LAB. Investigation needs to be done on other free programming languages that can be used for scientificapplications.

5.6 Further Development

The application developed can be seen as a prototype, which requires more tests to be taken and otherfunctionalities to be added.

• The application currently only accepts WAVE files. This must be expanded to more popular for-mats such as MP3, WMA, AAC etc.

24

• The user may not have a piano recording of the song which they would want to retrieve informationfrom the database. Research must be done to exam how the system developed fairs with otherinstruments.

• Also an investigation needs to be conducted on the correctness of the system when the audio filehas more than one instrument.

• Test should be done on audio files of different quality. Also test should be conducted on liverecordings. If this is successful, research can be done to make the query system “live” i.e. as themusician records the songs the application can start querying the system for information.

5.7 Conclusion

In this chapter future tasks which should be investigated were discussed. If the topics addressed in thischapter are investigated, a versatile music information retrieval application can be produced.

25

Chapter 6

Conclusion

Automatic music transcription, though a popular area in music information retrieval, is one of the mostchallenging tasks in the research area[Argenti et al. 2010]. Traditionally, transcribing music is a difficulttask only done by trained professionals. As with many tasks, computing technology has been introducedto ease the work in transcribing musical recordings. Due to this, automatic transcription systems can beused for various music information retrieval applications.

This document aimed to investigate automatic transcription systems use with querying a databasecontaining a collection of songs. The objective was, given an audio recording determine the name andmusician of a song by searching a database. Such an application is required to be quick and accurate if itis to be used as a commercial product.

The purpose of this document was to introduce the context of the automatic music transcriptionsystems; this was given in Chapter 2. The way in which automatic music transcription systems performtranscription on a music piece, is by breaking the work done into two phases; a frequency analysis phaseand a pitch estimation phase. Frequency analysis consists of transforming an audio recording from itstime domain to its frequency domain. It was shown for the second phase that due to the mathematicalproperties of piano frequencies, a logarithmic function could be used to label the frequencies.

With the concept that it is possible to obtain key notation from an audio file, it was stated that it ispossible to use the notation to match a song in a database. An issue which arose is that no transcriptionsystem is guaranteed to produce a perfect transcription of all the songs it encounters [Poliner et al. 2007].To overcome this problem Levenshtein distance algorithm was used to find the closest matching song.

It was given in Chapter 2 that two types of files can be used to retrieve notation of a song, audiofiles and digital music notation files. It had been proposed that using digital notation files to retrieveinformation from a database would be more efficient than using audio files. The reason for this was that,in an audio database, time will be consumed transcribing each audio file to obtain the notation. The timeit took to identify a song amongst a collection of songs using the two different databases was measured.The first database consisted of digital music notation files, while the second comprised of audio files. Bymeasuring the time it took to identify a song, it was shown that digital notation files are highly efficientfor obtaining song information from a database.

Though it was believed that the application designed was highly accurate in retrieving the correctsong, it was shown in Section 4.4 that they are instances when the wrong song can be identified. Thisoccurred when a song had borrowed elements from another song. Nevertheless, the system implementedis still considered to be a success.

With the research conducted, insight has been given in the field of music information retrieval andmusic information retrieval applications. This was only a stepping stone, as in Chapter 5, a number ofpotential task were given. With the work given in this research, as well with the future work suggested,an efficient and accurate music information retrieval application can be created.

26

References

[Argenti et al. 2010] F. Argenti, P. Nesi, , and G. Pantaleo. Automatic Transcription of Polyphonc MusicBased on The Constant-Q Bispectral Analysis. IEEE, 2010.

[Association 2010] MIDI Manufacturers Association. The technology of MIDI, 2010. [Online; accessed23 April 2011].

[Barker 2006] G. Barker. Now what’s that song? The Sydney Morning Herald, 21, January 2006.

[Burstein 2011] L. Burstein. MATLAB in Bioscience and Biotechnology. Biohealthcare Publishing(Oxford) Limited, 2011.

[Cannam et al. 2010] C. Cannam, C. Landone, and M. Sandler. Sonic visualiser: An open source ap-plication for viewing, analysing, and annotating music audio files. In Proceedings of the ACMMultimedia 2010 International Conference, pages 1467–1468, Firenze, Italy, October 2010.

[Cozma 2010] N. Cozma. Behind the App: Shazam Encore, 2010.

[Dixon 2000] S. Dixon. On the Computer Recognition of Solo Piano Music. In Proceedings of Aus-tralasian Computer Music Conference, pages 31–37, Brisbane, Australia, July 2000.

[Downie 2008] J. Stephen Downie. The music information retrieval evaluation exchange (2005–2007): A window into music information retrieval research. Acoustical Science and Technology,29(4):247–255, 2008.

[Gostin 2006] A Gostin. Matlab Code that implements Piano Note Detection., 2006.

[Gusfield 1997] D. Gusfield. Algorithms on strings, trees, and sequences: computer science and com-putational biology. Cambridge University Press, New York, NY, USA, 1997.

[Hylvander and Nordmark 2010] F. Hylvander and A. Nordmark. Reason Operation Manual. Propeller-head, 5 edition, 2010.

[J.Glover et al. 2010] J.Glover, V. Lazzarini, and J. Timoney. Python For Audio Signal Processing.2010.

[Kammler 2007] D.W. Kammler. A first course in fourier analysis. Cambridge University Press, 2007.

[MathWorks 1994] MathWorks. MATLAB - The Language Of Technical Computing, 1994. [Online;accessed 13 November 2011].

[MathWorks 2011] MathWorks. R2011b Documentation MATLAB - Resolving “Out of Memory” Er-rors, 2011. [Online; accessed 13 November 2011].

[Poliner and Ellis 2007] G. E. Poliner and D.P.W Ellis. A Discriminative Model for Polyphonic PianoTranscription. EURASIP Journal on Advances in Signal Processing, 2007.

27

[Poliner et al. 2007] G.E. Poliner, D.P.W Ellis, A.F. Ehmann, E. Gomez, S. Streich, and B. Ong. MelodyTranscription From Music Audio: Approaches and Evaluation. IEEE Transactions on audio,speech, and language processing, 15(4):1247–1256, May 2007.

[Raphael 2002] C. Raphael. Automatic Transcription of Piano Music. In Proceedings of the Interna-tional Conference on Music Information Retrieval, 2002.

[Robinson 2001] Andy Robinson. Introduction to transcribing music, 2001. [Online; accessed 23 April2011].

[Rossing 1982] T.D. Rossing. The Science of Sound. Addison Wesley, 1982.

[Schutte 2010] K. Schutte. MATLAB and MIDI, 2010. [Online; accessed 13 November 2011].

[Selfridge-Field 1997] Eleanor Selfridge-Field, editor. Beyond MIDI: the handbook of musical codes.MIT Press, Cambridge, MA, USA, 1997.

[v. d. boogaart and Lienhart 2009] C. G. v. d. boogaart and R. Lienhart. Note Onset Detection for theTranscription of Polyphonc Piano Music. IEEE International Conference on Multimedia andExpo, pages 446–449, 2009.

[Wagner and Fischer 1974] R.A. Wagner and M.J. Fischer. The String-to-String Correction Problem. J.ACM, 21:168–173, January 1974.

[Watkinson 1994] J. Watkinson. The Art of Digital Audio. Focal Press, Waltham, Massachusetts, 1994.

[Wikipedia 2010] Wikipedia. Sampling (music) — Wikipedia, The Free Encyclopedia, 2010. [Online;accessed 24 April 2011].

[Wikipedia 2011] Wikipedia. Piano key frequencies — Wikipedia, The Free Encyclopedia, 2011. [On-line; accessed 10 November 2011].

28

Appendix A

MIDI Note Number Table

Ocatave Number Note NumberC C# D D# E F F# G G# A A# B

-2 0 1 2 3 4 5 6 7 8 9 10 11-1 12 13 14 15 16 17 18 19 20 21 22 230 24 25 26 27 28 29 30 31 32 33 34 351 36 37 38 39 40 41 42 43 44 45 46 472 48 49 50 51 52 53 54 55 56 57 58 593 60 61 62 63 64 65 66 67 68 69 70 714 72 73 74 75 76 77 78 79 80 81 82 835 84 85 86 87 88 89 90 91 92 93 94 956 96 97 98 99 100 101 102 103 104 105 106 1077 108 109 110 111 112 113 114 115 116 117 118 1198 120 121 122 123 124 125 126 127

Table A.1: A 128-key map for mapping piano notes to MIDI numbers

29

automatic music transcription for music information

Documents