retrieval of guitarist fingering information using ...joey.scarr.co.nz/pdf/autotab.pdf · in this...

7
Retrieval of Guitarist Fingering Information using Computer Vision Joseph Scarr 1 , Richard Green 2 Department of Computer Science and Software Engineering University of Canterbury, New Zealand 1 [email protected], 2 [email protected] Abstract Writing musical notation for guitar, otherwise known as tablature (or colloquially, “tab”) can be a tedious and time-consuming task when performed by hand. Software exists that can detect the pitch of a signal emitted from a musical instrument, but this is insufficient for guitar tablature which also requires spatial information. Previous vision-based methods had low accuracy rates and required the camera to be fixed to the guitar neck. In this paper we propose an algorithm that uses a markerless approach to successfully locate a guitar fretboard in a webcam image, normalise it and detect the individual locations of the guitarist’s fretting fingers. Preliminary testing of this system shows that it is more accurate at note recognition than existing methods which require a camera mounted on the guitar neck. Keywords: guitar, computer vision, automatic transcription 1 Introduction Writing musical notation for guitar, otherwise known as tablature (or colloquially, “tab”) can be a te- dious and time-consuming task. There exist auto- matic musical transcription methods that use audio- based analysis, but this does not produce all the required information for guitar players. Unlike a piano, which has a one-to-one relationship between musical notes and the physical keys on the instru- ment, the same single note can be played in seve- ral different places on the guitar fretboard. Gui- tar players therefore need positional information in their musical notation, which is difficult to retrieve using audio-based methods alone [1]. 2 Related Work In this section we note the lack of available soft- ware for automatic music transcription, and review recent research into vision-based, audio-based and multi-modal methods. 2.1 Existing software There exist programs, both free and commercial, that can detect the note being played on a gui- tar for tuning purposes[2]. Furthermore, there are many programs that aid the composition process 978-1-4244-9631-0/10/$26.00 c 2010 IEEE by providing score and tab editors[3, 4]. However, we were unable to find any downloadable software claiming to provide automatic transcription. 2.2 Audio processing research A significant amount of research has been publi- shed on automatic music transcription via signal processing. A robust polyphonic signal proces- sing system was developed by Klapuri, 1998[5, 6], and various research directions have been explored since then. Raphael, 2002 investigated the use of Hidden Markov Models for analysing musical si- gnals, and Reis, 2007[7] successfully demonstrated the feasibility of genetic algorithms. All of the above systems have a severe limitation when it comes to transcribing guitar music, because they record only the tone, not the position, of played notes. Traube, 2001[1] attempted to retrieve spa- tial information using audio analysis alone, but were ultimately only successful in determining the plucking position of open strings, not the fretting position. 2.3 Computer vision techniques A computer vision-based system was demonstrated by Motokawa and Saito, 2006[8]. Their system required the use of an AR marker, and was res- tricted to fretboard detection only. Burns, 2007[9] developed a prototype that also included fingering

Upload: buidien

Post on 22-Jul-2018

213 views

Category:

Documents


0 download

TRANSCRIPT

Retrieval of Guitarist Fingering Information usingComputer Vision

Joseph Scarr1, Richard Green2

Department of Computer Science and Software EngineeringUniversity of Canterbury, New Zealand

[email protected], [email protected]

Abstract

Writing musical notation for guitar, otherwise known as tablature (or colloquially, “tab”) can be a tediousand time-consuming task when performed by hand. Software exists that can detect the pitch of a signalemitted from a musical instrument, but this is insufficient for guitar tablature which also requires spatialinformation. Previous vision-based methods had low accuracy rates and required the camera to be fixed tothe guitar neck.

In this paper we propose an algorithm that uses a markerless approach to successfully locate a guitarfretboard in a webcam image, normalise it and detect the individual locations of the guitarist’s frettingfingers. Preliminary testing of this system shows that it is more accurate at note recognition than existingmethods which require a camera mounted on the guitar neck.

Keywords: guitar, computer vision, automatic transcription

1 Introduction

Writing musical notation for guitar, otherwise knownas tablature (or colloquially, “tab”) can be a te-dious and time-consuming task. There exist auto-matic musical transcription methods that use audio-based analysis, but this does not produce all therequired information for guitar players. Unlike apiano, which has a one-to-one relationship betweenmusical notes and the physical keys on the instru-ment, the same single note can be played in seve-ral different places on the guitar fretboard. Gui-tar players therefore need positional information intheir musical notation, which is difficult to retrieveusing audio-based methods alone [1].

2 Related Work

In this section we note the lack of available soft-ware for automatic music transcription, and reviewrecent research into vision-based, audio-based andmulti-modal methods.

2.1 Existing software

There exist programs, both free and commercial,that can detect the note being played on a gui-tar for tuning purposes[2]. Furthermore, there aremany programs that aid the composition process

978-1-4244-9631-0/10/$26.00 c©2010 IEEE

by providing score and tab editors[3, 4]. However,we were unable to find any downloadable softwareclaiming to provide automatic transcription.

2.2 Audio processing research

A significant amount of research has been publi-shed on automatic music transcription via signalprocessing. A robust polyphonic signal proces-sing system was developed by Klapuri, 1998[5, 6],and various research directions have been exploredsince then. Raphael, 2002 investigated the use ofHidden Markov Models for analysing musical si-gnals, and Reis, 2007[7] successfully demonstratedthe feasibility of genetic algorithms. All of theabove systems have a severe limitation when itcomes to transcribing guitar music, because theyrecord only the tone, not the position, of playednotes. Traube, 2001[1] attempted to retrieve spa-tial information using audio analysis alone, butwere ultimately only successful in determining theplucking position of open strings, not the frettingposition.

2.3 Computer vision techniques

A computer vision-based system was demonstratedby Motokawa and Saito, 2006[8]. Their systemrequired the use of an AR marker, and was res-tricted to fretboard detection only. Burns, 2007[9]developed a prototype that also included fingering

detection, but had several limitations. First, itrequired a camera that was fixed to the head ofthe guitar, and second, it suffered from accuracyissues due to the top-down camera perspective,amongst other reasons. They achieved a 40.2%recognition rate across three test cases: a C majorchord progression1, where each strum plays a chordspanning multiple strings, as well as the C majorscale and an arrangement of “Ode an die Freude”,both of which only involve one string being playedat a time.

Vision-based fingering detection systems have alsobeen developed for other instruments, such as thepiano, with modest success[10].

2.4 Multi-modal systems

Finally, research is currently being performed intomulti-modal transcription systems, which utiliseboth visual and auditory inputs. Frisson, 2009[11]discussed a comprehensive system for extractinginformation from a guitar performance, includinga computer vision component. Their implemen-tation relied on AR markers for visual tracking.Quested, 2008[12] describes the progress of a multi-modal system for guitar note detection. The com-puter vision component of their work was success-ful in its ability to output the set of frets coveredby the hand in any given video frame. Paleari,2008[13] demonstrated a complete multi-modal sys-tem for guitar note detection, with impressive ac-curacy rates – 89%. However, neither of the abovepapers produced accuracy results for the computervision component alone, and neither of them wereable to gain information about individual fingers -they simply detected the approximate location ofthe entire fretting hand.

This work aims to develop a uni-modal system thatdoes not require the mounted camera used in [9]or the markers used in [8], while providing a morecomplete information set than the computer visioncomponents of the systems described in [12] and[13].

3 Design and Implementation

3.1 Aims

Our system was designed with the following speci-fic goals in mind:

1. There should be no interference with the mu-sician’s playing of the instrument.

2. The perspective of the camera should not beoverly constrained.

1C, Am, Dm, G7, C

3. The system should work without any specialmarkers.

4. The system should work without expensivecomponents; a low-cost webcam should suf-fice.

By successfully achieving these goals, future ite-rations of our proposed method will be able toanalyse and transcribe not only a live video feedthrough an ordinary webcam, but also previouslyrecorded musical performances.

3.2 Method

Our system uses an algorithm that processes eachframe of the video feed separately. No trackingalgorithms are used – the advantages and disad-vantages of this are discussed in Section 5.

3.2.1 Locating the fretboard

First, background subtraction is performed, usinga frame taken while the guitar is out of shot. Thisframe is nominated by the user. The absolute dif-ference is dilated and used as a mask on the sourceimage, to reduce interference caused by straightlines in the background.

Line detection is then performed on the image usingthe Canny edge detector and a probabilistic Houghtransform. Of the lines returned, the median gra-dient is assumed to be the approximate gradientof the fretboard strings. We can then safely ignoreany lines with a significantly different gradient.The largest cluster of the remaining lines is thenfound and assumed to constitute the fretboard. Abounding rectangle, rotated to match the new me-dian gradient, is constructed to contain this cluster(see Figure 1). The algorithm then performs a cropand transform based on the bounding rectangle,and the resulting image is the approximate vicinityof our fretboard. The advantage of this method isthat it allows us to detect low-contrast guitar necks– for example, our test guitar had a dark neck andthree black strings.

3.2.2 Refining the fretboard area

Now that the algorithm has a view of the neck,it attempts to detect the frets (which are verticallines in the transformed vicinity image). To dothis, we first process the image using a horizontalSobel filter, then threshold it to a binary image andremove noise via median filtering and the morpho-logical close operator.

A Hough transform is then performed on the image.Each cluster of approximately vertical lines is consi-dered to be a fret. With height and x-positioninformation for each fret, a very close estimate

Figure 1: Detecting the vicinity of the fretboard.

Figure 2: Estimating the fretboard location. Detected

frets are highlighted in blue and the resulting fretboard

bounds are highlighted in red.

for the location of the fretboard can be acquired.The long edges of the fretboard are estimated byfitting a line to the top and bottom points of thefrets. The fitting algorithm used has to be veryrobust to outliers, such as spurious frets (whichare mostly caused by vertical lines from the hand).Initial testing revealed that simple linear regres-sion performed poorly, and we instead opted for asimple algorithm that creates candidate lines usingpairs of points from adjacent frets, and chooses thecandidate that minimises the sum of y-distances foreach of the detected points.

The leftmost detected fret is used as the bridge-end edge of the bounding rectangle. Note that thelocation of this edge is of little importance; thealgorithm does not need to know the location ofthe bridge, and works well with a partial view ofthe fretboard. Detection of the nut, however, is ex-tremely important, because it acts as the baselinethat allows us to estimate fret numbers. Therefore,the nut is detected separately using a scanline algo-rithm that finds the rightmost bright vertical lineof sufficient height.

An example of the fret detection and estimation ofthe fretboard location is shown in Figure 2.

3.2.3 Fretspace normalisation

Now that we have a close-cut view of the fretboard,we need to normalise it horizontally so that thedetected frets correspond to their expected loca-tions in fretspace. We can use the luthier’s formula(Equation 1) to build an expected set of x co-ordinates for frets on the guitar neck. Rearran-ging this gives a simple ratio between adjacent

fret spacings that doesn’t require knowledge of thefretboard length L (Equation 2).

x0 = 0

xi = xi−1 +L− xi−1

17.817(1)

xi − xi−1 = 0.94387(xi+1 − xi) (2)

After first excluding frets detected within the hori-zontal range occupied by the hand (using the fingerdetection method described in the next section),we can examine the spacings between the detectedfrets, and automatically infer the locations of anyfrets that might be missing. To do this, we firstassume that the two detected frets with the mi-nimum separation distance are adjacent. Startingwith these two frets, we can work backward to-ward the nut, comparing fret spacings in turn andadding in missing fret numbers when the distanceis too large. In this way, detected fret lines canbe accurately numbered according to their actuallocation of the fretboard.

After assigning fret numbers to each of our de-tected frets, we can approximate the mapping tofretspace as a linear transformation. Simple linearregression can be used to derive this transforma-tion, which is then applied to the fretboard imageto normalise it. Figure 3 shows the resulting viewof the fretboard.

Note that due to camera distortion and the angle ofthe guitar neck, the true relationship is not exactlylinear. Due to this, our linear regression performedbest when only considering the first 7-8 detectedfrets. Calibration and undistortion of the cameraprior to running the algorithm would help with thisissue, but the accuracy of the current system wassufficient for our tests, which mostly played notesnear the nut. Accurate mapping of the higher fretswould require non-linear regression techniques, andpossibly a higher image resolution.

Figure 3: The normalised view of the fretboard. The

image is flipped horizontally to simplify calculations

and provide a more intuitive view to the user.

3.2.4 Finger detection

Previous work used the K-means algorithm to de-tect clusters of skin-coloured pixels [12]. However,this technique was not able to successfully locateindividual fingers. Using the spatially localisedfretboard view, our algorithm can detect individualfingers using the method illustrated in Figure 4.

First, Canny edge detection is performed on thenormalised fretboard image. The output from this

is dilated to close up small gaps in the image.We can now detect individual contours within theimage, some of which will be fingers. It should benoted that while this method improves on previousattempts, it is not 100% accurate. If the cameragain is too strong, the edge between two adjacentfingers sometimes becomes partially invisible, joi-ning the two contours together. The effects of thiserror are discussed in Section 4.

R > 95

G > 40

B > 20

R ≥ G (3)

To determine which contours are the fingers, we de-tect the number of skin-coloured pixels within eachcontour, using a simplification of the skin classifierdescribed in Peer, 2003 [14] (see the conditions inEquation 3). We modified this to be more tolerantthan the original equation in order to mitigate thenegative effects of glare caused by the camera gain.Furthermore, false negatives were less of an issuedue to the other criteria involved in classifying thefingers. We also filter out contours that do notintersect the bottom of the image, and those thatare disproportionately wide. Formally, we use thelargest four (if available) contours that satisfy theconditions in Equation 4.

bbox.y + bbox.height ≥ image.height− 5

bbox.width ≤ bbox.height ∗ 4

num skin pixels ∗ 3 ≥ total num pixels (4)

In future, it might be beneficial to replace the simpleclassifier with a more adaptive one. This possibi-lity is further discussed in Section 5.

Figure 4: Finger detection. First Canny edge detection

is performed and the result is dilated to create detec-

table contours. These contours are then filtered by the

criteria in Equation 4 to identify the fingers.

3.2.5 Output

Now that we have the contour of each finger, wecan determine where the fingertip is located, aswell as the rest of the finger. We can use this infor-mation to heuristically determine which frets areactive in any given frame. In the current system,the fingertips are located in fretspace and outputto the user as shown in Figure 5.

Figure 5: The output shown to the user. Detected

active frets are highlighted in red. In this case, the

fourth finger is incorrectly output as being active – this

is an ambiguity which cannot be easily resolved with a

purely vision-based system using only a single camera.

While the prototype only uses fingertip informa-tion at present, heuristics could be added in futureto calculate extra output, such as whether or not afinger is barred (i.e. activating all strings beneaththe finger contour). For example, a barred fingercannot cross frets, and a barred finger will neveroverride a fret on the same string being activatedcloser to the nut2.

4 Evaluation

We performed an evaluation to analyse the accu-racy of each individual stage of the algorithm usingthe testbed described by Burns, 2007[9]. Whilemore sophisticated computer vision methods havesince been developed, their evaluations were limi-ted and they did not produce the same granularityof output; specifically, they were unable to de-tect individual finger positions. We therefore choseBurns’ algorithm as the most appropriate baselinefor comparison.

4.1 Apparatus

The system used for testing was a Windows XPmachine with a 2.66 GHz Core 2 Quad processorand an nVidia GeForce 9800 GT graphics card.The camera was a Logitech webcam running at aresolution of 640x480 pixels. The algorithm wasimplemented in C++, using Visual Studio 2008and version 2.0 of the OpenCV library.

2Most barring exclusively uses the index finger.

4.2 Method

The testbed consisted of the three musical piecesshown in Figure 6: a C major scale, a C majorchord progession, and an arrangement of Ode andie Freude. Each piece was recorded using thewebcam and analysed by the algorithm post-hoc.At each step, the algorithm provided an outputimage and a set of intermediary processing images.This allowed us to identify the processing stage atwhich any errors occurred.

(a) The C major scale.

(b) A C major chord progression.

(c) Beethoven’s Ode an die Freude.

Figure 6: Transcriptions of the musical pieces used for

evaluation (images taken from [9]). The letters indicate

the finger used for each note (i = index, m = middle,

a = ring, o = little).

The output was manually inspected. For each notethat was played, the guitarist selected the corres-ponding frame in the output. For the C majorchord progression, a block of 3-4 contiguous frameswere chosen per chord, to provide more accurateresults for the longer strum times. Intermediate or“transition” frames, where the guitarist was mo-

ving between hand positions, were ignored in theanalysis (temporal segmentation is an issue not co-vered by our algorithm). We only considered fret-ted notes in each analysed frame – not those playedon open strings. Similarly, because the system isunable to tell if fingers are touching strings or not,we only looked for the presence of the expectednotes – any extraneous output notes were ignored.

In each frame selected by the guitarist, each ex-pected note was checked for correctness. If it waspresent in the output, it was marked as correct.If it was not, the intermediary debug output wasused to determine the cause of failure. Five typesof failure were recorded: incorrect neck bounds, in-correct linear transform, joined fingers, finger notdetected, and camera angle error. The error typesare further explained below.

Incorrect neck bounds:This means that Step 2 of the algorithm failed toaccurately detect the corners of the fretboard.

Incorrect linear transform:This means that fretspace normalisation (Step 3 ofthe algorithm) failed.

Joined finger contours:This means the algorithm detected a single contourcontaining multiple fingers, resulting in incorrectoutput.

Finger not detected:This means that the relevant finger contour wasnot classified as a finger.

Camera angle error:This is an error caused by the camera angle, suchthat from the camera’s point of view, a fingertipappears to be resting on a different string than itactually is (due to occlusion of the actual finger-tip). These errors normally result in the detectionof a fingertip on the string above its actual location,and are difficult to resolve with a single cameraview.

4.3 Results

A table of results is displayed in Figure 7. 69% ofthe individual finger positions were detected cor-rectly by the algorithm. It is worth noting herethat the chord progression performed the worst,with only a 52% recognition rate. This was likelydue to the condensed hand shapes required whenplaying chords; these hand shapes are more likelyto cause false fret detections, throwing off the fret-board location, as well as making it slightly harderto distinguish between fingers.

Figure 8 displays a breakdown of the results interms of string detection and fret detection. Theresults of 87% and 75% for fret and string detection

Fret String1 2 3 1 2 3 4 5 6

C Major scale 100% 75% 75% N/A 100% 100% 100% 50% N/AC major chords 92% 79% 82% 29% 67% 83% 44% 60% 100%Ode an die Freude 83% 88% 100% N/A 67% 88% 100% 0% N/ATOTAL 90% 83% 91% 29% 68% 88% 81% 54% 100%

87% 75%

Figure 8: A breakdown of the recognition rate per string and fret.

Correct Total Percent

C major scale 7 9 78%C major chords 25 48 52%Ode an die Freude 38 45 84%TOTAL 70 102 69%

Figure 7: The table of results. The Correct column

represents the number of individual finger positions

detected correctly in the piece. The Total column re-

presents the total number of detectable finger positions

in the piece.

Figure 9: A chart displaying the frequency of each error

type.

are fairly high; this shows that many of the classifi-cations marked as incorrect in Figure 7 were in factpartially correct. It also demonstrates a significantincrease in accuracy over Burns’ results of 76% and48% respectively.

Errors

A summary of the different error types can be seenin Figure 9. By far the most prevalent cause oferror was incorrect detection of the fretboard boun-daries. Most of these error cases were due to erro-neous detection of the top neck edge – often due tothe camera angle, the topmost edge was detectedfurther in than it should have been, thus omittingsome of the area that would have otherwise beenallocated to the low E string. This meant that theperspective transform stretched the neck a little

Figure 10: An example of a camera angle error. In this

case, the second (middle) finger appears to be touching

the fifth string, when in fact it is resting on the fourth.

too far vertically, resulting in off-by-one errors onthe strings.

The second largest cause of error was due to thecamera angle. During testing, the guitar neck wastilted slightly upward so the guitarist could have aview of the strings. This meant that the viewingangle of the desk-mounted camera was slightly be-low the horizontal. Under these circumstances, abent finger can occlude its own fingertip, and canappear to be resting a string higher than it actuallyis. An example of this situation is shown in Figure10.

5 Conclusions and Future Work

In this paper we presented a markerless approachfor detecting fretting hand positions using compu-ter vision. We evaluated it against the testbed usedby Burns, 2007 [9] and demonstrated a significantincrease in accuracy over their approach.

While the system’s performance results are promi-sing, it is still affected by a number of issues. Herewe outline some future work that needs to be donein order to improve the robustness of the system.

5.1 Tracking

In its current state, the prototype uses no trackingalgorithms to smooth fretboard detection betweenframes. This results in a somewhat jumpy, butfairly accurate, result that is not affected by sharpmovements of the fretboard. However, the effi-ciency and frame rate could be improved by ad-ding a tracking calculation for the vicinity detec-tion step. This would remove the requirement ofperforming line detection on the source image atfull resolution, which is fairly intensive. Further-more, if tracking latency was not an issue (i.e. ifthe user is guaranteed not to move the guitar necktoo much), then a Kalman filter or similar couldbe implemented to decrease jumpiness.

5.2 Lighting

The system is also currently quite susceptible tolighting conditions and camera gain. In the deve-lopment environment, we were not able to controlthe camera gain, and the auto-correction some-times led to images that were either too bright ortoo dark. For example, auto-correction sometimesmade the fingers appear too white, decreasing theaccuracy of skin detection and reducing the detec-table edges between fingers (which is critical forfinger contour detection).

The algorithm would benefit significantly with somecontrol over the camera gain. Furthermore, cer-tain parameters and thresholds were manually op-timised during development to perform well in thebrightly-lit development environment, and as a re-sult the system sometimes performs badly underdifferent conditions. In future, we would like to in-clude automatic optimisation of these parametersto make the system more robust.

References

[1] C. Traube and J. Smith, “Extracting thefingering and the plucking points on a guitarstring from a recording,” in IEEE Workshopon Applications of Signal Processing to Audioand Acoustics (WASPAA’01), 2001.

[2] N. Software, “Pitch perfect guitar tuner,”2010, http://www.nch.com.au/tuner/index.html.

[3] A. Music, “Guitar pro 6,” 2010, http://www.guitar-pro.com/en/index.php.

[4] B. Larsen, “Power tab,” 2010, http://www.power-tab.net/.

[5] A. Klapuri, “Automatic transcription of mu-sic,” Master’s thesis, Tampere University ofTechnology, 1998.

[6] A. Klapuri, T. Virtanen, A. Eronen, andJ. Seppanen, “Automatic transcription of mu-sical recordings,” in Consistent & ReliableAcoustic Cues Workshop, CRAC-01, Aalborg,Denmark. Citeseer, 2001.

[7] G. Reis and F. Vega, “A novel approach toautomatic music transcription using electronicsynthesis and genetic algorithms,” in GECCO’07: Proceedings of the 2007 GECCO confe-rence companion on Genetic and evolutionarycomputation. New York, NY, USA: ACM,2007, pp. 2915–2922.

[8] Y. Motokawa and H. Saito, “Support systemfor guitar playing using augmented realitydisplay,” in Proceedings of the 2006 FifthIEEE and ACM International Symposium onMixed and Augmented Reality (ISMAR’06)-Volume 00. IEEE Computer Society, 2006,pp. 243–244.

[9] A. Burns, “Computer Vision Methods forGuitarist Left-Hand Fingering Recognition,”Master’s thesis, McGill University, 2007.

[10] M. Sotirios and P. Georgios, “Computer visionmethod for pianist’s fingers information retrie-val,” in Proceedings of the 10th InternationalConference on Information Integration andWeb-based Applications & Services. ACM,2008, pp. 604–608.

[11] C. Frisson, L. Reboursiere, W. Chu,O. Lahdeoja, J. Mills III, C. Picard,A. Shen, and T. Todoroff, “MultimodalGuitar: Performance Toolbox and StudyWorkbench,” Quarterly Progress ScientificReport of the numediart research program,vol. 2, no. 2, p. 67, 2009.

[12] G. Quested, R. Boyle, and K. Ng, “Polyphonicnote tracking using multimodal retrieval ofmusical events,” in Proceedings of the Interna-tional Computer Music Conference (ICMC),2008.

[13] M. Paleari, B. Huet, A. Schutz, and D. Slock,“A multimodal approach to music transcrip-tion,” in Image Processing, 2008. ICIP 2008.15th IEEE International Conference on, oct.2008, pp. 93 –96.

[14] P. Peer, J. Kovac, and F. Solina, “Humanskin colour clustering for face detection,”in EUROCON 2003-International Conferenceon Computer as a Tool. Citeseer, 2003.