multimodal navigation within multilayer-modeled gregorian … · 2017. 5. 8. · set of gregorian...

Multimodal Navigation Within Multilayer-ModeledGregorian Chant Information

Adriano Barate, Goffredo Haus, Luca A. Ludovico and Damiano TriglioneDepartment of Computer ScienceUniversita degli Studi di Milano

Via Comelico 39, 20135 Milano, ItalyEmail: {adriano.barate, goffredo.haus, luca.ludovico, damiano.triglione}@unimi.it

Abstract—Information entities can often be described frommany different points of view. In the digital domain, a multi-media environment is fit to the goal of describing, as well asa multimodal framework is fit to the goal of enjoying thoseinformation entities which are rich in heterogeneous facets. Theproposed approach can be applied to all the fields where symbolic,textual, graphical, audio and video contents are involved. Thispaper aims at deepening the concept of multimodal navigationfor multilayer-modeled information. Addressing the specific casestudy of music, our work first presents a suitable format to encodeheterogeneous information within a single document. Then itproposes a computer-based approach for analyzing recordingswhose musical scores are known, identifying their semanticrelationship in automatic way. Finally, the method is tested on aset of Gregorian chants, in order to produce a multimodal Webapplication.

Keywords—Multimodal navigation, multilayer models, music.

I. INTRODUCTION

In the digital era, the concept of multimedia - i.e. contentwhich uses a combination of different forms and media - isstrictly related to computer. An early definition can be foundin [12], where multimedia is described as “any combinationof text, graphic art, sound, animation, and video that is deliv-ered by computer”. Digital devices natively and intrinsicallysupport multimedia, and those allowing the user to control theinformation are said to support interactive multimedia.

Different media types have to cooperate in order to provideheterogeneous but integrated descriptions of the same entity.It is like looking at the many facets of a physical object fromdifferent perspectives in order to have a complete view ofthe object itself. Such a concept is trivial when we refer forinstance to a movie: image, sound, music and even text (e.g.subtitles) are strictly integrated inside a unique informationstream. In this case, the audience would be annoyed if imagesand audio are not perfectly synchronized, or if the moviehas no spoken dialogue nor soundtrack.1 But the conceptbecomes more relevant if we apply it to contexts wheretypically only one sense is involved, e.g. music. People isused to listen to music, nevertheless a music piece can alsobe enjoyed or analyzed through other senses. For example, insome cases visual aspects are very relevant, such as for opera

1This was the case of silent films, namely early movies with no synchro-nized recorded sound. However, in order to fill this gap, they were oftenaccompanied by live music.

or ballet performances. From another perspective, a musicianis interested also in symbolic and graphic contents of score.Finally, a music piece could also be touched, like in Braillemusic, namely a Braille code that allows music to be notatedand read by visually impaired musicians [5].

These aspects will be addressed in the next sections, byusing a top-down approach. After introducing some key ideas,i.e. multilayer description and multimodal navigation, musicwill be treated as a particular field these concepts can beapplied to. Finally, a very specific example about Gregorianchant will be discussed.

II. MULTILAYER MODELING OF HETEROGENEOUSINFORMATION

As mentioned before, a single information entity - e.g. aphysical object, a digital representation, a concept. etc. - canoften be described from many different perspectives. Multi-layer structures are valid abstractions to take into accountmultifarious contents. The idea of layer implies adding a newsurface to the previous ones, so that the overall aspect becomesmore and more articulated. In our approach, layers correspondto different kinds of information related to a single object.As regards digital descriptions of an entity, homogeneousmedia types should be contained within a single layer, whereasdifferent layers refer to different media types. This impliesthe possibility to describe an information entity by a multi-plicity of descriptors, organized along two dimensions. Wecan provide a number of different ways to look at an object,corresponding to the concept of layers (vertical dimension),and for each layer a number of concurrent descriptions, corre-sponding to the concept of instances (horizontal dimension).In the aforementioned movie example, audio, video, textcontents represent the different layers; the original audio andthe tracks containing other local languages, on the contrary,can be considered instances of the same layer. A more detailedexample, focusing on music field, will be provided in SectionIVThe Music Case Studysection.4.

Now we want to stress another concept related to multi-media but more focused on interaction, namely multimodality.Multimodal interaction provides the user with multiple modesof interfacing with a system. A common application of thisconcept is using different senses to enjoy a unique informationentity, as shown in Figure 1The five senses, representing five

978-1-4673-2565-3/12/$31.00 ©2012 IEEE 65

Fig. 1. The five senses, representing five possible modes for multimodalinteraction with an information entity.

possible modes for multimodal interaction with an informationentityfigure.1, which is a typical behaviour of people in reallife. On the contrary, providing the user with a multisensorialinterface in the digital domain is not a trivial task, and thisproblem has been addressed in a number of scientific works(e.g. in [10] and [11]). A multilayer structuring of informationcan result in advanced multimodal interfaces to enjoy contents.Some examples will be provided in Section IIIMultimodalQueries on Heterogeneous Contentssection.3.

Whenever many layers are involved to describe a uniqueobject, it is necessary to find an entity which works as a sort ofglue among layer, keeping them synchronized. This concept,sometimes called spine, can be implemented e.g. through a listof unique event identifiers. Events are both the atomic entitiesto describe and the points where synchronization among layerscan occur. For example, if we want to describe a theatricalperformance, the minimum granuarity can be set to singlewords. In this case, each word represents an event, has itsown id and can be referred by each layer (e.g. the symboliclayer with text contents, the audio layer containing recordingsof the evening, etc.).

The presence of the spine, which acts as the common datastructure for the multilayer environment, is fundamental forsynchronization purposes. In fact, it is possible to jump fromone representation of an event to another, either within thesame or in another layer, by referring to their common id.Besides, adding new synchronized contents to the documentmeans finding the occurence of the events listed in the spinewithin the new material: thanks to the spine, the network ofinterconnections among layers is automatically produced.

III. MULTIMODAL QUERIES ON HETEROGENEOUSCONTENTS

One of the aspects of interest when dealing with het-erogeneous but linked contents is the possibility to performmultimodal queries.

In a context where the description of an entity relates toa unique domain, the only way to retrieve information isusing data from that domain: this is the case of a plain textsearch on a corpus of text documents. Even in this simpleexample, multimodality can find an application. In fact, it ispossible to employ different ways to interact with the system- e.g. speech-to-text translation - in order to obtain symboliccontents from audio contents. A further example is visualindexing for text contents (like in some LATEXguides), whichimplies using graphical materials to find symbolic descriptions.In the music field, a well-known case is query by humming: theuser-hummed melody is converted into a sequence of musicsymbols which is used to search into a database of symbolicscores.

In a multilayer environment an object can be described frommany points of view. These descriptions are not independent,as they are linked together by the concept of spine, whichprovides both time-based synchronization and any other formof relationship. In this sense, all the facets of an objectcan be employed to drive queries on either homogeneous orheterogeneous contents. A possible application is using knowninformation contained in a layer to find unknown informa-tion stored in another, thus belonging to a different domain.A multilayer approach allows to avoid domain conversions,since there is previously-established correspondance amongdifferent representations of the same information entity. Inthe case of query by humming, for instance, audio contentscould be compared directly to an audio track, thus finding theoccurrence of the music theme also in the symbolic score.

In this context, multimodality can be pushed to a higherlevel: many interaction modes can be used together, workingin parallel to query many layers. We refer to this conceptas polymodality. The presence of heterogeneous descriptorsfor a unique entity, belonging to multifarious domains, allowsthe user to adopt the most adequate sense to interact withthe system, or even to employ many senses as well as manyinteraction modes simultenously. An example is using voiceand gesture together to control a system which providesboth a visual and an aural feedback. This approach becomesparticularly interesting in an augmented reality environment.

IV. THE MUSIC CASE STUDY

Music is a relevant example of a field where multiplesymbolic and media contents can provide a richer descriptionof a unique information entity.

As stated in [7], music can be described in a multilayerenvironment, where each layer corresponds from a theoreticalperspective to a different degree of abstraction, and from apractical perspective to a different media type. In particular, 6layers can be identified:

66

• General - Music-related metadata, i.e. catalogue informa-tion about the piece;

• Logic - The logical description of score symbols;• Structural - Identification of music objects and their

mutual relationships;• Notational - Graphical representations of the score;• Performance - Computer-based descriptions and execu-

tions of music according to performance languages;• Audio - Digital or digitised recordings of the piece.The previous list highlights the heterogeneous aspects that

can be involved in the description of a music piece.The concepts previously introduced, namely the multilayer

representation of digital information and the multimodal wayto enjoy it, are the bases in the music field for the IEEE 1599standard [2].

IEEE 1599 is an XML-based format whose developmentfollowed the guidelines of IEEE Computer Society, calledRecommended Practice Dealing With Applications and Rep-resentations of Symbolic Music Information Using the XMLLanguage. This initiative aimed at representing music infor-mation in a comprehensive way. Its ultimate goal is providing ahighly integrated representation of music, where score, audio,video, and graphical contents can be appreciated together. ThisIEEE standard, sponsored by the Computer Society StandardsActivity Board, was launched by the Technical Committee onComputer Generated Music (IEEE CS TC on CGM).

From a technical point of view, the format implementsthe six-layer layout described above by using as many sub-elements of the XML root element. Besides, the concept ofspine introduced in Section IIMultilayer Modeling of Het-erogeneous Informationsection.2 plays a key role inside anIEEE 1599 document, allowing both intra- and inter-layersynchronization2 of music contents.

IEEE 1599 will be adopted to encode Gregorian chant,as explained in Sections VIThe Gregorian Chant CaseStudysection.6 and VIIInteracting with Graduale 814section.7.

V. AUTOMATIC TAGGING OF MUSIC CONTENTS

In the context of a multilayer modeling of information, oneof the most relevant problems is keeping heterogeneous mediasynchronized by creating pointers, both inside a single layerand among layers.

For the sake of clarity, an example is called for. Let m bethe number of different digital objects belonging to n differentlayers. The synchronization tags obviously differ from a mediatype to another from a semantical point of view, since e.g. atime position cannot be identified like a score area. But, moresurprisingly, we need distinct synchronization tags also for theinstances inside a unique layer, since versions may differ. Letus consider a number of performances of the same operaticaria: each one has its own attack times, base metronomes,tempo changes, etc. Similarly, different score editions havedifferent layouts, namely they use different margins, spacings

2Intra-layer synchronization occurs among instances belonging to a singlelayer, whereas inter-layer synchronization takes place among instances indifferent layers.

and fonts, etc. When adding the m + 1-th instance, from atheoretical point it should be synchronized to the m alreadyavailable instances. Let k be the number of events identifiedin the spine: this would imply computing for a new instancek ·m points of synchronization. Luckily, the concept of spineitself allows to create only the link between each event andits unique id inside the common data structure, thus reducingto k the number of synchronization points to be determined.In this way, the complete network of intra- and inter-layersynchronizations can be reconstructed on the base of the spine.

As already mentioned, IEEE 1599 allows the synchroniza-tion of many performances of the music piece, as the genericmusic event ei can be associated to a different timestamp ineach performance.

It would be desirable to create those pointers automatically,in an efficient and effective way. Of course manual annotationcould be a solution, but for sure it is not the fastest northe cheapest. In the field of Music Information Retrieval(MIR), the research area on semi-automated or fully automatedsynchronization is very active. In our recent works, we arelooking with great interest at the ideas of inspired authors likeDan Ellis [6] and Meinard Muller [9], trying to improve theirapproach, if possible.

Our original contribution in this field focuses on: modularityof the Matlab functions; efficiency of the calculus; code read-ability; flexibility in enabling and choosing suitable parametersto adjust the behaviour of the algorithms. Another step forwardis the ability to perform some numerical processing in a moreprecise fashion, when needed. Lastly, we are going to considersome metrics to validate the outputs through statisticallyproven descriptors, either intrinsic and with respect to otheroutputs taken as reference. Actually, the consequences of lowsignal-to-noise ratio are not included in our investigations, butthey are planned to be.

Now we will provide a specific example of automatictagging for audio contents, starting from a symbolic score andan audio track. The proposed algorithm suitably transformsthe symbolic score into a MIDI file, and then extracts fromthe latter the fingerprints to be compared to the audio filefingerprints, as explained below.

The basic concept is considering two digital objects, namelythe original audio track and the MIDI rendition of the corre-sponding score, as two streams of information where:

• time can be stretched non linearly in both the streams;• timbre information (i.e. the instruments used in the per-

formances) can be different, but there is still a commonspectral fingerprint that can be identified in both tracks.This fingerprint allows time to be dynamically adjustedin order to make the two streams play synchronously.

Let fpn(t) be the fingerprint of the n-th track in a time-frame with length ∆ (i.e. in the interval [t, t + ∆]). We canevaluate, for both files, the values of fpn(t) with a samplingperiod δ. For instance, in the case of the first file, we cancollect the column vector:

V1 = {fp1(t)|t = k · δ} (1)

67

with k ∈ N0, which is

[fp1(0), fp1(δ), fp1(2δ) . . .]T (2)

We should observe that δ, namely the sampling interval, isresponsible for the time-wise accuracy: smaller values bringhigher accuracy, even if the algorithm becomes more andmore time- and memory-consuming. The time-frame length∆ is responsible for the frequency-wise accuracy: smallervalues bring lower accuracy, because less bins are availablefor the Fourier Transform; nevertheless, too large values havea negative impact as regards the ability to be robust againstrapid relative changes in tempo). Therefore, a lot of tests anddeep studies must be conducted to find the suitable values forthe couple of parameters (∆, δ).

Please note that ∆ and δ are not strictly related to theoriginal sampling frequency fs of the audio file, or the originaltime granularity Tg of the MIDI file. Obviously, it must be thecase that δ � max(1/fs, Tg) since a collection of many audiosamples is needed to build each fingerprint sample; besides,∆ ≥ δ should be verified too, otherwise some audio samplesof the original files would be neglected. Consequently, the twoaudio files do not have to share the same sampled time domain;in other words, we do not require Tg = 1/fs.

The similarity matrix S between V1 and V2 is defined asfollows:

S = V1VT2 (3)

The generic element of S is

{S}i,j = fp1 [(i− 1) · δ] • fp2 [(j − 1) · δ] (4)

where i = 1, 2, . . . , n1 and j = 1, 2, . . . , n2.From the similarity matrix S, a problem of operational

research is solved by finding the optimal path that minimizesthe cumulative cost associated to the dissimilarity matrixD = 1 − S. D must be traversed from the element (1, 1) tothe element (n1, n2). In the computation of the optimal path,a diagonal step could be weighted more than a horizontal ora vertical step. The resulting path is the map which fulfils thesynchronization.

The described algorithm has been employed to syncrhonizethe music contents presented in the next section. We haveapplied it to two opposite cases, trying to synchronize a MIDIfile with both its audio rendition (trivial case) and an audiorecording which contains also an instrumental accompainment(complex case). The results are shown in Figure 2The optimalpath for the synchronization between a MIDI file and its audiorenditionfigure.2 and 3The optimal path for the synchroniza-tion between a MIDI file and a real-world recordingfigure.3respectively. In the former case, the optimal path is representedby a perfectly straight line; in the latter case, the line is brokendue to the differences in the audio contents.

Fig. 2. The optimal path for the synchronization between a MIDI file andits audio rendition.

Fig. 3. The optimal path for the synchronization between a MIDI file and areal-world recording.

VI. THE GREGORIAN CHANT CASE STUDY

As cited before, the chosen format to encode music ina multilayer environment is IEEE 1599. This internationalstandard represents heterogeneous information within a uniqueframework, where contents are intrinsically synchronized.More precisely, this format provides two different kinds ofsynchronization, namely intra-layer and inter-layer synchro-nization. The former way is the one that occurs amongcontents stored in a single layer, e.g. many audio performancesof the same piece; the latter regards contents placed in differentlayers, e.g. a graphical score and an audio track. IEEE 1599is an XML-based format, whose international standardizationdates back to 2008. For further information, please refer to[8].

The typical use of such language is encoding CommonWestern Notation (CWN) together with many other relatedmaterials, e.g. audio tracks and score scans, and in this sensemany examples and applications are available ([1], [4], and[3]).

In the framework of a cultural heritage project, a campaignheld at the Certosa of Pavia (Italy) produced the digitizationof 13 graduals dating back to the XVI century and containingmasses in neumatic notation.

In this case study, the notation in use substantially differsfrom CWN. There are methods and transcription rules, well-

68

known in literature, to convert neumes into modern notation,so that also a symbolic encoding could be provided inside theXML document. The part of IEEE 1599 format devoted tosymbolic description of music is the Logic layer. Here we candescribe both neumes and their modern transcription. This casestudy includes also lyrics, which obviously play a fundamentalrole for Gregorian chant, and they are encoded within theLogic layer as well.

As regards the Notational layer, devoted to graphical scores,in this case it is very rich in digital objects. The original score,a more recent printed version and a transcription in modernnotation are all linked to music events. An aspect of furtherinterest emerges: neumatic notation, due to the extremely widediffusion of the repertoire and to the hand-made process ofcopying, often presents small variants as regards note pitchesor grouping into neumes. For this reason, it is interesting tocompare other score versions.

Audio contents in IEEE 1599 are placed within the Audiolayer. Also in this case, intra-layer synchronization can beprovided, by adding many performances of the same piece.For instance, the XML document allows to link both a vocalversion, recorded in a schola cantorum and linked to the Audiolayer, and an instrumental one, performed on a synthesizedchurch organ and linked to the Performance layer. Inter-layersynchronization will be automatically obtained thanks to thestructure of IEEE 1599 documents.

In conclusion, the case study regarding Gregorian chant isrelevant as regards both the numerosity and the heterogeneityof involved materials that can be synchronized within a uniquedocument. A graphical representation is shown in Figure 4Thelayers used for Gregorian chantfigure.4. All this informationrequires a specifically-designed viewer to be enjoyed; thesubject will be treated in the next section.

VII. INTERACTING WITH GRADUALE 814

The ideas illustrated in the previous sections have beenapplied to this case study, thus producing an online viewerto enjoy the Gregorian chant repertory (see Figure 5A Webinterface for Gregorian chantfigure.5). In this project, IEEE1599 has been employed to relive Gregorian chant and tomake it enjoyable also by non-experts. The link to the Webapplication is:

http://graduali.lim.dico.unimi.it.

The screenshot of Figure 5A Web interface for Gregorianchantfigure.5 illustrates the user interface of the application,split in two main parts: the left panel, that lists the availablesections of the Graduale 814, and the right panel, that presentsthe main part of the interface, described later. Music browsingis based on controls that operate in synchronism while themusic is being played.

For every selected music piece, in the upper right part ofthe main panel there are the media controls, i.e. the play/pausebutton, the position bar, and the volume control. The upper leftpart hosts three choices among scores that can be loaded: theoriginal manuscript in neumatic notation, a later transcription,

and the corresponding score in CWN. Upon that selection,several synchronized activities start and execute in real time.The music starts playing, while on the score the running cursorindicates what is being played.

The user can move the red cursor with the mouse and initiateplaying from another point in the score, and the audio/videoplayer cursor changes its current position accordingly. Duringthe performance, users can change the score, with a real-timere-synchronization both on the graphical and on the audiomedia. This is particularly useful to compare the neumaticnotation with the CWN transcription, in a full-synchronizedmultimodal environment.

In conclusion, for Gregorian chant, an IEEE 1599-basedapplication provides not only a way to make scores enjoyableby untrained people through its score-following features, butalso a professional tool to explore in real time melodic andnotational variants.

VIII. CONCLUSION

In this paper we have illustrated an advanced approach tomultilayer modeling and multimodal enjoyment of informa-tion. The aforementioned concepts can be applied to any fieldand any kind of information where heterogeneous points ofview can be combined in order to provide a unique pictureof the object to describe. Music in general, and Gregorianchant in particular, has constituted a case study to show theapplicability of our approach to real-world examples.

Further applications can be conceived, ranging from educa-tion to entertainment, from multimedia installations to digitalarchives.

Fig. 4. The layers used for Gregorian chant.

69

Fig. 5. A Web interface for Gregorian chant.

ACKNOWLEDGMENT

The authors would like to thank all the colleagues at theLaboratorio di Informatica Musicale (LIM), and in partic-ular Stefano Baldan and Davide Andrea Mauro, for theircooperation in the IEEE 1599 initiative. This work has beenproduced in the framework of the TIVal (Tecnologie Integrateper la documentazione e la Valorizzazione dei beni culturalilombardi) project, supported by the Regione Lombardia.

REFERENCES

[1] D. Baggi, A. Barate, G. Haus, L. A. Ludovico, NINA - Navigatingand Interacting with Notation and Audio, SMAP 2007 proceedings: Second International Workshop on Semantic Media Adaptation andPersonalization: London, United Kingdom, 17-18 December, 2007, pp.134-139, Los Alamitos, California, USA: IEEE Computer Society, 2007.

[2] D. Baggi, G. Haus, Music Navigation and Interaction With Symbols andLayers - From Binary Audio to Interactive Musical Forms, Hoboken,New Jersey, USA: John Wiley & Sons Inc., 2012.

[3] A. Barate, G. Haus, L. A. Ludovico, Music representation of score,sound, MIDI, structure and metadata all integrated in a single multilayerenvironment based on XML, Intelligent Music Information Systems:Tools and Methodologies, pp. 305-328, Hershey, Pennsylvania, USA:Information Science Reference, 2008.

[4] A. Barate, L. A. Ludovico, A. Pinto, A Computer Tool to Visualize ScoreAnalysis, Proceedings of the 2008 Computers in Music Modelingand Retrieval and Network for Cross-Disciplinary Studies of Musicand Meaning Conference, pp. 315-326, Copenhagen, Denmark: Re:New,2008.

[5] M. T. De Garmo, Introduction to Braille music transcription, Washing-ton, DC, USA: Library of Congress, 1974.

[6] B. Gold, N. Morgan, D. Ellis, Speech and Audio Signal Processing,Hoboken, New Jersey, USA: John Wiley & Sons Inc., 2011.

[7] G. Haus, M. Longari, A multi-layered, timebased music descriptionapproach based on XML, Computer Music Journal, 29(1), pp. 70–85, MIT Press, 2005.

[8] L. A. Ludovico, Key concepts of the IEEE 1599 Standard, Proceedingsof the IEEE CS Conference The Use of Symbols To Represent MusicAnd Multimedia Objects, pp. 15-26, Lugano, Switzerland: IEEE CS,2008.

[9] M. Muller, Information Retrieval for Music and Motion, Berlin, Ger-many: Springer, 2007.

[10] S. Oviatt, Ten Myths of Multimodal Interaction, Commun. ACM,42(11), pp. 74-81, New York, NY, USA : ACM, 1999.

[11] S. Oviatt, A. De Angeli, K. Kuhn, Integration and Synchronization ofInput Modes During Multimodal Human-Computer Interaction, Com-puter Human Interaction ’97, pp. 415-422, New York, NY, USA : ACM,1997.

[12] T. Vaughan, Multimedia: Making it work, Tata McGraw-Hill Education,2006.

70

multimodal navigation within multilayer-modeled gregorian … · 2017. 5. 8. · set of gregorian...

Documents