molecular clock synthesis - ircamrepmus.ircam.fr/_media/esling/description.pdf · molecular clock...

Molecular clock synthesis

18th January 2015

Contents1 Introduction 2

I Extracting symbols from signal 3

2 Data structure 3

3 Harmonic Information 43.1 From Chords to symbols . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53.2 Harmonic Analysis (in modern Music) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73.3 Energy variations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

4 Going further 12

II Aligning symbolic sequences 14

5 Alignment of symbolic sequences. 145.1 An introduction to pairwise alignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145.2 Levenshtein (edit) distance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155.3 Dynamic Programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155.4 The Needleman-Wunsch algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165.5 On the influence of the scoring matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

6 Multiple alignments algorithms. 186.1 An introduction to multiple alignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186.2 Dynamic programming (exact) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196.3 Center-star method (approximation) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196.4 Heuristics methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

6.4.1 ClustalW - Progressive alignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206.4.2 MUSCLE - Iterative method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

6.5 Computing the consensus sequence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

7 Going further 23

1

1 IntroductionThis document details the ATIAM project called Molecular clock synthesis. The objective of this project is to work onthe relationships between signal and symbolic layers of information in order to develop a new type of synthesis relyingon the joint analysis of symbolic and signal representations. The project is divided into three main modules aimed atproviding you with different knowledge through a set of corresponding exercises. Each of these modules will be combinedto form the global project.

Module.1 Extraction of symbolic information from signal data. This module covers different ways to extract high-levelstructure informations (chords, degrees) from raw data.

Module.2 Alignment of sequences for structure discovery and similarity analysis. This module entails the genericalignment of symbolic sequences and the use of alignment as a measure of similarity and for different typesof structure discovery (such as multiple sequence alignment summarizing the structure of a selected group).

Module.3 Synthesis of audio from symbolic alignments data. Based on the aligned symbolic sequences, we can projectthis information on the signal information layer in order to re-synthesize a sequence based on the consensusof different audio.

The goal of this project is to learn the interplays between signal processing, mathematics and informatics in the contextof analyzing musical information in both symbolic or signal representations.

Higher-level symbolic sequence extraction

Multiple sequences alignment Synthesis

Global pairwise alignment

Influence of score matrix

Multiple Sequence Alignment

Consensus sequence

Database (metadata) gathering

Data structure

Chords to symbols

Harmonic analysis

Clustering / Classification

Energy variations

CCDF#GDB […]CFBBBD#B […]

CCDF#GB […]C--F--B […]

Song database

Figure 1: Overall workflow of the project.

2

Part I

Extracting symbols from signalAt first we want to be able to extract from a set of musical songs a whole range of information summarizing differentsymbolic aspects of the underlying music. Hence, the project will be based on the research database known as the Quaeroset which was used in several research fields related to Music Information Retrieval (MIR). The goal of this section istherefore to understand various ways to extract higher-level information out of continuous raw signal data.

Objectives

1. Creating a database and establishing a proprietary format for large-scale analyses.

2. Extracting structured higher-level information out of a large dataset

3. Understanding algorithms to extract musical knowledge out of signal

4. Comparing the quality and interest of different types of information extracted

5. Targeting different application and knowledge depending on the data at hand.

Database description Quaero is one of the most suitable database to study modern popular music. It contains 138songs performed by 59 artists, it is important to observe that the adjective popular does not concern only the last JustinBiber’s hit, but in a wider sense everything that is interesting for and understandable by the majority of the population.Hence Quaero includes a huge set of different styles and genres.

Given a database it is important to understand if it is necessary (when possible) to collect some metadata. In thiscase it would be interesting to create a data of data structure that would allow you to sort songs by artist, years, genre,et cetera.

Quaero’s artists

50 CentsACDCAerosmithAli Farka ToureAmy WinehouseBjorkBobby McFerrinBritney SpearsBuckcherryBuenavista social clubCarl DouglasCherCoCo LeeCranberriesD’Angelo

DaughtryDestiny’s ChildDillingerDolly PartonEminemEric ClaptonFaith HillFinger ElevenFlo RidaFranck ZappaGeorges MichaelGoran BregovicGwen StefaniJedi Mind TricksJim Jones

Joan BaezJudas PriestJustin TimberlakeKissLil WayneLudacrisMadconMadonnaMariah CareyMassive AttackMichael JacksonMobyNeil YoungObituaryPatrick Hernandez

Pink FloydPlatiskmanPucho and his latin soulbrosPuff DaddyFaith EvansRadioheadRay CharlesRun DMCScorpionsShackSweetThe BeatlesThe CureThe Fall A Sides

Some features and 3 different kind of symbol sequences have already been computed in Matlab and stored in astructure variable. In the next section we will briefly describe the nature of this data you can use, modify and improveduring the project. You can always refer to this section as a summary to find a particular feature or a sequence. Howeverit is necessary to read section 3 for a detailed analysis of the methods used to generate each sequence and its musicalmeaning.

2 Data structureThe information deduced from the audio is stored in a Matlab structure variable called AllSongsData. To access to thedata in Matlab type the command:

3

>>Al lSongsData ( k )

ans =name : ’ 0020−Queen−ANight at the Opera . . . ’Data : [ 68 x25 da t a s e t ]S ta t : [ 3 x3 d a t a s e t ]ChordType : [ 68 x1 doub l e ]Cho rdSt r i ng : ’ZHXAHAHVOHAFHAFTAQJCAKLFHA . . . ’Degree : [ 68 x1 doub l e ]Deg r e eS t r i n g : ’ZACDADABBEADEADBEBEAEDEDE . . . ’V a r i a t i o n : [ 68 x1 doub l e ]R e l V a r i a t i o n : [ 6 8 x1doub l e ]R e l V a r i a t i o n S t r i n g : ’Z16A1C1D1A1D1A1B1B0 . . . ’

The command

>>Al lSongsData ( k ) . name

will display the name of the k-th audio file of the dataset. Here follows the description of each component of thedata structure.

name the name of .wav file associated to the song.

Data a #Chords×18 dataset containing information about the song: Chord_name, arpeggio, arpeggio in Z/12.Z,Chord_Duration (ms), Chord_Duration (beat), Average BPM, Most_Used_Chord, Tonality for each chord,Discrete Hamiltonian Equation data. You can use it to make your own experiments and try to generate newsequences of symbols.

Stat a simple statistical analysis of the discrete hamiltonian equation’s graph associated to the song.

ChordType From chords’s name to integers.

ChordString From chords’s integers to symbols.

Degree Degree of each chord in relation to its tonality.

Variation Modulations: changes form one key to another in the harmonic structure of the song.

RelVariation Distance among the tonality regions in the song and the most used tonality.

RelVariationString String in which each degree is followed by a number which identifies its tonality in terms of distanceform a central one.

3 Harmonic InformationIt is natural for a (trained) listener to intuitively segment and classify music while listening to it, see figure 2. What weare conjecturing is that the high level, symbolical interpretation of a song represents a good descriptor of the intrinsicproperties, that allow us to naturally classify music in genres or moods. Thus we used a classifier based on supportvector machines to transcribe a simplified version of the harmonic structure of Quaero’s songs. The class of chords wetranscribed is limited to the standard major and minor triads, the hypothesis is that augmented and diminished triadsare negligible in the study of modern popular music, in addition this hypothesis allows us to work with a small alphabet,which clearly is composed by 24 symbols, plus the identity element (no chord).

4

Figure 2: Intuitive music segmentation. c©John Alkinson, Wrong Hands.

In the next paragraphs several interpretations of the harmonic information are examined. At first we will simply try toconvert chords in symbols, wondering if it is possible to define a distance among chords with respect to the music theoryconstraints of modern harmony. For instance we want the C major triads to be close either to the Cm and to its minorrelative Am and far from G. After this first step we will try to climb into the next level in harmonic analysis creating atonal region interpretation of each songs, based on the harmonization of the major scale, see figure 3.

1

C

2

D‹

3

E‹

4

F

5

G

6

A‹

7

Bº

&

œ

œ

œ

œ

œ

œ

œ

œ

œ

œ

œ

œœ

œ

œ

œ

œ

œ

œ

œ

œ ‰

Figure 3: C major harmonization

In the last part a more exotic feature related to the energy variation in music will be considered. In this case alignmentwill not refer to genre, but to the listening effort, or symmetrically to the song divergence from the static state (trivialharmony).

3.1 From Chords to symbolsThe set of chords involved in this kind of study is limited to the 24 major and minor triads:

T := C,C],D,D],E, F, F ],G,G],A,A],B × Major, Minor .

The association among chords and symbols is described in table 1.

5

Chord SymbolC AC] BC CC] DE EC FC] GC HC] IC JC] KB L

Chord SymbolCm MC]m NCm OC]m PEm RCm SC]m TCm UC]m VCm WC]m XBm Y

Chord SymbolNoChord Z

Table 1: Association among chords and characters

The whole dictionary is composed by 25 symbols. This information is contained in the string

>> Al lSongsData ( k ) . Data . Cho rdSt r i ng

It is possible to define a distance among chords. Let P = (p1, . . . , pn) and Q = (q1, . . . , qn) be two chords ofn-voices, where (pi)

ni=1 ∈ (Z/12.Z)nand (qi)

ni=1 ∈ (Z/12.Z)n, i.e. the voices of the chords are pitches computed from

frequencies with the following formula

p = 69 + 12 log2(f/440),

where f is the frequency we want to represent as the real number p. This representation allows to define the familyof distances dαchords : (Z/12.Z)n × (Z/12.Z)n → Z for α ∈ N ∪ ∞, as follows:

dαchords(P,Q) =(∑

ni=1 |pi − qi|

α)1/α

if α <∞,

and

dαchords(P,Q) = max |pi − qi| for pi ∈ P, qi ∈ Q if α =∞.

These distances are the ones usually defined on the `α space.

Example. Distancesd1chords, d

2chords, d

∞chords

among the triad C major and the set of all possible triads.

1. We order each chord from its lower to its highest pitch in Z/12.Z: C = (0, 4, 7) , C] = (1, 5, 8) , . . . and Cm =(0, 3, 7) , C]m = (1, 4, 8) , . . .

2. We compute the distances for each couple of the form (C,P ), where P is one of the 24 major and minor triads:d1chords(C,C) = 0, d1chords(C,C]) = |0− 1|+ |4− 5|+ |7− 8| = 3,. . .

See table 2 and figure 4 for further examples.

Chord P d1chords(C,P ) d2chords(C,P ) d∞chords(C,P )

C 0 0 0C] 3 1.7321 1D 6 3.4641 2

Chord P d1chords(C,P ) d2chords(C,P ) d∞chords(C,P )

Cm 1 1 1C]m 2 1.4142 1Dm 5 3 2

Table 2: Chord distances

6

C C# D D# E F F# G G# A A# B CmC#mDmD#mEm Fm F#mGmG#mAm A#m Bm0

2

4

6

8

10

12l1l2linf

Figure 4: Chord distances: comparison

Exercise 1. It is possible to build a different substitution matrix for every distance dαchords : (Z/12.Z)n×(Z/12.Z)n → Zfor α ∈ N,1 ≤ α ≤ ∞.

From figure 4 it is clear that the choice of the distance is essential to compare harmonic structures. What are theimplication of this choice in musical terms?

3.2 Harmonic Analysis (in modern Music)A musician studying a music piece for the first time often performs an analysis of its harmonic behavior, trying to work outa relation among each chord and a tonal (or modal) centre that could vary during the song. Given the chords sequenceof a song it is possible to define a set of functions mapping each chord to its tonality. The tonality dictionary is limitedto the 15 major tonalities (12 + 3 homophones) and their minor relatives, thus it is possible to associate to each tonalitya symbol as it is shown in table 3.

7

Tonality SymbolCmajor/Aminor AGmajor/Eminor BDmajor/Eminor CAmajor/F]minor DEmajor/C]minor EBmajor/G]minor FF]major/D]minor GC]major/A]minor HFmajor/Dminor IB[major/Gminor JE[major/Cminor KA[major/Fminor LD[major/B[minor MG[major/E[minor NC[major/A[minor O

NoChord Z

Table 3: Association among tonalities and symbols

This first analysis allows to associate to a song a symbol sequence describing it in terms of tonal regions, see the firstand second row of table 5 for an example. However this kind of segmentation is pretty course, actually it is possible tofind songs built using a single tonality, that would be associated to a sequence composed by only one symbol.

It is possible to refine this segmentation associating each chord to its tonality using the tonal degree of the chord: infigure 3 these degrees are represented as natural numbers from 1 to 7. The association among degrees and symbols (seetable 4) leads to a really small alphabet composed only by 7 symbols. This step is just a mimic of the standard harmonicanalysis process: at first we have listed the tonal centers of the song and now we are associating each chord to its degree.

Tonality: C CM Dm Em FM GM Am Bm([5)Degree 1 2 3 4 5 6 7Symbol A B C D E F G

No Chord 0Symbol Z

Table 4: Tonality degrees, C major key

To go further in this approximation of the human harmonic analysis it is necessary to introduce a distance amongtonalities. Let us call the set of all tonalities K, we can define a distance dcircle : K×K → N as it is intuitively describedin figure 5.

Figure 5: Distance among tonalities: modulation from D major to F major: dcircle(D,F ) = 3.

8

To understand what are we doing in practice, we will use the song represented in figure 6 as a guide example toexplain what kind of information it is possible to deduce using tonalities, degrees and distances. So far, what we areable to compute corresponds to the rows Degree and Variation of table 5. The components of the vector degree relateeach chord to its tonality, while the components of Variation are related to a kind of absolute distance among the song’stonalities: the 0th tonality is represented by the first tonality of the song. Let’s sum up what we did so far:

Tonal Regions

It is possible to build a sequence of symbols describing the song in terms of tonal regions, following the dictionary oftable 3. In the case of Tune Up, it would be:

CCCAAAJJJCJJL.

This information is contained in the vector

>> Al lSongsData ( k ) . TONALITIES

Degrees

Another possibility is to consider only the degree of each chord modulo the tonality. This means to translate the vectorDegree in symbols (see table 4). In this case the string associated to Tune Up is

BEABEABEABEAE.

The advantage in this case is that the dictionary is incredibly small (7 symbols), at the same time the risk is to createpatterns that does not really exist. However this is a really common way to conceive music in the standard improvisationalontology.

Type

>> Al lSongsData ( k ) . Deg r e eS t r i n g

two obtain this information for the k-th song.

Variations

It is even possible to create the weird sequences using the Variation vector, in this case it would be

0002002004402

this sequence means that the song started in a certain tonality, let’s call it K1such that dcricle(K1,K1) = 0,(see figure 5 for the intuitive definition of dcircle) after 3 chords, the song modulates to a new tonality K2 such thatdcircle(K1,K2) = 2, again after two chords a new modulation occurs and dcricle(K2,K3) = 2. Clearly in this case weare only evaluating variation and the sequence gives a segmentation of the song in stable tonal regions and a classificationin terms of modulation’s gaps.

This information is stored in the structure variable as

>> Al lSongsData ( k ) . V a r i a t i o n

Chords Em7 A7 D∆ Dm7 G7 C∆Tonality D D D C C CDegree 2 5 1 2 5 1Variation 0 0 0 2 0 0

Rel Variation 4 4 4 2 2 2

Chords Cm7 F7 B[∆ Em7 F7 B[∆ E[7Tonality Bb Bb Bb D Bb Bb AbDegree 2 5 1 2 5 1 5Variation 2 0 0 4 4 0 2

Rel Variation 0 0 0 4 0 0 2

Table 5: Tune up - Tonal Analysis

9

bc ÅÅMiles Davis

Concert

Medium Swing q = 160Tune Up

JE.7E Å 8

A7

EH ED^*H 3

JD.7E Å 8

G7

EH EC^

* 3

JC.7

E 8 Å F7

8# 7 @ 8# 8 8Bb^E 8 Å 8 7 @ 8 8 8 8

JE.7*

F77 @ 8#Bb^

8 8 8 8 *Eb7 3

bc ÅÅMiles DavisSolar

JC.7 @ 8 8 8 8 @ 8 Å 8 Å G.7

8# 8 Å 8 8C7(#9)

8 8 *

JF^7 @ 8# 8H 8 8 @ 8 Å 8 Å

F.78# 8 Å 8# 8

Bb7(#9)8 8 *

JEb^@ 8# 8 8 8# 8

Eb.78 # 8

Ab7(#9)

8 8 8 # 8Db^

8# *DØ@ 8#

G7(#9)

8 8 8 8 8%

Figure 6: Miles Davis - Tune Up

The last step of the development of the harmonic analysis mimic algorithm consists in the identification of a centraltonality which is proper of the song and not simply the first, or an a priori choice.

Local Variation The criterion we used to establish a preferred tonality among the set of tonalities detected in the songis really simple: the preferred tonality is the mostly used. However it is possible to introduce different constraints, forinstance one could consider the central tonality to be the one which is more stable (the one associated to the longestchords sequence of consecutive chords). For an example see the last row of the table 5.

Tonal regions and local variations

It is possible to combine the data and create a new string taking into account either the information deriving fromthe analysis of tonal regions distance (RelVariation) and the one coming from the association chord-tonality’s degree(Degree), in this case we work with a 2×#chords matrix. In Tune Up we have D D D C C C Bb Bb Bb D Bb Bb Ab

B E A B E A B E A B E A E4 4 4 2 2 2 0 0 0 4 0 0 2

The dictionary will be the set

A,B,C,D,E, F,G︸︷︷︸degrees symbols

× 0, 1, 2, . . . , 16︸︷︷︸tonalities+no chord

[Hint: This is a very huge alphabet (112 symbols), however it is possible to build a substitution matrix in which...]In the case of Tune Up the modulations-degrees string will be

B4E4A4B2E2A2B0E0A0B4E0A0B2.

Where we assumed the most used tonality to be the 0-th tonal region. The other integer numbers appearing in thestring have been computed taking into account their distance from the central tonality on the circle of fifths see figure5. It is possible to find the tonal regions/local variations string associated to the k-th song typing

>> Al lSongsData ( k ) . R e l V a r i a t i o n S t r i n g

Exercise 2. In (modern) music, cadences have a central role. Here follows the list of the most used common patterns inmusic, the notation we will use is of the form 1− 2− 3 where each number represents the degree of the chord in relationto a tonality, thus the cadence 1− 2− 3, means C −Dm− Em in C major, D − Em− F]m in D major et cetera.

10

Authentic or Perfect cadence: 5−1 or, typically in jazz music 2−5−1(we are considering the chords as the 3-uple(root , third , fifth) ordered from the lowest to the highest pitch);

Deceptive cadence: 5− 6;

Plagal cadence: 4− 1;

Half cadence: 4− 5, 2− 5, 6− 5, 5/5− 5, where 5/5 means that we are considering the fifth degree of the fifthdegree of a certain tonality, for instance in the tonality of C major the fifth degree is the major triad G and thusthe fifth of the fifth degree is the major triad D;

Is it possible to take into account this cadence during the pairwise alignment of two strings in which tonal region anddegree information are combined?

3.3 Energy variationsThe dataset AllSongs(k).Data contains several observations concerning the data necessary to plot a graph representingthe variations of energy (in a mechanical sense) during the song. AllSongs(k).Data.[Haminfty,Hamone,Hamtwo] is thelist of the values in (absolute or relative) time1 of the energy variation during the k-th song. For instance plotting thevector Hamtwo for the song Seaside rendez vous - Queen gives the curve depicted in figure 7.

0 20 40 60 80-30

-20

-10

0

10

20Energetic variations, Max norm

0 20 40 60 80-20

-10

0

10

20

30

40

50

60Energetic variations, norm 1

0 20 40 60 80-30

-20

-10

0

10

20

30Energetic variations, norm 2

Figure 7: Energy variations

Assume to work in norm `2 (vector Hamtwo), which normally gives a more balanced shape to the graph and to deletethe last two values of the vector which are misleading, see figure 8.

It is possible to associate a symbol sequence to the graph. In the figure the x and y axis have been divided in asquare grid, the length of the edges of each square is 5 units. Considering each column of the grid from left to right andeach row (top to bottom), we can associate the following string to the graph:

ABBFBAFBBBDCCCBDCCCBFD...

1Each point of the graph is related to a chord of the song, to its position in terms of beats and to its position in time (ms).

11

or

ABFBAFBDCBDCBFD...

or, taking into account the vertical grid

A1B1B1F1B1A1F1B1B2B2D2C2C2C2B2D2C2C2C3B3F3D3...

Time (chord)10 20 30 40 50 60

Ener

gy v

aria

tion

F

E

D

C

B

A

Figure 8: From the Discrete Hamiltonian Equations to Symbols

Exercise 3. Energy variation alignment:

Write a matlab function to associate a string of symbol to the Hamtwo” graph, in which the length of the squarescomposing the square grid have to be set as an input parameter, to be able to set the quality of the sampling onthe graph.

4 Going furtherBased on the algorithms you were presented, several questions can lead you to further investigations and discoveriesrelated to this field.

1. How to handle temporality in symbolic representations?

2. How to evaluate the quality of symbolic data extraction without relying on ground truth human annotation?

3. How to extend the alignment paradigm to include multiple sources of informations?

4. Is it possible to devise an adaptive alignment that could use the information of alignment from one source ofinformation in order to ameliorate the alignment of another source?

5. How to detect common patterns inside a set of symbolic sequences?

6. How to compare the fitness of various representations?

12

Project module assignementsPart 1 (Database and metadata)

Collect metadata for Quaero’s songs.

Organize the data taking advantage from the metadata collected at the previous point.

Part 2 (from chords to symbols)

Is it possible to define different metrics able to capture different harmonic property, for instance it would beinteresting to use the Hamming distance. Show that it is possible to use this distance on chords as they have beendefined in this paragraph. Try different kind of metrics and analyze their harmonical meaning.

A crucial role in harmony is played by cadences, tuples of chords (in general sequence of 2 or 3 chords) that formcommon harmonical patterns that are invariant modulo key changes. Start to think about this kind of structuresand try to express them as point in the space (Z/12Z)n·m, where n is the cardinality of the chord and m is thelength of cadence.

Part 3 (harmonic analysis)

Given a sequence of chord (AllSongs.Data.CHORD) write an algorithm to associate a tonality to each chord andfind the segmentation of the song in tonal region. A constraint has to be respected: writing the sequence oftonalities, the movement on the cycle of fifths has to be minimized. For instance

1. the chord is a Dm can be associated to three different tonalities: C, B[ and F.

2. If the following chord is Am then it can be associated to G, F and C.

3. The tonality of the couple (Dm,Am) has to be in the intersection of the two sets we wrote above:

C,B[,F ∩ G,F,C = C,F

To minimize the movement on the cycle of fifths means minimize the distance we defined above. Thus forthe sequence of chords Dm |Am the best solution is the couple of tonalities (C,C) or (F,F) , the secondchoice would be (F,C) or (F,B[) and the worst one is (B[, C).

Part 4

Select one of the open questions above and propose a simple yet efficient way to deal with this problem. Providean algorithmic sketch of the implementations required for your idea.

13

Part II

Aligning symbolic sequencesOnce symbolic information has been extracted from audio data, we obtain a higher-level view of each element thecorresponding dataset. However, most of the MIR field is based on finding (dis-)similarities between various songs. Thealignment of symbolic sequences is a prolific and still active field of research, which finds applications in areas such as textmining (web search engines), genetics (phylogeny, mutation tracking), shape analysis (face recognition) or healthcare(diagnostic tools). The idea is that sequences with a high similarity are bound to align very well. Conversely, sequenceswhich share some common patterns should still be able to align partially.

Objectives

1. Comparing the high-level structure of several items in a set of raw signal data

2. Understanding algorithms to align two symbolic sequences

3. Comparing the quality and interest of different algorithms based on similarities and clustering

4. Discovering the inner structure of a whole set of sequences through multiple alignment

5. Finding the global common patterns through consensus sequences

6. Constructing the ancestor sequences of groups based on these sequences

5 Alignment of symbolic sequences.In a first step, we will try to align pairwise sequences. This mean that we will only try to align two sequences togetherat a time. This alignment can of course be based on any type of symbolic information, given that the two sequences arecomposed of symbols with the same signification. The goal here is to understand the different algorithms for pairwisealignment, but also the influence of different weighting matrix used for the same algorithm.

5.1 An introduction to pairwise alignmentThe goal of pairwise alignment is that given a pair of symbolic sequences S1 and S2 of length n,m ∈ R2, and a scoringfunction δ (x, y) that defines the similarity between two symbols x, y ∈ A2 we want to find the sequences S∗1 and S∗2 ofequal-length k such that the sum of similarity scores is maximized.

Pairwise alignment allows to gain information about the similarity between two sequences, but also about theirstructure. Hence, this can allow to find common patterns, or to assemble together set of sequences (fragment assembly)or even to compare the inner structure of two sequences. We can infer the relationships and the sub-sequence structuredirectly from this alignment.

The different issues related to pairwise alignment are that

Most of the sequences we are comparing will differ in length

There may be only relatively small matching regions in the sequences

We want to allow partial matches (ie. some symbols are more similar to each other than others)

It should also be noted that three types of alignments can be performed

Global Find the best match of both sequences in their entirety

Local Find the best subsequence match (even small portions of the sequences)

Semi-global Find the best global match without penalizing gaps on the ends of the alignment

An exemple of alignment between (apparently) lowly-related sequences is displayed in Figure 9.

14

GSAQVKGHGKKVADALTNAVAHV---D--DMPNALSALSDLHAHKL:: ::::|: || : :| :: :|: |:::|: |NNPELQAHAGKVFKLVYEAAIQLQVTGVVVTDATLKNLGSVHVSKG

Figure 9: Example of global alignment between two (apparently) lowly-related sequences inwhich the exact match are identified by a pipe (|) and related match are identified by a (:).Even though the symbols in both sequences are highly different, most of these are actuallyclosely related, which implies that the sequences show a high amount of similarity.

5.2 Levenshtein (edit) distanceThe simplest way to obtain the similarity between two symbolic sequences is through the Levenshtein distance (sometimesalso called edit distance). The idea is to consider that three types of differences can arise wen comparing two symbolicsequences

Substitutions ACGA ⇒ AGGAInsertions ACGA ⇒ ACCGADeletions ACGA ⇒ ACA

The latter two will result in gaps in the alignments. Hence, the goal of the Levenshtein distance is to find the minimalnumber of applications of these operations between two sequences. The distance itself is therefore defined as the numberof times any of these operations are applied. The main problems of the Levenshtein distance is that

1. All operations are considered (the same score is assigned to any change)

2. Only the binary match/mismatch relationship is considered (symbols cannot be more or less related)

3. The computation cannot be divided (as alignment in any subsequence depends on the alignment of previoussubsequences)

5.3 Dynamic ProgrammingDynamic programming solves an instance of a problem by taking advantage of computed solutions for smaller subpartsof the problem. Hence, we can determine the alignment of two sequences by determining the alignment of all prefixes ofthe sequences.

Scoring scheme This approch relies on a substition (or distance) matrix δ (a, b) which indicates the score of aligningany characters a and b from our dictionnary. Second, the scoring use a gap penalty function w (k) which indicates thecost of a gap of length k. We will first consider the simplest case of using a linear gap penalty function w (k) = g.kwhere g is a constant (which means that the cost of consecutive gaps is equivalent).

Idea The idea of dynamic programming is that given a sequence xof length n and a sequence y of length m, we canconstruct a (n+ 1) × (m+ 1) matrix F such that Fi,j is the score of the best alignment of x [1 . . . i] with y [1 . . . j].This means that the score of any cell can be deduced by the scores of its three previous neighboring (up and left) cells.Therefore when extending an alignment in the cell Fi,j , three choices can be made

align x [1 . . . (i− 1)] with y [1 . . . (j − 1)] and match x [i] with y [j].

align x [1 . . . i] with y [1 . . . (j − 1)] and match a gap with y [j].

align x [1 . . . (i− 1)] with y [1 . . . j] and match a gap with x [i].

Hence one way to specify the DP problem is in terms of its recurrence relation

F (i, j) = max

F (i− 1, j − 1) + δ (x [i] , y [j])

F (i− 1, j) + g

F (i, j − 1) + g

The algorithm for dynamic programming can be sketched as

15

The Beatles - Come together

The Cure - The gurehead

I A AB I A AB C I A AB C AX BX I A AB C I M M M M M M -

I A DEI A DEC H ACD M M* M- -M* M*AA A C HA A X Y





I





I

Figure 10: Differences between the Levenshtein distance and the Dynamic Programming (DP) in aligning two symbolicsequences. Based on two symbolic sequences (top), the Levenshtein distance provides an alignment solely based onbinary match/mismatch information (middle), while DP can allow for much more subtlety by introducing the concept ofcontinuous distance between symbols (down), represented here by lighter bars indicating partial matches.

DP-alignParameters: Two sequences S1 and S2, a scoring function δ (a, b) and a gap cost function w (k)Return: A global alignment S∗1 and S∗2 .1: Initialize the first row and column of the matrix2: Fill the matrix from top to bottom, left to right3: For each Fi,j save pointers to cells that resulted in best score4: Trace the pointers back from Fm,n to F0,0 to obtain the alignment

Exercise 4. Based on the two sequences AAAC and AGC, suppose that we choose the scoring scheme δ (a, b) = 1 if a = band δ (a, b) = −1 otherwise, with a gap penalty g = −2

Construct and fill the scoring matrix (with corresponding path pointers)

Find the best alignment between the two sequences by tracing the pointers

What is the theoretical worst-case complexity of this method?

Can there be some cases of ambiguity? Which are them and how can you resolve these?

Analysis of dynamic programming Given two sequences of length n, we can compute the number of possible align-ments as (

2nn

)=

(2n)!

(n!)2 ≈

22n√πn

Therefore for two sequences of length 1000, there is approximately 10600 possible alignments, but the DP approachfinds an optimal alignment efficiently. Compared to the Levenshtein distance, the dynamic programming approach canallow a strongly higher quality of alignment as displayed in Figure 10.

5.4 The Needleman-Wunsch algorithmWe have seen in the previous section that the DP approach for global alignment normally relies on a linear gap penaltyfunction w (k) = gk. This implies that a long gap of 20 positions between highly matching subsequences has thesame impact on the alignment as 20 gaps disseminated along the sequences. However, it seems obvious that we should

16

rather prefer one long gap between highly matching sub-sequences (which enhance a view on the local structures sharedbetween sequences). Therefore, the Needleman-Wunsch (NW) algorithm handles this mechanism by providing an affinegap penalty function defined as

w (k) =

α+ βk k ≥ 1

0 k = 0

Here α defines the cost for opening the gap and β defines the cost for extending it. This penalizes small sporadicgaps as we choose α > β, meaning that it costs more to open a gap than to extend an existing one. In order to stillperform this refined alignment in O

(n2)time, the NW algorithm relies on three different matrices instead of one. First,

the matrixM (i, j) defines the best score given that x [i] is aligned to y [j]. Second, Ix (i, j) defines the best score giventhat x [i] is aligned to a gap and Iy (i, j) defines the best score given that y [j] is aligned to a gap. The NW algorithmalso redifines the recurrence relations defined for the classical DP approach as

M (i, j) = max

M (i− 1, j − 1) + δ (xi, yj)

Ix (i− 1, j − 1) + δ (xi, yj)

Iy (i− 1, j − 1) + δ (xi, yj)

Ix (i, j) = max

M (i− 1, j) + h+ g

Ix (i− 1, j) + g

Iy (i, j) = max

M (i, j − 1) + h+ g

Iy (i, j − 1) + g

The NW algorithm can then be draft in three steps

1. Initialization

(a) M (0, 0) = 0

(b) Ix (i, 0) = h+ g.i

(c) Iy (0, j) = h+ g.j

2. Fill the three matrix (M, Ix and Iy) together from top to bottom, left to right

3. Traceback

(a) Start at the largest value betweenM (m,n), Ix (m,n) and Iy (m,n)

(b) Stop at any ofM (0, 0), Ix (0, 0) and Iy (0, 0)

Exercise 5. Needleman-Wunsch

5.5 On the influence of the scoring matrixAs we have seen in the previous sections, one of the core concepts of different DP algorithms is that it allows to includea scoring function δ (x, y) which provides a more subtle notion of matching. Hence, one of the key aspect for tuningthe quality of alignments is the way to assess the differences between symbols in the dictionnary. This can be definedas the distance measure δ (x, y) between symbols x and yis a function taking and returning the distance d betweenthese. This distance has to be non-negative, i.e. δ (x, y) ≥ 0. If this measure satisfies the additional symmetry propertyδ (x, y) = δ (y, x) and subadditivity δ (x, z) ≤ δ (x, y) + δ (y, z) (also known as the triangle inequality), the distance issaid to be a metric.

The definition of this scoring matrix will highly influence the final alignment, as we can then evaluate the score of analignment either by using the sum of all distances

D (S1,S2) =∑k<l

δ(sk1 , s

l2

)or another way is to try to minimze the entropy of each column given by

D (si) = −∑a

cialog2 (pia)

where mi is the ith column of an alignment m, cia count of character a in column i and pia is the probability ofcharacter a in column i. The effect of devising different scoring matrix is displayed in Figure 11.

17

[2] Eric Clapton - Before you accuse me

[2] Neil Y

oung - Tell m

e why

[2] Th

e Bea

tles -

Can

’t buy

me l

ove

[1] V

ange

lis -

Conq

uest

[1] T

he C

ure

- Col

d

[1] A

CD

C -

You

shoo

k m

e

[1] P

ink

Floy

d - B

rain

dam

age

[2] T

he B

eatle

s - W

hen

I get

hom

e[2

] The

Bea

tles -

Bab

ys in

blac

k

[2] Th

e Bea

tles -

Roc

k and

roll

[2] The Beatles - Every l

ittle thing

[2] The Beatles - Everybody is trying to be

[1] Pink Floyd - Us and them

[2] Eric Clapton - Nobody knows you

[2] Eric Clapton - Layla[2] Eric Clapton - Hey hey

[2] Eric Clapton - Walking blues

[1] The Fall - A Sides god box

[2] The Beatles - Honey don’t

[1] Lil’Mam

a - Shawty get loose

[1] 50Cent - Thug love

[1] Dangelo - M

e and those

[1] Flo Rida - Feat[1] Xzibit - X

[1] Pucho and his latin soul - Let love find you

[1] Ludacris - Runaway love

[1] Eminem - Without me

0.9 0.8 0.7 0.6

[2] Eric Clapton - Before you accuse me

[2] The Beatles -

Everybody is

trying to be my la

dy

[2] Th

e Bea

tles -

Hon

ey do

n’t

[1] T

he F

all -

Aside

s god

box

[1] 5

0Cen

t - T

hug

Love

[2] E

ric C

lapt

on -

Hey

hey

[2] T

he B

eatle

s - B

abys

in b

lack

[2] T

he B

eatle

s - R

ock

and

roll

[2] T

he B

eatle

s - E

very

little

thing

[1] ACDC - Y

ou sh

ook m

e all n

ight

[2] The Beatles - Can’t b

uy me love

[1] Dangelo - Me and those dreamin

[1] EMINEM - Without me

[1] Pink Floyd - Us and them

[1] Pink Floyd - Brain damage[2] Eric Clapton - Walking blues

[2] The Beatles - When I get home

[1] Pucho and his latin soul - Let love find you

[1] Xzibit - X

[2] Eric Clapton - Layla

[1] Vangelis - Conquest

[2] Eric Clapton - N

obody

[2] Neil Young - Tell me why

[1] Flo Rida - Feat[1] Ludacris - Runaway love

[1] The Cure - Cold

[1] Lil’Mama - Shawty get loose

0.9 0.7 0.5 0.3

Figure 11: The effect of using different grammars (symbolic information) and different weighting matrix can lead todramatically different results in the final alignments and similarities between the sets of sequences.

6 Multiple alignments algorithms.Once it is well understood that pairwise alignments allow to define a form of similarity and to find some common localstructures between sequences, we can moove on to higher level of reasoning by trying to align multiple sequences atthe same time. Hence, we will compare different algorithms for multiple sequence alignement, regarding their qualityand compacity. This can allow to find higher-quality consensus sequences between symbolic representations. Studyingdifferent grammars and types of symbolic informations (as seen in the first part), different weighting matrix and finallyvarious multiple alignment algorithms will hence allow to enhance different intrinsic properties of music that will furtherinfluence the results on the consensus sequences.

6.1 An introduction to multiple alignmentDefinition of the problem The problem of Multiple Sequence Alignment (MSA) can be defined as finding from a setof k sequences S = S1,S2, . . . ,Sk the aligned set of k equal-length sequences S∗ = S∗1 ,S∗2 , . . . ,S∗k, where S∗i isobtained by inserting gaps into Si ∀i ∈ [0, k] while minimizing a score function.

Sum-of-Pair score The most straightforward way to define a fitness value for a certain solution of the MSA in orderto optimize the quality of the result is to rely on the Sum-of-Pair (SP) score

SPscore (a1, . . . , ak) =∑

1≤i<j≤k

δ (ai, aj)

where ai is any symbol from our dictionnary and δ (ai, aj) is the distance defined in our weight matrix. Then, theoverall score of an alignment S∗ can be defined as

SPscore (S∗) =∑x

SPscore (S∗1 [x] , . . . , S∗k [x])

If we interpret this formula, it means that we are trying to minimize the position-wise differences in symbols for allsequences in the alignment.

Exercise 6. (Manual alignment) Given the set of sequences

Try to perform a manual multiple alignment of these sequences

Compute the SP-Score at each position of the alignment

18

V S N S

AS

SNAS

V S N S

AS

SNAS

VSNSSNASAS

VSN-S-SNAS---AS

(a) (b) (c) (d) (f) (g)

Figure 12: Multiple sequence alignment of 3 sequences through dynamic programming. (a) Given a set of 3 sequences toalign, (b) we can construct a 3-dimensional matrix in which (c) each cell defines 23 − 1 = 7 different possible paths. (d)Following the same procedure as pairwise alignment, we can find the optimal alignment between sequences, which leadto (e) the multiple sequence alignment. (f) An interesting property is that we can actually project this multidimensionalpath on bi-dimensional planes to obtain the pairwise alignments between any sequence of the set.

For further information on the exact method and algorithms, the different multiple sequence alignment algorithms thatwill be used in this project are ClustalW, Muscle, MAFFT, ProbCons and TCoffee.

6.2 Dynamic programming (exact)We have seen in Section 5.3 that dynamic programming is an excellent tool to perform the alignment of two sequences.In fact this technique can be extended to perform the alignment of a set of k sequences and can provide the optimalsolution for this set. We can rewrite the dynamic programming equation as

V (i1, i2) = max(b1,b2)∈0,12−(0,0)

V (i1 − b1, i2 − b2) + δ (S1 [i1b1] , S2 [i2b2])

This equation simply states that the best path from one cell depends on its neighborhood in the previous cell of thescoring matrix. As we can see, this form is closely related to that of the SP-score. We can extend it by considering that

V (i1, . . . , ik) = SPscore align (S1 [1 . . . i1] , . . . , Sk [1, . . . ik])

We can observe here that the score of the last column will therefore be the SP-score of the optimal alignment betweenthe k sequences. Therefore we have that at each cell of the multi-dimensional matrix

V (i1, · · · , ik) = max(b1,...,bk)∈0,1k

V (i1 − b1, . . . , ik − bk) + SPscore (S1 [i1b1] , . . . , Sk [ikbk])

Therefore the SP-score of the optimal multiple alignment of S = S1,S2, . . . ,Sk is V (n1, · · · , nk) where ni is thelength of Si. So we can actually fill a k−dimensional scoring matrix the same way that we did for two sequences tocompute V (n1, · · · , nk). This process is detailed in Figure 12.

Exercise 7. (Complexity) Given the k-dimensional dynamic programming equation

Compute the theoretical worst-case time complexity of the algorithm

Compute the space requirements of the algorithm (storage required)

What are the pros and cons of using this algorithm, and in which cases could we use it?

6.3 Center-star method (approximation)As we have seen in the previous section, finding the optimal multiple alignment takes exponential time, which makes itimpossible to apply on large set of sequences. Therefore, we would like to devise a good approximation of this alignmentusing polynomial time. The center star method was one of the first method to minimize the SP-score in an efficient way.

The main idea behind this method is to find a reference (center) sequence inside the set of sequences to align, andthen to align all other sequences with this reference. In order to find the reference sequence, we can compute the pairwisealignments of all pairs of sequences in the set and then select the sequence that minimize the sum of distances to all the

19

ATTGCCATT

ATGGCCATT ATTGCCATT

ATC-CAATTTT ATTGCCATT--

ATCTTC-TT ATTGCCATT ATTGCCGATT

ATTGCC-ATT

ATTGCCATT ATGGCCATT ATCCAATTTT ATCTTCTT ATTGCCGATT

ATTGCCATT-- ATGGCCATT-- ATC-CAATTTT

ATTGCCATT-- ATGGCCATT-- ATC-CAATTTT ATCTTC-TT--

ATTGCC-ATT--ATGGCC-ATT-- ATC-CA-ATTTT ATCTTC--TT-- ATTGCCGATT--

Figure 13: Summary of the center-star algorithm.

set (it will therefore represent a form of “centroïd” of the set). Then, based on the pairwise alignments, we can iterativelyfind the multiple alignment simply by adding gaps in the current alignment. The overall workflow for the center-starmethod is presented in Figure 13

Center_StarParameters: A set S of sequencesReturn: A multiple alignmentM with a SP-score at most twice that of the optimal alignment of S.1: Compute D (Si,Sj) for all Si,Sj ∈ S2: Find the center sequence Sc which minimizes

∑ki=1D (Sc,Si)

3: For every Si ∈ S − Sc, choose the optimal pairwise alignment between Sc and Si4: Introduce gaps into Sc so that the multiple alignmentM satisfies the alignments found in Step 3.

Exercise 8. (Complexity) Given the center-star method

What is the time complexity of this algorithm

Proove that the center-star method provides an alignment with a SP-score at most twice of the optimal alignment.

Implement the center-star method based on the results of a Needleman-Wunsch pairwise alignment algorithm.

6.4 Heuristics methodsEven though the star method is a good way to obtain a multiple alignment, it still suffers from several flaws in termsof time and space requirements, but also in the quality of the final alignment. Indeed, the center star method is highlybrittle to the choice of the reference sequence. Hence, several heuristics have been devised to alleviate this problem.

6.4.1 ClustalW - Progressive alignment

In order to get a refined quality, progressive alignment is based on the idea to align the two most closest sequences, andthen to progressively align the most closest related sequences until all sequences are aligned. Hence it can be seen asthe same idea as the center-star method, but selecting two references at each iteration. Several algorithms have beendeveloped based on this idea such as ClustalW, T-Coffee and Probcons. We will analyze ClustalW has it is one of themost popular multiple alignment software. The algorithm can be divided in three main steps

1. Computing pairwise distance scores for all pairs of sequences

2. Generate the guide tree which ensures similar sequences are nearer in the tree

3. Aligning the sequences one by one according to the guide tree

20

S1: P-PGVKSDCASS2: PADGVK-DCASS3: PPDG-KSD--SS4: GADG-K-DCCSS5: GADG-K-DCAS

S1: PPGVKSDCASS2: PADGVKDCASS3: PPDGKSDSS4: GADGKDCCSS5: GADGKDCAS

S1 S2 S3 S4 S5S1 0 0.11 0.25 0.55 0.44

S2 0 0.37 0.22 0.11

S3 0 0.5 0.5

S4 0 0.11

S5 0s1 s2 s3 s4 s5

S1=P-PGVKSDCAS

S2=PADGVK-DCAS

S4=GADGKDCCS

S5=GADGKDCAS

s1 s2 s3 s4 s5

S1=P-PGVKSDCAS

S2=PADGVK-DCAS

S3=PPDG-KSD--S

s1 s2 s3 s4 s5

(a) (b)

(c)

(d)(e)

Figure 14: Summary of the ClustalW algorithm.

Step 1. Pairwise distance score In this step we compute the (triangular) distance matrix containing D (Si,Sj) for allSi,Sj ∈ S. The idea here is to rely on a pairwise alignment and consider that the distance can be expressed as the ratio ofmismatched symbols (percentage of symbols that differ in the aligned version of both sequences). It is important to notehere (as already mentionned in Section 5.5) that the choice of the pairwise alignment algorithm and its correspondingweight matrix will both highly influence the final multiple alignment

Step 2. Generate the guide tree By computing the pairwise distance matrix, we obtain a representation from whichwe can easily generate a hierarchical clustering (as we have seen in Section 5). Hence any linkage function and clusteringalgorithm (single, average, complete) can be used to obtain the iterative grouping of sequences. The ClustalW algorithmrelies on neighbor-joining in order to generate the dendrogram of sequences similarities.

Step 3. Align the sequences according to the guide tree Based on the tree constructed at the previous step, wecan now progressively align sequences, by simply following the guide tree. The idea is to rely on the same method usedto trim down the tree in order to obtain clusters. Therefore, we start from the leaves and move up in the tree. Each timean internal node connecting several sequences is met, we align the corresponding sequences. This process is repeateduntil we cross the root node.

You might have noted that this process implies to perform several multiple alignments between subsets of our fullsequence set. Hence, in order to perform the alignment of subsets (that might have already been aligned in a previousstep), ClustalW relies on a Profile-Profile alignment detailed in the next paragraph.

Profile-Profile Alignment Given two aligned sets of sequences A1 and A2, the profile-profile alignment introducesgaps to A1 and A2 so that both of them have the same length. In order to determine this alignment, we need a scoringfunction PSP (A1 [i] ,A2 [j]). In ClustalW, the score is defined as follows

PSP (A1 [i] ,A2 [j]) =∑x,y

gixgjyδ (x, y)

where gix is the observed frequency of symbol x in the column i and δ (x, y) is the distance between symbols x andy(as defined in our weight matrix). Hence, our aim is to find an alignment between the two sets that maximizes thePSP score. As we can see, this clearly delineates an alignment problem similar to those defined in a classical pairwisealignment. Therefore, ClustalW relies on a dynamic programming algorithm to find the optimal alignment. The overallworkflow for the ClustalW method is presented in Figure 14

21

Exercise 9. ClustalW alignment

Based on the two sets of sequences, compute the PSP score at each column and propose an optimal alignment

Manually perform three iterations of the ClustalW algorithm on the set of sequences

Compute the theoretical worst-case time complexity of each step of the algorithm.

Deduce the overall worst-case time complexity of the ClustalW algorithm.

6.4.2 MUSCLE - Iterative method

The biggest limitation of progressive alignment method is that it will not realign the sequences. Hence, the final alignmentis still highly brittle to the quality of the initial alignments, and progressive methods do not guarantee to converge tothe global optimal. In order to alleviate these flaws, iterative methods introduce heuristics that starts with a progressivealignment and then iteratively improves the multiple alignment. Examples of iterative methods are PRRP, MAFFT andMUSCLE. We will detail the MUltiple Sequence Comparison by Log-Expectation (MUSCLE). This algorithm is based ontwo ideas

1. Generating a draft multiple alignment as fast as possible, then iteratively improving it.

2. Introducing the log-expectation score for profile-profile alignment

The PSP score used in ClustalW (PSP (A1 [i] ,A2 [j]) =∑x,y g

ixgjyδ (x, y)) may favor gaps as it relies on the direct

sum of weighted symbol distances (and gaps are considered as equivalent symbols). Therefore, MUSCLE introduces thelog-expectation (LE) score

LE (A1 [i] ,A2 [j]) =(1− fGi

) (1− fGj

)log

(∑x,y

fxi fyj

pxy(pxpy)

)where fGi is the proportion of gaps in A1, fxi is the proportion of symbol x in A1, px is the background proportion

of symbol x and pxy is the probability that x aligns with y. It should be noted pxy/(pxpy) = eδ(x,y). Then MUSCLE isbased on three different stages

Stage 1. Draft progressive Here, we generate an intial alignment based on the progressive alignment method.Therefore, the steps are exactly the same as ClustalW, with several modifications. First, the distance matrix is computedfaster by discriminating sequences based on the symbols frequency. Second, MUSCLE rely on the Unweighted Pair-Group Method using Arithmetic mean (UPGMA) method to perform clustering instead of neighbor-joining. Finally, theprofile-profile alignment relies on the log-expectation score instead of the PSP score.

Stage 2. Improved progressive The second stage follows in fact exactly the same steps as stage 1 (progressivealignment). However, the distance matrix is computed by first finding the fraction D of identical symbols shared bytwo aligned sequences. Then the distance is computed is −loge

(1−D − D2

/5). The guide tree is still built using

UPGMA and profile-profile alignment still performed with log-expectation score. However, we perform re-alignment ofthe sequences only when there are changes relative to the original tree.

Stage 3. Refinement This stage is optional but allows to refine the multiple alignment in order to maximize theSP-score. It is based on the following steps.

1. Visit the edges e in decreasing distance from the root

(a) Partition the alignment into two sets by deleting the edge e from the guide tree

(b) The two sets are re-aligned using profile-profile alignment

(c) Compute the SP-score for the new alignment

(d) If the SP-score is improved, we keep the new alignment

2. Iterate Step A until there is no improvement in SP-Score

Exercise 10. MUSCLE alignment

22

0

1

2

bits

5 -6 -5

GCTA

-4

GCTA

-3

GTCA

-2

CTAG

-1

GA

0T 1

G

2T 3

G

4

CT

5TAG

6GTAC

7T 8AG

9CAG

10

CA

11

G

12

G

13

G

14

TGA

15

GTA

16

CGA

17 18 19 3

Figure 15: Possible representations of the consensus sequences

Based on the two sets of sequences, compute the log-expectation score at each column and propose an optimalalignment

Manually perform three iterations of the MUSCLE algorithm on the set of sequences

Compute the theoretical worst-case time complexity of each step of the algorithm.

Deduce the overall worst-case time complexity of the MUSCLE algorithm.

6.5 Computing the consensus sequenceThe adaptation of different consensus sequence algorithm can actually allow to study different properties of the alignment.Even though several statistical methods have been developed for constructing the consensus sequences, we will focushere only on simple methods based on frequencies. For further information and more advanced statistical techniques, thedifferent consensus sequence algorithms that will be used in this project are Seqtrace and UGene. Some of the possiblefinal representations of the consensus sequences are displayed in Figure 15

7 Going furtherBased on the algorithms you were presented, several questions can lead you to further investigations and discoveriesrelated to this field.

1. How to extend the alignment paradigm to include multiple sources of informations?

2. Is it possible to devise an adaptive alignment that could use the information of alignment from one source ofinformation in order to ameliorate the alignment of another source?

3. How to detect motifs (common patterns) inside a set of symbolic sequences based solely on the multiple alignment?

Project module assignmentPart 1. (Pairwise alignment)

1. Implement one of the methods presented in the pairwise alignment state-of-art

2. Propose several distance (scoring) matrix for each of the symbolic representation presented in the first part.

3. Compare the use of different scoring matrix, and evaluate their accuracy by

(a) Computing the pairwise alignment for all songs (some of which are covers of others)

(b) Comparing visually the results (you can either use a hierarchical clustering or an NMDS for display)

4. Evaluate the quality of different distance matrix, gap penalty function parameters and alignment choices

23

(a) First by an agnostic approach, computing the Normalized Mutual Information, B-Cubed and inertia ratio ofdifferent results

(b) Second by relying on the metadata gathered in the first part of the project and considering this as a classifier

(c) You can use the F-Score or any clustering, to check if the classification is accurate

(d) Plot all your results as a polar dendrogram (code provided)

Part 2. (Multiple alignment)

1. Implement one of the methods in the Multiple Sequence Alignment (MSA) section

2. Compare all the MSA methods (Centar-star, ClustalW, ClustalΩ, MAFFT, MUSCLE, ProbCons and TCoffee)provided in the code section

(a) Set up the whole evaluation framework for comparing different multiple alignment methods

(b) Rely on all the different symbolic information and their corresponding distance matrix for evaluation

(c) Implement multiple

3. Evaluate the quality of different results by relying on a criterion of compacity, coverage and accuracy

4. Evaluate the alignments at different branching nodes of the tree.

Part 3. (Consensus sequence)

1. Compute the various consensus sequences for all previous variations and evaluate their qualities.

2. Display the consensus sequences and evaluate their compacity and accuracy at different branching nodes of thetree.

Part 4. (Research)

Select one of the open questions stated in Section 7 and propose a simple yet efficient way to deal with this problem.Provide an algorithmic sketch of the implementations required for your idea.

24

molecular clock synthesis - ircamrepmus.ircam.fr/_media/esling/description.pdf · molecular clock...

Documents