algorithmic contributions to computational molecular biology · on computational molecular biology,...

145
Algorithmic Contributions to Computational Molecular Biology St´ ephane Vialette To cite this version: St´ ephane Vialette. Algorithmic Contributions to Computational Molecular Biology. Data Structures and Algorithms [cs.DS]. Universit´ e Paris-Est, 2010. <tel-00862069> HAL Id: tel-00862069 https://tel.archives-ouvertes.fr/tel-00862069 Submitted on 23 Sep 2013 HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers. L’archive ouverte pluridisciplinaire HAL, est destin´ ee au d´ epˆ ot et ` a la diffusion de documents scientifiques de niveau recherche, publi´ es ou non, ´ emanant des ´ etablissements d’enseignement et de recherche fran¸cais ou ´ etrangers, des laboratoires publics ou priv´ es.

Upload: others

Post on 20-Aug-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Algorithmic Contributions to Computational Molecular Biology · on computational molecular biology, i.e., those algorithmic and combinatorial topics that are connected to molecular

Algorithmic Contributions to Computational Molecular

Biology

Stephane Vialette

To cite this version:

Stephane Vialette. Algorithmic Contributions to Computational Molecular Biology. DataStructures and Algorithms [cs.DS]. Universite Paris-Est, 2010. <tel-00862069>

HAL Id: tel-00862069

https://tel.archives-ouvertes.fr/tel-00862069

Submitted on 23 Sep 2013

HAL is a multi-disciplinary open accessarchive for the deposit and dissemination of sci-entific research documents, whether they are pub-lished or not. The documents may come fromteaching and research institutions in France orabroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, estdestinee au depot et a la diffusion de documentsscientifiques de niveau recherche, publies ou non,emanant des etablissements d’enseignement et derecherche francais ou etrangers, des laboratoirespublics ou prives.

Page 2: Algorithmic Contributions to Computational Molecular Biology · on computational molecular biology, i.e., those algorithmic and combinatorial topics that are connected to molecular

UNIVERSITE PARIS-EST MARNE-LA-VALLEE

INSTITUT GASPARD MONGE

HABILITATION DIRIGER DES RECHERCHES

prsente parStphane VIALETTE

Algorithmic Contributions toComputational Molecular Biology

Soutenue publiquement le 1 Juin 2010devant le jury compos de

Marie-Pierre Beal Professeur, Universit Paris-Est, ExaminateurChristian Choffrut Professeur, Universit Paris 7, ExaminateurMaxime Crochemore Professeur, Universit Paris-Est, ExaminateurAlain Denise Professeur, Universit Paris-Sud 11, ExaminateurGregory Kucherov Directeur de Recherche CNRS, RapporteurAndras Sebo Directeur de Recherche CNRS, Rapporteur

Page 3: Algorithmic Contributions to Computational Molecular Biology · on computational molecular biology, i.e., those algorithmic and combinatorial topics that are connected to molecular
Page 4: Algorithmic Contributions to Computational Molecular Biology · on computational molecular biology, i.e., those algorithmic and combinatorial topics that are connected to molecular

Contents

I Structures: from 2-intervals to annotated sequences . . . throught permutations 1

1 Algorithmic aspects of 2-interval sets 51.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51.2 Bestiary and definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61.3 Recognizing multidimensional interval graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . 71.4 Combinatorial problems on 2-intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2 From linear graphs to permutations 172.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172.2 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172.3 From linear graphs to permutations . . . and back . . . . . . . . . . . . . . . . . . . . . . . . . . 182.4 Pattern matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192.5 Finding common structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232.6 Separable patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

3 Arc-annotated sequences 333.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 333.2 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 343.3 Maximum common patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 363.4 Pattern matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 373.5 Extending the standard model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

II Pattern Matching in Graphs 39

4 Pattern matching in graphs 434.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 434.2 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 434.3 Finding exact occurrences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 444.4 Approximate occurrences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 464.5 Replacing lists by colors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

i

Page 5: Algorithmic Contributions to Computational Molecular Biology · on computational molecular biology, i.e., those algorithmic and combinatorial topics that are connected to molecular

ii

5 Searching for connected occurrences 515.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 515.2 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 525.3 Searching for exact connected occurrences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 525.4 Minimizing the number of connected components . . . . . . . . . . . . . . . . . . . . . . . . . 555.5 Maximizing the size of the connected occurrence . . . . . . . . . . . . . . . . . . . . . . . . . . 585.6 Further variants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

6 Querying PPI Networks 636.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 636.2 A feedback vertex set approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 646.3 Practical issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

III Genome Rearrangements 67

7 Genome rearrangements with duplicate genes 717.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 717.2 From genomes to permutations . . . and back . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 727.3 Comparing two compatible genomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 757.4 Exact algorithms and heuristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

8 Exemplar common subsequences 858.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 858.2 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 858.3 Key results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

IV Additional topics 89

9 Selenocysteine-like insertion 939.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 939.2 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 949.3 Key results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

10 How many words are needed to build up all words ? 9910.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9910.2 Approximation and inapproximation results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10110.3 Jumping to numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

Bibliographie 127

Index 112

Page 6: Algorithmic Contributions to Computational Molecular Biology · on computational molecular biology, i.e., those algorithmic and combinatorial topics that are connected to molecular

Remerciements

Gad Landau, Gregory Kucherov et Andras Sebo m’ont honor en acceptant d’łtre les rapporteurs de cemmoire. Je les en remercie vivement.

Je remercie sincrement Marie-Pierre Bal, Alain Denise, Maxime Crochemore et Christian Choffrut dem’avoir fait l’honneur de participer au jury.

Merci Marie-Piere Bal de m’avoir chaleureusement accueilli mon arrive Marne-la-Valle et y avoir favorisle dveloppement de l’algorithmique pour la biologie. Toujours disponible pour m’couter et m’encouragermalgr le peu de temps que lui laisse ses responsabilits.

Je me rappelle trs bien ma premire rencontre avec Alain Denise. Dbut 2002, il m’avait demand une copiede mes transparents la suite d’un expos l’ENS o je venais de commencer mon postdoc. Qu’il puisse trouverquelque intrłt mes travaux m’avait alors la fois tonn et enthousiasm. Hasard (ou pas?), je le rejoignais auLRI l’anne suivante. J’ai normment appris ses cots, bien plus qu’il pourrait le penser.

Maxime Crochemore tait dj membre du jury de ma thse, il en tait en fait rapporteur. Sa bonne humeur etses qualits humaines font de nos (trop rares !) rencontres des moments des plus agrables.

Christian Choffrut. Que dire . . . ? Je sais tout ce que je lui dois.

Mon parcours est jalonn de rencontres et d’expriences qui seront dterminantes. Merci donc . . .

tous les membres du LIGM pour leur accueil. Un merci tout particulier Line Fonfrde et Gabrielle Brossardpour leur aide et leur disponibilit.

l’quipe Bioinformatique du LRI, Christine Froidevaux et Alain Denise (encore lui !) en tłte, pour m’avoirdonn ma chance.

Claude Jacq pour m’avoir donn l’opportunit de travailler dans un laboratoire de gntique. Ces deux annesde postdoc ont t – et reste – une exprience extraordinaire (pour łtre tout fait honnłte, je pouvais mettre uneblouse si l’envie m’en prenait mais il m’tait interdit de toucher quoi que ce soit sur les paillasses !).

Guillaume Fertin pour les heures de discussion, le travail en commun, les coups de fil hebdomadaires(pas forcment pour parler travail), les voyages (malheureusement bien moins frquents depuis 2006), . . . moncollaborateur prfr en somme!

iii

Page 7: Algorithmic Contributions to Computational Molecular Biology · on computational molecular biology, i.e., those algorithmic and combinatorial topics that are connected to molecular

iv

Guillaume Blin pour toutes ces petites choses qui font de notre bureau un endroit particulirement agrable. . . en n’oubliant pas les musiques improbables qu’il m’inflige (j’ai bien failli ne pas survivre la compil’spciale annes 80).

Romeo – incredible reduction – Rizzi et Danny Hermelin, collgues et amis, pour tout le travail en communet le temps pass ensemble. C’est toujours un rel plaisir de les retrouver.

mes (autres) co-auteurs et/ou amis : Gaelle Brevier, Riccardo Dondi, Mike Fellows, Isabelle Fagnot,Philippe Gambette, Sylvain Guillemot, Anthony Labarre, Matthieu Raffinot, Dror Rawitz, Irena Rusu, EricTannier, . . .

Annelyse Thvenin et Florian Sikora qui ont support mes techniques exprimentales de direction de thse etde pdagogie.

Merci Patricia, Jeanne et Nathan sans qui, tout simplement, rien ne serait possible.

Page 8: Algorithmic Contributions to Computational Molecular Biology · on computational molecular biology, i.e., those algorithmic and combinatorial topics that are connected to molecular

Foreword

The time has come to write it! For most of us, writing our habilitation thesis is not an easy matter, and I didnot depart from the rule. How to expose our works? How to establish a logical link between our works?Some of many questions that should be answered before anything else.

So first the haunting question: What should be the best format? Fortunately or not (I would go for nothere), there are as many answers as people! Answers go from “extended abstract”to “PhD-like manuscript”,and, as for what should be in, “cumulative”(based on previous research) or “monographical”(specificunpublished results) . . . well, answers do not help much here. The most appropriate format is often thesimplest but I guess it varies from person to person and depends on what you are expecting or looking for inthat exercise. So what am I expecting from this writing exercise? To cut a long story short, something useful(at least for me). Therefore, to avoid a tedious and useless exercise, my choice has been to write a survey ofmy works together with a brief exposition of the related results; something to which I can refer to for furtherresearch. This should explain the rather long form of the present manuscript.

As for this famous “logical link”, quite naturally, I decided to focus on algorithms, and more specificallyon computational molecular biology, i.e., those algorithmic and combinatorial topics that are connected tomolecular biology. With this in mind, I have wilfully chosen to move some of my published papers apartfrom my habilitation. It is not that there are too many of them but some definitively do not fit in the scope ofthis manuscript. First, some are clearly totally (I did not want to write “too”here) biologically oriented andhave very little algorithmic content (and actually I am not really involved anymore in this activity). Thisincludes [Lelandais et al., 2004a,b, 2006; Margeot et al., 2002; Sylvestre et al., 2003]. I shall not discuss thispart here. Neither shall I discuss about my interest in graph labelings and our recent works on alliancesand secure sets in graphs. Whereas my interest in these topics is clearly algorithmic, they are not relatedin any way to any of the four parts of the present manuscript. This includes [Fertin and Vialette, 2009;Vialette, 2006] as well as [Blin et al., 2009a]. One may argue that some parts of this habilitation thesis arequite far from any combinatorial topic connected to molecular biology (still this famous “logical link”). Ihave, for example, in mind Section 2.3, Section 2.4, Section 2.6, Section 10.3, . . . , and I do agree these topicsare very far – not to say independent – from any computational molecular biology consideration. However,I came across these topics as special cases or relaxations of combinatorial objects that are, from my pointof view, clearly connected to exploring molecular biology (I do no claim practical applications for all ofthem): d-intervals, linear graphs, exon shuffling, . . . My opinion is thus that these topics have their rightfulplaces in my manuscript and, even more important from my point of view, gathering together these topicswith more practical issues such as “querying PPI network”or “designing fast heuristics for comparativegenomics”clearly reflects my way of doing research.

v

Page 9: Algorithmic Contributions to Computational Molecular Biology · on computational molecular biology, i.e., those algorithmic and combinatorial topics that are connected to molecular

vi

Page 10: Algorithmic Contributions to Computational Molecular Biology · on computational molecular biology, i.e., those algorithmic and combinatorial topics that are connected to molecular

Introduction

Computational biology is (should be?) an interdisciplinary field that applies the techniques of computerscience, applied mathematics and statistics to address biological problems. It encompasses many fields,ranging from Computational biomodeling to Computational biochemistry. If I would really have to place myselfin this field (always such a difficult question for me, my preferred answer would be just algorithmic as Inever have laid down myself to restrict to this area) I would say: Bioinformatics (in the very very precisesense of designing algorithms to the interpretation, classification and understanding of biological datasets)and Comparative genomics (as a part of Computational genomics).

For the prerequisites, the reader is expected to be familiar with basic graph theory, classical complexitytheory and parameterized complexity theory. We only recall some basic definitions (the two followingparagraphs should constitute sufficient preparation).

In computer science and operations research, approximation algorithms are algorithms used to find ap-proximate solutions to optimization problems (the best general references are [Vazirani, 2002] and [Ausielloet al., 1999]). Approximation algorithms are often associated with NP-hard problems since it is unlikelythat there can ever be efficient polynomial-time exact algorithms solving NP-hard problems, one settlesfor polynomial-time sub-optimal solutions. Unlike heuristics, which usually only find reasonably goodsolutions reasonably fast, one wants provable solution quality and provable run time bounds. Ideally, theapproximation is optimal up to a small constant factor. Given an instance x of an optimization problem P,the performance guarantee (or approximation ratio ) R(x, y) of a solution y to the instance x is defined as

R(x, y) = max(

opt(x)f(y)

,f(y)

opt(x)

),

where opt(x) is the value of an optimum solution for the instance x and f(y) is the value of the solutiony for the instance x. Clearly, the performance guarantee is greater than or equal to 1 (and equal to 1 ifand only if y is an optimal solution). If an algorithm A guarantees to return solutions with a performanceguarantee of at most r(n), then A is said to be an r(n)-approximation algorithm and has an approximationratio of r(n). Likewise, a problem with an r(n)-approximation algorithm is said to be r(n)-approximable orhave an approximation ratio of r(n). The class APX (an abbreviation of “approximable”) is the set of (NPO)

vii

Page 11: Algorithmic Contributions to Computational Molecular Biology · on computational molecular biology, i.e., those algorithmic and combinatorial topics that are connected to molecular

viii

optimization problems that allow polynomial-time approximation algorithms with approximation ratiobounded by a constant (or constant-factor approximation algorithms for short). In simple terms, problemsin this class have efficient algorithms that can find an answer within some fixed percentage of the optimalanswer. A PTAS is an algorithm which takes an instance of an optimization problem and a parameterε > 0 and, in polynomial-time, produces a solution that is within a factor ε of being optimal. Notice thatthe running time of a PTAS is required to be polynomial in n for every fixed ε but can be different fordifferent ε. Thus, an algorithm, running in O(n1/ε) time or even O(nexp(1/ε)) time counts as a PTAS. Apractical problem with PTAS algorithms is that the exponent of the polynomial could increase dramaticallyas ε shrinks, for example if the runtime is O(n1/ε). One way of addressing this is to define the efficientpolynomial-time approximation scheme or EPTAS, in which the running time is required to be O(nc) for aconstant c independent of ε. This ensures that an increase in problem size has the same relative effect onruntime regardless of what ε is being used; however, the constant under the big-O can still depend on εarbitrarily.

For many applications the trade-offs inherent to approximation algorithms and heuristics are not satis-factory. Fixed-parameter algorithms can provide an alternative by providing optimal solutions with usefulruntime guarantees (the best general references are [Downey and Fellows, 1999; Flum and Grohe, 2006;Niedermeier, 2006]). The core concept is formalized as follows: An instance of a parameterized problemconsists of a problem instance x and a parameter k. A parameterized problem is fixed-parameter tractable ifit can be solved in f(k)|x|O(1)) time, where f is a computable function solely depending on the parameterk, not on the input size |x|. For NP-hard problems, f(k) will of course not be polynomialsince otherwisewe would have an overall polynomial-time algorithm – but typically be exponential like 2k . Clearly, fixed-parameter tractability captures the notion of “efficient for small parameter values”: for any constant k, weobtain a polynomial-time algorithm. Moreover, the exponent of the polynomial must be independent ofk, which means that the combinatorial explosion is completely confined to the parameter. The standardparameterization of an optimization problem such as VERTEX COVER or CLIQUE takes the size of the solutionas the parameter. Accompanying the work on designing efficient and practical parameterized algorithms, atheory of parameterized intractability has been developed (Downey and Fellows 1999 monograph [Downeyand Fellows, 1999] gives a fairly complete picture of the theory then). In parameterized complexity, toclassify fixed-parameter intractable problems, a hierarchy, the so-called W[-]hierarchy

⋃t≥0W[t], where

W[t] ⊆W[t + 1] for all t ≥ 0 has been introduced, in which the 0-th level W[0] is the class FPT. The hardnessand completeness have been defined for each level W[t] of the W[-]hierarchy for t ≥ 1, and a large numberof W [i]-hard parameterized problems have been identified [Downey and Fellows, 1999]. For example, theVERTEX COVER problem is known to be fixed-parameterized tractable for the standard parameterizationwhereas the CLIQUE problem has been proved to W[1]-complete. The fundamental conjecture FPT 6= W[1] isvery much analogous (but clearly weaker) to the conjecture that P 6= NP. Notice that, from an algorithmicpoint of view, it is usually sufficient to distinguish between W[1]-hardness and membership in FPT.

This habilitation thesis is organized in four parts. I would say (i) algorithmic of (not so) linear structures,(ii) pattern matching in graphs, (iii) comparative genomics, and (iv) what is left and does not fit well in anyof the three first parts. If I would have to give a chronology, Part I contains the problems I was first interestedin as I came across 2-intervals as early as during my PhD thesis (actually as a naive attempt to build anabstract model for autocatalytic group I introns). Part II and Part IV follow. My interest in comparativegenomics (Part III) is more recent; as far as I remember my first research activity in this field dates back to2005. Let me now introduce briefly these four parts (to facilitate access to the individual topics, the chaptersare rendered as self-contained as possible).

Page 12: Algorithmic Contributions to Computational Molecular Biology · on computational molecular biology, i.e., those algorithmic and combinatorial topics that are connected to molecular

ix

Part I is concerned with families of high-dimensional intervals, linear graphs, permutations and arc-annotated sequences, all those graph-like combinatorial objects I can draw from left to right, align andsearch for a pattern in. It is composed of three chapters. Chapter 1 is devoted to algorithmic aspects of high-dimensional intervals and more specifically to algorithmic aspects of 2-intervals. This chapter encompassesrecognition of restricted 2-interval graphs and combinatorial problems on families of 2-intervals. Chapter 2focuses on linear graphs and linear matchings, those graph with linearly ordered vertices. This chapter canbe seen as a follow-up of Chapter 1 as most questions could have been raised in the general framework of2-intervals (for disjoint interval ground sets). However, as we shall see, linear graphs and linear matchingsdeserve a separate chapter as they raise specific and important (and hard!) questions about permutationpatterns. In Chapter 3, we consider some algorithmic issues of arc-annotated sequences, a popular objectto represent RNA sequences. Common to these three chapters is the notion of relative positioning: for anytwo disjoint objects, either one precedes the other, is included in the other, or they are crossing each over.I have tried to develop in this manuscript a general and common framework (including notations) thatencompasses 2-intervals, linear graphs and arc-annotated sequences. Clearly, this is the part of my researchthat was the most followed [Li and Li, 2006, 2009a; Jiang, 2007b, 2008, 2007a; Chen et al., 2007a; Gramm,2004a,b; Gramm et al., 2002; P. Thebault et al., 2006].

Part II is devoted to pattern matching issues (in the broad sense) in graphs. It is composed of threechapters. Chapter 4 is concerned with pattern matching in the common sense: finding an exact or anapproximate occurrence of a motif (given in the form of a graph) in a target graph. We focus in Chapter 4on edge-conservation and injective mappings. With protein-protein interaction (PPI) networks in mind,we consider additional restrictions: (i) each vertex of the pattern is associated to a (small) set of vertices ofthe target graph it can be mapped to, and (ii) both the motif and the target graphs are vertex-colored andany vertex of the motif must be mapped to a vertex of the target graph with the same color. We do believethat a better approach would consist in using a set of colors instead of one color (thereby allowing for agreater flexibly in the design) but we will not develop this point in this manuscript. Chapter 5 differs fromChapter 4 by renouncing to topology conservation (this is actually a weak renouncement as we shall see),we only require the occurrences to be connected. This recent problem (introduced in the context of metabolicnetworks [Lacroix et al., 2006]) raises new, elegant and original questions. Finally, brief Chapter 6 is devotedto presenting our contribution to a somewhat more classical view of pattern matching in PPI networks whereone is allowed to insert and delete vertices in the occurrence. Most of the interest in our contribution (basedon feedback vertex sets) is the PADA1 software that performs as well as QNet (the state-of-the-art softwareto query PPI networks) on tree patterns while allowing for general graph patterns (the tree decompositionbased approach of QNet for dealing with general graph patterns has never been implemented due to itscomplexity).

Part III is concerned with comparative genomics. Comparative genomics is a field by itself and we shallonly consider genome rearrangements with duplicate genes. It is composed of two chapters. Chapter 7is by far the longest of the two. In this chapter we consider the problem of computing a distance (or a(dis)similarity) between two genomes with duplicate genes from a pure algorithmic point of view. As weshall see, most – not to say all - problems are intractable and sometimes even hard to approximate withinany ratio. Again, I have tried in this chapter to develop a common general framework (in the form ofpermutations associated to matchings) that encompasses all these problems and allows me give a unifiedexposition. Chapter 8 is devoted to presenting and analyzing a simple LCS-like problem that aims atovercoming the difficulties I have raised in Chapter 7.

Part IV is actually concerned with two different topics. Chapter 9 is devoted to algorithmic aspectsof selenocysteine insertion and could be seen as a follow-up of [Backofen et al., 2002] where the problemof computing an mRNA sequence of maximum codon-wise similarity to a given mRNA (and hence, to agiven protein) that additionally satisfies some secondary structure constraints was introduced. Chapter 10

Page 13: Algorithmic Contributions to Computational Molecular Biology · on computational molecular biology, i.e., those algorithmic and combinatorial topics that are connected to molecular

x

is devoted to a covering problem. My initial motivation for studying this problem came from a paper byBodlaender et al. [Bodlaender et al., 1995], who described an application in the context of protein folding (theauthors actually referred to this problem as the DICTIONARY GENERATION problem). Indeed, many proteinsseem to be composed of relatively small regions which fold independently of other regions, and the theoryof exon shuffling proposes that all proteins are concatenations of such regions, where the regions are drawnfrom a common ancestral dictionary [Dorit and Gilbert, 1991; Patthy, 1991].

To avoid any confusion, the citations of external items (algorithms, theorem, . . . ) appear in the bodyof the text as references whereas my results are – most of time, I have tried to stick to this rule as much aspossible – given in the form of propositions. Notable exceptions for this rule are pages 28 (a lemma by NogaAlon) and 101 (the Local-Ratio lemma [Bar-Yehuda and Even, 1985]).

Finally, all along this document, I use thinking notes or perspective notes (sometimes also referred to asheadache notes in the text for obvious reasons) in the following form

to point out facts, important questions or even perspectives. It is not about all problems or special cases leftopen in this manuscript (I would have to put such a note on each page I guess), but to shed light on points Iam particularly interested in. Therefore, it is worth keeping in mind that these notes are concerned withboth problems I have spent weeks (sometimes months) on . . . without much success, and perspectives forfurther research.

Last point, some new results are announced in this manuscript, most without proof. Two notableexceptions are Proposition 2.4.1 (page 20) and Proposition 10.3.2 (page 103).

Page 14: Algorithmic Contributions to Computational Molecular Biology · on computational molecular biology, i.e., those algorithmic and combinatorial topics that are connected to molecular

Part I

Structures: from 2-intervals to annotatedsequences . . . throught permutations

1

Page 15: Algorithmic Contributions to Computational Molecular Biology · on computational molecular biology, i.e., those algorithmic and combinatorial topics that are connected to molecular
Page 16: Algorithmic Contributions to Computational Molecular Biology · on computational molecular biology, i.e., those algorithmic and combinatorial topics that are connected to molecular

Introduction

This part is devoted to presenting our works on (not so) linear structures. Well, what are those (not so) linearstructures? In our context these will be all those combinatorial objects that I can draw from left to right, alignand search for a pattern in. More specifically, we will be concern with high-dimensional intervals, lineargraphs, linear matchings, permutations and arc-annotated sequences. The rationale for bringing togetherthese combinatorial objects is a common notion of relative positioning I am particularly interested in: for anytwo disjoint objects, either one precedes the other, one is included in the other, or they are crossing each over.My interest in such a property started with 2-intervals. It turns out that this property is also at the heart ofthe algorithmic of linear graphs and arc-annotated sequences. This manuscript gave me the opportunity todevelop a general and common framework (including notations) that encompasses 2-intervals, linear graphsand arc-annotated sequences: an family of objects is typeM if any two objects in it are comparable for abinary relation inM.

High-dimensional (or multi-dimensional, or d-interval) intervals are the union of disjoint intervals, and amulti-dimensional interval graph is the intersection graph of a family of multi-dimensional intervals. Multi-dimensional intervals together with multi-dimensional interval graphs constitute a natural generalization ofintervals and interval graphs (one of the most studied class of intersection graphs). We shall be interestedmainly on 2-intervals, i.e., unions of pairs of disjoint intervals. Our concern is twofold: recognition of somerestricted 2-interval graphs and algorithmic aspects of 2-intervals.

Linear graphs are graphs with linearly ordered vertices. These graphs certainly constitute a special caseof 2-intervals. Adopting the same strategy as for 2-intervals, we will be concerned with finding motifs inlinear graphs. This general problem includes both finding an occurrence of a linear graph in another lineargraph and finding a common motif (a linear graph here) that occurs in each input linear graph. Of particularimportance, linear graphs are a generalization of permutations, and hence a portion of this chapter will bedevoted to the problem of finding motifs in permutations. Indeed, as we shall see, hardness of finding motifsin linear graphs (and in 2-intervals) originates from permutations.

Arc-annotated sequences can be seen both as a generalization of standard sequences (a string togetherwith some edges) and as a special case of linear graphs (a vertex-labeled linear graph). Arc-annotatedsequences have recently proved to be useful for modeling RNA structures. Again, we will adopt the verysame strategy as for 2-intervals and linear graphs.

3

Page 17: Algorithmic Contributions to Computational Molecular Biology · on computational molecular biology, i.e., those algorithmic and combinatorial topics that are connected to molecular

4

Page 18: Algorithmic Contributions to Computational Molecular Biology · on computational molecular biology, i.e., those algorithmic and combinatorial topics that are connected to molecular

1Algorithmic aspects of 2-interval sets

Contents1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51.2 Bestiary and definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61.3 Recognizing multidimensional interval graphs . . . . . . . . . . . . . . . . . . . . . . . . . . 71.4 Combinatorial problems on 2-intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

1.4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101.4.2 The 2-INTERVAL PATTERN problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101.4.3 Algorithms and complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121.4.4 Approximation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131.4.5 Parameterized complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

1.1 Introduction

Let F = S1, S2, . . . , Sn be a family of sets. The intersection graph of F , usually denotedΩ(F), is the graphhaving F as vertex set with Si adjacent to Sj if and only if i 6= j and Si ∩ Sj = ∅ ([McKee and McMorris,1999] and [Brandstadt et al., 1999] are our favorite references here). A graph G is an intersection graphif there exists a family F such that Ω(F) ' G, where we typically display this isomorphism by writingV(G) = u1, u2, . . . , un with each ui corresponding to Si and ui, uj ∈ E(G) if and only if Si ∩ Sj 6= ∅.When Ω(F) ' G, the family F is called a representation if G. Notice that every graph is an intersection graph(this property is ascribed to Marczewski in [McKee and McMorris, 1999]) Therefore, while every graph has aset representation, intersection graph theory uses properties of the set representations and various conditionsimposed thereon, rather than the conventional graph-theoretic approach that, in some sense, forgets the sets.

We shall be concerned in this chapter with high-order intervals (also referred as multidimensional intervals).Notice, however, that, whereas we will present all definitions in the general setting of d-dimensional intervals,most of out our concern will be 2-dimensional intervals. The term “d-dimensional interval”originated in thelate 1970s [Trotter and Harary, 1979; Griggs and West, 1979; Scheinerman and West, 1983], where the focuswas on determining how small d can be so that a given graph is a d-interval graph. The first referencesdevoted to algorithmic aspects of d-interval graphs are [Golumbic, 1980] (actually in the form of an exercise)and [West and Shmoys, 1984]. For an up-to-date survey of the algorithmic aspects of 2-intervals, we refer thereader to our recent entry in the Encyclopedia of Algorithms [Vialette, 2008].

5

Page 19: Algorithmic Contributions to Computational Molecular Biology · on computational molecular biology, i.e., those algorithmic and combinatorial topics that are connected to molecular

6

1.2 Bestiary and definitions

Let us start by presenting some intersection graphs we will be concerned with in this manuscript (d-boxgraphs are presented for the sake of completeness – as another way to generalize interval graphs – but weshall not consider them in the sequel).

A d-interval (or multiple interval) is a set of the real line which can be written as the union of d disjointclosed intervals [ai, bi]. Clearly, 1-intervals are the intervals. The intersection graph of a family of d-intervalsis a d-interval graph. The smallest d for which G is a d-interval graph is the interval number i(G).

A d-track interval is a union of d intervals, one each from d parallel lines (actually separate lines wouldbe a better definition as, for example, defining piercing sets for d-track intervals by vertical lines is a bitconfusing if they are defined on parallel lines). A graph is a d-track interval graph if it is the intersection graphof d-track intervals. The intervals graphs are precisely the 1-track interval graphs (and also the 1-intervalgraphs). The multitrack interval number of a graph G is the smallest d for which G is a d-track interval graph.Notice that a d-track interval graph is the union of d interval graphs with the same vertex set.

Closely related are d-boxes and d-box graphs. A d-box is the Cartesian product of intervals [ai, bi],1 ≤ i ≤ d. A graph is a d-box graph if it is the intersection graph of d-boxes. Hence interval graphs areprecisely the 1-box graphs. Of interest in our context, a d-box graph is the intersection of d interval graphswith the same vertex set. Notice that d-box graphs are not contained in d-interval graphs, neither d-intervalgraphs are included in d-box graphs. Indeed, K3,6 is a 2-box graph but not a 2-interval graph, and the graphobtained by subdivising edge each of K5 is a 2-interval graph but not a 2-box graph. The boxicity of a graphG is the minimum d for which G is a d-box graph. We shall not develop d-box graphs in the sequel.

For a d-interval (resp, d-track interval, d-box) graph G, a d-interval (resp, d-track interval, d-box) representa-tion of G is a family of d-intervals (resp, d-track intervals, d-boxes) F for which G is the intersection graphof.

Example 1 Let G be the graph defined as follows:

u1

u4

u5

u3

u2

A 2-interval representation of G is given by

u4u1 u1

u2u3

u5u4

u3u2u5

a 2-track interval representation of G is given by:

track 1: u1u2

u3u5

u4

track 2: u2 u1u4

u3u5

and a 2-box representation of G is given by:

Page 20: Algorithmic Contributions to Computational Molecular Biology · on computational molecular biology, i.e., those algorithmic and combinatorial topics that are connected to molecular

7

u1

u2

u3

u4

u5

For algorithmic considerations, convenient d-intervals are needed. For the sake of brevity, we definerestricted d-interval object but the same definitions do apply in a natural way for d-track intervals as well.

A d-interval I = (I1, I2, . . . , Id) is balanced if |I1| = |I2| = . . . = |Id|. Notice that this restriction has beenintroduce for RNA considerations [Crochemore et al., 2008]. A balanced d-interval graph is the intersectiongraph of a family of balanced d-intervals.

A d-interval I = (I1, I2, . . . , Id) is unit if it is composed of d intervals of length 1. A unit d-interval graph isthe intersection graph of a family of unit d-intervals. Clearly, unit d-interval graphs are balanced d-intervalgraphs whereas the converse is not necessarily true.

A d-interval I = (I1, I2, . . . , Id) with integer endpoints is type (l1, l2, . . . , ld) if |Ii| = li for all 1 ≤ i ≤ d.A d-interval graph type (l1, l2, . . . , ld) is the intersection graph of a family of d-intervals type (l1, l2, . . . , ld). Notice that unit d-intervals are d-intervals type (1, 1, . . . , 1) and that d-intervals type (l, l, . . . , l), l ∈ N∗,are balanced d-intervals. We can also notice that 2-interval graphs type (1, 1) are exactly line graphs: eachinterval of length 1 of the ground set can be considered as the vertex of a root graph and each 2-intervalas an edge in the root graph. This implies, for example, that the coloration problem is also NP-completefor 2-interval graphs type (2, 2) and wider classes of graphs. It is also known that the complexity of theMAXIMUM INDEPENDENT SET problem is NP-complete on 2-interval graphs type (2, 2) [Bafna et al., 1996].Recognition of 2-union graphs type (1, 2), a related class (restriction of multitrack interval graphs), has beenalso proved to be NP-complete [Halldorsson and Karlsson, 2006].

The depth of a family of d-intervals is the maximum number of intervals that share a common point. Therepresentation depth of a d-interval graph is the minimum depth of any d-interval representation of the graph.Notice that any d-interval (or d-track interval) representation of a triangle-free graph must have depth atmost 2. On the other hand, for any constant d ≥ 2, it is easy to construct a d-interval (or d-track interval)representation of depth 2 of a triangle.

1.3 Recognizing multidimensional interval graphs

In this section, we shall mostly focus on d = 2. We study some restrictions of 2-interval graphs, and theirposition in the hierarchy of graph classes as illustrated Figure 1.1.

Recognizing restricted graph classes in an ubiquitous problem in intersection graph theory, and indeedthere has been considerable interest in recognizing d-interval graphs (and related graph classes). The firstexplicit reference to this question we are aware of is in [Golumbic, 1980]. A classical result of West andShmoys [West and Shmoys, 1984] states that, for any constant d ≥ 2, recognizing d-interval graphs isNP-complete (moreover, for any constants d ≥ 2 and r ≥ 3, recognizing d-interval graphs of representationdepth at most r is also NP-complete). The class of d-track interval graphs is clearly contained in the class ofd-interval graphs. Notice that the containment is proper as the complete bipartite graph Kd2+d−1,d+1 is a

Page 21: Algorithmic Contributions to Computational Molecular Biology · on computational molecular biology, i.e., those algorithmic and combinatorial topics that are connected to molecular

8

Figure 1.1: Graph classes related to 2-interval graphs and its restrictions. A class pointing towards anotherstrictly contains it, and the dashed lines mean that there is no inclusion relationship between the two. Darkclasses correspond to classes not yet present in the ISGCI Database.

.

d-interval graph but is not a d-track interval graph [West and Shmoys, 1984]. Gyarfas and West [Gyarfas andWest, 1995] have however proved that recognizing 2-track interval graphs is NP-complete (their proof alsoimplies that, for any constant r ≥ 3, recognizing 2-track interval graphs of representation depth at most r isNP-complete). It is still an open problem (but conjectured to be true) to prove that, for any constant d ≥ 2,recognizing d-track interval graph is NP-complete . . . To be honest, the problem is not really open any longeras M. Jiang has recently communicated us a – correct as far as we can assess – NP-hardness proof for thisproblem.

Our contributions for balanced 2-interval graphs is two-fold. We have shown that the class of balanced2-interval graphs is strictly included in the class of 2-interval graphs. The rationale for this question wasconcerned with approximation: does any approximation result for balanced 2-intervals propagate to (general)2-intervals? The answer is No, unfortunately. Moreover, we have settled the complexity of recognizingbalanced 2-interval graphs.

Proposition 1.3.1 ([Gambette and Vialette, 2007]). The class of balanced 2-interval graphs is strictly included inthe class of 2-interval graphs.

Our proof is by exhibiting a 2-interval graph that has no balanced 2-interval realization (this latter pointbeing of course the hardest part of the proof). Without going into the details, the construction is by connectinga bunch of gadgets K5,3 (the complete bipartite graph K5,3 is indeed not a unit 2-interval graph) togetherwith additional vertices to enforce an unbalanced 2-interval representation. Notice that we also proved thatthe class of balanced 2-interval graphs strictly contains circular-arc graphs (see [Brandstadt et al., 1999] fordefinitions).

Proposition 1.3.2 ([Gambette and Vialette, 2007]). Recognizing balanced 2-interval graphs is NP-complete.

To prove Proposition 1.3.2, we have adapted the proofs of [West and Shmoys, 1984] and [Gyarfas andWest, 1995], and gave a reduction from the HAMILTONIAN CYCLE problem for 2-regular triangle-free graphs, aproblem which has been proved to be NP-complete in [West and Shmoys, 1984]. Moreover, it is easy enough

Page 22: Algorithmic Contributions to Computational Molecular Biology · on computational molecular biology, i.e., those algorithmic and combinatorial topics that are connected to molecular

9

to check that Gyarfas and West’s proof of NP-hardness of recognizing 2-track interval graphs [Gyarfas andWest, 1995] can be adapted, by adjusting the interval lengths in the representation (more or less as we did for2-interval graphs [Gambette and Vialette, 2007]), to show that recognizing balanced 2-track interval graphsis NP-complete as well.

As for 2-interval graphs type (l, l), we have obtained the following results.

Proposition 1.3.3 ([Gambette and Vialette, 2007]). For any l ∈ N∗, l ≥ 2, the class of 2-interval graphs type (l, l)is strictly contained in the class of 2-interval graphs type (l+ 1, l+ 1).

Proper containment of 2-interval graphs type (l, l) in 2-interval graphs type (l+ 1, l+ 1) is illustrated inFigure 1.2.

(a)

(b)

Figure 1.2: The graph K ′4 (a) is (5,5)-interval but not (4,4)-interval.

Proposition 1.3.4 ([Gambette and Vialette, 2007]).

unit 2-interval graphs =⋃l∈N∗

2-interval graphs type (l, l).

According to Proposition 1.3.4, if recognizing 2-interval graphs type (l, l) is polynomial-time solvable forany l ∈ N∗, then recognizing unit 2-interval graphs is polynomial-time solvable. This problem has not beensettled yet.

Aiming at further deciphering the precise nature of unit 2-interval graphs, we have obtained the followinginclusion between proper circular-arc graphs (circular-arc graphs such that no arc is included in another inthe representation) and 2-interval graphs (recall that circular-arc graphs are balanced 2-interval graphs butthat circular-arc graphs are not necessarily unit 2-interval graphs).

Proposition 1.3.5 ([Gambette and Vialette, 2007]). The class of proper circular-arc graphs is strictly included inthe class of unit 2-interval graphs.

Page 23: Algorithmic Contributions to Computational Molecular Biology · on computational molecular biology, i.e., those algorithmic and combinatorial topics that are connected to molecular

10

Determining the complexity of recognizing unit 2-interval graphs is still an open problem. Moregenerally, what is the complexity of recognizing d-interval graphs type (2, 2, . . . , 2)? The questionis also open for 2-interval graphs type (l, l). A first step could be to focus on 2-track intervalgraphs type (l, l). Indeed, 2-track interval graphs type (l, l) are subclasses on unit 2-intervalgraphs.

In [Gambette and Vialette, 2007], we have considered a class of graphs that generalizes quasi-line graphsand contains unit 2-interval graphs. Quasi-line graphs are those graphs whose vertices are bisimplicial, i.e.,the closed neighborhood of each vertex is the union of two cliques. This graph class has been introducedas a generalization of line graphs and as a useful subclass of claw-free graphs [Ben Rebea, 1981; Faudreeet al., 1997; Chudnovsky and Seymour, 2005; King and Reed, 2007]. Let k ∈ N∗. A graph G is all-k-simplicialif the neighborhood of each vertex u ∈ V(G) can be partitioned into at most k cliques. The class of quasi-linegraphs is thus exactly the class of all-2-simplicial graphs.

Proposition 1.3.6 ([Gambette and Vialette, 2007]). The class of unit 2-interval graphs is strictly included in theclass of all-4-simplicial graphs.

1.4 Combinatorial problems on 2-intervals

1.4.1 Introduction

Multiple-interval graphs are a natural generalization of interval graphs, and hence there is a natural interestin studying standard combinatorial problems for multiple-intervals (and multiple-interval graphs, thedistinction is important since computing a multiple-interval representation of a graph is NP-complete). Ofparticular interest, three standard graph problems, namely MINIMUM VERTEX COVER, MINIMUM DOMINATINGSET and MAXIMUM CLIQUE, for d-intervals are considered in [Butman et al., 2007]. Their results can besummarized as follows: the MINIMUM VERTEX COVER problem is approximable within ratio (2− 1/d) (aratio which equals the best known ratio for 2d1 bounded degree graphs), the MINIMUM DOMINATING SETproblem is approximable within ratio d2, and the MAXIMUM CLIQUE problem is approximable within ratio(d2d+ 1)/2.

We present here two contributions in this area. First, we discuss the 2-INTERVAL PATTERN problem whichcan be seen as a generalization of the MAXIMUM INDEPENDENT SET for 2-intervals. Standard complexity andapproximation are considered. Second, we consider parameterized issues of some natural combinatorialproblems for 2-intervals. Parameterized issues of the 2-INTERVAL PATTERN problem are part of an ongoingwork with S. Guilemot and D. Hermelin, we shall only mention them briefly.

1.4.2 The 2-INTERVAL PATTERN problem

The 2-INTERVAL PATTERN problem is concerned with finding large constrained patterns in families of 2-intervals. Given a single-stranded RNA molecule, a sequence of contiguous bases of the molecule canbe represented as an interval on a single line, and a possible pairing between two disjoint sequences canbe represented as a 2-interval, which is merely the union of two disjoint intervals. Therefore, 2-intervalrepresentation considers thus only the bonds between the bases and the pattern of the bonds, such as hairpinstructures, knots and pseudoknots. A maximum cardinality pairwise disjoint subfamily of a candidate family

Page 24: Algorithmic Contributions to Computational Molecular Biology · on computational molecular biology, i.e., those algorithmic and combinatorial topics that are connected to molecular

11

Does there exist a polynomial-time algorithm for finding a maximum cardinality clique in afamily of 2-intervals? In other words, given an family of 2-intervals F = D1, D2, . . . , Dn, is theproblem of finding a maximum cardinality subset F ′ ⊆ F of pairwise intersecting 2-intervalsin P? The MAXIMUM CLIQUE problem in its natural decision setting is NP-complete for familiesof 3-intervals [Butman et al., 2007], approximable within ratio (d2 − d + 1)/2 for families ofd-intervals (and hence within ratio 3/2 for families of 2-intervals) [Butman et al., 2007], andfixed-parameter tractable for families of d-intervals for parameters d and k (k is the size of theclique we are looking for) [Fellows et al., 2007].The MAXIMUM CLIQUE problem for families of d-intervals is (or is likely to be) strongly related tothe interval piercing number. Let F be a family of d-intervals. A piercing set for F is a set of points Pon the real line such that, for any d-intervalD ∈ F ,D∩P 6= ∅. The piercing number of F , denotedτ(F), is the size of a minimum cardinality piercing set of F . Gyarfas has proved that, for anyfamily of pairwise intersecting 2-intervals F , it holds that τ(F) ≤ 3. [Gyarfas and Lehel, 1970](see also [Gyarfas, 2003]).A stronger result holds for families of 2-track intervals and has proved to be useful. Indeed,if F is a set of pairwise intersecting 2-track interval set, then τ(F) ≤ 2 and there is a piercingset p1, p2 of F with p1 on track one and p2 on track two [Gyarfas and Lehel, 1970] (see also[Gyarfas, 2003]). Starting from this property, we can prove that the MAXIMUM CLIQUE problem for2-track intervals is in P (D. Hermelin, R. Rizzi and S. Vialette, Unpublished result). As anotherstep towards determining the complexity of the MAXIMUM CLIQUE problem for 2-intervals, weannounce the following result: The MAXIMUM CLIQUE problem for 3-track intervals is APX-hard(D. Hermelin, M. Jiang and S. Vialette, Unpublished result).

of 2-intervals restricted to certain prespecified geometrical constraints can provide useful valid approximationfor RNA secondary structure determination. Therefore, the geometric properties of 2-intervals providea possible guide for understanding the computational complexity of finding structured patterns in RNAsequences. Using a model to represent non sequential information allows us for varying restrictions on thecomplexity of the pattern structure. Indeed, two disjoint 2-intervals, i.e., two 2-intervals that do not intersectin any point, can be in precedence order (<), be allowed to nest (@) or be allowed to cross (G). Furthermore,the family of 2-intervals and the pattern can have different restrictions, e.g., all intervals have the samelength or all the intervals are disjoint. These different combinations of restrictions alter the computationalcomplexity of the problems, and need to be examined separately. This examination produces efficientalgorithms for more restrictive structured patterns, and hardness results for those less restrictive.

Let I = [a, b] be an interval on the line. Write start(I) = a and end(I) = b. A 2-interval is the union oftwo disjoint intervals defined over a single line and is thus denoted by D = (I, J), I is completely to theleft of J. Write left(D) = I and right(D) = J. Two 2-intervals D1 = (I1, J1) and D2 = (I2, J2) are said to bedisjoint (or non-intersecting) if the two 2-intervals share no common point, i.e., (I1 ∪ J1) ∩ (I2 ∪ J2) = ∅. Forsuch disjoint pairs of 2-intervals, three natural binary relations, denoted <, @ and G, are of special interest(these three relations will be recurrent in the first part of this manuscript):

• D1 < D2 (D1 precedes D2), if I1 ≺ J1 ≺ I2 ≺ J2,

• D1 @ D2 (D1 is nested in D2), if I2 ≺ I1 ≺ J1 ≺ J2, and

• D1 G D2 (D1 crosses D2), if I1 ≺ I2 ≺ J1 ≺ J2,

Page 25: Algorithmic Contributions to Computational Molecular Biology · on computational molecular biology, i.e., those algorithmic and combinatorial topics that are connected to molecular

12

where ≺ denotes the usual precedence between (1-)intervals.A pair of 2-intervals D1 and D2 is said to be R-comparable for some R ∈ <,@, G, if either D1RD2 or

D2RD1, i.e., D1 and D2 are comparable by R. Note that any two disjoint 2-intervals are R-comparable forsome R ∈ <,@, G (good, we shall not miss something). A family of pairwise disjoint 2-intervals F is said tobe typeM for someM⊆ <,@, G,M 6= ∅, if any pair of distinct 2-intervals in F is R-comparable for someR ∈ M. The non-empty subsetR is usually called a model for F . It is implicitly assumed here thatR is assmall as possible.

Given a family of 2-intervals, the 2-INTERVAL PATTERN problem asks to find in a maximum cardinalitysubset of pairwise compatible 2-intervals. In the present context, compatibility denotes the fact that any two2-intervals in the solution are (i) non-intersecting and (ii) satisfy some prespecified geometrical constraints.The 2-INTERVAL PATTERN problem is formally defined as follows.

2-INTERVAL PATTERN

• Input : A family of 2-intervals F and a modelM⊆ <,@, G.• Solution : A subfamily F ′ ⊆ F (of pairwise disjoint 2-intervals) typeM.• Measure : The size of F ′, i.e., |F ′|.

Some additional definitions are needed for further algorithmic analysis. Let F be a family of 2-intervals.The width (resp. height, depth) is the size of a maximum cardinality subset F ′ ⊆ F type < (resp @, G). Theinterleaving distance of a 2-interval Di ∈ F is defined to be the distance between the two intervals of Di, i.e.,start(right(Di)) − end(left(Di)). The total interleaving distance of the family of 2-intervals F , written L(F), isthe sum of all interleaving distances, i.e., L(F) =

∑Di∈F start(right(Di)) − end(left(Di)). The density of F ,

written d(F), is the maximum number of 2-intervals in F over a single point. Formally,

d(F) = maxx∈X(F)

D ∈ F : end(left(D) ≤ x < start(right(D)).

The structure of the set of all (simple) intervals involved in a family of 2-intervals F turns out to be ofparticular importance for algorithmic analysis of the 2-INTERVAL PATTERN problem. The interval ground setof F , denoted I(F), is the set of all intervals involved in F , i.e., I(F) = left(Di) : Di ∈ F ∪ right < (Di) :Di ∈ F . In [Vialette, 2004; Crochemore et al., 2008], we have introduced four types of interval ground sets:

• UNLIMITED: no restriction on the structure,

• BALANCED: each 2-interval Di ∈ F is composed of two intervals having the same length, i.e.,| left(Di)| = | right(Di)|,

• UNIT: the interval ground set I(F) is solely composed of unit length intervals,

• DISJOINT: no two distinct intervals in the interval ground set I(F) intersect.

Recall that family of unit 2-intervals is balanced, while the converse is not necessarily true. Furthermore,for most applications, one may assume that a family of pairwise disjoint 2-intervals is unit. Observe that inthis latter case, a family of 2-intervals reduces to a graph G equipped with a numbering of its vertices from 1to |V | (see Chapter 2 for a complete treatment of this restriction). Considering additional restrictions such as(i) bounding the width, the height or the depth of either the input family of 2-intervals or the solution subset,or (ii) bounding the interleaving distances are also of interest for practical applications.

Page 26: Algorithmic Contributions to Computational Molecular Biology · on computational molecular biology, i.e., those algorithmic and combinatorial topics that are connected to molecular

13

Before going to algorithms, I would like to take the opportunity of this manuscript to clarify onepoint about 2-intervals I never explicitly stated. This is the right place I guess. Indeed, I wasasked many times whether 2-intervals (and actually the same question would make sense forlinear graphs) would ever yield any competitive algorithm for RNA structure prediction. Theanswer is No, and actually I did not ever think the answer could be Yes. Biology is of course toocomplicated for such a simplistic solution (“Biology easily has 500 years of exciting problems to workon, it’s at that level”, D. Knuth, Computer Literacy Bookshops Interview, 1993). Nobody wouldthink that 2-intervals and linear graphs are accurate models for RNA structure, ever! There are somany parameters here, so many exceptions, so many exceptions to exceptions, . . . “Parfois, maispas toujours, oui, non, enfin parfois, a dpend, pas toujours, non, voil, pas toujours”would say Jean-PierreRousset. But this is precisely for this reason that I believe simple enough combinatorial objects areneeded to deal with specific issues involved in the big picture (I have for example in mind thealgorithmic impact of crossing structures). This was my guideline for introducing and studying2-intervals in computational biology.

1.4.3 Algorithms and complexity

Let us start with some easy observations and statements. For one, the 2-INTERVAL PATTERN problem forM = <,@, G is related to the MAXIMUM INDEPENDENT SET problem in 2-interval graphs with a given2-interval representation (recall that, as we have seen, recognizing 2-interval graphs, and hence computinga 2-interval representation, is NP-complete). For another, graphs of maximum degree ∆ are d(∆+ 1)/2e-interval graphs [Griggs and West, 1979], and hence any graph with maximum degree 3 is a 2-interval graph.Therefore, since the MAXIMUM INDEPENDENT SET problem in its natural decision setting is NP-completefor planar graphs with maximum degree 3 [Garey et al., 1976] (we note in passing that any planar graphis a 3-interval graph [Scheinerman and West, 1983]), it follows that the 2-INTERVAL PATTERN problem isNP-complete in its whole generality. This is actually not very surprising (but we know what one is lettingoneself in for).

The best complexity results for the 2-INTERVAL PATTERN problem are given in Table 1.3 for variousmodels and interval ground sets, and we shall only discuss some very specific points in the sequel.

First, the O(n log(n) + L) time algorithm of [Chen et al., 2007a] forM = @, G and disjoint intervalground set now supersedes our O(n2 log(n)) time algorithm [Blin et al., 2007c]. However, the techniques arequite comparable and are based on the following property. For a 2-interval D, define its covering interval tobe c(D) = [start(left(D)), end(right(D))], i.e., the least interval that coversD. Let F be a family of 2-intervalsand let G be the intersection graph of the intervals c(D) : D ∈ F ; G is certainly an interval graph. Moreover,any subfamily F ′ ⊆ F type @, G induces a clique in G. It is thus enough to focus on the maximal cliques ofG. But an interval graph G is a chordal graph and as such has at most |V(G)| maximal cliques [Fulkersonand Gross, 1965]. Furthermore, all the maximal cliques of a chordal graph can be found in O(n+m) time,where n = |V(G)| and m = |E(G)|, by a modification of Maximum Cardinality Search (MCS) [Tarjan andYannakakis, 1984; Blair and Peyton, 1993]. The O(n2 log(n)) time algorithm follows.

Second, if the 2-INTERVAL PATTERN problem is solvable in O(n log(n)) time for both M = < (thealgorithm is trivial) andM = @, the two algorithms actually use different geometrical objects: intervals forthe former and trapezoids [Felsner et al., 1997] for the latter (a structure which will prove extremely useful inthe sequel).

Page 27: Algorithmic Contributions to Computational Molecular Biology · on computational molecular biology, i.e., those algorithmic and combinatorial topics that are connected to molecular

14

Interval Ground Set I(F)ModelM

Unlimited, Balanced, Unit Disjoint<,@, G APX-hard [Bar-Yehuda et al., 2002] O(n

√n) [Micali and Vazirani, 1980]

<, G NP-complete [Blin et al., 2007c] NP-complete [Li and Li, 2009a]@, G APX-hard [Vialette, 2004] O(n log(n) + L) [Chen et al., 2007a]<,@ O(n log(n) + nd) [Chen et al., 2007a]< O(n log(n)) [Vialette, 2004]@ O(n log(n)) [Blin et al., 2007c]G O(n log(n) + L) [Chen et al., 2007a]

Figure 1.3: Best complexity results for the 2-INTERVAL PATTERN problem for all combinations of models andinterval ground sets. For the polynomial-time cases, n = |F |, L = L(F) and d = d(F).

Interval Ground Set I(F)ModelM

Unlimited Balanced Unit Disjoint<,@, G 4 1 [Bar-Yehuda et al., 2002] 4 2 [Crochemore et al., 2008] 3 2 [Bar-Yehuda et al., 2002] N/A@, G 4 1 [Bar-Yehuda et al., 2002] 4 3 [Crochemore et al., 2008] 3 3 [Crochemore et al., 2008] N/A<, G PTAS [Jiang, 2007b] (or effective 2 4 [Jiang, 2007a])

Figure 1.4: Performance ratios for hard instances of the 2-INTERVAL PATTERN problem (the 2-INTERVALPATTERN problem for disjoint interval ground set and modelsM = <,@, G andM = @, G is polynomial-time solvable).

1.4.4 Approximation

The best approximation ratio for the 2-INTERVAL PATTERN problem are given Figure 1.4 for various modelsand interval ground sets, and, once again, we shall only discuss specific points.

First, we paid special attention to efficient approximation algorithms, and most of the results presentedFigure 1.4 support implementation. However, the 4-approximation for unlimited 2-intervals is by linearprogramming and we are still not able to design a simple and practical approximation algorithm with thevery same performance ratio.

Second, the PTAS of Jiang [Jiang, 2007b] forM = <, G supersedes our results [Crochemore et al., 2008],i.e., an approximation ratio (i) 6 for general 2-intervals, (ii) 4 for balanced 2-intervals, (iii) 3 for unit 2-intervals,and (iv) 2 when the 2-intervals reduce to a linear graph (see next chapter). It is worth noticing that, like mostPTAS, Jiang’s algorithm does not support implementation (however, Jiang has proposed an approximationalgorithm with performance ratio 2well-suited for practical applications).

Third, we considered in [Crochemore et al., 2008] a weighted variant of the 2-INTERVAL PATTERN problem:each 2-interval is associated to a weight and the goal is to find a maximum weight subfamily of pairwisedisjoint 2-intervals with respect to a prespecified modelM. Here, one can for instance weight a 2-intervalby the total sum of the lengths of its intervals, thereby allowing more refined solutions in the biologicalapplication of the problem. We have shown in [Crochemore et al., 2008] that our results can be extended tothe weighted variant, while still maintaining the same approximation factors.

1.4.5 Parameterized complexity

We have considered in [Hermelin et al., 2009] the parameterized issues of some standard combinatorialproblems restricted to d-intervals. It is understood here that, even if the considered problems are defined forgraphs, we consider these problems on families of 2-intervals in terms of the associated 2-interval graphs.For example, the MINIMUM DOMINATING SET problem (given a graph G, find a minimum cardinality set of

Page 28: Algorithmic Contributions to Computational Molecular Biology · on computational molecular biology, i.e., those algorithmic and combinatorial topics that are connected to molecular

15

vertices V ′ ⊆ V(G) such that any vertex in V(G) \ V ′ has at least one neighbor in V ′) reduces for a family of2-intervals F to finding a subfamily of 2-intervals F ′ ⊆ F such that any 2-interval in F \ F ′ intersects atleast one 2-interval in F ′. Our results (negative and positive) can be summarized as follows.

According to Proposition 1.4.3, the 2-INTERVAL PATTERN problem forM = @, G is W[1]-hardfor its standard parameterization The parameterized complexity (standard parameterization)of the 2-INTERVAL PATTERN problem forM = <, G is more intriguing. So far, we are still notable to determine the parameterized complexity of this problem. Recall that the 2-INTERVALPATTERN problem forM = <, G has a polynomial-time approximation scheme (PTAS) [Jiang,2007b], and hence proving that it is W[1]-hard would show that, in some sense, a PTAS is the bestapproximation one can obtain (i.e., no efficient PTAS) for this problem.

Conjecture 1. The 2-INTERVAL PATTERN problem forM = <, G is W[1]-hard for its standard parame-terization.

Proposition 1.4.1 ([Hermelin et al., 2009]). The following problems are W[1]-hard for 2-interval graphs (assuminga 2-interval representation is given along with the graph):

• the MAXIMUM INDEPENDENT SET problem parameterized by the size of the solution,

• the MINIMUM DOMINATING SET problem parameterized by the size of the solution, and

• the INDEPENDENT MINIMUM DOMINATING SET problem parameterized by the size of the solution.

It is worth noticing that [Hermelin et al., 2009] has rapidly become a well-cited paper, not in reason ofProposition 1.4.1 but for the multicolored clique technique we have introduced. In a nutshell, the multicoloredclique technique allows for an almost systematic gadget-construction and helps in eliminating severaltechnical details (see [Hermelin et al., 2009] where a large portion is devoted to presenting this generaltechnique).

Proposition 1.4.2 ([Hermelin et al., 2009]). The MAXIMUM CLIQUE problem for d-intervals is fixed-parametertractable when parameterized by d and k (k is the size of the clique we are looking for).

Central in the proof of Proposition 1.4.2 is the fact a d-interval graphs with no clique of size k has a vertexof degree less than 2k. However, the algorithm is of limited practical interest due to the huge exponentialterm in the running time, i.e., O

(k2(2dkk

)). Notice, however, that Jiang has recently proposed a better a

maxdO(k), 2O(klogk) · poly(n) time algorithm for MAXIMUM CLIQUE problem for d-intervals, where n isthe number of vertices in the graph Jiang [2010]. Designing a practical fixed-parameter algorithm for theCLIQUE problem for d-intervals remains a challenging problem.

We conclude this chapter by discussing briefly parameterized issues of the 2-INTERVAL PATTERN problem.First, according to Proposition 1.4.1 forM = <,@, G is W[1]-hard for its standard parameterization. Tocomplement this result, the following proposition is announced with proof.

Proposition 1.4.3. The 2-INTERVAL PATTERN problem forM = @, G is W[1]-hard for its standard parameteriza-tion

Page 29: Algorithmic Contributions to Computational Molecular Biology · on computational molecular biology, i.e., those algorithmic and combinatorial topics that are connected to molecular

16

Page 30: Algorithmic Contributions to Computational Molecular Biology · on computational molecular biology, i.e., those algorithmic and combinatorial topics that are connected to molecular

2From linear graphs to permutations

Contents2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172.2 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172.3 From linear graphs to permutations . . . and back . . . . . . . . . . . . . . . . . . . . . . . . . 182.4 Pattern matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192.5 Finding common structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

2.5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232.5.2 Structured patterns type <,@ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 252.5.3 Structured patterns type <, G . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272.5.4 Structured patterns type @, G . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272.5.5 Putting everything together . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 282.5.6 Towards biologically sounding models . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

2.6 Separable patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

2.1 Introduction

This chapter is devoted to algorithmic aspects of linear graphs and is a natural follow-up of Chapter 1 aslinear graphs can be viewed as families of 2-intervals over a disjoint ground set (this was actually our initialmotivation for studying linear graphs). In Section 2.2 we set up notation and terminology. Section 2.3 isdevoted to introducing the relationship between permutations and linear matchings. The three followingsections are devoted to algorithmic considerations: In Section 2.4 we consider the pattern matching forpermutations problem, Section 2.5 is concerned with finding large common patterns in linear graphs whereasSection 2.6 aims at bringing together common patterns and permutations.

2.2 Definitions

We follow standard notations in graph theory (see for example [Diestel, 2000]). The order (resp. size) of agraph G is defined as the number of vertices (resp. edges) of G.

17

Page 31: Algorithmic Contributions to Computational Molecular Biology · on computational molecular biology, i.e., those algorithmic and combinatorial topics that are connected to molecular

18

Definition 2.2.1 (Linear graph). A linear graph of order n is a vertex-labeled graph where each vertex is labeled bya distinct label from 1, 2, . . . , n.

It is worth mentioning that we shall always assume in this chapter that a linear graph has no degree 0vertices (see also note Page 30).

A linear graph can be thus viewed as a graph with vertices embedded on the integral line, yielding atotal order amongst them. In case of linear graphs, we write an edge between vertices i and j, i < j, as thepair (i, j). By convention, if G is a linear graph, we let G[i . . . j], 1 ≤ i ≤ j ≤ |V(G)|, denote the subgraphinduced by all vertices labeled kwith i ≤ k ≤ j. Two edges of a linear graph are disjoint if they do not share acommon vertex. Of particular interest in our context are edge-disjoint linear graphs.

Definition 2.2.2 (Linear matching). A linear matching is an edge-disjoint linear graph.

A linear matching with 2n vertices (should be indeed even) has thus n edges. Similarly to 2-intervals,relative positioning of disjoint edges is of particular interest. Let e = (i, j) and e ′ = (i ′, j ′) be two disjointedges in a linear graph or a linear matching G. We write:

• e < e ′ (e precedes e ′) if i < j < i ′ < j ′,

• e @ e ′ (e is nested in e ′) if i ′ < i < j < j ′, and

• e G e ′ (e and e ′ cross) if i < i ′ < j < j ′.

Two edges e and e ′ are R-comparable, for some R ∈ <,@, G, if eRe ′ or e ′Re. For a subsetM⊆ <,@, G,M 6= ∅, edges e and e ′ are said to beM-comparable if e and e ′ are R-comparable for some R ∈ M. A set ofedges E isM-comparable if any pair of distinct edges e, e ′ ∈ E areM-comparable. A linear matching whoseedge set isM-comparable is said to be typeM. A subgraph of a linear graph G is a linear graph H whichcan be obtained from G by a series of vertex and edge deletions, where the deletion of vertex i results inremoving vertex i and all edges incident to it from the graph, and then relabeling all vertices j with j > i toj− 1. In our context, an edge-disjoint subgraph of a linear graph is also called a structured pattern.

2.3 From linear graphs to permutations . . . and back

It is folklore that linear matchings type <,@, G of order 2n are in bijection with fixed-point free (fpf)involutions, i.e., permutations of S2n with n cycles, each of length 2. The number of fpf involutions of S2nis the double factorial number (2n− 1)!! = 1 · 3 · · · (2n− 1) (see the On-Line Encyclopedia of Integer Sequencesfor references).

It is also a simple observation that linear matchings type @, G of size n are in bijection with permutationsof Sn. To see this, let us consider a linear matching G type @, G of size n. Then the vertices in G whichare left endpoints of edges are labeled 1, 2, . . . , n and the right endpoints are labeled n+ 1, n+ 2, . . . , 2n.The permutation πG corresponding to G is defined by πG(j− n) = i if and only if (i, j) ∈ E(G). Clearly, alllinear matchings type @, G have corresponding permutations, and vice versa. It follows from this bijectivecorrespondence that the number of different linear matchings type @, G of G of size n is n!. Interestinglyenough, notice that increasing subsequences in πG correspond to subgraphs type G of G, while decreasingsubsequences correspond to subgraphs type @. See Figure 2.1 for an illustration.

Both linear matchings type <,@ and <, G of order 2n are in bijection with Dyck words of length 2n(should we call a linear matching type <, G an anti-Dyck pattern?). Recall that a Dyck word of length 2nis a string consisting of n a’s and n b’s such that no initial segment of the string has more b’s than a’s (forexample, the following are the Dyck words of length 6: aaabbb, abaabba, ababab, aabbab, aababb). Thenumber of Dyck words of length 2n is n-th Catalan number Cn =

(2nn

)/(n+ 1) (the best general reference

here is [Stanley, 1999]).

Page 32: Algorithmic Contributions to Computational Molecular Biology · on computational molecular biology, i.e., those algorithmic and combinatorial topics that are connected to molecular

19

G1 2 3 4 5 6 7 8 9 5 9 4 7 6 3 2 1 8

πG = 5 9 4 7 6 3 2 1 8

1 2 3 6 7 94 5 8 9 6 3 2 15 4 7 8

Decreasing subsequence 9 6 3 2 1

1 2 3 4 6 95 7 8 9 4 6 3 2 15 7 8

Increasing subsequence 5 7 8

Figure 2.1: A linear matching G type @, G and the corresponding permutation πG = 5 9 4 7 6 3 2 1 8. Alsoillustrated is the bijective correspondence between decreasing subsequences (resp. increasing subsequences)of πG and patterns of G type @ (resp. type G).

2.4 Pattern matching

We consider in this section the pattern matching problem for linear matchings: Given a pattern (in the formof a linear matching) and a target linear matching, decide whether there is an occurrence of the motif in thetarget. We refer to this problem as the PATTERN MATCHING FOR LINEAR MATCHINGS problem. Accordingto the preceeding section, if both the pattern and the target are linear matching type @, G we are left withthe pattern matching problem for permutations (the bijection is indeed pattern-preserving). We refer to thislater problem as the PERMUTATION PATTERN problem. As we shall see, most of the difficulties in trying tosolve the PATTERN MATCHING FOR LINEAR MATCHINGS problem originate from the PERMUTATION PATTERNproblem.

Let us embed the PATTERN MATCHING FOR LINEAR MATCHINGS problem into permutations. A permuta-tion π is said to contain the pattern (shorter permutation) σ, in symbols σ π, if there exists a subsequenceof entries of π that has the same relative order as σ (alternatively, σ is involved in π). Otherwise, π avoidsσ. For example, 3215674 contains the pattern 132 since the subsequence 154 is ordered in the same wayas 132. Pattern involvement in permutations has become a very active area of research. For one, patterncontainment restrictions are often used to describe classes of permutations that are sortable under variousconditions [Knuth, 1973]. For another, a great deal of study has been devoted to counting pattern-avoidingpermutations [Bona, 2004], probably culminating in the proof of the Stanley-Wilf conjecture [Marcus and

Page 33: Algorithmic Contributions to Computational Molecular Biology · on computational molecular biology, i.e., those algorithmic and combinatorial topics that are connected to molecular

20

Tardos, 2004].

Does there exist a fixed-parameter algorithm (standard parameterization) for finding an occur-rence of a permutation σ ∈ Sk in a permutation π ∈ Sn? In other words, does there exist analgorithm for finding an occurrence of σ in π is f(k)nO(1) time, where f is an arbitrary functiondepending only on k?We thought for ages that the answer should be just No, unless FPT = W[1]. We changed our mindradically about this issue. Every attempt to design a parameterized reduction leaded us to – moreor less – the same cul-de-sac.

Conjecture 2. The pattern matching for permutations problem is fixed-parameter tractable for its standardparameterization.

We, however, do believe that proving this conjecture is quite a difficult difficult task . . . far beyondthe reach of our arms for the time being.

Given two permutations σ and π, the PERMUTATION PATTERN problem is thus to decide whether σ π(this problem is ascribed to H. Wilf in [Bose et al., 1998]). The PERMUTATION PATTERN problem is NP-hard[Bose et al., 1998] (see [Vialette, 2004] for an alternate proof), but is clearly polynomial-time solvable if σhas bounded size. Indeed, if σ has size k and π has size n, a straightforward brute-force algorithm solvesthe problem in O(nk) time. Improvements to this algorithm were presented in [Albert et al., 2001] and [?],the latter describing a O(n0.47k+o(k)) time algorithm. Also, the problem is known to be polynomial-timesolvable (in k and n) if σ is separable, i.e., σ contains neither the pattern 2413 nor 3142 [Bose et al., 1998; Ibarra,1997]. In case σ is monotone, i.e., σ = 1 . . . k or σ = k . . . 1, a nice O(n log log(n)) time algorithm is known[Hunt and Szymanski, 1977a].

Is the PERMUTATION PATTERN problem fixed-parameter tractable for its standard parameterization? Ifonly one question were to be asked in this manuscript this has to be this one. We still don’t have any answerhere. Could it be the case that the PERMUTATION PATTERN problem is a special case of a more generalproblem in FPT? Or in other words, does there exist a FPT proof for free? Since the PERMUTATION PATTERNproblem is a special case of the general PATTERN MATCHING FOR LINEAR MATCHINGS problem, one cannaturally reduces this question (in our context) to: is the PATTERN MATCHING FOR LINEAR MATCHINGSfixed-parameter tractable for its standard parameterization? We answer this latter question in the negative(unless FPT = W[1], a fairly unexpected event) by proving the following new result (observe that thisdoes not, however, rule out the existence of another simpler problem in FPT containing the PERMUTATIONPATTERN problem . . . hence the main question remains asked).

Proposition 2.4.1. The PATTERN MATCHING FOR LINEAR MATCHINGS problem is W[1]-hard for its standardparameterization, i.e, the size of the pattern we are looking for.

Here is a sketch of the proof.

Proof. We propose a parameterized reduction from the CLIQUE problem which is known to be W[1]-hardwhen parameterized by the size of the clique we are looking for [Downey and Fellows, 1999].

Let (G, k) be an arbitrary instance of the CLIQUE problem. Write V(G) = u1, u2, . . . , un and E(G) =e1, e2, . . . , em). Furthermore, let us write di for the degree of ui ∈ V(G), and for convenience let d0 = 0.For each 1 ≤ i ≤ n, write Di for d1 + d2 + . . . + di−1. Finally, for 1 ≤ i, j ≤ n, write li,j for the number of

Page 34: Algorithmic Contributions to Computational Molecular Biology · on computational molecular biology, i.e., those algorithmic and combinatorial topics that are connected to molecular

21

The following items might be useful tools towards proving fixed-parameter tractability of thePERMUTATION PATTERN problem (Join work with S. Guillemot).The all 1’s k×k binary matrix is denoted Jk. LetA = [ai,j] be am×n binary matrix. It is said to bepruned if it contains neither a all 0’s row nor a all 0’s column. For now on, we assume A is pruned.A (p/m, q/n)-partition P is (i) a partition of 1, 2, . . . ,m into p intervals R1, R2, . . . , Rp, and (ii) apartition of 1, 2, . . . , n into q intervals C1, C2, . . . , cq. The quotient of A by P, in symbols A/P,is the p× q binary matrix A/P = [a/pi,j] defined by a/pi,j = 1 if and only if ak,l = 1 for somek ∈ Ri and l ∈ Cj. Given two pruned binary matricesA and B of sizem×n and p×q, respectively,we say that B is contained in A, denoted by B A, if there exists a (p/m, q/n)-partition P suchthat B ≤ A/P (this latter notation means that a/pi,j = 1 whenever bi,j = 1 for 1 ≤ i ≤ p andA ≤ j ≤ q) .For am× n pruned binary matrix A = [ai,j], it will be convenient to define ones(A) as follows

ones(A) = (i, j) ∈ 1, 2, . . . ,m× 1, 2, . . . , n : ai,j = 1.

For any e = (i, j) ∈ ones(A) and e ′ = (i ′, j ′) ∈ ones(A), define the distance between e and e ′,denoted dA(e, e ′), by dA(e, e ′) = max|i− i ′|, |j− j ′|. The matrix A is k-locally-dense if there existsdistinct e, e ′ ∈ ones(A) such that dA(e, e ′) ≤ k (by convention A is k-dense if | ones(A)| ≤ 1).Define the local-density of A, simply denoted d(A), to be the minimum k for which A is k-locally-dense.

Conjecture 3. For any permutation matrix π, if d(π) ≥ k then Jk π.

Although at first odd, no counter-example has yet been found. Notice that the above conjectureholds for k = 2 since non-separable permutations contain 3142 or 3142 and hence does containJ2. It also holds for |π| = k2 (details omitted). Finally, observe that, if True, the above conjecturewould imply D(π) ≤ ∆(π), where D(π) = maxd(σ) : σ pi and ∆(π) is the maximum k forwhich Jk π holds.

neighbors ux of ui such that x < j, i.e.,

li,j = |ux : ui, ux ∈ E(G) ∧ x < j|.

We construct the corresponding instance (Gtarget, Gpattern) of the PATTERN MATCHING FOR LINEAR MATCH-INGS problem as follows. The linear matching Gtarget has order 4n + 8m + 4 and its edge set E(Gtarget) isdefined by

E(Gtarget) = E1target ∪ E2target ∪ E3target ∪ E4target ∪ E5target ∪ E6target

Page 35: Algorithmic Contributions to Computational Molecular Biology · on computational molecular biology, i.e., those algorithmic and combinatorial topics that are connected to molecular

22

where

E1target = (i, 4m+ 4n+ 4i+ 1) : 1 ≤ i ≤ m

E2target = (i+ 1, 4m+ 4n+ 4i+ 4) : 1 ≤ i ≤ m

E3target = (2m+ 2i− 1, 2m+ 2n+ 2i+Di + 1) : 1 ≤ i ≤ n

E4target = (2m+ 2i, 2m+ 2n+ 2i+Di+1 + 2) : 1 ≤ i ≤ n

E5target = (2m+ 2n+ 1, 2m+ 2n+ 2), (4m+ 4n+ 3, 4m+ 4n+ 4).

Let us now describe E6target which is the only part of Gtarget that depends on the input (apart from m andn of course). We add two edges to E6target for each edge of G. More precisely, if ei = up, uq, p < q,is an edge of G, we add to E6target the two edges (2m + 2n + 2i + Dp + lp,q + 2, 4m + 4n + 4i + 2) and(2m+ 2n+ 2i+Dp + lq,p + 2, 4m+ 4n+ 4i+ 2). This completes the construction of Gtarget.

We now turn to constructing Gpattern. The linear matching Gpattern has order 4k2 + 4 (depending solely ofk, good!) and its edge set E(Gtarget) is defined by

E(Gpattern) = E1pattern ∪ E2pattern ∪ E3pattern ∪ E4pattern ∪ E5pattern ∪ E6pattern

using the very same construction as for Gtarget but considering as input the complete graph Kk on k vertices(and 1

2k(k− 1) edges) instead of G.

It can be proved that there is an occurrence of Gpattern in Gtarget if and only if G has a clique of size k. Onedirection is trivial (by construction). For the other direction, we only mention that the following observationis crucial for correctness (and for reducing the proof to a sequence of easy steps): the two edges of E5pattern

must match the two edges of E5target. Indeed, a careful observation of Gpattern shows that, for any two edgesei, ej ∈ E(Gpattern), ei < ej if and only if ei = (2m+2n+1, 2m+2n+2) and ej = (4m+4n+3, 4m+4n+4).Similarly, for any two edges ei, ej ∈ E(Gtarget), ei < ej if and only if ei = (2m + 2n + 1, 2m + 2n + 2) andej = (4m+4n+3, 4m+4n+4). This property allows us to draw the following crucial property: for 1 ≤ i ≤ 6,all edges of Eipattern are matched to edges in Eitarget. From this point, the rest of the proof is just a sequence ofeasy readings of the construction.

In the light of the present situation (the parameterized complexity of the PERMUTATION PATTERNfor its natural parameterization is still open), we have considered in [Guillemot and Vialette, 2009] thePERMUTATION PATTERN problem in case σ (possibly σ and π) avoids a pattern of length 3. Recall that Knuthproved in [Knuth, 1998] that for all six of the patterns of length 3 it is true that the number of permutationsof size n that avoid the pattern is the Catalan number Cn =

(2nn

)/(n + 1). First, it is easy to see that the

PERMUTATION PATTERN problem is polynomial-time solvable if the pattern σ avoids 132, 312, 213 or 231since σ is clearly separable in this case. Monotone patterns, i.e., 123 and 321, however, deserve separateconsideration (we focus here on 321-avoiding permutations but if a permutation avoids 123 then its reverseavoids 321). The rest of this section is devoted to presenting our results.

First, combining ordered forest embeddings with labeled DAG morphisms, we have shown that thePERMUTATION PATTERN problem is polynomial-time solvable if both π and σ are 321-avoiding.

Proposition 2.4.2 ([Guillemot and Vialette, 2009]). In case both π and σ are 321-avoiding, the PERMUTATIONPATTERN problem is solvable in O(k2n6) time, where k = |σ| and n = |π|.

Second, if we relax the problem to only one 321-avoiding permutation (and it has to be σ of course), wehave obtained the following result.

Proposition 2.4.3 ([Guillemot and Vialette, 2009]). In case only σ (the pattern) is 321-avoiding, the PERMUTATION

PATTERN problem is solvable in O(kn4√k+12) time, where k = |σ| and n = |π|.

Page 36: Algorithmic Contributions to Computational Molecular Biology · on computational molecular biology, i.e., those algorithmic and combinatorial topics that are connected to molecular

23

Notice that the above proposition does not settle the complexity of the PERMUTATION PATTERN problemin case only σ is 321-avoiding. In [Guillemot and Vialette, 2009], we have conjectured this problem to beNP-complete. Unfortunately (or fortunately, we were not wrong!) this conjecture is true as shown in thefollowing proposition (definitively too long proof omitted).

Proposition 2.4.4. The PERMUTATION PATTERN problem is NP-complete even if σ is 321-avoiding.

We close this section by mentioning a generalization of the PERMUTATION PATTERN problem that may beof independent interest. The c-COLORED PERMUTATION PATTERN problem is defined as follows: given twopermutations σ and π, and a stair decompositionD of σ (see [Atkinson et al., 2005] and [Guillemot and Vialette,2009] for details), where σ and π are c-colored permutations (i.e., a color in 1, 2, . . . , c is associated to eachpoint of the permutation), find a color-preserving embedding of σ into π. We have obtained the followingresult (recall that the Exponential-Time Hypothesis (ETH) is the assumption that the 3-SAT problem cannotbe solved in 2o(n) time, where n is the number of variables).

Proposition 2.4.5 ([Guillemot and Vialette, 2009]). The 2-COLORED PERMUTATION PATTERN problem parame-terized by k is W[1]-hard and cannot be solved in no(

√k) time assuming ETH.

We also refer the reader to [Guillemot and Vialette, 2009] for WNL-hardness issues of Proposition 2.4.5(the parameterized class WNL was introduced in [Guillemot, 2008] to capture the parameterized complexityof problems solvable by k-dimensional dynamic programming).

2.5 Finding common structures

2.5.1 Introduction

This section is devoted to finding common structures in linear graphs and linear matchings. We begin bypresenting this problem in its original setting. RNA and proteins exhibit a three-dimensional structurethat determines most of their functionality. This three dimensional structure can be modeled (at theprice of simplifications!) in two dimensions by a linear graphs. The corresponding structure-similarity orstructure-prediction problems that arise in such contexts usually translate to finding common linear matchingsubgraphs, or common structured patterns, that occur in a family of general linear graphs. Examples of suchproblems are the LONGEST COMMON SUBSEQUENCE problem [Hirschberg, 1977; Hunt and Szymanski,1977b], the MAXIMUM COMMON ORDERED TREE INCLUSION problem [Alonso and Schott, 1993; Chung, 1998;Kilpelainen and Mannila, 1995], the ARC-PRESERVING SUBSEQUENCE problem [Blin et al., 2005b; Evans,1999c; Gramm et al., 2002], and the MAXIMUM CONTACT MAP OVERLAP problem [Goldman et al., 1999]problem (more on this in the perspective note Page 30). A general framework for such problems is known asthe e MAXIMUM COMMON STRUCTURED PATTERN (MCSP) problem.

The MCSP problem was originally introduced (under a different name) by Davydov and Batzoglou[Davydov and Batzoglou, 2006] in the context of non-coding RNA secondary structure prediction via multiplestructural alignment. There, an RNA sequence of n nucleotides is represented by a linear graph with nvertices, and an edge connects two vertices if and only if their corresponding nucleotides are complementary.A family of linear graphs is then used to represent a family of functionally-related RNAs, and a commonstructured pattern in such a family is considered to be a putative common secondary structure element ofthe family.

The MAXIMUM COMMON STRUCTURED PATTERN (MCSP) problem is formally defined as follows (seeFigure 2.2 for an illustrative example).

Page 37: Algorithmic Contributions to Computational Molecular Biology · on computational molecular biology, i.e., those algorithmic and combinatorial topics that are connected to molecular

24

(a) Linear graph G1

(b) Linear graph G2

(c) Linear graph G3

(d) Linear graph G4

e1 e2 e3 e4 e5

(e) Linear matching Gsol

Figure 2.2: Four linear graphs G1, G2, G3 and G4 and a common structured pattern (depicted as Gsol. Theoccurrence of the structured pattern Gsol in each graph is emphasized in bold. Edges e2, e3, e4 and e5 arenested in edge e1 ; edges e2 and e3 precede edge e5 ; edge e2 precedes edge e4 and crosses edge e3, whileedge e3 crosses both edges e2 and e4.

MCSP

• Input : A family of linear graphs G = G1, G2, . . . , Gn and a non-empty subsetM⊆ <,@, G.• Solution : A common structured pattern Gsol typeM of G, i.e., a linear matching typeM thatoccurs in each input linear graph of G.• Measure : The size of Gsol, i.e., |E(Gsol)|.

Page 38: Algorithmic Contributions to Computational Molecular Biology · on computational molecular biology, i.e., those algorithmic and combinatorial topics that are connected to molecular

25

It will be convenient to introduce some special linear matching. A linear matching type < (resp. @, G)is called a sequence (resp. tower, staircase). Define the width (resp. height, depth) of a linear graph to be the sizeof a maximum cardinality sequence (resp. tower, staircase) subgraph of the graph. A linear matching type<,@ with the additional property that any two maximal towers in it do not share an edge is called a sequenceof towers . Similarly, a linear matching type <, G is a sequence of staircases if any two maximal staircases donot share an edge. A tower of staircases is a linear matching type @, G where any pair of maximal staircasesdo not share an edge, and a staircase of towers is a comparable linear matching type @, G where any pair ofmaximal towers do not share an edge. A sequence of towers (resp. sequence of staircases, tower of staircases,staircase of towers) is balanced if all of its maximal towers (resp. staircases, staircases, towers) are of equalsize. Figure 2.3 illustrates an example of the above types of linear graphs.

(a) A <,@-structured pattern of width 4and height 4

(b) A <, G-structured pattern of width 4and depth 4.

(c) A @, G-structured pattern of height 6and depth 3

(d) A sequence of towers of width 5 andheight 3.

(e) A sequence of balanced towers of width3 and height 3.

(f) A sequence of staircases of width 4 anddepth 4.

(g) A sequence of balanced staircases ofwidth 3 and depth 3.

(h) A tower of staircases of height 4 anddepth 3.

(i) A tower of balanced staircases of height3 and depth 3.

(j) A staircase of towers of height 3 anddepth 4.

(k) A staircase of balanced towers of height3 and depth 3.

Figure 2.3: Some restricted structured patterns. Edges are drawn above or below the vertices with noparticular signification.

The MCSP problem is relatively easy for simpleM. Indeed, it is solvable in O(nm) time forM = <[Gupta et al., 1982], in O(nm log log(m)) time forM = @ [Whang and Wang, 1992], and in O(nm1.5) timeforM = G [Tiskin, 2006], where n = |G| andm = maxG∈G |E(G)|.

We briefly review some related results. Valiente gave a dynamic programming algorithm for finding alargest nested linear graph that occurs in two nested linear graphs [Lozano and Valiente, 2004] (see also[Zhang and Shasha, 1989]). In a totally different context, Felsner et al. considered the matching problemregardless of precise pattern definition and proved that given a linear graph G of sizem, a maximum size

Page 39: Algorithmic Contributions to Computational Molecular Biology · on computational molecular biology, i.e., those algorithmic and combinatorial topics that are connected to molecular

26

nested subgraph of G can be found in O(m2) time [Felsner et al., 1997]. The general problem of finding amaximum size edge-independent subgraph of G is the well-known maximum matching problem [Diestel,2000].

2.5.2 Structured patterns type <,@

Of particular importance in the context of computational molecular biology, is the fact that the MCSPproblem for structured patterns type <,@ has been shown to be NP-complete [Davydov and Batzoglou,2006]. We have strengthen this result in a drastic way.

Proposition 2.5.1 ([Kubica et al., 2006]). The MCSP problem for structured patterns type <,@ is NP-hard evenif each input linear matching is a sequence of towers of height at most 2.

Notice, however, that we have proved the MCSP problem to be polynomial-time solvable in case thenumber of input linear graphs is a fixed integer [Kubica et al., 2006]. As for the approximation, the MCSPproblem for structured patterns type <,@ was proved to be approximable with ratio O(log2(k)) [Davydovand Batzoglou, 2006], where k is the size of an optimal solution. We have improved this result in [Kubicaet al., 2006]

Proposition 2.5.2 ([Kubica et al., 2006]). The MCSP problem for structured patterns type <,@ is approximablewithin ratio O(logk) in O(nm2) time, where k is the size of an optimal solution, n = |G|, and m is the maximumsize of any linear graph in G.

2.5.3 Structured patterns type <, G

Focusing onM = <, G, we have obtained the following results (the first one is a straightforward conse-quence of Proposition 2.5.1 whereas the second one requires two ingredients: (i) any structured patterntype <, G contains a sequence of staircases of substantial size and (ii) any sequence of staircases contains abalanced subgraph of substantial size, details omitted).

Proposition 2.5.3 ([Fertin et al., 2007]). The MCSP problem for structured patterns type <, G is NP-hard even ifeach input linear graph is a sequence of staircases of depth at most 2.

Notice that a recent result in [Li and Li, 2009b] implies that the MCSP problem for structured patternstype <, G is hard even if G consists of only two linear graphs. However, the input linear graphs used in [Liand Li, 2009b] are of unlimited structure, unlike Proposition 2.5.3. Interestingly enough, the case |G| = 1 hasbeen recently proved to be NP-hard [Li and Li, 2009b].

Proposition 2.5.4 ([Fertin et al., 2007]). The MCSP problem for structured patterns type <, G is approximablewithin ratio 2H(k) inO(nm3.5 log(m)) time, where k is the size of an optimal solution, n = |G|,m = maxG∈G |E(G)|,and H(k) =

∑ki=1 1/i is the k-th harmonic number.

2.5.4 Structured patterns type @, G

Not surprisingly, the MCSP problem for structured patterns type @, G is hard even for quite simple instance.

Proposition 2.5.5 ([Fertin et al., 2007]). The MCSP problem for structured patterns type @, G is NP-hard even ifeach input linear graph is a tower of staircases of depth at most 2. The same result applies for staircases of towers

As the reader might have guessed, the above result is an easy consequence of Proposition 2.5.1. Asobserved before, this case in strongly related to pattern matching for permutations. The well-known Erdos-Szekeres Theorem [Erdos and Szekeres, 1935] states that any permutation of Sk contains either an increasing

Page 40: Algorithmic Contributions to Computational Molecular Biology · on computational molecular biology, i.e., those algorithmic and combinatorial topics that are connected to molecular

27

Proposition 2.5.1 raises the following entertaining problem. Let A = [ai,j] be a m × n matrixwith non-negative entries. Define a run r in A to be a mapping r : m → n. The weight of therun r is defined by ω(r) = minai,r(i) : 1 ≤ i ≤ m. Let r1 and r2 be two runs of A. The run r1precedes the run r2, in symbols r1 < r2, if r1(i) < r2(i) for all 1 ≤ i ≤ m. We consider the problemdefined as follows: Given a m × n matrix with non-negative entries, find a sequence of runsr1 < r2 < . . . < rk with maximum total weight

∑ki=1ω(ri). Observe that the number of runs in

the solution is not part of the input, one is only interested in maximizing the total weight of thesolution. Below is an illustration for a 5× 6 matrix with three runs r1 < r2 < r3 of total weightω(r1) +ω(r2) +ω(r3) = 2+ 2+ 5 = 9.

r1 r2 r3

4 1 3 2 6 0

2 1 4 2 7 9

4 3 1 6 1 2

3 9 2 5 7 1

3 3 3 2 1 5

According to Proposition 2.5.1, the above problem is NP-complete even if every entry of thematrix is one of the integers 0, 1 and 2 (each row denotes a sequence of tower and each entrydenotes the height of a tower). Moreover, it is easy to show that the problem is polynomial-timesolvable if every entry of the matrix is one of the integers 0 and 1 (Proposition 2.5.1 is tight). Howapproximable is the general problem? Is it APX-hard? We would be very surprised if the answerto this latter question for a fixed number of distinct non-negative integers was Yes. In other words,we believe a PTAS exists for this special case. Notice that the straighforward greedy algorithm(repeatedly select a maximum weight run) does not yield any approximation result. To see this,consider them×mmatrix A = [ai,j] defined by ai,j = 2 if i = j and ai,j = 1 otherwise.

According to [Li and Li, 2009b], the PATTERN MATCHING FOR LINEAR MATCHINGS problem isNP-complete is the pattern is type <, G. What about the complexity if both the target graph andthe pattern are linear matchings type <, G?

Conjecture 4. Finding an occurrence of a linear matching type <, G in another linear matching type<, G is polynomial-time solvable.

Page 41: Algorithmic Contributions to Computational Molecular Biology · on computational molecular biology, i.e., those algorithmic and combinatorial topics that are connected to molecular

28

or a decreasing subsequence of size at least√k. It is worth noticing that extremal Erdos-Szekeres (EES)

permutations, i.e., permutations that do not contain monotone subsequences longer than√k, are known

to exist (for example, there are 4 EES permutations of length 4: 2 1 4 3, 2 4 1 3, 3 1 4 2 and 3 4 1 2). Combiningthis with algorithms for finding a largest common structured pattern type @ or G, we have obtained thefollowing result.

Proposition 2.5.6 ([Fertin et al., 2007]). The MCSP problem for modelM = @, G is approximable within ratio√k in O(nm1.5) time, where k is the size of an optimal solution n = |G|, andm = maxG∈G |E(G)|.

How tight Proposition 2.5.6 is? Central in Proposition 2.5.6 is the use of family of patterns to probe theinput. Can we use more complicated families of patterns to improve the approximation (notice that thereis tradeoff to be made here, the family should be large enough, but only polynomially large if we want topossibly consider each member)? Unfortunately, the answer is negative. For k ∈ N, let Πk ⊆ Sk be a set of|Πk| permutations on k elements. Each permutation in Πk can be equivalently regarded as a linear matchingtype @, G. Alon recently communicated us a proof of essentially the following lemma.

Lemma 2.5.7 (N. Alon, Private communication). For every family of permutationsΠk ⊆ Sk, k ∈ N and |Πk| ≤ 2k,there exists a permutation π ∈ SK, K = Ω(k2), which avoids all permutations in Πk.

Notice that Alon’s lemma shows that there exists a linear matching type @, G of size K = Ω(k2) whichdoes not contain any linear matching type @, G out of a family of at most 2k such graphs. Hence, evenusing more involved or interesting families of linear matchings type @, G to be used to probe our inputgraphs, no approximation guarantee better than O(

√k) for maximum common structured patterns type

@, G can be possibly achieved.

2.5.5 Putting everything together

We now consider structured patterns type <,@, G. We gave in [Fertin et al., 2007] three approximationalgorithms with increasing time complexities but decreasing approximation ratios. Roughly speaking, thesethree algorithms rely on sufficiently large sub-patterns that occur in any structured pattern type <,@, G,and the fact that finding maximum common structured patterns of these types is polynomial-time solvable.

Our first approximation algorithm is based on Dilworth’s Theorem [Dilworth, 1950] and uses thefollowing simple structure lemma.

Lemma 2.5.8 ([Fertin et al., 2007]). Let G be a linear matching type <,@, G of size k. Then H contains a simple(i.e.,M = <,M = @ orM = G) structured pattern of size at least k1/3.

It is easily seen, however, that Lemma 2.5.8 is tight. One way to obtain an extremal example of this is asfollows: take k1/3 balanced towers of staircases, each one of depth k1/3 and height k1/3, and concatenatethem one next to the other into one supergraph of size k, reassigning labels accordingly. Combining the abovelemma with the fact that a maximum common simple structured pattern of G can be found in O(nm1.5)time, we have obtained the following approximation algorithm for general structured patterns.

Proposition 2.5.9 ([Fertin et al., 2007]). The MCSP problem for patterns type <,@, G is approximable withinratio O(k2/3) in O(nm1.5) time, where k is the size of an optimal solution, n = |G|, andm = maxG∈G |E(G)|.

Our second approximation algorithm is based on [Kostochka, 1988] (see also [Kostochka and Kratocvil,1997]). It uses the following structure lemma.

Lemma 2.5.10 ([Fertin et al., 2007]). Let G be a linear matching type <,@, G of size k. Then G contains a subgraphof sizeΩ

(√k/ log(k)

)which is either type <,@ or type G.

Page 42: Algorithmic Contributions to Computational Molecular Biology · on computational molecular biology, i.e., those algorithmic and combinatorial topics that are connected to molecular

29

Combining the above lemma with an algorithm for finding a maximum structured pattern type G andthe O(log(n))-approximation algorithm for structured patterns type <,@, we have obtained the followingresult.

Proposition 2.5.11 ([Fertin et al., 2007]). The MCSP problem for structured patterns type <,@, G is approximable

within ratio O(√k log2(k)) in O(nm2) time, where k is the size of an optimal solution, n = |G|, and m =

maxG∈G |E(G)|.

We now turn to our third approximation algorithm. It uses the following structure lemma.

Lemma 2.5.12 ([Fertin et al., 2007]). Let G be a linear matching type <,@, G of size k. Then G contains either atower or a balanced sequence of staircases of sizeΩ

(√k/ log(k)

).

Combining the above lemma with algorithms for finding a maximum common tower and a balancedsequence of staircases in G (see [Fertin et al., 2007] for details) we have obtained the following results.

Proposition 2.5.13 ([Fertin et al., 2007]). The MCSP problem for structured patterns type <,@, G is approximablewithin ratio O(

√k log(k)) in O(nm3.5 logm) time, where k is the size of an optimal solution, n = |G|, and

m = maxG∈G |E(G)|.

It remains a challenging problem to improve the approximation ratio for the MCSP problem for struc-tured patterns type <,@, G.

Let us consider here subgraphs of linear matchings typeM for someM⊆ <,@, G with |M| = 2.Prove or disprove the following (we would go for prove): Any linear matching type <,@, G ofsize k contains either a type <,@ subgraph of size k2/3, a type <, G subgraph of size k2/3, or a type@, G subgraph of size k2/3. Notice that, unfortunately, true or false, this cannot be applied forapproximation purposes (approximating the MCSP problem for patterns type @, G definitivelyremains the bottleneck).Let us put this problem in perspective. For one, we have shown in [Fertin et al., 2007] that anylinear matching type <,@, G of size k contains a subgraph of size ε k2/3, where ε = (

√17−1)/8 ≈

0.39, which is either type <,@, type <, G, or type @, G. For another, let k be an integer suchthat k1/3 is an integer. A simple construction shows that there exists a linear matching type<,@, G of size k that contains neither a subgraph type <,@, nor a subgraph type <, G, nora subgraph type @, G of size least ε k2/3 for any ε > 1. Indeed, assuming the contrary, thenany linear matching type <,@, G of size k contains a subgraph with at least

√ε k2/3 = ε1/2 k1/3

edges which is either type <, type @, or type G. A patent contradiction since it can be proved(see [Fertin et al., 2007]) that there exists a linear matching type <,@, G of size k that does notcontain a simple structured pattern of size ε k1/3 for any ε > 1.

2.5.6 Towards biologically sounding models

As we have observed in [Herrbach and Vialette, 2005], the MCSP problem does not completely succeedin accurately modeling RNA structures. To this end, we have proposed in [Herrbach and Vialette, 2005] toconsider the MCSP problem together with a simple RNA stacking-pair scoring scheme.

Page 43: Algorithmic Contributions to Computational Molecular Biology · on computational molecular biology, i.e., those algorithmic and combinatorial topics that are connected to molecular

30

A contact map is a useful graph-theoretic abstraction (and two-dimensional depiction) of thestructure of a protein. For a protein of size n, and a given threshold ε, the contact map Mε =[mε(i, j)] is an n × n 0–1 matrix whose entry mε(i, j) equals 1 if the distance between aminoacids i and j is less than or equal to ε, and 0 otherwise [Goldman et al., 1999]. Of particularimportance in our context, the contact map can also be viewed as a Hamiltonian path (usuallydepicted horizontally) with nodes representing the amin acids and with edges added that joinpairs of nodes whose centers of gravity have been found to be closer to each other than a fixedthreshold ε. Contact maps are thus linear graphs. Contact maps have been used for secondarystructure prediction, fold assignment, protein structure alignment, and threading (see [Goldmanet al., 1999] and references therein). A Contact map overlap is a measure of similarity of proteinstructures based on maximum size common subgraphs in contact maps) [Goldman et al., 1999].In their pioneered work, Goldman et al. [Goldman et al., 1999] have laid the foundations of thealgorithmic issues of contact map overlap: the general problem is NP-complete and approximablewithin a constant ratio for some restricted but interesting special cases. Also, they have introducesspecial structures (stacks, queues and staircases) that are of particular importance for proteins. Anotable breakthrough in algorithmic aspects of contact map overlap is a recent paper of Xu et al.[Xu et al., 2007] where some preliminary fixed-parameter algorithms are presented. Below aresome lines of research we plan to explore.

• Of particular importance, there exists a O(n6) time algorithm for finding the maximumoverlap of two degree 2 contact maps, one of which is either a stack or a staircase [Goldmanet al., 1999]. However, due to the high degree of complexity, it is not practical. Improvingthe time complexity of this problem is a challenging and important problem.

• It is intuitively clear – and widely accepted – that contact maps of real proteins are far frombeing arbitrary collections of edges since they have a specialized structure reflecting thegeometry of proteins. To this end, Goldman et al. [Goldman et al., 1999] have introduced aspecial class of contact maps (self-avoiding walk on the 2D grid) that seem to be a long waytoward capturing this structure (the problem of computing the maximum overlap remains,however, hard for self-avoiding walks on the 2D grid . . . not a real surprise as self-avoidingwalks do capture the complexity of real instances). Interesting questions include:

Computing the maximum overlap between two self-avoiding walks on the 2D gridis known to be approximable within ratio 4. Improving this ratio is crucial to bridgethe theory–practice gap. A promising line of research could be to improve the decom-position of self-avoiding walks on the 2D grid (it is only known that a self-avoidingwalk can be decomposed into 2 stacks and 1 queue). Notice that it is not even knownwhether this problem is APX-hard (self-avoiding walks on the 2D grid enjoy somedegree of planarity).

Parameterized issues of self-avoiding walks on the 2D grid are completely unexplored.Obtaining efficient fixed-parameter algorithms would be of particular interest forpractical perspectives.

It would be of particular interest to discover favorable properties of 3-dimensionalself-avoid walks. Current approaches seem inherently 2-dimensional in that theyexploit topological properties of the plane. These aspects (properties, algorithmicissues, . . . ) are completely unexplored yet.

Page 44: Algorithmic Contributions to Computational Molecular Biology · on computational molecular biology, i.e., those algorithmic and combinatorial topics that are connected to molecular

31

If we assume that the secondary structure of an RNA contains no pseudoknots, the secondary structurecan be decomposed into a few types of loops: stacking pairs, hairpins, bulges, internal loops and multipleloops [Waterman, 1995]. A stacking-pair is a loop formed by two pairs of consecutive bases (i, j) and(i+ 1, j− 1). By definition, a stacking-pair contains no unpaired bases, and any other kinds of loops containone or more unpaired bases. Since unpaired bases are destabilizing and have positive free energy, stakingpairs are the only type of loops that have negative free energy and stabilize the secondary structure, see[Ieong et al., 2003] for a nice application of the stacking-pair scoring scheme to RNA secondary structures.We need some new easy definitions.

Definition 2.5.14 (SP scoring scheme). Let G be a linear matching type <,@. The SP-score of G, denoted SP(G),is defined by

SP(G) = |(i, j) : i+ 1 < j− 1 ∧ (i, j) ∈ E(G) ∧ (i+ 1, j− 1) ∈ E(G)|.

Definition 2.5.15 (SP-Trim linear matching). A linear matching graph G type <,@ is called a SP-trim nestedlinear graph provided that SP(G) > SP(G[E(G) − e]) for all e ∈ E(G).

The MCSP for pseudoknot-free RNA under the stacking-pair scoring scheme can be rephrased as follows.

MCSP-SP

• Input : A family of linear graphs G = G1, G2, . . . , Gn.• Solution : A common SP-Trim pattern Gsol type <,@ of G, i.e., a SP-Trim pattern Gsol type<,@ that occurs in each input linear graph of G.• Measure : The SP-score of Gsol, i.e., SP(Gsol).

We briefly review the results we have obtained for the MCSP-SP problem. Not surprisingly (in the lightof Proposition 2.5.1), the MCSP-SP problem is computationally hard even for quite simple instances (noticethat Proposition 2.5.1 does not apply here as it is concerned with towers of height 1 or 2).

Proposition 2.5.16 ([Herrbach and Vialette, 2005]). The MCSP-SP problem – in its natural decision form – isNP-complete even if G is composed of SP-trim linear matchings type <,@.

The above proposition may be contrasted with the following positive result.

Proposition 2.5.17 ([Herrbach and Vialette, 2005]). The MCSP-SP problem is solvable inO(4kk√knm4 log(m))

time, where n = |G|, m = max|E(Gi) : Gi ∈ G, and k is the SP-score of the sought common SP-Trim structuredpattern type <,@.

The proof of Proposition 2.5.17 is by enumeration and dynamic programming. The following result,well-suited for fixed |G|, complements Proposition 2.5.17.

Proposition 2.5.18 ([Herrbach and Vialette, 2005]). The MCSP-SP problem is solvable in O(m2n logn−1(mn))time, where n = |G|, m = max|E(Gi)| : Gi ∈ G, and k is the SP-score of the sought common SP-Trim structuredpattern type <,@.

Interestingly enough, the proof of Proposition 2.5.18 is by a combination of high-dimensional trapezoidsdiagrams and high-dimensional trapezoids graphs [Felsner et al., 1997]. Crucial in our algorithm is aprocedure to compute a maximum weighted disjoint subset of high-dimensional trapezoids (in terms ofdisjoint induced closed polygons).

According to Proposition 2.5.18, the MCSP-SP problem is polynomial-time solvable for fixed |G|. Fur-thermore, as we have seen, the MCSP-SP problem is NP-complete even if G is composed of linear matchingstype <,@. Therefore, there is very little hope that a polynomial-time algorithm exists for this restricted case.However, this raises the important question of whether there exists a more efficient algorithm, althoughexponential in n, for fixed |G| in case G is composed of linear matchings type <,@. The answer is positive.

Page 45: Algorithmic Contributions to Computational Molecular Biology · on computational molecular biology, i.e., those algorithmic and combinatorial topics that are connected to molecular

32

Proposition 2.5.19 ([Herrbach and Vialette, 2005]). The MCSP-SP problem for linear matchings type <,@ issolvable in O(nm2n) time, where n = |G| andm = max|E(Gi) : Gi ∈ G.

As the reader might have guessed, there is strong relationships between tree alignment problems and theMCSP-SP problem in case G is composed of linear matchings type <,@, and this similarity is indeed atthe heart of Proposition 2.5.19. First, observe that a linear matching type <,@ can easily be mapped to anordered tree structure. Second, grouping together stacking edges, we can assume that the ordered trees arevertex weighted by the number of stacking edges, i.e., thickness. The goal is thus clearly to find a commonhomeomorphic ordered subtree with additional constraints on the weights (see [Herrbach and Vialette, 2005]for details).

2.6 Separable patterns

The preceding sections were concerned with two problems: (i) searching for an occurrence of a permutationin another one and (ii) finding a maximum size common pattern in a collection of linear graphs. But, as wehave observed, a permutation is nothing but a special linear graph, and hence there is a natural interestin combining the two above-mentioned problems, i.e., finding a maximum size common permutation in acollection of permutation. But the general problem of finding an occurrence of a permutation in anotherone is NP-complete [Bose et al., 1998], and hence, for algorithmic purposes, we need to restrict ourselves toclasses of permutations for which the PERMUTATION PROBLEM problem is polynomial-time solvable. Wefocus in this section on separable permutations (this is actually not really a choice as, as far as we are aware of,separable permutations constitute the only non-trivial class of permutations for which the PERMUTATIONPATTERN problem is polynomial-time solvable).

Recall that a permutation is separable if it contains neither the subpattern 3142 nor 2413 [Bose et al.,1998; Ibarra, 1997] and that the PERMUTATION PATTERN problem is polynomial-time solvable for separablepatterns. There are actually numerous characterizations of separable permutations, for example in terms ofpermutation graphs [Bose et al., 1998], of interval decomposition [Bui-Xuan et al., 2005; Bergeron et al., 2005;Rossin and Bouvel, 2006], or with ad-hoc structures [Bose et al., 1998]. Our results can be stated as follows.

We are not aware of any finitely based pattern-avoiding permutation classes C for which therecognition problem (i.e., π ∈ C?) is NP-hard. This remarks motivates the following problem: Isthe problem of finding a maximum size C-pattern in a collection of n permutations polynomial-time solvable for any finitely based C? If not, exhibit such a class C for which this problem isNP-hard. As far as we know, this problem is completely open.

Proposition 2.6.1 ([Bouvel et al., 2007]). The problem of finding a common maximum size separable permutation ina collection ofm permutations, each of size at most n, is solvable in O(n6m+1) time.

Being exponential in the number of permutation is roughly the best one can obtain as shown in thefollowing proposition (we cannot exclude, however, a O(nO(

√m) time algorithm).

Proposition 2.6.2 ([Bouvel et al., 2007]). The problem of finding a common maximum size separable permutation ina collection of permutations is NP-complete, even if each input permutation is separable.

Page 46: Algorithmic Contributions to Computational Molecular Biology · on computational molecular biology, i.e., those algorithmic and combinatorial topics that are connected to molecular

33

We end this section with a first attempt to address the following question: How much finding a commonmaximum size separable permutation does help when looking for a common maximum size (general) patternin a collection of permutations. Put it in our context, is a common maximum size separable permutation agood approximation of a common maximum size permutation? We have obtained the following – somewhatnegative – general result.

Proposition 2.6.3 ([Bouvel et al., 2007]). Let Π be a set of permutations and C be any pattern avoiding permutationclass. Furthermore, let k (resp. kC) be the maximum size of a common pattern (resp. maximum size of a common Cpattern) in Π. Then, k/kC ≤

√k, and the inequality is tight.

In other words, the problem of finding a common maximum size pattern in a collection of permutationscannot be approximated within a performance ratio better than

√opt by the problem of finding a common

maximum size C-pattern, where opt is the size of an optimal solution and C is any pattern-avoidingpermutation class.

Page 47: Algorithmic Contributions to Computational Molecular Biology · on computational molecular biology, i.e., those algorithmic and combinatorial topics that are connected to molecular

34

Page 48: Algorithmic Contributions to Computational Molecular Biology · on computational molecular biology, i.e., those algorithmic and combinatorial topics that are connected to molecular

3Arc-annotated sequences

Contents3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

3.2 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

3.3 Maximum common patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

3.4 Pattern matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

3.5 Extending the standard model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

3.1 Introduction

Structure comparison for RNA has become a central computational problem bearing many computer sciencechallenging questions. Indeed, RNA secondary structure comparison is essential for (i) identification ofhighly conserved structures during evolution (which cannot always be detected in the primary sequence,since it is often unpreserved) which suggest a significant common function for the studied RNA molecules,(ii) RNA classification of various species (phylogeny), (iii) RNA folding prediction by considering a set ofalready known secondary structures and (iv) identification of a consensus structure and consequently of acommon role for molecules.

From an algorithmic point of view, RNA structure comparison was first considered in the frameworkof ordered trees [Shasha and Zhang, 1989]. More recently, it has also been considered in the framework ofarc-annotated sequences [Evans, 1999c]. An arc-annotated sequence is a pair (u, P) where u is a sequence ofRNA bases and P represents hydrogen bonds between pairs of elements of u. From a purely combinatorialpoint of view, arc-annotated sequences are a natural extension of simple sequences. However, using arcs formodeling non-sequential information together with restrictions on the relative positioning of arcs allow forvarying restrictions on the structure of arc-annotated sequences.

Different pattern matching and motif search problems have been considered in the context of arc-annotated sequences among which we can mention the Longest Arc-Annotated Subsequence (LAPCS)problem, the Arc Preserving Subsequence (APS) problem, the Maximum Arc-Preserving Common Subse-quence (MAPCS) problem, and the Edit-distance for arc-annotated sequence (EDIT) problem [Evans, 1999a;Jiang et al., 2000b; Alber et al., 2004; Gramm et al., 2006]. For an up-to-date survey of this area we refer thereader to our chapter [Blin et al., 2010a].

35

Page 49: Algorithmic Contributions to Computational Molecular Biology · on computational molecular biology, i.e., those algorithmic and combinatorial topics that are connected to molecular

36

This chapter is devoted to presenting our algorithmic results for arc-annotated based problems. It isorganized as follows. Section 3.2 presents some preliminaries. Section 3.3 deals with the problem of findinga common arc-annotated sequence (the LAPCS problem) whereas Section 3.4 is concerned with patternmatching issues in arc-annotated sequences. In Section 3.5, we extend the standard model with RNAapplications in mind.

3.2 Definitions

Definition 3.2.1 (Arc-annotated sequence). An arc-annotated sequence over alphabet A is a pair (u, P), whereu (the sequence) is a string over A∗ and P (the annotation) is a set of arcs (i, j) : 1 ≤ i < j ≤ |u|.

Notice that even if these objects are described in terms of arcs, the orientation is not relevant and weare actually concerned with edges (but we follow the standard terminology here). In the context of RNAstructures, we have A = A,C,G,U, and u and P represent the nucleotide sequence and the hydrogenbonds of the RNA structure, respectively. Characters in u are thus often referred to as bases. A letter of u issaid to be free if it is no incident to an arc of P (this point is crucial if one compare arc-annotated sequenceswith linear graphs as the latters do not allow for free vertices). Two arcs of P are independent if they do notshare a vertex.

Definition 3.2.2 (Occurrence). Let (u, P) and (v,Q) be two arc-annotated sequences. The arc-annotated sequence(v,Q) occurs in (u, P) if (v,Q) can be obtained from (u, P) by letter deletions.

Notice that the above definition does not allow for edge deletion, i.e., a a does not occurs in a b a .The definition 3.2.2 is illustrated Figure 3.1.

(u, p) = a b c d b a c a c a d c b

(v, q) = b c a c d

Figure 3.1: Occurrence of an arc-annotated sequence in another arc-annotated sequence.

Again, the relative positioning of arcs is of particular importance for arc-annotated sequences. Followingthe example of 2-intervals and linear graphs, this relative positioning is completely described by three binaryrelations: (i) the precedence, denoted <, (ii) the inclusion, denoted @, and (iii) the crossing, denoted G. For thesake of consistency, we choose to adopt this general framework for presenting our results.

Definition 3.2.3. Let (u, P) be an arc-annotated sequence, and (i, j) and (k, l) be two independent arcs of P. Thebinary relations <, @ and G are defined as follows:

Page 50: Algorithmic Contributions to Computational Molecular Biology · on computational molecular biology, i.e., those algorithmic and combinatorial topics that are connected to molecular

37

• precedence: (i, j) < (k, l) if and only if j < k,

• inclusion: (k, l) @ (i, j) if and only if i < k and l < j, and

• crossing: (i, j) G (k, l) if and only if i < k < j < l.

The notations (k, l) @ (i, j) and (i, j) A (k, l), and (i, j) < (k, l) and (k, l) > (i, j) are of course equivalent(note, however, that there does not exist an equivalent for the relation G). The binary relations <, @ et G areillustrated Figure 3.2.

(i, j) < (k, l) ui uj uk ul

(k, l) @ (i, j) ui uk ul uj

(i, j) G (k, l) ui uk uj ul

Figure 3.2: The binary relations <, G, G.

For the sake of presentation, for two arcs (i, j), (k, l) ∈ p, we write

• (i, j) ∼< (k, l) if and only if (i, j) < (k, l) or (i, j) > (k, l),

• (i, j) ∼@ (k, l) if and only if (i, j) @ (k, l) or (i, j) A (k, l), and

• (i, j) ∼G (k, l) if and only if (i, j) G (k, l) or (k, l) G (i, j).

In her pioneering work [Evans, 1999a], Evans has introduced a five level hierarchy for arc-annotatedsequences that is described as follows

• UNLIMITED: No restriction.

• CROSSING: for any two arcs (i, j), (k, l) ∈ p, either (i, j) ∼< (k, l), (i, j) ∼@ (k, l), or (i, j) ∼G (k, l).

• NESTED: for any two arcs (i, j), (k, l) ∈ p, either (i, j) ∼< (k, l) or (i, j) ∼@ (k, l).

• CHAIN: for any two arcs (i, j), (k, l) ∈ p, (i, j) ∼< (k, l).

• PLAIN: No arc, i.e., p = ∅.

The PLAIN level thus corresponds to sequences in the usual sense. This hierarchy is clearly organizedaccording to the following chain of inclusions:

PLAIN ⊂ CHAIN ⊂ NESTED ⊂ CROSSING ⊂ UNLIMITED.

The hierarchy introduced by Evans is clearly incomplete (with respect to the combinatorics induced bythe three binary relations <, @ and G). In particular, in the context of RNA secondary structures, the abovehierarchy does not allow to precisely describe stems. To this end, a first refinement of the NESTED level hasbeen proposed [Guignon et al., 2005]: for any two arcs (i, j), (k, l) ∈ p, (i, j) ∼@ (k, l).

Extending our works on 2-intervals and linear graphs, we have proposed in [Blin et al., 2005a] a clearunified framework for arc-annotated sequences. While we claim no novelty at all, we do believe this general

Page 51: Algorithmic Contributions to Computational Molecular Biology · on computational molecular biology, i.e., those algorithmic and combinatorial topics that are connected to molecular

38

PLAIN

Type <Type G Type @

Type <, G Type @, G Type <,@

Type <,@, G

UNLIMITED

Figure 3.3: The refined arc-annotated sequences hierarchy.

framework allows us for varying restrictions in a clear and precise way. We give an outline of this approach.LetM ⊆ <,@, G,M 6= ∅. An arc-annotated sequence (u, P) is typeM if for any two arcs (i, j), (k, l) ∈ p,there exists a binary relation R ∈ M such that (i, j) ∼R (k, l). According to this definition, CROSSINGcorresponds to type <,@, G, NESTED correspond to type <,@, and PLAIN is type <. If we define thePLAIN level as the class of all arc-annotated sequences with at most one arc, we are left with the refinedhierarchy given Figure 3.3. It is this hierarchy we have chosen to adopt in the sequel.

3.3 Maximum common patterns

Evans has introduced in [Evans, 1999b] the natural extension of longest common subsequences [Bergrothet al., 2000; Crochemore et al., 2007] to arc-annotated sequences. This problem is known as the LAPCS(LONGEST ARC-PRESERVING COMMON SUBSEQUENCE) problem.

LAPCS

• Input : Two arc-annotated sequences (u, p) and (v,Q).• Solution : An arc-annotated sequence (w,R) that occurs in both (u, P) and (v,Q).• Measure : The number of letters of (w,R), i.e., |w|.

For two subsetsM,M ′ ∈ <,@, G,M 6= ∅,M ′ 6= ∅, we let LAPCS(M,M ′) stand for the LAPCS problemwhere (u, P) (resp. (v,Q)) is an arc-annotated sequence typeM (resp.M ′).

The complexity (standard, approximation and parameterized) has been studied in numerous papersand manuscripts [Evans, 1999b,c; Jiang et al., 2000b; Lin et al., 2002; Guo, 2002; Hamel et al., 2010]. Evans

Page 52: Algorithmic Contributions to Computational Molecular Biology · on computational molecular biology, i.e., those algorithmic and combinatorial topics that are connected to molecular

39

[Evans, 1999b] was the first to prove that the LAPCS(<,@, G, PLAIN) problem is NP-complete. On thepositive side, she proved that the LAPCS(<, <) problem is polynomial-time solvable. Jiang et al. [Jianget al., 2000b] have proposed aO(nm3) time algorithm for both the LAPCS(<,@, <) and LAPCS(<, <)problems. Of particular importance for pseudoknot-free RNA secondary structures, Lin et al. [Lin et al., 2002]have proved (such a clever reduction!) that the LAPCS(<,@, <,@) problem is NP-complete even if (u, P)and (v,Q) are built over a one letter alphabet. This latter result leaves the possibility of a polynomial-timealgorithm for the LAPCS(@, @) problem computational biologists are particularly interested in (see also[Guignon et al., 2005]). Unfortunately (and surprisingly I would say . . . I used to think that the problem waspolynomial-time solvable), our recent result rules out such a positive issue.

Proposition 3.3.1 ([Hamel et al., 2010]). The LAPCS(@, @) problem is NP-complete.

The complexity of the LAPCS(@, @) problem remains, however, open if the two arc-annotatedsequences are built over a fixed-size alphabet (quite a natural restriction for practical applications). Never-theless, we conjecture that the LAPCS(@, @) problem remains NP-complete for a fixed-size alphabet (wealso conjecture that it would not be a piece of cake to prove hardness of this restriction).

It is shown in [Alber et al., 2002] that a simple enumerative brute-force algorithm solves theLAPCS(<,@, <,@) in O((3|A|)l n l) time, where l is the length of the common subsequenceand |A| is the size of the underlying alphabet. Central in this approach is a dynamic programmingalgorithm [Gramm et al., 2006] that determines, given two arc-annotated sequences (u, P) and(v, q)), in O(|u| |v|) time whether (v,Q) is an arc-preserving subsequence of (u, v).Clearly, the above presented result only leads to an efficient exact algorithm if parameter l(subsequence length) is small. To get round this limitation, a more involved – and probablymore practical – algorithm in presented in [Alber et al., 2002] to determine in O(3.31k1+k2 n)time whether an arc-preserving common subsequence can be obtained by deleting (together withincident arcs) k1 letters from (u, v) and k2 from (v,Q). It is a challenging problem to adapt thissearch tree based algorithm or to develop a new approach for the NP-complete LAPCS(@, @)problem. This problem deserves deep consideration.

3.4 Pattern matching

The APS (ARC-PRESERVING SUBSEQUENCE) problem is the natural extension of the usual pattern matching[Crochemore et al., 2007] to arc-annotated sequences.

APS

• Input : Two arc-annotated sequences (u, p) and (v,Q).

• Question : Does there exist an occurrence of (u, P) in (v,Q)?

Notice that, oppositely to the LAPCS problem, the APS problem is a pure decision problem. Again, for twosetsM,M ′ ∈ <,@, G,M 6= ∅,M ′ 6= ∅, we let APS(M,M ′) stand for the APS problem where (u, P) (resp.(v,Q)) is an arc-annotated sequence typeM (resp.M ′).

Page 53: Algorithmic Contributions to Computational Molecular Biology · on computational molecular biology, i.e., those algorithmic and combinatorial topics that are connected to molecular

40

Guo [Guo, 2002] has shown that the APS(<,@, G, <) problem is NP-complete. He has also observed in[Gramm et al., 2002, 2006] that the hardness of both the APS(<,@, G, <,@, G) and APS(UNLIMITED, PLAIN)problems are direct consequences of Evans’ works on the LAPCS problem [Evans, 1999b]. Algorithms withO(nm) and O(n+m) running times, n = |u| etm = |v|, are described in [Gramm et al., 2002, 2006] for theAPS(<,@, <,@) and APS(<, PLAIN) problem, respectively.

Trying to precisely confine the intractability of the APS problem, quite arduous polynomial reductionshave allowed us to refine and complete the results of Guo [Guo, 2002].

Proposition 3.4.1 ([Blin et al., 2005a]). The APS(@, G, PLAIN) and the APS(<, G, PLAIN) problems areNP-complete.

In other words, using the binary relation G in conjunction with < or @ is enough to get intractability. Thisnegative result is confirmed by the following easy proposition that shows that the binary relation G alonedoes not result in intractability.

Proposition 3.4.2 ([Blin et al., 2005a]). The APS(G, G) problem is solvable in O(nm2) time, where n = |u| andm = |v|.

3.5 Extending the standard model

Whereas clearly defined, one may naturally argues that the objective function of the LAPCS problemconsiders letters only. If this definition makes sense, undoubtedly, for the standard pattern matchingframework, one may reasonably doubt about the adequacy and the accuracy of this model for RNA wherethe weight of the arcs cannot be neglected. On that account, we have proposed in [Blin et al., 2007a] a simpleextension of the LAPCS problem, referred hereafter as the MAPCS (MAXIMUM ARC-PRESERVING COMMONSUBSEQUENCE) problem, where the objective function is concerned with both the number of letters and thenumber of arcs in a solution.

MAPCS

• Input : Two arc-annotated sequences (u, p) and (v,Q) built over alphabet A, and functionsf : A→ N∗ and g : A2 → N∗.• Solution : An arc-annotated sequence (w,R) that occurs both in (u, P) and in (v,Q).•Measure : The score s((w, r)) =

∑a∈w f(a)+

∑(i,j)∈r g(w[i], w[j]) of the arc-annotated sequence

(w,R).

Once again, for any two setsM,M ′ ∈ <,@, G,M 6= ∅,M ′ 6= ∅, we simply let MAPS(M,M ′) stand forthe MAPS problem where (u, P) (resp. (v,Q)) is an arc-annotated sequence typeM (resp.M ′).

Observe that the LAPCS problem is nothing but the MAPCS problem for zero everywhere g, i.e.,g((x, y)) = 0 for all (x, y) ∈ A2, and hence all negative results of the LAPCS problem directly propagate tothe MAPCS problem. Focusing on zero everywhere f, i.e., f(x) = 0 for all x ∈ A, results in a more interestingproblem that deserves separate consideration. Our positive and negative contributions are given in thefollowing propositions (the first two are concerned with zero everywhere f).

Not surprisingly, the MAPCS problem is hard for simple instance.

Proposition 3.5.1 ([Blin et al., 2007a]). The MAPCS(<,@, <,@) and MAPCS(<,@, G, PLAIN) problemsare NP-complete.

Proposition 3.5.2 ([Blin et al., 2007a]). The MAPCS(<,@, G, <,@, G) problem is NP-complete even for zeroeverywhere f.

Page 54: Algorithmic Contributions to Computational Molecular Biology · on computational molecular biology, i.e., those algorithmic and combinatorial topics that are connected to molecular

41

The two following propositions give some positive results.

Proposition 3.5.3 ([Blin et al., 2007a]). For zero everywhere f, the MAPCS(<,@, <,@), MAPCS(<,@, <),and MAPCS(<, <) problems are solvable in O(n2m2), O(m2, n) and O(nm) time, respectively, where n = |u|andm = |v|.

Proposition 3.5.4 ([Blin et al., 2007a]). The MAPCS(<,@, <) and MAPCS (<, <) problems are solvablein O(nm3) and O(nm) time, respectively, where n = |u| andm = |v|.

Page 55: Algorithmic Contributions to Computational Molecular Biology · on computational molecular biology, i.e., those algorithmic and combinatorial topics that are connected to molecular

42

Page 56: Algorithmic Contributions to Computational Molecular Biology · on computational molecular biology, i.e., those algorithmic and combinatorial topics that are connected to molecular

Part II

Pattern Matching in Graphs

43

Page 57: Algorithmic Contributions to Computational Molecular Biology · on computational molecular biology, i.e., those algorithmic and combinatorial topics that are connected to molecular
Page 58: Algorithmic Contributions to Computational Molecular Biology · on computational molecular biology, i.e., those algorithmic and combinatorial topics that are connected to molecular

Introduction

High-throughput analysis makes possible the study of protein-protein interactions at a genome-wise scale[Gavin et al., 2002; Ho et al., 2002; Uetz et al., 2000], and comparative analysis tries to determine the extent towhich protein networks are conserved among species. Indeed, mounting evidence suggests that proteinsthat function together in a pathway or a structural complex are likely to evolve in a correlated fashion, and,during evolution, all such functionally linked proteins tend to be either preserved or eliminated in a newspecies [Pellegrini et al., 1999].

Protein interactions identified on a genome-wide scale are commonly visualized as protein interactiongraphs, where proteins are vertices and interactions are edges [Titz et al., 2004]. Experimentally derivedinteraction networks can be extremely complex, so that it is a challenging problem to extract biologicalfunctions or pathways from them. However, biological systems are hierarchically organized into functionalmodules. Several methods have been proposed for identifying functional modules in protein-proteininteraction graphs. As observed in [Pereira-Leal et al., 2004], cluster analysis is an obvious choice ofmethodology for the extraction of functional modules from protein interaction networks. Comparativeanalysis of protein-protein interaction graphs aims at finding complexes that are common to differentspecies. Kelley et al. [Kelley et al., 2003] developed the program PathBlast, which aligns two protein-proteininteraction graphs combining topology and sequence similarity. Sharan et al. [Sharan et al., 2004] studied theconservation of complexes (they focused on dense, clique-like interaction patterns) that are conserved inSaccharomyces cerevisae (a species of budding yeast) and Helicobacter pylori (a gram-negative, microaerophilicbacterium that infects various areas of the stomach and duodenum), and found 11 significantly conservedcomplexes (several of these complexes match very well with prior experimental knowledge on complexesin yeast only). They actually recasted the problem of searching for conserved complexes as a problem ofsearching for heavy subgraphs in an edge- and node-weighted graph, whose vertices are orthologous proteinpairs. A promising computational framework for alignment and comparison of more than one proteinnetwork together with a three-way alignment of the protein-protein interaction networks of Caenorhabditiselegans, Drosophila melanogaster and Saccharomyces cerevisae is presented in [Sharan et al., 2005].

This part is devoted to graph-based algorithmic aspects of this topic. We have divided our presentationinto three chapters. Chapter 4 is devoted to graph homomorphisms-like aspects. The rationale for thisresearch is that graph-homomorphisms do preserve adjacencies and hence are a natural choice for patternmatching problems in biological networks (as long as injectivity is also required!). Chapter 5 is concernedwith a more recent view of graph motifs in biological networks. Here, topology is of lesser importance butthe functionalities of network nodes (expressed by colors) form the governing principle. Finally, we presentin Chapter 6 our contribution to a somewhat more classical view of pattern matching in biological networks.

45

Page 59: Algorithmic Contributions to Computational Molecular Biology · on computational molecular biology, i.e., those algorithmic and combinatorial topics that are connected to molecular

46

Page 60: Algorithmic Contributions to Computational Molecular Biology · on computational molecular biology, i.e., those algorithmic and combinatorial topics that are connected to molecular

4Pattern matching in graphs

Contents4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 434.2 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 434.3 Finding exact occurrences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

4.3.1 Polynomial cases, hardness and coping with hardness . . . . . . . . . . . . . . . . . . 444.3.2 The corresponding number . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

4.4 Approximate occurrences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 464.5 Replacing lists by colors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

4.1 Introduction

We consider in this chapter two edge-preserving pattern matching problems in graphs (one being a restrictionof the other). Common to these two problems are the fact that each vertex of the motif (given in the form of agraph) is allowed to match to only few vertices of the target graph. First, we shall consider the case whereeach vertex of the motif is associated with the list of vertices of the target graph it is allowed to match. Noticethat we shall only discuss about “lists”whereas we actually mean “sets”as order is not relevant here . . . butmost – not to say all – references in this area use the term list. Our interest in this problem will be for fixedcardinality lists. Second, we shall consider a natural restriction on lists: any two intersecting lists are equal.It will be more convenient to use colors instead of lists for this particular problem.

This chapter is organized as follows. Section 4.2 presents some preliminaries. Section 4.3 is devoted tolist graph matching and we consider in Section 4.4 approximate occurrences. Section 4.5 is concerned withthe relaxation to colors.

4.2 Definitions

A graph homomorphism θ from a graph G to a graph H, written θ : G → H, is a mapping θ : V(G) → V(H)from the vertex set of G to the vertex set of H such that u, v ∈ E(G) implies θ(u), θ(v) ∈ E(H). A graphhomomorphism is thus a mapping between two graphs that respects their structure; more concretely it mapsadjacent vertices to adjacent vertices. If θ : G→ H, G is said to be homomorphic to H or H-colorable. Indeed,

47

Page 61: Algorithmic Contributions to Computational Molecular Biology · on computational molecular biology, i.e., those algorithmic and combinatorial topics that are connected to molecular

48

in terms of graph coloring, k-colorings of G are precisely homomorphisms θ : G → Kk, where Kk is thecomplete graph with k vertices. As a consequence if G→ H, the chromatic number of G is at most that of H.The best general reference is [Hell and Nesetril, 2004]. If the homomorphism θ : G→ H is a bijection whoseinverse function is also a graph homomorphism, then θ is a graph isomorphism. Determining whether thereis an isomorphism between two graphs is an important (but hard!) problem in computational complexitytheory (see [Garey and Johnson, 1979]).

Given graphs G and H, and lists L(u) ⊆ V(H), u ∈ V(G), a list homomorphism of G to Hwith respect tothe lists L(u), u ∈ V(G), is an homomorphism θ : G→ H such that θ(u) ∈ L(u) for all u ∈ V(G). By abuseof notation, for any v ∈ V(H), we let L−1(v) stand for u ∈ V(G) : v ∈ L(u). Recall that the degree of a vertexδ(u) is the number of vertices adjacent to u and that the degree of G is ∆(G) = maxδ(u) : u ∈ V(G). Agraph is regular of degree ∆ or ∆-regular if the degree of all vertices equal ∆.

4.3 Finding exact occurrences

4.3.1 Polynomial cases, hardness and coping with hardness

The problem we are interested in is formally defined as follows.

Exact-(µG, µH)-Matching

• Input : Two graphs G and H, and the lists L(u) ⊆ V(H), u ∈ V(G), such that (i) max|L(u)| :u ∈ V(G) ≤ µG and (ii) max|L−1(v)| : v ∈ V(H) ≤ µH.

• Question : Does there exist an injective list homomorphism θ : G→ H with respect to the listsL(u), u ∈ V(G)?

Clearly, we may assume |L(u)| > 0 for all u ∈ V(G), and |L−1(v)| > 0 for all ∈ V(H) (a trivial clean-upprocedure would apply otherwise). For now on, unless explicitly stated, we assume µG and µH to befixed constant. Indeed, observe that for unbounded µG and µH, the EXACT-(µH, µG)-MATCHING problemtrivially contains the CLIQUE problem, and hence is NP-complete [Garey and Johnson, 1979]. Furthermore,to avoid heavy notations, we will let n andm stand for the number of vertices and the number of edges of G,respectively, and p and q stand for the number of vertices and the number of edges of H, respectively.

As sketched above, most on our interest in the EXACT-(µH, µG)-MATCHING problem is concerned withsmall (and actually fixed) µG and µH. We have proved in [Fagnot et al., 2008] that the problem we areinterested in is polynomial-time for µG ≤ 2.

Proposition 4.3.1 ([Fagnot et al., 2008]). The EXACT-(2, µH)-MATCHING is solvable in O(n3 + q) time. Thisreduces to O(n+ q) time if µH = O(1).

Notice that the counting problem associated to EXACT-(2, µH)-MATCHING is #P-complete and is solvablein O(1.3247n) time [Fagnot et al., 2008]. We have completed the above proposition in [Fertin et al., 2009b].

Proposition 4.3.2 ([Fertin et al., 2009b]). The EXACT-(µG, 1)-MATCHING is solvable in linear time for ∆(G) ≤ 2,for any constant µG.

One may argue, however, that the above proposition is too constrained to be of interest (each vertexu ∈ V(G) has private vertices in H and G is collection of paths and cycles). Unfortunately, despite thesimplicity of Proposition 4.3.2, the result is quite tight - taking into consideration both ∆(G) and ∆(H) - asshown in the following proposition that summarize our negative results.

Page 62: Algorithmic Contributions to Computational Molecular Biology · on computational molecular biology, i.e., those algorithmic and combinatorial topics that are connected to molecular

49

Proposition 4.3.3 ([Fagnot et al., 2008; Fertin et al., 2009b]). The following problems are NP-completes:

• the EXACT-(3, 2)-MATCHING problem for ∆(G) ≤ 1 and ∆(H) ≤ 2,

• the EXACT-(3, 1)-MATCHING problem for bipartite G and H, and

• the EXACT-(3, 1)-MATCHING for ∆(G) ≤ 3 and ∆(H) ≤ 4.

In the light of the negative results presented in Proposition 4.3.3, a substantial part of our work presented[Fagnot et al., 2008] was devoted to coping with hardness by means of exponential-time algorithms.

Proposition 4.3.4 ([Fagnot et al., 2008]). The EXACT-(µG, µH)-MATCHING problem is solvable

• in O(1.1889n) time and exponential space,

• in O(1.2025n) time and polynomial-space,

• in O(1.2388n+m) time, and

• in (2− 2/(µG + 1))n time within a polynomial factor.

Parameterized complexity issues of the EXACT-(µG, µH)-MATCHING problem have been initiated in[Fagnot et al., 2004] and further investigated in [Fagnot et al., 2008]. We have considered two naturalparameters: (i) the number of ambiguous vertices in G (those vertices that can match different vertices in H),and (ii) an objective with respect to a weight function. It turns out that the first parameterization yields tofixed-parameter tractability whereas the second yields to parameterized intractability.

Proposition 4.3.5 ([Fagnot et al., 2008]). The EXACT-(µG, µH)-MATCHING problem is solvable inO(k (µG)k (n+m)) time, where k = |u ∈ V(G) : |L(u)| > 1|.

Proposition 4.3.6 ([Fagnot et al., 2008]). Let G and H be two graphs, L(u) ⊆ V(H), u ∈ V(G) be lists, andω : (V(G)×V(H)) → N+ be a scoring such that ω(u, v) > 0 only if v ∈ L(u). Deciding whether there exists aninjective homomorphism θ : G→ H with respect to the lists L(u), u ∈ V(G), such that

∑u∈V(G)ω(u, θ(u)) ≥ k is

a W[1]-hard problem with respect to parameter k.

In other words, under a reasonable and commonly accepted complexity hypothesis Downey and Fellows[1999], there does not exist an algorithm exponential in k only to determine whether there exists an injectivehomomorphism θ : G→ Hwith respect to the lists L(u), u ∈ V(G), such that

∑u∈V(G)ω(u, θ(u)) ≥ k It is

worth noticing that Proposition 4.3.6 holds even ifω(u, v) ∈ 0, 1 for all (u, v) ∈ (V(G)×V(H)).

4.3.2 The corresponding number

Aiming at separating Yes instances from possibly No instances (and hence speeding-up our algorithms forsome special instances), we have introduced in [Fertin et al., 2009b] the correspondence number C(G,H,L) ofany instance of the EXACT-(µG, 1)-MATCHING problem. It is defined as follow:

C(G,H,L) = min|u ′, v ′ : u ′ ∈ L(u) ∧ v ′ ∈ L(v) ∧ u ′, v ′ ∈ E(H)|

|L(u)| |L(v)|: u, v ∈ E(G)

.

For now on, we assume that, for each edge u, v ∈ E(G), there exists an edge u ′, v ′ ∈ E(H) such thatu ′ ∈ L(u) and v ′ ∈ L(v) (see [Fertin et al., 2009b] for details). The rationale for introducing the correspondingnumber C(G,H,L) stems from the following observations. For one (µG)

−2 ≤ C(G,H,L) ≤ 1. For another,if C(G,H,L) = 1, then there exists an injective homomorphism θ : G → H with respect to the lists L(u),u ∈ V(G). Indeed, any injective mapping of G to H with respect to the lists L(u), u ∈ V(G), is a solution

Page 63: Algorithmic Contributions to Computational Molecular Biology · on computational molecular biology, i.e., those algorithmic and combinatorial topics that are connected to molecular

50

(recall that µH = 1). Ideally, one would like to determine a bound c∗ as small as possible, µG−2 < c∗ < 1,such that if C(G,H,L) > c∗ then (G,H,L) is a Yes instance and if C(G,H,L) ≤ c∗ then (G,H,L) is possiblya No instance. Unfortunately, we did not succeed in obtaining such a precise bound and we have thusfocused in [Fertin et al., 2009b] on the determination of two bounds clow and cup, clow ≤ cup, such that ifC(G,H,L) > cup then (G,H,L) is a Yes instance, and if C(G,H,L) ≤ clow then (G,H,L) is possibly a Noinstance. Of course, the smaller cup and cup − clow are, the better our estimation is.

Proposition 4.3.7 ([Fertin et al., 2009b]). Let (G,H,L) be any instance of the EXACT-(µG, 1)-MATCHING problem.If C(G,H,L) > 2∆(G)−1−e−1

2∆(G)−1 then there exists an injective homomorphism θ of G to H with respect to the lists L(u),u ∈ V(G). If C(G,H,L) ≤ ∆(G)−1

∆(G) then an injective homomorphism θ of G to H with respect to the lists L(u),u ∈ V(G), might not exist.

The upper-bound is by the Lovasz local lemma [Erdos and Lovasz, 1975]. According to this bound, if∆(G) = 1 (resp. ∆(G) = 2, ∆(G) = 3) and C(G,H,L) > 0.633 (resp. C(G,H,L) > 0.878, C(G,H,L) > 0.927)then there exists an injective homomorphism θ ofG toHwith respect to the listsL(u). As for the lower-bound,for any d > 1, we provided a generic construction of an instance (G,H,L) of the EXACT-(µG, 1)-MATCHING

problem with ∆(G) = d and C(G,H,L) ≤ ∆(G)−1∆(G) for which there does not exist an injective homomorphism

θ of G to Hwith respect to the lists L(u), u ∈ V(G).

Subsection 4.3.2 is concerned exclusively with the EXACT-(µG, 1)-MATCHING problem. The ratio-nale for considering µh = 1 is that the problem of finding an injective homomorphism θ : G→ Hwith respect to the lists L(u), u ∈ V(G), enjoys some “degree of independence”. Indeed, for anydistinct u, v ∈ V(G), we have θ(u) 6= θ(v) in any solution θ since L(u) ∩ L(v) = ∅. ExtendingProposition 4.3.7 to any instance of the EXACT-(µG, µH)-MATCHING problem would be of particu-lar interest. In particular, does there exist a constant c∗ (possibly depending on µG and µH) suchthat if C(G,H,L) > c∗ then there exists an injective homomorphism θ of G to Hwith respect tothe lists L(u), u ∈ V(G)? Most of these issues are completely unexplored.

4.4 Approximate occurrences

Requiring an injective homomorphism, i.e., an injective mapping that preserves all edges of G, might resultin an over-constrained problem, though it may exist good approximate solutions, i.e., solutions that matchmany but not all edges of G. This remark is just common sense for practical considerations. We consideredin [Fertin et al., 2009b] one possible approach to deal with approximate occurrences (see also upcomingSection 4.5 for a – from our point of view – more practical approach). We refer to this optimization problemas the MAX-(µG, µH)-MATCHING problem.

Page 64: Algorithmic Contributions to Computational Molecular Biology · on computational molecular biology, i.e., those algorithmic and combinatorial topics that are connected to molecular

51

Max-(µG, µH)-Matching

• Input : Two graphs G and H, and the lists L(u) ⊆ V(H), u ∈ V(G), such that (i) max|L(u)| :u ∈ V(G) ≤ µG and (ii) max|L−1(v)| : v ∈ V(H) ≤ µH.•Solution : A mapping θ : V(G)→ V(H) with respect to the listsL(u), u ∈ V(G), i.e., θ(u) ∈ L(u)for all u ∈ V(G).• Measure : The number of edges conserved by θ, i.e., |u, v ∈ E(G) : θ(u), θ(v) ∈ E(H)|.

Notice that for the MAX-(µG, µH)-MATCHING problem the solution mapping θmay not be (and in generalis not) an injective graph homomorphism as it is not required to preserve all edges. Being a natural but mererestriction of the EXACT-(µG, µH)-MATCHING problem, the MAX-(µG, µH)-MATCHING problem inherits ofall the negative results of it (see Section 4.3). Therefore, we focus on approximation and (unfortunately) onhardness of approximation. Indeed (and not surprisingly, I admit it), as we have shown in [Fertin et al.,2009b], turning the pure decision EXACT-(µG, µH)-MATCHING problem into an optimization one results in aharder problem (considering parameters µG and µH).

Proposition 4.4.1 ([Fertin et al., 2009b]). The MAX-(1, 2)-MATCHING is APX-hard even if G and H are bipartitegraphs with ∆(G) ≤ 3 and ∆(H) ≤ 2.

The above proposition gains in interest if we compare it with Proposition 4.3.1 and Proposition 4.3.2.Actually, it is an immediate consequence of Proposition 4.4.1 that the MAX-(1, 2)-MATCHING problem for∆(G) = 3 and ∆(H) = 2 (resp. ∆(G) = 6 and ∆(H) = 5) is not approximable within ratio 1.0005 (resp.1.0014), unless P = NP.

Proposition 4.4.2 ([Fertin et al., 2009b]). The MAX-(µG, 1)-MATCHING problem is approximable within ratio2 d3∆(G)/5e if ∆(G) is even and within ratio 2 d(3∆(G) + 2)/5e if ∆(G) is odd.

Actually, we have shown a somewhat stronger result in [Fertin et al., 2009b]: if the linear arboricityconjecture (see [Akiyama et al., 1981]) is true, then the MAX-(µG, 1)-MATCHING is approximable within ratio∆(G) + 2 if ∆(G) is even and within ratio ∆(G) + 1 if ∆(G) is odd, for any ∆(H) and any fixed µG. Noticethat the linear arboricity conjecture has been shown to be asymptotically correct as d→∞ [Alon, 1988]

Using a straightforward application of the probabilistic method [Alon and Spencer, 1992] – a powerfultool for demonstrating the existence of combinatorial objects – we gave in [Fertin et al., 2005] a linear-timerandomized (µG)

2-approximation algorithm for the MAX-(µG, 1)-MATCHING problem. We have improvedthis result in [Fertin et al., 2009b].

Proposition 4.4.3 ([Fertin et al., 2009b]). There exists a randomized 2µG-approximation algorithm for the MAX-(µG, 1)-MATCHING problem, for any µG.

We close this section by discussing exponential issues of the MAX-(µG, 1)-MATCHING problem. For anyinstance (G,H,L) of the MAX-(µG, 1)-MATCHING problem, we have shown in [Fertin et al., 2009b] that onemay construct in polynomial-time a (unfortunately complicated) graph I[G,H,L], called the incompatibilitygraph of the instance, that satisfies the following properties:

1. there exists an injective mapping θ : V(G)→ V(H) with respect to the lists L(u), u ∈ V(G), such that|u, v ∈ E(G) : θ(u), θ(v) ∈ E(H)| ≥ k if and only if the stability number of I[G,H,L] is at least k,and

2. ∆(I[G,H,L]) ≤ (µG − 1)(2µG ∆(G) − µG + 1).

Combining these properties, we have obtained the following result.

Page 65: Algorithmic Contributions to Computational Molecular Biology · on computational molecular biology, i.e., those algorithmic and combinatorial topics that are connected to molecular

52

Proposition 4.4.4 ([Fertin et al., 2009b]). The MAX-(µG, 1)-MATCHING problem is solvable in O(m (D + 1)k)time, wherem is the number of edges of G and D = ∆(I[G,H,L]).

Notice that D is fixed as long as ∆(G), µG and µH are fixed. Therefore, the MAX-(µG, 1)-MATCHINGproblem is fixed-parameter tractable for parameter “number of conserved edges”.

As the reader may have noticed, the approximation of the general MAX-(µG, µH)-MATCHINGproblem is almost completely unexplored. Indeed, we are still not be able to tackle the caseµH > 1. As noticed in the headache note Page 46, in case µH = 1, any injective mappingθ : G→ Hwith respect to the lists L(u), u ∈ V(G), enjoys some “degree of independence” that we douse for approximation design. Overcoming this difficulty remains a totally open but challengingproblem.

4.5 Replacing lists by colors

We consider in this section a restriction of the MAX-(µG, µH)-MATCHING problem well-suited for bettermodeling specific applications. Indeed, one may argue that using lists L(u), u ∈ V(G), to represent theputative correspondences is not restrictive enough for most practical applications (although allowing alarge degree of freedom in the design!). One very important objection is that one may reasonably asks forL(u) = L(v) as soon as L(u) ∩ L(v) 6= ∅ as a golden rule. This objection becomes evident in the context ofprotein-protein interaction networks where it is folklore to construct the putative correspondences (the lists)by (i) (BLAST) comparing the sequences two by two, (ii) adjusting a cutoff to construct a correspondencegraph, and finally (iii) computing the connected components of the correspondence graph. See [Brevieret al., 2007] and [Brevier et al., 2009]. This additional constraint is better taken into account by usingcolored-vertices instead of the lists L(u), u ∈ V(G), in the MAX-(µG, µH)-MATCHING problem.

Let col be a set of colors and G equipped with a coloring mapping λ : V(G)→ col. For any color ci ∈ col,we denote by colG(ci) the set of vertices of G that are colored with color ci, i.e., colG(ci) = u ∈ V(G) :λ(u) = ci. The multiplicity of λ in G, written mult(G, λ), is the maximum number of occurrences of a colorin G, i.e., mult(G, λ) = max| colG(ci)| : ci ∈ col. Let G and H be two graphs and let θ : V(G) → V(H) bean injective mapping. The set of edges of G that are preserved in H by θ is denoted by match(G,H, θ), i.e.,match(G,H, θ) = u, v ∈ E(G) : θ(u), θ(v) ∈ E(H). If both G and H are equipped with some coloringsλG : V(G)→ col and λH : V(H)→ col of their vertices, a mapping θ : V(G)→ V(H) is said to be with respectto λG and λH if λG(u) = λH(θ(u)) for every u ∈ V(G), i.e., θ is a color-preserving mapping. For simplicity,

we shall usually abbreviate such a mapping as θ : V(G)λG,λH−−−−→ V(H).

Max-(ρ, σ)-Matching-Colors

• Input : Two graphs G and H together with the coloring mappings λG : V(G) → col andλH : V(H)→ col with mult(G, λG) = ρ and mult(H, λH) = σ.• Solution : An injective mapping θ : V(G)

λG,λH−−−−→ V(H).•Measure : The number of edges of Gmatched by the injective mapping θ, i.e., |match(G,H, θ)|.

Page 66: Algorithmic Contributions to Computational Molecular Biology · on computational molecular biology, i.e., those algorithmic and combinatorial topics that are connected to molecular

53

We let EXACT-(ρ, σ)-MATCHING-COLORS stand for the extremal problem of finding an injective mapping

θ : V(G)λG,λH−−−−→ V(H) that matches all the edges of G, i.e., θ is required to be an injective graph homomor-

phism as we have considered in Section 4.3. Also, we call an instance of both MAX-(ρ, σ)-MATCHING-COLORSand EXACT-(ρ, σ)-MATCHING-COLORS colorful if ρ = 1, i.e., each color occurs once in the motif graph G.

Clearly, MAX-(1, σ)-MATCHING–COLORS and MAX–(µG, 1)–MATCHING are equivalent problems (colorfulinstances and disjoint lists do represent the same configuration). Then it follows that the MAX-(1, σ)-MATCHING-COLORS is approximable within ratio 2 d3∆(G)/5e if ∆(G) is even and within ratio 2 d(3∆(G) +2)/5e if ∆(G) is odd, for any ∆(H) and any fixed σH (see Proposition 4.4.2).

We have proposed in [Brevier et al., 2009] a random walk algorithm to deal with exact colorful instances(recall that theO∗ notation suppresses polynomial terms) Observe that the EXACT-(1, σ)-MATCHING-COLORSproblem is easily solvable in O∗(σn) time (n is the order of G): the easy brute-force algorithm tries all

possible injective mappings θ : V(G)λG,λH−−−−→ V(H) and returns the best one.

Proposition 4.5.1 ([Brevier et al., 2009]). There exists a randomized algorithm that, given any instance (G,H) ofthe EXACT-(1, σ)-MATCHING-COLORS problem, returns an injective homomorphism θ : V(G)

λG,λH−−−−→ V(H) (ifsuch a mapping exists) in O(f(σ)n) expected time (ignoring polynomial factors), where

f(σ) =4σ(2σ− 2)3

4(2σ− 2)3 + 27(2σ− 3).

Recall that the MAX-(1, 2)-MATCHING-COLORS problem for bipartite graphs G and H with ∆(G) = 3 and∆(H) = 2 (resp. with ∆(G) = 6 and ∆(H) = 5) is APX-hard and is not approximable within ratio 1.0005(resp. 1.0014), unless P = NP [Fertin et al., 2009b]. Therefore, there is a natural interest to investigate thecomplexity issues of MAX-(ρ, σ)-MATCHING-COLORS for restricted graph classes. Our results are technicalbut quite negative.

Proposition 4.5.2 ([Brevier et al., 2009]). The MAX-(3, 3)-MATCHING-COLORS (resp. MAX-(2, 2)-MATCHING-COLORS) problem is APX-hard even if both G and H are linear forests (resp. trees with maximum degree 3).

It remains open, however, whether the MAX-(ρ, σ)-MATCHING-COLORS problem for linear forests G andH is polynomial-time solvable in case ρ < 3. In the light of the negative results of Proposition 4.5.2, there is anatural interest on approximating the MAX-(ρ, σ)-MATCHING-COLORS problem for bounded-degree graphs.We have shown the following result.

Proposition 4.5.3 ([Brevier et al., 2009]). For any ρ and σ, the MAX-(ρ, σ)-MATCHING-COLORS problem isapproximable within ratio 3/2(∆min + 1) + ε for any ε > 0, where ∆min = min∆(G), ∆(H).

Central in the above result is a (3/2+ ε)-approximation algorithm, for any ε > 0, for a new combinatorialproblem that may be of independent interest [Brevier et al., 2009]: Given a graph G and a symmetric matrixA = [ai,j] of ordermwhose entries are natural integers, find a maximum cardinality matchingM⊆ E(G)subject to the constraint that, for 1 ≤ i ≤ j ≤ m, the number of edges inM having one end-vertex colored ciand one end-vertex colored cj is at most ai,j.

Combining a random 2-labeling procedure (together with its induced cut) and a weighted bipartitematching algorithm, we have obtained in [Brevier et al., 2009] the following result.

Proposition 4.5.4 ([Brevier et al., 2009]). There exists a randomized algorithm for the MAX-(ρ, σ)-MATCHING-COLORS problem with expected performance ratio 4σ.

We have already mentioned that the MAX-(3, 3)-MATCHING-COLORS problem is APX-hard even if bothG and H are linear forests. Furthermore, according to Proposition 4.5.3, the MAX-(ρ, σ)-MATCHING-COLORSproblem for linear forests is approximable within ratio 2(∆min + 1) = 6. This is strengthened by the followingproposition (it is worth mentioning that the technique is based on 2-interval patterns).

Page 67: Algorithmic Contributions to Computational Molecular Biology · on computational molecular biology, i.e., those algorithmic and combinatorial topics that are connected to molecular

54

Proposition 4.5.5 ([Brevier et al., 2009]). For any ρ and σ, the MAX-(ρ, σ)-MATCHING-COLORS problem isapproximable within ratio 4 in case both G and H are linear forests.

We close this chapter by mentioning that we have proposed a better approximation for the MAX-(2, 2)-MATCHING-COLORS problem.

Proposition 4.5.6 ([Brevier et al., 2009]). MAX-(2, 2)-MATCHING-COLORS is approximable within ratio 1.1442.

Page 68: Algorithmic Contributions to Computational Molecular Biology · on computational molecular biology, i.e., those algorithmic and combinatorial topics that are connected to molecular

5Searching for connected occurrences

Contents5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 515.2 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 525.3 Searching for exact connected occurrences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

5.3.1 Polynomial-time and hardness results . . . . . . . . . . . . . . . . . . . . . . . . . . . 525.3.2 Parameterized complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

5.4 Minimizing the number of connected components . . . . . . . . . . . . . . . . . . . . . . . . 555.4.1 Algorithms and hardness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 565.4.2 Parameterized complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

5.5 Maximizing the size of the connected occurrence . . . . . . . . . . . . . . . . . . . . . . . . . 585.5.1 Algorithms and hardness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 595.5.2 Algorithms and parameterized complexity . . . . . . . . . . . . . . . . . . . . . . . . . 60

5.6 Further variants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 605.6.1 Practical issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

5.1 Introduction

With the advent of network biology [Sharan and Ideker, 2006; Alm and Arkin, 2003] and complex networkanalysis in general, the study of pattern matching problems in graphs has become more and more important.In this context, the term “graph motif ”plays a central role.

Roughly speaking, there are two views of graph (or network) motifs. The older is the topologicalview where one basically ends up with certain subgraph isomorphism problems. For instance, the term“network motif ” has been used to represent patterns of interconnections that occur in a network at frequenciesmuch higher than those found in random networks [Shen-Orr et al., 2002; Wernicke, 2006] (Chapter 4 wasconcerned with such a view). By way of contrast, the second and more recent view on graph motifs takes amore “functional approach”. Here, topology is of lesser importance but the functionalities of network nodes(expressed by colors) form the governing principle. This approach has been propagated by Lacroix et al.[Lacroix et al., 2006].

This chapter is organized as follows. Section 5.2 presents some preliminaries. Section 5.3 is devotedto studying algorithmic aspects of the problem of finding a connected occurrence of a motif in a graph.

55

Page 69: Algorithmic Contributions to Computational Molecular Biology · on computational molecular biology, i.e., those algorithmic and combinatorial topics that are connected to molecular

56

Section 5.4 and Section 5.5 are concerned with optimization issues of this topic, and we briefly present inSection 5.6 further variants of this problem we are particularly interested in.

5.2 Definitions

A multiset (or bag) is a pair (A,mult), whereA is some set and mult : A→ N∗ The setA is called the underlyingset of elements. For each a ∈ A, the multiplicity (that is, the number of occurrences) of a is the numbermult(a). The maximum multiplicity of (A,mult) is defined to be maxmult(a) : a ∈ A. It is common to writethe function mult as a set of ordered pairs (a,mult(a)) : a ∈ A. For example, (a, 2), (b, 3), (c, 1) is themultiset (a, b, c, d,mult), where mult : A→ N∗ is defined by mult(a) = 2, mult(b) = 3, and mult(c) = 1.

We shall consider here motifs given in the form of multisets of colors and we writeM = (col,mult). AmotifM is called colorful if it has maximum multiplicity 1, i.e.,M reduces to a set.

The problem we are interested in is formally defined as follows.

Color-Matching

• Input : A set of colors col, a motifM = (col,mult), and a vertex colored graph (G, λ), whereλ : V(G)→ col is the coloring mapping.

• Question : Does there exist a connected induced subgraph of G colored byM, i.e., a subsetV ′ ⊆ V(G) such that (i) G[V ′] is connected, and (ii) λ(V ′) =M ?

In other words, we are asked to find a connected subgraph of Gwith |M| vertices which is exactly coloredwith the colors ofM (including multiplicities, if any). See Figure 5.1 for an illustration.

The different vertex colors are used to model different functionalities. Although originally introduced ina biological context [Lacroix et al., 2006; Fellows et al., 2007], it is conceivable that the GRAPH MOTIF is aninteresting problem not only for biological networks, but also may prove useful when studying complexsocial or technical networks (this remark is also in Betzler et al. [2008]).

5.3 Searching for exact connected occurrences

5.3.1 Polynomial-time and hardness results

The GRAPH MOTIF problem has been shown to be NP-complete even if the target graphG is a tree in [Lacroixet al., 2006]. The following proposition complete this result (one may easily notice that the GRAPH MOTIFproblem is polynomial-time solvable if ∆(G) ≤ 2).

Proposition 5.3.1 ([Fellows et al., 2007]). The two following variants of the GRAPH MOTIF problem are NP-complete:

1. the target G is a bipartite graph, ∆(G) = 4, and λ is a proper 2-coloring of G, and

2. the target G is a tree, ∆(G) = 3, each color occurs at most three times in G, andM is a colorful motif.

We did not succeed in proving that the GRAPH MOTIF problem is NP-complete if the targetG is a bipartitegraph, ∆(G) = 3, and λ is a (not necessarily) proper 2-coloring of G. However, we conjecture this restrictionto be NP-complete.

Defining the precise tractability landscape of the GRAPH MOTIF is of particular interest to strengthenhardness results. The following proposition shows that the jump in complexity is sudden and confirms thatthe second item of Proposition 5.3.1 is the best possible.

Page 70: Algorithmic Contributions to Computational Molecular Biology · on computational molecular biology, i.e., those algorithmic and combinatorial topics that are connected to molecular

57

c1

c2

c1 c2

c2

c2

c3 c2

c3c1

c4c3

c6c1

c4 c3

c6

c1 c3

c2

c1

c3c5

c1

Figure 5.1: A vertex-colored graph (the pancake network of order 4) together with an occurrence (in bold) ofthe motifM = c1, c1, c2, c3, c3, c3, c4, c5, c6, c6.

Proposition 5.3.2 ([Fellows et al., 2007]). The GRAPH MOTIF problem is solvable in polynomial-time if the targetG is a tree, each color occurs at most two times in G, andM is a colorful motif.

5.3.2 Parameterized complexity

In their pioneered work [Lacroix et al., 2006], the GRAPH MOTIF problem was proved to fixed-parametertractable when parameterized by the size of the given motif (i.e., |M|), in case the target graph is a tree.However, as observed in [Lacroix et al., 2006], their fixed-parameter algorithm does not apply when thevertex-colored graph is a general graph. For this case, they only provided a heuristic algorithm which workswell in practice. This motivates us to further investigate the tractability landscape of the GRAPH MOTIFproblem. From our point of view, a notable breakthrough in the study of the GRAPH MOTIF problem forgeneral graphs is that it is, as we shall see soon, fixed-parameter tractable when parameterized by the sizeof the motifM [Fellows et al., 2007]. This result is important in many ways. For one, it rests on a firmfoundation and paves the way to further fixed-parameter algorithms (current approaches are still limitedto motifs of size about 15 whereas practical applications do ask for motifs of size about 25–30 [Bruckneret al., 2009b], not an order of magnitude difference). For another, it motivates the investigation of the GRAPHMOTIF problem under different parameters which govern the structure of its input.

At the heart of our approach is the color-coding technique introduced by Alon et al. [Alon et al., 1995],whose derandomized version crucially relies on the notion of perfect hash families.

Definition 5.3.3 (Perfect Hash Family [Alon et al., 1995]). Let G be a graph. A family F of functions from V(G)to 1, 2, . . . , k is perfect if for any subset V ′ ⊆ V(G) of k vertices there is a function f ∈ F which is one-to-one onV ′.

Page 71: Algorithmic Contributions to Computational Molecular Biology · on computational molecular biology, i.e., those algorithmic and combinatorial topics that are connected to molecular

58

Aiming at accurate models, variants of the GRAPH MOTIF problem are greatly needed. To this aim,Betzler et al. [Betzler et al., 2008] replaced connectedness demand by more robust requirements,and proved the problem of finding a biconnected occurrence ofM inG to be W[1]-complete whenthe parameter is the size of the motif. This result is of particular importance as it sheds light onthe fact that a seemingly small step towards motif topology results in parameterized intractability.What about replacing the connectedness demand by modularity? Recall that a module in a graphG is a subset V ′ ⊆ V(G) such that the neighborhoods outside the module of the vertices withinthe module are all equal [McConnell and Spinrad, 1999]. The problem now becomes: Given avertex-colored graph G and a motifM, find a subset V ′ ⊆ V(G) such that (i) V ′ is colored byM, and (ii) V ′ is a module in G? We do believe this direction is a promising line of research thatwe plan to expand in future works (this is actually a direction we are currently pursuing withF. Sikora). For one, considering modules makes sense in the general setting of graphs motifs.Indeed, a module is a set of vertices such that each vertex not in the module has a uniformrelationship to all members of the module, i.e., vertices of the module are indistinguishable fromthe outside. For another, the notion of modules is shipped with modular decomposition trees, i.e.,an organization in a tree of the strong modules. Below is an example of a graph together with itsmodular decomposition tree

1

4

2

3

5

6

7

8

10

9

11

1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11

1 2, 3, 4

2, 3

2 3

4

6, 7

6 7

8, 9, 10, 11

8 9 10, 11

10 11

Modular decomposition trees should certainly help for algorithm design. Sure enough, replacingconnectedness by the notion of modules is not a strong enough relaxation (is it a relaxation?) topush the GRAPH MOTIF problem towards tractability (actually, definitively not!). However, weexpect the modular tree decomposition to be a useful structure to design efficient fixed-parameteralgorithms. At a more general level, there does not exist any precise definition of what a motif is(or should be) in a biological network, and hence we think that providing a parameterized toolboxincorporating several definitions of graph motifs could be of particular interest for practicalapplications.

Our result can be stated as follows.

Proposition 5.3.4 ([Fellows et al., 2007]). The GRAPH MOTIF problem is solvable in 2O(k) n2 log(n) time, wherek = |M| and n is the number of vertices in the target graph G.

The GRAPH MOTIF problem is thus fixed-parameter tractable when parameterized by |M|. We onlysketch the main ideas to prove Proposition 5.3.4. Suppose M has an occurrence V ′ in G, and suppose

Page 72: Algorithmic Contributions to Computational Molecular Biology · on computational molecular biology, i.e., those algorithmic and combinatorial topics that are connected to molecular

59

we are provided with a perfect family F of functions from V(G) to 1, 2, . . . , k. Since F is perfect, we areguaranteed that at least one function in F assigns V ′ with k distinct labels. Let f ∈ F be such a function.For a given L ⊆ 1, 2, . . . , k, we defineML(v) to be the family of all motifsM ′ ⊆M, |M ′| = |L|, for whichthere exists an occurrence V ′′ with v ∈ V ′, such that the set of (unique) labels that f assigns to V ′′ is exactlyL. SinceM occurs in G, we know thatM ∈ M1,2,...,k(v) for some v ∈ V(G). To determine whetherMoccurs in G, we apply a dynamic programming to computeML(v) for all v ∈ V(G) and L ⊆ 1, 2, . . . , k.Now, fix L to be some subset of 1, 2, . . . , k, and let v be any vertex of G. Our goal is thus to computeML(v)assumingML ′(u) has been previously computed for every vertex u ∈ V(G) and any L ′ ⊆ L \ f(v). Thestraightforward approach is to consider small motifs occurring at neighbors of v. However, a motif occurringat v might be the union of motifs occurring at any number of neighbors of v, and so this approach mightrequire exponential running time in n. We have shown in [Fellows et al., 2007] that there exists an alternativemethod for computingML(v) that uses an even more naive approach, but one that requires exponential-timeonly with respect to k. Notice that while the motifs computed by our algorithm are in general multisets ofcolors, the procedure always considers sets of distinct labels.

The tree-width parameter of graphs [Robertson and Seymour, 1986] plays a central role in designing exactalgorithms for many NP-hard graph problems [Arnborg and Proskurowski, 1989; Bodlaender, 1993]. Usingtree decompositions and nice tree decompositions of arbitrary graphs (in a somewhat nonstandard way totailor-fit our purposes), we have shown that the GRAPH MOTIF problem is polynomial-time solvable whenthe target graph G has constant tree-width andM consists of a constant number of colors (but arbitrarynumber of elements). This should be compared with the sharp hardness result of Proposition 5.3.1 whichstates that there are rather restricted classes of graphs, such as bounded degree bipartite graphs, where theGRAPH MOTIF problem is NP-complete even whenM is built over only two colors.

Proposition 5.3.5 ([Fellows et al., 2007]). The GRAPH MOTIF problem is solvable in O∗(2ω22c

)time, where ω is

the tree-width of the target graph G, and c is the number of distinct colors in the motif |M|.

Although Proposition 5.3.5 gives a nice complementary result to the sharp hardness result of Proposi-tion 5.3.1, it still leaves a certain gap. The following proposition closes this gap (by the negative).

Proposition 5.3.6 ([Fellows et al., 2007]). The GRAPH MOTIF problem, parameterized by the number of distinctcolors c in the motifM, is W[1]-hard for trees.

Even if we do believe that the GRAPH MOTIF problem introduced by Lacroix et al. [Lacroix et al., 2006] hasshed new light on graph motifs, it suffers from much the same weaknesses as all pure decision problem: itdoes not allow for approximate occurrences. The rest of this chapter is devoted to analyzing natural variantsof the GRAPH MOTIF problem that deal with approximate occurrences. As we shall see, the GRAPH MOTIFproblem enjoys several variants (reflecting different points of view) that deserve separate considerations.

5.4 Minimizing the number of connected components

We consider in this section the problem of finding an occurrence of a motifM in a vertex-colored graph thatresults in a minimum number of connected components. This problem has been first considered in [Dondiet al., 2007], and [Betzler et al., 2008] presents some additional interesting results. We refer to this problem asthe MINIMUM CC problem (MINIMIZING THE NUMBER OF CONNECTED COMPONENTS). It is formally definedas follows.

Minimum CC

• Input : A set of colors col, a motifM = (col,mult), and a vertex colored graph (G, λ).• Solution : A subset V ′ ⊆ V(G) such that λ(V ′) =M.• Measure : The number of connected components in the induced subgraph G[V ′].

Page 73: Algorithmic Contributions to Computational Molecular Biology · on computational molecular biology, i.e., those algorithmic and combinatorial topics that are connected to molecular

60

In other words, we are asked to find a subgraph of G with |M| vertices which is exactly colored with thecolors ofM (including multiplicities, if any) that induces a minimum number of connected components. SeeFigure 5.3 for an illustration.

c1

c2

c1 c2

c2

c2

c3 c2

c3c1

c4c3

c6c1

c4 c3

c6

c1 c3

c2

c1

c3c5

c1

Figure 5.2: A vertex-colored graph together with an occurrence (in bold) of the motif M =c1, c1, c2, c3, c3, c4, c5, c6 that results in two connected components, i.e., c1, c2, c3, c4, c6 and c3, c3, c5.

5.4.1 Algorithms and hardness

It turns out that the MINIMUM CC problem is the most difficult variant of the GRAPH MOTIF problem if onefocuses on graph classes. Let us explain this point. We need some new definitions. Define an isogram to be aword in which no letter is used more than once, and a pair isogram to be a word in which each letter occursexactly twice. A cover of size k of a word u is an ordered collection of words C = (v1, v2, . . . , vk) such thatu = w1v1w2v2 . . . wkvkwk+1 for some words w1, w2, . . . , wk+1 and v = v1v2 . . . vk is an isogram The coveris called prefix (resp. suffix) if w1 (resp. wk+1) is the empty word. Strongly related are proper 2-colorings. Aproper 2-coloring of a pair isogram u is an assignment f of colors c1 and c2 to the letters of u such that everyletter of u is colored with color c1 once and colored with color c2 once. If two adjacent letters x and y arecolored with different colors we say that there is a color change between x and y.

Example 2 Consider the pair-isogram u = abb cdd e e ca f g g f hh. A cover C of u of size 4 is given byC = (ab, de, c, gfh) and the associated proper 2-coloration of u is given by

u = a b b c d d e e c a f g g f h h.

Page 74: Algorithmic Contributions to Computational Molecular Biology · on computational molecular biology, i.e., those algorithmic and combinatorial topics that are connected to molecular

61

A word u is said to be crossing-free (resp. inclusion-free) if there do not exist indices 1 ≤ i1 <i2 < i3 < i4 ≤ |u| such that u[i1] = u[i3] 6= u[i2] = u[i4] (resp. u[i1] = u[i4] 6= u[i2] = u[i3]).Does there exist a polynomial-time for computing a minimum cardinality cover of a crossing-freepair isogram? What about inclusion-free pair isograms?

How approximable is the MINIMUM CC problem for paths?

The 1-REGULAR-2-COLORS-PAINT-SHOP problem is defined as follows (see [Bonsma et al., 2006; Bonsma,2003; Epping and Oertel, 2004] for the general PAINTSHOP-FOR-WORDS problem): Given a pair isogramu, find a 2-coloring f of u that minimizes the number of color changes in (u, f). Bonsma [Bonsma, 2003]proved that the 1-REGULAR-2-COLORS-PAINT-SHOP problem is APX-hard. Combining this with the factthat a minimum cardinality cover of a pair isogram cannot be both prefix and suffix (see [Dondi et al., 2007]),we have obtained the following result.

Proposition 5.4.1 ([Dondi et al., 2007]). The MINIMUM CC problem is APX-hard even ifM is colorful and thetarget graph G is a path in which each color appears at most twice.

Quite a negative result! The following proposition moderates the above proposition.

Proposition 5.4.2 ([Dondi et al., 2007]). The MINIMUM CC problem for paths is solvable inO(nc+4) time, where nis the number of vertices in the path and c is the number of distinct colors in the motifM.

Focusing on trees, we have obtained the following positive and negative results (the positive resultsbeing actually exponential-time algorithms).

Proposition 5.4.3 ([Dondi et al., 2007]). There exists a constant c > 0 such that the MINIMUM CC problem for treescannot be approximated within performance ratio c log(n), where n is the number of vertices in the tree.

Proposition 5.4.4. The MINIMUM CC problem for trees is solvable in O(n2k(c+1)2+1) time, where n is the number

of vertices in the tree, k is the size of the motifM, and c is the number of distinct colors inM.

Proposition 5.4.5 ([Dondi et al., 2007]). The MINIMUM CC problem for trees is solvable in O(n222n3 ) time, where

n is the number of vertices in the tree.

5.4.2 Parameterized complexity

Extending our result (Proposition 5.3.4), we have shown the MINIMUM CC problem for its standard parame-terization to be fixed-parameter tractable as well.

Proposition 5.4.6 ([Dondi et al., 2007]). The MINIMUM CC problem is fixed-parameter tractable when parameterizedby the size of the motif.

Page 75: Algorithmic Contributions to Computational Molecular Biology · on computational molecular biology, i.e., those algorithmic and combinatorial topics that are connected to molecular

62

The O(4.32kk2m | log(ε)|) time algorithm of [Betzler et al., 2008] uses (to speed-up the dynamicprogramming procedure) the technique of fast subset convolution. This novel technique wasdeveloped by Bjorklund et al. [Bjorklund et al., 2007], who used it to speed-up several dynamicprogramming algorithms including the algorithm by Scott et al. [Ideker et al., 2006] for computingminimum weight size k trees in signaling networks.From our point of view, fixed-parameter algorithmic results should support implementationand experimental work. It would be of particular interest to investigate whether the recentlyintroduced subset convolution technique, which so far has been studied purely from a theoreticalpoint of view, also yields a significant speed-up in practice.A similar question may be asked as to how much color coding techniques [Alon et al., 1995]support implementation. (indeed, it turns out that the O(4.32kk2m | log(ε)|) time algorithm of[Betzler et al., 2008] increases the number of colors that are used for color-coding in order toincrease the probability of an occurrence ofM to receive a colorful recoloring, see [Huffner et al.,2007]). Is anybody aware of any implementation of perfect hash families to derandomize thisapproach? I suspect there isn’t one, most approaches use a randomized color procedure and notperfect hash families. However, recent research on implementation on (randomized) color-codingbased graph algorithms [Dost et al., 2007; Huffner et al., 2007; Ideker et al., 2006] are, undoubtly,positive experiences.

Our result is now superseded by [Betzler et al., 2008] where it is shown (in a clever way) that the MINIMUMCC problem can be solved with error probability ε in O(4.32kk2m | log(ε)|) time, where k is the size of themotifM andm is the number of edges in the target graph (see thinking note page 58).

The following proposition shows a sharp contrast in complexity if one considers the number of connectedcomponents as the parameter of interest.

Proposition 5.4.7 ([Dondi et al., 2007]). The MINIMUM CC problem is W[2]-hard when parameterized by thenumber of connected components, even if the target graph G is a tree.

It is worth mentioning that, answering an question we have raised in [Dondi et al., 2007], N. Betzler et al.[Betzler et al., 2008] have recently proved the MINIMUM CC problem to be W[1]-hard only for paths (prooffrom the W[1]-hard PERFECT CODE problem).

5.5 Maximizing the size of the connected occurrence

We now turn to another variant of the GRAPH MOTIF problem where one is interested in obtaining a singleconnected component (as in the original GRAPH MOTIF problem) at the cost of “loosing”some colors. Severaldefinitions actually would perfectly fit to this. We focus here on finding a connected occurrence that uses asmuch as possible colors from the motif (see next section for other possible definitions). We have introducedthis variant of the GRAPH MOTIF in [Dondi et al., 2009] and we refer to it as the MAXIMUM MOTIF problem.

Page 76: Algorithmic Contributions to Computational Molecular Biology · on computational molecular biology, i.e., those algorithmic and combinatorial topics that are connected to molecular

63

c1

c2

c1 c2

c2

c2

c3 c2

c3c1

c4c3

c6c1

c4 c3

c6

c1 c3

c2

c1

c3c5

c1

Figure 5.3: A vertex-colored graph together with an maximum cardinality occurrence (in bold) of a submotifM ′ = c1, c1, c2, c3, c3, c4, c5, c6, c6, of the motifM = c1, c1, c2, c3, c3, c4, c4, c5, c6, c6, .

Maximum Motif

• Input : A set of colors col, a motifM = (col,mult), and a vertex colored graph (G, λ).• Solution : A subset V ′ ⊆ V(G) such that (i) G[V ′] is connected, and (ii) λ(V ′) =M ′ for somesubmotifM ′ ⊆M.• Measure : The size ofM ′, i.e., |M ′|.

Intuitively, the MAXIMUM MOTIF problem thus asks for the largest submotifM ′ ⊆ M that occurs in G asa connected component. See Figure 5.3 for an illustration. Being a mere restriction of the GRAPH MOTIFproblem, the MAXIMUM MOTIF problem is NP-complete as well [Lacroix et al., 2006].

5.5.1 Algorithms and hardness

Not surprisingly, the MAXIMUM MOTIF problem is hard to approximate. The following proposition provethat the MAXIMUM MOTIF problem does not enjoy a PTAS even for trees and colorful motifs.

Proposition 5.5.1 ([Dondi et al., 2009]). The MAXIMUM MOTIF problem is APX-hard even if the motif is colorful,the target graph is a tree with maximum degree 3, and each color occurs at most twice in the tree.

It is worth mentioning that we do believe Proposition 5.5.1 not to be tight since we seriously doubt theMAXIMUM MOTIF problem for colorful motifs and trees with bounded number of occurrences of colors tobe even in APX. The following proposition supports this sentiment (the rather technical proof is by the

Page 77: Algorithmic Contributions to Computational Molecular Biology · on computational molecular biology, i.e., those algorithmic and combinatorial topics that are connected to molecular

64

self-improvement technique, see for example [Hein et al., 1996; Jiang and Li, 1995; Karger et al., 1995] to seethis powerful technique in action).

Proposition 5.5.2 ([Dondi et al., 2009]). For any constant δ < 1, the MAXIMUM MOTIF problem for trees andcolorful motifs cannot be approximated within performance ratio 2logδ n, unless NP ⊆ DTIME[2poly logn].

First, notice that NP 6⊆ DTIME[2poly logn] is considered to be a reasonable complexity hypothesis (and isactually widely believed to be true). It is also worth noticing that, as we have shown in [Dondi et al., 2009],substituting the complexity hypothesis NP ⊆ DTIME[2poly logn] by the classical P = NP yields inapproxima-bility within a constant ratio. Second, the only difference in the instances between Proposition 5.5.1 andProposition 5.5.2 is that the number of occurrences of each color is fixed in the former. Although at first odd,we believe that this restriction is not stronger enough to imply membership to APX.

5.5.2 Algorithms and parameterized complexity

In the light of the negative results for approximating the MAXIMUM MOTIF problem, we turn to exponential-time algorithms and parameterized complexity. We gave in [Dondi et al., 2009] two exact branch-and-boundalgorithms for the MAXIMUM MOTIF problem in case the target graph is a tree. The two results are summarizedin the following proposition.

Proposition 5.5.3 ([Dondi et al., 2009]). The MAXIMUM MOTIF problem for trees of size n can be solved inO(1.62n poly(n)) time. In case the motif is colorful, the time complexity reduces to O(1.33n poly(n)).

Based on the color-coding technique and perfect hash families [Alon et al., 1995], we have consideredin [Dondi et al., 2009] parameterized issues of the MAXIMUM MOTIF problem. Our results can be stated asfollows.

Proposition 5.5.4 ([Dondi et al., 2009]). The MAXIMUM MOTIF problem for trees with n vertices is solvable inO(k2kn3 logn) 2O(k) time, where k is the size of the submotif occurring in the tree. For general graphs of order n, theMAXIMUM MOTIF problem is solvable in O(25kkn2 log2 n) 4O(k) time, where k is the size of the submotif occurringin the graph.

One should admit that our parameterized results are still far from being able to support implementation.However, we believe that the color coding approach reaches its limits here and that improving the running-time of our algorithms requires different techniques.

5.6 Further variants

This short section is devoted to briefly presenting further variants of the GRAPH MOTIF problem we areinterested in. Indeed, as stated in the preceding section, relaxing the GRAPH MOTIF problem to allow forapproximate solutions may lead to distinct combinatorial problems (the MINIMUM CC and MAXIMUM MOTIFSproblems are two such possibilities). We are especially interested in variants where one is searching fora single connected component at the cost of loosing/adding/modifying some colors as it seems that thisrequirement is well-suited for practical applications.

• The MINIMUM DELETE MOTIF problem: Find a submotifM ′ ⊆M that occurs as a connected componentin the target graph. The optimization is concerned with deleting a minimum number of colors inM.

• The MINIMUM ADD MOTIF problem: Find a supermotifM ′ ⊇M that occurs as a connected componentin the target graph. The optimization is concerned with adding a minimum number of colors toM, orequivalently minimizing |M ′|.

Page 78: Algorithmic Contributions to Computational Molecular Biology · on computational molecular biology, i.e., those algorithmic and combinatorial topics that are connected to molecular

65

• The MINIMUM SUBSTITUTION MOTIF problem: Find a motifM ′ (related toM) that occurs as a connectedcomponent in the target graph. The optimization is concerned with modifying a minimum number ofcolors inM to obtainM ′.

Being mere variants of the GRAPH MOTIF problem, all these variants are of course NP-complete (actually, itturns out that they are all APX-hard or not approximable). However, they correspond to different questionsone may ask for connected motifs. Hereafter we mention some thoughts about these three problems.

• In terms of optimal solutions, the MAXIMUM MOTIF and the MINIMUM DELETE MOTIF problems areclearly equivalent. The comparison stops there. Indeed, considering approximation, dual combinatorialproblems usually enjoys opposite properties (this is not a rule, I admit, only a trend). As for theparameterized complexity, the two problems seem to behave radically differently. Without going intothe details, we just mention that, oppositely to the MAXIMUM MOTIF problem, the MINIMUM DELETEMOTIF problem for its standard parameterization is not fixed-parameter tractable.

• The MINIMUM ADD MOTIF problem seems to be more well-suited than the MINIMUM CC problem formost applications. Indeed, in the MINIMUM CC problem, the focus is on the number of connectedcomponents, regardless whether these connected components are arbitrary far in the graph. In somesense, the MINIMUM ADD MOTIF problem allows us to control this aspect by putting the focus on thenumber of colors to add to the motif to connect all those connected components into a single one.

• Although being a natural variant of the GRAPH MOTIF problem, the MINIMUM SUBSTITUTION MOTIFproblem is quite intriguing from an algorithmic point of view as it seems to add one level of freedom.Interestingly enough (but unfortunately), the MINIMUM SUBSTITUTION MOTIF problem parameterizedby the number of substitutions is W[2]-hard.

5.6.1 Practical issues

As we have seen in this chapter, we are still far from being able to provide a complete algorithmic toolboxto deal with the many flavors of the GRAPH MOTIF problem. Most fixed-parameter algorithms (includingours) do not support implementation yet. However, bridging the gap between theory and practice is greatlyneeded for practical applications.

A first step towards providing an integrated algorithmic solution is the TORQUE web server [Bruckneret al., 2009b] (it implements the algorithms in [Bruckner et al., 2009a] for querying protein sets across species).TORQUE combines three approaches: a dynamic programming method utilizing color coding, integer linearprogramming and a fast heuristic based on shortest paths. Quoting the authors [Bruckner et al., 2009a]:“TORQUE automatically selects the best method to apply at each stage and outputs the highest scoring match”.Actually, there is no magical trick, TORQUE relies on color coding if the motif is small enough (about 15elements) and switches to linear programming otherwise.

In collaboration with G. Blin and F. Sikora, we have also developed an integrated algorithmic toolbox,named GraMoFoNe, to deal with the many flavors of the GRAPH MOTIF problem [Blin et al., 2010b]. Noticethat, oppositely to TORQUE, GraMoFoNe is not a web server but a plugin for the popular cytoscapeopen source platform (http://www.cytoscape.org/). Another notable difference with TORQUE is thatGraMoFoNe does not combine two techniques (color coding and linear programming) but uses boolean linearprogramming. The rationale for this choice is twofold. For one, equipped with such a framework, TORQUEis superseded by GraMoFoNe in terms of modeling as it allows to consider most variant of the GRAPH MOTIFproblem (TORQUE is indeed limited to colorful motifs). Without going into the details, it is worth noticingthat TORQUE and GraMoFoNe notably differ in the consideration of the connectedness property: whereasTORQUE simulates a flow algorithm and hence does need linear programming, GraMoFoNe simulates abreadth first search (BFS) procedure that can be performed by boolean linear programming. For another,

Page 79: Algorithmic Contributions to Computational Molecular Biology · on computational molecular biology, i.e., those algorithmic and combinatorial topics that are connected to molecular

66

Figure 5.4: Screenshot of the GraMoFoNe software: Querying the mouse DNA synthesome complex in theyeast PPI network (see [Bruckner et al., 2009b]).

GraMoFoNe uses a pure pseudo-boolean programming engine together with some data reduction rules tospeed-up the computations. See Figure 5.4 for a screenshot of GraMoFoNe in action.

TORQUE and GraMoFoNe perform more or less the same in terms of performances for moderate size treemotifs (GraMoFoNe is, however, not limited to trees). They also suffer from the same drawbacks: they arenot able to deal with large motifs. However, GraMoFoNe is by far more scalable and is completely integratedin the cytoscape software (and hence can be easily used in combination with other cytoscape plugins). It isa challenging and important problem to improve GraMoFoNe so that it can tackle motifs of size about 30.

Page 80: Algorithmic Contributions to Computational Molecular Biology · on computational molecular biology, i.e., those algorithmic and combinatorial topics that are connected to molecular

6Querying PPI Networks

Contents6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 636.2 A feedback vertex set approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 646.3 Practical issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

6.1 Introduction

This short chapter is devoted to a somewhat more classical view of pattern matching in protein-proteininteraction (PPI) networks. Comparative analysis of PPI tries to determine the extent to which proteinnetworks are conserved among species. Indeed, it was observed that proteins functioning together ina pathway (i.e., a path in the interactions graph) or a structural complex (i.e., an assembling of stronglyconnected proteins) are likely to evolve in a correlated fashion and during evolution, all such functionallylinked proteins tend to be either preserved or eliminated in a new species [Pellegrini et al., 1999].

The classical view of PPI network querying is as follows: Given a PPI network and a pattern with agraph topology, find a subnetwork of the PPI network that is as similar as possible to the pattern, in respectto the initial topology. Similarity is measured both in terms of sequence similarity and graph topologyconservation.

Unfortunately, this problem is clearly equivalent to the NP-complete SUBGRAPH HOMEOMORPHISMproblem [Garey and Johnson, 1979]. Recently, several techniques have been proposed to overcome thedifficulty of this problem. By restricting the query to a path of length less than five, Kelley et al. [Kelleyet al., 2003] developped PathBlast, an exponential-time algorithm which allows one consecutive mismatch.Later on, Shlomi et al. [Shlomi et al., 2006] proposed an alternative, called QPath, for querying paths in a PPInetwork (the algorithm is based on the color coding technique [Alon et al., 1995]). By restricting the query toa tree, Pinter et al. [Pinter et al., 2005] proposed an algorithm that is restricted to forest PPI networks, i.e.,collection of trees. Finally, Dost et al. [Dost et al., 2007] developed QNet, a software to handle tree query inthe general context of PPI networks. Of particular importance, [Dost et al., 2007] proposed an algorithmbased on tree-decomposition for querying general graphs.

Indisputably, QNet is the state-of-the-art software to query PPI networks. Let us thus present it briefly(more precisely, let us present the implemented part of QNet since, as we shall see soon, this distinctionis crucial in the present context). QNet is a fixed-parameter tractable randomized algorithm for querying

67

Page 81: Algorithmic Contributions to Computational Molecular Biology · on computational molecular biology, i.e., those algorithmic and combinatorial topics that are connected to molecular

68

trees in (general) PPI networks. The complexity ism2O(k) log(ε−1) time, where k is the number of proteinsin the query, m the number of edges of the PPI network and 1 − ε is the probability of success (for anyε > 0). Following the example of QPath, QNet combines (in a non-trivial way) a dynamic programmingprocedure together with the color-coding technique. This completely described the implemented part ofQNet. However, QNet is shipped with an additional algorithm to query general graphs in PPI networks.At the heart of this algorithm is a procedure to transform the query graph into a tree (the technique is bytree-decomposition [Bodlaender, 1993]). The running time is 2O(k)nω+1, whereω is the tree-width of thequery graph (recall, however, that computing the tree-width of a graph is NP-complete [Arnborg et al.,1987]). A word of caution is necessary here. Indeed, the authors were not be able to implement this algorithm(this is probably due to its inherent difficulties at dealing with tree-decomposition). Nevertheless, evenif they would succeed in implementing it, we highly suspected the huge constants hidden by the big-Onotation to make it useless.

This chapter is devoted to presenting an effective alternative to QNet called PADA1 (I am not being heldresponsible for this Star Wars name!). It is organized as follows. Chapter 6.2 is intended to motivate ourinvestigation. Chapter 6.3 is devoted to practical considerations of our contribution.

6.2 A feedback vertex set approach

PADA1 [Blin et al., 2009c] is an effective network querying algorithm that extends QNet to more generalquery graphs. Following the example of QNet, PADA1 is a two-step procedure: it first transforms the querygraph into a tree and next uses that tree to effectively perform the query. Notice that it allows for insertionsand deletions in the occurrence. While both QNet and PADA1 use a tree-like query, the two algorithms usetotally different approaches. Indeed, whereas QNet is based on tree-decomposition, PADA1 focuses on thefact that most query graphs have relatively small feedback vertex set in practice (recall that a feedback vertexset is subset of vertices whose removal leaves us with a cycle-free graph).

Finding a smallest feedback vertex set is a well-known NP-complete problem [Garey and Johnson, 1979].The current implementation of PADA1 transforms the query graph into a tree by iteratively finding a cycle,duplicating (and storing) a node on that cycle and finally breaking the cycle by edge deletion. More efficientapproaches, including iterative compression [Guo et al., 2006] and reduction to kernel [Thomasse, 2009], maybe used to identify a feedback vertex set and transform the query graph into a tree, but experimentationsshow that our “brute-force”algorithm turns out to be the fastest in practice. Indeed, (i) iteratively findingcycles relies on a fast BFS search (a O(n + m) time procedure), (ii) the feedback vertex set of most realinstances is very small, and finally (iii) finding an occurrence of the constructed tree into the PPI network isdefinitively the most time-consuming part of our approach.

The second step of PADA1 consists in finding an occurrence (allowing insertions and deletions) ofthe constructed tree into the PPI network. Our approach is by combining random coloring and dynamicprogramming (see [Blin et al., 2009c] for details). The main difficulty in this second step is to take intoaccount all those duplicated vertices, and more precisely to group process all the copies of a same vertex(done by dynamic programming in the current implementation).

On the whole, the complexity of PADA1 is O(mn|f|Ndel2O(k+Nins) log(ε−1)) time, where k is the number

of proteins in the query,m the number of edges of the PPI network, 1− ε is the probability of success (forany ε > 0),Nins is the maximum number of insertionsNdel is the maximum number of deletions, and f is thefeedback vertex identifies in the very first part of the algorithm. Of particular importance, observe that thetime complexity does not depend on the total number of duplicated nodes but on the size of the identifiedfeedback vertex set (good, exactly what we were looking for).

Page 82: Algorithmic Contributions to Computational Molecular Biology · on computational molecular biology, i.e., those algorithmic and combinatorial topics that are connected to molecular

69

6.3 Practical issues

We briefly discuss practical issues of PADA1. Since we wanted PADA1 to be an effective alternative to QNet,we have confronted our results in [Blin et al., 2009c] with thoses obtained by QNet. PADA1 has proved toperform as well as QNet (we refer the reader to [Blin et al., 2009c] for details). For example, our secondexperiment was performed across species. The Mitogen-Activated Protein Kinase (MAPK) is a collectionof signal transduction queries. According to [Dent et al., 2003], they have a critical function in the cellularresponse to extracellular stimuli. They are known to be conserved through different species. We obtained thehuman MAPK from the KEGG database [Kanehisa et al., 2004] and tried to retrieve them in the fly networkas done in QNet. While QNet uses only trees, we were able to query general graphs (See Figure 6.1 for anillustrative output of PADA1). The results were quite satisfying since we retrieved all of them, with only fewmodifications.

Figure 6.1: An automatic dot file generated by PADA1 (verbose output omitted). Left: A MAPK human query[Kanehisa et al., 2004]. Right: The alignment graph given provide by PADA1 in the fly PPI network. Dashedlines denotes the BLAST homology scores between the proteins. See [Dost et al., 2007] for a discussion onthis particular query.

Page 83: Algorithmic Contributions to Computational Molecular Biology · on computational molecular biology, i.e., those algorithmic and combinatorial topics that are connected to molecular

70

Page 84: Algorithmic Contributions to Computational Molecular Biology · on computational molecular biology, i.e., those algorithmic and combinatorial topics that are connected to molecular

Part III

Genome Rearrangements

71

Page 85: Algorithmic Contributions to Computational Molecular Biology · on computational molecular biology, i.e., those algorithmic and combinatorial topics that are connected to molecular
Page 86: Algorithmic Contributions to Computational Molecular Biology · on computational molecular biology, i.e., those algorithmic and combinatorial topics that are connected to molecular

Introduction

The third part of this manuscript is concerned with comparative genomics. The combinatorial studyof genome rearrangements started with permutations, but permutations lack the possibility of takingduplications into account. Duplications are however a major evolutionary event, believed to be one of themost important mechanisms for novel generations in evolution, and almost all datasets on eukaryotescontain duplicated genes (see e.g. [Ohno, 1970]). An appropriate tool for studying genomes with duplicatedgenes was therefore needed, and strings are a very natural generalization of permutations that fit this purposewell. It allows to add two possible rearrangement events: duplications and deletions. We shall see in this partthat NP-completeness and even inapproximability results are very numerous. The subject was surveyed in2005 by El-Mabrouk [2005]. The most up-to-date reference to this field is our recent monograph [Fertin et al.,2009a].

Biological motivations

Duplications can occur at several levels, ranging from the duplication of a single gene or small segment ofDNA to the duplication of a whole chromosome, and even whole genome duplications are known to occur.These evolutionary events result in genomes in which where some markers are undifferentiable, and we callthem duplicated genes.

Given a set of genomes, all copies of a given gene among those genomes are said to be homologous, whichmeans that they originate from a common ancestral gene, and form a gene family. The presence of two copiesof a gene in a set of genome may be explained by speciation events, that is, the apparition of two distinctspecies, each genome carrying the gene; it can also be explained by duplication events, which result in twocopies of a gene in the same genome. The relationships of the copies of a gene in a gene family can thus beof several type. Two copies of a gene are said to be orthologous if they derive from a speciation event, andParalogous if they derive from a duplication event. Given two genomes and a gene family, a distinction ismade between out-paralogs, which are paralogous gene copies which derive from a duplication that occuredbefore the last common ancestor of the two genomes, and in-paralogs, that derive from a duplication thatoccured after the last common ancestor. Note that the name gene is classically a bit ambiguous, as it referseither to a family (there are several copies of a same gene in the genomes), or to copies (two genes mayderive from a duplication).

Those different situations motivate and justify the use of the models we will consider in this part. Indeed,when comparing two sequences under the assumption that all copies of a given element in a single string are

73

Page 87: Algorithmic Contributions to Computational Molecular Biology · on computational molecular biology, i.e., those algorithmic and combinatorial topics that are connected to molecular

74

in-paralogs, the goal will be to identify the position of the unique ancestor. If there can be out-paralogs, thenthe goal will be to detect orthologs by matching some copies. The distances between two strings will varyaccording to which model is chosen. Every combinatorial problem we have seen so far can be reformulatedin terms of strings, but the algorithmic treatment is usually completely different. For instance, the breakpointgraph, which is a ubiquitous objects when dealing with permutations, is not used on strings, in spite of someattempts to define them in the case of whole genome duplications by [Alekseyev and Pevzner, 2007].

How to deal with multiple copies?

Two strings on the same alphabet that contain the same number of occurrences of each gene family are saidto be balanced. Two balanced strings obviously have same length. This property ensures that it is possible totransform one string into another without deletions and duplications as rearrangements.

It is not equally difficult to take into account multiple copies when considering two balanced or twogeneral (that is, not necessarily balanced) strings. By convention, balanced strings are supposed to containonly out-paralogs, which means that each of the hmembers of some gene family present on each string S andT originates from one of the hmembers of the same gene family present on their last common ancestor. Thedifficulty is then to identify (that is, to match) the pairs of members, one on each string, which originate fromthe same member of the last common ancestor (i.e. are orthologous). On the contrary, general strings allowto assume the existence of both out-paralogs and in-paralogs on each string, so that deletion and insertionevents have to be considered additionally to the rearrangement events when comparing general strings. Theassignation of ortholog pairs of genes given two strings reduces to finding a matching between them.

While comparing two strings u and v, a matching between u and v is aimed at representing the commoncomposition of the strings, as supported by their last common ancestor and regardless of (but without losingtouch with) the order of the characters. Any pair of matched characters is then assumed to correspond toorthologous genes, while the unmatched characters are supposed to be in-paralogs. Here rearrangementstudies meet the important problem of ortholog identifications. The members of the same gene familypresent on the same string and which are matched are out-paralogs. In order to distinguish out-paralogsfrom each other, a relabeling may be performed, which gives new and distinct names. to out-paralogs andrenames the orthologous of each out-paralog accordingly. The last step of such a treatment of the strings isthe obtention of a pruning. The good news at this stage is that if we assume the relabeling is done such thatthe characters in the pruned strings are integers, both strings are permutations and may be compared usingthe usual distances on permutations.

Now, going back to our initial question “How to deal with multiple copies?”two answers are available: eitherdefine a collection of possible rearrangement and compute the minimum number of operations needed totransform one genome into the other, or reduce genomes to permutations using matchings and pruning andthen compute the distance (or (dis)similarity) between the permutations. We refer to these two approachesas the block edit model and the match-an-prune model , respectively. Part III of this manuscript is devotedto the match-and-prune model. For a thorough introduction to the block edit model we refer the reader to[Fertin et al., 2009a].

Page 88: Algorithmic Contributions to Computational Molecular Biology · on computational molecular biology, i.e., those algorithmic and combinatorial topics that are connected to molecular

7Genome rearrangements with duplicate genes

Contents7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 717.2 From genomes to permutations . . . and back . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

7.2.1 Genomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 727.2.2 Permutations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 727.2.3 Turning a genome into a permutation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

7.3 Comparing two compatible genomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 757.3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 757.3.2 Breakpoint distance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 767.3.3 Signed reversal distance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 777.3.4 Adjacency similarity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 787.3.5 Common intervals similarity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 787.3.6 Dissimilarity measures MAD and SAD . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

7.4 Exact algorithms and heuristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 807.4.1 Boolean linear programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 807.4.2 Heuristic approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

7.1 Introduction

This chapter is devoted to algorithmic aspects of the match-an-prune model for genome rearrangements.All problems follow the same guideline: start with two genomes with duplicate genes, i.e., strings, andtransform them into two permutations (by means of some special matching between the two genomes)so as to optimize a given distance or (dis)similarity measure. The rationale for this guideline is that mostdistances and (dis)similarity measures (more precisely those that are of interest in comparative genomics)are computable in polynomial-time for permutations. The difficulty of the problem is thus to compute a“good”transformation.

This chapter is organized as follows. Section 7.2 presents the relevant material thus making our expositionself-contained. Section 7.3 is concerned with complexity issues of genome comparisons and we discuss inSection 7.4 a family of fast general heuristics to cope with intractability.

75

Page 89: Algorithmic Contributions to Computational Molecular Biology · on computational molecular biology, i.e., those algorithmic and combinatorial topics that are connected to molecular

76

7.2 From genomes to permutations . . . and back

7.2.1 Genomes

A signed genome G is a string over the alphabet of integers (excluding 0, to avoid having to write +0 and−0). An unsigned genome is defined analogously by forbidding negative integers, this is a thus string overthe alphabet of positive integers. In the context of comparative genomics, we refer to the letters of G asgenes (the sign denotes the orientation of the gene). We follow standard string terminology and the size of agenome G is denoted |G|. We write G[i] for the gene at position i in G, 1 ≤ i ≤ |G|, and we denote its signby sign(G[i]). A gene family of G is a positive integer that occurs in G regardless of its sign (here are thosefamous duplications we are interested in). We will denote by F(G) the set of gene families that occur in G.For simplicity of notation, we write g ∈ G if g ∈ F(G) (thus we may write g ∈ G even if the gene g occursonly negatively in G). We will denote by |G|g the number of occurrences of a gene family g ∈ G, and we letdeg(G) stand for the maximum number of occurrences of a gene family in G, i.e., deg(G) = max|G|g : g ∈ G.Of particular importance, notice that deg(G) is computed independently of the signs of the genes. A genomeG is duplication-free if |G[i]| 6= |G[j]| for all 1 ≤ i < j ≤ |G|. In other words, a genome is duplication-free if anygene occurs exactly once, regardless of its sign.

Example 3 For genome G = 1 − 4 2 3 − 1 2, we have F(G) = 1, 2, 3, 4, |G|1 = 2, |G|2 = 2, |G|3 = 1, |G|4 = 1,and hence deg(G) = 2. On the other hand, genome H = −2 − 1 3 5 4 is duplication-free.

Let G be a duplication-free genome of size n, and i and j, 1 ≤ i < j ≤ n. If g = G[i] and g ′ = G[j], thedistance between gene g and gene g ′ in G, denoted dist(G, g, g ′), is defined by dist(G, g, g ′) = j− i.

Definition 7.2.1 (Pegged genome). A genome G is pegged if each interval between two genes in the same genefamily contains at least one singleton (a gene that occurs exactly once in G).

Pegged genomes have the interesting property that singletons act as markers helping to uniquely identifyeach occurrence of a non-singleton by its position with respect to these markers.

7.2.2 Permutations

The symmetric group on set 1, 2, . . . , n is written Sn = S(1, 2, . . . , n), and we let S0n = S(0, 1, . . . , n)stand for the symmetric groups on 0, 1, . . . , n. A permutation π of size n is a bijection π : Sn → Sn. Aclassical notation used in combinatorics to denote a permutation π is the two-row notation , where onearranges the “natural”ordering of the elements being permuted on a row, and the new ordering on anotherrow. For example,

π =

(1 2 3 4 52 5 4 3 1

)stands for the permutation π of the set 1, 2, 3, 4, 5 defined by π(1) = 2, π(2) = 5, π(3) = 4, π(4) = 3, andπ(5) = 1. We will, however, adopt the more convenient – and standard – one-row notation that keeps onlythe second row. Going back to our example, π = (2 5 4 3 1). The identity permutation (1 2 . . . n) is denoted ι,regardless of n.

The composition of two permutations π, σ ∈ Sn, denoted π σ, is defined by π σ = (πσ1 πσ2 . . . πσn).For example for π = (3 1 4 2) and σ = (4 1 3 2), we have π σ = (2 3 4 1). The inverse permutation of π ∈ Snis the permutation π−1 defined by π−1i = i for all 1 ≤ i ≤ n.

Signed permutations model the organization of genomes better than unsigned permutations, because theytake into account the double helix structure of DNA. A signed permutation on 1, 2, . . . , n is a permutation πof the set −n, . . . ,−2,−1, 1, 2, . . . , n such that π−i = −πi for all 1 ≤ i ≤ n. The one-row notation is alsoused for signed permutations. For example, the permutation

π =

(−5 −4 −3 −2 −1 1 2 3 4 5−5 3 −1 4 2 −2 −4 1 −3 5

)

Page 90: Algorithmic Contributions to Computational Molecular Biology · on computational molecular biology, i.e., those algorithmic and combinatorial topics that are connected to molecular

77

is simply written π = (−2 − 4 1 − 3 5).We recall here some basic definitions about permutations (we refer the reader to [Bona, 2004] for general

combinatorial aspects of permutations and to [Fertin et al., 2009a; Estivill-Castro and Wood, 1992] forapplications to comparative genomics).

Definition 7.2.2 (linear extension). The linear extension of a (signed or unsigned) permutation π ∈ Sn is thepermutation πl ∈ S0n+1 defined by πl = (0 π1 π2 . . . πn n+ 1).

Definition 7.2.3. Let πl be the linear extension of a (signed or unsigned) permutation π ∈ Sn. A point is an orderedpair (πli, π

li+1), 0 ≤ i ≤ n. This point is called

• an adjacency if πli+1 = πli + 1,

• a reverse adjacency if πli+1 = πli − 1,

• a breakpoint if it is not an adjacency, and

• a strong breakpoint if it is neither an adjacency nor a reverse adjacency.

Definition 7.2.4 (Interval (in a permutation)). An interval in a (signed or unsigned) permutation π ∈ Sn is a setI = |πi|, |πi+1|, . . . , |πj|, 1 ≤ i ≤ j ≤ n. The elements πi and πj of I are called the extremities of the interval.

Definition 7.2.5 (Common interval). An interval I is a common interval of permutations π, σ ∈ Sn if it is aninterval of both π and σ.

In case σ = ι, an interval I = |πi|, |πi+1|, . . . , |πj|, 1 ≤ i ≤ j ≤ n, is a common interval of π and ι if I isa set of consecutive integers. The number of common intervals of two permutations π and σ is denotedCI(π, σ). For π, σ ∈ Sn, it is easily seen that n + 1 ≤ CI(π, σ) ≤

(n2

)+ n, The lower bound is attained, for

example, if we take π = (1 2 3 4) and σ = (2 4 1 3) or σ = (3 1 4 2) (the excluded patterns of separablepermutations [Bose et al., 1998]!). The upper bound is attained for π = σ or σ = (πn . . . π2 π1).

7.2.3 Turning a genome into a permutation

How to deal with multiple copies? The match-and-prune model addresses the following question, whicharises naturally when trying to discover the relationships between two genomes G and H: how can we takeinto account, when comparing genomes with duplicates, that the structure of the last common ancestor of Gand H plays an important role in the evolutionary distance between the two genomes? As this structure isunknown, unless we have very good reasons to conclude we can afford to keep it unknown, the solution isto attempt to model it.

Three (sub)models are used to this aim. They have essential differences and essential common points. Thedifferences come from different assumptions with respect to composition of the last common ancestor. Themain common point is the method to compute measures (distances and similarities) between strings, whichonly takes into account their common composition, as identified by the composition of the last commonancestor. Speaking about the differences, the three models have the following features. In the exemplarmodel , the last common ancestor is assumed to contain exactly one member of each gene family which hasmembers both on G and H. In the intermediate model , the last common ancestor is assumed to contain at leastone members of each gene family which is common to G and H. Finally, in the full model, the last commonancestor is assumed to contain as many members as possible from any gene family.

We want to draw the attention of the reader on the fact that the intermediate introduce a level of difficulty.Indeed, observe that for the exemplar and full models, we know in advance the size of the resultingpermutations as we keep exactly one gene or as many genes as possible from each gene family. The situationis different for the intermediate model as we do not know in advance how many genes of each gene familywill be kept in an optimal solution.

Page 91: Algorithmic Contributions to Computational Molecular Biology · on computational molecular biology, i.e., those algorithmic and combinatorial topics that are connected to molecular

78

The history of these three models starts with Sankoff’s paper [Sankoff, 1999], who put the basis ofthe exemplar model and, in the same time, of the most general match-and-prune model. Besides itsbiological motivations, Sankoff’s idea has two important features, that make it attractive. Reducing to onethe cardinality of each gene family in each string implies that (1) the resulting strings are permutations, andcomputing distances on permutations is both already studied and often polynomial; and (2) the one-to-onecorrespondence of genes in the same family on the two strings is obvious, and thus may avoid furthercomplications. The next model to be defined was the full model, whose first ideas are suggested in Sankoff’spaper [Sankoff, 1999] and that was more precisely defined by [Tang and Moret, 2003] for balanced strings.The most recent one is the intermediate model we have introduced in [Angibaud et al., 2006].

Definition 7.2.6 (Matching). Let G and H be two genomes. A matchingM between G and H is a set of pairs

M = (i1, j1), (i2, j2), . . . , (ik, jk) ⊆ P(1, 2, . . . , |G|× 1, 2, . . . , |H|)

such that

1. |G[i]| = |H[j]| for all pairs (i, j) ∈M, and

2. if (il, jl) and (il ′ , jl ′) are two distinct pairs ofM, then l 6= l ′.

The 2k genes G[i1],G[i2], . . .G[ik],H[i1],H[i2], . . .H[ik] are said to be saturated by the matchingM.

Notice here that we do allow G[i] and H[j], (i, j) ∈ M, to have opposite signs. In the sequel, it will beenough to focus on compatible genomes as defined below.

Definition 7.2.7 (Compatible genomes). Two genomes G and H are said to be compatible if F(G) = F(H).

Definition 7.2.8 (Exemplar, intermediate and full matching). A matchingM between two compatible genomesG and H is an intermediate matching ifM saturates at least one gene of each gene family of F(G) = F(H). Anintermediate matching is called an exemplar matching (resp. full matching) if it is of minimum (resp. maximum)cardinality.

Roughly speaking, intermediate matchings correspond to standard matchings in graphs (in bipartitegraphs here) whereas exemplar and full matchings have additional constraints. In other words, exemplarand full matchings are in intermediate matchings.

Example 4 Consider the two following compatible genomes G and H:

G = 1 2 − 4 − 2 3 1 4 − 3 4

H = 4 1 − 3 − 2 2 1 2 4.

with F(G) = 1, 2, 3, 4 = F(H).

1. The matchingM = (1, 2), (2, 4), (5, 3), (6, 6), (7, 8) is an intermediate matching, and the pruned genomesGM and HM induced byM reduce to (indices a and b are used to disambiguate the induced mapping andclarify the presentation):

GM = 1a 2 3 1b 4

HM = 1a − 3 − 2 1b 4.

The associated permutations (according to a relabeling so that πG,M = ι) are thus given by:

πG,M = (1 2 3 4 5)

πH,M = (1 − 3 − 2 4 5).

Page 92: Algorithmic Contributions to Computational Molecular Biology · on computational molecular biology, i.e., those algorithmic and combinatorial topics that are connected to molecular

79

2. The matchingM ′ = (1, 2), (2, 4), (5, 3), (9, 8) is an exemplar matching. The pruned genomes GM ′ andHM ′ induced byM ′ reduce to:

GM ′ = 1 2 3 4

HM ′ = 1 − 3 − 2 4.

The associated permutations (according to a relabeling so that πG,M ′ = ι) are thus given by:

πG,M ′ = (1 2 3 4)

πH,M ′ = (1 − 3 − 2 4).

3. The matchingM ′′ = (1, 6), (2, 7), (3, 1), (4, 5), (5, 3), (6, 2), (9, 8) is a full matching. The pruned genomesGM‘ ′ and HM‘ ′ induced byM‘ ′ reduce to (once again, indices a and b are used to disambiguate the inducedmapping):

GM ′′ = 1a 2a − 4a − 2b 3 1b 4b

HM ′′ = 4b 1b − 3 2b 1a 2a 4a

The associated permutations (according to a relabeling so that πG,M ′′ = ι) are thus given by:

πG,M ′′ = (1 2 − 3 − 4 5 6 7)

πH,M ′′ = (7 6 5 4 1 2 3).

It is now a simple matter to see that exemplar, intermediate and full matchings coincide if deg(G) = 1 ordeg(H) = 1, i.e., G or H are duplication-free.

Definition 7.2.9 (Mexpl(G,H),Mint(G,H), andMfull(G,H)). For any two compatible genomes G and H, we letMexpl(G,H) (resp. Mint(G,H),Mfull(G,H)) stand for the set of all exemplar (resp. intermediate, full) matchingsbetween G and H.

The following definition will facilitate the exposition of subsequent sections.

Definition 7.2.10 (Πexpl(G,H), Πint(G,H), and Πfull(G,H)). For any two compatible genomes G and H, we defineΠexpl(G,H), Πint(G,H) and Πfull(G,H) by:

Πexpl(G,H) = (πG,M, πH,M) :M∈Mexpl(G,H),

Πint(G,H) = (πG,M, πH,M) :M∈Mint(G,H), andΠfull(G,H) = (πG,M, πH,M) :M∈Mfull(G,H).

The set Πexpl(G,H) (resp. Πint(G,H), Πfull(G,H)) is thus the set of all permutations that correspond tovalid exemplar (resp. intermediate, full) matchings between G and H.

7.3 Comparing two compatible genomes

7.3.1 Introduction

We are now ready to compare genomes with respect to the match-and-prune model. Given two genome Gand H, our steps will be

1. find an matching (exemplar, intermediate or full)M between G and H,

Page 93: Algorithmic Contributions to Computational Molecular Biology · on computational molecular biology, i.e., those algorithmic and combinatorial topics that are connected to molecular

80

2. Construct the associated permutations πG,M and πH,M, and

3. Compute the distance, similarity or dissimilarity measure we are interested in between πG,M andπH,M.

Let π and σ be two signed permutations. We will focus in this section on the following standard measures:

• the breakpoint distance between π and σ, denoted BK(π, σ), is the number of breakpoints between πl

and σl.

• the signed reversal distance between π and σ, denoted SR(π, σ), is the minimum number of signedreversals to transform π into σ, where a signed reversal is the operation of reversing an interval of π(together with signs).

• the adjacency similarity between π and σ, denoted ADJ(π, σ), is the number of adjacencies between πl

and σl.

• the common intervals similarity between π and σ, denoted CI(π, σ), is the number of common intervalsbetween π and σ.

• the MAD and SAD numbers whose precise definitions are deferred to the related subsection.

7.3.2 Breakpoint distance

Together with the signed reversal distance, the breakpoint distance is one of the first applications of Sankoff’sexemplar model. The two distances are defined on different bases: the reversal distance counts a minimumnumber of operations to transform one genome into another one, whereas the breakpoint distance countsthe structural differences between the two genomes. However, as we shall see soon, they are closely related.

Computing the breakpoint distance between two compatible signed genomes G and H reduces to findinga (exemplar, intermediate or full) matching between these two genomes that induces a minimum number ofbreakpoints between the two associated permutations πG and πH.

Definition 7.3.1. Let G and H be two signed compatible genomes. The measures BKexpl, BKint and BKfull of G and Hare defined by:

BKexpl(G,H) = minBK(πG, πH) : (πG, πH) ∈ Πexpl(G,H),

BKint(G,H) = minBK(πG, πH) : (πG, πH) ∈ Πint(G,H), andBKfull(G,H) = minBK(πG, πH) : (πG, πH) ∈ Πfull(G,H).

The measure BKexpl(G,H) has been introduced in [Sankoff, 1999], and the measure BKfull(G,H) in [Blinet al., 2004]. As for the measure BKint(G,H), we have introduced it in [Angibaud et al., 2006]. In 2000, Bryanthas shown that computing any of BKexpl(G,H), BKint(G,H) and BKfull(G,H) is an NP-complete problem,even for pegged genomes G and H [Bryant, 2000]. Notice that Bryant‘s proof does not actually need toconsider three separate cases as it holds even if deg(G) = 1 and deg(H) = 2. Nguyen has strengthened theseresults by proving that computing any of BKexpl(G,H) and BKint(G,H) is an NP-complete problem even ifboth G and H are unsigned pegged genomes [Nguyen, 2005]. The strongest inapproximability result knownso far is ours.

Proposition 7.3.2 ([Angibaud et al., 2008b]). Computing any of the three measures BKexpl(G,H), BKint(G,H) andBKfull(G,H) is an APX-hard problem.

Page 94: Algorithmic Contributions to Computational Molecular Biology · on computational molecular biology, i.e., those algorithmic and combinatorial topics that are connected to molecular

81

The proof of Proposition 7.3.2 actually holds for deg(G) = 1 and deg(H) = 2, and hence does not need toconsider three separate cases.

Chen et al. [Chen et al., 2006] have shown that there exists a constant c > 0 such that there does not exist apolynomial-time algorithm with performance guarantee c log(n), n = max|G|, |H|, to compute BKexpl(G,H)and BKint(G,H). Of particular importance in this context, they have also shown that deciding whetherBKexpl(G,H) = 0 is an NP-complete problem even if deg(G) = 3 and deg(H) = 3. The same result holds ifwe replace BKexpl(G,H) = 0 by BKint(G,H) = 0. We have first completed the result of [Chen et al., 2006] in[Angibaud et al., 2008b] before proving a stronger result.

Proposition 7.3.3 ([Blin et al., 2009b]). Deciding whether equality BKexpl(G,H) = 0 holds is an NP-completeproblem even if deg(G) = 2 and deg(H) = 2.

Notice that the above Proposition holds if we replace BKexpl(G,H) = 0 by BKint(G,H) = 0. The aboveproposition carries definitive implications for research design in the form of the following corollary.

Corollary 7.3.4 ([Blin et al., 2009b]). There does not exist any approximation algorithm to compute BKexpl(G,H) orBKint(G,H), even if deg(G) = 2 and deg(H) = 2.

The above corollary gains in interest if we notice that it precisely defines the inapproximability landscape.Indeed, if deg(G =) = 1 or deg(H) = 1, it can be shown that deciding whether equality BKexpl(G,H) = 0holds is linear-time solvable. In other words, Proposition 7.3.3 is tight.

We mention to finish two results in this context that may be of independent interest. We have shown in[Angibaud et al., 2008b] that (i) deciding whether BKfull(G,H) = 0 holds is solvable in O(nm log log(nm))time, where n = |G| and m = |H|, and that (ii) computing BKexpl(G,H) and BKint(G,H) for two genomesG and H such that deg(G) = 2 and deg(H) = 2 is solvable in O(poly(k) 1.61822k) time, where k is upper-bounded by the number of gene families that occur exactly twice in G and in H.

7.3.3 Signed reversal distance

The signed reversal distance is the second distance considered by [Sankoff, 1999] to illustrate his theory ofexemplar distances. Under the full model, the signed reversal distance is very well studied on balancedstrings and not studied at all on general strings.

Computing the signed reversal distance between two compatible signed genomes G and H reduces tofinding a (exemplar, intermediate or full) matching between these two genomes that induces a minimumsigned reversal distance between the two associated permutations πG and πH.

Definition 7.3.5. Let G and H be two signed compatible genomes. The distances SRexpl, SRint and SRfull of G and Hare defined by:

SRexpl(G,H) = minSR(πG, πH) : (πG, πH) ∈ Πexpl(G,H),

SRint(G,H) = minSR(πG, πH) : (πG, πH) ∈ Πint(G,H), andSRfull(G,H) = minSR(πG, πH) : (πG, πH) ∈ Πfull(G,H).

The distance SRexpl has been introduced in [Sankoff, 1999], and the distance SRfull in [Chen et al., 2005].As far as we know, no specific result exists for the intermediate model.

Bryant has shown that computing DRexpl(G,H) is an NP-complete problem even if G and H are pegged,and deg(G) = 2 and deg(H) = 2 [Bryant, 2000]. It turns out that the inapproximability results we gave forthe breakpoint distance propagate to the signed reversal distance. Indeed, if G and H are two compatiblegenomes, then 2 SRexpl(G,H) ≤ BKexpl(G,H) ≤ SRexpl(G,H) (we refer the reader to our monograph [Fertinet al., 2009a] for an elementary proof).

Page 95: Algorithmic Contributions to Computational Molecular Biology · on computational molecular biology, i.e., those algorithmic and combinatorial topics that are connected to molecular

82

7.3.4 Adjacency similarity

We consider in this subsection a similarity measure which is the complement of the breakpoint distance. Thebasis of this measure is the preserved adjacency between two consecutive characters in G and H.

Computing the adjacency similarity between two compatible signed genomes G and H reduces to findinga (exemplar, intermediate or full) matching between these two genomes that induces a maximum number ofadjacencies between the two associated permutations πG and πH.

Definition 7.3.6. Let G and H be two signed compatible genomes. The measures ADJexpl, ADJint and ADJfull of Gand H are defined by:

ADJexpl(G,H) = minADJ(πG, πH) : (πG, πH) ∈ Πexpl(G,H),

ADJint(G,H) = minADJ(πG, πH) : (πG, πH) ∈ Πint(G,H), andADJfull(G,H) = minADJ(πG, πH) : (πG, πH) ∈ Πfull(G,H).

We have introduced the similarities ADJexpl and ADJfull in [Angibaud et al., 2007a] and ADJint in [An-gibaud et al., 2008a]. Chen et al. [Chen et al., 2007b] have proved that computing any of ADJexpl(G,H),ADJint(G,H) and ADJfull(G,H) is an NP-complete problem and is not approximable within ratio n1−ε evenwhen deg(G) = 1 and deg(H) = 2. The problem is also known to W[1]-hard. For restricted instances (i.e.,full matching and balanced genomes), we have obtained the following approximation results.

Proposition 7.3.7 ([Angibaud et al., 2008b]). Let G and H be two balanced genomes with deg(G) = deg(H) = k.If k = 2, there exists an algorithm to compute ADJfull(G,H) with performance ratio 1.1442. If k = 3, there exists analgorithm to compute ADJfull(G,H) with performance ratio 3+ ε for any ε > 0. Finally, for general k, there exists analgorithm to compute ADJfull(G,H) with performance ratio 4.

It is worth noticing that the above ratio 4 uses 2-intervals (more precisely a result of [Crochemore et al.,2008]) thereby fueling our arguments on the importance of 2-intervals in the design of approximationalgorithms.

7.3.5 Common intervals similarity

Common intervals are a natural generalization of adjacencies, as they identify subsets of characters thatappear contiguously, but possibly in a different order, in both genomes.

Computing the common intervals similarity between two compatible signed genomes G and H reducesto finding a (exemplar, intermediate or full) matching between these two genomes that induces a maximumnumber of common intervals between the two associated permutations πG and πH.

Definition 7.3.8. Let G and H be two signed compatible genomes. The similarities CIexpl, CIint and CIfull of G and Hare defined by:

CIexpl(G,H) = minCI(πG, πH) : (πG, πH) ∈ Πexpl(G,H),

CIint(G,H) = minCI(πG, πH) : (πG, πH) ∈ Πint(G,H), andCIfull(G,H) = minCI(πG, πH) : (πG, πH) ∈ Πfull(G,H).

Computing the number of common intervals of two permutations on n elements is done inO(n+k) timeby [Uno and Yagiura, 2000], where k is the number of common intervals. Notice that [Heber and Stoye, 2001]achieve the same running time for q ≥ 3 permutations (in this case, n is the total size of the q permutations).

We have introduced the similarities CIexpl and CIfull in [Blin et al., 2007b] and CIint in [Angibaud et al.,2008b] (common intervals are indeed quite common in comparative genomics). Unfortunately, we are onlyable to prove inapproximability.

Page 96: Algorithmic Contributions to Computational Molecular Biology · on computational molecular biology, i.e., those algorithmic and combinatorial topics that are connected to molecular

83

Proposition 7.3.9 ([Blin et al., 2007b]). Computing any of the two three measures CIexpl(G,H), CIint(G,H) andCIfull(G,H)S is an APX-hard problem.

Notice that the above (negative) result holds even if deg(G) = 1 and deg(G) = 2. No positive results areknown.

7.3.6 Dissimilarity measures MAD and SAD

The measures to estimate the (dis)similarity between two permutations that we have mentioned so far fallinto two categories: they estimate either the distance or the similarity between the two permutations. Thetwo measures considered here (MAD and SAD), both defined by Sankoff and Haque [Sankoff and Haque,2005], belong to neither category. Unlike distances, their value is never zero, and unlike similarities, theirvalue grows as the dissimilarity of the permutations grows.

Intuitively, MAD and SAD measure how far genes have to move from their initial position in one genomein order to yield the other genome, and this measure focuses either on each gene (MAD) or on all genesaltogether (SAD).

For the sake of presentation, we introduce a new notation. For any two permutations π, σ ∈ Sn, we letπσ stand for the permutation obtained from σ by renaming the elements of π so as to obtain the identitypermutation ι, and then renaming the elements of σ accordingly.

Example 5 For π = (1 3 5 2 4) and σ = (5 3 2 1 4), we obtain πσ = (4 2 1 3 5) and σπ = (3 2 4 1 5).

Definition 7.3.10 (Maximum Adjacency Disruption (MAD)). The dissimilarity measure MAD between twopermutations π, σ ∈ Sn, denoted MAD(π, σ), is defined by:

MAD(π, σ) = max1≤i≤n−1

max|σπi − σπi+1|, |πσi − πσi+1|.

Intuitively, the MAD dissimilarity measure of two permutations π and σ is the largest gap between twoconsecutive elements in σπ and πσ. Notice that MAD(π, π) = 1 for all π ∈ Sn.

Example 6 Pursuing Example 5, i.e., π = (1 3 5 2 4) and σ = (5 3 2 1 4), we get MAD(π, σ) =maxmax1, 2,max2, 1,max3, 2,max4, 2 = 4.

Definition 7.3.11. Let G and H be two compatible genomes. The dissimilarity measures MADexpl, MADint andMADfull of G and H are defined by:

MADexpl(G,H) = minMAD(πG, πH) : (πG, πH) ∈ Πexpl(S, T),

MADint(G,H) = minMAD(πG, πH) : (πG, πH) ∈ Πint(S, T), etMADfull(G,H) = minMAD(πG, πH) : (πG, πH) ∈ Πfull(S, T).

We have obtained the following result.

Proposition 7.3.12 ([Blin et al., 2007b]). Unless P = NP, there does not exist an approximation algorithm withperformance guarantee 2− ε, for any ε > 0, to compute the dissimilarity measures MADexpl(G,H), MADint(G,H)and MADfull(G,H).

The above proposition holds even if deg(G) = 1 and deg(H) = 9. It is worth mentioning that 2 − ε isnot the best bound. Indeed, for the sake of proof simplicity, the proof of Proposition 7.3.12, as presented in[Blin et al., 2007b], does not use the PCP theorem. However, resorting to the PCP theorem (more precisely,resorting to the non-approximation of a restriction of the MAX 3-SAT problem), we can show that there existsa constant c > 2 such that there does not exist an approximation algorithm with performance guarantee cto compute the dissimilarity measures MADexpl(G,H), MADint(G,H) and MADfull(G,H) (the exact value

Page 97: Algorithmic Contributions to Computational Molecular Biology · on computational molecular biology, i.e., those algorithmic and combinatorial topics that are connected to molecular

84

of c has not been precisely computed). No algorithmic positive result to compute the MAD dissimilaritymeasure is known so far, but we conjecture Proposition 7.3.12 to be far from being tight.

We now turn to considering the SAD dissimilarity measure that takes into account all differences betweenconsecutive elements.

Definition 7.3.13 (Summed Adjacency Disruption (SAD)). The SAD dissimilarity measure of two permutationsπ, σ ∈ Sn, denoted SAD(π, σ), is defined by:

SAD(π, σ) =

n−1∑i=1

(|σπi − σπi+1|+ |πσi − πσi+1|

).

Intuitively, the SAD dissimilarity measure is the sum of all gaps between two consecutive elements in σπ

and πσ. Notice that SAD(π, π) = 2(n− 1) for all π ∈ Sn.

Example 7 For π = (1 3 5 2 4) and σ = (5 3 2 1 4), we obtain SAD(π, σ) = (1+2)+(2+1)+(3+2)+(4+2) = 17.

Definition 7.3.14. Let G and H be two compatible genomes. The dissimilarity measures SADexpl, SADint and SADfullof G and H are defined by:

SADexpl(G,H) = minSAD(πG, πH) : (πG, πH) ∈ Πexpl(S, T),

SADint(G,H) = minSAD(πG, πH) : (πG, πH) ∈ Πint(S, T), etSADfull(G,H) = minSAD(πG, πH) : (πG, πH) ∈ Πfull(S, T).

Not surprisingly, computing the SAD dissimilarity measure turns out the be more difficult than com-puting the MAD dissimilarity measure (as long as we cannot prove stronger inapproximability for thelatter).

Proposition 7.3.15 ([Blin et al., 2007b]). Unless P = NP, there exists a constant c > 0 such that no approximationalgorithm with performance guarantee c logn is achievable to compute the dissimilarity measures SADexpl(G,H),SADint(G,H) and SADfull(G,H).

Once again, no positive algorithmic result to compute the dissimilarity measure SAD is known sofar. Whereas Sankoff and Haque [Sankoff and Haque, 2005] claim that MAD and SAD are relevant forcomparative genomics, algorithmic cannot help much here.

How related are the different measures discussed in Section 7.3? Do they answer differentquestions? Do they answer different parts of a research question? There are some of manyimportant questions left open. Our woks [Angibaud et al., 2008a, 2007a,b] only provide partialanswers.At a more general level, our contributions are only algorithmic and actually we didn’t introduceany new measure but express in our terms measures that are used by the comparative genomicscommunity. We might regret this abundance of measures to the detriment of comparative analyses.As an example, MAD and SAD turn out to be very complicated measures (from an algorithmicpoint of view) but, to the best of our knowledge, no evidence has proved or disprove the benefitof such tortuous measures.

Page 98: Algorithmic Contributions to Computational Molecular Biology · on computational molecular biology, i.e., those algorithmic and combinatorial topics that are connected to molecular

85

7.4 Exact algorithms and heuristics

7.4.1 Boolean linear programming

We have considered in [Angibaud et al., 2006], [Angibaud et al., 2007a] and [Angibaud et al., 2008a] exactalgorithms for computing various genome rearrangement distances. It is worth noticing that the rationalfor considering such an approach was not to propose efficient general algorithms but to compute for areference dataset a bunch of exact results new heuristic approaches can confront with. Our approach wasby pseudo-boolean programming (linear integer programming where all variables are restricted to takevalues of either 0 or 1). All our experiments were conducted with Minisat+ [Een and Sorensson., 2006] andILOG CPLEX (http://www.ilog.com/products/cplex). This part was the subject of the PhD thesisof Annelyse Thevenin (defended November 2009).

To illustrate our approach, we present in Figure 7.1 a pseudo-boolean program to compute the minimumnumber of breakpoints between two genomes under the full matching viewpoint [Angibaud et al., 2007a]. Werefer the reader to [Angibaud et al., 2006], [Angibaud et al., 2007a] and [Angibaud et al., 2008a] for a thoroughdiscussion on pseudo-boolean-programming for genome comparison. In particular, we investigated in thesepapers the impact of the choice of the matching (exemplar, covering or maximum) on a γ-proteobacteriadataset [Lerat et al., 2003].

7.4.2 Heuristic approaches

We focus in this subsection on heuristics to deal with the aforementioned problems, and the motivationfor our choice to present heuristics rather than exact algorithms (or rather than both heuristics and exactalgorithms) relies on their universality. These heuristics are identical for all distances and need only minorchanges to go from one model to another one.

The three heuristics presented here use the notion of a longest common substring, up to a complete reversaland are all based on the following easy idea. Assuming temporarily that one aims at finding a full matchingbetween G and H which intuitively preserves the most conserved regions between the two strings, an easyway to have such a matching is given by Algorithm ILCS where we assume that each longest commonsubstring found on G and H is identified by one precise occurrence on each of G and H. Figure 7.2 shows anexample.

Algorithm 1: ILCS Heuristic (full matching)Data: Two genomes G and H.Result: A matching between G and H.1 Compute a longest common substring L of G and H, up to a complete reversal, exclusively made of unmatchedcharacters from G and H.2 Match the characters of G and H belonging to the occurrences of L according to their positions in L.3 Iterate the process until all possible characters have been matched.4 Remove all unmatched characters.return The required distance on the resulting permutations

As far as we know, this idea appeared in [Tichy, 1984] and was often used since. In [Angibaud et al., 2007b]and [Angibaud et al., 2008a] we proceeded to a large number of time consuming distance computations, andnoticed that even small changes in the ILCS algorithm might improve both the execution time, the quality ofthe result and its applicability to various models. The Algorithm IILCS we proposed in [Angibaud et al.,2007b] is such a variant of ILCS where the removal of characters which cannot be matched is done before

Page 99: Algorithmic Contributions to Computational Molecular Biology · on computational molecular biology, i.e., those algorithmic and combinatorial topics that are connected to molecular

86

starting a new iteration.

Algorithm 2: IILCS Heuristic (exemplar/intermediate/full matching)Data: Two genomes G and H.Result: A matching between G and H.1 Compute a longest common substring L of G and H, up to a complete reversal, exclusively made of unmatchedcharacters from G and H.2 Match (all or part of) the characters of G and H belonging to the occurrences of L according to their positions in L,so as to fit the exemplar/intermediate/full model constraints.3 Iterate the process until all possible characters have been matched.4 Remove all unmatched characters.return The required distance on the resulting permutations

This new heuristic allows to obtain in step 2 one or several pairs of matched characters in each genefamily, according to the model, and to discard in step 3 all characters which become useless. Behind theflexibility introduced by this variant of ILCS in regard to the model, an improvement of the results mayalso be expected as IILCS better takes into account the final goal of matching characters, which is to identifyas many conserved regions as possible in the resulting pruning, and not in G and H. Indeed, the resultingpruning has consecutive characters which were not consecutive in the initial strings, and thus has conservedregions which were possibly not conserved in the initial strings. The early removal of characters by IILCSallows non-neighboring characters on G and H to become neighbors at the end of some iteration, if thecharacters between them are not matched. New longest common substrings may then be formed in this way,thus improving the identification of common regions in the final pruning.

The argument these heuristics relies on is that long common substrings are strongly conserved regionsthat strongly affect the values of all measures, either distances or similarities. Such an argument is supportedby the good performances of these heuristics (see below), but cannot be invoked when the longest commonsubstrings are short, i.e., not exceeding some given length `. Consequently, it could be a reasonable ideato stop the execution of the IILCS heuristic when the threshold ` is reached for the length of the longestcommon substring, and then to apply some exact (and thus exponential) algorithm to optimally match theremaining characters according to the problem P to solve. Problem P is defined by the measure to compute,and the model to use. This idea yields the hybrid method we have developed in [Angibaud et al., 2007b].

The IILCS and hybrid heuristics were systematically evaluated together with ILCS in [Angibaud et al.,2007b] and [Angibaud et al., 2008a] on several problems and data sets for which exact results are known(these results are actually ours, see 7.4.1). These evaluations show that our heuristics perform very well onexperimental data.

Page 100: Algorithmic Contributions to Computational Molecular Biology · on computational molecular biology, i.e., those algorithmic and combinatorial topics that are connected to molecular

87

Program Max-Matching-Breakpoint

Objective :

Maximize∑

0≤i<nG

∑i<j≤nG

∑0≤k<nH

∑k<l≤nH

d(i, j, k, l)

Constraints :

(C.01) ∀ 1 ≤ i ≤ nG,∑

1≤k≤nH

|G[i]|=|H[k]|

a(i, k) = bG(i)

∀ 1 ≤ k ≤ nH,∑

1≤i≤nG

|G[i]|=|H[k]|

a(i, k) = bH(k)

(C.02) ∀ X ∈ G,H, ∀ g ∈ A,∑

1≤i≤nX|X[i]|=|g|

bX(i) = min|G|g, |H|g

(C.03) ∀ X ∈ G,H, ∀1 ≤ i ≤ j − 1 < nX, cX(i, j) +∑

i<p<j

bX(p) ≥ 1

(C.04) ∀ X ∈ G,H, ∀ 1 ≤ i < p < j ≤ nX, cX(i, j) + bX(p) ≤ 1

(C.05) ∀ 1 ≤ i < j ≤ nG, ∀ 1 ≤ k < l ≤ nH,

such that G[i] = H[k] and G[j] = H[l],a(i, k) + a(j, l) + cG(i, j) + cH(k, l) − d(i, j, k, l) ≤ 3

(C.06) ∀ 1 ≤ i < j ≤ nG, ∀ 1 ≤ k < l ≤ nH,

such that G[i] = H[k] and G[j] = H[l],a(i, k) − d(i, j, k, l) ≥ 0a(j, l) − d(i, j, k, l) ≥ 0cG(i, j) − d(i, j, k, l) ≥ 0cH(k, l) − d(i, j, k, l) ≥ 0

(C.07) ∀ 1 ≤ i < j ≤ nG, ∀ 1 ≤ k < l ≤ nH,

such that G[i] = −H[l] and G[j] = −H[k],a(i, l) + a(j, k) + cG(i, j) + cH(k, l) − d(i, j, k, l) ≤ 3

(C.08) ∀ 1 ≤ i < j ≤ nG, ∀ 1 ≤ k < l ≤ nH,

such that G[i] = −H[l] and G[j] = −H[k],a(i, l) − d(i, j, k, l) ≥ 0a(j, k) − d(i, j, k, l) ≥ 0cG(i, j) − d(i, j, k, l) ≥ 0cH(k, l) − d(i, j, k, l) ≥ 0

(C.09) ∀ 1 ≤ i < j ≤ nG, ∀ 1 ≤ k < l ≤ nH,

such that |G[i]|, |G[j]| 6= |H[k]|, |H[l]| ou G[i] − G[j] 6= H[k] −H[l],d(i, j, k, l) = 0

(C.10) ∀ 1 ≤ i < j ≤ nG,∑

1≤k<nH

∑k<l≤nH

d(i, j, k, l) ≤ 1

Domains :

∀ X ∈ G,H, ∀ 1 ≤ i < j ≤ nG,∀ 1 ≤ k < l ≤ nH,

a(i, k), bX(i), cX(i, k), d(i, j, k, l) ∈ 0, 1

Figure 7.1: Program Max-Matching-Breakpoint to compute ADJfull(G,H), where nG and nH denote thesize of G and H, respectively.

Page 101: Algorithmic Contributions to Computational Molecular Biology · on computational molecular biology, i.e., those algorithmic and combinatorial topics that are connected to molecular

88

ILCS: G = −1 2 5 3 5 −3 −2 −1 4

H = −3 −2 −5 −2 1 3 −3 −2 1 −1 −4

1 3 2 4 5

2 1 3 4 5

Result: G ′ = −1 2 5 3 ′ −3 −2 ′ −1 4

H ′ = −3 −2 ′ −5 −2 1 3 ′ 1 ′ −4

IILCS: G = −1 2 5 3 5 −3 −2 −1 4

H = −3 −2 −5 −2 1 3 −3 −2 1 −1 −4

1 2 3 4

1 2 3 4

Result: G ′ = −1 2 5 3 −3 ′ −2 ′ −1 ′ 4

G ′ = −5 −2 1 3 −3 ′ −2 ′ 1 ′ −4

HYBP(2): G = −1 2 5 3 5 −3 −2 −1 4

H = −3 −2 −5 −2 1 3 −3 −2 1 −1 −4

1 2 3 3

1 2 3 3

Result: G ′ = −1 2 5 3 −3 ′ −2 ′ −1 ′ 4

H ′ = −5 −2 1 3 −3 ′ −2 ′ −1 ′ −4

Figure 7.2: Execution and results of the three heuristics, seeking for a full matching, on the strings G =−1 2 5 3 5 − 3 − 2 − 1 4 and H = −3 − 2 − 5 − 2 1 3 − 3 − 2 1 − 1 − 4. As an example, the problem P inthe HYB heuristic asks to compute the conserved intervals similarity. The circled numbers indicate in whichorder the LCS were identified, except for the 3© in the HYB heuristic which signifies that the matchings weredecided simultaneously by the exact algorithm evoked in the last step of the HYB heuristic.

Page 102: Algorithmic Contributions to Computational Molecular Biology · on computational molecular biology, i.e., those algorithmic and combinatorial topics that are connected to molecular

8Exemplar common subsequences

8.1 Introduction

This short chapter is devoted to presenting some results on exemplar longest common subsequences.In the genome rearrangement domain, gene duplication is rarely considered as, as we have seen, it

usually makes the problem at hand harder (see Chapter 7 for a patent illustration). Sankoff [Sankoff, 1999]proposed the so-called exemplar model, which consists in searching, for each family of duplicated genes,an exemplar representative in each genome. In biological terms, the exemplar gene may correspond tothe original copy of the gene, which later originated all other copies. Following the parsimony principle,the choices of exemplars should be made so as to minimize the reversal distance between the two simplerversions of both genomes, composed only by the exemplar genes. An alternative to the exemplar modelis the multigene family model, which consists in maximizing the number of paired genes among a family.Again, the gene pairs should be chosen so as to minimize the reversal distance between the genomes. Bothexemplar and multigene models were proven to lead to NP-hard problems [Bryant, 2000; Blin et al., 2004].

To compare two sequences, we have proposed in [Bonizzoni et al., 2007] to study a similarity measurethat takes into account the concept of exemplar genes. Observe that a repetition-free LCS can be seen as theedit distance between two sequences where only deletions are allowed and, furthermore, for each familywith k duplicated genes, at least k1 of them must be deleted [Sadique Adi et al., 2008]. At a more generallevel, we considered both mandatory and optional letters and the measure we have proposed is the length ofa constrained common subsequence (LCS) (see [Bergroth et al., 2000; Crochemore et al., 2007]) subject to twoconstraints: (i) the common LCS is required to contains exactly or at least one occurrence of each mandatoryletter, and (ii) the common LCS is required to contains at most one occurrence of each optional letter (thisconstraint may be relaxed). Additional restrictions on optional letters allow us for varying restriction onthe solution we are looking for. Notice that a complete treatment of pure exemplar LCS has been recentlyproposed in [Sadique Adi et al., 2008].

8.2 Definitions

We define four variants of the EXEMPLAR-LCS problem we are interested in. These different variants considerdifferent situations: each mandatory letter is required to occur exactly or at most once in the common LCS,and the number of occurrences of each optional letter in the common LCS may be upper-bounded.

89

Page 103: Algorithmic Contributions to Computational Molecular Biology · on computational molecular biology, i.e., those algorithmic and combinatorial topics that are connected to molecular

90

Exemplar-LCS-i

• Input : Two strings u and v over alphabet A = Ao∪Am, where Ao is the set of optional lettersand Am is the set of mandatory letters.• Solution : A common subsequence w of u and v such that (depending on i):

i = 1: w contains exactly one occurrence of each letter in Am and at most one occurrence ofeach letter in Ao,

i = 2: w contains at least one occurrence of each letter in Am and at most one occurrence ofeach letter in Ao,

i = 3: w contains exactly one occurrence of each letter in Am,

i = 4: w contains at least one occurrence of each letter in Am.

• Measure : The length of the common subsequence, i.e., |w|.

Notice, that, as we shall see soon, whereas the classical LCS problem is well-known to be polynomial-time solvable for two stings [Bergroth et al., 2000; Crochemore et al., 2007], adding various constraints onmandatory and optional letters result in much harder problems (not a big surprise).

8.3 Key results

We summarize in this section the results we have obtained for the EXEMPLAR-LCS-i, 1 ≤ i ≤ 4, problem.

Proposition 8.3.1 ([Bonizzoni et al., 2007]). Both the EXEMPLAR-LCS-1 problem and the EXEMPLAR-LCS-2problem are APX-hard.

It is worth noticing that Proposition 8.3.1 holds even if each letter occurs at most twice in both u andv. Observe that, even if these two problems are very similar in their definition, two distinct proofs wereneeded.

Strongly related to the the EXEMPLAR-LCS-i, 1 ≤ i ≤ 4, problems are the one of determining whetherany feasible solution does exist. Let us focus on the EXEMPLAR-LCS-4 problem, i.e., find for a commonsubsequence w that contains at least one occurrence of each mandatory letter. Clearly, optimality aside,the existence of a solution for the EXEMPLAR-LCS-4 problem implies the existence of a solution for allEXEMPLAR-LCS-i problems.

Proposition 8.3.2 ([Bonizzoni et al., 2007]). Let (u, v) be an instance of the EXEMPLAR-LCS-i problem, 1 ≤ i ≤ 4.If |u|a + |v|a ≤ 3 for each letter a ∈ A, then there exists a polynomial-time algorithm to decide whether there exists afeasible solution.

Proposition 8.3.3 ([Bonizzoni et al., 2007]). Let (u, v) be an instance of the EXEMPLAR-LCS-i problem, 1 ≤ i ≤ 4.If |u|a + |v|a ≤ 3 for each letter a ∈ A, deciding whether there exists any feasible solution is NP-complete.

This latter result have a definitive consequence on the approximability of the EXEMPLAR-LCS-i problemwhen each mandatory letter occurs at most 3 times in each input string as it rules out any polynomial-timeapproximation algorithm.

In the light of the above negative results, we have considered in [Bonizzoni et al., 2007] parameterizedissues of the EXEMPLAR-LCS-3 and EXEMPLAR-LCS-4 problems when the parameter is the number ofmandatory letters, i.e., |Am|.

Page 104: Algorithmic Contributions to Computational Molecular Biology · on computational molecular biology, i.e., those algorithmic and combinatorial topics that are connected to molecular

91

Proposition 8.3.4 ([Bonizzoni et al., 2007]). Both the EXEMPLAR-LCS-3 and the EXEMPLAR-LCS-4 problemsare solvable in O(m2mn2) time, wherem = |Am| and n = max|u|, |v|.

Our fixed-parameter algorithm for the EXEMPLAR-LCS-3 problem has been implemented and testedon randomly generated strings. Whereas the running time is acceptable, the space complexity (that growsexponentially in the size of Am) makes the algorithm not practical for |Am| ≥ 20.

Page 105: Algorithmic Contributions to Computational Molecular Biology · on computational molecular biology, i.e., those algorithmic and combinatorial topics that are connected to molecular

92

Page 106: Algorithmic Contributions to Computational Molecular Biology · on computational molecular biology, i.e., those algorithmic and combinatorial topics that are connected to molecular

Part IV

Additional topics

93

Page 107: Algorithmic Contributions to Computational Molecular Biology · on computational molecular biology, i.e., those algorithmic and combinatorial topics that are connected to molecular
Page 108: Algorithmic Contributions to Computational Molecular Biology · on computational molecular biology, i.e., those algorithmic and combinatorial topics that are connected to molecular

Introduction

This part is devoted to presenting two additional contributions to computational molecular biology. Thefirst one is concerned with selenocysteine-like insertion and more precisely the problem of computing anmRNA sequence of maximum codon-wise similarity to a given mRNA (and hence, to a given protein) thatadditionally satisfies some structure constraints. We do not write secondary structure, i.e., pseudoknot-freesecondary structures, intentionally as we shall consider pseudoknoted structures. The second problem isdevoted to covering strings by substrings. Our initial motivation came from a paper by Bodlaender et al.[Bodlaender et al., 1995], who described an application for this problem in the context of protein folding.

This part is organized as follows. Chapter 9 is devoted generalized selenocysteine insertion and Chap-ter 10 with covering strings by substrings. The two following chapters are independent and as self-containedas possible.

95

Page 109: Algorithmic Contributions to Computational Molecular Biology · on computational molecular biology, i.e., those algorithmic and combinatorial topics that are connected to molecular

96

Page 110: Algorithmic Contributions to Computational Molecular Biology · on computational molecular biology, i.e., those algorithmic and combinatorial topics that are connected to molecular

9Selenocysteine-like insertion

Contents9.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 939.2 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 949.3 Key results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

9.1 Introduction

Perhaps the most significant process in molecular biology known today is the transformation of geneticinformation encoded in DNA into proteins. In this process, segments of DNA are transcribed into messengerRNA (mRNA), which in turn serve as blueprints for manufacturing proteins. This protein blueprint isprovided by triplets of nucleotides known as codons, which compose the mRNA nucleotide sequence, whereeach codon encodes a specific amino acid. An mRNA is thus translated into a protein by reading eachof its codons in sequential fashion, and creating a chain of amino acids which forms the target protein.Recently, biologists found out that according to the folding structure of an mRNA molecule, a certain codonmight encode for different amino acids. This folding structure is captured in many ways, in what is calledthe mRNA secondary structure, the set of all hydrogen bonds, or base pairings, formed by the molecule’snucleotides.

In [Backofen et al., 2002], Backofen et al. introduced the problem of computing an mRNA sequence ofmaximum codon-wise similarity to a given mRNA (and consequently, to a given protein) that additionallysatisfies some secondary structure constraints, the so-called mRNA Structure Optimization (MRSO) problem.The initial motivation of the MRSO problem is concerned with selenocysteine insertion, i.e. generating newamino acid sequences containing selenocysteine. This rare amino acid was discovered as the 21-st amino acid[Boch et al., 1991], giving another clue to the complexity and flexibility of the mRNA translation mechanism.Selenocysteine is encoded by the UGA codon, which is usually a stop codon encoding the end of translation.It has been shown [Boch et al., 1991] that in case of selenocysteine, termination of translation is inhibited inthe presence of a sequence of nucleotides, the SECIS element, which forms a hairpin-like structure in the3 ′-region after the UGA codon (see Figure 9.1). It is even argued in [Backofen et al., 2002] that modifyingexisting proteins by incorporating selenocysteine instead of a catalytic cysteine is an important problem forcatalytic activity enhancement and X-ray crystallography.

Selenocysteine insertion is concerned with a restricted type of secondary structure, i.e., a secondary

97

Page 111: Algorithmic Contributions to Computational Molecular Biology · on computational molecular biology, i.e., those algorithmic and combinatorial topics that are connected to molecular

98

Figure 9.1: The translation of UGA into selenocysteine. Termination of translation is inhibited in the presenceof the SECIS element.

structure without pseudoknots, and for this type of structure the linear-time algorithm presented in [Backofenet al., 2002] provides an optimal solution. However, it is reasonable to assume that the discovery ofselenocysteine will lead to the discovery of several other amino acids of similar kind, some of which arelikely to require more complex secondary structures. Even today, similar problems occur in programmedframeshifts which allow to encode two different amino acid sequences in one mRNA sequence [Jacks et al.,1988; Jacks and Varmus, 1985]. This motivates the investigation of the MRSO problem for more elaboratesecondary structures (actually, this issues have been suggested in [Backofen et al., 2002; Bongartz, 2004]).

9.2 Preliminaries

An mRNA molecule is viewed as a string over the alphabet A = A,C,G,U, where A represents the fourdifferent types of nucleotides in the molecule. The pairs A,U, G,C, and G,U are known as complementarynucleotide pairs. Hydrogen bonds can only be formed between complementary nucleotides in an mRNAfolding. A codon of an mRNA sequence is a segment of three nucleotides, i.e., a string in A3. Thus, anmRNA sequence S = s1s2 . . . s3n is a concatenation of n consecutive codons, where the i-th codon of S iss3i−2s3i−1s3i.

Given a source mRNA sequence S = s1s2 . . . s3n, we consider the problem of evaluating the codon-wisesimilarity between S and another target mRNA sequence T = t1t2 . . . t3n. For this, we are provided witha set of n functions, F = f1, f2 . . . , fn, called similarity functions of S, such that for all i, 1 ≤ i ≤ n, eachfunction fi is of the form fi : Σ

3 → Q. Thus, fi assigns a value to the i-th codon of T according to its level ofsimilarity in comparison with the i-th codon of S. The total level of similarity between S and T is then givenby∑ni=1 fi(t3i−2t3i−1t3i). Note that given a set of similarity functions F = f1, f2, . . . , fn for S, one does not

need to know anything else about S in order to compute the similarity score of S and T .The structure constraints Γ ⊆ i, j : 1 ≤ i < j ≤ 3n for a target mRNA sequence T of length 3n, are

pairings between distinct integers in 1, 2, . . . , 3n. These represent necessary hydrogen bonds in the foldingof T . It is assumed that each nucleotide can pair with at most one other nucleotide in any folding, hence eachinteger appears in at most one pair in Γ . Furthermore, there are no pairs of the form i, i+ 1 or i, i+ 2 inΓ , for all i, 1 ≤ i ≤ 3n− 2. Given a set of structure constraints Γ ⊆ i, j : 1 ≤ i < j ≤ 3n, and an arbitrarytarget mRNA sequence T = t1t2 . . . t3n, we say that nucleotides ti and tj in T are compatible with respectto Γ , if either ti, tj is a complementary nucleotide pair or i, j /∈ Γ . The entire sequence T is compatiblewith respect to Γ , if all pairs of nucleotides in T are compatible with respect to Γ . We are now in position toformally define the MRNA STRUCTURE OPTIMIZATION (MRSO) problem we are interested in (see [Backofen

Page 112: Algorithmic Contributions to Computational Molecular Biology · on computational molecular biology, i.e., those algorithmic and combinatorial topics that are connected to molecular

99

et al., 2002]).

MRSO

• Input : A set F of n similarity functions for a source mRNA sequence of length 3n, and and aset of structure constraints Γ ⊆ i, j : 1 ≤ i < j ≤ 3n. .• Solution : A target mRNA sequence which is compatible with respect to Γ .• Measure : The similarity score of the target mRNA sequence with respect to F .

It will convenient to formalize the MRSO problem in a slightly different manner using graph-theoreticconcepts, as we shall consider Γ as a linear graph (see Definition 2.2.1 page 18) with 3n vertices which has amaximum degree of one. However, since we are really interested in codon-wise similarity, we use a moresuitable representation of Γ :

Definition 9.2.1 (Implied structure graph [Backofen et al., 2002]). Let Γ ⊆ i, j : 1 ≤ i < j ≤ 3n be a set ofstructure constraints for a target mRNA sequence of length 3n. The implied structure graph of Γ is the linear graphGΓ defined by:

V(GΓ ) = 1, 2, . . . , n, andE(GΓ ) = i, j : ∃x, y ∈ Γ : x ∈ 3i− 2, 3i− 1, 3i ∧ y ∈ 3j− 2, 3j− 1, 3j .

In this way, a vertex i in V(GΓ ) corresponds to the i-th codon of the target mRNA sequence, andi, j ∈ V(GΓ ) are connected in E(GΓ ) if there are any structure constraints in Γ between the i-th and j-thcodons of the sequence. Note that there can be at most three structure constraints between any pair ofcodons, therefore GΓ has maximum degree of three, i.e., it is a subcubic graph. Also note that, while thisrepresentation may seem lossy, Γ can be retained from GΓ by adding up to three labels for each edge inE(GΓ ).

Given a subset of vertices V ⊆ V(GΓ ), we let GΓ [V ] denote the subgraph of GΓ induced by V , i.e., thesubgraph with V as its vertex set, and E(GΓ ) ∩ (V×V) as its edge set. Similarly, given a subset of edgesE ⊆ E(GΓ ), GΓ [E] denotes the subgraph of GΓ with vertex set i | i, j ∈ E and edge set E. Also, we useGΓ [i, . . . , j] to denote the subgraph of GΓ induced by i, . . . , j ⊆ V(GΓ ). Two edges i, j and i ′, j ′ crossin GΓ if either i < i ′ < j < j ′ or i ′ < i < j ′ < j. Note that two crossing edges might not cross under adifferent ordering of V(GΓ ). If there exists an ordering of V(GΓ ) which introduces no edge crossings thenGΓ is outerplanar. Recall that if GΓ is outerplanar, the MRSO problem is solvable O(n) time [Backofen et al.,2002].

A codon assignment for GΓ is a mapping from some V ⊆ V(GΓ ) to Σ3. An assignment for a pair of verticesi, j ∈ V(GΓ ), i→ t3i−2t3i−1t3i and j→ t3j−2t3j−1t3j, is compatible with respect toGΓ , if either i, j /∈ E(GΓ )or ti ′ and tj ′ are complementary nucleotides for any i ′, j ′ ∈ Γ ∩ 3i− 2, 3i− 1, 3i× 3j− 2, 3j− 1, 3j. Moregenerally, an assignment φ : V → Σ3 for some V ⊆ V(GΓ ) is compatible with respect to GΓ , if for any i, j ∈ V ,the assignment i→φ(i) and j→φ(j) is compatible with respect to GΓ . Henceforth, we consider instances forthe MRSO problem of the form (GΓ ,F). Our goal in this setting is to find an assignment φ : V(GΓ )→ Σ3

(i.e., a target mRNA sequence T = φ(1)φ(2) . . . φ(n)), which is compatible with GΓ , and which maximizes∑ni=1 fi(φ(i)).For the MRSO problem, it has been shown in [Backofen et al., 2002] that there exists a linear-time

algorithm if the considered secondary structure corresponds to an outerplanar graph (as it is the case forselenocysteine insertion). In this sequel, we refer to this algorithm as AOP. For the general case, the problemwas proved to be NP-complete [Backofen et al., 2002], and Bongartz showed that in fact the problem isAPX-hard [Bongartz, 2004]. An algorithm for approximating the MRSO problem within ratio 2 is givenin [Backofen et al., 2002]. A slightly slower but somewhat simpler 4-approximation algorithm is given in[Bongartz, 2004]. We mention also that an extension of the MRSO problem, where insertions and deletionsare allowed in the amino acid sequence is presented in [Backofen and Busch, 2004].

Page 113: Algorithmic Contributions to Computational Molecular Biology · on computational molecular biology, i.e., those algorithmic and combinatorial topics that are connected to molecular

100

9.3 Key results

We shall be concerned with two natural parameters for the MRSO problem. These are the number ofcrossings edges in GΓ , and the number of degree three vertices in GΓ , denoted p(GΓ ) and q(GΓ ), respectively.

Our initial interest in parameters p(GΓ ) and q(GΓ ) stems from the fact that we strongly believe them tobe small for most practical applications. Consider parameter p(GΓ ), the number of edge crossings in GΓ(this parameter was previously suggested in [Bongartz, 2004]). Indeed, almost all currently known mRNAshave secondary structures which induce outerplanar formations, i.e., formations with no edge crossings.Furthermore, many secondary structure prediction algorithms restrict their search space to structures withbounded edge crossings, since prediction with unbounded edge crossings usually becomes NP-hard, andis anyhow assumed unnatural (see for instance [Akutsu, 2000]). As for parameter q(GΓ ), the number ofdegree three vertices, recall that a vertex of degree three in GΓ represents a codon with three nucleotides,each pairing with complementary nucleotides in three different codons. Although this situation can occur ina folding of an mRNA molecule, it can be expected to be quite rare due to the geometric constraints imposedon any such folding. Also, pairs of hydrogen bonds of the form i, j and i+ 1, j− 1, called stacking pairs,tend to contribute quite substantially to the overall stability of the folding structure of the mRNA [Ieonget al., 2003; Lyngsø and Pedersen, 2000]. A secondary structure is hence expected to have a relatively highnumber of stacking pairs, and therefore to induce an implied structure graph with a relatively small numberof degree three vertices.

It turns out that the MRSO problem is polynomial-time solvable when either p(GΓ ) or q(GΓ ) are fixed.The notion of edge bipartition plays a central role in this setting.

Definition 9.3.1 ((Nice) Edge bipartition). Let GΓ be an implied structure graph with n vertices. An edgebipartition P = (Et, Eb) of GΓ is a partitioning of the edges in GΓ into Et and Eb, the top and bottom edges of Prespectively, such that Et ∪ Eb = E(GΓ ), Et ∩ Eb = ∅ and Et 6= ∅. Furthermore, P is said to be nice , if the subgraphGΓ [Et] is outerplanar.

Recall that a graph is called outerplanar if it has an embedding in the plane such that the vertices lie on afixed circle and the edges lie inside the disk of the circle and don’t intersect.

Central in our approach is Algorithm ANEB (page 97). We shall use Algorithm ANEB only for a nice edgebipartition of GΓ with a fixed number of bottom edges. At the heart of algorithm ANEB lies the followingsimple observation. Suppose we want to find the highest-scoring compatible mRNA sequence which startswith codon AAA. For this, we can replace the similarity function f1 ∈ F by a different function f ′, wheref ′(AAA) = f1(AAA) and f ′(C) = −∞ for all codons C 6= AAA. Solving the MRSO problem for the instance(GΓ ,F ′), whereF ′ = f ′, f2, . . . , fn, will then give us our desired mRNA. The following definition generalizesthis example.

Definition 9.3.2 (Corresponding similarity functions). Let (GΓ ,F) be an instance of the MRSO problem withF = f1, f2, . . . , fn. Also, let φ : V → Σ3 be a codon assignment for some V ⊆ V(GΓ ). The corresponding set ofsimilarity functions of assignment φ, denoted Fφ = fφ1 , . . . , f

φn , is defined as follows:

• For all i ∈ V : fφi (φ(i)) = fi(φ(i)), and fφi (C) = −∞ for any C 6= φ(i).

• For all j ∈ V(GΓ ) − V : fφj = fj.

Algorithm ANEB uses Algorithm AOP, the algorithm given in [Backofen et al., 2002] for outerplanarimplied structure graphs, as a subprocedure in its computation. At its core, Algorithm ANEB is basicallyan exhaustive search procedure that searches through all possible codon assignments for vertices whichare incident to edges in Eb. For each such assignment, Algorithm ANEB first checks if the assignment iscompatible with respect to GΓ [Eb], and if so, it invokes Algorithm AOP with the set of similarity functionscorresponding to this assignment. Any solution returned by Algorithm AOP is guaranteed to be compatible

Page 114: Algorithmic Contributions to Computational Molecular Biology · on computational molecular biology, i.e., those algorithmic and combinatorial topics that are connected to molecular

101

Algorithm 3: Algorithm ANEB(GΓ ,F ,P)Data: An implied structure graphGΓ of order n, a set of similarity functions F = f1, f2, . . . , fn and a

nice edge bipartition P = (Et, Eb).Result: An optimal target mRNA sequence T which is compatible with respect to GΓ .foreach codon assignment φ to vertices incident to edges in Eb do

if φ is compatible with respect to GΓ [Eb] then(a) Construct Fφ, the similarity functions corresponding to φ.(b) Invoke AOP(GΓ [Et],Fφ).

endendreturn the target mRNA sequence found in Step (b) with the highest similarity score.

with GΓ since it is simultaneously compatible with both GΓ [Eb] and GΓ [Et]. Finally, Algorithm ANEBterminates by outputting the maximum solution over all target mRNAs returned by Algorithm AOP. Most ofour interest in Algorithm ANEB stems from the following lemma.

Lemma 9.3.3 ([Blin et al., 2008]). Given an instance (GΓ ,F) for the MRSO problem accompanied by a nice edgebipartition P = (Et, Eb) of GΓ , ANEB computes an optimal target mRNA sequence for this instance in O(212mn)time, where n = |V(GΓ )| andm = |Eb|.

Thanks to this general lemma, we have obtained the following results (details omitted).

Proposition 9.3.4 ([Blin et al., 2008]). The MRSO problem is solvable in O(212p(GΓ )n) time.

Proposition 9.3.5 ([Blin et al., 2008]). The MRSO problem is solvable in O(212q(GΓ )n) time.

Also, omitted here are our results that the MRSO problem is solvable in polynomial-time if GΓ hasbounded cutwidth (although cutwidth is perhaps not as natural as the two previously discussed parameters,it has been studied quite considerably for other problems dealing with RNAs Evans [1999c]; Evans andWareham [2001]; Jiang et al. [2000a]) or bounded treewidth (see [Blin et al., 2008] for details). Finally, noticethat in a recent paper, the cliquewidth of GΓ was also argued to be an interesting parameter for the MRSOproblem [Gurski, 2008].

Page 115: Algorithmic Contributions to Computational Molecular Biology · on computational molecular biology, i.e., those algorithmic and combinatorial topics that are connected to molecular

102

Page 116: Algorithmic Contributions to Computational Molecular Biology · on computational molecular biology, i.e., those algorithmic and combinatorial topics that are connected to molecular

10How many words are needed to build up all words ?

Contents10.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9910.2 Approximation and inapproximation results . . . . . . . . . . . . . . . . . . . . . . . . . . . 10110.3 Jumping to numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

10.3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10210.3.2 Hardness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10310.3.3 Put the blame on rk2(S) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105

10.1 Introduction

In a covering problem we are faced with the following situation: We are given two (not necessarily disjoint)sets of elements, the base elements and the covering elements, and the goal is to find a minimum (weight) subsetof covering elements that “covers”all the base elements. The exact notion of covering differs from problem toproblem, yet this abstract setting is common to many classical combinatorial problems in various applicationareas. Two famous examples are (i) the MINIMUM SET COVER problem where the covering elements aresubsets of the base elements and the notion of covering corresponds to set inclusion, and (ii) the MINIMUMVERTEX COVER problem where the setting is graph-theoretic and the notion of covering corresponds toincidence between vertices and edges. Ever since the early days of combinatorial optimization, research oncovering problems such as the two examples above proved extremely fruitful in laying down fundamentaltechniques and ideas. The early work of Johnson [Johnson, 1974] and Lovasz [Lovasz, 1974] on the MINIMUMSET COVER problem pioneered the greedy analysis approach, while Chvatal [Chvatal, 1979] gave thefirst analysis based on linear programming (LP) while tackling the same problem. The first LP-roundingalgorithm by Hochbaum [Hochbaum, 1982] was also designed for MINIMUM SET COVER, while Bar-Yehudaand Even gave the first Primal-Dual [Bar-Yehuda and Even, 1981] and Local-Ratio [Bar-Yehuda and Even,1985] algorithms for the MINIMUM VERTEX COVER problem.

In this chapter we introduce a new covering problem which resides in the realm of strings. A string u is asubstring of a string v, if u can be obtained by deleting any number of consecutive letters from both ends ofv. In our covering problem, the base elements are strings and the covering elements are their substrings.The notion of covering corresponds to string-factorization, or to the generation of strings by substringconcatenation. More formally, for a given set of strings S, let C(S) denote the set of all substrings of strings in

103

Page 117: Algorithmic Contributions to Computational Molecular Biology · on computational molecular biology, i.e., those algorithmic and combinatorial topics that are connected to molecular

104

S. We define a cover of S to be a subset C ⊆ C(S) such that any string s ∈ S can be written as a concatenationof strings in C. If each string in S can be written as a concatenation of at most ` strings in C, we say that C isan `-cover of S. Given a weight function w : C(S)→ Q+, we are interested in computing an `-cover of S withminimum possible weight.

The problem is formally defined as follows.

Minimum Substring Cover

• Input : A set of strings S, a weight functionω : C(S)→ Q+, and an integer ` ≥ 2.• Solution : An `-cover C of S. That is, a set of strings C ⊆ C(S), where for each s ∈ S there existc1, . . . , cp ∈ C, p ≤ `, with s = c1 · · · cp.• Measure : The total weight of the cover, i.e.,ω(C) =

∑c∈Cw(c).

Example 8 Consider the set of strings S = a, aab, aba. Then

C(S) = a, b, aa, ab, ba, aab, aba

and C1 = a, b and C2 = a, ab are covers of S. The cover C1 is a 3-cover of S, while C2 is a 2-cover.

Throughout this chapter, we use n to denote the number of strings in S, andm to denote the maximumlength of any string in S, i.e., n = |S| andm = max|s| : s ∈ S.

Note that in case ` ≥ m, there is no actual bound on the concatenation length of the required cover, andthis case is denoted by ` = ∞. An ∞-cover is referred to simply as a cover. Another interesting specialcase is when ` = 2. In this case, we are required to cover Swith a set of prefixes and suffixes in S, where aprefix (resp. suffix) of a string s is a substring of s which is obtained by removing consecutive letters onlyfrom the end (resp. beginning) of s. As we will see, these two extremal cases both give a certain amount ofcombinatorial leverage, and therefore deserve particular consideration. We also wish to point out that ouruse of general weight functionsω : C(S)→ Q+ allows for more robustness in modeling different scenarios.For instance, whenω is the unitary function, i.e.,ω(c) = 1 for every c ∈ C(S), this corresponds to the situationwhere we want to minimize the size of a cover of S. Whenω(c) = |c|, i.e., the weight of every substring is itslength (w is the length-weighted function), this corresponds to the case where we want to minimize the totallength of the cover. Often some sort of middle ground between these two situations might also be desirable.

Example 9 Consider the two coversC1 andC2 of the set of strings S in Example 8. Ifω is the unitary function,thenω(C1) = ω(C2) = 2. However, ifω is the length-weighted function, we haveω(C1) = 2 < ω(C2) = 3.

Our initial inspiration for studying the MINIMUM SUBSTRING COVER problem came from a paper byBodlaender et al. [Bodlaender et al., 1995], who described an application for this problem in the context ofprotein folding (The authors of [Bodlaender et al., 1995] actually referred to our problem as the DICTIONARYGENERATION problem, and considered its unweighted variant under the parameterized complexity frame-work.) Protein folding is the problem of determining the folding structure of proteins using their amino-acidsequential description. This problem is extremely important, since most of the functionally of a protein isdetermined by its folding structure, and because current biological methods for extracting the sequentialdescription of a given protein exceed by far the methods for extracting the folding structure of the protein.In [Bodlaender et al., 1995], it is argued that since all known approaches for protein folding are NP-hard, apossible heuristic for this problem is to break the protein sequence into small segments, small enough forallowing efficient folding computation. This heuristic is justified by the fact that many proteins seem to becomposed of relatively small regions which fold independently of other regions. The theory of exon shufflingproposes that all proteins are concatenations of such regions, where the regions are drawn from a commonancestral dictionary [Dorit and Gilbert, 1991; Patthy, 1991].

Page 118: Algorithmic Contributions to Computational Molecular Biology · on computational molecular biology, i.e., those algorithmic and combinatorial topics that are connected to molecular

105

The MINIMUM SUBSTRING COVER problem can also model interesting computational issues whicharise in formal language theory, and in particular, in the area of combinatorics of words. Our notion of∞-cover actually corresponds to the notion of combinatorial rank, an important parameter of a set of words(the best general reference here is [Choffrut and Karhumaki, 1997]). Neraud [Neraud, 1990] studied theproblem of determining whether a given set of words is elementary, where a set of strings is said to beelementary if it does not have a cover of size strictly less than its own. Neraud describes a direct applicationof this notion to the famous D0L-sequence equivalence problem (see [Rozenberg and Salomaa, 1980]) viaso-called elementary morphisms [Ehrenfeucht and Rozenberg, 1978]. He also argues that this notion appearsfrequently in numerous sub-areas such as test sets, code theory, representation of formal languages, andthe theory of equations in free monoids. His main result is in showing that deciding whether a given set ofwords is elementary is coNP-complete, which implies that MINIMUM SUBSTRING COVER is NP-hard.

Apart from the work of Bodlaender et al. [Bodlaender et al., 1995] and Neraud [Neraud, 1990], therehas also been some recent work on problems closely related to MINIMUM SUBSTRING COVER, especiallyfor the case of ` = 2. The MINIMUM SET COVER WITH PAIRS problem introduced by Hassin and Segev in[Hassin and Segev, 2005], is a variant of the MINIMUM SET COVER problem where base elements are nowcovered by pairs of sets, and the goal is to cover all base elements using a minimum weight collection of sets.Hassin and Segev gave an O(

√n log(n)) approximation algorithm for this problem, along with a few other

algorithms for special cases of this problem. Another closely related problem is the HAPLOTYPE INFERENCEBY MAXIMUM PARSIMONY problem, an important problem in computational molecular biology. Huang et al.[Huang et al., 2005] gave an algorithm for this problem, which translates to an O(m log(n)) algorithm forthe MINIMUM SUBSTRING COVER problem with ` = 2. Hajiaghayi et al. [Hajiaghayi et al., 2006] introducedthe MINIMUM MULTICOLORED SUBGRAPH problem within the same context, and gave an algorithm which inour terms obtains a performance guarantee of O(

√m log(n)) with high probability.

10.2 Approximation and inapproximation results

We begin the presentation by giving some negative results (to fix the context). Combining an approximationpreserving reduction from the MINIMUM HYPERGRAPH VERTEX COVER problem to the MINIMUM SUBSTRINGCOVER problem together with the results of [Raz and Safra, 1997] and [Dinur et al., 2005], we have obtainedthe following inapproximability result.

Proposition 10.2.1 ([Hermelin et al., 2008]). It is NP-hard to approximate the MINIMUM SUBSTRING COVERproblem (i) within ratio c log(n) for some constant c, and (ii) within ratio bm/2c− 1− ε for any ε > 0.

It is worth noticing that the above proposition relies (i) on a somewhat unnatural weight functionω, and(ii) on the fact that the strings in S are allowed to be fairly long. The following proposition relaxes both theseconditions at the cost of reducing the lower bounds to only a constant.

Proposition 10.2.2 ([Hermelin et al., 2008]). The MINIMUM SUBSTRING COVER problem is NP-hard to approxi-mate (i) within ratio c log(n) for some constant c, (ii) within ratio bm/2c− 1− ε for any ε > 0, and (iii) within someconstant c, whenm and ` are constant, andω is either the unitary or the length-weighted function.

We now turn to presenting some positive results in the form of approximations algorithms with per-formance ratios depending on the length of the longest word in S. We first apply the local-ratio technique[Bar-Yehuda, 2000; Bar-Yehuda and Even, 1985], and next linear programming techniques. The local-ratiotechnique [Bar-Yehuda, 2000] is based on the Local-Ratio Lemma [Bar-Yehuda and Even, 1985], which in ourterms is stated as follows.

Lemma 10.2.3 (Local-Ratio). Let C be a cover for S, and let ω1 and ω2 be weight functions for C(S). If C is anα-approximate solution, both with respect tow1 and with respect tow2, then C is also an α-approximate solution withrespect toω1 +ω2.

Page 119: Algorithmic Contributions to Computational Molecular Biology · on computational molecular biology, i.e., those algorithmic and combinatorial topics that are connected to molecular

106

We need a new definition. Given a weight function ω : C(S) → Q+, we say that ω is proper if for anyc, c1 ∈ C(S),ω(c1) ≤ ω(c) whenever c1 is a prefix or a suffix of c. For example, unitary and length-weightedfunctions are proper. Based on the local-ratio technique, we have obtained the following positive results.

Proposition 10.2.4 ([Hermelin et al., 2008]). The MINIMUM SUBSTRING COVER problem is approximable (i)within ratio

(m+12

)− 1 for general values of `, (ii) within ratio m − 1 for ` = 2, and (iii) within ratio m for ` =∞

and proper weight functionω.

Is the MINIMUM SUBSTRING COVER problem approximable within ratiom for ` =∞ if the weightfunction is not proper?

We now consider the MINIMUM SUBSTRING COVER problem from a linear programming point of view.We have shown in [Hermelin et al., 2008] that the MINIMUM SUBSTRING COVER problem can be formulatedas follows:

minimizem∑i=1

(1+ xi)2

4

subject to∑

vp vq=ui

(1+ xp)(1+ xq)

4≥ 1, i = 1, 2, . . . , n

xi ∈ −1,+1, j = 1, 2, . . . ,m

(10.1)

By combining the above linear program with a randomized rounding procedure we have obtained thefollowing approximation result.

Proposition 10.2.5 ([Hermelin et al., 2008]). With high probability, the MINIMUM SUBSTRING COVER problem isapproximable within ratio O(m(`−1)2/` log1/`(n)).

10.3 Jumping to numbers

10.3.1 Introduction

What is the complexity of the MINIMUM SET COVER problem for a one-letter alphabet? If ` = 2 in addition?Unfortunately, we don’t have any answer for that, even for ` = 2. The problem is, however, known to beAPX-hard for a binary alphabet (D. Hermelin, and S. Vialette, Unpublished result), but none of our attemptssucceeded in proving either NP-hardness or membership to P for a one-letter alphabet. We consider in thissection the MINIMUM SET COVER problem for a one-letter alphabet . . . in quite a relaxed form. More precisely,what about the complexity of the MINIMUM SET COVER problem for a one-letter alphabet in case the stringsare given by their length? Indeed, in the special case of a one-letter alphabet, there is no ambiguity in givingeither the string or their length. Observe that the two problems are, however, different from an algorithmictheory point of view. Indeed, if S is composed of n strings, each of length at most m, the size of the input isO(nm) whereas it reduces to O(n log(m)) if the strings are presented by their length (assuming a naturalbinary encoding). This section is devoted to investigating this problem, referred hereafter as the MINIMUMk-GENERATING SET problem.

We use N∗ to refer to the set of all natural numbers excluding zero, i.e., N∗ = 1, 2, . . .. Let S =s1, s2, . . . , sn ⊂ N∗. For any k ∈ N∗, we write kS for the set of all integers that can be expressed as the

Page 120: Algorithmic Contributions to Computational Molecular Biology · on computational molecular biology, i.e., those algorithmic and combinatorial topics that are connected to molecular

107

What is the complexity of the MINIMUM SET COVER problem for a one-letter alphabet and ` = 2?This question to equivalent to deciding whether the MINIMUM 2-GENERATING SET problem isstrongly NP-complete.

sum of exactly k non necessarily distinct integers of S, i.e., kS = si1 + si2 + . . . + sik : si1 , si2 , . . . , sik ∈ S.According to this definition, for any set S, S = 1S. A set X ⊂ N∗ is a k-generating set of S (or k-generates S)if S ⊆

⋃ki=1 iX. (Notice here that we do not require the additional constraint

⋃ki=1 iX ⊆ S.) It is called a

minimum k-generating set of S if X is a k-generating set of S of minimum cardinality. The k-rank of S, denotedrkk(S), is the cardinality of a minimum k-generating set of S. A set S ⊂ N∗ is k-elementary if rkk(S) = |S|. Letmin(S) and max(S) stand for minsi : si ∈ S and maxsi : si ∈ S, respectively.

MINIMUM k-GENERATING SET

• Input : A set S = s1, s2, . . . , sn ⊂ N∗, and a positive integer k.• Solution : A k-generating set X of S, i.e., a set X ⊂ N∗ such that S ⊆

⋃1≤i≤k iX.

• Measure : The cardinality of X.

Notice that it has been shown (in the context of conformal radiotherapy and Intensity ModulatedRadiation Therapy) that computing the k-ranks of a set of integers for unbounded k is NP-complete [Collinset al., 2007].

We focus here only on the case k = 2. In [Fagnot et al., 2009], some easy properties (including expansion,contraction and shift) of minimum 2-generating sets are given. Actually, from our point of view, the onlyproperty that does not follow intuition is that, for any S ⊂ N∗ and any c ∈ N∗ common divisor of S, we haverk2(S/c) = rk2(S) if c is odd and rk2(S) ≤ rk2(S/c) ≤ 2 rk2(S) if c is even, where S/c = si/c : si ∈ S. Inother words, dividing the elements of S by a common divisor does not reduce the 2-rank (thereby ruling outthis approach for approximation considerations), and the result is quite tight (see [Fagnot et al., 2009]).

10.3.2 Hardness

The complexity of the MINIMUM 2-GENERATING SET problem is settled in [Fagnot et al., 2009].

Proposition 10.3.1 ([Fagnot et al., 2009]). The MINIMUM 2-GENERATING SET problem is APX-hard.

The proof is from the VERTEX COVER problem for cubic graphs and uses combinatorial properties (moreprecisely matching properties) of intervals [Gyarfas, 2003].

It remains open, however, whether the MINIMUM 2-GENERATING SET problem is NP-complete if everyinteger in S is bounded by a polynomial in the length of the input. We now complement the above hardnessresult by presenting a new result that shows intractability of finding any non-trivial 2-generating set. Weneed a new definition. A set S ⊂ N∗ is said to be k-simplifiable , k ∈ N∗, if there exists a k-generating set X ofS such that |X| < |S|. In other words, S is k-simplifiable if rkk(S) < |S|.

Proposition 10.3.2. Deciding whether a set S ⊂ N∗ is 2-simplifiable is coNP-hard.

Proof. The reduction if from the EQUAL SUM SUBSET OF EQUAL CARDINALITY problem: Given a set T ⊂ N∗,are there two disjoint nonempty subsets A,B ⊆ T with |A| = |B| such that

∑a∈A a =

∑b∈B b? The EQUAL

SUM SUBSET OF EQUAL CARDINALITY problem has been shown to be NP-complete in [Cieliebak et al., 2008].

Page 121: Algorithmic Contributions to Computational Molecular Biology · on computational molecular biology, i.e., those algorithmic and combinatorial topics that are connected to molecular

108

Computing a minimum 2-generating set of 1, 2, . . . , n turns out to be a special case of thePOSTAGE STAMP problem ([Alter and Barnett, 1980; Tripathi, 2006]). The POSTAGE STAMPproblem is defined as follows: for fixed positive integers h, k ∈ N∗, find N(h, k), i.e., thelargest n for which a h-generating set of 1, 2, . . . , n of size k exists. It is easily seen thatN(1, k) = k (for 1, 2, . . . , k) and N(h, 1) = h (for 1). Stohr [Stohr, 1955a,b] proved thatN(h, 2) =

⌊(h2 + 6h+ 1)/4

⌋(Tripathi gives an alternate proof in [Tripathi, 2006]). Surprisingly

enough, no other closed-form formula is known for any other pair (h, k) where one of h andk is fixed [Tripathi, 2006]. Computing a closed-form formula for N(2, k) (or, going back to ourvocabulary, “computing rk2(1, 2, . . . , n”) remains a challenging open problem (an asymptoticbound for N(2, k) is given in [Moser, 1960]).To give some insights of the context, we give the minimum size of of each 2-generating set of1, 2, . . . , n, i.e., rk2(1, 2, . . . , n), for n from 1 to 64 (notice that rk2(1, 2, . . . , 65) = 13).

n rk2(1, 2, . . . , n) length of the slot1, 2 1 2

3, 4 2 2

5, 6, 7, 8 3 4

9, 10, 11, 12 4 4

13, 14, 15, 16 5 4

17, 18, 19, 20 6 4

21, 22, . . . , 26 7 6

27, 28, . . . , 32 8 6

33, 34, . . . , 40 9 8

41, 42, . . . , 46 10 6

47, 48, . . . , 54 11 8

55, 56, . . . , 64 12 10

Let n ∈ N∗, and X be a minimum 2-generating set of 1, 2, . . . , n. We certainly have max(X) ≥⌈n2

⌉(indeed, (

⌈n2

⌉− 1) + (

⌈n2

⌉− 1) < n). We conjecture that max(X) ≤

⌈n2

⌉is actually enough.

Conjecture 5. For every n ∈ N∗, there exists a minimum 2-generating set X of 1, 2, . . . , n such thatmax(X) =

⌈n2

⌉.

A computer-aided verification (linear programming) shows the conjecture to be correct for 1 ≤n ≤ 80. If True, this property might help in better understanding the structure of optimal 2-generating sets (more precisely, the structure of some interesting optimal 2-generating sets) ofconsecutive integers.

Let T = t1, t2, . . . tn ⊂ N∗ be an arbitrary instance of the EQUAL SUM SUBSET OF EQUAL CARDINALITYproblem. We are due to find out whether this set contains two disjoint subsets with the same cardinality andwith the same sum. We take a rather big natural B (B = 1+

∑ni=1 ti is actually enough) and define the set

S = s0, s1, . . . , sn ⊂ N∗ by s0 = 1 and si = (2(ti +B) + 1, 1 ≤ i ≤ n. We claim that S is 2-simplifiable if andonly if the original set T admits two disjoint subsets of equal sum and cardinality.

Assume for simplicity that t1+ t3+ . . .+ t2k−1 = t2+ t4+ . . .+ t2k for some k. This is clearly equivalentto assuming that s1 + s3 + . . .+ s2k−1 = s2 + s4 + . . .+ s2k. Assume further that t2i−1 ≤ t2j−1 and t2i ≥ t2j

Page 122: Algorithmic Contributions to Computational Molecular Biology · on computational molecular biology, i.e., those algorithmic and combinatorial topics that are connected to molecular

109

for all 1 ≤ i < j ≤ k. This can be safely assumed through a relabeling of the indexes of t1, t3, . . . , t2k−1 (andof s1, s3, . . . , s2k−1) and a relabeling of the indexes of t2, t4, . . . , t2k (and of s2, s4, . . . , s2k). Now, we havethat

s1 + s3 + . . .+ s2q−1 ≤ s2 + s4 + . . .+ s2q (10.2)

for every 1 ≤ q ≤ k. Consider now the 2k naturals d0, d1, d2, . . . , d2k−1 defined by d0 = 1 and di = si−di−1for 1 ≤ i ≤ 2k− 1. We claim that each di, 1 ≤ i ≤ 2k− 1, is a positive integer. Indeed, for 1 ≤ i ≤ k− 1, wehave

d2i+1 = s2i+1 − d2i

= (s1 + s3 + . . .+ s2i+1) − s0 − (s2 + s4 + . . .+ s2i)

= 2(t1 + t2 + . . .+ t2i+1) + 2B(i+ 1) + (i+ 1) − s0 − 2(t2 + t4 + . . .+ t2i) − 2Bi− i

= 2(t1 + t2 + . . .+ t2i+1) − 2(t2 + t4 + . . .+ t2i) + 2B+ 1

> 0

for B >∑ni=1 ti, and

d2i = s2i − d2i−1

= (s2 + s4 + . . .+ s2i) + s0 − (s1 + s3 + . . .+ s2i−1)

≥ s0= 1

by (10.2). Notice that it follows from the above that

d2k−1 = (s1 + s3 + . . .+ s2k−1) − s0 − (s2 + s4 + . . .+ s2k−2).

Combining this with our hypothesis

s2k = (s1 + s3 + . . .+ s2k−1) − (s2 + s4 + . . .+ sk−2)

yields s2k = d2k−1 + s0 = d2k−1 + d0, and hence si = di−1 + di mod 2k holds for 1 ≤ i ≤ 2k. Then it followsthat X = di : 0 ≤ i ≤ 2k− 1 ∪ si : 2k+ 1 ≤ i ≤ n is a 2-generating set of S. But |X| = n− 1 < n = |S|, andhence S is 2-simplifiable.

For the other direction, assume the set S = s0, s1, . . . sn is 2-simplifiable and let D = d1, d2, . . . , dk,k ≤ n, be a 2-generating set for S. Let us represent this situation with a graph G (for the very first time inthis manuscript we shall allow for loops in a graph). The graph G has k + 1 vertices labeled 0, d1, . . . dk.As for the edges, for every 0 ≤ i ≤ n, we have an edge labeled si between two vertices dp and dq suchthat si = dp + dq (if there exists severals possibilities for 2-generating si with D we choose one arbitrarily).Notice that, we have a loop labeled si if si = dp + dp for some dp ∈ D, and en edge labeled si with anendpoint labeled 0 allows us to conveniently represent the case si = dp as si = dp + 0, dp ∈ D.

Since G has k+ 1 ≤ n+ 1 vertices and n+ 1 edges then Gmust contain some cycle C. Furthermore, thesum of the labels on the edges of C is twice the sum of the labels on the nodes of C, and must therefore bean even number. Now, remembering that the si‘s are all odd (by construction from the ti‘s) it follows thatC is an even cycle. Moreover, the edges of C can be partitioned into two matchingsM1 andM2. Noticenow that the sum of the labels on the edges ofM1 equals the sum of the labels on the edges ofM2. Fromthis, and since B is so big, we can exclude the possibility that one edge of C is labelled with s0 = 1. Then itfollows that the edges ofM1 and the edges ofM2 give two disjoint subsets of T = t1, t2, . . . , tn with equalsum and cardinality.

Observe, however, that Proposition 10.3.2 does rule out the existence of an approximation algorithm forfinding a minimum 2-generating set. Quite a surprising fact at first glance!

Page 123: Algorithmic Contributions to Computational Molecular Biology · on computational molecular biology, i.e., those algorithmic and combinatorial topics that are connected to molecular

110

10.3.3 Put the blame on rk2(S)

Write n = |S|,m = max(S) and k = rk2(S). We now consider the problem of finding a minimum cardinality2-generating set of S from an effective computational point of view [Downey and Fellows, 1999; Niedermeier,2006]. As a first attempt, let us consider the brute-force approach (we do think it is good practices to alwaysbegin with the naive brute-force algorithm): generate all k-subsets X of 1, 2, . . . ,m and check for each ofthem whether 2-generates S, i.e., S ⊆ X ∪ 2X. Correctness of this algorithm is of course immediate. Thereare

(mk

)such subsets and each subset X can be identified as a 2-generating set of S in O(k2 log(k)) time

(assuming a unit-cost RAM model with log(m) word size). Therefore, the brute-force algorithm is, as a whole,a O(mkk2 log(k)) time procedure. Butm (and even log(m)) can be arbitrarily large compared to n = O(k2)and this naturally leads us to the problem of trying to confine the seemingly inevitable combinatorialexplosion of computational difficulty to a function of k only [Downey and Fellows, 1999; Niedermeier, 2006].We prove here that such an algorithm does exist for finding a minimum cardinality 2-generating set of S.The following lemma is central in our approach.

Lemma 10.3.3 ([Fagnot et al., 2009]). Let S = si : 1 ≤ i ≤ n ⊂ N∗ and X be a minimum 2-generating set of S.There exist rationals αi,j ∈ −1,−2−1, 0, 2−1, 1, 1 ≤ i ≤ rk2(S) and 1 ≤ j ≤ n, such that

X =

n∑j=1

αi,j sj : 1 ≤ i ≤ rk2(S)

.

Combining Lemma 10.3.3 with a brute-force algorithm for finding a (representation of a) minimumcardinality 2-generating set of a set S ∈ N∗ yields the following result.

Proposition 10.3.4. Assuming a unit-cost RAM model with log(m) word size (m = max(S)), there exists a

O(5k2(k+3)

2 k2 log(k)) time algorithm for finding a minimum cardinality 2-generating set of S, where k = rk2(S).

Let us conclude by clarifying a point that we think some people have found confusing: Can integerlinear programming techniques cope here with trying to confine the seemingly inevitable combinatorialexplosion to a function of k only? For one a classical result in parameterized algorithms is that the INTEGERLINEAR PROGRAMMING problem parametrized by the number of variables is fixed-parameter tractable. Thispowerful result, first proved by Lenstra in [Lenstra, 1983] (this paper received Fulkerson Prize in 1985 for anoutstanding contribution in the area of discrete mathematics) and later improved by Kannan [Kannan, 1987].For another, it is not hard to design an integer linear program for finding a minimum cardinality 2-generatingset of a set of integers. Therefore, integer linear programming seems to be an appealing approach in ourcontext, and Yes this is indeed a good question. However, as long as our integer linear program uses anumber of variables depending – at least linearly – on max(S), Lenstra’s result does not apply since max(S)can be arbitrarily big compared to |S|. We did not succeed in obtaining such an integer linear program (andactually we doubt such an approach is possible).

Page 124: Algorithmic Contributions to Computational Molecular Biology · on computational molecular biology, i.e., those algorithmic and combinatorial topics that are connected to molecular

111

Before concluding, we would like to address the following question: does there exist a betterrepresentation lemma to improve Proposition 10.3.4 ? To this end, consider the set A ⊂ P(Q)defined as follows: A ∈ A if and only if, for any set S = si : 1 ≤ i ≤ n ⊂ Z+, rk2(S) = k, thereexist rationals αi,j ∈ A, 1 ≤ i ≤ n and 1 ≤ j ≤ k, such that X =

∑nj=1 αi,jsj : 1 ≤ i ≤ k

is a

(necessarily minimum) 2-generating set of S. According to Lemma 10.3.3, A is not empty as itcontains −1,−1/2, 0, 1/2, 1. Furthermore, as shown in Proposition 10.3.4, the time complexity ofour algorithm to compute a minimum 2-generating set depends - in an exponential way - on thesize of the minimum cardinality set in A. For any i ∈ Z+, define Ai ⊆ A to be the set of those setsof A of cardinality i. Our question thus reduces to finding the minimum i ∈ Z+ such that Ai 6= ∅.Lemma 10.3.3 shows that A5 6= ∅. Proving or disproving Ai 6= ∅ for some 1 ≤ i < 5 remains achallenging problem.Here are some lines of thought to reduce the above problem to “Is A4 empty?”. Let A ∈ A. We firstobserve that 0 ∈ A (it is enough to consider the set S = 1, n, for any unbounded n). Furthermore,we must have 1 ∈ A (it is enough here to consider the set S = 1). As an immediate consequence,A1 = ∅ since |A| ≥ 2. We now show that A2 = ∅, i.e., 0, 1 /∈ A2. Indeed, the set S1 = 4, 5, 6has a unique minimum cardinality 2-generating set, namely X1 = 2, 3. But 2 6=

∑i∈S1 αi i for

αi ∈ 0, 1, i ∈ S1, and hence 0, 1 /∈ A2. Therefore, A2 = ∅. We now turn to proving that A3 = ∅.Suppose, aiming at a contradiction, that A3 6= ∅, and let A ∈ A3. According to the above, A hasthe general form 0, 1, x for some x ∈ Q. For one, as we already noticed, the set S1 = 4, 5, 6 hasa unique minimum cardinality 2-generating set, namely X1 = 2, 3. Considering the pair (S1, X1),an exhaustive computation now shows that A ∈ A3 ⊆ AS1 , where

AS1 = −1, 0, 1 , −3/4, 0, 1 , −1/2, 0, 1 , −1/3, 0, 1 , 0, 1/5, 1 ,

0, 1/3, 1 , 0, 1/2, 1 .

For another, the set S2 = 6, 7, 8 has a unique minimum cardinality 2-generating set, namely X2 =3, 4. Considering the pair (S2, X2), an exhaustive computation shows that A ∈ A3 ⊆ AS2 ,where

AS2 = −3/8, 0, 1 , −1/2, 0, 1 , −2/3, 0, 1 , −2/7, 0, 1 , 0, 1/2, 1 .

Then it follows that A ∈ A3 ⊆ AS1 ∩ AS2 = −1/2, 0, 1, 0, 1/2, 1. We now proceed to showthat neither −1/2, 0, 1 nor 0, 1/2, 1 belongs to A3, thereby proving A3 = ∅. Indeed, considerfirst the set S ′ = 3, 4, 6, 7, 8, 10, 14, 15, 16, 18, 22, 30. It has a unique minimum cardinality 2-generating set, namely X ′ = 1, 3, 7, 15. But 1 6=

∑i∈S ′ αi i for αi ∈ 0, 1/2, 1, i ∈ S ′, since

1 < min(S ′) and all rationals in 0, 1/2, 1 are positive. Then it follows that 0, 1/2, 1 /∈ A3.Consider now the set S ′′ = 4, 9, 11. It has a unique minimum cardinality 2-generating set,namely X ′′ = 2, 9. But 2 6=

∑i∈S ′′ i αi for αi ∈ −1/2, 0, 1, i ∈ S ′′. Indeed, we must have

αi = −1/2 for some i ∈ S ′′ since 2 < mini : i ∈ S ′′. We now need to consider three cases. First,if α9 = −1/2 then we must have α11 = −1/2 since 9 and 11 are odd integers and 4 is even. But−9+11

2+ 4 = −6 < 2, a contradiction. Second, if α11 = −1/2, we also end up with a contradiction

by a symmetric argument to one from the previous case. Third, if α4 = −1/2, α9 ∈ 0, 1 andα11 ∈ 0, 1, we obtain 4 = 9α9 + 11α11 for α9 ∈ 0, 1 and α11 ∈ 0, 1, which is impossible. Thenit follows that −1/2, 0, 1 /∈ A3. Combining A3 ⊆ −1/2, 0, 1, 0, 1/2, 1 with 0, 1/2, 1 /∈ A3and −1/2, 0, 1 /∈ A3, we obtain A3 = ∅.Finally, notice that, according to the above, if any setA ∈ A has to be symmetric around zero (notethat since our solution subset −1,−1/2, 0, 1/2,+1 is, this may be an intrinsic property), thenwe would immediately obtain the desired result A4 = ∅. Proving or disproving this symmetryproperty remains an intriguing open problem as well.

Page 125: Algorithmic Contributions to Computational Molecular Biology · on computational molecular biology, i.e., those algorithmic and combinatorial topics that are connected to molecular

112

Page 126: Algorithmic Contributions to Computational Molecular Biology · on computational molecular biology, i.e., those algorithmic and combinatorial topics that are connected to molecular

Perspectives

Thanks to my thinking notes along this manuscript, I think I have already pointed out most of the problemsand questions related to my exposition I am interested in. I hope I have succeeded in trying to convincethe reader that multidimensional intervals, linear matchings, permutations, connected occurrences andk-generating sets are nice, interesting and useful (and fun? quite an important point for me) combinatorialobjects for algorithmic investigations. It is thus understood that the above-mentioned topics will constitutedefinitively a large part of my research in the coming years. Therefore I shall not reproduce nor discuss anylonger any of them here, and I shall focus instead on some new additional topics.

In the sequel, I present and briefly discuss four research topics I am particularly interested in and onwhich I plan to work in the very near future. Let me first put these different topics in their context. Both theMINIMUM COMMON STRING PARTITION problem and the Tandem duplication-random loss model are relatedto comparative genomics. Actually, I think that these two problems (together with algorithmic aspects ofnext generation sequencing) will constitute in a short-term – not to say are now – my main research activityin comparative genomics (I am now left with the feeling that heuristic approaches is the best we can do forthe match-an-prune model). The MEDIAN problem for the Kendall-Tau distance is related to rank aggregation.Finally, I present one combinatorial problem on graphs that has recently caught most of our attention.

The MAXIMUM COMMON STRING PARTITION problem

A partition of a string u is a sequence P = (p1, p2, . . . , pm) of strings, called the blocks, whose concatenationis equal to u, i.e., u = p1 p2 . . . pm. Given a partition P of a string u and a partition Q of a string v, the pair(P,Q) is a common partition of u and v if Q is a permutation of P. The MINIMUM COMMON STRING PARTITION(MCSP) problem is to find a common partition of two strings u and v with the minimum number of blocks.The restricted version of MCSP where each letter occurs at most k times in each input string, is denoted byk-MCSP. It has been shown that the 2-MCSP problem is NP-hard and, moreover, even APX-hard [Kolmanet al., 2005]. The 2-MCSP and 3-MCSP problems are approximable within ratio 1.1037 and 4, respectively[Kolman et al., 2005]. The MCSP problem is approximable within ratio O(k), where k is the maximumnumber of occurrences of a letter, and within ratio O(log(n) log∗(n)) in the general case [Kaplan and Shafrir,2006].

Let us embed the MCSP problem into multidimensional interval graphs. Recall that the MINIMUMINDEPENDENT DOMINATING SET problem is to find in a graph G is minimum cardinality subset V ′ ⊆ V(G)such that (i) V ′ is an independent set and (ii) for each u ∈ V(G) \V ′,NG(u)∩V ′ 6= ∅. Most of our interest in

113

Page 127: Algorithmic Contributions to Computational Molecular Biology · on computational molecular biology, i.e., those algorithmic and combinatorial topics that are connected to molecular

114

the MINIMUM INDEPENDENT DOMINATING SET problem stems from the following easy observation: Giventwo strings u and v, both of length n, built over some alphabetA such that |u|a = |v|a for all a ∈ A, one can constructin polynomial-time a 2-track interval graph G = Ω(D) with at most n(n + 1)/2 vertices such that independentdominating sets of size k of G are in one-to-one correspondence with common partitions of size k of u and v. Whatabout approximating the MINIMUM INDEPENDENT DOMINATING SET problem for 2-track interval graphs?The nice results of [Butman et al., 2007] do not help much here as it seems impossible (let me moderatethis, D. Rawitz and I did not succeed) to push up the local-ratio based approximation for the MINIMUMDOMINATING SET problem in d-interval graphs, d ≥ 2, to the MINIMUM INDEPENDENT DOMINATING SETproblem for d-interval (or even d-track interval) graphs. However, it is conceivable that this problem is inAPX for d-track interval (and actually even for d-interval) graphs (even if we have proved this problem to beW[1]-hard for 2-interval graphs [Hermelin et al., 2009]). Notice that, even if most – not to say all – standardcombinatorial graph problems remain NP-complete for 2-interval graphs, they usually belong to APX. Inany case, I believe this point of view might shed some new light upon the approximation of the MCSPproblem.

Tandem duplication-random loss model

In the Tandem duplication-random loss model, a genome (given in the form of a permutation) evolves viathe tandem duplication of a contiguous segment of genes (i.e., the duplicated copy is inserted immediatelyafter the original copy), followed by the loss of one copy of each of the duplicated genes. In most, thoughnot all cases, this process will result in a genome rearrangement [Chaudhuri et al., 2006]. See Figure 10.1for an illustration; the goal is to find a minimum cost sequence of duplication-loss steps to transform theidentity into a target permutation.

1 2 3 4 5 6

1 2 3 4 5 2 3 4 5 6

1 2/3 4/5 2 3/4 5/6 = 1 3 5 2 4 6

1 3 5 1 3 5 2 4 6

1/3 5 1 3/5/2 4 6 = 3 5 1 2 4 6

duplication 2 3 4 5

loss

duplication 1 3 5

loss

Figure 10.1: Example of a genome rearrangement caused by two rounds of tandem duplication and randomloss.

Although it seems intuitively clear that the cost of a duplication event should be some non-decreasingfunction of the length of the duplication, it is not clear what functional form this cost duplication shouldtake. If the cost of a duplication of a segment of k genes is αk for some constant parameter α ≥ 1, it has been

Page 128: Algorithmic Contributions to Computational Molecular Biology · on computational molecular biology, i.e., those algorithmic and combinatorial topics that are connected to molecular

115

proved in [Chaudhuri et al., 2006] that computing a minimum cost sequence of duplication-loss steps totransform the identity into a target permutation is solvable in linear-time if α = 1 and inO(n log(n)) time forany α ≥ 2 (in the latter case, the problem reduces to computing the Kendall-Tau distance). The complexityis open for 1 < α < 2. More important, the complexity of computing this distance is unknown for affinefunctions.

The MEDIAN problem for the Kendall-Tau distance

In the traditional (undirected) version of the MEDIAN problem, we are given k genomes and we are askedto find the genome which minimizes the sum of the distances to the other k genomes (see our monograph[Fertin et al., 2009a] for an up-to-date survey of this field). The MEDIAN problem for the Kendall-Tau distancehas also been studied in the context of social choice theory and rank aggregation. It has been shown tobe NP-complete for k = 4 [Dwork et al., 2001] (actually, it is claimed in [Dwork et al., 2001] that the resultholds for k ≥ 4 but it is not clear that the hardness propagates upwards to odd n, they had no success atall in convincing us), and to be approximable within ratio 1.57 [Ailon et al., 2005]. For k = 3, the case ofmost interest in the context of phylogeny since it is often use as a subroutine in phylogenetic reconstruction,NP-hardness has been established and nothing but a trivial 4/3 approximation ratio is known. What aboutother distances? We have presented some preliminary results of this topic in [Blin et al., 2009a].

Security in graphs

Given a graph G, a subset S ⊆ V(G) is a defensive alliance if for all u ∈ S, |NG[u] ∩ S| ≥ |NG[u] \ S|, i.e., everyvertex u ∈ S has as many neighbors, including u itself, in S as those in V(G) \ S [Kristiansen et al., 2004] Thevarious concepts of alliances in graphs are motivated from a security issue that whether defenders of u ∈ Scan defeat the attackers of v ∈ V(G) \ S. Brigham et al. have described a global concept of alliances based onthe idea that defensive alliances do not accurately model real-world situations [Brigham et al., 2007]. Givena graph G, a subset S ⊆ V(G) is a secure set if for all X ⊆ S, |NG[X] ∩ S| ≥ |NG[X] \ S|, i.e., every subset X ⊆ Shas as many neighbors, including X, in S as those in V(G) \ S. The security number of G is the cardinality of aminimum secure set of G. The complexity of determining the security number of a graph is completely open(it has, however, be found for some families of graphs). In fact, there is no known polynomial algorithm fordetermining if a given set S ⊆ V(G) is a secure set [Dutton, 2009].

Recently, security in graphs has caught most of our attention, both from the standard complexity and theparameterized point of view. Our recent results (collaboration with R. Rizzi, Univ. Udine, Italy) suggest that(i) determining if a given set S ⊆ V(G) is a secure set is coNP-hard, and that (ii) the problem of computingthe security number is not approximable but is fixed-parameter tractable for the standard parameterization(quite an unusual situation in parameterized complexity theory). Determining the security number ofbipartite and bounded degree graphs is still completely open. More generally, very little is known aboutapproximation issues of the various concepts of alliances in graphs [Rodrıguez-Velazquez and Sigarreta,2009; Favaron et al., 2004].

Page 129: Algorithmic Contributions to Computational Molecular Biology · on computational molecular biology, i.e., those algorithmic and combinatorial topics that are connected to molecular

116

Page 130: Algorithmic Contributions to Computational Molecular Biology · on computational molecular biology, i.e., those algorithmic and combinatorial topics that are connected to molecular

Bibliography

Ailon, N., Charikar, M., and Newman, A. (2005). Aggregating inconsistent information. In Gabow, H. andFagin, R., editors, Proc. of the 37th Annual ACM Symposium on Theory of Computing (STOC), Baltimore, MD,USA, pages 684–693. ACM.

Akiyama, J., Exoo, G., and Harary, F. (1981). Covering and packing in graphs IV: Linear arboricity. Networks,11:69–72.

Akutsu, T. (2000). Dynamic programming algorithms for RNA secondary structure prediction with pseudo-knots. Discrete Applied Mathematics, 104:45–62.

Alber, J., Gramm, J., Guo, J., and Niedermeier, R. (2002). Towards optimally solving the longest commonsubsequence problem for sequences with nested arc annotations in linear time. In Apostolico, A. andTakeda, M., editors, Proc. 13th Annual Symposium on Combinatorial Pattern Matching (CPM), Fukuoka, Japan,volume 2373 of Lecture Notes in Computer Science, pages 99–114. Springer.

Alber, J., Gramm, J., Guo, J., and Niedermeier, R. (2004). Computing the similarity of two sequences withnested arc annotations. Theoretical Computer Science, 312(2-3):337–358.

Albert, M., Aldred, R., Atkinson, M., and Holton, D. (2001). Algorithms for pattern involvement inpermutations. In Proc. International Symposium on Algorithms and Computation (ISAAC), volume 2223 ofLecture Notes in Computer Science, pages 355–366.

Alekseyev, M. and Pevzner, P. (2007). Colored de Bruijn graphs and the genome halving problem. IEEE/ACMTransactions on Computational Biology and Bioinformatics, 4(1):98–107.

Alm, E. and Arkin, A. (2003). Biological networks. Curr. Opin. Struct. Biol., 13(2):193–202.

Alon, N. (1988). The linear arboricity of graphs. Israel J. of Mathematics, 62(3):311–325.

Alon, N. and Spencer, J. (1992). The Probabilistic Method. Wiley.

Alon, N., Yuster, R., and Zwick, U. (1995). Color coding. Journal of the ACM, 42(4):844–856.

Alonso, L. and Schott, R. (1993). On the tree inclusion problem. In Borzyszkowski, A. and Sokolowski, S.,editors, Proc. 18th Mathematical Foundations of Computer Science (MFCS), Gdansk, Poland, volume 711 ofLecture Notes in Computer Science, pages 211–221.

117

Page 131: Algorithmic Contributions to Computational Molecular Biology · on computational molecular biology, i.e., those algorithmic and combinatorial topics that are connected to molecular

118

Alter, R. and Barnett, J. (1980). A postage stamp problem. Amer. Math. Montly, 87:206–210.

Angibaud, S., Fertin, G., Rusu, I., Thevenin, A., and Vialette, S. (2007a). A pseudo-boolean programmingapproach for computing the breakpoint distance between two genomes with duplicate genes. In Proc.5th RECOMB Comparative Genomics Satellite Workshop (RECOMB-CG), volume 4751 of Lecture Notes inBioinformatics, pages 16–29. Springer.

Angibaud, S., Fertin, G., Rusu, I., Thevenin, A., and Vialette, S. (2008a). Efficient tools for computing thenumber of breakpoints and the number of adjacencies between two genomes with duplicate genes. Journalof Computational Biology. To appear.

Angibaud, S., Fertin, G., Rusu, I., Thevenin, A., and Vialette, S. (2008b). On the approximability of comparinggenomes with duplicates. Journal of Graph Algorithms and Applications, 13(1):19–53.

Angibaud, S., Fertin, G., Rusu, I., and Vialette, S. (2006). How pseudo-boolean programming can helpgenome rearrangement distance computation. In Proc. 4th RECOMB Comparative Genomics Satellite Workshop(RECOMB-CG), Montreal, Canada, volume 4205 of Lecture Notes in Bioinformatics, pages 75–86.

Angibaud, S., Fertin, G., Rusu, I., and Vialette, S. (2007b). A general framework for computing rearrangementdistances between genomes with duplicates. Journal of Computational Biology, 14(4):379–393.

Arnborg, S., Corneil, D., and Proskurowski, A. (1987). Complexity of finding embeddings in a k-tree. Journalon Algebraic and Discrete Methods, 8(2):277–284.

Arnborg, S. and Proskurowski, A. (1989). Linear time algorithms for NP-hard problems restricted to partialk-trees. Discrete Applied Mathematics, 23:11–24.

Atkinson, M., Murphy, M., and Ruskuc, N. (2005). Pattern Avoidance Classes and Subpermutations. EuropeanJournal of Combinatorics, 12(1).

Ausiello, G., Crescenzi, P., Gambosi, G., Kann, V., Marchetti-Spaccamela, A., and Protasi, M. (1999). Complexityand Approximation: Cominatorial optimization problems and their approximability properties. Springer-Verlag.

Backofen, R. and Busch, A. (2004). Computational design of new and recombinant selenoproteins. In Sahinalp,S. C., Muthukrishnan, S., and Dogrusoz, U., editors, Proc. 15th Annual Symposium on Combinatorial PatternMatching (CPM), Istanbul, Turkey, volume 3109 of Lecture Notes in Computer Science, pages 270–284.

Backofen, R., Narayanaswamy, N., and Swidan, F. (2002). Protein similarity search under mRNA structuralconstraints: application to targeted selenocystein insertion. In Silico Biology, 2(3):275–290.

Bafna, V., Narayanan, B., and Ravi., R. (1996). Non-overlapping local alignments (weighted independentsets of axis-parallel rectangles). Discrete Applied Mathematics, 71(1):41–54.

Bar-Yehuda, R. (2000). One for the price of two: A unified approach for approximating covering problems.Algorithmica, 27(2):131–144.

Bar-Yehuda, R. and Even, S. (1981). A linear time approximation algorithm for the weighted vertex coverproblem. Journal of Algorithms, 2:198–203.

Bar-Yehuda, R. and Even, S. (1985). A local-ratio theorem for approximating the weighted vertex coverproblem. Annals of Discrete Mathematics, 25:27–46.

Bar-Yehuda, R., Halldorsson, M., Naor, J., Shachnai, H., and Shapira, I. (2002). Scheduling split intervals. InProc. 13th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 732–741.

Ben Rebea, A. (1981). Etude des stables dans les graphes quasi-adjoints. PhD thesis, Universite de Grenoble.

Page 132: Algorithmic Contributions to Computational Molecular Biology · on computational molecular biology, i.e., those algorithmic and combinatorial topics that are connected to molecular

119

Bergeron, A., Chauve, C., de Montgolfier, F., and Raffinot, M. (2005). Computing common intervals of kpermutations, with applications to modular decomposition of graphs. In Brodal, G. S. and Leonardi, S.,editors, Proc. 13th Annual European Symposium, Palma de Mallorca, Spain, volume 3669 of Lecture Notes inComputer Science, pages 779–790.

Bergroth, L., Hakonen, H., and Raita, T. (2000). A survey of longest common subsequence algorithms. InProc. of the 7th International Symposium on String Processing Information Retrieval (SPIRE), Coruna, Spain,pages 39–48. IEEE Computer Society.

Betzler, N., Fellows, M., Komusiewicz, C., and Niedermeier, R. (2008). Parameterized algorithms andhardness results for some graph motif problems. In Proc. 19th Annual Symposium on Combinatorial PatternMatching (CPM), Pisa, Italy, volume 5029 of Lecture Notes in Computer Science, pages 31–43. Springer.

Bjorklund, A., Husfeldt, T., Kaski, P., and Koivisto, M. (2007). Fourier meets mobius: fast subset convolution.In Proceedings of the thirty-ninth annual ACM symposium on Theory of computing, pages 67–74. ACM NewYork, NY, USA.

Blair, J. and Peyton, B. (1993). An introduction to chordal graphs and clique trees. Graph Theory and SparseMatrix Computation, 56:1–29.

Blin, G., , Fertin, G., Herry, G., and Vialette, S. (2007a). Comparing RNA structures: towards an intermediatemodel between the edit and the lapcs problems. In Sagot, M.-F., Walter, W. T., and Maria, E., editors,1st Brazilian Symposium on Bioinformatics (BSB), Angra dos Reis, Brazil, volume 4643 of Lecture Notes inBioinformatics, pages 101–112. Springer.

Blin, G., Chauve, C., and Fertin, G. (2004). The breakpoint distance for signed sequences. In Proc. 1stAlgorithms and Computational Methods for Biochemical and Evolutionary Networks (CompBioNets), Recife, Brazil,pages 3–16. KCL publications.

Blin, G., Chauve, C., Fertin, G., Rizzi, R., and Vialette, S. (2007b). Comparing genomes with duplications:a computational complexity point of view. ACM/IEEE Trans. Computational Biology and Bioinformatics,14(4):523–534.

Blin, G., Crochemore, M., Hamel, S., and Vialette, S. (2009a). Finding the median of three permutationsunder the Kendall-Tau distance. In Proc. 7th annual international conference on Permutation Patterns, Firenze,Italy. electronic version (6 pp).

Blin, G., Crochemore, M., and Vialette, S. (2010a). Algorithms in Computational Molecular Biology: Techniques,Approaches and Applications, chapter Algorithmic Aspects of Arc-Annotated Sequences. Wiley. To appear.

Blin, G., Fertin, G., Hermelin, D., and Vialette, S. (2008). Fixed-parameter algorithms for protein similaritysearch under mrna structure constraints. Journal of Discrete Algorithms, 6(4):618–626.

Blin, G., Fertin, G., Rizzi, R., and Vialette, S. (2005a). What makes the arc-preserving subsequence problemhard ? T. Comp. Sys. Biology, 2:1–36.

Blin, G., Fertin, G., Sikora, F., and Vialette, S. (2009b). The exemplar breakpoint distance for non-trivialgenomes cannot be approximated. In Das, S. and Uehara, R., editors, Proc. 3rd Annual Workshop onAlgorithms and Computation (WALCOM’09), Kolkata, India, volume 5431 of Lecture Notes in Computer Science,pages 357–368. Springer.

Blin, G., Fertin, G., and Vialette, S. (2005b). What makes the arc-preserving subsequence problem hard ?LNCS Transactions on Computational Systems Biology, 2:1–36.

Page 133: Algorithmic Contributions to Computational Molecular Biology · on computational molecular biology, i.e., those algorithmic and combinatorial topics that are connected to molecular

120

Blin, G., Fertin, G., and Vialette, S. (2007c). Extracting constrained 2-interval subsets in 2-interval sets.Theoretical Computer Science, 385(1-3):241–263.

Blin, G., Sikora, F., and Vialette, S. (2009c). Querying Protein-Protein Interaction Networks. In Istrail,S., Pevzner, P., and Waterman, M., editors, 5th International Symposium on Bioinformatics Research andApplications (ISBRA’09), volume 5542 of LNBI, pages 52–62, Fort Lauderdale, FL, USA. Springer-Verlag.

Blin, G., Sikora, F., and Vialette, S. (2010b). GraMoFoNe: a cytoscape plugin for querying motifs withouttopology in protein-protein interactions networks. In Al-Mubaid, H., editor, 2nd International Conference onBioinformatics and Computational Biology (BICoB-2010), page 3843, Honolulu, USA. International Society forComputers and their Applications (ISCA).

Boch, A., Forchhammer, K., Heider, J., and Baron, C. (1991). Selenoprotein synthesis: a review. Trends inBiochemical Sciences, 16(2):463–467.

Bodlaender, H. (1993). A tourist guide through treewidth. Acta Cybernetica, 11:1–23.

Bodlaender, H., Downey, R., Fellows, M., Hallett, M., and Wareham, H. (1995). Parameterized complexityanalysis in computational biology. Computer Applications in the Biosciences, 11(1):49–57.

Bona, M. (2004). Combinatorics of permutations. Discrete Mathematics and its Applications. Chapman &Hall/CRC.

Bongartz, D. (2004). Some notes on the complexity of protein similarity search under mRNA structureconstraints. In Proc. of the 30th Conference on Current Trends in Theory and Practice of Computer Science(SOFSEM), pages 174–183.

Bonizzoni, P., Vedova, G. D., Dondi, R., Fertin, G., Rizzi, R., and Vialette, S. (2007). Exemplar longest commonsubsequences. IEEE/ACM Transactions on Computational Biology and Bionformatics, 4(4):535–543.

Bonsma, P. (2003). Complexity results for restricted instances of a paint shop problem. Technical Report1681, Dep. of Applied Mathematics, Univ. of Twente.

Bonsma, P., Epping, T., and Hochstattler, W. (2006). Complexity results on restricted instances of a paintshop problem for words. Discrete Applied Mathematics, 154(9):1335–1343.

Bose, P., J.F.Buss, and Lubiw, A. (1998). Pattern matching for permutations. Information Processing Letters,65(5):277–283.

Bouvel, M., Rossin, D., and Vialette, S. (2007). Longest common separable pattern between permutations.In Ma, B. and Zhang, K., editors, Proc. Symposium on Combinatorial Pattern Matching (CPM’07), London,Ontario, Canada, volume 4580 of Lecture Notes in Computer Science, pages 316–327.

Brandstadt, A., Le, V. B., and Spinrad, J. (1999). Graph Classes: A Survey. volume of the SIAM Monographson Discrete Mathematics and Applications. SIAM, Philadelphia.

Brevier, G., Rizzi, R., and Vialette, S. (2007). Pattern matching in protein-protein interaction graphs. InCsuhaj-Varju, E. and Esik, Z., editors, Proc. 16th Fundamentals of Computation Theory, 16th InternationalSymposium (FCT), Budapest, Hungary, Lecture Notes in Computer Science, pages 137–148. Springer.

Brevier, G., Rizzi, R., and Vialette, S. (2009). Complexity issues in color-preserving graph embeddings.Theoretical Computer Science. In press.

Brigham, R., Dutton, R., and Hedetniemi, S. (2007). Security in graphs. Discrete Applied Mathematics,155(13):1708–1714.

Page 134: Algorithmic Contributions to Computational Molecular Biology · on computational molecular biology, i.e., those algorithmic and combinatorial topics that are connected to molecular

121

Bruckner, S., Huffner, F., Karp, R., Shamir, R., and Sharan, R. (2009a). Topology-free querying of proteininteraction networks. In Proc. 13th Annual International Conference on Computational Molecular Biology(RECOMB), Tucson, USA, page 74. Springer.

Bruckner, S., Huffner, F., Karp, R., Shamir, R., and Sharan, R. (2009b). Torque: topology-free querying ofprotein interaction networks. Nucleic Acids Research, 37 (Web-Server-Issue:106–108.

Bryant, D. (2000). The complexity of calculating exemplar distances. In Sankoff, D. and Nadeau, J., editors,Comparative Genomics: Empirical and Analytical Approaches to Gene Order Dynamics, Map Alignment, and theEvolution of Gene Families, volume 1, pages 207–212. Kluwer Academic Publisher.

Bui-Xuan, B.-M., Habib, M., and Paul, C. (2005). Revisiting t. uno and m. yagiura’s algorithm. In Deng, X.and Du, D.-Z., editors, Proc. 16th International Symposium on Algorithms and Computation (ISAAC), Sanya,Hainan, China, volume 3827 of Lecture Notes in Computer Science, pages 156–165.

Butman, A., Hermelin, D., Lewenstein, M., and Rawitz, D. (2007). Optimization problems in multiple-intervalgraphs. In Proc. 9th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA). To appear.

Chaudhuri, K., Chen, K., Mihaescu, R., and Rao, S. (2006). On the tandem duplication-random loss modelof genome rearrangement. In Giancarlo, R. and Sankoff, D., editors, Proc. of the 17th annual ACM-SIAMsymposium on Discrete algorithm (SODA), Miami, Florida, USA, pages 564–570.

Chen, E., Yang, L., and Yuan, H. (2007a). Improved algorithms for largest cardinality 2-interval patternproblem. Journal of Combinatorial Optimization, 13(3):262–275.

Chen, X., Zheng, J., Fu, Z., Nan, P., Zhong, Y., Lonardi, S., and Jiang, T. (2005). Assignment of orthologousgenes via genome rearrangement. IEEE/ACM Transactions on Computational Biology and Bioinformatics,2(4):302–315.

Chen, Z., Fu, B., Xu, J., Yang, B., Zhao, Z., and Zhu, B. (2007b). Non-breaking similarity of genomes withgene repetitions. In Ma, B. and Zhang, K., editors, Proc. Symposium on Combinatorial Pattern Matching(CPM’07), London, Ontario, Canada, volume 4580 of Lecture Notes in Computer Science, pages 119–130.

Chen, Z., Fu, B., and Zhu, B. (2006). The approximability of the exemplar breakpoint distance problem. In2nd International Conference on Algorithmic Aspects in Information and Management (AAIM), volume 4041 ofLecture Notes in Computer Science, pages 291–302. Springer.

Choffrut, C. and Karhumaki, J. (1997). Combinatorics of Words, in G. Rozenberg and A. Salomaa (eds), Handbookof Formal Languages. Springer-Verlag.

Chudnovsky, M. and Seymour, P. (2005). The structure of claw-free graphs. In Surveys in Combinatorics,volume 327 of London. Math. Soc. Lecture Notes, pages 153–172. Cambridge University Press.

Chung, M.-J. (1998). More efficient algorithm for ordered tree inclusion. Journal of Algorithms, 26(2):370–385.

Chvatal, V. (1979). A greedy heuristic for the set-covering problem. Mathematics of Operations Research,4(3):233–235.

Cieliebak, M., Eidenbenz, S., Pagourtzis, A., and Schlude, K. (2008). On the complexity of variations of equalsum subsets. Nordic Journal of Computing, 14(3):151–172.

Collins, M., Kempe, D., Saia, J., and Young, M. (2007). Nonnegative integral subset representations of integersets. Information Processing Letters, 101(3):129–133.

Crochemore, M., Hancart, C., and Lecroq, T. (2007). Algorithms on Strings. Cambridge.

Page 135: Algorithmic Contributions to Computational Molecular Biology · on computational molecular biology, i.e., those algorithmic and combinatorial topics that are connected to molecular

122

Crochemore, M., Hermelin, D., Landau, G., Rawitz, D., and Vialette, S. (2008). Approximating the 2-intervalpattern problem. Theoretical Computer Science, 395(2-3):283–297. (special issue for Alberto Apostolico).

Davydov, E. and Batzoglou, S. (2006). A computational model for rna multiple structural alignment.Theoretical Computer Science, 368(3):205–216.

Dent, P., Yacoub, A., Fisher, P., Hagan, M., and Grant, S. (2003). MAPK pathways in radiation responses.Oncogene, 22:5885–5896.

Diestel, R. (2000). Graph Theory. Number 173 in Graduate texts in Mathematics. Springer-Verlag, secondedition.

Dilworth, R. (1950). A decomposition theorem for partially ordered sets. Ann. Math., 51:161–166.

Dinur, I., Guruswami, V., Khot, S., and Regev, O. (2005). A new multilayered PCP and the hardness ofhypergraph vertex cover. SIAM Journal on Computing, 34(5):1129–1146.

Dondi, R., Fertin, G., and Vialette, S. (2007). Weak pattern matching in colored graphs: Minimizing thenumber of connected components. In Proc.10th Italian Conference on Theoretical Computer Science (ICTCS),Roma, Italy, pages 27–38. World-Scientific.

Dondi, R., Fertin, G., and Vialette, S. (2009). Maximum motif problem in vertex-colored graphs. In Kucherov,G. and Ukkonen, E., editors, Proc. 20th Annual Symposium on Combinatorial Pattern Matching (CPM’09), Lille,France, volume 5577 of Lecture Notes in Computer Science, pages 221–235.

Dorit, R. and Gilbert, W. (1991). The limited universe of exons. Current Opinions in Structural Biology,1:973–977.

Dost, B., Shlomi, T., Gupta, N., Ruppin, E., Bafna, V., and Sharan, R. (2007). QNet: A Tool for QueryingProtein Interaction Networks. RECOMB, pages 1–15.

Downey, R. and Fellows, M. (1999). Parameterized Complexity. Springer-Verlag.

Dutton, R. (2009). On a graph’s security number. Discrete Mathematics, 309:4443–4447.

Dwork, C., Kumar, R., Naor, M., and Sivakumar, D. (2001). Rank aggregation methods for the web. In Proc.of the 10th international conference on World Wide Web (WWW), Hong Kong, Hong Kong, pages 613–622.

Een, N. and Sorensson., N. (2006). Translating pseudo-boolean constraints into SAT. Journal on Satisfiability,Boolean Modeling and Computation, 2:1–26.

Ehrenfeucht, A. and Rozenberg, G. (1978). Elementary homomorphisms and a solution of the D0L sequenceequivalence problem. Theoretical Computer Science, 7:169–183.

El-Mabrouk, N. (2005). Genome rearrangements with gene families. In Mathematics of evolution and phylogeny,pages 291–320. Oxford University Press.

Epping, W. H. T. and Oertel, P. (2004). Complexity results on a paint shop problem. Discrete AppliedMathematics, 136(2-3):217–226.

Erdos, P. and Szekeres, G. (1935). A combinatorial problem in geometry. Compositio Mathematica, 2:463–470.

Erdos, P. and Lovasz, L. (1975). Problems and results on 3-chromatic hypergraphs and some related questions.In Hajnal, A., Rado, R., and Sos, V., editors, Infinite and Finite Sets (Colloq., Keszthely, 1973; dedicated to P.Erdos on his 60th birthday), volume 2 of Coll. Math. Soc. J. Bolyai, pages 609–627. North-Holland, Amsterdam.

Page 136: Algorithmic Contributions to Computational Molecular Biology · on computational molecular biology, i.e., those algorithmic and combinatorial topics that are connected to molecular

123

Estivill-Castro, S. and Wood, D. (1992). A survey of adaptive sorting algorithms. ACM Computing Surveys,24(9):441–476.

Evans, P. (1999a). Algorithms and complexity for annotated sequence analysis. PhD thesis, University of Victoria.

Evans, P. (1999b). Algorithms and Complexity for Annotated Sequences Analysis. PhD thesis, University ofVictoria.

Evans, P. and Wareham, H. (2001). Exact algorithms for computing d pairwise alignments and 3-mediansfrom structure-annotated sequences (extended abstract). In Proc. of the 6th Pacific Symposium on Biocomputing(PSB), pages 559–570.

Evans, P. A. (1999c). Finding common subsequences with arcs and pseudoknots. In Crochemore, M.and Paterson, M., editors, Proc. 10th Annual Symposium Combinatorial Pattern Matching (CPM), WarwickUniversity, UK, volume 1645 of Lecture Notes in Computer Science, pages 270–280. Springer.

Fagnot, I., Fertin, G., and Vialette, S. (2009). On finding small 2-generating sets. In Ngo, H., editor, Proc.15th Annual International Conference (COCOON), Niagara Falls, NY, USA, volume 5609 of Lecture Notes inComputer Science, pages 378–387. Springer.

Fagnot, I., Lelandais, G., and Vialette, S. (2004). Bounded list injective homomorphism for comparativeanalysis of protein-protein interaction graphs. In Proc. 1st Algorithms and Computational Methods forBiochemical and Evolutionary Networks (CompBioNets), Recife, Brazil, pages 45–70. KCL publications.

Fagnot, I., Lelandais, G., and Vialette, S. (2008). Bounded list injective homomorphism for comparativeanalysis of protein-protein interaction graphs. Journal of Discrete Algorithms, 6(2):178–191.

Faudree, R., Flandrin, E., and Ryjacek, Z. (1997). Claw-free graphs - a survey. Discrete Mathematics, 164:87–147.

Favaron, O., Fricke, G., Goddard, W., Hedetniemi, S., Hedetniemi, S., Kristiansen, P., Laskar, R., and Skaggs,R. (2004). Offensive alliances in graphs. Discussiones Mathematicae Graph Theory, 24(2):263–275.

Fellows, M., Fertin, G., Hermelin, D., and Vialette, S. (2007). Sharp tractability borderlines for findingconnected motifs in vertex-colored graphs. In Proc. 34th International Colloquium on Automata, Languagesand Programming (ICALP), Wroclaw, Poland, volume 4596 of Lecture Notes in Computer Science, pages 340–351.Springer.

Felsner, S., Muller, R., and Wernisch, L. (1997). Trapezoid graphs and generalizations: Geometry andalgorithms. Discrete Applied Math., 74:13–32.

Fertin, G., Hermelin, D., Rizzi, R., and Vialette, S. (2007). Common structured patterns in linear graphs:Approximations and combinatorics. In Ma, B. and Zhang, K., editors, Proc. Symposium on CombinatorialPattern Matching (CPM’07), London, Ontario, Canada, volume 4580 of Lecture Notes in Computer Science,pages 241–252.

Fertin, G., Labarre, A., Rusu, I., Tannier, E., and Vialette, S. (2009a). Combinatorics of Genome Rearrangements.MIT press.

Fertin, G., Rizzi, R., and Vialette, S. (2005). Finding exact and maximum occurrences of protein complexes inprotein-protein interaction graphs. In Jedrzejowicz, J. and Szepietowski, A., editors, Proc. 30th InternationalSymposium on Mathematical Foundations of Computer Science (MFCS), Gdansk, Poland, volume 3618 of LectureNotes in Computer Science, pages 328–339. Springer.

Fertin, G., Rizzi, R., and Vialette, S. (2009b). Finding occurrences of protein complexes in protein-proteininteraction graphs. Journal of Discrete Algorithms, 7(1):90–101.

Page 137: Algorithmic Contributions to Computational Molecular Biology · on computational molecular biology, i.e., those algorithmic and combinatorial topics that are connected to molecular

124

Fertin, G. and Vialette, S. (2009). On the s-labeling problem. In Proc. 5th European conference on Combinatorics,Graph Theory and Applications (EuroComb), Bordeaux, France, volume 34 of Electronic Notes on DiscreteMathematics, pages 273–277.

Flum, J. and Grohe, M. (2006). Parameterized Complexity Theory. Springer Verlag.

Fulkerson, D. and Gross, O. (1965). Incidence matrices and interval graphs. Pacific Journal of Mathematics,15:835–855.

Gambette, P. and Vialette, S. (2007). On restrictions of balanced 2-interval graphs. In Brandstadt, A., Kratsch,D., and Muller, H., editors, 33rd International Workshop on Graph-Theoretic Concepts in Computer Science(WG’07), volume 4769 of Lecture Notes in Computer Science, pages 55–65.

Garey, M. and Johnson, D. (1979). Computers and Intractability: A guide to the theory of NP-completeness.W.H. Freeman, San Francisco.

Garey, M., Johnson, D., and Stockmeyer, L. (1976). Some simplified NP-complete graph problems. TheoreticalComputer Science, 1:237–267.

Gavin, A., Boshe, M., et al. (2002). Functional organization of the yeast proteome by systematic analysis ofprotein complexes. Nature, 414(6868):141–147.

Goldman, D., Istrail, S., and Papadimitriou, C. (1999). Algorithmic aspects of protein structure similarity. InProc. 40th Annual Symposium of Foundations of Computer Science (FOCS), New York, NY, USA, pages 512–522.IEEE Computer Society.

Golumbic, M. (1980). Algorithmic Graph Theory and Perfect Graphs. Academic Press, New York.

Gramm, J. (2004a). A polynomial-time algorithm for the matching of crossing contact-map patterns.IEEE/ACM Transactions on Computational Biology and Bioinformatics, 4(1):171–180.

Gramm, J. (2004b). A polynomial-time algorithm for the matching of crossing contact-map patterns.IEEE/ACM Transactions on Computational Biology and Bioinformatics, 1(4):171–180.

Gramm, J., Guo, J., and Niedermeier, R. (2002). Pattern matching for arc-annotated sequences. In Agrawal,M. and Seth, A., editors, Proc. 22nd Foundations of Software Technology and Theoretical Computer Science(FSTTCS), Kanpur, India, Lecture Notes in Computer Science, pages 182–193.

Gramm, J., Guo, J., and Niedermeier, R. (2006). Pattern matching for arc-annotated sequences. ACMTransactions on Algorithms, 2(1):44–65.

Griggs, J. and West, D. (1979). Extremal values of the interval number of a graph, I. SIAM Journal on Algebraicand Discrete Methods, 1:1–7.

Guignon, V., Chauve, C., and Hamel, S. (2005). An edit distance between rna stem-loops. In Consens, M. P.and Navarro, G., editors, 12th International Conference SPIRE, volume 3772 of LNCS, pages 335–347.

Guillemot, S. (2008). Parameterized complexity and approximability of the SLCS problem. In Proc. Interna-tional Workshop on Parameterized and Exact Computation (IWPEC), volume 5018 of Lecture Notes in ComputerScience, pages 115–128. Springer.

Guillemot, S. and Vialette, S. (2009). Pattern matching for 321-avoiding permutations. In Dong, Y., Du, D.-Z.,and Ibarra, O., editors, Proc. 20-th International Symposium on Algorithms and Computation (ISAAC), Hawaii,USA, volume 5878 of Lecture Notes in Computer Science, page 10641073.

Page 138: Algorithmic Contributions to Computational Molecular Biology · on computational molecular biology, i.e., those algorithmic and combinatorial topics that are connected to molecular

125

Guo, J. (2002). Exact algorithms for the longest common subsequence problem for arc-annotated sequences.Master’s thesis, Univeristy of Tubingen.

Guo, J., Gramm, J., Huffner, F., Niedermeier, R., and Wernicke, S. (2006). Compression-based fixed-parameter algorithms for feedback vertex set and edge bipartization. Journal of Computer and SystemSciences, 72(8):1386–1396.

Gupta, U., Lee, D., and Leung, J.-T. (1982). Efficient algorithms for interval graph and circular-arc graphs.Networks, 12:459–467.

Gurski, F. (2008). Polynomial algorithms for protein similarity search for restricted mrna structures. Informa-tion Processing Letters, 105(5):170–176.

Gyarfas, A. (2003). Combinatorics of intervals, preliminary version. Institute for Mathematics and itsApplications (IMA) Summer Workshop on Combinatorics and Its Applications. available online athttp://www.math.gatech.edu/news/events/ima/newag.pdf.

Gyarfas, A. and Lehel, J. (1970). A helly-type problem in trees. In os, P. E., Renyi, A., and Sos, V., editors,Combinatorial Theory and its Application, pages 571–584. North-Holland.

Gyarfas, A. and West, D. (1995). Multitrack interval graphs. Congressus Numerantium, 109:109–116.

Hajiaghayi, M., Jain, K., Lau, L., Mandoiu, I., Russell, A., and Vazirani, V. (2006). Minimum multicoloredsubgraph problem in multiplex PCR primer set selection and population haplotyping. In Proceedings of the6th International Conference on Computational Science (ICCS), pages 758–766.

Halldorsson, M. and Karlsson, R. (2006). Strip graphs: Recognition and scheduling. In Fomin, F., editor, Proc.32nd International Workshop on Graph-Theoretic Concepts in Computer Science (WG), Bergen, Norway, volume4271 of Lecture Notes in Computer Science, pages 137–146.

Hamel, S., Blin, G., and Vialette, S. (2010). Comparing RNA structures with biologically relevant operationscannot be done without strong combinatorial restrictions. In Fujita, S. and Rahman, S., editors, Proc. 4thWorkshop on Algorithms and Computation (WALCOM’10), Dhaka, Bangladesh, Lecture Notes in ComputerScience. To appear.

Hassin, R. and Segev, D. (2005). The set cover with pairs problem. In Ramanujam, R. and .Sen, S., editors,Proceedings of the 25th international conference on Foundations of Software Technology and Theoretical ComputerScience (FSTTCS), Hyderabad, India, pages 164–176.

Heber, S. and Stoye, J. (2001). Finding all common intervals of k permutations. In Amir, A. and Landau, G.,editors, Proc. Annual Symposium on Combinatorial Pattern Matching (CPM), Jerusalem, Israel, volume 2089 ofLecture Notes in Computer Science, pages 207–218. Springer.

Hein, J., Jiang, T., Wang, L., and Zhang, K. (1996). On the complexity of comparing evolutionary trees.Discrete Applied Mathematics, 71:153–169.

Hell, P. and Nesetril, J. (2004). Graphs and Homomorphisms. Lecture Series in Mathematics and Its Applications.Oxford University Press.

Hermelin, D., Fellows, M., Rosamond, F., and Vialette, S. (2009). On the parameterized complexity ofmultiple-interval graph problems. Theoretical Computer Science, 410(1):53–61.

Hermelin, D., Rawitz, D., Rizzi, R., and Vialette, S. (2008). The minimum substring cover problem. Informationand Computation, 206(11):1303–1312.

Page 139: Algorithmic Contributions to Computational Molecular Biology · on computational molecular biology, i.e., those algorithmic and combinatorial topics that are connected to molecular

126

Herrbach, C. and Vialette, S. (2005). Linear graph non-crossing structural alignment under the rna stacking-pair scoring scheme. In Proc. 2nd Algorithms and Computational Methods for Biochemical and EvolutionaryNetworks (CompBioNets), Lyon, France. KCL publications.

Hirschberg, D. (1977). Algorithms for the longest common subsequence problem. Journal of the ACM,24(4):664–675.

Ho, Y., Gruhler, A., et al. (2002). Systematic identification of protein complexes in Saccharomyces cerevisaeby mass spectrometry. Nature, 415(6868):180–183.

Hochbaum, D. (1982). Approximation algorithms for the set covering and vertex cover problems. SIAMJournal on Computing, 11(3):555–556.

Huang, Y.-T., Chao, K.-M., and Chen, T. (2005). An approximation algorithm for haplotype inference bymaximum parsimony. In Proceedings of the 20-th ACM Symposium on Applied Computing (SAC), pages146–150.

Huffner, F., Wernicke, S., and Zichner, T. (2007). Algorithm Engineering For Color-Coding To FacilitateSignaling Pathway Detection. In Proceedings of the 5th Asia-Pacific Bioinformatics Conference. Imperial CollegePress.

Hunt, J. and Szymanski, T. (1977a). A fast algorithm for computing longest common subsequences. Commu-nications of the ACM, 20:350353.

Hunt, J. and Szymanski, T. (1977b). A fast algorithm for computing longest common subsequences. Commu-nications of the ACM, 20:350–353.

Ibarra, L. (1997). Finding pattern matchings for permutations. Information Processing Letters, 61(6):293–295.

Ideker, T., Karp, R., Scott, J., and Sharan, R. (2006). Efficient algorithms for detecting signaling pathways inprotein interaction networks. Journal of Computational Biology, 13(2):133–144.

Ieong, S., Kao, M.-Y., Lam, T.-W., Sung, W.-., and Yiu, S.-. (2003). Predicting RNA secondary structureswith arbitrary pseudoknots by maximizing the number of stacking pairs. Journal of Computational Biology,10(6):981–995.

Jacks, T., Masiarz, M. P. F., Luciw, P., Barr, P., and Varmus, H. (1988). Characterization of ribosomalframeshifting in HIV-1 gag-pol expression. Nature, 331:280–283.

Jacks, T. and Varmus, H. (1985). Expression of the Rous sarcoma virus pol gene by ribosomal frameshifting.Science, 230:1237–1242.

Jiang, M. (2007a). A 2-approximation for the preceding-and-crossing structured 2-interval pattern problem.Journal of Combinatorial Optimization, 13:217–221. Special Issue on Bioinformatics.

Jiang, M. (2007b). A PTAS for the weighted 2-interval pattern problem over the preceding-and-crossingmodel. In Dress, A., Xu, Y., and Zhu, B., editors, 1st Annual International Conference on CombinatorialOptimization and Applications (COCOA’07), Xi’an, Shaanxi, China, volume 4616 of Lecture Notes in ComputerScience, pages 378–387.

Jiang, M. (2008). Approximation algorithms for predicting RNA secondary structures with arbitrary pseudo-knots. IEEE/ACM Transactions on Computational Biology and Bioinformatics. To appear.

Jiang, M. (2010). On the parameterized complexity of some optimization problems related to multiple-interval graphs. In Proc. 21th Annual Symposium on Combinatorial Pattern Matching (CPM), New York, USA,volume 6129 of Lecture Notes in Computer Science, pages 125–137. Springer.

Page 140: Algorithmic Contributions to Computational Molecular Biology · on computational molecular biology, i.e., those algorithmic and combinatorial topics that are connected to molecular

127

Jiang, T. and Li, M. (1995). On the approximation of shortest common supersequences and longest commonsubsequences. SIAM Journal on Computing, 24:1122–1139.

Jiang, T., Lin, G., Ma, B., and Zhang, K. (2000a). The longest common subsequence problem for arc-annotatedsequences. In Giancarlo, R. and Sankoff, D., editors, Proc. of the 11th annual symposium on CombinatorialPattern Matching (CPM), Montreal, Canada, pages 154–165.

Jiang, T., Lin, G.-H., Ma, B., and Zhang, K. (2000b). The longest common subsequence problem for arc-annotated sequences. In Giancarlo, R. and Sankoff, D., editors, Proc. 11th Annual Symposium on CombinatorialPattern Matching (CPM), Montreal, Canada, volume 1848 of Lecture Notes in Computer Science, pages 154–165.Springer.

Johnson, D. (1974). Approximation algorithms for combinatorial problems. Journal of Computer and SystemSciences, 9:256–278.

Kanehisa, M., Goto, S., Kawashima, S., Okuno, Y., and Hattori, M. (2004). The KEGG resource for decipheringthe genome. Nucleic acids research, 32:277–280.

Kannan, R. (1987). Minkowskis convex body theorem and integer programming. Mathematics of OperationsResearch, 12:415–440.

Kaplan, H. and Shafrir, N. (2006). The greedy algorithm for edit distance with moves. Information ProcessingLetters, 97(1):27–37.

Karger, D., Motwani, R., and Ramkumar, G. (1995). On approximating the longest path in a graph. SIAMJournal on Computing, 24:1122–1139.

Kelley, B., Sharan, R., Karp, R., Sittler, T., Root, D. E., Stockwell, B., and Ideker, T. (2003). Conserved pathwayswithin bacteria and yeast as revealed by global protein network alignment. Proceedings of the NationalAcademy of Sciences, 100(20):11394–11399.

Kilpelainen, P. and Mannila, H. (1995). Ordered and unordered tree inclusion. SIAM J. Comp., 24(2):340–356.

King, A. and Reed, B. (2007). Bounding χ in terms ofω ans δ for quasi-line graphs. Article in preparation.

Knuth, D. (1973). Fundamental Algorithms, volume 1 of The Art of Computer Programming. Addison-Wesley,Reading MA, 3rd edition.

Knuth, D. (1998). Sorting and Searching, The Art of Computer Programming Vol. 3 (Second Edition). Addison-Wesley.

Kolman, P., Goldstein, A., and Zheng, J. (2005). Minimum common string partition problem: Hardness andapproximations. The Electronic Journal of Combinatorics, 12(1). Paper R50.

Kostochka, A. (1988). On upper bounds on the chromatic numbers of graphs. Transactions of the Institute ofMathematics (Siberian Branch of the Academy of Sciences in USSR), 10:204–226.

Kostochka, A. and Kratocvil, J. (1997). Covering and coloring polygon-circle graphs. Discrete Mathematics,163:299–305.

Kristiansen, P., Hedetniemi, S., and Hedetniemi, S. (2004). Alliances in graphs. Journal of CombinatorialMathematics and Combinatorial Computing, 48:157–177.

Kubica, M., Rizzi, R., Vialette, S., and Walen, T. (2006). Approximation of RNA multiple structural alignment.In Lewenstein, M. and Valiente, G., editors, Proc. 17th Annual Symposium on Combinatorial Pattern Matching(CPM),Barcelona, Spain, volume 4009 of Lecture Notes in Computer Science. Springer.

Page 141: Algorithmic Contributions to Computational Molecular Biology · on computational molecular biology, i.e., those algorithmic and combinatorial topics that are connected to molecular

128

Lacroix, V., Fernandes, C., and Sagot, M.-F. (2006). Motif search in graphs: application to metabolic networks.IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB), 3(4):360–368.

Lelandais, G., Crom, S. L., Devaux, F., Vialette, S., Church, G., Jacq, C., and Marc, P. (2004a). yMGV: across-species expression data mining tool. Nucl. Acids. Res., 32:D323–D325.

Lelandais, G., Marc, P., Vincens, P., Jacq, C., and Vialette, S. (2004b). MiCoViTo: a tool for gene-centriccomparison and visualization of yeast transcriptome states. BMC Bioinformatics, 5(20).

Lelandais, G., Vincens, P., Badel-Chagnon, A., Vialette, S., Jacq, C., and Hazout, S. (2006). Comparing geneexpression networks in multi-dimensional space to extract similarities and differences between organisms.Bioinformatics, 22(11):1359–1366.

Lenstra, H. (1983). Integer programming with a fixed number of variables. Mathematics of Operations Research,8:538–548.

Lerat, E., Daubin, V., and Moran, N. (2003). From gene trees to organismal phylogeny in prokaryotes: thecase of the γ-proteobacteria. PLoS biology, 1(1):101–109.

Li, S. and Li, M. (2006). On the complexity of the crossing contact map pattern matching problem. In Proc.Proc. of 6th Workshop on Algorithms in Bioinformatics (WABI), volume 4175 of Lecture Notes in ComputerScience, pages 231–241.

Li, S. and Li, M. (2009a). On two open problems of 2-interval patterns. Theoretical Computer Science, 410(24-25):2410–2423.

Li, S. and Li, M. (2009b). On two open problems of 2-interval patterns. Theoretical Computer Science,410(24-25):2410–2423.

Lin, G., Chen, Z.-Z., Jiang, T., and Wen, J. (2002). The longest common subsequence problem for sequenceswith nested arc annotations. Journal of Computer and System Sciences, 65(3):465–480. Special issue oncomputational biology.

Lovasz, L. (1974). On the ratio of optimal integeral and fractional solutions. Discrete Mathematics, 13:383–390.

Lozano, A. and Valiente, G. (2004). On the maximum common embedded subtree problem for ordered trees.In Iliopoulos, C. and Lecroq, T., editors, String Algorithmics, chapter 7. King’s College London Publications.

Lyngsø R. and Pedersen, C. (2000). RNA pseudoknot prediction in energy-based models. Journal ofComputational Biology, 7(3-4):409–427.

Marcus, A. and Tardos, G. (2004). Excluded permutation matrices and the Stanley-Wilf conjecture. Journal ofCombinatorial Series A, 107(1):153–160.

Margeot, E., Blugeon, C., Sylvestre, J., Vialette, S., Jacq, C., and Corral-Debrinski, M. (2002). In saccharomycescerevisae, ATP2 mRNA sorting to the vicinity of mitochondria is essential for respiratory function. EMBOJournal, 21(24):6893–6904.

McConnell, R. and Spinrad, J. (1999). Modular decomposition and transitive orientation. Discrete Mathematics,201:189–241.

McKee, T. and McMorris, F. (1999). Topics in intersection graph theory. SIAM monographs on discretemathematics and applications.

Micali, S. and Vazirani, V. (1980). AnO(√

|V ||E|) algorithm for finding maximum matching in general graphs.In Proc. 21st Annual Symposium on Foundation of Computer Science (FOCS), pages 17–27. IEEE.

Page 142: Algorithmic Contributions to Computational Molecular Biology · on computational molecular biology, i.e., those algorithmic and combinatorial topics that are connected to molecular

129

Moser, L. (1960). On the representation of 1, 2, . . . , n by sums. Acta Arith., 6:11–13.

Neraud, J. (1990). Elementariness of a finite set of words is co-NP-complete. Theoretical Informatics andApplications, 24(5):459–470.

Nguyen, C. (2005). Algorithms for calculating exemplar distances. Technical report, National University ofSingapore. Honours Year Project.

Niedermeier, R. (2006). Invitation to Fixed Parameter Algorithms. Lecture Series in Mathematics and ItsApplications. Oxford University Press.

Ohno, S. (1970). Evolution by gene duplication. Springer-Verlag.

P. Thebault, S. d. G., Schiex, T., and Gaspin, C. (2006). Searching rna motifs and their intermolecular contactswith constraint networks. Bioinformatics, 22(17):2074–2080.

Patthy, L. (1991). Exons - original building blocks of proteins? BioEssays, 13(4):187–192.

Pellegrini, M., Marcotte, E., Thompson, M., Eisenberg, D., and Yeates, T. (1999). Assigning protein functionsby comparative genome analysis: protein phylogenetic profiles. PNAS, 96(8):4285–4288.

Pereira-Leal, J., Enright, A., and Ouzounis, C. (2004). Detection of functional modules from protein interactionnetworks. Proteins, 54(1):49–57.

Pinter, R., Rokhlenko, O., Yeger-Lotem, E., and Ziv-Ukelson, M. (2005). Alignment of metabolic pathways.Bioinformatics, 21(16):3401–3408.

Raz, R. and Safra, S. (1997). A sub-constant error-probability low-degree test, and a sub-constant error-probability PCP characterization of NP. In Proceedings of the 29th ACM Symposium on the Theory OfComputing (STOC), pages 475–484.

Robertson, N. and Seymour, P. (1986). Graph minors. II. Algorithmic aspects of tree-width. SIAM Journal ofAlgorithms, 7:309–322.

Rodrıguez-Velazquez, J. A. and Sigarreta, J. (2009). Global defensive k-alliances in graphs. Discrete AppliedMathematics, 157(2):211–218.

Rossin, D. and Bouvel, M. (2006). The longest common pattern problem for two permutations. PureMathematics and Applications, 17:55–69.

Rozenberg, G. and Salomaa, A. (1980). The Mathematical Theory of L Systems. Academic Press, New York.

Sadique Adi, S., Braga, M., Fernandes, C., Eduardo Ferreira, C., Viduani Martinez, F., M.-F Sagot, M. S.,Tjandraatmadja, C., and Wakabayashi, Y. (2008). Repetition-free longest common subsequence. ElectronicNotes in Discrete Mathematics, 30:243–248.

Sankoff, D. (1999). Genome rearrangement with gene families. Bioinformatics, 15(11):909–917.

Sankoff, D. and Haque, L. (2005). Power boosts for cluster tests. In Proc. 3rd RECOMB Comparative GenomicsSatellite Workshop, Dublin, Ireland, volume 3678 of Lecture Notes in Bioinformatics, pages 11–20.

Scheinerman, E. and West, D. (1983). The interval number of a planar graph: three intervals suffice. Journalof Combinatorial Theory - Series B, 35:224–239.

Sharan, R. and Ideker, T. (2006). Modeling cellular machinery through biological network comparison.Nature Biotechnology, 24:427–433.

Page 143: Algorithmic Contributions to Computational Molecular Biology · on computational molecular biology, i.e., those algorithmic and combinatorial topics that are connected to molecular

130

Sharan, R., Ideker, T., Kelley, B., Shamir, R., and Karp, R. (2004). Identification of protein complexes bycomparative analysis of yeast and bacterial protein interaction data. In Proc. 8th annual internationalconference on Computational molecular biology (RECOMB), San Diego, California, USA, pages 282–289. ACMPress.

Sharan, R., Suthram, S., Kelley, R., Kuhn, T., McCuine, S., Uetz, P., Sittler, T., Karp, R., and Ideker, T. (2005).Conserved patterns of protein interaction in multiple species. Proc. Natl Acad. Sci. USA, 102(6):1974–1979.

Shasha, D. and Zhang, K. (1989). Simple fast algorithms for the editing distance between trees and relatedproblems. SIAM Journal on Computing, 18(6):1245–1262.

Shen-Orr, S., Milo, R., Mangan, S., and Alon, U. (2002). Network motifs in the transcriptional regulationnetwork of escherichia coli. Nature Genetics, 31(1):64–68.

Shlomi, T., Segal, D., Ruppin, E., and Sharan, R. (2006). QPath: a method for querying pathways in aprotein-protein interaction network. BMC Bioinformatics, 7:199.

Stanley, R. (1999). Enumerative Combinatorics, volume 2. Cambridge University Press, Cambridge.

Stohr, A. (1955a). Geloste und ungeloste fragen uber basen der naturlichen zahlenreihe, i. J. reine Angew.Math., 194:40–65.

Stohr, A. (1955b). Geloste und ungeloste fragen uber basen der naturlichen zahlenreihe, ii. J. reine Angew.Math., 194:111–140.

Sylvestre, J., Vialette, S., Corral-Debrinski, M., and Jacq, C. (2003). Long mRNA coding for yeast mitochon-drial proteins of prokaryotic origin preferentially localize to the vicinity of mitochondria. Genome Biology,4(7):1–9.

Tang, J. and Moret, B. (2003). Phylogenetic reconstruction from gene-rearrangement data with unequal genecontent. In Dehne, F., Sack, J.-R., and Smid, M., editors, Proc. 8-th International Workshop on Algorithms andData Structures (WADS), Ottawa, Ontario, Canada, volume 2748 of Lecture Notes in Computer Science, pages37–46. Springer.

Tarjan, R. and Yannakakis, M. (1984). Simple linear-time algorithms to test chordality of graphs, test acyclicityof hypergraphs, and selectively reduce acyclic hypergraphs. SIAM J. Comput., 13:566–579.

Thomasse, S. (2009). A quadratic kernel for feedback vertex set. In Proceedings of the Nineteenth AnnualACM-SIAM Symposium on Discrete Algorithms.

Tichy, W. (1984). The string-to-string correction problem with block moves. ACM Transactions on ComputerSystems, 2(4):309–321.

Tiskin, A. (2006). Longest common subsequences in permutations and maximum cliques in circle graphs. InLewenstein, M. and Valiente, G., editors, Proc. 17th Combinatorial Pattern Matching (CPM), Barcelona, Spain,volume 4009 of Lecture Notes in Computer Science, pages 270–281.

Titz, B., Schlesner, M., and Uetz, P. (2004). What do we learn from high-throughput protein interaction data?Expert Review of Anticancer Therapy, 1(1):111–121.

Tripathi, A. (2006). A note on the postage stamp problem. Journal of Integer Sequences, 9. 06.013.

Trotter, W. and Harary, F. (1979). On double and multiple interval graphs. J. Graph Theory, 3:205–211.

Uetz, P., Giot, L., et al. (2000). A comprehensive analysis of protein-protein interactions in Saccharomycescerevisae. Nature, 403(6770):623–627.

Page 144: Algorithmic Contributions to Computational Molecular Biology · on computational molecular biology, i.e., those algorithmic and combinatorial topics that are connected to molecular

131

Uno, T. and Yagiura, M. (2000). Fast algorithms to enumerate all common intervals of two permutations.Algorithmica, 26(2).

Vazirani, V. (2002). Approximation Algorithms. Springer. 1st ed. 2001. Corr. 2nd printing, 2002.

Vialette, S. (2004). On the computational complexity of 2-interval pattern matching problems. TheoreticalComputer Science, 312(2-3):223–249.

Vialette, S. (2006). Packing of (0, 1)-matrices. Theoretical Informatics and Applications RAIRO, 40(4):519–536.

Vialette, S. (2008). Two-interval pattern problems. In Kao, M.-Y., editor, Encyclopedia of Algorithms, pages985–989. Springer.

Waterman, M. (1995). Introduction to computational biology - Maps, sequences and genomes. Chapman and Hall,London.

Wernicke, S. (2006). Efficient detection of network motifs. IEEE/ACM Transactions on Computational Biologyand Bioinformatics, 3(4):347–359.

West, D. and Shmoys, D. (1984). Recognizing graphs with fixed interval number is NP-complete. DiscreteApplied Mathematics, 8:295–305.

Whang, M.-S. and Wang, G.-H. (1992). Efficient algorithms for the maximum weight clique and maximumweight independent set problems on permutation graphs. Information Processing Letters, 43:293–295.

Xu, J., Jiao, F., and Berger, B. (2007). A parameterized algorithm for protein structure alignment. Journal ofComputational Biology, 14(5):564–577.

Zhang, K. and Shasha, D. (1989). Simple fast algorithms for the editing distance between trees and relatedproblems. SIAM journal of computing, 18(6):1245–1262.

Page 145: Algorithmic Contributions to Computational Molecular Biology · on computational molecular biology, i.e., those algorithmic and combinatorial topics that are connected to molecular

132