comparison of graphs - semantic scholar · comparison of graphs application to the analysis of...

Master ECDUniversite Lyon 1 - Universite Lyon 2 - Universite de

Nantes - Universite Paris 11 Orsay

Comparison of GraphsApplication to the Analysis of Social Networks in the

Middle Ages- Raport de Stage -

Daniel PORUMBEL

Coordinator:Prof. Pascale KUNTZ

Nantes2006

Abstract

This project is about network analysis and comparison. Nowadays, networksare objects present almost everywhere around us influencing our lives. TheInternet and WWW are changing the way we think and learn. Our veryphysical existence is based on various biological nets. The development ofcommunication networks is one of the best indicators of countries’ develop-ment.

In this environment, graph theory has made great progress. There ap-peared different needs of graph comparison and classification to better un-derstand the structures sustaining our existence.

We first present the issue from a theoretical viewpoint and afterwordswe also describe several very practical applications - chemical structures,image recognition, voice timbre matching, etc. We also present an originalapproach concerning the analysis of social networks in the Middle Ages basedon historical documents we have in a region of South Western France .

Contents

1 Introduction 31.1 General notions of graph similarity . . . . . . . . . . . . . . . 41.2 Project outline . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.2.1 Application on the analysis of social structures . . . . . 5

2 Graph Matching in Mathematics 72.1 Basic Notions . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.1.1 Definition of graph isomorphism . . . . . . . . . . . . . 82.2 Graph isomorphism as a NP-complete problem . . . . . . . . . 92.3 Graph isomorphism algorithms . . . . . . . . . . . . . . . . . 11

2.3.1 Ullmann’s algorithm . . . . . . . . . . . . . . . . . . . 112.3.2 The literature on similar algorithms . . . . . . . . . . . 12

2.4 Expanding the basic measures . . . . . . . . . . . . . . . . . . 122.5 Heuristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132.6 The edit distance . . . . . . . . . . . . . . . . . . . . . . . . . 14

3 Applied Graph Matching 153.1 Similarity searches in bioinformatics . . . . . . . . . . . . . . . 15

3.1.1 Chemical structure query systems . . . . . . . . . . . . 163.1.2 Molecular similarity in organic syntheses . . . . . . . . 183.1.3 A generalization of the edit distance for functional sim-

ilarity of protein molecules . . . . . . . . . . . . . . . . 183.2 Similarity searches in image recognition . . . . . . . . . . . . . 20

3.2.1 Face recognition using bunch graphs . . . . . . . . . . 213.3 Timbre matching . . . . . . . . . . . . . . . . . . . . . . . . . 213.4 Graph similarity and web-searching . . . . . . . . . . . . . . . 24

4 Social structures in the Middle Ages 254.1 Questions of historians . . . . . . . . . . . . . . . . . . . . . . 254.2 The input data . . . . . . . . . . . . . . . . . . . . . . . . . . 264.3 The data digitization . . . . . . . . . . . . . . . . . . . . . . . 27

1

4.4 The plan of our analysis . . . . . . . . . . . . . . . . . . . . . 27

5 Conclusions and further work 30

2

Chapter 1

Introduction

Following the recent evolution of network science, the graphs have been ex-tensively studied over the last years, as for example in numerous link analysisworkshops around the world. They are powerful and general tools that allowus to represent relational information including at the same time semantic,combinatorial and even geometrical features.

Networks are present everywhere around us; in fact they are the core ofour modern civilization. Communication networks, the WWW, the Internet,peer-to-peer networks, chemical structures, genome, neural networks, trans-portation networks, river networks, social networks, collaborations, terroristnetworks, friendship networks, telephone call graphs, mail networks, powergrids, relations between enterprises, nets of ownership, networks of influence,electronic circuits, nets of software components, geometric graphs... - wereside in a world of networks.

Graph theory has made great progress. Recently, many extensive studieswere performed in statistical physics for the analysis of large nets as forexample in [1, 7]. Usually, the objective is to understand networks structuraland functional features. Another point of view comes into sight when wehave a collection of networks gathered from only one application domain.It is about the issue of applying graph comparison techniques in order toconstruct typologies and classifications. This problem appears in very diverseapplications as for example in:

• the chemistry and pharmacology, we need to define similarity measuresbetween molecules in order to execute queries (i.e. search for com-pounds, finding similar structures) in chemical databases [13, 25]

• the sociology, the analysis of social networks (nets of acquaintances,networks of influence, networks of friendships) in order to compareand describe social structures, relationships preferences, dynamics of

3

linking, small world effects1 (studied by D. J. Watts and S.H. Strogatzas described for example in [23] and [7]).

• the web searching, using graph matching techniques in the Web searchtechnology [15]

• the robotics, in which we want to compare graphs of states

• pattern recognition, for example, image recognition (see section 3.2) orvoice timbre recognition (see section 3.3)

1.1 General notions of graph similarity

Intuitively, the generic problem can be stated like this: given two graphs G1

and G2, to what extent is G1 similar with G2? How can one decide whetherthey are very similar, they resemble a little or they do not resemble at all?There is no clear or straightforward answer to this question but there are 3important approaches presented in the literature:

1. We can describe each graph by a set of indicators utilized in statisticalphysics (i.e. degree distribution, efficiency, robustness, clustering etc).This way, the initial problem is transformed into a classic problem ofdata analysis: each graph is described by a set of variables (or indica-tors) and one can try to classify the graphs just in terms of them usingtraditional methods of data mining [3]. The literature in this area isvery rich, but the results could not always take into consideration allaspects of graph combinatorial features

2. One can try to construct a projection of graphs into a geometrical space(often the Euclidian Space); these kind of techniques allows the appli-cation of methods corresponding to the studies from the comparison ofpoint sets (clustering points). The projection is difficult, but it can bemade by spectral decomposition of graphs [14, 16]

3. In this project, I will try to approach the similarity directly (withoutany kind of projections) by constructing distances and measurementson the pairs of graphs. This problem is particularly complex due to the

1We should mention an interesting concept called ”six degrees of separation”; after a1967 small world experiment, the social psychologist Stanley Milgram suggested that tworandom US citizens were connected on average by a chain of six acquaintances. The smallworld effect is the hypothesis that everyone in the world can be reached through a shortchain of social acquaintances.

4

combinatorial features of graphs. A natural idea concerning this issueis to search maximum common structures inside G1 and G2; this canbe done by using pattern-matching algorithms on graphs, algorithmsusually having limited applicability to large databases because of theirNP-completness.

1.2 Project outline

One of the goals of this thesis is to compare the results of the direct combi-natorial method of comparison with other methods studied and applied byour team (COD - COnnaissances & Decision) in the LINA laboratory (Lab-oratoire d’Informatique de Nantes Atlantique). Starting with an overview oftheoretical issues in graph similarity (chapter 2), the most important partof the project was devoted to a state of the art on applied graph similarity(chapter 3). As we will explain, the similarity indicators are very often de-fined ad-hoc for specific application fields but, still, they can be very general.We tried to put into light their wide-ranging features in order to extractvaluable information for many other applications apart from their own.

1.2.1 Application on the analysis of social structures

An experimental application (chapter 4) work on an original graph corpusrepresenting relationships between persons involved in official commercialtransaction (sells, buys, investitures, etc) in the French Middle Ages. In fact,this work is part of a multidisciplinary project involving computer scientists,historians and experts of the medieval period.

Generally speaking, the documents available for this period involve justhigher social categories (ie. the clerics, the nobility, etc.) ignoring the mostnumerous category - the peasantry which was more than 90% of the totalpopulation. Moreover, the starting questions of historians concern in factthe evolution of the peasant society over the Middle Ages. For these reasons,the geographical area of study was restrained to the South-West France, aregion in which historians have gathered a wide information concerning allthe population.

Our database includes hundreds of official acts; these acts contain at thesame time dates, names of persons and regions, transaction types, enablingus to construct various graphs of relationships (graphs of names, graphs ofsells and buys, graphs of person intersection, etc.). In this thesis, the studywas concerned with comparisons of graphs associated with periods of onegeneration (50 years in the middle ages).

5

In the second part of chapter 4, the database is explained in detail, alongwith the interface created for the acquisition of digital data; one of the goalsof this interface was to automatically create graphs. Due to the complexity oforiginal data (originally available in form of manuscripts hand-written in oldFrench), this phase of data-acquisition is the most complex and it is actuallystill in the course of construction with the historians. The graphs that wehave for the moment are still very brief (tens of nodes in sharp contrast withthe predicted graphs of hundred of nodes) and, consequently, the algorithmswere not yet experimented at the time of rapport’s deadline and they are dueby the end of the training course.

6

Chapter 2

Graph Matching inMathematics

Given two graphs (or networks), the first natural questions to ask about themare:

• what do they look like?

• do they both have dense connections?

• do they have similar features ?

• do they share common components?

• why do they have the similarities they have?

In fact, this chapter is devoted to a theoretical discussion of just this issues.In the beginning, the structure of graphs was an object of imense interestespecially in mathematics. Thus, the graph theory already provided us witha very important technique of graph matching: the isomorphism function.

The isomorphism is an exact criterion for determining if two graphs haveidentical patterned structures or to check if one graph is part of another;two compared graphs can be perfectly matched if and only if there existsa bijection between the nodes (the isomorphism function) which preservesadjacency. This approach provides us with a very general form of patternmatching that finds practical application in many real networks.

In modern database systems, many traditional techniques of graph iso-morphism can be applied to measure the similarity of network-like objects.Example of such objects can be found in a various areas, namely computerscience, communications, biology, sociology, economics and all these sciencesprovides different views. However, a difficult problem is the computational

7

complexity of the pure mathematical techniques which limits its applicabilityto large databases.

This chapter aims at introducing a theoretical approach of actual workdone for many years in graph theory, while in the next chapter we are ledto present the similarity search contributions to applicative areas of science.In the next three sections we will give an overview over the isomorphismapproach of graph similarity and in the last section we will present a differentsimilarity measure based on a so-called edit distance.

2.1 Basic Notions

From a formal point of view, a graph can be described like a binary relation(edges) between nodes. In other words, it is a set of vertices (nodes) linkedvia edges (arcs). Graphs with labels on nodes are called attributed graphs,graphs with weights (distances) on edges are called weighted graphs. If theedges have directions (or weights of values ±1), the graph is called directed.Usually, we denote the vertices by V and the edges by E, and thus the graphG is described like a pair of nodes and edges:

G = (V, E), E ⊆ V × V

The total number of connections of a node is called its degree k; thedegree distribution is one of the most important characteristic of graphs.The adjacency matrix of a graph provides its complete image. It indicateswhich of the vertices are connected (adjacent). This is a square N × N

matrix where N is the total number of vertices in the network. For simpleundirected graphs, its element aij is equal to 1 if there is an edge comingfrom the vertex i to the vertex j. For weighted graphs, the element aij isequal to the distance from the vertex i to the vertex j or ∞ if the nodes arenot connected.

2.1.1 Definition of graph isomorphism

Two graphs G1 and G2 are isomorphic if there exists an one-to-one mappingof their nodes that preserves graph structures (the edges). A simple exampleof graph isomorphism is shown in figure 2.1. More formally, given two graphsG1 = (V1, E1) and G2 = (V2, E2) they are isomorphic if and only if there existsa bijective function f : V1 → V2 (an isomorphism) such that

∀(u, v) ∈ V1 × V1, (u, v) ∈ E1 ⇔ (f(u), f(v)) ∈ E2

.

8

Figure 2.1: Example of an isomorphism between two graphs

The notion of isomorphism is very general in mathematics, especially inabstract algebra. Intuitively, it is a kind of mapping between objects, whichshows a relationship between two properties or operations. If there exists anisomorphism between two structures, we call the two structures isomorphic.In a certain sense, Isomorphic sets are structurally identical, if you choose toignore finer-grained differences that may arise from how they are defined.

Consequently the above definition is not limited to simple undirectedgraphs. Lets consider weighted graphs and denote by w((u, v)) = d theweight (distance) associated to an edge (u, v). The definition of the isomor-phism function shall be naturally generalized:

∀(u, v) ∈ V1 × V1, w((u, v)) = d ⇔ (f(u), f(v)) ∈ V2 × V2, w(f(u), f(v)) = d

where w is a weight function defined on all edges. For simple undirectedgraphs, one can consider that w((u, v)) = 1 for all (u, v) ∈ E.

2.2 Graph isomorphism as a NP-complete prob-

lem

If the mathematical aspects of isomorphism are clear in our cases, how canthey help a program algorithmically (automatically) decide whether towgraphs are similar or no? There is no simple, clear or straightforward answerto this question. Many methods were developed over the years to test thedegree of similarity between two graphs and they may even give differentresults.

From a combinatorial perspective, the solution may be related to severalclassical graph-theoretical problems involving subgraph matching. In com-plexity theory, most of them are regarded as very challenging open problemssince there is no polynomial algorithm for them and the question of NP-completeness is not always solved [11]. At this point, we will shortly discussthe formal descriptions for the most well known such problems: MaximumCommon Subgraph, Subgraph Isomorphism and Graph Isomorphism.

9

1. Maximum common subgraph problem (MCS problem)Input: Two graphs G1 and G2.Question: Find the largest subgraph of G1 isomorphic to a subgraphof G2?This problem is known to be NP-complete [11]. It is also reffered to asthe maximum common subgraph isomorphism problem. The relativesize of the maximum common subgraph is a simple and convincingindicator for the general similarity of the input graphs.

2. Subgraph isomorphism problemInput: Two graphs G1 and G2.Question: Is G1 isomorphic to a subgraph of G2?This problem is known to be NP-complete [11]. It is also referred to asthe subgraph matching problem and it is very often used in connectionwith the first problem. G1 is called the query graph and G2 the tar-get graph. Usually, subgraph isomorphism algorithms search throughlarge target graphs to find regions that are instances of a specific pat-tern (query) graph.

3. The Graph Isomorphism ProblemInput: Two graphs G1 and G2Question: Is G1 isomorphic to G2?Th NP-completeness for this problem is one of the most challengingopen problem in complexity theory. It is neither known to be NP-complete, nor exists a polynomial algorithm to solve it [11, 8]. Still,there are known some particular theoretical cases of graph classes (seesection 2.5) having tractable isomorphism algorithms [8]. For example,J. L. Faulon proved the problems of graph isomorphism, automorphismpartitioning, and canonical labeling to be polynomial-time in the con-text of chemistry because molecules are a restricted class of graphs[9].

Maybe the first idea that comes to mind is using the relative size ofthe maximum common subgraph as the similarity measure for two inputgraphs. And indeed, many articles and algorithms on the general issue ofgraph similarity deal in fact with efficient methods to solve the maximumcommon subgraph problem or the subgraphs isomorphism problem.

10

2.3 Graph isomorphism algorithms

Since all basic problems of graphs matching are NP-complete, the worst-casetime requirements increase exponentially with the size of the input graphs,restricting the applicability of many graph based techniques to very smallgraphs ([5]). On the other hand, enormous graph structures in biologicaland engineering databases require fast methods for searching for a givensubgraph. Algorithms having time requirements suited for matching largegraphs, have been a subject of research during the last three decades [20].Some of them reduce the computational complexity of the matching processby imposing topological restrictions on the graphs. We give furthermore anoverview over the most well known efficient algorithms developed on thisissue.

2.3.1 Ullmann’s algorithm

Published in 1976 [22], Ullmann’s algorithm is still today one of the mostcommonly used technique of exact graph matching because of its general-ity and effectiveness. It is devised for both graph isomorphism and graph-subgraph isomorphism and it is a starting point for many other theories.

Given two graphs Ga and Gb, the algorithm tries to assign every possiblenode of the query subgraph (Ga) to a node of the target graph (Gb) byrecursively generating a match matrix. In this algorithm, a technique calledrefinement is adopted in every recursion step in order to reduce the numberof candidates [21].

The algorithm represents the graphs and the mappings between the nodesusing three matrices: A (the adjacency matrix of Ga), B (the adjacency ma-trix of Gb), and M (the ’match matrix’). The algorithm for graph-subgraphisomorphism deal only with symmetrical matrices (for undirected graphs).An element mij ∈ M is 1 if the node Gi

a ∈ Ga is a candidate to a mappingto node G

jb ∈ Gb. One can easily form the initial matrix M by iterating over

all i and j and setting mij to 1 if the arity of Gia is less than or equal to the

arity of Gjb, otherwise mij is set to zero.

Each step of the backtracking procedure chooses a node of Ga (a line of M)and applies a refinement procedure on it. The refinement keeps a candidatematching mij if and only if ∃(x)aix ∧ ¬(∃(y)mxy ∧ byj). This logical criteriacan be expressed in more intuitive way like this: if mij is 1, then all Gx

a linkedto Gi

a must have a matching Gyb (Mxy = 1) linked to G

jb. Practically, the

algorithm applies the following mathematical matrix operations:

M ′ = M ∧ ¬(A ×¬(M × B))

11

If at any time we encounter a line consisting only of zeroes, the refinementstep fails - we have no correspondent in the target graph for a node in thequery graph. Otherwise, the refinement iterates until the M matrix does notchange anymore. If this case we check that our matrix is a valid solution (hasexactly one 1 on every line and every column). Otherwise the backtrackingportion recursively continues the search by arbitrarily selecting another nodein Ga and calling refinement again until if finds one isomorphism or until ittries all possibilities for M [21].

While Ullmann’s algorithm tries to match all possible combinations ofnodes, it often redundantly matches the same regions when there are per-mutable nodes in the query graph. These matches in effect cause redundantcalculations that decrease the performance of the algorithm.

2.3.2 The literature on similar algorithms

Apart from Ulmann’s algorithm, which is the most famous, there are manyothers methods proposed in the literature ([20, 5]), and one may find a lotof useful ideas in them. A space state search algorithm for detecting themaximum common subgraph is interestingly described in [4]. The Durand-Pasari algorithm is a technique based on clique detection also generating themaximum common subgraph as explained in [4]. RASCAL (Rapid SimilarityCalculation 1) method is also an algorithm based on clique detection, and hassome characteristics that make it particularly suitable for the measurementof chemical similarity.

2.4 Expanding the basic measures

Until now, we’ve compared the graphs by taking into account just theirmaximum common (matching) part. But what if we had two graphs sharingmany small subgraphs, but without sharing a common large one?

Furthermore in this case, we would need to define a combined measurethat takes into account all common subgraphs. A good example can befound in [18]. Basically, to compare Ga and Gb, one can apply the followingoperations:

• Decompose a graph Ga into a set of non-overlapping subgraphs Ga1,Ga2, Gan.

• Decompose a graph Gb into a set of non-overlapping subgraphs Gb1,Gb2, Gbn.

1www.daylight.com/meetings/emug01/Willett/tsld027.htm

12

• Extract a set G1, G2, Gm of common (isomorphic) subgraphs

• Define a similarity measure, as for example:

δ(Ga, Gb) =

∑Gi

(p(Gi, Ga) ∗ p(Gi, Gb))

p(Ga, Ga) ∗ p(Gb, Gb),

where p(Gi, Ga) is a similarity measure between a subgraph and hissupergraph. Thus, p(Gi, Ga) can be the number of edges from Ga thatappear also in Gi as in [18].

These formulas are not very rigid, and one can make a lot of variations andexperiments in order to obtain the best results. In fact we want to splittwo input graphs into smaller components and to derive a general similarityindicator based on the subgraph similarities.

2.5 Heuristics

There are many cases of heuristics for partial similarity very useful withintheir own applications. For example, in chemical substructure search, a usermay not know the exact composition of the full structure he wants, but re-quires that it contain a set of small functional fragments. Even inexact struc-ture matching can determine important behavioral similarities in moleculesin special cases.

There are also several ways to solve the maximum common subgraphproblem in polynomial time, as for example:

1. Limit the type of graphsFor example, certains graph isomorphism problems do have efficient,polynomial-time solutions in some particular cases. For example, thegraph isomorphism problem has polynomial algorithms for:

(a) graphs of bounded degree (in particular k-regular graphs),

(b) graphs of bounded genus (in particular, the genus is 0 for planargraphs - including trees),

(c) graphs with adjacency matrices with bounded eigenvalue multiplic-ity [8].

2. Approximate the common subgraphOne can just approximate the largest subgraph to a factor, meaningthat one has no guarantees to find the largest graph, but a neat ap-proximation.

13

2.6 The edit distance

Another very common similarity measure for graphs is based on the conceptof edit distance (Levenshtein distance). This notion was first applied onstrings and it denotes the minimum number of operations (insertion, deletion,or substitution) needed to transform one string into the other. It can beconsidered a generalization of the Hamming distance, which is used for stringsof the same length and only considers substitution edits.

In graph theory, the main idea is similar; the edit distance is given bythe minimum number of operations (i.e insertions, deletions, relabeling) onvertices and edges that one needs to apply to make the compared graphsisomorphic[13]. It is easy to detect such a distance on the adjacency ma-trix. If two graphs have identical adjacency matrix, they are isomorphic (atdistance 0); if the have almost all elements respectively equals , the graphsare at distance n where n is the number of different elements. There arealso further generalizations of the edit distance that consider, for example,a cost function on edit operations and determine a more complex form ofthe minimum cost transformation from one graph to another (more detailsin subsection 3.1.3).

Although it is simple in principle, this distance also has a major drawback- it is a very time complex measure. It has been proven that, even forunordered labeled trees, the edit distance is MAX-SNP [13] (NP-completeproblem for which there is no efficient approximation scheme to find non-optimal solutions up to a small constant factor).

14

Chapter 3

Applied Graph Matching

Since graphs are very common structures, the graph matching problems havebeen studied not only in mathematics, but also in many other fields of science.Some of the techniques often use particular models of graphs that comewith specific features depending on the application’s needs. For example,the attributes in a labeled graph could be atoms and bonds in chemicalcompounds, genes in biological networks, or object descriptors in images,etc. We should mention that a lot of graph applications come with manygood general ideas that can be applied in very different fields.

The remainder of this chapter is organized as follows. In the next sectionwe present graphs similarity techniques used in chemistry and biology. Af-terwords, in section 3.2 we present a similarity model for images representedas graphs. Timber matching graph-based algorithms and a graph similaritymethod derived from web search technology are discussed in the next tzosections.

3.1 Similarity searches in bioinformatics

One of the significant problems in bioinformatics was to represent and ma-nipulate computationally complex biological information such as complexchemical structures, neural networks1, genome and protein networks, reac-tion prediction, or metabolic interactions. This challenge was successfullyaccomplished by representing chemical information as graph objects; for ex-ample a classical chemical compound (atoms and bonds) is nothing morethan a set of labeled nodes linked via weighted edges. This way, there has

1The organization of neural networks is an incredible wide field. The number of neuronsin the human brain is of the order of 100 billion (1011) and it is larger than the WWW[7].

15

been a large body of work on subgraph search as it is an integral part ofmany areas of computational chemistry.

3.1.1 Chemical structure query systems

A chemical structure similarity search finds all molecules (target graphs)from a database set that are like the query structure (query graph) in somerespects. As stated in [25], most of queries fall into the following threecategories:

1. full structure search: find the structure exactly the same as the querygraph;from the perspective of computer science this nothing more thanthe graph isomorphism problem discussed in section 2.2.

2. substructure search: find structures that contain the query graph, orvice versa; this would be reduced to the the subgraph isomorphismproblem.

3. inexact structure similarity search: find structures that are similar tothe query graph - not necessarily exactly the same as the query; in thiscase one needs to take into account distances costs between structures.

The last query type is most applicable in chemical problems due to thelarge number of target graphs that nearly match the query graph and that donot match it completely. In this situation, a query refinement process has tobe taken by letting the user define the portion of the query for exact matchingand by allowing the system to change the remaining portion slightly. Thequery could be relaxed progressively until a relaxation threshold is reachedor a reasonable number of matches are found [25].

An example from caffeine chemistry

The substances from caffeine chemistry are a good example of three distin-guished compounds nearly matching: caffeine, theophylline, and theobromine[10]. These compounds bear many structural and pharmacological similari-ties and differ only by the presence of methyl groups in two positions of thechemical structure. They are all cardiac stimulants and, in nature, the caf-feine is found with widely varying concentrations of the other two substances.When caffeine appears to have different effects depending on the source, it isdue primarily to varying concentrations of other stimulants and absorptionrates of the mixture.

If one takes the caffeine as the query graph (in figure 3.1 bellow), obviouslyno match exists between caffeine and the other two target graphs. If we relax

16

the query with one vertex label miss, caffeine and theophylline will be goodmatches. If we relax the query further, the theobromine could also be ananswer.

Figure 3.1: An example of the three almost identical substances from caffeinefamily. Methylation refers to the replacement of a hydrogen atom (H) witha methyl group (CH3)

This kind of inexact similarity approach can be used in many classificationand filtering systems. All three compounds are from a group of substances(Xanthines) that are commonly used for their effects as stimulants and asbronchodilators (important in treating the symptoms of asthma). The corecompound, xanthine, is a product on the pathway of purine 2 degradation.

Applying inexact similarity searches on a database is not an easy task,it is much more difficult than exact substructure search. A naive solutionis to form a set of sub graph queries with one or more edge/vertex dele-tions/changings and then use the exact substructure search. This does notwork when the number of deletions is more than 1. For example, if we al-low three edges to be deleted in a 20-edge query graph, it may generate(203 ) = 1140 substructure queries, which is too expensive to check and better

solutions needed to be investigated [25].

2Two of the bases in the nucleic acids in the genome, adenine and guanine, are alsopurines. They do not have a very different structures than caffeine.

17

3.1.2 Molecular similarity in organic syntheses

Organic synthesis is the construction of organic molecules via chemical pro-cesses. Organic molecules can often contain a higher level of complexitycompared to purely inorganic compounds 3, so the synthesis of organic com-pounds has developed into one of the most important aspects of organicchemistry.

The connection between a target compound (the compound to be con-structed) and available starting materials is achieved by similarity searchesor substructure searches. For example, WODCA 4 contains forty differentcriteria for molecular similarity [12], similarities specifically defined for thepurposes of synthesis planning.

The similarity definitions are based either on substructures such as thelargest carbon skeleton or on generalized chemical reactions like oxidation.Depending on the type of structure search allowed by the system, the com-plete molecule or any compound containing the structure of the molecule willbe retrieved as an answer set.

After selecting a similarity criterion (i. e. carbon skeleton, largest ringsystems), the molecular similarity problem is often transformed into a puregraphs similarity problem. Usually, the graph representation of a moleculetakes into account not just the link structure of nodes, but it encodes bio-chemical information in each vertex.

3.1.3 A generalization of the edit distance for func-tional similarity of protein molecules

To represent complex molecules, it is necessary a graph model integratinggeometrical, structural and chemical information. Therefore, we employ theconcept of attributed graphs, which introduces the vertex information alongwith a cost function of vertex matching (the similarity between two Carbonvertices is better than the similarity Carbon-Hydrogen).

The similarity of attributed graphs can be simply described as an as-signment problem, where the distance is given by the minimal total costof matching the vertices of one graph into another. Such an assignmentoperation, called vertex matching distance takes account just the vertex in-formation, ignoring the link structure of the graph. The other basic common

3This includes all chemical compounds except the many which are based upon chainsor rings of carbon atoms, which are termed organic compounds and are studied under theseparate heading of organic chemistry.

4Workbench for the Organization of Data for Chemical Applications - a computerprogram of the newest generation for interactive planning of organic syntheses

18

idea used in attributed graph matching is the edit distance which considersonly the relationships between vertices by counting the minimal number ofnodes or edges to be added in order to make two graphs isomorphic.

The edge matching distance

Both methods presented on graph distances - the edit distance (section 2.6)and the vertex matching distance (above subsection) - have certain draw-backs. The first one is a very time-consuming measure and the second doesnot really take into account the structure of the graphs (i.e. they are treatedas sets of vertices). Consequently, a new similarity measure was proposed in[13, 17] and it is able to overcome these issues. It is called the edge matchingdistance and it considers both the structural relationships between verticesand the similarity of node attributes.

The basic idea is that instead of matching the vertices of two graphs, it ispossible to introduce a cost function for the edge matching and then to derivea minimal weight maximal matching between the edge sets of two graphs.

Given two graphs G1(V1, E1) and G2(V2, E2), the edge matching distanceis formally defined using a complete bipartite graph GX(VX , EX) in whichVX = E1 ∪ E2 ∪ δ and EX = E1 × (E2 ∪ δ). Without loss of generality,we assume that |E1| > |E2|. In the new matching graph, the set of verticesis denoted by all edges in the initial graphs and by a dummy edge δ thathas the role of ”matching” all unmatched edges from E1. An edge matchingbetween G1 and G2 is defined as a maximal matching in GX - a function thatassociates an edge from E2∪δ to each each edge from E1. Supposing that wehave a cost function c : E1 × (E2 ∪ δ) → R+

0 , the matching distance betweenG1 and G2, is denoted by dmatch(G1, G2), as the cost of the minimum-weightedge matching between G1 and G2 with respect to the cost function c.

Like in many other fields, the molecule classification can be done in morethan one step by employing a filter step that should be cheap in order toreduce the overall search time. For example, such a filter could drop outvery rapidly molecules that have a very different size in comparison with aquery molecule. After the filter step is over, the exact similarity distance isdetermined for the candidate objects in the refinement step and only thosefulfilling the query predicate are reported.

The egde matching distance has several properties that makes is par-ticularly ideal to classification of large databases of protein molecules. Animportant such property is the polynomial time-complexity [13], as opposedto the exponential time-complexity of the edit distance or of the exact iso-morphism methods. This complexity is mandatory for a practical applicationof the measure in large databases, where it has to be calculated repeatedly

19

many times.

3.2 Similarity searches in image recognition

In the field of image retrieval, similarity of images can also by described justas an assignment problem, or a feature based similarity model [13] that doesnot take into account any relationship structure. An interesting approach,described in [17] is used to encode images as attributed graphs in order toapply graph matching algorithms on them. The encoding idea is illustratedvery intuitively in figure 3.2 bellow. The graph representation encodes thecolor and the size of the uni-color area as attributes and the neighboringrelations in the link structure.

Figure 3.2: Example of the attributed graph representation of an image.

The same idea of edge matching distance as in protein graphs was used bythe same authors for image classification and the results were encouraging.To demonstrate this, in [17] an average filter selectivity was measured 100queries. For each of their image datasets, the filter retrieved various fractionsof the database and the results were that:

(1) more than 87% of the database objects are filtered out for a queryresult of 5% the database size, and

(2) more than 82% of the database objects are filtered out for a queryresult of 10% the database size.

20

3.2.1 Face recognition using bunch graphs

Generally speaking, the research in image recognition involves a lot of graphmatching theories since all 2D face views can be represented by general la-beled graphs. Such methods can be found in [24], where bunch graphs tech-niques are employed to build a system for recognizing human image faces outof a large database. Such a task is difficult due to the variation in terms ofposition, size, expression and pose but all this aspects can be ”collapsed” byextracting concise face descriptions in the form of image graphs.

When comparing and analyzing faces, we can consider them as defined bya set of 45 facial points (or fiducial) at which the graph nodes are positioned.The nodes should represent compatible (x, y, z) relative coordinates (consid-ering a fixed pose - e.g. frontal) in all faces - such as left and right pupil, thecorners of the mouth, the tip of the nose, etc. In this approach, graph edgesare labeled with distance vectors and graph nodes are labeled with jets. Jetsare a concise and robust representation of localgrey-level value regions of theimage.

The model of the face bunch graph is constructed from a representativeset of model graphs having the same pose. All jets referring to the samefiducial point (e.g. all left-eye jets), are bundled together in a bunch, fromwhich one can select any jet as an alternative description. The left-eye bunchmight contain a male eye, a female eye, both closed or open, etc. New facescan be encoded by taking jets from different models at each node, e.g. left-eye jet taken from model 3 while nose jet taken from model 25. Such anapproach is very easy to use at least for determining general attributes of aface (whether a person is male or female, has glasses or not).

New graphs can be automatically generated using a method of elasticbunch-graph matching, which is guided by a graph similarity function definedbetween a face bunch graph and a new image graph. This similarity functionaccounts for spatial distortion and is based on the jet similarities betweenimage jets and the best fitting jets in each bunch. Finally, once generated,the new graph is compared with stored graphs and the example with thehighest average similarity is taken as the recognized person. This procedureis also possible between different views, because jets representing the samefiducial points across views can be associated[24].

3.3 Timbre matching

Timbre is what, with a little practice, people use to pick out the saxophonefrom the trumpet in a jazz group or the flute from the violin in an orchestra,

21

even if they are playing exactly the same notes. It represents the temporaldevelopment of harmonics within a musical sound and it can denote manyapparently unrelated aspects of a sound.

Partials, pitch and amplitude

Each note produced by a musical instrument is made of a number of partials- distinct sounds with distinct frequencies, measured in hertz (Hz). Eachpartial is denoted by its pitch and amplitude. The pitch is the physicalfrequency perceived by the human ear in terms of highness or lowness; theamplitude is perceived as loudness.

The lowest frequency is called the fundamental and the pitch producedby this frequency is used to name the note (Do, Re, Mi...). For example, inwestern music, instruments are normally tuned to A = 440 Hz5. However,the richness of the sound is produced by the combination of this fundamentalwith a series of partials and other harmonics. In general the timbre of a tonedepends upon:

• the number of partials present (number of independent sounds)

• the relative location or locations of these partials in the range from thelowest to the highest

• the relative strength or dominance of each partial

Figure 3.3: The developement of a note.

5The other frequencies in an orchestra are called overtones of the fundamental fre-quency, which may include harmonics and partials. Harmonics are whole number multi-ples of the fundamental frequency – x2, x3, x4, etc. Partials are other overtones. Mostwestern instruments produce harmonic sounds, but many instruments produce partialsand inharmonic tones.

22

The wave’s developmenent over a note’s life (or over one partial’s life) isdivided into 4 phases (ADSR envelope model): Attack, Decay, Sustain, andRelease (fig. 3.3). As an example when you press a key on the piano, thesound will hit in (Attack), then fall fast (Decay) to a rather constant tone(Sustain) and when you take your finger off, the sound fades away (Release).

Graph representation

To sum up the discussion above, the spectral shape of a note can be approx-imated using the peaks of the attack features as :

Sshape = {(f1, a1), (f2, a2), . . . , (fl, al)}

where l is the number of partial under consideration and fi and ai are the fre-quencies and amplitudes of the peak of attack along each partial respectively[19].

In constructing graph representations of timbre, the spectral shape isrepresented as a connected labeled graph where the nodes encode the pair(fi, ai). Each node (peak of attack) is viewed as being connected to ev-ery other node on tracks with a higher harmonic number. This constructsa directed graph where the connectivity specifies monotonically increasingsequences of partials [19].

Figure 3.4: A normal representation for relative amplitudes of violin peaksof attack features (a) and the graph representation of the first 7 violin peaks(b)

Correspondences are then established between the spectral ‘shapes’ orgraphs of both sounds, or, more particularly, between sequences of partialsfrom both sounds. The problem of finding correspondences between timbres,therefore, is cast as a structural matching problem, or more specifically asa graph similarity problem. In figure 3.4, one might remark a partial graph

23

representation of a violin timbre along with the plot showing the spectralshape generated by the 20 most significant peaks

3.4 Graph similarity and web-searching

Another idea of graph similarity measure is inspired from the web searchtechnology. An important aspect when searching web-pages relevant to agiven query is the algorithm identifying the hubs and the authorities. TheKleinberg’s hub and authority method [15] can be viewed as a special case ofgraph similarity algorithm in the case where one of the graphs has two verticesand a unique directed edge between them. Kleinberg derived the algorithmto compute authority score and hub score to every vertex of a given graph byusing the eigenvectors of certain matrices associated with the link graph; hiswork in turn motivate additional heuristics for link-based analysis. In [2] theKleinberg’s idea was used to study the similarity of oriented graphs by usinga concept of similarity between vertices of directed graphs and a similaritymatrix.

24

Chapter 4

Social structures in the MiddleAges

4.1 Questions of historians

Our main applied objective is to contribute to the achievement of a betterunderstanding of social evolution of the peasant class in the Middle Ages.Due to the absence of written sources on this major component of the society(in contrast with the clerics or the nobility), the vision we have about itis often distorted. The studied region of Quercy (South-West of France)is a special case because the land contracts here provide us with preciousinformation concerning the social relationships of the peasantry.

The most important questions of the Middle Ages experts are organizedaround three issues: the peasantrys environment, the relationships betweenpeasants and the aristocracy, the sociability of the peasant family.

1. The peasantry’s environment

• The family; where do they marry?; in which social or geographicalcircle?

• The economic environment: where do they work the land? Howdid the fluidity of agrarian transactions change over the genera-tions?

• The juridical environment: in what conditions could the peasantbecome nobles?

2. Relationships between peasants and the aristocracy

• Are there any projects of setting up land possessions heritage overseveral generations?

25

• What are the relationship between the inhabitants of the countryfarms and the inhabitants of towns? The vast majority of thepopulation was living in country houses; the countryside was lessinfluenced by crisis than the towns which could be even destroyedby wars?

3. The sociability of the peasant family

• Alliances at the family level - is there any logic of alliances in thepeasantry like in aristocracy?

• In the studied region (Quercy), there was large number of immi-grants in the 15th century from the Massif Central; half of thenames in the documents changed over this period. What hap-pened to old families, did they disappear, did they ally with theimmigrants? Generally speaking, how did they overcome the 100years long war (between 1350 and 1420, there is a population de-crease of 50% - plagues wars)?

• How did they chose their solicitor (lawyer), is there any logic intheir preferences?

4.2 The input data

To give an idea over the complexity of our historical data, a first evaluationconsiders more than 10.000 mentions of individuals. All data (in fact landcontracts) was collected over a geographical area situated 100km Nord ofToulouse (Chatellenie de Castelnau-Montratier). From a historical point ofview, this area is very famous because of its rich written and archeologicaldocumentation. The sources are covering a period from 1240AD to the end ofthe 16th century. The documentation corresponds to more than 6000 officialpapers from different backgrounds and it enables us to have access to threekinds of information:

• the official document: his nature, his chronology, the lawyer, thelord in charge of the involved area 1

• the persons involved: the identity of each one in the document; thereare a lot of ambiguities in the names and they can be solved only bythe experienced historians.

1In the French Middle Ages, all parcel of land are grouped in communities (i.e. parishes)and each community is represented by a lord; no legal operation can occur in a communitywithout the written permission of the lord

26

• the places: the documentation mentions a lot of places - parcels ofland, houses, farms, forests, etc.

The social relationships are always very complex objects and, for our data,one can find several types of links to be embedded into graphs. Going on withthe historians’ ideas, the relations between individuals can be established bythe following:

• by transaction, two persons are linked by a contract. Several personsare linked into one transaction: the seller, the buyer, the lawyer, thelord in charge of the land.

• by the family

• by the neighboring relations of the work places

• by the parish (rural community they are part of)

4.3 The data digitization

The figure 4.1 is an example of an original document. All documents arehand-written in old French; moreover a part of them are also damaged andthus the automatic writing recognition is completely impossible. Taking alsointo account the problems of ambiguity, the team of historians and computerscientists decided to construct and use a special interface for the task of datadigitization.

The figure 4.2 shows a part of the web interface actually constructed. Theactual phase is the result of numerous talks and plans organized over the timewith the historians. It can help us store all information in the documentswhile tacking into account all constraints necessary for the construction ofall graphs. Our interface is available on line (http://graphe.dyndns.org) andit was build using PHP and MySql; of course, the tools creating graphs canaccess the Sql database directly (without using this interface).

4.4 The plan of our analysis

At this moment, the team is working on the digital data acquisition using thedeveloped interface. The first family of graphs that I will study regards thetransactions over periods of one generation (around 50 years at that time).

For our application, there are two specific problems that did not apply inthe same form for the classic issues described until now:

27

Figure 4.1: An example of an official document in the Middle Ages

• the dimension of the networks

• the semantic of nodes and edges in relation with the combinatorialcomponent

Many heuristics of similarity have a high complexity that makes difficultto process networks of thousands nodes the reduction of complexity is thusa great challenge, particularly difficult due to the combinatorial computa-tional explosion. Still, in out project the computing time is not considered amajor constraint and we do not approach this problem directly. But we areinterested in ways to adapt existing algorithms ([4, 9, 20]) to the semantic ofour data and we can experimentally evaluate the computing time.

The vast majority of studies on graph isomorphism do not take into ac-count the whole semantic component associated to nodes and edges. Forthe beginning, we are eager to integrate the semantic closeness by defining ameasure of similarity on the set of nodes - we consider for the moment thatthe relations defined by edges are completely independent. The isometric

28

Figure 4.2: A part of our interface - an example inserting a new document

condition is thus more relaxed: in addition to the distance of combinatorialstructures we take into account the semantic distance between pair of nodes.

In the particular case in which the semantic distance would have a finitenumber of values, we can apply distances studied for colored graphs [6].Still, this approach cannot be completely realistic in our case and we are alsothinking about the integration of a distance defined on continuous intervals.

29

Chapter 5

Conclusions and further work

The study of graphs and of their applications is an incredible wide fieldnowadays. Following the advances of the Internet and WWW, there is a veryvast recent literature on link analysis (and more generally network theory).Link analysis here provides the crucial relationships and associations betweenvery many objects of different types that are not apparent from isolated piecesof information.

Until the end of the training course, we want to completely explore theissues concerning the Middle Ages documents and we are ready to continuethe work on graph comparison in order to:

• combine the combinatorial, semantic and geometrical information intograph representation (i.e. for structures of concepts, ontologies)

• analyze the statistical distribution of our similarity measures on par-ticular graphs

• analyze the metric properties of our measures in order to put into lightour classification methods in various applications

Old ideas from mathematics, statistical physics, biology, computer sci-ence, and so on take quite a few new form in application to real evolvingnets. Our general goal is to combine various approaches from most impor-tant domains of application towards a better understanding of networks inour life.

30

Bibliography

[1] R. Albert and A-L Barabasi. Statistical mechanics of complex networks.Rev. of Modern Physics, 74:47–97, 2002.

[2] V. D. Blondel, A. Gajardo, M. Heymans, P. Senellart, and P. VanDooren. A measure of similarity between graph vertices with applica-tions to synonym extraction and web searching. SIAM Review, 46:4:647–666, 2004.

[3] J. Buhl, L. Lasserre, P. Kuntz, G. Theraulaz, and I. Kojadinovic. Con-tribution a l’analyse d’un corpus de reseaux de galeries chez les fourmis.In 11emes Rencontres de la Societe Francophone de Classification, pages127–130, 2006.

[4] H. Bunke, P. Foggia, C. Guidobaldi, C. Sansone, and M. Vento. Acomparison of algorithms for maximum common subgraph on randomlyconnected graphs.

[5] L.P. Cordella, P. Foggia, C. Sansone, and M. Vento. An improvedalgorithm for matching large graphs. In 3rd IAPR-TC15 Workshopon Graph-based Representations in Pattern Recognition, pages 149–159,2001.

[6] Peter Dankelmann and Peter Slater Wayne Goddard and. Average dis-tance in colored graphs. Journal of Graph Theory, 38(1):1–17, 2001.

[7] S.N. Dorogotsev and J.F.F. Mendes. Evolution of networks - From bio-logical nets to the Internet and WWW. Oxford Univ. Press, 2003.

[8] R. T. Faizullin and A. V. Prolubnikov. The direct algorithm for solvingof the graph isomorphism problem. ArXiv Mathematics e-prints, feb2005.

[9] Jean-Loup Faulon. Isomorphism, automorphism partitioning, andcanonical labeling can be solved in polynomial-time for molecular

31

graphs. Journal of Chemical Information and Computer Sciences,38(3):432–444, 1998.

[10] Bertil B. Fredholm, Karl Bttig, Janet Holmn, Astrid Nehlig, and Ed-win E. Zvartau. Actions of caffeine in the brain with special reference tofactors that contribute to its widespread use. Pharmacological Review,51(1):83–133, 1999.

[11] Michael R. Garey and David S. Johnson. Computers and Intractability;A Guide to the Theory of NP-Completeness. W. H. Freeman & Co.,New York, NY, USA, 1990.

[12] Molecular Networks GmbH. WODCA - User Manual. Computer-Chemie-Centrum und Institut fur Organische Chemie, April 2004. Avail-able from http://www.mol-net.de/software/wodca/wodca manual.pdf.

[13] Kriegel H.-P., Pfeifle M., and Schonauer S. Similarity search in biologicaland engineering databases. IEEE Data Engineering Bulletin, 27(4):37 –44, 2004.

[14] B. Jouve, P. Kuntz, and F. Velin. Extraction de structures macro-scopiques dans des grands graphes par une approche spectrale. Extrac-tion des Connaissances et Apprentissage, 1(4), 2001.

[15] Jon M. Kleinberg. Authoritative sources in a hyperlinked environment.Journal of the ACM, 46(5):604–632, 1999.

[16] R.I. Kondor and J. Laffert. Diffusion kernels on graphs and other discretestructures. In Proc. of the 19th Int. Conf. on Machine Learning, pages315–322, 2002.

[17] Hans-Peter Kriegel and Stefan Schonauer. Similarity search in struc-tured data. In 5th Int. Conf. on Data Warehousing and KnowledgeDiscovery (DaWaK’03), pages 188 – 199, 2001.

[18] Si Quang Lee, Tu Bao Ho, and T.T Hang Phan. A novel graph-basedsimilarity measure for 2d chemical structures. Genome Informatics,15(2):82–91, 2004.

[19] T. Lysaght and J.T. Timoney. A graph theoretic approach to timbrematching. In IX Brazilian Symposium on Computer Music: Music asEmergent Behaviour, September 2003.

32

[20] C. Sansone P. Foggia and M. Vento. A performance comparison offive algorithms for graph isomorphism. In 3rd IAPR TC-15 Workshopon Graph-based Representations in Pattern Recognition, pages 188–199,2001.

[21] James D. Roberts. Parallelizing subgraphisomorphism refinement forclassification and retrieval of conceptual structures. Technical ReportUCSC-CRL-94-48, 1994.

[22] J. R. Ullmann. An algorithm for subgraph isomorphism. Journal of theAssociation for Computing Machinery, 23:31–42, 1976.

[23] Duncan J. Watts and Steven H. Strogatz. Collective dynamics of small-world networks. Nature, 393:440, 1998.

[24] Laurenz Wiskott, Jean-Marc Fellous, Norbert Kruger, and Christophvon der Malsburg. Face recognition by elastic bunch graph matching.In Gerald Sommer, Kostas Daniilidis, and Josef Pauli, editors, Proc. 7thIntern. Conf. on Computer Analysis of Images and Patterns, CAIP’97,Kiel, number 1296, pages 456–463, Heidelberg, 1997. Springer-Verlag.

[25] Xifeng Yan, Philip S. Yu, and Jiawei Han. Substructure similarity searchin graph databases. In SIGMOD ’05: Proceedings of the 2005 ACMSIGMOD international conference on Management of data, pages 766–777, New York, NY, USA, 2005. ACM Press.

33

comparison of graphs - semantic scholar · comparison of graphs application to the analysis of...

Documents