lncs 6426 - fast business process similarity search with feature … · fast business process...

Fast business process similarity search with feature-basedsimilarity estimationCitation for published version (APA):Yan, Z., Dijkman, R. M., & Grefen, P. W. P. J. (2010). Fast business process similarity search with feature-basedsimilarity estimation. In R. Meersman, T. Dillon, & P. Herrero (Eds.), Proceedings of the 18th internationalconference on cooperative information systems (Vol. 1, pp. 60-77). (Lecture Notes in Computer Science; Vol.6426). Springer. https://doi.org/10.1007/978-3-642-16934-2_8

DOI:10.1007/978-3-642-16934-2_8

Document status and date:Published: 01/01/2010

Document Version:Publisher’s PDF, also known as Version of Record (includes final page, issue and volume numbers)

Please check the document version of this publication:

• A submitted manuscript is the version of the article upon submission and before peer-review. There can beimportant differences between the submitted version and the official published version of record. Peopleinterested in the research are advised to contact the author for the final version of the publication, or visit theDOI to the publisher's website.• The final author version and the galley proof are versions of the publication after peer review.• The final published version features the final layout of the paper including the volume, issue and pagenumbers.Link to publication

General rightsCopyright and moral rights for the publications made accessible in the public portal are retained by the authors and/or other copyright ownersand it is a condition of accessing publications that users recognise and abide by the legal requirements associated with these rights.

• Users may download and print one copy of any publication from the public portal for the purpose of private study or research. • You may not further distribute the material or use it for any profit-making activity or commercial gain • You may freely distribute the URL identifying the publication in the public portal.

If the publication is distributed under the terms of Article 25fa of the Dutch Copyright Act, indicated by the “Taverne” license above, pleasefollow below link for the End User Agreement:www.tue.nl/taverne

Take down policyIf you believe that this document breaches copyright please contact us at:[email protected] details and we will investigate your claim.

Download date: 03. Aug. 2020

https://doi.org/10.1007/978-3-642-16934-2_8

https://doi.org/10.1007/978-3-642-16934-2_8

https://research.tue.nl/en/publications/fast-business-process-similarity-search-with-featurebased-similarity-estimation(ff25dcc2-d27c-4fdb-be33-395e32c9e456).html

Fast Business Process Similarity Search with

Feature-Based Similarity Estimation

Zhiqiang Yan, Remco Dijkman, and Paul Grefen

Eindhoven University of TechnologyP.O. Box 513, NL-5600 MB Eindhoven, The Netherlands

{z.yan,r.m.dijkman,p.w.p.j.grefen}@tue.nl

Abstract. Nowadays, business process management plays an importantrole in the management of organizations. More and more organizationsdescribe their operations as business processes, and the intra- and inter-organizational interactions between operations as services. It is commonfor organizations to have collections of hundreds or even thousands ofbusiness processes. Consequently, techniques are required to quickly findrelevant business process models in such a collection. Currently, tech-niques exist that can rank all business process models in a collectionbased on their similarity to a query business process model. However,those techniques compare the query model with each model in the col-lection in terms of graph structure, which is inefficient and computa-tionally complex. Therefore, this paper presents a technique to makethis more efficient. The technique selects small characteristic model frag-ments, called features, which are used to efficiently estimate model simi-larities and classify them as relevant, irrelevant or potentially relevant toa query model. Only potentially relevant models must be compared us-ing the existing techniques. Experiments show that this helps to retrievesimilar models at least 3.5 times faster without impacting the quality ofthe results; and 5.5 times faster if a quality reduction of 1% is acceptable.

1 Introduction

Nowadays, service and business process management technologies developquickly in both academic and industrial fields. To increase the flexibility andcontrollability of the management of organizations, business processes are usedto describe functions organization provides and services are used to describe theboth intra- and inter-organizational cooperations. As a result, it is common tosee collections of hundreds or even thousands of business process models. Forexample, the SAP reference model consists of more than 600 business processmodels [16], and the reference model for Dutch Local Governments containsa similar number of models [9]. As business process model collections increasein size, tools and techniques are required to manage them. This includes toolsand techniques for quickly searching a collection for business process modelsthat meet certain criteria. These criteria can be specified by means of a searchquery [2,5], but also by means of (a part of) a business process model for whichsimilar models must be retrieved [6,7].

R. Meersman et al. (Eds.): OTM 2010, Part I, LNCS 6426, pp. 60–77, 2010.c© Springer-Verlag Berlin Heidelberg 2010

Fast Business Process Similarity Search 61

Buy Goods

X

Receive Goods

Verify Invoice

O +

Store Goods

+

Query model

Buy Goods

X

Receive Goods

Verify Invoice

X

Model 1

Buy SpecialGoodsOnline

GoodsReceipt

ConsumeGoods

Model 2

PurchaseCommodities

X

GetCommodities

CheckReceipt

O +

WarehouseCommodities

+

Model 3

Order a book

Receive the Book

Storethe Book

+

Read the Book

+

Model 4

Inform candidate

Assess request

Sendresponse

Model 5

Fig. 1. Searching a collection of business process models

This paper focuses on the second class of search techniques, which are alsocalled similarity search techniques. Similarity search can, for example, be ap-plied to search a collection of reference process models for the model that bestmatches a process model from a specific organization; or in case of merger be-tween two organizations to search which business process model from one orga-nization matches which business process model of the other. Figure 1 shows anexample of a business process similarity search. It shows one query model andfive process models in the BPMN notation. Given a search query model, a simi-larity search technique should only return those process models that are similarto the search query model and it should return those similar process models inorder of their similarity to the search query model. In the example, the techniquecould return models 1, 2 and 3.

There currently exist similarity search techniques [6,7]. However, these tech-niques focus on defining a metric to compute the similarity between two processmodels. To rank the business process models in a collection, the similarity ofeach of the process models to the query model must be computed. Subsequently,the process models are ordered according to their similarity. This is time con-suming and can cause a similarity search operation to take multiple seconds oreven minutes, depending on the metric and algorithm that is used, while a searchquery should be performed within milliseconds by a search engine, e.g., Google.

Therefore, the goal of this paper is to develop a similarity search techniquethat is both accurate and fast. The technique works by quickly classifying processmodels as ‘relevant’, ‘irrelevant’ or ‘potentially relevant’ to a search query model,based on an estimation of their similarity to the search model. Existing similaritysearch techniques then only have to be used to rank the process models in the‘potentially relevant’ category, which typically contains much fewer models thanthe collection as a whole (in our evaluation set only 10% of the number of modelsin the collection), therewith significantly reducing the search time.

62 Z. Yan, R. Dijkman, and P. Grefen

Step 1:Identity features

Step 2:Compute feature

similarity and matching

Step 3:Estimate process

similarity and classify processes

Step 4:Compare potentially

relevant models with the query

model

Step 5:Rank retrieved

models

Fig. 2. Steps of the technique in this paper

The technique classifies process models based on the number of features thatthey have that match with the search model. The technique consists of fivesteps, as shown in Figure 2. First, features are identified in the process modelsthat have to be searched. Features are simple but representative abstractionsof a business process model, e.g., tasks and task succession. Second, the searchmodel and the process models are compared based by looking at the featuresthat they have in common, i.e.: that are similar enough such that we say thatthey are ‘matched’. For example, suppose that we use tasks and task successionas features in figure 1. We can observe that model 1 has six matching featureswith the search model: the task features ‘Buy Goods’, ‘Receive Goods’ and ‘Ver-ify Invoice’; and the succession features (‘Buy Goods’,‘Receive Goods’), (‘BuyGoods’,‘Verify Invoice’) and (‘Receive Goods’,‘Verify Invoice’). Models 2 and3 have fewer matches or have weaker similarity with respect to their matches(e.g.: ‘Buy Goods’, ‘Buy Special Goods Online’ and ‘Purchase Commodities’ aresimilar but not identical labels). Models 4 and 5 have an even weaker matchor no match at all with respect to the features that they have in common withthe search model. Third, an estimation of the similarity of process models tothe search model is made, based on the ratio of matching features, and modelsare classified based on their estimated similarity. For example, depending on theexact metrics that are used, model 1 could be considered as relevant based onmatching features, models 2 and 3 considered as potentially relevant, and mod-els 4 and 5 as irrelevant. Fourth, using existing technologies [6], process modelsimilarities are computed for potentially relevant models, e.g., models 2 and 3.Fifth, retrieved models are ranked. Relevant models, which are ranked by theirestimated similarity, (e.g., model 1), are followed by potentially relevant models,which are ranked by the similarities computed in step 4 (e.g., model 2 and 3).This finally leads to a ranked list of search results.

Experiments show that the technology helps to retrieve similar models at least3.5 times faster without impacting the quality of the results; and 5.5 times fasterif a quality reduction of 1% as a tradeoff is acceptable.

The rest of the paper is organized as follows. Section 2 defines the conceptof feature and presents features that can be used for business process similar-ity estimation. Section 3 defines metrics for measuring the similarity of featuresand checking whether features match. Section 4 presents metrics to determinewhether a model is relevant, irrelevant or potentially relevant to a search query,based on the features that match with features from the query model. Section 5presents the experiments to determine the search time and quality of the simi-larity search technique. Section 6 presents related work and section 7 concludes.


2 Business Process Model Features

In this paper features are defined as simple but representative abstractions ofbusiness process models. Their simplicity allows similarity computation basedon them to be fast and their representativeness ensures that their similarity isstrongly related to similarity of the business process models themselves. Thismakes features very suitable as means to quickly estimate the similarity of busi-ness process models.

Provided that we choose business process model features carefully, we canfurther speed up similarity search by building an index of business process modelsbased on those features. In this section, we present the business process modelfeatures that we explore in this paper.

In previous work [6,7] we have shown that the most representative features ofbusiness process models are their task labels and their structure. This means thatif two business processes have similar labels for their activities, it is also likelythat the processes themselves are similar. Similarly, if two business processes havea similar structure, it is also likely that the processes themselves are similar.

Labels can be conveniently used as features, because they are simple stringsand therefore qualify as simple abstractions. In addition, indexing mechanismsfor strings are well-known, which enables indexing of label features. However,it is harder to use the structure of a business process model as a feature. Infact, considering the structure of a graph when computing the similarity be-tween business process models in our previous work is what makes the problemcomputationally hard. Therefore, we consider the structure of a business processmodel in terms the more simple structural features: start, stop, sequence, splitand join. We define these features on the abstraction of a business process graph.

Definition 1 (Business Process Graph, Pre-set, Post-set). Let L be a setof labels. A business process graph is a tuple (N, E, λ), in which:

– N is the set of nodes;– E ⊆ N × N is the set of edges; and– λ : N → L is a function that maps nodes to labels.

Let G = (N, E, λ) be a business process graph and n ∈ N be a node: •n ={m|(m, n) ∈ E} is the pre-set of n, while n• = {m|(n, m) ∈ E} is the post-setof n.

A business process graph is a graph representation of a business process model.As such, it is an abstraction of a business process model that focuses purely onthe structure of that model, while abstracting from other aspects, e.g., differenttypes of process modeling notations (BPMN, EPC, Petri net, etc.). However,because our measure of similarity is defined on the structure of a business processmodel, abstracting from these aspects is acceptable. Optionally, certain types ofnodes can be disregarded in a business process graph. For example, figure 3shows the business process graphs for the models from figure 1. Only tasks areconsidered in these graphs. Events and gateways are disregarded. We define our


Buy Goods

Receive Goods

Query graph

StoreGoods

Verify Invoice

Graph 1

Buy SpecialGoodsOnline

Goods Receipt

ConsumeGoods

Graph 2

BuyGoods

Receive Goods

Verify Invoice

Graph 3

Purchase Commodities

Get Commodities

WarehouseCommodities

CheckReceipt

Graph4

Inform candidate

Assess request

Graph 5

Sendresponse

Receive the Book

Readthe BookStore

the Book

Order a book

Fig. 3. Business process graphs

similarity search techniques on business process graphs to be independent of aspecific notation.

Based on this the structural features are defined in Definition 2.

Definition 2 (Structural Business Process Model Features). Let G =(N, E, λ) be a business process graph.

– A start feature is a node n ∈ N that has an empty pre-set;– A stop feature is a node n ∈ N that has an empty post-set;– A sequence feature of size s is a list of nodes [n1, n2, n3, . . . , ns] ⊆ N , such

that (n1, n2) ∈ E, (n2, n3) ∈ E, . . . , (ns−1, ns) ∈ E, for s ≥ 2;– A split feature of size s is a split node n and a set of nodes {n1, n2, . . . , ns−1} ⊆

N , such that (n, n1) ∈ E, (n, n2) ∈ E, . . . , (n, ns−1) ∈ E, for s ≥ 3;– A join feature of size s is a join node n and a set of nodes {n1, n2, . . . , ns−1} ⊆

N , such that (n1, n) ∈ E, (n2, n) ∈ E, . . . , (ns−1, n) ∈ E, for s ≥ 3;

For example, for graph 1 in figure 3, the label feature set is {Buy Goods, ReceiveGoods, Verify Invoice}, the start feature set is {Buy Goods} (using node labels toidentify nodes), the stop feature set is {Verify Invoice}, the sequence feature setis {(Buy Goods, Receive Goods), (Buy Goods, Verify Invoice), (Receive Goods,Verify Invoice)}, the split feature set is {(Buy Goods,{Receive Goods, VerifyInvoice})}, and the join feature set is {({Buy Goods, Receive Goods},VerifyInvoice)}.

Many more possible features can be considered in business process models,depending on the business process model aspects that are taken into account(e.g. the organizational aspect or the data aspect), the desired performance ofthe algorithm (adding more features decreases performance) and the desiredquality of the results (adding more features is expected to increase the qualityof the search results). In this paper we focus on the most basic process modelfeatures. Extensions are possible and are a topic for future work.

Structural features can be considered in two fundamentally different ways:

1. by focusing on a single node and the interconnections that it has; or2. by focusing on a small number of connected nodes.

More explanation will be given in Section 3.


3 Feature Similarity, Matching and Indexing

It is possible to use the similarity of the features of two business process modelsas an estimator of the similarity of the business process models themselves. Tothis end, metrics must be defined that quantify the similarity of the businessprocess model features. We say that two features that are sufficiently similarare matching features and we show how we can determine feature matchingbased on their similarity. The ratio of matching features will be used in the nextsection as an estimator of the similarity of business process models. To be ableto quickly identify matching features, and therewith similar business processmodels, feature indices must be defined.

This section first presents metrics to quantify feature similarity. Second, itexplains how a feature match can be determined based on feature similarityand, third, it presents feature-based indices.

3.1 Feature Similarity

Label feature similarity can be measured in a number of different ways [7]. Forillustration purposes we will use a syntactic similarity metric, which is basedon string edit-distance, in this paper. However, in realistic cases more advancedmetrics should be used that take synonyms and stemming [7] and, if possible,domain ontologies into account [10]. The label feature similarity is defined in theformer work [6].

Definition 3 (Label Feature Similarity). Let G = (N, E, λ) be a businessprocess graph and n, m ∈ N be two nodes and let |x| represent the number ofcharacters in a label x. The string edit distance of the labels λ(n) and λ(m) of thenodes, denoted ed(λ(n), λ(m)) is the minimal number of atomic string operationsneeded to transform λ(n) into λ(m) or vice versa. The atomic string operationsare: inserting a character, deleting a character or substituting a character foranother. The label feature similarity of λ(n) and λ(m), denoted lsim(n, m) is:

lsim(n, m) = 1.0 − ed(λ(n), λ(m))max(|λ(n)|, |λ(m)|)

For example, the string edit distance between ‘Transportation planning and pro-cessing’ and ‘Transporting’ is 26: delete ‘ion planning and process’. Consequently,the label feature similarity is 1.0− 26

38 ≈ 0.32. Optional pre-processing steps, suchas lower-casing and removing special characters, can improve the results of fea-ture similarity measurements.

The drawback of measuring feature similarities only by labels is that relatedtasks can be labeled differently. For example, in figure 3, ‘Buy Special GoodsOnline’ of Graph 2 and ‘Purchase Commodities’ of Graph 3 are related to ‘BuyGoods’ of Query graph. However, compared with ‘Buy Goods’, the former oneis more verbose and the later one uses synonyms. They may not match based onthe label similarity. To deal with this situation, we use structural informationtogether with the labels.


We can measure the structural similarity of two nodes, by determining thesimilarity of the (structural) roles that they have in their business process graphs.We distinguish five different roles that nodes can have: start, stop, sequence, splitor join. We do not distinguish the type of splits or joins (e.g.: XOR or AND),because we established in previous work [6] that the similarity of the types oftwo splits or two joins is a bad indication for whether they are similar. Althoughwe ackowledge that the disctinction between different types of splits or joins isof utmost importance when determining behavioral equivalence, we point outthat measuring similarity is different from determining equivalence.

Definition 4 (Role Feature). Let n ∈ N be a node and R = {start, stop, split,join, regular} be a set of roles that a node can have. The roles of n are determinedby the function roles : N → IP(R), such that

start ∈ roles(n) ⇔ | • n| = 0stop ∈ roles(n) ⇔ |n • | = 0split ∈ roles(n) ⇔ |n • | ≥ 2join ∈ roles(n) ⇔ | • n| ≥ 2regular ∈ roles(n) ⇔ | • n| = 1 ∧ |n • | = 1

Roles of nodes are considered to be similar or not with respect to the input andoutput paths of the nodes. The definition of role feature similarity is inspired bystring edit-distance, i.e., mainly considering the differences between numbers ofinput (output) paths of two nodes. Formally, role feature similarity is defined asfollows:

Definition 5 (Role Feature Similarity). Let n, m ∈ N be two nodes. Therole feature similarity of these two nodes, denoted rsim(n, m), is defined as:1

rsim(n, m) =

⎧⎪⎪⎪⎪⎪⎪⎨

⎪⎪⎪⎪⎪⎪⎩

1 if start ∈ croles ∧ stop ∈ crolesavg(1 − abs(|n•|−|m•|)

|n•|+|m•| , 1) if start ∈ croles ∧ stop /∈ croles

avg(1, 1 − abs(|•n|−|•m|)|•n|+|•m| ) if start /∈ croles ∧ stop ∈ croles

avg(1 − abs(|n•|−|m•|)|n•|+|m•| ,

1 − abs(|•n|−|•m|)|•n|+|•m| ) otherwise

Where croles = roles(n) ∩ roles(m).

This formula covers all possible combinations of roles that nodes can have. Forexample, the situation in which both nodes are split nodes as well as join nodesis covered by the case ‘otherwise’ (start /∈ croles ∧ stop /∈ croles). The situationin which both nodes are sequence nodes is covered by the same case and leadsto a role feature similarity score of 1.

The drawback of measuring role similarity in this way is that it does notdiscount for the fact that there is a large difference between the frequency ofthe occurrence of the different role features. Therefore, using the role similarity1 avg returns the average value; abs returns the absolute value.


metric in this way is ineffective. Since, if we give a bonus for matching rolefeatures, most nodes would receive that bonus. Therewith, the effect of the bonuswould be minimal.

For that reason we refine the role similarity metric to take this effect intoaccount. We do that by not considering features that appear too frequently inthe dataset; we say that those features lack ‘discriminative power’.

Definition 6 (Discriminative Role Features). Let N be the set of nodes ofall business process models in a collection. We say that a role feature r ∈ R isdiscriminative, denoted discriminative(r) if and only if the fraction of the nodesthat have the feature is sufficiently small:

|{n|n ∈ N , r ∈ roles(n)}||N | ≤ dcutoff

Where dcutoff is a cutoff value that determines when the fraction of nodes thathave the feature is sufficiently small. This cutoff value is a parameter that canbe set as desired, to produce the best results.

In general a good setting for dcutoff is easy to determine, because there is a largedifference between the frequency of features with a low frequency of occurrenceand features with a high frequency of occurrence. For example, in the set of busi-ness process models that we use for evaluation in this paper, there are 374 nodesin total and 178 the ‘stop’ role, 153 have the ‘start’ role, 58 the ‘seq’ role, 52 the‘split’ role, 36 the ‘join’ role. Here, we have far more nodes with the ‘start’ and‘stop’ roles than other nodes. Hence, if we set the dcutoff anywhere between 0.16and 0.40, ‘start’ and ‘stop’ role features are not considered discriminative, whileother role features are considered discriminative. We incorporate the discrimina-tive power of role features into their similarity using the following formula.

Definition 7 (Role Feature Similarity with Discriminative Power). Letn, m ∈ N be two nodes. Their role feature similarity with discriminative power,denoted rdsim(n, m), is defined as:

rdsim(n, m) ={

rsim(n, m) if ∀r ∈ roles(n) ∩ roles(m) : discriminative(r)0 otherwise

3.2 Feature Matching

We say that two features are matched if they are sufficiently similar. Whatis considered to be sufficient is determined by cutoff parameters that can beset accordingly. If two business process models have sufficiently many matchingfeatures, we consider them similar. This is explained in the next section.

We consider two node features to be matched, if their component features(label features and role features) are matched. Strong label feature similarity is astrong indication that two nodes are matched, while a combination of role featuresimilarity and (less strong) label feature similarity is also an indication that two


nodes are matched. We distinguish between these two cases when determining anode feature match, such that we can set different thresholds for label similarityin case there is also role similarity and in case there is no role similarity.

Definition 8 (Node Feature Match). Let n, m ∈ N be two node features withtheir respective label features and role features. The node features are matched ifthey satisfy one of the following two rules:

– their label features are similar to a high degree, i.e., lsim(n, m) ≥ lcutoffhigh;– their role features are similar, and their label features are similar to a medium

degree, i.e., rdsim(n, m) ≥ rcutoff and lsim(n, m) ≥ lcutoffmed.

Where lcutoffhigh, rcutoff and lcutoffmed are parameters that determine what isconsidered to be a similar to what degree. The parameters can be set as desired,to produce the best results.

We consider two structural features to be matched, if their component features(node features) are matched.

Definition 9 (Structural Feature Match). Two start features with nodes nand m are matched if and only if their node features are matched. A stop featurematch is defined similarly.

Two sequence features of size s with lists of nodes Ln = [n1, n2, n3, . . . , ns]and Lm = [m1, m2, m3, . . . , ms] are matched if and only if for each 1 ≥ i ≥ s:the node features of ni and mi are matched.

Two split features of size s with split nodes n and m and sets of nodes Sn ={n1, n2, . . . , ns−1} and Sm = {m1, m2, . . . , ms−1} are matched if and only ifthe node features of nodes n and m are matched and there exists a mappingMap : Sn → Sm holds that for each (sn, sm) ∈ Map: the node features of sn andsm are matched. A join feature match is defined similarly.

Features of different types or sizes are never matched with each other. We canuse these two definition to define general feature matching.

Definition 10 (Feature Match). Let f1 and f2 be two features. f1 and f2

are matched, denoted match(f1, f2), if and only if they are of the same typeand they are matched according to definition 8 in case they are node features ordefinition 9 in case they are structural features.

3.3 Index Construction

Node feature matching is mainly based on label similarity and structural featurematching is as well, because it is based on node feature matching. Therefore, ifwe can speed up finding similar labels, we can speed up feature matching. Weuse two index techniques to speed up finding similar labels.

First, we use an M-Tree index [3] on node labels. An M-Tree index is specifi-cally meant for quickly finding items that are similar to a given item to a givendegree. In our case, we can use it to quickly find node labels that have a label


feature similarity (definition 3) with a given node label that is higher than aspecified cutoff (lcutoffhigh or lcutoffmed in definition 8).

Second we use an inverted index [14] that maps node labels to nodes, to obtainthe set of similar nodes from the set of similar labels. We use the inverted index,because multiple nodes with the same label may exist in a collection of businessprocess models. For example, in the set of business process models that we usefor evaluation in this paper, there are 374 labels, but only 190 distinct ones. Theinverted index can prevent comparing identical labels repeatedly.

For other features, we also build inverted indices to prevent comparing iden-tical ones repeatedly. Furthermore, we can build another index, if we exploit thefact that, to match a sequence of two nodes, we always need to match a nodefeature and, to match a sequence of three nodes, we always need to match a se-quence of two nodes. In other words to match a ‘larger’ feature, we always needto match a ‘smaller’ feature. Using this observation we can build a parent/childindex. This index functions like a cache that stores the similarity between smaller‘parent features’ as well as the relation between larger ‘child features’ and theirsmaller ‘parent features’. Using this cache, to match two ‘child features’ we canlook up the corresponding ‘parent features’ and their similarity.

Definition 11 (Parent Feature, Child Feature). If feature A can generatefeature B by adding some node(s), feature A is a parent feature of feature B,and feature B is a child feature of feature A. If feature A can generate feature Bby adding only one node, feature A is a direct parent feature of feature B, andfeature B is a direct child feature of feature A.

For example, node feature ‘Buy Goods’ is a direct parent feature of sequencefeature (‘Buy Goods’, ‘Receive Goods’), and sequence feature (‘Buy Goods’,‘Receive Goods’) is a direct child feature of node feature ‘Buy Goods’. An exam-ple of the parent/child feature index is shown in figure 4, which is an exampleof for graph 2 in figure 3. The index has different feature size levels, and thefeatures of the same size are at the same level. Between levels features connectto their direct parent and child features. By doing this, we can start comparingfeatures from node feature level and only need to compare larger features whoseparent features are matched.

Buy SpecialGoods Online

ConsumeGoodsGoods Receipt

(Buy SpecialGoods Online, Goods Receipt)

(Goods Receipt, Consume Goods)

(Buy Special Goods Online, Goods Receipt, Consume Goods)

Size = 1

Size = 2

Size = 3

Fig. 4. Feature Index


4 Feature-Based Similarity Estimation

We use the fraction of features that match between two business process modelsto estimate their similarity, as shown in Definition 12.

Definition 12 (Estimated Business Process Model Similarity). Given aquery process graph Gq and another process graph G, with feature sets Fq and Fderived automatically from Gq and G. The estimated business process similarity,denoted GSim(Gq, G) is the number of features in Gq or G that are matched bya feature in the other process graph, divided by the number of all features in Gq

and G:

|{fq ∈ Fq|∃f ∈ F : match(fq, f)}| + |{f ∈ F |∃fq ∈ Fq : match(fq, f)}||Fq | + |F |

Note that we count the number of features in Gq that match a feature in Gseparately from the number of features in G that match a feature in Gq, becausethe match is not necessarily one-to-one. For example, a label feature ‘Fill-outRequest Forms’ can match with label features ‘Fill-out Requester’s Detail’ and‘Fill-out Request Details’ in the other process graph.

Based on the estimated graph similarity, we can classify graphs as relevant,irrelevant or potentially relevant to a search graph. We do that by definingthe minimal estimated similarity that a graph must have to the search querygraph to be considered relevant and the minimal estimated similarity that agraph must have to be considered potentially relevant. We retrieve relevant onedirectly, check the potentially relevant ones with expensive similarity searchalgorithms [6,7], and discard irrelevant ones.

Definition 13 (Graph Relevance Classification). Given a query processgraph Gq and another process graph G, we classify G as:

– relevant to Gq if and only if GSim(Gq, G) ≥ ratior

– potentially relevant to Gq if and only if ratior > GSim(Gq, G) > ratiop

– irrelevant to Gq if and only if ratiop ≥ GSim(Gq, G)

Where ratior and ratiop are parameters that determine when a process graphis considered to be relevant, potentially relevant or irrelevant and can be set asdesired, to produce the best results.

After determining the sets of process graphs that are relevant and potentiallyrelevant to a query graph, we can rank them. We rank the process graphs inthe ‘relevant’ set according to their estimated similarities (GSim) with respectto the query graph. We rank the process graphs in the ‘potentially relevant’set by computing their similarity to the query graph, using an existing processsimilarity metric (e.g. one from [6,7]). The complete ranked list is formed bythe ranked relevant process models followed by the ranked potentially relevantprocess models.


5 Evaluation

This section shows how the estimation step affects process similarity search interms of the quality of the retrieved results and the execution time. It firstexplains the setup of the evaluation and second the results. We implemented atool for the technique in this paper (including transferring process models toprocess graphs automatically).2

5.1 Evaluation Setup

We have two experimental setups one for evaluating the quality of retrieved re-sults and one for evaluating the execution time, respectively. We use 10 searchmodels for both of the evaluations, which are derived from the SAP referencemodel. We made some adaption to these 10 models to better evaluate the tech-nique in this paper, e.g., rephrasing labels (synonyms) and using sub-models.

To evaluate the quality of retrieved results, we use the same evaluation datasetas in [6]. This dataset consists of 100 process models that are derived from theSAP reference model. This is a collection of 604 business process models (de-scribed as EPCs) capturing business processes supported by the SAP enterprisesystem. On average the process models contain 21.6 nodes with a minimum of 3and a maximum 130 nodes. The average size of node labels is 3.8 words. For eachof the 1000 combinations of a search model and a process model a human ob-server judged whether the process model was relevant to the search model. TheR-Precision [4] of the search results returned by a particular similarity searchalgorithm can then be computed to indicate the quality of those results.

Definition 14 (R-Precision). Let D be the set of process models, Q be the set ofquery models and relevant : Q → IP(D) be the function that returns the set of rele-vant process models for each query model (as determined by the human observer).

Given the list of search results D = [d1, d2, ..., dn] for a query q with ∀di ∈ D,the R-Precision is the precision of the first R results, where R = |relevant(q)| isthe total number of process models that is relevant to the query:

R-Precision =|[di ∈ D|i ≤ n, i ≤ R, di ∈ relevant(q)]|

R

To evaluate the quality of the retrieved results, we compare the R-Precisionof the greedy graph similarity search algorithm that we developed in previouswork [6] to the R-Precision of the same algorithm with feature-based similarityestimation applied first. We use the greedy graph similarity search algorithm,because it is the fastest algorithm of the ones we studied [6] and, therefore,provides a lower-bound for improvements in execution time.

To evaluate the execution time, we compare the 10 queries with all 604 busi-ness process models in the SAP reference model, instead of just the 100 processmodels. We can do this, because to compute the execution time we do not needthe relevance scores.2 http://is.tm.tue.nl/research/apromore.html


5.2 Evaluation Results

Table 1 shows the quality of the results that are retrieved by the original greedysimilarity search algorithm and those returned by the algorithm in combinationwith similarity estimation, based on various feature types.

Table 1. Quality of retrieved results

Feature(n) Occurrences Matches Relevant Potential Irrelevant R-Precision

Former Work [6] - - 0 100 0 0.841:Node(1) 374 581 5.5 10.9 83.6 0.842:1+Seq(2) +267 +197 8.1 8 83.9 0.833:2+Seq(3) +175 +96 7.8 10.1 82.1 0.834:2+Split(3) +87 +93 7.8 10.1 82.1 0.835:4+Split(4) +23 +11 7.8 10.1 82.1 0.836:2+Join(3) +58 +18 7.8 10.1 82.1 0.837:6+Join(4) +14 +1 7.8 10.1 82.1 0.83

The rows in the table show the features that are used to do the feature-basedsimilarity estimation. In the first row no feature-based similarity estimation isdone, so this row lists the performance of the original algorithm. In the secondrow similarity estimation is done based only on node features (of size 1). In thethird row similarity estimation is done based on node features plus sequencefeatures of size 2 and so on and so forth (when a feature, whose size is biggerthan 1, is evaluated, its parent features are always included).

The columns in the table show the properties of the features and similarityestimation based on the features. First, they show the number of times features ofa given type occur in the set of process models and the number of times featuresof a certain type match in the set of process models. For example, in the setof process models, there could be four nodes labeled ‘A’. These nodes count asfour occurrences of the node feature type. Because of their high label featuresimilarity, these nodes can be considered to match. This leads to six matches,because each of the four nodes can be matches to each of the others. Second,the columns show the average number of process models that is estimated asbeing relevant, potentially relevant and irrelevant over the ten queries. Third,the columns show the average R-Precision over the ten queries.

The table shows that when similarity estimation is done based only on nodefeatures on average 5.5 models are estimated to be relevant, 10.9 models to bepotentially relevant and 83.6 models as irrelevant. Therefore, in this situation,the original algorithm only has to be used to measure the similarity of about 11%of the total number of process models. In this case the R-Precision remains thesame. If sequences of size two are also used to perform the similarity estimation,only 8% of the process models has to be compared using the original algorithm.However, this does lead to a slightly lower R-Precision. Inclusion of other typesof features does not improve the similarity estimation any further.


Table 2. Execution time of similarity search

Features(n) Relevant Potential Irrelevant Testimate Tcompute Tavgtotal Tmin

total Tmaxtotal

Former Work [6] 0 604 0 0.00s 0.60s 0.60s 0.16s 1.45s1:Node(1) 7 73 524 0.05s 0.12s 0.17s 0.03s 0.40s2:1+Seq(2) 13.7 44.9 554.4 0.05s 0.06s 0.11s 0.03s 0.24s3:2+Seq(3) 9.5 73.2 521.3 0.05s 0.16s 0.21s 0.03s 0.43s4:2+Split(3) 9.5 73.2 521.3 0.05s 0.16s 0.21s 0.03s 0.43s5:4+Split(4) 9.5 73.2 521.3 0.05s 0.16s 0.21s 0.03s 0.43s6:2+Join(3) 9.5 73.2 521.3 0.05s 0.16s 0.21s 0.03s 0.43s7:6+Join(4) 9.5 73.2 521.3 0.05s 0.16s 0.21s 0.03s 0.43s

Table 2 shows the execution time of the similarity search both when onlyusing the original algorithm and when using certain feature types for similarityestimation. All the experiment are run on a laptop with the Intel Core2 Duoprocessor T7500 CPU (2.2GHz, 800MHz FSB, 4MB L2 cache), 4 GB DDR2memory, the Windows Vista operating system and the SUN Java Virtual Ma-chine version 1.6.

The execution time consists of two parts: the total time it takes to estimatethe similarity and classify process models as relevant, potentially relevant or ir-relevant, denoted Testimate; and the time it takes to compute the similarity forthe models classified as potentially relevant, denoted Tcompute. Table 2 showsthe average estimation and computation times over the ten search queries. Inaddition to that it shows the average total time over the ten queries and the(minimum) time of processing the query that takes the least time and the (max-imum) time of processing the query that takes the most time.

The table shows that if we, on average, estimating similarity based on nodefeatures helps to retrieve similar models 3.5 times faster and from table 1 weknow that this does not impact the quality of the search results. Also involvingthe sequence of size two feature type even helps retrieve similar models 5.5 timesfaster, but from table 1 we know that this reduces the quality of the results byabout 1% as a tradeoff.

The table also shows that, on average, the total search time for the originalalgorithm is quite acceptable and takes only 0.60 seconds. However, in the worstcase the total search time for the original algorithm is already 1.45 seconds andthis time is linear over the number of models in the collection, meaning thatif we were to search a collection of 2000 models (the size of the collection ofbusiness process of a large telecom provider in the Netherlands) the search timewould already be around 5 seconds in the worst case. This is still much slowerthan common search engines, e.g., Google.

The technique depends on several parameters:

– dcutoff, which is a parameter that determines whether a role feature is con-sidered to be discriminative (Definition 6).

– lcutoffhigh, rcutoff and lcutoffmed, which are parameters that determine whatis considered to be a sufficiently similar for a feature to match (Definition 10).


– ratior and ratiop, which are parameters that determine which class a processmodel belongs to based on the fraction of features that match with the searchquery model (Definition 13).

We vary each of these parameters from 0 to 1 in increments of 0.1 and ranthe experiments with all possible combinations of parameter values within thisrange. We use the parameters that, on average, gives the highest R-Precision orthe fewest potentially relevant models with respect to the queries to show at-tractive tradeoffs. The values that we use are dcutoff = 0.3, lcutoffhigh = 0.8,rcutoff = 1.0 and lcutoffmed = 0.2. The other two parameters also depend onthe type of features we use. For the node feature (the second row in Table 1 or 2),ratior = 0.5; otherwise, ratior = 0.2. For the node and sequence (with twonodes) features (the second and third rows in Table 1 or 2), ratiop = 0.1; oth-erwise, ratiop = 0.0. The values of the parameters are specific to this dataset.More general values for should be obtained by doing more experiments.

6 Related Work

The work presented in this paper is related to business process similarity searchtechniques and to general graph similarity search techniques, which we use as abasis for the work in this paper.

Business process similarity search techniques have been developed from dif-ferent angles [1,10,11,12,13,15,19]. These techniques mainly vary with respect tothe information, incorporated in the business process models, that they use todetermine similarity [6] and the underlying formalism that they use to determinesimilarity [8]. The work described in this paper complements existing businessprocess similarity search techniques, because it focuses on estimating businessprocess similarity, rather than measuring it exactly, and using that estimate toimprove the time performance of existing techniques. As such it can be com-bined with any of the existing techniques to improve their performance. Lu andSadiq [12] also use features to determine similarity, but because their goal dif-fers from the goal of this paper (they want to measure similarity exactly), theirfeatures are larger than ours, potentially consisting of a complete process model.This makes their features suitable for measuring similarity exactly, but not forestimating it quickly.

General graph similarity search has been applied in various application do-mains, including fingerprint search, DNA search and chemical compound search.In these domains feature-based methods have been used for similarity estima-tion and measurement. Willett et al. [18] describe feature-based similarity searchin a chemical compound databases. For the structure-based method, ShaSha etal. [17] propose a path-based approach; Yan et al. [20] use discriminative frequentstructures to index graphs; Zhao et al. [22] prove that using tree structures and asmall number of discriminative graph structures to index graphs is a good choice.Furthermore, Yan et al. [21] also investigate the relationship between feature-based and structure-based methods and built an connection between the two.The main difference between the work that has been done in this area and the


work in this paper, is the different nature of business process graphs as comparedto graphs in other domains. In particular, there is practically no restriction tothe number of possible node labels in a business process graphs and matchingnodes do not necessarily have the identical labels. In comparison dna nodes havefour possible labels, chemical compound nodes have 117 possible labels, and inboth cases matching nodes have identical labels. Also, business process graphshave different structural properties and patterns. These characteristics requirethat feature types are defined specifically for business process graphs. In addi-tion to that processing feature similarity is different, because business processgraphs do not require features to match exactly for graphs to be similar, whilegraphs in other domains do require features to match exactly.

7 Conclusion

This paper presents a technique for improving the speed of business process sim-ilarity search. The evaluation shows that the search time of the fastest algorithmfor business process similarity search that exists today can be reduced by a factor3.5 on average, without impacting the quality of the results. The execution timecan even be reduced by a factor 5.5, if a reduction of the quality of the results of1% is acceptable. These reductions are computed as the average reduction overten queries. The reduction for the most complex query is a factor 16.5 and thereduction for the least complex query is a factor 2.5.

The technique works by quickly classifying models in a collection as eitherrelevant, irrelevant or potentially relevant to a search query. The original algo-rithm then only has to be used to classify the potentially relevant models in thecollection. The classification is done based on simple, but representative, partsof business process models, also called features. The evaluation shows that thefactor 3.5 reduction of processing time is achieved, if the labels and the intercon-nections of individual nodes from a business process model are used as featuresto estimate the similarity. The factor 5.5 reduction of processing time with areduction of 1% on the quality of the search results is achieved, if individualnodes, their interconnections and sequences of two nodes are used as featuresto estimate the similarity. Other features that have been used are sequences ofthree nodes and splits and joins. However, these features do not further improvethe quality of the search results or reduce the search time.

Let k be the number of process models in the collection and n be the maximumnumber of nodes in a single process model. To determine similarity of processmodels based on node features, for each node in the search graph, similar nodesneed to be selected from all the nodes in the M-Tree. There are at most k·n nodesin the M-Tree when all the nodes are distinct from each other. Therefore, thetime complexity of node feature similarity has an upper bound of O(n·log(k ·n)),while the time complexity of the, currently fastest, greedy algorithm for processsimilarity search is O(k · n3) [6]. The complexity of similarity with respect toother features is linear, because it depends on node feature similarity of whichthe results can be stored.


There are some drawbacks to the technique in this paper. First, the estima-tion is mainly based on label similarity. However, similar tasks can be labeleddifferently, e.g., synonyms, different levels of verbosity. The paper uses discrim-inative role features to deal with this issue, which have proven to be effective inthe context of our evaluation data set. However, in our evaluation data set pro-cess models were constructed by the same people, leading to models that havesimilar labels for similar tasks. This may not always be the case. Therefore, weapplied more advanced metrics for label similarity that consider synonyms [7]and domain ontologies [10]. The integration of these advanced metrics into thetechnique described in this paper is left for future work. Second, the techniquein this paper mainly focuses on tasks and connections between them. However,process models often contain more information that may be exploited when de-termining their similarity, e.g., resources and data used. Using this informationwhen determining process similarity is left for future work.

Acknowledgement

The research is supported by the China Scholarship Council (CSC).

References

1. van der Aalst, W.M.P., de Medeiros, A.K.A., Weijters, A.J.M.M.T.: Process Equiv-alence: Comparing Two Process Models based on Observed Behavior. In: Dustdar,S., Fiadeiro, J.L., Sheth, A.P. (eds.) BPM 2006. LNCS, vol. 4102, pp. 129–144.Springer, Heidelberg (2006)

2. Awad, A.: BPMN-Q: A language to query business processes. In: Proceedings ofEMISA 2007, Nanjing, China, pp. 115–128 (2007)

3. Bartolini, I., Ciaccia, P., Patella, M.: String Matching with Metric Trees Usingan Approximate Distance. In: Laender, A.H.F., Oliveira, A.L. (eds.) SPIRE 2002.LNCS, vol. 2476, p. 271. Springer, Heidelberg (2002)

4. Buckley, C., Voorhees, E.M.: Evaluating Evaluation Measure Stability. In: Pro-ceedings of the ACM SIGIR Conference, pp. 33–40 (2000)

5. Choi, I., Kim, K., Jang, M.: An xml-based process repository and process querylanguage for integrated process management. Knowledge and Process Manage-ment 14(4), 303–316 (2007)

6. Dijkman, R.M., Dumas, M., Garcıa-Banuelos, L.: Graph Matching Algorithms forBusiness Process Model Similarity Search. In: Dayal, U., Eder, J., Koehler, J.,Reijers, H.A. (eds.) Business Process Management. LNCS, vol. 5701, pp. 48–63.Springer, Heidelberg (2009)

7. van Dongen, B.F., Dijkman, R.M., Mendling, J.: Measuring Similarity betweenBusiness Process Models. In: Bellahsene, Z., Leonard, M. (eds.) CAiSE 2008.LNCS, vol. 5074, pp. 450–464. Springer, Heidelberg (2008)

8. Dumas, M., Garcıa-Banuelos, L., Dijkman, R.M.: Similarity Search of BusinessProcess Models. Technical Committee on Data Engineering 32(3), 23–28 (2009)

9. Documentair structuurplan, http://www.model-dsp.nl10. Ehrig, M., Koschmider, A., Oberweis, A.: Measuring similarity between semantic

business process models. In: Proceedings of the 4th Asia-Pacific Conference onConceptual Modelling, Ballarat, Victoria, Australia, pp. 71–80 (2007)

http://www.model-dsp.nl


11. Li, C., Reichert, M.U., Wombacher, A.: On Measuring Process Model Similaritybased on High-level Change Operations. In: Li, Q., Spaccapietra, S., Yu, E., Olive,A. (eds.) ER 2008. LNCS, vol. 5231, pp. 248–264. Springer, Heidelberg (2008)

12. Lu, R., Sadiq, S.K.: On the Discovery of Preferred Work Practice through BusinessProcess Variants. In: Parent, C., Schewe, K.-D., Storey, V.C., Thalheim, B. (eds.)ER 2007. LNCS, vol. 4801, pp. 165–180. Springer, Heidelberg (2007)

13. Madhusudan, T., Zhao, L., Marshall, B.: A Case-based Reasoning Framework forWorkflow Model Management. Data Knowledge Engineering 50(1), 87–115 (2004)

14. Manning, C.D., Raghavan, P., Schutze, H.: Introduction to Information Retrieval.Cambridge University Press, Cambridge (2008)

15. Minor, M., Tartakovski, A., Bergmann, R.: Representation and structure-basedsimilarity assessment for agile workflows. In: Weber, R.O., Richter, M.M. (eds.)ICCBR 2007. LNCS (LNAI), vol. 4626, pp. 224–238. Springer, Heidelberg (2007)

16. Curran, T.A., Keller, G.: SAP R/3 Business Blueprint - Business Engineering mitden R/3-Referenzprozessen. Addison-Wesley, Bonn (1999)

17. ShaSha, D., Wang, J., Giugno, R.: Algorithmics and applications of tree and graphsearching. In: Proceedings of the 21th ACM International Symposium on Principlesof Database Systems, pp. 39–53 (2002)

18. Willett, P., Barnard, J., Downs, G.: Chemical similarity searching. J. Chem. Inf.Comput. Sci. 38, 983–996 (1998)

19. Wombacher, A.: Evaluation of technical measures for workflow similarity based ona pilot study. In: Meersman, R., Tari, Z. (eds.) OTM 2006. LNCS, vol. 4275, pp.255–272. Springer, Heidelberg (2006)

20. Yan, X., Yu, P.S., Han, J.: Graph Indexing: A Frequent Structure-based Approach.In: Proceedings of the 2004 ACM SIGMOD, pp. 335–346 (2004)

21. Yan, X., Yu, P.S., Han, J.: Substructure Similarity Search in Graph Databases. In:Proceedings of the 2005 ACM SIGMOD, pp. 766–777 (2005)

22. Zhao, P., Yu, J.X., Yu, P.S.: Graph Indexing: Tree + Delta >= Graph. In: Pro-ceedings of the 2007 ACM VLDB, pp. 938–949 (2007)

lncs 6426 - fast business process similarity search with feature … · fast business process...

Documents