fuzzy mining - adaptive process simplification based … · fuzzy mining - adaptive process...

17
Fuzzy mining - adaptive process simplification based on multi-perspective metrics Günther, C.W.; van der Aalst, W.M.P. Published in: Proceedings of the 5th International Conference on Business Process Management (BPM 2007) 24-28 September 2007, Brisbane, Australia DOI: 10.1007/978-3-540-75183-0_24 Published: 01/01/2007 Document Version Publisher’s PDF, also known as Version of Record (includes final page, issue and volume numbers) Please check the document version of this publication: • A submitted manuscript is the author's version of the article upon submission and before peer-review. There can be important differences between the submitted version and the official published version of record. People interested in the research are advised to contact the author for the final version of the publication, or visit the DOI to the publisher's website. • The final author version and the galley proof are versions of the publication after peer review. • The final published version features the final layout of the paper including the volume, issue and page numbers. Link to publication Citation for published version (APA): Günther, C. W., & Aalst, van der, W. M. P. (2007). Fuzzy mining - adaptive process simplification based on multi-perspective metrics. In G. Alonso, P. Dadam, & M. Rosemann (Eds.), Proceedings of the 5th International Conference on Business Process Management (BPM 2007) 24-28 September 2007, Brisbane, Australia (pp. 328-343). (Lecture Notes in Computer Science; Vol. 4714). Berlin: Springer. DOI: 10.1007/978-3-540-75183- 0_24 General rights Copyright and moral rights for the publications made accessible in the public portal are retained by the authors and/or other copyright owners and it is a condition of accessing publications that users recognise and abide by the legal requirements associated with these rights. • Users may download and print one copy of any publication from the public portal for the purpose of private study or research. • You may not further distribute the material or use it for any profit-making activity or commercial gain • You may freely distribute the URL identifying the publication in the public portal ? Take down policy If you believe that this document breaches copyright please contact us providing details, and we will remove access to the work immediately and investigate your claim. Download date: 09. Sep. 2018

Upload: truongtruc

Post on 10-Sep-2018

226 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Fuzzy mining - adaptive process simplification based … · Fuzzy mining - adaptive process simplification based on multi-perspective metrics Günther, C.W.; van der Aalst, W.M.P

Fuzzy mining - adaptive process simplification based onmulti-perspective metricsGünther, C.W.; van der Aalst, W.M.P.

Published in:Proceedings of the 5th International Conference on Business Process Management (BPM 2007) 24-28September 2007, Brisbane, Australia

DOI:10.1007/978-3-540-75183-0_24

Published: 01/01/2007

Document VersionPublisher’s PDF, also known as Version of Record (includes final page, issue and volume numbers)

Please check the document version of this publication:

• A submitted manuscript is the author's version of the article upon submission and before peer-review. There can be important differencesbetween the submitted version and the official published version of record. People interested in the research are advised to contact theauthor for the final version of the publication, or visit the DOI to the publisher's website.• The final author version and the galley proof are versions of the publication after peer review.• The final published version features the final layout of the paper including the volume, issue and page numbers.

Link to publication

Citation for published version (APA):Günther, C. W., & Aalst, van der, W. M. P. (2007). Fuzzy mining - adaptive process simplification based onmulti-perspective metrics. In G. Alonso, P. Dadam, & M. Rosemann (Eds.), Proceedings of the 5th InternationalConference on Business Process Management (BPM 2007) 24-28 September 2007, Brisbane, Australia (pp.328-343). (Lecture Notes in Computer Science; Vol. 4714). Berlin: Springer. DOI: 10.1007/978-3-540-75183-0_24

General rightsCopyright and moral rights for the publications made accessible in the public portal are retained by the authors and/or other copyright ownersand it is a condition of accessing publications that users recognise and abide by the legal requirements associated with these rights.

• Users may download and print one copy of any publication from the public portal for the purpose of private study or research. • You may not further distribute the material or use it for any profit-making activity or commercial gain • You may freely distribute the URL identifying the publication in the public portal ?

Take down policyIf you believe that this document breaches copyright please contact us providing details, and we will remove access to the work immediatelyand investigate your claim.

Download date: 09. Sep. 2018

Page 2: Fuzzy mining - adaptive process simplification based … · Fuzzy mining - adaptive process simplification based on multi-perspective metrics Günther, C.W.; van der Aalst, W.M.P

Fuzzy Mining – Adaptive Process SimplificationBased on Multi-perspective Metrics

Christian W. Gunther and Wil M.P. van der Aalst

Eindhoven University of TechnologyP.O. Box 513, NL-5600 MB, Eindhoven, The Netherlands{c.w.gunther, w.m.p.v.d.aalst}@tue.nl

Abstract. Process Mining is a technique for extracting process models from ex-ecution logs. This is particularly useful in situations where people have an ide-alized view of reality. Real-life processes turn out to be less structured than peo-ple tend to believe. Unfortunately, traditional process mining approaches haveproblems dealing with unstructured processes. The discovered models are often“spaghetti-like”, showing all details without distinguishing what is important andwhat is not. This paper proposes a new process mining approach to overcome thisproblem. The approach is configurable and allows for different faithfully simpli-fied views of a particular process. To do this, the concept of a roadmap is used asa metaphor. Just like different roadmaps provide suitable abstractions of reality,process models should provide meaningful abstractions of operational processesencountered in domains ranging from healthcare and logistics to web servicesand public administration.

1 Introduction

Business processes, whether defined and prescribed or implicit and ad-hoc, drive andsupport most of the functions and services in enterprises and administrative bodies oftoday’s world. For describing such processes, modeling them as graphs has proven tobe a useful and intuitive tool. While modeling is well-established in process design, itis complicated to do for monitoring and documentation purposes. However, especiallyfor monitoring, process models are valuable artifacts, because they allow us to commu-nicate complex knowledge in intuitive, compact, and high-level form.

Process mining is a line of research which attempts to extract such abstract, compactrepresentations of processes from their logs, i.e. execution histories [1,2,3,5,6,7,10,14].The α-algorithm, for example, can create a Petri net process model from an executionlog [2]. In the last years, a number of process mining approaches have been developed,which address the various perspectives of a process (e.g., control flow, social network),and use various techniques to generalize from the log (e.g., genetic algorithms, theoryof regions [12,4]). Applied to explicitly designed, well-structured, and rigidly enforcedprocesses, these techniques are able to deliver an impressive set of information, yet theirpurpose is somewhat limited to verifying the compliant execution. However, most pro-cesses in real life have not been purposefully designed and optimized, but have evolvedover time or are not even explicitly defined. In such situations, the application of pro-cess mining is far more interesting, as it is not limited to re-discovering what we alreadyknow, but it can be used to unveil previously hidden knowledge.

G. Alonso, P. Dadam, and M. Rosemann (Eds.): BPM 2007, LNCS 4714, pp. 328–343, 2007.c© Springer-Verlag Berlin Heidelberg 2007

Page 3: Fuzzy mining - adaptive process simplification based … · Fuzzy mining - adaptive process simplification based on multi-perspective metrics Günther, C.W.; van der Aalst, W.M.P

Fuzzy Mining – Adaptive Process Simplification Based on Multi-perspective Metrics 329

Over the last couple of years we obtained much experience in applying the tried-and-tested set of mining algorithms to real-life processes. Existing algorithms tend toperform well on structured processes, but often fail to provide insightful models for lessstructured processes. The phrase “spaghetti models” is often used to refer to the resultsof such efforts. The problem is not that existing techniques produce incorrect results.In fact, some of the more robust process mining techniques guarantee that the resultingmodel is “correct” in the sense that reality fits into the model. The problem is that theresulting model shows all details without providing a suitable abstraction. This is com-parable to looking at the map of a country where all cities and towns are represented byidentical nodes and all roads are depicted in the same manner. The resulting map is cor-rect, but not very suitable. Therefore, the concept of a roadmap is used as a metaphor tovisualize the resulting models. Based on an analysis of the log, the importance of activ-ities and relations among activities are taken into account. Activities and their relationscan be clustered or removed depending on their role in the process. Moreover, certainaspects can be emphasized graphically just like a roadmap emphasizes highways andlarge cities over dirt roads and small towns. As will be demonstrated in this paper, theroadmap metaphor allows for meaningful process models.

In this paper we analyze the problems traditional mining algorithms have withless-structured processes (Section 2), and use the metaphor of maps to derive a novel,more appropriate approach from these lessons (Section 3). We abandon the idea ofperforming process mining confined to one perspective only, and propose a multi-perspective set of log-based process metrics (Section 4). Based on these, we have de-veloped a flexible approach for Fuzzy Mining, i.e. adaptively simplifying mined processmodels (Section 5).

2 Less-Structured Processes – The Infamous Spaghetti Affair

The fundamental idea of process mining is both simple and persuasive: There is a pro-cess which is unknown to us, but we can follow the traces of its behavior, i.e. we haveaccess to enactment logs. Feeding those into a process mining technique will yield anaggregate description of that observed behavior, e.g. in form of a process model.

In the beginning of process mining research, mostly artificially generated logs wereused to develop and verify mining algorithms. Then, also logs from real-life work-flow management systems, e.g. Staffware, could be successfully mined with these tech-niques. Early mining algorithms had high requirements towards the qualities of log files,e.g. they were supposed to be complete and limited to events of interest. Yet, most ofthe resulting problems could be easily remedied with more data, filtering the log andtuning the algorithm to better cope with problematic data.

While these successes were certainly convincing, most real-life processes are notexecuted within rigid, inflexible workflow management systems and the like, which en-force correct, predictive behavior. It is the inherent inflexibility of these systems whichdrove the majority of process owners (i.e., organizations having the need to supportprocesses) to choose more flexible or ad-hoc solutions. Concepts like Adaptive Work-flow or Case Handling either allow users to change the process at runtime, or defineprocesses in a somewhat more “loose” manner which does not strictly define a specific

Page 4: Fuzzy mining - adaptive process simplification based … · Fuzzy mining - adaptive process simplification based on multi-perspective metrics Günther, C.W.; van der Aalst, W.M.P

330 C.W. Gunther and W.M.P. van der Aalst

path of execution. Yet the most popular solutions for supporting processes do not en-force any defined behavior at all, but merely offer functionality like sharing data andpassing messages between users and resources. Examples for these systems are ERP(Enterprise Resource Planning) and CSCW (Computer-Supported Cooperative Work)systems, custom-built solutions, or plain E-Mail.

It is obvious that executing a process within such less restrictive environments willlead to more diverse and less-structured behavior. This abundance of observed behav-ior, however, unveiled a fundamental weakness in most of the early process miningalgorithms. When these are used to mine logs from less-structured processes, the resultis usually just as unstructured and hard to understand. These “spaghetti” process mod-els do not provide any meaningful abstraction from the event logs themselves, and aretherefore useless to process analysts. It is important to note that these “spaghetti” mod-els are not incorrect. The problem is that the processes themselves are really “spaghetti-like”, i.e., the model is an accurate reflection of reality.

DSYE(complete)

69

OSYW(complete)

181

OSIX(complete)

20

0.5 4

0.987 125

OSZY(complete)

28

0.8 10

AHCW(complete)

1

0.5 1

UNHL(complete)

91

0.979 74

VPPQ(complete)

184

0.667 11

UNEL(complete)

155

0.977 86

UNLE(complete)

41

0.545 18

0.881 36

UNEN(complete)

44

NEOI(complete)

15 0.75

9

DNEZ(complete)

6

0.833 6

ONZY(complete)

394

0.992 244

ONVL(complete)

155

0.756 70

ONZO(complete)

173

0.956 78

0.955 28

DSYN(complete)

57 0.921

37

DSYV(complete)

52

0.939 49

DSPY(complete)

35

0.8 18

DSLQ(complete)

30

0.909 16

0.667 10

0.917 13

OSYN(complete)

106

0.5 7

0.957 52

OSDL(complete)

80

0.821 28

OSAT(complete)

8

0.75 5

0.929 19

DSSV(complete)

234

0.966 50

OSAI(complete)

8

0.8 5

OSHY(complete)

224

0.989 153

0.878 50

0.983 88

DSSN(complete)

304

0.854 129

OSVZ(complete)

18

0.5 4

0.8 43

0.9 28

ONVY(complete)

160

DNLP(complete)

264

HQQL(complete)1153

0.984 165

OSHB(complete)

231

0.944 65

OSZO(complete)

61 0.8 28

DSSA(complete)

238

0.98 147

0.969 66

VPPN(complete)

2

0.5 1

DSVM(complete)

29

0.833 12

0.756 42

0.984 128

POZI(complete)

263

0.8 20

OSOI(complete)

291

0.981 110

POLA(complete)1255

0.923 169

ONCZ(complete)

11

0.5 2

0.5 5

0.923 89

0.998 871

OSWL(complete)

103

0.877 54

AISW(complete)

430 0.667

34

ONPI(complete)

264

0.819 104

LELY(complete)

126

0.667 12

0.8 36

SPWV(complete)

114

0.929 28

ONYZ(complete)

7

0.5 6

OSPL(complete)

4

0.5 1

0.667 20

0.5 7

0.974 71

ONAZ(complete)

171

0.909 41

ONHL(complete)

264

0.974 108

0.975 42

ONYN(complete)

142

0.942 88

ONIZ(complete)

23 0.75 8

LEVN(complete)

5

0.5 2

VPNY(complete)

10

0.5 1

VPNH(complete)

160

0.854 53

0.983 75

ONLW(complete)

100

0.978 46

UNEA(complete)

239

0.941 36

VPNE(complete)

15

0.667 2

0.957 52

0.75 24

0.909 14

0.985 94

NEOW(complete)

60

0.923 34

NEPL(complete)

47

0.885 26

0.909 19

0.667 24

0.941 37

XISH(complete)

866

0.833 59

OSEL(complete)

1

0.5 1

0.8 46

0.997 729

AIVM(complete)

130

0.667 44

OSAX(complete)

6

0.833 5

0.997 370

0.8 40

HDEI(complete)

40

0.5 3

TUII(complete)

90

0.667 9

VPPK(complete)

350

0.904 80

0.986 154

0.9 38

0.909 72

0.964 25

0.984 101

VPHA(complete)

102 0.9 43

0.981 173

0.978 44

0.909 37

ONOI(complete)

703

0.667 54

0.966 56

ONHI(complete)

78

0.947 60

ONHB(complete)

178

0.952 79

ONHY(complete)

74

0.983 64

ONNY(complete)

191

0.938 64

0.988 108

0.75 77

0.75 60

0.998 522

VPHN(complete)

279

0.867 140

0.985 136

0.985 151

PONA(complete)

65

0.667 11

0.938 38

0.98 91

ONIY(complete)

117

0.921 70

0.981 70

0.97 43

HQWL(complete)

221

0.99 165

HQXZ(complete)

83

0.759 35

0.968 47

0.982 84

LEJE(complete)

50

0.923 41

LEOO(complete)

85

0.914 37

0.75 3

0.917 35

0.786 43

AIHT(complete)

87

0.909 37

0.977 111

HDWE(complete)

43

0.833 6

TUGP(complete)

73

0.688 19

DSCW(complete)

20

0.667 4

0.944 95

0.957 47

AINO(complete)

270

0.872 47

0.857 50

0.992 205

0.75 14

0.97 30

HQLE(complete)6010

HQQY(complete)

185

0.962 83

HQLY(complete)

80

0.902 62

HQLK(complete)

1

0.5 1

0.896 79

0.998 682

0.667 10

0.845 96 1

5791

0.667 27

KZEY(complete)

43 0.667

18

HQXY(complete)

12

0.667 4

LEOR(complete)

4

0.5 3

TUJI(complete)

296

0.987 171

POZW(complete)

103

0.933 75

0.8 43

0.933 15

0.933 27

0.667 18

TUMI(complete)

409

0.995 310

0.923 10

HDPT(complete)

145

0.833 10

BKMI(complete)

195

0.923 60

TUMK(complete)

40

0.571 16

TURK(complete)

42

0.825 38

0.9 20

TUSC(complete)

166

0.833 17

0.967 36

SPWA(complete)

129

0.824 72

0.962 45

SPWN(complete)

120

0.9 74

0.957 45

SPWI(complete)

99

0.963 72

0.914 46

0.909 27

TUZV(complete)

308

0.986 119

TUZC(complete)

390

0.929 148

TUJV(complete)

92

0.889 28

0.929 14

0.987 101

TUZI(complete)

395

0.944 252

TUWV(complete)

5

0.5 1

TUTP(complete)

24

0.667 4

0.944 2

0.98 119

TUOK(complete)

44 0.975

42

TUPK(complete)

66

0.954 65

TUZK(complete)

207

0.969 156

0.875 43

TUPC(complete)

18

0.75 8

BKYA(complete)

153

0.97 55

BKYI(complete)

160

0.892 93

KZAL(complete)

2

0.5 2

0.952 37

BKYV(complete)

144

0.884 109

TUSI(complete)

306

0.667 10

LEBR(complete)

1 0.5 1

0.933 21

BLYI(complete)

128

0.797 120

0.968 56

0.978 50

0.96 80

0.957 84

0.938 49

BCCC(complete)

82

0.667 18

TUEW(complete)

154

0.688 35

TUIV(complete)

1

0.5 1

0.97 82

0.8 16

HDWP(complete)

11

0.5 1

TUSK(complete)

88

0.667 13

0.923 10

0.941 25

0.5 7

0.962 51

TUJC(complete)

121

0.857 22

0.9 37

0.955 32

TUIC(complete)

24

0.667 7

TUJP(complete)

203

0.964 99

0.923 86

TUGV(complete)

39

0.8 13 0.923

56

0.986 100

TUJK(complete)

131

0.852 124

TUMC(complete)

192

0.8 49

LEBS(complete)

197

0.833 38

TUAB(complete)

12

0.5 2

TUYK(complete)

1

0.5 1

TUKW(complete)

73

0.833 12

0.955 45

0.5 1

0.667 7

0.955 28

0.75 12

0.981 100

HDPR(complete)

35

0.594 31

0.8 8

BCCV(complete)

55

0.75 22

TUGC(complete)

79

0.667 20

0.989 134

0.857 19

0.875 21

0.964 36

0.8 18

0.955 41

0.75 27

0.967 49

TUGI(complete)

152

0.986 119

TUGK(complete)

40

0.857 21

HLPO(complete)

4

0.5 3

BCCI(complete)

177

0.989 139

BCCK(complete)

63

0.923 30

0.993 231

0.917 54

0.667 7

0.5 1

DSLN(complete)

30

0.917 13

DSLL(complete)

36

0.759 28

DSZN(complete)

25

0.958 25

DSLX(complete)

48

0.87 20

0.889 26

0.947 20

DSZV(complete)

22

0.957 22

DSLV(complete)

26

0.952 20

0.95 15

HDOP(complete)

7

0.5 6

0.667 3

0.8 30

BKMA(complete)

130

0.8 29

0.944 69

0.875 53

0.99 129

0.923 44

0.96 80

BKMV(complete)

143

0.912 66

0.75 17

0.923 14

0.917 45

0.7 14

0.955 17

0.923 17

0.929 18

BCCW(complete)

4

0.4 4

0.75 19

0.5 3

0.875 8

0.833 10

0.941 17

0.5 2

0.667 3

LECL(complete)

211

0.75 23

0.991 159

0.75 37

0.99 135

TUIP(complete)

39

0.75 28

0.667 20

0.8 2

0.98 66

TUIK(complete)

24

0.842 21

0.889 43

0.917 50

BLOM(complete)

22

0.522 20

BLAM(complete)

63

0.833 11

TUQK(complete)

1

0.5 1

0.889 18

0.857 26

0.833 26

0.977 76

BLBO(complete)

10 0.5 2

0.75 15

0.929 19

TUWP(complete)

39

0.944 19

0.5 7

0.923 20

0.545 17

0.667 12

0.667 7

0.667 12

0 5

OSHW(complete)

4

0.5 1

0.5 1

0.5 1

NEBI(complete)

33

0.917 13

0.947 16

NEOO(complete)

16 0.562

14

0.9 11

OSLL(complete)

61

0.941 35

OSHD(complete)

17

0.667 6

0.667 8

0.8 10

0.51

0.875 17

0.5 2

0.4 4

0.75 3

0.75 4

0.929 14

0.5 3

0.75 2

TUAV(complete)

5

0.5 7

0.25 4

TUAI(complete)

1

0.5 1

TUAK(complete)

2

0.5 1

0.5 2

0.5 1

0.75 2

0.5 2

0 5

0.833 5

0.929 17

0.5 1

0.5 1

Fig. 1. Excerpt of a typical “Spaghetti” process model (ca. 20% of complete model)

An example of such a “spaghetti” model is given in Figure 1. It is noteworthy that thisfigure shows only a small excerpt (ca. 20%) of a highly unstructured process model. Ithas been mined from machine test logs using the Heuristics Miner, one of the traditionalprocess mining techniques which is most resilient towards noise in logs [14]. Althoughthis result is rather useful, certainly in comparison with other early process miningtechniques, it is plain to see that deriving helpful information from it is not easy.

Event classes found in the log are interpreted as activity nodes in the process model.Their sheer amount makes it difficult to focus on the interesting parts of the process.The abundance of arcs in the model, which constitute the actual “spaghetti”, introducean even greater challenge for interpretation. Separating cause from effect, or the generaldirection in which the process is executed, is not possible because virtually every nodeis transitively connected to any other node in both directions. This mirrors the crux offlexibility in process execution – when people are free to execute anything in any givenorder they will usually make use of such feature, which renders monitoring businessactivities an essentially infeasible task.

Page 5: Fuzzy mining - adaptive process simplification based … · Fuzzy mining - adaptive process simplification based on multi-perspective metrics Günther, C.W.; van der Aalst, W.M.P

Fuzzy Mining – Adaptive Process Simplification Based on Multi-perspective Metrics 331

We argue that the fault for these problems lies neither with less-structured pro-cesses, nor with process mining itself. Rather, it is the result of a number of, mostlyimplicit, assumptions which process mining has historically made, both with respectto the event logs under consideration, and regarding the processes which have gener-ated them. While being perfectly sound in structured, controlled environments, theseassumptions do not hold in less-structured, real-life environments, and thus ultimatelymake traditional process mining fail there.

Assumption 1: All logs are reliable and trustworthy. Any event type found in thelog is assumed to have a corresponding logical activity in the process. However,activities in real-life processes may raise a random number of seemingly unrelatedevents. Activities may also go unrecorded, while other events do not correspond toany activity at all.

The assumption that logs are well-formed and homogeneous is also often nottrue. For example, a process found in the log is assumed to correspond to one logicalentity. In less-structured environments, however, there are often a number of “tacit”process types which are executed, and thus logged, under the same name.

Also, the idea that all events are raised on the same level of abstraction, andare thus equally important, is not true in real-life settings. Events on different lev-els are “flattened” into the same event log, while there is also a high amount ofinformational events (e.g., debug messages from the system) which need to bedisregarded.

Assumption 2: There exists an exact process which is reflected in the logs. This as-sumption implies that there is the one perfect solution out there, which needs tobe found. Consequently, the mining result should model the process completely,accurately, and precisely. However, as stated before, spaghetti models are not nec-essarily incorrect – the models look like spaghetti, because they precisely describeevery detail of the less-structured behavior found in the log. A more high-levelsolution, which is able to abstract from details, would thus be preferable.

Traditional mining algorithms have also been confined to a single perspective(e.g., control flow, data), as such isolated view is supposed to yield higher pre-cision. However, perspectives are interacting in less-structured processes, e.g. thedata flow may complement the control flow, and thus also needs to be taken intoaccount.

In general, the assumption of a perfect solution is not well-suited for real-life application. Reality often differs significantly from theory, in ways that hadnot been anticipated. Consequently, useful tools for practical application must beexplorative, i.e. support the analyst to tweak results and thus capitalize on theirknowledge.

We have conducted process mining case studies in organizations like Philips Med-ical Systems, UWV, Rijkswaterstaat, the Catharina Hospital Eindhoven and the AMChospital Amsterdam, and the Dutch municipalities of Alkmaar and Heusden. Our ex-periences in these case studies have shown the above assumptions to be violated in allways imaginable. Therefore, to make process mining a useful tool in practical, less-structured settings, these assumptions need to be discarded. The next section introducesthe main concept of our mining approach, which takes these lessons into account.

Page 6: Fuzzy mining - adaptive process simplification based … · Fuzzy mining - adaptive process simplification based on multi-perspective metrics Günther, C.W.; van der Aalst, W.M.P

332 C.W. Gunther and W.M.P. van der Aalst

3 An Adaptive Approach for Process Simplification

Process mining techniques which are suitable for less-structured environments need tobe able to provide a high-level view on the process, abstracting from undesired details.The field of cartography has always been faced with a quite similar challenge, namelyto simplify highly complex and unstructured topologies. Activities in a process can berelated to locations in a topology (e.g. towns or road crossings) and precedence relationsto traffic connections between them (e.g., railways or motorways).

Fig. 2. Example of a road map

When one takes a closer look at maps (such as the example in Figure 2), the solutioncartography has come up with to simplify and present complex topologies, one canderive a number of valuable concepts from them.

Aggregation: To limit the number of information items displayed, maps often showcoherent clusters of low-level detail information in an aggregated manner. One ex-ample are cities in road maps, where particular houses and streets are combinedwithin the city’s transitive closure (e.g., the city of Eindhoven in Figure 2).

Abstraction: Lower-level information which is insignificant in the chosen context issimply omitted from the visualization. Examples are bicycle paths, which are of nointerest in a motorist’s map.

Emphasis: More significant information is highlighted by visual means such as color,contrast, saturation, and size. For example, maps emphasize more important roadsby displaying them as thicker, more colorful and contrasting lines (e.g., motorway“E25” in Figure 2).

Customization: There is no one single map for the world. Maps are specialized on adefined local context, have a specific level of detail (city maps vs highway maps),and a dedicated purpose (interregional travel vs alpine hiking).

These concepts are universal, well-understood, and established. In this paper we ex-plore how they can be used to simplify and properly visualize complex, less-structuredprocesses. To do that, we need to develop appropriate decision criteria on which to basethe simplification and visualization of process models. We have identified two funda-mental metrics which can support such decisions: (1) significance and (2) correlation.

Page 7: Fuzzy mining - adaptive process simplification based … · Fuzzy mining - adaptive process simplification based on multi-perspective metrics Günther, C.W.; van der Aalst, W.M.P

Fuzzy Mining – Adaptive Process Simplification Based on Multi-perspective Metrics 333

Significance, which can be determined both for event classes (i.e., activities) andbinary precedence relations over them (i.e., edges), measures the relative importance ofbehavior. As such, it specifies the level of interest we have in events, or their occurringafter one another. One example for measuring significance is by frequency, i.e. events orprecedence relations which are observed more frequently are deemed more significant.

Correlation on the other hand is only relevant for precedence relations over events.It measures how closely related two events following one another are. Examples formeasuring correlation include determining the overlap of data attributes associated totwo events following one another, or comparing the similarity of their event names.More closely correlated events are assumed to share a large amount of their data, or havetheir similarity expressed in their recorded names (e.g. “check customer application”and “approve customer application”).

Based on these two metrics, which have been defined specially for this purpose, wecan sketch our approach for process simplification as follows.

– Highly significant behavior is preserved, i.e. contained in the simplified model.– Less significant but highly correlated behavior is aggregated, i.e. hidden in clusters

within the simplified model.– Less significant and lowly correlated behavior is abstracted from, i.e. removed from

the simplified model.

This approach can greatly reduce and focus the displayed behavior, by employingthe concepts of aggregation and abstraction. Based on such simplified model, we canemploy the concept of emphasis, by highlighting more significant behavior.

Fig. 3. Excerpt of a simplified and decorated process model

Figure 3 shows an excerpt from a simplified process model, which has been cre-ated using our approach. Bright square nodes represent significant activities, the darkeroctagonal node is an aggregated cluster of three less-significant activities. All nodesare labeled with their respective significance, with clusters displaying the mean sig-nificance of their elements. The brightness of edges between nodes emphasizes theirsignificance, i.e. more significant relations are darker. Edges are also labeled with theirrespective significance and correlation values. By either removing or hiding less signif-icant information, this visualization enables the user to focus on the most interestingbehavior in the process.

Page 8: Fuzzy mining - adaptive process simplification based … · Fuzzy mining - adaptive process simplification based on multi-perspective metrics Günther, C.W.; van der Aalst, W.M.P

334 C.W. Gunther and W.M.P. van der Aalst

Yet, the question of what constitutes “interesting” behavior can have a number of an-swers, based on the process, the purpose of analysis, or the desired level of abstraction.In order to yield the most appropriate result, significance and correlation measures needto be configurable. We have thus developed a set of metrics, which can each measuresignificance or correlation based on different perspectives (e.g., control flow or data) ofthe process. By influencing the “mix” of these metrics and the simplification procedureitself, the user can customize the produced results to a large degree.

The following section introduces some of the metrics we have developed for signifi-cance and correlation in more detail.

4 Log-Based Process Metrics

Significance and correlation, as introduced in the previous section, are suitable conceptsfor describing the importance of behavior in a process in a compact manner. However,because they represent very generalized, condensed metrics, it is important to mea-sure them in an appropriate manner. Taking into account the wide variety of processes,analysis questions and objectives, and levels of abstraction, it is necessary to make thismeasurement adaptable to such parameters.

Our approach is based on a configurable and extensible framework for measuringsignificance and correlation. The design of this framework is introduced in the nextsubsection, followed by detailed introductions to the three primary types of metrics:unary significance, binary significance, and binary correlation.

4.1 Metrics Framework

An important property of our measurement framework is that for each of the three pri-mary types of metrics (unary significance, binary significance, and binary correlation)different implementations may be used. A metric may either be measured directly fromthe log (log-based metric), or it may be based on measurements of other, log-basedmetrics (derivative metric).

When the log contains a large number of undesired events, which occur in betweendesired ones, actual causal dependencies between the desired event classes may go un-recorded. To counter this, our approach also measures long-term relations, i.e. when thesequence A, B, C is found in the log, we will not only record the relations A → B andB → C, but also the length-2-relationship A → C. We allow measuring relationshipsof arbitrary length, while the measured value will be attenuated, i.e. decreased, withincreasing length of relationship.

4.2 Unary Significance Metrics

Unary significance describes the relative importance of an event class, which will berepresented as a node in the process model. As our approach is based on removing lesssignificant behavior, and as removing a node implies removing all of its connected arcs,unary significance is the primary driver of simplification.

Page 9: Fuzzy mining - adaptive process simplification based … · Fuzzy mining - adaptive process simplification based on multi-perspective metrics Günther, C.W.; van der Aalst, W.M.P

Fuzzy Mining – Adaptive Process Simplification Based on Multi-perspective Metrics 335

One metric for unary significance is frequency significance, i.e. the more often acertain event class was observed in the log, the more significant it is. Frequency is a log-based metric, and is in fact the most intuitive of all metrics. Traditional process miningtechniques are built solely on the principle of measuring frequency, and it remains animportant foundation of our approach. However, real-life logs often contain a largenumber of events which are in fact not very significant, e.g. an event which describessaving the process state after every five activities. In such situations, frequency plays adiminished role and can rather distort results.

Another, derivate metric for unary significance is routing significance. The idea be-hind routing significance is that points, at which the process either forks (i.e., splitnodes) or synchronizes (i.e., join nodes), are interesting in that they substantially definethe structure of a process. The higher the number and significance of predecessors for anode (i.e., its incoming arcs) differs from the number and significance of its successors(i.e., outgoing arcs), the more important that node is for routing in the process. Routingsignificance is important as amplifier metric, i.e. it helps separating important routingnodes (whose significance it increases) from those less important.

4.3 Binary Significance Metrics

Binary significance describes the relative importance of a precedence relation betweentwo event classes, i.e. an edge in the process model. Its purpose is to amplify and toisolate the observed behavior which is supposed to be of the greatest interest. In oursimplification approach, it primarily influences the selection of edges which will beincluded in the simplified process model.

Like for unary significance, the log-based frequency significance metric is also themost important implementation for binary significance. The more often two eventclasses are observed after one another, the more significant their precedence relation.

The distance significance metric is a derivative implementation of binary signifi-cance. The more the significance of a relation differs from its source and target nodes’significances, the less its distance significance value. The rationale behind this metric is,that globally important relations are also always the most important relations for theirendpoints. Distance significance locally amplifies crucial key relations between eventclasses, and weakens already insignificant relations. Thereby, it can clarify ambiguoussituations in edge abstraction, where many relations “compete” over being included inthe simplified process model. Especially in very unstructured execution logs, this metricis an indispensible tool for isolating behavior of interest.

4.4 Binary Correlation Metrics

Binary correlation measures the distance of events in a precedence relation, i.e. howclosely related two events following one another are. Distance, in the process domain,can be equated to the magnitude of context change between two activity executions.Subsequently occurring activities which have a more similar context (e.g., which areexecuted by the same person or in a short timeframe) are thus evaluated to be highercorrelated. Binary correlation is the main driver of the decision between aggregation orabstraction of less-significant behavior.

Page 10: Fuzzy mining - adaptive process simplification based … · Fuzzy mining - adaptive process simplification based on multi-perspective metrics Günther, C.W.; van der Aalst, W.M.P

336 C.W. Gunther and W.M.P. van der Aalst

Proximity correlation evaluates event classes, which occur shortly after one an-other, as highly correlated. This is important for identifying clusters of events whichcorrespond to one logical activity, as these are commonly executed within a shorttimeframe.

Another feature of such clusters of events occurring within the realm of one higher-level activity is, that they are executed by the same person. Originator correlation be-tween event classes is determined from the names of the persons, which have triggeredtwo subsequent events. The more similar these names, the higher correlated the respec-tive event classes. In real applications, user names often include job titles or functionidentifiers (e.g.“sales John” and “sales Paul”). Therefore, this metric implementation isa valuable tool also for unveiling implicit correlation between events.

Endpoint correlation is quite similar, however, instead of resources it compares theactivity names of subsequent events. More similar names will be interpreted as highercorrelation. This is important for low-level logs including a large amount of less sig-nificant events which are closely related. Most of the time, events which reflect similartasks also are given similar names (e.g., “open valve13” and “close valve13”), and thismetric can unveil these implicit dependencies.

In most logs, events also include additional attributes, containing snapshots from thedata perspective of the process (e.g., the value of an insurance claim). In such cases,the selection of attributes logged for each event can be interpreted as its context. Thus,the data type correlation metric evaluates event classes, where subsequent events sharea large amount of data types (i.e., attribute keys), as highly correlated. Data value cor-relation is more specific, in that it also takes the values of these common attributes intoaccount. In that, it uses relative similarity, i.e. small changes of an attribute value willcompromise correlation less than a completely different value.

Currently, all implementations for binary correlation in our approach are log-based.The next section introduces our approach for adaptive simplification and visualizationof complex process models, which is based on the aggregated measurements of allmetric implementations which have been introduced in this section.

5 Adaptive Graph Simplification

Most process mining techniques follow an interpretative approach, i.e. they attempt tomap behavior found in the log to typical process design patterns (e.g., whether a splitnode has AND- or XOR-semantics). Our approach, in contrast, focuses on high-levelmapping of behavior found in the log, while not attempting to discover such patterns.Thus, creating the initial (non-simplified) process model is straightforward: All eventclasses found in the log are translated to activity nodes, whose importance is expressedby unary significance. For every observed precedence relation between event classes, acorresponding directed edge is added to the process model. This edge is described bythe binary significance and correlation of the ordering relation it represents.

Subsequently, we apply three transformation methods to the process model, whichwill successively simplify specific aspects of it. The first two phases, conflict reso-lution and edge filtering, remove edges (i.e., precedence relations) between activity

Page 11: Fuzzy mining - adaptive process simplification based … · Fuzzy mining - adaptive process simplification based on multi-perspective metrics Günther, C.W.; van der Aalst, W.M.P

Fuzzy Mining – Adaptive Process Simplification Based on Multi-perspective Metrics 337

nodes, while the final aggregation and abstraction phase removes and/or clusters less-significant nodes. Removing edges from the model first is important – due to theless-structured nature of real-life processes and our measurement of long-term relation-ships, the initial model contains deceptive ordering relations, which do not correspondto valid behavior and need to be discarded. The following sections provide details aboutthe three phases of our simplification approach, given in the order in which they are ap-plied to the initial model.

5.1 Conflict Resolution in Binary Relations

Whenever two nodes in the initial process model are connected by edges in both direc-tions, they are defined to be in conflict. Depending on their specific properties, conflictsmay represent one of three possible situations in the process:

– Length-2-loop: Two activities A and B constitute a loop in the process model,i.e. after executing A and B in sequence, one may return to A and start over. Inthis case, the conflicting ordering relations between these activities are explicitlyallowed in the original process, and thus need to be preserved.

– Exception: The process orders A → B in sequence, however, during real-life exe-cution the exceptional case of B → A also occurs. Most of the time, the “normal”behavior is clearly more significant. In such cases, the “weaker” relation needs tobe discarded to focus on the main behavior.

– Concurrency: A and B can be executed in any order (i.e., they are on two distinct,parallel paths), the log will most likely record both possible cases, i.e. A → Band B → A, which will create a conflict. In this case, both conflicting orderingrelations need to be removed from the process model.

Conflict resolution attempts to classify each conflict as one of these three cases, andthen resolves it accordingly. For that, it first determines the relative significance of bothconflicting relations.

Fig. 4. Evaluating the relative significance of conflicting relations

Figure 4 shows an example of two activities A and B in conflict. The relative signif-icance for an edge A → B can be determined as follows.

Definition 1 (Relative significance). Let N be the set of nodes in a process model, andlet sig : N × N → R

+0 be a relation that assigns to each pair of nodes A, B ∈ N

the significance of a precedence relation over them. rel : N × N → R+0 is a relation

which assigns to each pair of nodes A, B ∈ N the relative importance of their orderingrelation: rel(A, B) = 1

2 · sig(A,B)∑X∈N sig(A,X) + 1

2 · sig(A,B)∑X∈N sig(X,B) .

Page 12: Fuzzy mining - adaptive process simplification based … · Fuzzy mining - adaptive process simplification based on multi-perspective metrics Günther, C.W.; van der Aalst, W.M.P

338 C.W. Gunther and W.M.P. van der Aalst

Every ordering relation A → B has a set of competing relations CompAB = Aout ∪Bin. This set of competing relations is composed of Aout, i.e. all edges starting from A,and of Bin, i.e. all edges pointing to B (cf. Figure 4). Note that this set also contains thereference relation itself, i.e. more specifically: Bin∩Aout = {A → B}. By dividing thesignificance of an ordering relation A → B with the sum of all its competing relations’significances, we get the importance of this relation in its local context.

If the relative significance of both conflicting relations, rel(A, B) and rel(B, A)exceeds a specified preserve threshold value, this signifies that A and B are apparentlyforming a length-2-loop, which is their most significant behavior in the process. Thus,in this case, both A → B and B → A will be preserved.

In case at least one conflicting relation’s relative significance is below this threshold,the offset between both relations’ relative significances is determined, i.e. ofs(A, B) =|rel(A, B)−rel(B, A)|. The larger this offset value, the more the relative significancesof both conflicting relations differ, i.e. one of them is clearly more important. Thus, ifthe offset value exceeds a specified ratio threshold, we assume that the relatively lesssignificant relation is in fact an exception and remove it from the process model.

Otherwise, i.e. if at least one of the relations has a relative significance below thepreserve threshold and their offset is smaller than the ratio threshold, this signifies thatboth A → B and B → A are relations which are of no greater importance for both theirsource and target activities. This low, yet balanced relative significance of conflictingrelations hints at A and B being executed concurrently, i.e. in two separate threads ofthe process. Consequently, both edges are removed from the process model, as they donot correspond to factual ordering relations.

5.2 Edge Filtering

Although conflict resolution removes a number of edges from the process model, themodel still contains a large amount of precedence relations. To infer further structurefrom this model, it is necessary to remove most of these remaining edges by edge fil-tering, which isolates the most important behavior. The obvious solution is to removethe globally least significant edges, leaving only highly significant behavior. However,this approach yields sub-optimal results, as it is prone to create small, disparate clustersof highly frequent behavior. Also, in the subsequent aggregation step, highly corre-lated relations play an important part in connecting clusters, even if they are not verysignificant.

Therefore, our edge filtering approach evaluates each edge A → B by its utilityutil(A, B), a weighed sum of its significance and correlation. A configurable utilityratio ur ∈ [0, 1] determines the weight, such that util(A, B) = ur · sig(A, B) + (1 −ur) · cor(A, B). A larger value for ur will preserve more significant edges, while asmaller value will favor highly correlated edges.

Figure 5 shows an example for processing the incoming arcs of a node A. Usinga utility ratio of 0.5, i.e. taking significance and correlation equally into account, theutility value is calculated, which ranges from 0.4 to 1.0 in this example.

Filtering edges is performed on a local basis, i.e. for each node in the process model,the algorithm preserves its incoming and outgoing edges with the highest utility value.The decision of which edges get preserved is configured by the edge cutoff parameter

Page 13: Fuzzy mining - adaptive process simplification based … · Fuzzy mining - adaptive process simplification based on multi-perspective metrics Günther, C.W.; van der Aalst, W.M.P

Fuzzy Mining – Adaptive Process Simplification Based on Multi-perspective Metrics 339

Fig. 5. Filtering the set of incoming edges for a node A

co ∈ [0, 1]. For every node N , the utility values for each incoming edge X → N arenormalized to [0, 1], so that the weakest edge is assigned 0 and the strongest one 1. Alledges whose normalized utility value exceeds co are added to the preserved set. In theexample in Figure 5, only two of the original edges, are preserved, using a edge cutoffvalue of 0.4: P → A (with normalized utility of 1.0) and R → A (norm. utility of0.56). The outgoing edges are processed in the same manner for each node.

The edge cutoff parameter determines the aggressiveness of the algorithm, i.e. thehigher its value, the more likely the algorithm is to remove edges. In very unstructuredprocesses, where precedence relations are likely to have a balanced significance, it isoften useful to use a lower utility ratio, such that correlation will be taken more intoaccount and resolve such ambiguous situations. On top of that, a high edge cutoff willact as an amplifier, helping to distinguish the most important edges.

Note that our edge filtering approach starts from an empty set of precedence rela-tions, i.e. all edges are removed by default. Only if an edge is selected locally for atleast one node, it will be preserved. This approach keeps the process model connected,while clarifying ambiguous situations. Whether an edge is preserved depends on itsutility for describing the behavior of the activities it connects – and not on global com-parisons with other parts of the model, which it does not even interact with.

topcomplete0.517

midtopcomplete0.406

0.7950.815

sponsorscomplete0.473

0.0490.697

toolbarcomplete0.408

0.0820.697

layout_files/_rootcomplete0.896

0.0150.672

bottomcomplete0.417

0.0410.638

witcomplete0.657

0.0510.660

zwartcomplete0.610

0.3640.651

stylescomplete0.785

0.0240.676

introductioncomplete0.430

0.1570.653

subtopcomplete0.437

0.3430.768

processmining/_rootcomplete1.000

0.0400.697

0.0380.846

0.0960.671

0.2210.618

0.0230.551

0.1250.703

0.0460.734

0.0790.667

0.0380.380

0.4220.698

0.8960.820

0.0010.395

0.0260.640

0.0290.675

0.0081.000

0.0320.719

0.2270.494

0.0590.749

0.2060.598

0.4790.679

0.4720.637

0.0680.669

0.0370.651

0.0660.484

0.0310.766

0.0200.660

0.4290.685

0.0620.453

0.9810.674

0.0610.545

0.1230.640

0.1320.528

0.0710.708

0.0230.669

0.0310.589

0.0700.471

0.0180.509

0.0200.615

0.0140.444

0.2410.716

0.0070.412

0.0750.413

0.0480.455

0.2950.531

0.0590.408

0.0490.507

0.1990.476

0.0230.765

0.0280.781

0.9820.726

0.1010.707

0.1210.517

0.0020.853

0.1030.620

0.2220.592

0.2200.556

0.0400.730

0.0290.764

0.0480.586

0.5920.612

0.1990.707

0.0340.515

0.0560.589

0.1790.473

0.0310.669

0.0390.936

0.6700.743

0.4410.493

0.0600.592

0.1910.679

0.0320.595

0.5730.639

0.5410.623

0.0370.636

0.0960.627

0.1160.515

0.0720.627

0.4830.782

0.0400.904

0.2430.507

0.1220.650

0.2850.622

0.0380.651

0.0510.414

0.0300.603

0.0160.677

0.0160.433

0.9840.574

0.0380.530

0.0730.372

0.0600.457

0.4920.707

0.0980.392

0.0120.572

0.1850.377

0.0370.687

0.0180.744

0.1790.615

0.9570.699

0.0490.406

0.4370.685

0.0600.568

0.0760.662

0.1670.520

0.0070.682

0.0260.736

0.0220.382

0.0710.843

0.1220.834

0.0860.674

0.3920.610

0.0180.421

0.1840.697

0.0230.651

0.0930.660

0.0460.588

0.9140.714

0.0200.672

0.2660.549

0.1040.630

0.0230.702

0.0250.500

0.0070.794

0.0230.669

1.0000.614

0.3060.558

0.0060.718

0.0320.613

0.0620.565

0.0080.682

topcomplete0.517

midtopcomplete0.406

0.7950.815

subtopcomplete0.437

0.8960.820

sponsorscomplete0.473

0.0081.000

zwartcomplete0.610

0.4790.679

stylescomplete0.785

0.4720.637

toolbarcomplete0.408

bottomcomplete0.417

0.9810.674

layout_files/_rootcomplete0.896

0.2410.716

processmining/_rootcomplete1.000

0.1990.476

0.9820.726

0.0020.853

witcomplete0.657

0.5920.612

0.0390.936

0.4410.493

0.5730.639

0.5410.623

0.0400.904

0.9840.574

0.4920.707

introductioncomplete0.430

0.9570.699

0.0070.682

0.9140.714

1.0000.614

0.0080.682

Fig. 6. Example of a process model before (left) and after (right) edge filtering

Figure 6 shows the effect of edge filtering applied to a small, but very unstructuredprocess. The number of nodes remains the same, while removing an appropriate subsetof edges clearly brings structure to the previously chaotic process model.

5.3 Node Aggregation and Abstraction

While removing edges brings structure to the process model, the most effective tool forsimplification is removing nodes. It enables the analyst to focus on an interesting subset

Page 14: Fuzzy mining - adaptive process simplification based … · Fuzzy mining - adaptive process simplification based on multi-perspective metrics Günther, C.W.; van der Aalst, W.M.P

340 C.W. Gunther and W.M.P. van der Aalst

of activities. Our approach preserves highly correlated groups of less-significant nodesas aggregated clusters, while removing isolated, less-significant nodes. Removing nodesis based on the node cutoff parameter. Every node whose unary significance is belowthis threshold becomes a victim, i.e. it will either be aggregated or abstracted from. Thefirst phase of our algorithm builds initial clusters of less-significant behavior as follows.

– For each victim, find the most highly correlated neighbor (i.e., connected node)– If this neighbor is a cluster node, add the victim to this cluster.– Otherwise, create a new cluster node, and add the victim as its first element.

Whenever a node is added to a cluster, the cluster will “inherit” the ordering relationsof that node, i.e. its incoming and outgoing arcs, while the actual node will be hidden.The second phase is merging the clusters, which is necessary as most clusters will,at this stage, only consist of one single victim. The following routine is performed toaggregate larger clusters and decrease their number.

– For each cluster, check whether all predecessors or all successors are also clusters.– If all predecessor nodes are clusters as well, merge with the most highly correlated

one and move on to the next cluster.– If all successors are clusters as well, merge with the most highly correlated one.– Otherwise, i.e. if both the cluster’s pre- and postset contain regular nodes, the clus-

ter is left untouched.

Xstart

0.507Cluster B

2 primitives~ 0.127

0.4470.902

Cluster C1 primitives

~ 0.156

0.4470.902

Ystart

0.927

0.0000.451

Cluster A2 primitives

~ 0.169

0.0000.564

0.0010.676

Fig. 7. Excerpt of a clustered model after the first aggregation phase

It is important that clusters will only be merged, if the “victim” has only clusters inhis pre- or postset. Figure 7 shows an example of a process model after the first phaseof clustering. Cluster A cannot merge with cluster B, as they are also both connectedto node X . Otherwise, X would be connected to the merged cluster in both directions,making the model less informative. However, clusters B and C can merge, as B’s post-set consists only of C. This simplification of the model does not lessen the amount ofinformation, and is thus valid.

The last phase, which constitutes the abstraction, removes isolated and singularclusters. Isolated clusters are detached parts of the process, which are less significantand highly correlated, and which have thus been folded into one single, isolated clusternode. It is obvious that such detached nodes do not contribute to the process model,which is why they are simply removed. Singular clusters consist only of one, singleactivity node. Thus, they represent less-significant behavior which is not highly corre-lated to adjacent behavior. Singular clusters are undesired, because they do not simplifythe model. Therefore, they are removed from the model, while their most significantprecedence relations are transitively preserved (i.e., their predecessors are artificiallyconnected to their successors, if such edge does not already exist in the model).

Page 15: Fuzzy mining - adaptive process simplification based … · Fuzzy mining - adaptive process simplification based on multi-perspective metrics Günther, C.W.; van der Aalst, W.M.P

Fuzzy Mining – Adaptive Process Simplification Based on Multi-perspective Metrics 341

6 Implementation and Application

We have implemented our approach as the Fuzzy Miner plugin for the ProM framework[8]. All metrics introduced in Section 4 have been implemented, and can be configuredby the user. Figure 8 shows the result view of the Fuzzy Miner, with the simplified graphview on the left, and a configuration pane for simplification parameters on the right.Alternative views allow the user to inspect and verify measurements for all metrics,which helps to tune these metrics to the log.

Fig. 8. Screenshot of the Fuzzy Miner, applied to the very large and unstructured log also usedfor mining the model in Figure 1

Note that this approach is the result of valuable lessons learnt from a great number ofcase studies with real-life logs. As such, both the applied metrics and the simplificationalgorithm have been optimized using large amounts actual, less-structured data. Whileit is difficult to validate the approach formally, the Fuzzy Miner has already becomeone of the most useful tools in case study applications.

For example, Figure 8 shows the result of applying the Fuzzy Miner to a large testlog of manufacturing machines (155.296 events in 16 cases, 826 event classes). Theapproach has been shown to scale well and is of linear complexity. Deriving all metricsfrom the mentioned log was performed in less than ten seconds, while simplifying theresulting model took less than two seconds on a 1.8 GHz dual-core machine. While thismodel has been created from the raw logs, Figure 1 has been mined with the HeuristicsMiner, after applying log filters [14]. It is obvious that the Fuzzy Miner is able to cleanup a large amount of confusing behavior, and to infer and extract structure from whatis chaotic. We have successfully used the Fuzzy Miner on various machinery test andusage logs, development process logs, hospital patient treatment logs, logs from case

Page 16: Fuzzy mining - adaptive process simplification based … · Fuzzy mining - adaptive process simplification based on multi-perspective metrics Günther, C.W.; van der Aalst, W.M.P

342 C.W. Gunther and W.M.P. van der Aalst

handling systems and web servers, among others. These are notoriously flexible andunstructured environments, and we hold our approach to be one of the most useful toolsfor analyzing them so far.

7 Related Work

While parting with some characteristics of traditional process mining techniques, suchas absolute precision and describing the complete behavior, it is obvious that the ap-proach described in this paper is based on previous work in this domain, most specif-ically control-flow mining algorithms [2,14,7]. The mining algorithm most related toFuzzy Mining is the Heuristics Miner, which also employs heuristics to limit the set ofprecedence relations included in the model [14]. Our approach also incorporates con-cepts from log filtering, i.e. removing less significant events from the logs [8]. However,the foundation on multi-perspective metrics, i.e. looking at all aspects of the process atonce, its interactive and explorative nature, and the integrated simplification algorithmclearly distinguishes Fuzzy Mining from all previous process mining techniques.

The adaptive simplification approach presented in Section 5 uses concepts from thedomains of data clustering and graph clustering. Data clustering attempts to find relatedsubsets of attributes, based on a binary distance metric inferred upon them [11]. Graphclustering algorithms, on the other hand, are based on analyzing structural properties ofgraphs, from which they derive partitioning strategies [13,9]. Our approach, however, isbased on a unique combination of analyzing the significance and correlation of graphelements, which are based on a wide set of process perspectives. It integrates abstractionand aggregation, and is also more specialized towards the process domain.

8 Discussion and Future Work

We have described the problems traditional process mining techniques face when ap-plied to large, less-structured processes, as often found in practice. Subsequently, wehave analyzed the causes for these problems, which lie in a mismatch between fun-damental assumptions of traditional process mining, and the characteristics of real-lifeprocesses. Based on this analysis, we have developed an adaptive simplification and vi-sualization technique for process models, which is based on two novel metrics, signifi-cance and correlation. We have described a framework for deriving these metrics froman enactment log, which can be adjusted to particular situations and analysis questions.

While the fine-grained configurability of the algorithm and its metrics makes ourapproach universally applicable, it is also one of its weaknesses, as finding the “right”parameter settings can sometimes be time-consuming. Thus, our next steps will concen-trate on deriving higher-level parameters and sensible default settings, while preservingthe full range of parameters for advanced users. Further work will concentrate on ex-tending the set of metric implementations and improving the simplification algorithm.

It is our belief that process mining, in order to become more meaningful, and tobecome applicable in a wider array of practical settings, needs to address the problemsit has with unstructured processes. We have shown that the traditional desire to modelthe complete behavior of a process in a precise manner conflicts with the original goal,

Page 17: Fuzzy mining - adaptive process simplification based … · Fuzzy mining - adaptive process simplification based on multi-perspective metrics Günther, C.W.; van der Aalst, W.M.P

Fuzzy Mining – Adaptive Process Simplification Based on Multi-perspective Metrics 343

i.e. to provide the user with understandable, high-level information. The success ofprocess mining will depend on whether it is able to balance these conflicting goalssensibly. Fuzzy Mining is a first step in that direction.

Acknowledgements. This research is supported by the Technology Foundation STW,applied science division of NWO and the technology programme of the Dutch Ministryof Economic Affairs.

References

1. van der Aalst, W.M.P., van Dongen, B.F., Herbst, J., Maruster, L., Schimm, G., Weijters,A.J.M.M.: Workflow Mining: A Survey of Issues and Approaches. Data and KnowledgeEngineering 47(2), 237–267 (2003)

2. van der Aalst, W.M.P., Weijters, A.J.M.M., Maruster, L.: Workflow Mining: DiscoveringProcess Models from Event Logs. IEEE Transactions on Knowledge and Data Engineer-ing 16(9), 1128–1142 (2004)

3. Agrawal, R., Gunopulos, D., Leymann, F.: Mining Process Models from Workflow Logs. In:Sixth International Conference on Extending Database Technology, pp. 469–483 (1998)

4. Badouel, E., Bernardinello, L., Darondeau, P.: The Synthesis Problem for Elementary NetSystems is NP-complete. Theoretical Computer Science 186(1-2), 107–134 (1997)

5. Cook, J.E., Wolf, A.L.: Discovering Models of Software Processes from Event-Based Data.ACM Transactions on Software Engineering and Methodology 7(3), 215–249 (1998)

6. Datta, A.: Automating the Discovery of As-Is Business Process Models: Probabilistic andAlgorithmic Approaches. Information Systems Research 9(3), 275–301 (1998)

7. van Dongen, B.F., van der Aalst, W.M.P.: Multi-Phase Process Mining: Building InstanceGraphs. In: Atzeni, P., Chu, W., Lu, H., Zhou, S., Ling, T.-W. (eds.) ER 2004. LNCS,vol. 3288, pp. 362–376. Springer, Heidelberg (2004)

8. van Dongen, B.F., de Medeiros, A.K.A., Verbeek, H.M.W., Weijters, A.J.M.M., van derAalst, W.M.P.: The ProM framework: A New Era in Process Mining Tool Support. In: Ciardo,G., Darondeau, P. (eds.) ICATPN 2005. LNCS, vol. 3536, pp. 444–454. Springer, Heidelberg(2005)

9. van Dongen, S.: Graph Clustering by Flow Simulation. PhD thesis, University of Utrecht(2000)

10. Herbst, J.: A Machine Learning Approach to Workflow Management. In: Lopez de Mantaras,R., Plaza, E. (eds.) ECML 2000. LNCS (LNAI), vol. 1810, pp. 183–194. Springer, Heidel-berg (2000)

11. Jain, A.K., Murty, M.N., Flynn, P.J.: Data clustering: A review. ACM Computing Sur-veys 31(3), 264–323 (1999)

12. de Medeiros, A.K.A., Weijters, A.J.M.M., van der Aalst, W.M.P.: Genetic Process Mining:A Basic Approach and its Challenges. In: Bussler, C., Haller, A. (eds.) BPM 2005. LNCS,vol. 3812, pp. 203–215. Springer, Heidelberg (2006)

13. Pothen, A., Simon, H.D., Liou, K.: Partitioning sparse matrics with eigenvectors of graphs.SIAM J. Matrix Anal. Appl. 11(3), 430–452 (1990)

14. Weijters, A.J.M.M., van der Aalst, W.M.P.: Rediscovering Workflow Models from Event-Based Data using Little Thumb. Integrated Computer-Aided Engineering 10(2), 151–162(2003)