process mining chapter_05_process_discovery

54
Chapter 5 Process Discovery: An Introduction prof.dr.ir. Wil van der Aalst www.processmining.org

Upload: muhammad-ajmal

Post on 25-Dec-2014

103 views

Category:

Documents


3 download

DESCRIPTION

 

TRANSCRIPT

Page 1: Process mining chapter_05_process_discovery

Chapter 5Process Discovery: An Introduction

prof.dr.ir. Wil van der Aalstwww.processmining.org

Page 2: Process mining chapter_05_process_discovery

Overview

PAGE 1

Part I: Preliminaries

Chapter 2 Process Modeling and Analysis

Chapter 3Data Mining

Part II: From Event Logs to Process Models

Chapter 4 Getting the Data

Chapter 5 Process Discovery: An Introduction

Chapter 6 Advanced Process Discovery Techniques

Part III: Beyond Process Discovery

Chapter 7 Conformance Checking

Chapter 8 Mining Additional Perspectives

Chapter 9 Operational Support

Part IV: Putting Process Mining to Work

Chapter 10 Tool Support

Chapter 11 Analyzing “Lasagna Processes”

Chapter 12 Analyzing “Spaghetti Processes”

Part V: Reflection

Chapter 13Cartography and Navigation

Chapter 14Epilogue

Chapter 1 Introduction

Page 3: Process mining chapter_05_process_discovery

Process discovery

PAGE 2

software system

(process)model

eventlogs

modelsanalyzes

discovery

records events, e.g., messages,

transactions, etc.

specifies configures implements

analyzes

supports/controls

enhancement

conformance

“world”

people machines

organizationscomponents

businessprocesses

Page 4: Process mining chapter_05_process_discovery

Process discovery = Play-In

PAGE 3

event log process model

Play-In

event logprocess model

Play-Out

event log process model

Replay

• extended model showing times, frequencies, etc.

• diagnostics• predictions• recommendations

Page 5: Process mining chapter_05_process_discovery

Example

PAGE 4

a

b

c

de

p2

end

p4

p3p1

start

Event log contains all possible traces of model and vice versa.

Page 6: Process mining chapter_05_process_discovery

Another example

PAGE 5

a

b

c

df

p2

end

p4

p3p1

start

e

p5

Generalization: event log contains only subset of all possible traces of model.

Page 7: Process mining chapter_05_process_discovery

Notation is less relevant (e.g. BPMN)

PAGE 6

a

b

c

de

p2

end

p4

p3p1

start

a

start end

b

c

e

d

Page 8: Process mining chapter_05_process_discovery

Another BPMN example

PAGE 7

a

b

c

df

p2

end

p4

p3p1

start

e

p5

a

start end

b

c

e

d

f

Page 9: Process mining chapter_05_process_discovery

Challenge

• In general, there is a trade-off between the following four quality criteria:

1.Fitness: the discovered model should allow for the behavior seen in the event log.

2.Precision (avoid underfitting): the discovered model should not allow for behavior completely unrelated to what was seen in the event log.

3.Generalization (avoid overfitting): the discovered model should generalize the example behavior seen in the event log.

4.Simplicity: the discovered model should be as simple as possible.

PAGE 8

Page 10: Process mining chapter_05_process_discovery

Process Discovery: example of algorithm

PAGE 9

α

Page 11: Process mining chapter_05_process_discovery

PAGE 10

>,→,||,# relations

• Direct succession: x>y iff for some case x is directly followed by y.

• Causality: x→y iff x>y and not y>x.

• Parallel: x||y iff x>y and y>x

• Choice: x#y iff not x>y and not y>x.

a>ba>ca>eb>cb>dc>bc>de>d

a→ba→ca→eb→dc→de→d

b||cc||b

abcdacbdaed

b#ee#bc#ea#d…

Page 12: Process mining chapter_05_process_discovery

PAGE 11

Basic Idea Used by α Algorithm (1)

a b

(a) sequence pattern: a→b

Page 13: Process mining chapter_05_process_discovery

PAGE 12

Basic Idea Used by α Algorithm (2)

a

b

c

(b) XOR-split pattern:a→b, a→c, and b#c

a

b

c

(c) XOR-join pattern:a→c, b→c, and a#b

a

b

c

(b) XOR-split pattern:a→b, a→c, and b#c

Page 14: Process mining chapter_05_process_discovery

PAGE 13

Basic Idea Used by α Algorithm (3)

a

b

c

(d) AND-split pattern:a→b, a→c, and b||c

a

b

c

(e) AND-join pattern:a→c, b→c, and a||b

a

b

c

(d) AND-split pattern:a→b, a→c, and b||c

Page 15: Process mining chapter_05_process_discovery

Example Revisited

PAGE 14

a

b

c

de

p2

end

p4

p3p1

start

Result produced by α algorithm

a>ba>ca>eb>cb>dc>bc>de>d

a→ba→ca→eb→dc→de→d

b||cc||b

b#ee#bc#ea#d…

Page 16: Process mining chapter_05_process_discovery

Footprint of L1

PAGE 15

a

b

c

de

p2

end

p4

p3p1

start

Page 17: Process mining chapter_05_process_discovery

Footprint of L2

PAGE 16

a

b

c

df

p2

end

p4

p3p1

start

e

p5

Page 18: Process mining chapter_05_process_discovery

Simple patterns

PAGE 17

a b

(a) sequence pattern: a→b

a

b

c

(b) XOR-split pattern:a→b, a→c, and b#c

a

b

c

(c) XOR-join pattern:a→c, b→c, and a#b

a

b

c

(d) AND-split pattern:a→b, a→c, and b||c

a

b

c

(e) AND-join pattern:a→c, b→c, and a||b

Page 19: Process mining chapter_05_process_discovery

Algorithm

PAGE 18

Let L be an event log over T. α(L) is defined as follows. 1. TL = { t ∈ T | ∃σ ∈ L t ∈ σ}, 2. TI = { t ∈ T | ∃σ ∈ L t = first(σ) }, 3. TO = { t ∈ T | ∃σ ∈ L t = last(σ) }, 4. XL = { (A,B) | A ⊆ TL ∧ A ≠ ø ∧ B ⊆ TL ∧ B ≠ ø ∧

∀a ∈ A∀b ∈ B a →L b ∧ ∀a1,a2 ∈ A a1#L a2 ∧ ∀b1,b2 ∈ B b1#L b2 }, 5. YL = { (A,B) ∈ XL | ∀(A′,B′) ∈ XL

A ⊆ A′ ∧B ⊆ B′⇒ (A,B) = (A′,B′) }, 6. PL = { p(A,B) | (A,B) ∈ YL } ∪{iL,oL}, 7. FL = { (a,p(A,B)) | (A,B) ∈ YL ∧ a ∈ A } ∪ { (p(A,B),b) | (A,B) ∈

YL ∧ b ∈ B } ∪{ (iL,t) | t ∈ TI} ∪{ (t,oL) | t ∈ TO}, and 8. α(L) = (PL,TL,FL).

Page 20: Process mining chapter_05_process_discovery

Key idea: find places

PAGE 19

a1

...

a2

am

b1

b2

bn

p(A,B) ...

A={a1,a2, … am} B={b1,b2, … bn}

4. XL = { (A,B) | A ⊆ TL ∧ A ≠ ø ∧ B ⊆ TL ∧ B ≠ ø ∧∀a ∈ A∀b ∈ B a →L b ∧ ∀a1,a2 ∈ A a1#L a2 ∧ ∀b1,b2 ∈ B b1#L b2 },

5. YL = { (A,B) ∈ XL | ∀(A′,B′) ∈ XLA ⊆ A′ ∧B ⊆ B′⇒ (A,B) = (A′,B′) },

Page 21: Process mining chapter_05_process_discovery

Places as footprints

PAGE 20

a1

...

a2

am

b1

b2

bn

p(A,B) ...

A={a1,a2, … am} B={b1,b2, … bn}

Page 22: Process mining chapter_05_process_discovery

PAGE 21

a

b

c

de

p2

end

p4

p3p1

start

Page 23: Process mining chapter_05_process_discovery

Another event log L3

PAGE 22

Page 24: Process mining chapter_05_process_discovery

Model for L3

PAGE 23

a b

c

d

e

p({a,f},{b})iL

p({b},{c})

p({b},{d})

p({c},{e})

p({d},{e})

g

oLp({e},{f,g})

f

Page 25: Process mining chapter_05_process_discovery

Another event log L4

PAGE 24

b

c

p({a,b},{c}) oL

a

iL e

d

p({c},{d,e})

Page 26: Process mining chapter_05_process_discovery

Event log L5

PAGE 25

Page 27: Process mining chapter_05_process_discovery

PAGE 26

Page 28: Process mining chapter_05_process_discovery

Discovered model

PAGE 27

b

p({a},{e})

oLaiL

c

e

f

d

p({e},{f})

p({b},{c,f})p({a,d},{b})

p({c},{d})

Page 29: Process mining chapter_05_process_discovery

Limitation of α algorithm(implicit places)

PAGE 28

g

a

c

d

e

fb

p1

p2

Green places are implicit!

Page 30: Process mining chapter_05_process_discovery

Limitation of α algorithm(loops of length 1)

PAGE 29

a c

b

a c

b

Page 31: Process mining chapter_05_process_discovery

Limitation of α algorithm(loops of length 2)

PAGE 30

a b d

c

a

b

d

c

Page 32: Process mining chapter_05_process_discovery

Limitation of α algorithm(non-local dependencies)

PAGE 31

b

c

a

e

dp1

p2

Green places are not discovered!

Page 33: Process mining chapter_05_process_discovery

Difficult constructs for α algorithm

PAGE 32

a

b

c

Page 34: Process mining chapter_05_process_discovery

Taking the transactional life-cycle into account

PAGE 33

start

assigned

assign

running

complete

a

start

running

complete

b

suspended

suspend

resume

start

assigned

assign

running

complete

c

Page 35: Process mining chapter_05_process_discovery

Rediscovering process models

PAGE 34

The rediscovery problem: Is the discovered model N’ equivalent to the original model N?

discoveredprocessmodel

original process model event

log

simulate discover

N=N’ ?N N’

Page 36: Process mining chapter_05_process_discovery

Equivalence: trace equivalence, bisimilarity, and branching bisimilarity

PAGE 35

s1

birth

s2

s3 s4

hell

curse

curse

curse

birth

curse

hell

s5

s6

s7

birth

curse

s8

s9

s11

curse

curses10

TS1 TS2 TS3

heaven

heaven heaven hell

heaven

hell

?

heaven

Three trace equivalent transition systems: TS1 and TS2are not bisimilar, but TS2 and TS3 are bisimilar

Page 37: Process mining chapter_05_process_discovery

Branching bisimilarity defined for YAWL

PAGE 36

s1

TS1 TS2

start

check

end

reject accept

c1 c2

s2

acceptreject

s3 s4

s5

s6

check

s7

acceptreject

s8

τ τ

start

end

reject accept

c3

check check

?

TS1 and TS2 are not branching bisimilar (although trace equivalent).

Page 38: Process mining chapter_05_process_discovery

Challenge: finding the right representational bias

PAGE 37

a a

start endp

There is no WF-net with unique visible labels that exhibits this behavior.

Page 39: Process mining chapter_05_process_discovery

Another example

PAGE 38

a b

start endp1

c

τ

p1

(a)

a b

start endp1

c

p1

(b)

a

a b

start endp1

c

p1

(c)

There is no WF-net with unique visible labels that exhibits this behavior.

Page 40: Process mining chapter_05_process_discovery

Challenge: noise and incompleteness

• To discover a suitable process model it is assumed that the event log contains a representative sample of behavior.

• Two related phenomena:−Noise: the event log contains rare and

infrequent behavior not representative for the typical behavior of the process.

− Incompleteness: the event log contains too few events to be able to discover some of the underlying control-flow structures.

PAGE 39

Page 41: Process mining chapter_05_process_discovery

More on incompleteness

PAGE 40

See also chapter 3 (cross-validation, precision, recall, etc.)

Page 42: Process mining chapter_05_process_discovery

PAGE 41

Challenge: Balancing Between Underfitting and Overfitting

Page 43: Process mining chapter_05_process_discovery

Challenge: four competing quality criteria

PAGE 42

process discovery

fitness

precisiongeneralization

simplicity

“able to replay event log” “Occam’s razor”

“not overfitting the log” “not underfitting the log”

Page 44: Process mining chapter_05_process_discovery

Flower model

PAGE 43

g

ac

d

ef

b

start end

h

Page 45: Process mining chapter_05_process_discovery

PAGE 44

What is the best model?

A D

C

EB

A D

C

EB

ACDACEBCEBCD

990850

Page 46: Process mining chapter_05_process_discovery

PAGE 45

What is the best model?

A D

C

EB

A D

C

EB

ACDACEBCEBCD

99888578

Page 47: Process mining chapter_05_process_discovery

PAGE 46

What is the best model?

A D

C

EB

A D

C

EB

ACDACEBCEBCD

992853

Page 48: Process mining chapter_05_process_discovery

Example: one log four models

PAGE 47

astart register

request

bexamine thoroughly

cexamine casually

d checkticket

decide

pay compensation

reject request

reinitiate requeste

g

hfend

astart register

request

cexamine casually

dcheckticket

decide reject request

e hend

N3 : fitness = +, precision = -, generalization = +, simplicity = +

N2 : fitness = -, precision = +, generalization = -, simplicity = +

astart register

request

bexamine

thoroughly

cexamine casually

dcheck ticket

decide

pay compensation

reject request

reinitiate request

e

g

h

f

end

N1 : fitness = +, precision = +, generalization = +, simplicity = +

astart register

request

cexamine casually

dcheckticket

decide reject request

e hend

N4 : fitness = +, precision = +, generalization = -, simplicity = -

aregister request

dexamine casually

ccheckticket

decide reject request

e h

a cexamine casually

dcheckticket

decide

e g

a dexamine casually

ccheckticket

decide

e g

register request

register request

pay compensation

pay compensation

aregister request

b dcheckticket

decide reject request

e h

aregister request

d bcheckticket

decide reject request

e h

a b dcheckticket

decide

e gregister request

pay compensation

examine thoroughly

examine thoroughly

examine thoroughly

… (all 21 variants seen in the log)

acdeh

abdeg

adceh

abdeh

acdeg

adceg

adbeh

acdefdbeh

adbeg

acdefbdeh

acdefbdeg

acdefdbeg

adcefcdeh

adcefdbeh

adcefbdeg

acdefbdefdbeg

adcefdbeg

adcefbdefbdeg

adcefdbefbdeh

adbefbdefdbeg

adcefdbefcdefdbeg

455

191

177

144

111

82

56

47

38

33

14

11

9

8

5

3

2

2

1

1

1

# trace

1391

process discovery

fitness

precisiongeneralization

simplicity

“able to replay event log” “Occam’s razor”

“not overfitting the log” “not underfitting the log”

Page 49: Process mining chapter_05_process_discovery

Model N1

PAGE 48

acdeh

abdeg

adceh

abdeh

acdeg

adceg

adbeh

acdefdbeh

adbeg

acdefbdeh

acdefbdeg

acdefdbeg

adcefcdeh

adcefdbeh

adcefbdeg

acdefbdefdbeg

adcefdbeg

adcefbdefbdeg

adcefdbefbdeh

adbefbdefdbeg

adcefdbefcdefdbeg

455

191

177

144

111

82

56

47

38

33

14

11

9

8

5

3

2

2

1

1

1

# trace

1391

astart register

request

bexamine

thoroughly

cexamine casually

dcheck ticket

decide

pay compensation

reject request

reinitiate request

e

g

h

f

end

N1 : fitness = +, precision = +, generalization = +, simplicity = +

Page 50: Process mining chapter_05_process_discovery

Model N2

PAGE 49

acdeh

abdeg

adceh

abdeh

acdeg

adceg

adbeh

acdefdbeh

adbeg

acdefbdeh

acdefbdeg

acdefdbeg

adcefcdeh

adcefdbeh

adcefbdeg

acdefbdefdbeg

adcefdbeg

adcefbdefbdeg

adcefdbefbdeh

adbefbdefdbeg

adcefdbefcdefdbeg

455

191

177

144

111

82

56

47

38

33

14

11

9

8

5

3

2

2

1

1

1

# trace

1391

astart register

request

cexamine casually

dcheckticket

decide reject request

e hend

N2 : fitness = -, precision = +, generalization = -, simplicity = +

Page 51: Process mining chapter_05_process_discovery

Model N3

PAGE 50

acdeh

abdeg

adceh

abdeh

acdeg

adceg

adbeh

acdefdbeh

adbeg

acdefbdeh

acdefbdeg

acdefdbeg

adcefcdeh

adcefdbeh

adcefbdeg

acdefbdefdbeg

adcefdbeg

adcefbdefbdeg

adcefdbefbdeh

adbefbdefdbeg

adcefdbefcdefdbeg

455

191

177

144

111

82

56

47

38

33

14

11

9

8

5

3

2

2

1

1

1

# trace

1391

astart register

request

bexamine thoroughly

cexamine casually

d checkticket

decide

pay compensation

reject request

reinitiate requeste

g

hfend

N3 : fitness = +, precision = -, generalization = +, simplicity = +

Page 52: Process mining chapter_05_process_discovery

Model N4

PAGE 51

acdeh

abdeg

adceh

abdeh

acdeg

adceg

adbeh

acdefdbeh

adbeg

acdefbdeh

acdefbdeg

acdefdbeg

adcefcdeh

adcefdbeh

adcefbdeg

acdefbdefdbeg

adcefdbeg

adcefbdefbdeg

adcefdbefbdeh

adbefbdefdbeg

adcefdbefcdefdbeg

455

191

177

144

111

82

56

47

38

33

14

11

9

8

5

3

2

2

1

1

1

# trace

1391

astart register

request

cexamine casually

dcheckticket

decide reject request

e hend

N4 : fitness = +, precision = +, generalization = -, simplicity = -

aregister request

dexamine casually

ccheckticket

decide reject request

e h

a cexamine casually

dcheckticket

decide

e g

a dexamine casually

ccheckticket

decide

e g

register request

register request

pay compensation

pay compensation

aregister request

b dcheckticket

decide reject request

e h

aregister request

d bcheckticket

decide reject request

e h

a b dcheckticket

decide

e gregister request

pay compensation

examine thoroughly

examine thoroughly

examine thoroughly

… (all 21 variants seen in the log)

Page 53: Process mining chapter_05_process_discovery

Why is process mining such a difficult problem?

• There are no negative examples (i.e., a log shows what has happened but does not show what could not happen).

• Due to concurrency, loops, and choices the search space has a complex structure and the log typically contains only a fraction of all possible behaviors.

• There is no clear relation between the size of a model and its behavior (i.e., a smaller model may generate more or less behavior although classical analysis and evaluation methods typically assume some monotonicity property).

PAGE 52

Page 54: Process mining chapter_05_process_discovery

Creating a 2-D slice of a 3-D reality

PAGE 53

Creating a 2-D slice of a 3-D reality: the process is viewed from a specific angle, the process is scoped using a frame, and the resolution determines the granularity of the resulting model