cluster analysis in data mining - university of...

24
1 Page 1 Cluster Analysis in Data Mining Part II Andrew Kusiak Intelligent Systems Laboratory 2139 Seamans Center The University of Iowa The University of Iowa Intelligent Systems Laboratory The University of Iowa Iowa City, IA 52242 - 1527 [email protected] http://www.icaen.uiowa.edu/~ankusiak Tel. 319-335 5934 Cluster Analysis Decomposition Aggregation The University of Iowa Intelligent Systems Laboratory Grouping

Upload: others

Post on 19-Jul-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Cluster Analysis in Data Mining - University of Iowauser.engineering.uiowa.edu/~comp/Public/Clustering2.pdf · Cluster Representation 4 12 3 Clusters A .6.3.1 B .5 .2 .3 C .6 .3 .1

1

Page 1

Cluster Analysis in Data Mining

Part II

Andrew KusiakIntelligent Systems Laboratory

2139 Seamans CenterThe University of Iowa

The University of Iowa Intelligent Systems Laboratory

The University of IowaIowa City, IA 52242 - [email protected]

http://www.icaen.uiowa.edu/~ankusiakTel. 319-335 5934

Cluster Analysis

Decomposition

Aggregation

The University of Iowa Intelligent Systems Laboratory

Grouping

Page 2: Cluster Analysis in Data Mining - University of Iowauser.engineering.uiowa.edu/~comp/Public/Clustering2.pdf · Cluster Representation 4 12 3 Clusters A .6.3.1 B .5 .2 .3 C .6 .3 .1

2

Page 2

Models

• Matrix formulation

• Mathematical programming formulation

G h f l ti

The University of Iowa Intelligent Systems Laboratory

• Graph formulation

Cluster Representation 1

A

BC

D

EF

G

H

The University of Iowa Intelligent Systems Laboratory

Mutually exclusive regions

Page 3: Cluster Analysis in Data Mining - University of Iowauser.engineering.uiowa.edu/~comp/Public/Clustering2.pdf · Cluster Representation 4 12 3 Clusters A .6.3.1 B .5 .2 .3 C .6 .3 .1

3

Page 3

Cluster Representation 2

A

B C D

EF

G

H

The University of Iowa Intelligent Systems Laboratory

Overlapping regions

Cluster Representation 3

F DE G HA B C

The University of Iowa Intelligent Systems Laboratory

Hierarchy

Page 4: Cluster Analysis in Data Mining - University of Iowauser.engineering.uiowa.edu/~comp/Public/Clustering2.pdf · Cluster Representation 4 12 3 Clusters A .6.3.1 B .5 .2 .3 C .6 .3 .1

4

Page 4

Cluster Representation 41 2 3

Clusters

A .6 .3 .1B .5 .2 .3C .6 .3 .1D .3 .6 .1E .2 .1 .7F .1 .2 .7

Objects

The University of Iowa Intelligent Systems Laboratory

Fuzzy clusters

G .2 .6 .2H .3 .6 .1

Two types of data

• Object with decisions

• Object without decisions

The University of Iowa Intelligent Systems Laboratory

Page 5: Cluster Analysis in Data Mining - University of Iowauser.engineering.uiowa.edu/~comp/Public/Clustering2.pdf · Cluster Representation 4 12 3 Clusters A .6.3.1 B .5 .2 .3 C .6 .3 .1

5

Page 5

Clustering Data with Decisions

Method 1

• Group objects according to the decisions

• Ignore decisions• Treat the objects as data without decisions

Method 2

The University of Iowa Intelligent Systems Laboratory

• Group objects according to the decisions• Treat each group of objects with decisions as

a separate data set

Si il it ffi i t th d

Solving the Clustering Problem:Binary Matrix Formulation

• Similarity coefficient methods• Sorting based algorithms• Bond energy algorithm• Cost-based method• Cluster identification algorithm

The University of Iowa Intelligent Systems Laboratory

• Cluster identification algorithm• Extended cluster identification algorithm

Page 6: Cluster Analysis in Data Mining - University of Iowauser.engineering.uiowa.edu/~comp/Public/Clustering2.pdf · Cluster Representation 4 12 3 Clusters A .6.3.1 B .5 .2 .3 C .6 .3 .1

6

Page 6

Clustering Algorithms

Desired Properties p

• Low computational time complexity• Structure enhancement• Enhancement of human interaction and

simplification of interface with software

The University of Iowa Intelligent Systems Laboratory

simplification of interface with software• Generalizability

Similarity Coefficient Methodn

δ1(aik, ajk )∑k =1

sij = _________________n

δ2(aik, ajk )k =1∑

The University of Iowa Intelligent Systems Laboratory

1 if aik = ajk

where δ1 (aik , ajk ) = {0 otherwise

Generalizationto qualitativefeatures

Page 7: Cluster Analysis in Data Mining - University of Iowauser.engineering.uiowa.edu/~comp/Public/Clustering2.pdf · Cluster Representation 4 12 3 Clusters A .6.3.1 B .5 .2 .3 C .6 .3 .1

7

Page 7

0 if aik = ajkδ2 (aik, ajk ) = {

1 th i1 otherwise

n = the number of features

The University of Iowa Intelligent Systems Laboratory

Consider1 2 3 4 5

1 GO-1 {

2

3

Yes 1 1

Yes 1

1 1 1

Consider

The University of Iowa Intelligent Systems Laboratory

3 GO-2 {

4

1 1 1

1 1

Page 8: Cluster Analysis in Data Mining - University of Iowauser.engineering.uiowa.edu/~comp/Public/Clustering2.pdf · Cluster Representation 4 12 3 Clusters A .6.3.1 B .5 .2 .3 C .6 .3 .1

8

Page 8

2s12 = s34 = ------- = 0.66

3

1

1 if aik = ajkδ1 (aik, ajk ) = {

0 otherwise

1s13 = ----- = 0.25

4

0s14 = s24 = s23 = ----- = 0

1 2 3 4 5

1 Yes 1 1

0 if aik = ajkδ2 (aik, ajk ) = {

1 otherwise

The University of Iowa Intelligent Systems Laboratory

5GO-1 {

2

3 GO-2 {

4

Yes 1

1 1 1

1 1

Object number

Tree Representing Similarity Coefficients

mila

rity

(%) 100

6650

1 2 3 4

1 GO-1

2

1 3 2 4 5

GO -1 GF-2

Yes 1

Yes 1

The University of Iowa Intelligent Systems Laboratory

Sim

200

3 GO-2

4

1 1 1

1 1

Page 9: Cluster Analysis in Data Mining - University of Iowauser.engineering.uiowa.edu/~comp/Public/Clustering2.pdf · Cluster Representation 4 12 3 Clusters A .6.3.1 B .5 .2 .3 C .6 .3 .1

9

Page 9

C fCluster Identification Algorithm

The University of Iowa Intelligent Systems Laboratory

Decomposition Problem

Decompose an object-feature incidence matrix into mutually separable submatrices (groups of objects and groups of features) with the minimum number of overlapping objects (or features) subject to the following constraints:

Constraint C1: Empty groups of objects or features are not allowed.

Constraint C2: The number of objects in a group does not exceed

The University of Iowa Intelligent Systems Laboratory

Constraint C2: The number of objects in a group does not exceed an upper limit, b (or alternatively, the number of features in a group does not exceed, d).

Page 10: Cluster Analysis in Data Mining - University of Iowauser.engineering.uiowa.edu/~comp/Public/Clustering2.pdf · Cluster Representation 4 12 3 Clusters A .6.3.1 B .5 .2 .3 C .6 .3 .1

10

Page 10

Cluster Identification Algorithm

Step 0. Set iteration number k = 1.Step 1. Select row i of incidence matrix [aij](k) and draw a horizontal

line hi through it ([aij](k) is read: matrix [aij] at iteration k )line hi through it ([aij](k) is read: matrix [aij] at iteration k ).Step 2. For each entry of * crossed by the horizontal line hi draw a

vertical line vj .Step 3. For each entry of * crossed-once by the vertical line vj draw a

horizontal line hk .Step 4. Repeat steps 2 and 3 until there are no more crossed-once

entries of * in [aij](k). All crossed-twice entries * in [aij](k) form row cluster RC-k and column cluster CC-k .

The University of Iowa Intelligent Systems Laboratory

Step 5. Transform the incidence matrix [aij](k) into [aij](k+1) by removing rows and columns corresponding to the horizontal and vertical lines drawn in steps 1 through 4.

Step 6. If matrix [aij](k+1) = 0 (where 0 denotes a matrix with all empty elements ), stop; otherwise set k = k + 1 and go to step 1.

Example 11 2 3 4 5 6 7 8

Incidence matrix

Feature

12

34

* * *

* *

* ** *

Incidence matrix

Object

The University of Iowa Intelligent Systems Laboratory

567

* **

* * * *

Page 11: Cluster Analysis in Data Mining - University of Iowauser.engineering.uiowa.edu/~comp/Public/Clustering2.pdf · Cluster Representation 4 12 3 Clusters A .6.3.1 B .5 .2 .3 C .6 .3 .1

11

Page 11

1 2 3 4 5 6 7 8 12

* * ** *2

345

* ** *

* ** *

The University of Iowa Intelligent Systems Laboratory

67

** * * *

Next Step

Delete allDelete all double-crossed elements

The University of Iowa Intelligent Systems Laboratory

Page 12: Cluster Analysis in Data Mining - University of Iowauser.engineering.uiowa.edu/~comp/Public/Clustering2.pdf · Cluster Representation 4 12 3 Clusters A .6.3.1 B .5 .2 .3 C .6 .3 .1

12

Page 12

1 2 3 4 5 6 7 8

1 * * * 1 4 6 712

345

* *

* ** *

* *

1 4 6 72346

* ** *

* *

The University of Iowa Intelligent Systems Laboratory

67

** * * *

6 *

Resultant Matrix

1 4 6 71 4 6 7234

* ** *

* *

The University of Iowa Intelligent Systems Laboratory

6 *

Page 13: Cluster Analysis in Data Mining - University of Iowauser.engineering.uiowa.edu/~comp/Public/Clustering2.pdf · Cluster Representation 4 12 3 Clusters A .6.3.1 B .5 .2 .3 C .6 .3 .1

13

Page 13

Iteration 21 4 6 7

23

* ** *

h2346

* ** *

*h4

v1 v6

4 7

The University of Iowa Intelligent Systems Laboratory

Iteration 34 7

36

h3

h6

v4 v7

* **

Final decomposition result

CC-1 CC-2 CC-3

RC-1

RC-2

1572

2 3 5 8 1 6 4 7

* * ** * *

* * * ** *

The University of Iowa Intelligent Systems Laboratory

RC-3436

* ** **

Page 14: Cluster Analysis in Data Mining - University of Iowauser.engineering.uiowa.edu/~comp/Public/Clustering2.pdf · Cluster Representation 4 12 3 Clusters A .6.3.1 B .5 .2 .3 C .6 .3 .1

14

Page 14

Three types of matrices

* * *

(a)

* * * * * * * *

(b) (c)a b c d e f g h a b c d e f g h a b c d e f 1 1 1* * *

* * * * * * * * *

* * * * * *

* * * * * * * * * *

* * * * * * * * * * * * * * *

* * * * * *

* * * * * *

* * * * * * * * * * * *

1 2 3 4 5 6 7

1 2 3 4 5

1 2 3 4 5 6

The University of Iowa Intelligent Systems Laboratory

(a) decomposable matrix(b) non-decomposable matrix with overlapping features(c) non-decomposable matrix with objects

Step 0. Set iteration number k = 1.Step 1. Select those objects (rows of matrix [aij] ) that based

on the user's expertise, are potential candidates for inclusion in machine cell GO-k Draw a horizontal line hi through

(k)

Extended CI Algorithm

in machine cell GO-k. Draw a horizontal line hi through each row of matrix [aij] corresponding to these objects. In the absence of the user's expertise any machine can be selected.

Step 2. For each column in [aij] corresponding to entry 1, single-crossed by any of the horizontal lines hi , draw a vertical line vj .

(k)

(k)

(k)

The University of Iowa Intelligent Systems Laboratory

Step 3. For each row in [aij] corresponding to the entry 1, single-crossed by the vertical line vj, drawn in Step 2, draw a horizontal line hi. Based on the objects corresponding to all the horizontal lines drawn in Step 1and Step 3, a temporary machine cell GO-k , is formed.

(k)

Page 15: Cluster Analysis in Data Mining - University of Iowauser.engineering.uiowa.edu/~comp/Public/Clustering2.pdf · Cluster Representation 4 12 3 Clusters A .6.3.1 B .5 .2 .3 C .6 .3 .1

15

Page 15

If the user's expertise indicates that some of the objects cannot be included in the temporary group GO-k, erase the corresponding horizontal lines in the matrix [aij]. Removal of these horizontal lines results in group GO'-k .

(k)

GO k .Delete from matrix [aij] features (columns) that arecontained in at least one of the objects already included in GO-k. Place these features on a separate list. Draw a vertical line vj through each single-crossed entry 1 in [aij] which does not involve any other objects than

(k )

(k)

The University of Iowa Intelligent Systems Laboratory

those included in GO-k .

(k)Step 4.For all the double-crossed entries 1 in [aij] , form a

cluster GO-k and a group GF-k

Step 5. Transform the incidence matrix [aij] into [aij] by removing all the rows and

(k)(k+1)

columns included in GO-k and GF-k, respectively.

Step 6. If matrix [aij] = 0 (where 0 denotes a matrix with all elements equal to zero), stop; otherwise set k = k + 1 and

(k +1)

The University of Iowa Intelligent Systems Laboratory

), p;go to step 1.

Page 16: Cluster Analysis in Data Mining - University of Iowauser.engineering.uiowa.edu/~comp/Public/Clustering2.pdf · Cluster Representation 4 12 3 Clusters A .6.3.1 B .5 .2 .3 C .6 .3 .1

16

Page 16

Extended CI AlgorithmConstraints:Max |GO| = 4

Feature number

Object

1 2 3 4 5 6 7 8 9 10 11

1 2 3

1 1 1 1 1 1

1

Objects 1 and 4 in the cluster

The University of Iowa Intelligent Systems Laboratory

j3 4 5 6 7

1 1 1 1

1 1 1 1 1 1 1

1 1 1 1 1

number

Step 0. Set iteration number k = 1.Step 1. Since user's expertise indicates that

objects 1 and 4 should be included clusterGO-1, two horizontal lines h1 and h 4 are drawn.

1 2 3 4 5 6 7 8 9 10 11

1 2 3

1 1 1 1 1 1

1

Feature number

h1

The University of Iowa Intelligent Systems Laboratory

4 5 6 7

1 1 1 1 1

1 1 1 1 1 1 1 1 1 1

h4

Page 17: Cluster Analysis in Data Mining - University of Iowauser.engineering.uiowa.edu/~comp/Public/Clustering2.pdf · Cluster Representation 4 12 3 Clusters A .6.3.1 B .5 .2 .3 C .6 .3 .1

17

Page 17

Step 2. For columns 1, 2, 3, 6, and 7 crossed by the horizontal lines h1 and h4, five vertical lines, v1, v2, v3, v6, and v7 are drawn.

Feature numberFeature number1 2 3 4 5 6 7 8 9 10 11

1 2 3 4 5

1 1 1 1 1 1

1 1 1 1

1 1

h1

h4

The University of Iowa Intelligent Systems Laboratory

6 7

1 1 1 1 1 1 1 1 1 1

v1 v2 v3 v6 v7

Step 3. Three horizontal lines, h2, h6, and h7 are drawn through rows 2, 5, and 7 corresponding of the single-crossed elements 1.

Feature number1 2 3 4 5 6 7 8 9 10 11

h1h2

h4

h6

1 1 1 1 1 1

1 1 1 1

1 1 1 1 1 1 1

1 2 3 4 5 6

The University of Iowa Intelligent Systems Laboratory

v1 v2 v3 v6 v7

h6h7

1 1 1 1 1 1 1 1 1 1

6 7

Page 18: Cluster Analysis in Data Mining - University of Iowauser.engineering.uiowa.edu/~comp/Public/Clustering2.pdf · Cluster Representation 4 12 3 Clusters A .6.3.1 B .5 .2 .3 C .6 .3 .1

18

Page 18

• A temporary machine cell GO'-1 with objects {1, 2, 4, 6, 7} is formed.

• objects 2 and 6 are excluded from in GO'-1as they include less double-crossed (committed) elements than object 7 The horizontal lines h2elements than object 7. The horizontal lines h2 and h6 are erased.

2 3 5 6 7 8 10 11

h1

h4

1 1 1 1 1

1 1 1

1 2 3 4

The University of Iowa Intelligent Systems Laboratoryv2 v3 v6 v7

h4

h7

1 1 1 1

1 1 1 1 1

4 5 6 7

Step 4. The double-crossed entries 1 of matrix indicate:• Object cluster GO-1 = {1, 4, 7} and• Feature cluster GF-1 = {2, 3, 6, 7}{ }

Step 5. Matrix is transformed into the matrix below.

5 8 10 11

1 1 2

The University of Iowa Intelligent Systems Laboratory

1 1 1

1 1

3 5 6

Page 19: Cluster Analysis in Data Mining - University of Iowauser.engineering.uiowa.edu/~comp/Public/Clustering2.pdf · Cluster Representation 4 12 3 Clusters A .6.3.1 B .5 .2 .3 C .6 .3 .1

19

Page 19

Set k = k + 1 = 2 and go to Step 1.The second iteration (k = 2) results in:

• Object cluster GO-2 = {2, 3, 5, 6} and

Step 6.

• Feature cluster GF-2 = {5, 8, 10, 11}

5 8 10 11

1 1 1

1 1

2 3 5

h2h3h

The University of Iowa Intelligent Systems Laboratory

1 1 1 1

5 6

v5 v8 v10 v11

h5h6

The final result

GF - 1 GF - 22 3 6 7 5 8 10 11 1 4 9

1 4 7 2 3 5

1 1 1 1 1 1 1 1 1 1 1

1 1 1 1

1 1

GO - 1

GO - 2

The University of Iowa Intelligent Systems Laboratory

5 6

1 1 1 1 1 1 1

Page 20: Cluster Analysis in Data Mining - University of Iowauser.engineering.uiowa.edu/~comp/Public/Clustering2.pdf · Cluster Representation 4 12 3 Clusters A .6.3.1 B .5 .2 .3 C .6 .3 .1

20

Page 20

Notationf jm = number of objects

n = number of featuresp = number of clusters

1 if feature i belongs toxij = { cluster j

The University of Iowa Intelligent Systems Laboratory

j { j0 otherwise

dij = distance between objects i and j

n nmin dij xij

i =1 j =1 subject to:

∑ ∑

jn

xij = 1 for all i = 1, ..., nj =1

nxjj = p

The University of Iowa Intelligent Systems Laboratory

j =1 xij <= xjj for all i = 1, ..., n , j = 1, ..., n

xij = 0,1 for all i = 1, ..., n , j = 1, ..., n

Page 21: Cluster Analysis in Data Mining - University of Iowauser.engineering.uiowa.edu/~comp/Public/Clustering2.pdf · Cluster Representation 4 12 3 Clusters A .6.3.1 B .5 .2 .3 C .6 .3 .1

21

Page 21

Hamming Distance

1 0m

dij = d(aki , akj )∑Blue

0101

Blue1011

O =1 O =5

Hamming distance

Two objects

j ( , j )k =1

The University of Iowa Intelligent Systems Laboratory

Hamming distance

d15 = 1 + 0 + 1 + 1+ 1 + 0 = 4

xij

Decision variable

xij

Obj t ( f t ) b

The University of Iowa Intelligent Systems Laboratory

Object (or feature)number

Object (or feature) numberCluster number

Page 22: Cluster Analysis in Data Mining - University of Iowauser.engineering.uiowa.edu/~comp/Public/Clustering2.pdf · Cluster Representation 4 12 3 Clusters A .6.3.1 B .5 .2 .3 C .6 .3 .1

22

Page 22

1 2 3 4 5

1 1 1

1 1

1 1

1

2

3

Feature number

Object number

Givenp = 2

1 2 3 4 5Feature number

1 1

1 1

3

4

p 2m

dij = d(aki, akj )k =1∑

The University of Iowa Intelligent Systems Laboratory

0 4 0 4 4 4 0 4 0 1 0 4 0 4 3 4 0 4 0 1 4 1 3 1 0

1 2 3 4 5

Featurenumber[dij] =

MIN 4X12+4X14+3X15+4X21+4X23+1X25+4X32+4X34+3X35+4X41+4X43+1X45+3X51+1X52+3X53+1X54

nSUBJECT TO xij = 1 for all i = 1, ..., n

j =1

X11+X12+X13+X14+X15=1X21+X22+X23+X24+X25=1X31+X32+X33+X34+X35=1 1 2 3 4 5

Feature number

The University of Iowa Intelligent Systems Laboratory

X31+X32+X33+X34+X35 1X41+X42+X43+X44+X45=1X51+X52+X53+X54+X55=1X11+X22+X33+X44+X55=2

3

0 4 0 4 4 4 0 4 0 1 0 4 0 4 3 4 0 4 0 1 4 1 3 1 0

1 2 3 4 5

Feature number[dij] =

Page 23: Cluster Analysis in Data Mining - University of Iowauser.engineering.uiowa.edu/~comp/Public/Clustering2.pdf · Cluster Representation 4 12 3 Clusters A .6.3.1 B .5 .2 .3 C .6 .3 .1

23

Page 23

X21-X11<=0X31-X11<=0X41-X11<=0

X14-X44<=0X24-X44<=0X34-X44<=0

Constraint xij <= xjj

X51-X11<=0X12-X22<=0X32-X22<=0X42-X22<=0X52-X22<=0X13 X33< 0

X54-X44<=0X15-X55<=0X25-X55<=0X35-X55<=0X45-X55<=0END

The University of Iowa Intelligent Systems Laboratory

X13-X33<=0X23-X33<=0X43-X33<=0X53-X33<=0

END INTEGER 25

Solution x11 = 1, x31 = 1x24 = 1, x44 = 1, x54 = 1

Based on the definition of xij , two groups of features are formed:formed:

GF-1 = {1, 3} GF-2 = {2, 4, 5}GO-1 = {2, 4} GO-2 = {1, 3}

1 3 2 4 5

GF -1 GF-2

1 2 3 4 5

Feature number

The University of Iowa Intelligent Systems Laboratory

2 GO-1

4

1 GO-2

3

1 1

1 1

1 1 1

1 1

1 1 1

1 1

1 1

1 1

1

2

3

4

Object number

Page 24: Cluster Analysis in Data Mining - University of Iowauser.engineering.uiowa.edu/~comp/Public/Clustering2.pdf · Cluster Representation 4 12 3 Clusters A .6.3.1 B .5 .2 .3 C .6 .3 .1

24

Page 24

1 2 3 4 5

1 GO-1 {

2

1 1 1

1 12

3 GO-2 {

4

1 1

1 1 1

1 1

The University of Iowa Intelligent Systems Laboratory

Solution: x13=1, x23 = 1x35 =1, x45 =1, x55=1