quality of clustering

Upload: naseer-mohammad

Post on 01-Jun-2018

219 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/9/2019 quality of clustering

    1/94

    Introduction

    Basic thoughts

    Cluster quality statistics

    Examples

    Discussion

    Measurement of quality in cluster analysis

    Christian Hennig

    July 24, 2013

    Christian Hennig Measurement of quality in cluster analysis

    http://-/?-http://-/?-http://find/http://goback/http://-/?-
  • 8/9/2019 quality of clustering

    2/94

    Introduction

    Basic thoughts

    Cluster quality statistics

    Examples

    Discussion

    Which clustering is better?

    Why datasets without known truth?

    1. Introduction

    IFCS task force forcluster benchmarking

    (Nema Dean, Iven van Mechelen, Fritz Leisch, Doug Steinley,

    Bernd Bischl, Isabelle Guyon, Christian Hennig)

    Data repository for systematic comparison of quality

    of different cluster analysis algorithms

    In this presentation: compare quality of clusteringsbased on clustering and data alone,

    without reference to known truth.

    Christian Hennig Measurement of quality in cluster analysis

    http://find/http://goback/http://-/?-
  • 8/9/2019 quality of clustering

    3/94

    Introduction

    Basic thoughts

    Cluster quality statistics

    Examples

    Discussion

    Which clustering is better?

    Why datasets without known truth?

    Which clustering is better?

    (Old faithful geyser data)

    2 1 0 1 2

    2

    1

    0

    1

    mclust

    waiting

    duration

    2 1 0 1 2

    2

    1

    0

    1

    pam

    waiting

    duration

    Christian Hennig Measurement of quality in cluster analysis

    http://find/http://-/?-
  • 8/9/2019 quality of clustering

    4/94

    Introduction

    Basic thoughts

    Cluster quality statistics

    Examples

    Discussion

    Which clustering is better?

    Why datasets without known truth?

    Which clustering is better?

    10 0 10 20 30

    0

    10

    20

    30

    40

    xy[,1]

    xy[,2

    ]

    10 0 10 20 30

    0

    10

    20

    30

    40

    xy[,1]

    xy[,2

    ]

    Christian Hennig Measurement of quality in cluster analysis

    http://find/http://-/?-
  • 8/9/2019 quality of clustering

    5/94

    Introduction

    Basic thoughts

    Cluster quality statistics

    Examples

    Discussion

    Which clustering is better?

    Why datasets without known truth?

    Which tower is better?

    Christian Hennig Measurement of quality in cluster analysis

    http://find/http://-/?-
  • 8/9/2019 quality of clustering

    6/94

    Introduction

    Basic thoughts

    Cluster quality statistics

    Examples

    Discussion

    Which clustering is better?

    Why datasets without known truth?

    Why datasets without known truth?

    Benchmarking approaches:

    Real datasets with known classes

    Simulated datasets from mixture distributionsDatasets with intuitive classesby fiat

    Real datasets without known classes

    Christian Hennig Measurement of quality in cluster analysis

    I t d ti

    http://find/http://goback/http://-/?-
  • 8/9/2019 quality of clustering

    7/94

    Introduction

    Basic thoughts

    Cluster quality statistics

    Examples

    Discussion

    Which clustering is better?

    Why datasets without known truth?

    Why datasets without known truth?

    Benchmarking approaches:

    Real datasets with known classes

    Simulated datasets from mixture distributionsDatasets with intuitive classesby fiat

    Real datasets without known classes

    Misclassification rates or Rand index

    are (more or less) straightforward.

    So why use datasets without known truth?

    Christian Hennig Measurement of quality in cluster analysis

    Introduction

    http://find/http://goback/http://-/?-
  • 8/9/2019 quality of clustering

    8/94

    Introduction

    Basic thoughts

    Cluster quality statistics

    Examples

    Discussion

    Which clustering is better?

    Why datasets without known truth?

    Whats wrong with knowing the truth?

    Disclaimer: knowing the truth is not evil.

    There is definitely a role for datasetswith known truth in cluster benchmarking.

    Christian Hennig Measurement of quality in cluster analysis

    Introduction

    http://find/http://-/?-
  • 8/9/2019 quality of clustering

    9/94

    Introduction

    Basic thoughts

    Cluster quality statistics

    Examples

    Discussion

    Which clustering is better?

    Why datasets without known truth?

    Whats wrong with knowing the truth?

    Disclaimer: knowing the truth is not evil.

    There is definitely a role for datasetswith known truth in cluster benchmarking.

    Measuring cluster quality

    ignoring the truth can be of use

    even if truth is known.(May explain which truths a method can discover.)

    Christian Hennig Measurement of quality in cluster analysis

    Introduction

    http://find/http://-/?-
  • 8/9/2019 quality of clustering

    10/94

    Introduction

    Basic thoughts

    Cluster quality statistics

    Examples

    Discussion

    Which clustering is better?

    Why datasets without known truth?

    Whats wrong with knowing the truth?

    But...

    In datasets with known classes

    clustering is not of real scientific interest.

    (Or one may want to finddifferentclusterings.)Deviate systematically from real clustering problems.

    Christian Hennig Measurement of quality in cluster analysis

    Introduction

    http://find/http://-/?-
  • 8/9/2019 quality of clustering

    11/94

    Introduction

    Basic thoughts

    Cluster quality statistics

    Examples

    Discussion

    Which clustering is better?

    Why datasets without known truth?

    Whats wrong with knowing the truth?

    But...

    In datasets with known classes

    clustering is not of real scientific interest.

    (Or one may want to finddifferentclusterings.)Deviate systematically from real clustering problems.

    The fact that we know certain true classes

    doesnt preclude other legitimate/true clusterings.

    Christian Hennig Measurement of quality in cluster analysis

    Introduction

    http://find/http://-/?-
  • 8/9/2019 quality of clustering

    12/94

    Introduction

    Basic thoughts

    Cluster quality statistics

    Examples

    Discussion

    Which clustering is better?

    Why datasets without known truth?

    Whats wrong with knowing the truth?

    But...

    In datasets with known classes

    clustering is not of real scientific interest.

    (Or one may want to finddifferentclusterings.)Deviate systematically from real clustering problems.

    The fact that we know certain true classes

    doesnt preclude other legitimate/true clusterings.

    Classes in supervised classification problemsmay not qualify as data analytic clusters.

    Christian Hennig Measurement of quality in cluster analysis

    Introduction

    http://goforward/http://find/http://-/?-
  • 8/9/2019 quality of clustering

    13/94

    Basic thoughts

    Cluster quality statistics

    Examples

    Discussion

    Which clustering is better?

    Why datasets without known truth?

    Whats wrong with knowing the truth?

    But...

    In datasets with known classes

    clustering is not of real scientific interest.

    (Or one may want to finddifferentclusterings.)Deviate systematically from real clustering problems.

    The fact that we know certain true classes

    doesnt preclude other legitimate/true clusterings.

    Classes in supervised classification problemsmay not qualify as data analytic clusters.

    So there could be better truths than the known one.

    Christian Hennig Measurement of quality in cluster analysis

    Introduction

    http://find/http://-/?-
  • 8/9/2019 quality of clustering

    14/94

    Basic thoughts

    Cluster quality statistics

    Examples

    Discussion

    Which clustering is better?

    Why datasets without known truth?

    Which clustering is better?

    (10-d vowel data; Hastie, Tibshirani and Friedman ESL)

    1

    2

    3

    4

    5

    6 7

    8

    9

    0

    a

    1

    2

    3

    4

    5

    6 7

    8

    9

    0

    a 1

    2

    3

    4

    5

    6 7

    8

    9

    0

    a

    1

    2

    3

    4

    5

    6

    7

    8

    9

    0

    a

    1

    23

    4

    5

    6

    78

    9

    0

    a

    1

    2

    34

    5

    678

    9

    0

    a

    1

    2

    3

    4

    56

    7

    8

    9

    0

    a

    1

    2

    3

    4

    56

    7

    8

    9

    0

    a

    1

    2

    3

    4

    56

    7

    8

    9

    0

    a

    1

    2

    3

    4

    56

    7

    8

    9

    0

    a

    1

    2

    3

    4

    56

    7

    8

    9

    0

    a

    1

    2

    3

    4

    56

    7

    8

    9

    0

    a

    1

    2

    3

    4

    5

    6

    7

    8 9

    0

    a

    1

    2

    3

    4

    5

    6

    7

    89

    0

    a

    1

    2

    3

    4

    5

    6

    7

    8

    9

    0

    a

    1

    2

    3

    4

    5

    6

    7

    89

    0

    a

    1

    2

    3

    45

    67

    89

    0

    a

    1

    2

    3

    45

    6

    7

    8

    9

    0

    a 12

    3

    4

    56

    7

    8

    9

    0

    a

    12

    3

    4

    56

    7

    89

    0

    a

    12

    3

    4

    56

    7

    8

    9

    0

    a

    12

    3

    4

    5

    6

    7

    8

    9

    0

    a

    12

    3

    4

    5

    6

    7

    89

    0

    a

    12

    3

    4

    5

    67

    8

    9

    0

    a

    1

    2

    3

    4

    5

    6

    7

    8

    9

    0

    a

    1

    2

    3

    4

    5

    6

    7

    89

    0

    a

    1

    2

    3

    4

    5

    6

    7

    8

    9

    0

    a

    1

    2

    3

    4

    5

    6

    7

    8

    9

    0

    a

    12

    3

    4

    5

    6

    7

    8

    9

    0

    a1

    2

    3

    4

    5

    67

    8

    9

    0

    a

    1

    2

    34

    5

    6

    78

    90

    a

    1

    2

    34

    5

    67

    8

    9 0

    a

    1

    2

    34

    5

    67

    8

    9

    0

    a

    1

    2

    34

    5

    67

    8

    9

    0

    a

    1

    2

    34

    5

    6

    7

    8

    90

    a

    1

    2

    3

    45

    678

    90

    a

    12

    3

    4

    5

    6

    7

    8 9 0

    a

    1

    2

    3

    4

    5

    6

    7

    89

    0

    a

    1

    2

    3

    4

    5

    6

    7

    8

    9

    0

    a

    1

    2

    3

    4

    56

    7

    8

    9

    0

    a

    1

    2

    3

    4

    5 6

    7

    89

    0

    a

    1

    2

    3

    4

    5 6

    7

    89

    0

    a

    12

    3

    456

    7

    8

    9

    0

    a

    1

    2

    3

    4

    56

    7

    8

    9

    0

    a

    1

    2

    3

    4

    5

    6

    7

    8

    9

    0

    a

    1

    2

    3

    4

    5

    6

    7

    8

    90

    a

    1

    2

    3

    4

    5

    6

    7

    8

    9

    0

    a

    1

    2

    3

    4

    5

    6

    7

    8

    9

    0

    a

    1

    23

    4

    56

    7

    8

    9

    0

    a

    1

    2

    3

    4

    5

    6

    7

    8

    9

    0

    a

    1

    2

    3

    4

    5

    6

    7

    8

    9

    0

    a

    1

    2

    3

    4

    5

    6

    7

    8

    9

    0

    a

    1

    2

    3

    4

    56

    7

    8

    9

    0

    a

    1

    2

    3

    4

    5

    67

    8

    9

    0

    a

    12

    3

    4

    5

    6

    7

    8

    9

    0

    a

    1

    2

    3

    4

    5 67

    89

    0

    a

    1

    23

    4

    5 6

    7

    8

    9

    0

    a

    1

    23

    4

    5

    6

    7

    8

    9

    0

    a

    12

    3

    4

    5

    6

    7

    8

    9

    0

    a

    1

    2

    3

    4

    5

    6

    7

    8

    90

    a 1

    2

    3

    4

    56

    7

    89

    0

    a

    1

    2 3

    4

    56

    7

    89

    0

    a

    1

    23

    4

    5

    6

    7

    89

    0

    a

    1

    2 3

    4

    5

    6

    7

    89

    0

    a

    1

    23

    4

    5

    6

    7

    8

    9

    0

    a

    1

    23

    45

    67

    8

    9

    0

    a

    1

    2

    3

    4

    5 6

    78

    9

    0

    a

    1

    2

    3

    4

    5

    6

    78

    9

    0

    a

    1

    2

    3

    4

    5

    6

    7

    8

    9

    0

    a

    1

    2

    3

    4

    5

    6

    7

    8

    9

    0

    a

    1

    2

    3

    4

    5

    67

    8

    9

    0

    a

    1

    2

    3

    4

    5

    6

    7

    8

    9

    0

    a

    1

    2

    3

    4

    5 6

    7

    8

    9

    0

    a 1

    2

    3

    4

    56

    7

    8

    9

    0

    a1

    2

    3

    4

    56

    7

    8

    9

    0

    a

    1

    2

    3

    4

    5

    6

    7

    8

    9

    0

    a

    1

    2

    3

    4

    5

    6

    7

    8

    9

    0

    a

    1

    2

    3

    4

    5

    67

    8

    9

    0

    a

    1

    2

    3

    4

    5 6

    7

    8

    90

    a

    1

    2

    3

    4

    5 6

    7

    8

    9

    0

    a

    1

    2

    3

    4

    5 6

    7

    8

    9

    0

    a

    1

    2

    3

    4

    5 6

    7

    8

    9

    0

    a

    1

    2

    3

    4

    5

    6

    7

    8

    9

    0

    a

    1

    2

    3

    4

    5

    6

    7

    8

    9

    0

    a

    1

    2

    34

    5

    6

    7

    8 9

    0 a

    1

    2

    3

    4

    5

    6

    7

    8

    90

    a

    1

    2 34

    5

    6

    7

    8

    90

    a

    1

    234

    5

    6

    7

    8

    9

    0 a

    1

    23

    4

    5

    67

    8

    9

    0

    a

    1

    23

    4

    5

    6

    7

    8

    9

    0

    a

    8 6 4 2 0 2

    12

    10

    8

    6

    4

    2

    dc 1

    dc

    2

    9

    5

    6

    6

    8

    0

    a

    1

    a

    a

    0

    9

    9

    6

    6

    8 0

    a

    1

    a

    a

    0

    99

    7

    6

    1

    0

    a

    1

    a

    a

    0

    5

    9

    5

    6

    1

    0

    a1

    a

    a

    05

    9

    9

    6

    1

    0

    a1

    1

    a

    05

    99

    6

    0

    0

    a

    1

    1

    a

    6

    5

    5

    96

    8 88

    1

    1

    1

    3

    5

    5

    96

    8 8

    8

    1

    2

    1

    3

    5

    5

    9

    6

    8 8

    8

    1

    2

    1

    3

    5

    5

    9

    6

    8 88

    12

    1

    3

    5

    5

    96

    88

    8

    1 2

    1

    3

    5

    5

    96

    8

    3

    8

    1

    2

    1

    3 5

    56

    68

    88

    1

    1

    1

    6

    5

    56

    68

    8

    8

    1

    2

    1

    45

    56

    6

    8

    8

    8

    1

    2

    1

    4

    5

    5

    6

    68 8

    81

    1

    1

    4 5

    5

    6

    68

    8

    11

    11

    6

    5

    5

    6

    68

    8

    1

    1 1

    1

    6

    5

    5

    6

    6

    6 6

    8

    2

    2

    5

    6

    5

    5

    6

    6

    6 6

    82

    25

    6

    5

    5

    6

    6

    6 6

    82 2

    5

    6

    5

    5

    6

    60 68

    2 2

    5

    6

    5

    5

    6

    60

    6

    8

    22

    5

    6

    9

    5

    6

    60

    6

    8

    21

    5

    6

    77

    6

    6

    88

    8

    8

    2

    43

    77

    a

    6

    88

    8

    8

    2

    43

    77

    a

    68 8

    8

    8

    1

    43

    77a

    63 8

    8

    8

    1

    4 3

    7

    a

    a

    63

    3

    8

    8

    2

    43 7

    a

    a

    63

    3

    2

    2

    2

    43

    a

    a

    a

    33

    3

    32

    3

    a

    3

    a

    a

    a

    33

    3

    3

    2

    3

    a

    3

    a

    a

    a

    33

    3

    3

    2

    3

    a

    3

    a

    a

    a

    0

    3

    3

    3

    2

    3

    a

    3

    a

    a

    a

    3

    3

    3

    3

    2

    3

    a

    3

    a

    a

    a

    3

    3

    3

    3

    2

    3

    a

    3

    a

    a

    a

    3

    33

    82

    4

    43

    a

    7

    a

    3

    3

    3

    82

    4

    4

    3

    7

    7a

    3

    3

    3

    82

    4

    4

    3

    7

    7a

    3

    3

    3

    82

    4

    4

    3

    7

    7

    a

    3

    3

    3

    82

    4

    4

    3

    7

    7

    a

    3

    3

    3

    8

    2

    4

    4

    3

    55

    9

    0

    00

    1

    1

    0 5

    0

    5

    5

    9

    00

    0

    1

    1

    0

    5

    0

    5

    5

    9

    0

    0

    0

    1

    1

    0

    5

    0

    5

    5

    9

    0

    0

    8

    1

    1

    0

    5

    0

    5

    5

    9

    0

    0

    80

    1

    0

    5

    0

    5

    5

    9

    0

    080

    1

    0

    5

    0

    7

    9

    9

    6

    86

    8

    1

    11

    6

    7

    9

    9

    6

    8

    61

    1

    1 1

    6

    7

    9

    9

    6

    8

    61

    1

    1 1

    6

    7

    9

    9

    6

    8

    6

    1

    1

    11

    6

    7

    99

    6

    8

    6

    1

    1 1

    1

    6

    7

    9

    9

    6

    8

    6

    1

    1

    1

    1

    6

    9

    9

    9

    0

    0

    0

    0

    1

    01

    0

    99

    9

    0

    0

    0

    0

    1

    01

    0

    99

    9

    0

    0

    0

    01

    0

    1

    0

    99

    9

    0

    0

    0

    01

    0

    1

    0

    9

    9

    9

    0

    0

    0

    0

    1

    0

    1

    0

    9

    9

    9

    0

    0

    0

    0

    1

    0

    1

    0

    9

    9

    6

    6

    8 6

    0

    a

    0 0

    0

    9

    9

    6

    6

    0 6

    a

    a

    0

    0

    0

    9

    9

    6

    6

    0

    0

    a

    a

    0

    0

    0

    9

    9

    9

    6

    0

    0

    a

    a

    00

    0

    99

    9

    6

    0

    0

    a

    a

    0

    0

    0

    99

    9

    6

    0

    0

    a

    a

    0

    0

    0

    7

    7

    7

    68 6

    11

    4

    4

    6

    7

    7

    7

    68 6

    1

    1

    4

    4

    6

    777

    68 6

    11

    4

    4

    6

    77

    a

    686

    11

    4

    4

    6

    77

    a

    68

    6

    8

    1

    4

    4

    6

    7

    aa

    68

    68

    1

    4

    4

    6

    7

    7

    76

    868

    1

    8

    2

    0

    7

    7

    7

    6

    8 68

    1

    0

    2

    0

    757

    6

    868

    1

    0

    2

    0

    7

    5

    7

    6

    8 68

    1

    0

    2

    0

    7

    5

    7

    6

    8 08

    1

    0

    2

    0

    7

    5

    76

    80

    8

    1 5

    2

    0

    7

    7

    7

    6

    8

    802 4

    4

    6

    7

    7

    7

    6

    8

    802 4

    4

    6

    7

    7

    7

    6

    8

    8

    12

    4

    4

    3

    7

    77

    6

    8

    3

    82

    4

    4

    3

    7

    77

    6

    8

    3

    82

    4

    4

    3

    7

    77

    6

    8

    3

    8

    2

    4

    4

    3

    a

    a

    a

    3

    8

    3

    22

    2

    2

    3

    a a

    a

    3

    8

    3

    22

    2

    23

    aa

    a

    3

    8

    3

    22

    2

    2 3

    a a

    a

    38

    3

    2

    2

    2

    2

    3

    aa

    a

    33

    3

    22

    2

    2

    3

    a a

    a

    3

    3

    3

    22

    2

    2

    4

    10 8 6 4 2 0 2

    8

    6

    4

    2

    0

    2

    dc 1

    dc

    2

    Christian Hennig Measurement of quality in cluster analysis

    Introduction

    http://goforward/http://find/http://-/?-
  • 8/9/2019 quality of clustering

    15/94

    Basic thoughts

    Cluster quality statistics

    Examples

    Discussion

    Which clustering is better?

    Why datasets without known truth?

    Whats wrong with knowing the truth?

    Identification mixture components/clusters

    is problematic.

    Mixture of two components may be unimodal.

    Christian Hennig Measurement of quality in cluster analysis

    Introduction

    http://find/http://-/?-
  • 8/9/2019 quality of clustering

    16/94

    Basic thoughts

    Cluster quality statistics

    Examples

    Discussion

    Which clustering is better?

    Why datasets without known truth?

    Whats wrong with knowing the truth?

    Identification mixture components/clusters

    is problematic.

    Mixture of two components may be unimodal.

    Observations in tails may rather be outliers

    than cluster members (t-distributions).

    Christian Hennig Measurement of quality in cluster analysis

    Introduction

    http://find/http://-/?-
  • 8/9/2019 quality of clustering

    17/94

    Basic thoughts

    Cluster quality statistics

    Examples

    Discussion

    Which clustering is better?

    Why datasets without known truth?

    Whats wrong with knowing the truth?

    Identification mixture components/clusters

    is problematic.

    Mixture of two components may be unimodal.

    Observations in tails may rather be outliers

    than cluster members (t-distributions).

    Clustering aims may deviate from

    finding intuitive clusters or mixture components.

    Christian Hennig Measurement of quality in cluster analysis

    Introduction

    B i th ht

    http://find/http://-/?-
  • 8/9/2019 quality of clustering

    18/94

    Basic thoughts

    Cluster quality statistics

    Examples

    Discussion

    Which clustering is better?

    Why datasets without known truth?

    Which clustering is better?

    10 0 10 20 30

    0

    10

    20

    30

    40

    xy[,1]

    xy[,2

    ]

    10 0 10 20 30

    0

    10

    20

    30

    40

    xy[,1]

    xy[,2

    ]

    Christian Hennig Measurement of quality in cluster analysis

    Introduction

    Basic tho ghts Cl ster alidation inde es

    http://find/http://-/?-
  • 8/9/2019 quality of clustering

    19/94

    Basic thoughts

    Cluster quality statistics

    Examples

    Discussion

    Cluster validation indexes

    General philosophy

    Typical clustering aims

    2. Basic thoughts

    There is a range ofcluster validation indexes

    measuring clustering quality, such as

    Average silhouette width (ASW)(Kaufman and Rouseeuw 1990)

    sw(i, C) = b(i,C)a(i,C)max(a(i,C),b(i,C)) ,

    a(i, C) = 1

    |Cj| 1xCj

    d(xi, x), b(i,C) = minxiCl

    1

    |Cl|xCl

    d(xi, x).

    Maximum averageswgood C.

    Christian Hennig Measurement of quality in cluster analysis

    Introduction

    Basic thoughts Cluster validation indexes

    http://find/http://-/?-
  • 8/9/2019 quality of clustering

    20/94

    Basic thoughts

    Cluster quality statistics

    Examples

    Discussion

    Cluster validation indexes

    General philosophy

    Typical clustering aims

    Most such indexes balance within-cluster homogeneity

    against between-cluster separation.

    One size fits it all-approach.

    Christian Hennig Measurement of quality in cluster analysis

    Introduction

    Basic thoughts Cluster validation indexes

    http://find/http://-/?-
  • 8/9/2019 quality of clustering

    21/94

    Basic thoughts

    Cluster quality statistics

    Examples

    Discussion

    Cluster validation indexes

    General philosophy

    Typical clustering aims

    Most such indexes balance within-cluster homogeneity

    against between-cluster separation.

    One size fits it all-approach.

    Homogeneity will always dominate here:

    10 0 10 20 30

    0

    10

    20

    30

    40

    xy[,1]

    xy[,2

    ]

    10 0 10 20 30

    0

    10

    20

    30

    40

    xy[,1]

    xy[,2

    ]

    Christian Hennig Measurement of quality in cluster analysis

    Introduction

    Basic thoughts Cluster validation indexes

    http://find/http://-/?-
  • 8/9/2019 quality of clustering

    22/94

    Basic thoughts

    Cluster quality statistics

    Examples

    Discussion

    Cluster validation indexes

    General philosophy

    Typical clustering aims

    General philosophy

    There are various different aims of clustering.

    Depending on application,

    these aims carry different weights.

    Christian Hennig Measurement of quality in cluster analysis

    IntroductionBasic thoughts Cluster validation indexes

    http://find/http://-/?-
  • 8/9/2019 quality of clustering

    23/94

    Basic thoughts

    Cluster quality statistics

    Examples

    Discussion

    Cluster validation indexes

    General philosophy

    Typical clustering aims

    General philosophy

    There are various different aims of clustering.

    Depending on application,

    these aims carry different weights.

    Measure them separatelyto characterise

    what a method does best,

    instead of producing a single ranking.

    Christian Hennig Measurement of quality in cluster analysis

    IntroductionBasic thoughts Cluster validation indexes

    http://find/http://-/?-
  • 8/9/2019 quality of clustering

    24/94

    g

    Cluster quality statistics

    Examples

    Discussion

    General philosophy

    Typical clustering aims

    General philosophy

    There are various different aims of clustering.

    Depending on application,

    these aims carry different weights.

    Measure them separatelyto characterise

    what a method does best,

    instead of producing a single ranking.

    Can piece together overall quality

    as weighted mean of separate statistics.

    Christian Hennig Measurement of quality in cluster analysis

    IntroductionBasic thoughts Cluster validation indexes

    http://find/http://-/?-
  • 8/9/2019 quality of clustering

    25/94

    g

    Cluster quality statistics

    Examples

    Discussion

    General philosophy

    Typical clustering aims

    Typical clustering aims Between-cluster separation

    Christian Hennig Measurement of quality in cluster analysis

    IntroductionBasic thoughts Cluster validation indexes

    http://find/http://-/?-
  • 8/9/2019 quality of clustering

    26/94

    Cluster quality statistics

    Examples

    Discussion

    General philosophy

    Typical clustering aims

    Typical clustering aims Between-cluster separation

    Within-cluster homogeneity (low distances)

    Christian Hennig Measurement of quality in cluster analysis

    IntroductionBasic thoughts Cluster validation indexes

    http://find/http://-/?-
  • 8/9/2019 quality of clustering

    27/94

    Cluster quality statistics

    Examples

    Discussion

    General philosophy

    Typical clustering aims

    Typical clustering aims Between-cluster separation

    Within-cluster homogeneity (low distances)

    Within-cluster homogeneous distributional shape

    Christian Hennig Measurement of quality in cluster analysis

    IntroductionBasic thoughts Cluster validation indexes

    http://goforward/http://find/http://goback/http://-/?-
  • 8/9/2019 quality of clustering

    28/94

    Cluster quality statistics

    Examples

    Discussion

    General philosophy

    Typical clustering aims

    Typical clustering aims Between-cluster separation

    Within-cluster homogeneity (low distances)

    Within-cluster homogeneous distributional shape

    Good representation of data by centroids

    Christian Hennig Measurement of quality in cluster analysis

    IntroductionBasic thoughts

    Cl li i i

    Cluster validation indexes

    G l hil h

    http://find/http://-/?-
  • 8/9/2019 quality of clustering

    29/94

    Cluster quality statistics

    Examples

    Discussion

    General philosophy

    Typical clustering aims

    Typical clustering aims Between-cluster separation

    Within-cluster homogeneity (low distances)

    Within-cluster homogeneous distributional shape

    Good representation of data by centroids Good representation of dissimilarity

    by clustering-induced metric

    Christian Hennig Measurement of quality in cluster analysis

    IntroductionBasic thoughts

    Cl t lit t ti ti

    Cluster validation indexes

    G l hil h

    http://find/http://-/?-
  • 8/9/2019 quality of clustering

    30/94

    Cluster quality statistics

    Examples

    Discussion

    General philosophy

    Typical clustering aims

    Typical clustering aims Between-cluster separation

    Within-cluster homogeneity (low distances)

    Within-cluster homogeneous distributional shape

    Good representation of data by centroids Good representation of dissimilarity

    by clustering-induced metric

    Clusters are regions of high density

    without within-cluster gaps

    Christian Hennig Measurement of quality in cluster analysis

    IntroductionBasic thoughts

    Cluster quality statistics

    Cluster validation indexes

    General philosophy

    http://find/http://-/?-
  • 8/9/2019 quality of clustering

    31/94

    Cluster quality statistics

    Examples

    Discussion

    General philosophy

    Typical clustering aims

    Typical clustering aims Between-cluster separation

    Within-cluster homogeneity (low distances)

    Within-cluster homogeneous distributional shape

    Good representation of data by centroids Good representation of dissimilarity

    by clustering-induced metric

    Clusters are regions of high density

    without within-cluster gaps Uniform cluster sizes

    Christian Hennig Measurement of quality in cluster analysis

    IntroductionBasic thoughts

    Cluster quality statistics

    Cluster validation indexes

    General philosophy

    http://find/http://-/?-
  • 8/9/2019 quality of clustering

    32/94

    Cluster quality statistics

    Examples

    Discussion

    General philosophy

    Typical clustering aims

    Typical clustering aims Between-cluster separation

    Within-cluster homogeneity (low distances)

    Within-cluster homogeneous distributional shape

    Good representation of data by centroids Good representation of dissimilarity

    by clustering-induced metric

    Clusters are regions of high density

    without within-cluster gaps Uniform cluster sizes

    Stability (requires knowledge of method)

    Christian Hennig Measurement of quality in cluster analysis

    IntroductionBasic thoughts

    Cluster quality statistics

    Cluster validation indexes

    General philosophy

    http://find/http://-/?-
  • 8/9/2019 quality of clustering

    33/94

    Cluster quality statistics

    Examples

    Discussion

    General philosophy

    Typical clustering aims

    E.g., pattern recognition in images

    requires separation,

    Christian Hennig Measurement of quality in cluster analysis

    IntroductionBasic thoughts

    Cluster quality statistics

    Cluster validation indexes

    General philosophy

    http://find/http://-/?-
  • 8/9/2019 quality of clustering

    34/94

    Cluster quality statistics

    Examples

    Discussion

    General philosophy

    Typical clustering aims

    E.g., pattern recognition in images

    requires separation,

    clustering for information reduction requires

    good representation by centroids,

    Christian Hennig Measurement of quality in cluster analysis

    IntroductionBasic thoughts

    Cluster quality statistics

    Cluster validation indexes

    General philosophy

    http://find/http://-/?-
  • 8/9/2019 quality of clustering

    35/94

    Cluster quality statistics

    Examples

    Discussion

    General philosophy

    Typical clustering aims

    E.g., pattern recognition in images

    requires separation,

    clustering for information reduction requires

    good representation by centroids,

    groups in social network analysis shouldnt have

    large within-cluster gaps,

    Christian Hennig Measurement of quality in cluster analysis

    IntroductionBasic thoughts

    Cluster quality statistics

    Cluster validation indexes

    General philosophy

    http://find/http://-/?-
  • 8/9/2019 quality of clustering

    36/94

    q y

    Examples

    Discussion

    p p y

    Typical clustering aims

    E.g., pattern recognition in images

    requires separation,

    clustering for information reduction requires

    good representation by centroids,

    groups in social network analysis shouldnt have

    large within-cluster gaps,

    underlying true classes (biological species)may cause homogeneous distributional shapes.

    Christian Hennig Measurement of quality in cluster analysis

    IntroductionBasic thoughts

    Cluster quality statistics

    Principle of direct interpretation

    Measuring between-cluster separation

    http://find/http://-/?-
  • 8/9/2019 quality of clustering

    37/94

    Examples

    Discussion

    Other statistics

    3. Cluster quality statistics

    Aim: measure all thats of interest

    by statistics in[0, 1](1 is good)

    (so that different statistics are comparableand weighted means make sense).

    Christian Hennig Measurement of quality in cluster analysis

    IntroductionBasic thoughts

    Cluster quality statistics

    Principle of direct interpretation

    Measuring between-cluster separation

    http://find/http://-/?-
  • 8/9/2019 quality of clustering

    38/94

    Examples

    Discussion

    Other statistics

    3. Cluster quality statistics

    Aim: measure all thats of interest

    by statistics in[0, 1](1 is good)

    (so that different statistics are comparableand weighted means make sense).

    Principle of direct interpretation:

    Aim attranslatingrequirements directly into formulae;thats not optimisation, not estimation of any truth.

    Christian Hennig Measurement of quality in cluster analysis

    IntroductionBasic thoughts

    Cluster quality statistics

    Principle of direct interpretation

    Measuring between-cluster separation

    http://find/http://-/?-
  • 8/9/2019 quality of clustering

    39/94

    Examples

    Discussion

    Other statistics

    Warning: requires bold subjective tuning decisions.

    And its work in progress.

    Christian Hennig Measurement of quality in cluster analysis

    IntroductionBasic thoughts

    Cluster quality statistics

    E l

    Principle of direct interpretation

    Measuring between-cluster separation

    O h i i

    http://find/http://-/?-
  • 8/9/2019 quality of clustering

    40/94

    Examples

    Discussion

    Other statistics

    Measuring between-cluster separation

    several ways measuring separation (as for other aims).Straightforward: min distance between any two clusters,

    or distance between centroids (e.g.,k-means).

    Christian Hennig Measurement of quality in cluster analysis

    IntroductionBasic thoughts

    Cluster quality statistics

    E l

    Principle of direct interpretation

    Measuring between-cluster separation

    Oth t ti ti

    http://find/http://-/?-
  • 8/9/2019 quality of clustering

    41/94

    Examples

    Discussion

    Other statistics

    2 1 0 1 2

    2

    1

    0

    1

    waiting

    duration

    Christian Hennig Measurement of quality in cluster analysis

    IntroductionBasic thoughts

    Cluster quality statistics

    Examples

    Principle of direct interpretation

    Measuring between-cluster separation

    Other statistics

    http://find/http://-/?-
  • 8/9/2019 quality of clustering

    42/94

    Examples

    Discussion

    Other statistics

    2 1 0 1 2

    2

    1

    0

    1

    waiting

    duration

    M

    M

    Christian Hennig Measurement of quality in cluster analysis

    IntroductionBasic thoughts

    Cluster quality statistics

    Examples

    Principle of direct interpretation

    Measuring between-cluster separation

    Other statistics

    http://find/http://-/?-
  • 8/9/2019 quality of clustering

    43/94

    Examples

    Discussion

    Other statistics

    Measuring between-cluster separation

    several ways measuring separation (as for other aims).

    Straightforward: min distance between any two clusters,or distance between centroids (e.g.,k-means).

    These measure quite different concepts of separation.

    (min distance relies on only two points;

    centroid distance ignores what goes on at border.)

    Christian Hennig Measurement of quality in cluster analysis

    IntroductionBasic thoughts

    Cluster quality statistics

    Examples

    Principle of direct interpretation

    Measuring between-cluster separation

    Other statistics

    http://find/http://-/?-
  • 8/9/2019 quality of clustering

    44/94

    Examples

    Discussion

    Other statistics

    p-separation index:

    More stable version of min distance:Average distance to nearest point in different cluster for

    p=10%border points in any cluster.(ASW averages 100% to all in neighbouring cluster.)

    2 1 0 1 2

    2

    1

    0

    1

    waiting

    du

    ration

    X

    X

    X

    X

    X

    X

    X

    XX

    XX X

    X

    X

    X

    X

    X

    X

    X

    XX

    XX

    X

    X

    X X

    X

    X

    X

    X

    Christian Hennig Measurement of quality in cluster analysis

    IntroductionBasic thoughts

    Cluster quality statistics

    Examples

    Principle of direct interpretation

    Measuring between-cluster separation

    Other statistics

    http://find/http://-/?-
  • 8/9/2019 quality of clustering

    45/94

    Examples

    Discussion

    Other statistics

    p-stability index:

    Average distance to nearest point in different cluster for

    p=10%border points in any cluster.

    Problems: choice ofp, standardisation.

    Christian Hennig Measurement of quality in cluster analysis

    IntroductionBasic thoughts

    Cluster quality statistics

    Examples

    Principle of direct interpretation

    Measuring between-cluster separation

    Other statistics

    http://find/http://-/?-
  • 8/9/2019 quality of clustering

    46/94

    Examples

    Discussion

    Other statistics

    p-stability index:

    Average distance to nearest point in different cluster for

    p=10%border points in any cluster.

    Problems: choice ofp, standardisation.

    May standardise by maximum distance;

    range then is[0, 1], but values may be very small,max distance may be outlying,

    implicit downweighting if used inoverall quality weighted mean.

    Christian Hennig Measurement of quality in cluster analysis

    IntroductionBasic thoughts

    Cluster quality statistics

    Examples

    Principle of direct interpretation

    Measuring between-cluster separation

    Other statistics

    http://find/http://-/?-
  • 8/9/2019 quality of clustering

    47/94

    p

    Discussion

    Could use average/median distance etc. and bound by 1(separation larger ave distance perfect).

    Christian Hennig Measurement of quality in cluster analysis

    IntroductionBasic thoughts

    Cluster quality statistics

    Examples

    Principle of direct interpretation

    Measuring between-cluster separation

    Other statistics

    http://find/http://-/?-
  • 8/9/2019 quality of clustering

    48/94

    p

    Discussion

    Could use average/median distance etc. and bound by 1(separation larger ave distance perfect).

    Probably not fully satisfactory.

    May use nonlinear transformation to[0, 1]pronouncing differences between lower values,

    taking into account whether

    Max distance >>ave/median distance.

    Christian Hennig Measurement of quality in cluster analysis

    IntroductionBasic thoughts

    Cluster quality statistics

    Examples

    Principle of direct interpretation

    Measuring between-cluster separation

    Other statistics

    http://find/http://-/?-
  • 8/9/2019 quality of clustering

    49/94

    Discussion

    Could use average/median distance etc. and bound by 1(separation larger ave distance perfect).

    Probably not fully satisfactory.

    May use nonlinear transformation to[0, 1]pronouncing differences between lower values,

    taking into account whether

    Max distance >>ave/median distance.

    Stick to max distance standardisation here.

    Christian Hennig Measurement of quality in cluster analysis

    IntroductionBasic thoughts

    Cluster quality statistics

    Examples

    Principle of direct interpretation

    Measuring between-cluster separation

    Other statistics

    http://find/http://-/?-
  • 8/9/2019 quality of clustering

    50/94

    Discussion

    Could use average/median distance etc. and bound by 1(separation larger ave distance perfect).

    Probably not fully satisfactory.

    May use nonlinear transformation to[0, 1]pronouncing differences between lower values,

    taking into account whether

    Max distance >>ave/median distance.

    Stick to max distance standardisation here.

    p=0.1 intuitive; sensitivity?

    Christian Hennig Measurement of quality in cluster analysis

    IntroductionBasic thoughts

    Cluster quality statistics

    Examples

    Principle of direct interpretation

    Measuring between-cluster separation

    Other statistics

    http://find/http://-/?-
  • 8/9/2019 quality of clustering

    51/94

    Discussion

    Alternative concept:

    Distance-based knn density index

    Measures whether border points have lowest density,

    highest density is within clustersi.

    Christian Hennig Measurement of quality in cluster analysis

    IntroductionBasic thoughts

    Cluster quality statistics

    Examples

    Di i

    Principle of direct interpretation

    Measuring between-cluster separation

    Other statistics

    http://find/http://-/?-
  • 8/9/2019 quality of clustering

    52/94

    Discussion

    Alternative concept:

    Distance-based knn density index

    Measures whether border points have lowest density,

    highest density is within clustersi.

    Border points here:nBi points that have points

    from other clusters amongk=4-nearest neighbours,nIi interior points.

    Pointwise density:k/(2mean distance tok-nn).

    Christian Hennig Measurement of quality in cluster analysis

    IntroductionBasic thoughts

    Cluster quality statistics

    Examples

    Discussion

    Principle of direct interpretation

    Measuring between-cluster separation

    Other statistics

    http://find/http://-/?-
  • 8/9/2019 quality of clustering

    53/94

    Discussion

    2 1 0 1 2

    2

    1

    0

    1

    waiting

    duration

    BB

    B

    B

    B

    B

    Christian Hennig Measurement of quality in cluster analysis

    IntroductionBasic thoughts

    Cluster quality statistics

    Examples

    Discussion

    Principle of direct interpretation

    Measuring between-cluster separation

    Other statistics

    http://find/http://-/?-
  • 8/9/2019 quality of clustering

    54/94

    Discussion

    2 1 0 1 2

    2

    1

    0

    1

    waiting

    duration

    BB

    B

    B

    B

    B

    Christian Hennig Measurement of quality in cluster analysis

    Introduction

    Basic thoughts

    Cluster quality statistics

    Examples

    Discussion

    Principle of direct interpretation

    Measuring between-cluster separation

    Other statistics

    http://find/http://-/?-
  • 8/9/2019 quality of clustering

    55/94

    Discussion

    Clusterwise density indexri :(mean border density)/(mean interior density),

    0 ifnBi =0, 1 ifnIi =0.

    Aggregation ([0, 1]):

    ID=1 ((1 q)r1+qr2)1((1 q)r1+qr21),

    r1=

    wiri , r2 =

    bi,

    q=0.5 |nP

    nIi

    n

    0.5|, wi= nI

    i

    Pn

    Ii

    ,

    Overall: bmean border density, imean interior density.

    Christian Hennig Measurement of quality in cluster analysis

    http://find/http://-/?-
  • 8/9/2019 quality of clustering

    56/94

    Introduction

    Basic thoughts

    Cluster quality statistics

    Examples

    Discussion

    Principle of direct interpretation

    Measuring between-cluster separation

    Other statistics

  • 8/9/2019 quality of clustering

    57/94

    Discussion

    Aggregation ([0, 1]):

    ID=1 ((1 q)r1+qr2)1((1 q)r1+qr21),

    r1=

    wiri , r2 =

    bi,

    q=0.5 |nP

    nIin 0.5|, wi=

    nIiPnI

    i

    ,

    Idea: r1 measures cluster-relative density ratio,

    r2 overall.

    Both of interest, but fornIinor 0,

    one side ofr2 relies on very weak information.

    Althoughr1 downweights clusters withnIi small,

    outlier one-point clusters still produce too good ID.

    Christian Hennig Measurement of quality in cluster analysis

    Introduction

    Basic thoughts

    Cluster quality statistics

    Examples

    Discussion

    Principle of direct interpretation

    Measuring between-cluster separation

    Other statistics

    http://find/http://-/?-
  • 8/9/2019 quality of clustering

    58/94

    Other statistics Within-cluster average distance

    Christian Hennig Measurement of quality in cluster analysis

    Introduction

    Basic thoughts

    Cluster quality statistics

    Examples

    Discussion

    Principle of direct interpretation

    Measuring between-cluster separation

    Other statistics

    http://find/http://-/?-
  • 8/9/2019 quality of clustering

    59/94

    Other statistics Within-cluster average distance

    Aggregated within-cluster similarity

    (Kolmogorov etc.) to normal/uniform

    Christian Hennig Measurement of quality in cluster analysis

    Introduction

    Basic thoughts

    Cluster quality statistics

    Examples

    Discussion

    Principle of direct interpretation

    Measuring between-cluster separation

    Other statistics

    http://goforward/http://find/http://goback/http://-/?-
  • 8/9/2019 quality of clustering

    60/94

    Other statistics Within-cluster average distance

    Aggregated within-cluster similarity

    (Kolmogorov etc.) to normal/uniform

    Within-cluster (squared) distance to centroid

    Christian Hennig Measurement of quality in cluster analysis

    Introduction

    Basic thoughts

    Cluster quality statistics

    Examples

    Discussion

    Principle of direct interpretation

    Measuring between-cluster separation

    Other statistics

    http://goforward/http://find/http://goback/http://-/?-
  • 8/9/2019 quality of clustering

    61/94

    Other statistics Within-cluster average distance

    Aggregated within-cluster similarity

    (Kolmogorov etc.) to normal/uniform

    Within-cluster (squared) distance to centroid (distance, cluster induced distance) (Huberts)

    Christian Hennig Measurement of quality in cluster analysis

    Introduction

    Basic thoughts

    Cluster quality statistics

    Examples

    Discussion

    Principle of direct interpretation

    Measuring between-cluster separation

    Other statistics

    http://find/http://-/?-
  • 8/9/2019 quality of clustering

    62/94

    Other statistics Within-cluster average distance

    Aggregated within-cluster similarity

    (Kolmogorov etc.) to normal/uniform

    Within-cluster (squared) distance to centroid (distance, cluster induced distance) (Huberts)

    Entropy of cluster sizes

    Christian Hennig Measurement of quality in cluster analysis

    Introduction

    Basic thoughts

    Cluster quality statistics

    Examples

    Discussion

    Principle of direct interpretation

    Measuring between-cluster separation

    Other statistics

    http://find/http://-/?-
  • 8/9/2019 quality of clustering

    63/94

    Other statistics Within-cluster average distance

    Aggregated within-cluster similarity

    (Kolmogorov etc.) to normal/uniform

    Within-cluster (squared) distance to centroid (distance, cluster induced distance) (Huberts)

    Entropy of cluster sizes

    Within-cluster nearest neighbour distances

    coefficient of variation

    Christian Hennig Measurement of quality in cluster analysis

    Introduction

    Basic thoughts

    Cluster quality statistics

    Examples

    Discussion

    Principle of direct interpretation

    Measuring between-cluster separation

    Other statistics

    http://find/http://-/?-
  • 8/9/2019 quality of clustering

    64/94

    Other statistics Within-cluster average distance

    Aggregated within-cluster similarity

    (Kolmogorov etc.) to normal/uniform

    Within-cluster (squared) distance to centroid (distance, cluster induced distance) (Huberts)

    Entropy of cluster sizes

    Within-cluster nearest neighbour distances

    coefficient of variation Average largest within-cluster gap

    Christian Hennig Measurement of quality in cluster analysis

    Introduction

    Basic thoughts

    Cluster quality statistics

    Examples

    Discussion

    Principle of direct interpretation

    Measuring between-cluster separation

    Other statistics

    http://find/http://-/?-
  • 8/9/2019 quality of clustering

    65/94

    All need standardisation/transformation.

    Most are dissimilarity-based,allow flexible use with non-Euclidean data,

    given meaningful distance measure.

    Christian Hennig Measurement of quality in cluster analysis

    Introduction

    Basic thoughts

    Cluster quality statistics

    Examples

    Discussion

    Principle of direct interpretation

    Measuring between-cluster separation

    Other statistics

    http://find/http://-/?-
  • 8/9/2019 quality of clustering

    66/94

    Data set submission to benchmarking repository

    requires filling in questionnaire, e.g.

    Should clusters be similar or dissimilar in size?

    Are there requirements on what should be the unifying/common ground for elements to belong to the samecluster? Small within-cluster dissimilarities (and, if yes, in which respect)?

    Are there requirements on what should be the discriminating ground for elements to belong to differentclusters? Large between-cluster dissimilarities (and, if yes, in which respect)? Separation (and, if yes, ofwhich kind)? Other (and, if yes, what form do these requirements take)?

    Are there requirements on the between-cluster heterogeneity, that is, the structure of between-clusterdifferences (e.g., should lie in low-dimensional space, other)?

    (Some other; not all yet formalised by indexes)

    Please indicate the importance of those criteria selected by filling in a numerical weight.

    Christian Hennig Measurement of quality in cluster analysis

    Introduction

    Basic thoughts

    Cluster quality statistics

    Examples

    Discussion

    http://find/http://-/?-
  • 8/9/2019 quality of clustering

    67/94

    4. Examples

    10 0 10 20 30

    0

    10

    20

    30

    40

    xy[,1]

    xy[,2

    ]

    10 0 10 20 30

    0

    10

    20

    30

    40

    xy[,1]

    xy[,2

    ]

    3-means mclust-3

    ave within 0.811 0.643sep index 0.163 0.306

    density index 0.460 0.876

    within gap 0.927 0.949

    Christian Hennig Measurement of quality in cluster analysis

    Introduction

    Basic thoughts

    Cluster quality statistics

    Examples

    Discussion

    http://find/http://-/?-
  • 8/9/2019 quality of clustering

    68/94

    2 1 0 1 2

    2

    1

    0

    1

    mclust

    waiting

    duration

    2 1 0 1 2

    2

    1

    0

    1

    pam

    waiting

    duration

    2 1 0 1 2

    2

    1

    0

    1

    spectral

    waiting

    duration

    2 1 0 1 2

    2

    1

    0

    1

    ave.linkage

    waiting

    duration

    2 1 0 1 2

    2

    1

    0

    1

    single linkage

    waiting

    duration

    2 1 0 1 2

    2

    1

    0

    1

    pdfCluster (3)

    waiting

    duration

    Christian Hennig Measurement of quality in cluster analysis

    http://find/http://-/?-
  • 8/9/2019 quality of clustering

    69/94

    Introduction

    Basic thoughts

    Cluster quality statistics

    Examples

    Discussion

    http://find/http://-/?-
  • 8/9/2019 quality of clustering

    70/94

    Weighthed mean:

    full weight: ave within, sep index

    0.8 weight: entropy

    half weight: within nn cov, gap, min separation,

    density index, hubert gamma, normality, uniformity

    pdf3 spect ave.l mclust kmeans comp.l pam sing.l

    0.624 0.622 0.622 0.619 0.618 0.610 0.601 0.460

    Christian Hennig Measurement of quality in cluster analysis

    Introduction

    Basic thoughtsCluster quality statistics

    Examples

    Discussion

    http://find/http://-/?-
  • 8/9/2019 quality of clustering

    71/94

    Problem with pam captured.

    Christian Hennig Measurement of quality in cluster analysis

    Introduction

    Basic thoughtsCluster quality statistics

    Examples

    Discussion

    http://find/http://-/?-
  • 8/9/2019 quality of clustering

    72/94

    Problem with pam captured. Single linkage: useless (entropy 0.023; one-point cluster),

    good values in some indexes (careful!), bad in others.

    Christian Hennig Measurement of quality in cluster analysis

    Introduction

    Basic thoughtsCluster quality statistics

    Examples

    Discussion

    http://find/http://-/?-
  • 8/9/2019 quality of clustering

    73/94

    Problem with pam captured. Single linkage: useless (entropy 0.023; one-point cluster),

    good values in some indexes (careful!), bad in others.

    Comparison 2-cluster vs. 3-cluster (pdfCluster):

    individual indexes unfair;

    ave within better, separation worse with largerk (etc.)

    Depends on proper weighting.

    Could add parsimony index.

    Christian Hennig Measurement of quality in cluster analysis

    Introduction

    Basic thoughtsCluster quality statistics

    Examples

    Discussion

    http://find/http://-/?-
  • 8/9/2019 quality of clustering

    74/94

    Problem with pam captured. Single linkage: useless (entropy 0.023; one-point cluster),

    good values in some indexes (careful!), bad in others.

    Comparison 2-cluster vs. 3-cluster (pdfCluster):

    individual indexes unfair;

    ave within better, separation worse with largerk (etc.)

    Depends on proper weighting.

    Could add parsimony index.

    mclust not best in normality,

    ave.l not best in ave within!Individual indexes may favour certain methods,

    but not as obvious as it seems.

    Christian Hennig Measurement of quality in cluster analysis

    Introduction

    Basic thoughtsCluster quality statistics

    Examples

    Discussion

    European land snails data (Hausdorf Hennig 2003 2006)

    http://find/http://-/?-
  • 8/9/2019 quality of clustering

    75/94

    European land snails data (Hausdorf, Hennig 2003,2006)

    Presence-absence (0-1) data for species in regions;

    geographical Kulczynski dissimilarity;clustering for biotic elements (natural history),

    originally clustered with mclust after MDS.

    Can compare with distance-based.

    0.4 0.2 0.0 0.2 0.4 0.6

    0.6

    0.4

    0.

    2

    0.

    0

    0.

    2

    0.

    4

    datanp[,1]

    datanp[,2]

    0.4 0.2 0.0 0.2 0.4 0.6

    0.6

    0.4

    0.

    2

    0.

    0

    0.

    2

    0.

    4

    datanp[,1]

    datanp[,2]

    Christian Hennig Measurement of quality in cluster analysis

    Introduction

    Basic thoughtsCluster quality statistics

    Examples

    Discussion

    http://find/http://-/?-
  • 8/9/2019 quality of clustering

    76/94

    0.4 0.2 0.0 0.2 0.4 0.6

    0.

    6

    0.4

    0.

    2

    0.

    0

    0.

    2

    0.4

    datanp[,1]

    datanp[,2]

    0.4 0.2 0.0 0.2 0.4 0.6

    0.

    6

    0.4

    0.

    2

    0.

    0

    0.

    2

    0.4

    datanp[,1]

    datanp[,2]

    mclust ave.l

    ave within 0.619 0.645gap 0.766 0.771

    density index 0.503 0.852sep index 0.055 0.126

    entropy 0.929 0.717normality (MDS) 0.805 0.781

    uniformity (MDS) 0.393 0.302

    Christian Hennig Measurement of quality in cluster analysis

    Introduction

    Basic thoughtsCluster quality statistics

    Examples

    Discussion

    http://find/http://-/?-
  • 8/9/2019 quality of clustering

    77/94

    0.4 0.2 0.0 0.2 0.4 0.6

    0.6

    0.4

    0.2

    0.0

    0.2

    0.4

    datanp[,1]

    datanp[,2]

    0.4 0.2 0.0 0.2 0.4 0.6

    0.6

    0.4

    0.2

    0.0

    0.2

    0.4

    datanp[,1]

    datanp[,2]

    mclust only better in entropy

    and MDS-distribution based indexes.

    Christian Hennig Measurement of quality in cluster analysis

    Introduction

    Basic thoughtsCluster quality statistics

    Examples

    Discussion

    http://find/http://-/?-
  • 8/9/2019 quality of clustering

    78/94

    0.4 0.2 0.0 0.2 0.4 0.6

    0.6

    0.4

    0.2

    0.0

    0.2

    0.4

    datanp[,1]

    datanp[,2]

    0.4 0.2 0.0 0.2 0.4 0.6

    0.6

    0.4

    0.2

    0.0

    0.2

    0.4

    datanp[,1]

    datanp[,2]

    mclust only better in entropy

    and MDS-distribution based indexes.

    Perturbed by noise cluster;

    better cluster with noise component.

    How to use indexes with unclustered data?

    Could just ignore them but that gives

    clustering with noise unfair advantage.

    Christian Hennig Measurement of quality in cluster analysis

    http://find/http://-/?-
  • 8/9/2019 quality of clustering

    79/94

    Introduction

    Basic thoughtsCluster quality statistics

    Examples

    Discussion

    http://find/http://-/?-
  • 8/9/2019 quality of clustering

    80/94

    true 11-means spectralave within 0.691 0.734 0.692

    sep index 0.069 0.093 0.130

    hubert gamma 0.224 0.411 0.400

    entropy 1.000 0.983 0.739

    ARI 1.000 0.205 0.142

    true no good clustering.

    Ave within, entropy areonlyindexes

    positively correlated with ARI!

    Christian Hennig Measurement of quality in cluster analysis

    Introduction

    Basic thoughtsCluster quality statistics

    Examples

    Discussion

    http://find/http://-/?-
  • 8/9/2019 quality of clustering

    81/94

    true 11-means spectral

    ave within 0.691 0.734 0.692

    sep index 0.069 0.093 0.130

    hubert gamma 0.224 0.411 0.400

    entropy 1.000 0.983 0.739

    ARI 1.000 0.205 0.142

    true no good clustering.

    Ave within, entropy areonlyindexes

    positively correlated with ARI!

    Good ARI needs good ave within and nothing else here.

    Use to explain results in data with known classes.

    Christian Hennig Measurement of quality in cluster analysis

    Introduction

    Basic thoughtsCluster quality statistics

    Examples

    Discussion

    http://goforward/http://find/http://goback/http://-/?-
  • 8/9/2019 quality of clustering

    82/94

    Crabs data (2 species, m/f):

    FL

    6 8 12 16 20 20 30 40 50

    10

    15

    20

    6

    8

    12

    16

    20

    RW

    CL

    15

    25

    35

    45

    20

    30

    40

    50

    CW

    10 15 20 15 25 35 45 10 15 20

    10

    15

    20

    BD

    FL

    6 8 12 16 20 20 30 40 50

    10

    15

    20

    6

    8

    12

    16

    20

    RW

    CL

    15

    25

    35

    45

    20

    30

    40

    50

    CW

    10 15 20 15 25 35 45 10 15 20

    10

    15

    20

    BD

    spectral

    Christian Hennig Measurement of quality in cluster analysis Introduction

    Basic thoughtsCluster quality statistics

    Examples

    Discussion

    http://goforward/http://find/http://goback/http://-/?-
  • 8/9/2019 quality of clustering

    83/94

    Crabs data (2 species, m/f):

    FL

    6 8 12 16 20 20 30 40 50

    10

    15

    20

    6

    8

    12

    16

    20

    RW

    CL

    15

    25

    35

    45

    20

    30

    40

    50

    CW

    10 15 20 15 25 35 45 10 15 20

    10

    15

    20

    BD

    FL

    6 8 12 16 20 20 30 40 50

    10

    15

    20

    6

    8

    12

    16

    20

    RW

    CL

    15

    25

    35

    45

    20

    30

    40

    50

    CW

    10 15 20 15 25 35 45 10 15 20

    10

    15

    20

    BD

    mclust

    Christian Hennig Measurement of quality in cluster analysis Introduction

    Basic thoughtsCluster quality statistics

    Examples

    Discussion

    http://find/http://-/?-
  • 8/9/2019 quality of clustering

    84/94

    true mclust spectral

    ave within 0.761 0.828 0.908density 0.167 0.027 0.246

    hubert gamma 0.060 0.291 0.591

    ARI 1.000 0.316 0.023

    true is worst according to most indexes.

    But thereisa visible pattern!

    Allindexes (except entropy) are

    negatively correlated with ARI.

    mclust has best ARI out of 8 methods

    but quite bad index values.

    Indexes fail to capture what goes on here:-(

    Christian Hennig Measurement of quality in cluster analysis Introduction

    Basic thoughtsCluster quality statistics

    Examples

    Discussion

    http://find/http://-/?-
  • 8/9/2019 quality of clustering

    85/94

    5. Discussion Clustering quality is multidimensional.

    Christian Hennig Measurement of quality in cluster analysis Introduction

    Basic thoughtsCluster quality statistics

    Examples

    Discussion

    http://find/http://-/?-
  • 8/9/2019 quality of clustering

    86/94

    5. Discussion

    Clustering quality is multidimensional.

    Provide multidimensional evaluation,

    characterising a methods behaviour.

    Christian Hennig Measurement of quality in cluster analysis Introduction

    Basic thoughtsCluster quality statistics

    Examples

    Discussion

    http://find/http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-
  • 8/9/2019 quality of clustering

    87/94

    5. Discussion

    Clustering quality is multidimensional.

    Provide multidimensional evaluation,

    characterising a methods behaviour.

    Can aggregate criteria by weighted mean

    given well justified weights.

    Christian Hennig Measurement of quality in cluster analysis Introduction

    Basic thoughtsCluster quality statistics

    Examples

    Discussion

    http://find/http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-
  • 8/9/2019 quality of clustering

    88/94

    5. Discussion

    Clustering quality is multidimensional.

    Provide multidimensional evaluation,

    characterising a methods behaviour.

    Can aggregate criteria by weighted mean

    given well justified weights.

    Can use to explain performance

    in data with known truth

    Christian Hennig Measurement of quality in cluster analysis Introduction

    Basic thoughtsCluster quality statistics

    Examples

    Discussion

    http://find/http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-
  • 8/9/2019 quality of clustering

    89/94

    5. Discussion

    Clustering quality is multidimensional.

    Provide multidimensional evaluation,

    characterising a methods behaviour.

    Can aggregate criteria by weighted mean

    given well justified weights.

    Can use to explain performance

    in data with known truth

    Designers of new methods should specify

    what aspects of clustering they aim at,so that it can be tested.

    Christian Hennig Measurement of quality in cluster analysis Introduction

    Basic thoughtsCluster quality statistics

    Examples

    Discussion

    O bl

    http://find/http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-
  • 8/9/2019 quality of clustering

    90/94

    Open problems

    Dependence on subjective tuning hurts(weights,k-nn, percentage, standardisation).

    Christian Hennig Measurement of quality in cluster analysis Introduction

    Basic thoughtsCluster quality statistics

    Examples

    Discussion

    O bl

    http://find/http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-
  • 8/9/2019 quality of clustering

    91/94

    Open problems

    Dependence on subjective tuning hurts(weights,k-nn, percentage, standardisation).

    But its honest; such decisions are needed

    to define a good clustering in practice.

    Christian Hennig Measurement of quality in cluster analysis Introduction

    Basic thoughtsCluster quality statistics

    Examples

    Discussion

    O bl

    http://find/http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-
  • 8/9/2019 quality of clustering

    92/94

    Open problems

    Dependence on subjective tuning hurts(weights,k-nn, percentage, standardisation).

    But its honest; such decisions are needed

    to define a good clustering in practice.

    Proper behaviour of criteria(standardisation, transformation,

    different numbers of clusters)

    for fair aggregation and comparability?

    Christian Hennig Measurement of quality in cluster analysis Introduction

    Basic thoughtsCluster quality statistics

    Examples

    Discussion

    Open problems

    http://find/http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-
  • 8/9/2019 quality of clustering

    93/94

    Open problems

    Dependence on subjective tuning hurts(weights,k-nn, percentage, standardisation).

    But its honest; such decisions are needed

    to define a good clustering in practice.

    Proper behaviour of criteria(standardisation, transformation,

    different numbers of clusters)

    for fair aggregation and comparability?

    Index choice vs. method definition

    (average linkage not always optimal for ave. distance)

    Christian Hennig Measurement of quality in cluster analysis Introduction

    Basic thoughtsCluster quality statistics

    Examples

    Discussion

    Open problems

    http://find/http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-
  • 8/9/2019 quality of clustering

    94/94

    Open problems

    Dependence on subjective tuning hurts(weights,k-nn, percentage, standardisation).

    But its honest; such decisions are needed

    to define a good clustering in practice.

    Proper behaviour of criteria(standardisation, transformation,

    different numbers of clusters)

    for fair aggregation and comparability?

    Index choice vs. method definition

    (average linkage not always optimal for ave. distance)

    With given weights, optimise quality?

    Christian Hennig Measurement of quality in cluster analysis

    http://find/http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-