on clustering with chernoff-type faces

17
This article was downloaded by: [University of North Carolina] On: 13 November 2014, At: 12:46 Publisher: Taylor & Francis Informa Ltd Registered in England and Wales Registered Number: 1072954 Registered office: Mortimer House, 37-41 Mortimer Street, London W1T 3JH, UK Communications in Statistics - Theory and Methods Publication details, including instructions for authors and subscription information: http://www.tandfonline.com/loi/lsta20 On clustering with chernoff-type faces Eugene F. Tidmore a & Danny W. Turner a a Baylor University , Waco, Texas Published online: 27 Jun 2007. To cite this article: Eugene F. Tidmore & Danny W. Turner (1983) On clustering with chernoff-type faces, Communications in Statistics - Theory and Methods, 12:4, 381-396, DOI: 10.1080/03610928308828466 To link to this article: http://dx.doi.org/10.1080/03610928308828466 PLEASE SCROLL DOWN FOR ARTICLE Taylor & Francis makes every effort to ensure the accuracy of all the information (the “Content”) contained in the publications on our platform. However, Taylor & Francis, our agents, and our licensors make no representations or warranties whatsoever as to the accuracy, completeness, or suitability for any purpose of the Content. Any opinions and views expressed in this publication are the opinions and views of the authors, and are not the views of or endorsed by Taylor & Francis. The accuracy of the Content should not be relied upon and should be independently verified with primary sources of information. Taylor and Francis shall not be liable for any losses, actions, claims, proceedings, demands, costs, expenses, damages, and other liabilities whatsoever or howsoever caused arising directly or indirectly in connection with, in relation to or arising out of the use of the Content. This article may be used for research, teaching, and private study purposes. Any substantial or systematic reproduction, redistribution, reselling, loan, sub-licensing, systematic supply, or distribution in any form to anyone is expressly forbidden. Terms & Conditions of access and use can be found at http://www.tandfonline.com/page/ terms-and-conditions

Upload: danny-w

Post on 16-Mar-2017

233 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: On clustering with chernoff-type faces

This article was downloaded by: [University of North Carolina]On: 13 November 2014, At: 12:46Publisher: Taylor & FrancisInforma Ltd Registered in England and Wales Registered Number: 1072954Registered office: Mortimer House, 37-41 Mortimer Street, London W1T 3JH, UK

Communications in Statistics -Theory and MethodsPublication details, including instructions for authors andsubscription information:http://www.tandfonline.com/loi/lsta20

On clustering with chernoff-typefacesEugene F. Tidmore a & Danny W. Turner aa Baylor University , Waco, TexasPublished online: 27 Jun 2007.

To cite this article: Eugene F. Tidmore & Danny W. Turner (1983) On clustering withchernoff-type faces, Communications in Statistics - Theory and Methods, 12:4, 381-396, DOI:10.1080/03610928308828466

To link to this article: http://dx.doi.org/10.1080/03610928308828466

PLEASE SCROLL DOWN FOR ARTICLE

Taylor & Francis makes every effort to ensure the accuracy of all the information (the“Content”) contained in the publications on our platform. However, Taylor & Francis,our agents, and our licensors make no representations or warranties whatsoeveras to the accuracy, completeness, or suitability for any purpose of the Content. Anyopinions and views expressed in this publication are the opinions and views of theauthors, and are not the views of or endorsed by Taylor & Francis. The accuracyof the Content should not be relied upon and should be independently verifiedwith primary sources of information. Taylor and Francis shall not be liable for anylosses, actions, claims, proceedings, demands, costs, expenses, damages, and otherliabilities whatsoever or howsoever caused arising directly or indirectly in connectionwith, in relation to or arising out of the use of the Content.

This article may be used for research, teaching, and private study purposes. Anysubstantial or systematic reproduction, redistribution, reselling, loan, sub-licensing,systematic supply, or distribution in any form to anyone is expressly forbidden. Terms& Conditions of access and use can be found at http://www.tandfonline.com/page/terms-and-conditions

Page 2: On clustering with chernoff-type faces

C O M N . STATIST.-THEOR. METH., 1 2 ( 4 ) , 381-396 (1983)

ON CLUSTERING WITH CHERNOFF-TYPE FACES

F . Eugene Tidmore and Danny W. Turner

Bay lo r U n i v e r s i t y , Waco, Texas

Key W o r d s a n d P h r a s e s : l i n e p r i n t e r f a c e s ; c l u s t e r i n g mult.1- v a r i a t e d a t a ; e v a l u a t i o n o f c l u s t e r i n g w i t h . l i n e p r i n t : t ? r f a c e s ; Rand s t a t i s t i c ; s ing1 .e l i n k a g e ; c o m p l e t e l i n k a g e ; a v e r a g e l i n k a g e ; W a r d ' s m e t h o d .

ABSTRACT

Chernof f (1973) in t roduced a new procedure f o r rep

mu l t id imens iona l da ta by u s i n g car toon-1 i k e faces drawn

pen p l o t t e r , w h i l e Turner and Tidmore (1977) in t roduced

Cherno f f - t ype faces which can be generated on a l i n e p r

The use o f such faces f o r c l u s t e r i n g m u l t i v a r i a t e data

e s e r ~ t ing

by ;I

asyr imetr ic

nter-.

S 21

we1 l known technique. However, t h e r e have been few a t tempts

t o eva lua te t h i s g r a p h i c a l procedure i n a sys temat i c f a s h i o n .

Th is paper r e p o r t s r e s u l t s ob ta ined i n a comparison of t h e

1 i n e p r i n t e r faces c l u s t e r i n g method w i t h severa l nongraph ica l

h i e r a r c h i c a l c l u s t e r i n g a l g o r i t h m s , i n c l u d i n g s i n g l e , comp;ete,

and average l i n k a g e and Ward's minimum v a r i a n c e method.

1 . INTRODUCTION

The concept o f u s i n g two-dimensional c a r t o o n - l i k e faces t o

represen t m u l t i v a r i a t e d a t a p o i n t s was o r i g i n a t e d by C h e r n l ~ f f

(1973). Each o r i g i n a l Chernof f f a c e c o u l d accomodate up t o

18 v a r i a b l e s per case, a l though c e r t a i n n o r m a l i z a t i o n s v i r t u a l l y

reduced t h i s t o 16.

Copyright O 1983 by Marcel Dekker, Inc.

Dow

nloa

ded

by [

Uni

vers

ity o

f N

orth

Car

olin

a] a

t 12:

46 1

3 N

ovem

ber

2014

Page 3: On clustering with chernoff-type faces

382 TIDMORE AND TURNER

The Chernof f faces were drawn by a Calcomp p l o t t e r . Turner

and Tidmore (1977) in t roduced asymmet r i c Cherno f f - t ype faces

t h a t cou ld be generated on a l i n e p r i n t e r . Our goal was n o t

t o d u p l i c a t e the o r i g i n a l Chernof f face on a l i n e p r i n t e r , b u t

t o cap tu re t h e essence o f Chernof f faces u s i n g an inexpensive

procedure which would be r e a d i l y a v a i l a b l e t o d a t a a n a l y s t s . A

t y p i c a l l i n e p r i n t e r f a c e appears i n F i g u r e 1 . Our FACES program

( w r i t t e n i n FORTRAN) generates 8 such faces per page o f o u t p u t .

L i n e p r i n t e r faces can handle up t o 12 v a r i a b l e s per case.

The 12 f e a t u r e s t h a t can be c o n t r o l l e d a r e t h e 4 corners, 2 eye-

brows, 2 eye frames, 2 p u p i l s , nose, and mouth. These f e a t u r e s

operate independent ly , w i t h each one being c o n t r o l l e d by a s i n g l e

v a r i a b l e . (Some f e a t u r e s a r e dormant i f t h e r e a re l e s s than

12 v a r i a b l e s . ) The manner i n which a v a r i a b l e X . c o n t r o l s a

f e a t u r e F. o f t h e p r i n t e r face i s b a s i c a l l y as f o l l o w s . The

range o f X . i s broken i n t o 10 equal l e n g t h i n t e r v a l s . Feature

F. o f face j (drawn f o r case j ) i s determined by the i n t e r v a l

t h a t X . f a l l s i n f o r case j. Thus, t h e number o f d i s t i n c t faces 12

t h a t can be produced i s 10 . Attempts a t a n a l y t i c a l l y e v a l u a t i n g t h e faces technique

seem t o be few. Chernof f and R i z v i (1975) address the problem

o f a s s i g n i n g v a r i a b l e s t o f e a t u r e s and t h e e f f e c t t h i s has on

c l a s s i f i c a t i o n e r r o r us ing p l o t t e r faces. Our e a r l y work on

e v a l u a t i n g t h e p r i n t e r faces procedure f o r c l u s t e r i n g data

f o l l o w e d t h e example o f Chernof f (1973) and i n v o l v e d t h e use o f

da ta w i t h no predetermined " r i g h t 1 ' answers o r c l u s t e r s . Moreover,

t h e r e s u l t s (Turner and Tidmore [19771) p rov ided no comparisons

w i t h o t h e r c l u s t e r i n g a l g o r i t h m s . I n t h i s paper we take a more

sys temat i c approach t o e v a l u a t i n g t h e e f f e c t i v e n e s s o f t h e faces

c l u s t e r i n q procedure by a p p l y i n g i t t o da ta s e t s f o r which i t i s

known t h a t c l u s t e r s o f a p a r t i c u l a r t ype do e x i s t . Several

nongraphica l c l u s t e r i n g a l g o r i t h m s a r e a l s o a p p l i e d t o these same

da ta se ts t o p r o v i d e a b a s i s f o r comparison. These a l g o r i t h m s

inc lude s i n g l e , complete, and average l i n k a g e and Ward's

minimum var iance method.

Dow

nloa

ded

by [

Uni

vers

ity o

f N

orth

Car

olin

a] a

t 12:

46 1

3 N

ovem

ber

2014

Page 4: On clustering with chernoff-type faces

CLUSTERING WITH CHERNOFF-TYPE FACES

FIG. 1 . T y p i c a l L i n e P r i n t e r Face

2. PRELIMINARY DEFINITIONS

Th is sec t i o n in t roduces two parameters, 0 and y, t h a t rwa-

sure geometr ic p r o p e r t i e s o f a c o l l e c t i o n o f subsets ( c l u s t e r s )

o f a g i v e n s e t . Each i n v o l v e s c r i t i c a l r a t i o s o f c e r t a i n

d is tances computed f rom a g iven c l u s t e r i n g ( p a r t i t i o n ) o f a s e t

and i s in tended t o p r o v i d e a measure o f d i f f i c u l t y f o r t h e t < i s k

o f recover ing the g i v e n c l u s t e r i n g f rom the unc lus te red d a t a .

A c l u s t e r w i l l be viewed as a se t o f p o i n t s s a t i s f y i n g

some s i m i l a r i t y c o n d i t i o n w i t h i n a l a r g e r s e t o f p o i n t s f rom

some f i n i t e dimensional space. Given t h a t a data se t con ta ins

c l u s t e r s o f a p a r t i c u l a r t ype , i t i s t h e t a s k o f a c l u s t e r i n g

procedure t o l o c a t e these c l u s t e r s . I n t h e case o f c l u s t e r i n g

based on geometr ic d is tance , t h e r e c o g n i t i o n o f such c l u s t e r s

can have v a r y i n g l e v e l s o f d i f f i c u l t y , depending upon how

c lose s i m i l a r p o i n t s a r e t o each o t h e r compared w i t h d i s t a n c e s

Dow

nloa

ded

by [

Uni

vers

ity o

f N

orth

Car

olin

a] a

t 12:

46 1

3 N

ovem

ber

2014

Page 5: On clustering with chernoff-type faces

384 TIDMORE AND TURNER

between c l u s t e r s . For t h i s reason, a parameter B w i l l be

de f ined below t h a t g i v e s t h e c r i t i c a l r a t i o o f d i s s i m i l a r p a i r

d i s tances t o s i m i l a r p a i r d i s tances . Roughly, i f B = 2 , t h e

d is tance between two d a t a p o i n t s n o t considered c l o s e t o each

o t h e r would be a t l e a s t tw ice as g r e a t as t h e d i s t a n c e between

two data p o i n t s which a r e judged as c l o s e o r s i m i l a r t o each

o ther .

L e t S be a s e t and l e t x and y be p o i n t s i n S. A f i n i t e

ordered subset C = { z 0 , Z , , . . . , z i s c a l l e d a cha in connec t ing

x and y i n 2 i f z o = x , z = y and z . E S , 12 i I n - 1. I f ---- there i s a d i s t a n c e f u n c t i o n d d e f i n e d on S , then

i s c a l l e d the norm o f the cha in C. I f S i s f i n i t e , then f o r any

p a i r x and y i n S the re e x i s t s a cha in connec t ing x and y i n S

w i t h minimum norm. Such a cha in w i l l be c a l l e d a minimal

connect ing cha in f o r x and y i n S .

L e t S be a f i n i t e se t , w i t h d i s t a n c e f u n c t i o n d , which has

been p a r t i t i o n e d i n t o c l u s t e r s . T h i s p a r t i t i o n i s s a i d t o s a t i s f y

the T p r o p e r t y i f f o r each c l u s t e r A and each p a i r x and y i n A ,

t h e r e e x i s t s a cha in C connect ing x and y i n A such t h a t

/ I c / / < d(A, S-A), where - denotes se t d i f f e r e n c e .

A p a r t i t i o n P i s n o n t r i v i a l i f P con ta ins a t l e a s t two

c l u s t e r s , a t l e a s t one o f which has two o r more p o i n t s .

For any f i n i t e s e t S w i t h d i s t a n c e d and any n o n t r i v i a l

p a r t i t i o n P o f S t h e r e w i l l e x i s t a un ique l a r g e s t r e a l number B

such t h a t p 1 ~ c I 1 5 d(A, S-A) f o r a l l c l u s t e r s A i n P and a1 1

minimal connect ing chains C f o r p a i r s x and y i n A.

Le t x and y be p o i n t s o f A, a subset o f S which has some

d is tance f u n c t i o n d. The l i nkage d i s t a n c e between x and y ,

re1 a t

f o r x

Clear

ve t o A, w i l l be the norm of a minimal connect ing cha in

and y i n A, and w i l l be denoted by L (x, y ) . L e t A

MA = max i t A ( x , y) : x , y E A).

Y , MA # 0 i f f A c o n t a i n s more than one p o i n t .

Dow

nloa

ded

by [

Uni

vers

ity o

f N

orth

Car

olin

a] a

t 12:

46 1

3 N

ovem

ber

2014

Page 6: On clustering with chernoff-type faces

CLUSTERING WITH CHERNOFF-TYPE FACES

Theorem 1 g i ves a computat ional procedure f o r o b t a i n i n g

@ f o r n o n t r i v i a l p a r t i t i o n s .

Theorem 1.

L e t S be a f i n i t e se t w i t h d i s t a n c e d and l e t P be a non-

t r i v i a l p a r t i t i o n o f S. Then B = m, where m = min {d(A, S-A)/MA:

Proo f :

c l u s t e r

For any

connect

Since /

A i s a c l u s t e r i n P hav ing a t l e a s t two p o i n t s ) .

The f i n i t e n e s s o f S i m p l i e s t h a t t h e r e e x i s t s some

A' hav ing a t l e a s t two p o i n t s such t h a t m = d (A1 ,S

c l u s t e r A w i t h a t l e a s t two p o i n t s and any min imal

ing cha in C i n A, m 2 d ( ~ , S - A ) / M ~ , hence mM < d(A A -

C I / < MA, ml / c / I 5 mMA ( d(A, S-A). A lso, MA, = [ I - f o r some min imal connec t ing cha in C ' i n A ' ; t h e r e f o r e , i f r > m

then r / I c ' I I = r M A , > mMA, = d ( A ' , S - A ' ) . For s i n g l e t o n c l u s t e r s

A, I I c / / = 0 f o r any cha in C i n A. Then c l e a r l y m l / c l I 2 d ( ~ , , S-A)

i n t h i s case. I t f o l l o w s t h a t m = B.

The f o l l o w i n g e a s i l y proved theorem r e l a t e 5 t h e T p r o p e r t y

and the q u a n t i t y 6.

Theorem 2.

A n o n t r i v i a l p a r t i t i o n P o f a f i n i t e se t S s a t i s f i e s the

T p r o p e r t y i f f B > 1 .

The use o f a v i s u a l c l u s t e r i n g procedure a l s o

another r a t i o might be r e l a t e d t o successfu l i d e n t

e x i s t i n g c l u s t e r s . Let S be a f i n i t e se t w i t h d i s

d which has been p a r t i t i o n e d i n t o c l u s t e r s C 1 ' C 2 '

suggested chat

i f i c a t i o n o f

tance funct: i o n

..., C k , i . > I .

Let b(S) = max { d ( x , y ) : x , y E S ) and l e t p = min {d(C S.-c.) : ' J 1 5 j 2 k } . Then the parameter y = 6(S) /p i s a r a t i o which

r e l a t e s the d iameter o f t h e se t S t o d i s t a n c e s between p o i n t s

i n d i f f e r e n t c l u s t e r s . For l a r g e va lues o f y the d i s t i n c t i o n

between c e r t a i n p a i r s o f c l u s t e r s may be d i f f i c u l t t o d e t e c t .

3. DESIGN OF EXPERIMENT I

Our f i r s t experiment invo lved 48 da ta s e t s w i t h the number

o f p o i n t s i n an i n d i v i d u a l da ta s e t rang ing (randomly) f rom

Dow

nloa

ded

by [

Uni

vers

ity o

f N

orth

Car

olin

a] a

t 12:

46 1

3 N

ovem

ber

2014

Page 7: On clustering with chernoff-type faces

386 TIDMORE AND TURNER

28 t o 40. A f a c t o r i a l d e s i g n i n v o l v i n g f o u r f a c t o r s was imple-

mented as f o l l o w s : F a c t o r 1 : a t f o u r l e v e l s ( a i n t h e f o l l o w -

i n g i n t e r v a l s - 11.1, 1.31, 11.4, 1.61, K1.8, 2 .21, L2.4 , 2 .81 ) ;

F a c t o r 2: y a t t h r e e l e v e l s (y = 2, 6 , 1 0 ) ; F a c t o r 3 : d imens ion

o f space a t two l e v e l s (6, 12 ) ; F a c t o r 4 : c o n s t r u c t e d number

o f c l u s t e r s a t two l e v e l s ( 2 , 3 ) . T h i s s t r u c t u r e accoun ts f o r

t h e 48 s e t s o f d a t a t h a t were c o n s t r u c t e d u s i n g an i n t e r a c t i v e

FORTRAN program (SIMDAT) we des igned f o r t h i s purpose. The

"shape" o f an i n d i v i d u a l c l u s t e r i n t h i s expe r imen t i s b e s t

d e s c r i b e d as r e c t a n g u l a r p a r a l l e l a p i p e d and moreover , c l u s t e r s

w i t h i n a s i n g l e d a t a s e t do n o t o v e r l a p ( i . e . , t h e r e e x i s t s sepa r -

a t i n g hype rp lanes ) .

F i v e c l u s t e r i n g p rocedures were a p p l i e d t o each o f t h e 48

d a t a s e t s d e s c r i b e d above. The f i v e a r e s i n g l e l i n k a g e , comp le te

1 inkage, average l inkage (g roup average method) , Ward1 s minimum

v a r i a n c e method, and l i n e p r i n t e r faces. S i n g l e , comple te and

average l i n k a g e were implemented u s i n g BMDP program PIM w i t h

Euc l idean d i s t a n c e ( d i s t a n c e m a t r i x i n p u t , see D i x o n and Brown

(1979) ) .

Ward's method was implemented u s i n g program CLUSTAR (see

Romesburg and M a r s h a l l (1980)) and l i n e p r i n t e r f a c e s were , genera ted u s i n g o u r program FACES (see T u r n e r and T idmore (1980)) .

S ince t h e f o u r n o n g r a p h i c a l methods a r e h i e r a r c h i c a l , t h e i r

o u t p u t does n o t a u t o m a t i c a l l y i n c l u d e a c h o i c e f o r t h e number

o f c l u s t e r s t o use i n t h e f i n a l p a r t i t i o n o f a d a t a s e t . I n

t h i s expe r imen t , f o r t h e n o n g r a p h i c a l h i e r a r c h i c a l methods,

we a lways chose t h e a l g o r i t h m ' s s o l u t i o n wh ich cor responded t o

t h e des igned c o r r e c t number o f groups. However, when faces

was a p p l i e d t o a s e t o f d a t a , t h e o n l y knowledge conce rn ing

c o r r e c t number o f groups was t h a t i t was e i t h e r two o r t h r e e .

The dependent v a r i a b l e i n t h i s expe r imen t i s a measure

o f agreement between two p a r t i t i o n s o f a s e t c a l l e d t h e Rand

s t a t i s t i c (see Rand (1971) ) . I t i s t h e p r o p o r t i o n o f a l l p a i r s

o f p o i n t s t h a t t h e two p a r t i t i o n s ag ree on where agreement

means t h e p a r t i t i o n s b o t h have t h e p a i r t o g e t h e r i n some c l u s t e r

Dow

nloa

ded

by [

Uni

vers

ity o

f N

orth

Car

olin

a] a

t 12:

46 1

3 N

ovem

ber

2014

Page 8: On clustering with chernoff-type faces

CLUSTERING WITH CHEWOFF-TYPE FACES 387

o r they bo th have t h e p a i r i n separate c l u s t e r s . C l e a r l y , the

va lue o f t h e Rand s t a t i s t i c ranges f rom 0 t o 1 w i t h 0 be ing

t o t a l disagreement and 1 be ing t o t a l agreement. Our da ta

c o n s i s t s of t h e 240 va lues o f Rand ob ta ined by a p p l y i n g each

o f our 5 c l u s t e r i n g methods t o each o f t h e 48 c o n s t r u c t e d da ta

se ts and computing t h e v a l u e o f Rand between the method's

p a r t i t i o n and the c o r r e c t ( cons t ruc ted) p a r t i t ion. (The Rand

va lue used f o r faces f o r a d a t a set was the average o f two va lues ,

one f o r each a u t h o r ' s face p a r t i t i o n o f the data se t . The

a u t h o r s ' face p a r t i t i o n s were g e n e r a l l y q u i t e s i m i l a r . )

Before r e p o r t i n g the r e s u l t s o f experiment I , we i l l u s t - - a t e

two o f t h e d a t a s e t s used by showing t h e p a r t i t i o n s ob ta ined

us ing faces. F i g u r e 2 shows a p a r t i t i o n ob ta ined u s i n g t h e

faces procedure on a da ta set c o n s t r u c t e d f o r the p a i r f3 = 1 . 1 ,

y = 2. Cases 602, 108, and 930 were i n c o r r e c t l y p laced i n

c l u s t e r 1. F i g u r e 3 shows t h e p a r t i t i o n ob ta ined u s i n g t h e

faces procedure f o r another da ta se t . The two cons t ruc ted c l u s -

t e r s , w i t h B = 2 .2 , y = 10, can be ob ta ined by combining c l u s t e r s

I and I I i n t o a s i n g l e c l u s t e r , w i t h c l u s t e r I l l as the second

c l u s t e r . A p a r t i t i o n w i t h t h e T p r o p e r t y , B = 1 .7 , can be

ob ta ined by d e l e t i n g 704 f rom the p a r t i t i o n represented i n

F i g u r e 3. The reader should n o t e c a r e f u l l y i n F igures 2 and 3

how the faces a r e arranged. I n p a r t i c u l a r , a c l u s t e r may

c o n t a i n two faces t h a t do n o t look a l i k e , bu t t h e r e i s a cha in

o f faces connect ing the two. Thus, we see t h a t us ing a

g raph ica l method, l i k e faces, i s n o t n e c e s s a r i l y r e s t r i c t e d

t o j u s t grouping o b j e c t s t h a t a r e ( p a i r w i s e ) s i m i l a r t o each

o ther .

4. RESULTS OF EXPERIMENT I

Most o f the r e s u l t s r e p o r t e d below a r e based on ou tpu t

generated by runn ing our da ta through BMDP a n a l y s i s o f vari<ance

program P2V. Since the va lues o f t h e dependent v a r i a b l e (Fhnd

Dow

nloa

ded

by [

Uni

vers

ity o

f N

orth

Car

olin

a] a

t 12:

46 1

3 N

ovem

ber

2014

Page 9: On clustering with chernoff-type faces

TIDMORE AND TURNER

FIG. 2. Faces fo r Data Set 122

Dow

nloa

ded

by [

Uni

vers

ity o

f N

orth

Car

olin

a] a

t 12:

46 1

3 N

ovem

ber

2014

Page 10: On clustering with chernoff-type faces

CLUSTERING WITH CHERNOFF-TYPE FACES

F I G . 3. Faces f o r D a t a Set 235

Dow

nloa

ded

by [

Uni

vers

ity o

f N

orth

Car

olin

a] a

t 12:

46 1

3 N

ovem

ber

2014

Page 11: On clustering with chernoff-type faces

390 TIDMORE AND TURNER

s t a t i s t i c ) tended t o be i n t h e upper end o f t h e i n t e r v a l L O , 13,

an a r c s i n e t r a n s f o r m a t i o n was used.

The B f a c t o r had a s i g n i f i c a n t e f f e c t (P -va lue = .08) w i t h

l a r g e r va lues o f Rand assoc ia ted w i t h l a r g e r va lues of B, and

the main e f f e c t o c c u r r i n g between l e v e l 1 and l e v e l 2 o f B.

The y f a c t o r was h i g h l y s i g n i f i c a n t (P-value - .0007) w i t h t h e

t r e n d be ing l a r g e r va lues o f Rand assoc ia ted w i t h sma l le r va lues

o f y, and the main e f f e c t o c c u r r i n g between l e v e l s 1 and 2.

The dimension o f space and c o r r e c t number o f groups f a c t o r s

were bo th i n f l u e n t i a l ( r e s p e c t i v e P-values o f . O 7 and .01)

w i t h h igher dimension o r h i g h e r number o f groups hav ing a

d i m i n i s h i n g e f f e c t on Rand.

C l u s t e r i n g method was found t o have a h i g h l y s i g n i f i c a n t -4

e f f e c t (P-value < 10 ) . The s i n g l e l i n k a g e a l g o r i t h m was

s i g n i f i c a n t l y b e t t e r than t h e o t h e r c l u s t e r i n g procedures, hav ing

an average Rand va lue o f .98. The o t h e r methods, ranked by

average Rand va lue (which i s shown i n parentheses) were

faces (.89), average l i n k a g e ( .88) , Ward's method ( .87) , and

complete 1 inkage (. 84) .

There were a l s o i n t e r a c t i o n terms t h a t were s i g n i f i c a n t .

Gamma by t r u e number o f groups had P-value = .02 w i t h Rand

having aber ran t h i g h va lues when y = 10 and c o r r e c t number

of groups equals 3. A lso , c l u s t e r i n g method by t r u e number o f

groups was impor tan t (P-value 2 .01) w i t h faces buck ing t h e t r e n d

by hav ing an "ou t -o f -1 ine" h i g h mean Rand v a l u e when t h e t r u e

number o f groups was 3. The l a s t s i g n i f i c a n t two-way i n t e r a c t i o n

was between c l u s t e r i n g method and gamma, w i t h P-value a .0003.

The n a t u r e o f t h i s i n t e r a c t i o n was t h a t mean Rand va lues tended

t o drop sharp1 y from l e v e l one o f gamma (y - 2) t o l e v e l two

(y 2 6) and then remain r e l a t i v e l y cons tan t o r r i s e s l i g h t l y

a t l e v e l th ree ( y - 10) except f o r s i n g l e l i n k a g e , which remained

cons tan t f rom l e v e l one t o l e v e l two and then dropped moderate ly

a t l e v e l t h r e e (y - 10) . Faces a l s o e x h i b i t e d a crossover e f f e c t

by hav ing a lower mean than t h e o t h e r methods a t l e v e l one o f

gamma, b u t h i g h e r mean than t h e o t h e r s (except f o r s i n g l e l i n k a g e )

Dow

nloa

ded

by [

Uni

vers

ity o

f N

orth

Car

olin

a] a

t 12:

46 1

3 N

ovem

ber

2014

Page 12: On clustering with chernoff-type faces

CLUSTERING WITH CHERNOFF-TYPE FACES

a t l e v e l s two and t h r e e . F i n a l l y , one three-way i n t e r a c t i o n

i n v o l v i n g c l u s t e r i n g method, be ta , and t r u e number o f groups bas

s i g n i f i c a n t b u t we p r o v i d e no d e s c r i p t i o n o f t h e n a t u r e o f t h i s

e f f e c t i n t h i s r e p o r t .

Up t o t h i s p o i n t we have examined r e l a t i o n s h i p s among

p a r t i t i o n s generated by v a r i o u s c l u s t e r i n g methods and c o r r e s -

ponding c o r r e c t p a r t i t i o n s o f

Another i n t e r e s t i n g ques t ion

ing a l g o r i t h m was faces most

( f o r each da ta s e t ) the va lue

generated us ing faces and t h e

a se t o f rnu l t i r l imensional point. .

s t o ask which nongraphica l c l u s t e r -

i k e ? To answer t h i s , we computed

o f Rand between t h e p a r t i t i o n

p a r t i t i o n generated by each o f

the o t h e r f o u r a lgor i thms. These va lues were averaged across

a l l the da ta se ts w i t h t h e f o l l o w i n g r e s u l t s . Faces p a r t i t i o n s

were m s t s i m i l a r t o s i n g l e l i n k a g e p a r t i t i o n s w i t h the mean

va lue o f Rand being .91. Average l i n k a g e f o l l o w e d w i t h mean

Rand o f .88, so t h e marg in o f v i c t o r y was n o t l a r g e , b u t impc~r-

t a n t , when combined w i t h t h e r e s u l t s r e p o r t e d i n s e c t i o n 6

below.

5 . DESIGN OF EXPERIMENT I I

Our second experiment invo lved 10 se ts o f da ta hav ing t h e

f o l l o w i n g genera l s t r u c t u r e : 60 t o 90 s ix-d imensional p o i n t s

per s e t ; 2 to 5 m u l t i v a r i a t e normal c l u s t e r s (groups) per set.;

covar iance m a t r i c e s n o t n e c e s s a r i l y equal ; c l u s t e r s may o v e r i 3 p .

There was no systemat ic v a r i a t i o n o f model parameters f o r these

data sets . However, each data se t was cons t ruc ted w i t h a d e f i n i t e

geometr ic c o n f i g u r a t i o n i n mind f o r t h e u n d e r l y i n g c l u s t e r

s t r u c t u r e . We inc luded no r e a l l y "easy" c o n f i g u r a t i o n s (e.g.,

w i d e l y separated s p h e r i c a l groups) and, i n f a c t , some would be

considered q u i t e d i f f i c u l t (e.g., i n t e r s e c t i n g h y p e r e l l i p s o i ~ d s

w i t h equal mean v e c t o r s and n o n t r i v i a l covar iance s t r u c t u r e ) .

Since a d e t a i l e d d e s c r i p t i o n o f a l l 10 c o n f i g u r a t i o n s would se

leng thy , we s h a l l d e s c r i b e o n l y one, da ta s e t 858, i n d e t a i l .

Dow

nloa

ded

by [

Uni

vers

ity o

f N

orth

Car

olin

a] a

t 12:

46 1

3 N

ovem

ber

2014

Page 13: On clustering with chernoff-type faces

39 2 TIDMORE AND TURNER

The geomet r i c i dea o f s e t 858 was t h r e e m u l t i v a r i a t e normal groups

ar ranged i n a t r i a n g u l a r c o n f i g u r a t i o n w i t h some o v e r l a p a t

t h e " v e r t i c e s " . T h i n k i n g i n t h r e e d imens ions (even though o u r

d a t a i s s i x - d i m e n s i o n a l here) one can v i s u a l i z e an e l l i p s o i d

w i t h m a j o r a x i s a l o n g t h e x a x i s and one end a t t h e o r i g i n , a

s i m i l a r e l l i p s o i d a l o n g t h e y a x i s , and a t h i r d e l l i p s o i d c o n n e c t i n g

t h e ends o f t h e f i r s t

normal p o p u l a t i o n s hav

ance m a t r i c e s , one can

m a n i p u l a t e t h e p o p u l a t

s e t 858, s p e c i f i c a l l y ,

wo. By s t a r t i n g w i t h t h r e e m u l t i v a r i a t e

ng z e r o mean v e c t o r s and d i a g o n a l c o v a r i -

use v a r i o u s l i n e a r t r a n s f o r m a t i o n s t o

ons i n t o t h e d e s i r e d ar rangement . Data

c o n s i s t e d o f t h e f o l l o w i n g :

Group 1 2 3

Sample S i z e 20

Mean V e c t o r (4.6,0,4.6,0,4.6,0) (2.4,0,-4.9,0,2.4.0) (9,0,-.8,0,9,0)

Covar iance M a t r i x 6 0 . 5 0 5 0 2.3 0 -2.7 0 1 .3 0 1 0 . 5 0 0 0

To measure t h e degree o f o v e r l a p between a p a i r o f groups

o f sample p o i n t s , we compute t h e pe rcen tage o f a l l sample p o i n t s

t h a t i n t r u d e t h e o p p o s i t e g r o u p ' s 90% p r o b a b i l i t y e l l i p s o i d .

Fo r d a t a s e t 858, t h e pe rcen tages were 4 .3 f o r g roups 1 and 2 ,

13.5 f o r 1 and 3 , and 0 f o r 2 and 3.

F i g u r e 4 shows t h e c l u s t e r e d f a c e s f o r d a t a s e t 858. As i n

expe r imen t I , each o f t h e f i v e c l u s t e r i n g methods was a p p l i e d

t o each o f t h e 10 s e t s o f d a t a w i t h t h e Rand s t a t i s t i c b e i n g

t h e measure o f agreement between two p a r t i t i o n s .

Dow

nloa

ded

by [

Uni

vers

ity o

f N

orth

Car

olin

a] a

t 12:

46 1

3 N

ovem

ber

2014

Page 14: On clustering with chernoff-type faces

FIG. 4. FACES for Data Set 858 as clustered by one author. The

Rand statistic for this partition (relative to the correct

partition) i s .76. Ward's method generated the best partition

with Rand value equal .81.

Dow

nloa

ded

by [

Uni

vers

ity o

f N

orth

Car

olin

a] a

t 12:

46 1

3 N

ovem

ber

2014

Page 15: On clustering with chernoff-type faces

TIDMORE AND TURNER

6. RESULTS OF EXPERIMENT I I

We a g a i n used BMDP program P2V t o p e r f o r m t h e necessa ry

compu ta t i ons w i t h t h e a r c s i n e t r a n s f o r m a t i o n b e i n g a p p l i e d t o

' t h e dependent v a r i a b l e Rand.

B l o c k i n g on t h e t e n d a t a s e t s was e f f e c t i v e (P -va lue <

w i t h d a t a s e t s hav ing h i g h e r pe rcen tage o v e r l a p b e i n g more d i f f i -

c u l t f o r t h e c l u s t e r i n g methods t o p a r t i t i o n c o r r e c t l y . - 4

Method o f c l u s t e r i n g was a l s o s i g n i f i c a n t (P -va lue < 10 )

w i t h Ward's method b e i n g b e t t e r t han t h e o t h e r methods. The

methods, ranked by mean v a l u e o f Rand, were Ward 's ( . 7 9 ) , faces

( . 7 3 ) , average l inkage (. 7 0 ) , comple te 1 inkage ( .69 ) , and

s i n g l e l i n k a g e ( . 4 7 ) .

Us ing compu ta t i ons l i k e those d e s c r i b e d a t t h e end o f

s e c t i o n 4, i t t u r n e d o u t t h a t faces p a r t i t i o n s were most s i m i l a r

t o Ward's method p a r t i t i o n s (average Rand = .75) w i t h average

l inkage n e x t (average Rand = . T I ) .

7. SUMMARY AND CONCLUDING REMARKS

I n t h i s paper we have r e p o r t e d on two expe r imen ts t h a t were

des igned m a i n l y t o h e l p e v a l u a t e t h e f a c e s method f o r v i s u a l l y

c l u s t e r i n g m u l t i v a r i a t e da ta . Two new measures, B and y ,

o f c e r t a i n geomet r i c p r o p e r t i e s o f a c l u s t e r i n g ( p a r t i t i o n ) o f

a g i v e n s e t were i n t r o d u c e d and found t o be r e l a t e d t o t h e

c a p a b i l i t y o f c l u s t e r i n g a l g o r i t h m s t o r e c o v e r t h e g i v e n p a r t i t i o n .

The most i n t e r e s t i n g f e a t u r e o f t h e faces method i s t h a t

i t seems t o be f l e x i b l e . Wh i l e f a c e s was n o t t h e o v e r a l l "w inne r1 '

i n e i t h e r expe r imen t w i t h r e s p e c t t o r e c o v e r i n g c o r r e c t

p a r t i t i o n s , i t was second b o t h t imes . Moreover , i n each

expe r imen t , f a c e s genera ted p a r t i t i o n s most s i m i l a r t o t hose

genera ted by t h e w i n n i n g a l g o r i t h m , w h i c h was s i n g l e l i n k a g e i n

expe r imen t I and Ward 's method i n expe r imen t I I . Needless t o

say, t hese two a l g o r i t h m s have q u i t e d i f f e r e n t c l u s t e r i n g

s t r a t e g i e s .

Dow

nloa

ded

by [

Uni

vers

ity o

f N

orth

Car

olin

a] a

t 12:

46 1

3 N

ovem

ber

2014

Page 16: On clustering with chernoff-type faces

CLUSTERING WITH CHERNOFF-TYPE FACES

The r e s u l t s o f these exper iments suggest the f o l l o w i n g

s t r a t e g y f o r u t i l i z i n g t h e g r a p h i c a l procedure faces. Suppose

we have a set o f data t h a t i s t o be p a r t i t i o n e d i n t o a number

o f c l u s t e r s and t h e r e i s , a p r i o r i , no compe l l i ng reason t o

p r e f e r any s p e c i f i c c l u s t e r i n g method. Then c l u s t e r the datz

us ing severa l procedures l i k e the nongraphica l methods

i l l u s t r a t e d h e r e i n and a l s o c l u s t e r u s i n g faces. Now, determ ne

which o f the methods faces i s most s i m i l a r t o and use i t f o r

your a n a l y s i s . The idea i s t h a t faces " p o i n t s t o t h e

winner".

I n t e r e s t i n g examples and more d i s c u s s i o n o f t h e use o f

faces and o t h e r g raph ica l c l u s t e r i n g methods can be found i n

Wang (1978), F ienberg (1979), and Turner (1981).

Bl BLIOGRAPHY

Chernof f , Herman (1973). Usinq Faces t o Represent P o i n t s i n k-dimensional Space ~ r a ~ h i c a l l y , J . Amer. S t a t i s t . A S S O C . - ~ ~ , 361-68.

Chernof f , Herman & R i z v i , M. Haseeb (1975). E f f e c t on C l a s s i - f i c a t i o n E r r o r o f Random Permutat ions o f Features i n Represent ing M u l t i v a r i a t e Data by Faces. J. Amer S t a t i s t . - Assoc. 70, 548-54.

Dixon, W. J. & Brown, M. B., E d i t o r s (1979). BMDP Biomedica'! Computer Programs. U n i v e r s i t y o f C a l i f o r n i a Press.

F ienberg, S. E. (1979) Graph ica l Methods i n S t a t i s t i c s . Arne-. S t a t i s t . 33 (41, 165-178.

Rand, W. M. (1971). O b j e c t i v e C r i t e r i a f o r t h e E v a l u a t i o n o f C l u s t e r i n g Methods. J. Amer. S t a t i s t . Assoc. 66, 846-850.

Romesburg, C. H. 6 M a r s h a l l , K. (1980). CLUSTAR and CLUSTID: Computer Programs f o r H i e r a r c h i c a l C l u s t e r A n a l y s i s . Amel-. S t a t i s t . 34 (3 ) , 186.

Turner . D. W. (1981). Graphica l Methods f o r Represent ing P o i n t s i n n - ~ i r n e n s i o n a l Space. A b s t r a c t s Amer. Math. Soc. 2 (61, 516. (Repr in ts a v a i l a b l e f rom the a u t h o r . )

Turner , D . W. & Tidmore, F. E, (1977). C l u s t e r i n g w i t h Chernof f - t y p e Faces. Proceeding o f t h e American S t a t i s t i c a l A s s o c i a t i o n , S t a t i s t i c a l Computing Sec t ion , 372 - 377.

Dow

nloa

ded

by [

Uni

vers

ity o

f N

orth

Car

olin

a] a

t 12:

46 1

3 N

ovem

ber

2014

Page 17: On clustering with chernoff-type faces

396 TIDMORE AND TURNER

Turner , D . W. & Tidmore, F. E . (1980). FACES - A FORTRAN Program f o r Generat ing Cherno f f - t ype Faces on a L i n e P r i n t e r . Arner. S t a t i s t . 34 ( 3 ) , 187.

Wang, Peter C. C. (1978). Graph ica l Represen ta t ion o f M u l t i - v a r i a t e Data. New York: Academic Press.

R e c e i v e d F e b r u a r y , 1980; R e v i s e d J a n u a r y , 1982.

Recommended b y William H . R o g e r s , T h e Rand C o r p . S a n t a M o n i c a , CA

Dow

nloa

ded

by [

Uni

vers

ity o

f N

orth

Car

olin

a] a

t 12:

46 1

3 N

ovem

ber

2014