ibm research reportibm research report kmap - a visualizer for kohonen self-organizing map...

RC22517 (W0207-065) July 15, 2002Computer Science

IBM Research Report

KMAP - A Visualizer for Kohonen Self-organizing MapClustering Results

George S. Almasi, Richard D. LawrenceIBM Research Division

Thomas J. Watson Research CenterP.O. Box 218

Yorktown Heights, NY 10598

Research DivisionAlmaden - Austin - Beijing - Delhi - Haifa - India - T. J. Watson - Tokyo - Zurich

LIMITED DISTRIBUTION NOTICE: This report has been submitted for publication outside of IBM and will probably be copyrighted if accepted for publication. It has been issued as a Research Reportfor early dissemination of its contents. In view of the transfer of copyright to the outside publisher, its distribution outside of IBM prior to publication should be limited to peer communications and specific requests.After outside publication, requests should be filled only by reprints or legally obtained copies of the article (e.g. , payment of royalties). Copies may be requested from IBM T. J. Watson Research Center , P. O. Box 218, Yorktown Heights, NY 10598 USA (email: [email protected]). Some reports are available on the internet at http://domino.watson.ibm.com/library/CyberDig.nsf/home .

KMAP - A Visualizer for KohonenSelf-organizing Map Clustering Results

George S. Almasi and Richard D. LawrenceIBM T.J. Watson Research Center

P.O. Box 218Yorktown Heights, New York 10598

[email protected], [email protected]

August 17, 2002

Abstract

We describe a Java-based visualizer that preserves the topological information in-herent in the results of cluster data mining that used the Kohonen self-organizing mapapproach. The visualizer provides a tool with a number of views and drilldowns that ananalyst can use to circumvent the data interpretation bottleneck and quickly evaluatethe results of the map produced by the datamining run. Being written in Java, it isportable across a large number of computing platforms.

Keywords: visualizer, data mining, Kohonen, self-organizing map, clustering,Java.

1 Introduction

The Kmap visualizer described in this paper was designed to allow rapid evaluation ofthe results of data mining performed using the Kohonen self-organizing map [8] clusteringapproach. A key feature of this approach is that, in addition to grouping similar datarecords into clusters, it places the clusters on a two-dimensional map in such a way thatsimilar clusters are near each other. The nature of a cluster in a particular neighborhood ofthe map says something about that neighborhood. In other words, the contents of individualclusters and the topography of the cluster map are both of interest.

The current authors wrote a highly scalable parallel version of the self-organizing mapalgorithm [9] which is used by IBM’s Intelligent Miner for Data [7] . This program wasrecently put to the test in a demonstration of mining very large databases [4], and theparallel speedup that was obtained was so good (over 14X for a 16-processor SP2) that thenormal ratio of time needed to perform the datamining vs. interpreting its results becameinverted – data analysis became the new bottleneck. The reason was that the existing

1

visualizer was good at conveying the nature of individual clusters, but the map had to beconstructed basically by hand. Kmap was designed to remedy this problem.

This paper is organized as follows: we first describe the design of the Kmap visualizer,its modes of operation and its drilldown capabilities, and its implementation in Java. Thisis followed by four case studies that show examples of Kmap’s features. We then presentillustrative examples of Kmap’s features in the form of four case studies using stock market,retail supermarket, credit bureau, and census data.

2 Implementation

The main Java classes in Kmap reflect the program’s functionality:

• Kmap collects the clustering results from the input file (generic neural weights file orIntelligent Miner results file) and creates and manipulates the rectangular grid map ofclusters.

• Kdetail handles the drilldowns into an individual cluster• Kplot creates the contents of the rectangles for Kmap and Kdetail, using the JClass

chart and table java components provided by KL Group Inc. [5].

• Kmenubar handles user interactions via the menu bar that is used for both Kmapand Kdetail.

• NNClustRes provides an API to Intelligent Miner results files in the form of a dozen“get” methods for obtaining

1. the number of clusters (neural nodes) and data attributes (fields),

2. the neural weights (mean clustering attributes for each node), and

3. several sets of attribute statistics, used for drilldowns and described further below.

There are three different “modes” for viewing clustering results in Kmap and its drill-downs:

• “weights” mode, in which a plot of all the neural weights (i.e., attribute mean valuse)is shown for each cluster. This mode is useful when the attributes consist of a logicalsequence like month-by-month prices or spending. Figure 1 shows such an example.

• “topten” mode, in which the names of the attributes with the top 3 or top 10 weightsare shown for each cluster. This mode can be very useful when the number of attributesis large. Figure 3 shows a comparison of results viewed in weights and in topten modes,and Figure 4 gives a detailed example of topten mode in action.

• “sizes” mode, which simply shows the size of each cluster as a filled circle. Figure 3contains an example of a “sizes” view.

2

The mode can be selected from the “View” menu, or by specifying w, t, or s after the filenamewhen Kmap is called from the command line.

The cluster size is indicated by the background color in weights mode, by the graynessof the frame in topten mode, and by the size of the circle in sizes mode.

The available supplemental attributes, such as Total Spending or Age, can also be usedto color the inner rectangle in topten mode or the background in weights or sizes mode. Thisis done via the “Color By” pulldown on the menu bar. The functions provided by the othermenu bar pulldowns are as follows:

• “File” handles the opening of files and the printing of the displays.• “View” provides a large number of methods for controlling the appearance of the

visualization.

• “Sort By” allows sorting the attributes in topten mode by weights or by six otherquantities.

The attributes used in neural clustering are often normalized first, and so the neuralweights (the means of the normalized attributes) for each cluster can be different from themeans of the raw values. The un-normalized values of the attributes can be included assupplemental fields, and Intelligent Miner will compute their statistics along with those ofthe active fields. Kmap has provision to read these supplemental field statistics and makethem available for alternative ways to sort the attributes in topten mode, using the ”SortBy” menu. These quantities available for each cluster are:

1. “mean”, the mean value of the un-normalized attribute

2. “partcp”, the fraction of records with non-zero values of this attribute

3. “meanNZ”, the mean non-zero value of the un-normalized attribute, computed fromthe ratio of the two quantities above

4. “Rmean”, the mean value of the un-normalized attribute for this cluster divided bythe mean value of this attribute for the entire data set.

5. “Rpartcp”, the quantity “partcp” normalized to the entire data set as above

6. “RmeanNZ”, the quantity “meanNZ” normalized to the entire data set as above

There are also popup menus and other panels available for drilldown and other purposes;these are shown in the examples that follow.

3 Case Studies

This section describes the use of Kmap to visualize several different kinds of dataminingresults.

3

• Standard and Poor 500 monthly closings (weights, drilldowns)• Safeway (topten mode, more drilldowns)• Credit Bureau(bankruptcy colorby)• Census (categorical variables)

3.1 Stock Prices

For our first example, Figure 1 shows Kmap in “weights” mode, displaying the results of a64-node clustering performed on the monthly closing prices of the Standard & Poor’s stocksfor the period September 1995 - August 1996. Each stock price is normalized by its maximumvalue during this period. The action of the Kohonen algorithm in placing similar clustersnear each other on the map shows clearly in the gradual transition from gainers on the leftto decliners on the right.

Kmap offers two special drilldowns for relatively small datasets like this one. It canread a file listing the cluster for each record and display the membership of each clusteron a popup menu, as shown in Figure 1. In addition, the “View” menu has an entry thatleads to a “Record Finder” panel that can find which cluster a particular record is in andhighlight that cluster, as shown for the case of Western Atlas and cluster 25 in the figure.A hypothetical scenario for using these capabilities would be to check the membership of againer cluster like node 32, find that it contains two petroleum industry members (Mobiland Texaco), wonder where Western Atlas (also a petroleum industry member) had beenplaced, use the Record Finder, and discover it in cluster 25, a direct neighbor of cluster 32.There is a substantial number of such intuitively satisfying relationships in the map.

If the cluster membership file mentioned above also contains the attribute values for theindividual records, Kmap can provide an additional drilldown that allows comparing thebehavior of individual records in a cluster, as shown in Figure 2.

3.2 Supermarket Shoppers

The example in this section explains the motivation for “topten” mode and describes someof the features of that mode. The clustering in this case was performed on a month’sspending data by some 36,000 supermarket customers in about 100 product categories (dairy,confectionery, baby products, etc.). As shown in Figure 3, the Kmap “weights” mode usedin the previous example can be used to view these clustering results as well. Kmap providespoint labels to help identify each point even in a crowded plot, and this feature can beused to identify the three attributes with the largest weights in cluster 3 as Sugar, Tea,and Desserts/Puddings, for example. But the large number of attributes makes this a slowprocess. Furthermore, unlike the months of the previous example, there is no sequentialrelationship among the attributes that makes a plot view particularly useful. A much fastergrasp of the clusters’ nature can be obtained in this case from an initial map that shows whatthe top 3 attributes are for each cluster, with a rich set of available drilldowns for furtherinvestigation. This is what “topten” mode does.

4

Figure 1: Monthly closing prices for the Standard & Poor’s 500 stocks over a one-year period

5

Figure 2: Drilldowns available in weights mode (for example in Figure 1)

In topten mode, each rectangle in the map conveys the following information about itscorresponding cluster: the outer frame’s degree of grayness is proportional to its population.the inner rectangle’s color indicates the value of the supplemental attribute chosen from the“Color By” menu. In Figure 3, for example, blue indicates above-average customer Age,while pink denotes below-average attribute values. The numerical values of the populationand Age are also shown for each cluster.

By default, the top 3 attributes are sorted by their neural weights, but this can be changed(for the main map as well as the drilldowns) via the “Sort By” menu, which allows re-sortingby the attribute mean value or the five other statistics mentioned earlier. The color of thebullets in this example correspond to a rainbow spectrum matched to the range of categorynumbers, so attributes with similar colors have similar category numbers. The bullet shapescarry no information in this example. However, for the wine recommender described in [3],we used a different scheme in which the bullet color indicated the wine color and the bulletshape indicated price (squares cost more than circles, etc.).

Clicking on a cluster’s rectangle pops up a menu that can be used for further drilldown,as shown in Figure 3. The first item on the menu allows getting more information on aparticular attribute (field), such as the distribution of its revenue among the clusters. Forexample, the 11% of the total customer population represented by cluster 3 accounted for24% of the spending on Tea, for a “wallet share enrichment” of 2.18 – i.e., chances of findinga tea buyer in this cluster are more than twice that for the overall customer population.

The second item on the drilldown menu allows getting more detail on a particular cluster.Clicking it brings up a scrollable table showing the neural weights and other field statisticsfor all the attributes. For convenience, the distribution map for a given attribute can bebrought up by clicking the attribute in this table.

The third item on the drilldown menu brings up an enlarged plot of the neural weightsfor that cluster, as shown in the figure.

The third Kmap view, “sizes”, is shown at the top left of Figure 3 for reference. Inthis mode, the cluster population is conveyed by the size of the corresponding circle (thepopulations are very close in this case), and the color convention is the same as for the“topten” view.

The drilldowns shown are available in all 3 modes, and each drilldown figure can beprinted individually.

6

Figure 3: Comparison of views and drilldowns. Starting at top left, the “sizes”, “weights”,and “topten” views of the supermarket shopper clustering results for 9 clusters. The topright view also shows popup menu leading to several drilldowns: into a specific attribute(left center and right), into the weights view of a cluster (right center), and into the toptenview of a cluster (bottom right). Clicking an attribute’s button in the view at bottom rightis a second way to obtain the attribute map at bottom left.

7

Figure 4: Supermarket retail data clustering results

8

One application of this clustering capability is in the generation of personalized recom-mendations, as we describe in [2]. These recommendations were being generated for partic-ipants in a trial program using PDAs(personal digital assistants) for supermarket shopping.Since all participants had above-average spending 1, we clustered only the above-averagespenders and obtained the results shown in Figure 4. Note the emergence of a cluster (4)with strong spending on baby products. As the three drilldown maps of attribute shareshow, these clusters are sharper than the previous ones in that the “wallet share enrich-ments” are larger. For example, cluster 4 has 10% of the total population but 56% of thetotal spending on baby products, for a an enrichment of 5.6. This sort of enrichment canbe quite valuable for customer relationship and marketing purposes. It also impacted thegeneration of recommendations. For example, the most popular chocolates in cluster 4 turnout to be quite different from those most frequently preferred by the population as a whole.[2].

3.3 Credit Data

The example in this section shows the value of adroitness in the selection of the attributesused for clustering. Figure 5 shows clustering results from a study on mining a large databasefrom a credit bureau. Twelve of the 44 attributes that were used were created by categorizingeach outstanding credit card balance into a small matrix of several ranges of balance andseveral ranges of the age of the account. This helped the domain expert to find about adozen different segments on the map. For example, the top attributes for clusters 18 and28 are “BANK BAL03” and “BANK BAL06” , which both correspond to large balances inaccounts less than 6 months old, hence the analyst’s label ”BIG CHURN” for people whocontinually accept bank offers for new credit cards with low initial interest rates, and thentransfer to another card at the end of the low interest rate period. This ability to discernvarious segments of credit usage can be very helpful to a bank’s efforts to market its variousproducts.

The second example in this section shows the clustering results on a subset of the popu-lation whose credit rating is within a “gray area” that normally makes it difficult to obtaincredit. One of the supplemental attributes (not used in the clustering, of course) was whetheran individual in this subset actually went bankrupt within the next six months. When thisattribute is used to color the cluster map (Figure 6) , an interesting pattern appears: farfrom being randomly distributed, the actual bankruptcies are concentrated on the right edgeof the map. The bankruptcy rate for the population as a whole is about 0.5%, and most ofthe clusters have a rate below this value, while one of the clusters at the right of the maphas an 8% bankruptcy rate, which is 16 times higher than the average value. The dominantattributes of this cluster are all old accounts with very large balances, which tend to beprominent in the other clusters with above-average bankruptcies as well. The suggestion isthat a number of people are being denied credit without good reason, and that a more refinedrating system could free up credit to a group that could probably really use it. Customersand the bank would both benefit.

1like the children in Lake Wobegon

9

File View Color By Sort By Exit

credit100 Kmap -- grayest = 30% of 1073960 records, sorted by weights, colored by BEACON_96_SCR (mean=633.494)

BANK_BAL05BANK_BAL08BANK_BAL11

564.5 116799


592.03 19598


609.34 24897

BANK_BAL11BANK_BAL10POPDEPT_BAL

549.62 155396

AUTO_BALINSTFIN_BALOTHDEPT_BAL

471.49 431(95

OTHDEPT_BALPOPDEPT_BALBANK_BAL11

568.35 1725(0%)94

POPDEPT_BALBANK_BAL11OTHDEPT_BAL

597.69 370793


604.53 494492

HOMEFURN_BALPOPDEPT_BALHOME_BAL

553.64 1371(91

HOMEFURN_BALHOME_BALOTHDEPT_BAL

581.72 2373(90


587.01 28089


584.87 28888


610.58 44787


596.92 470586

OTHDEPT_BALBANK_BAL11BANK_BAL10

581.69 165785

OTHDEPT_BALBANK_BAL11POPDEPT_BAL

588.18 3649(0%)84

POPDEPT_BALOTHDEPT_BALBANK_BAL11

585.21 404483

POPDEPT_BALBANK_BAL11MORT_BAL

612.58 10K(82

HOMEFURN_BALHOME_BALAUTO_BAL

607.99 7956(81

ELEC_BALHOMEFURN_BALBANK_BAL11

630.97 4319(80


611.89 36479


625.23 63378

BANK_BAL08BANK_BAL11MORT_BAL

639.46 77277

BANK_BAL11MORT_BALPOPDEPT_BAL

620.66 10K(76

BANK_BAL11OTHDEPT_BALPOPDEPT_BAL

608.83 428675

OTHDEPT_BALAUTO_BALMORT_BAL

600.93 4763(0%)74


610.86 948073


619.82 16K(72

CREDITU_BALAUTO_BALMORT_BAL

680.76 9648(71


635.64 3146(70


620.21 30369


640 6698(068


666.35 94567


643.55 599566

BANK_BAL11MORT_BALBANK_BAL10

646.42 17K(65

CLOTHING_BALOTHDEPT_BALBANK_BAL11

605.9 1443(0%)64

OTHDEPT_BALPOPDEPT_BALAUTO_BAL

625.28 14K(63

AUTO_BALPOPDEPT_BALOTHDEPT_BAL

599.59 582662

AUTO_BALBANK_BAL11MORT_BAL

648.65 13K(161

AUTO_BALMORT_BALBANK_BAL11

595.68 3060(60


621.92 24759


656.61 54458


678.22 79357


592.4 11K(156


572.57 643255


585.82 2455(0%)54

BANK_BAL10BANK_BAL11OTHDEPT_BAL

592.15 16K(53


673.48 43K(52

SECUR_BALMORT_BALAUTO_BAL

658.7 2911(051

STUDENT_BALINSTFIN_BALBANK_BAL11

638.2 5578(050


671.21 44949


685.3 11K48


705.06 15K47


541.33 199246


579.84 525045

OILCARD_BALPOPDEPT_BALBANK_BAL11

546.99 1071(0%)44

OILCARD_BALOTHDEPT_BALBANK_BAL11

571.79 185143

SECUR_BALINSTFIN_BALAUTO_BAL

628.16 396442

INSTFIN_BALMORT_BALAUTO_BAL

607.79 2162(41


630.63 2416(40


622.64 18939


640.46 11238


614.82 28337


571.51 191636


608.55 711835


643.51 17K(2%)34

OILCARD_BALOTHDEPT_BALPOPDEPT_BAL

587.68 458033

POPDEPT_BALOTHDEPT_BALAUTO_BAL

636.55 21K(32


659.73 9814(31


647.16 13K(130


688.92 43329


687.38 18228


688.73 16127

MORT_BALBANK_BAL01BANK_BAL02

651.77 461726

REVFINAN_BALMORT_BALBANK_BAL11

638.43 316225

JUMBO_BALBANK_BAL12CHARGECARD_BAL

733.6 7103(0%)24


692.11 558623


712.44 11K(22

JEWELRY_BALOTHDEPT_BALAUTO_BAL

576.49 2296(21

GOV_BALSTUDENT_BALMORT_BAL

593.14 1820(20


684.04 39919


689.5 414318


697 4349(017

MORT_BALAUTO_BALBANK_BAL11

694.96 14K(16


666.58 454115

CHARGECARD_BALBANK_BAL11MORT_BAL

659.36 2232(0%)14


703.63 14K(13

OTHDEPT_BALBANK_BAL11POPDEPT_BAL

661.06 27K(12

GOV_BALSTUDENT_BALOTHDEPT_BAL

575.52 6223(11

OTHUTIL_BALOTHDEPT_BALAUTO_BAL

561.92 1608(10

MORT_BALBANK_BAL11AUTO_BAL

677.99 2719


701.01 8118


727.16 18K7

MORT_BALBANK_BAL11OTHDEPT_BAL

723.64 26K(6


725.88 34K(5


727.19 30K(3%)4


690.89 22K(3


693.1 13K(12

HOMEFURN_BALHOME_BALELEC_BAL

621.66 12K(11

BANK_BAL11INSTFIN_BALBANK_BAL12

718.52 326K(0

Figure 5: Credit usage customer segments

10

File View Color By Sort By Exit

seg650E Kmap -- grayest = 18% of 95157 records, sorted by weights, colored by ENH_DAS_SCR (mean=0.003)

POPDEPT_BALBANK_BAL11BANK_BAL12

0 208(0%)99


0.02 458(98


0 957(1%)97

BANK_BAL11OTHDEPT_BALBANK_BAL08

0.01 252(096

OTHDEPT_BALBANK_BAL11CLOTHING_BAL

0 402(0%)95

OTHDEPT_BALBANK_BAL12BANK_BAL11

0 112(0%)94


0 576(0%)93


0 279(0%)92


0 379(0%)91


0 463(0%)90


0.01 261(089


0.01 391(88

BANK_BAL11POPDEPT_BALOTHDEPT_BAL

0.02 308(087


0 534(0%)86


0 231(0%)85


0 428(0%)84


0 589(0%)83


0 345(0%)82


0 654(0%)81


0 2237(2%)80


0.04 236(079


0.03 308(78


0.02 154(077


0 398(0%)76


0 656(0%)75

POPDEPT_BALMORT_BALBANK_BAL11

0 637(0%)74


0 1200(1%)73


0 517(0%)72


0 1377(1%)71


0 4365(5%)70


0.08 208(069


0.02 310(68


0 254(0%)67


0.02 250(066


0 1310(1%)65


0 2143(2%)64

HOMEFURN_BAHOME_BALMORT_BAL

0 896(0%)63

HOMEFURN_BAHOME_BALBANK_BAL11

0.02 177(0%62

HOMEFURN_BALHOME_BALMORT_BAL

0 509(0%)61

CREDITU_BALAUTO_BALBANK_BAL11

0 1568(2%)60


0.01 353(059


0.02 493(58


0 520(0%)57


0.01 579(056


0 430(0%)55

BANK_BAL11POPDEPT_BALBANK_BAL10

0 869(0%)54

ELEC_BALHOMEFURN_BAMORT_BAL

0 569(0%)53

ELEC_BALHOMEFURN_BABANK_BAL11

0 289(0%)52

CREDITU_BALMORT_BALAUTO_BAL

0 616(0%)51

CREDITU_BALAUTO_BALBANK_BAL11

0 571(0%)50


0.02 347(049


0 478(0%)48


0 860(0%)47

BANK_BAL08BANK_BAL05AUTO_BAL

0 877(0%)46


0 951(0%)45


0 697(0%)44


0 1736(2%)43

BANK_BAL11AUTO_BALMORT_BAL

0 628(0%)42

SECUR_BALMORT_BALAUTO_BAL

0 206(0%)41


0 365(0%)40


0.03 309(039


0 628(0%)38


0 594(0%)37


0 1436(2%)36


0 563(0%)35


0 1393(1%)34

BANK_BAL11AUTO_BALOTHDEPT_BAL

0 3050(3%)33

OTHDEPT_BALBANK_BAL11MORT_BAL

0 653(0%)32

SECUR_BALAUTO_BALINSTFIN_BAL

0 656(0%)31

AUTO_BALBANK_BAL11POPDEPT_BAL

0 1412(1%)30


0 401(0%)29


0.02 492(28


0 1064(1%)27


0 807(0%)26


0 144(0%)25


0 1686(2%)24


0 287(0%)23

OTHDEPT_BALPOPDEPT_BALMORT_BAL

0 895(0%)22

OTHDEPT_BALBANK_BAL11AUTO_BAL

0 1892(2%)21

AUTO_BALBANK_BAL11MORT_BAL

0 4098(4%)20


0 684(0%)19


0 1332(1%18


0 1762(2%)17


0 1404(1%)16


0 2132(2%)15


0 1427(1%)14


0 570(0%)13


0 378(0%)12

CHARGECARD_BAMORT_BALBANK_BAL12

0 284(0%)11

BANK_BAL10OILCARD_BALBANK_BAL11

0 1276(1%)10


0 389(0%)9


0 944(0%)8


0 356(0%)7


0 307(0%)6


0 640(0%)5


0 310(0%)4

INSTFIN_BALMORT_BALBANK_BAL11

0 174(0%)3


0 831(0%)2

GOV_BALAUTO_BALBANK_BAL11

0 155(0%)1

OTHDEPT_BALBANK_BAL11INSTFIN_BAL

0 17K(18%)0

Figure 6: Actual bankruptcies among customers rated as marginal credit risks.

11

File

Vie

wC

olor

By

Sor

t By

Exi

t

seg

650E

Km

ap -

- g

raye

st =

18%

of

9515

7 re

cord

s, s

ort

ed b

y w

eig

hts

, co

lore

d b

y E

NH

_DA

S_S

CR

(m

ean

=0.0

03)

PO

PD

EP

T_B

AL

BA

NK

_BA

L11

BA

NK

_BA

L12

0 208(0%)

99

BA

NK

_BA

L11

BA

NK

_BA

L12

BA

NK

_BA

L08

0.02 458(

98

BA

NK

_BA

L11

MO

RT

_BA

LB

AN

K_B

AL

12

0 957(1%)

97

BA

NK

_BA

L11

OT

HD

EP

T_B

AL

BA

NK

_BA

L08

0.01 252(0

96

OT

HD

EP

T_B

AL

BA

NK

_BA

L11

CL

OT

HIN

G_B

AL

0 402(0%)

95

OT

HD

EP

T_B

AL

BA

NK

_BA

L12

BA

NK

_BA

L11

0 112(0%)

94

BA

NK

_BA

L12

MO

RT

_BA

LB

AN

K_B

AL

11

0 576(0%)

93

BA

NK

_BA

L06

BA

NK

_BA

L09

MO

RT

_BA

L

0 279(0%)

92

BA

NK

_BA

L03

BA

NK

_BA

L06

MO

RT

_BA

L

0 379(0%)

91

MO

RT

_BA

LB

AN

K_B

AL

11A

UT

O_B

AL

0 463(0%)

90

BA

NK

_BA

L12

BA

NK

_BA

L11

BA

NK

_BA

L08

0.01 261(0

89

BA

NK

_BA

L11

BA

NK

_BA

L02

BA

NK

_BA

L05

0.01 391(

88

BA

NK

_BA

L11

PO

PD

EP

T_B

AL

OT

HD

EP

T_B

AL

0.02 308(0

87

PO

PD

EP

T_B

AL

BA

NK

_BA

L11

OT

HD

EP

T_B

AL

0 534(0%)

86

OT

HD

EP

T_B

AL

PO

PD

EP

T_B

AL

BA

NK

_BA

L11

0 231(0%)

85

BA

NK

_BA

L11

BA

NK

_BA

L12

MO

RT

_BA

L

0 428(0%)

84

BA

NK

_BA

L12

BA

NK

_BA

L11

BA

NK

_BA

L02

0 589(0%)

83

BA

NK

_BA

L09

BA

NK

_BA

L12

MO

RT

_BA

L

0 345(0%)

82

BA

NK

_BA

L06

BA

NK

_BA

L09

MO

RT

_BA

L

0 654(0%)

81

MO

RT

_BA

LB

AN

K_B

AL

11O

TH

DE

PT

_BA

L

0 2237(2%)

80

BA

NK

_BA

L12

BA

NK

_BA

L05

BA

NK

_BA

L08

0.04 236(0

79

BA

NK

_BA

L11

BA

NK

_BA

L08

BA

NK

_BA

L05

0.03 308(

78

PO

PD

EP

T_B

AL

BA

NK

_BA

L08

BA

NK

_BA

L11

0.02 154(0

77

PO

PD

EP

T_B

AL

BA

NK

_BA

L11

OT

HD

EP

T_B

AL

0 398(0%)

76

PO

PD

EP

T_B

AL

OT

HD

EP

T_B

AL

AU

TO

_BA

L

0 656(0%)

75

PO

PD

EP

T_B

AL

MO

RT

_BA

LB

AN

K_B

AL

11

0 637(0%)

74

BA

NK

_BA

L12

BA

NK

_BA

L11

MO

RT

_BA

L

0 1200(1%)

73

MO

RT

_BA

LB

AN

K_B

AL

12B

AN

K_B

AL

11

0 517(0%)

72

MO

RT

_BA

LA

UT

O_B

AL

BA

NK

_BA

L11

0 1377(1%)

71

MO

RT

_BA

LB

AN

K_B

AL

11A

UT

O_B

AL

0 4365(5%)

70

BA

NK

_BA

L05

BA

NK

_BA

L08

BA

NK

_BA

L11

0.08 208(0

69

BA

NK

_BA

L08

BA

NK

_BA

L11

BA

NK

_BA

L05

0.02 310(

68

PO

PD

EP

T_B

AL

BA

NK

_BA

L08

BA

NK

_BA

L11

0 254(0%)

67

PO

PD

EP

T_B

AL

BA

NK

_BA

L02

BA

NK

_BA

L05

0.02 250(0

66

PO

PD

EP

T_B

AL

OT

HD

EP

T_B

AL

AU

TO

_BA

L

0 1310(1%)

65

PO

PD

EP

T_B

AL

OT

HD

EP

T_B

AL

BA

NK

_BA

L11

0 2143(2%)

64

HO

ME

FU

RN

_BA

HO

ME

_BA

LM

OR

T_B

AL

0 896(0%)

63

HO

ME

FU

RN

_BA

HO

ME

_BA

LB

AN

K_B

AL

11

0.02 177(0%

62

HO

ME

FU

RN

_BA

LH

OM

E_B

AL

MO

RT

_BA

L

0 509(0%)

61

CR

ED

ITU

_BA

LA

UT

O_B

AL

BA

NK

_BA

L11

0 1568(2%)

60

BA

NK

_BA

L08

BA

NK

_BA

L05

BA

NK

_BA

L11

0.01 353(0

59

BA

NK

_BA

L08

BA

NK

_BA

L05

BA

NK

_BA

L11

0.02 493(

58

BA

NK

_BA

L08

MO

RT

_BA

LB

AN

K_B

AL

05

0 520(0%)

57

BA

NK

_BA

L11

BA

NK

_BA

L08

BA

NK

_BA

L05

0.01 579(0

56

PO

PD

EP

T_B

AL

BA

NK

_BA

L08

BA

NK

_BA

L05

0 430(0%)

55

BA

NK

_BA

L11

PO

PD

EP

T_B

AL

BA

NK

_BA

L10

0 869(0%)

54

EL

EC

_BA

LH

OM

EF

UR

N_B

AM

OR

T_B

AL

0 569(0%)

53

EL

EC

_BA

LH

OM

EF

UR

N_B

AB

AN

K_B

AL

11

0 289(0%)

52

CR

ED

ITU

_BA

LM

OR

T_B

AL

AU

TO

_BA

L

0 616(0%)

51

CR

ED

ITU

_BA

LA

UT

O_B

AL

BA

NK

_BA

L11

0 571(0%)

50

BA

NK

_BA

L05

BA

NK

_BA

L08

BA

NK

_BA

L02

0.02 347(0

49

BA

NK

_BA

L05

BA

NK

_BA

L08

BA

NK

_BA

L02

0 478(0%)

48

BA

NK

_BA

L05

BA

NK

_BA

L08

BA

NK

_BA

L02

0 860(0%)

47

BA

NK

_BA

L08

BA

NK

_BA

L05

AU

TO

_BA

L

0 877(0%)

46

BA

NK

_BA

L08

BA

NK

_BA

L11

AU

TO

_BA

L

0 951(0%)

45

BA

NK

_BA

L11

BA

NK

_BA

L05

BA

NK

_BA

L08

0 697(0%)

44

BA

NK

_BA

L11

MO

RT

_BA

LB

AN

K_B

AL

10

0 1736(2%)

43

BA

NK

_BA

L11

AU

TO

_BA

LM

OR

T_B

AL

0 628(0%)

42

SE

CU

R_B

AL

MO

RT

_BA

LA

UT

O_B

AL

0 206(0%)

41

AU

TO

_BA

LM

OR

T_B

AL

BA

NK

_BA

L11

0 365(0%)

40

BA

NK

_BA

L02

BA

NK

_BA

L05

BA

NK

_BA

L08

0.03 309(0

39

BA

NK

_BA

L05

BA

NK

_BA

L02

BA

NK

_BA

L08

0 628(0%)

38

BA

NK

_BA

L05

MO

RT

_BA

LB

AN

K_B

AL

08

0 594(0%)

37

BA

NK

_BA

L05

BA

NK

_BA

L08

BA

NK

_BA

L02

0 1436(2%)

36

BA

NK

_BA

L07

BA

NK

_BA

L04

BA

NK

_BA

L08

0 563(0%)

35

BA

NK

_BA

L08

BA

NK

_BA

L11

BA

NK

_BA

L05

0 1393(1%)

34

BA

NK

_BA

L11

AU

TO

_BA

LO

TH

DE

PT

_BA

L

0 3050(3%)

33

OT

HD

EP

T_B

AL

BA

NK

_BA

L11

MO

RT

_BA

L

0 653(0%)

32

SE

CU

R_B

AL

AU

TO

_BA

LIN

ST

FIN

_BA

L

0 656(0%)

31

AU

TO

_BA

LB

AN

K_B

AL

11P

OP

DE

PT

_BA

L

0 1412(1%)

30

BA

NK

_BA

L02

BA

NK

_BA

L05

BA

NK

_BA

L08

0 401(0%)

29

BA

NK

_BA

L02

BA

NK

_BA

L08

BA

NK

_BA

L05

0.02 492(

28

BA

NK

_BA

L05

BA

NK

_BA

L02

BA

NK

_BA

L08

0 1064(1%)

27

BA

NK

_BA

L01

BA

NK

_BA

L04

BA

NK

_BA

L05

0 807(0%)

26

BA

NK

_BA

L04

BA

NK

_BA

L07

BA

NK

_BA

L08

0 144(0%)

25

BA

NK

_BA

L07

BA

NK

_BA

L04

BA

NK

_BA

L08

0 1686(2%)

24

BA

NK

_BA

L10

BA

NK

_BA

L11

MO

RT

_BA

L

0 287(0%)

23

OT

HD

EP

T_B

AL

PO

PD

EP

T_B

AL

MO

RT

_BA

L

0 895(0%)

22

OT

HD

EP

T_B

AL

BA

NK

_BA

L11

AU

TO

_BA

L

0 1892(2%)

21

AU

TO

_BA

LB

AN

K_B

AL

11M

OR

T_B

AL

0 4098(4%)

20

BA

NK

_BA

L02

BA

NK

_BA

L05

MO

RT

_BA

L

0 684(0%)

19

BA

NK

_BA

L02

BA

NK

_BA

L05

BA

NK

_BA

L11

0 1332(1%

18

BA

NK

_BA

L02

BA

NK

_BA

L05

AU

TO

_BA

L

0 1762(2%)

17

BA

NK

_BA

L01

BA

NK

_BA

L02

BA

NK

_BA

L04

0 1404(1%)

16

BA

NK

_BA

L01

BA

NK

_BA

L04

BA

NK

_BA

L02

0 2132(2%)

15

ST

UD

EN

T_B

AL

INS

TF

IN_B

AL

BA

NK

_BA

L11

0 1427(1%)

14

RE

VF

INA

N_B

AL

MO

RT

_BA

LB

AN

K_B

AL

11

0 570(0%)

13

RE

VF

INA

N_B

AL

MO

RT

_BA

LB

AN

K_B

AL

11

0 378(0%)

12

CH

AR

GE

CA

RD

_BA

MO

RT

_BA

LB

AN

K_B

AL

12

0 284(0%)

11

BA

NK

_BA

L10

OIL

CA

RD

_BA

LB

AN

K_B

AL

11

0 1276(1%)

10

BA

NK

_BA

L02

MO

RT

_BA

LB

AN

K_B

AL

05

0 389(0%)

9

BA

NK

_BA

L02

MO

RT

_BA

LB

AN

K_B

AL

05

0 944(0%)

8

BA

NK

_BA

L02

BA

NK

_BA

L01

BA

NK

_BA

L05

0 356(0%)

7

BA

NK

_BA

L01

BA

NK

_BA

L04

BA

NK

_BA

L02

0 307(0%)

6

ST

UD

EN

T_B

AL

INS

TF

IN_B

AL

BA

NK

_BA

L11

0 640(0%)

5

ST

UD

EN

T_B

AL

INS

TF

IN_B

AL

BA

NK

_BA

L11

0 310(0%)

4

INS

TF

IN_B

AL

MO

RT

_BA

LB

AN

K_B

AL

11

0 174(0%)

3

INS

TF

IN_B

AL

MO

RT

_BA

LA

UT

O_B

AL

0 831(0%)

2

GO

V_B

AL

AU

TO

_BA

LB

AN

K_B

AL

11

0 155(0%)

1

OT

HD

EP

T_B

AL

BA

NK

_BA

L11

INS

TF

IN_B

AL

0 17K(18%)

0

Fig

ure

7:A

ctual

ban

kru

ptc

ies

amon

gcu

stom

ers

rate

das

mar

ginal

cred

itri

sks.

12

3.4 Census Data

The example in this section shows how Kmap can handle categorical values. The clusteringwas performed using publicly available census data with a mixture of 12 continuous andcategorical attributes for each person. The data comes from a municipality located near aNational Laboratory, which accounts for the unusually large number of advanced degreesand large diversity of backgrounds.

Categorical attributes pose an extra degree of complexity in neural clustering becausethey are expanded to a one-to-N mapping: for example, if the original attribute were “animaltype” with values “cat, dog, cow”, it would be replaced for clustering purposes by threeattributes “is a cat”, “is a dog”, “is a cow” that can each assume values 1 or 0. So the totalnumber of attributes actually used in the clustering is data-dependent.

Kmap handles this by making two passes through the results file, obtaining the number ofdistinct values for each categorical attribute on the first pass and constructing the effectiveattributes coming from each original attribute on the second pass. Sample Kmap resultsare shown in Figure 9 and are compared to the default Intelligent Miner visualizer’s viewin Figure 8. The two views each present interesting information in different ways, but theyare indeed consistent with each other. For example, one of the supplemental attributes (i.e.,not used for clustering) is whether household income is above or below $50,000. Using thisattribute to color the Kmap in Figure 9 points to clusters 18 and 24 as being highest inthis attribute, with 70% of the households in cluster 24 having income above $50,000. Thisagrees with the corresponding pie chart in Figure 8.

One of the original categorical attributes was “education”, and one of its values was“doctorate”. Drilling down into the effective categorical attribute “Doctorate(education)”(or “has a PhD” in our earlier terminology) in Figure 9 (comfortingly, perhaps) an overlapbetween clusters with high PhD populations and clusters with high income. Drilling downinto the “educ num” attribute, a number proportional to the number of years of education,shows similar results. Both results are consistent with the drilldowns in Figure 8. Kmapdoes a good job showing the attribute distributions on the map. The default visualizer doesa good job showing the categorical attributes in pie charts. Both can be useful.

3.5 Appendix: A Tale of Two Clusters

Figure 10 offers a more detailed look at the kind of information that Kmap can extract fromthe supermarket customer clustering results mentioned earlier.

4 Summary

Kmap has proved to be a versatile visualizer for Kohonen self-organizing map clusteringresults obtained from a variety of datamining engagements. Analysts value being able tosee the map topography, as well as the drilldowns provided, many of which resulted fromtheir suggestions. We provided examples of Kmap’s use in four different areas of commerce.Currently we are applying Kmap to technical data generated by the simulators and hardwarecontrol systems of massively parallel supercomputers.

13

census36

8 seg0

7seg30

6seg185

seg34

5

seg32

5

seg12

5

seg24

5

seg5

4

seg23

education

HS-grad

relationship

Own-childHusband

marital_status

Married-AF-spouseMarried-civ-spouse

educ_num sex

MaleFemale

occupation

Exec-managerialSales

Transport-movingMachine-op-inspct

Craft-repairOther

workclass

Federal-govState-gov

Self-emp-incLocal-gov

Self-emp-not-incOther

relationship

Not-in-familyOther-relative

Own-child

Husband

marital_status


occupation

Transport-movingSales

Farming-fishingExec-managerial

Craft-repairOther

sex

MaleFemale

education

10thAssoc-voc

7th-8thSome-college

HS-gradOther

education

Assoc-acdmDoctorate

Prof-schoolMasters

Bachelors

educ_num relationship

WifeNot-in-family

Own-childOther-relative

Husband

marital_status

Married-civ-spouse

occupation

Tech-supportCraft-repair

SalesProf-specialty

Exec-managerialOther

sex

Male

relationship

Unmarried

marital_status

Divorced

sex

MaleFemale

age occupation

SalesOther-service

Exec-managerialProf-specialty

Adm-clericalOther

hours_per_week

relationship

Own-childWife

sex

MaleFemale

marital_status


occupation

?Other-service

Exec-managerialProf-specialty

Adm-clericalOther

age hours_per_week

education

Some-college

relationship

Own-childOther-relative

Husband

marital_status


educ_num sex

Male

workclass

State-govFederal-gov

Private

workclass

Federal-govState-govLocal-gov

Self-emp-incSelf-emp-not-inc

education

Assoc-acdmDoctorate

Prof-schoolMasters

BachelorsOther

educ_num relationship

UnmarriedOwn-child

Other-relative

Husband

marital_status

WidowedMarried-AF-spouse

Married-civ-spouse

occupation

Adm-clericalProtective-serv

SalesExec-managerial

Prof-specialtyOther

relationship

Own-child

education

Some-college

age marital_status

WidowedMarried-civ-spouse

Married-spouse-absentSeparated

Never-married

educ_num hours_per_week

relationship

Not-in-family

marital_status

Never-married

sex

Female

occupation

Exec-managerialSales

Prof-specialtyOther-service

Adm-clericalOther

education

Assoc-vocAssoc-acdm

MastersHS-grad

Some-collegeOther

age

ev36_11 Cluster seg24 4.57% of population

[income]

>50K

Figure 9: Census data clustering results seen with Kmap

15

Figure 10: A Tale of Two Clusters: rather different lifestyles emerge during examination ofthe supermarket shopper clustering results. Cluster 4 seems to be an older group of serioushomebakers likely to be found having a tea party, whereas cluster 9 seems to be a younger,health-conscious group more likely to be found jogging - and not smoking!

16

5 Acknowledgements

We are grateful for many helpful discussions with our colleagues Joseph Bigus and MichaelRothman.

References

[1] Kotlyar V., Viveros M.S., Duri S.S., Lawrence R.D., Almasi G.S. “A Case Studyin Information Delivery to Mass Retail Markets” In the Proceedings of the 10thInternational Conference on Database and Expert Systems Applications (DEXA),Florence, Italy, August/September 1999, pp. 842-851. Published by Springer-Verlagas Lecture Notes in Computer Science, vol 1677

[2] R. D. Lawrence, G. S. Almasi, V. Kotlyar, M. S. Viveros, and S. S. Duri , “Person-alization of Product Recomendations in Mass Retail Markets”, Data Mining andDiscovery 3(2): 171-195 (1999).

[3] G. S. Almasi and A. J. Lee, “PDA-based Personalized Recommender Agent”, Pro-ceedings of the 5th International Conference on the Practical Application of In-telligent Agent and Multi Agent Technology (PAAM 2000), Manchester, England,April 2000, pp. 299-309.

[4] H. Edelstein, “Mining Large Databases: A Case Study”,www.twocrows.com/largedb.pdf

[5] “JClass Java Components”, www.klgroup.com

[6] G. Almasi, “Kmap Demo”, www.gsalmasi.com

[7] Intelligent Miner for Data, www.ibm.com/software/data/iminer/fordata

[8] T. Kohonen, “Self-Organizing Maps”, Springer-Verlag, 1995.

[9] R. D. Lawrence, G. S. Almasi, and H. E. Rushmeier, “A Scalable Parallel Algorithmfor Self-Organizing Maps with Applications to Sparse Data Mining Problems”,Data Mining and Knowledge Discovery, Vol. 3, 1999, pp. 171-195. (U. S. Patent6,260,036, July 10, 2001.)

[10] J. C. Bezdek and N. R. Pal, “Some New Indexes of Cluster Validity”, IEEE Trans-actions on Systems, Man, and Cybernetics – Part B: Cybernetics Vol. 28, No. 3,June 1998, pp. 301 - 315.

17

ibm research reportibm research report kmap - a visualizer for kohonen self-organizing map...

Documents