ibm research reportibm research report kmap - a visualizer for kohonen self-organizing map...

18
RC22517 (W0207-065) July 15, 2002 Computer Science IBM Research Report KMAP - A Visualizer for Kohonen Self-organizing Map Clustering Results George S. Almasi, Richard D. Lawrence IBM Research Division Thomas J. Watson Research Center P.O. Box 218 Yorktown Heights, NY 10598 Research Division Almaden - Austin - Beijing - Delhi - Haifa - India - T. J. Watson - Tokyo - Zurich LIMITED DISTRIBUTION NOTICE: This report has been submitted for publication outside of IBM and will probably be copyrighted if accepted for publication. It has been issued as a Research Report for early dissemination of its contents. In view of the transfer of copyright to the outside publisher, its distribution outside of IBM prior to publication should be limited to peer communications and specific requests. After outside publication, requests should be filled only by reprints or legally obtained copies of the article (e.g. , payment of royalties). Copies may be requested from IBM T. J. Watson Research Center , P. O. Box 218, Yorktown Heights, NY 10598 USA (email: [email protected]). Some reports are available on the internet at http://domino.watson.ibm.com/library/CyberDig.nsf/home .

Upload: others

Post on 26-Jan-2021

18 views

Category:

Documents


0 download

TRANSCRIPT

  • RC22517 (W0207-065) July 15, 2002Computer Science

    IBM Research Report

    KMAP - A Visualizer for Kohonen Self-organizing MapClustering Results

    George S. Almasi, Richard D. LawrenceIBM Research Division

    Thomas J. Watson Research CenterP.O. Box 218

    Yorktown Heights, NY 10598

    Research DivisionAlmaden - Austin - Beijing - Delhi - Haifa - India - T. J. Watson - Tokyo - Zurich

    LIMITED DISTRIBUTION NOTICE: This report has been submitted for publication outside of IBM and will probably be copyrighted if accepted for publication. It has been issued as a Research Reportfor early dissemination of its contents. In view of the transfer of copyright to the outside publisher, its distribution outside of IBM prior to publication should be limited to peer communications and specific requests.After outside publication, requests should be filled only by reprints or legally obtained copies of the article (e.g. , payment of royalties). Copies may be requested from IBM T. J. Watson Research Center , P. O. Box 218, Yorktown Heights, NY 10598 USA (email: [email protected]). Some reports are available on the internet at http://domino.watson.ibm.com/library/CyberDig.nsf/home .

  • KMAP - A Visualizer for KohonenSelf-organizing Map Clustering Results

    George S. Almasi and Richard D. LawrenceIBM T.J. Watson Research Center

    P.O. Box 218Yorktown Heights, New York 10598

    [email protected], [email protected]

    August 17, 2002

    Abstract

    We describe a Java-based visualizer that preserves the topological information in-herent in the results of cluster data mining that used the Kohonen self-organizing mapapproach. The visualizer provides a tool with a number of views and drilldowns that ananalyst can use to circumvent the data interpretation bottleneck and quickly evaluatethe results of the map produced by the datamining run. Being written in Java, it isportable across a large number of computing platforms.

    Keywords: visualizer, data mining, Kohonen, self-organizing map, clustering,Java.

    1 Introduction

    The Kmap visualizer described in this paper was designed to allow rapid evaluation ofthe results of data mining performed using the Kohonen self-organizing map [8] clusteringapproach. A key feature of this approach is that, in addition to grouping similar datarecords into clusters, it places the clusters on a two-dimensional map in such a way thatsimilar clusters are near each other. The nature of a cluster in a particular neighborhood ofthe map says something about that neighborhood. In other words, the contents of individualclusters and the topography of the cluster map are both of interest.

    The current authors wrote a highly scalable parallel version of the self-organizing mapalgorithm [9] which is used by IBM’s Intelligent Miner for Data [7] . This program wasrecently put to the test in a demonstration of mining very large databases [4], and theparallel speedup that was obtained was so good (over 14X for a 16-processor SP2) that thenormal ratio of time needed to perform the datamining vs. interpreting its results becameinverted – data analysis became the new bottleneck. The reason was that the existing

    1

  • visualizer was good at conveying the nature of individual clusters, but the map had to beconstructed basically by hand. Kmap was designed to remedy this problem.

    This paper is organized as follows: we first describe the design of the Kmap visualizer,its modes of operation and its drilldown capabilities, and its implementation in Java. Thisis followed by four case studies that show examples of Kmap’s features. We then presentillustrative examples of Kmap’s features in the form of four case studies using stock market,retail supermarket, credit bureau, and census data.

    2 Implementation

    The main Java classes in Kmap reflect the program’s functionality:

    • Kmap collects the clustering results from the input file (generic neural weights file orIntelligent Miner results file) and creates and manipulates the rectangular grid map ofclusters.

    • Kdetail handles the drilldowns into an individual cluster• Kplot creates the contents of the rectangles for Kmap and Kdetail, using the JClass

    chart and table java components provided by KL Group Inc. [5].

    • Kmenubar handles user interactions via the menu bar that is used for both Kmapand Kdetail.

    • NNClustRes provides an API to Intelligent Miner results files in the form of a dozen“get” methods for obtaining

    1. the number of clusters (neural nodes) and data attributes (fields),

    2. the neural weights (mean clustering attributes for each node), and

    3. several sets of attribute statistics, used for drilldowns and described further below.

    There are three different “modes” for viewing clustering results in Kmap and its drill-downs:

    • “weights” mode, in which a plot of all the neural weights (i.e., attribute mean valuse)is shown for each cluster. This mode is useful when the attributes consist of a logicalsequence like month-by-month prices or spending. Figure 1 shows such an example.

    • “topten” mode, in which the names of the attributes with the top 3 or top 10 weightsare shown for each cluster. This mode can be very useful when the number of attributesis large. Figure 3 shows a comparison of results viewed in weights and in topten modes,and Figure 4 gives a detailed example of topten mode in action.

    • “sizes” mode, which simply shows the size of each cluster as a filled circle. Figure 3contains an example of a “sizes” view.

    2

  • The mode can be selected from the “View” menu, or by specifying w, t, or s after the filenamewhen Kmap is called from the command line.

    The cluster size is indicated by the background color in weights mode, by the graynessof the frame in topten mode, and by the size of the circle in sizes mode.

    The available supplemental attributes, such as Total Spending or Age, can also be usedto color the inner rectangle in topten mode or the background in weights or sizes mode. Thisis done via the “Color By” pulldown on the menu bar. The functions provided by the othermenu bar pulldowns are as follows:

    • “File” handles the opening of files and the printing of the displays.• “View” provides a large number of methods for controlling the appearance of the

    visualization.

    • “Sort By” allows sorting the attributes in topten mode by weights or by six otherquantities.

    The attributes used in neural clustering are often normalized first, and so the neuralweights (the means of the normalized attributes) for each cluster can be different from themeans of the raw values. The un-normalized values of the attributes can be included assupplemental fields, and Intelligent Miner will compute their statistics along with those ofthe active fields. Kmap has provision to read these supplemental field statistics and makethem available for alternative ways to sort the attributes in topten mode, using the ”SortBy” menu. These quantities available for each cluster are:

    1. “mean”, the mean value of the un-normalized attribute

    2. “partcp”, the fraction of records with non-zero values of this attribute

    3. “meanNZ”, the mean non-zero value of the un-normalized attribute, computed fromthe ratio of the two quantities above

    4. “Rmean”, the mean value of the un-normalized attribute for this cluster divided bythe mean value of this attribute for the entire data set.

    5. “Rpartcp”, the quantity “partcp” normalized to the entire data set as above

    6. “RmeanNZ”, the quantity “meanNZ” normalized to the entire data set as above

    There are also popup menus and other panels available for drilldown and other purposes;these are shown in the examples that follow.

    3 Case Studies

    This section describes the use of Kmap to visualize several different kinds of dataminingresults.

    3

  • • Standard and Poor 500 monthly closings (weights, drilldowns)• Safeway (topten mode, more drilldowns)• Credit Bureau(bankruptcy colorby)• Census (categorical variables)

    3.1 Stock Prices

    For our first example, Figure 1 shows Kmap in “weights” mode, displaying the results of a64-node clustering performed on the monthly closing prices of the Standard & Poor’s stocksfor the period September 1995 - August 1996. Each stock price is normalized by its maximumvalue during this period. The action of the Kohonen algorithm in placing similar clustersnear each other on the map shows clearly in the gradual transition from gainers on the leftto decliners on the right.

    Kmap offers two special drilldowns for relatively small datasets like this one. It canread a file listing the cluster for each record and display the membership of each clusteron a popup menu, as shown in Figure 1. In addition, the “View” menu has an entry thatleads to a “Record Finder” panel that can find which cluster a particular record is in andhighlight that cluster, as shown for the case of Western Atlas and cluster 25 in the figure.A hypothetical scenario for using these capabilities would be to check the membership of againer cluster like node 32, find that it contains two petroleum industry members (Mobiland Texaco), wonder where Western Atlas (also a petroleum industry member) had beenplaced, use the Record Finder, and discover it in cluster 25, a direct neighbor of cluster 32.There is a substantial number of such intuitively satisfying relationships in the map.

    If the cluster membership file mentioned above also contains the attribute values for theindividual records, Kmap can provide an additional drilldown that allows comparing thebehavior of individual records in a cluster, as shown in Figure 2.

    3.2 Supermarket Shoppers

    The example in this section explains the motivation for “topten” mode and describes someof the features of that mode. The clustering in this case was performed on a month’sspending data by some 36,000 supermarket customers in about 100 product categories (dairy,confectionery, baby products, etc.). As shown in Figure 3, the Kmap “weights” mode usedin the previous example can be used to view these clustering results as well. Kmap providespoint labels to help identify each point even in a crowded plot, and this feature can beused to identify the three attributes with the largest weights in cluster 3 as Sugar, Tea,and Desserts/Puddings, for example. But the large number of attributes makes this a slowprocess. Furthermore, unlike the months of the previous example, there is no sequentialrelationship among the attributes that makes a plot view particularly useful. A much fastergrasp of the clusters’ nature can be obtained in this case from an initial map that shows whatthe top 3 attributes are for each cluster, with a rich set of available drilldowns for furtherinvestigation. This is what “topten” mode does.

    4

  • Figure 1: Monthly closing prices for the Standard & Poor’s 500 stocks over a one-year period

    5

  • Figure 2: Drilldowns available in weights mode (for example in Figure 1)

    In topten mode, each rectangle in the map conveys the following information about itscorresponding cluster: the outer frame’s degree of grayness is proportional to its population.the inner rectangle’s color indicates the value of the supplemental attribute chosen from the“Color By” menu. In Figure 3, for example, blue indicates above-average customer Age,while pink denotes below-average attribute values. The numerical values of the populationand Age are also shown for each cluster.

    By default, the top 3 attributes are sorted by their neural weights, but this can be changed(for the main map as well as the drilldowns) via the “Sort By” menu, which allows re-sortingby the attribute mean value or the five other statistics mentioned earlier. The color of thebullets in this example correspond to a rainbow spectrum matched to the range of categorynumbers, so attributes with similar colors have similar category numbers. The bullet shapescarry no information in this example. However, for the wine recommender described in [3],we used a different scheme in which the bullet color indicated the wine color and the bulletshape indicated price (squares cost more than circles, etc.).

    Clicking on a cluster’s rectangle pops up a menu that can be used for further drilldown,as shown in Figure 3. The first item on the menu allows getting more information on aparticular attribute (field), such as the distribution of its revenue among the clusters. Forexample, the 11% of the total customer population represented by cluster 3 accounted for24% of the spending on Tea, for a “wallet share enrichment” of 2.18 – i.e., chances of findinga tea buyer in this cluster are more than twice that for the overall customer population.

    The second item on the drilldown menu allows getting more detail on a particular cluster.Clicking it brings up a scrollable table showing the neural weights and other field statisticsfor all the attributes. For convenience, the distribution map for a given attribute can bebrought up by clicking the attribute in this table.

    The third item on the drilldown menu brings up an enlarged plot of the neural weightsfor that cluster, as shown in the figure.

    The third Kmap view, “sizes”, is shown at the top left of Figure 3 for reference. Inthis mode, the cluster population is conveyed by the size of the corresponding circle (thepopulations are very close in this case), and the color convention is the same as for the“topten” view.

    The drilldowns shown are available in all 3 modes, and each drilldown figure can beprinted individually.

    6

  • Figure 3: Comparison of views and drilldowns. Starting at top left, the “sizes”, “weights”,and “topten” views of the supermarket shopper clustering results for 9 clusters. The topright view also shows popup menu leading to several drilldowns: into a specific attribute(left center and right), into the weights view of a cluster (right center), and into the toptenview of a cluster (bottom right). Clicking an attribute’s button in the view at bottom rightis a second way to obtain the attribute map at bottom left.

    7

  • Figure 4: Supermarket retail data clustering results

    8

  • One application of this clustering capability is in the generation of personalized recom-mendations, as we describe in [2]. These recommendations were being generated for partic-ipants in a trial program using PDAs(personal digital assistants) for supermarket shopping.Since all participants had above-average spending 1, we clustered only the above-averagespenders and obtained the results shown in Figure 4. Note the emergence of a cluster (4)with strong spending on baby products. As the three drilldown maps of attribute shareshow, these clusters are sharper than the previous ones in that the “wallet share enrich-ments” are larger. For example, cluster 4 has 10% of the total population but 56% of thetotal spending on baby products, for a an enrichment of 5.6. This sort of enrichment canbe quite valuable for customer relationship and marketing purposes. It also impacted thegeneration of recommendations. For example, the most popular chocolates in cluster 4 turnout to be quite different from those most frequently preferred by the population as a whole.[2].

    3.3 Credit Data

    The example in this section shows the value of adroitness in the selection of the attributesused for clustering. Figure 5 shows clustering results from a study on mining a large databasefrom a credit bureau. Twelve of the 44 attributes that were used were created by categorizingeach outstanding credit card balance into a small matrix of several ranges of balance andseveral ranges of the age of the account. This helped the domain expert to find about adozen different segments on the map. For example, the top attributes for clusters 18 and28 are “BANK BAL03” and “BANK BAL06” , which both correspond to large balances inaccounts less than 6 months old, hence the analyst’s label ”BIG CHURN” for people whocontinually accept bank offers for new credit cards with low initial interest rates, and thentransfer to another card at the end of the low interest rate period. This ability to discernvarious segments of credit usage can be very helpful to a bank’s efforts to market its variousproducts.

    The second example in this section shows the clustering results on a subset of the popu-lation whose credit rating is within a “gray area” that normally makes it difficult to obtaincredit. One of the supplemental attributes (not used in the clustering, of course) was whetheran individual in this subset actually went bankrupt within the next six months. When thisattribute is used to color the cluster map (Figure 6) , an interesting pattern appears: farfrom being randomly distributed, the actual bankruptcies are concentrated on the right edgeof the map. The bankruptcy rate for the population as a whole is about 0.5%, and most ofthe clusters have a rate below this value, while one of the clusters at the right of the maphas an 8% bankruptcy rate, which is 16 times higher than the average value. The dominantattributes of this cluster are all old accounts with very large balances, which tend to beprominent in the other clusters with above-average bankruptcies as well. The suggestion isthat a number of people are being denied credit without good reason, and that a more refinedrating system could free up credit to a group that could probably really use it. Customersand the bank would both benefit.

    1like the children in Lake Wobegon

    9

  • File View Color By Sort By Exit

    credit100 Kmap -- grayest = 30% of 1073960 records, sorted by weights, colored by BEACON_96_SCR (mean=633.494)

    BANK_BAL05BANK_BAL08BANK_BAL11

    564.5 116799

    BANK_BAL08BANK_BAL11BANK_BAL05

    592.03 19598

    BANK_BAL11BANK_BAL05BANK_BAL08

    609.34 24897

    BANK_BAL11BANK_BAL10POPDEPT_BAL

    549.62 155396

    AUTO_BALINSTFIN_BALOTHDEPT_BAL

    471.49 431(95

    OTHDEPT_BALPOPDEPT_BALBANK_BAL11

    568.35 1725(0%)94

    POPDEPT_BALBANK_BAL11OTHDEPT_BAL

    597.69 370793

    POPDEPT_BALBANK_BAL11OTHDEPT_BAL

    604.53 494492

    HOMEFURN_BALPOPDEPT_BALHOME_BAL

    553.64 1371(91

    HOMEFURN_BALHOME_BALOTHDEPT_BAL

    581.72 2373(90

    BANK_BAL08BANK_BAL05BANK_BAL02

    587.01 28089

    BANK_BAL08BANK_BAL11BANK_BAL05

    584.87 28888

    BANK_BAL11BANK_BAL08BANK_BAL05

    610.58 44787

    BANK_BAL11BANK_BAL12BANK_BAL10

    596.92 470586

    OTHDEPT_BALBANK_BAL11BANK_BAL10

    581.69 165785

    OTHDEPT_BALBANK_BAL11POPDEPT_BAL

    588.18 3649(0%)84

    POPDEPT_BALOTHDEPT_BALBANK_BAL11

    585.21 404483

    POPDEPT_BALBANK_BAL11MORT_BAL

    612.58 10K(82

    HOMEFURN_BALHOME_BALAUTO_BAL

    607.99 7956(81

    ELEC_BALHOMEFURN_BALBANK_BAL11

    630.97 4319(80

    BANK_BAL08BANK_BAL05BANK_BAL11

    611.89 36479

    BANK_BAL08BANK_BAL05BANK_BAL11

    625.23 63378

    BANK_BAL08BANK_BAL11MORT_BAL

    639.46 77277

    BANK_BAL11MORT_BALPOPDEPT_BAL

    620.66 10K(76

    BANK_BAL11OTHDEPT_BALPOPDEPT_BAL

    608.83 428675

    OTHDEPT_BALAUTO_BALMORT_BAL

    600.93 4763(0%)74

    OTHDEPT_BALPOPDEPT_BALBANK_BAL11

    610.86 948073

    POPDEPT_BALOTHDEPT_BALBANK_BAL11

    619.82 16K(72

    CREDITU_BALAUTO_BALMORT_BAL

    680.76 9648(71

    CREDITU_BALAUTO_BALMORT_BAL

    635.64 3146(70

    BANK_BAL05BANK_BAL02BANK_BAL08

    620.21 30369

    BANK_BAL05BANK_BAL08BANK_BAL02

    640 6698(068

    BANK_BAL05BANK_BAL08MORT_BAL

    666.35 94567

    BANK_BAL11BANK_BAL08BANK_BAL05

    643.55 599566

    BANK_BAL11MORT_BALBANK_BAL10

    646.42 17K(65

    CLOTHING_BALOTHDEPT_BALBANK_BAL11

    605.9 1443(0%)64

    OTHDEPT_BALPOPDEPT_BALAUTO_BAL

    625.28 14K(63

    AUTO_BALPOPDEPT_BALOTHDEPT_BAL

    599.59 582662

    AUTO_BALBANK_BAL11MORT_BAL

    648.65 13K(161

    AUTO_BALMORT_BALBANK_BAL11

    595.68 3060(60

    BANK_BAL02BANK_BAL05BANK_BAL08

    621.92 24759

    BANK_BAL02BANK_BAL05BANK_BAL08

    656.61 54458

    BANK_BAL05BANK_BAL02BANK_BAL08

    678.22 79357

    BANK_BAL07BANK_BAL04BANK_BAL08

    592.4 11K(156

    BANK_BAL10BANK_BAL11POPDEPT_BAL

    572.57 643255

    BANK_BAL10BANK_BAL11POPDEPT_BAL

    585.82 2455(0%)54

    BANK_BAL10BANK_BAL11OTHDEPT_BAL

    592.15 16K(53

    AUTO_BALMORT_BALBANK_BAL11

    673.48 43K(52

    SECUR_BALMORT_BALAUTO_BAL

    658.7 2911(051

    STUDENT_BALINSTFIN_BALBANK_BAL11

    638.2 5578(050

    BANK_BAL02BANK_BAL05MORT_BAL

    671.21 44949

    BANK_BAL02MORT_BALBANK_BAL05

    685.3 11K48

    BANK_BAL02BANK_BAL05MORT_BAL

    705.06 15K47

    BANK_BAL07BANK_BAL04BANK_BAL08

    541.33 199246

    BANK_BAL04BANK_BAL01BANK_BAL07

    579.84 525045

    OILCARD_BALPOPDEPT_BALBANK_BAL11

    546.99 1071(0%)44

    OILCARD_BALOTHDEPT_BALBANK_BAL11

    571.79 185143

    SECUR_BALINSTFIN_BALAUTO_BAL

    628.16 396442

    INSTFIN_BALMORT_BALAUTO_BAL

    607.79 2162(41

    STUDENT_BALINSTFIN_BALBANK_BAL11

    630.63 2416(40

    BANK_BAL12BANK_BAL11BANK_BAL08

    622.64 18939

    BANK_BAL09BANK_BAL05BANK_BAL02

    640.46 11238

    BANK_BAL02BANK_BAL01BANK_BAL05

    614.82 28337

    BANK_BAL01BANK_BAL04BANK_BAL05

    571.51 191636

    BANK_BAL01BANK_BAL04BANK_BAL02

    608.55 711835

    BANK_BAL01BANK_BAL04BANK_BAL02

    643.51 17K(2%)34

    OILCARD_BALOTHDEPT_BALPOPDEPT_BAL

    587.68 458033

    POPDEPT_BALOTHDEPT_BALAUTO_BAL

    636.55 21K(32

    INSTFIN_BALMORT_BALAUTO_BAL

    659.73 9814(31

    STUDENT_BALINSTFIN_BALBANK_BAL10

    647.16 13K(130

    BANK_BAL12MORT_BALBANK_BAL11

    688.92 43329

    BANK_BAL06BANK_BAL09MORT_BAL

    687.38 18228

    BANK_BAL03BANK_BAL06MORT_BAL

    688.73 16127

    MORT_BALBANK_BAL01BANK_BAL02

    651.77 461726

    REVFINAN_BALMORT_BALBANK_BAL11

    638.43 316225

    JUMBO_BALBANK_BAL12CHARGECARD_BAL

    733.6 7103(0%)24

    BANK_BAL12BANK_BAL11MORT_BAL

    692.11 558623

    BANK_BAL12MORT_BALBANK_BAL11

    712.44 11K(22

    JEWELRY_BALOTHDEPT_BALAUTO_BAL

    576.49 2296(21

    GOV_BALSTUDENT_BALMORT_BAL

    593.14 1820(20

    MORT_BALBANK_BAL12BANK_BAL11

    684.04 39919

    BANK_BAL09BANK_BAL06MORT_BAL

    689.5 414318

    BANK_BAL03BANK_BAL06MORT_BAL

    697 4349(017

    MORT_BALAUTO_BALBANK_BAL11

    694.96 14K(16

    REVFINAN_BALMORT_BALBANK_BAL11

    666.58 454115

    CHARGECARD_BALBANK_BAL11MORT_BAL

    659.36 2232(0%)14

    BANK_BAL08BANK_BAL05BANK_BAL11

    703.63 14K(13

    OTHDEPT_BALBANK_BAL11POPDEPT_BAL

    661.06 27K(12

    GOV_BALSTUDENT_BALOTHDEPT_BAL

    575.52 6223(11

    OTHUTIL_BALOTHDEPT_BALAUTO_BAL

    561.92 1608(10

    MORT_BALBANK_BAL11AUTO_BAL

    677.99 2719

    MORT_BALBANK_BAL11AUTO_BAL

    701.01 8118

    MORT_BALBANK_BAL11BANK_BAL12

    727.16 18K7

    MORT_BALBANK_BAL11OTHDEPT_BAL

    723.64 26K(6

    MORT_BALAUTO_BALBANK_BAL11

    725.88 34K(5

    MORT_BALBANK_BAL11OTHDEPT_BAL

    727.19 30K(3%)4

    BANK_BAL11BANK_BAL10BANK_BAL12

    690.89 22K(3

    CREDITU_BALAUTO_BALMORT_BAL

    693.1 13K(12

    HOMEFURN_BALHOME_BALELEC_BAL

    621.66 12K(11

    BANK_BAL11INSTFIN_BALBANK_BAL12

    718.52 326K(0

    Figure 5: Credit usage customer segments

    10

  • File View Color By Sort By Exit

    seg650E Kmap -- grayest = 18% of 95157 records, sorted by weights, colored by ENH_DAS_SCR (mean=0.003)

    POPDEPT_BALBANK_BAL11BANK_BAL12

    0 208(0%)99

    BANK_BAL11BANK_BAL12BANK_BAL08

    0.02 458(98

    BANK_BAL11MORT_BALBANK_BAL12

    0 957(1%)97

    BANK_BAL11OTHDEPT_BALBANK_BAL08

    0.01 252(096

    OTHDEPT_BALBANK_BAL11CLOTHING_BAL

    0 402(0%)95

    OTHDEPT_BALBANK_BAL12BANK_BAL11

    0 112(0%)94

    BANK_BAL12MORT_BALBANK_BAL11

    0 576(0%)93

    BANK_BAL06BANK_BAL09MORT_BAL

    0 279(0%)92

    BANK_BAL03BANK_BAL06MORT_BAL

    0 379(0%)91

    MORT_BALBANK_BAL11AUTO_BAL

    0 463(0%)90

    BANK_BAL12BANK_BAL11BANK_BAL08

    0.01 261(089

    BANK_BAL11BANK_BAL02BANK_BAL05

    0.01 391(88

    BANK_BAL11POPDEPT_BALOTHDEPT_BAL

    0.02 308(087

    POPDEPT_BALBANK_BAL11OTHDEPT_BAL

    0 534(0%)86

    OTHDEPT_BALPOPDEPT_BALBANK_BAL11

    0 231(0%)85

    BANK_BAL11BANK_BAL12MORT_BAL

    0 428(0%)84

    BANK_BAL12BANK_BAL11BANK_BAL02

    0 589(0%)83

    BANK_BAL09BANK_BAL12MORT_BAL

    0 345(0%)82

    BANK_BAL06BANK_BAL09MORT_BAL

    0 654(0%)81

    MORT_BALBANK_BAL11OTHDEPT_BAL

    0 2237(2%)80

    BANK_BAL12BANK_BAL05BANK_BAL08

    0.04 236(079

    BANK_BAL11BANK_BAL08BANK_BAL05

    0.03 308(78

    POPDEPT_BALBANK_BAL08BANK_BAL11

    0.02 154(077

    POPDEPT_BALBANK_BAL11OTHDEPT_BAL

    0 398(0%)76

    POPDEPT_BALOTHDEPT_BALAUTO_BAL

    0 656(0%)75

    POPDEPT_BALMORT_BALBANK_BAL11

    0 637(0%)74

    BANK_BAL12BANK_BAL11MORT_BAL

    0 1200(1%)73

    MORT_BALBANK_BAL12BANK_BAL11

    0 517(0%)72

    MORT_BALAUTO_BALBANK_BAL11

    0 1377(1%)71

    MORT_BALBANK_BAL11AUTO_BAL

    0 4365(5%)70

    BANK_BAL05BANK_BAL08BANK_BAL11

    0.08 208(069

    BANK_BAL08BANK_BAL11BANK_BAL05

    0.02 310(68

    POPDEPT_BALBANK_BAL08BANK_BAL11

    0 254(0%)67

    POPDEPT_BALBANK_BAL02BANK_BAL05

    0.02 250(066

    POPDEPT_BALOTHDEPT_BALAUTO_BAL

    0 1310(1%)65

    POPDEPT_BALOTHDEPT_BALBANK_BAL11

    0 2143(2%)64

    HOMEFURN_BAHOME_BALMORT_BAL

    0 896(0%)63

    HOMEFURN_BAHOME_BALBANK_BAL11

    0.02 177(0%62

    HOMEFURN_BALHOME_BALMORT_BAL

    0 509(0%)61

    CREDITU_BALAUTO_BALBANK_BAL11

    0 1568(2%)60

    BANK_BAL08BANK_BAL05BANK_BAL11

    0.01 353(059

    BANK_BAL08BANK_BAL05BANK_BAL11

    0.02 493(58

    BANK_BAL08MORT_BALBANK_BAL05

    0 520(0%)57

    BANK_BAL11BANK_BAL08BANK_BAL05

    0.01 579(056

    POPDEPT_BALBANK_BAL08BANK_BAL05

    0 430(0%)55

    BANK_BAL11POPDEPT_BALBANK_BAL10

    0 869(0%)54

    ELEC_BALHOMEFURN_BAMORT_BAL

    0 569(0%)53

    ELEC_BALHOMEFURN_BABANK_BAL11

    0 289(0%)52

    CREDITU_BALMORT_BALAUTO_BAL

    0 616(0%)51

    CREDITU_BALAUTO_BALBANK_BAL11

    0 571(0%)50

    BANK_BAL05BANK_BAL08BANK_BAL02

    0.02 347(049

    BANK_BAL05BANK_BAL08BANK_BAL02

    0 478(0%)48

    BANK_BAL05BANK_BAL08BANK_BAL02

    0 860(0%)47

    BANK_BAL08BANK_BAL05AUTO_BAL

    0 877(0%)46

    BANK_BAL08BANK_BAL11AUTO_BAL

    0 951(0%)45

    BANK_BAL11BANK_BAL05BANK_BAL08

    0 697(0%)44

    BANK_BAL11MORT_BALBANK_BAL10

    0 1736(2%)43

    BANK_BAL11AUTO_BALMORT_BAL

    0 628(0%)42

    SECUR_BALMORT_BALAUTO_BAL

    0 206(0%)41

    AUTO_BALMORT_BALBANK_BAL11

    0 365(0%)40

    BANK_BAL02BANK_BAL05BANK_BAL08

    0.03 309(039

    BANK_BAL05BANK_BAL02BANK_BAL08

    0 628(0%)38

    BANK_BAL05MORT_BALBANK_BAL08

    0 594(0%)37

    BANK_BAL05BANK_BAL08BANK_BAL02

    0 1436(2%)36

    BANK_BAL07BANK_BAL04BANK_BAL08

    0 563(0%)35

    BANK_BAL08BANK_BAL11BANK_BAL05

    0 1393(1%)34

    BANK_BAL11AUTO_BALOTHDEPT_BAL

    0 3050(3%)33

    OTHDEPT_BALBANK_BAL11MORT_BAL

    0 653(0%)32

    SECUR_BALAUTO_BALINSTFIN_BAL

    0 656(0%)31

    AUTO_BALBANK_BAL11POPDEPT_BAL

    0 1412(1%)30

    BANK_BAL02BANK_BAL05BANK_BAL08

    0 401(0%)29

    BANK_BAL02BANK_BAL08BANK_BAL05

    0.02 492(28

    BANK_BAL05BANK_BAL02BANK_BAL08

    0 1064(1%)27

    BANK_BAL01BANK_BAL04BANK_BAL05

    0 807(0%)26

    BANK_BAL04BANK_BAL07BANK_BAL08

    0 144(0%)25

    BANK_BAL07BANK_BAL04BANK_BAL08

    0 1686(2%)24

    BANK_BAL10BANK_BAL11MORT_BAL

    0 287(0%)23

    OTHDEPT_BALPOPDEPT_BALMORT_BAL

    0 895(0%)22

    OTHDEPT_BALBANK_BAL11AUTO_BAL

    0 1892(2%)21

    AUTO_BALBANK_BAL11MORT_BAL

    0 4098(4%)20

    BANK_BAL02BANK_BAL05MORT_BAL

    0 684(0%)19

    BANK_BAL02BANK_BAL05BANK_BAL11

    0 1332(1%18

    BANK_BAL02BANK_BAL05AUTO_BAL

    0 1762(2%)17

    BANK_BAL01BANK_BAL02BANK_BAL04

    0 1404(1%)16

    BANK_BAL01BANK_BAL04BANK_BAL02

    0 2132(2%)15

    STUDENT_BALINSTFIN_BALBANK_BAL11

    0 1427(1%)14

    REVFINAN_BALMORT_BALBANK_BAL11

    0 570(0%)13

    REVFINAN_BALMORT_BALBANK_BAL11

    0 378(0%)12

    CHARGECARD_BAMORT_BALBANK_BAL12

    0 284(0%)11

    BANK_BAL10OILCARD_BALBANK_BAL11

    0 1276(1%)10

    BANK_BAL02MORT_BALBANK_BAL05

    0 389(0%)9

    BANK_BAL02MORT_BALBANK_BAL05

    0 944(0%)8

    BANK_BAL02BANK_BAL01BANK_BAL05

    0 356(0%)7

    BANK_BAL01BANK_BAL04BANK_BAL02

    0 307(0%)6

    STUDENT_BALINSTFIN_BALBANK_BAL11

    0 640(0%)5

    STUDENT_BALINSTFIN_BALBANK_BAL11

    0 310(0%)4

    INSTFIN_BALMORT_BALBANK_BAL11

    0 174(0%)3

    INSTFIN_BALMORT_BALAUTO_BAL

    0 831(0%)2

    GOV_BALAUTO_BALBANK_BAL11

    0 155(0%)1

    OTHDEPT_BALBANK_BAL11INSTFIN_BAL

    0 17K(18%)0

    Figure 6: Actual bankruptcies among customers rated as marginal credit risks.

    11

  • File

    Vie

    wC

    olor

    By

    Sor

    t By

    Exi

    t

    seg

    650E

    Km

    ap -

    - g

    raye

    st =

    18%

    of

    9515

    7 re

    cord

    s, s

    ort

    ed b

    y w

    eig

    hts

    , co

    lore

    d b

    y E

    NH

    _DA

    S_S

    CR

    (m

    ean

    =0.0

    03)

    PO

    PD

    EP

    T_B

    AL

    BA

    NK

    _BA

    L11

    BA

    NK

    _BA

    L12

    0 208(0%)

    99

    BA

    NK

    _BA

    L11

    BA

    NK

    _BA

    L12

    BA

    NK

    _BA

    L08

    0.02 458(

    98

    BA

    NK

    _BA

    L11

    MO

    RT

    _BA

    LB

    AN

    K_B

    AL

    12

    0 957(1%)

    97

    BA

    NK

    _BA

    L11

    OT

    HD

    EP

    T_B

    AL

    BA

    NK

    _BA

    L08

    0.01 252(0

    96

    OT

    HD

    EP

    T_B

    AL

    BA

    NK

    _BA

    L11

    CL

    OT

    HIN

    G_B

    AL

    0 402(0%)

    95

    OT

    HD

    EP

    T_B

    AL

    BA

    NK

    _BA

    L12

    BA

    NK

    _BA

    L11

    0 112(0%)

    94

    BA

    NK

    _BA

    L12

    MO

    RT

    _BA

    LB

    AN

    K_B

    AL

    11

    0 576(0%)

    93

    BA

    NK

    _BA

    L06

    BA

    NK

    _BA

    L09

    MO

    RT

    _BA

    L

    0 279(0%)

    92

    BA

    NK

    _BA

    L03

    BA

    NK

    _BA

    L06

    MO

    RT

    _BA

    L

    0 379(0%)

    91

    MO

    RT

    _BA

    LB

    AN

    K_B

    AL

    11A

    UT

    O_B

    AL

    0 463(0%)

    90

    BA

    NK

    _BA

    L12

    BA

    NK

    _BA

    L11

    BA

    NK

    _BA

    L08

    0.01 261(0

    89

    BA

    NK

    _BA

    L11

    BA

    NK

    _BA

    L02

    BA

    NK

    _BA

    L05

    0.01 391(

    88

    BA

    NK

    _BA

    L11

    PO

    PD

    EP

    T_B

    AL

    OT

    HD

    EP

    T_B

    AL

    0.02 308(0

    87

    PO

    PD

    EP

    T_B

    AL

    BA

    NK

    _BA

    L11

    OT

    HD

    EP

    T_B

    AL

    0 534(0%)

    86

    OT

    HD

    EP

    T_B

    AL

    PO

    PD

    EP

    T_B

    AL

    BA

    NK

    _BA

    L11

    0 231(0%)

    85

    BA

    NK

    _BA

    L11

    BA

    NK

    _BA

    L12

    MO

    RT

    _BA

    L

    0 428(0%)

    84

    BA

    NK

    _BA

    L12

    BA

    NK

    _BA

    L11

    BA

    NK

    _BA

    L02

    0 589(0%)

    83

    BA

    NK

    _BA

    L09

    BA

    NK

    _BA

    L12

    MO

    RT

    _BA

    L

    0 345(0%)

    82

    BA

    NK

    _BA

    L06

    BA

    NK

    _BA

    L09

    MO

    RT

    _BA

    L

    0 654(0%)

    81

    MO

    RT

    _BA

    LB

    AN

    K_B

    AL

    11O

    TH

    DE

    PT

    _BA

    L

    0 2237(2%)

    80

    BA

    NK

    _BA

    L12

    BA

    NK

    _BA

    L05

    BA

    NK

    _BA

    L08

    0.04 236(0

    79

    BA

    NK

    _BA

    L11

    BA

    NK

    _BA

    L08

    BA

    NK

    _BA

    L05

    0.03 308(

    78

    PO

    PD

    EP

    T_B

    AL

    BA

    NK

    _BA

    L08

    BA

    NK

    _BA

    L11

    0.02 154(0

    77

    PO

    PD

    EP

    T_B

    AL

    BA

    NK

    _BA

    L11

    OT

    HD

    EP

    T_B

    AL

    0 398(0%)

    76

    PO

    PD

    EP

    T_B

    AL

    OT

    HD

    EP

    T_B

    AL

    AU

    TO

    _BA

    L

    0 656(0%)

    75

    PO

    PD

    EP

    T_B

    AL

    MO

    RT

    _BA

    LB

    AN

    K_B

    AL

    11

    0 637(0%)

    74

    BA

    NK

    _BA

    L12

    BA

    NK

    _BA

    L11

    MO

    RT

    _BA

    L

    0 1200(1%)

    73

    MO

    RT

    _BA

    LB

    AN

    K_B

    AL

    12B

    AN

    K_B

    AL

    11

    0 517(0%)

    72

    MO

    RT

    _BA

    LA

    UT

    O_B

    AL

    BA

    NK

    _BA

    L11

    0 1377(1%)

    71

    MO

    RT

    _BA

    LB

    AN

    K_B

    AL

    11A

    UT

    O_B

    AL

    0 4365(5%)

    70

    BA

    NK

    _BA

    L05

    BA

    NK

    _BA

    L08

    BA

    NK

    _BA

    L11

    0.08 208(0

    69

    BA

    NK

    _BA

    L08

    BA

    NK

    _BA

    L11

    BA

    NK

    _BA

    L05

    0.02 310(

    68

    PO

    PD

    EP

    T_B

    AL

    BA

    NK

    _BA

    L08

    BA

    NK

    _BA

    L11

    0 254(0%)

    67

    PO

    PD

    EP

    T_B

    AL

    BA

    NK

    _BA

    L02

    BA

    NK

    _BA

    L05

    0.02 250(0

    66

    PO

    PD

    EP

    T_B

    AL

    OT

    HD

    EP

    T_B

    AL

    AU

    TO

    _BA

    L

    0 1310(1%)

    65

    PO

    PD

    EP

    T_B

    AL

    OT

    HD

    EP

    T_B

    AL

    BA

    NK

    _BA

    L11

    0 2143(2%)

    64

    HO

    ME

    FU

    RN

    _BA

    HO

    ME

    _BA

    LM

    OR

    T_B

    AL

    0 896(0%)

    63

    HO

    ME

    FU

    RN

    _BA

    HO

    ME

    _BA

    LB

    AN

    K_B

    AL

    11

    0.02 177(0%

    62

    HO

    ME

    FU

    RN

    _BA

    LH

    OM

    E_B

    AL

    MO

    RT

    _BA

    L

    0 509(0%)

    61

    CR

    ED

    ITU

    _BA

    LA

    UT

    O_B

    AL

    BA

    NK

    _BA

    L11

    0 1568(2%)

    60

    BA

    NK

    _BA

    L08

    BA

    NK

    _BA

    L05

    BA

    NK

    _BA

    L11

    0.01 353(0

    59

    BA

    NK

    _BA

    L08

    BA

    NK

    _BA

    L05

    BA

    NK

    _BA

    L11

    0.02 493(

    58

    BA

    NK

    _BA

    L08

    MO

    RT

    _BA

    LB

    AN

    K_B

    AL

    05

    0 520(0%)

    57

    BA

    NK

    _BA

    L11

    BA

    NK

    _BA

    L08

    BA

    NK

    _BA

    L05

    0.01 579(0

    56

    PO

    PD

    EP

    T_B

    AL

    BA

    NK

    _BA

    L08

    BA

    NK

    _BA

    L05

    0 430(0%)

    55

    BA

    NK

    _BA

    L11

    PO

    PD

    EP

    T_B

    AL

    BA

    NK

    _BA

    L10

    0 869(0%)

    54

    EL

    EC

    _BA

    LH

    OM

    EF

    UR

    N_B

    AM

    OR

    T_B

    AL

    0 569(0%)

    53

    EL

    EC

    _BA

    LH

    OM

    EF

    UR

    N_B

    AB

    AN

    K_B

    AL

    11

    0 289(0%)

    52

    CR

    ED

    ITU

    _BA

    LM

    OR

    T_B

    AL

    AU

    TO

    _BA

    L

    0 616(0%)

    51

    CR

    ED

    ITU

    _BA

    LA

    UT

    O_B

    AL

    BA

    NK

    _BA

    L11

    0 571(0%)

    50

    BA

    NK

    _BA

    L05

    BA

    NK

    _BA

    L08

    BA

    NK

    _BA

    L02

    0.02 347(0

    49

    BA

    NK

    _BA

    L05

    BA

    NK

    _BA

    L08

    BA

    NK

    _BA

    L02

    0 478(0%)

    48

    BA

    NK

    _BA

    L05

    BA

    NK

    _BA

    L08

    BA

    NK

    _BA

    L02

    0 860(0%)

    47

    BA

    NK

    _BA

    L08

    BA

    NK

    _BA

    L05

    AU

    TO

    _BA

    L

    0 877(0%)

    46

    BA

    NK

    _BA

    L08

    BA

    NK

    _BA

    L11

    AU

    TO

    _BA

    L

    0 951(0%)

    45

    BA

    NK

    _BA

    L11

    BA

    NK

    _BA

    L05

    BA

    NK

    _BA

    L08

    0 697(0%)

    44

    BA

    NK

    _BA

    L11

    MO

    RT

    _BA

    LB

    AN

    K_B

    AL

    10

    0 1736(2%)

    43

    BA

    NK

    _BA

    L11

    AU

    TO

    _BA

    LM

    OR

    T_B

    AL

    0 628(0%)

    42

    SE

    CU

    R_B

    AL

    MO

    RT

    _BA

    LA

    UT

    O_B

    AL

    0 206(0%)

    41

    AU

    TO

    _BA

    LM

    OR

    T_B

    AL

    BA

    NK

    _BA

    L11

    0 365(0%)

    40

    BA

    NK

    _BA

    L02

    BA

    NK

    _BA

    L05

    BA

    NK

    _BA

    L08

    0.03 309(0

    39

    BA

    NK

    _BA

    L05

    BA

    NK

    _BA

    L02

    BA

    NK

    _BA

    L08

    0 628(0%)

    38

    BA

    NK

    _BA

    L05

    MO

    RT

    _BA

    LB

    AN

    K_B

    AL

    08

    0 594(0%)

    37

    BA

    NK

    _BA

    L05

    BA

    NK

    _BA

    L08

    BA

    NK

    _BA

    L02

    0 1436(2%)

    36

    BA

    NK

    _BA

    L07

    BA

    NK

    _BA

    L04

    BA

    NK

    _BA

    L08

    0 563(0%)

    35

    BA

    NK

    _BA

    L08

    BA

    NK

    _BA

    L11

    BA

    NK

    _BA

    L05

    0 1393(1%)

    34

    BA

    NK

    _BA

    L11

    AU

    TO

    _BA

    LO

    TH

    DE

    PT

    _BA

    L

    0 3050(3%)

    33

    OT

    HD

    EP

    T_B

    AL

    BA

    NK

    _BA

    L11

    MO

    RT

    _BA

    L

    0 653(0%)

    32

    SE

    CU

    R_B

    AL

    AU

    TO

    _BA

    LIN

    ST

    FIN

    _BA

    L

    0 656(0%)

    31

    AU

    TO

    _BA

    LB

    AN

    K_B

    AL

    11P

    OP

    DE

    PT

    _BA

    L

    0 1412(1%)

    30

    BA

    NK

    _BA

    L02

    BA

    NK

    _BA

    L05

    BA

    NK

    _BA

    L08

    0 401(0%)

    29

    BA

    NK

    _BA

    L02

    BA

    NK

    _BA

    L08

    BA

    NK

    _BA

    L05

    0.02 492(

    28

    BA

    NK

    _BA

    L05

    BA

    NK

    _BA

    L02

    BA

    NK

    _BA

    L08

    0 1064(1%)

    27

    BA

    NK

    _BA

    L01

    BA

    NK

    _BA

    L04

    BA

    NK

    _BA

    L05

    0 807(0%)

    26

    BA

    NK

    _BA

    L04

    BA

    NK

    _BA

    L07

    BA

    NK

    _BA

    L08

    0 144(0%)

    25

    BA

    NK

    _BA

    L07

    BA

    NK

    _BA

    L04

    BA

    NK

    _BA

    L08

    0 1686(2%)

    24

    BA

    NK

    _BA

    L10

    BA

    NK

    _BA

    L11

    MO

    RT

    _BA

    L

    0 287(0%)

    23

    OT

    HD

    EP

    T_B

    AL

    PO

    PD

    EP

    T_B

    AL

    MO

    RT

    _BA

    L

    0 895(0%)

    22

    OT

    HD

    EP

    T_B

    AL

    BA

    NK

    _BA

    L11

    AU

    TO

    _BA

    L

    0 1892(2%)

    21

    AU

    TO

    _BA

    LB

    AN

    K_B

    AL

    11M

    OR

    T_B

    AL

    0 4098(4%)

    20

    BA

    NK

    _BA

    L02

    BA

    NK

    _BA

    L05

    MO

    RT

    _BA

    L

    0 684(0%)

    19

    BA

    NK

    _BA

    L02

    BA

    NK

    _BA

    L05

    BA

    NK

    _BA

    L11

    0 1332(1%

    18

    BA

    NK

    _BA

    L02

    BA

    NK

    _BA

    L05

    AU

    TO

    _BA

    L

    0 1762(2%)

    17

    BA

    NK

    _BA

    L01

    BA

    NK

    _BA

    L02

    BA

    NK

    _BA

    L04

    0 1404(1%)

    16

    BA

    NK

    _BA

    L01

    BA

    NK

    _BA

    L04

    BA

    NK

    _BA

    L02

    0 2132(2%)

    15

    ST

    UD

    EN

    T_B

    AL

    INS

    TF

    IN_B

    AL

    BA

    NK

    _BA

    L11

    0 1427(1%)

    14

    RE

    VF

    INA

    N_B

    AL

    MO

    RT

    _BA

    LB

    AN

    K_B

    AL

    11

    0 570(0%)

    13

    RE

    VF

    INA

    N_B

    AL

    MO

    RT

    _BA

    LB

    AN

    K_B

    AL

    11

    0 378(0%)

    12

    CH

    AR

    GE

    CA

    RD

    _BA

    MO

    RT

    _BA

    LB

    AN

    K_B

    AL

    12

    0 284(0%)

    11

    BA

    NK

    _BA

    L10

    OIL

    CA

    RD

    _BA

    LB

    AN

    K_B

    AL

    11

    0 1276(1%)

    10

    BA

    NK

    _BA

    L02

    MO

    RT

    _BA

    LB

    AN

    K_B

    AL

    05

    0 389(0%)

    9

    BA

    NK

    _BA

    L02

    MO

    RT

    _BA

    LB

    AN

    K_B

    AL

    05

    0 944(0%)

    8

    BA

    NK

    _BA

    L02

    BA

    NK

    _BA

    L01

    BA

    NK

    _BA

    L05

    0 356(0%)

    7

    BA

    NK

    _BA

    L01

    BA

    NK

    _BA

    L04

    BA

    NK

    _BA

    L02

    0 307(0%)

    6

    ST

    UD

    EN

    T_B

    AL

    INS

    TF

    IN_B

    AL

    BA

    NK

    _BA

    L11

    0 640(0%)

    5

    ST

    UD

    EN

    T_B

    AL

    INS

    TF

    IN_B

    AL

    BA

    NK

    _BA

    L11

    0 310(0%)

    4

    INS

    TF

    IN_B

    AL

    MO

    RT

    _BA

    LB

    AN

    K_B

    AL

    11

    0 174(0%)

    3

    INS

    TF

    IN_B

    AL

    MO

    RT

    _BA

    LA

    UT

    O_B

    AL

    0 831(0%)

    2

    GO

    V_B

    AL

    AU

    TO

    _BA

    LB

    AN

    K_B

    AL

    11

    0 155(0%)

    1

    OT

    HD

    EP

    T_B

    AL

    BA

    NK

    _BA

    L11

    INS

    TF

    IN_B

    AL

    0 17K(18%)

    0

    Fig

    ure

    7:A

    ctual

    ban

    kru

    ptc

    ies

    amon

    gcu

    stom

    ers

    rate

    das

    mar

    ginal

    cred

    itri

    sks.

    12

  • 3.4 Census Data

    The example in this section shows how Kmap can handle categorical values. The clusteringwas performed using publicly available census data with a mixture of 12 continuous andcategorical attributes for each person. The data comes from a municipality located near aNational Laboratory, which accounts for the unusually large number of advanced degreesand large diversity of backgrounds.

    Categorical attributes pose an extra degree of complexity in neural clustering becausethey are expanded to a one-to-N mapping: for example, if the original attribute were “animaltype” with values “cat, dog, cow”, it would be replaced for clustering purposes by threeattributes “is a cat”, “is a dog”, “is a cow” that can each assume values 1 or 0. So the totalnumber of attributes actually used in the clustering is data-dependent.

    Kmap handles this by making two passes through the results file, obtaining the number ofdistinct values for each categorical attribute on the first pass and constructing the effectiveattributes coming from each original attribute on the second pass. Sample Kmap resultsare shown in Figure 9 and are compared to the default Intelligent Miner visualizer’s viewin Figure 8. The two views each present interesting information in different ways, but theyare indeed consistent with each other. For example, one of the supplemental attributes (i.e.,not used for clustering) is whether household income is above or below $50,000. Using thisattribute to color the Kmap in Figure 9 points to clusters 18 and 24 as being highest inthis attribute, with 70% of the households in cluster 24 having income above $50,000. Thisagrees with the corresponding pie chart in Figure 8.

    One of the original categorical attributes was “education”, and one of its values was“doctorate”. Drilling down into the effective categorical attribute “Doctorate(education)”(or “has a PhD” in our earlier terminology) in Figure 9 (comfortingly, perhaps) an overlapbetween clusters with high PhD populations and clusters with high income. Drilling downinto the “educ num” attribute, a number proportional to the number of years of education,shows similar results. Both results are consistent with the drilldowns in Figure 8. Kmapdoes a good job showing the attribute distributions on the map. The default visualizer doesa good job showing the categorical attributes in pie charts. Both can be useful.

    3.5 Appendix: A Tale of Two Clusters

    Figure 10 offers a more detailed look at the kind of information that Kmap can extract fromthe supermarket customer clustering results mentioned earlier.

    4 Summary

    Kmap has proved to be a versatile visualizer for Kohonen self-organizing map clusteringresults obtained from a variety of datamining engagements. Analysts value being able tosee the map topography, as well as the drilldowns provided, many of which resulted fromtheir suggestions. We provided examples of Kmap’s use in four different areas of commerce.Currently we are applying Kmap to technical data generated by the simulators and hardwarecontrol systems of massively parallel supercomputers.

    13

  • census36

    8 seg0

    7seg30

    6seg185

    seg34

    5

    seg32

    5

    seg12

    5

    seg24

    5

    seg5

    4

    seg23

    education

    HS-grad

    relationship

    Own-childHusband

    marital_status

    Married-AF-spouseMarried-civ-spouse

    educ_num sex

    MaleFemale

    occupation

    Exec-managerialSales

    Transport-movingMachine-op-inspct

    Craft-repairOther

    workclass

    Federal-govState-gov

    Self-emp-incLocal-gov

    Self-emp-not-incOther

    relationship

    Not-in-familyOther-relative

    Own-child

    Husband

    marital_status

    Married-AF-spouseMarried-civ-spouse

    occupation

    Transport-movingSales

    Farming-fishingExec-managerial

    Craft-repairOther

    sex

    MaleFemale

    education

    10thAssoc-voc

    7th-8thSome-college

    HS-gradOther

    education

    Assoc-acdmDoctorate

    Prof-schoolMasters

    Bachelors

    educ_num relationship

    WifeNot-in-family

    Own-childOther-relative

    Husband

    marital_status

    Married-civ-spouse

    occupation

    Tech-supportCraft-repair

    SalesProf-specialty

    Exec-managerialOther

    sex

    Male

    relationship

    Unmarried

    marital_status

    Divorced

    sex

    MaleFemale

    age occupation

    SalesOther-service

    Exec-managerialProf-specialty

    Adm-clericalOther

    hours_per_week

    relationship

    Own-childWife

    sex

    MaleFemale

    marital_status

    Married-AF-spouseMarried-civ-spouse

    occupation

    ?Other-service

    Exec-managerialProf-specialty

    Adm-clericalOther

    age hours_per_week

    education

    Some-college

    relationship

    Own-childOther-relative

    Husband

    marital_status

    Married-AF-spouseMarried-civ-spouse

    educ_num sex

    Male

    workclass

    State-govFederal-gov

    Private

    workclass

    Federal-govState-govLocal-gov

    Self-emp-incSelf-emp-not-inc

    education

    Assoc-acdmDoctorate

    Prof-schoolMasters

    BachelorsOther

    educ_num relationship

    UnmarriedOwn-child

    Other-relative

    Husband

    marital_status

    WidowedMarried-AF-spouse

    Married-civ-spouse

    occupation

    Adm-clericalProtective-serv

    SalesExec-managerial

    Prof-specialtyOther

    relationship

    Own-child

    education

    Some-college

    age marital_status

    WidowedMarried-civ-spouse

    Married-spouse-absentSeparated

    Never-married

    educ_num hours_per_week

    relationship

    Not-in-family

    marital_status

    Never-married

    sex

    Female

    occupation

    Exec-managerialSales

    Prof-specialtyOther-service

    Adm-clericalOther

    education

    Assoc-vocAssoc-acdm

    MastersHS-grad

    Some-collegeOther

    age

    ev36_11 Cluster seg24 4.57% of population

    [income]

    >50K

  • Figure 9: Census data clustering results seen with Kmap

    15

  • Figure 10: A Tale of Two Clusters: rather different lifestyles emerge during examination ofthe supermarket shopper clustering results. Cluster 4 seems to be an older group of serioushomebakers likely to be found having a tea party, whereas cluster 9 seems to be a younger,health-conscious group more likely to be found jogging - and not smoking!

    16

  • 5 Acknowledgements

    We are grateful for many helpful discussions with our colleagues Joseph Bigus and MichaelRothman.

    References

    [1] Kotlyar V., Viveros M.S., Duri S.S., Lawrence R.D., Almasi G.S. “A Case Studyin Information Delivery to Mass Retail Markets” In the Proceedings of the 10thInternational Conference on Database and Expert Systems Applications (DEXA),Florence, Italy, August/September 1999, pp. 842-851. Published by Springer-Verlagas Lecture Notes in Computer Science, vol 1677

    [2] R. D. Lawrence, G. S. Almasi, V. Kotlyar, M. S. Viveros, and S. S. Duri , “Person-alization of Product Recomendations in Mass Retail Markets”, Data Mining andDiscovery 3(2): 171-195 (1999).

    [3] G. S. Almasi and A. J. Lee, “PDA-based Personalized Recommender Agent”, Pro-ceedings of the 5th International Conference on the Practical Application of In-telligent Agent and Multi Agent Technology (PAAM 2000), Manchester, England,April 2000, pp. 299-309.

    [4] H. Edelstein, “Mining Large Databases: A Case Study”,www.twocrows.com/largedb.pdf

    [5] “JClass Java Components”, www.klgroup.com

    [6] G. Almasi, “Kmap Demo”, www.gsalmasi.com

    [7] Intelligent Miner for Data, www.ibm.com/software/data/iminer/fordata

    [8] T. Kohonen, “Self-Organizing Maps”, Springer-Verlag, 1995.

    [9] R. D. Lawrence, G. S. Almasi, and H. E. Rushmeier, “A Scalable Parallel Algorithmfor Self-Organizing Maps with Applications to Sparse Data Mining Problems”,Data Mining and Knowledge Discovery, Vol. 3, 1999, pp. 171-195. (U. S. Patent6,260,036, July 10, 2001.)

    [10] J. C. Bezdek and N. R. Pal, “Some New Indexes of Cluster Validity”, IEEE Trans-actions on Systems, Man, and Cybernetics – Part B: Cybernetics Vol. 28, No. 3,June 1998, pp. 301 - 315.

    17