c45 algorithm

INTEGRAL, Vol. 8 No. 2, Oktober 2003

105

TOWARDS THE USE OF C4.5 ALGORITHM FOR CLASSIFYING BANKING DATASET

Veronica S. Moertini

Jurusan Ilmu Komputer, Fakultas Matematika dan Ilmu Pengetahuan Alam universitas Katolik Parahyangan Bandung.

E-mail : [email protected]

Abstract C4.5 is a well known algorithm used for classifying datasets. It induces decision trees and rules from datasets, which could contain categorical and numerical attributes. The rules could be used to predict categorical values of attributes from new records. This paper discusses an overview of data classification and its techniques, the basic methods of C4.5 algorithm, the process and analysis of the results of an experiment, which utilizes C4.5 for classifying banking dataset. C4.5 performs well in classifying the dat aset, but more data needs to be collected in order to gain useful rules.

Intisari C4.5 adalah algoritma yang sudah banyak dikenal dan digunakan untuk klasifikasi data yang memiliki atribut-atribut numerik dan kategorial. Hasil dari proses klasifikasi yang berupa aturan-aturan dapat digunakan untuk memprediksi nilai atribut bertipe diskret dari record yang baru. Makalah ini membahas teknik-teknik klasifikasi data secara umum, metodologi dasar algoritma C4.5, proses dan analisis hasil eksperimen yang menggunakan C4.5 untuk mengklasifikasi data perbankan. C4.5 bekerja dengan baik, tapi untuk mendapatkan aturan-aturan yang berguna, perlu untuk dikumpulkan data yang lebih lengkap.

Diterima : 27 Juni 2003

Disetujui untuk dipublikasikan : 10 Juli 2003

1. Introduction Databases are rich with hidden information that can be used for making intelligent business decision. Classification is one of the forms of data analysis that can be used to extract models describing important data classes or to predict categorical labels. An example of the model application is to categorize bank loan application as either safe or risky1. Banks’ databases are rich with data. Banks can take advantage of the data they have to characterize the behavior of their customers2, then based on the behavior of the customers, can take

business actions, such as to hold on to good customers and weeding out the bad ones 3. An experiment in analyzing banking dataset with the goal of generating knowledge regarding bank customers has been conducted. The task chosen is to classify the customers and the technique used in the experiment is mainly C4.5 algorithm. This paper discusses an overview of data classification and its techniques, the basic methods of C4.5 algorithm, the process and the result analysis of the experiment in utilizing C4.5 for classifying banking dataset.


106

2. Data Classification Data classification is a two-step process (see Figure 1). In the first step, a model is built describing a predetermined set of data classes or concepts. The model is constructed by analyzing database tuples (records) described by attributes. Each tuple is assumed to belong to a predefined class, as determined by one of the attributes, called the label attribute. In the context of classification, data tuples are also referred to as samples, examples or objects. The data tuples are

analyzed to build the model collectively from the training data set. The individual tuples making up the training set are referred to as training samples and are randomly sele cted from the sample population. Since the class label of each training sample is provided, this step is also known as supervised learning (i.e., the learning of the model is “supervised” in that it is told to which class each training sample belongs). In the second step (Figure 1.b), the model is used for classification.

Figure 1. The data classification process: (a) Learning: Training data are analyzed by a classification algorithm. The class label attribute is credit_rating, and the learned model is represented in the form of classification rules. (b) Classification: Test data are used to estimate the accuracy of the classification rules. If the accuracy is acceptable, the rules can be used to classify new data tuples1.


107

3. Preparing Data for Classifi-cation

The following preprocessing steps may be applied to the data in order to help improve the accuracy, efficiency and scalability of the classification process: - Data cleaning: removing the noise and

the treatment of missing values. In real dataset, noise can be viewed as legitimate records having abnormal behavior.

- Relevance analysis: removing any irrelevant or redundant attributes from the learning process.

- Data transformation: the data can be generalized to higher -level concepts. The data may also be normalized.

4. Overview of Data Classification

Techniques There are several basic techniques for data classification. Among them are decision tree induction, Bayesian classification and Bayesian belief networks 1, neural networks5,6,7, and associat ion-based classification. There are also other approaches to classification, which are less commonly used for commercial data mining systems, such as k-nearest neighbor classifier, case-based reasoning, genetic algorithms, rough sets4 and fuzzy logic techniques. The basic algorithm for decision tree induction is a greedy algorithm that constructs decision trees, from the dataset, in a top-down recursive divide-and-conquer manner. Bayesian classifiers are statistical classifiers. They can predict class membership probabilities, such as the probability that a given sample belongs to a particular class. There are two types of Bayesian classifiers, which are naïve Bayesian classifier and Bayesian belief networks. Naïve Bayesian classifiers assume that the effect of an attribute value on a given class is independent of the values of the other attributes. Bayesian belief networks

are graphical models, which unlike naïve Bayesian classifiers, allow the representation of dependencies among subset of attributes. Neural networks which are common for data classification are of backpropagation type. Backpropagation learns by iteratively processing a set of training samples, comparing the network’s prediction for each sample with the actual known class label. For each training sample, the weights are modified so as to minimize the mean square error between the network’s prediction and the actual class. 5. Classification by Decision Tree

Induction, ID3 and C4.5 Algorithm

Decision trees are powerful and popular tools for classification and prediction3. The attractiveness of tree-based methods is due in large part to the fact that, in contrast to neural networks, decision trees represent rules. Rules can readily be expressed in a language that humans can understand them, or in a database access language, such as SQL. In some applications, the accuracy of a classification or prediction is the only thing that matters, for example in selecting (or predicting) the most potential customers. In this case, neural networks can be used. But, in other situations, the ability to explain the reason for a decision is crucial. For example, rejecting loan applicants require some explanation. There are a variety of algorithms for building decision trees. The most popular ones are CART, CHAID and C4.53. A decision tree is a flow-chart like tree structure, where each internal node denotes a test on an attribute, each branch represents an outcome of the test, and leaf nodes represent classes or class distribution. The top-most node in a tree is the root node. An example of a tree is given on Figure 2.


108

Figure 2. A decision tree example. Decision Tree Induction This section discusses a well known decision tree induction, C4.5 algorithm, by first introducing the basic methods of its predecessor, which is ID3 algorithm. Then, the enhancement of the methods that is applied to C4.5 would be given. As has been mentioned previously, the basic algorithm for decision tree induction is a greedy algorithm that constructs decision trees in a top-down recursive divide-and-conquer manner. Figure 3 shows the basic algorithm of ID3. The basic strategy is as follows [1]: - The tree starts as a single node

representing the training samples (step 1).

- If the samples are all of the same class, then the node becomes a leaf and is labeled with that class (steps 2 and 3).

- Otherwise, the algorithm uses an entropy-based measure known as information gain as a heuristic for selecting the attribute that will best separate the samples into individual classes (step 6). This attribute becomes the “test” or “decision” attribute at the node (step 7). (All of the attributes are categorical or discrete value. Continues-valued attribute must be discretized.)

- A branch is created for each known value of the test attribute, and the samples are partitioned accordingly (steps 8-10).

- The algorithm uses the same process recursively to form a decision tree for the samples at each partition. Once an attribute has occurred at a node, it need not be considered in any of the node’s descendents (step 13).

- The recursive partitioning stops only when any one of the following conditions is true:

o All the samples for a given node belong to the same class (steps 2 and 3), or

o There are no remaining attributes on which the samples may be further partitioned (step 4). In this case, majority voting is employed (step 5). This involves converting the given node into a leaf and labeling it with the class in majority among samples. Alternatively, the class distribution of the node samples may be stored.

o There are no samples for the branch test-attribute = ai (step 11). In this case, a leaf is created with the majority class in samples (step 12).


109

Algorithm: Generate_decision_tree. Narative : Generate a decision tree from the given training data. Input : The training samples, samples, represented by discrete-valued attribute; the set of candidate attributes, attribute-list. Output: A decision tree. Method: (1) create a node N; (2) if samples are all of the same class, C then (3) return N as a leaf node labeled with the class C; (4) if attribute-list is empty then (5) return N as a leaf node labeled with the most common class in

samples;//majority voting (6) select test-attribute, the attribute among attribute-list with the highest information gain; (7) label node N with test-attribute; (8) for each known value ai of test-attribute; (9) grow a branch from node N for the condition test-attribute = ai; (10) let si be the set of samples in samples for which test-attribute = ai; // a partition (11) if si is empty then (12) attach a leaf labeled with the most common class in samples; (13) else attach the node returned by Generate_decision_tree (si, attribute-list-

test-attribute);

Figure 3. Basic algorithm for inducing a decision tree from training samples1 . Attribute Selection Measure The information gain measure is used to select the test attribute at each node in the tree. Such a measure is referred to as an attribute selection measure or a measure of the goodness of split. The attribute with the highest information gain (or greatest entropy reduction) is chosen as the test attribute for the current node. Let S be a set consisting of s data samples. Suppose the class label attribute has m distinct values defining m distinct classes, Ci (for i = 1,…,m). Let si be the number of samples of S in class Ci. The expected information needed to classify a given sample is given by

∑=

−=m

iiim ppsssI

1221 ),(log),...,,(

where p i is the probability that an arbitrary sample belongs to class Ci and is estimated by si/s. The log function to the base 2 is used as the information is encoded in bits. Let attribute A have v distinct values, {a1,a2,…,av}. Attribute A can be used to partition S into v subset, {S1,S2,…,Sv}, where Sj contains those samples in S that have value a j of A. If A were selected as the test attribute (the best attribute for splitting), then these subsets would correspond to the branches grown from the node containing the set S. Let si j be


110

the number of samples of class Ci in a subset S j. The entropy, or expected information based on the partitioning into subsets by A, is given by

),...,()( 11

...1

mjj

v

j

SS ssIs

AE mjj∑=

++=

The term s

mjj SS ++...1 acts as the weight of

the jth subset and is the number of samples in the subset (having value aj of A) divided by total number of samples in S. The smaller the entropy value, the greater the purity of the subset partitions. For a given subset Sj,

∑=

−=m

iijijmjjj ppsssI

1221 )(log),...,,(

where || j

ijij S

sp = and is the probability

that a sample in Sj belongs to class Ci. The encoding information that would be gained by branching on A is

)(),...,,()( 21 AEsssIAGain m −= In other words, Gain(A) is the expected reduction in entropy caused by knowing the value of attribute A. The algorithm computes the informatio n gain of each attribute. The attribute with the highest information gain is chosen as the test attribute for the given set S. A node is created and labeled with the attribute, branches are created for each value of the attribute, and the samples are partitioned accordingly. Tree Pruning When a decision tree is built, many of the branches will reflect anomalies in the training data due to noise or outliers. Tree pruning methods address this problem of overfitting the data. Such methods typically use statistical measures to remove the least reliable branches, generally resulting in faster classification and an improvement in the ability of the

tree to correctly classify independent test data. There are two common approaches to tree pruning, which are prepruning and postpruning. In the prepruning approach, a tree is “pruned” by halting its construction early (by deciding not to further split or partitioned the subset of training samples at a given node). Upon halting, the node becomes a leaf. In the postpruning approach, a tree is pruned after it is “fully grown”. A tree node is pruned by removing its branches. The lowest unpruned node becomes a leaf and is labeled by the most frequent class among its former branches. Extracting Classification Rules from Decision Trees The knowledge represented in decision trees can be extracted and represented in the form of IF-THEN rules. One rule is created for each path from the root to a leaf node. Each attribute-value pair along a given path forms a conjunction in the rule antecedent (“IF” part). The leaf node holds the class prediction, forming the rule consequent (“THEN” part). The IF-THEN rules may be easier for humans to understand, especially if the given tree is very large. C4.5: An Enhancement to ID3 Several enhancements to the basic decision tree (ID3) algorithm have been proposed. C4.5 (detailed discussion is in [8]), a successor algorithm to ID3, proposes mechanism for 3 types of attribute test: 1. The “standard” test on a discrete

attribute, with one outcome and branch for each possible value of that attribute.

2. A more complex test, based on a discrete attribute, in which the possible values are allocated to a variable number of groups with one


111

outcome for each group rather than each value.

3. If attribute A has continuous numeric values, a binary test with outcomes A� Z and A>Z , based on comparing the value of A against a threshold value Z. Given v values of A, then v-1 possible splits are considered in determining Z, which are the midpoints between each pair of adjacent values.

The information gain measure is biased in that it tends to prefer attributes with many values. C4.5 proposes gain ratio, which considers the probability of each attribute value. 6. Experiment An experiment is conducted with the goal of finding the steps needed to utilize C4.5 algorithm for classifying real banking dataset, discovering rules generated from the dataset and the meaning of them. Banking Dataset Description The original banking dataset used for the experiment is obtained from [10]. It consists of several text files as described in [9]. The data is then exported and stored in Access database. The database contains the data related to a bank’s clients, and its schema is given on Figure 4. Figure 4 shows that there are relation Account, Client, Disposition, PermanentOrder, Transaction, Loan, CreditCard and District, which related between one and another. There is 4500 tuples in Account, 5369 in Client, 5369 in Disposition, 6471 in PermanentOrder, 1056320 in Transaction, 682 in Loan, 892 in CreditCard, and 77 in District. Detailed description of the data can be found in [9].

Classifying the Banking Dataset Suppose the bank marketing managers need to classify the customers who hold credit card, so that they could offer the right card to the bank customers who currently hold no credit card. Also, the loan division needs to classify the customers who have loans, so that they could predict whether the new loan applicants would be good customers. Then, the tasks chosen in analyzing the data are to classify customers who hold credit card and who have loan. The data is considered to be clean and complete, so there is no treatment applied to improve the quality of the data. To select the relevant data from the database, two datasets are created. One is used for analyzing credit card holders and the other is for analyzing loan owners. The original C4.5 requires 3 files as its inputs, which are filename.names, filename.data and filename.test 8. Filename.names contains the definition of the label attribute, the name of the attributes and their categorical values or their type of continuous. Filename.data contains the training data (each line contains one tuple) and filename.test contains the test data (each line contains one tuple). Dataset for card holders: The data that is considered to be relevant to be analyzed is the data stored in table Client, District, Account, Transaction, Loan and CreditCard. The tables are then joined by properly constructing SQL statements. The attributes selected are birth number from table Client; the sum of amount from table Loan; the sum of order id from table PermanentOrder; the average of balance from table Transaction; A4, A10, A11 from table District and type from table CreditCard. From the result of the join operation, the age and gender of the customers is then computed from birth number. The result is then exported


112

to 2 text files, card.data that contains 810 lines and card.test that contains 82 lines

or tuples.

Figure 4. The database schema in MSAccess that shows the name of the relations and their relationship among them.

Dataset for loan owners: The data that is considered to be relevant to be analyzed is the data stored in table Client, District, Account, PermanentOrder, Loan and CreditCard. Data transaction, actually, can be useful in classifying loan owners. Unfortunately, the data transaction stored in table Transaction is not complete. The table contains some parts of the transactions done by some part of the customers (not all of the tupples in Loan is related to tupples in Transaction), so it could not be used. Then, the tables selected are joined by properly constructed SQL statements. The attributes selected are birth number from table Client; A4, A10, A11 from table District; the sum of order id and the sum of amount from table PermanentOrder; type from table CreditCard, the sum of amount and duration, and status from table Loan. The loan status A and C are converted to good, and B and D are converted to bad (please see [9] for the

description of loan status). The result is then exported to 2 text files, loan.data that contains 600 lines and loan.test that contains 83 lines or tuples. The dataset chosen are not normalized and is not generalized to higher level concepts as the database schema does not show hierarchies. The result of presenting the training and test data of card dataset to C4.5 program (downloaded from [11]) is given on Figure 5. It turns out that C4.5 classifies the data by attribute age only. Of the 800 records from data training, 131 of which are classified as junior card holders and 679 as classic card holders. The evaluation on training and test data (Figure 6) shows that some of the customers are misclassified. 79 customers who hold gold card are classified as classic card holders. This happens due to tree pruning which has


113

been discussed in Section 3. The error percentage of training data is 9.8% and of test data is 11%. If this error is acceptable, then the rules given on Figure 5.b can be applied to new record of customers for predicting the type of the card that customers would buy. However, it can easily be learned from the rules that the rules are already known and

would not predict any gold card holder. Therefore, these rules, despite the error percentage, would not be applicable or useful in making business decision, and would not help the banks managers in improving their marketing strategies. To generate better rules, clearly, more data that “tells” more about the bank customers needs to be gathered.

(a) (b)

Figure 5. The output of C4.5 algorithm for card dataset: (a) Decision tree. (b) Rules generated from the tree.

(a) (b)

Figure 6. The evaluation on: (a) training and (b) test of card dataset. The result of presenting the training and test data of loan dataset to C4.5 is given on Figure 7. Here, C4.5 generates a few decision trees and rules using a few attributes. As can bee seen on Figure 7.b, the attributes used for the rules are NoPermOrder, PermOrderAmt, AvgSalary . NoPermOrder denotes the number of permanent order service that customers subscribe. One of the purposes of subscribing this service is actually for paying loans periodically (for example monthly) and automatically. Therefore, the loan owners may subscribe this service after they are granted for loans. PermOrderAmt states the amount to be

deducted from customers’ account for this service. So, this may also exist after loan owners have loans. AvgSalary is the average salary of the district where the customers live. This may be a useful attribute in characterizing loan owners. But, the rules using this attribute are rather suspicious. Rule 5 states that customers living in districts having average salary of greater than 9624 are bad customers. Rule 2 states that customers living in districts having average salary of less than 9624 are good customers. These 2 rules need further investigation to prove their correctness.

C4.5 [release 8] decision tree generator ----------------------------------------- Read 810 cases (8 attributes) from card.data Decision Tree: Age <= 20.0 : junior (131.0) Age > 20.0 : classic (679.0/79.0) Tree saved

C4.5 [release 8] rule generator ------------------------------- Final rules from tree 0: Rule 1: Age <= 20.0 -> class junior [98.9%] Rule 2: Age > 20.0 -> class classic [87.4%] Default class: classic

Evaluation on training data (810 items): Tested 810, errors 79 (9.8%) << (a) (b) (c) <-classified as ---- ---- ---- 79 (a): class gold 600 (b): class classic 131 (c): class junior

Evaluation on test data (82 items): Tested 82, errors 9 (11.0%) << (a) (b) (c) <-classified as ---- ---- ---- 9 (a): class gold 59 (b): class classic 14 (c): class junior


114

Other than the error percentage, on Figure 8, ones could also see that most of the loan owners are good ones. Therefore, in analyzing the bank dataset,

it may be more appropriate in focusing the analysis to the bad customers, and gather more facts about them.

(a) (b)

Figure 7. The output of C4.5 algorithm for loan dataset: (a) Decision tree. (b) Rules generated from the tree. (a) (b)

Figure 8. Training error on: (a) training and (b) test of loan data.

Evaluation on training data (600 items):

Tested 600, errors 60 (10.0%) <<

(a) (b) < -classified as ---- ----

529 1 (a): class Good 59 11 (b): class Bad

Evaluation on test data (83 items): Tested 83, errors 7 (8.4%) << (a) (b) <-classified as ---- ---- 76 (a): class Good 7 (b): class Bad

C4.5 [release 8] decision tree generator ----------------------------------------- Read 600 cases (9 attributes) from loan.data Decision Tree: NoPermOrder > 1.0:Good(385.0/18.0) NoPermOrder <= 1.0 : | PermOrderAmt <= 7512.7:Good (189.0/38.0) | PermOrderAmt > 7512.7 : | | PermOrderAmt <= 7742.0:Bad (6.0) | | PermOrderAmt > 7742.0 : | | | AvgSalary > 9624.0:Bad (6.0/1.0) | | | AvgSalary <= 9624.0 : | | | | NofInhabitans >70699.0: Good (9.0) | | | | NofInhabitans <=70699.0: | | | | | NofInhabitans <= 45714.0: Good(3.0/1.0) | | | | | NofInhabitans > 45714.0: Bad(2.0)

C4.5 [release 8] rule generator --------------------------------------- Read 600 cases (9 attributes) from loan ------------------ Processing tree 0 Final rules from tree 0: Rule 1: NoPermOrder <= 1.0 PermOrderAmt > 7512.7 PermOrderAmt <= 7742.0 -> class Bad [79.4%] Rule 5: AvgSalary > 9624.0 NoPermOrder <= 1.0 PermOrderAmt > 7512.7 -> class Bad [66.2%] Rule 6: NoPermOrder > 1.0 -> class Good [94.4%] Rule 2: AvgSalary <= 9624.0 PermOrderAmt > 7742.0 -> class Good [91.1%] Default class: Good


115

Another experiment with the intention of visualizing then clustering the two datasets has also been conducted. The techniques used are Self-Organizing Map (SOM) and K-Means algorithm. However, due to space limitation, the results could not be presented in this paper. The clustering results show similarities with the results of the tree induction experiment: for the card dataset, only the attribute age and card type are important, whereas for the loan dataset, attribute NoPermOrder , PermOrderAmt and loan status play significant role in forming clusters. 7. Conclusion C4.5 algorithm performs well in constructing decision trees and extracting rules from the banking dataset. However, a graphical user interface based application that implements C4.5 algorithm is needed in order to provide ease of use and better visualization of the decision trees for the users. The application should also provide features for accessing databases directly, as most of the business data is stored in databases. From the experiment results, it can be learned that a few of the attr ibutes are unused in classifying. There are also attributes used in the result rules that have unimportant meaning in making business decision. Hence, it can be concluded that selecting the proper attributes being used from the dataset plays a significant role in data classification. For classifying banking dataset, banking knowledge base and statistical methods of analyzing the relevant attributes for the tasks must be employed. In order to discover new, meaningful and “actionable” knowledge from the

banking dataset, more data needs to be collected. The data might be the one related to the customers, such as detailed demographic data, and various as well as complete transactional data. 8. References [1] Han, Jiawei; Kamber, Micheline;

“Data Mining Concepts and Techniques”, Morgan Kaufmann Pub., USA, 2001.

[2] IBM, “Mellon Bank Forecasts a Bright Future for Data Mining”, Data Management Solutions Banking, http://www.software.ibm.com/data, 1998.

[3] Berry M.J., Linoff G., “Data Mining Techniques for Marketing, Sales and Customer Support”, John Wiley & Sons Inc., USA, 1997.

[4] Hu, Xiaohua, “Using Rough Sets Theory and Database Operations to Construct a Good Ensemble of Classifiers for Data Mining Applications”, IEEE ICDM Proceedings, December, 2001.

[5] Brause, R., Langsdorf, T., Hepp, M., “Neural Data Mining for Credit Card Fraud Detection”, J.W. Goethe-University, Frankfurt, Germany.

[6] Kao, L.J, Chiu, C.C.; “Mining the Customer Credit by Using the Neural Network Model with Classification and Regresion Tree Approach”, IEEE Transaction on Data Engineering and Knowledge Discovery, Vol.1, p.923, 2001.

[7] Syeda,M., Zhang, Y.Q., Pan, Y.; “Parallel Granular Neural Networks for Fast Credit Card Fraud Detection”, IEEE Transaction on Neural Networks, Vol.2, p.572, 2002.


116

[8] Quinlan, J.Ross; C4.5: Programs for Machine Learning, Morgan Kaufmann Pub., USA, 1993

[9] Berka, Petr; “Guide to the Financial Data Set”, Laboratory for Intelligent Systems, Univ. of Economics, Prague, Czech Republic, http://lisp.vse.cz/pkdd99.

[10] http://lisp.vse.cz/pkdd99. [11] http://www.mkp.com/c45. [12] Conolly, Thomas; Begg, Carolyn;

“Database Systems A Practical Approach to Design, Implementa-tion and Management”, 3rd ed., Addison Wesley Pub., USA, 2002.

c45 algorithm

Documents

constructs decision trees

bayesian belief networks

highest information gain

classifying banking dataset

decision tree induction

hold credit card

information gain measure

classic card holders