a method for supporting the domain expert by the interpretation of different decision trees learnt...

Special Issue Article

(wileyonlinelibrary.com) DOI: 10.1002/qre.1675 Published online 20 June 2014 in Wiley Online Library

A Method for Supporting the Domain Expertby the Interpretation of Different DecisionTrees Learnt from the Same DomainPetra Perner*†

Data mining methods are widely used across many disciplines to identify patterns, rules, or associations among hugevolumes of data. Data mining methods with explanation capability such as decision tree induction are preferred in manydomains. The aim of this paper is to discuss how to deal with the result of decision tree induction methods. This paperhas been prompted by the fact that domain experts are able to use the tools for decision tree induction but have greatdifficulties in interpreting the results. When the domain expert has learnt two decision trees that are from the same domainbut based on different data sets as a result of further data collection, he is faced with the problem of how to interpret thedifferent trees. The comparison of two decision trees is therefore an important issue as the user needs such a comparisonin order to understand what has changed. We have proposed to provide him with a measure of correspondence betweenthe two trees that allows him to judge if he can accept the changes or not. In this paper, we propose a proper similaritymeasure. In case of a low similarity value, the expert has evidence to start exploring the reason for this change. Often, hecan find things in the data acquisition that might have resulted in some noise and might be fixed. Copyright © 2014 JohnWiley & Sons, Ltd.

Keywords: decision tree induction; explanation; similarity measure; decision tree; comparison

1. Introduction

Datamining methods are widely used across many disciplines to identify patterns, rules, or associations among huge volumes ofdata. Different methods can be applied to accomplish this. Among them are neural nets,1 support vector machines (SVMs),2

regression models,3,4 Bayesian networks5 and graphical models,6 decision forest,7 and decision trees.8 While in the past, mostlyblack box methods such as neural nets and SVMs have been heavily used in technical domains, methods that have explanationcapability have been particularly used in medical domains because a physician likes to understand the outcome of a classifier andmap it to his domain knowledge; otherwise, the level of acceptance of an automatic system is low. Nowadays, data mining methodswith explanation capability are also used for technical domains as more work on advantages and disadvantages of the methods hasbeen carried out.9

The most preferred method among the methods with explanation capability is the decision tree induction method.10 This methodcan easily learn a decision tree without heavy user interaction, whereas in neural nets, a lot of time is spent on training the net. Cross-validation methods can be applied to decision tree induction methods; these methods ensure that the calculated error rate comesclose to the true error rate. Many decision tree methods exist, but the method that works well on average on all kinds of data setsis still the C4.5 decision tree method11 and some of its variants. Although the user can easily apply this method to his data set thanksto all the different tools that are available and set up in such a way that a user who is not a computer-science specialist can use themwithout any problem, the user is still faced with the problem of how to interpret the result of a decision tree induction method. Thisproblem especially arises when two different data sets for one problem are available or when the data set is collected in temporalsequence. Then the data set grows over time, and the resulting tree might change.

The aim of this paper is to give an overview of the problems that arise when interpreting two or more decision trees learnt fromthe same domain. This paper is aimed at providing the user with a methodology on how to use the resulting model of decision treeinduction methods.

Model comparison is known from structural regression models based on statistical measures.12 In decision forest, the treeproperties are used to judge the learnt decision forest13 as well as statistical measures that allow describing the covered solutionsspace by the tree.7 In this paper, we develop a method that takes into account the tree structure as well as the rules in the tree.

Institute of Computer Vision and Applied Computer Sciences IBaI, PO Box 30 11 14, 04251, Leipzig, Germany*Correspondence to: Petra Perner, Institute of Computer Vision and Applied Computer Sciences IBaI, PO Box 30 11 14, 04251 Leipzig, Germany.†E-mail: [email protected]

Copyright © 2014 John Wiley & Sons, Ltd. Qual. Reliab. Engng. Int. 2014, 30 985–992

985

P. PERNER

986

In Section 2, we explain the data collection problem, and we introduce how the comparison measure works based on the IRIS dataset. In Section 3, we review how to interpret a learnt decision tree. Section 4 deals with the comparison of two different decision treeslearnt from the same domain. We explain the preprocessing of a decision tree in Section 4.1 and describe our developed similaritymeasure in Section 4.2. Finally, we summarize our work in Section 5.

2. The problem

Many factors influence the result of the decision tree induction process. The data collection is a tricky pit fall. The data might becomevery noisy due to some subjective or system-dependent problems during the data collection process.

Newcomers to data mining approach data mining step by step. First, they will acquire a small data base that allows them to testwhat can be achieved by data mining methods. Then, they will enlarge the data base hoping that a larger data set will result in betterdata mining results. But often this is not the case.

Others may have big data collections collected in their daily practice such as in marketing and finance. To a certain point, theywant to analyze these data with data mining methods. If they do this based on all data, they might be faced with a lot of noise inthe data because customer behavior might have changed over time due to some external factors such as economic factors andclimate condition changes in a certain area.

Web data can change severely over time. People worldwide can access a website and leave a distinct track dependent on thegeographic area they are from and the nation they belong to.

If the user has to label the data, then it might be apparent that the subjective decision about the class to which the data setbelongs might result in some noise. Depending on his current form or his experience level, the expert will label the data properlyor not as well as he should.

If the data have been collected over an extended period of time, there might be some data drift. In case of a Web-based shop, thecustomers frequenting the shop might have changed because the products now attract other groups of people. In a medicalapplication, the data might change because the medical treatment protocol has been changed. This must be taken into considerationwhen using the data.

It is also possible that the data are collected in time intervals. The data in time period_1 might have other characteristics than thedata collected in time period_2. In agriculture, this might be true because the weather conditions have changed. If this is the case, thedata cannot make up a single data set. The data must be kept separate with a tag indicating that they were collected under differentweather conditions.

In this paper, we describe the behavior of decision tree induction under changing conditions (Figure 1) in order to give the user amethodology for using decision tree induction methods.

To give you a real example, we have learnt two decision trees from the IRIS data set based on the ID3 algorithm and the Gini-Indexbased algorithm. The resulting trees are shown in Figure 2 for the ID3 algorithm and in Figure 3 for the Gini-Index basedalgorithm. Up to the third level, the trees are similar. After level three, the structure changes. The similarity measure we propose

in our paper as comparison measure will give you a similarity value of Sim1;2 ¼ 2*28þ11ð Þ ¼ 0:21. This means that the two decision

trees are more dissimilar. They are only identically up to the second depth level of the tree. Because the tree has five depthlevels, the resulting similarity value is low. Because the structure changes on a high level, some serious changes happened inthe explanation. This change can also be seen in the data portion belonging to the leaf nodes. In a real world experiment, itis now up to the expert to investigate the reason for this change. In this experiment, we know the reason to be the differentdecision tree induction experiments.

3. How to interpret the results of a decision tree?

Suppose we have a data set X with n samples. The outcome of the data mining process is a decision tree that represents the model ina hierarchical rule-based fashion. One path from the top to the leaf of a tree can be described by a rule that combines the decisions of

TIME

n n+1 n+2 n+3 n+4 ...

Data Stream

…DS_n DS_n+1 DS_n+3

DS

Change in Data Collection ProtocolInfluence from Outside

Data Sampling (DS) Strategy

Figure 1. The data collection problem


Figure 2. Decision tree 1

Figure 3. Decision tree 2

P. PERNER

987

each node by a logical AND. The closer the decision is to the leaf, the more noise is contained in the decision because the entire dataset is subsequently split into two parts from the top to the bottom and, in the end, only a few samples are contained in the two datasets. Pruning is performed to avoid that the model overfits the data. Pruning provides a more compact tree and often a better modelin terms of accuracy.


P. PERNER

988

The pruning algorithm is based on an assumption regarding the distribution of the data. If this assumption does not fit the data,the pruned model will not have better accuracy. Then it is better to stay with the unpruned tree.

When users feel confident about the data mining process, they are often keen on getting more data. Then they apply the updateddata set that is combined of the data set X with n samples and the new data set X′ containing n+ t samples (n< n+ t) to the decisiontree induction. If the resulting model only changes in nodes close to the leaf of a decision tree, the user will accept the result.

There will be confusion when the whole structure of the decision tree has been changed, especially when the attribute in the rootnode changes. The root node decision should be the most confident decision. The reason for a change can be that there were alwaystwo competing attributes having slightly different values for the attribute selection criteria. Now, based on the data, the attributeranked second in the former procedure is ranked first. When this happens, the whole structure of the tree will change because adifferent attribute in the first node will result in a different first split of the entire data set.

It is important that this situation is visually presented to the user so that he can judge the result. Often, the user has already somedomain knowledge and prefers a certain attribute to be ranked first. A way to enable such a preference is to allow the user to activelypick the attribute for the node.

These visualization techniques should enable the user to see the location of the class-specific data dependent on two attributes.This helps the user to understand what has changed in the data. From a list of attributes, the user can pick two attributes, and therespective graph will be presented.

Another way to judge this situation is to look for the variance of the accuracy. If the variance is high, the model is not stable yet.The data do not give enough confidence in regard to the decision.

The described situation can indicate that something is wrong with the data. It is often helpful to talk to the user and figure out howthe new data set has been obtained. To give you an example, a database contains information about the mortality rate of patientsthat have been treated for breast cancer. Information about the patients, such as age, size, weight, measurements taken duringthe treatment, and finally the success or failure, is reported. In the time period T1, treatment with a certain cocktail of medicine,radioactive treatment, and physiotherapy has taken place; the kind of treatment is called a protocol. In the time period T2, thephysicians changed the protocol because other medicine is available or other treatment procedures have been reported to be moresuccessful in the medical literature. The physicians know about the change in protocol but they did not inform you accordingly. Thenthe whole tree might change; as a result, the decision rules are changing and the physicians cannot confirm the resulting knowledgebecause it does not fit their knowledge about the disease as established in the meantime. The resulting tree has to be discussed withthe physicians; the outcome may be that in the end the new protocol is simply not good.

However, we want to give the user a ‘how to’ method for comparing two learnt decision trees from the same domain. We havedeveloped a similarity-based approach that results in measurement values for the similarity.

4. Comparison of two decision trees

Two data sets of the same domain that might have been collected at different times and used separately, or combined to one largedata set, for decision tree induction might result in two different decision trees. Then the question arises how similar these twodecision trees are and how to interpret the changes in the decision trees. If the models are not similar, then something significanthas changed in the data set. This is evidence that the user should check the data or investigate the whole application.

We propose in this Section a proper similarity measure for comparing the two learnt trees and answering the question how similarthese two trees are. In order to compute the similarity measure, we need to preprocess the tree and transform the tree into a rule setfirst.

4.1. Preprocessing of the decision tree

The path from the top of a decision tree to the leaf is described by a rule such as ‘IF attribute A ≤ x and attribute B ≤ y and attribute C ≤ zand… THEN Class_1’. The transformation of a decision tree into a rule-like representation can be easily carried out. The location of anattribute is fixed by the structure of the decision tree.

Comparison of rule sets is known from rule induction methods in different domains.14 Here the induced rules are usually comparedto the human-built rules.15,16 Often, this is carried out manually and should give a measure of how good the constructed rule set is.

These kinds of rules can also be automatically compared by substructure mining. The following questions can be asked: (i) Howmany rules are identical? (ii) How many of them are identical compared to all rules? (iii) What rules contain part structures of thedecision tree?.

4.2. Similarity measure for comparing the decision trees

We propose a similarity measure for the differences of the two models as follows:

(i) Transform two decision trees d1 and d2 into a rule set and order the rules of two decision tress according to the number n ofattributes in a rule.

(ii) Then build substructures of all l rules by decomposing the rules into their substructures(Figure 7) starting at the root node.


P. PERNER

Note: it is important to always start at the root node toward the particular substructure because the root node decides how the dataset is recursively split.

(iii) Compare two rules i and j of two decision trees di and dj for each of the nj and ni substructures with s attributes.(iv) Build similarity measure SIMij according to formula 1–5.

We want to know how many substructures are similar and how many substructures one subset has but not the other.The Tversky similarity17 measure gives us a measure that can be applied to this problem:

Sim di; dj� � ¼ f A∩Bð Þ

f A∩Bð Þ þ ∝f A� Bð Þ þ βf B� Að Þ (1)

where A is the set of substructures of the decision tree di, and B is the set of substructures of the decision tree dj; f(A∩B) are thesubstructures common to both decision trees di and dj; f(A� B) are the substructures that belong to set A but not to set B, and f(B�A)are the substructures that belong to set B but not to set A.

If ∝= β = 0.5 then

Sim di; dj� � ¼ 2f A∩Bð Þ

f A∪Bð Þ ¼ 2f A∩Bð Þl

(2)

If the rule contains a numerical attribute A ≤ k1 and A′ ≤ k2= k1+ x, then the similarity measure is

Simnum ¼ 1� A� A′

t¼ 1� k1 � k1 � xj j

t¼ 1� xj j

tfor x < t

Simk ¼ 0for x≥t(3)

with t a user-chosen value that allows x to be in a tolerance range of s % (e.g., 10%) ofk1. This means that, as long as the cut-point k1 iswithin the tolerance range, we consider the term as similar, outside the tolerance range it is dissimilar. Small changes around the firstcut-point are allowed while a cut-point far from the first cut-point means that something serious has happened with the data and thetree might change significantly. The rule can be considered as not corresponding to the other rule or we can calculate a numericalsimilarity based on the formula in the succeeding text.

The similarity measure for the whole substructure is

Simk ¼ 1

s

Xs

z¼1

Simnum

1 for A ¼ A′

0 otherwise

8><>:

(4)

The overall similarity between two decision trees d1 and d2 is

Simdi;dj ¼ 2

l

Xl

k¼1

max∀j

Simk (5)

for comparing the rules of decision di with rules of decision dj.How the similarity is working is shown by five artificial examples represented in Figures 4–8.The separation of the rules into their substructures is shown in Figure 9 for all trees.Table I shows the pairwise similarity value between the decision trees. The deeper the correspondence between the two

substructures, the higher the similarity value. If both trees are deep and the correspondence between the two trees stops at a highdepth level, the similarity value is low. This means that the lower sub-trees of the two trees have a high impact on how the data areinterpreted.

Figure 4. Decision tree_1


989





P. PERNER

990

If the first tree corresponds to the second tree but the second tree is deeper than the first tree, the similarity value is high but doesnot reach the value of one. This situation can be considered as the first tree having been pruned back to the depth level of thecorresponding second tree.

The similarity measure has values between 0 and 1. The value is 1 for identity and 0 for no correspondence. The similarity value of0.5 means neutral. A value of less than 0.5 means more dissimilar and a value of greater than 0.5 means more similar.

Such a similarity measure can help an expert understand the developed model and also aid in comparing two models that havebeen built based on two different data sets from the same domain.


Figure 9. Comparison of DT_1 to all decision trees, decomposition of rules, and similarity values

Table I. Pairwise similarity value between the decision trees

DT_1 DT_2 DT_3 DT_4 DT_5

DT_1 1 0.833 0.33 0 0.75DT_2 0.833 1 0.285 0 0.6DT_3 0.33 0.285 1 0 0.4DT_4 0 0 0 1 0DT_5 0.75 0.6 0.4 0 1

P. PERNER

There might be other options for constructing the similarity measure. This consideration is left for our future work.

991

5. Conclusions

The aim of this paper is to discuss how to deal with the result of data mining methods such as decision tree induction. This paper hasbeen prompted by the fact that domain experts are able to use the tools for decision tree induction but have a hard time interpretingthe results. A lot of factors have to be taken into consideration. The quantitative error measures give a good overview in regard to thequality of the learnt model. But computer-science experts claim that decision trees have explanation capabilities and that, comparedto neural nets and SVM, the user can understand the decision. This is only partially true. Of course, the user can follow the path of adecision from the top to the leaves of the tree, and this provides him with a rule where the decisions in the node are combined bylogical ANDs. But often this is tricky. When the domain expert has learnt two decision trees from the same domain but based ondifferent data sets obtained by further collection of data, he is faced with the problem of how to interpret the different trees.


P. PERNER

992

The comparison of two trees is therefore an important issue as the user needs this comparison in order to understand what haschanged. We have proposed to provide him with a measure of correspondence between the two trees that allows him to judge ifhe can accept the changes or not. We have proposed proper similarity measures that give a measure of quality. The measure hasvalues between one and zero. If the two trees are identical, the value is one; if they are dissimilar, the value is zero. The value of0.5means neutral. In case of a similarity value under 0.5, the expert should explore the reason for this change. Often he can find thingsin the data acquisition that might have resulted in some noise and might be fixed.

The developed method is applicable to decision trees but can also be extended to the comparison of Bayesian networks, structuralregression models, and decision forest.

Acknowledgement

This work has been sponsored by the German Ministry of Science and Technology under the grant ‘Quantitative Measurement ofDynamic Time Dependent Cellular Events, QuantPro’ grant no. 0313831B.

References1. Michie D, Spiegelhalter DJ, Taylor CC. Machine leaning, neural nets and statistical classification, Ellis Horwood Series in Artificial Intelligence, 1994.2. Vapnik V. The Nature of Statistical Learning Theory. Springer-Verlag: Berlin, Heidelberg, 1998.3. Kenett R, Zacks S. Modern Industrial Statistics: with Applications in R, MINITAB and JMP. Chichester: John Wiley and Sons, 2014.4. Elek P, Zempléni A. Tail behaviour and extremes of two-state Markov-switching autoregressive models. Computers & Mathematics with Applications

2008; 55(12):2839–2855.5. Ben-Gal I. Bayesian networks. In Encyclopedia of Statistics in Quality and Reliability, vol. 1, Ruggeri F, Kenett RS, Faltin FW (eds). John Wiley and

Sons: Chichester, 2007; 179–185.6. Kuhnt S, Becker Cl. Sensitivity of graphical modeling against contamination. In Between Data Science and Applied Data Analysis, Schader M,

Gaul W, Vichi M (eds). Studies in Classification, Data Analysis, and Knowledge Organization. Berlin, Heidelberg, 2003; 279–287.7. Ho TK. The random subspace method for constructing decision forests. IEEE Transactions on Pattern Analysis and Machine Intelligence 1998; 20(8):

832–844.8. Quinlan JR. Induction of decision trees, Machine Learning 1986; 1(1):81–106.9. Perner P, Zscherpel U, Jacobsen C. A comparison between neural networks and decision trees based on data from industrial radiographic testing.

Pattern Recognition Letters 2001; 22:47–54.10. Perner P. Data mining on multimedia data. Lecture Notes Computer Science, vol. 2558, Berlin, Heidelberg: Springer Verlag, 2002.11. Quinlan JR. C4.5 Programs for Machine Learning. San Francisco: Morgan Kaufmann Publishers Inc., 1993.12. Schermelleh-Engel K, Moosbrugger H, Müller H. Evaluating the fit of structural equation models: tests of significance and descriptive goodness-of-fit

measures. Methods of Psychological Research Online 2003; 8(2):23–74.13. Murphy PM, Pazzani MJ. Exploring the decision forest: an empirical investigation of Occam’s Razor in decision tree induction. Journal of Artificial

Intelligence Research 1994; 1:257–275.14. Georg G, Séroussi B, Bouaud J. Does GEM-encoding clinical practice guidelines improve the quality of knowledge bases? A Study with the Rule-

Based Formalism. In AMIA Annu Symp Proc. 2003; 254–258.15. Lee S, Lee SH, Lee KC, Lee MH, Harashima F. Intelligent performance management of networks for advanced manufacturing systems. IEEE

Transactions on Industrial Electronics 2001; 48(4):731–741.16. Bazijanec B, Gausmann O, Turowski K. Parsing effort in a B2B integration scenario - an industrial case study. In Enterprise Interoperability II, Part IX,

London: Springer Verlag, 2007; 783–794.17. Tversky A. Feature of similarity. Psychological Review 1977; 84(4):327–350.

Authors' biography

Petra Perner (IAPR Fellow) is the director of the Institute of Computer Vision and Applied Computer Sciences, IBaI. She received herDiploma degree in electrical engineering and her PhD degree in computer science. Her habilitation thesis was about ‘A Methodologyfor the Development of Knowledge-Based Image-Interpretation Systems.’ She has been the principal investigator of various nationaland international research projects. She received several research awards for her research work and has been awarded with threebusiness awards for her work on bringing intelligent image interpretation methods and data mining methods into business. Herresearch interest is image analysis and interpretation, machine learning, data mining, machine learning, image mining, and case-based reasoning. Recently, she is working on various medical, chemical and biomedical applications, information managementapplications, technical diagnosis, and e-commerce applications. Most of the developments are protected by legal patent rights andcan be licensed to qualified industrial companies. She has published numerous scientific publications and patents and is oftenrequested as a plenary speaker in distinct research fields, as well as across disciplines. Her vision is to build intelligent flexible androbust data-interpreting systems that are inspired by the human case-based reasoning process.


a method for supporting the domain expert by the interpretation of different decision trees learnt...

Documents