treatment learning: implementation and application

Treatment Learning:Implementation and Application

Ying Hu

Electrical & Computer Engineering

University of British Columbia

Ying Hu http://www.ece.ubc.ca/~yingh 2

Outline

1. An example2. Background Review3. TAR2 Treatment Learner

• TARZAN: Tim Menzies• TAR2: Ying Hu & Tim Menzies

4. TAR3: improved tar2• TAR3: Ying Hu

5. Evaluation of treatment learning6. Application of Treatment Learning7. Conclusion


First Impression

low high

6.7 <= rooms < 9.8 and12.6 <= parent teacher ratio < 15.9

0.6 <= nitric oxide < 1.9 and17.16 <= living standard < 39

• C4.5’s decision tree:• Treatment learner:

Boston Housing Dataset (506 examples, 4 classes)


Review: Background

What is KDD ? – KDD = Knowledge Discovery in Database [fayyad96]

– Data mining: one step in KDD process– Machine learning: learning algorithms

Common data mining tasks– Classification

• Decision tree induction (C4.5) [quinlan86]• Nearest neighbors [cover67]• Neural networks [rosenblatt62]• Naive Baye’s classifier [duda73]

– Association rule mining• APRIORI algorithm [agrawal93]• Variants of APRIORI


Treatment Learning: Definition– Input: classified dataset

• Assume: classes are ordered

– Output: Rx=conjunction of attribute-value pairs• Size of Rx = # of pairs in the Rx

– confidence(Rx w.r.t Class) = P(Class|Rx)– Goal: to find Rx that have different level of

confidence across classes– Evaluate Rx: lift– Visualization form of output


Motivation: Narrow Funnel Effect When is enough learning enough?

– Attributes: < 50%, accuracy: decrease 3-5% [shavlik91]

– 1-level decision tree is comparable to C4 [Holte93]

– Data engineering: ignoring 81% features result in 2% increase of accuracy [kohavi97]

– Scheduling: random sampling outperforms complete search (depth-first) [crawford94]

Narrow funnel effect– Control variables vs. derived variables

– Treatment learning: finding funnel variables


TAR2: The Algorithm Search + attribute utility estimation

– Estimation heuristic: Confidence1

– Search: depth-first search• Search space: confidence1 > threshold

Discretization: equal width interval binning Reporting Rx

– Lift(Rx) > threshold Software package and online distribution


The Pilot Case Study Requirement optimization

– Goal: optimal set of mitigations in a cost effective manner

Risks

Mitigations

RequirementsCost

reduce

relates

Benefit

incur

achieve

Iterative learning cycle


The Pilot Study (continue) Cost-benefit distribution (30/99 mitigations)

Compared to Simulated Annealing


Problem of TAR2 Runtime vs. Rx size

To generate Rx of size r: To generate Rx from size [1..N]


TAR3: the improvement

Random sampling– Key idea:

• Confidence1 distribution = probability distribution

• sample Rx from confidence1 distribution

– Steps:• Place item (ai) in increasing order according to

confidence1 value

• Compute CDF of each ai

• Sample a uniform value u in [0..1]

• The sample is the least ai whose CDF>u

– Repeat till we get a Rx of given size


Comparison of Efficiency Runtime vs. Data size

Runt i me vs. at t r i bute#

R2 = 0. 9436

0

5

10

15

20

25

30

10 20 30 40 50 60 70 80 90 99

at t r i bute#

Runt

ime

(sec

)

Runt i me vs. Rx si ze

R2 = 0. 8836

0

20

40

60

80

100

120

1 2 3 4 5 6 7 8

Treatment si ze

Runt

ime

(sec

)

Runtime vs. Rx size

Runtime vs. TAR2


Comparison of Results

Mean and STD in each round

Final Rx: TAR2=19, TAR3=20

10 UCI domains, identical best Rx

pilot2 dataset (58 * 30k )


External Evaluation

All attributes(10 UCI datasets)

learning

FSS framework

someattributes

learning

CompareAccuracy

C4.5Naive Bayes

Feature subset selectorTAR2less


The Results

Accuracy using Naïve Bayes

(Avg increase = 0.8% )

Number of attributes

Accuracy using C4.5(avg decrease 0.9%)


Compare to other FSS methods

# of attribute selected (C4.5 )

# of attribute selected (Naive Bayes)

17/20, fewest attributes selected Another evidence for funnels


Applications of Treatment Learning Downloading site: http://www.ece.ubc.ca/~yingh/ Collaborators: JPL, WV, Portland, Miami Application examples

– pair programming vs. conventional programming– identify software matrix that are superior error

indicators– identify attributes that make FSMs easy to test– find the best software inspection policy for a

particular software development organization Other applications:

– 1 journal, 4 conference, 6 workshop papers


Main Contributions

New learning approach A novel mining algorithm Algorithm optimization Complete package and online distribution Narrow funnel effect Treatment learner as FSS Application on various research domains


======================

Some notes follow


Rx Definition example Input example

– classified dataset

– Output example:

Rx=conjunction of attribute-value pairs confidence(Rx w.r.t C) = P(C|Rx)


TAR2 in practice Domains containing narrow funnels

– A tail in the confidence1 distribution– A small number of variables that have disproportionally

large confidence1 value– Satisfactory Rx of small size (<6)


Background: Classification

2-step procedure– The learning phase– The testing phase

Strategies employed– Eager learning

• Decision tree induction (e.g. C4.5)• Neural Networks (e.g. Backpropagation)

– Lazy learning• Nearest neighbor classifiers (e.g. K-nearest

neighbor classifier)


Background: Association Rule

Possible Rule:B => C,E[support=2%, confidence= 80%]

Wheresupport(X->Y) = P(X)confidence(X->Y) = P(Y|X)

Representative algorithms– APRIORI

• Apriori property of large itemset

– Max-Miner• More concise

representation of the discovered rules

• Different prune strategies.

ID Transactions

1 A, B, C,E,F

2 B,C,E

3 B,C,D,E

4 … …


Background: Extension

CBA classifier– CBA = Classification Based on Association– X=>Y, Y = class label– More accurate than C4.5 (16/26)

JEP classifier– JEP = Jumping Emerging Patterns

• Support(X w.r.t D1) = 0, Support(X w.r.t D2) > 0• Model: collection of JEPs• Classify: maximum collective impact

– More accurate than both C4.5 & CBA (15/25)


Background: Standard FSS Method

Information Gain attribute ranking Relief Principle Component Analysis (PCA) Correlation based feature selection Consistency based subset evaluation Wrapper subset evaluation


Comparison

Relation to classification– Class boundary / class density– Class weighting

Relation to association rule mining– Multiple classes / no class– Confidence-based pruning

Relation to change detecting algorithm– support: |P(X|y=c1)-P(X|y=c2)|– confidence: |P(y=c1|X)-P(y=c2|X)|– Baye’s rule


Confidence Property

Universal-extential upward closureR1: Age.young -> Salary.low

R2: Age.young, Gender.m -> Salary.low

R2: Age.young, Gender.f -> Salary.low Long rule tend to have high confidence Large Rx tend to have high lift value


TAR3: Usability

Usability: more user-friendly– Intuitive, default setting

treatment learning: implementation and application

Documents

classesevaluate rx

rx sizeto

resultsfinal rx

rx of size r

value pairssize of rx

rx confidencerx w

kdd processmachine learning

attribute selected c4