![Page 1: Wang ke classification by cut clearance under threshold](https://reader031.vdocuments.site/reader031/viewer/2022022200/58aa05921a28abec248b6b3d/html5/thumbnails/1.jpg)
Classification by CUT:Clearance Under Threshold
Ryan McBride ([email protected]),
Ke Wang ([email protected]),
and Wenyuan Li ([email protected])
June 17, 2015
![Page 2: Wang ke classification by cut clearance under threshold](https://reader031.vdocuments.site/reader031/viewer/2022022200/58aa05921a28abec248b6b3d/html5/thumbnails/2.jpg)
SummaryI Domain knowledge helps identify “bad”
cases.I Usual Domain Knowledge: Each
outcome’s cost or relative benefit - cost
sensitive classification.I But costs are too hard to specify in
practice.I Our Idea: Model with a regulatory
threshold, a maximum acceptable
frequency in future cases.I Experiments: Our numbers > other
numbers.
![Page 3: Wang ke classification by cut clearance under threshold](https://reader031.vdocuments.site/reader031/viewer/2022022200/58aa05921a28abec248b6b3d/html5/thumbnails/3.jpg)
I Problem: Given a collection of
sampled electrical transformers, predict
ones with carcinogenic polychlorinated
biphenyls (PCBs), known to be harmful
to human and environment.
![Page 4: Wang ke classification by cut clearance under threshold](https://reader031.vdocuments.site/reader031/viewer/2022022200/58aa05921a28abec248b6b3d/html5/thumbnails/4.jpg)
Similar Problems
I Predict a cancer patient
I Predict an unqualified applicant
I Predict a broken car brake
![Page 5: Wang ke classification by cut clearance under threshold](https://reader031.vdocuments.site/reader031/viewer/2022022200/58aa05921a28abec248b6b3d/html5/thumbnails/5.jpg)
Conventional SolutionI User sets cost matrix
(note: negative=bad)
Object Class jPositive Negative
Predicted Positive C1 C2
Class i Negative C3 C4
I Issue: What is the cost of notremoving a public health hazard?
![Page 6: Wang ke classification by cut clearance under threshold](https://reader031.vdocuments.site/reader031/viewer/2022022200/58aa05921a28abec248b6b3d/html5/thumbnails/6.jpg)
Our Solution: Thresholds
I Insight: Problems without costs focus
on acceptable rates of negatives:
1. Regulations: At most “1 hazard out of100”.
2. Power Industries: Too frequent outagesin equipment ⇒ Strengthen equipment.
I Idea: Model to find “under threshold”
groups.
![Page 7: Wang ke classification by cut clearance under threshold](https://reader031.vdocuments.site/reader031/viewer/2022022200/58aa05921a28abec248b6b3d/html5/thumbnails/7.jpg)
I CUT Classification: Given t,partition attribute space:
-
x
y + ++
+ ++
+
- ++
+
++
----
- -++
�� ��
��
I Gi Over Threshold ⇒ Mitigate Risk.
I Gi Under Threshold ⇒ Delay Action.
![Page 8: Wang ke classification by cut clearance under threshold](https://reader031.vdocuments.site/reader031/viewer/2022022200/58aa05921a28abec248b6b3d/html5/thumbnails/8.jpg)
Defining Cleared Groups
I When is a group “underthreshold”?
I One sample that isn’t contaminated?I One hundred samples with no PCBs?I Million samples with no PCBs?
I Only “clear” if enoughobservations...
I Use statistics to estimatepotential frequencies
![Page 9: Wang ke classification by cut clearance under threshold](https://reader031.vdocuments.site/reader031/viewer/2022022200/58aa05921a28abec248b6b3d/html5/thumbnails/9.jpg)
Statistical ClearanceI Use confidence interval with some
confidence (e.g. 99%):I Frequency in future cases is no more
than upper bound: ub(Gi)
I Example: There is a 99% chance that
no more than 5% of Dynamo
Incorporated transformers are
contaminated.
I Unknown class object o cleared if in Gi
where ub(Gi) ≤ t.
![Page 10: Wang ke classification by cut clearance under threshold](https://reader031.vdocuments.site/reader031/viewer/2022022200/58aa05921a28abec248b6b3d/html5/thumbnails/10.jpg)
Partitioning Objective
I Goal: Prove many future casesare cleared.
I CUT+ Algorithm: Repeatedsearch for large cleared groupings.
Example with t = 5% on next slide.
![Page 11: Wang ke classification by cut clearance under threshold](https://reader031.vdocuments.site/reader031/viewer/2022022200/58aa05921a28abec248b6b3d/html5/thumbnails/11.jpg)
I List valid partitions and choose one:
Lowlands:
2 PCB of 300
ub(Lowlands):
1.6%
300 CLEARED
Midlands:
103 PCB of 150
ub(Midlands):
76.3%
NON-CLEARED
Partition A: Region for t=5%
Highlands:
45 PCB of 550
ub(Highlands):
10.3%
NON-CLEARED
Partition B: Manufacturer for t=5%
Made-Up Electric:
130 PCB of 400
ub(Made-Up)=36.4%
NON-CLEARED
Dynamo Inc:
20 PCB of 600
ub(Dynamo)=4.8%
600 CLEARED
I Partition A clears 300 samples.I Partition B clears 600 samples.
I Partition B preferred because it clears
more objects.
![Page 12: Wang ke classification by cut clearance under threshold](https://reader031.vdocuments.site/reader031/viewer/2022022200/58aa05921a28abec248b6b3d/html5/thumbnails/12.jpg)
Current Tree Partition:
Produced by
Made-Up Electric
20 PCB of 600
ub(Dynamo): 4.8%,
600 CLEARED
Produced by
Dynamo Inc
All Objects
130 PCB of 400
ub(Made-Up): 36.4%
NON-CLEARED
Improvement 1: Repeat partition search in
non-cleared groups.
![Page 13: Wang ke classification by cut clearance under threshold](https://reader031.vdocuments.site/reader031/viewer/2022022200/58aa05921a28abec248b6b3d/html5/thumbnails/13.jpg)
Final Tree
20 PCB of 600
ub(Dynamo): 4.8%,
600 CLEARED
In Surrey
Produced by
Dynamo Inc
All Objects
98 PCB of 100
ub(G): 100%,
NON-CLEARED
In LowlandsIn Midlands
In Highlands
30 PCB of 150
ub(G): 25.8%,
NON-CLEARED
2 PCB of 150
ub(G): 4.2%,
150 CLEARED
Produced by
Made-Up Electric
Improvement 2: Merge all non-cleared
regions then search again.
![Page 14: Wang ke classification by cut clearance under threshold](https://reader031.vdocuments.site/reader031/viewer/2022022200/58aa05921a28abec248b6b3d/html5/thumbnails/14.jpg)
CUT+ Algorithm
I Given a set of training objects, G , and a
clearance threshold, tI REPEAT UNTIL no cleared group is
found:I CUT Tree(G , t)I Remove the objects assigned to a cleared
group from G
I Three heuristics for building trees:
1. Immediate Clearance2. Risk Reduction3. Pure Potential
![Page 15: Wang ke classification by cut clearance under threshold](https://reader031.vdocuments.site/reader031/viewer/2022022200/58aa05921a28abec248b6b3d/html5/thumbnails/15.jpg)
Experiments (1)
I Use cross-validation and compare:I 3 CUT+ algorithms.I Competitors from other classification
areas.
I Problem Set: PCBidentification problems.
![Page 16: Wang ke classification by cut clearance under threshold](https://reader031.vdocuments.site/reader031/viewer/2022022200/58aa05921a28abec248b6b3d/html5/thumbnails/16.jpg)
Experiments (2)
I Evaluate partition {G1, . . . ,Gn}with test set by:
I Percent of positives cleared (TPR).
![Page 17: Wang ke classification by cut clearance under threshold](https://reader031.vdocuments.site/reader031/viewer/2022022200/58aa05921a28abec248b6b3d/html5/thumbnails/17.jpg)
PCB Experiment (1)
I t ranges from0% to p̂.
I p̂ is theobserved rateof PCB cases.
0%
1%
2%
3%
4%
5%
0%
0.1
p̂
0.2
p̂
0.3
p̂
0.4
p̂
0.5
p̂
0.6
p̂
0.7
p̂
0.8
p̂
0.9
p̂
1.0
p̂
FP
R(t
)
Clearance Threshold, tPure Potential Baseline1: C4.5
Baseline2: SMOTE Baseline3: MetaCost
0%
20%
40%
60%
80%
100%
0%
0.1
p̂
0.2
p̂
0.3
p̂
0.4
p̂
0.5
p̂
0.6
p̂
0.7
p̂
0.8
p̂
0.9
p̂
1.0
p̂
TP
R
Results for PCB50
CUT+ clears more non-PCB transformers.
Paper results show that there are not too
many “over threshold” errors.
![Page 18: Wang ke classification by cut clearance under threshold](https://reader031.vdocuments.site/reader031/viewer/2022022200/58aa05921a28abec248b6b3d/html5/thumbnails/18.jpg)
PCB Experiment (2)
I t ranges from0% to p̂.
I p̂ is theobserved rateof PCB cases.
0%
1%
2%
3%
4%
5%
0%
0.1
p̂
0.2
p̂
0.3
p̂
0.4
p̂
0.5
p̂
0.6
p̂
0.7
p̂
0.8
p̂
0.9
p̂
1.0
p̂
FP
R(t
)
Clearance Threshold, tPure Potential Baseline1: C4.5
Baseline2: SMOTE Baseline3: MetaCost
0%
20%
40%
60%
80%
100%
0%
0.1
p̂
0.2
p̂
0.3
p̂
0.4
p̂
0.5
p̂
0.6
p̂
0.7
p̂
0.8
p̂
0.9
p̂
1.0
p̂
TP
R
Results for PCB50
Competitors have few cleared groups since:I Too few observations to clear group.
I Or frequency too high to clear group.
![Page 19: Wang ke classification by cut clearance under threshold](https://reader031.vdocuments.site/reader031/viewer/2022022200/58aa05921a28abec248b6b3d/html5/thumbnails/19.jpg)
I More Experiments on UCI Sets:Pure Potential best algorithm in 22 out
of 25 tests.
I Code available at
http://www.cs.sfu.ca/~wangk/
software/CUT_classification
![Page 20: Wang ke classification by cut clearance under threshold](https://reader031.vdocuments.site/reader031/viewer/2022022200/58aa05921a28abec248b6b3d/html5/thumbnails/20.jpg)
Acknowledgments
I Funding: BC Hydro R&D program and
Canada’s NSERC.
I Transformer Image Source:
Wikipedia user Benutzer:Stahlkocher;
License: GFDL.