tree based methods (ensemble schemes) · 2020-01-15 · • greedy strategy. • split the records...

68
Tree Based Methods (Ensemble Schemes) Machine Learning Spring 2018 Feb 26 2018 Kasthuri Kannan [email protected]

Upload: others

Post on 06-Mar-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Tree Based Methods (Ensemble Schemes) · 2020-01-15 · • Greedy strategy. • Split the records based on an aribute test that op8mizes certain criterion. • Quesons • Determine

Tree Based Methods (Ensemble Schemes)

MachineLearningSpring2018Feb262018

[email protected]

Page 2: Tree Based Methods (Ensemble Schemes) · 2020-01-15 · • Greedy strategy. • Split the records based on an aribute test that op8mizes certain criterion. • Quesons • Determine

Overview •  DecisionTrees

•  Overview•  Spli1ngnodes•  Limita8ons

•  Bagging/BootstrapAggrega8ngandBoos8ng•  Howbaggingreducesvariance?•  Boos8ng

•  RandomForests•  Overview•  WhyRFworks?

•  CancerGenomicsApplica8on

Page 3: Tree Based Methods (Ensemble Schemes) · 2020-01-15 · • Greedy strategy. • Split the records based on an aribute test that op8mizes certain criterion. • Quesons • Determine

Classifica=on

• Givenacollec8onofrecords(trainingset)•  EachrecordcontainsasetofaLributes,oneoftheaLributesistheclass/label

•  FindamodelforclassaLributeasafunc8onofthevaluesofotheraLributes.

• Goal:previouslyunseenrecordsshouldbeassignedaclassasaccuratelyaspossible.

•  Atestsetisusedtodeterminetheaccuracyofthemodel.Usually,thegivendatasetisdividedintotrainingandtestsets,withtrainingsetusedtobuildthemodelandtestsetusedtovalidateit.

Page 4: Tree Based Methods (Ensemble Schemes) · 2020-01-15 · • Greedy strategy. • Split the records based on an aribute test that op8mizes certain criterion. • Quesons • Determine

Classifica=on Illustra=on

Apply Model

Induction

Deduction

Learn Model

Model

Tid Attrib1 Attrib2 Attrib3 Class

1 Yes Large 125K No

2 No Medium 100K No

3 No Small 70K No

4 Yes Medium 120K No

5 No Large 95K Yes

6 No Medium 60K No

7 Yes Large 220K No

8 No Small 85K Yes

9 No Medium 75K No

10 No Small 90K Yes 10

Tid Attrib1 Attrib2 Attrib3 Class

11 No Small 55K ?

12 Yes Medium 80K ?

13 Yes Large 110K ?

14 No Small 95K ?

15 No Large 67K ? 10

Test Set

Learningalgorithm

Training Set

Courtesy:www.cs.kent.edu/~jin/DM07/

Page 5: Tree Based Methods (Ensemble Schemes) · 2020-01-15 · • Greedy strategy. • Split the records based on an aribute test that op8mizes certain criterion. • Quesons • Determine

Classifica=on Examples

•  Predic8ngtumorcellsasbenignormalignant

•  Classifyingcreditcardtransac8onsaslegi8mateorfraudulent

•  Classifyingsecondarystructuresofproteinasalpha-helix,beta-sheet,orrandomcoil

•  Categorizingnewsstoriesasfinance,weather,entertainment,sports,etc.

•  Severalalgorithms–Decisiontrees,SupportVectorMachines,Rule-basedMethodsetc.

Page 6: Tree Based Methods (Ensemble Schemes) · 2020-01-15 · • Greedy strategy. • Split the records based on an aribute test that op8mizes certain criterion. • Quesons • Determine

Decision Tree (Example)

Tid Refund MaritalStatus

TaxableIncome Cheat

1 Yes Single 125K No

2 No Married 100K No

3 No Single 70K No

4 Yes Married 120K No

5 No Divorced 95K Yes

6 No Married 60K No

7 Yes Divorced 220K No

8 No Single 85K Yes

9 No Married 75K No

10 No Single 90K Yes10

Refund

MarSt

TaxInc

YES NO

NO

NO

Yes No

Married Single, Divorced

< 80K > 80K

Splitting Attributes

Training Data Model: Decision Tree

Page 7: Tree Based Methods (Ensemble Schemes) · 2020-01-15 · • Greedy strategy. • Split the records based on an aribute test that op8mizes certain criterion. • Quesons • Determine

Decision Tree (Another Example)

Tid Refund MaritalStatus

TaxableIncome Cheat

1 Yes Single 125K No

2 No Married 100K No

3 No Single 70K No

4 Yes Married 120K No

5 No Divorced 95K Yes

6 No Married 60K No

7 Yes Divorced 220K No

8 No Single 85K Yes

9 No Married 75K No

10 No Single 90K Yes10

MarSt

Refund

TaxInc

YES NO

NO

NO

Yes No

Married Single,

Divorced

< 80K > 80K

Therecouldbemorethanonetreethatfitsthesamedata!

Page 8: Tree Based Methods (Ensemble Schemes) · 2020-01-15 · • Greedy strategy. • Split the records based on an aribute test that op8mizes certain criterion. • Quesons • Determine

Applying the model Oncethedecisiontreeisbuilt,themodelcanbeusedtotestanunclassifieddata

Refund

MarSt

TaxInc

YES NO

NO

NO

Yes No

Married Single, Divorced

< 80K > 80K

Refund Marital Status

Taxable Income Cheat

No Married 80K ? 10

Test Data

Assign Cheat to “No”

Page 9: Tree Based Methods (Ensemble Schemes) · 2020-01-15 · • Greedy strategy. • Split the records based on an aribute test that op8mizes certain criterion. • Quesons • Determine

x = 1

y = 0.5

x < 1

y < 1 y < 0.5

(1.5,0.8)

Tumor samples/patients given by expression in (gene1,gene2)

Normal samples/patients given by expression in (gene1,gene2)

T

Yes No

N

No Yes

T N

No Yes

gene1

gene

2

y = 1

Tumor Classifica=on (Example)

Page 10: Tree Based Methods (Ensemble Schemes) · 2020-01-15 · • Greedy strategy. • Split the records based on an aribute test that op8mizes certain criterion. • Quesons • Determine

Decision Tree Algorithms

•  ManyAlgorithms:•  Hunt’sAlgorithm(oneoftheearliest)•  CART•  ID3,C4.5•  SLIQ,SPRINT

Hunt’sAlgorithm(GeneralIdea)LetDtbethesetoftrainingrecordsthatreachanodetGeneralProcedure:

IfDtcontainsrecordsthatbelongthesameclassyt,thentisaleafnodelabeledasytIfDtisanemptyset,thentisaleafnodelabeledbythedefaultclass,ydIfDtcontainsrecordsthatbelongtomorethanoneclass,useanaLributetesttosplitthedataintosmallersubsets.Recursivelyapplytheproceduretoeachsubset.

Page 11: Tree Based Methods (Ensemble Schemes) · 2020-01-15 · • Greedy strategy. • Split the records based on an aribute test that op8mizes certain criterion. • Quesons • Determine

Hunt’s Algorithm (Illustra=on)

Don’t Cheat

Refund

Don’t Cheat

Don’t Cheat

Yes No

Refund

Don’t Cheat

Yes No Marital Status

Don’t Cheat

Cheat

Single,Divorced Married

Taxable Income

Don’t Cheat

<80K >=80K

Refund

Don’t Cheat

Yes No Marital Status

Don’t Cheat Cheat

Single,Divorced Married

Tid Refund MaritalStatus

TaxableIncome Cheat

1 Yes Single 125K No

2 No Married 100K No

3 No Single 70K No

4 Yes Married 120K No

5 No Divorced 95K Yes

6 No Married 60K No

7 Yes Divorced 220K No

8 No Single 85K Yes

9 No Married 75K No

10 No Single 90K Yes10

Page 12: Tree Based Methods (Ensemble Schemes) · 2020-01-15 · • Greedy strategy. • Split the records based on an aribute test that op8mizes certain criterion. • Quesons • Determine

Tree Induc=on

•  Greedystrategy.

•  SplittherecordsbasedonanaLributetestthatop8mizescertaincriterion.

• Mainques8ons

•  Determinehowtosplittherecords

•  Binaryormul8-waysplit?

•  Howtodeterminethebestsplit?

•  Determinewhentostopspli1ng

Page 13: Tree Based Methods (Ensemble Schemes) · 2020-01-15 · • Greedy strategy. • Split the records based on an aribute test that op8mizes certain criterion. • Quesons • Determine

Test Condi=on (SpliHng Based on Nominal/Ordinal AKributes)

Mul8-waysplit:Useasmanypar88onsasdis8nctvalues.Binarysplit:Dividesvaluesintotwosubsets.

Needtofindop8malpar88oning.

Car Type Family

SportsLuxury

Car Type {Family,Luxury} {Sports}

Car Type {Sports,Luxury} {Family} OR

Page 14: Tree Based Methods (Ensemble Schemes) · 2020-01-15 · • Greedy strategy. • Split the records based on an aribute test that op8mizes certain criterion. • Quesons • Determine

Test Condi=on (SpliHng Based on Con=nuous AKributes)

TaxableIncome> 80K?

Yes No

TaxableIncome?

(i) Binary split (ii) Multi-way split

< 10K

[10K,25K) [25K,50K) [50K,80K)

> 80K

Page 15: Tree Based Methods (Ensemble Schemes) · 2020-01-15 · • Greedy strategy. • Split the records based on an aribute test that op8mizes certain criterion. • Quesons • Determine

Binary vs. Mul=-way split – which is the best?

hLp://www.cse.msu.edu/~cse802/DecisionTrees.pdf

Page 16: Tree Based Methods (Ensemble Schemes) · 2020-01-15 · • Greedy strategy. • Split the records based on an aribute test that op8mizes certain criterion. • Quesons • Determine

Binary vs. Mul=-way split – which is the best?

hLp://www.cse.msu.edu/~cse802/DecisionTrees.pdf

Page 17: Tree Based Methods (Ensemble Schemes) · 2020-01-15 · • Greedy strategy. • Split the records based on an aribute test that op8mizes certain criterion. • Quesons • Determine

Determining Best Split

OwnCar?

C0: 6C1: 4

C0: 4C1: 6

C0: 1C1: 3

C0: 8C1: 0

C0: 1C1: 7

CarType?

C0: 1C1: 0

C0: 1C1: 0

C0: 0C1: 1

StudentID?

...

Yes No Family

Sports

Luxury c1c10

c20

C0: 0C1: 1

...

c11

BeforeSpli1ng: 10recordsofclass0, 10recordsofclass1

WhichaLributeisthebestforspli1ng?

Page 18: Tree Based Methods (Ensemble Schemes) · 2020-01-15 · • Greedy strategy. • Split the records based on an aribute test that op8mizes certain criterion. • Quesons • Determine

Determining Best Split

• Greedyapproach:•  Nodeswithhomogeneousclassdistribu8onarepreferred

• Needameasureofnodeimpurity:

C0: 5C1: 5

C0: 9C1: 1

Non-homogeneous,

Highdegreeofimpurity

Homogeneous,

Lowdegreeofimpurity

Page 19: Tree Based Methods (Ensemble Schemes) · 2020-01-15 · • Greedy strategy. • Split the records based on an aribute test that op8mizes certain criterion. • Quesons • Determine

Measures of Node Impurity

• GiniIndex•  Entropy

• Misclassifica8onerror

∑−=j

tjptGINI 2)]|([1)(

∑−=j

tjptjptEntropy )|(log)|()(

)|(max1)( tiPtErrori

−=

Page 20: Tree Based Methods (Ensemble Schemes) · 2020-01-15 · • Greedy strategy. • Split the records based on an aribute test that op8mizes certain criterion. • Quesons • Determine

Compu=ng Measures of Node Impurity

B?

Yes No

Node N3 Node N4

A?

Yes No

Node N1 Node N2

BeforeSpliWng:

C0 N10 C1 N11

C0 N20 C1 N21

C0 N30 C1 N31

C0 N40 C1 N41

C0 N00 C1 N01

M0

M1 M2 M3 M4

M12 M34Gain = M0 – M12 vs M0 – M34

Page 21: Tree Based Methods (Ensemble Schemes) · 2020-01-15 · • Greedy strategy. • Split the records based on an aribute test that op8mizes certain criterion. • Quesons • Determine

Compu=ng Measures of Node Impurity (Gini Index)

•  GiniIndexforagivennodet:

(NOTE:p( j | t) istherela8vefrequencyofclassjatnodet)

•  Maximum(1-1/nc)whenrecordsareequallydistributedamongallclasses,implyingleastinteres8nginforma8on

•  Minimum(0.0)whenallrecordsbelongtooneclass,implyingmostinteres8nginforma8on

∑−=j

tjptGINI 2)]|([1)(

Page 22: Tree Based Methods (Ensemble Schemes) · 2020-01-15 · • Greedy strategy. • Split the records based on an aribute test that op8mizes certain criterion. • Quesons • Determine

M1 M2

Example (Gini Index of a Node)

∑−=j

tjptGINI 2)]|([1)(A?

Yes No

Node N1 Node N2

C0# 0"C1# 6"

#

C0# 1"C1# 5"

#

C0# 0"C1# 6"

#

C0# 1"C1# 5"

#

P(C0) = 0/6 = 0 P(C1) = 6/6 = 1

Gini = 1 – P(C0)2 – P(C1)2 = 1 – 0 – 1 = 0

P(C1) = 1/6 P(C2) = 5/6

Gini = 1 – (1/6)2 – (5/6)2 = 0.278

M1

M2

Page 23: Tree Based Methods (Ensemble Schemes) · 2020-01-15 · • Greedy strategy. • Split the records based on an aribute test that op8mizes certain criterion. • Quesons • Determine

SpliHng based on Gini Index

•  UsedinCART,SLIQ,SPRINT

• Whenanodepissplitintokpar88ons(children),thequalityofsplitiscomputedas,

where,nL=numberofrecordsatthelejchildnode, nR=numberofrecordsattherightchildnode

•  SplitontheaLributethatmaximizesGinisplit

Page 24: Tree Based Methods (Ensemble Schemes) · 2020-01-15 · • Greedy strategy. • Split the records based on an aribute test that op8mizes certain criterion. • Quesons • Determine

SpliHng Based on Gini Index

•  Splitsintotwopar88ons

•  EffectofWeighingpar88ons:•  Largerandpurerpar88onsaresoughtfor.

B?

Yes No

Node N1 Node N2

Parent C1 6

C2 6 Gini = 0.500

N1 N2 C1 5 1 C2 2 4 Gini=0.333

Gini(N1) = 1 – (5/6)2 – (2/6)2 = 0.194

Gini(N2) = 1 – (1/6)2 – (4/6)2 = 0.528

Gini(Children)=7/12*0.194+5/12*0.528=0.333

Ginisplit(B)=0.5-0.3=0.2

Page 25: Tree Based Methods (Ensemble Schemes) · 2020-01-15 · • Greedy strategy. • Split the records based on an aribute test that op8mizes certain criterion. • Quesons • Determine

Measures of Node Impurity (Entropy)

•  Entropyatagivennodet:

•  (NOTE:p( j | t) istherela8vefrequencyofclassjatnodet).

•  Measureshomogeneityofanode.

•  Maximum(lognc)whenrecordsareequallydistributedamongallclassesimplyingleastinforma8on

•  Minimum(0.0)whenallrecordsbelongtooneclass,implyingmostinforma8on

•  Entropybasedcomputa8onsaresimilartotheGINIindexcomputa8ons

∑−=j

tjptjptEntropy )|(log)|()(

Page 26: Tree Based Methods (Ensemble Schemes) · 2020-01-15 · • Greedy strategy. • Split the records based on an aribute test that op8mizes certain criterion. • Quesons • Determine

Example (Entropy)

C1 0 C2 6

C1 2 C2 4

C1 1 C2 5

P(C1) = 0/6 = 0 P(C2) = 6/6 = 1

Entropy = – 0 log 0 – 1 log 1 = – 0 – 0 = 0

P(C1) = 1/6 P(C2) = 5/6

Entropy = – (1/6) log2 (1/6) – (5/6) log2 (1/6) = 0.65

P(C1) = 2/6 P(C2) = 4/6

Entropy = – (2/6) log2 (2/6) – (4/6) log2 (4/6) = 0.92

Page 27: Tree Based Methods (Ensemble Schemes) · 2020-01-15 · • Greedy strategy. • Split the records based on an aribute test that op8mizes certain criterion. • Quesons • Determine

SpliHng Based on Entropy: Informa=on Gain

•  Informa8onGain:

ParentNode,pissplitintokpar88ons;niisnumberofrecordsinpar88onI

•  Measuresreduc8oninentropyachievedbecauseofthesplit.Choosethesplitthatachievesmostreduc8on(maximizesGAIN)

•  UsedinID3andC4.5•  Disadvantage:Tendstoprefersplitsthatresultinlargenumberofpar88ons,eachbeingsmallbutpure.

⎟⎠⎞

⎜⎝⎛−= ∑

=

k

i

i

splitiEntropy

nnpEntropyGAIN

1)()(

Page 28: Tree Based Methods (Ensemble Schemes) · 2020-01-15 · • Greedy strategy. • Split the records based on an aribute test that op8mizes certain criterion. • Quesons • Determine

Measures of Node Impurity (Misclassifica=on Error)

•  Classifica8onerroratanodet:

•  Measuresmisclassifica8onerrormadebyanode.

•  Maximum(1-1/nc)whenrecordsareequallydistributedamongallclasses,implyingleastinteres8nginforma8on

•  Minimum(0.0)whenallrecordsbelongtooneclass,implyingmostinteres8nginforma8on

•  Misclassifica8onbasedcomputa8onsaresimilartotheGINIindexandentropycomputa8ons

)|(max1)( tiPtErrori

−=

Page 29: Tree Based Methods (Ensemble Schemes) · 2020-01-15 · • Greedy strategy. • Split the records based on an aribute test that op8mizes certain criterion. • Quesons • Determine

Example (Misclassifica=on Error)

C1 0 C2 6

C1 2 C2 4

C1 1 C2 5

P(C1) = 0/6 = 0 P(C2) = 6/6 = 1

Error = 1 – max (0, 1) = 1 – 1 = 0

P(C1) = 1/6 P(C2) = 5/6

Error = 1 – max (1/6, 5/6) = 1 – 5/6 = 1/6

P(C1) = 2/6 P(C2) = 4/6

Error = 1 – max (2/6, 4/6) = 1 – 4/6 = 1/3

)|(max1)( tiPtErrori

−=

Page 30: Tree Based Methods (Ensemble Schemes) · 2020-01-15 · • Greedy strategy. • Split the records based on an aribute test that op8mizes certain criterion. • Quesons • Determine

Comparison (SpliHng Criteria)

Two-classproblem

E/G/

MCVa

lues

Page 31: Tree Based Methods (Ensemble Schemes) · 2020-01-15 · • Greedy strategy. • Split the records based on an aribute test that op8mizes certain criterion. • Quesons • Determine

Summary

•  Greedystrategy.

•  SplittherecordsbasedonanaLributetestthatop8mizescertaincriterion.

•  Ques8ons

•  Determinehowtosplittherecords

•  Binaryormulb-waysplit?Mulb-wayissameasbinarysplit.

•  HowtodeterminethebestspliWngfeatures?Gini/Entropy/Misclassificabon

•  WhentostopspliWng?

Page 32: Tree Based Methods (Ensemble Schemes) · 2020-01-15 · • Greedy strategy. • Split the records based on an aribute test that op8mizes certain criterion. • Quesons • Determine

When to stop spliHng?

• Amaximalnodedepthisreached

•  Ifspli1ngisstoppedtooearly,errorontrainingdataisnotsufficientlylowandperformanceonthetestdatawillsuffer(underfi1ng)

•  Theleafnodesdoesn’thaveanyimpurity

•  Ifwecon8nuetogrowthetreefullyun8leachleafnodecorrespondstothelowestimpurity,thenthedatahavetypicallybeenoverfit;inthelimit,eachleafnodehasonlyonepaLern!

Page 33: Tree Based Methods (Ensemble Schemes) · 2020-01-15 · • Greedy strategy. • Split the records based on an aribute test that op8mizes certain criterion. • Quesons • Determine

#ofnodesTomMitchell,1997

Page 34: Tree Based Methods (Ensemble Schemes) · 2020-01-15 · • Greedy strategy. • Split the records based on an aribute test that op8mizes certain criterion. • Quesons • Determine

When to stop spliHng?

•  Stopspli1ngwhenthebestcandidatesplitatanodereducestheimpuritybylessthanthepresetamount(threshold)

•  Howtosetthethreshold?

•  Spli1nganotedoesnotleadtoaninforma8ongain

•  Dependsonthemetricusedforinforma8ongain

Page 35: Tree Based Methods (Ensemble Schemes) · 2020-01-15 · • Greedy strategy. • Split the records based on an aribute test that op8mizes certain criterion. • Quesons • Determine

Comparison (Gini Vs. Misclassifica=on)

A?

Yes No

Node N1 Node N2

Parent C1 7

C2 3 Gini = 0.42

! N1# N2#C1! 3# 4#C2! 0# 3####Gini#=#0.34#

!

Gini(Children)=3/10*0+7/10*0.489=0.342

Gini(N1) = 1 – (3/3)2 – (0/3)2 = 0

Gini(N2) = 1 – (4/7)2 – (3/7)2 = 0.489

MC(Parent)=1-max(7/10,3/10)=1-7/10=3/10

MC(N1)=1-max(3/3,0/3)=1-1=0MC(N2)=1-max(4/7,3/7)=1-4/7=3/7

Mcsplit(A)=3/10-(3/10)0-(7/10)(3/7)=0Ginisplit(A)=0.42–0.34=0.08

Page 36: Tree Based Methods (Ensemble Schemes) · 2020-01-15 · • Greedy strategy. • Split the records based on an aribute test that op8mizes certain criterion. • Quesons • Determine

Comparison (Entropy Vs. Misclassifica=on)

OriginalTree

hLps://sebas8anraschka.com/faq/docs/decisiontree-error-vs-entropy.html

Page 37: Tree Based Methods (Ensemble Schemes) · 2020-01-15 · • Greedy strategy. • Split the records based on an aribute test that op8mizes certain criterion. • Quesons • Determine

Advantages & Disadvantages

•  Inexpensivetoconstructifclassifyingfeaturesaregivenorknown•  Extremelyfastatclassifyingunknownrecords•  Easytointerpretforsmall-sizedtrees•  Accuracyiscomparabletootherclassificabontechniquesformanysimpledatasets

•  UnderfiWngandOverfiWng

•  Costsofclassificabonhighastreeconstrucboniscombinatorialinfeatures

•  Sensibvetodata-highvariance–lessaccuratepredicbon

Page 38: Tree Based Methods (Ensemble Schemes) · 2020-01-15 · • Greedy strategy. • Split the records based on an aribute test that op8mizes certain criterion. • Quesons • Determine

When to stop spliHng?

•  Usevalida8on/cross-valida8ontechniques

•  Con8nuespli1ngun8lerroronvalida8onsetisminimum

•  Cross-valida8onreliesonseveralindependentlychosensubsets

•  Cross-valida8onmethods

•  Hold-outmethod(50-70%Training,50-30%tes8ng/valida8on)•  Pronetoselec8onbias

•  K-foldcross-valida8on

•  Leaveoneoutcross-valida8on(lotsofcomputa8on8me)•  Bootstrapmethods

Page 39: Tree Based Methods (Ensemble Schemes) · 2020-01-15 · • Greedy strategy. • Split the records based on an aribute test that op8mizes certain criterion. • Quesons • Determine

Bootstrap Aggrega=on

• Keyvariancereduc8ontechnique

•  Ifeachclassifierhasahighvariancetheaggregatedclassifierhasasmallervariancethaneachsingleclassifier

•  Thebaggingclassifierislikeanapproxima8onofthetrueaveragecomputedbyreplacingtheprobabilitydistribu8onwithbootstrapapproxima8on

Page 40: Tree Based Methods (Ensemble Schemes) · 2020-01-15 · • Greedy strategy. • Split the records based on an aribute test that op8mizes certain criterion. • Quesons • Determine

normaltumor

Classifier1 Classifier2

SmallchangeinthedatacouldresultinaradicallydifferenttreestructureLargevarianceinclassifica8onVariancereducesbyaggrega8ng/averaging

Variance Classifiedasnormal Classifiedastumor

Page 41: Tree Based Methods (Ensemble Schemes) · 2020-01-15 · • Greedy strategy. • Split the records based on an aribute test that op8mizes certain criterion. • Quesons • Determine

Why Bagging Works?

Page 42: Tree Based Methods (Ensemble Schemes) · 2020-01-15 · • Greedy strategy. • Split the records based on an aribute test that op8mizes certain criterion. • Quesons • Determine

Why Bagging Works?

x

x

x

x

y1

y2

y3

y4

Ifyi’sareuncorrelated

E[Z ]= 1m

E[yi ]=1m(mE[y])∑ = E[y]

(x11, y1

1), (x21, y2

1 ),!, (xk1, yk

1 ) ↔ (x, y1)(x1

2, y12 ), (x2

2, y22 ),!, (xk

2, yk2 ) ↔ (x, y2 )

"

(x1m, y1

m ), (x2m, y2

m ),!, (xkm, yk

m )↔ (x, ym )

Page 43: Tree Based Methods (Ensemble Schemes) · 2020-01-15 · • Greedy strategy. • Split the records based on an aribute test that op8mizes certain criterion. • Quesons • Determine

Why bagging works?

•  Asmincreases,varianceisreducedandtheaggregatedpredic8onisclosertothetruevalue.

•  Unfortunately,weDON’Thaveseveralsetsofsamples!!!

• Whatshouldwedo?

•  Ofcourse,makeauniformsamplingfromgivensampleset,withreplacementanduseeachsamplesetasabootstrapsampleset!!!

Page 44: Tree Based Methods (Ensemble Schemes) · 2020-01-15 · • Greedy strategy. • Split the records based on an aribute test that op8mizes certain criterion. • Quesons • Determine

40%

Ensemble Classifiers

•  Subsamplewithreplacement

•  Uniformlysubsample

Classifier1 Classifier2 Classifier3

Trainingset

Tes8ngset

Consensus/Aggregatepredic8ons

Predic8on1 Predic8on2 Predic8on3

Page 45: Tree Based Methods (Ensemble Schemes) · 2020-01-15 · • Greedy strategy. • Split the records based on an aribute test that op8mizes certain criterion. • Quesons • Determine

Bagging (Error Curves)

web1.sph.emory.edu/users/tyu8/740/Lecture%2011%20forest.pptx

Page 46: Tree Based Methods (Ensemble Schemes) · 2020-01-15 · • Greedy strategy. • Split the records based on an aribute test that op8mizes certain criterion. • Quesons • Determine

Boos=ng

•  Powerfultechniqueforcombiningmul8ple“base”classifierstoformacommiLeewhoseperformancecanbesignificantlybeLerthananyofthebaseclassifiers

•  AdaBoost(adap8veboos8ng)cangivegoodresultsevenifbase

classifierperformanceisonlyslightlybeLerthanrandom•  Hencebaseclassifiersareknownasweaklearners•  Designedforclassifica8on,canbeusedforregressionaswell

Page 47: Tree Based Methods (Ensemble Schemes) · 2020-01-15 · • Greedy strategy. • Split the records based on an aribute test that op8mizes certain criterion. • Quesons • Determine

40%

Boos=ng •  Uniformlysubsamplewithreplacement

•  Testonthetrainingdataandweightappropriatelyaccordingtoerror

•  Sequen8allysubsamplewithweightsbasedonmisclassifica8on

Classifier1 Classifier2 Classifier3

Trainingset

Tes8ngset

Predic8on1 Predic8on2 Predic8on3

test

Predic8on

Page 48: Tree Based Methods (Ensemble Schemes) · 2020-01-15 · • Greedy strategy. • Split the records based on an aribute test that op8mizes certain criterion. • Quesons • Determine

Bagging vs. Boos=ng Comparison

web1.sph.emory.edu/users/tyu8/740/Lecture%2011%20forest.pptx

Page 49: Tree Based Methods (Ensemble Schemes) · 2020-01-15 · • Greedy strategy. • Split the records based on an aribute test that op8mizes certain criterion. • Quesons • Determine

Random Forest

•  Baggingcanbeseenasamethodtoreducevarianceofanes8matedpredic8onfunc8on.Itmostlyhelpshigh-varianceclassifiers.

•  Compara8vely,boos8ngbuildweakclassifiersone-by-one,allowingthecollec8ontoevolvetotherightdirec8on.

•  Randomforestisasubstan8almodifica8ontobagging

•  Buildacollec8onofde-correlatedtrees.

•  Similarperformancetoboos8ng

•  Simplertotrainandtunecomparedtoboos8ng

Page 50: Tree Based Methods (Ensemble Schemes) · 2020-01-15 · • Greedy strategy. • Split the records based on an aribute test that op8mizes certain criterion. • Quesons • Determine

Random Forest (Intui=on)

Page 51: Tree Based Methods (Ensemble Schemes) · 2020-01-15 · • Greedy strategy. • Split the records based on an aribute test that op8mizes certain criterion. • Quesons • Determine

Random Forest (Algorithm)

Page 52: Tree Based Methods (Ensemble Schemes) · 2020-01-15 · • Greedy strategy. • Split the records based on an aribute test that op8mizes certain criterion. • Quesons • Determine

Why Random Forest Works?

Page 53: Tree Based Methods (Ensemble Schemes) · 2020-01-15 · • Greedy strategy. • Split the records based on an aribute test that op8mizes certain criterion. • Quesons • Determine

Bagging vs. Boos=ng vs. Random Forest Comparison

ElementsofSta8s8calLearning(2ndEd.)cHas8e,Tibshirani&Friedman2009Chap15

Page 54: Tree Based Methods (Ensemble Schemes) · 2020-01-15 · • Greedy strategy. • Split the records based on an aribute test that op8mizes certain criterion. • Quesons • Determine

Cancer Genomics Applica=ons I: Soma=c Score

•  Fewsoma8cmuta8onsbelow15(max255).•  47isthevalida8onnecessity•  Takenasaminimumthreshold

Distribu8onofsoma8cscores

in-validated

validated

Page 55: Tree Based Methods (Ensemble Schemes) · 2020-01-15 · • Greedy strategy. • Split the records based on an aribute test that op8mizes certain criterion. • Quesons • Determine

Cancer Genomics Applica=ons II: Determining key muta=ons

• Howdowedeterminekeymuta8ons?

• Lookintoevolu8onforanswers

• "NothinginBiologyMakesSenseExceptintheLightofEvolu8on”

-Dobzhansky

Page 56: Tree Based Methods (Ensemble Schemes) · 2020-01-15 · • Greedy strategy. • Split the records based on an aribute test that op8mizes certain criterion. • Quesons • Determine

Soma=c Muta=ons

• Allcancersarisebecauseofsoma8callyacquiredchanges

• Notallsoma8cvaria8onswillresultincancer–theonesthatresultarecalled“drivers”.“Passengers”arepassive.

•  Ideabehind‘driver’and‘passenger’muta8onsisbasedonevolu8on

• Certainmuta8onsarecasuallyselectedinthetumormicro-environmentthatwouldleadtoincreasedsurvival/reproduc8on.

•  Selec8on-Whatdoesthismean??*?#$%$%*&???!!!!

Page 57: Tree Based Methods (Ensemble Schemes) · 2020-01-15 · • Greedy strategy. • Split the records based on an aribute test that op8mizes certain criterion. • Quesons • Determine

Non-selec=on •  Understandingselec8onrequiresunderstandingnon-selec8on.

•  Non-selec8on–essen8allystochas8cprocess.•  e.g.Randomwalk/Brownianmo8onetc.

•  Doesnotmeanuncertainty•  ifuncertaintymeanstheprobabilityofobservingan

eventisnot1.e.g.,Drunkard'swalk.

•  Expecta8oniszeroatanyinstanceof8me.

•  Gene8cdrij•  Changeintheallelefrequencyduetorandomness.

•  Drivermuta8onsarenotoutcomesofgene8cdrij,butduetoselecbon.

Page 58: Tree Based Methods (Ensemble Schemes) · 2020-01-15 · • Greedy strategy. • Split the records based on an aribute test that op8mizes certain criterion. • Quesons • Determine

Natural Selec=on and Cancer

• Naturalselec8onisanon-random*processbywhichphenotypesbecomefixatedinapopula8onasafunc8onofdifferen8alreproduc8on/survival

• Cellsinpre-malignantandmalignantstate(tumors)evolvebyselec8nggenesthatfavorstumorgenesis

*Asopposedtogene8cdrij,NSisdeterminis8candhasdirec8on.

Recurrent mutations

TP53 KRAS

Classic approach to finding driver genes

DriverMuta8ons

Page 59: Tree Based Methods (Ensemble Schemes) · 2020-01-15 · • Greedy strategy. • Split the records based on an aribute test that op8mizes certain criterion. • Quesons • Determine

Muta=onal Frequency

• Muta8onalfrequencyadefiningcriteria?

Page 60: Tree Based Methods (Ensemble Schemes) · 2020-01-15 · • Greedy strategy. • Split the records based on an aribute test that op8mizes certain criterion. • Quesons • Determine

CHASM

• Methodsindependentoffrequencyofoccurrencearerequiredtodeterminedrivergenes.

• Cancer-specificHigh-throughputAnnota8onofSoma8cMuta8ons(CHASM)*isamachinelearningalgorithmdesignedtoiden8fydrivermuta8ons.

• Basedonrandomforestclassifier

*BozicIetal.,PNAS,2010

Page 61: Tree Based Methods (Ensemble Schemes) · 2020-01-15 · • Greedy strategy. • Split the records based on an aribute test that op8mizes certain criterion. • Quesons • Determine

Natural Selec=on and Cancer

• Naturalselec8onisanon-random*processbywhichphenotypesbecomefixatedinapopula8onasafunc8onofdifferen8alreproduc8on/survival

• Cellsinpre-malignantandmalignantstate(tumors)evolvebyselec8nggenesthatfavorstumorgenesis

* As opposed to genetic drift, NS is deterministic and has direction.

Recurrent mutations

TP53 KRAS

Classic approach to finding driver genes

Driver Mutations

Page 62: Tree Based Methods (Ensemble Schemes) · 2020-01-15 · • Greedy strategy. • Split the records based on an aribute test that op8mizes certain criterion. • Quesons • Determine

Muta=onal Frequency

• Muta8onalfrequencyadefiningcriteria?

Page 63: Tree Based Methods (Ensemble Schemes) · 2020-01-15 · • Greedy strategy. • Split the records based on an aribute test that op8mizes certain criterion. • Quesons • Determine

CHASM

• Methodsindependentoffrequencyofoccurrencearerequiredtodeterminedrivergenes.

• Cancer-specificHigh-throughputAnnota8onofSoma8cMuta8ons(CHASM)*isamachinelearningalgorithmdesignedtoiden8fydrivermuta8ons.

• Basedonrandomforestclassifier

* Bozic I et al., PNAS, 2010

Page 64: Tree Based Methods (Ensemble Schemes) · 2020-01-15 · • Greedy strategy. • Split the records based on an aribute test that op8mizes certain criterion. • Quesons • Determine

CHASM Features Set

•  Foreachmuta8on,56defaultfeatureslike,

PTM enzyme

SNP density

DNA binding

Orthologcompatibleamino acid

Exonconservation

Superfamilyconservation

Page 65: Tree Based Methods (Ensemble Schemes) · 2020-01-15 · • Greedy strategy. • Split the records based on an aribute test that op8mizes certain criterion. • Quesons • Determine

CHASM Overview

•  CHASMhasacuratedsetof2488drivergenestakenfromCOSMICandothercancerdatasets

•  CHASMsimulatesmuta8onsbasedongivencontextsandcallsthempassengermuta8ons.Thesearemuta8onsthatoccurbyrandomchancealone

•  Apor8onofthesepassengermuta8onsareisolatedtoformanulldistribu8on

Page 66: Tree Based Methods (Ensemble Schemes) · 2020-01-15 · • Greedy strategy. • Split the records based on an aribute test that op8mizes certain criterion. • Quesons • Determine

CHASM Overview

Known driver mutations Simulated passenger mutations NULL

Random forest classifier

Our mutations Classify

Get CHASM scores (NULL)

Get CHASM scores (Our mutations)

Report CHASM scores, significance, FDR Statistics (t-test)

Page 67: Tree Based Methods (Ensemble Schemes) · 2020-01-15 · • Greedy strategy. • Split the records based on an aribute test that op8mizes certain criterion. • Quesons • Determine

CHASM Pipeline

Identify somatic mutations

Validated mutations

Compute context table

Get CHASM scores, significance, FDR

Set threshold & identify key genes

Find pathways

Page 68: Tree Based Methods (Ensemble Schemes) · 2020-01-15 · • Greedy strategy. • Split the records based on an aribute test that op8mizes certain criterion. • Quesons • Determine

Summary •  DecisionTrees

•  Overview•  Spli1ngnodes•  Limita8ons

•  Bagging/BootstrapAggrega8ngandBoos8ng•  Howbaggingreducesvariance?•  Boos8ng

•  RandomForests•  Overview•  WhyRFworks?

•  CancerGenomicsApplica8on