additive models, trees, and related methods (part i) joy, jie, lucian oct 22 nd, 2002

33
Additive Models, Trees, and Related Methods (Part I) Joy, Jie, Lucian Oct 22 nd , 2002

Post on 19-Dec-2015

215 views

Category:

Documents


0 download

TRANSCRIPT

Additive Models, Trees, and Related Methods (Part I)

Joy, Jie, LucianOct 22nd, 2002

Outline

• Tree-Based Methods– CART (30 minutes ) 1:30pm – 2:00pm– HME (10 minutes) 2:00pm – 2:20pm

• PRIM (20) minutes 2:20pm-2:40pm

• Discussions (10 minutes) 2:40pm-2:50pm

Tree-Based Methods• Overview

– Principle behind: Divide and conquer• Variance will be increased

– Finesse the curse of dimensionality with the price of mis-specifying the model

– Partition the feature space into a set of rectangles• For simplicity, use recursive binary partition

– Fit a simple model (e.g. constant) for each rectangle– Classification and Regression Trees (CART)

• Regress Trees• Classification Trees

– Hierarchical Mixture Experts (HME)

CART

• An example (in regression case):

How CART Sees An Elephant

It was six men of Indostan ; To learning much inclined, Who went to see the Elephant; (Though all of them were blind), That each by observation; Might satisfy his mind …. -- “The Blind Men and the Elephant” by John Godfrey Saxe (1816-1887)

Basic Issues in Tree-based Methods

• How to grow a tree?

• How large should we grow the tree?

Regression Trees

• Partition the space into M regions: R1, R2, …, RM.

)|(,

)()(1

miim

M

mmm

Rxyaveragecwhere

RxIcxf

Regression Trees – Grow the Tree

• The best partition: to minimize the sum of squared error:

• Finding the global minimum is computationally infeasible• Greedy algorithm: at each level choose variable j and

value s as:

• The greedy algorithm makes the tree unstable– The error made at the upper level will be propagated to the lower

level

N

iii xfy

1

2))((

])(min)(min[minarg),(

22

),(

21

,2

21

1

sjRx

ic

sjRxi

csjii

cycy

Regression Tree – how large should we grow the tree ?

• Trade-off between accuracy and generalization– Very large tree: overfit– Small tree: might not capture the structure

• Strategies:– 1: split only when we can decrease the error

(short-sighted, e.g. XOR)– 2: Cost-complexity pruning (preferred)

Regression Tree - Pruning

• Cost-complexity pruning:– Pruning: collapsing some internal nodes– Cost complexity:

– Choose best alpha: weakest link pruning• Each time collapse an internal node which add smallest error• Choose from this tree sequence the best one by cross-validation

||)()(||

1

TTQNTCT

mmm

Penalty on the complexity/size of the tree

Cost: sum of squared errors

Classification Trees

• Classify the observations in node m to the major class in the node:– Pmk is the proportion of observation of class k in

node m

• Define impurity for a node:– Misclassification error:

– Cross-entropy:

mkk pmk maxarg)(

mkp1

mk

K

kmk pp

1

log

Classification Trees– Gini index (a famous index of measuring income

inequality):)1(

1

mk

K

kmk pp

Classification Trees

• Cross-entropy and Gini are more sensitive• To grow the tree: use CE or Gini• To prune the tree: use Misclassification rate (or any other

method)

Discussions on Tree-based Methods

• Categorical Predictors– Problem: Consider splits of sub tree t into tL and tR based on

categorical predictor x which has q possible values: 2(q-1)-1 ways !

– Theorem (Fisher 1958)• There is an optimal partition B1, B2 of B such that

• For and

– Order the predictor classes according to the mean of the outcome Y.

– Intuition: Treat the categorical predictor as if it were ordered

),|(),|( 21 bxtXYEbxtXYE mm 11 Bb 22 Bb

Discussions on Tree-based Methods

• The Loss Matrix– Consequences of misclassification depends on

class– Define loss function L– Modify the Gini index as

– In a terminal node m , classify it to class k as:

''

' mkkk

mkkk ppL

l

mllkk pLmk minarg)(

Discussions on Trees• Missing Predictor Values

– If we have enough training data: discard observations with mission value

– Fill in (impute) the missing value. E.g. the mean of known values

– Create a category called “missing” – Surrogate variables

• Choose primary predictor and split point• The first surrogate predictor best mimics the split by the

primary predictor, the second does second best, …• When sending observations down the tree, use primary first. If

the value of primary is missing, use the first surrogate. If the first surrogate is missing, use the second. …

Discussions on Trees• Binary Splits?

– Question: (Yan Liu)• This question is on the limitation of multiway split for building

tree: it is said on page273 that the problem with multi-way split is that it fragments the data too quickly, leaving insufficient data at the next level down. Can you give me an intuitive explanation of why the binary splits are more preferred? In my understanding, one of the problems in multiway split might be that it is hard to find the best attributes and split points, is that right?

– Answer: why binary splits are preferred?• More standard framework to train

• “To be or not to be” is easier to decide

Discussions on Trees

• Linear Combination Splits– Split the node based on– Improve the predictive power– Hurt interpretability

• Instability of Trees– Inherited from the hierarchical nature– Bagging (section 8.7) can reduce the variance

sXa jj

Discussions on Trees

Discussions on Trees

Majority vote

Average

Hierarchical Mixture Experts

• The gating networks provide a nested, “soft” partitioning of the input space

• The expert network provide local regression surface within the partition

• Both mixture coefficients and mixture components are Generalized Linear Models (GLIM)

Hierarchical Mixture Experts• Expert node output:

• Lower level gate

• Lower level gate output

)( xUf ijij

k

ij

Tijij

ik

ij

e

eg

xv

|

j

ijiji g |

i

iig

Hierarchical Mixture Experts

• Likelihood of the training data

• Gradient descent learning algorithm to update Uij

• Applying EM to HME for training– Latent variable: indicator zi– which branch to go

– See Jordan 1994 for details

t j

tij

tij

i

ti

Ntt

yPggl

yx

)(ln);(

},{()()(

|)(

1)()(

Each histogram displays the distribution of posterior probabilities across the training set at each node in the tree

Hierarchical Mixture Experts -- EM

ComparisonArchitecture Relative Error1 # Epochs

Linear .31 1

BackProp .09 5,500

HME ( Alg. 1) .10 35

HME ( Alg. 2) .12 39

CART .17 N/A

CART (linear) .13 N/A

MARS .16 N/A

16 experts for HME, four level hierarchy

16 basis function for MARS

(Jordan 94)

•All methods perform better than Linear

•BP has lowest relative error

•BP hard to converge

Model Selection for HME

• Structural parameters need to be decided– Number of levels– Branching factor of the tree K

• No methods for finding a good tree topology as in CART

Questions and Discussions

• CART:– Rong Jin: 1. According to Eqn. (9.16), successfully splitting in a

large subtree is more valuable than doing it for small subtree. Could you justify it ?

– Rong Jin: In the discussion of general regression tree or classification tree, it only considers the partition of feature space in a simple binary way. Is there any works that has been done along the line of nonlinear partition of feature space?

– Rong Jin: Does it make any sense to do the overlap split ?

– Ben: Could you make it clearer about the differences between using L_{k,k'}as loss vs. as weights? (p. 272)

Questions and Discussions

• Locality:– Rong Jin: Both tree model and kernel function try to capture the

locality. For tree model, the locality is created through the partition of the feature space while the kernel function is able to express the locality using the special distance function. Please comment on the these two methods on their ability of expressing localized function.

– Ben: Could you make comparisons between the tree methods introduced here with kNN and kernel methods?

Questions and Discussions

• Gini and other measures:– Yan: The classification (or regression) trees discussed in thischapter used a lot

of criterion to select the attribute and splitpoints, such as misclassification error, gini index and cross-entropy.When should we use these criteria? Is the entropy more preferred thanthe other two? (Also I want to make some clarification: is the Gini index refers to gain ratio,a nd cross-entropy refers to information gain?)

– Jian Zhang: From the book we know that Gini index has many nice properties, like:tight upper bound of error, training error rate with probability, etc.Should we prefer it in classification task for those reasons?

– Weng-keen: How is the Gini index equal to the training error?– Yan Jun: For tree based methods, can you give me an intuitive

explanationabout the Gini index measure?– Ben: . What does minimizing node impurity mean? Is it just to decrease

overall variance? Is there any implication to the bias? How does the usual bias-vairance tradeoff play a role here?

Questions and Discussions

• HME:– Yan: In my understanding, the HME is more like neural network

combined with Gaussian linear regression (or logistic regression) interms that the input of the neural network is the output of the Gaussian regression. Is my understanding right?

– Ben: For 2-class case depicted in Fig. 9.13 (HME), why do we need to do two times of mixtures? (two layers)

– Ben In equ. 9.30 why the 2 upper bound of summations are the same K? -- Yes

Reference

• Fisher, W.D. 1958. On grouping for maximum homogeneity. J. Amer. Statist. Assoc., 53: 789-798

• Breiman, L. 1984. Classification and Regression Trees. Wadsworth International Group

• Jordan, M. I., & Jacobs, R. A. (1994). Hierarchical mixtures of experts and the EM algorithm. Neural Computation, 6, 181-214

Hierarchical Mixture Experts

• “The softmax function derives naturally from log-linear models and leads to convenient interpretations of the weights in terms of odds ratios. You could, however, use a variety of other nonnegative functions on the real line in place of the exp function. Or you could constrain the net inputs to the output units to be nonnegative, and just divide by the sum--that's called the Bradley-Terry-Luce model”

Hierarchical Mixture Experts

0

0.05

0.1

0.15

0.2

0.25

G_i

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Data Points

Proportion vs. Softmax

Proportion

Softmax