nearest neighbor and kernel survival analysis
TRANSCRIPT
Nearest Neighbor and Kernel Survival Analysis
George H. ChenAssistant Professor of Information Systems
Carnegie Mellon University
June 11, 2019
Nonasymptotic Error Bounds and Strong Consistency Rates
Glutenallergy
Immuno-suppressant
Low resting heart rate High BMIIrregular
heart beatTime of death
Day 2
Day 10
Day ≥ 6
Survival Analysis
Glutenallergy
Immuno-suppressant
Low resting heart rate High BMIIrregular
heart beatTime of death
Day 2
Day 10
Day ≥ 6
Feature vector XX Observed time YY
Survival Analysis
Glutenallergy
Immuno-suppressant
Low resting heart rate High BMIIrregular
heart beatTime of death
Day 2
Day 10
Day ≥ 6
Feature vector XX Observed time YY
Survival Analysis
When we stop collecting training data, not everyone has died!
Glutenallergy
Immuno-suppressant
Low resting heart rate High BMIIrregular
heart beatTime of death
Day 2
Day 10
Day ≥ 6
Goal: Estimate S(t|x) = P(survive beyond time t | feature vector x)
Feature vector XX Observed time YY
Survival Analysis
When we stop collecting training data, not everyone has died!
Problem Setup
Model: Generate data point as follows:(X,Y, �)
Problem Setup
Model: Generate data point as follows:(X,Y, �)
1. Sample feature vector X ⇠ PX
Problem Setup
2. Sample time of death T ⇠ PT |X
Model: Generate data point as follows:(X,Y, �)
1. Sample feature vector X ⇠ PX
Problem Setup
2. Sample time of death T ⇠ PT |X
3. Sample time of censoring C ⇠ PC|X
Model: Generate data point as follows:(X,Y, �)
1. Sample feature vector X ⇠ PX
Problem Setup
2. Sample time of death T ⇠ PT |X
3. Sample time of censoring C ⇠ PC|X
Model: Generate data point as follows:(X,Y, �)
1. Sample feature vector X ⇠ PX
4. If death happens before censoring ( ):T C
Problem Setup
2. Sample time of death T ⇠ PT |X
3. Sample time of censoring C ⇠ PC|X
Model: Generate data point as follows:(X,Y, �)
1. Sample feature vector X ⇠ PX
4. If death happens before censoring ( ):T C
Problem Setup
Y = T, � = 1Set
2. Sample time of death T ⇠ PT |X
3. Sample time of censoring C ⇠ PC|X
Otherwise: Set Y = C, � = 0
Model: Generate data point as follows:(X,Y, �)
1. Sample feature vector X ⇠ PX
4. If death happens before censoring ( ):T C
Problem Setup
Y = T, � = 1Set
Estimator (Beran 1981):
2. Sample time of death T ⇠ PT |X
3. Sample time of censoring C ⇠ PC|X
Otherwise: Set Y = C, � = 0
Model: Generate data point as follows:(X,Y, �)
1. Sample feature vector X ⇠ PX
4. If death happens before censoring ( ):T C
Problem Setup
Y = T, � = 1Set
Estimator (Beran 1981):
2. Sample time of death T ⇠ PT |X
3. Sample time of censoring C ⇠ PC|X
Otherwise: Set Y = C, � = 0
Model: Generate data point as follows:(X,Y, �)
1. Sample feature vector X ⇠ PX
4. If death happens before censoring ( ):T C
x
Problem Setup
Y = T, � = 1Set
find training points closest to . data pointsk
xk
Estimator (Beran 1981):
2. Sample time of death T ⇠ PT |X
3. Sample time of censoring C ⇠ PC|X
Otherwise: Set Y = C, � = 0
Model: Generate data point as follows:(X,Y, �)
1. Sample feature vector X ⇠ PX
4. If death happens before censoring ( ):T C
x
Problem Setup
Y = T, � = 1Set
find training points closest to . data pointsk
xk
Estimator (Beran 1981):
2. Sample time of death T ⇠ PT |X
3. Sample time of censoring C ⇠ PC|X
Otherwise: Set Y = C, � = 0
Model: Generate data point as follows:(X,Y, �)
1. Sample feature vector X ⇠ PX
4. If death happens before censoring ( ):T C
xKaplan-Meier
estimatorbS(t | x)
Problem Setup
Y = T, � = 1Set
find training points closest to . data pointsk
xk
Estimator (Beran 1981):
Kernel variant is similar
2. Sample time of death T ⇠ PT |X
3. Sample time of censoring C ⇠ PC|X
Otherwise: Set Y = C, � = 0
Model: Generate data point as follows:(X,Y, �)
1. Sample feature vector X ⇠ PX
4. If death happens before censoring ( ):T C
xKaplan-Meier
estimatorbS(t | x)
Problem Setup
Y = T, � = 1Set
find training points closest to . data pointsk
xk
Estimator (Beran 1981):
Kernel variant is similar
2. Sample time of death T ⇠ PT |X
3. Sample time of censoring C ⇠ PC|X
Otherwise: Set Y = C, � = 0
Model: Generate data point as follows:(X,Y, �)
1. Sample feature vector X ⇠ PX
4. If death happens before censoring ( ):T C
xKaplan-Meier
estimatorbS(t | x)
Error: for time horizon tsupt2[0,⌧ ]
|bS(t|x)� S(t|x)| ⌧
Problem Setup
Y = T, � = 1Set
find training points closest to . data pointsk
xk
Estimator (Beran 1981):
Kernel variant is similar
2. Sample time of death T ⇠ PT |X
3. Sample time of censoring C ⇠ PC|X
Otherwise: Set Y = C, � = 0
Model: Generate data point as follows:(X,Y, �)
1. Sample feature vector X ⇠ PX
4. If death happens before censoring ( ):T C
xKaplan-Meier
estimatorbS(t | x)
Error: for time horizon tsupt2[0,⌧ ]
|bS(t|x)� S(t|x)| ⌧
Enough of the n training data have Y values > tY ⌧n
Problem Setup
Y = T, � = 1Set
find training points closest to . data pointsk
xk
Estimator (Beran 1981):
Kernel variant is similar
2. Sample time of death T ⇠ PT |X
3. Sample time of censoring C ⇠ PC|X
Otherwise: Set Y = C, � = 0
Model: Generate data point as follows:(X,Y, �)
1. Sample feature vector X ⇠ PX
4. If death happens before censoring ( ):T C
xKaplan-Meier
estimatorbS(t | x)
Error: for time horizon tsupt2[0,⌧ ]
|bS(t|x)� S(t|x)| ⌧
Feature space is separable metric space(intrinsic dimension d)d
Enough of the n training data have Y values > tY ⌧n
Problem Setup
Y = T, � = 1Set
find training points closest to . data pointsk
xk
Estimator (Beran 1981):
Kernel variant is similar
2. Sample time of death T ⇠ PT |X
3. Sample time of censoring C ⇠ PC|X
Otherwise: Set Y = C, � = 0
Model: Generate data point as follows:(X,Y, �)
1. Sample feature vector X ⇠ PX
4. If death happens before censoring ( ):T C
xKaplan-Meier
estimatorbS(t | x)
Error: for time horizon tsupt2[0,⌧ ]
|bS(t|x)� S(t|x)| ⌧
Continuous r.v. in time &smooth w.r.t. feature space
(Hölder index a)↵
Feature space is separable metric space(intrinsic dimension d)d
Enough of the n training data have Y values > tY ⌧n
Problem Setup
Y = T, � = 1Set
find training points closest to . data pointsk
xk
Estimator (Beran 1981):
Kernel variant is similar
2. Sample time of death T ⇠ PT |X
3. Sample time of censoring C ⇠ PC|X
Otherwise: Set Y = C, � = 0
Model: Generate data point as follows:(X,Y, �)
1. Sample feature vector X ⇠ PX
4. If death happens before censoring ( ):T C
xKaplan-Meier
estimatorbS(t | x)
Error: for time horizon tsupt2[0,⌧ ]
|bS(t|x)� S(t|x)| ⌧
Borel prob. measureContinuous r.v. in time &
smooth w.r.t. feature space(Hölder index a)↵
Feature space is separable metric space(intrinsic dimension d)d
Enough of the n training data have Y values > tY ⌧n
Problem Setup
Y = T, � = 1Set
Theory (Informal)
k-NN estimator with has strong consistency rate:
supt2[0,⌧ ]
|bS(t|x)� S(t|x)| eO(n�↵/(2↵+d))
k = e⇥(n2↵/(2↵+d))
Theory (Informal)
If no censoring, problem reduces to conditional CDF estimation
k-NN estimator with has strong consistency rate:
supt2[0,⌧ ]
|bS(t|x)� S(t|x)| eO(n�↵/(2↵+d))
k = e⇥(n2↵/(2↵+d))
Theory (Informal)
If no censoring, problem reduces to conditional CDF estimation→ Error upper bound, up to a log factor, matches conditional CDF
estimation lower bound by Chagny & Roche 2014
k-NN estimator with has strong consistency rate:
supt2[0,⌧ ]
|bS(t|x)� S(t|x)| eO(n�↵/(2↵+d))
k = e⇥(n2↵/(2↵+d))
Theory (Informal)
If no censoring, problem reduces to conditional CDF estimation→ Error upper bound, up to a log factor, matches conditional CDF
estimation lower bound by Chagny & Roche 2014
Proof ideas also give finite sample rates for:
k-NN estimator with has strong consistency rate:
supt2[0,⌧ ]
|bS(t|x)� S(t|x)| eO(n�↵/(2↵+d))
k = e⇥(n2↵/(2↵+d))
Theory (Informal)
If no censoring, problem reduces to conditional CDF estimation→ Error upper bound, up to a log factor, matches conditional CDF
estimation lower bound by Chagny & Roche 2014
Proof ideas also give finite sample rates for:• Kernel Kaplan-Meier estimators
k-NN estimator with has strong consistency rate:
supt2[0,⌧ ]
|bS(t|x)� S(t|x)| eO(n�↵/(2↵+d))
k = e⇥(n2↵/(2↵+d))
Theory (Informal)
If no censoring, problem reduces to conditional CDF estimation→ Error upper bound, up to a log factor, matches conditional CDF
estimation lower bound by Chagny & Roche 2014
Proof ideas also give finite sample rates for:• Kernel Kaplan-Meier estimators
k-NN estimator with has strong consistency rate:
supt2[0,⌧ ]
|bS(t|x)� S(t|x)| eO(n�↵/(2↵+d))
k = e⇥(n2↵/(2↵+d))
( )• k-NN & kernel Nelson-Aalen cumulative hazard estimators � logS(t | x)
Theory (Informal)
If no censoring, problem reduces to conditional CDF estimation→ Error upper bound, up to a log factor, matches conditional CDF
estimation lower bound by Chagny & Roche 2014
Proof ideas also give finite sample rates for:• Kernel Kaplan-Meier estimators
• Generalization bound for automatic k using validation data
k-NN estimator with has strong consistency rate:
supt2[0,⌧ ]
|bS(t|x)� S(t|x)| eO(n�↵/(2↵+d))
k = e⇥(n2↵/(2↵+d))
( )• k-NN & kernel Nelson-Aalen cumulative hazard estimators � logS(t | x)
Theory (Informal)
If no censoring, problem reduces to conditional CDF estimation→ Error upper bound, up to a log factor, matches conditional CDF
estimation lower bound by Chagny & Roche 2014
Proof ideas also give finite sample rates for:• Kernel Kaplan-Meier estimators
• Generalization bound for automatic k using validation data
k-NN estimator with has strong consistency rate:
supt2[0,⌧ ]
|bS(t|x)� S(t|x)| eO(n�↵/(2↵+d))
k = e⇥(n2↵/(2↵+d))
( )• k-NN & kernel Nelson-Aalen cumulative hazard estimators � logS(t | x)
Most general finite sample theory for k-NN and kernel survival estimators
Theory (Informal)
If no censoring, problem reduces to conditional CDF estimation→ Error upper bound, up to a log factor, matches conditional CDF
estimation lower bound by Chagny & Roche 2014
Proof ideas also give finite sample rates for:• Kernel Kaplan-Meier estimators
• Generalization bound for automatic k using validation data
k-NN estimator with has strong consistency rate:
supt2[0,⌧ ]
|bS(t|x)� S(t|x)| eO(n�↵/(2↵+d))
k = e⇥(n2↵/(2↵+d))
( )• k-NN & kernel Nelson-Aalen cumulative hazard estimators � logS(t | x)
Most general finite sample theory for k-NN and kernel survival estimators
Theory (Informal)
Existing kernel results only for Euclidean space (Dabrowska 1989, Van Keilegom & Veraverbeke 1996, Van Keilegom 1998)
0.60 0.65 0.70c-Lndex
DdDptLve Nernel
rDndom survLvDl Iorest
Nernel (trLDngle) L1
Nernel (trLDngle) L2
Nernel (box) L1
Nernel (box) L2
N-11 (trLDngle) L1
N-11 (trLDngle) L2
N-11 L1
N-11 L2
cox
DDtDset "gbsg2" ConcordDnce IndLces
Experiments
0.60 0.65 0.70c-Lndex
DdDptLve Nernel
rDndom survLvDl Iorest
Nernel (trLDngle) L1
Nernel (trLDngle) L2
Nernel (box) L1
Nernel (box) L2
N-11 (trLDngle) L1
N-11 (trLDngle) L2
N-11 L1
N-11 L2
cox
DDtDset "gbsg2" ConcordDnce IndLces
Experiments
Distance/kernel choice matter a lot
in practice
0.60 0.65 0.70c-Lndex
DdDptLve Nernel
rDndom survLvDl Iorest
Nernel (trLDngle) L1
Nernel (trLDngle) L2
Nernel (box) L1
Nernel (box) L2
N-11 (trLDngle) L1
N-11 (trLDngle) L2
N-11 L1
N-11 L2
cox
DDtDset "gbsg2" ConcordDnce IndLces
Experiments
Distance/kernel choice matter a lot
in practice
Learning the kernel typically has best performance (but no theory yet!)