machine learning for healthcare - github pages• san antonio model • easy to ask/measure in the...
TRANSCRIPT
MachineLearningforHealthcare6.S897,HST.S53
Prof.DavidSontagMITEECS,CSAIL,IMES
Lecture2:Riskstratification
(Thanks to Narges Razavian for some of the slides)
Outlinefortoday’sclass
1. Casestudyforriskstratification:EarlydetectionofType2diabetes
2. Framingassupervisedlearningproblem3. Derivinglabels4. Evaluatingriskstratificationalgorithms5. Non-stationarity
Outlinefortoday’sclass
1. Casestudyforriskstratification:EarlydetectionofType2diabetes
2. Framingassupervisedlearningproblem3. Derivinglabels4. Evaluatingriskstratificationalgorithms5. Non-stationarity
Type2Diabetes:AMajorpublichealthchallenge
1994 2000
<4.5%4.5%–5.9%6.0%–7.4%7.5%–8.9%>9.0%
2013
$245billion:Totalcostsofdiagnoseddiabetes intheUnitedStatesin2012$831billion:Totalfiscalyearfederal budgetforhealthcare intheUnitedStatesin2014
Type2DiabetesCanBePrevented*
Requirement for successful large scale prevention program1. Detect/reach truly at risk population
2. Improve the interventions
3. Lower the cost of intervention
*DiabetesPreventionProgramResearchGroup."Reductionintheincidenceoftype2diabeteswithlifestyleinterventionormetformin."TheNewEnglandjournalofmedicine346.6(2002):393.
Type2DiabetesCanBePrevented*
Requirement for successful large scale prevention program1. Detect/reach truly at risk population
2. Improve the interventions
3. Lower the cost of intervention
*DiabetesPreventionProgramResearchGroup."Reductionintheincidenceoftype2diabeteswithlifestyleinterventionormetformin."TheNewEnglandjournalofmedicine346.6(2002):393.
TraditionalRiskPredictionModels• SuccessfulExamples
• ARIC• KORA• FRAMINGHAM• AUSDRISC• FINDRISC• SanAntonioModel
• Easytoask/measureintheoffice,orforpatientstodoonline
• Simplemodel:cancalculatescoresbyhand
ChallengesofTraditionalRiskPredictionModels• A screening step needs to be done for every
member in the population• Either in the physician’s office or as surveys• Costly and time-consuming• Infeasible for regular screening for millions of individuals
• Models not easy to adapt to multiple surrogates, when a variable is missing• Discovery of surrogates not straightforward
Population-LevelRiskStratification
• Keyidea:Usereadilyavailableadministrative,utilization,andclinicaldata
• Machinelearningwillfindsurrogatesforriskfactorsthatwouldotherwisebemissing
• Performriskstratificationatthepopulationlevel– millionsofpatients
[Razavian, Blecker, Schmidt,Smith-McLallen,Nigam,Sontag.BigData.‘16]
AData-DrivenapproachonLongitudinalData
• Lookingatindividualswhogotdiabetestoday, (comparedtothosewhodidn’t)– Canweinferwhichvariables intheir recordcouldhavepredicted their
healthoutcome?
TodayAFewYearsAgo
Reminder:Administrative&ClinicalData
Patient:
EligibilityRecord:-MemberID-Age/gender-IDofsubscriber-Companycode
MedicalClaims:-ICD9diagnosiscodes-CPTcode(procedure)-Specialty-Locationofservice-DateofService
LabTests:-LOINCcode(urineorbloodtestname)-Results(actualvalues)-LabID-Rangehigh/low-Date
Medications:-NDCcode(drugname)-Daysofsupply-Quantity-ServiceProviderID-Dateoffill
time
Disease count4011Benignhypertension 4470172724Hyperlipidemia NEC/NOS 3820304019HypertensionNOS 37247725000DMIIwocmp nt st uncntr 3395222720Purehypercholesterolem 2326712722Mixedhyperlipidemia 180015V7231Routinegyn examination 1787092449HypothyroidismNOS 16982978079MalaiseandfatigueNEC 149797V0481Vaccin forinfluenza 1478587242Lumbago 137345V7612ScreenmammogramNEC 129445V700Routinemedicalexam 127848
Disease count71947Jointpain-ankle 286483004Dysthymicdisorder 285302689VitaminDdeficiencyNOS 28455V7281Preopcardiovsclrexam 278977243Sciatica 2760478791Diarrhea 27424V221Supervis oth normalpreg 2732036501Opnanglbrderln lorisk 2603337921Vitreousdegeneration 255924241Aorticvalvedisorder 2542561610VaginitisNOS 2473670219Othersborheickeratosis 244533804Impactedcerumen 24046
Disease count53081Esophagealreflux 12106442731Atrialfibrillation 1137987295Paininlimb 11244941401Crnry athrscl natve vssl 1044782859AnemiaNOS 10335178650ChestpainNOS 919995990Urin tractinfectionNOS 87982V5869Long-termusemedsNEC 85544496Chr airwayobstructNEC 785854779Allergic rhinitisNOS 7796341400Cor ath unsp vsl ntv/gft 75519
Outof135K patients who hadlaboratorydata
Topdiagnosiscodes
Labtest2160-0Creatinine 12847373094-0Ureanitrogen 12823442823-3Potassium 12808122345-7Glucose 12998971742-6Alanineaminotransferase 11878091920-8Aspartateaminotransferase 11879652885-2Protein 12773381751-7Albumin 12741662093-3Cholesterol 12682692571-8Triglyceride 125775113457-7Cholesterol.inLDL 124120817861-6Calcium 11653702951-2Sodium 1167675
Labtest
2085-9Cholesterol.in HDL 1155666718-7Hemoglobin 11527264544-3Hematocrit 11478939830-1Cholesterol.total/Cholesterol.in HDL 103773033914-3Glomerularfiltrationrate/1.73sqM.predicted 561309
785-6Erythrocytemeancorpuscularhemoglobin 10708326690-2Leukocytes 1062980789-8Erythrocytes 1062445
787-2Erythrocytemeancorpuscularvolume 1063665
Labtest770-8Neutrophils/100leukocytes 952089731-0Lymphocytes 943918704-7Basophils 863448711-2Eosinophils 9357105905-5Monocytes/100leukocytes 943764706-2Basophils/100leukocytes 863435751-8Neutrophils 943232742-7Monocytes 942978713-8Eosinophils/100leukocytes 9339293016-3Thyrotropin 8918074548-4HemoglobinA1c/Hemoglobin.total 527062
Countofpeoplewhohavethetestresult(ever)
Toplabtestresults
Outlinefortoday’sclass
1. Casestudyforriskstratification:EarlydetectionofType2diabetes
2. Framingassupervisedlearningproblem3. Derivinglabels4. Evaluatingriskstratificationalgorithms5. Non-stationarity
Alternativeframings• Alignbyrelativetime,e.g.
– 2hoursintopatientstayinER– EverytimepatientseesPCP– Whenindividualturns40yrs old
• Alignbydataavailability
NOTE:• Ifmultipledatapointsperpatient,makesureeachpatientinonly train,validate,ortest
Methods• L1RegularizedLogisticRegression
– Simultaneouslyoptimizespredictiveperformanceand– Performsfeatureselection,choosingthesubsetofthefeaturesthataremostpredictive
• Thispreventsoverfittingtothetrainingdata
Demographics(age,sex,etc.)
Healthinsurancecoverage
Proceduresperformed(457features)
Specialtyofdoctorsseen(cardiology,rheumatology,…)
FeaturesusedinmodelsServiceplace(urgentcare,inpatient,outpatient,…)
Laboratoryindicators(7000features)
Forthe1000mostfrequent labtests:• Wasthetesteveradministered?• Wastheresulteverlow?• Wastheresulteverhigh?• Wastheresultevernormal?• Isthevalueincreasing?• Isthevaluedecreasing?• Isthevaluefluctuating?
Medicationstaken(999features)(laxatives,metformin,anti-arthritics,…)
Demographics(age,sex,etc.)
Healthinsurancecoverage
Proceduresperformed(457features)
Specialtyofdoctorsseen(cardiology,rheumatology,…)
FeaturesusedinmodelsServiceplace(urgentcare,inpatient,outpatient,…)
Laboratoryindicators(7000features)
Medicationstaken(999features)(laxatives,metformin,anti-arthritics,…)
16,000ICD-9diagnosiscodes(allhistory)
Allhistory 24monthhistory
6monthhistory
Totalfeaturesperpatient:42,000
Outlinefortoday’sclass
1. Casestudyforriskstratification:EarlydetectionofType2diabetes
2. Framingassupervisedlearningproblem3. Derivinglabels4. Evaluatingriskstratificationalgorithms5. Non-stationarity
Wheredothelabelscomefrom?
1. Manuallylabeldatabychartreview2. Electronicphenotypingfrommedicalrecords3. Usemachinelearningtogetthelabels
themselves
Visualization(lookingatindividualpatients)isimportanttosanitycheck
labelingmethod
Demographic informationPatientevents list
Events,astheyoccurforthefirsttime inpatienthistory
GettingthelabelsusingtheAnchor&LearnFramework
• Useacombinationofdomainexpertise(simplerules)andvastamountsofdata(machinelearning)
• Methoddoesnotrequireanymanuallabeling• Anchorsarehighlytransferablebetweeninstitutions
[Halpernetal.,AMIA2014]
Whatareanchors?• Ratherthanprovidegold-standardlabels,constructasimplerulethatcancatchsomepositivecases.
• Examples:
Clin.statevar PossibleAnchorDiabetic gsn:016313(insulin)inMedications
Cardiac ICD9:428.X (heartfailure)inDiagnoses
Nursinghome “fromnursinghome”intext
Socialwork “socialworkconsulted”intext
Whatareanchors?• Ratherthanprovidegold-standardlabels,constructasimplerulethatcancatchsomepositivecases. Lowsensitivityhereisok!
• Examples:
Clin.statevar PossibleAnchorDiabetic gsn:016313(insulin)inMedications
Cardiac ICD9:428.X (heartfailure)inDiagnoses
Nursinghome “fromnursinghome”intext
Socialwork “socialworkconsulted”intext
LearningwithAnchorsLOINC& UMLS&CUID& RXnorm& ICD9& Unstructured&Data&
Patientdatabase
101
100
1
• Identifyanchors• Learntopredicttheanchors(anchoraspseudo-labels)• Accountforthedifferencebetweenanchorsandlabels
Transform
Predictanchor Predictlabel
Theoreticalbasisforanchors• Unobservedvariable:Y,Observation:A• A isananchorforY ifconditioningonA=1givesuniformsamplesfromthesetofpositivecases.
Theoreticalbasisforanchors• Unobservedvariable:Y,Observation:A• A isananchorforY ifconditioningonA=1givesuniformsamplesfromthesetofpositivecases.
• Alternativeformulation– twonecessaryconditions:
P (Y = 1|A = 1) = 1Positivecondition
A ? X|YConditionalindependence
AND
X representsallother observations.
Theoreticalbasisforanchors• Unobservedvariable:Y,Observation:A• A isananchorforY ifconditioningonA=1givesuniformsamplesfromthesetofpositivecases.
• Alternativeformulation– twonecessaryconditions:
P (Y = 1|A = 1) = 1Positivecondition
A ? X|YConditionalindependence
AND
X representsallother observations.
e.g.Ifpatient istakinginsulin,thepatient issurelydiabetic.
e.g.Ifweknowthepatient hadheartfailure,knowingwhetherthediagnosiscodeappearsdoesnotinformusabouttherestoftherecord.
Theoreticalbasisforanchors• Unobservedvariable:Y,Observation:A• A isananchorforY ifconditioningonA=1givesuniformsamplesfromthesetofpositivecases.
• Theorem[Elkan &Noto 2008]:Intheabovesetting,afunction topredictA
canbetransformedtopredict Y
• Canalsousemorerecentadvancesonlearningwithnoisylabels(e.g.,Natarajan etal.,NIPS‘13)
LearningwithanchorsInput:anchorA
unlabeledpatientsOutput:predictionrule1. Learnacalibratedclassifier (e.g.
logisticregression) topredict:
2. Usingavalidateset,letP bethepatientswithA=1.Compute:
3. Forapreviouslyunseenpatientt,predict:
Pr(A = 1 | X̃ )
C =1
|P|X
k2PPr(A = 1 | X̃ (k))
[Elkan&Noto 2008]
1
CPr(A = 1|X (t)) if A(t) = 0
1 if A(t) = 1
CalibrationC istheaveragemodel
prediction forpatientswithanchors.
LearningLearntopredictA fromtheothervariables.
TransformationIfnoanchorpresent,
accordingtoascaledversionoftheanchor-prediction
model.
Outlinefortoday’sclass
1. Casestudyforriskstratification:EarlydetectionofType2diabetes
2. Framingassupervisedlearningproblem3. Derivinglabels4. Evaluatingriskstratificationalgorithms5. Non-stationarity
WhataretheDiscoveredRiskFactors?
• 769variableshavenon-zeroweight
TopHistoryofDisease Odds RatioImpaired Fasting Glucose (Code 790.21) 4.17
(3.87 4.49)
Abnormal Glucose NEC (790.29) 4.07 (3.76 4.41)
Hypertension (401) 3.28 (3.17 3.39)
Obstructive Sleep Apnea (327.23) 2.98 (2.78 3.20)
Obesity (278) 2.88 (2.75 3.02)
Abnormal Blood Chemistry (790.6) 2.49 (2.36 2.62)
Hyperlipidemia (272.4) 2.45 (2.37 2.53)
Shortness Of Breath (786.05) 2.09 (1.99 2.19)
Esophageal Reflux (530.81) 1.85(1.78 1.93)
Diabetes1-yeargap
WhataretheDiscoveredRiskFactors?
TopHistoryofDisease Odds RatioImpaired Fasting Glucose (Code 790.21) 4.17
(3.87 4.49)
Abnormal Glucose NEC (790.29) 4.07 (3.76 4.41)
Hypertension (401) 3.28 (3.17 3.39)
Obstructive Sleep Apnea (327.23) 2.98 (2.78 3.20)
Obesity (278) 2.88 (2.75 3.02)
Abnormal Blood Chemistry (790.6) 2.49 (2.36 2.62)
Hyperlipidemia (272.4) 2.45 (2.37 2.53)
Shortness Of Breath (786.05) 2.09 (1.99 2.19)
Esophageal Reflux (530.81) 1.85(1.78 1.93)
Additional DiseaseRiskFactors Include:Pituitarydwarfism (253.3),Hepatomegaly(789.1), ChronicHepatitisC(070.54),Hepatitis (573.3),CalcanealSpur(726.73),Thyrotoxicosiswithoutmentionofgoiter(242.90),Sinoatrial Nodedysfunction(427.81),Acute frontalsinusitis(461.1),Hypertrophicandatrophicconditionsofskin(701.9),Irregularmenstruation(626.4), …
• 769variableshavenon-zeroweight
Diabetes1-yeargap
Top Lab Factors Odds RatioHemoglobin A1c /Hemoglobin.Total (High - past 2 years) 5.75
(5.42 6.10)
Glucose (High- Past 6 months) 4.05 (3.89 4.21)
Cholesterol.In VLDL (Increasing - Past 2 years) 3.88(3.53 4.27)
Potassium (Low - Entire History) 2.58(2.24 2.98)
Cholesterol.Total/Cholesterol.In HDL (High - Entire History) 2.29(2.19 2.40)
Erythrocyte mean corpuscular hemoglobin concentration -(Low - Entire History)
2.25(1.92 2.64)
Eosinophils (High - Entire History) 2.11(1.82 2.44)
Glomerular filtration rate/1.73 sq M.Predicted (Low -Entire History) 2.07(1.92 2.24)
Alanine aminotransferase (High Entire History) 2.04(1.89 2.19)
WhataretheDiscoveredRiskFactors?
• 769variableshavenon-zeroweight
Diabetes1-yeargap
Top Lab Factors Odds RatioHemoglobin A1c /Hemoglobin.Total (High - past 2 years) 5.75
(5.42 6.10)
Glucose (High- Past 6 months) 4.05 (3.89 4.21)
Cholesterol.In VLDL (Increasing - Past 2 years) 3.88(3.53 4.27)
Potassium (Low - Entire History) 2.58(2.24 2.98)
Cholesterol.Total/Cholesterol.In HDL (High - Entire History) 2.29(2.19 2.40)
Erythrocyte mean corpuscular hemoglobin concentration -(Low - Entire History)
2.25(1.92 2.64)
Eosinophils (High - Entire History) 2.11(1.82 2.44)
Glomerular filtration rate/1.73 sq M.Predicted (Low -Entire History) 2.07(1.92 2.24)
Alanine aminotransferase (High Entire History) 2.04(1.89 2.19)
WhataretheDiscoveredRiskFactors?
Additional LabTestRiskFactors Include:Albumin/Globulin (Increasing -Entirehistory),Ureanitrogen/Creatinine -(high-EntireHistory),Specificgravity(Increasing,Past2years),Bilirubin (high-Past2years),…
• 769variableshavenon-zeroweight
Diabetes1-yeargap
Receiver-operatorcharacteristiccurve
FullmodelTraditionalriskfactors
Falsepositiverate
Truepositiverate
Wanttobehere Obtainedbyvaryingpredictionthreshold
Diabetes1-yeargap
Receiver-operatorcharacteristiccurve
FullmodelTraditionalriskfactors
Falsepositiverate
Truepositiverate
AreaundertheROCcurve(AUC)
AUC=Probabilitythatalgorithmranksapositivepatientoveranegativepatient
Invarianttoamountofclassimbalance
Diabetes1-yeargap
Receiver-operatorcharacteristiccurve
FullmodelAUC=0.78TraditionalriskfactorsAUC=0.74
Falsepositiverate
Truepositiverate
Riskstratificationusuallyfocusesonjustthisregion
(becauseofthecostofinterventions)
RandomAUC=0.5
Diabetes1-yeargap
Positivepredictivevalue(PPV)
0.060.07
0.06
0.15
0.17
0.1
Top100Predictions Top1000Predictions Top10000Predictions
Traditionalriskfactors Fullmodel
Diabetes1-yeargap
Calibration(note:differentdataset)
Predicted Probability
Actual Probability
0
1
0.5
1
fraction of patients the model predicts to have this
probability of infection
Model
PredictinginfectionintheER
Outlinefortoday’sclass
1. Casestudyforriskstratification:EarlydetectionofType2diabetes
2. Framingassupervisedlearningproblem3. Derivinglabels4. Evaluatingriskstratificationalgorithms5. Non-stationarity
Majorchallenge:non-stationarity• ICD10rolledoutin2015:predictivemodelslearnedusingICD9featuresarenolongeruseful!
• Logisticalissues=>somefeaturesmaynotbeavailable!
• Prevalenceandsignificanceoffeaturesmaychangeovertime
• Automaticallyderivedlabelsmaychangemeaning
DiabetesOnsetafter2009
Months,after2009/01/01
Num
bero
fpeo
plewho
haddiabetesonset
Jan/Feb2010
Aslightdownwardtrendofnewdiagnoses(Amongpatientswhoare enrolled between2009and2013,howmanynewlydiagnoseddiabetics dowehaveeachmonth?)
Geiss LS,WangJ,ChengYJ,etal.PrevalenceandIncidenceTrendsforDiagnosedDiabetes AmongAdultsAged20to79Years,UnitedStates,1980-2012.JAMA.
2014;312(12):1218-1226.
DiabetesOnsetafter2009