next-generation phenotyping
DESCRIPTION
Next-generation phenotyping. George Hripcsak, MD, MS Department of Biomedical Informatics Columbia University, New York, USA. Electronic health record. National EHR data, per year. Healthcare $2.5 trillion industry in US can’t duplicate. Data quality. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Next-generation phenotyping](https://reader036.vdocuments.site/reader036/viewer/2022062310/56816231550346895dd2630f/html5/thumbnails/1.jpg)
Next-generation phenotyping
George Hripcsak, MD, MSDepartment of Biomedical Informatics
Columbia University, New York, USA
![Page 2: Next-generation phenotyping](https://reader036.vdocuments.site/reader036/viewer/2022062310/56816231550346895dd2630f/html5/thumbnails/2.jpg)
Electronic health record
![Page 3: Next-generation phenotyping](https://reader036.vdocuments.site/reader036/viewer/2022062310/56816231550346895dd2630f/html5/thumbnails/3.jpg)
National EHR data, per year
1,000,000,000 visit notes
35,000,000 admit notes, discharge sum.
46,000,000 procedure notes
3,000,000,000 prescriptions
1,000,000,000 laboratory tests
>50,000,000,000 facts
• Healthcare $2.5 trillion industry in US– can’t duplicate
![Page 4: Next-generation phenotyping](https://reader036.vdocuments.site/reader036/viewer/2022062310/56816231550346895dd2630f/html5/thumbnails/4.jpg)
• All medical record information should be regarded as suspect; much of it is fiction.
• Burnum JF ... Ann Intern Med 1989
• Data shall be used only for the purpose for which they were collected. If no purpose was defined prior to the collection of the data, then the data should not be used.
• van der Lei J ... Method Inform Med 1991
Data quality
![Page 5: Next-generation phenotyping](https://reader036.vdocuments.site/reader036/viewer/2022062310/56816231550346895dd2630f/html5/thumbnails/5.jpg)
EHRs augment research databases
1. Data — “manually curated”– read record, enter into research database
2. Subjects — patient recruitment3. Knowledge — sample size4. Continuity — long term follow up5. Fully EHR-based observational studies
– without case-specific curation6. Fully EHR-based interventional trials
![Page 6: Next-generation phenotyping](https://reader036.vdocuments.site/reader036/viewer/2022062310/56816231550346895dd2630f/html5/thumbnails/6.jpg)
Solvable challenges
• Lack of penetration of EHRs– $30B HITECH in US
• Distributed systems, inconsistent formats– HL7, CDISC, …
• Privacy– policy
![Page 7: Next-generation phenotyping](https://reader036.vdocuments.site/reader036/viewer/2022062310/56816231550346895dd2630f/html5/thumbnails/7.jpg)
Hard challenges
• Quality of the data– Ambiguous or unknown meaning– Accuracy
• 50-100% accuracy [Hogan JAMIA 1997]
– Completeness• mostly missing
– Complexity• disease ontologies
• Bias
![Page 8: Next-generation phenotyping](https://reader036.vdocuments.site/reader036/viewer/2022062310/56816231550346895dd2630f/html5/thumbnails/8.jpg)
Meaning
• PERRLAPupils equal, round, reactive to light and accommodation
![Page 9: Next-generation phenotyping](https://reader036.vdocuments.site/reader036/viewer/2022062310/56816231550346895dd2630f/html5/thumbnails/9.jpg)
Missing
• Data are mostly missing– Sampled when sick
• Implicit information– Pertinent negatives by attending vs CC3
0
100
200
300
400
500
600
60
70
80
90
100
110
120
![Page 10: Next-generation phenotyping](https://reader036.vdocuments.site/reader036/viewer/2022062310/56816231550346895dd2630f/html5/thumbnails/10.jpg)
Missing
• Missing completely at random (MCAR)• Missing at random (MAR)• Not missing at random (NMAR)
![Page 11: Next-generation phenotyping](https://reader036.vdocuments.site/reader036/viewer/2022062310/56816231550346895dd2630f/html5/thumbnails/11.jpg)
Missing
• Missing completely at random (MCAR)• Missing at random (MAR)• Not missing at random (NMAR)• Almost completely missing (ACM)
![Page 12: Next-generation phenotyping](https://reader036.vdocuments.site/reader036/viewer/2022062310/56816231550346895dd2630f/html5/thumbnails/12.jpg)
Noisy
• As low as 50% accuracy (Hogan JAMIA 1997)
• … 36 year old man … 27 year old woman …
![Page 13: Next-generation phenotyping](https://reader036.vdocuments.site/reader036/viewer/2022062310/56816231550346895dd2630f/html5/thumbnails/13.jpg)
observe &
interpretTruth
Health status of the patient
ConceptClinician or
patient’s conception
RecordEHR/PHR
Concept2nd clinician’s conception of the patient (or
self, lawyer, compliance, ...)
ModelComputable
representation
author read
process
![Page 14: Next-generation phenotyping](https://reader036.vdocuments.site/reader036/viewer/2022062310/56816231550346895dd2630f/html5/thumbnails/14.jpg)
observe &
interpretTruth
Health status of the patient
ConceptClinician or
patient’s conception
RecordEHR/PHR
Concept2nd clinician’s conception of the patient (or
self, lawyer, compliance, ...)
ModelComputable
representation
author read
process
Error Error
Error
Implicit
![Page 15: Next-generation phenotyping](https://reader036.vdocuments.site/reader036/viewer/2022062310/56816231550346895dd2630f/html5/thumbnails/15.jpg)
Complex
• Narrative text holds much of the useful info– Slight increase of pulmonary vascular congestion
with new left pleural effusion, question mild congestive changes
– s/p LURT 1998 c/b 1A rejection 7/07 back on HD
![Page 16: Next-generation phenotyping](https://reader036.vdocuments.site/reader036/viewer/2022062310/56816231550346895dd2630f/html5/thumbnails/16.jpg)
Natural language processing
“Slight increase of pulmonary vascular congestion with new left pleural effusion, question mild congestive changes”
pulmonary vascular congestionchange: increase
degree: low
pleural effusionregion: leftstatus: new
congestive changescertainty: moderatedegree: low
![Page 17: Next-generation phenotyping](https://reader036.vdocuments.site/reader036/viewer/2022062310/56816231550346895dd2630f/html5/thumbnails/17.jpg)
Complex
• Which is the right time?– When specimen drawn– When specimen received– When test performed– When result updated– When result received by the patient– When patient told clinician– When clinician wrote the note
![Page 18: Next-generation phenotyping](https://reader036.vdocuments.site/reader036/viewer/2022062310/56816231550346895dd2630f/html5/thumbnails/18.jpg)
Biased
• Completeness, noise, and complexity depend on the state we are trying to measure
• Billing and liability are motivations
![Page 19: Next-generation phenotyping](https://reader036.vdocuments.site/reader036/viewer/2022062310/56816231550346895dd2630f/html5/thumbnails/19.jpg)
Completeness, sampling biasPatient stable Patient ill Patient stable Lapse in visits Patient stable
(?)Theoretical predictability w.r.t. time (delta-t):
Patient state:
Clinician sampling:
Predictability w.r.t. sequence (tau):
Time
Value
Patient stable Patient ill Patient stable Lapse in visits Patient stable
(?)Theoretical predictability w.r.t. time (delta-t):
Patient state:
Clinician sampling:
Predictability w.r.t. sequence (tau):
Time
Value
![Page 20: Next-generation phenotyping](https://reader036.vdocuments.site/reader036/viewer/2022062310/56816231550346895dd2630f/html5/thumbnails/20.jpg)
Biased
Patient state
Electronic health record
Care team
Therapy
Objective tests
Environment
![Page 21: Next-generation phenotyping](https://reader036.vdocuments.site/reader036/viewer/2022062310/56816231550346895dd2630f/html5/thumbnails/21.jpg)
Inpatient mortality for community acquired pneumonia
0
5
10
15
20
25
30
35
1 2 3 4 5
Fine class
Mor
talit
y (%
)
18715 cohort1935 cohortFine
18715 cohort +CXR +fdg -recent pneu -recent visit
1935 cohort above plus +DSUM exist +ICD9 (pneu not sepsis)
Hripcsak ... Comput Biol Med 2007;37:296-304
![Page 22: Next-generation phenotyping](https://reader036.vdocuments.site/reader036/viewer/2022062310/56816231550346895dd2630f/html5/thumbnails/22.jpg)
Good news
• Clinicians use the record for patient care– Human interpretation
• Can we deconvolve the truth?– Need new tools to handle it
![Page 23: Next-generation phenotyping](https://reader036.vdocuments.site/reader036/viewer/2022062310/56816231550346895dd2630f/html5/thumbnails/23.jpg)
EHR-derived phenotype
• Clinically relevant feature derived from EHR– Patient has (a diagnosis of) type II diabetes– Recent rash and fever– Drug-induced liver injury
• Then use the phenotype in correlation studies, etc.
![Page 24: Next-generation phenotyping](https://reader036.vdocuments.site/reader036/viewer/2022062310/56816231550346895dd2630f/html5/thumbnails/24.jpg)
State of the art
• Knowledge engineer and domain expert iterate on a query that combines information from multiple sources– Diagnosis, medication, laboratory tests, etc.
• Can take months per query– eMerge
• Bias of developers, generalizability, ...• How to improve time and accuracy
![Page 25: Next-generation phenotyping](https://reader036.vdocuments.site/reader036/viewer/2022062310/56816231550346895dd2630f/html5/thumbnails/25.jpg)
High-throughput phenotyping
• Elimination of case-by-case curation through queries
• Generate thousands of phenotype queries with minimal human intervention such that they can be maintained over time
![Page 26: Next-generation phenotyping](https://reader036.vdocuments.site/reader036/viewer/2022062310/56816231550346895dd2630f/html5/thumbnails/26.jpg)
Solution
• Top-down knowledge engineering + bottom-up machine learning
• Study the EHR as an object in itself• Health care process model• Quantify bias to avoid it or correct for it
![Page 27: Next-generation phenotyping](https://reader036.vdocuments.site/reader036/viewer/2022062310/56816231550346895dd2630f/html5/thumbnails/27.jpg)
Methods
• Characterization• Dimension reduction• Latent variables• Temporal processing• Natural language processing• Derived properties• Causality
![Page 28: Next-generation phenotyping](https://reader036.vdocuments.site/reader036/viewer/2022062310/56816231550346895dd2630f/html5/thumbnails/28.jpg)
Health care process model
![Page 29: Next-generation phenotyping](https://reader036.vdocuments.site/reader036/viewer/2022062310/56816231550346895dd2630f/html5/thumbnails/29.jpg)
“Physics” of the medical record
1. Study EHR as if it were a natural object– Use EHR to learn about EHR– Not studying patient, but recording of patient
2. Aggregate across units and model3. Borrow methods from non-linear time series
![Page 30: Next-generation phenotyping](https://reader036.vdocuments.site/reader036/viewer/2022062310/56816231550346895dd2630f/html5/thumbnails/30.jpg)
Glucose by Δt and tau
1 2 3 4 5 6 7 8 910 20 30 40 50 60 70 80 90
100
0.17
0.83
2
750
450
-0.05
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
MI
tau
delta-t (days)
Glucose
0.4-0.45
0.35-0.4
0.3-0.350.25-0.3
0.2-0.25
0.15-0.2
0.1-0.15
0.05-0.10-0.05
-0.1-0
Albers ... Translational Bioinformatics 2009
![Page 31: Next-generation phenotyping](https://reader036.vdocuments.site/reader036/viewer/2022062310/56816231550346895dd2630f/html5/thumbnails/31.jpg)
Correlate lab tests and concepts
• 22 years of data on 3 million patients• 21 laboratory tests
– sodium, potassium, bicarbonate, creatinine, urea nitrogen, glucose, and hemoglobin
• 60 concepts derived from signout notes– residents caring for inpatients to facilitate the
transfer of care for overnight coverage– concepts likely to have an association + controls
![Page 32: Next-generation phenotyping](https://reader036.vdocuments.site/reader036/viewer/2022062310/56816231550346895dd2630f/html5/thumbnails/32.jpg)
Methods
• Extract concepts using case-insensitive stemmed search phrases in signout notes, and assign time of note
• Normalize laboratory test within patient to eliminate inter-patient effect
• Interpolate both time series so every point has a partner– Treat concepts as 0/1
• Time lag by +/− 60 days• Calculate Pearson’s linear correlation
1
0
![Page 33: Next-generation phenotyping](https://reader036.vdocuments.site/reader036/viewer/2022062310/56816231550346895dd2630f/html5/thumbnails/33.jpg)
Lagged linear correlation
-60 -40 -20 0 20 40 60
-0.15
-0.1
-0.05
0
0.05
0.1
0.15potassium
aldactone
dialysis
hyperkalemia
hypokalemia
hypomagnesemia
positive correlation
negative correlation
lab precedes concept (d) lab follows concept (d)
lab
concept
![Page 34: Next-generation phenotyping](https://reader036.vdocuments.site/reader036/viewer/2022062310/56816231550346895dd2630f/html5/thumbnails/34.jpg)
Definitional association
-60 -40 -20 0 20 40 60
-0.3
-0.2
-0.1
0
0.1
0.2
0.3sodium
aldactone
hctz
hypernatremia
hyponatremia
lasix
Hripcsak ... JAMIA 2011
![Page 35: Next-generation phenotyping](https://reader036.vdocuments.site/reader036/viewer/2022062310/56816231550346895dd2630f/html5/thumbnails/35.jpg)
Intentional and physiologic associations
-60 -40 -20 0 20 40 60
-0.15
-0.1
-0.05
0
0.05
0.1
0.15potassium
aldactone
dialysis
hyperkalemia
hypokalemia
hypomagnesemia
![Page 36: Next-generation phenotyping](https://reader036.vdocuments.site/reader036/viewer/2022062310/56816231550346895dd2630f/html5/thumbnails/36.jpg)
Timing of cause in disease vs. treatment
-60 -40 -20 0 20 40 60
-0.04
-0.02
0
0.02
0.04
0.06
0.08
0.1
0.12glucose
hyperglycemia
hypernatremia
hypoglycemia
insulin
metformin
pancreatitis
![Page 37: Next-generation phenotyping](https://reader036.vdocuments.site/reader036/viewer/2022062310/56816231550346895dd2630f/html5/thumbnails/37.jpg)
Shape of curve cause vs. definition
-60 -40 -20 0 20 40 60
-0.04
-0.02
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14creatinine
aldactone
dialysis
diarrhea
diuretic
hctz
hyperglycemia
hypernatremia
vomiting
![Page 38: Next-generation phenotyping](https://reader036.vdocuments.site/reader036/viewer/2022062310/56816231550346895dd2630f/html5/thumbnails/38.jpg)
Specificity of the concept
-60 -40 -20 0 20 40 60
-0.04
-0.02
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14creatinine
aldactone
dialysis
diarrhea
diuretic
hctz
hyperglycemia
hypernatremia
vomiting
![Page 39: Next-generation phenotyping](https://reader036.vdocuments.site/reader036/viewer/2022062310/56816231550346895dd2630f/html5/thumbnails/39.jpg)
Value of aggregation• Blood potassium vs aldactone
– all values: 5424 pts, 570,000 values– ≤10 values: 444 pts, 2534 values (.4%), 6/pt
-60 -40 -20 0 20 40 60
-0.03
-0.02
-0.01
0.00
0.01
0.02
0.03
0.04
≤10 valuesall values
![Page 40: Next-generation phenotyping](https://reader036.vdocuments.site/reader036/viewer/2022062310/56816231550346895dd2630f/html5/thumbnails/40.jpg)
Value of using all time and normalization
-60 -40 -20 0 20 40 60
-0.12
-0.10
-0.08
-0.06
-0.04
-0.02
0.00
0.02
0.04
potassium — Aldactone
corrected
no time
no normalize
no interpolation
![Page 41: Next-generation phenotyping](https://reader036.vdocuments.site/reader036/viewer/2022062310/56816231550346895dd2630f/html5/thumbnails/41.jpg)
Ranking association curves
• Actual correlation is only 0.05– Most are significant (not just 500 of 10000)
• How to order association curves– Size of association: maximum correlation– Consistency of association: area under the curve– Time dependence of association: range
• maximum correlation – minimum correlation over +/– 60 days
![Page 42: Next-generation phenotyping](https://reader036.vdocuments.site/reader036/viewer/2022062310/56816231550346895dd2630f/html5/thumbnails/42.jpg)
Ranking association curves
• 21 lab tests, 60 concepts• Expert: for each concept, 0-6 lab tests that ought to
be most strongly correlated with the concept based on medical knowledge– Anemia: hematocrit, hemoglobin, RBC– Hyponatremia: sodium– Diuretics: six electrolytes
• Measure match between system and expert– Proportion of labs algorithm places in “top”– “Top” is number of labs selected by expert for concept
![Page 43: Next-generation phenotyping](https://reader036.vdocuments.site/reader036/viewer/2022062310/56816231550346895dd2630f/html5/thumbnails/43.jpg)
Ranking association curves
• Examples:– the six labs selected by the expert (potassium, sodium,
urea nitrogen, creatinine, chloride, bicarbonate) had the six highest ranges for spironolactone
– anemia's three (hematocrit, hemoglobin, RBC) were also at its top
– atrial fibrillation expert chose anticoagulation tests, but the white blood count and bicarbonate ranked higher, perhaps reflecting the role of infection and electrolyte disturbance in atrial fibrillation
![Page 44: Next-generation phenotyping](https://reader036.vdocuments.site/reader036/viewer/2022062310/56816231550346895dd2630f/html5/thumbnails/44.jpg)
Ranking association curvesAlgorithm Proportion within top
Maximum correlation 0.44*
Area under the curve 0.33*
Range 0.62*
*all differ by paired t-test
Hripcsak ... Translational Bioinformatics 2012
![Page 45: Next-generation phenotyping](https://reader036.vdocuments.site/reader036/viewer/2022062310/56816231550346895dd2630f/html5/thumbnails/45.jpg)
Ranking association curves
• In 19 concepts, expert picked 1 lab– Range ranked that test at the very top in 12 cases
(63%)
![Page 46: Next-generation phenotyping](https://reader036.vdocuments.site/reader036/viewer/2022062310/56816231550346895dd2630f/html5/thumbnails/46.jpg)
Ranking association curves
• How to factor out other effects1. Normalize one variable to reduce inter-patient
effects2. Look for time dependence of the association
![Page 47: Next-generation phenotyping](https://reader036.vdocuments.site/reader036/viewer/2022062310/56816231550346895dd2630f/html5/thumbnails/47.jpg)
Meaning of lagged linear correlation
• Usually used in surveillance to detect lag in information
• What if one variable is dichotomous– Concept in clinical notes
• What if dichotomous one is rare and short lived– Start of medication
![Page 48: Next-generation phenotyping](https://reader036.vdocuments.site/reader036/viewer/2022062310/56816231550346895dd2630f/html5/thumbnails/48.jpg)
Hripcsak ... Translational Bioinformatics 2012
![Page 49: Next-generation phenotyping](https://reader036.vdocuments.site/reader036/viewer/2022062310/56816231550346895dd2630f/html5/thumbnails/49.jpg)
x
yStart of
medication
Sodium
Lag
Start of medication
![Page 50: Next-generation phenotyping](https://reader036.vdocuments.site/reader036/viewer/2022062310/56816231550346895dd2630f/html5/thumbnails/50.jpg)
![Page 51: Next-generation phenotyping](https://reader036.vdocuments.site/reader036/viewer/2022062310/56816231550346895dd2630f/html5/thumbnails/51.jpg)
-80 -60 -40 -20 0 20 40 60 800
200
400
600
800
1000
1200
mean in binmedian in bin
![Page 52: Next-generation phenotyping](https://reader036.vdocuments.site/reader036/viewer/2022062310/56816231550346895dd2630f/html5/thumbnails/52.jpg)
-80 -60 -40 -20 0 20 40 60 80
-0.04
-0.03
-0.02
-0.01
0
0.01
0.02
![Page 53: Next-generation phenotyping](https://reader036.vdocuments.site/reader036/viewer/2022062310/56816231550346895dd2630f/html5/thumbnails/53.jpg)
Drug interaction example
-80 -60 -40 -20 0 20 40 60 80
-0.02
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
glucose paxil_pravastatinglucose paxilglucose pravastatin
From Tatonetti, et al.
![Page 54: Next-generation phenotyping](https://reader036.vdocuments.site/reader036/viewer/2022062310/56816231550346895dd2630f/html5/thumbnails/54.jpg)
x
y
Sodium
Serum concentration
![Page 55: Next-generation phenotyping](https://reader036.vdocuments.site/reader036/viewer/2022062310/56816231550346895dd2630f/html5/thumbnails/55.jpg)
Meaning
• If one is dichotomous– Lagged linear correlation is equivalent to aligning
all instances of the condition and averaging the other variable forwards and backwards in time (window)
• Virtual alignment– While it is difficult to align cases for symbolic
methods, numeric methods may accommodate the fuzzy and ambiguous start times
![Page 56: Next-generation phenotyping](https://reader036.vdocuments.site/reader036/viewer/2022062310/56816231550346895dd2630f/html5/thumbnails/56.jpg)
Population physiology
Albers DJ, Chaos 2012, and Albers DJ, Physics Letters A 2010
![Page 57: Next-generation phenotyping](https://reader036.vdocuments.site/reader036/viewer/2022062310/56816231550346895dd2630f/html5/thumbnails/57.jpg)
![Page 58: Next-generation phenotyping](https://reader036.vdocuments.site/reader036/viewer/2022062310/56816231550346895dd2630f/html5/thumbnails/58.jpg)
Conclusion
• Numeric methods may be able to extract knowledge from noisy EHRs
• Better performance when can factor out extraneous effects
• EHR research can benefit from collaboration– Informatics, Computer science, Statistics/Epi– Non-linear physics (aggregation of short time series)– Philosophy (causation)
Funded by National Library of Medicine, USA R01 LM006910 and T15 LM007079