using association rule mining for phenotype extraction from ehrs
DESCRIPTION
2013 Summit on Clinical Research InformaticsTRANSCRIPT
Using Association Rule Mining for Phenotype Extraction from Electronic Health Records Dingcheng Li, PhD1
Gyorgy Simon, PhD2
Christopher G. Chute, MD, DrPH1
Jyotishman Pathak, PhD1
1Mayo Clinic, Rochester 2University of Minnesota, Twin Cities
2013 AMIA Clinical Research Informatics Summit
High-Throughput Phenotyping from EHRs
Outline • Clinical phenotyping from electronic health
records (EHRs)
• Machine learning techniques for phenotyping
• Association Rule Mining and T2DM
• Results • Discussion
©2013 MFMER | slide-2
High-Throughput Phenotyping from EHRs
Data Transform Transform
EHR-driven Phenotyping: The Process
Phenotype Algorithm
Visualization
Evaluation
NLP, SQL
Rules
Mappings [eMERGE Network]
©2013 MFMER | slide-3
High-Throughput Phenotyping from EHRs
Example: Hypothyroidism Algorithm
©2012 MFMER | slide-4 [Conway et al. AMIA 2011: 274-83]
Drugs
Labs
Diagnosis
NLP
Proce- dures
High-Throughput Phenotyping from EHRs ©2013 MFMER | slide-5
http://gwas.org [eMERGE Network]
High-Throughput Phenotyping from EHRs
0.5 5
Genotype-Phenotype Association Results
0.5 50.5 5.0 1.0
Odds Ratio
rs2200733 Chr. 4q25 rs10033464 Chr. 4q25 rs11805303 IL23R rs17234657 Chr. 5 rs1000113 Chr. 5 rs17221417 NOD2 rs2542151 PTPN22 rs3135388 DRB1*1501 rs2104286 IL2RA rs6897932 IL7RA rs6457617 Chr. 6 rs6679677 RSBN1 rs2476601 PTPN22 rs4506565 TCF7L2 rs12255372 TCF7L2 rs12243326 TCF7L2 rs10811661 CDKN2B rs8050136 FTO rs5219 KCNJ11 rs5215 KCNJ11 rs4402960 IGF2BP2
Atrial fibrillation
Crohn's disease
Multiple sclerosis
Rheumatoid arthritis
Type 2 diabetes
disease gene / region marker
2.0 [Ritchie et al. AJHG 2010; 86(4):560-72]
observed published
©2013 MFMER | slide-6
High-Throughput Phenotyping from EHRs
Data Transform Transform
EHR-driven Phenotyping: The Process
Phenotype Algorithm
Visualization
Evaluation
NLP, SQL
Rules
Mappings [eMERGE Network]
©2013 MFMER | slide-7
Time consuming!
High-Throughput Phenotyping from EHRs
Our research agenda
©2013 MFMER | slide-8
High-Throughput Phenotyping from EHRs
Our research agenda
• Develop effective machine learning methods for automatic phenotype extractions to reduce the workload of manual development of phenotyping algorithms
• Explore effective ways to extract features from EHR data and generate highly predictive models
• Study phenotype extractions methods from EHRs to facilitate population-based studies for clinical and translational research
©2013 MFMER | slide-9
High-Throughput Phenotyping from EHRs
Common Modeling Approaches
• Logistic regression/Survival Analysis • No ability to discover interactions
• Decision Trees/RandomForest/Gradient-boosted Trees • Greedy approach to discover interaction
• Association Rule Mining (ARM) • Specifically designed to discover interactions
©2013 MFMER | slide-10
High-Throughput Phenotyping from EHRs
Association rule mining
• Proposed by Agrawal et al., VLDB1994 • It is an important data mining model
studied extensively by the database and data mining community
• Assume all data are categorical • No good algorithm for numeric data • Initially used for Market Basket Analysis to
find how items purchased by customers are related
©2013 MFMER | slide-11
High-Throughput Phenotyping from EHRs
The model: data
• I = {i1, i2, …, im}: a set of items. • Transaction t :
• t a set of items, and t ⊆ I.
• Transaction Database T: a set of transactions T = {t1, t2, …, tn}.
©2013 MFMER | slide-12
High-Throughput Phenotyping from EHRs
The model: data • Market basket transactions:
t1: {bread, cheese, milk} t2: {apple, eggs, salt, yogurt} … … tn: {biscuit, eggs, milk}
• Concepts: • An item: an item/article in a basket • I: the set of all items sold in the store • A transaction: items purchased in a basket; it
may have TID (transaction ID) • A transactional dataset: A set of transactions
©2013 MFMER | slide-13
High-Throughput Phenotyping from EHRs
The model: rules
• A transaction t contains X, a set of items (itemset) in I, if X ⊆ t.
• An association rule is an implication of the form: X → Y, where X, Y ⊂ I, and X ∩Y = ∅
• An itemset is a set of items. • E.g., X = {milk, bread, cereal} is an itemset.
• A k-itemset is an itemset with k items. • E.g., {milk, bread, cereal} is a 3-itemset
©2013 MFMER | slide-14
High-Throughput Phenotyping from EHRs
Rule strength measures
• Support: The rule holds with support sup in T (the transaction data set) if sup% of transactions contain X ∪ Y. • sup = Pr(X ∪ Y).
• Confidence: The rule holds in T with confidence conf if conf% of tranactions that contain X also contain Y. • conf = Pr(Y | X)
• An association rule is a pattern that states when X occurs, Y occurs with certain probability.
©2013 MFMER | slide-15
High-Throughput Phenotyping from EHRs
An example
• Transaction data • Assume:
minsup = 30% minconf = 80%
• An example frequent itemset: {Chicken, Clothes, Milk} [sup = 3/7] • Association rules from the itemset:
Clothes → Milk, Chicken [sup = 3/7, conf = 3/3] … … Clothes, Chicken → Milk, [sup = 3/7, conf = 3/3]
t1: Beef, Chicken, Milk t2: Beef, Cheese t3: Cheese, Boots t4: Beef, Chicken, Cheese t5: Beef, Chicken, Clothes, Cheese, Milk t6: Chicken, Clothes, Milk t7: Chicken, Milk, Clothes
©2013 MFMER | slide-16
High-Throughput Phenotyping from EHRs
Distributional Association Rules associate an itemset with a continuous outcome.
Distributional Association Rule Mining
©2013 MFMER | slide-17
[Simon et al. KDD 2011; 823-831]
High-Throughput Phenotyping from EHRs
Based on Apriori Algorithm (Agarwal, VLDB 1994)
©2013 MFMER | slide-18
Algorithm Apriori(T) C1 ← init-pass(T); F1 ← {f | f ∈ C1, f.count/n ≥ minsup}; // n: no. of transactions in T for (k = 2; Fk-1 ≠ ∅; k++) do Ck ← candidate-gen(Fk-1); for each transaction t ∈ T do for each candidate c ∈ Ck do if c is contained in t then c.count++; end end Fk ← {c ∈ Ck | c.count/n ≥ minsup} end
return F ← k Fk;
High-Throughput Phenotyping from EHRs
Use Case: Type 2 Diabetes
©2013 MFMER | slide-19
607
MAYO GENOME CONSORTIA
For personal use. Mass reproduce only with permission from Mayo Clinic Proceedingsa .
genetic research within EMR systems.1,2 Successful use of this approach in the eMERGE Network has inspired the creation of the intramural Mayo Genome Consortia (MayoGC). The goal of the MayoGC is to assemble a large cohort of participants from research studies across Mayo Clinic with high-throughput genetic data and to use EMR for phenotype extraction for cost-effective genetic research. Herein, we describe the design of the MayoGC, includ-ing the current participating cohorts, expansion efforts, data processing, and study management and organization. As a test of the genetic research capability of the MayoGC, we conducted a GWA study to identify genetic variants associ-ated with total bilirubin levels. Bilirubin levels have a large variability in the population, with heritability of roughly 0.50.3 Two previous GWA studies identi!ed variants from similar genomic locations with strong and moderate effects on bilirubin levels,4,5 making this phenotype an ideal candi-date for testing. The MayoGC provides a model of a unique collaborative effort in the environment of a common EMR for the investigation of genetic determinants of diseases.
PARTICIPANTS AND METHODS
MayoGC is a large cohort of Mayo Clinic patients with EMR and genotype data. Eligible participants include those who gave general research (ie, not disease-speci!c) consent in the contributing studies to share high-throughput genotyping data with other investigators. This cohort is being built in 2 phas-es. Phase 1, which has been completed, includes participants from 3 studies funded by the National Institutes of Health,
which sought to identify genetic determinants of peripheral arterial disease (PAD), venous thromboembolism, and pan-creatic cancer, respectively, with a combined total sample size of 6307 unique participants (Table 1). The eMERGE study contributed genotype data for 3336 participants with PAD and control participants recruited from Mayo Clinic’s noninvasive vascular and exercise stress testing laboratories, respectively.2 Peripheral arterial disease was de!ned by docu-mentation of at least 1 of the following: (1) an ankle-brachial index (ABI) of 0.9 or less at rest or 1 minute after exercise, (2) the presence of poorly compressible arteries, or (3) a nor-mal ABI but history of revascularization for PAD. Control participants had a normal ABI and no history of PAD.2
The GENEVA (Gene Environment Association Stud-ies) Study of Venous Thromboembolism of the National Human Genome Research Institute enrolled consecutive Mayo Clinic outpatients with objectively diagnosed deep venous thrombosis and/or pulmonary embolism who resid-ed in the upper Midwest and had been referred by a Mayo Clinic physician to the Mayo Clinic Special Coagulation Laboratory or to the Mayo Clinic Thrombophilia Center.6 A deep venous thrombosis or pulmonary embolism was categorized as objectively diagnosed (1) when it was con-!rmed by venography or pulmonary angiography or via a pathology examination of a thrombus removed at surgery or (2) if !ndings on at least 1 noninvasive test (compression duplex ultrasonography, lung scan, computed tomography, magnetic resonance imaging) were positive. Persons with venous thromboembolism related to active cancer were excluded. A control group was prospectively recruited for this study. Control participants were frequency-matched
TABLE 1. MayoGC Phase 1 Studiesa,b
eMERGE Network (PAD)2 GENEVA (VTE)6 PANC7,8
Cases Controls Cases Controls Controls Characteristics (n=1612) (n=1585) (n=1233) (n=1264) (n=613)
Age (y), mean ± SD 66.0±10.7 61.0±7.4 55.0±16.2 56.0±15.8 66.0±10.0Female (%) 36 40 50 52 45Medical record length (y) Mean ± SD 23.4±20.0 26.1±20.3 13.7±16.3 21.1±15.4 30.2±16.5 Median (range) 18.7 (1.0-78.6) 23.0 (1.0-79.2) 6.3 (1.0-71.8) 17.8 (1.0-70.2) 29.8 (1.0-75.0)White (%) 94 94 96 99 100Geographic location, No. (%)c Olmsted County 328 (20) 590 (37) 7 (1) 10 (1) 64 (10) Southeast Minnesota 191 (12) 62 (4) 205 (17) 378 (30) 107 (17) Greater Minnesota 393 (24) 343 (22) 314 (25) 317 (25) 135 (22) Iowa 212 (13) 97 (6) 176 (14) 191 (15) 65 (11) South and North Dakota 50 (3) 31 (2) 79 (6) 71 (6) 19 (3) Wisconsin 128 (8) 68 (4) 121 (10) 138 (11) 32 (5) Other states or international 309 (19) 394 (25) 330 (27) 159 (13) 191 (31)a eMERGE = Electronic Medical Records and Genomics; GENEVA = Gene Environment Association Studies; MayoGC = Mayo
Genome Consortia; PAD = peripheral arterial disease; PANC = Mayo Clinic Molecular Epidemiology of Pancreatic Cancer Study; VTE = venous thromboembolism.
b Percentages may not total 100% because of rounding.c Southeast Minnesota includes 7 counties in the southeast corner of Minnesota: Dodge, Goodhue, Wabasha, Winona, Houston,
Fillmore, and Mower. Olmsted County, Minnesota, is a mutually exclusive category.
High-Throughput Phenotyping from EHRs
Use Case: Type 2 Diabetes
• Find all item sets I of co-morbid conditions, such that the distribution of risk R is significantly different between the patient population having I and without I
©2013 MFMER | slide-20
Items and Frequencies (based on AHRQ CCS) Items Times Diagnosis meaning V48 V82080
10 10
Diabetes melitus without complication Hemoglobin, A1c
V86 8 Hypertension with complications and secondary hypertension
V56 6 Deficiency and other anemia
V217 4 Other fractures
V52 3 Gout and other crystal arthropathies
V245 3 Residual codes; unclassified
V246 3 Adjustment disorders
V221 3 Open wounds of head; neck; and trunk
V73 3 Retinal detachments; defects; vascular occlusion; and retinopathy
V244 2 Other screening for suspected conditions (not mental disorders or infectious diseases)
V216 2 Fracture of lower limb
V143 1 Chronic renal failure
V142 1 Acute and unspecified renal failure
High-Throughput Phenotyping from EHRs
Rule Ranking – Top 5
Rank Support SupportD Precision Item Set
1 281 270 0.961 V48 V86 V142 V245 V82080
2 280 269 0.96 V48 V57 V74 V86 V245 V82080
3 274 263 0.95 V48 V52 V57 V74 V244 V246 V82080
4 278 263 0.94 V48 V52 V57 V87 V82080
5 278 263 0.94 V48 V57 V86 V216 V221 V82080
©2013 MFMER | slide-22
High-Throughput Phenotyping from EHRs
Confusion Matrix Mod
el Predictive
class False True False True False True
ARM Cutoff 0.93 0.92 0.95 Actual class
N 801 6 752 55 736 71 Y 429 54 51 432 17 466
D-Tree
Cutoff 0.88 0.75 0.70 Actual class
N 766 42 754 54 747 60 Y 393 393 53 429 36 447
LR Cutoff 0.95 0.7 0.6 Actual class
N 772 35 755 48 752 52 Y 149 335 98 384 88 395
SVM Cutoff 0.7 0.6 0.55 Actual class
N 768 40 758 50 751 57 Y 104 378 68 414 55 424
©2013 MFMER | slide-23
High-Throughput Phenotyping from EHRs
Measure Metrics for All Models Model Cutoff Precision Recall F-score ARM 0.95 0.868 0.966 0.914
0.92 0.887 0.894 0.895
0.93 0.9 0.112 0.199
D-Tree 0.88 0.903 0.812 0.855
0.75 0.888 0.889 0.889
0.70 0.881 0.925 0.902
LR 0.95 0.904 0.693 0.785
0.7 0.889 0.796 0.840
0.6 0.883 0.819 0.849
SVM 0.7 0.901 0.784 0.839
0.6 0.893 0.858 0.875
0.55 0.881 0.878 0.879
©2013 MFMER | slide-24
High-Throughput Phenotyping from EHRs
ROC for ARM and SVM
©2013 MFMER | slide-25
High-Throughput Phenotyping from EHRs
Discussion
• Clearly the space of all association rules is exponential, O(2m), where m is the number of items in I.
• The mining exploits sparseness of data, and high minimum support and high minimum confidence values.
• Still, it always produces a huge number of rules, thousands, tens of thousands, millions, ...
©2013 MFMER | slide-26
High-Throughput Phenotyping from EHRs
Discussion
• A machine learning framework for semi-automatic phenotype extraction from EHRs
• Initial results on DM classification with ARM seems to be encouraging—scalable, robust and efficient
• Item Sets and Association Rules are human interpretable
• Next steps will explore more complex phenotypes, and incorporate additional items (e.g., medications, procedures)
©2013 MFMER | slide-27
High-Throughput Phenotyping from EHRs
Acknowledgment
• Material adapted from Agrawal and Liu
• Mayo Clinic SHARP project on Secondary Use of EHR data (90TR002)
• Mayo Clinic eMERGE project (HG006379)
• Mayo Clinic Career Development Award (FP00058504)
©2013 MFMER | slide-28