using association rule mining for phenotype extraction from ehrs

Using Association Rule Mining for Phenotype Extraction from Electronic Health Records Dingcheng Li, PhD1

Gyorgy Simon, PhD2

Christopher G. Chute, MD, DrPH1

Jyotishman Pathak, PhD1

1Mayo Clinic, Rochester 2University of Minnesota, Twin Cities

2013 AMIA Clinical Research Informatics Summit

High-Throughput Phenotyping from EHRs

Outline • Clinical phenotyping from electronic health

records (EHRs)

• Machine learning techniques for phenotyping

• Association Rule Mining and T2DM

• Results • Discussion

©2013 MFMER | slide-2


Data Transform Transform

EHR-driven Phenotyping: The Process

Phenotype Algorithm

Visualization

Evaluation

NLP, SQL

Rules

Mappings [eMERGE Network]



Example: Hypothyroidism Algorithm

©2012 MFMER | slide-4 [Conway et al. AMIA 2011: 274-83]

Drugs

Labs

Diagnosis

NLP

Proce- dures

High-Throughput Phenotyping from EHRs ©2013 MFMER | slide-5

http://gwas.org [eMERGE Network]


0.5 5

Genotype-Phenotype Association Results

0.5 50.5 5.0 1.0

Odds Ratio

rs2200733 Chr. 4q25 rs10033464 Chr. 4q25 rs11805303 IL23R rs17234657 Chr. 5 rs1000113 Chr. 5 rs17221417 NOD2 rs2542151 PTPN22 rs3135388 DRB1*1501 rs2104286 IL2RA rs6897932 IL7RA rs6457617 Chr. 6 rs6679677 RSBN1 rs2476601 PTPN22 rs4506565 TCF7L2 rs12255372 TCF7L2 rs12243326 TCF7L2 rs10811661 CDKN2B rs8050136 FTO rs5219 KCNJ11 rs5215 KCNJ11 rs4402960 IGF2BP2

Atrial fibrillation

Crohn's disease

Multiple sclerosis

Rheumatoid arthritis

Type 2 diabetes

disease gene / region marker

2.0 [Ritchie et al. AJHG 2010; 86(4):560-72]

observed published



Data Transform Transform

EHR-driven Phenotyping: The Process

Phenotype Algorithm

Visualization

Evaluation

NLP, SQL

Rules

Mappings [eMERGE Network]


Time consuming!


Our research agenda



Our research agenda

• Develop effective machine learning methods for automatic phenotype extractions to reduce the workload of manual development of phenotyping algorithms

• Explore effective ways to extract features from EHR data and generate highly predictive models

• Study phenotype extractions methods from EHRs to facilitate population-based studies for clinical and translational research



Common Modeling Approaches

• Logistic regression/Survival Analysis •  No ability to discover interactions

• Decision Trees/RandomForest/Gradient-boosted Trees •  Greedy approach to discover interaction

• Association Rule Mining (ARM) •  Specifically designed to discover interactions



Association rule mining

• Proposed by Agrawal et al., VLDB1994 •  It is an important data mining model

studied extensively by the database and data mining community

• Assume all data are categorical • No good algorithm for numeric data •  Initially used for Market Basket Analysis to

find how items purchased by customers are related



The model: data

•  I = {i1, i2, …, im}: a set of items. • Transaction t :

•  t a set of items, and t ⊆ I.

• Transaction Database T: a set of transactions T = {t1, t2, …, tn}.



The model: data • Market basket transactions:

t1: {bread, cheese, milk} t2: {apple, eggs, salt, yogurt} … … tn: {biscuit, eggs, milk}

• Concepts: •  An item: an item/article in a basket •  I: the set of all items sold in the store •  A transaction: items purchased in a basket; it

may have TID (transaction ID) •  A transactional dataset: A set of transactions



The model: rules

•  A transaction t contains X, a set of items (itemset) in I, if X ⊆ t.

•  An association rule is an implication of the form: X → Y, where X, Y ⊂ I, and X ∩Y = ∅

•  An itemset is a set of items. •  E.g., X = {milk, bread, cereal} is an itemset.

•  A k-itemset is an itemset with k items. •  E.g., {milk, bread, cereal} is a 3-itemset



Rule strength measures

•  Support: The rule holds with support sup in T (the transaction data set) if sup% of transactions contain X ∪ Y. •  sup = Pr(X ∪ Y).

• Confidence: The rule holds in T with confidence conf if conf% of tranactions that contain X also contain Y. •  conf = Pr(Y | X)

•  An association rule is a pattern that states when X occurs, Y occurs with certain probability.



An example

•  Transaction data •  Assume:

minsup = 30% minconf = 80%

•  An example frequent itemset: {Chicken, Clothes, Milk} [sup = 3/7] •  Association rules from the itemset:

Clothes → Milk, Chicken [sup = 3/7, conf = 3/3] … … Clothes, Chicken → Milk, [sup = 3/7, conf = 3/3]

t1: Beef, Chicken, Milk t2: Beef, Cheese t3: Cheese, Boots t4: Beef, Chicken, Cheese t5: Beef, Chicken, Clothes, Cheese, Milk t6: Chicken, Clothes, Milk t7: Chicken, Milk, Clothes



Distributional Association Rules associate an itemset with a continuous outcome.

Distributional Association Rule Mining


[Simon et al. KDD 2011; 823-831]


Based on Apriori Algorithm (Agarwal, VLDB 1994)


Algorithm Apriori(T) C1 ← init-pass(T); F1 ← {f | f ∈ C1, f.count/n ≥ minsup}; // n: no. of transactions in T for (k = 2; Fk-1 ≠ ∅; k++) do Ck ← candidate-gen(Fk-1); for each transaction t ∈ T do for each candidate c ∈ Ck do if c is contained in t then c.count++; end end Fk ← {c ∈ Ck | c.count/n ≥ minsup} end

return F ← k Fk;


Use Case: Type 2 Diabetes


607

MAYO GENOME CONSORTIA

For personal use. Mass reproduce only with permission from Mayo Clinic Proceedingsa .

genetic research within EMR systems.1,2 Successful use of this approach in the eMERGE Network has inspired the creation of the intramural Mayo Genome Consortia (MayoGC). The goal of the MayoGC is to assemble a large cohort of participants from research studies across Mayo Clinic with high-throughput genetic data and to use EMR for phenotype extraction for cost-effective genetic research. Herein, we describe the design of the MayoGC, includ-ing the current participating cohorts, expansion efforts, data processing, and study management and organization. As a test of the genetic research capability of the MayoGC, we conducted a GWA study to identify genetic variants associ-ated with total bilirubin levels. Bilirubin levels have a large variability in the population, with heritability of roughly 0.50.3 Two previous GWA studies identi!ed variants from similar genomic locations with strong and moderate effects on bilirubin levels,4,5 making this phenotype an ideal candi-date for testing. The MayoGC provides a model of a unique collaborative effort in the environment of a common EMR for the investigation of genetic determinants of diseases.

PARTICIPANTS AND METHODS

MayoGC is a large cohort of Mayo Clinic patients with EMR and genotype data. Eligible participants include those who gave general research (ie, not disease-speci!c) consent in the contributing studies to share high-throughput genotyping data with other investigators. This cohort is being built in 2 phas-es. Phase 1, which has been completed, includes participants from 3 studies funded by the National Institutes of Health,

which sought to identify genetic determinants of peripheral arterial disease (PAD), venous thromboembolism, and pan-creatic cancer, respectively, with a combined total sample size of 6307 unique participants (Table 1). The eMERGE study contributed genotype data for 3336 participants with PAD and control participants recruited from Mayo Clinic’s noninvasive vascular and exercise stress testing laboratories, respectively.2 Peripheral arterial disease was de!ned by docu-mentation of at least 1 of the following: (1) an ankle-brachial index (ABI) of 0.9 or less at rest or 1 minute after exercise, (2) the presence of poorly compressible arteries, or (3) a nor-mal ABI but history of revascularization for PAD. Control participants had a normal ABI and no history of PAD.2

The GENEVA (Gene Environment Association Stud-ies) Study of Venous Thromboembolism of the National Human Genome Research Institute enrolled consecutive Mayo Clinic outpatients with objectively diagnosed deep venous thrombosis and/or pulmonary embolism who resid-ed in the upper Midwest and had been referred by a Mayo Clinic physician to the Mayo Clinic Special Coagulation Laboratory or to the Mayo Clinic Thrombophilia Center.6 A deep venous thrombosis or pulmonary embolism was categorized as objectively diagnosed (1) when it was con-!rmed by venography or pulmonary angiography or via a pathology examination of a thrombus removed at surgery or (2) if !ndings on at least 1 noninvasive test (compression duplex ultrasonography, lung scan, computed tomography, magnetic resonance imaging) were positive. Persons with venous thromboembolism related to active cancer were excluded. A control group was prospectively recruited for this study. Control participants were frequency-matched

TABLE 1. MayoGC Phase 1 Studiesa,b

eMERGE Network (PAD)2 GENEVA (VTE)6 PANC7,8

Cases Controls Cases Controls Controls Characteristics (n=1612) (n=1585) (n=1233) (n=1264) (n=613)

Age (y), mean ± SD 66.0±10.7 61.0±7.4 55.0±16.2 56.0±15.8 66.0±10.0Female (%) 36 40 50 52 45Medical record length (y) Mean ± SD 23.4±20.0 26.1±20.3 13.7±16.3 21.1±15.4 30.2±16.5 Median (range) 18.7 (1.0-78.6) 23.0 (1.0-79.2) 6.3 (1.0-71.8) 17.8 (1.0-70.2) 29.8 (1.0-75.0)White (%) 94 94 96 99 100Geographic location, No. (%)c Olmsted County 328 (20) 590 (37) 7 (1) 10 (1) 64 (10) Southeast Minnesota 191 (12) 62 (4) 205 (17) 378 (30) 107 (17) Greater Minnesota 393 (24) 343 (22) 314 (25) 317 (25) 135 (22) Iowa 212 (13) 97 (6) 176 (14) 191 (15) 65 (11) South and North Dakota 50 (3) 31 (2) 79 (6) 71 (6) 19 (3) Wisconsin 128 (8) 68 (4) 121 (10) 138 (11) 32 (5) Other states or international 309 (19) 394 (25) 330 (27) 159 (13) 191 (31)a eMERGE = Electronic Medical Records and Genomics; GENEVA = Gene Environment Association Studies; MayoGC = Mayo

Genome Consortia; PAD = peripheral arterial disease; PANC = Mayo Clinic Molecular Epidemiology of Pancreatic Cancer Study; VTE = venous thromboembolism.

b Percentages may not total 100% because of rounding.c Southeast Minnesota includes 7 counties in the southeast corner of Minnesota: Dodge, Goodhue, Wabasha, Winona, Houston,

Fillmore, and Mower. Olmsted County, Minnesota, is a mutually exclusive category.


Use Case: Type 2 Diabetes

• Find all item sets I of co-morbid conditions, such that the distribution of risk R is significantly different between the patient population having I and without I


Items and Frequencies (based on AHRQ CCS) Items Times Diagnosis meaning V48 V82080

10 10

Diabetes melitus without complication Hemoglobin, A1c

V86 8 Hypertension with complications and secondary hypertension

V56 6 Deficiency and other anemia

V217 4 Other fractures

V52 3 Gout and other crystal arthropathies

V245 3 Residual codes; unclassified

V246 3 Adjustment disorders

V221 3 Open wounds of head; neck; and trunk

V73 3 Retinal detachments; defects; vascular occlusion; and retinopathy

V244 2 Other screening for suspected conditions (not mental disorders or infectious diseases)

V216 2 Fracture of lower limb

V143 1 Chronic renal failure

V142 1 Acute and unspecified renal failure


Rule Ranking – Top 5

Rank Support SupportD Precision Item Set

1 281 270 0.961 V48 V86 V142 V245 V82080

2 280 269 0.96 V48 V57 V74 V86 V245 V82080

3 274 263 0.95 V48 V52 V57 V74 V244 V246 V82080

4 278 263 0.94 V48 V52 V57 V87 V82080

5 278 263 0.94 V48 V57 V86 V216 V221 V82080



Confusion Matrix Mod

el Predictive

class False True False True False True

ARM Cutoff 0.93 0.92 0.95 Actual class

N 801 6 752 55 736 71 Y 429 54 51 432 17 466

D-Tree

Cutoff 0.88 0.75 0.70 Actual class

N 766 42 754 54 747 60 Y 393 393 53 429 36 447

LR Cutoff 0.95 0.7 0.6 Actual class

N 772 35 755 48 752 52 Y 149 335 98 384 88 395

SVM Cutoff 0.7 0.6 0.55 Actual class

N 768 40 758 50 751 57 Y 104 378 68 414 55 424



Measure Metrics for All Models Model Cutoff Precision Recall F-score ARM 0.95 0.868 0.966 0.914

0.92 0.887 0.894 0.895

0.93 0.9 0.112 0.199

D-Tree 0.88 0.903 0.812 0.855

0.75 0.888 0.889 0.889

0.70 0.881 0.925 0.902

LR 0.95 0.904 0.693 0.785

0.7 0.889 0.796 0.840

0.6 0.883 0.819 0.849

SVM 0.7 0.901 0.784 0.839

0.6 0.893 0.858 0.875

0.55 0.881 0.878 0.879



ROC for ARM and SVM



Discussion

• Clearly the space of all association rules is exponential, O(2m), where m is the number of items in I.

• The mining exploits sparseness of data, and high minimum support and high minimum confidence values.

• Still, it always produces a huge number of rules, thousands, tens of thousands, millions, ...



Discussion

•  A machine learning framework for semi-automatic phenotype extraction from EHRs

•  Initial results on DM classification with ARM seems to be encouraging—scalable, robust and efficient

•  Item Sets and Association Rules are human interpretable

• Next steps will explore more complex phenotypes, and incorporate additional items (e.g., medications, procedures)



Acknowledgment

• Material adapted from Agrawal and Liu

• Mayo Clinic SHARP project on Secondary Use of EHR data (90TR002)

• Mayo Clinic eMERGE project (HG006379)

• Mayo Clinic Career Development Award (FP00058504)



Thank You!


[email protected]

using association rule mining for phenotype extraction from ehrs

Documents

sql rules

venous thromboembolism

bilirubin

association

rule holds

ehr data

association

throughput