mining association rules from microarray gene expression data

Mining Association Rules from Microarray Gene Expression Data

2

Association Rules

• Form: LHS => RHS, where LHS and RHS are disjoint itemsets.

• Frequent itemset: supportT(I) ≥ α, where supportT(I) is the number of transactions in T that contains all the items in I and α is the minimum support.

3

Objective Measurements of Rule Interestingness

• Support of the rule LHS => RHS:

Support(LHS U RHS), frequency that LHS and RHS occur together in a transaction.

• Confidence of the rule LHS => RHS:

Support (LHS U RHS)/Support (LHS), frequency that RHS occurs when LHS occurs.

4

Association Rule Discovery (ARD) algorithms

• Originally used in market basket analysis

• Unsupervised and domain independent

• Example: Apriori algorithm (Agrawal, 1993)

• Advantage: Finds all associations

• Disadvantage: Very large number of associations

5

Microarray technology• Measurement of gene expression levels in cells• Simultaneous measurement of thousands of genes• Facilitates the study of gene interactions• Expression profile=set of values for different conditions

– One condition = one slide– condition: tissue, treatment or time point

• Data matrices

6

Rules Applied to Gene Expression

• Each gene expression experiment is a single transaction and each gene is an item.– Gene value may be numerical, may need to be binned

as being up (expressed), down (repressed), or neither.

• Items in a gene expression transaction can also include relevant facts describing the cellular environment.

• Find frequent itemsets: apply Apriori algorithm.

• Generate association rules from the frequent itemsets.

7

Example - association rules

• DatabaseT1: A B C D ET2: A B C D ET3: A B C D ET4: A B C D ET5: A B C D E

• RulesR1: A B C E (sup=0.40, conf=0.67)

R2: A B D (sup=0.40, conf=1.00)

R3: A B (sup=0.60, conf=1.00)

8

Rule induction algorithms G1 G2 G3 G4 ClassS1 CancerS2 Non-Cancer. . . . . .. . . . . .. . . . . .Sn Non-Cancer

R1 : G1 G3 CancerR2 : G1 G4 Non-CancerR3: G1 G3 G2 Cancer

G1 is Gene 1, S1 is sample 1, means overexpressed , and means underexpressed.

Rule set

Association rule discovery algorithm

*

9

Forming Association rules• Any frequent itemset of size greater than

one can be divided into two itemsets, LHS and RHS.

• Using objective measures: If the confidence of a candidate rule exceeds a speficied minimum confidence criterion, the rule may be included.

• A very large set of rules is usually generated.

10

Example

• 28 treatments were recorded on 28 microarray chips– 28 transactions.

• Each chip contains expression levels of approximately 6200 genes (items).

• After Apriori algorithm, 70,000,000 rules were generated.

11

Subjective Measures for Selecting Rules

• Select genes before finding frequent items – limit the number of items.

• Limit the size of LHS or RHS

E.g., LHS contains only one item.

• Domain specific measures -- Use knowledge about the domain under study to specify patterns.

12

Rule Filtering and Group (Tuzhilin and Adomavicius 2002)

• Rule templates – specify restrictions on the combinations of genes and their expression levels that can appear in the body and head of the rule.

RulePart HAS Quantifier OF C1, C2, ..., CN [ONLY]– RulePart: BODY, HEAD, or RULE.

– C1, C2, ..., CN is a comparison set, representing a list of genes against which the discovered rules will be compared:

• A gene, e.g., G17

• A gene with a particular expression level, e.g., G17↑

• A group (category) of genes, e.g., [DNA_Repair]

• A group of genes with an expression level, e.g., [DNA_Repair]

• A group of genes with a list of allowable or unallowable expression levels, e.g., [DNA_Repair] = {↑, #}

13

Rule Filtering con’t• Quantifier: a keyword or an expression

specifying how many genes specified by C1, C2, ..., CN List have to be contained in RulePart.

– ALL, ANY, NONE, specifying the number of genes from C1, C2, …, CN the RulePart must have

– A numeric value; e.g., 2, specifying a rule must have exactly 2 genes from the comparison set

– A range of numeric values; e.g., 1,3,5-7

• ONLY is used to indicate RulePart can have only the genes in the C1, C2, …, CN list.

14

Rule Filtering con’t

• All rules that contain at least one of the following genes: G1, G5, G7:

– RULE HAS (ANY) of G1, G5, G7,

– Matching rules: G1↑ => G3↑.

• "When genes involved in the DNA repair are upregulated, what other gene categories are also up- or downregulated?"– BODY HAS (ANY) OF [DNA_Repair] AND HEAD HAS

(ANY) OF [All_Genes]={, }

15

Macro Templates (Tuzhilin and Adomavicius 2002)

• Detect unexpected rules :

CONTRADICT (GeneExprSet, G, ExpLevel)=

BODY HAS (ALL) of GeneExprSet AND

HEAD HAS (ALL) OF G ≠ {ExpLevel}

– CONTRADICT({G1, G2},G4,}– Unexpected rule: G1G2G3 G4

16

Rule Grouping• Group similar rules together into classes to

be analyzed

• Gene hierarchy: group genes based on their functions.

ALL

F1 F2

G1 G2 G3 G4 G5

17

Rule Grouping con’t

• Aggregated rules– Groups: F1={G1,G2,G3}, F2={G4,G5}– Rules: R1={G1G4}, R2={G1G5},

R3={G1G3G5}– Aggregated rule: R=F1F2, R'=F1F2– {R1,R2,R3}R, {R1,R2}R', R3R'

18

Rule Derivation Procedure

Feature selection

Rule induction

Rule selection

Rule validation

Database of transactions

Reduced database

Initial rule set

Relevant rule set

Rules representing knowledge

Some features may be irrelevant or a reduction may be required for efficiency reasons

A rule induction algorithm is applied to the database (e.g. association rule algorithm or decision tree algorithm)

Rules are assessed and ranked using different measures of interestingness

The rules need to be validated to be accepted as knowledge. Can be done by more detailed biological experiments in the context of gene expression data

Mining Association Rules from Clinical Data

20

Association Rules in Medical Data

• Medical record data– Millions of “claims” for medical procedures– Each patient may have several claims– Each claim may have several line items or records,

one for each procedure performed.– A diagnosis was reported with each procedure.

• Data: patient code, procedure code, diagnosis code

• Goal: discover relationships between procedure performed on a patient and the reported diagnosis.

21

Con’t

• Data items: the set of all procedure and diagnosis codes (7,365 + 9,383=16,748)

• Transaction: the set of procedure and diagnosis codes for each patient (1,257,645 patients)– May cause unexpected rule: {cast} => {heart disease}

• Each itemset consists of one or more procedure/diagnosis codes.

• The support of an itemset I is the number of patients whose set of items include all the items in I (> 1%).

22

Formulating Rules

• Applying Aprior algorithm to generate all frequent itemsets.

• Restricting to those frequent itemsets which contain both procedures and diagnoses

• For each selected frequent itemset, one rule is formulated with all procedure codes on the left and all diagnosis codes on the right.

• Computing confidence and eliminating rules with confidence less than 65%.

mining association rules from microarray gene expression data

Documents