peano count trees and association rule mining for gene expression profiling using dna microarray...

Peano Count Trees and Association Rule Mining for Gene Expression Profiling using

DNA Microarray Data

Dr. William Perrizo, Willy Valdivia, Dr. Edward Deckard, Francis Larson;North Dakota State University

{william.perrizo, willy.valdivia, edward.deckard, francis.larson @ndsu.nodak.edu}

Patents pending on bSQ and Ptree technology

The Problem

•There is a lot of data available today (e.g., gene expression data), but too little information.

•Data Mining attempts to reduce raw data to information for decision support.

Decisions (often 1 bit – Y/N, T/F, Do/Don’t_do )•Data mining

•Classification (supervised learning)•Clustering (unsupervised learning)•Association Rule Mining (ARM)

•Statistics•Machine Learning•Data Structuring•Signal Processing raw data (gigs, teras, petas, exas…)

0/1

A Solution?Currently the predominant method employed in bioinformatics is clustering (a little classification) on isolated microarray datasets.

• Needed:? A data mining software suite able to:• transform copies of pertinent data from a variety of databases into a

data mining-ready form in real-time (our solution based on P-trees?)“transform copies” rather than “standardize” since standardization rarely works! There will always be an MS (and I don’t mean Martha Stewart) to frustrate/destroy the standardization effort.

• facilitate Association Rule Mining, Clustering, Classification in an uniform way (so data mining results from other areas can be used)

Bioinformatics: a Walmart or a Kmart?!?Walmart took DM seriously (early, comprehensive approach

borrowing useful techniques from a variety of application areas)

Kmart? Too little, too late.

Using data mining techniques developed for other application areas in bioinformatics?

TIFF image Yield Map

Remotely Sensed Images (RSI) can be viewed as collections of pixels. Each pixel has a value for each feature attribute

For example, the RSI dataset above has 1320 rows and 1320 columns of pixels (1,742,400 pixels) and 4 feature attributes (Red,Green,Blue,Yield). The (R,G,B) feature bands are in the TIFF image and the Y feature is color coded in the Yield Map.

Microarray or DNA chip data is not much different (multiple attributes corresponding to treatments or conditions). Much data mining (ARM) has been done on RSI data.

Can it be useful in bioinformatics?

Regulation Pathway Discovery is not very different from Market Basket Research (ala Walmart)

The results of clustering microarray data may indicate that genes (1 – 9) are involved in a regulation pathway.

High confident rule mining on that cluster can discover the relationships among those genes (e.g., the expression of one gene, Gene2, might be discovered to be regulated by 1,3,5,6,8,9 and Gene4 and Gene7 may not be directly regulating Gene2 and can therefore be excluded.

Gene1Gene2, Gene3

Gene4, Gene 5, Gene6Gene7, Gene8

Gene9

Clustering

ARM

Gene2Gene1 Gene3Gene8Gene6 Gene9

Gene5

Gene4 Gene7

ARM for Microarray Data• A gene regulatory pathway component can be represented as an association rule,

{G1..Gn} Gm where {G1…Gn} is the antecedent & Gm is the consequent.

• Microarray data is most often represented as a relation G(Gid, T1 …Tn) where Gid is the gene identifier; T1... Tn are the treatments (or conditions) and the data values represent gene expression levels. Call this the " Gene Table”.

• Currently, data-mining techniques concentrate on the Gene table - specifically, on finding clusters of genes that exhibit similar expression patterns under selected treatments (clustering the gene table).

….….….….G4

….….….….G3

….….….….G2

….….….….G1

T4T3T2T1 Trmt-IDGene-ID . Gene

expression values

ARM for Microarray Data (Contd.)• An alternate data format exits (called the “Treatment Table”.)

T(Tid, G1, G2, …. , Gn) where Tid is the treatment identifier and G1…Gn are the gene identifiers.• Treatment table provides a convenient form for ARM of gene expression levels.• Goal is to mine for rules among genes by associating treatment table columns.

….….….….T4

….….….….T3

….….….….T2

….….….….T1

G4G3G2G1 GeneIDTrtmtID .

Gene expression

values

The form of the Treatment Table with binary values (coding only whether an expression level exceeds or does not_exceed a threshold) is identical to Market Basket Data, for which a wealth of Rule Mining techniques have been developed in the last 8 years.

Treatment Table

…….….…T4

…….….…T3

…….….…T2

…….….…T1

G4G3G2G1

Gene Table is usually given as a standard (MS excel) spreadsheet of gene expression levels coming from microarray experiements. It is a 2-D data cube which can be rotated (to the Treatment Table), rolledup, sliced, diced, drilled down, association rule mined etc.

Gene Table

……….…G4

……….…G3

……….…G2

……….…G1

T4T3T2T1

What are Peano Trees? First what are the Spatial Data Formats

BAND-1 254 127 (1111 1110) (0111 1111)

14 193 (0000 1110) (1100 0001)

BAND-237 240(0010 0101) (1111 0000)

200 19(1100 1000) (0001 0011)

Band SeQuential (2 files)(BSQ) Band 1: 254 127 14 193 Band 2: 37 240 200 19

Spatial Data Formats (Cont.)

BAND-1 254 127 (1111 1110) (0111 1111)

14 193 (0000 1110) (1100 0001)

BAND-237 240(0010 0101) (1111 0000)

200 19(1100 1000) (0001 0011)

BSQ format (2 files)

Band 1: 254 127 14 193 Band 2: 37 240 200 19

Band InterLeaved by Line(BIL)254 127 37 240 14 193 200 19


BAND-1 254 127 (1111 1110) (0111 1111)

14 193 (0000 1110) (1100 0001)

BAND-237 240(0010 0101) (1111 0000)

200 19(1100 1000) (0001 0011)


Band 1: 254 127 14 193 Band 2: 37 240 200 19

BIL format (1 file)

254 127 37 240 14 193 200 19

Band Interleaved by Pixel (1 file)(BIP)254 37 127 240 14 200 193 19


BAND-1 254 127 (1111 1110) (0111 1111)

14 193 (0000 1110) (1100 0001)

BAND-237 240(0010 0101) (1111 0000)

200 19(1100 1000) (0001 0011)


Band 1: 254 127 14 193 Band 2: 37 240 200 19

BIL format (1 file)

254 127 37 240 14 193 200 19

BIP format (1 file)

254 37 127 240 14 200 193 19

bit SeQuential (bSQ) format (16 files) (related to bit planes in graphics)B11 B12 B13 B14 B15 B16 B17 B18 B21 B22 B23 B24 B25 B26 B27 B28 1 1 1 1 1 1 1 0 0 0 1 0 0 1 0 1 0 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 1 1 1 0 1 1 0 0 1 0 0 0 1 1 0 0 0 0 0 1 0 0 0 1 0 0 1 1

Reasons of using bSQ format– Different bits contribute to the value differently. – bSQ format facilitates representation of precision hierarchy (1 bit, 2 bit, … n-bit precision). – bSQ format facilitates the creation of an efficient P-tree data structure and P-tree algebra.

BSQ and bSQ formats– BSQ and bSQ are “tabular” formats

• BSQ consist of a separate table for each band (e.g., Gene or Treatment)• bSQ consist of a separate table for each bit of each band

– One can view it this way:• Data set is initially 1 relation or table, R(K1,..,Kk, A1, A2,…, An), K1,..,Kk are

structure attributes and each Ai is a feature attribute.– Structure attributes of an RSI are X and Y coordinates (could put the same

structure on the Gene Table, but I want to focus on the Treatment table).– Structure attributes of the Treatment Table might be a collection of Treatment

dimensions, based on MIAME standard (Minimum info about microarray exp):http://www.mged.org/Annotations-wg/index.html

» Experimental design» Array design» Samples» Hybridisations» Measurements» Normalization Control

http://www.mged.org/Annotations-wg/index.html

A Universal Format? E.g., One large universal table with 5 dimensions based on MIAME standard?

– E = Experimental design – Hybridisation Procedures– A = Array design– S = Samples– M = Measurements– N = Normalization Control for data mining across all treatments and genes?

Gene-Rep

Tid(E,A,S,M,N)

G1 G2 … Gn

E,A,S,M,N1 …. …. ….

E,A,S,M,N2 …. …. ….

. . .

E,A,S,M,Nm …. …. ….

Gene expression values

"GREASMN" (5-D Universal Gene Expression Cube)

Cardinatlity is high, but compression will be substantial (next slide).

GREASMN datacube rolled up onto (E,S)

1 5 2 0…

1 7 0...

90.

0 8 1 7 6 5...

70.

zeros

zeros

S (Organism..)

E (Lab…)

Yeast

S1

S2

.

.

.

Sn

E1 E2 . . . En

The non-zero blocks may occur off the diagonal.The Point: Massive but very sparse dataset!

Peano Count Tree (P-tree)

P-tree represents spatial bSQ data bit-by-bit in a recursive quadrant-by-quadrant arrangement.

P-tree is a lossless, compressed, data-mining-ready representation of the data.

– partially run-length compressed using the structure attributes.

– “count pre-computed”.

An example of Peano Count tree

Peano or Z-ordering Pure (Pure-1/Pure-0) quadrant Root Count

Level Fan-out QID (Quadrant ID)

1 1 1 1 1 1 0 01 1 1 1 1 0 0 0 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1

55

16 8 15 16

3 0 4 1 4 4 3 4

1 1 1 0 0 0 1 0 1 1 0 1

16 16

55

0 4 4 4 4

158

1 1 1 0

3

0 0 1 0

1

1 1

3

0 1

Given a bSQ file, Bij, (shown in spatial positions below) we create its basic PC-tree, Pij as follows.

1111110011111000111111001111111011111111111111111111111101111111

55

16 8 15 16

3 0 4 1 4 4 3 4

1 1 1 0 0 0 1 0 1 1 0 1

An example of PC-tree

Peano or Z-ordering Pure (Pure-1/Pure-0) quadrant Root Count

Level Fan-out QID (Quadrant ID)

1 1 1 1 1 1 0 01 1 1 1 1 0 0 0 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1

0 1 2 3

111

( 7, 1 ) ( 111, 001 )

2

3

2 . 2 . 3

001

Level-0

Level-3

Level-2

Level-1

10.10.11

Alternative forms for Ptrees (all lossless)

P1: 0 ______/ / \ \______ / / \ \ / / \ \ 1 0 0 1 / / \ \ / / \ \ 0 0 1 0 1 1 0 1 //|\ //|\ //|\ 1110 0010 1101

P0: 0 ______/ / \ \______ / / \ \ / / \ \ 0 0 0 0 / / \ \ / / \ \ 0 1 0 0 0 0 0 0 //|\ //|\ //|\ 0001 1101 0010

PNZ (=P0’) 1 ________ / / \ \___ / ____ / \ \ / / \ \ 1 1 1 1 / / \ \ / / \ \ 1 0 1 1 1 1 1 1 //|\ //|\ //|\ 1110 0010 1101

1 means quadrant is pure-1, 0 otherwise (pure0 if no sub-tree ptrs, otherwise mixed)

1 means quadrant is pure-0, 0 otherwise

1 means quadrant is Not pure-Zero, 0 otherwise (Note: PM = PNZ XOR P1 )

P1V (as a table):qid vector[ ] 1001[01] 0010[10] 1101[01.00] 1110[01.11] 0010[10.10] 1101

P0V:qid vector[ ] 0000[01] 0100[10] 0000[01.00] 0001[01.11] 1101[10.10] 0010

PNZV:qid vector[ ] 1111[01] 1011[10] 1111[01.00] 1110[01.11] 0010[10.10] 1101

Vector forms (A table entry for each mixed inode containing its qid and its children bit-vector ; Eliminate need for subtree pointers)

00 01

00

10 11

0001 0110 1011 11

Since there is no qid=[01.01] in the table we know it’s pure0, not mixed

Basic, Value and Tuple Ptrees

Value Ptrees(i.e., P1, 001 = P11’ AND P12’ AND P13)

Tuple Ptrees(i.e., P001, 010, 111 = P1, 001 AND P2, 010 AND P3, 111)

AND

AND

Basic Ptrees(i.e., P11, P12, …, P18, P21, …, P28, …, P71, …, P78)

Distributed P trees?qid NZ P1[ ] 1111 1001[01] 1011 0010[10] 1111 1101[01.00] 1110[01.11] 0010[10.10] 1101

qid NZ P1[ ] 1010 1000[10] 1111 1110[10.11] 0111

qid NZ P1[ ] 0111 0001[01] 1111 1110[10] 1110 0110[01.11] 0110[10.00] 1000

P11 P12 P13

Assume a 5-computer cluster; NodeC, Node00, Node01, Node10, Node11.

Send to Nodeij if qid ends in ij: Bp qid NZ P1 0011[01.00] 111013[10.00] 1000Bp qid NZ P1 C

11[ ] 1111 100112[ ] 1010 100013[ ] 0111 0001

A data mining request involves a series of multicast invocations and at most one unicast reply for each receiving node.

A distributed Genomic data mining federation of Beowulf clusters? Each node computes only a tiny portion of the necessary count information then sends to the requesting node?

Bp qid NZ P1 0111[01] 1011 001013[01] 1111 1110

Bp qid NZ P1 1011[10] 1111 110111[10.10] 110112[10] 1111 111013[10] 1110 0110

Bp qid NZ P1 1111[01.11] 001012[10.11] 011113[01.11] 0110

1 2 3 4 5 6 7 8 87865676…5

55 depth=0 level=3____________/ / \ \___________

/ _____/ \___ \16 ____8__ _15__ 16 depth=1 level=2

/ / | \ / | \ \3 0 4 1 4 4 3 4 depth=2 level=1//|\ //| \ //| \

1110 0010 1101 depth=3 level=0

bSQ format: Bit files of intervalized, normalized,Red/green ratios for each Microarray.

Ptree format: One P-tree for each bit position of each bSQ file (e.g., the high-order bit)

Hierarchical Clustering

Agglomerative Divisive

Non-Hierarchical Clustering

K-clustering PCASOM

Supervised Learning or Classification

SVM Decision TreesKNN

Non-ARM Ptree-based Microarray data mining methods

TemporalGene Exp.

Analysis

Spatial Gene Exp.

Analysis

Genotypic Gene Exp.

Analysis

Data Repository

bSQPtrees

Development Of Data Mining

Tools

User JAVA Graphical InterfaceSQL, XML

Other MicroarrayData Repositories

StanfordEMBLSGDB

A plan

Data Mining in Genomics: Conclusion

•Data Mining in application areas, with huge raw data stores such as Market Basket Research, Remotely Sensed Imagery, and Genomics (Proteomics?, Transcriptomics, Metabolomics?), are remarkably similar in terms of data and data mining needs.

•There should be more collaboration across applications.

•In the application areas data cube rotation can open data mining possibilities.

•We suggest a universal data structure (GREASMN Table and P-trees)

•striped across a wide federation of computer nodes,

•using P-tree technology to facilitate data mining

•eliminate barriers introduced by scale limitations, incompatible data formats, etc.

peano count trees and association rule mining for gene expression profiling using dna microarray...

Documents