data warehouse mining ( dwm ) for any datawarehouse with

43
Data Warehouse Mining ( DWM ) For any DataWarehouse with Fact file, F(d 1 ..d n ,m 1 ..m k ) (m i ’s are measurements) and Dimension files, D i (d i , a i1 ...a ir i ) i=1..n Method-1 : (to simplify) Convert to a Boolean DW by applying a predicate to measurements, {m 1 …m k } replacing each measurement vector with a 1-bit if predicate is true and 0 if false. (e.g., predicates can be simple thresholds – may include dimensions). Predicated Fact File, PF(d 1 ...d n ,m 0 ) (m 0 = Boolean predicate result) Dimension files, D i (d i , a i1 ...a ir i ) Next, Theta-join the Dimension files (doing selections and projections 1 st ?) using PF as Theta condition, ending up with one large relation,the Universal Predicated Fact (UF) Universal Predicated Fact File, UF(d 1 ...d n , a 11 ...a 1r 1 … a n1 ...a nr n ) Next, (possibly) structure UF vertically (e.g., using basic Ptrees or?) Approach? Avoid actually creating the large UF relation at all (very large!). Create UF-basic-Ptrees directly from the Fact and Dimension basic- Ptrees?

Upload: huslu

Post on 31-Jan-2016

54 views

Category:

Documents


0 download

DESCRIPTION

Data Warehouse Mining ( DWM ) For any DataWarehouse with Fact file, F(d 1 ..d n ,m 1 ..m k ) (m i ’s are measurements) and Dimension files, D i (d i , a i1 ...a ir i ) i=1..n. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Data Warehouse Mining  ( DWM ) For any DataWarehouse with

Data Warehouse Mining ( DWM )For any DataWarehouse with

Fact file, F(d1..dn,m1..mk) (mi’s are measurements) and

Dimension files, Di(di, ai1...airi) i=1..n

Method-1: (to simplify) Convert to a Boolean DW by applying a predicate to measurements, {m1…mk} replacing each measurement vector with a 1-bit if

predicate is true and 0 if false. (e.g., predicates can be simple thresholds – may include dimensions).

Predicated Fact File, PF(d1...dn,m0) (m0 = Boolean predicate result) Dimension files, Di(di, ai1...airi

)

Next, Theta-join the Dimension files (doing selections and projections 1st ?) using PF as Theta condition, ending up with one large relation,the Universal Predicated Fact (UF)

Universal Predicated Fact File, UF(d1...dn, a11...a1r1 …

an1...anrn)

Next, (possibly) structure UF vertically (e.g., using basic Ptrees or?)Approach? Avoid actually creating the large UF relation at all (very

large!).Create UF-basic-Ptrees directly from the Fact and Dimension basic-

Ptrees?

Method-2: Create the full equi-join of F and all Di (no predication), also denoted result, UF.

UF can be fully vertically partitioned and data mined (e.g., Nearest Neighbor Classification, NNC or any other data mining method).

Universal Fact File, UF(d1...dn, a11...a1r1 … an1...anrn

,m1..mk)

Page 2: Data Warehouse Mining  ( DWM ) For any DataWarehouse with

A UF example

date_key (d)Day (a)day_of_wk (w)Month (m)Quarter (q)Year (200y)

date

prod_key (p)prod_name (n)Brand ((b)Supplier (s)

product

Sales Fact Tabledate_key (d)

product_key (p)

country_key (c)

Total-$-sold meas.(t)

d000000111111222222

p001122001122001122

c010101010101010101

t470121336504642517

d012

a492

wmft

m276

q133

y213

p012

countrykey (c)Legalname (l)Continent (o)

country

njik

brru

s002

c01

lusgb

d000000111111222222

p001122001122001122

c010101010101010101

t470121336504642517

a444444999999222222

wmmmmmmfffffftttttt

m222222777777666666

q111111333333333333

y222222111111333333

njjiikkjjiikkjjiikk

brrrruurrrruurrrruu

s000022000022000022

lusgbusgbusgbusgbusgbusgbusgbusgbusgb

UFF

Date

Prod

Countryo01

o010101010101010101

Page 3: Data Warehouse Mining  ( DWM ) For any DataWarehouse with

Nearest Neighbor Classification (NNC)

Many UF mining research topics can be pursued. E.g., for any DW data area,

Association Rule Mining (ARM),Clustering,Classification (e.g., NNC)other NN methods, Iceberg Querying, CaseBased & RoughSet

Classification, NNsearchOutlier/Noise Analysis,OLAP operator implementation,Query Processing,Vertical DW maintenance (e.g., upon inserting next-day data...).

The research may be quite different depending on the data area. e.g., Dr. Slator is interested in Classification of Virtual Cell data with respect to which students do well

NNC: Given a Training Set, a similarity measure and an unclassified tuple, find a set of nearest neighbors from the Training Set.Those neighbors predict the class thru plurality vote (or similarity-weighted vote).How many neighbors? eg, kNNC, find k nearest neighbors; dNNC all neighbors within a similarity dNote: NNC requires a similarity measure on pairs of tuples for nearest to make sense

Classification: Choose a feature attribute as “class label” (may be composite?)

( = the column(s) you want to classify tuples with respect to).

A Classifier is a program with input= unclassified_tuple (no class label yet) and

output= predicted_class_label for that input.How is that prediction made?

It’s based on already classified tuples (Training Set) of historical data

Page 4: Data Warehouse Mining  ( DWM ) For any DataWarehouse with

Training-set, T, consists of an aerial photograph (TIFF image taken during a growing season) and a synchronized yield map (crop yield taken that same year at harvest).

T( R,G,B, Y) ~100,000 tuples

NNC example from Precision Agriculture

TIFF image Yield Map

Producer want to classify Y=yield (e.g., Hi,Med,Low) based on color intensity (R,G,B).Y=Yield is the class label attribute. Using last year’s data set for Training Data, producers want a classifier that takes a (R,G,B) triple as input (from an image taken during the current growing season) outputs a predicted Yield for that pixel of their field

Then they can apply additional Nitrogen on just those parts of the field that need it to increase yield, without wasting N on the parts that will likely have high enough yield anyway (avoiding application of excess N in those parts, which would just run off into rivers and contaminate ground water anyway).

This classifier would help save N costs, maximize yield and save the environment!

Page 5: Data Warehouse Mining  ( DWM ) For any DataWarehouse with

UF (predicated) example Fact file is F(d1, d2, d3, m1, m2, m3). Predicate on mis results

in PF(d1,d2,d3,m0) Dimensions D1( d1, a10, a11, a12, a13 ), D2( d1, a20, a21, a22) and D3( d3, a30, a31)

s200

s411

c710

a10d1

0

a11 a12 a13

1

2

3

c710

0b

1a

a30d3

0

a31

1

0b

1a2

3

a01

a11

a20d2

0

a21 a22

1

2 b13

a013

D1( d1 a10 a11 a121a122a123 a13 ) d10 0 1 1 1 1 c d11 1 1 1 0 0 s d12 0 1 1 1 1 c d13 0 0 0 1 0 s

D2( d2 a201a202 a21 a22 ) d10 0 1 1 c d11 0 1 0 a d12 1 1 1 b d13 0 1 0 a

D3( d3 a21 a22 ) d10 c 1 d11 a 0 d12 b 1 d13 a 0

D1(d1d2d3 m)00001111

00110011

01010101

00001111

00110011

23232323

00001111

22332233

01010101

00001111

23232323

22332233

22223333222233332222333322223333

00110011

01010101

00110011

23232323

22332233

0101010123232323

22332233

1

1 1 1 1 1

d3=3=2

=1 =2 =3

d1=0

=1

=2

=3

d2=0

=1=0

1

1

1

1

1

1

NNC example: Choose D2.a22 as Class Attribute, C.

Page 6: Data Warehouse Mining  ( DWM ) For any DataWarehouse with

The ordering used on the previous slide is shown here;Generalized Peano order, sorting on d11, then d21, then d31,then d12, then d22, then d32, … (the origin is in the top back left

corner)

d1

d3

d2

Page 7: Data Warehouse Mining  ( DWM ) For any DataWarehouse with

d1

d2

d3

Spread out, so you can seewhat’s going on.

Page 8: Data Warehouse Mining  ( DWM ) For any DataWarehouse with

Using the standard orientation (origin in the bottom back left corner)and Generalized Peano order,(x1,y1,z1,x2,y2,z2,x3,y3,z3)

Y=d2

X=d1

Z=d3

Page 9: Data Warehouse Mining  ( DWM ) For any DataWarehouse with

Y=d2

X=d1

Z=d3

Enlarged, Standard orientation and Generalized Peano order,(x1,y1,z1,x2,y2,z2,x3,y3,z3)

Page 10: Data Warehouse Mining  ( DWM ) For any DataWarehouse with

Graph G (asEdge Table)G(Tid1 Tid2) t1 t2 t1 t3 t1 t5 t1 t6 t2 t1 t2 t7 t3 t1 t3 t2 t3 t3 t3 t5 t5 t1 t5 t3 t5 t5 t5 t7 t6 t1 t7 t2 t7 t5

ie, 2-D reflexive relationship on a single dimension file

Single Dimension File, RTid a1 a2 a3 a4 a5 a6 a7 a8 a9 C) t1 1 0 1 0 0 0 1 1 0 1t2 0 1 1 0 1 1 0 0 0 1t3 0 1 0 0 1 0 0 0 1 1t4 1 0 1 1 0 0 1 0 1 1t5 0 1 0 1 0 0 1 1 0 0t6 1 0 1 0 1 0 0 0 1 0t7 0 0 1 1 0 0 1 1 0 0

Example UF with a 2-D Reflexive Fact File (a graph)

e.g., a Protein-Protein interaction graph. Note, the dimension files are identical copies of the gene table

Note: Given any 2-D Reflexive Fact File (Graph), the standard Universal Fact File will be denoted as, UF1.

UF2 will denote the UF coming from the “2-hop Graph” Fact File (join of G

with itself, G2 = ( G Tid1JOINTid’2 G’)[ Tid1, Tid2’].

UF3 will come from the “3-hop Graph” Fact File, G3= G1 Tid1JOINTid2’ G’[ …

Graph G (as Reflexive 2-D relationship) t1 t2 t3 t4 t5 t6 t7t1 0 1 1 0 1 1 0t2 1 0 0 0 0 0 1t3 1 1 1 0 1 0 0t4 0 0 0 0 0 0 0t5 1 0 1 0 1 0 1t6 1 0 0 0 0 0 0t7 0 1 0 0 1 0 0

Tid1 Tid2

Page 11: Data Warehouse Mining  ( DWM ) For any DataWarehouse with

For this example: UF = UF1= R THETAJOIN R’ (THETAJOIN using THETA=G)

UF1

d1 d2 a1 a2 a3 a4 a5 a6 a7 a8 a9 C a1'a2'a3'a4'a5'a6'a7'a8'a9‘ C't1 t2 1 0 1 0 0 0 1 1 0 1 0 1 1 0 1 1 0 0 0 1t1 t3 1 0 1 0 0 0 1 1 0 1 0 1 0 0 1 0 0 0 1 1t1 t5 1 0 1 0 0 0 1 1 0 1 0 1 0 1 0 0 1 1 0 0t1 t6 1 0 1 0 0 0 1 1 0 1 1 0 1 0 1 0 0 0 1 0t2 t1 0 1 1 0 1 1 0 0 0 1 1 0 1 0 0 0 1 1 0 1t2 t7 0 1 1 0 1 1 0 0 0 1 0 0 1 1 0 0 1 1 0 0t3 t1 0 1 0 0 1 0 0 0 1 1 1 0 1 0 0 0 1 1 0 1t3 t2 0 1 0 0 1 0 0 0 1 1 0 1 1 0 1 1 0 0 0 1t3 t3 0 1 0 0 1 0 0 0 1 1 0 1 0 0 1 0 0 0 1 1t3 t5 0 1 0 0 1 0 0 0 1 1 0 1 0 1 0 0 1 1 0 0t5 t1 0 1 0 1 0 0 1 1 0 0 1 0 1 0 0 0 1 1 0 1t5 t3 0 1 0 1 0 0 1 1 0 0 0 1 0 0 1 0 0 0 1 1t5 t5 0 1 0 1 0 0 1 1 0 0 0 1 0 1 0 0 1 1 0 0t5 t7 0 1 0 1 0 0 1 1 0 0 0 0 1 1 0 0 1 1 0 0t6 t1 1 0 1 0 1 0 0 0 1 0 1 0 1 0 0 0 1 1 0 1t7 t2 0 0 1 1 0 0 1 1 0 0 0 1 1 0 1 1 0 0 0 1t7 t5 0 0 1 1 0 0 1 1 0 0 0 1 0 1 0 0 1 1 0 0

Recursively, for k > 1 (letting G1=G)

Gk =(Gk-1 gkJOINg1’ G’)(g1,…,gk+1) where gk+1 = g2’

UFk= R Gk-join R’ where Gk-join is ThetaJoin using Gk[g1,gk+1]

Page 12: Data Warehouse Mining  ( DWM ) For any DataWarehouse with

UF1 a1 a2 a3 a4 a5 a6 a7 a8 a9 C a1'a2'a3'a4'a5'a6'a7'a8'a9'C't00t01t02t03t04t05t06t07t10t11t12 1 0 1 0 0 0 1 1 0 1 0 1 1 0 1 1 0 0 0 1t13 1 0 1 0 0 0 1 1 0 1 0 1 0 0 1 0 0 0 1 1t14t15 1 0 1 0 0 0 1 1 0 1 0 1 0 1 0 0 1 1 0 0t16 1 0 1 0 0 0 1 1 0 1 1 0 1 0 1 0 0 0 1 0t17t20t21 0 1 1 0 1 1 0 0 0 1 1 0 1 0 0 0 1 1 0 1t22t23t24t25t26t27 0 1 1 0 1 1 0 0 0 1 0 0 1 1 0 0 1 1 0 0t30t31 0 1 0 0 1 0 0 0 1 1 1 0 1 0 0 0 1 1 0 1t32 0 1 0 0 1 0 0 0 1 1 0 1 1 0 1 1 0 0 0 1t33 0 1 0 0 1 0 0 0 1 1 0 1 0 0 1 0 0 0 1 1t34t35 0 1 0 0 1 0 0 0 1 1 0 1 0 1 0 0 1 1 0 0t36t37t40t41t42t43t44t45t46t47t50t51 0 1 0 1 0 0 1 1 0 0 1 0 1 0 0 0 1 1 0 1t52t53 0 1 0 1 0 0 1 1 0 0 0 1 0 0 1 0 0 0 1 1t54t55 0 1 0 1 0 0 1 1 0 0 0 1 0 1 0 0 1 1 0 0t56t57 0 1 0 1 0 0 1 1 0 0 0 0 1 1 0 0 1 1 0 0t60t61 1 0 1 0 1 0 0 0 1 0 1 0 1 0 0 0 1 1 0 1t62t63t64t65t66t67t70t71t72 0 0 1 1 0 0 1 1 0 0 0 1 1 0 1 1 0 0 0 1t73t74t75 0 0 1 1 0 0 1 1 0 0 0 1 0 1 0 0 1 1 0 0t76t77

The full matrix (8x8 raster order):UF1 a1 a2 a3 a4 a5 a6 a7 a8 a9 C a1'a2'a3'a4'a5'a6'a7'a8'a9'C‘t00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0t01 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0t02 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0t03 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0t04 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0t05 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0t06 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0t07 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0t10 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0t11 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0t12 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1t13 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1t14 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0t15 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1t16 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1t17 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0t20 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0t21 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1t22 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0t23 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0t24 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0t25 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0t26 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0t27 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1t30 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0t31 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1t32 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1t33 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1t34 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0t35 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1t36 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0t37 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0t40 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0t41 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0t42 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0t43 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0t44 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0t45 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0t46 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0t47 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0t50 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0t51 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1t52 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0t53 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1t54 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0t55 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1t56 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0t57 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1t60 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0t61 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1t62 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0t63 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0t64 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0t65 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0t66 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0t67 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0t70 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0t71 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0t72 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1t73 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0t74 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0t75 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1t76 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0t77 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

A UF1 template: 1-bit wherever there are values 0-bit where there are blanks. Note: tij means ti,tj

Page 13: Data Warehouse Mining  ( DWM ) For any DataWarehouse with

UF1 a1 a2 a3 a4 a5 a6 a7 a8 a9 C a1'a2'a3'a4'a5'a6'a7'a8'a9' C't00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0t01 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0t02 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0t03 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0t04 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0t05 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0t06 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0t07 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0t10 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0t11 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0t12 1 0 1 0 0 0 1 1 0 1 0 1 1 0 1 1 0 0 0 1t13 1 0 1 0 0 0 1 1 0 1 0 1 0 0 1 0 0 0 1 1t14 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0t15 1 0 1 0 0 0 1 1 0 1 0 1 0 1 0 0 1 1 0 0t16 1 0 1 0 0 0 1 1 0 1 1 0 1 0 1 0 0 0 1 0t17 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0t20 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0t21 0 1 1 0 1 1 0 0 0 1 1 0 1 0 0 0 1 1 0 1t22 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0t23 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0t24 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0t25 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0t26 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0t27 0 1 1 0 1 1 0 0 0 1 0 0 1 1 0 0 1 1 0 0t30 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0t31 0 1 0 0 1 0 0 0 1 1 1 0 1 0 0 0 1 1 0 1t32 0 1 0 0 1 0 0 0 1 1 0 1 1 0 1 1 0 0 0 1t33 0 1 0 0 1 0 0 0 1 1 0 1 0 0 1 0 0 0 1 1t34 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0t35 0 1 0 0 1 0 0 0 1 1 0 1 0 1 0 0 1 1 0 0t36 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0t37 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0t40 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0t41 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0t42 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0t43 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0t44 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0t45 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0t46 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0t47 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0t50 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0t51 0 1 0 1 0 0 1 1 0 0 1 0 1 0 0 0 1 1 0 1t52 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0t53 0 1 0 1 0 0 1 1 0 0 0 1 0 0 1 0 0 0 1 1t54 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0t55 0 1 0 1 0 0 1 1 0 0 0 1 0 1 0 0 1 1 0 0t56 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0t57 0 1 0 1 0 0 1 1 0 0 0 0 1 1 0 0 1 1 0 0t60 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0t61 1 0 1 0 1 0 0 0 1 0 1 0 1 0 0 0 1 1 0 1t62 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0t63 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0t64 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0t65 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0t66 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0t67 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0t70 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0t71 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0t72 0 0 1 1 0 0 1 1 0 0 0 1 1 0 1 1 0 0 0 1t73 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0t74 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0t75 0 0 1 1 0 0 1 1 0 0 0 1 0 1 0 0 1 1 0 0t76 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0t77 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

The full relation, UF1

(in raster order, with padded zeros)

Each column is a 0-dim basic Ptree (just sequences, a fanout=0 tree, no compression).

Later in these notes, there is discussion of techniques for building the 1-D basic Ptree set and the 2-D basic Ptree set for this Universal Fact File.

Page 14: Data Warehouse Mining  ( DWM ) For any DataWarehouse with

G2=(G g2JOINg1‘ G') (g1, g2, g2') G2[g1, g3] t1 t2 t1 t1 t1 t1 t2 t7 t1 t2 t1 t3 t1 t1 t3 t1 t3 t2 t1 t5 t1 t3 t3 t1 t7 t1 t3 t5 t2 t2 t1 t5 t1 t2 t3 t1 t5 t3 t2 t5 t1 t5 t5 t2 t6 t1 t5 t7 t3 t1 t1 t6 t1 t3 t2 t2 t1 t2 t3 t3 t2 t1 t3 t3 t5 t2 t1 t5 t3 t6 t2 t1 t6 t3 t7 t2 t7 t2 t5 t1 t2 t7 t5 t5 t2 t3 t1 t2 t5 t3 t3 t1 t3 t5 t5 t3 t1 t5 t5 t6 t3 t1 t6 t5 t7 t3 t2 t1 t6 t2 t3 t2 t7 t6 t3 t3 t3 t3 t6 t5 t3 t5 t1 t6 t6 t3 t5 t3 t7 t1 t3 t5 t5 t7 t2 t3 t5 t7 t7 t3 t5 t1 t2 t7 t5 t5 t1 t3 t7 t6 t5 t1 t5 t5 t1 t6 t5 t3 t1 t5 t3 t2 t5 t3 t5 t5 t5 t1 t5 t5 t3 t5 t5 t5 t5 t5 t7 t5 t7 t2 t5 t7 t5 t6 t1 t2 t6 t1 t3 t6 t1 t5 t6 t1 t6 t7 t1 t2 t7 t1 t3 t7 t1 t5 t7 t1 t6 t7 t5 t1 t7 t5 t3 t7 t5 t5 t7 t5 t1

UF2 a1 a2 a3 a4 a5 a6 a7 a8 a9 C a1'a2'a3'a4'a5'a6'a7'a8'a9‘C't11 1 0 1 0 0 0 1 1 0 1 1 0 1 0 0 0 1 1 0 1t12 1 0 1 0 0 0 1 1 0 1 0 1 1 0 1 1 0 0 0 1t13 1 0 1 0 0 0 1 1 0 1 0 1 0 0 1 0 0 0 1 1t15 1 0 1 0 0 0 1 1 0 1 0 1 0 1 0 0 1 1 0 0t17 1 0 1 0 0 0 1 1 0 1 0 0 1 1 0 0 1 1 0 0t22 0 1 1 0 1 1 0 0 0 1 0 1 1 0 1 1 0 0 0 1t23 0 1 1 0 1 1 0 0 0 1 0 1 0 0 1 0 0 0 1 1t25 0 1 1 0 1 1 0 0 0 1 0 1 0 1 0 0 1 1 0 0t26 0 1 1 0 1 1 0 0 0 1 1 0 1 0 1 0 0 0 1 0t31 0 1 0 0 1 0 0 0 1 1 1 0 1 0 0 0 1 1 0 1t32 0 1 0 0 1 0 0 0 1 1 0 1 1 0 1 1 0 0 0 1t33 0 1 0 0 1 0 0 0 1 1 0 1 0 0 1 0 0 0 1 1t35 0 1 0 0 1 0 0 0 1 1 0 1 0 1 0 0 1 1 0 0t36 0 1 0 0 1 0 0 0 1 1 1 0 1 0 1 0 0 0 1 0t37 0 1 0 0 1 0 0 0 1 1 0 0 1 1 0 0 1 1 0 0t51 0 1 0 1 0 0 1 1 0 0 1 0 1 0 0 0 1 1 0 1t52 0 1 0 1 0 0 1 1 0 0 0 1 1 0 1 1 0 0 0 1t53 0 1 0 1 0 0 1 1 0 0 0 1 0 0 1 0 0 0 1 1t55 0 1 0 1 0 0 1 1 0 0 0 1 0 1 0 0 1 1 0 0t56 0 1 0 1 0 0 1 1 0 0 1 0 1 0 1 0 0 0 1 0t57 0 1 0 1 0 0 1 1 0 0 0 0 1 1 0 0 1 1 0 0t62 1 0 1 0 1 0 0 0 1 0 0 1 1 0 1 1 0 0 0 1t63 1 0 1 0 1 0 0 0 1 0 0 1 0 0 1 0 0 0 1 1t65 1 0 1 0 1 0 0 0 1 0 0 1 0 1 0 0 1 1 0 0t66 1 0 1 0 1 0 0 0 1 0 1 0 1 0 1 0 0 0 1 0t71 0 0 1 1 0 0 1 1 0 0 1 0 1 0 0 0 1 1 0 1t72 0 0 1 1 0 0 1 1 0 0 0 1 1 0 1 1 0 0 0 1t73 0 0 1 1 0 0 1 1 0 0 0 1 0 0 1 0 0 0 1 1t75 0 0 1 1 0 0 1 1 0 0 0 1 0 1 0 0 1 1 0 0t76 0 0 1 1 0 0 1 1 0 0 1 0 1 0 1 0 0 0 1 0

UF2

Page 15: Data Warehouse Mining  ( DWM ) For any DataWarehouse with

G3=G2 g3JOINg1' G' G2[g1,g3] G(g1 g2) G3[g1,g4] t4 absent, no t1 t1 t1 t2 t1 t1 interaction t1 t2 t1 t3 t1 t2 All other t1 t3 t1 t5 t1 t3 possibilities t1 t5 t1 t6 t1 t5 appear except t1 t7 t2 t1 t1 t6 the 2 below: t2 t2 t2 t7 t1 t7 t2 t3 t3 t1 t2 t1 t2 t5 t3 t2 t2 t2 t2 t6 t3 t3 t2 t3 t3 t1 t3 t5 t2 t5 _t2 t6 absent t3 t2 t5 t1 t2 t7 t3 t3 t5 t3 t3 t1 t3 t5 t5 t5 t3 t2 t3 t6 t5 t7 t3 t3 t3 t7 t6 t1 t3 t5 t5 t1 t7 t2 t3 t6 t5 t2 t7 t5 t3 t7 t5 t3 t5 t1 t5 t5 t5 t2 t5 t6 t5 t3 t5 t7 t5 t5 t6 t2 t5 t6 t6 t3 t5 t7 t6 t5 t6 t1 t6 t6 t6 t2 t7 t1 t6 t3 t7 t2 t6 t5 _t6 t6 absent t7 t3 t6 t7 t7 t5 t7 t1 t7 t6 t7 t2 t7 t3 t7 t5 t7 t6 t7 t7

UF3 a1 a2 a3 a4 a5 a6 a7 a8 a9 C a1'a2'a3'a4'a5'a6'a7'a8'a9‘ C't11 1 0 1 0 0 0 1 1 0 1 1 0 1 0 0 0 1 1 0 1t12 1 0 1 0 0 0 1 1 0 1 0 1 1 0 1 1 0 0 0 1t13 1 0 1 0 0 0 1 1 0 1 0 1 0 0 1 0 0 0 1 1t15 1 0 1 0 0 0 1 1 0 1 0 1 0 1 0 0 1 1 0 0t16 1 0 1 0 0 0 1 1 0 1 1 0 1 0 1 0 0 0 1 0t17 1 0 1 0 0 0 1 1 0 1 0 0 1 1 0 0 1 1 0 0t21 0 1 1 0 1 1 0 0 0 1 1 0 1 0 0 0 1 1 0 1t22 0 1 1 0 1 1 0 0 0 1 0 1 1 0 1 1 0 0 0 1t23 0 1 1 0 1 1 0 0 0 1 0 1 0 0 1 0 0 0 1 1t25 0 1 1 0 1 1 0 0 0 1 0 1 0 1 0 0 1 1 0 0t27 0 1 1 0 1 1 0 0 0 1 0 0 1 1 0 0 1 1 0 0t31 0 1 0 0 1 0 0 0 1 1 1 0 1 0 0 0 1 1 0 1t32 0 1 0 0 1 0 0 0 1 1 0 1 1 0 1 1 0 0 0 1t33 0 1 0 0 1 0 0 0 1 1 0 1 0 0 1 0 0 0 1 1t35 0 1 0 0 1 0 0 0 1 1 0 1 0 1 0 0 1 1 0 0t36 0 1 0 0 1 0 0 0 1 1 1 0 1 0 1 0 0 0 1 0t37 0 1 0 0 1 0 0 0 1 1 0 0 1 1 0 0 1 1 0 0t51 0 1 0 1 0 0 1 1 0 0 1 0 1 0 0 0 1 1 0 1t52 0 1 0 1 0 0 1 1 0 0 0 1 1 0 1 1 0 0 0 1t53 0 1 0 1 0 0 1 1 0 0 0 1 0 0 1 0 0 0 1 1t55 0 1 0 1 0 0 1 1 0 0 0 1 0 1 0 0 1 1 0 0t56 0 1 0 1 0 0 1 1 0 0 1 0 1 0 1 0 0 0 1 0t57 0 1 0 1 0 0 1 1 0 0 0 0 1 1 0 0 1 1 0 0t61 1 0 1 0 1 0 0 0 1 0 1 0 1 0 0 0 1 1 0 1t62 1 0 1 0 1 0 0 0 1 0 0 1 1 0 1 1 0 0 0 1t63 1 0 1 0 1 0 0 0 1 0 0 1 0 0 1 0 0 0 1 1t65 1 0 1 0 1 0 0 0 1 0 0 1 0 1 0 0 1 1 0 0t67 1 0 1 0 1 0 0 0 1 0 0 0 1 1 0 0 1 1 0 0t71 0 0 1 1 0 0 1 1 0 0 1 0 1 0 0 0 1 1 0 1t72 0 0 1 1 0 0 1 1 0 0 0 1 1 0 1 1 0 0 0 1t73 0 0 1 1 0 0 1 1 0 0 0 1 0 0 1 0 0 0 1 1t75 0 0 1 1 0 0 1 1 0 0 0 1 0 1 0 0 1 1 0 0t76 0 0 1 1 0 0 1 1 0 0 1 0 1 0 1 0 0 0 1 0t77 0 0 1 1 0 0 1 1 0 0 0 0 1 1 0 0 1 1 0 0

UF3

Page 16: Data Warehouse Mining  ( DWM ) For any DataWarehouse with

 

G3[g1,g4] G(g1 g2) G4[g1,g5] t4 doesn't appear t1 t1 t1 t2 t1 t1 (no interaction). t1 t2 t1 t3 t1 t2 Every other t1 t3 t1 t5 t1 t3 possibility t1 t5 t1 t6 t1 t5 appears except t1 t7 t2 t1 t1 t6 the 2 below: t2 t2 t2 t7 t1 t7 t2 t3 t3 t1 t2 t1 t2 t5 t3 t2 t2 t2 t2 t6 t3 t3 t2 t3 t3 t1 t3 t5 t2 t5 __t2 t6 not there t3 t2 t5 t1 t2 t7 t3 t3 t5 t3 t3 t1 t3 t5 t5 t5 t3 t2 t3 t6 t5 t7 t3 t3 t3 t7 t6 t1 t3 t5 t5 t1 t7 t2 t3 t6 t5 t2 t7 t5 t3 t7 t5 t3 t5 t1 t5 t5 t5 t2 t5 t6 t5 t3 t5 t7 t5 t5 t6 t2 t5 t6 t6 t3 t5 t7 t6 t5 t6 t1 t6 t6 t6 t2 t7 t1 t6 t3 t7 t2 t6 t5 __t6 t6 not there t7 t3 t6 t7 t7 t5 t7 t1 t7 t6 t7 t2 t7 t3 t7 t5 t7 t6 t7 t7

Note: UF3 = UF4 = UF5 =… UFi for all i>2 since Gi = G3

UF4

Page 17: Data Warehouse Mining  ( DWM ) For any DataWarehouse with

57

75

16

27

13

23

33

53

15

35

55

72

12

61

51

31

21

t2t1

F (Edge Tbl)

For UF1[a1] AND with PF01001010

01001010

01001010

01001010

01001010

01001010

01001010

01001010

0 1 0 0 1 0 1 00 1 0 0 1 0 1 00 1 0 0 1 0 1 00 1 0 0 1 0 1 00 1 0 0 1 0 1 00 1 0 0 1 0 1 00 1 0 0 1 0 1 00 1 0 0 1 0 1 0

01001010

Replicate R[a1] columns:

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

PF

00110110

01010001

01010100

00000000

01010101

01000000

00100100

00000000

01001010

01001010

01001010

01001010

01001010

01001010

01001010

01001010

00110110

01010001

01010100

00000000

01010101

01000000

00100100

00000000

t12

t15

t13

t16

t61

For UF1[a1’] AND with PF

00110110

01010001

01010100

00000000

01010101

01000000

00100100

00000000

00110110

01010001

01010100

00000000

01010101

01000000

00100100

00000000

0 1 0 0 1 0 1 00 1 0 0 1 0 1 00 1 0 0 1 0 1 00 1 0 0 1 0 1 00 1 0 0 1 0 1 00 1 0 0 1 0 1 00 1 0 0 1 0 1 00 1 0 0 1 0 1 0

t21

t16

t31

t61

t51

0000000000000010010000000100000000000000010000000100000000000000

UF1[a1’ ]

0000000000110110000000000000000000000000000000000100000000000000

UF1[a1]

Dimension File, RTid a1 a2 a3 a4 a5 a6 a7 a8 a9 C) t1 1 0 1 0 0 0 1 1 0 1t2 0 1 1 0 1 1 0 0 0 1t3 0 1 0 0 1 0 0 0 1 1t4 1 0 1 1 0 0 1 0 1 1t5 0 1 0 1 0 0 1 1 0 0t6 1 0 1 0 1 0 0 0 1 0t7 0 0 1 1 0 0 1 1 0 0

From R and FPtrees, createPtrees for UF?

Replicate R’[a1]=R[a1]tr rows:

0 1 0 0 1 0 1 0

Page 18: Data Warehouse Mining  ( DWM ) For any DataWarehouse with

PR[a1] 0 0 0 0 0 01 10 10

01001010

R[a1] 01001010

01001010

01001010

01001010

01001010

01001010

01001010

01001010

R[a1] replicated

00

00110

00110 0

0

00011

00011

0 0

00

11000

11000

11000

1100

0

01100

01100

01100

01100

PG-pattern

00 0 0 0

0

00001

00010

0 0

00

00010

00010

01000

0010

0

00001

00001

00001

001010011 0011

0001 0100

00

00110

00110 0

0

00011

00011

0 0

00

11000

11000

11000

1100

0

01100

01100

01100

01100

013

103

012

112

221

P R[a1]-replicated

Class research project? Develop the algorithm and code for creatingthe basic Ppattern PR[ai]-replicated Ptrees and (therefore) PUF[ai] Ptrees fromPF and R Ptrees.

Page 19: Data Warehouse Mining  ( DWM ) For any DataWarehouse with

00110100

Replicate R[a2] as cols of matrix

0 0 1 1 0 1 0 00 0 1 1 0 1 0 00 0 1 1 0 1 0 00 0 1 1 0 1 0 00 0 1 1 0 1 0 00 0 1 1 0 1 0 00 0 1 1 0 1 0 00 0 1 1 0 1 0 0

Replicate R[a2]tr as rows of matrix:

For UF1[a2] AND with pat00110110

01010001

01010100

00000000

01010101

01000000

00100100

00000000

00110110

01010001

01010100

00000000

01010101

01000000

00100100

00000000

For UF1[a2’] AND with pat

00110110

01010001

01010100

00000000

01010101

01000000

00100100

00000000

00110110

01010001

01010100

00000000

01010101

01000000

00100100

00000000

0000000000110100000000000011010000000000000101000000000000100100

UF1[a2’ ]

0000000000000000010000010111010000000000010101010000000000000000

RG1[a2]

00110100

00110100

00110100

00110100

00110100

00110100

00110100

00110100

00110100

00110100

00110100

00110100

00110100

00110100

00110100

00110100

t53t51 t55 t57

t21

t31 t32 t33

t27

t35

0 0 1 1 0 1 0 00 0 1 1 0 1 0 00 0 1 1 0 1 0 00 0 1 1 0 1 0 00 0 1 1 0 1 0 00 0 1 1 0 1 0 00 0 1 1 0 1 0 00 0 1 1 0 1 0 0

t12 t13 t15 t32 t33 t35 t53 t55 t72 t75

Page 20: Data Warehouse Mining  ( DWM ) For any DataWarehouse with

Note that the cardinality of the UFk file may fill up quickly (wrt k).

E.g., in the previous example, for k>2, the cardinality is maximal (34)and almost full (49). Even for k=1, the cardinality is already 17, morethan double that of k=0 (7) and 35% of full. If there 100,000 genes involved, e.g., the full size is 10,000,000,000(10 billion). Instead of joining, one can simply apply quantifiersacross the graph. E.g., the quantifying universally across the graph: UFU (a1 a2 a3 a4 a5 a6 a7 a8 a9 C a1'a2'a3'a4'a5'a6'a7'a8'a9')t1 1 0 1 0 0 0 1 1 0 1 0 0 0 0 0 0 0 0 0t2 0 1 1 0 1 1 0 0 0 1 0 0 1 0 0 0 1 1 0t3 0 1 0 0 1 0 0 0 1 1 0 0 0 0 0 0 0 0 0t5 0 1 0 1 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0t6 1 0 1 0 1 0 0 0 1 0 1 0 1 0 0 0 1 1 0t7 0 0 1 1 0 0 1 1 0 0 0 1 0 0 0 0 0 0 0

The existential quantifier across the graph yields: UFE (a1 a2 a3 a4 a5 a6 a7 a8 a9 C a1'a2'a3'a4'a5'a6'a7'a8'a9')t1 1 0 1 0 0 0 1 1 0 1 1 1 1 1 1 1 1 1 1t2 0 1 1 0 1 1 0 0 0 1 1 0 1 1 0 0 1 1 0t3 0 1 0 0 1 0 0 0 1 1 1 1 1 1 1 1 1 1 1t5 0 1 0 1 0 0 1 1 0 0 1 1 1 1 1 0 1 1 1t6 1 0 1 0 1 0 0 0 1 0 1 0 1 0 0 0 1 1 0t7 0 0 1 1 0 0 1 1 0 0 0 1 1 1 1 1 1 1 0

Page 21: Data Warehouse Mining  ( DWM ) For any DataWarehouse with

UF NNC scan example:Find 3-Nearest Neighbors in UF1. Current practice is to find the 3NN set by scanning.E.g., use Hamming Distance, d(x,y)= # of mismatches to C-classify (a1..a9)= 001100100

Choose class label=C in UF1 (Training Set) below

d1 d2 a1 a2 a3 a4 a5 a6 a7 a8 a9 C a1'a2'a3'a4'a5'a6'a7'a8'a9‘ C't1 t2 1 0 1 0 0 0 1 1 0 1 0 1 1 0 1 1 0 0 0 1t1 t3 1 0 1 0 0 0 1 1 0 1 0 1 0 0 1 0 0 0 1 1t1 t5 1 0 1 0 0 0 1 1 0 1 0 1 0 1 0 0 1 1 0 0t1 t6 1 0 1 0 0 0 1 1 0 1 1 0 1 0 1 0 0 0 1 0t2 t1 0 1 1 0 1 1 0 0 0 1 1 0 1 0 0 0 1 1 0 1t2 t7 0 1 1 0 1 1 0 0 0 1 0 0 1 1 0 0 1 1 0 0t3 t1 0 1 0 0 1 0 0 0 1 1 1 0 1 0 0 0 1 1 0 1t3 t2 0 1 0 0 1 0 0 0 1 1 0 1 1 0 1 1 0 0 0 1t3 t3 0 1 0 0 1 0 0 0 1 1 0 1 0 0 1 0 0 0 1 1t3 t5 0 1 0 0 1 0 0 0 1 1 0 1 0 1 0 0 1 1 0 0t5 t1 0 1 0 1 0 0 1 1 0 0 1 0 1 0 0 0 1 1 0 1t5 t3 0 1 0 1 0 0 1 1 0 0 0 1 0 0 1 0 0 0 1 1t5 t5 0 1 0 1 0 0 1 1 0 0 0 1 0 1 0 0 1 1 0 0t5 t7 0 1 0 1 0 0 1 1 0 0 0 0 1 1 0 0 1 1 0 0t6 t1 1 0 1 0 1 0 0 0 1 0 1 0 1 0 0 0 1 1 0 1t7 t2 0 0 1 1 0 0 1 1 0 0 0 1 1 0 1 1 0 0 0 1t7 t5 0 0 1 1 0 0 1 1 0 0 0 1 0 1 0 0 1 1 0 0

0 0 1 1 0 0 1 0 0

t1 t2 1 0 1 0 0 0 1 1 0 1 3

t1 t3 1 0 1 0 0 0 1 1 0 1 3

t1 t5 1 0 1 0 0 0 1 1 0 1 3

a1 a2 a3 a4 a5 a6 a7 a8 a9 C d

3NN set so far

0 0 1 1 0 0 1 0 03 mismatches, d=3, don’t

replace

0 0 1 1 0 0 1 0 05 mismatches, d=5, don’t

replace

0 0 1 1 0 0 1 0 05 mismatches, d=5, don’t

replace

0 0 1 1 0 0 1 0 06 mismatches, d=6, don’t

replace

0 0 1 1 0 0 1 0 06 mismatches, d=6, don’t

replace

0 0 1 1 0 0 1 0 06 mismatches, d=6, don’t

replace

0 0 1 1 0 0 1 0 06 mismatches, d=6, don’t

replace

0 0 1 1 0 0 1 0 03 mismatches, d=3, don’t

replace

0 0 1 1 0 0 1 0 03 mismatches, d=3, don’t

replace

0 0 1 1 0 0 1 0 03 mismatches, d=3, don’t

replace

0 0 1 1 0 0 1 0 03 mismatches, d=3, don’t

replace

0 0 1 1 0 0 1 0 05 mismatches, d=5, don’t

replace

0 0 1 1 0 0 1 0 01 mismatch, d=1,

replace

t7 t2 0 0 1 1 0 0 1 1 0 0 1

0 0 1 1 0 0 1 0 01 mismatch, d=1,

replace

t7 t5 0 0 1 1 0 0 1 1 0 0 1Final Plurality vote winner: C=0

Page 22: Data Warehouse Mining  ( DWM ) For any DataWarehouse with

UF NNC scan example-2:( a5 a6 a1’ a2’ a3’ a4’ ) =

Class label=C’, using Hamming Dis, d(x,y)= # of mismatches:

d1 d2 a1 a2 a3 a4 a5 a6 a7 a8 a9 C a1'a2'a3'a4'a5'a6'a7'a8'a9‘ C't1 t2 1 0 1 0 0 0 1 1 0 1 0 1 1 0 1 1 0 0 0 1t1 t3 1 0 1 0 0 0 1 1 0 1 0 1 0 0 1 0 0 0 1 1t1 t5 1 0 1 0 0 0 1 1 0 1 0 1 0 1 0 0 1 1 0 0t1 t6 1 0 1 0 0 0 1 1 0 1 1 0 1 0 1 0 0 0 1 0t2 t1 0 1 1 0 1 1 0 0 0 1 1 0 1 0 0 0 1 1 0 1t2 t7 0 1 1 0 1 1 0 0 0 1 0 0 1 1 0 0 1 1 0 0t3 t1 0 1 0 0 1 0 0 0 1 1 1 0 1 0 0 0 1 1 0 1t3 t2 0 1 0 0 1 0 0 0 1 1 0 1 1 0 1 1 0 0 0 1t3 t3 0 1 0 0 1 0 0 0 1 1 0 1 0 0 1 0 0 0 1 1t3 t5 0 1 0 0 1 0 0 0 1 1 0 1 0 1 0 0 1 1 0 0t5 t1 0 1 0 1 0 0 1 1 0 0 1 0 1 0 0 0 1 1 0 1t5 t3 0 1 0 1 0 0 1 1 0 0 0 1 0 0 1 0 0 0 1 1t5 t5 0 1 0 1 0 0 1 1 0 0 0 1 0 1 0 0 1 1 0 0t5 t7 0 1 0 1 0 0 1 1 0 0 0 0 1 1 0 0 1 1 0 0t6 t1 1 0 1 0 1 0 0 0 1 0 1 0 1 0 0 0 1 1 0 1t7 t2 0 0 1 1 0 0 1 1 0 0 0 1 1 0 1 1 0 0 0 1t7 t5 0 0 1 1 0 0 1 1 0 0 0 1 0 1 0 0 1 1 0 0

0 0 0 0 0 0

t1 t2 0 0 1 0 1 1 0 2

t1 t3 0 0 1 0 1 0 0 1

t1 t5 0 0 1 0 1 0 1 2

a5 a6 C a1’a2’a3’a4’ d

3NN set so far

0 0 0 0 0 0

d=2, don’t replace

0 0 0 0 0 0

d=4, don’t replace

0 0 0 0 0 0

d=4, don’t replace

0 0 0 0 0 0

d=3, don’t replace

0 0 0 0 0 0

d=3, don’t replace

0 0 0 0 0 0

d=2, don’t replace

0 0 0 0 0 0

d=3, don’t replace

0 0 0 0 0 0

d=2, don’t replace

0 0 0 0 0 0

d=1, replace

t5 t3 0 0 0 0 1 0 0 1

0 0 0 0 0 0

d=2, don’t replace

0 0 0 0 0 0

d=2, don’t replace

0 0 0 0 0 0

d=3, don’t replace

0 0 0 0 0 0

d=2, don’t replace

0 0 0 0 0 0

d=2, don’t replace

Final winner: C=1

Page 23: Data Warehouse Mining  ( DWM ) For any DataWarehouse with

UF NNC scan example-2 (cont):( a5 a6 a1’ a2’ a3’ a4’ ) =

To find all training pts within distance=2 of the sample, takes another scan, using scan methods.

d1 d2 a1 a2 a3 a4 a5 a6 a7 a8 a9 C a1'a2'a3'a4'a5'a6'a7'a8'a9‘ C't1 t2 1 0 1 0 0 0 1 1 0 1 0 1 1 0 1 1 0 0 0 1t1 t3 1 0 1 0 0 0 1 1 0 1 0 1 0 0 1 0 0 0 1 1t1 t5 1 0 1 0 0 0 1 1 0 1 0 1 0 1 0 0 1 1 0 0t1 t6 1 0 1 0 0 0 1 1 0 1 1 0 1 0 1 0 0 0 1 0t2 t1 0 1 1 0 1 1 0 0 0 1 1 0 1 0 0 0 1 1 0 1t2 t7 0 1 1 0 1 1 0 0 0 1 0 0 1 1 0 0 1 1 0 0t3 t1 0 1 0 0 1 0 0 0 1 1 1 0 1 0 0 0 1 1 0 1t3 t2 0 1 0 0 1 0 0 0 1 1 0 1 1 0 1 1 0 0 0 1t3 t3 0 1 0 0 1 0 0 0 1 1 0 1 0 0 1 0 0 0 1 1t3 t5 0 1 0 0 1 0 0 0 1 1 0 1 0 1 0 0 1 1 0 0t5 t1 0 1 0 1 0 0 1 1 0 0 1 0 1 0 0 0 1 1 0 1t5 t3 0 1 0 1 0 0 1 1 0 0 0 1 0 0 1 0 0 0 1 1t5 t5 0 1 0 1 0 0 1 1 0 0 0 1 0 1 0 0 1 1 0 0t5 t7 0 1 0 1 0 0 1 1 0 0 0 0 1 1 0 0 1 1 0 0t6 t1 1 0 1 0 1 0 0 0 1 0 1 0 1 0 0 0 1 1 0 1t7 t2 0 0 1 1 0 0 1 1 0 0 0 1 1 0 1 1 0 0 0 1t7 t5 0 0 1 1 0 0 1 1 0 0 0 1 0 1 0 0 1 1 0 0

0 0 0 0 0 0

t1 t2 0 0 1 0 1 1 0 2

t1 t3 0 0 1 0 1 0 0 1

a5 a6 C a1’a2’a3’a4’ d

0 0 0 0 0 0

d=2, include it also

0 0 0 0 0 0

d=4, don’t include

0 0 0 0 0 0

d=4, don’t include

0 0 0 0 0 0

d=3, don’t include

0 0 0 0 0 0

d=3, don’t include

0 0 0 0 0 0

d=2, include it also

0 0 0 0 0 0

d=3, don’t include

0 0 0 0 0 0

d=2, include it also

0 0 0 0 0 0

d=1, already have

t5 t3 0 0 0 0 1 0 0 1

0 0 0 0 0 0

d=2, include it also

0 0 0 0 0 0

d=2, include it also

0 0 0 0 0 0

d=3, don’t replace

0 0 0 0 0 0

d=2, include it also

0 0 0 0 0 0

d=2, include it also

3NN setVote histogram

0 0 0 0 0 0

d=2, include it also

0 0 0 0 0 0

d=2, already have

0 0 0 0 0 0

d=1, already have

0 1

Page 24: Data Warehouse Mining  ( DWM ) For any DataWarehouse with

a4‘

11011011101100110

a3‘

01100000110110001

a2‘

00011110001001100

a1‘

11100101110111011

UF NNC Ptree Ex. 1 using 0-D Ptrees (sequences) a=a5 a6

a1’a2’a3’a4’=(000000)

C'11001011101100110

d1 d2

t1 t2t1 t3t1 t5t1 t6t2 t1t2 t7t3 t1t3 t2t3 t3t3 t5t5 t1t5 t3t5 t5t5 t7t6 t1 t7 t2t7 t5

a1

11110000000000100

a2 00001111111111000

a3

11111100000000111

a4

000000000 01111011

a5

0000111111000 0100

a6

00001100000000000

a7

11110000001111011

a8

11110000001111011

a9

00000011110000100

C11111111110000000

a1‘00011010001000100

a2‘ 11100001110110011

a3‘ 10011111001001110

a4‘ 00100100010011001

a5‘ 110100011001000 10

a6‘10000001000000010

a7‘ 00101110011011101

a8‘00101110011011101

a9‘01010000100100000

a6

11110011111111111

a5

11110000001111011

Identifying all training tuples in the distance=0 ring or 0ring, centered at a (exact matches ) as 1-bits of the Ptree, P=

a5^a6^a1’^a2’^a3’^a4’ (we use _ for complement)

There are no training points in a’s 0ring!We must look further out, i.e., a’s 1ring

P00000000000000000

0 1Vote histogram

(so far)

Page 25: Data Warehouse Mining  ( DWM ) For any DataWarehouse with

C

11111111110000000

C

00000000001111111

UF NNC Ptree ex-1 (cont.) a’s 1ring? a=a5 a6 a1’a2’a3’a4’ = (000000)

C'11001011101100110

d1 d2

t1 t2

t1 t3

t1 t5

t1 t6

t2 t1

t2 t7

t3 t1

t3 t2

t3 t3

t3 t5

t5 t1

t5 t3

t5 t5

t5 t7

t6 t1 t7 t2

t7 t5

a1

11110000000000100

a2 00001111111111000

a3

11111100000000111

a4

000000000 01111011

a5

00001111110000100

a6

00001100000000000

a7

11110000001111011

a8

11110000001111011

a9

00000011110000100

C11111111110000000

a1‘00011010001000100

a2‘ 11100001110110011

a3‘ 10011111001001110

a4‘ 00100100010011001

a5‘ 110100011001000 10

a6‘10000001000000010

a7‘ 00101110011011101

a8‘00101110011011101

a9‘01010000100100000

Training pts in the 1ring centered at a are given by 1-bits in the Ptree, P, constructed as follows:

0 1P01000000000100000

The C=1 vote count = root count of P^C.The C=0 vote count = root count of P^C.(never need to know which tuples voted)

a4‘

11011011101100110

a3‘

01100000110110001

a2‘

00011110001001100

a1‘

11100101110111011

a6

11110011111111111

a5

00001111110000100

a4‘

11011011101100110

a3‘

01100000110110001

a2‘

00011110001001100

a1‘

11100101110111011

a6

0 0001100000000000

a5

1111000000111 1011

a4‘

11011011101100110

a3‘

01100000110110001

a2‘

00011110001001100

a1‘

00011010001000100

a6

11110011111111111

a5

1111000000111 1011

a4‘

11011011101100110

a3‘

01100000110110001

a2‘

1 1100001110110011

a1‘

11100101110111011

a6

11110011111111111

a5

1111000000111 1011

a4‘

11011011101100110

a3‘

1 0011111001001110

a2‘

00011110001001100

a1‘

11100101110111011

a6

11110011111111111

a5

1111000000111 1011

a4‘

0 0100100010011001

a3‘

01100000110110001

a2‘

00011110001001100

a1‘

11100101110111011

a6

11110011111111111

a5

1111000000111 1011

OR

a5^a6^a1’^a2’^a3’^a4’a5^a6^a1’^a2’^a3’^a4’

(100000)

a5^a6^a1’^a2’^a3’^a4’

(010000)

a5^a6^a1’^a2’^a3’^a4’

(001000)

a5^a6^a1’^a2’^a3’^a4’

(000100)

a5^a6^a1’^a2’^a3’^a4’

(000010)(000001)

(a5 a6 a1’a2’a3’a4’)

Page 26: Data Warehouse Mining  ( DWM ) For any DataWarehouse with

a’s 2-ring? a=a5 a6 a1’a2’a3’a4’ = (000000)

d1 d2

t1 t2

t1 t3

t1 t5

t1 t6

t2 t1

t2 t7

t3 t1

t3 t2

t3 t3

t3 t5

t5 t1

t5 t3

t5 t5

t5 t7

t6 t1 t7 t2

t7 t5

a5

0000111111000 0100

a6

00001100000000000

C11111111110000000

a1‘00011010001000100

a2‘ 11100001110110011

a3‘ 10011111001001110

a4‘ 00100100010011001

For each of the following Ptrees, a 1-bit corresponds to a training point in a’s 2-ring:Pa5a6a1’a2‘a3‘a4‘ Pa5a6 a1‘a2’a3‘a4‘ Pa5a6 a1‘a2’a3‘a4‘ Pa5a6 a1‘a2‘a3’a4‘ Pa5a6

a1‘a2‘a3‘a4’Pa5a6 a1‘a2’a3‘a4‘ Pa5a6 a1‘a2’a3‘a4‘ Pa5a6 a1‘a2‘a3’a4‘ Pa5a6 a1‘a2‘a3‘a4’Pa5a6 a1‘a2’a3‘a4‘ Pa5a6 a1‘a2‘a3’a4‘ Pa5a6 a1‘a2‘a3‘a4’Pa5a6 a1‘a2‘a3’a4‘ Pa5a6 a1‘a2‘a3‘a4’Pa5a6 a1‘a2‘a3‘a4’ 1st line first:

a4‘

11011011101100110

a3‘

01100000110110001

a2‘

00011110001001100

a1‘

11100101110111011

a6

0 0001100000000000

a4‘

11011011101100110

a3‘

01100000110110001

a2‘

00011110001001100

a1‘

00011010001000100

a6

11110011111111111

a4‘

11011011101100110

a3‘

01100000110110001

a2‘

1 1100001110110011

a1‘

11100101110111011

a6

11110011111111111

a4‘

11011011101100110

a3‘

1 0011111001001110

a2‘

00011110001001100

a1‘

11100101110111011

a6

11110011111111111

a4‘

0 0100100010011001

a3‘

01100000110110001

a2‘

00011110001001100

a1‘

11100101110111011

a6

11110011111111111

a5

00001111110000100

a5

00001111110000100

a5

00001111110000100

a5

00001111110000100

a5

00001111110000100

0 1

(110000)(101000)(100100)(100010)(100001)

Stop here? But the other 10 Ptrees should also be considered. The fact that the 2-ring includes so many new training points is “The curse of demensionality”.

Page 27: Data Warehouse Mining  ( DWM ) For any DataWarehouse with

Enfranchising the rest of a’s 2-ring? a=a5 a6 a1’a2’a3’a4’ = (000000)

d1 d2

t1 t2

t1 t3

t1 t5

t1 t6

t2 t1

t2 t7

t3 t1

t3 t2

t3 t3

t3 t5

t5 t1

t5 t3

t5 t5

t5 t7

t6 t1 t7 t2

t7 t5

a5

0000111111000 0100

a6

00001100000000000

C11111111110000000

a1‘00011010001000100

a2‘ 11100001110110011

a3‘ 10011111001001110

a4‘ 00100100010011001

0 1

a4‘

11011011101100110

a3‘

01100000110110001

a2‘

00011110001001100

a1‘

00011010001000100

a6

0 0001100000000000

a5

1111000000111 1011

a4‘

11011011101100110

a3‘

01100000110110001

a2‘

11100001110110011

a1‘

11100101110111011

a6

0 0001100000000000

a5

1111000000111 1011

a4‘

11011011101100110

a3‘

11100001110110011

a2‘

00011110001001100

a1‘

11100101110111011

a6

0 0001100000000000

a5

1111000000111 1011

a4‘

00100100010011001

a3‘

01100000110110001

a2‘

00011110001001100

a1‘

11100101110111011

a6

0 0001100000000000

a5

1111000000111 1011

For each of the following Ptrees, a 1-bit corresponds to a training point in a’s 2-ring:Pa5a6a1’a2‘a3‘a4‘ Pa5a6 a1‘a2’a3‘a4‘ Pa5a6 a1‘a2’a3‘a4‘ Pa5a6 a1‘a2‘a3’a4‘ Pa5a6

a1‘a2‘a3‘a4’Pa5a6 a1‘a2’a3‘a4‘ Pa5a6 a1‘a2’a3‘a4‘ Pa5a6 a1‘a2‘a3’a4‘ Pa5a6 a1‘a2‘a3‘a4’Pa5a6 a1‘a2’a3‘a4‘ Pa5a6 a1‘a2‘a3’a4‘ Pa5a6 a1‘a2‘a3‘a4’Pa5a6 a1‘a2‘a3’a4‘ Pa5a6 a1‘a2‘a3‘a4’Pa5a6 a1‘a2‘a3‘a4’ 2nd line:

Page 28: Data Warehouse Mining  ( DWM ) For any DataWarehouse with

Enfranchising the rest of a’s 2-ring (cont.) a=a5 a6 a1’a2’a3’a4’ = (000000)

d1 d2

t1 t2

t1 t3

t1 t5

t1 t6

t2 t1

t2 t7

t3 t1

t3 t2

t3 t3

t3 t5

t5 t1

t5 t3

t5 t5

t5 t7

t6 t1 t7 t2

t7 t5

a5

0000111111000 0100

a6

00001100000000000

C11111111110000000

a1‘00011010001000100

a2‘ 11100001110110011

a3‘ 10011111001001110

a4‘ 00100100010011001

0 1

a4‘

11011011101100110

a3‘

01100000110110001

a2‘

11100001110110011

a1‘

00011010001000100

a6

11110011111111111

a5

1111000000111 1011

a4‘

11011011101100110

a3‘

10011111001001110

a2‘

00011110001001100

a1‘

00011010001000100

a6

11110011111111111

a5

1111000000111 1011

a4‘

11011011101100110

a3‘

01100000110110001

a2‘

00011110001001100

a1‘

00011010001000100

a6

11110011111111111

a5

1111000000111 1011

For each of the following Ptrees, a 1-bit corresponds to a training point in a’s 2-ring:Pa5a6a1’a2‘a3‘a4‘ Pa5a6 a1‘a2’a3‘a4‘ Pa5a6 a1‘a2’a3‘a4‘ Pa5a6 a1‘a2‘a3’a4‘ Pa5a6

a1‘a2‘a3‘a4’Pa5a6 a1‘a2’a3‘a4‘ Pa5a6 a1‘a2’a3‘a4‘ Pa5a6 a1‘a2‘a3’a4‘ Pa5a6 a1‘a2‘a3‘a4’Pa5a6 a1‘a2’a3‘a4‘ Pa5a6 a1‘a2‘a3’a4‘ Pa5a6 a1‘a2‘a3‘a4’Pa5a6 a1‘a2‘a3’a4‘ Pa5a6 a1‘a2‘a3‘a4’Pa5a6 a1‘a2‘a3‘a4’ 3rd line:

Page 29: Data Warehouse Mining  ( DWM ) For any DataWarehouse with

Enfranchising the rest of a’s 2-ring (cont.) a=a5 a6 a1’a2’a3’a4’ = (000000)

d1 d2

t1 t2

t1 t3

t1 t5

t1 t6

t2 t1

t2 t7

t3 t1

t3 t2

t3 t3

t3 t5

t5 t1

t5 t3

t5 t5

t5 t7

t6 t1 t7 t2

t7 t5

a5

0000111111000 0100

a6

00001100000000000

C11111111110000000

a1‘00011010001000100

a2‘ 11100001110110011

a3‘ 10011111001001110

a4‘ 00100100010011001

0 1

a4‘

11011011101100110

a3‘

10011111001001110

a2‘

11100001110110011

a1‘

11100101110111011

a6

11110011111111111

a5

1111000000111 1011

a4‘

00100100010011001

a3‘

01100000110110001

a2‘

11100001110110011

a1‘

11100101110111011

a6

11110011111111111

a5

1111000000111 1011

P2

10100000000010011

For each of the following Ptrees, a 1-bit corresponds to a training point in a’s 2-ring:Pa5a6a1’a2‘a3‘a4‘ Pa5a6 a1‘a2’a3‘a4‘ Pa5a6 a1‘a2’a3‘a4‘ Pa5a6 a1‘a2‘a3’a4‘ Pa5a6

a1‘a2‘a3‘a4’Pa5a6 a1‘a2’a3‘a4‘ Pa5a6 a1‘a2’a3‘a4‘ Pa5a6 a1‘a2‘a3’a4‘ Pa5a6 a1‘a2‘a3‘a4’Pa5a6 a1‘a2’a3‘a4‘ Pa5a6 a1‘a2‘a3’a4‘ Pa5a6 a1‘a2‘a3‘a4’Pa5a6 a1‘a2‘a3’a4‘ Pa5a6 a1‘a2‘a3‘a4’Pa5a6 a1‘a2‘a3‘a4’ 4th line:

Page 30: Data Warehouse Mining  ( DWM ) For any DataWarehouse with

Enfranchising the rest of a’s 2-ring (cont.) a=a5 a6 a1’a2’a3’a4’ = (000000)

d1 d2

t1 t2

t1 t3

t1 t5

t1 t6

t2 t1

t2 t7

t3 t1

t3 t2

t3 t3

t3 t5

t5 t1

t5 t3

t5 t5

t5 t7

t6 t1 t7 t2

t7 t5

a5

0000111111000 0100

a6

00001100000000000

C11111111110000000

a1‘00011010001000100

a2‘ 11100001110110011

a3‘ 10011111001001110

a4‘ 00100100010011001

0 1

a4‘

00100100010011001

a3‘

10011111001001110

a2‘

00011110001001100

a1‘

11100101110111011

a6

11110011111111111

a5

1111000000111 1011

P3

00000000000001000

For each of the following Ptrees, a 1-bit corresponds to a training point in a’s 2-ring:Pa5a6a1’a2‘a3‘a4‘ Pa5a6 a1‘a2’a3‘a4‘ Pa5a6 a1‘a2’a3‘a4‘ Pa5a6 a1‘a2‘a3’a4‘ Pa5a6

a1‘a2‘a3‘a4’Pa5a6 a1‘a2’a3‘a4‘ Pa5a6 a1‘a2’a3‘a4‘ Pa5a6 a1‘a2‘a3’a4‘ Pa5a6 a1‘a2‘a3‘a4’Pa5a6 a1‘a2’a3‘a4‘ Pa5a6 a1‘a2‘a3’a4‘ Pa5a6 a1‘a2‘a3‘a4’Pa5a6 a1‘a2‘a3’a4‘ Pa5a6 a1‘a2‘a3‘a4’Pa5a6 a1‘a2‘a3‘a4’ 5th line:

Page 31: Data Warehouse Mining  ( DWM ) For any DataWarehouse with

Justification for using vertical structures (once again)?

• For record-based workloads (where the result is a set of records), changing the horizontal record structure and then having to reconstruct it, may introduce too much post processing?

0 1 0 1 1 1 1 1 0 0 0 10 1 1 1 1 1 1 1 0 0 0 00 1 0 1 1 0 1 0 1 0 0 10 1 0 1 1 1 1 0 1 1 1 11 0 1 0 1 0 0 0 1 1 0 00 1 0 0 1 0 0 0 1 1 0 11 1 1 0 0 0 0 0 1 1 0 01 1 1 0 0 0 0 0 1 1 0 0

R11 R12 R13 R21 R22 R23 R31 R32 R33 R41 R42 R43

010 111 110 001011 111 110 000010 110 101 001010 111 101 111101 010 001 100010 010 001 101111 000 001 100111 000 001 100

R( A1 A2 A3 A4)

0 1 0 1 1 1 1 1 0 0 0 10 1 1 1 1 1 1 1 0 0 0 00 1 0 1 1 0 1 0 1 0 0 10 1 0 1 1 1 1 0 1 1 1 11 0 1 0 1 0 0 0 1 1 0 00 1 0 0 1 0 0 0 1 1 0 11 1 1 0 0 0 0 0 1 1 0 01 1 1 0 0 0 0 0 1 1 0 0

R11 R12 R13 R21 R22 R23 R31 R32 R33 R41 R42 R43

• For data mining workloads, the result is often a bit (Yes/No, True/False) or another unstructured result (histogram), where there is no reconstructive post processing and the actual data records need never be involved?

1

0 1

or

Page 32: Data Warehouse Mining  ( DWM ) For any DataWarehouse with

Paper Topics in the area of NNC on UF?

If you decide to do a research project in this area, you might pick a particular DW area (VirtualCell data, Bioinformatics data, Market Basket data, Text data, Sales data, Scientific data, Astronomical data, ….).

Then discover an interpretation of the results of NNC that gives new, useful info.

e.g., in the last example NCC problem, if the data is gene expression data and C=1 means the gene is associated with a particular cancer, the previous results might be interpreted as “if none of the treatments, a5 a6

a1’ a2’ a3’ a4’ express at a threshold level, then the dissolved tissue is predicted to be cancerous (2/3 probability in the scan based NCC algorithm and 6/11 probability in the Ptree based NCC algorithm).

Other research projects in this setting could involve:

1. Looking at one of the other data mining techniques (clustering, ARM…) and applying it to a new data area.

2. Developing efficient algorithms (implement them and prove that they are efficient) of the various steps in this data mining methodology (or any other).

E.g., An efficient algorithm for “producing the basic Ptrees for a UF from the basic Ptrees for F and the Di

s without having to actually construct the massive UF in the process” is suggested in these notes but the details (or a better method?) and performance work would make a good topic.

Page 33: Data Warehouse Mining  ( DWM ) For any DataWarehouse with

Paper Topics in the area of NNC on UF (continued)

Stopping conditions in NCC:

Note that we have assumed the user picks a k ahead of time (our example, k=3)then finds the k-nearest training neighbors to vote on the class assignment(or the 1st ring in which at least k voters appear – the closed kNNC method)

In kNNC the prior choice of k determines when to stop accumulating voters.

Other methods (address the curse of dimensionality)?

Weight the votes by similarity distance from the unclassified sample? Byweighting attributes beforehand?,weighting votes depending upon distance out?,or both?, (or something else?)

All training points within a predefined similarity level (rather than count level)?

Build out in rings until the histogram shows a clear enough winner?

Note that the Histogram doesn’t get good and stay good, necessarily, so

Build out past 1st good histogram to see if the 2nd “good” histogram is even better?…

Page 34: Data Warehouse Mining  ( DWM ) For any DataWarehouse with

a4‘

11011011101100110

a3‘

10011111001001110

a2‘

00011110001001100

a1‘

11100101110111011

Another example of Ptree NNC using weights (about the only way to address the curse of dimensionality) a=a5 a6 a1’a2’a3’a4’=(010010),

attribute weights (1, 1, 3, 3, 3, 3)vote weight = 1/(1+distance)

C'11001011101100110

d1 d2

t1 t2t1 t3t1 t5t1 t6t2 t1t2 t7t3 t1t3 t2t3 t3t3 t5t5 t1t5 t3t5 t5t5 t7t6 t1 t7 t2t7 t5

a1

11110000000000100

a2 00001111111111000

a3

11111100000000111

a4

000000000 01111011

a5

0000111111000 0100

a6

00001100000000000

a7

11110000001111011

a8

11110000001111011

a9

00000011110000100

C11111111110000000

a1‘00011010001000100

a2‘ 11100001110110011

a3‘ 10011111001001110

a4‘ 00100100010011001

a5‘ 110100011001000 10

a6‘10000001000000010

a7‘ 00101110011011101

a8‘00101110011011101

a9‘01010000100100000

a6

00001100000000000

a5

11110000001111011

d(p,q) = {weighti : p & q differ at i} Identifying all training tuples in the 0-ring centered at a (exact matches ) as 1-bits of the Ptree, P= a5^a6^a1’^a2’^a3’^a4’

P00000000000000000

Page 35: Data Warehouse Mining  ( DWM ) For any DataWarehouse with

a’s 1ring? a=a5 a6 a1’a2’a3’a4’ = (010010)

C'11001011101100110

d1 d2

t1 t2

t1 t3

t1 t5

t1 t6

t2 t1

t2 t7

t3 t1

t3 t2

t3 t3

t3 t5

t5 t1

t5 t3

t5 t5

t5 t7

t6 t1 t7 t2

t7 t5

a1

11110000000000100

a2 00001111111111000

a3

11111100000000111

a4

000000000 01111011

a5

00001111110000100

a6

00001100000000000

a7

11110000001111011

a8

11110000001111011

a9

00000011110000100

C11111111110000000

a1‘00011010001000100

a2‘ 11100001110110011

a3‘ 10011111001001110

a4‘ 00100100010011001

a5‘ 110100011001000 10

a6‘10000001000000010

a7‘ 00101110011011101

a8‘00101110011011101

a9‘01010000100100000

P00000000000000000

a4‘

11011011101100110

a3‘

10011111001001110

a2‘

00011110001001100

a1‘

11100101110111011

a6

00001100000000000

a5

00001111110000100

a4‘

11011011101100110

a3‘

10011111001001110

a2‘

00011110001001100

a1‘

11100101110111011

a6

11110011111111111

a5

1111000000111 1011

attribute weights (1, 1, 3, 3, 3, 3) vote weight = 1/(1+distance) d(p,q) = {weighti : p & q differ at i}

(110010)(000010)

Page 36: Data Warehouse Mining  ( DWM ) For any DataWarehouse with

a’s 2ring? a=a5 a6 a1’a2’a3’a4’ = (010010)

C'11001011101100110

d1 d2

t1 t2

t1 t3

t1 t5

t1 t6

t2 t1

t2 t7

t3 t1

t3 t2

t3 t3

t3 t5

t5 t1

t5 t3

t5 t5

t5 t7

t6 t1 t7 t2

t7 t5

a1

11110000000000100

a2 00001111111111000

a3

11111100000000111

a4

000000000 01111011

a5

00001111110000100

a6

00001100000000000

a7

11110000001111011

a8

11110000001111011

a9

00000011110000100

C11111111110000000

a1‘00011010001000100

a2‘ 11100001110110011

a3‘ 10011111001001110

a4‘ 00100100010011001

a5‘ 110100011001000 10

a6‘10000001000000010

a7‘ 00101110011011101

a8‘00101110011011101

a9‘01010000100100000

P00000000000000000

a4‘

11011011101100110

a3‘

10011111001001110

a2‘

00011110001001100

a1‘

11100101110111011

a6

11110011111111111

a5

00001111110000100

attribute weights (1, 1, 3, 3, 3, 3) d(p,q) = {weighti : p & q differ at i} vote weight = 1/(1+distance)

(100010)

Page 37: Data Warehouse Mining  ( DWM ) For any DataWarehouse with

a’s 3-ring? a=a5 a6 a1’a2’a3’a4’ = (010010)

d1 d2

t1 t2

t1 t3

t1 t5

t1 t6

t2 t1

t2 t7

t3 t1

t3 t2

t3 t3

t3 t5

t5 t1

t5 t3

t5 t5

t5 t7

t6 t1 t7 t2

t7 t5

a5

0000111111000 0100

a6

00001100000000000

C11111111110000000

a1‘00011010001000100

a2‘ 11100001110110011

a3‘ 10011111001001110

a4‘ 00100100010011001

Identify all training pts in the 3-ring centered at aCheck each of a1’ a2’ a3’ a4’ as the single difference.

a4‘

11011011101100110

a3‘

10011111001001110

a2‘

00011110001001100

a1‘

00011010001000100

a6

00001100000000000

a5

1111000000111 1011

a4‘

11011011101100110

a3‘

10011111001001110

a2‘

11100001110110011

a1‘

11100101110111011

a6

00001100000000000

a5

1111000000111 1011

a4‘

11011011101100110

a3‘

01100000110100001

a2‘

00011110001001100

a1‘

11100101110111011

a6

00001100000000000

a5

1111000000111 1011

a4‘

00100100010011001

a3‘

10011111001001110

a2‘

00011110001001100

a1‘

11100101110111011

a6

00001100000000000

a5

1111000000111 1011

P00000000000000000

Page 38: Data Warehouse Mining  ( DWM ) For any DataWarehouse with

a’s 4-ring? a=a5 a6 a1’a2’a3’a4’ =(010010)

d1 d2

t1 t2

t1 t3

t1 t5

t1 t6

t2 t1

t2 t7

t3 t1

t3 t2

t3 t3

t3 t5

t5 t1

t5 t3

t5 t5

t5 t7

t6 t1 t7 t2

t7 t5

a5

00001111110000100

a6

00001100000000000

C11111111110000000

a1‘00011010001000100

a2‘ 11100001110110011

a3‘ 10011111001001110

a4‘ 00100100010011001

Identify all training pts in the 4-ring centered at aCheck Pa5a6a1’a2‘a3‘a4‘ Pa5a6 a1‘a2’a3‘a4‘ Pa5a6 a1‘a2‘a3’a4‘ Pa5a6 a1‘a2‘a3‘a4’ as differing, in turn.

a4‘

11011011101100110

a3‘

10011111001001110

a2‘

00011110001001100

a1‘

00011010001000100

a6

00001100000000000

a5

00001111110000100

a4‘

11011011101100110

a3‘

10011111001001110

a2‘

11100001110110011

a1‘

11100101110111011

a6

00001100000000000

a5

00001111110000100

a4‘

11011011101100110

a3‘

01100000110110001

a2‘

00011110001001100

a1‘

11100101110111011

a6

00001100000000000

a5

00001111110000100

a4‘

00100100010011001

a3‘

10011111001001110

a2‘

00011110001001100

a1‘

11100101110111011

a6

00001100000000000

a5

00001111110000100

Attribute weights (1, 1, 3, 3, 3, 3) d(p,q) = {weighti : p & q differ at i} vote weight = 1/(1+distance)

0 1/5

Vote Tally: C=0 C=1

2/5

Page 39: Data Warehouse Mining  ( DWM ) For any DataWarehouse with

a’s 5-ring? a=a5 a6 a1’a2’a3’a4’ =(010010)

d1 d2

t1 t2

t1 t3

t1 t5

t1 t6

t2 t1

t2 t7

t3 t1

t3 t2

t3 t3

t3 t5

t5 t1

t5 t3

t5 t5

t5 t7

t6 t1 t7 t2

t7 t5

a5

00001111110000100

a6

00001100000000000

C11111111110000000

a1‘00011010001000100

a2‘ 11100001110110011

a3‘ 10011111001001110

a4‘ 00100100010011001

Identify all training pts in the 5-ring centered at aCheck Pa5a6a1’a2‘a3‘a4‘ Pa5a6 a1‘a2’a3‘a4‘ Pa5a6 a1‘a2‘a3’a4‘ Pa5a6 a1‘a2‘a3‘a4’ as differing, in turn.

a4‘

11011011101100110

a3‘

10011111001001110

a2‘

00011110001001100

a1‘

00011010001000100

a6

11110011111111111

a5

00001111110000100

a4‘

11011011101100110

a3‘

10011111001001110

a2‘

11100001110110011

a1‘

11100101110111011

a6

11110011111111111

a5

00001111110000100

a4‘

11011011101100110

a3‘

01100000110110001

a2‘

00011110001001100

a1‘

11100101110111011

a6

11110011111111111

a5

00001111110000100

a4‘

00100100010011001

a3‘

10011111001001110

a2‘

00011110001001100

a1‘

11100101110111011

a6

11110011111111111

a5

00001111110000100

attribute weights (1, 1, 3, 3, 3, 3) d(p,q) = {weighti : p & q differ at i} vote weight = 1/(1+distance)

0 2/5

Vote Tally: C=0 C=1

2/5+1/6=17/3017/30+5/30

=22/305/30

Stop here? (C=1 winner)Interactive Stop on Command

System?

Note: An ISoC system seems easy with vertical Ptree methods, but hard with horizontal scan methods?

Projects?:Implement such an ISoC NNC?

Allow users to decide attribute weights interactively also ???

Ptree NNC which stops only after all classes have at least 1 vote (or after certain thresholds are achieved?) How does this perform wrt standard stopping methods?

I think users would really like a system in which they could interactively control the vote and also do a “recall” vote if they don’t like the outcome? (a California NNC? or CNNC)

Page 40: Data Warehouse Mining  ( DWM ) For any DataWarehouse with

Iceberg Queries• On any relation (not just the UF of a DW), R(ai,…,an,b) , find all tuples for which an

aggregate (e.g., sum) over a set of attribute(s) exceeds a threshold (why iceberg? Because the result set is small and therefore the tip of the iceberg)

– SELECT * FROM R GROUPED BY ai1 ,…, aik

WHERE aggr(b) theshold;

– E.g., SALES( CUST, ITEM, TIME, CTRY, $SOLD )• e.g., typical “who?, what?, when?, where?” data cube (wwww data cube) with measurement, “how much?”

SELECT * FROM SALES GROUPED BY CUST,ITEM WHERE SUM($SOLD)$10M

(i.e., “Which are our big customer-item match-ups over all time and locations?”)

Ptrees: a=(a1,…,an) output if i=1..8RootCount(Pa^Pbi) * 28-i threshold (b=b1…b8 in bits)

Still must sequence through all a values? Assuming very few meet the threshold (the iceberg assumption, devise a pruning mechanism for the search by considering each each bit in turn (from high order on down)

First all combos:

i=1..8RootCount(Pa11^…^Pan1

^Pbi) * 28-I underscore indicates a choice of ‘ or not.

Then i=1..8RootCount(Pa11^ Pa12

^…^Pan1^ Pan2

^Pbi) * 28-I etc.

Whenever an attribute makes the threshold for only one choice (‘ or not) eliminate the other. Whenever an attribute makes the threshold for no choices, done (no iceberg)

i=1..8RootCount(Pa1..aj1..ajk

..an^Pbi

) * 28-I

(k=1..8 enumerates bits of aj j=1..rj1enumerates the attributes). If this falls below the threshold for some k, prune aj (if the sum falls below the level at which the remain bits of aj can’t possibly make it to the threshold, prune). From the surviving aj ‘s

Assume numbers from 0 - 255

Page 41: Data Warehouse Mining  ( DWM ) For any DataWarehouse with

a4‘

00100100010011001

a3‘

10011111001001110

a2‘

11100001110110011

a1‘

00011010001000100

C'11001011101100110

d1 d2

t1 t2t1 t3t1 t5t1 t6t2 t1t2 t7t3 t1t3 t2t3 t3t3 t5t5 t1t5 t3t5 t5t5 t7t6 t1 t7 t2t7 t5

a1

11110000000000100

a2 00001111111111000

a3

11111100000000111

a4

000000000 01111011

a5

0000111111000 0100

a6

00001100000000000

a7

11110000001111011

a8

11110000001111011

a9

00000011110000100

C11111111110000000

a1‘00011010001000100

a2‘ 11100001110110011

a3‘ 10011111001001110

a4‘ 00100100010011001

a5‘ 110100011001000 10

a6‘10000001000000010

a7‘ 00101110011011101

a8‘00101110011011101

a9‘01010000100100000

a6

00001100000000000

a5

0000111111000 0100

Appendix: scratch slides

Page 42: Data Warehouse Mining  ( DWM ) For any DataWarehouse with

a4‘

00100100010011001

a3‘

10011111001001110

a2‘

11100001110110011

a1‘

00011010001000100

a6

00001100000000000

a5

0000111111000 0100

a4‘

00100100010011001

a3‘

10011111001001110

a2‘

11100001110110011

a1‘

00011010001000100

a6

00001100000000000

a5

0000111111000 0100

a4‘

00100100010011001

a3‘

10011111001001110

a2‘

11100001110110011

a1‘

00011010001000100

a6

00001100000000000

a5

0000111111000 0100

a4‘

00100100010011001

a3‘

10011111001001110

a2‘

11100001110110011

a1‘

00011010001000100

a6

00001100000000000

a5

0000111111000 0100

a4‘

00100100010011001

a3‘

10011111001001110

a2‘

11100001110110011

a1‘

00011010001000100

a6

00001100000000000

a5

0000111111000 0100

Appendix: scratch slides

a4‘

00100100010011001

a3‘

10011111001001110

a2‘

11100001110110011

a1‘

00011010001000100

a6

00001100000000000

a5

0000111111000 0100

Page 43: Data Warehouse Mining  ( DWM ) For any DataWarehouse with

a4‘

11011011101100110

a3‘

01100000110110001

a2‘

00011110001001100

a1‘

11100101110111011

a6

11110011111111111

a5

1111000000111 1011

a4‘

11011011101100110

a3‘

01100000110110001

a2‘

00011110001001100

a1‘

11100101110111011

a6

11110011111111111

a5

1111000000111 1011

a4‘

11011011101100110

a3‘

01100000110110001

a2‘

00011110001001100

a1‘

11100101110111011

a6

11110011111111111

a5

1111000000111 1011

a4‘

11011011101100110

a3‘

01100000110110001

a2‘

00011110001001100

a1‘

11100101110111011

a6

11110011111111111

a5

1111000000111 1011

a4‘

11011011101100110

a3‘

01100000110110001

a2‘

00011110001001100

a1‘

11100101110111011

a6

11110011111111111

a5

1111000000111 1011

Appendix: scratch slides

a4‘

11011011101100110

a3‘

01100000110110001

a2‘

00011110001001100

a1‘

11100101110111011

a6

11110011111111111

a5

1111000000111 1011