data warehouse mining ( dwm ) for any datawarehouse with
DESCRIPTION
Data Warehouse Mining ( DWM ) For any DataWarehouse with Fact file, F(d 1 ..d n ,m 1 ..m k ) (m i ’s are measurements) and Dimension files, D i (d i , a i1 ...a ir i ) i=1..n. - PowerPoint PPT PresentationTRANSCRIPT
Data Warehouse Mining ( DWM )For any DataWarehouse with
Fact file, F(d1..dn,m1..mk) (mi’s are measurements) and
Dimension files, Di(di, ai1...airi) i=1..n
Method-1: (to simplify) Convert to a Boolean DW by applying a predicate to measurements, {m1…mk} replacing each measurement vector with a 1-bit if
predicate is true and 0 if false. (e.g., predicates can be simple thresholds – may include dimensions).
Predicated Fact File, PF(d1...dn,m0) (m0 = Boolean predicate result) Dimension files, Di(di, ai1...airi
)
Next, Theta-join the Dimension files (doing selections and projections 1st ?) using PF as Theta condition, ending up with one large relation,the Universal Predicated Fact (UF)
Universal Predicated Fact File, UF(d1...dn, a11...a1r1 …
an1...anrn)
Next, (possibly) structure UF vertically (e.g., using basic Ptrees or?)Approach? Avoid actually creating the large UF relation at all (very
large!).Create UF-basic-Ptrees directly from the Fact and Dimension basic-
Ptrees?
Method-2: Create the full equi-join of F and all Di (no predication), also denoted result, UF.
UF can be fully vertically partitioned and data mined (e.g., Nearest Neighbor Classification, NNC or any other data mining method).
Universal Fact File, UF(d1...dn, a11...a1r1 … an1...anrn
,m1..mk)
A UF example
date_key (d)Day (a)day_of_wk (w)Month (m)Quarter (q)Year (200y)
date
prod_key (p)prod_name (n)Brand ((b)Supplier (s)
product
Sales Fact Tabledate_key (d)
product_key (p)
country_key (c)
Total-$-sold meas.(t)
d000000111111222222
p001122001122001122
c010101010101010101
t470121336504642517
d012
a492
wmft
m276
q133
y213
p012
countrykey (c)Legalname (l)Continent (o)
country
njik
brru
s002
c01
lusgb
d000000111111222222
p001122001122001122
c010101010101010101
t470121336504642517
a444444999999222222
wmmmmmmfffffftttttt
m222222777777666666
q111111333333333333
y222222111111333333
njjiikkjjiikkjjiikk
brrrruurrrruurrrruu
s000022000022000022
lusgbusgbusgbusgbusgbusgbusgbusgbusgb
UFF
Date
Prod
Countryo01
o010101010101010101
Nearest Neighbor Classification (NNC)
Many UF mining research topics can be pursued. E.g., for any DW data area,
Association Rule Mining (ARM),Clustering,Classification (e.g., NNC)other NN methods, Iceberg Querying, CaseBased & RoughSet
Classification, NNsearchOutlier/Noise Analysis,OLAP operator implementation,Query Processing,Vertical DW maintenance (e.g., upon inserting next-day data...).
The research may be quite different depending on the data area. e.g., Dr. Slator is interested in Classification of Virtual Cell data with respect to which students do well
NNC: Given a Training Set, a similarity measure and an unclassified tuple, find a set of nearest neighbors from the Training Set.Those neighbors predict the class thru plurality vote (or similarity-weighted vote).How many neighbors? eg, kNNC, find k nearest neighbors; dNNC all neighbors within a similarity dNote: NNC requires a similarity measure on pairs of tuples for nearest to make sense
Classification: Choose a feature attribute as “class label” (may be composite?)
( = the column(s) you want to classify tuples with respect to).
A Classifier is a program with input= unclassified_tuple (no class label yet) and
output= predicted_class_label for that input.How is that prediction made?
It’s based on already classified tuples (Training Set) of historical data
Training-set, T, consists of an aerial photograph (TIFF image taken during a growing season) and a synchronized yield map (crop yield taken that same year at harvest).
T( R,G,B, Y) ~100,000 tuples
NNC example from Precision Agriculture
TIFF image Yield Map
Producer want to classify Y=yield (e.g., Hi,Med,Low) based on color intensity (R,G,B).Y=Yield is the class label attribute. Using last year’s data set for Training Data, producers want a classifier that takes a (R,G,B) triple as input (from an image taken during the current growing season) outputs a predicted Yield for that pixel of their field
Then they can apply additional Nitrogen on just those parts of the field that need it to increase yield, without wasting N on the parts that will likely have high enough yield anyway (avoiding application of excess N in those parts, which would just run off into rivers and contaminate ground water anyway).
This classifier would help save N costs, maximize yield and save the environment!
UF (predicated) example Fact file is F(d1, d2, d3, m1, m2, m3). Predicate on mis results
in PF(d1,d2,d3,m0) Dimensions D1( d1, a10, a11, a12, a13 ), D2( d1, a20, a21, a22) and D3( d3, a30, a31)
s200
s411
c710
a10d1
0
a11 a12 a13
1
2
3
c710
0b
1a
a30d3
0
a31
1
0b
1a2
3
a01
a11
a20d2
0
a21 a22
1
2 b13
a013
D1( d1 a10 a11 a121a122a123 a13 ) d10 0 1 1 1 1 c d11 1 1 1 0 0 s d12 0 1 1 1 1 c d13 0 0 0 1 0 s
D2( d2 a201a202 a21 a22 ) d10 0 1 1 c d11 0 1 0 a d12 1 1 1 b d13 0 1 0 a
D3( d3 a21 a22 ) d10 c 1 d11 a 0 d12 b 1 d13 a 0
D1(d1d2d3 m)00001111
00110011
01010101
00001111
00110011
23232323
00001111
22332233
01010101
00001111
23232323
22332233
22223333222233332222333322223333
00110011
01010101
00110011
23232323
22332233
0101010123232323
22332233
1
1 1 1 1 1
d3=3=2
=1 =2 =3
d1=0
=1
=2
=3
d2=0
=1=0
1
1
1
1
1
1
NNC example: Choose D2.a22 as Class Attribute, C.
The ordering used on the previous slide is shown here;Generalized Peano order, sorting on d11, then d21, then d31,then d12, then d22, then d32, … (the origin is in the top back left
corner)
d1
d3
d2
d1
d2
d3
Spread out, so you can seewhat’s going on.
Using the standard orientation (origin in the bottom back left corner)and Generalized Peano order,(x1,y1,z1,x2,y2,z2,x3,y3,z3)
Y=d2
X=d1
Z=d3
Y=d2
X=d1
Z=d3
Enlarged, Standard orientation and Generalized Peano order,(x1,y1,z1,x2,y2,z2,x3,y3,z3)
Graph G (asEdge Table)G(Tid1 Tid2) t1 t2 t1 t3 t1 t5 t1 t6 t2 t1 t2 t7 t3 t1 t3 t2 t3 t3 t3 t5 t5 t1 t5 t3 t5 t5 t5 t7 t6 t1 t7 t2 t7 t5
ie, 2-D reflexive relationship on a single dimension file
Single Dimension File, RTid a1 a2 a3 a4 a5 a6 a7 a8 a9 C) t1 1 0 1 0 0 0 1 1 0 1t2 0 1 1 0 1 1 0 0 0 1t3 0 1 0 0 1 0 0 0 1 1t4 1 0 1 1 0 0 1 0 1 1t5 0 1 0 1 0 0 1 1 0 0t6 1 0 1 0 1 0 0 0 1 0t7 0 0 1 1 0 0 1 1 0 0
Example UF with a 2-D Reflexive Fact File (a graph)
e.g., a Protein-Protein interaction graph. Note, the dimension files are identical copies of the gene table
Note: Given any 2-D Reflexive Fact File (Graph), the standard Universal Fact File will be denoted as, UF1.
UF2 will denote the UF coming from the “2-hop Graph” Fact File (join of G
with itself, G2 = ( G Tid1JOINTid’2 G’)[ Tid1, Tid2’].
UF3 will come from the “3-hop Graph” Fact File, G3= G1 Tid1JOINTid2’ G’[ …
Graph G (as Reflexive 2-D relationship) t1 t2 t3 t4 t5 t6 t7t1 0 1 1 0 1 1 0t2 1 0 0 0 0 0 1t3 1 1 1 0 1 0 0t4 0 0 0 0 0 0 0t5 1 0 1 0 1 0 1t6 1 0 0 0 0 0 0t7 0 1 0 0 1 0 0
Tid1 Tid2
For this example: UF = UF1= R THETAJOIN R’ (THETAJOIN using THETA=G)
UF1
d1 d2 a1 a2 a3 a4 a5 a6 a7 a8 a9 C a1'a2'a3'a4'a5'a6'a7'a8'a9‘ C't1 t2 1 0 1 0 0 0 1 1 0 1 0 1 1 0 1 1 0 0 0 1t1 t3 1 0 1 0 0 0 1 1 0 1 0 1 0 0 1 0 0 0 1 1t1 t5 1 0 1 0 0 0 1 1 0 1 0 1 0 1 0 0 1 1 0 0t1 t6 1 0 1 0 0 0 1 1 0 1 1 0 1 0 1 0 0 0 1 0t2 t1 0 1 1 0 1 1 0 0 0 1 1 0 1 0 0 0 1 1 0 1t2 t7 0 1 1 0 1 1 0 0 0 1 0 0 1 1 0 0 1 1 0 0t3 t1 0 1 0 0 1 0 0 0 1 1 1 0 1 0 0 0 1 1 0 1t3 t2 0 1 0 0 1 0 0 0 1 1 0 1 1 0 1 1 0 0 0 1t3 t3 0 1 0 0 1 0 0 0 1 1 0 1 0 0 1 0 0 0 1 1t3 t5 0 1 0 0 1 0 0 0 1 1 0 1 0 1 0 0 1 1 0 0t5 t1 0 1 0 1 0 0 1 1 0 0 1 0 1 0 0 0 1 1 0 1t5 t3 0 1 0 1 0 0 1 1 0 0 0 1 0 0 1 0 0 0 1 1t5 t5 0 1 0 1 0 0 1 1 0 0 0 1 0 1 0 0 1 1 0 0t5 t7 0 1 0 1 0 0 1 1 0 0 0 0 1 1 0 0 1 1 0 0t6 t1 1 0 1 0 1 0 0 0 1 0 1 0 1 0 0 0 1 1 0 1t7 t2 0 0 1 1 0 0 1 1 0 0 0 1 1 0 1 1 0 0 0 1t7 t5 0 0 1 1 0 0 1 1 0 0 0 1 0 1 0 0 1 1 0 0
Recursively, for k > 1 (letting G1=G)
Gk =(Gk-1 gkJOINg1’ G’)(g1,…,gk+1) where gk+1 = g2’
UFk= R Gk-join R’ where Gk-join is ThetaJoin using Gk[g1,gk+1]
UF1 a1 a2 a3 a4 a5 a6 a7 a8 a9 C a1'a2'a3'a4'a5'a6'a7'a8'a9'C't00t01t02t03t04t05t06t07t10t11t12 1 0 1 0 0 0 1 1 0 1 0 1 1 0 1 1 0 0 0 1t13 1 0 1 0 0 0 1 1 0 1 0 1 0 0 1 0 0 0 1 1t14t15 1 0 1 0 0 0 1 1 0 1 0 1 0 1 0 0 1 1 0 0t16 1 0 1 0 0 0 1 1 0 1 1 0 1 0 1 0 0 0 1 0t17t20t21 0 1 1 0 1 1 0 0 0 1 1 0 1 0 0 0 1 1 0 1t22t23t24t25t26t27 0 1 1 0 1 1 0 0 0 1 0 0 1 1 0 0 1 1 0 0t30t31 0 1 0 0 1 0 0 0 1 1 1 0 1 0 0 0 1 1 0 1t32 0 1 0 0 1 0 0 0 1 1 0 1 1 0 1 1 0 0 0 1t33 0 1 0 0 1 0 0 0 1 1 0 1 0 0 1 0 0 0 1 1t34t35 0 1 0 0 1 0 0 0 1 1 0 1 0 1 0 0 1 1 0 0t36t37t40t41t42t43t44t45t46t47t50t51 0 1 0 1 0 0 1 1 0 0 1 0 1 0 0 0 1 1 0 1t52t53 0 1 0 1 0 0 1 1 0 0 0 1 0 0 1 0 0 0 1 1t54t55 0 1 0 1 0 0 1 1 0 0 0 1 0 1 0 0 1 1 0 0t56t57 0 1 0 1 0 0 1 1 0 0 0 0 1 1 0 0 1 1 0 0t60t61 1 0 1 0 1 0 0 0 1 0 1 0 1 0 0 0 1 1 0 1t62t63t64t65t66t67t70t71t72 0 0 1 1 0 0 1 1 0 0 0 1 1 0 1 1 0 0 0 1t73t74t75 0 0 1 1 0 0 1 1 0 0 0 1 0 1 0 0 1 1 0 0t76t77
The full matrix (8x8 raster order):UF1 a1 a2 a3 a4 a5 a6 a7 a8 a9 C a1'a2'a3'a4'a5'a6'a7'a8'a9'C‘t00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0t01 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0t02 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0t03 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0t04 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0t05 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0t06 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0t07 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0t10 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0t11 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0t12 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1t13 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1t14 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0t15 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1t16 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1t17 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0t20 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0t21 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1t22 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0t23 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0t24 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0t25 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0t26 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0t27 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1t30 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0t31 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1t32 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1t33 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1t34 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0t35 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1t36 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0t37 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0t40 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0t41 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0t42 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0t43 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0t44 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0t45 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0t46 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0t47 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0t50 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0t51 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1t52 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0t53 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1t54 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0t55 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1t56 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0t57 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1t60 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0t61 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1t62 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0t63 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0t64 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0t65 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0t66 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0t67 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0t70 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0t71 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0t72 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1t73 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0t74 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0t75 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1t76 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0t77 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
A UF1 template: 1-bit wherever there are values 0-bit where there are blanks. Note: tij means ti,tj
UF1 a1 a2 a3 a4 a5 a6 a7 a8 a9 C a1'a2'a3'a4'a5'a6'a7'a8'a9' C't00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0t01 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0t02 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0t03 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0t04 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0t05 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0t06 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0t07 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0t10 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0t11 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0t12 1 0 1 0 0 0 1 1 0 1 0 1 1 0 1 1 0 0 0 1t13 1 0 1 0 0 0 1 1 0 1 0 1 0 0 1 0 0 0 1 1t14 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0t15 1 0 1 0 0 0 1 1 0 1 0 1 0 1 0 0 1 1 0 0t16 1 0 1 0 0 0 1 1 0 1 1 0 1 0 1 0 0 0 1 0t17 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0t20 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0t21 0 1 1 0 1 1 0 0 0 1 1 0 1 0 0 0 1 1 0 1t22 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0t23 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0t24 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0t25 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0t26 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0t27 0 1 1 0 1 1 0 0 0 1 0 0 1 1 0 0 1 1 0 0t30 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0t31 0 1 0 0 1 0 0 0 1 1 1 0 1 0 0 0 1 1 0 1t32 0 1 0 0 1 0 0 0 1 1 0 1 1 0 1 1 0 0 0 1t33 0 1 0 0 1 0 0 0 1 1 0 1 0 0 1 0 0 0 1 1t34 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0t35 0 1 0 0 1 0 0 0 1 1 0 1 0 1 0 0 1 1 0 0t36 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0t37 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0t40 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0t41 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0t42 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0t43 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0t44 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0t45 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0t46 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0t47 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0t50 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0t51 0 1 0 1 0 0 1 1 0 0 1 0 1 0 0 0 1 1 0 1t52 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0t53 0 1 0 1 0 0 1 1 0 0 0 1 0 0 1 0 0 0 1 1t54 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0t55 0 1 0 1 0 0 1 1 0 0 0 1 0 1 0 0 1 1 0 0t56 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0t57 0 1 0 1 0 0 1 1 0 0 0 0 1 1 0 0 1 1 0 0t60 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0t61 1 0 1 0 1 0 0 0 1 0 1 0 1 0 0 0 1 1 0 1t62 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0t63 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0t64 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0t65 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0t66 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0t67 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0t70 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0t71 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0t72 0 0 1 1 0 0 1 1 0 0 0 1 1 0 1 1 0 0 0 1t73 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0t74 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0t75 0 0 1 1 0 0 1 1 0 0 0 1 0 1 0 0 1 1 0 0t76 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0t77 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
The full relation, UF1
(in raster order, with padded zeros)
Each column is a 0-dim basic Ptree (just sequences, a fanout=0 tree, no compression).
Later in these notes, there is discussion of techniques for building the 1-D basic Ptree set and the 2-D basic Ptree set for this Universal Fact File.
G2=(G g2JOINg1‘ G') (g1, g2, g2') G2[g1, g3] t1 t2 t1 t1 t1 t1 t2 t7 t1 t2 t1 t3 t1 t1 t3 t1 t3 t2 t1 t5 t1 t3 t3 t1 t7 t1 t3 t5 t2 t2 t1 t5 t1 t2 t3 t1 t5 t3 t2 t5 t1 t5 t5 t2 t6 t1 t5 t7 t3 t1 t1 t6 t1 t3 t2 t2 t1 t2 t3 t3 t2 t1 t3 t3 t5 t2 t1 t5 t3 t6 t2 t1 t6 t3 t7 t2 t7 t2 t5 t1 t2 t7 t5 t5 t2 t3 t1 t2 t5 t3 t3 t1 t3 t5 t5 t3 t1 t5 t5 t6 t3 t1 t6 t5 t7 t3 t2 t1 t6 t2 t3 t2 t7 t6 t3 t3 t3 t3 t6 t5 t3 t5 t1 t6 t6 t3 t5 t3 t7 t1 t3 t5 t5 t7 t2 t3 t5 t7 t7 t3 t5 t1 t2 t7 t5 t5 t1 t3 t7 t6 t5 t1 t5 t5 t1 t6 t5 t3 t1 t5 t3 t2 t5 t3 t5 t5 t5 t1 t5 t5 t3 t5 t5 t5 t5 t5 t7 t5 t7 t2 t5 t7 t5 t6 t1 t2 t6 t1 t3 t6 t1 t5 t6 t1 t6 t7 t1 t2 t7 t1 t3 t7 t1 t5 t7 t1 t6 t7 t5 t1 t7 t5 t3 t7 t5 t5 t7 t5 t1
UF2 a1 a2 a3 a4 a5 a6 a7 a8 a9 C a1'a2'a3'a4'a5'a6'a7'a8'a9‘C't11 1 0 1 0 0 0 1 1 0 1 1 0 1 0 0 0 1 1 0 1t12 1 0 1 0 0 0 1 1 0 1 0 1 1 0 1 1 0 0 0 1t13 1 0 1 0 0 0 1 1 0 1 0 1 0 0 1 0 0 0 1 1t15 1 0 1 0 0 0 1 1 0 1 0 1 0 1 0 0 1 1 0 0t17 1 0 1 0 0 0 1 1 0 1 0 0 1 1 0 0 1 1 0 0t22 0 1 1 0 1 1 0 0 0 1 0 1 1 0 1 1 0 0 0 1t23 0 1 1 0 1 1 0 0 0 1 0 1 0 0 1 0 0 0 1 1t25 0 1 1 0 1 1 0 0 0 1 0 1 0 1 0 0 1 1 0 0t26 0 1 1 0 1 1 0 0 0 1 1 0 1 0 1 0 0 0 1 0t31 0 1 0 0 1 0 0 0 1 1 1 0 1 0 0 0 1 1 0 1t32 0 1 0 0 1 0 0 0 1 1 0 1 1 0 1 1 0 0 0 1t33 0 1 0 0 1 0 0 0 1 1 0 1 0 0 1 0 0 0 1 1t35 0 1 0 0 1 0 0 0 1 1 0 1 0 1 0 0 1 1 0 0t36 0 1 0 0 1 0 0 0 1 1 1 0 1 0 1 0 0 0 1 0t37 0 1 0 0 1 0 0 0 1 1 0 0 1 1 0 0 1 1 0 0t51 0 1 0 1 0 0 1 1 0 0 1 0 1 0 0 0 1 1 0 1t52 0 1 0 1 0 0 1 1 0 0 0 1 1 0 1 1 0 0 0 1t53 0 1 0 1 0 0 1 1 0 0 0 1 0 0 1 0 0 0 1 1t55 0 1 0 1 0 0 1 1 0 0 0 1 0 1 0 0 1 1 0 0t56 0 1 0 1 0 0 1 1 0 0 1 0 1 0 1 0 0 0 1 0t57 0 1 0 1 0 0 1 1 0 0 0 0 1 1 0 0 1 1 0 0t62 1 0 1 0 1 0 0 0 1 0 0 1 1 0 1 1 0 0 0 1t63 1 0 1 0 1 0 0 0 1 0 0 1 0 0 1 0 0 0 1 1t65 1 0 1 0 1 0 0 0 1 0 0 1 0 1 0 0 1 1 0 0t66 1 0 1 0 1 0 0 0 1 0 1 0 1 0 1 0 0 0 1 0t71 0 0 1 1 0 0 1 1 0 0 1 0 1 0 0 0 1 1 0 1t72 0 0 1 1 0 0 1 1 0 0 0 1 1 0 1 1 0 0 0 1t73 0 0 1 1 0 0 1 1 0 0 0 1 0 0 1 0 0 0 1 1t75 0 0 1 1 0 0 1 1 0 0 0 1 0 1 0 0 1 1 0 0t76 0 0 1 1 0 0 1 1 0 0 1 0 1 0 1 0 0 0 1 0
UF2
G3=G2 g3JOINg1' G' G2[g1,g3] G(g1 g2) G3[g1,g4] t4 absent, no t1 t1 t1 t2 t1 t1 interaction t1 t2 t1 t3 t1 t2 All other t1 t3 t1 t5 t1 t3 possibilities t1 t5 t1 t6 t1 t5 appear except t1 t7 t2 t1 t1 t6 the 2 below: t2 t2 t2 t7 t1 t7 t2 t3 t3 t1 t2 t1 t2 t5 t3 t2 t2 t2 t2 t6 t3 t3 t2 t3 t3 t1 t3 t5 t2 t5 _t2 t6 absent t3 t2 t5 t1 t2 t7 t3 t3 t5 t3 t3 t1 t3 t5 t5 t5 t3 t2 t3 t6 t5 t7 t3 t3 t3 t7 t6 t1 t3 t5 t5 t1 t7 t2 t3 t6 t5 t2 t7 t5 t3 t7 t5 t3 t5 t1 t5 t5 t5 t2 t5 t6 t5 t3 t5 t7 t5 t5 t6 t2 t5 t6 t6 t3 t5 t7 t6 t5 t6 t1 t6 t6 t6 t2 t7 t1 t6 t3 t7 t2 t6 t5 _t6 t6 absent t7 t3 t6 t7 t7 t5 t7 t1 t7 t6 t7 t2 t7 t3 t7 t5 t7 t6 t7 t7
UF3 a1 a2 a3 a4 a5 a6 a7 a8 a9 C a1'a2'a3'a4'a5'a6'a7'a8'a9‘ C't11 1 0 1 0 0 0 1 1 0 1 1 0 1 0 0 0 1 1 0 1t12 1 0 1 0 0 0 1 1 0 1 0 1 1 0 1 1 0 0 0 1t13 1 0 1 0 0 0 1 1 0 1 0 1 0 0 1 0 0 0 1 1t15 1 0 1 0 0 0 1 1 0 1 0 1 0 1 0 0 1 1 0 0t16 1 0 1 0 0 0 1 1 0 1 1 0 1 0 1 0 0 0 1 0t17 1 0 1 0 0 0 1 1 0 1 0 0 1 1 0 0 1 1 0 0t21 0 1 1 0 1 1 0 0 0 1 1 0 1 0 0 0 1 1 0 1t22 0 1 1 0 1 1 0 0 0 1 0 1 1 0 1 1 0 0 0 1t23 0 1 1 0 1 1 0 0 0 1 0 1 0 0 1 0 0 0 1 1t25 0 1 1 0 1 1 0 0 0 1 0 1 0 1 0 0 1 1 0 0t27 0 1 1 0 1 1 0 0 0 1 0 0 1 1 0 0 1 1 0 0t31 0 1 0 0 1 0 0 0 1 1 1 0 1 0 0 0 1 1 0 1t32 0 1 0 0 1 0 0 0 1 1 0 1 1 0 1 1 0 0 0 1t33 0 1 0 0 1 0 0 0 1 1 0 1 0 0 1 0 0 0 1 1t35 0 1 0 0 1 0 0 0 1 1 0 1 0 1 0 0 1 1 0 0t36 0 1 0 0 1 0 0 0 1 1 1 0 1 0 1 0 0 0 1 0t37 0 1 0 0 1 0 0 0 1 1 0 0 1 1 0 0 1 1 0 0t51 0 1 0 1 0 0 1 1 0 0 1 0 1 0 0 0 1 1 0 1t52 0 1 0 1 0 0 1 1 0 0 0 1 1 0 1 1 0 0 0 1t53 0 1 0 1 0 0 1 1 0 0 0 1 0 0 1 0 0 0 1 1t55 0 1 0 1 0 0 1 1 0 0 0 1 0 1 0 0 1 1 0 0t56 0 1 0 1 0 0 1 1 0 0 1 0 1 0 1 0 0 0 1 0t57 0 1 0 1 0 0 1 1 0 0 0 0 1 1 0 0 1 1 0 0t61 1 0 1 0 1 0 0 0 1 0 1 0 1 0 0 0 1 1 0 1t62 1 0 1 0 1 0 0 0 1 0 0 1 1 0 1 1 0 0 0 1t63 1 0 1 0 1 0 0 0 1 0 0 1 0 0 1 0 0 0 1 1t65 1 0 1 0 1 0 0 0 1 0 0 1 0 1 0 0 1 1 0 0t67 1 0 1 0 1 0 0 0 1 0 0 0 1 1 0 0 1 1 0 0t71 0 0 1 1 0 0 1 1 0 0 1 0 1 0 0 0 1 1 0 1t72 0 0 1 1 0 0 1 1 0 0 0 1 1 0 1 1 0 0 0 1t73 0 0 1 1 0 0 1 1 0 0 0 1 0 0 1 0 0 0 1 1t75 0 0 1 1 0 0 1 1 0 0 0 1 0 1 0 0 1 1 0 0t76 0 0 1 1 0 0 1 1 0 0 1 0 1 0 1 0 0 0 1 0t77 0 0 1 1 0 0 1 1 0 0 0 0 1 1 0 0 1 1 0 0
UF3
G3[g1,g4] G(g1 g2) G4[g1,g5] t4 doesn't appear t1 t1 t1 t2 t1 t1 (no interaction). t1 t2 t1 t3 t1 t2 Every other t1 t3 t1 t5 t1 t3 possibility t1 t5 t1 t6 t1 t5 appears except t1 t7 t2 t1 t1 t6 the 2 below: t2 t2 t2 t7 t1 t7 t2 t3 t3 t1 t2 t1 t2 t5 t3 t2 t2 t2 t2 t6 t3 t3 t2 t3 t3 t1 t3 t5 t2 t5 __t2 t6 not there t3 t2 t5 t1 t2 t7 t3 t3 t5 t3 t3 t1 t3 t5 t5 t5 t3 t2 t3 t6 t5 t7 t3 t3 t3 t7 t6 t1 t3 t5 t5 t1 t7 t2 t3 t6 t5 t2 t7 t5 t3 t7 t5 t3 t5 t1 t5 t5 t5 t2 t5 t6 t5 t3 t5 t7 t5 t5 t6 t2 t5 t6 t6 t3 t5 t7 t6 t5 t6 t1 t6 t6 t6 t2 t7 t1 t6 t3 t7 t2 t6 t5 __t6 t6 not there t7 t3 t6 t7 t7 t5 t7 t1 t7 t6 t7 t2 t7 t3 t7 t5 t7 t6 t7 t7
Note: UF3 = UF4 = UF5 =… UFi for all i>2 since Gi = G3
UF4
57
75
16
27
13
23
33
53
15
35
55
72
12
61
51
31
21
t2t1
F (Edge Tbl)
For UF1[a1] AND with PF01001010
01001010
01001010
01001010
01001010
01001010
01001010
01001010
0 1 0 0 1 0 1 00 1 0 0 1 0 1 00 1 0 0 1 0 1 00 1 0 0 1 0 1 00 1 0 0 1 0 1 00 1 0 0 1 0 1 00 1 0 0 1 0 1 00 1 0 0 1 0 1 0
01001010
Replicate R[a1] columns:
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
PF
00110110
01010001
01010100
00000000
01010101
01000000
00100100
00000000
01001010
01001010
01001010
01001010
01001010
01001010
01001010
01001010
00110110
01010001
01010100
00000000
01010101
01000000
00100100
00000000
t12
t15
t13
t16
t61
For UF1[a1’] AND with PF
00110110
01010001
01010100
00000000
01010101
01000000
00100100
00000000
00110110
01010001
01010100
00000000
01010101
01000000
00100100
00000000
0 1 0 0 1 0 1 00 1 0 0 1 0 1 00 1 0 0 1 0 1 00 1 0 0 1 0 1 00 1 0 0 1 0 1 00 1 0 0 1 0 1 00 1 0 0 1 0 1 00 1 0 0 1 0 1 0
t21
t16
t31
t61
t51
0000000000000010010000000100000000000000010000000100000000000000
UF1[a1’ ]
0000000000110110000000000000000000000000000000000100000000000000
UF1[a1]
Dimension File, RTid a1 a2 a3 a4 a5 a6 a7 a8 a9 C) t1 1 0 1 0 0 0 1 1 0 1t2 0 1 1 0 1 1 0 0 0 1t3 0 1 0 0 1 0 0 0 1 1t4 1 0 1 1 0 0 1 0 1 1t5 0 1 0 1 0 0 1 1 0 0t6 1 0 1 0 1 0 0 0 1 0t7 0 0 1 1 0 0 1 1 0 0
From R and FPtrees, createPtrees for UF?
Replicate R’[a1]=R[a1]tr rows:
0 1 0 0 1 0 1 0
PR[a1] 0 0 0 0 0 01 10 10
01001010
R[a1] 01001010
01001010
01001010
01001010
01001010
01001010
01001010
01001010
R[a1] replicated
00
00110
00110 0
0
00011
00011
0 0
00
11000
11000
11000
1100
0
01100
01100
01100
01100
PG-pattern
00 0 0 0
0
00001
00010
0 0
00
00010
00010
01000
0010
0
00001
00001
00001
001010011 0011
0001 0100
00
00110
00110 0
0
00011
00011
0 0
00
11000
11000
11000
1100
0
01100
01100
01100
01100
013
103
012
112
221
P R[a1]-replicated
Class research project? Develop the algorithm and code for creatingthe basic Ppattern PR[ai]-replicated Ptrees and (therefore) PUF[ai] Ptrees fromPF and R Ptrees.
00110100
Replicate R[a2] as cols of matrix
0 0 1 1 0 1 0 00 0 1 1 0 1 0 00 0 1 1 0 1 0 00 0 1 1 0 1 0 00 0 1 1 0 1 0 00 0 1 1 0 1 0 00 0 1 1 0 1 0 00 0 1 1 0 1 0 0
Replicate R[a2]tr as rows of matrix:
For UF1[a2] AND with pat00110110
01010001
01010100
00000000
01010101
01000000
00100100
00000000
00110110
01010001
01010100
00000000
01010101
01000000
00100100
00000000
For UF1[a2’] AND with pat
00110110
01010001
01010100
00000000
01010101
01000000
00100100
00000000
00110110
01010001
01010100
00000000
01010101
01000000
00100100
00000000
0000000000110100000000000011010000000000000101000000000000100100
UF1[a2’ ]
0000000000000000010000010111010000000000010101010000000000000000
RG1[a2]
00110100
00110100
00110100
00110100
00110100
00110100
00110100
00110100
00110100
00110100
00110100
00110100
00110100
00110100
00110100
00110100
t53t51 t55 t57
t21
t31 t32 t33
t27
t35
0 0 1 1 0 1 0 00 0 1 1 0 1 0 00 0 1 1 0 1 0 00 0 1 1 0 1 0 00 0 1 1 0 1 0 00 0 1 1 0 1 0 00 0 1 1 0 1 0 00 0 1 1 0 1 0 0
t12 t13 t15 t32 t33 t35 t53 t55 t72 t75
Note that the cardinality of the UFk file may fill up quickly (wrt k).
E.g., in the previous example, for k>2, the cardinality is maximal (34)and almost full (49). Even for k=1, the cardinality is already 17, morethan double that of k=0 (7) and 35% of full. If there 100,000 genes involved, e.g., the full size is 10,000,000,000(10 billion). Instead of joining, one can simply apply quantifiersacross the graph. E.g., the quantifying universally across the graph: UFU (a1 a2 a3 a4 a5 a6 a7 a8 a9 C a1'a2'a3'a4'a5'a6'a7'a8'a9')t1 1 0 1 0 0 0 1 1 0 1 0 0 0 0 0 0 0 0 0t2 0 1 1 0 1 1 0 0 0 1 0 0 1 0 0 0 1 1 0t3 0 1 0 0 1 0 0 0 1 1 0 0 0 0 0 0 0 0 0t5 0 1 0 1 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0t6 1 0 1 0 1 0 0 0 1 0 1 0 1 0 0 0 1 1 0t7 0 0 1 1 0 0 1 1 0 0 0 1 0 0 0 0 0 0 0
The existential quantifier across the graph yields: UFE (a1 a2 a3 a4 a5 a6 a7 a8 a9 C a1'a2'a3'a4'a5'a6'a7'a8'a9')t1 1 0 1 0 0 0 1 1 0 1 1 1 1 1 1 1 1 1 1t2 0 1 1 0 1 1 0 0 0 1 1 0 1 1 0 0 1 1 0t3 0 1 0 0 1 0 0 0 1 1 1 1 1 1 1 1 1 1 1t5 0 1 0 1 0 0 1 1 0 0 1 1 1 1 1 0 1 1 1t6 1 0 1 0 1 0 0 0 1 0 1 0 1 0 0 0 1 1 0t7 0 0 1 1 0 0 1 1 0 0 0 1 1 1 1 1 1 1 0
UF NNC scan example:Find 3-Nearest Neighbors in UF1. Current practice is to find the 3NN set by scanning.E.g., use Hamming Distance, d(x,y)= # of mismatches to C-classify (a1..a9)= 001100100
Choose class label=C in UF1 (Training Set) below
d1 d2 a1 a2 a3 a4 a5 a6 a7 a8 a9 C a1'a2'a3'a4'a5'a6'a7'a8'a9‘ C't1 t2 1 0 1 0 0 0 1 1 0 1 0 1 1 0 1 1 0 0 0 1t1 t3 1 0 1 0 0 0 1 1 0 1 0 1 0 0 1 0 0 0 1 1t1 t5 1 0 1 0 0 0 1 1 0 1 0 1 0 1 0 0 1 1 0 0t1 t6 1 0 1 0 0 0 1 1 0 1 1 0 1 0 1 0 0 0 1 0t2 t1 0 1 1 0 1 1 0 0 0 1 1 0 1 0 0 0 1 1 0 1t2 t7 0 1 1 0 1 1 0 0 0 1 0 0 1 1 0 0 1 1 0 0t3 t1 0 1 0 0 1 0 0 0 1 1 1 0 1 0 0 0 1 1 0 1t3 t2 0 1 0 0 1 0 0 0 1 1 0 1 1 0 1 1 0 0 0 1t3 t3 0 1 0 0 1 0 0 0 1 1 0 1 0 0 1 0 0 0 1 1t3 t5 0 1 0 0 1 0 0 0 1 1 0 1 0 1 0 0 1 1 0 0t5 t1 0 1 0 1 0 0 1 1 0 0 1 0 1 0 0 0 1 1 0 1t5 t3 0 1 0 1 0 0 1 1 0 0 0 1 0 0 1 0 0 0 1 1t5 t5 0 1 0 1 0 0 1 1 0 0 0 1 0 1 0 0 1 1 0 0t5 t7 0 1 0 1 0 0 1 1 0 0 0 0 1 1 0 0 1 1 0 0t6 t1 1 0 1 0 1 0 0 0 1 0 1 0 1 0 0 0 1 1 0 1t7 t2 0 0 1 1 0 0 1 1 0 0 0 1 1 0 1 1 0 0 0 1t7 t5 0 0 1 1 0 0 1 1 0 0 0 1 0 1 0 0 1 1 0 0
0 0 1 1 0 0 1 0 0
t1 t2 1 0 1 0 0 0 1 1 0 1 3
t1 t3 1 0 1 0 0 0 1 1 0 1 3
t1 t5 1 0 1 0 0 0 1 1 0 1 3
a1 a2 a3 a4 a5 a6 a7 a8 a9 C d
3NN set so far
0 0 1 1 0 0 1 0 03 mismatches, d=3, don’t
replace
0 0 1 1 0 0 1 0 05 mismatches, d=5, don’t
replace
0 0 1 1 0 0 1 0 05 mismatches, d=5, don’t
replace
0 0 1 1 0 0 1 0 06 mismatches, d=6, don’t
replace
0 0 1 1 0 0 1 0 06 mismatches, d=6, don’t
replace
0 0 1 1 0 0 1 0 06 mismatches, d=6, don’t
replace
0 0 1 1 0 0 1 0 06 mismatches, d=6, don’t
replace
0 0 1 1 0 0 1 0 03 mismatches, d=3, don’t
replace
0 0 1 1 0 0 1 0 03 mismatches, d=3, don’t
replace
0 0 1 1 0 0 1 0 03 mismatches, d=3, don’t
replace
0 0 1 1 0 0 1 0 03 mismatches, d=3, don’t
replace
0 0 1 1 0 0 1 0 05 mismatches, d=5, don’t
replace
0 0 1 1 0 0 1 0 01 mismatch, d=1,
replace
t7 t2 0 0 1 1 0 0 1 1 0 0 1
0 0 1 1 0 0 1 0 01 mismatch, d=1,
replace
t7 t5 0 0 1 1 0 0 1 1 0 0 1Final Plurality vote winner: C=0
UF NNC scan example-2:( a5 a6 a1’ a2’ a3’ a4’ ) =
Class label=C’, using Hamming Dis, d(x,y)= # of mismatches:
d1 d2 a1 a2 a3 a4 a5 a6 a7 a8 a9 C a1'a2'a3'a4'a5'a6'a7'a8'a9‘ C't1 t2 1 0 1 0 0 0 1 1 0 1 0 1 1 0 1 1 0 0 0 1t1 t3 1 0 1 0 0 0 1 1 0 1 0 1 0 0 1 0 0 0 1 1t1 t5 1 0 1 0 0 0 1 1 0 1 0 1 0 1 0 0 1 1 0 0t1 t6 1 0 1 0 0 0 1 1 0 1 1 0 1 0 1 0 0 0 1 0t2 t1 0 1 1 0 1 1 0 0 0 1 1 0 1 0 0 0 1 1 0 1t2 t7 0 1 1 0 1 1 0 0 0 1 0 0 1 1 0 0 1 1 0 0t3 t1 0 1 0 0 1 0 0 0 1 1 1 0 1 0 0 0 1 1 0 1t3 t2 0 1 0 0 1 0 0 0 1 1 0 1 1 0 1 1 0 0 0 1t3 t3 0 1 0 0 1 0 0 0 1 1 0 1 0 0 1 0 0 0 1 1t3 t5 0 1 0 0 1 0 0 0 1 1 0 1 0 1 0 0 1 1 0 0t5 t1 0 1 0 1 0 0 1 1 0 0 1 0 1 0 0 0 1 1 0 1t5 t3 0 1 0 1 0 0 1 1 0 0 0 1 0 0 1 0 0 0 1 1t5 t5 0 1 0 1 0 0 1 1 0 0 0 1 0 1 0 0 1 1 0 0t5 t7 0 1 0 1 0 0 1 1 0 0 0 0 1 1 0 0 1 1 0 0t6 t1 1 0 1 0 1 0 0 0 1 0 1 0 1 0 0 0 1 1 0 1t7 t2 0 0 1 1 0 0 1 1 0 0 0 1 1 0 1 1 0 0 0 1t7 t5 0 0 1 1 0 0 1 1 0 0 0 1 0 1 0 0 1 1 0 0
0 0 0 0 0 0
t1 t2 0 0 1 0 1 1 0 2
t1 t3 0 0 1 0 1 0 0 1
t1 t5 0 0 1 0 1 0 1 2
a5 a6 C a1’a2’a3’a4’ d
3NN set so far
0 0 0 0 0 0
d=2, don’t replace
0 0 0 0 0 0
d=4, don’t replace
0 0 0 0 0 0
d=4, don’t replace
0 0 0 0 0 0
d=3, don’t replace
0 0 0 0 0 0
d=3, don’t replace
0 0 0 0 0 0
d=2, don’t replace
0 0 0 0 0 0
d=3, don’t replace
0 0 0 0 0 0
d=2, don’t replace
0 0 0 0 0 0
d=1, replace
t5 t3 0 0 0 0 1 0 0 1
0 0 0 0 0 0
d=2, don’t replace
0 0 0 0 0 0
d=2, don’t replace
0 0 0 0 0 0
d=3, don’t replace
0 0 0 0 0 0
d=2, don’t replace
0 0 0 0 0 0
d=2, don’t replace
Final winner: C=1
UF NNC scan example-2 (cont):( a5 a6 a1’ a2’ a3’ a4’ ) =
To find all training pts within distance=2 of the sample, takes another scan, using scan methods.
d1 d2 a1 a2 a3 a4 a5 a6 a7 a8 a9 C a1'a2'a3'a4'a5'a6'a7'a8'a9‘ C't1 t2 1 0 1 0 0 0 1 1 0 1 0 1 1 0 1 1 0 0 0 1t1 t3 1 0 1 0 0 0 1 1 0 1 0 1 0 0 1 0 0 0 1 1t1 t5 1 0 1 0 0 0 1 1 0 1 0 1 0 1 0 0 1 1 0 0t1 t6 1 0 1 0 0 0 1 1 0 1 1 0 1 0 1 0 0 0 1 0t2 t1 0 1 1 0 1 1 0 0 0 1 1 0 1 0 0 0 1 1 0 1t2 t7 0 1 1 0 1 1 0 0 0 1 0 0 1 1 0 0 1 1 0 0t3 t1 0 1 0 0 1 0 0 0 1 1 1 0 1 0 0 0 1 1 0 1t3 t2 0 1 0 0 1 0 0 0 1 1 0 1 1 0 1 1 0 0 0 1t3 t3 0 1 0 0 1 0 0 0 1 1 0 1 0 0 1 0 0 0 1 1t3 t5 0 1 0 0 1 0 0 0 1 1 0 1 0 1 0 0 1 1 0 0t5 t1 0 1 0 1 0 0 1 1 0 0 1 0 1 0 0 0 1 1 0 1t5 t3 0 1 0 1 0 0 1 1 0 0 0 1 0 0 1 0 0 0 1 1t5 t5 0 1 0 1 0 0 1 1 0 0 0 1 0 1 0 0 1 1 0 0t5 t7 0 1 0 1 0 0 1 1 0 0 0 0 1 1 0 0 1 1 0 0t6 t1 1 0 1 0 1 0 0 0 1 0 1 0 1 0 0 0 1 1 0 1t7 t2 0 0 1 1 0 0 1 1 0 0 0 1 1 0 1 1 0 0 0 1t7 t5 0 0 1 1 0 0 1 1 0 0 0 1 0 1 0 0 1 1 0 0
0 0 0 0 0 0
t1 t2 0 0 1 0 1 1 0 2
t1 t3 0 0 1 0 1 0 0 1
a5 a6 C a1’a2’a3’a4’ d
0 0 0 0 0 0
d=2, include it also
0 0 0 0 0 0
d=4, don’t include
0 0 0 0 0 0
d=4, don’t include
0 0 0 0 0 0
d=3, don’t include
0 0 0 0 0 0
d=3, don’t include
0 0 0 0 0 0
d=2, include it also
0 0 0 0 0 0
d=3, don’t include
0 0 0 0 0 0
d=2, include it also
0 0 0 0 0 0
d=1, already have
t5 t3 0 0 0 0 1 0 0 1
0 0 0 0 0 0
d=2, include it also
0 0 0 0 0 0
d=2, include it also
0 0 0 0 0 0
d=3, don’t replace
0 0 0 0 0 0
d=2, include it also
0 0 0 0 0 0
d=2, include it also
3NN setVote histogram
0 0 0 0 0 0
d=2, include it also
0 0 0 0 0 0
d=2, already have
0 0 0 0 0 0
d=1, already have
0 1
a4‘
11011011101100110
a3‘
01100000110110001
a2‘
00011110001001100
a1‘
11100101110111011
UF NNC Ptree Ex. 1 using 0-D Ptrees (sequences) a=a5 a6
a1’a2’a3’a4’=(000000)
C'11001011101100110
d1 d2
t1 t2t1 t3t1 t5t1 t6t2 t1t2 t7t3 t1t3 t2t3 t3t3 t5t5 t1t5 t3t5 t5t5 t7t6 t1 t7 t2t7 t5
a1
11110000000000100
a2 00001111111111000
a3
11111100000000111
a4
000000000 01111011
a5
0000111111000 0100
a6
00001100000000000
a7
11110000001111011
a8
11110000001111011
a9
00000011110000100
C11111111110000000
a1‘00011010001000100
a2‘ 11100001110110011
a3‘ 10011111001001110
a4‘ 00100100010011001
a5‘ 110100011001000 10
a6‘10000001000000010
a7‘ 00101110011011101
a8‘00101110011011101
a9‘01010000100100000
a6
11110011111111111
a5
11110000001111011
Identifying all training tuples in the distance=0 ring or 0ring, centered at a (exact matches ) as 1-bits of the Ptree, P=
a5^a6^a1’^a2’^a3’^a4’ (we use _ for complement)
There are no training points in a’s 0ring!We must look further out, i.e., a’s 1ring
P00000000000000000
0 1Vote histogram
(so far)
C
11111111110000000
C
00000000001111111
UF NNC Ptree ex-1 (cont.) a’s 1ring? a=a5 a6 a1’a2’a3’a4’ = (000000)
C'11001011101100110
d1 d2
t1 t2
t1 t3
t1 t5
t1 t6
t2 t1
t2 t7
t3 t1
t3 t2
t3 t3
t3 t5
t5 t1
t5 t3
t5 t5
t5 t7
t6 t1 t7 t2
t7 t5
a1
11110000000000100
a2 00001111111111000
a3
11111100000000111
a4
000000000 01111011
a5
00001111110000100
a6
00001100000000000
a7
11110000001111011
a8
11110000001111011
a9
00000011110000100
C11111111110000000
a1‘00011010001000100
a2‘ 11100001110110011
a3‘ 10011111001001110
a4‘ 00100100010011001
a5‘ 110100011001000 10
a6‘10000001000000010
a7‘ 00101110011011101
a8‘00101110011011101
a9‘01010000100100000
Training pts in the 1ring centered at a are given by 1-bits in the Ptree, P, constructed as follows:
0 1P01000000000100000
The C=1 vote count = root count of P^C.The C=0 vote count = root count of P^C.(never need to know which tuples voted)
a4‘
11011011101100110
a3‘
01100000110110001
a2‘
00011110001001100
a1‘
11100101110111011
a6
11110011111111111
a5
00001111110000100
a4‘
11011011101100110
a3‘
01100000110110001
a2‘
00011110001001100
a1‘
11100101110111011
a6
0 0001100000000000
a5
1111000000111 1011
a4‘
11011011101100110
a3‘
01100000110110001
a2‘
00011110001001100
a1‘
00011010001000100
a6
11110011111111111
a5
1111000000111 1011
a4‘
11011011101100110
a3‘
01100000110110001
a2‘
1 1100001110110011
a1‘
11100101110111011
a6
11110011111111111
a5
1111000000111 1011
a4‘
11011011101100110
a3‘
1 0011111001001110
a2‘
00011110001001100
a1‘
11100101110111011
a6
11110011111111111
a5
1111000000111 1011
a4‘
0 0100100010011001
a3‘
01100000110110001
a2‘
00011110001001100
a1‘
11100101110111011
a6
11110011111111111
a5
1111000000111 1011
OR
a5^a6^a1’^a2’^a3’^a4’a5^a6^a1’^a2’^a3’^a4’
(100000)
a5^a6^a1’^a2’^a3’^a4’
(010000)
a5^a6^a1’^a2’^a3’^a4’
(001000)
a5^a6^a1’^a2’^a3’^a4’
(000100)
a5^a6^a1’^a2’^a3’^a4’
(000010)(000001)
(a5 a6 a1’a2’a3’a4’)
a’s 2-ring? a=a5 a6 a1’a2’a3’a4’ = (000000)
d1 d2
t1 t2
t1 t3
t1 t5
t1 t6
t2 t1
t2 t7
t3 t1
t3 t2
t3 t3
t3 t5
t5 t1
t5 t3
t5 t5
t5 t7
t6 t1 t7 t2
t7 t5
a5
0000111111000 0100
a6
00001100000000000
C11111111110000000
a1‘00011010001000100
a2‘ 11100001110110011
a3‘ 10011111001001110
a4‘ 00100100010011001
For each of the following Ptrees, a 1-bit corresponds to a training point in a’s 2-ring:Pa5a6a1’a2‘a3‘a4‘ Pa5a6 a1‘a2’a3‘a4‘ Pa5a6 a1‘a2’a3‘a4‘ Pa5a6 a1‘a2‘a3’a4‘ Pa5a6
a1‘a2‘a3‘a4’Pa5a6 a1‘a2’a3‘a4‘ Pa5a6 a1‘a2’a3‘a4‘ Pa5a6 a1‘a2‘a3’a4‘ Pa5a6 a1‘a2‘a3‘a4’Pa5a6 a1‘a2’a3‘a4‘ Pa5a6 a1‘a2‘a3’a4‘ Pa5a6 a1‘a2‘a3‘a4’Pa5a6 a1‘a2‘a3’a4‘ Pa5a6 a1‘a2‘a3‘a4’Pa5a6 a1‘a2‘a3‘a4’ 1st line first:
a4‘
11011011101100110
a3‘
01100000110110001
a2‘
00011110001001100
a1‘
11100101110111011
a6
0 0001100000000000
a4‘
11011011101100110
a3‘
01100000110110001
a2‘
00011110001001100
a1‘
00011010001000100
a6
11110011111111111
a4‘
11011011101100110
a3‘
01100000110110001
a2‘
1 1100001110110011
a1‘
11100101110111011
a6
11110011111111111
a4‘
11011011101100110
a3‘
1 0011111001001110
a2‘
00011110001001100
a1‘
11100101110111011
a6
11110011111111111
a4‘
0 0100100010011001
a3‘
01100000110110001
a2‘
00011110001001100
a1‘
11100101110111011
a6
11110011111111111
a5
00001111110000100
a5
00001111110000100
a5
00001111110000100
a5
00001111110000100
a5
00001111110000100
0 1
(110000)(101000)(100100)(100010)(100001)
Stop here? But the other 10 Ptrees should also be considered. The fact that the 2-ring includes so many new training points is “The curse of demensionality”.
Enfranchising the rest of a’s 2-ring? a=a5 a6 a1’a2’a3’a4’ = (000000)
d1 d2
t1 t2
t1 t3
t1 t5
t1 t6
t2 t1
t2 t7
t3 t1
t3 t2
t3 t3
t3 t5
t5 t1
t5 t3
t5 t5
t5 t7
t6 t1 t7 t2
t7 t5
a5
0000111111000 0100
a6
00001100000000000
C11111111110000000
a1‘00011010001000100
a2‘ 11100001110110011
a3‘ 10011111001001110
a4‘ 00100100010011001
0 1
a4‘
11011011101100110
a3‘
01100000110110001
a2‘
00011110001001100
a1‘
00011010001000100
a6
0 0001100000000000
a5
1111000000111 1011
a4‘
11011011101100110
a3‘
01100000110110001
a2‘
11100001110110011
a1‘
11100101110111011
a6
0 0001100000000000
a5
1111000000111 1011
a4‘
11011011101100110
a3‘
11100001110110011
a2‘
00011110001001100
a1‘
11100101110111011
a6
0 0001100000000000
a5
1111000000111 1011
a4‘
00100100010011001
a3‘
01100000110110001
a2‘
00011110001001100
a1‘
11100101110111011
a6
0 0001100000000000
a5
1111000000111 1011
For each of the following Ptrees, a 1-bit corresponds to a training point in a’s 2-ring:Pa5a6a1’a2‘a3‘a4‘ Pa5a6 a1‘a2’a3‘a4‘ Pa5a6 a1‘a2’a3‘a4‘ Pa5a6 a1‘a2‘a3’a4‘ Pa5a6
a1‘a2‘a3‘a4’Pa5a6 a1‘a2’a3‘a4‘ Pa5a6 a1‘a2’a3‘a4‘ Pa5a6 a1‘a2‘a3’a4‘ Pa5a6 a1‘a2‘a3‘a4’Pa5a6 a1‘a2’a3‘a4‘ Pa5a6 a1‘a2‘a3’a4‘ Pa5a6 a1‘a2‘a3‘a4’Pa5a6 a1‘a2‘a3’a4‘ Pa5a6 a1‘a2‘a3‘a4’Pa5a6 a1‘a2‘a3‘a4’ 2nd line:
Enfranchising the rest of a’s 2-ring (cont.) a=a5 a6 a1’a2’a3’a4’ = (000000)
d1 d2
t1 t2
t1 t3
t1 t5
t1 t6
t2 t1
t2 t7
t3 t1
t3 t2
t3 t3
t3 t5
t5 t1
t5 t3
t5 t5
t5 t7
t6 t1 t7 t2
t7 t5
a5
0000111111000 0100
a6
00001100000000000
C11111111110000000
a1‘00011010001000100
a2‘ 11100001110110011
a3‘ 10011111001001110
a4‘ 00100100010011001
0 1
a4‘
11011011101100110
a3‘
01100000110110001
a2‘
11100001110110011
a1‘
00011010001000100
a6
11110011111111111
a5
1111000000111 1011
a4‘
11011011101100110
a3‘
10011111001001110
a2‘
00011110001001100
a1‘
00011010001000100
a6
11110011111111111
a5
1111000000111 1011
a4‘
11011011101100110
a3‘
01100000110110001
a2‘
00011110001001100
a1‘
00011010001000100
a6
11110011111111111
a5
1111000000111 1011
For each of the following Ptrees, a 1-bit corresponds to a training point in a’s 2-ring:Pa5a6a1’a2‘a3‘a4‘ Pa5a6 a1‘a2’a3‘a4‘ Pa5a6 a1‘a2’a3‘a4‘ Pa5a6 a1‘a2‘a3’a4‘ Pa5a6
a1‘a2‘a3‘a4’Pa5a6 a1‘a2’a3‘a4‘ Pa5a6 a1‘a2’a3‘a4‘ Pa5a6 a1‘a2‘a3’a4‘ Pa5a6 a1‘a2‘a3‘a4’Pa5a6 a1‘a2’a3‘a4‘ Pa5a6 a1‘a2‘a3’a4‘ Pa5a6 a1‘a2‘a3‘a4’Pa5a6 a1‘a2‘a3’a4‘ Pa5a6 a1‘a2‘a3‘a4’Pa5a6 a1‘a2‘a3‘a4’ 3rd line:
Enfranchising the rest of a’s 2-ring (cont.) a=a5 a6 a1’a2’a3’a4’ = (000000)
d1 d2
t1 t2
t1 t3
t1 t5
t1 t6
t2 t1
t2 t7
t3 t1
t3 t2
t3 t3
t3 t5
t5 t1
t5 t3
t5 t5
t5 t7
t6 t1 t7 t2
t7 t5
a5
0000111111000 0100
a6
00001100000000000
C11111111110000000
a1‘00011010001000100
a2‘ 11100001110110011
a3‘ 10011111001001110
a4‘ 00100100010011001
0 1
a4‘
11011011101100110
a3‘
10011111001001110
a2‘
11100001110110011
a1‘
11100101110111011
a6
11110011111111111
a5
1111000000111 1011
a4‘
00100100010011001
a3‘
01100000110110001
a2‘
11100001110110011
a1‘
11100101110111011
a6
11110011111111111
a5
1111000000111 1011
P2
10100000000010011
For each of the following Ptrees, a 1-bit corresponds to a training point in a’s 2-ring:Pa5a6a1’a2‘a3‘a4‘ Pa5a6 a1‘a2’a3‘a4‘ Pa5a6 a1‘a2’a3‘a4‘ Pa5a6 a1‘a2‘a3’a4‘ Pa5a6
a1‘a2‘a3‘a4’Pa5a6 a1‘a2’a3‘a4‘ Pa5a6 a1‘a2’a3‘a4‘ Pa5a6 a1‘a2‘a3’a4‘ Pa5a6 a1‘a2‘a3‘a4’Pa5a6 a1‘a2’a3‘a4‘ Pa5a6 a1‘a2‘a3’a4‘ Pa5a6 a1‘a2‘a3‘a4’Pa5a6 a1‘a2‘a3’a4‘ Pa5a6 a1‘a2‘a3‘a4’Pa5a6 a1‘a2‘a3‘a4’ 4th line:
Enfranchising the rest of a’s 2-ring (cont.) a=a5 a6 a1’a2’a3’a4’ = (000000)
d1 d2
t1 t2
t1 t3
t1 t5
t1 t6
t2 t1
t2 t7
t3 t1
t3 t2
t3 t3
t3 t5
t5 t1
t5 t3
t5 t5
t5 t7
t6 t1 t7 t2
t7 t5
a5
0000111111000 0100
a6
00001100000000000
C11111111110000000
a1‘00011010001000100
a2‘ 11100001110110011
a3‘ 10011111001001110
a4‘ 00100100010011001
0 1
a4‘
00100100010011001
a3‘
10011111001001110
a2‘
00011110001001100
a1‘
11100101110111011
a6
11110011111111111
a5
1111000000111 1011
P3
00000000000001000
For each of the following Ptrees, a 1-bit corresponds to a training point in a’s 2-ring:Pa5a6a1’a2‘a3‘a4‘ Pa5a6 a1‘a2’a3‘a4‘ Pa5a6 a1‘a2’a3‘a4‘ Pa5a6 a1‘a2‘a3’a4‘ Pa5a6
a1‘a2‘a3‘a4’Pa5a6 a1‘a2’a3‘a4‘ Pa5a6 a1‘a2’a3‘a4‘ Pa5a6 a1‘a2‘a3’a4‘ Pa5a6 a1‘a2‘a3‘a4’Pa5a6 a1‘a2’a3‘a4‘ Pa5a6 a1‘a2‘a3’a4‘ Pa5a6 a1‘a2‘a3‘a4’Pa5a6 a1‘a2‘a3’a4‘ Pa5a6 a1‘a2‘a3‘a4’Pa5a6 a1‘a2‘a3‘a4’ 5th line:
Justification for using vertical structures (once again)?
• For record-based workloads (where the result is a set of records), changing the horizontal record structure and then having to reconstruct it, may introduce too much post processing?
0 1 0 1 1 1 1 1 0 0 0 10 1 1 1 1 1 1 1 0 0 0 00 1 0 1 1 0 1 0 1 0 0 10 1 0 1 1 1 1 0 1 1 1 11 0 1 0 1 0 0 0 1 1 0 00 1 0 0 1 0 0 0 1 1 0 11 1 1 0 0 0 0 0 1 1 0 01 1 1 0 0 0 0 0 1 1 0 0
R11 R12 R13 R21 R22 R23 R31 R32 R33 R41 R42 R43
010 111 110 001011 111 110 000010 110 101 001010 111 101 111101 010 001 100010 010 001 101111 000 001 100111 000 001 100
R( A1 A2 A3 A4)
0 1 0 1 1 1 1 1 0 0 0 10 1 1 1 1 1 1 1 0 0 0 00 1 0 1 1 0 1 0 1 0 0 10 1 0 1 1 1 1 0 1 1 1 11 0 1 0 1 0 0 0 1 1 0 00 1 0 0 1 0 0 0 1 1 0 11 1 1 0 0 0 0 0 1 1 0 01 1 1 0 0 0 0 0 1 1 0 0
R11 R12 R13 R21 R22 R23 R31 R32 R33 R41 R42 R43
• For data mining workloads, the result is often a bit (Yes/No, True/False) or another unstructured result (histogram), where there is no reconstructive post processing and the actual data records need never be involved?
1
0 1
or
Paper Topics in the area of NNC on UF?
If you decide to do a research project in this area, you might pick a particular DW area (VirtualCell data, Bioinformatics data, Market Basket data, Text data, Sales data, Scientific data, Astronomical data, ….).
Then discover an interpretation of the results of NNC that gives new, useful info.
e.g., in the last example NCC problem, if the data is gene expression data and C=1 means the gene is associated with a particular cancer, the previous results might be interpreted as “if none of the treatments, a5 a6
a1’ a2’ a3’ a4’ express at a threshold level, then the dissolved tissue is predicted to be cancerous (2/3 probability in the scan based NCC algorithm and 6/11 probability in the Ptree based NCC algorithm).
Other research projects in this setting could involve:
1. Looking at one of the other data mining techniques (clustering, ARM…) and applying it to a new data area.
2. Developing efficient algorithms (implement them and prove that they are efficient) of the various steps in this data mining methodology (or any other).
E.g., An efficient algorithm for “producing the basic Ptrees for a UF from the basic Ptrees for F and the Di
s without having to actually construct the massive UF in the process” is suggested in these notes but the details (or a better method?) and performance work would make a good topic.
Paper Topics in the area of NNC on UF (continued)
Stopping conditions in NCC:
Note that we have assumed the user picks a k ahead of time (our example, k=3)then finds the k-nearest training neighbors to vote on the class assignment(or the 1st ring in which at least k voters appear – the closed kNNC method)
In kNNC the prior choice of k determines when to stop accumulating voters.
Other methods (address the curse of dimensionality)?
Weight the votes by similarity distance from the unclassified sample? Byweighting attributes beforehand?,weighting votes depending upon distance out?,or both?, (or something else?)
All training points within a predefined similarity level (rather than count level)?
Build out in rings until the histogram shows a clear enough winner?
Note that the Histogram doesn’t get good and stay good, necessarily, so
Build out past 1st good histogram to see if the 2nd “good” histogram is even better?…
a4‘
11011011101100110
a3‘
10011111001001110
a2‘
00011110001001100
a1‘
11100101110111011
Another example of Ptree NNC using weights (about the only way to address the curse of dimensionality) a=a5 a6 a1’a2’a3’a4’=(010010),
attribute weights (1, 1, 3, 3, 3, 3)vote weight = 1/(1+distance)
C'11001011101100110
d1 d2
t1 t2t1 t3t1 t5t1 t6t2 t1t2 t7t3 t1t3 t2t3 t3t3 t5t5 t1t5 t3t5 t5t5 t7t6 t1 t7 t2t7 t5
a1
11110000000000100
a2 00001111111111000
a3
11111100000000111
a4
000000000 01111011
a5
0000111111000 0100
a6
00001100000000000
a7
11110000001111011
a8
11110000001111011
a9
00000011110000100
C11111111110000000
a1‘00011010001000100
a2‘ 11100001110110011
a3‘ 10011111001001110
a4‘ 00100100010011001
a5‘ 110100011001000 10
a6‘10000001000000010
a7‘ 00101110011011101
a8‘00101110011011101
a9‘01010000100100000
a6
00001100000000000
a5
11110000001111011
d(p,q) = {weighti : p & q differ at i} Identifying all training tuples in the 0-ring centered at a (exact matches ) as 1-bits of the Ptree, P= a5^a6^a1’^a2’^a3’^a4’
P00000000000000000
a’s 1ring? a=a5 a6 a1’a2’a3’a4’ = (010010)
C'11001011101100110
d1 d2
t1 t2
t1 t3
t1 t5
t1 t6
t2 t1
t2 t7
t3 t1
t3 t2
t3 t3
t3 t5
t5 t1
t5 t3
t5 t5
t5 t7
t6 t1 t7 t2
t7 t5
a1
11110000000000100
a2 00001111111111000
a3
11111100000000111
a4
000000000 01111011
a5
00001111110000100
a6
00001100000000000
a7
11110000001111011
a8
11110000001111011
a9
00000011110000100
C11111111110000000
a1‘00011010001000100
a2‘ 11100001110110011
a3‘ 10011111001001110
a4‘ 00100100010011001
a5‘ 110100011001000 10
a6‘10000001000000010
a7‘ 00101110011011101
a8‘00101110011011101
a9‘01010000100100000
P00000000000000000
a4‘
11011011101100110
a3‘
10011111001001110
a2‘
00011110001001100
a1‘
11100101110111011
a6
00001100000000000
a5
00001111110000100
a4‘
11011011101100110
a3‘
10011111001001110
a2‘
00011110001001100
a1‘
11100101110111011
a6
11110011111111111
a5
1111000000111 1011
attribute weights (1, 1, 3, 3, 3, 3) vote weight = 1/(1+distance) d(p,q) = {weighti : p & q differ at i}
(110010)(000010)
a’s 2ring? a=a5 a6 a1’a2’a3’a4’ = (010010)
C'11001011101100110
d1 d2
t1 t2
t1 t3
t1 t5
t1 t6
t2 t1
t2 t7
t3 t1
t3 t2
t3 t3
t3 t5
t5 t1
t5 t3
t5 t5
t5 t7
t6 t1 t7 t2
t7 t5
a1
11110000000000100
a2 00001111111111000
a3
11111100000000111
a4
000000000 01111011
a5
00001111110000100
a6
00001100000000000
a7
11110000001111011
a8
11110000001111011
a9
00000011110000100
C11111111110000000
a1‘00011010001000100
a2‘ 11100001110110011
a3‘ 10011111001001110
a4‘ 00100100010011001
a5‘ 110100011001000 10
a6‘10000001000000010
a7‘ 00101110011011101
a8‘00101110011011101
a9‘01010000100100000
P00000000000000000
a4‘
11011011101100110
a3‘
10011111001001110
a2‘
00011110001001100
a1‘
11100101110111011
a6
11110011111111111
a5
00001111110000100
attribute weights (1, 1, 3, 3, 3, 3) d(p,q) = {weighti : p & q differ at i} vote weight = 1/(1+distance)
(100010)
a’s 3-ring? a=a5 a6 a1’a2’a3’a4’ = (010010)
d1 d2
t1 t2
t1 t3
t1 t5
t1 t6
t2 t1
t2 t7
t3 t1
t3 t2
t3 t3
t3 t5
t5 t1
t5 t3
t5 t5
t5 t7
t6 t1 t7 t2
t7 t5
a5
0000111111000 0100
a6
00001100000000000
C11111111110000000
a1‘00011010001000100
a2‘ 11100001110110011
a3‘ 10011111001001110
a4‘ 00100100010011001
Identify all training pts in the 3-ring centered at aCheck each of a1’ a2’ a3’ a4’ as the single difference.
a4‘
11011011101100110
a3‘
10011111001001110
a2‘
00011110001001100
a1‘
00011010001000100
a6
00001100000000000
a5
1111000000111 1011
a4‘
11011011101100110
a3‘
10011111001001110
a2‘
11100001110110011
a1‘
11100101110111011
a6
00001100000000000
a5
1111000000111 1011
a4‘
11011011101100110
a3‘
01100000110100001
a2‘
00011110001001100
a1‘
11100101110111011
a6
00001100000000000
a5
1111000000111 1011
a4‘
00100100010011001
a3‘
10011111001001110
a2‘
00011110001001100
a1‘
11100101110111011
a6
00001100000000000
a5
1111000000111 1011
P00000000000000000
a’s 4-ring? a=a5 a6 a1’a2’a3’a4’ =(010010)
d1 d2
t1 t2
t1 t3
t1 t5
t1 t6
t2 t1
t2 t7
t3 t1
t3 t2
t3 t3
t3 t5
t5 t1
t5 t3
t5 t5
t5 t7
t6 t1 t7 t2
t7 t5
a5
00001111110000100
a6
00001100000000000
C11111111110000000
a1‘00011010001000100
a2‘ 11100001110110011
a3‘ 10011111001001110
a4‘ 00100100010011001
Identify all training pts in the 4-ring centered at aCheck Pa5a6a1’a2‘a3‘a4‘ Pa5a6 a1‘a2’a3‘a4‘ Pa5a6 a1‘a2‘a3’a4‘ Pa5a6 a1‘a2‘a3‘a4’ as differing, in turn.
a4‘
11011011101100110
a3‘
10011111001001110
a2‘
00011110001001100
a1‘
00011010001000100
a6
00001100000000000
a5
00001111110000100
a4‘
11011011101100110
a3‘
10011111001001110
a2‘
11100001110110011
a1‘
11100101110111011
a6
00001100000000000
a5
00001111110000100
a4‘
11011011101100110
a3‘
01100000110110001
a2‘
00011110001001100
a1‘
11100101110111011
a6
00001100000000000
a5
00001111110000100
a4‘
00100100010011001
a3‘
10011111001001110
a2‘
00011110001001100
a1‘
11100101110111011
a6
00001100000000000
a5
00001111110000100
Attribute weights (1, 1, 3, 3, 3, 3) d(p,q) = {weighti : p & q differ at i} vote weight = 1/(1+distance)
0 1/5
Vote Tally: C=0 C=1
2/5
a’s 5-ring? a=a5 a6 a1’a2’a3’a4’ =(010010)
d1 d2
t1 t2
t1 t3
t1 t5
t1 t6
t2 t1
t2 t7
t3 t1
t3 t2
t3 t3
t3 t5
t5 t1
t5 t3
t5 t5
t5 t7
t6 t1 t7 t2
t7 t5
a5
00001111110000100
a6
00001100000000000
C11111111110000000
a1‘00011010001000100
a2‘ 11100001110110011
a3‘ 10011111001001110
a4‘ 00100100010011001
Identify all training pts in the 5-ring centered at aCheck Pa5a6a1’a2‘a3‘a4‘ Pa5a6 a1‘a2’a3‘a4‘ Pa5a6 a1‘a2‘a3’a4‘ Pa5a6 a1‘a2‘a3‘a4’ as differing, in turn.
a4‘
11011011101100110
a3‘
10011111001001110
a2‘
00011110001001100
a1‘
00011010001000100
a6
11110011111111111
a5
00001111110000100
a4‘
11011011101100110
a3‘
10011111001001110
a2‘
11100001110110011
a1‘
11100101110111011
a6
11110011111111111
a5
00001111110000100
a4‘
11011011101100110
a3‘
01100000110110001
a2‘
00011110001001100
a1‘
11100101110111011
a6
11110011111111111
a5
00001111110000100
a4‘
00100100010011001
a3‘
10011111001001110
a2‘
00011110001001100
a1‘
11100101110111011
a6
11110011111111111
a5
00001111110000100
attribute weights (1, 1, 3, 3, 3, 3) d(p,q) = {weighti : p & q differ at i} vote weight = 1/(1+distance)
0 2/5
Vote Tally: C=0 C=1
2/5+1/6=17/3017/30+5/30
=22/305/30
Stop here? (C=1 winner)Interactive Stop on Command
System?
Note: An ISoC system seems easy with vertical Ptree methods, but hard with horizontal scan methods?
Projects?:Implement such an ISoC NNC?
Allow users to decide attribute weights interactively also ???
Ptree NNC which stops only after all classes have at least 1 vote (or after certain thresholds are achieved?) How does this perform wrt standard stopping methods?
I think users would really like a system in which they could interactively control the vote and also do a “recall” vote if they don’t like the outcome? (a California NNC? or CNNC)
Iceberg Queries• On any relation (not just the UF of a DW), R(ai,…,an,b) , find all tuples for which an
aggregate (e.g., sum) over a set of attribute(s) exceeds a threshold (why iceberg? Because the result set is small and therefore the tip of the iceberg)
– SELECT * FROM R GROUPED BY ai1 ,…, aik
WHERE aggr(b) theshold;
– E.g., SALES( CUST, ITEM, TIME, CTRY, $SOLD )• e.g., typical “who?, what?, when?, where?” data cube (wwww data cube) with measurement, “how much?”
SELECT * FROM SALES GROUPED BY CUST,ITEM WHERE SUM($SOLD)$10M
(i.e., “Which are our big customer-item match-ups over all time and locations?”)
Ptrees: a=(a1,…,an) output if i=1..8RootCount(Pa^Pbi) * 28-i threshold (b=b1…b8 in bits)
Still must sequence through all a values? Assuming very few meet the threshold (the iceberg assumption, devise a pruning mechanism for the search by considering each each bit in turn (from high order on down)
First all combos:
i=1..8RootCount(Pa11^…^Pan1
^Pbi) * 28-I underscore indicates a choice of ‘ or not.
Then i=1..8RootCount(Pa11^ Pa12
^…^Pan1^ Pan2
^Pbi) * 28-I etc.
Whenever an attribute makes the threshold for only one choice (‘ or not) eliminate the other. Whenever an attribute makes the threshold for no choices, done (no iceberg)
i=1..8RootCount(Pa1..aj1..ajk
..an^Pbi
) * 28-I
(k=1..8 enumerates bits of aj j=1..rj1enumerates the attributes). If this falls below the threshold for some k, prune aj (if the sum falls below the level at which the remain bits of aj can’t possibly make it to the threshold, prune). From the surviving aj ‘s
Assume numbers from 0 - 255
a4‘
00100100010011001
a3‘
10011111001001110
a2‘
11100001110110011
a1‘
00011010001000100
C'11001011101100110
d1 d2
t1 t2t1 t3t1 t5t1 t6t2 t1t2 t7t3 t1t3 t2t3 t3t3 t5t5 t1t5 t3t5 t5t5 t7t6 t1 t7 t2t7 t5
a1
11110000000000100
a2 00001111111111000
a3
11111100000000111
a4
000000000 01111011
a5
0000111111000 0100
a6
00001100000000000
a7
11110000001111011
a8
11110000001111011
a9
00000011110000100
C11111111110000000
a1‘00011010001000100
a2‘ 11100001110110011
a3‘ 10011111001001110
a4‘ 00100100010011001
a5‘ 110100011001000 10
a6‘10000001000000010
a7‘ 00101110011011101
a8‘00101110011011101
a9‘01010000100100000
a6
00001100000000000
a5
0000111111000 0100
Appendix: scratch slides
a4‘
00100100010011001
a3‘
10011111001001110
a2‘
11100001110110011
a1‘
00011010001000100
a6
00001100000000000
a5
0000111111000 0100
a4‘
00100100010011001
a3‘
10011111001001110
a2‘
11100001110110011
a1‘
00011010001000100
a6
00001100000000000
a5
0000111111000 0100
a4‘
00100100010011001
a3‘
10011111001001110
a2‘
11100001110110011
a1‘
00011010001000100
a6
00001100000000000
a5
0000111111000 0100
a4‘
00100100010011001
a3‘
10011111001001110
a2‘
11100001110110011
a1‘
00011010001000100
a6
00001100000000000
a5
0000111111000 0100
a4‘
00100100010011001
a3‘
10011111001001110
a2‘
11100001110110011
a1‘
00011010001000100
a6
00001100000000000
a5
0000111111000 0100
Appendix: scratch slides
a4‘
00100100010011001
a3‘
10011111001001110
a2‘
11100001110110011
a1‘
00011010001000100
a6
00001100000000000
a5
0000111111000 0100
a4‘
11011011101100110
a3‘
01100000110110001
a2‘
00011110001001100
a1‘
11100101110111011
a6
11110011111111111
a5
1111000000111 1011
a4‘
11011011101100110
a3‘
01100000110110001
a2‘
00011110001001100
a1‘
11100101110111011
a6
11110011111111111
a5
1111000000111 1011
a4‘
11011011101100110
a3‘
01100000110110001
a2‘
00011110001001100
a1‘
11100101110111011
a6
11110011111111111
a5
1111000000111 1011
a4‘
11011011101100110
a3‘
01100000110110001
a2‘
00011110001001100
a1‘
11100101110111011
a6
11110011111111111
a5
1111000000111 1011
a4‘
11011011101100110
a3‘
01100000110110001
a2‘
00011110001001100
a1‘
11100101110111011
a6
11110011111111111
a5
1111000000111 1011
Appendix: scratch slides
a4‘
11011011101100110
a3‘
01100000110110001
a2‘
00011110001001100
a1‘
11100101110111011
a6
11110011111111111
a5
1111000000111 1011