information theory for data management divesh srivastava suresh venkatasubramanian
TRANSCRIPT
![Page 1: Information Theory For Data Management Divesh Srivastava Suresh Venkatasubramanian](https://reader036.vdocuments.site/reader036/viewer/2022062304/56649c765503460f9492a1bd/html5/thumbnails/1.jpg)
Information Theory For Data Management
Divesh Srivastava
Suresh Venkatasubramanian
![Page 2: Information Theory For Data Management Divesh Srivastava Suresh Venkatasubramanian](https://reader036.vdocuments.site/reader036/viewer/2022062304/56649c765503460f9492a1bd/html5/thumbnails/2.jpg)
-- Abstruse Goose (177)
Motivation
Information Theory is relevant to all of humanity...
![Page 3: Information Theory For Data Management Divesh Srivastava Suresh Venkatasubramanian](https://reader036.vdocuments.site/reader036/viewer/2022062304/56649c765503460f9492a1bd/html5/thumbnails/3.jpg)
Background
Many problems in data management need precise reasoning about information content, transfer and loss– Structure Extraction– Privacy preservation– Schema design– Probabilistic data ?
![Page 4: Information Theory For Data Management Divesh Srivastava Suresh Venkatasubramanian](https://reader036.vdocuments.site/reader036/viewer/2022062304/56649c765503460f9492a1bd/html5/thumbnails/4.jpg)
Information Theory
First developed by Shannon as a way of quantifying capacity of signal channels.
Entropy, relative entropy and mutual information capture intrinsic informational aspects of a signal
Today:– Information theory provides a domain-independent way to
reason about structure in data– More information = interesting structure– Less information linkage = decoupling of structures
![Page 5: Information Theory For Data Management Divesh Srivastava Suresh Venkatasubramanian](https://reader036.vdocuments.site/reader036/viewer/2022062304/56649c765503460f9492a1bd/html5/thumbnails/5.jpg)
Tutorial Thesis
Information theory provides a mathematical framework for the quantification of information content, linkage and loss.
This framework can be used in the design of data management strategies that rely on probing the structure of information in
data.
![Page 6: Information Theory For Data Management Divesh Srivastava Suresh Venkatasubramanian](https://reader036.vdocuments.site/reader036/viewer/2022062304/56649c765503460f9492a1bd/html5/thumbnails/6.jpg)
Tutorial Goals
Introduce information-theoretic concepts to VLDB audience Give a ‘data-centric’ perspective on information theory Connect these to applications in data management Describe underlying computational primitives
Illuminate when and how information theory might be of use in new areas of data management.
![Page 7: Information Theory For Data Management Divesh Srivastava Suresh Venkatasubramanian](https://reader036.vdocuments.site/reader036/viewer/2022062304/56649c765503460f9492a1bd/html5/thumbnails/7.jpg)
7
Outline
Part 1 Introduction to Information Theory Application: Data Anonymization Application: Data Integration
Part 2 Review of Information Theory Basics Application: Database Design Computing Information Theoretic Primitives Open Problems
![Page 8: Information Theory For Data Management Divesh Srivastava Suresh Venkatasubramanian](https://reader036.vdocuments.site/reader036/viewer/2022062304/56649c765503460f9492a1bd/html5/thumbnails/8.jpg)
Histograms And Discrete Distributions
x1
x2
x1
x1
x4
x2
x3
x1
X
Column of data
X f(X)
x1 4
x2 2
x3 1
x4 1
Histogram
X p(X)
x1 0.5
x2 0.25
x3 0.125
x4 0.125
Probability distribution
normalizeaggregate counts
![Page 9: Information Theory For Data Management Divesh Srivastava Suresh Venkatasubramanian](https://reader036.vdocuments.site/reader036/viewer/2022062304/56649c765503460f9492a1bd/html5/thumbnails/9.jpg)
Histograms And Discrete Distributions
x1
x2
x1
x1
x4
x2
x3
x1
X
Column of data
X f(X)
x1 4
x2 2
x3 1
x4 1
Histogram
X p(X)
x1 0.667
x2 0.2
x3 0.067
x4 0.067
Probability distribution
aggregate counts
X f(x)*w(X)
x1 4*5=20
x2 2*3=6
x3 1*2=2
x4 1*2=2
normalizereweight
![Page 10: Information Theory For Data Management Divesh Srivastava Suresh Venkatasubramanian](https://reader036.vdocuments.site/reader036/viewer/2022062304/56649c765503460f9492a1bd/html5/thumbnails/10.jpg)
From Columns To Random Variables
We can think of a column of data as “represented” by a random variable: – X is a random variable– p(X) is the column of probabilities p(X = x1), p(X = x2), and so on– Also known (in unweighted case) as the empirical distribution
induced by the column X. Notation:
– X (upper case) denotes a random variable (column)– x (lower case) denotes a value taken by X (field in a tuple)– p(x) is the probability p(X = x)
![Page 11: Information Theory For Data Management Divesh Srivastava Suresh Venkatasubramanian](https://reader036.vdocuments.site/reader036/viewer/2022062304/56649c765503460f9492a1bd/html5/thumbnails/11.jpg)
1111
Joint Distributions
Discrete distribution: probability p(X,Y,Z)
p(Y) = ∑x p(X=x,Y) = ∑x ∑z p(X=x,Y,Z=z)
X Y Z p(X,Y,Z)
x1 y1 z1 0.125
x1 y2 z2 0.125
x1 y1 z2 0.125
x1 y2 z1 0.125
x2 y3 z3 0.125
x2 y3 z4 0.125
x3 y3 z5 0.125
x4 y3 z6 0.125
X p(X)
x1 0.5
x2 0.25
x3 0.125
x4 0.125
Y p(Y)
y1 0.25
y2 0.25
y3 0.5
X Y p(X,Y)
x1 y1 0.25
x1 y2 0.25
x2 y3 0.25
x3 y3 0.125
x4 y3 0.125
![Page 12: Information Theory For Data Management Divesh Srivastava Suresh Venkatasubramanian](https://reader036.vdocuments.site/reader036/viewer/2022062304/56649c765503460f9492a1bd/html5/thumbnails/12.jpg)
Entropy Of A Column
Let h(x) = log2 1/p(x) h(X) is column of h(x) values.
H(X) = EX[h(x)] = X p(x) log2 1/p(x)
Two views of entropy It captures uncertainty in data: high entropy, more
unpredictability It captures information content: higher entropy, more
information.
X p(X) h(X)
x1 0.5 1
x2 0.25 2
x3 0.125 3
x4 0.125 3
H(X) = 1.75 < log |X| = 2
![Page 13: Information Theory For Data Management Divesh Srivastava Suresh Venkatasubramanian](https://reader036.vdocuments.site/reader036/viewer/2022062304/56649c765503460f9492a1bd/html5/thumbnails/13.jpg)
Examples
X uniform over [1, ..., 4]. H(X) = 2 Y is 1 with probability 0.5, in [2,3,4] uniformly.
– H(Y) = 0.5 log 2 + 0.5 log 6 ~= 1.8 < 2– Y is more sharply defined, and so has less uncertainty.
Z uniform over [1, ..., 8]. H(Z) = 3 > 2– Z spans a larger range, and captures more information
X Y Z
![Page 14: Information Theory For Data Management Divesh Srivastava Suresh Venkatasubramanian](https://reader036.vdocuments.site/reader036/viewer/2022062304/56649c765503460f9492a1bd/html5/thumbnails/14.jpg)
Comparing Distributions
How do we measure difference between two distributions ? Kullback-Leibler divergence:
– dKL(p, q) = Ep[ h(q) – h(p) ] = i pi log(pi/qi)
Inference mechanism
Prior belief Resulting belief
![Page 15: Information Theory For Data Management Divesh Srivastava Suresh Venkatasubramanian](https://reader036.vdocuments.site/reader036/viewer/2022062304/56649c765503460f9492a1bd/html5/thumbnails/15.jpg)
Comparing Distributions
Kullback-Leibler divergence:– dKL(p, q) = Ep[ h(q) – h(p) ] = i pi log(pi/qi)
– dKL(p, q) >= 0 – Captures extra information needed to capture p given q– Is asymmetric ! dKL(p, q) != dKL(q, p) – Is not a metric (does not satisfy triangle inequality)
There are other measures:– 2-distance, variational distance, f-divergences, …
![Page 16: Information Theory For Data Management Divesh Srivastava Suresh Venkatasubramanian](https://reader036.vdocuments.site/reader036/viewer/2022062304/56649c765503460f9492a1bd/html5/thumbnails/16.jpg)
Conditional Probability
Given a joint distribution on random variables X, Y, how much information about X can we glean from Y ?
Conditional probability: p(X|Y)– p(X = x1 | Y = y1) = p(X = x1, Y = y1)/p(Y = y1)
X Y p(X,Y) p(X|Y) p(Y|X)
x1 y1 0.25 1.0 0.5
x1 y2 0.25 1.0 0.5
x2 y3 0.25 0.5 1.0
x3 y3 0.125 0.25 1.0
x4 y3 0.125 0.25 1.0
X p(X)
x1 0.5
x2 0.25
x3 0.125
x4 0.125
Y p(Y)
y1 0.25
y2 0.25
y3 0.5
![Page 17: Information Theory For Data Management Divesh Srivastava Suresh Venkatasubramanian](https://reader036.vdocuments.site/reader036/viewer/2022062304/56649c765503460f9492a1bd/html5/thumbnails/17.jpg)
Conditional Entropy
Let h(x|y) = log2 1/p(x|y)
H(X|Y) = Ex,y[h(x|y)] = x y p(x,y) log2 1/p(x|y) H(X|Y) = H(X,Y) – H(Y)
H(X|Y) = H(X,Y) – H(Y) = 2.25 – 1.5 = 0.75 If X, Y are independent, H(X|Y) = H(X)
X Y p(X,Y) p(X|Y) h(X|Y)
x1 y1 0.25 1.0 0.0
x1 y2 0.25 1.0 0.0
x2 y3 0.25 0.5 1.0
x3 y3 0.125 0.25 2.0
x4 y3 0.125 0.25 2.0
![Page 18: Information Theory For Data Management Divesh Srivastava Suresh Venkatasubramanian](https://reader036.vdocuments.site/reader036/viewer/2022062304/56649c765503460f9492a1bd/html5/thumbnails/18.jpg)
Mutual Information
Mutual information captures the difference between the joint distribution on X and Y, and the marginal distributions on X and Y.
Let i(x;y) = log p(x,y)/p(x)p(y) I(X;Y) = Ex,y[I(X;Y)] = x y p(x,y) log p(x,y)/p(x)p(y)
X Y p(X,Y) h(X,Y) i(X;Y)
x1 y1 0.25 2.0 1.0
x1 y2 0.25 2.0 1.0
x2 y3 0.25 2.0 1.0
x3 y3 0.125 3.0 1.0
x4 y3 0.125 3.0 1.0
X p(X) h(X)
x1 0.5 1.0
x2 0.25 2.0
x3 0.125 3.0
x4 0.125 3.0
Y p(Y) h(Y)
y1 0.25 2.0
y2 0.25 2.0
y3 0.5 1.0
![Page 19: Information Theory For Data Management Divesh Srivastava Suresh Venkatasubramanian](https://reader036.vdocuments.site/reader036/viewer/2022062304/56649c765503460f9492a1bd/html5/thumbnails/19.jpg)
Mutual Information: Strength of linkage I(X;Y) = H(X) + H(Y) – H(X,Y) = H(X) – H(X|Y) = H(Y) – H(Y|X) If X, Y are independent, then I(X;Y) = 0:
– H(X,Y) = H(X) + H(Y), so I(X;Y) = H(X) + H(Y) – H(X,Y) = 0 I(X;Y) <= max (H(X), H(Y))
– Suppose Y = f(X) (deterministically)– Then H(Y|X) = 0, and so I(X;Y) = H(Y) – H(Y|X) = H(Y)
Mutual information captures higher-order interactions:– Covariance captures “linear” interactions only – Two variables can be uncorrelated (covariance = 0) and have
nonzero mutual information:– X R [-1,1], Y = X2. Cov(X,Y) = 0, I(X;Y) = H(X) > 0
![Page 20: Information Theory For Data Management Divesh Srivastava Suresh Venkatasubramanian](https://reader036.vdocuments.site/reader036/viewer/2022062304/56649c765503460f9492a1bd/html5/thumbnails/20.jpg)
Information-Theoretic Clustering
Clustering takes a collection of objects and groups them.– Given a distance function between objects– Choice of measure of complexity of clustering– Choice of measure of cost for a cluster
Usually, – Distance function is Euclidean distance– Number of clusters is measure of complexity– Cost measure for cluster is sum-of-squared-distance to center
Goal: minimize complexity and cost – Inherent tradeoff between two
![Page 21: Information Theory For Data Management Divesh Srivastava Suresh Venkatasubramanian](https://reader036.vdocuments.site/reader036/viewer/2022062304/56649c765503460f9492a1bd/html5/thumbnails/21.jpg)
Feature Representation
v1
v2
v1
v1
v4
v2
v3
v1
X
Column of data
X f(X)
v1 4
v2 2
v3 1
v4 1
Histogram
X p(X)
v1 0.5
v2 0.25
v3 0.125
v4 0.125
Probability distribution
normalizeaggregate counts
Let V = {v1, v2, v3, v4}
X is “explained” by distribution over V.
“Feature vector” of X is [0.5, 0.25, 0.125, 0.125]
![Page 22: Information Theory For Data Management Divesh Srivastava Suresh Venkatasubramanian](https://reader036.vdocuments.site/reader036/viewer/2022062304/56649c765503460f9492a1bd/html5/thumbnails/22.jpg)
Feature Representation
V
v1 v2 v3 v4
XX1 0.5 0.25 0.125 0.125
X2 0.5 0.2 0.15 0.15
p(v2|X2) = 0.2
Feature vector
![Page 23: Information Theory For Data Management Divesh Srivastava Suresh Venkatasubramanian](https://reader036.vdocuments.site/reader036/viewer/2022062304/56649c765503460f9492a1bd/html5/thumbnails/23.jpg)
Information-Theoretic Clustering
Clustering takes a collection of objects and groups them.– Given a distance function between objects– Choice of measure of complexity of clustering– Choice of measure of cost for a cluster
In information-theoretic setting– What is the distance function ? – How do we measure complexity ? – What is a notion of cost/quality ?
Goal: minimize complexity and maximize quality – Inherent tradeoff between two
![Page 24: Information Theory For Data Management Divesh Srivastava Suresh Venkatasubramanian](https://reader036.vdocuments.site/reader036/viewer/2022062304/56649c765503460f9492a1bd/html5/thumbnails/24.jpg)
Measuring complexity of clustering
Take 1: complexity of a clustering = #clusters– standard model of complexity.
Doesn’t capture the fact that clusters have different sizes.
![Page 25: Information Theory For Data Management Divesh Srivastava Suresh Venkatasubramanian](https://reader036.vdocuments.site/reader036/viewer/2022062304/56649c765503460f9492a1bd/html5/thumbnails/25.jpg)
Measuring complexity of clustering
Take 2: Complexity of clustering = number of bits needed to describe it.
Writing down “k” needs log k bits. In general, let cluster t T have |t| elements.
– set p(t) = |t|/n– #bits to write down cluster sizes = H(T) = pt log 1/pt
H( ) < H( )
![Page 26: Information Theory For Data Management Divesh Srivastava Suresh Venkatasubramanian](https://reader036.vdocuments.site/reader036/viewer/2022062304/56649c765503460f9492a1bd/html5/thumbnails/26.jpg)
Information-theoretic Clustering (take I) Given data X = x1, ..., xn explained by variable V, partition X
into clusters (represented by T) such that
H(T) is minimized and quality is maximized
![Page 27: Information Theory For Data Management Divesh Srivastava Suresh Venkatasubramanian](https://reader036.vdocuments.site/reader036/viewer/2022062304/56649c765503460f9492a1bd/html5/thumbnails/27.jpg)
Soft clusterings
In a “hard” clustering, each point is assigned to exactly one cluster.
Characteristic function – p(t|x) = 1 if x t, 0 if not.
Suppose we allow points to partially belong to clusters:– p(T|x) is a distribution.– p(t|x) is the “probability” of assigning x to t
How do we describe the complexity of a clustering ?
![Page 28: Information Theory For Data Management Divesh Srivastava Suresh Venkatasubramanian](https://reader036.vdocuments.site/reader036/viewer/2022062304/56649c765503460f9492a1bd/html5/thumbnails/28.jpg)
Measuring complexity of clustering
Take 1:– p(t) = x p(x) p(t|x)– Compute H(T) as before.
Problem:
H(T1) = H(T2) !!
T1 t1 t2 T2 t1 t2
x1 0.5 0.5 x1 0.99 0.01
x2 0.5 0.5 x2 0.01 0.99
h(T) 0.5 0.5 h(T) 0.5 0.5
![Page 29: Information Theory For Data Management Divesh Srivastava Suresh Venkatasubramanian](https://reader036.vdocuments.site/reader036/viewer/2022062304/56649c765503460f9492a1bd/html5/thumbnails/29.jpg)
Measuring complexity of clustering
By averaging the memberships, we’ve lost useful information. Take II: Compute I(T;X) !
Even better: If T is a hard clustering of X, then I(T;X) = H(T)
X T1 p(X,T) i(X;T)
x1 t1 0.25 0
x1 t2 0.25 0
x2 t1 0.25 0
x2 t2 0.25 0
I(T1;X) = 0
X T2 p(X,T) i(X;T)
x1 t1 0.495 0.99
x1 t2 0.005 -5.64
x2 t1 0.25 0
x2 t2 0.25 0
I(T2;X) = 0.46
![Page 30: Information Theory For Data Management Divesh Srivastava Suresh Venkatasubramanian](https://reader036.vdocuments.site/reader036/viewer/2022062304/56649c765503460f9492a1bd/html5/thumbnails/30.jpg)
Information-theoretic Clustering (take II) Given data X = x1, ..., xn explained by variable V, partition X
into clusters (represented by T) such that
I(T,X) is minimized and quality is maximized
![Page 31: Information Theory For Data Management Divesh Srivastava Suresh Venkatasubramanian](https://reader036.vdocuments.site/reader036/viewer/2022062304/56649c765503460f9492a1bd/html5/thumbnails/31.jpg)
Measuring cost of a cluster
Given objects Xt = {X1, X2, …, Xm} in cluster t,
Cost(t) = (1/m)i d(Xi, C) = i p(Xi) dKL(p(V|Xi), C)
where C = (1/m) i p(V|Xi) = i p(Xi) p(V|Xi) = p(V)
![Page 32: Information Theory For Data Management Divesh Srivastava Suresh Venkatasubramanian](https://reader036.vdocuments.site/reader036/viewer/2022062304/56649c765503460f9492a1bd/html5/thumbnails/32.jpg)
Mutual Information = Cost of Cluster
Cost(t) = (1/m)i d(Xi, C) = i p(Xi) dKL(p(V|Xi), p(V))
i p(Xi) KL( p(V|Xi), p(V)) = i p(Xi) j p(vj|Xi) log p(vj|Xi)/p(vj)
= i,j p(Xi, vj) log p(vj, Xi)/p(vj)p(Xi)
= I(Xt, V) !!
Cost of a cluster = I(Xt,V)
![Page 33: Information Theory For Data Management Divesh Srivastava Suresh Venkatasubramanian](https://reader036.vdocuments.site/reader036/viewer/2022062304/56649c765503460f9492a1bd/html5/thumbnails/33.jpg)
Cost of a clustering
If we partition X into k clusters X1, ..., Xk
Cost(clustering) = i pi I(Xi, V)
(pi = |Xi|/|X|)
![Page 34: Information Theory For Data Management Divesh Srivastava Suresh Venkatasubramanian](https://reader036.vdocuments.site/reader036/viewer/2022062304/56649c765503460f9492a1bd/html5/thumbnails/34.jpg)
Cost of a clustering
Each cluster center t can be “explained” in terms of V: – p(V|t) = i p(Xi) p(V|Xi)
Suppose we treat each cluster center itself as a point:
![Page 35: Information Theory For Data Management Divesh Srivastava Suresh Venkatasubramanian](https://reader036.vdocuments.site/reader036/viewer/2022062304/56649c765503460f9492a1bd/html5/thumbnails/35.jpg)
Cost of a clustering
We can write down the “cost” of this “cluster”– Cost(T) = I(T;V)
Key result [BMDG05] : Cost(clustering) = I(X, V) – (T, V)
Minimizing cost(clustering) => maximizing I(T, V)
![Page 36: Information Theory For Data Management Divesh Srivastava Suresh Venkatasubramanian](https://reader036.vdocuments.site/reader036/viewer/2022062304/56649c765503460f9492a1bd/html5/thumbnails/36.jpg)
Information-theoretic Clustering (take III) Given data X = x1, ..., xn explained by variable V, partition X
into clusters (represented by T) such that
I(T;X) - I(T;V) is maximized
This is the Information Bottleneck Method [TPB98] Agglomerative techniques exist for the case of ‘hard’
clusterings is the tradeoff parameter between complexity and cost I(T;X) and I(T;V) are in the same units.
![Page 37: Information Theory For Data Management Divesh Srivastava Suresh Venkatasubramanian](https://reader036.vdocuments.site/reader036/viewer/2022062304/56649c765503460f9492a1bd/html5/thumbnails/37.jpg)
Information Theory: Summary
We can represent data as discrete distributions (normalized histograms)
Entropy captures uncertainty or information content in a distribution
The Kullback-Leibler distance captures the difference between distributions
Mutual information and conditional entropy capture linkage between variables in a joint distribution
We can formulate information-theoretic clustering problems
![Page 38: Information Theory For Data Management Divesh Srivastava Suresh Venkatasubramanian](https://reader036.vdocuments.site/reader036/viewer/2022062304/56649c765503460f9492a1bd/html5/thumbnails/38.jpg)
38
Outline
Part 1 Introduction to Information Theory Application: Data Anonymization Application: Data Integration
Part 2 Review of Information Theory Basics Application: Database Design Computing Information Theoretic Primitives Open Problems
![Page 39: Information Theory For Data Management Divesh Srivastava Suresh Venkatasubramanian](https://reader036.vdocuments.site/reader036/viewer/2022062304/56649c765503460f9492a1bd/html5/thumbnails/39.jpg)
3939
Data Anonymization Using Randomization Goal: publish anonymized microdata to enable accurate ad hoc
analyses, but ensure privacy of individuals’ sensitive attributes
Key ideas: – Randomize numerical data: add noise from known distribution– Reconstruct original data distribution using published noisy data
Issues:– How can the original data distribution be reconstructed?– What kinds of randomization preserve privacy of individuals?
Information Theory for Data Management - Divesh & Suresh
![Page 40: Information Theory For Data Management Divesh Srivastava Suresh Venkatasubramanian](https://reader036.vdocuments.site/reader036/viewer/2022062304/56649c765503460f9492a1bd/html5/thumbnails/40.jpg)
4040
Data Anonymization Using Randomization Many randomization strategies proposed [AS00, AA01, EGS03]
Example randomization strategies: X in [0, 10]– R = X + μ (mod 11), μ is uniform in {-1, 0, 1}– R = X + μ (mod 11), μ is in {-1 (p = 0.25), 0 (p = 0.5), 1 (p = 0.25)}– R = X (p = 0.6), R = μ, μ is uniform in [0, 10] (p = 0.4)
Question:– Which randomization strategy has higher privacy preservation?– Quantify loss of privacy due to publication of randomized data
Information Theory for Data Management - Divesh & Suresh
![Page 41: Information Theory For Data Management Divesh Srivastava Suresh Venkatasubramanian](https://reader036.vdocuments.site/reader036/viewer/2022062304/56649c765503460f9492a1bd/html5/thumbnails/41.jpg)
4141
Data Anonymization Using Randomization X in [0, 10], R1 = X + μ (mod 11), μ is uniform in {-1, 0, 1}
Information Theory for Data Management - Divesh & Suresh
Id X
s1 0
s2 3
s3 5
s4 0
s5 8
s6 0
s7 6
s8 0
![Page 42: Information Theory For Data Management Divesh Srivastava Suresh Venkatasubramanian](https://reader036.vdocuments.site/reader036/viewer/2022062304/56649c765503460f9492a1bd/html5/thumbnails/42.jpg)
4242
Data Anonymization Using Randomization X in [0, 10], R1 = X + μ (mod 11), μ is uniform in {-1, 0, 1}
Information Theory for Data Management - Divesh & Suresh
Id X μ
s1 0 -1
s2 3 0
s3 5 1
s4 0 0
s5 8 1
s6 0 -1
s7 6 1
s8 0 0
→
Id R1
s1 10
s2 3
s3 6
s4 0
s5 9
s6 10
s7 7
s8 0
![Page 43: Information Theory For Data Management Divesh Srivastava Suresh Venkatasubramanian](https://reader036.vdocuments.site/reader036/viewer/2022062304/56649c765503460f9492a1bd/html5/thumbnails/43.jpg)
4343
Data Anonymization Using Randomization X in [0, 10], R1 = X + μ (mod 11), μ is uniform in {-1, 0, 1}
Information Theory for Data Management - Divesh & Suresh
Id X μ
s1 0 0
s2 3 -1
s3 5 0
s4 0 1
s5 8 1
s6 0 -1
s7 6 -1
s8 0 1
→
Id R1
s1 0
s2 2
s3 5
s4 1
s5 9
s6 10
s7 5
s8 1
![Page 44: Information Theory For Data Management Divesh Srivastava Suresh Venkatasubramanian](https://reader036.vdocuments.site/reader036/viewer/2022062304/56649c765503460f9492a1bd/html5/thumbnails/44.jpg)
4444
Reconstruction of Original Data Distribution X in [0, 10], R1 = X + μ (mod 11), μ is uniform in {-1, 0, 1}
– Reconstruct distribution of X using knowledge of R1 and μ– EM algorithm converges to MLE of original distribution [AA01]
Information Theory for Data Management - Divesh & Suresh
Id X μ
s1 0 0
s2 3 -1
s3 5 0
s4 0 1
s5 8 1
s6 0 -1
s7 6 -1
s8 0 1
→
Id R1
s1 0
s2 2
s3 5
s4 1
s5 9
s6 10
s7 5
s8 1
→
Id X | R1
s1 {10, 0, 1}
s2 {1, 2, 3}
s3 {4, 5, 6}
s4 {0, 1, 2}
s5 {8, 9, 10}
s6 {9, 10, 0}
s7 {4, 5, 6}
s8 {0, 1, 2}
![Page 45: Information Theory For Data Management Divesh Srivastava Suresh Venkatasubramanian](https://reader036.vdocuments.site/reader036/viewer/2022062304/56649c765503460f9492a1bd/html5/thumbnails/45.jpg)
4545
Analysis of Privacy [AS00]
X in [0, 10], R1 = X + μ (mod 11), μ is uniform in {-1, 0, 1}– If X is uniform in [0, 10], privacy determined by range of μ
Information Theory for Data Management - Divesh & Suresh
Id X μ
s1 0 0
s2 3 -1
s3 5 0
s4 0 1
s5 8 1
s6 0 -1
s7 6 -1
s8 0 1
→
Id R1
s1 0
s2 2
s3 5
s4 1
s5 9
s6 10
s7 5
s8 1
→
Id X | R1
s1 {10, 0, 1}
s2 {1, 2, 3}
s3 {4, 5, 6}
s4 {0, 1, 2}
s5 {8, 9, 10}
s6 {9, 10, 0}
s7 {4, 5, 6}
s8 {0, 1, 2}
![Page 46: Information Theory For Data Management Divesh Srivastava Suresh Venkatasubramanian](https://reader036.vdocuments.site/reader036/viewer/2022062304/56649c765503460f9492a1bd/html5/thumbnails/46.jpg)
4646
Analysis of Privacy [AA01]
X in [0, 10], R1 = X + μ (mod 11), μ is uniform in {-1, 0, 1}– If X is uniform in [0, 1] [5, 6], privacy smaller than range of μ
Information Theory for Data Management - Divesh & Suresh
Id X μ
s1 0 0
s2 1 -1
s3 5 0
s4 6 1
s5 0 1
s6 1 -1
s7 5 -1
s8 6 1
→
Id R1
s1 0
s2 0
s3 5
s4 7
s5 1
s6 0
s7 4
s8 7
→
Id X | R1
s1 {10, 0, 1}
s2 {10, 0, 1}
s3 {4, 5, 6}
s4 {6, 7, 8}
s5 {0, 1, 2}
s6 {10, 0, 1}
s7 {3, 4, 5}
s8 {6, 7, 8}
![Page 47: Information Theory For Data Management Divesh Srivastava Suresh Venkatasubramanian](https://reader036.vdocuments.site/reader036/viewer/2022062304/56649c765503460f9492a1bd/html5/thumbnails/47.jpg)
4747
Analysis of Privacy [AA01]
X in [0, 10], R1 = X + μ (mod 11), μ is uniform in {-1, 0, 1}– If X is uniform in [0, 1] [5, 6], privacy smaller than range of μ– In some cases, sensitive value revealed
Information Theory for Data Management - Divesh & Suresh
Id X μ
s1 0 0
s2 1 -1
s3 5 0
s4 6 1
s5 0 1
s6 1 -1
s7 5 -1
s8 6 1
→
Id R1
s1 0
s2 0
s3 5
s4 7
s5 1
s6 0
s7 4
s8 7
→
Id X | R1
s1 {0, 1}
s2 {0, 1}
s3 {5, 6}
s4 {6}
s5 {0, 1}
s6 {0, 1}
s7 {5}
s8 {6}
![Page 48: Information Theory For Data Management Divesh Srivastava Suresh Venkatasubramanian](https://reader036.vdocuments.site/reader036/viewer/2022062304/56649c765503460f9492a1bd/html5/thumbnails/48.jpg)
4848
Quantify Loss of Privacy [AA01]
Goal: quantify loss of privacy based on mutual information I(X;R)– Smaller H(X|R) more loss of privacy in X by knowledge of R– Larger I(X;R) more loss of privacy in X by knowledge of R– I(X;R) = H(X) – H(X|R)
I(X;R) used to capture correlation between X and R– p(X) is the prior knowledge of sensitive attribute X– p(X, R) is the joint distribution of X and R
Information Theory for Data Management - Divesh & Suresh
![Page 49: Information Theory For Data Management Divesh Srivastava Suresh Venkatasubramanian](https://reader036.vdocuments.site/reader036/viewer/2022062304/56649c765503460f9492a1bd/html5/thumbnails/49.jpg)
4949
Quantify Loss of Privacy [AA01]
Goal: quantify loss of privacy based on mutual information I(X;R)– X is uniform in [5, 6], R1 = X + μ (mod 11), μ is uniform in {-1, 0, 1}
Information Theory for Data Management - Divesh & Suresh
X R1 p(X,R1) h(X,R1) i(X;R1)
5 4
5 5
5 6
6 5
6 6
6 7
X p(X) h(X)
5
6
R1 p(R1) h(R1)
4
5
6
7
![Page 50: Information Theory For Data Management Divesh Srivastava Suresh Venkatasubramanian](https://reader036.vdocuments.site/reader036/viewer/2022062304/56649c765503460f9492a1bd/html5/thumbnails/50.jpg)
5050
Quantify Loss of Privacy [AA01]
Goal: quantify loss of privacy based on mutual information I(X;R)– X is uniform in [5, 6], R1 = X + μ (mod 11), μ is uniform in {-1, 0, 1}
Information Theory for Data Management - Divesh & Suresh
X R1 p(X,R1) h(X,R1) i(X;R1)
5 4 0.17
5 5 0.17
5 6 0.17
6 5 0.17
6 6 0.17
6 7 0.17
X p(X) h(X)
5 0.5
6 0.5
R1 p(R1) h(R1)
4 0.17
5 0.34
6 0.34
7 0.17
![Page 51: Information Theory For Data Management Divesh Srivastava Suresh Venkatasubramanian](https://reader036.vdocuments.site/reader036/viewer/2022062304/56649c765503460f9492a1bd/html5/thumbnails/51.jpg)
5151
Quantify Loss of Privacy [AA01]
Goal: quantify loss of privacy based on mutual information I(X;R)– X is uniform in [5, 6], R1 = X + μ (mod 11), μ is uniform in {-1, 0, 1}
Information Theory for Data Management - Divesh & Suresh
X R1 p(X,R1) h(X,R1) i(X;R1)
5 4 0.17 2.58
5 5 0.17 2.58
5 6 0.17 2.58
6 5 0.17 2.58
6 6 0.17 2.58
6 7 0.17 2.58
X p(X) h(X)
5 0.5 1.0
6 0.5 1.0
R1 p(R1) h(R1)
4 0.17 2.58
5 0.34 1.58
6 0.34 1.58
7 0.17 2.58
![Page 52: Information Theory For Data Management Divesh Srivastava Suresh Venkatasubramanian](https://reader036.vdocuments.site/reader036/viewer/2022062304/56649c765503460f9492a1bd/html5/thumbnails/52.jpg)
5252
Quantify Loss of Privacy [AA01]
Goal: quantify loss of privacy based on mutual information I(X;R)– X is uniform in [5, 6], R1 = X + μ (mod 11), μ is uniform in {-1, 0, 1}– I(X;R) = 0.33
Information Theory for Data Management - Divesh & Suresh
X R1 p(X,R1) h(X,R1) i(X;R1)
5 4 0.17 2.58 1.0
5 5 0.17 2.58 0.0
5 6 0.17 2.58 0.0
6 5 0.17 2.58 0.0
6 6 0.17 2.58 0.0
6 7 0.17 2.58 1.0
X p(X) h(X)
5 0.5 1.0
6 0.5 1.0
R1 p(R1) h(R1)
4 0.17 2.58
5 0.34 1.58
6 0.34 1.58
7 0.17 2.58
![Page 53: Information Theory For Data Management Divesh Srivastava Suresh Venkatasubramanian](https://reader036.vdocuments.site/reader036/viewer/2022062304/56649c765503460f9492a1bd/html5/thumbnails/53.jpg)
5353
Quantify Loss of Privacy [AA01]
Goal: quantify loss of privacy based on mutual information I(X;R)– X is uniform in [5, 6], R2 = X + μ (mod 11), μ is uniform in {0, 1}– I(X;R1) = 0.33, I(X;R2) = 0.5 R2 is a bigger privacy risk than R1
Information Theory for Data Management - Divesh & Suresh
X R2 p(X,R2) h(X,R2) i(X;R2)
5 5 0.25 2.0 1.0
5 6 0.25 2.0 0.0
6 6 0.25 2.0 0.0
6 7 0.25 2.0 1.0
X p(X) h(X)
5 0.5 1.0
6 0.5 1.0
R2 p(R2) h(R2)
5 0.25 2.0
6 0.5 1.0
7 0.25 2.0
![Page 54: Information Theory For Data Management Divesh Srivastava Suresh Venkatasubramanian](https://reader036.vdocuments.site/reader036/viewer/2022062304/56649c765503460f9492a1bd/html5/thumbnails/54.jpg)
5454
Quantify Loss of Privacy [AA01]
Equivalent goal: quantify loss of privacy based on H(X|R)– X is uniform in [5, 6], R2 = X + μ (mod 11), μ is uniform in {0, 1}– Intuition: we know more about X given R2, than about X given R1– H(X|R1) = 0.67, H(X|R2) = 0.5 R2 is a bigger privacy risk than R1
Information Theory for Data Management - Divesh & Suresh
X R2 p(X,R2) p(X|R2) h(X|R2)
5 5 0.25 1.0 0.0
5 6 0.25 0.5 1.0
6 6 0.25 0.5 1.0
6 7 0.25 1.0 0.0
X R1 p(X,R1) p(X|R1) h(X|R1)
5 4 0.17 1.0 0.0
5 5 0.17 0.5 1.0
5 6 0.17 0.5 1.0
6 5 0.17 0.5 1.0
6 6 0.17 0.5 1.0
6 7 0.17 1.0 0.0
![Page 55: Information Theory For Data Management Divesh Srivastava Suresh Venkatasubramanian](https://reader036.vdocuments.site/reader036/viewer/2022062304/56649c765503460f9492a1bd/html5/thumbnails/55.jpg)
5555
Quantify Loss of Privacy
Example: X is uniform in [0, 1] – R3 = e (p = 0.9999), R3 = X (p = 0.0001)– R4 = X (p = 0.6), R4 = 1 – X (p = 0.4)
Is R3 or R4 a bigger privacy risk?
Information Theory for Data Management - Divesh & Suresh
![Page 56: Information Theory For Data Management Divesh Srivastava Suresh Venkatasubramanian](https://reader036.vdocuments.site/reader036/viewer/2022062304/56649c765503460f9492a1bd/html5/thumbnails/56.jpg)
5656
Worst Case Loss of Privacy [EGS03]
Example: X is uniform in [0, 1] – R3 = e (p = 0.9999), R3 = X (p = 0.0001)– R4 = X (p = 0.6), R4 = 1 – X (p = 0.4)
I(X;R3) = 0.0001 << I(X;R4) = 0.028
Information Theory for Data Management - Divesh & Suresh
X R3 p(X,R3) h(X,R3) i(X;R3)
0 e 0.49995 1.0 0.0
0 0 0.00005 14.29 1.0
1 e 0.49995 1.0 0.0
1 1 0.00005 14.29 1.0
X R4 p(X,R4) h(X,R4) i(X;R4)
0 0 0.3 1.74 0.26
0 1 0.2 2.32 -0.32
1 0 0.2 2.32 -0.32
1 1 0.3 1.74 0.26
![Page 57: Information Theory For Data Management Divesh Srivastava Suresh Venkatasubramanian](https://reader036.vdocuments.site/reader036/viewer/2022062304/56649c765503460f9492a1bd/html5/thumbnails/57.jpg)
5757
Worst Case Loss of Privacy [EGS03]
Example: X is uniform in [0, 1] – R3 = e (p = 0.9999), R3 = X (p = 0.0001)– R4 = X (p = 0.6), R4 = 1 – X (p = 0.4)
I(X;R3) = 0.0001 << I(X;R4) = 0.028– But R3 has a larger worst case risk
Information Theory for Data Management - Divesh & Suresh
X R3 p(X,R3) h(X,R3) i(X;R3)
0 e 0.49995 1.0 0.0
0 0 0.00005 14.29 1.0
1 e 0.49995 1.0 0.0
1 1 0.00005 14.29 1.0
X R4 p(X,R4) h(X,R4) i(X;R4)
0 0 0.3 1.74 0.26
0 1 0.2 2.32 -0.32
1 0 0.2 2.32 -0.32
1 1 0.3 1.74 0.26
![Page 58: Information Theory For Data Management Divesh Srivastava Suresh Venkatasubramanian](https://reader036.vdocuments.site/reader036/viewer/2022062304/56649c765503460f9492a1bd/html5/thumbnails/58.jpg)
5858
Worst Case Loss of Privacy [EGS03]
Goal: quantify worst case loss of privacy in X by knowledge of R– Use max KL divergence, instead of mutual information
Mutual information can be formulated as expected KL divergence– I(X;R) = ∑x ∑r p(x,r)*log2(p(x,r)/p(x)*p(r)) = KL(p(X,R) || p(X)*p(R))
– I(X;R) = ∑r p(r) ∑x p(x|r)*log2(p(x|r)/p(x)) = ER [KL(p(X|r) || p(X))]– [AA01] measure quantifies expected loss of privacy over R
[EGS03] propose a measure based on worst case loss of privacy– IW(X;R) = MAXR [KL(p(X|r) || p(X))]
Information Theory for Data Management - Divesh & Suresh
![Page 59: Information Theory For Data Management Divesh Srivastava Suresh Venkatasubramanian](https://reader036.vdocuments.site/reader036/viewer/2022062304/56649c765503460f9492a1bd/html5/thumbnails/59.jpg)
5959
Worst Case Loss of Privacy [EGS03]
Example: X is uniform in [0, 1]– R3 = e (p = 0.9999), R3 = X (p = 0.0001)– R4 = X (p = 0.6), R4 = 1 – X (p = 0.4)
IW(X;R3) = max{0.0, 1.0, 1.0} > IW(X;R4) = max{0.028, 0.028}
Information Theory for Data Management - Divesh & Suresh
X R3 p(X,R3) p(X|R3) i(X;R3)
0 e 0.49995 0.5 0.0
0 0 0.00005 1.0 1.0
1 e 0.49995 0.5 0.0
1 1 0.00005 1.0 1.0
X R4 p(X,R4) p(X|R4) i(X;R4)
0 0 0.3 0.6 0.26
0 1 0.2 0.4 -0.32
1 0 0.2 0.4 -0.32
1 1 0.3 0.6 0.26
![Page 60: Information Theory For Data Management Divesh Srivastava Suresh Venkatasubramanian](https://reader036.vdocuments.site/reader036/viewer/2022062304/56649c765503460f9492a1bd/html5/thumbnails/60.jpg)
6060
Worst Case Loss of Privacy [EGS03]
Example: X is uniform in [5, 6]– R1 = X + μ (mod 11), μ is uniform in {-1, 0, 1}– R2 = X + μ (mod 11), μ is uniform in {0, 1}
IW(X;R1) = max{1.0, 0.0, 0.0, 1.0} = IW(X;R2) = {1.0, 0.0, 1.0}– Unable to capture that R2 is a bigger privacy risk than R1
Information Theory for Data Management - Divesh & Suresh
X R1 p(X,R1) p(X|R1) i(X;R1)
5 4 0.17 1.0 1.0
5 5 0.17 0.5 0.0
5 6 0.17 0.5 0.0
6 5 0.17 0.5 0.0
6 6 0.17 0.5 0.0
6 7 0.17 1.0 1.0
X R2 p(X,R2) p(X|R2) i(X;R2)
5 5 0.25 1.0 1.0
5 6 0.25 0.5 0.0
6 6 0.25 0.5 0.0
6 7 0.25 1.0 1.0
![Page 61: Information Theory For Data Management Divesh Srivastava Suresh Venkatasubramanian](https://reader036.vdocuments.site/reader036/viewer/2022062304/56649c765503460f9492a1bd/html5/thumbnails/61.jpg)
6161
Data Anonymization: Summary
Randomization techniques useful for microdata anonymization– Randomization techniques differ in their loss of privacy
Information theoretic measures useful to capture loss of privacy– Expected KL divergence captures expected loss of privacy [AA01]– Maximum KL divergence captures worst case loss of privacy [EGS03]– Both are useful in practice
Information Theory for Data Management - Divesh & Suresh
![Page 62: Information Theory For Data Management Divesh Srivastava Suresh Venkatasubramanian](https://reader036.vdocuments.site/reader036/viewer/2022062304/56649c765503460f9492a1bd/html5/thumbnails/62.jpg)
62
Outline
Part 1 Introduction to Information Theory Application: Data Anonymization Application: Data Integration
Part 2 Review of Information Theory Basics Application: Database Design Computing Information Theoretic Primitives Open Problems
Information Theory for Data Management - Divesh & Suresh
![Page 63: Information Theory For Data Management Divesh Srivastava Suresh Venkatasubramanian](https://reader036.vdocuments.site/reader036/viewer/2022062304/56649c765503460f9492a1bd/html5/thumbnails/63.jpg)
6363
Schema Matching
Goal: align columns across database tables to be integrated– Fundamental problem in database integration
Early useful approach: textual similarity of column names– False positives: Address ≠ IP_Address– False negatives: Customer_Id = Client_Number
Early useful approach: overlap of values in columns, e.g., Jaccard– False positives: Emp_Id ≠ Project_Id– False negatives: Emp_Id = Personnel_Number
Information Theory for Data Management - Divesh & Suresh
![Page 64: Information Theory For Data Management Divesh Srivastava Suresh Venkatasubramanian](https://reader036.vdocuments.site/reader036/viewer/2022062304/56649c765503460f9492a1bd/html5/thumbnails/64.jpg)
6464
Opaque Schema Matching [KN03]
Goal: align columns when column names, data values are opaque– Databases belong to different government bureaucracies – Treat column names and data values as uninterpreted (generic)
Example: EMP_PROJ(Emp_Id, Proj_Id, Task_Id, Status_Id)– Likely that all Id fields are from the same domain– Different databases may have different column names
Information Theory for Data Management - Divesh & Suresh
W X Y Z
w2 x1 y1 z2
w4 x2 y3 z3
w3 x3 y3 z1
w1 x2 y1 z2
A B C D
a1 b2 c1 d1
a3 b4 c2 d2
a1 b1 c1 d2
a4 b3 c2 d3
![Page 65: Information Theory For Data Management Divesh Srivastava Suresh Venkatasubramanian](https://reader036.vdocuments.site/reader036/viewer/2022062304/56649c765503460f9492a1bd/html5/thumbnails/65.jpg)
6565
Opaque Schema Matching [KN03]
Approach: build complete, labeled graph GD for each database D– Nodes are columns, label(node(X)) = H(X), label(edge(X, Y)) = I(X;Y)– Perform graph matching between GD1 and GD2, minimizing distance
Intuition:– Entropy H(X) captures distribution of values in database column X– Mutual information I(X;Y) captures correlations between X, Y– Efficiency: graph matching between schema-sized graphs
Information Theory for Data Management - Divesh & Suresh
![Page 66: Information Theory For Data Management Divesh Srivastava Suresh Venkatasubramanian](https://reader036.vdocuments.site/reader036/viewer/2022062304/56649c765503460f9492a1bd/html5/thumbnails/66.jpg)
6666
Opaque Schema Matching [KN03]
Approach: build complete, labeled graph GD for each database D– Nodes are columns, label(node(X)) = H(X), label(edge(X, Y)) = I(X;Y)
Information Theory for Data Management - Divesh & Suresh
A B C D
a1 b2 c1 d1
a3 b4 c2 d2
a1 b1 c1 d2
a4 b3 c2 d3
A p(A)
a1 0.5
a3 0.25
a4 0.25
B p(B)
b1 0.25
b2 0.25
b3 0.25
b4 0.25
C p(C)
c1 0.5
c2 0.5
D p(D)
d1 0.25
d2 0.5
d3 0.25
![Page 67: Information Theory For Data Management Divesh Srivastava Suresh Venkatasubramanian](https://reader036.vdocuments.site/reader036/viewer/2022062304/56649c765503460f9492a1bd/html5/thumbnails/67.jpg)
6767
Opaque Schema Matching [KN03]
Approach: build complete, labeled graph GD for each database D– Nodes are columns, label(node(X)) = H(X), label(edge(X, Y)) = I(X;Y)
H(A) = 1.5, H(B) = 2.0, H(C) = 1.0, H(D) = 1.5
Information Theory for Data Management - Divesh & Suresh
A B C D
a1 b2 c1 d1
a3 b4 c2 d2
a1 b1 c1 d2
a4 b3 c2 d3
A h(A)
a1 1.0
a3 2.0
a4 2.0
B h(B)
b1 2.0
b2 2.0
b3 2.0
b4 2.0
C h(C)
c1 1.0
c2 1.0
D h(D)
d1 2.0
d2 1.0
d3 2.0
![Page 68: Information Theory For Data Management Divesh Srivastava Suresh Venkatasubramanian](https://reader036.vdocuments.site/reader036/viewer/2022062304/56649c765503460f9492a1bd/html5/thumbnails/68.jpg)
6868
Opaque Schema Matching [KN03]
Approach: build complete, labeled graph GD for each database D– Nodes are columns, label(node(X)) = H(X), label(edge(X, Y)) = I(X;Y)
H(A) = 1.5, H(B) = 2.0, H(C) = 1.0, H(D) = 1.5, I(A;B) = 1.5
Information Theory for Data Management - Divesh & Suresh
A B C D
a1 b2 c1 d1
a3 b4 c2 d2
a1 b1 c1 d2
a4 b3 c2 d3
A h(A)
a1 1.0
a3 2.0
a4 2.0
B h(B)
b1 2.0
b2 2.0
b3 2.0
b4 2.0
A B h(A,B) i(A;B)
a1 b2 2.0 1.0
a3 b4 2.0 2.0
a1 b1 2.0 1.0
a4 b3 2.0 2.0
![Page 69: Information Theory For Data Management Divesh Srivastava Suresh Venkatasubramanian](https://reader036.vdocuments.site/reader036/viewer/2022062304/56649c765503460f9492a1bd/html5/thumbnails/69.jpg)
6969
Opaque Schema Matching [KN03]
Approach: build complete, labeled graph GD for each database D– Nodes are columns, label(node(X)) = H(X), label(edge(X, Y)) = I(X;Y)
Information Theory for Data Management - Divesh & Suresh
A B C D
a1 b2 c1 d1
a3 b4 c2 d2
a1 b1 c1 d2
a4 b3 c2 d3
A B
DC
1.5
1.0
2.0
1.5
1.0
1.5
0.5
1.5
1.0
1.0
![Page 70: Information Theory For Data Management Divesh Srivastava Suresh Venkatasubramanian](https://reader036.vdocuments.site/reader036/viewer/2022062304/56649c765503460f9492a1bd/html5/thumbnails/70.jpg)
7070
Opaque Schema Matching [KN03]
Approach: build complete, labeled graph GD for each database D– Nodes are columns, label(node(X)) = H(X), label(edge(X, Y)) = I(X;Y)– Perform graph matching between GD1 and GD2, minimizing distance
[KN03] uses euclidean and normal distance metrics
Information Theory for Data Management - Divesh & Suresh
W X
ZY
2.0
1.0
1.5
1.5
1.0
1.5
1.0
1.0
1.5
0.5
A B
DC
1.5
1.0
2.0
1.5
1.0
1.5
0.5
1.5
1.0
1.0
![Page 71: Information Theory For Data Management Divesh Srivastava Suresh Venkatasubramanian](https://reader036.vdocuments.site/reader036/viewer/2022062304/56649c765503460f9492a1bd/html5/thumbnails/71.jpg)
7171
Opaque Schema Matching [KN03]
Approach: build complete, labeled graph GD for each database D– Nodes are columns, label(node(X)) = H(X), label(edge(X, Y)) = I(X;Y)– Perform graph matching between GD1 and GD2, minimizing distance
Information Theory for Data Management - Divesh & Suresh
W X
ZY
2.0
1.0
1.5
1.5
1.0
1.5
1.0
1.0
1.5
0.5
A B
DC
1.5
1.0
2.0
1.5
1.0
1.5
0.5
1.5
1.0
1.0
![Page 72: Information Theory For Data Management Divesh Srivastava Suresh Venkatasubramanian](https://reader036.vdocuments.site/reader036/viewer/2022062304/56649c765503460f9492a1bd/html5/thumbnails/72.jpg)
7272
Opaque Schema Matching [KN03]
Approach: build complete, labeled graph GD for each database D– Nodes are columns, label(node(X)) = H(X), label(edge(X, Y)) = I(X;Y)– Perform graph matching between GD1 and GD2, minimizing distance
Information Theory for Data Management - Divesh & Suresh
W X
ZY
2.0
1.0
1.5
1.5
1.0
1.5
1.0
1.0
1.5
0.5
A B
DC
1.5
1.0
2.0
1.5
1.0
1.5
0.5
1.5
1.0
1.0
![Page 73: Information Theory For Data Management Divesh Srivastava Suresh Venkatasubramanian](https://reader036.vdocuments.site/reader036/viewer/2022062304/56649c765503460f9492a1bd/html5/thumbnails/73.jpg)
7373
Heterogeneity Identification [DKOSV06] Goal: identify columns with semantically heterogeneous values
– Can arise due to opaque schema matching [KN03]
Key ideas: – Heterogeneity based on distribution, distinguishability of values– Use Information Bottleneck to compute soft clustering of values
Issues:– Which information theoretic measure characterizes heterogeneity?– How to set parameters in the Information Bottleneck method?
Information Theory for Data Management - Divesh & Suresh
![Page 74: Information Theory For Data Management Divesh Srivastava Suresh Venkatasubramanian](https://reader036.vdocuments.site/reader036/viewer/2022062304/56649c765503460f9492a1bd/html5/thumbnails/74.jpg)
7474
Heterogeneity Identification [DKOSV06] Example: semantically homogeneous, heterogeneous columns
Information Theory for Data Management - Divesh & Suresh
Customer_Id
Customer_Id
(908)-555-1234
973-360-0000
360-0007
(877)-807-4596
![Page 75: Information Theory For Data Management Divesh Srivastava Suresh Venkatasubramanian](https://reader036.vdocuments.site/reader036/viewer/2022062304/56649c765503460f9492a1bd/html5/thumbnails/75.jpg)
7575
Heterogeneity Identification [DKOSV06] Example: semantically homogeneous, heterogeneous columns
Information Theory for Data Management - Divesh & Suresh
Customer_Id
Customer_Id
(908)-555-1234
973-360-0000
360-0007
(877)-807-4596
![Page 76: Information Theory For Data Management Divesh Srivastava Suresh Venkatasubramanian](https://reader036.vdocuments.site/reader036/viewer/2022062304/56649c765503460f9492a1bd/html5/thumbnails/76.jpg)
7676
Heterogeneity Identification [DKOSV06] Example: semantically homogeneous, heterogeneous columns
More semantic types in column greater heterogeneity– Only email versus email + phone
Information Theory for Data Management - Divesh & Suresh
Customer_Id
Customer_Id
(908)-555-1234
973-360-0000
360-0007
(877)-807-4596
![Page 77: Information Theory For Data Management Divesh Srivastava Suresh Venkatasubramanian](https://reader036.vdocuments.site/reader036/viewer/2022062304/56649c765503460f9492a1bd/html5/thumbnails/77.jpg)
7777
Heterogeneity Identification [DKOSV06] Example: semantically homogeneous, heterogeneous columns
Information Theory for Data Management - Divesh & Suresh
Customer_Id
(877)-807-4596
Customer_Id
(908)-555-1234
973-360-0000
360-0007
(877)-807-4596
![Page 78: Information Theory For Data Management Divesh Srivastava Suresh Venkatasubramanian](https://reader036.vdocuments.site/reader036/viewer/2022062304/56649c765503460f9492a1bd/html5/thumbnails/78.jpg)
7878
Heterogeneity Identification [DKOSV06] Example: semantically homogeneous, heterogeneous columns
Relative distribution of semantic types impacts heterogeneity– Mainly email + few phone versus balanced email + phone
Information Theory for Data Management - Divesh & Suresh
Customer_Id
(877)-807-4596
Customer_Id
(908)-555-1234
973-360-0000
360-0007
(877)-807-4596
![Page 79: Information Theory For Data Management Divesh Srivastava Suresh Venkatasubramanian](https://reader036.vdocuments.site/reader036/viewer/2022062304/56649c765503460f9492a1bd/html5/thumbnails/79.jpg)
7979
Heterogeneity Identification [DKOSV06] Example: semantically homogeneous, heterogeneous columns
Information Theory for Data Management - Divesh & Suresh
Customer_Id
187-65-2468
987-64-6837
789-15-4321
987-65-4321
(908)-555-1234
973-360-0000
360-0007
(877)-807-4596
Customer_Id
(908)-555-1234
973-360-0000
360-0007
(877)-807-4596
![Page 80: Information Theory For Data Management Divesh Srivastava Suresh Venkatasubramanian](https://reader036.vdocuments.site/reader036/viewer/2022062304/56649c765503460f9492a1bd/html5/thumbnails/80.jpg)
8080
Heterogeneity Identification [DKOSV06] Example: semantically homogeneous, heterogeneous columns
Information Theory for Data Management - Divesh & Suresh
Customer_Id
187-65-2468
987-64-6837
789-15-4321
987-65-4321
(908)-555-1234
973-360-0000
360-0007
(877)-807-4596
Customer_Id
(908)-555-1234
973-360-0000
360-0007
(877)-807-4596
![Page 81: Information Theory For Data Management Divesh Srivastava Suresh Venkatasubramanian](https://reader036.vdocuments.site/reader036/viewer/2022062304/56649c765503460f9492a1bd/html5/thumbnails/81.jpg)
8181
Heterogeneity Identification [DKOSV06] Example: semantically homogeneous, heterogeneous columns
More easily distinguished types greater heterogeneity– Phone + (possibly) SSN versus balanced email + phone
Information Theory for Data Management - Divesh & Suresh
Customer_Id
187-65-2468
987-64-6837
789-15-4321
987-65-4321
(908)-555-1234
973-360-0000
360-0007
(877)-807-4596
Customer_Id
(908)-555-1234
973-360-0000
360-0007
(877)-807-4596
![Page 82: Information Theory For Data Management Divesh Srivastava Suresh Venkatasubramanian](https://reader036.vdocuments.site/reader036/viewer/2022062304/56649c765503460f9492a1bd/html5/thumbnails/82.jpg)
8282
Heterogeneity Identification [DKOSV06] Heterogeneity = space complexity of soft clustering of the data
– More, balanced clusters greater heterogeneity– More distinguishable clusters greater heterogeneity
Soft clustering– Soft assign probabilities to membership of values in clusters– How many clusters: tradeoff between space versus quality– Use Information Bottleneck to compute soft clustering of values
Information Theory for Data Management - Divesh & Suresh
![Page 83: Information Theory For Data Management Divesh Srivastava Suresh Venkatasubramanian](https://reader036.vdocuments.site/reader036/viewer/2022062304/56649c765503460f9492a1bd/html5/thumbnails/83.jpg)
8383
Heterogeneity Identification [DKOSV06] Hard clustering
Information Theory for Data Management - Divesh & Suresh
X = Customer_Id T = Cluster_Id
187-65-2468 t1
987-64-6837 t1
789-15-4321 t1
987-65-4321 t1
(908)-555-1234 t2
973-360-0000 t1
360-0007 t3
(877)-807-4596 t2
![Page 84: Information Theory For Data Management Divesh Srivastava Suresh Venkatasubramanian](https://reader036.vdocuments.site/reader036/viewer/2022062304/56649c765503460f9492a1bd/html5/thumbnails/84.jpg)
8484
Heterogeneity Identification [DKOSV06] Soft clustering: cluster membership probabilities
How to compute a good soft clustering?
Information Theory for Data Management - Divesh & Suresh
X = Customer_Id T = Cluster_Id p(T|X)
789-15-4321 t1 0.75
987-65-4321 t1 0.75
789-15-4321 t2 0.25
987-65-4321 t2 0.25
(908)-555-1234 t1 0.25
973-360-0000 t1 0.5
(908)-555-1234 t2 0.75
973-360-0000 t2 0.5
![Page 85: Information Theory For Data Management Divesh Srivastava Suresh Venkatasubramanian](https://reader036.vdocuments.site/reader036/viewer/2022062304/56649c765503460f9492a1bd/html5/thumbnails/85.jpg)
8585
Heterogeneity Identification [DKOSV06] Represent strings as q-gram distributions
Information Theory for Data Management - Divesh & Suresh
X = Customer_Id V = 4-grams p(X,V)
987-65-4321 987- 0.10
987-65-4321 87-6 0.13
987-65-4321 7-65 0.12
987-65-4321 -65- 0.15
987-65-4321 65-4 0.05
987-65-4321 5-43 0.20
987-65-4321 -432 0.15
987-65-4321 4321 0.10
Customer_Id
187-65-2468
987-64-6837
789-15-4321
987-65-4321
(908)-555-1234
973-360-0000
360-0007
(877)-807-4596
![Page 86: Information Theory For Data Management Divesh Srivastava Suresh Venkatasubramanian](https://reader036.vdocuments.site/reader036/viewer/2022062304/56649c765503460f9492a1bd/html5/thumbnails/86.jpg)
8686
Heterogeneity Identification [DKOSV06] iIB: find soft clustering T of X that minimizes I(T;X) – β*I(T;V)
Allow iIB to use arbitrarily many clusters, use β* = H(X)/I(X;V)– Closest to point with minimum space and maximum quality
Information Theory for Data Management - Divesh & Suresh
X = Customer_Id V = 4-grams p(X,V)
987-65-4321 987- 0.10
987-65-4321 87-6 0.13
987-65-4321 7-65 0.12
987-65-4321 -65- 0.15
987-65-4321 65-4 0.05
987-65-4321 5-43 0.20
987-65-4321 -432 0.15
987-65-4321 4321 0.10
Customer_Id
187-65-2468
987-64-6837
789-15-4321
987-65-4321
(908)-555-1234
973-360-0000
360-0007
(877)-807-4596
![Page 87: Information Theory For Data Management Divesh Srivastava Suresh Venkatasubramanian](https://reader036.vdocuments.site/reader036/viewer/2022062304/56649c765503460f9492a1bd/html5/thumbnails/87.jpg)
8787
Heterogeneity Identification [DKOSV06] Rate distortion curve: I(T;V)/I(X;V) vs I(T;X)/H(X)
β*
Information Theory for Data Management - Divesh & Suresh
![Page 88: Information Theory For Data Management Divesh Srivastava Suresh Venkatasubramanian](https://reader036.vdocuments.site/reader036/viewer/2022062304/56649c765503460f9492a1bd/html5/thumbnails/88.jpg)
8888
Heterogeneity Identification [DKOSV06] Heterogeneity = mutual information I(T;X) of iIB clustering T at β*
0 ≤I(T;X) (= 0.126) ≤ H(X) (= 2.0), H(T) (= 1.0)– Ideally use iIB with an arbitrarily large number of clusters in T
Information Theory for Data Management - Divesh & Suresh
X = Customer_Id T = Cluster_Id p(T|X) i(T;X)
789-15-4321 t1 0.75 0.41
987-65-4321 t1 0.75 0.41
789-15-4321 t2 0.25 -0.81
987-65-4321 t2 0.25 -0.81
(908)-555-1234 t1 0.25 -1.17
973-360-0000 t1 0.5 -0.17
(908)-555-1234 t2 0.75 0.77
973-360-0000 t2 0.5 0.19
![Page 89: Information Theory For Data Management Divesh Srivastava Suresh Venkatasubramanian](https://reader036.vdocuments.site/reader036/viewer/2022062304/56649c765503460f9492a1bd/html5/thumbnails/89.jpg)
8989
Heterogeneity Identification [DKOSV06] Heterogeneity = mutual information I(T;X) of iIB clustering T at β*
Information Theory for Data Management - Divesh & Suresh
![Page 90: Information Theory For Data Management Divesh Srivastava Suresh Venkatasubramanian](https://reader036.vdocuments.site/reader036/viewer/2022062304/56649c765503460f9492a1bd/html5/thumbnails/90.jpg)
9090
Data Integration: Summary
Analyzing database instance critical for effective data integration– Matching and quality assessments are key components
Information theoretic measures useful for schema matching– Align columns when column names, data values are opaque– Mutual information I(X;V) captures correlations between X, V
Information theoretic measures useful for heterogeneity testing– Identify columns with semantically heterogeneous values– I(T;X) of iIB clustering T at β* captures column heterogeneity
Information Theory for Data Management - Divesh & Suresh
![Page 91: Information Theory For Data Management Divesh Srivastava Suresh Venkatasubramanian](https://reader036.vdocuments.site/reader036/viewer/2022062304/56649c765503460f9492a1bd/html5/thumbnails/91.jpg)
91
Outline
Part 1 Introduction to Information Theory Application: Data Anonymization Application: Data Integration
Part 2 Review of Information Theory Basics Application: Database Design Computing Information Theoretic Primitives Open Problems
Information Theory for Data Management - Divesh & Suresh
![Page 92: Information Theory For Data Management Divesh Srivastava Suresh Venkatasubramanian](https://reader036.vdocuments.site/reader036/viewer/2022062304/56649c765503460f9492a1bd/html5/thumbnails/92.jpg)
9292
Review of Information Theory Basics
Discrete distribution: probability p(X)
p(X,Y) = ∑z p(X,Y,Z=z)
X Y Z p(X,Y,Z)
x1 y1 z1 0.125
x1 y2 z2 0.125
x1 y1 z2 0.125
x1 y2 z1 0.125
x2 y3 z3 0.125
x2 y3 z4 0.125
x3 y3 z5 0.125
x4 y3 z6 0.125
Information Theory for Data Management - Divesh & Suresh
X p(X)
x1 0.5
x2 0.25
x3 0.125
x4 0.125
Y p(Y)
y1 0.25
y2 0.25
y3 0.5
X Y p(X,Y)
x1 y1 0.25
x1 y2 0.25
x2 y3 0.25
x3 y3 0.125
x4 y3 0.125
![Page 93: Information Theory For Data Management Divesh Srivastava Suresh Venkatasubramanian](https://reader036.vdocuments.site/reader036/viewer/2022062304/56649c765503460f9492a1bd/html5/thumbnails/93.jpg)
9393
Review of Information Theory Basics
Discrete distribution: probability p(X)
p(Y) = ∑x p(X=x,Y) = ∑x ∑z p(X=x,Y,Z=z)
X Y Z p(X,Y,Z)
x1 y1 z1 0.125
x1 y2 z2 0.125
x1 y1 z2 0.125
x1 y2 z1 0.125
x2 y3 z3 0.125
x2 y3 z4 0.125
x3 y3 z5 0.125
x4 y3 z6 0.125
Information Theory for Data Management - Divesh & Suresh
X p(X)
x1 0.5
x2 0.25
x3 0.125
x4 0.125
Y p(Y)
y1 0.25
y2 0.25
y3 0.5
X Y p(X,Y)
x1 y1 0.25
x1 y2 0.25
x2 y3 0.25
x3 y3 0.125
x4 y3 0.125
![Page 94: Information Theory For Data Management Divesh Srivastava Suresh Venkatasubramanian](https://reader036.vdocuments.site/reader036/viewer/2022062304/56649c765503460f9492a1bd/html5/thumbnails/94.jpg)
9494
Review of Information Theory Basics
Discrete distribution: conditional probability p(X|Y)
p(X,Y) = p(X|Y)*p(Y) = p(Y|X)*p(X)
X Y p(X,Y) p(X|Y) p(Y|X)
x1 y1 0.25 1.0 0.5
x1 y2 0.25 1.0 0.5
x2 y3 0.25 0.5 1.0
x3 y3 0.125 0.25 1.0
x4 y3 0.125 0.25 1.0
Information Theory for Data Management - Divesh & Suresh
X p(X)
x1 0.5
x2 0.25
x3 0.125
x4 0.125
Y p(Y)
y1 0.25
y2 0.25
y3 0.5
![Page 95: Information Theory For Data Management Divesh Srivastava Suresh Venkatasubramanian](https://reader036.vdocuments.site/reader036/viewer/2022062304/56649c765503460f9492a1bd/html5/thumbnails/95.jpg)
9595
Review of Information Theory Basics
Discrete distribution: entropy H(X)
h(x) = log2(1/p(x))– H(X) = ∑X=x p(x)*h(x) = 1.75
– H(Y) = ∑Y=y p(y)*h(y) = 1.5 (≤ log2(|Y|) = 1.58)
– H(X,Y) = ∑X=x ∑Y=y p(x,y)*h(x,y) = 2.25 (≤ log2(|X,Y|) = 2.32)
X Y p(X,Y) h(X,Y)
x1 y1 0.25 2.0
x1 y2 0.25 2.0
x2 y3 0.25 2.0
x3 y3 0.125 3.0
x4 y3 0.125 3.0
Information Theory for Data Management - Divesh & Suresh
X p(X) h(X)
x1 0.5 1.0
x2 0.25 2.0
x3 0.125 3.0
x4 0.125 3.0
Y p(Y) h(Y)
y1 0.25 2.0
y2 0.25 2.0
y3 0.5 1.0
![Page 96: Information Theory For Data Management Divesh Srivastava Suresh Venkatasubramanian](https://reader036.vdocuments.site/reader036/viewer/2022062304/56649c765503460f9492a1bd/html5/thumbnails/96.jpg)
9696
Review of Information Theory Basics
Discrete distribution: conditional entropy H(X|Y)
h(x|y) = log2(1/p(x|y))– H(X|Y) = ∑X=x ∑Y=y p(x,y)*h(x|y) = 0.75– H(X|Y) = H(X,Y) – H(Y) = 2.25 – 1.5
X Y p(X,Y) p(X|Y) h(X|Y)
x1 y1 0.25 1.0 0.0
x1 y2 0.25 1.0 0.0
x2 y3 0.25 0.5 1.0
x3 y3 0.125 0.25 2.0
x4 y3 0.125 0.25 2.0
Information Theory for Data Management - Divesh & Suresh
X p(X) h(X)
x1 0.5 1.0
x2 0.25 2.0
x3 0.125 3.0
x4 0.125 3.0
Y p(Y) h(Y)
y1 0.25 2.0
y2 0.25 2.0
y3 0.5 1.0
![Page 97: Information Theory For Data Management Divesh Srivastava Suresh Venkatasubramanian](https://reader036.vdocuments.site/reader036/viewer/2022062304/56649c765503460f9492a1bd/html5/thumbnails/97.jpg)
9797
Review of Information Theory Basics
Discrete distribution: mutual information I(X;Y)
i(x;y) = log2(p(x,y)/p(x)*p(y))– I(X;Y) = ∑X=x ∑Y=y p(x,y)*i(x;y) = 1.0– I(X;Y) = H(X) + H(Y) – H(X,Y) = 1.75 + 1.5 – 2.25
X Y p(X,Y) h(X,Y) i(X;Y)
x1 y1 0.25 2.0 1.0
x1 y2 0.25 2.0 1.0
x2 y3 0.25 2.0 1.0
x3 y3 0.125 3.0 1.0
x4 y3 0.125 3.0 1.0
Information Theory for Data Management - Divesh & Suresh
X p(X) h(X)
x1 0.5 1.0
x2 0.25 2.0
x3 0.125 3.0
x4 0.125 3.0
Y p(Y) h(Y)
y1 0.25 2.0
y2 0.25 2.0
y3 0.5 1.0
![Page 98: Information Theory For Data Management Divesh Srivastava Suresh Venkatasubramanian](https://reader036.vdocuments.site/reader036/viewer/2022062304/56649c765503460f9492a1bd/html5/thumbnails/98.jpg)
98
Outline
Part 1 Introduction to Information Theory Application: Data Anonymization Application: Data Integration
Part 2 Review of Information Theory Basics Application: Database Design Computing Information Theoretic Primitives Open Problems
Information Theory for Data Management - Divesh & Suresh
![Page 99: Information Theory For Data Management Divesh Srivastava Suresh Venkatasubramanian](https://reader036.vdocuments.site/reader036/viewer/2022062304/56649c765503460f9492a1bd/html5/thumbnails/99.jpg)
9999
Information Dependencies [DR00]
Goal: use information theory to examine and reason about information content of the attributes in a relation instance
Key ideas: – Novel InD measure between attribute sets X, Y based on H(Y|X)– Identify numeric inequalities between InD measures
Results:– InD measures are a broader class than FDs and MVDs– Armstrong axioms for FDs derivable from InD inequalities– MVD inference rules derivable from InD inequalities
Information Theory for Data Management - Divesh & Suresh
![Page 100: Information Theory For Data Management Divesh Srivastava Suresh Venkatasubramanian](https://reader036.vdocuments.site/reader036/viewer/2022062304/56649c765503460f9492a1bd/html5/thumbnails/100.jpg)
100100
Information Dependencies [DR00]
Functional dependency: X → Y– FD X → Y holds iff t1, t2 ((t1[X] = t2[X]) (t1[Y] = t2[Y]))
Information Theory for Data Management - Divesh & Suresh
X Y Z
x1 y1 z1
x1 y2 z2
x1 y1 z2
x1 y2 z1
x2 y3 z3
x2 y3 z4
x3 y3 z5
x4 y3 z6
![Page 101: Information Theory For Data Management Divesh Srivastava Suresh Venkatasubramanian](https://reader036.vdocuments.site/reader036/viewer/2022062304/56649c765503460f9492a1bd/html5/thumbnails/101.jpg)
101101
Information Dependencies [DR00]
Functional dependency: X → Y– FD X → Y holds iff t1, t2 ((t1[X] = t2[X]) (t1[Y] = t2[Y]))
Information Theory for Data Management - Divesh & Suresh
X Y Z
x1 y1 z1
x1 y2 z2
x1 y1 z2
x1 y2 z1
x2 y3 z3
x2 y3 z4
x3 y3 z5
x4 y3 z6
![Page 102: Information Theory For Data Management Divesh Srivastava Suresh Venkatasubramanian](https://reader036.vdocuments.site/reader036/viewer/2022062304/56649c765503460f9492a1bd/html5/thumbnails/102.jpg)
102102
Information Dependencies [DR00]
Result: FD X → Y holds iff H(Y|X) = 0– Intuition: once X is known, no remaining uncertainty in Y
H(Y|X) = 0.5
Information Theory for Data Management - Divesh & Suresh
X Y p(X,Y) p(Y|X) h(Y|X)
x1 y1 0.25 0.5 1.0
x1 y2 0.25 0.5 1.0
x2 y3 0.25 1.0 0.0
x3 y3 0.125 1.0 0.0
x4 y3 0.125 1.0 0.0
X p(X)
x1 0.5
x2 0.25
x3 0.125
x4 0.125
Y p(Y)
y1 0.25
y2 0.25
y3 0.5
![Page 103: Information Theory For Data Management Divesh Srivastava Suresh Venkatasubramanian](https://reader036.vdocuments.site/reader036/viewer/2022062304/56649c765503460f9492a1bd/html5/thumbnails/103.jpg)
103103
Information Dependencies [DR00]
Multi-valued dependency: X →→ Y– MVD X →→ Y holds iff R(X,Y,Z) = R(X,Y) R(X,Z)
Information Theory for Data Management - Divesh & Suresh
X Y Z
x1 y1 z1
x1 y2 z2
x1 y1 z2
x1 y2 z1
x2 y3 z3
x2 y3 z4
x3 y3 z5
x4 y3 z6
![Page 104: Information Theory For Data Management Divesh Srivastava Suresh Venkatasubramanian](https://reader036.vdocuments.site/reader036/viewer/2022062304/56649c765503460f9492a1bd/html5/thumbnails/104.jpg)
104104
Information Dependencies [DR00]
Multi-valued dependency: X →→ Y– MVD X →→ Y holds iff R(X,Y,Z) = R(X,Y) R(X,Z)
Information Theory for Data Management - Divesh & Suresh
X Y Z
x1 y1 z1
x1 y2 z2
x1 y1 z2
x1 y2 z1
x2 y3 z3
x2 y3 z4
x3 y3 z5
x4 y3 z6
X Y
x1 y1
x1 y2
x2 y3
x3 y3
x4 y3
X Z
x1 z1
x1 z2
x2 z3
x2 z4
x3 z5
x4 z6
=
![Page 105: Information Theory For Data Management Divesh Srivastava Suresh Venkatasubramanian](https://reader036.vdocuments.site/reader036/viewer/2022062304/56649c765503460f9492a1bd/html5/thumbnails/105.jpg)
105105
Information Dependencies [DR00]
Multi-valued dependency: X →→ Y– MVD X →→ Y holds iff R(X,Y,Z) = R(X,Y) R(X,Z)
Information Theory for Data Management - Divesh & Suresh
X Y Z
x1 y1 z1
x1 y2 z2
x1 y1 z2
x1 y2 z1
x2 y3 z3
x2 y3 z4
x3 y3 z5
x4 y3 z6
X Y
x1 y1
x1 y2
x2 y3
x3 y3
x4 y3
X Z
x1 z1
x1 z2
x2 z3
x2 z4
x3 z5
x4 z6
=
![Page 106: Information Theory For Data Management Divesh Srivastava Suresh Venkatasubramanian](https://reader036.vdocuments.site/reader036/viewer/2022062304/56649c765503460f9492a1bd/html5/thumbnails/106.jpg)
106106
Information Dependencies [DR00]
Result: MVD X →→ Y holds iff H(Y,Z|X) = H(Y|X) + H(Z|X)– Intuition: once X known, uncertainties in Y and Z are independent
H(Y|X) = 0.5, H(Z|X) = 0.75, H(Y,Z|X) = 1.25Information Theory for Data Management - Divesh & Suresh
=
X Y h(Y|X)
x1 y1 1.0
x1 y2 1.0
x2 y3 0.0
x3 y3 0.0
x4 y3 0.0
X Z h(Z|X)
x1 z1 1.0
x1 z2 1.0
x2 z3 1.0
x2 z4 1.0
x3 z5 0.0
x4 z6 0.0
X Y Z h(Y,Z|X)
x1 y1 z1 2.0
x1 y2 z2 2.0
x1 y1 z2 2.0
x1 y2 z1 2.0
x2 y3 z3 1.0
x2 y3 z4 1.0
x3 y3 z5 0.0
x4 y3 z6 0.0
![Page 107: Information Theory For Data Management Divesh Srivastava Suresh Venkatasubramanian](https://reader036.vdocuments.site/reader036/viewer/2022062304/56649c765503460f9492a1bd/html5/thumbnails/107.jpg)
107107
Information Dependencies [DR00]
Result: Armstrong axioms for FDs derivable from InD inequalities
Reflexivity: If Y X, then X → Y– H(Y|X) = 0 for Y X
Augmentation: X → Y X,Z → Y,Z– 0 ≤ H(Y,Z|X,Z) = H(Y|X,Z) ≤ H(Y|X) = 0
Transitivity: X → Y & Y → Z X → Z– 0 ≥ H(Y|X) + H(Z|Y) ≥ H(Z|X) ≥ 0
Information Theory for Data Management - Divesh & Suresh
![Page 108: Information Theory For Data Management Divesh Srivastava Suresh Venkatasubramanian](https://reader036.vdocuments.site/reader036/viewer/2022062304/56649c765503460f9492a1bd/html5/thumbnails/108.jpg)
108108
Database Normal Forms
Goal: eliminate update anomalies by good database design– Need to know the integrity constraints on all database instances
Boyce-Codd normal form:– Input: a set ∑ of functional dependencies– For every (non-trivial) FD R.X → R.Y ∑+, R.X is a key of R
4NF:– Input: a set ∑ of functional and multi-valued dependencies– For every (non-trivial) MVD R.X →→ R.Y ∑+, R.X is a key of R
Information Theory for Data Management - Divesh & Suresh
![Page 109: Information Theory For Data Management Divesh Srivastava Suresh Venkatasubramanian](https://reader036.vdocuments.site/reader036/viewer/2022062304/56649c765503460f9492a1bd/html5/thumbnails/109.jpg)
109109
Database Normal Forms
Functional dependency: X → Y– Which design is better?
Information Theory for Data Management - Divesh & Suresh
X Y Z
x1 y1 z1
x1 y1 z2
x2 y2 z3
x2 y2 z4
x3 y3 z5
x4 y4 z6
X Y
x1 y1
x2 y2
x3 y3
x4 y4
X Z
x1 z1
x1 z2
x2 z3
x2 z4
x3 z5
x4 z6
=
![Page 110: Information Theory For Data Management Divesh Srivastava Suresh Venkatasubramanian](https://reader036.vdocuments.site/reader036/viewer/2022062304/56649c765503460f9492a1bd/html5/thumbnails/110.jpg)
110110
Database Normal Forms
Functional dependency: X → Y– Which design is better?
Decomposition is in BCNF
Information Theory for Data Management - Divesh & Suresh
X Y Z
x1 y1 z1
x1 y1 z2
x2 y2 z3
x2 y2 z4
x3 y3 z5
x4 y4 z6
X Y
x1 y1
x2 y2
x3 y3
x4 y4
X Z
x1 z1
x1 z2
x2 z3
x2 z4
x3 z5
x4 z6
=
![Page 111: Information Theory For Data Management Divesh Srivastava Suresh Venkatasubramanian](https://reader036.vdocuments.site/reader036/viewer/2022062304/56649c765503460f9492a1bd/html5/thumbnails/111.jpg)
111111
Database Normal Forms
Multi-valued dependency: X →→ Y– Which design is better?
Information Theory for Data Management - Divesh & Suresh
X Y Z
x1 y1 z1
x1 y2 z2
x1 y1 z2
x1 y2 z1
x2 y3 z3
x2 y3 z4
x3 y3 z5
x4 y3 z6
X Y
x1 y1
x1 y2
x2 y3
x3 y3
x4 y3
X Z
x1 z1
x1 z2
x2 z3
x2 z4
x3 z5
x4 z6
=
![Page 112: Information Theory For Data Management Divesh Srivastava Suresh Venkatasubramanian](https://reader036.vdocuments.site/reader036/viewer/2022062304/56649c765503460f9492a1bd/html5/thumbnails/112.jpg)
112112
Database Normal Forms
Multi-valued dependency: X →→ Y– Which design is better?
Decomposition is in 4NF
Information Theory for Data Management - Divesh & Suresh
X Y Z
x1 y1 z1
x1 y2 z2
x1 y1 z2
x1 y2 z1
x2 y3 z3
x2 y3 z4
x3 y3 z5
x4 y3 z6
X Y
x1 y1
x1 y2
x2 y3
x3 y3
x4 y3
X Z
x1 z1
x1 z2
x2 z3
x2 z4
x3 z5
x4 z6
=
![Page 113: Information Theory For Data Management Divesh Srivastava Suresh Venkatasubramanian](https://reader036.vdocuments.site/reader036/viewer/2022062304/56649c765503460f9492a1bd/html5/thumbnails/113.jpg)
113113
Well-Designed Databases [AL03]
Goal: use information theory to characterize “goodness” of a database design and reason about normalization algorithms
Key idea: – Information content measure of cell in a DB instance w.r.t. ICs– Redundancy reduces information content measure of cells
Results:– Well-designed DB each cell has information content > 0– Normalization algorithms never decrease information content
Information Theory for Data Management - Divesh & Suresh
![Page 114: Information Theory For Data Management Divesh Srivastava Suresh Venkatasubramanian](https://reader036.vdocuments.site/reader036/viewer/2022062304/56649c765503460f9492a1bd/html5/thumbnails/114.jpg)
114114
Well-Designed Databases [AL03]
Information content of cell c in database D satisfying FD X → Y– Uniform distribution p(V) on values for c consistent with D\c and FD– Information content of cell c is entropy H(V)
H(V62) = 2.0
Information Theory for Data Management - Divesh & Suresh
X Y Z
x1 y1 z1
x1 y1 z2
x2 y2 z3
x2 y2 z4
x3 y3 z5
x4 y4 z6
V62 p(V62) h(V62)
y1 0.25 2.0
y2 0.25 2.0
y3 0.25 2.0
y4 0.25 2.0
![Page 115: Information Theory For Data Management Divesh Srivastava Suresh Venkatasubramanian](https://reader036.vdocuments.site/reader036/viewer/2022062304/56649c765503460f9492a1bd/html5/thumbnails/115.jpg)
115115
Well-Designed Databases [AL03]
Information content of cell c in database D satisfying FD X → Y– Uniform distribution p(V) on values for c consistent with D\c and FD– Information content of cell c is entropy H(V)
H(V22) = 0.0
Information Theory for Data Management - Divesh & Suresh
X Y Z
x1 y1 z1
x1 y1 z2
x2 y2 z3
x2 y2 z4
x3 y3 z5
x4 y4 z6
V22 p(V22) h(V22)
y1 1.0 0.0
y2 0.0
y3 0.0
y4 0.0
![Page 116: Information Theory For Data Management Divesh Srivastava Suresh Venkatasubramanian](https://reader036.vdocuments.site/reader036/viewer/2022062304/56649c765503460f9492a1bd/html5/thumbnails/116.jpg)
116116
Well-Designed Databases [AL03]
Information content of cell c in database D satisfying FD X → Y– Information content of cell c is entropy H(V)
Schema S is in BCNF iff D S, H(V) > 0, for all cells c in D– Technicalities w.r.t. size of active domain
Information Theory for Data Management - Divesh & Suresh
X Y Z
x1 y1 z1
x1 y1 z2
x2 y2 z3
x2 y2 z4
x3 y3 z5
x4 y4 z6
c H(V)
c12 0.0
c22 0.0
c32 0.0
c42 0.0
c52 2.0
c62 2.0
![Page 117: Information Theory For Data Management Divesh Srivastava Suresh Venkatasubramanian](https://reader036.vdocuments.site/reader036/viewer/2022062304/56649c765503460f9492a1bd/html5/thumbnails/117.jpg)
117117
Well-Designed Databases [AL03]
Information content of cell c in database D satisfying FD X → Y– Information content of cell c is entropy H(V)
H(V12) = 2.0, H(V42) = 2.0
Information Theory for Data Management - Divesh & Suresh
V42 p(V42) h(V42)
y1 0.25 2.0
y2 0.25 2.0
y3 0.25 2.0
y4 0.25 2.0
X Y
x1 y1
x2 y2
x3 y3
x4 y4
X Z
x1 z1
x1 z2
x2 z3
x2 z4
x3 z5
x4 z6
V12 p(V12) h(V12)
y1 0.25 2.0
y2 0.25 2.0
y3 0.25 2.0
y4 0.25 2.0
![Page 118: Information Theory For Data Management Divesh Srivastava Suresh Venkatasubramanian](https://reader036.vdocuments.site/reader036/viewer/2022062304/56649c765503460f9492a1bd/html5/thumbnails/118.jpg)
118118
Well-Designed Databases [AL03]
Information content of cell c in database D satisfying FD X → Y– Information content of cell c is entropy H(V)
Schema S is in BCNF iff D S, H(V) > 0, for all cells c in D
Information Theory for Data Management - Divesh & Suresh
X Y
x1 y1
x2 y2
x3 y3
x4 y4
X Z
x1 z1
x1 z2
x2 z3
x2 z4
x3 z5
x4 z6
c H(V)
c12 2.0
c22 2.0
c32 2.0
c42 2.0
![Page 119: Information Theory For Data Management Divesh Srivastava Suresh Venkatasubramanian](https://reader036.vdocuments.site/reader036/viewer/2022062304/56649c765503460f9492a1bd/html5/thumbnails/119.jpg)
119119
Well-Designed Databases [AL03]
Information content of cell c in DB D satisfying MVD X →→ Y– Information content of cell c is entropy H(V)
H(V52) = 0.0, H(V53) = 2.32
Information Theory for Data Management - Divesh & Suresh
X Y Z
x1 y1 z1
x1 y2 z2
x1 y1 z2
x1 y2 z1
x2 y3 z3
x2 y3 z4
x3 y3 z5
x4 y3 z6
V52 p(V52) h(V52)
y3 1.0 0.0
V53 p(V53) h(V53)
z1 0.2 2.32
z2 0.2 2.32
z3 0.2 2.32
z4 0.0
z5 0.2 2.32
z6 0.2 2.32
![Page 120: Information Theory For Data Management Divesh Srivastava Suresh Venkatasubramanian](https://reader036.vdocuments.site/reader036/viewer/2022062304/56649c765503460f9492a1bd/html5/thumbnails/120.jpg)
120120
Well-Designed Databases [AL03]
Information content of cell c in DB D satisfying MVD X →→ Y– Information content of cell c is entropy H(V)
Schema S is in 4NF iff D S, H(V) > 0, for all cells c in D
Information Theory for Data Management - Divesh & Suresh
X Y Z
x1 y1 z1
x1 y2 z2
x1 y1 z2
x1 y2 z1
x2 y3 z3
x2 y3 z4
x3 y3 z5
x4 y3 z6
c H(V)
c12 0.0
c22 0.0
c32 0.0
c42 0.0
c52 0.0
c62 0.0
c72 1.58
c82 1.58
c H(V)
c13 0.0
c23 0.0
c33 0.0
c43 0.0
c53 2.32
c63 2.32
c73 2.58
c83 2.58
![Page 121: Information Theory For Data Management Divesh Srivastava Suresh Venkatasubramanian](https://reader036.vdocuments.site/reader036/viewer/2022062304/56649c765503460f9492a1bd/html5/thumbnails/121.jpg)
121121
Well-Designed Databases [AL03]
Information content of cell c in DB D satisfying MVD X →→ Y– Information content of cell c is entropy H(V)
H(V32) = 1.58, H(V34) = 2.32
Information Theory for Data Management - Divesh & Suresh
V34 p(V34) h(V34)
z1 0.2 2.32
z2 0.2 2.32
z3 0.2 2.32
z4 0.0
z5 0.2 2.32
z6 0.2 2.32
X Y
x1 y1
x1 y2
x2 y3
x3 y3
x4 y3
X Z
x1 z1
x1 z2
x2 z3
x2 z4
x3 z5
x4 z6
V32 p(V32) h(V32)
y1 0.33 1.58
y2 0.33 1.58
y3 0.33 1.58
![Page 122: Information Theory For Data Management Divesh Srivastava Suresh Venkatasubramanian](https://reader036.vdocuments.site/reader036/viewer/2022062304/56649c765503460f9492a1bd/html5/thumbnails/122.jpg)
122122
Well-Designed Databases [AL03]
Information content of cell c in DB D satisfying MVD X →→ Y– Information content of cell c is entropy H(V)
Schema S is in 4NF iff D S, H(V) > 0, for all cells c in D
Information Theory for Data Management - Divesh & Suresh
X Y
x1 y1
x1 y2
x2 y3
x3 y3
x4 y3
X Z
x1 z1
x1 z2
x2 z3
x2 z4
x3 z5
x4 z6
c H(V)
c12 1.0
c22 1.0
c32 1.58
c42 1.58
c52 1.58
c H(V)
c14 2.32
c24 2.32
c34 2.32
c44 2.32
c54 2.58
c64 2.58
![Page 123: Information Theory For Data Management Divesh Srivastava Suresh Venkatasubramanian](https://reader036.vdocuments.site/reader036/viewer/2022062304/56649c765503460f9492a1bd/html5/thumbnails/123.jpg)
123123
Well-Designed Databases [AL03]
Normalization algorithms never decrease information content– Information content of cell c is entropy H(V)
Information Theory for Data Management - Divesh & Suresh
X Y Z
x1 y1 z1
x1 y2 z2
x1 y1 z2
x1 y2 z1
x2 y3 z3
x2 y3 z4
x3 y3 z5
x4 y3 z6
c H(V)
c13 0.0
c23 0.0
c33 0.0
c43 0.0
c53 2.32
c63 2.32
c73 2.58
c83 2.58
![Page 124: Information Theory For Data Management Divesh Srivastava Suresh Venkatasubramanian](https://reader036.vdocuments.site/reader036/viewer/2022062304/56649c765503460f9492a1bd/html5/thumbnails/124.jpg)
124124
Well-Designed Databases [AL03]
Normalization algorithms never decrease information content– Information content of cell c is entropy H(V)
Information Theory for Data Management - Divesh & Suresh
c H(V)
c14 2.32
c24 2.32
c34 2.32
c44 2.32
c54 2.58
c64 2.58
X Y Z
x1 y1 z1
x1 y2 z2
x1 y1 z2
x1 y2 z1
x2 y3 z3
x2 y3 z4
x3 y3 z5
x4 y3 z6
X Y
x1 y1
x1 y2
x2 y3
x3 y3
x4 y3
X Z
x1 z1
x1 z2
x2 z3
x2 z4
x3 z5
x4 z6
=
c H(V)
c13 0.0
c23 0.0
c33 0.0
c43 0.0
c53 2.32
c63 2.32
c73 2.58
c83 2.58
![Page 125: Information Theory For Data Management Divesh Srivastava Suresh Venkatasubramanian](https://reader036.vdocuments.site/reader036/viewer/2022062304/56649c765503460f9492a1bd/html5/thumbnails/125.jpg)
125125
Well-Designed Databases [AL03]
Normalization algorithms never decrease information content– Information content of cell c is entropy H(V)
Information Theory for Data Management - Divesh & Suresh
c H(V)
c14 2.32
c24 2.32
c34 2.32
c44 2.32
c54 2.58
c64 2.58
X Y Z
x1 y1 z1
x1 y2 z2
x1 y1 z2
x1 y2 z1
x2 y3 z3
x2 y3 z4
x3 y3 z5
x4 y3 z6
X Y
x1 y1
x1 y2
x2 y3
x3 y3
x4 y3
X Z
x1 z1
x1 z2
x2 z3
x2 z4
x3 z5
x4 z6
=
c H(V)
c13 0.0
c23 0.0
c33 0.0
c43 0.0
c53 2.32
c63 2.32
c73 2.58
c83 2.58
![Page 126: Information Theory For Data Management Divesh Srivastava Suresh Venkatasubramanian](https://reader036.vdocuments.site/reader036/viewer/2022062304/56649c765503460f9492a1bd/html5/thumbnails/126.jpg)
126126
Database Design: Summary
Good database design essential for preserving data integrity
Information theoretic measures useful for integrity constraints– FD X → Y holds iff InD measure H(Y|X) = 0– MVD X →→ Y holds iff H(Y,Z|X) = H(Y|X) + H(Z|X)– Information theory to model correlations in specific database
Information theoretic measures useful for normal forms– Schema S is in BCNF/4NF iff D S, H(V) > 0, for all cells c in D– Information theory to model distributions over possible databases
Information Theory for Data Management - Divesh & Suresh
![Page 127: Information Theory For Data Management Divesh Srivastava Suresh Venkatasubramanian](https://reader036.vdocuments.site/reader036/viewer/2022062304/56649c765503460f9492a1bd/html5/thumbnails/127.jpg)
127
Outline
Part 1 Introduction to Information Theory Application: Data Anonymization Application: Data Integration
Part 2 Review of Information Theory Basics Application: Database Design Computing Information Theoretic Primitives Open Problems
Information Theory for Data Management - Divesh & Suresh
![Page 128: Information Theory For Data Management Divesh Srivastava Suresh Venkatasubramanian](https://reader036.vdocuments.site/reader036/viewer/2022062304/56649c765503460f9492a1bd/html5/thumbnails/128.jpg)
Domain size matters
For random variable X, domain size = supp(X) = {xi | p(X = xi) > 0}
Different solutions exist depending on whether domain size is “small” or “large”
Probability vectors usually very sparse
![Page 129: Information Theory For Data Management Divesh Srivastava Suresh Venkatasubramanian](https://reader036.vdocuments.site/reader036/viewer/2022062304/56649c765503460f9492a1bd/html5/thumbnails/129.jpg)
Entropy: Case I - Small domain size
Suppose the #unique values for a random variable X is small (i.e fits in memory)
Maximum likelihood estimator: – p(x) = #times x is encountered/total number of items in set.
1
21
4
2
51
1 2 3 4 5
![Page 130: Information Theory For Data Management Divesh Srivastava Suresh Venkatasubramanian](https://reader036.vdocuments.site/reader036/viewer/2022062304/56649c765503460f9492a1bd/html5/thumbnails/130.jpg)
Entropy: Case I - Small domain size
HMLE = x p(x) log 1/p(x) This is a biased estimate:
– E[HMLE] < H
Miller-Madow correction:– H’ = HMLE + (m’ – 1)/2n
m’ is an estimate of number of non-empty bins n = number of samples
Bad news: ALL estimators for H are biased. Good news: we can quantify bias and variance of MLE:
– Bias <= log(1 + m/N)– Var(HMLE) <= (log n)2/N
![Page 131: Information Theory For Data Management Divesh Srivastava Suresh Venkatasubramanian](https://reader036.vdocuments.site/reader036/viewer/2022062304/56649c765503460f9492a1bd/html5/thumbnails/131.jpg)
Entropy: Case II - Large domain size
|X| is too large to fit in main memory, so we can’t maintain explicit counts.
Streaming algorithms for H(X):– Long history of work on this problem– Bottomline:
(1+)-relative-approximation for H(X) that allows for updates to frequencies, and requires “almost constant”, and optimal space [HNO08].
![Page 132: Information Theory For Data Management Divesh Srivastava Suresh Venkatasubramanian](https://reader036.vdocuments.site/reader036/viewer/2022062304/56649c765503460f9492a1bd/html5/thumbnails/132.jpg)
Streaming Entropy [CCM07]
High level idea: sample randomly from the stream, and track counts of elements picked [AMS]
PROBLEM: skewed distribution prevents us from sampling lower-frequency elements (and entropy is small)
Idea: estimate largest frequency, and distribution of what’s left (higher entropy)
![Page 133: Information Theory For Data Management Divesh Srivastava Suresh Venkatasubramanian](https://reader036.vdocuments.site/reader036/viewer/2022062304/56649c765503460f9492a1bd/html5/thumbnails/133.jpg)
Streaming Entropy [CCM07]
Maintain set of samples from original distribution and distribution without most frequent element.
In parallel, maintain estimator for frequency of most frequent element– normally this is hard– but if frequency is very large, then simple estimator exists
[MG81] (Google interview puzzle!)
At the end, compute function of these two estimates Memory usage: roughly 1/2 log(1/) ( is the error)
![Page 134: Information Theory For Data Management Divesh Srivastava Suresh Venkatasubramanian](https://reader036.vdocuments.site/reader036/viewer/2022062304/56649c765503460f9492a1bd/html5/thumbnails/134.jpg)
Entropy and MI are related
I(X;Y) = H(X,Y) – H(X) – H(Y) Suppose we can c-approximate H(X) for any c > 0:
Find H’(X) s.t |H(X) – H’(X)| <= c Then we can 3c-approximate I(X;Y):
– I(X;Y) = H(X,Y) – H(X) – H(Y) <= H’(X,Y)+c – (H’(X)-c) – (H’(Y)-c) <= H’(X,Y) – H’(X) – H’(Y) + 3c
<= I’(X,Y) + 3c Similarly, we can 2c-approximate H(Y|X) = H(X,Y) – H(X) Estimating entropy allows us to estimate I(X;Y) and H(Y|X)
![Page 135: Information Theory For Data Management Divesh Srivastava Suresh Venkatasubramanian](https://reader036.vdocuments.site/reader036/viewer/2022062304/56649c765503460f9492a1bd/html5/thumbnails/135.jpg)
Computing KL-divergence: Small Domains
“easy algorithm”: maintain counts for each of p and q, normalize, and compute KL-divergence.
PROBLEM ! Suppose qi = 0:– pi log pi/qi is undefined !
General problem with ML estimators: all events not seen have probability zero !!– Laplace correction: add one to counts for each seen element– Slightly better: add 0.5 to counts for each seen element [KT81]– Even better, more involved: use Good-Turing estimator [GT53]
YIeld non-zero probability for “things not seen”.
![Page 136: Information Theory For Data Management Divesh Srivastava Suresh Venkatasubramanian](https://reader036.vdocuments.site/reader036/viewer/2022062304/56649c765503460f9492a1bd/html5/thumbnails/136.jpg)
Computing KL-divergence: Large Domains
Bad news: No good relative-approximations exist in small space.
(Partial) good news: additive approximations in small space under certain technical conditions (no pi is too small).
(Partial) good news: additive approximations for symmetric variant of KL-divergence, via sampling.
For details, see [GMV08,GIM08]
![Page 137: Information Theory For Data Management Divesh Srivastava Suresh Venkatasubramanian](https://reader036.vdocuments.site/reader036/viewer/2022062304/56649c765503460f9492a1bd/html5/thumbnails/137.jpg)
Information-theoretic Clustering
Given a collection of random variables X, each “explained” by a random variable Y, we wish to find a (hard or soft) clustering T such that
I(T,X) – I(T, Y)is minimized.
Features of solutions thus far:– heuristic (general problem is NP-hard)– address both small-domain and large-domain scenarios.
![Page 138: Information Theory For Data Management Divesh Srivastava Suresh Venkatasubramanian](https://reader036.vdocuments.site/reader036/viewer/2022062304/56649c765503460f9492a1bd/html5/thumbnails/138.jpg)
Agglomerative Clustering (aIB) [ST00] Fix number of clusters k1. While number of clusters < k
1. Determine two clusters whose merge loses the least information
2. Combine these two clusters
2. Output clustering Merge Criterion:
– merge the two clusters so that change in I(T;V) is minimized Note: no consideration of (number of clusters is fixed)
![Page 139: Information Theory For Data Management Divesh Srivastava Suresh Venkatasubramanian](https://reader036.vdocuments.site/reader036/viewer/2022062304/56649c765503460f9492a1bd/html5/thumbnails/139.jpg)
Agglomerative Clustering (aIB) [S]
Elegant way of finding the two clusters to be merged:
Let dJS(p,q) = (1/2)(dKL(p,m) + dKL(q,m)), m = (p+q)/2
dJS(p,q) is a symmetric distance between p, q (Jensen-Shannon distance)
We merge clusters that have smallest dJS(p,q), (weighted by cluster mass)
p qm
![Page 140: Information Theory For Data Management Divesh Srivastava Suresh Venkatasubramanian](https://reader036.vdocuments.site/reader036/viewer/2022062304/56649c765503460f9492a1bd/html5/thumbnails/140.jpg)
Iterative Information Bottleneck (iIB) [S] aIB yields a hard clustering with k clusters. If you want a soft clustering, use iIB (variant of EM)
– Step 1: p(t|x) ← exp(-dKL(p(V|x),p(V|t)) assign elements to clusters in proportion (exponentially) to
distance from cluster center !– Step 2: Compute new cluster centers by computing weighted
centroids: p(t) = x p(t|x) p(x) p(V|t) = x p(V|t) p(t|x) p(x)/p(t)
– Choose according to [DKOSV06]
![Page 141: Information Theory For Data Management Divesh Srivastava Suresh Venkatasubramanian](https://reader036.vdocuments.site/reader036/viewer/2022062304/56649c765503460f9492a1bd/html5/thumbnails/141.jpg)
Dealing with massive data sets
Clustering on massive data sets is a problem Two main heuristics:
– Sampling [DKOSV06]: pick a small sample of the data, cluster it, and (if necessary)
assign remaining points to clusters using soft assignment. How many points to sample to get good bounds ?
– Streaming: Scan the data in one pass, performing clustering on the fly How much memory needed to get reasonable quality
solution ?
![Page 142: Information Theory For Data Management Divesh Srivastava Suresh Venkatasubramanian](https://reader036.vdocuments.site/reader036/viewer/2022062304/56649c765503460f9492a1bd/html5/thumbnails/142.jpg)
LIMBO (for aIB) [ATMS04]
BIRCH-like idea:– Maintain (sparse) summary for each cluster (p(t), p(V|t))– As data streams in, build clusters on groups of objects– Build next-level clusters on cluster summaries from lower level
![Page 143: Information Theory For Data Management Divesh Srivastava Suresh Venkatasubramanian](https://reader036.vdocuments.site/reader036/viewer/2022062304/56649c765503460f9492a1bd/html5/thumbnails/143.jpg)
143
Outline
Part 1 Introduction to Information Theory Application: Data Anonymization Application: Data Integration
Part 2 Review of Information Theory Basics Application: Database Design Computing Information Theoretic Primitives Open Problems
Information Theory for Data Management - Divesh & Suresh
![Page 144: Information Theory For Data Management Divesh Srivastava Suresh Venkatasubramanian](https://reader036.vdocuments.site/reader036/viewer/2022062304/56649c765503460f9492a1bd/html5/thumbnails/144.jpg)
Open Problems
Data exploration and mining – information theory as first-pass filter
Relation to nonparametric generative models in machine learning (LDA, PPCA, ...)
Engineering and stability: finding right knobs to make systems reliable and scalable
Other information-theoretic concepts ? (rate distortion, higher-order entropy, ...)
THANK YOU !
![Page 145: Information Theory For Data Management Divesh Srivastava Suresh Venkatasubramanian](https://reader036.vdocuments.site/reader036/viewer/2022062304/56649c765503460f9492a1bd/html5/thumbnails/145.jpg)
145145
References: Information Theory
[CT] Tom Cover and Joy Thomas: Information Theory.
[BMDG05] Arindam Banerjee, Srujana Merugu, Inderjit Dhillon, Joydeep Ghosh. Learning with Bregman Divergences, JMLR 2005.
[TPB98] Naftali Tishby, Fernando Pereira, William Bialek. The Information Bottleneck Method. Proc. 37th Annual Allerton Conference, 1998
Information Theory for Data Management - Divesh & Suresh
![Page 146: Information Theory For Data Management Divesh Srivastava Suresh Venkatasubramanian](https://reader036.vdocuments.site/reader036/viewer/2022062304/56649c765503460f9492a1bd/html5/thumbnails/146.jpg)
146146
References: Data Anonymization
[AA01] Dakshi Agrawal, Charu C. Aggarwal: On the design and quantification of privacy preserving data mining algorithms. PODS 2001.
[AS00] Rakesh Agrawal, Ramakrishnan Srikant: Privacy preserving data mining. SIGMOD 2000.
[EGS03] Alexandre Evfimievski, Johannes Gehrke, Ramakrishnan Srikant: Limiting privacy breaches in privacy preserving data mining. PODS 2003.
Information Theory for Data Management - Divesh & Suresh
![Page 147: Information Theory For Data Management Divesh Srivastava Suresh Venkatasubramanian](https://reader036.vdocuments.site/reader036/viewer/2022062304/56649c765503460f9492a1bd/html5/thumbnails/147.jpg)
147147
References: Data Integration
[AMT04] Periklis Andritsos, Renee J. Miller, Panayiotis Tsaparas: Information-theoretic tools for mining database structure from large data sets. SIGMOD 2004.
[DKOSV06] Bing Tian Dai, Nick Koudas, Beng Chin Ooi, Divesh Srivastava, Suresh Venkatasubramanian: Rapid identification of column heterogeneity. ICDM 2006.
[DKSTV08] Bing Tian Dai, Nick Koudas, Divesh Srivastava, Anthony K. H. Tung, Suresh Venkatasubramanian: Validating multi-column schema matchings by type. ICDE 2008.
[KN03] Jaewoo Kang, Jeffrey F. Naughton: On schema matching with opaque column names and data values. SIGMOD 2003.
[PPH05] Patrick Pantel, Andrew Philpot, Eduard Hovy: An information theoretic model for database alignment. SSDBM 2005.
Information Theory for Data Management - Divesh & Suresh
![Page 148: Information Theory For Data Management Divesh Srivastava Suresh Venkatasubramanian](https://reader036.vdocuments.site/reader036/viewer/2022062304/56649c765503460f9492a1bd/html5/thumbnails/148.jpg)
148148
References: Database Design
[AL03] Marcelo Arenas, Leonid Libkin: An information theoretic approach to normal forms for relational and XML data. PODS 2003.
[AL05] Marcelo Arenas, Leonid Libkin: An information theoretic approach to normal forms for relational and XML data. JACM 52(2), 246-283, 2005.
[DR00] Mehmet M. Dalkilic, Edward L. Robertson: Information dependencies. PODS 2000.
[KL06] Solmaz Kolahi, Leonid Libkin: On redundancy vs dependency preservation in normalization: an information-theoretic study of XML. PODS 2006.
Information Theory for Data Management - Divesh & Suresh
![Page 149: Information Theory For Data Management Divesh Srivastava Suresh Venkatasubramanian](https://reader036.vdocuments.site/reader036/viewer/2022062304/56649c765503460f9492a1bd/html5/thumbnails/149.jpg)
149149
References: Computing IT quantities
[P03] Liam Panninski. Estimation of entropy and mutual information. Neural Computation 15: 1191-1254
[GT53] I. J. Good. Turing’s anticipation of Empirical Bayes in connection with the cryptanalysis of the Naval Enigma. Journal of Statistical Computation and Simulation, 66(2), 2000.
[KT81] R. E. Krichevsky and V. K. Trofimov. The performance of universal encoding. IEEE Trans. Inform. Th. 27 (1981), 199--207.
[CCM07] Amit Chakrabarti, Graham Cormode and Andrew McGregor. A near-optimal algorithm for computing the entropy of a stream. Proc. SODA 2007.
[HNO] Nich Harvey, Jelani Nelson, Krzysztof Onak. Sketching and Streaming Entropy via Approximation Theory. FOCS 2008
[ATMS04] Periklis Andritsos, Panayiotis Tsaparas, Renée J. Miller and Kenneth C. Sevcik. LIMBO: Scalable Clustering of Categorical Data. EDBT 2004
Information Theory for Data Management - Divesh & Suresh
![Page 150: Information Theory For Data Management Divesh Srivastava Suresh Venkatasubramanian](https://reader036.vdocuments.site/reader036/viewer/2022062304/56649c765503460f9492a1bd/html5/thumbnails/150.jpg)
150150
References: Computing IT quantities
[S] Noam Slonim. The Information Bottleneck: theory and applications. Ph.D Thesis. Hebrew University, 2000.
[GMV08] Sudipto Guha, Andrew McGregor, Suresh Venkatasubramanian. Streaming and sublinear approximations for information distances. ACM Trans Alg. 2008
[GIM08] Sudipto Guha, Piotr Indyk, Andrew McGregor. Sketching Information Distances. JMLR, 2008.
Information Theory for Data Management - Divesh & Suresh