Emiran Curtmola @ UC San DiegoAlin Deutsch @ UC San Diego
K.K. Ramakrishnan @ at&tDivesh Srivastava @ at&t
SIGMOD, June 2010
DATA ONLINE COMMUNITIES
2
Typical such applications are centralized Hosted online communities Search engines
Limitations Disintermediation of publishers from queriers
Publishers need to give up their data Central site controls visibility of publishers to queriers
Publishers loose their right to privacy
Free data exchange within the community Some users want to remain autonomous
User privacy (i.e., not all users may want to reveal their true identity)▪ Publishers express their opinions anonymously to
avoid association with sensitive or controversial issues (e.g., political, race, religion..)
User autonomy + privacy suggest a decentralized infrastructure
SIGMOD, June 2010 3
Make safer to join and post data for publishers Prevent association of sensitive topics with publishers
that contribute to them even if compromised nodes
Publisher k-anonymity: For every publisher p and data item d, hide p in a
k-protected crowd of publishers: there are at least other k-1 potential publishers of the same d
SIGMOD, June 2010 4
News & Blogs
Advertised data items about the publisher’s articles
P1 Beijing, Tibet, stocks, poverty, money
P2 Beijing, yak tea, Hong Kong, poverty
P3 Beijing, Tibet, yak tea, Hong Kong, money
P4 Beijing, Olympics, yak tea, stocks, money
P5 Beijing, Olympics, yak tea, stocks, money
P6 Olympics, Tibet, stocks, money
P7 Olympics, yak tea, stocks, money
P8 Olympics, yak tea, stocks, moneyQuery Q1: find the articles mentioning the Olympics in Beijing
Query Q3: find the articles mentioning poverty
Query Q2: find the articles about Tibet
Query Q4: find the articles that give the money in Hong Kong
P3
P8
P7 P6
P1
P2
P4
P5
The community data collection
local XML data
P3 local XML data
P4
local XML data
P8
local XML data
P2
local XML data
P5
local XML data
P1
local XML data
P6
local XML data
P7
SIGMOD, June 2010 5
How to query ad-hoc distributed data sources while preserving user privacy?How to query ad-hoc distributed data sources while preserving user privacy?
Allow publishers keep complete control over their data Disseminate queries in the network, not data Publishers answer queries at their own discretion Published data is not traceable back to publishers even if
compromised nodes
Allow publishers keep complete control over their data Disseminate queries in the network, not data Publishers answer queries at their own discretion Published data is not traceable back to publishers even if
compromised nodes
Infrastructure setup such that Distribution of data Large nr. of decentralized publishers and
consumers User privacy
Efficient query routing (to avoid flooding the network)
SIGMOD, June 2010 6
Build an overlay network to act as a distributed index
Peers are organized into logical query dissemination trees (QDTs)
Use QDTs to disseminate queries using node summaries
P1’s advertised set of terms: Beijing, Tibet, stocks, poverty, money
P1’s advertised set of terms: Beijing, Tibet, stocks, poverty, money
P2’s advertised set of terms: Beijing, yak tea, Hong Kong, poverty
P2’s advertised set of terms: Beijing, yak tea, Hong Kong, poverty
Node 3’s summary (set of terms) Beijing, Tibet, stocks, poverty, money, yak tea, Hong Kong
Node 3’s summary (set of terms) Beijing, Tibet, stocks, poverty, money, yak tea, Hong Kong
242118
1
8
9
1064 17 20 23
132
3 14 16
P4 P5
P6 P7 P8
P3P2P1
router
P publisher
union of its subtrees’ summariesunion of its subtrees’ summaries
SIGMOD, June 2010 7
1
64
2
3
8
9
P410 17 20 23
242118
13
14 16
P5
P6 P7 P8
P3P2P1
Q3=“poverty”
Q3 Q3 Q3
Q3
Q3
Q3
Q3
Q3
Only P1 and P2
publish articles about poverty …poverty……poverty…
check set inclusion: query into node’s summary
Bloom FilterBloom Filter
SIGMOD, June 2010 8
Pruning
Minimum information at each node▪ No node has global information
▪ Node summaries are vectors of counters (bloom filters) representing hash values of advertised data items
Queries reach publishers in such a manner that users do not know if publisher does not respond vs. does not have matching documents
SIGMOD, June 2010
1
64
2
3
8
9
P410 17 20 23
242118
13
14 16
P5
P6 P7 P8
P3P2P1
poverty…poverty…
9
Q3=“poverty”
▪ If an edge node is compromised▪ Risk: Individual updates of node summaries (from publishers to edge routers) may expose the publishers
▪ Solution: publisher k-anonymity Hide users in protected crowds of at least k-publishers and...
SIGMOD, June 2010
1
644
2
3
8
9
P410 17 20 23
242118
13
14 16
P5
P6 P7 P8
P3P2P1
poverty…poverty…
10
Protected crowd
▪ Solution: publisher k-anonymity Hide users in protected crowds of at least k-publishers and
use secure-multi party (SMP) computation inside crowds to advertise updates of published terms to the edge routers
SIGMOD, June 2010 11
4
P1P2
P3
+Up
d1
+Up
d1
+Upd
2
+Upd
2
+Upd
3
+Upd
3
+R+R
-R-R
Edge router 4
Publisher 3-anonymous protected crowd
Upd1 +Upd2 +Upd3
Upd1 +Upd2 +Upd3
▪ If an internal node is compromised▪ Risk: Node summary of advertised terms is exposed → Downstream may contain sensitive content but the crowd of publishers is even bigger now..
SIGMOD, June 2010
1
64
2
33
8
9
P410 17 20 23
242118
13
14 16
P5
P6 P7 P8
P3P2P1
poverty…poverty…
12
Protected crowd
The tree topology introduces congestion at upper QDT
levelsduring query dissemination
The tree topology introduces congestion at upper QDT
levelsduring query dissemination
How to relieve the congestion? How to relieve the congestion? SIGMOD, June 2010 13
Overlaying multiple logical QDTs over the same underlay network A physical node belongs to multiple
logical QDTs but at different levels
Goal: organize the nodes into QDTs such that the distribution of tree levels for a node is uniform across the QDTs
SIGMOD, June 2010 14
QDT1 QDT2
QDT3 QDT4
11
11
11
11
SIGMOD, June 2010 15
Partition community data collection into disjoint blocks
Build one QDT tree per block B QDTi groups all publishers with terms in Bi
Routing a query Terms in query determine the relevant blocks Send query to the corresponding QDT Check the full query with publishers
Block
Terms
B1 Beijing , Olympics
B2 Tibet , yak tea
B3 Hong Kong , stocks
B4 poverty , money
…poverty……poverty…
QDT1
QDT2
QDT3
QDT4
SIGMOD, June 2010 16
Q3=“poverty” Q3 falls in B4 use QDT4:
QDT1 QDT2
QDT3 QDT4
Q3=“poverty”
Q1=“Olympics”, “Beijing”
SIGMOD, June 2010 17
Q4=“Hong Kong”, “money”
Route Q4 on both trees?
Query selectivity optimization techniques: Choose the selective QDT to route on by maintaining
only 1-3% of popular data items (see paper)
Block
Terms
B1 Beijing, Olympics
B2 Tibet, yak tea
B3 Hong Kong, stocks
B4 poverty, money
QDT3
QDT4
SIGMOD, June 2010 18
Our solution Our solution SIGMOD, June 2010 19
Empirical fact: Upper two levels in a QDT are the most congested
Model: cyclical permutation of nodes on the tree levels
nr of QDTs for load balance = nr of legal permutations (i.e.,
without breaking the fairness property)
Fairness property: all routers appear precisely once in the top two levels of any QDT
Fairness property: all routers appear precisely once in the top two levels of any QDT
SIGMOD, June 2010 20
Overall throughput depends heavily on the most congested node
Look at node stress in terms of nr. of messages going into a node: Processing Load at a
node (PLoad) going out of a node: Forwarding Load at a
node (FLoad)
Throughput indicator: compare how far are
↔
SIGMOD, June 2010 21
PP
FF
peak load (k-QDTs)
ideal load (avg. load for 1-QDT =
)nr.msgsnr.nodes
SIGMOD, June 2010 22
Experiment 1: PLoad for Scribe QDT topology Result: nr. QDTs for load balance found
experimentally coincides with that given by our analytical model
Load balance with▪ How close: 32% closest to ideal PLoad▪ How close: 923% closest to ideal FLoad
To balance FLoad, need node fanouts to be the same
Experiment 2: FLoad for fanout-balanced QDT topologies How close: 18% closest to ideal Pload How close: 130% closest to ideal FLoad
Propose a novel publishing infrastructure
Empowers publishers to join and post without being associated with (sensitive) content
Generic solution: it extracts the maximum load balance supported by the QDT topology
SIGMOD, June 2010 23
SIGMOD, June 2010 24