michael melliar-smith university of california, santa barbara
DESCRIPTION
Trustworthy Information Distribution and Retrieval. Michael Melliar-Smith University of California, Santa Barbara. Research conducted in collaboration with Louise E. Moser, Isai Michel Lombera and Yung-Ting Chuang - PowerPoint PPT PresentationTRANSCRIPT
Trustworthy Information Distribution and Retrieval
Michael Melliar-Smith
University of California, Santa Barbara
Research conducted in collaboration withLouise E. Moser, Isai Michel Lombera and Yung-Ting Chuang
Supported in part by NSF Grant CNS 10-16103
METIS'2011 iTrust Michael Melliar-Smith 2
Information Access over the Internet Modern society and commerce depend on access to
information over the Internet Information is accessed over the Internet using
centralized search engines and search indexes– Internet search engines are centralized for efficiency
and scalability We cannot assume that centralized search engines
will always deliver the information we seek, uncensored and unbiased
iTrust is a system for publishing, searching for, and retrieving information over the Internet that provides trustworthy access to information
METIS'2011 iTrust Michael Melliar-Smith 3
What Is Trust?
"Trust is the mental background to delegation" (Castelfranchi and Falcone [1])
When you delegate an activity, you trust that:– Good things will happen– Bad things will not happen
However, what is good and what is bad is a matter of your intent
Thus, it is not possible to provide a universal definition of trust
Several useful surveys [2], [3] on trust, related to reliability, security, and privacy
METIS'2011 iTrust Michael Melliar-Smith 4
Trust in Algorithms
Trust in a strong encryption algorithm is based on estimates of the cost of breaking the keys– Publication of the encryption algorithm– Independent validation of the strength of the algorithm– No need to trust the people who created the algorithm
We aim to provide the same kind of trust for iTrust– Published algorithms– Statistical analysis– No need to trust an administrator
Trust a social network with a large number of users
METIS'2011 iTrust Michael Melliar-Smith 5
Related Work Useful surveys [4], [5], [6] on publish/subscribe and
distributed search, categorized as:– Structured – Require nodes to be organized in an overlay network,
e.g., Distributed Hash Tables (DHTs), rings, trees More efficient than unstructured
Involve administrative control and additional overhead for constructing
and maintaining the overlay network Still incur a trust risk through administrative control
– Unstructured – Gossip-based; typically use randomization Gnutella [7] - Great grandfather of unstructured systems; uses
flooding of requests to find information Freenet [8] - More sophisticated and efficient than Gnutella, because
it learns from previous requests Zhong and Shen [9] - Uses random walks; number of nodes visited by
a request is proportional to the square root of the content popularity Ferreira, et al. [10] - Uses random walks; replicates both the queries
and the data in a sparse network
METIS'2011 iTrust Michael Melliar-Smith 6
Related Work Several recent information distribution and
retrieval systems are concerned with security, privacy, and trust– Quasar [11] – A probabilistic publish/subscribe system,
using a sparse structured overlay that is concerned with the release of sensitive information by a central node
– OneSwarm [12] – A peer-to-peer data sharing system that uses a combination of trusted and untrusted peers
Part of an effort to provide an alternative to cloud computing that does not depend on centralized trust
Initial goal is to protect the privacy of the users Uses trusted intermediary nodes to preserve anonymity
METIS'2011 iTrust Michael Melliar-Smith 7
Objectives of iTrust
Main Objective: Provide users with information that is important to them
Publish information for other people to access Search for, and retrieve, published information Detect that the system is under attack When the system is under attack, adapt to increase
the probability of information distribution and retrieval, with some increase in costs
METIS'2011 iTrust Michael Melliar-Smith 8
Non-Objectives of iTrust
No attempt to prevent the distribution of misinformation
No attempt to maintain secrecy and privacy No attempt to minimize communication,
processing and storage costs– Costs are greater than for conventional, centralized
Internet search engines– We assume that the additional costs are acceptable,
given the primary objective – Information distribution and retrieval are probabilistic
METIS'2011 iTrust Michael Melliar-Smith 9
Characteristics of iTrust
Falls into the general category of random walk publish/subscribe systems – All nodes are equal– No central search engine– Replication of metadata and requests– Random search or random walk– No supernodes– Distributed membership– No central control over membership
METIS'2011 iTrust Michael Melliar-Smith 10
Basic Idea of iTrust The participating nodes constitute the membership of iTrust A source node is a participating node that
– Produces information, which it makes available to other participating nodes
– Produces metadata (keywords) describing the information, and distributes the metadata to a subset of participating nodes chosen at random
A requesting node is a participating node that– Generates requests containing metadata and distributes the
requests to a subset of participating nodes chosen at random A participating node that receives a request
– Compares the metadata in the request with the metadata it holds– If it finds a match, it returns the URL of the associated
information to the requesting node The requesting node then uses the URL to retrieve the
information from the source node
METIS'2011 iTrust Michael Melliar-Smith 11
Distribution of Metadata
Source ofInformation
METIS'2011 iTrust Michael Melliar-Smith 12
Distribution of a Request
Source ofInformation
Requester ofInformation
RequestEncounters
Metadata
METIS'2011 iTrust Michael Melliar-Smith 13
Retrieval of Information
Source ofInformation
Requester ofInformation
RequestMatched
METIS'2011 iTrust Michael Melliar-Smith 14
Probabilistic Analysis
Membership contains n participating nodes Metadata are distributed to m nodes Requests are distributed to r nodes Of the participating nodes, a proportion x
are operational
METIS'2011 iTrust Michael Melliar-Smith 15
Probabilistic Analysis
A request is distributed to r nodes (r trials)
Probability of no match on 1st trial: n-m
Probability of no match on 2nd trial: n-m-1
Probability of no match on rth trial: n-m-r+1
Probability q of no match on r trials: n-m n-m-1 … n-m-r+1 = (n-m)! (n-r)!
n
n-1
n-r+1
n n-1 n-r+1 n! (n-m-r)!
METIS'2011 iTrust Michael Melliar-Smith 16
Probabilistic Analysis
Probability p of a match on r trials:
1 - n-m n-m-1 … n-m-r+1
= 1 - (n-m)! (n-r)!
m=r =n p > 1 – e-1 0.6321
m=r =2n p > 1 – e-2 0.8647
m=r =2n p > 1 – e-4 0.9817
If m+r > n, then p = 1
n n-1 n-r+1
n! (n-m-r)!where n ≥ m+r
METIS'2011 iTrust Michael Melliar-Smith 17
Probabilistic Analysis
Now assume that only a proportion x of the participating nodes are operational
Probability that 1st node has the metadata: m
Probability that 1st node has the metadata and is operational: mx
Probability of no match on 1st trial: 1- mx = n-mx
Probability of no match on 2nd trial: n-mx-1
Probability of no match on rth trial: n-mx-r+1
n
n
n n
n-r+1
n-1
METIS'2011 iTrust Michael Melliar-Smith 18
Probability q of no match on r trials: n-mx n-mx-1 … n-mx-r+1
= (n-mx)! (n-r)!
Probability p of a match on r trials:
1 - n-mx n-mx-1 … n-mx-r+1
= 1 - (n-mx)! (n-r)!
If mx+r > n, then p = 1
Probabilistic Analysis
n n-1 n-r+1
n! (n-mx-r)!
n n-1 n-r+1
n! (n-mx-r)!where n ≥ mx+r
METIS'2011 iTrust Michael Melliar-Smith 19
Probability of a Match
METIS'2011 iTrust Michael Melliar-Smith 20
Probability of a Match
METIS'2011 iTrust Michael Melliar-Smith 21
Time-to-Live
Metadata can be provided with a time-to-live– Receiver of the metadata deletes the metadata when
the time-to-live expires
Similarly, a request can be provided with a time-to-live– Receiver of the request stores the request until the
time-to-live and then deletes the request– Receiver attempts to match newly arrived metadata
with the metadata in the request until the time-to-live
METIS'2011 iTrust Michael Melliar-Smith 22
Small Information
Many information items are small Distribute the information itself, rather than
the metadata about the information
METIS'2011 iTrust Michael Melliar-Smith 23
Different Classes of Nodes
Some nodes are less capable, or are only intermittently connected to the network
Distribute the metadata and the requests only to the more capable nodes
Less capable nodes might have more powerful proxy nodes or home agents
METIS'2011 iTrust Michael Melliar-Smith 24
Forwarding Metadata and Requests
To exploit the parallelism of the Internet, the originator of the metadata or request does not necessarily send the metadata or request to all m or r of the participating nodes
When a node receives the metadata or request, with some probability, it forwards the metadata or request to another participating node selected at random
Doing so introduces some variability in the number of nodes to which the metadata and requests are distributed
METIS'2011 iTrust Michael Melliar-Smith 25
Differential Distribution
If there are many more requests than metadata, it might be appropriate to distribute the metadata to more nodes and the requests to fewer nodes
Similarly, long-lived metadata and requests might be distributed to more nodes than short-lived metadata and requests
Likewise, frequently requested metadata might be distributed to more nodes than rarely requested metadata
METIS'2011 iTrust Michael Melliar-Smith 26
Network Load
We are investigating the effects of metadata and request distribution on the network load and also on the load of participating nodes
If the network load is too high, it might be necessary to reduce the number of nodes to which the metadata and requests are distributed
METIS'2011 iTrust Michael Melliar-Smith 27
Membership
The membership of participating nodes need not be exact and up-to-date
Small differences in the membership are equivalent to small proportions of non-operational nodes
It is essential, to the iTrust strategy, that the membership should not be centrally managed
Thus, we employ a membership algorithm that is based on iTrust itself
METIS'2011 iTrust Michael Melliar-Smith 28
Membership Algorithm
A node wishing to join the membership contacts any current member to obtain the current membership
– It does so using mechanisms that are outside the iTrust strategy, perhaps Email, Twitter, etc.
It then publishes its joining the membership, through the iTrust distribution and retrieval mechanisms
All nodes periodically request and retrieve information about new nodes that have joined the membership
METIS'2011 iTrust Michael Melliar-Smith 29
Joining the Membership
METIS'2011 iTrust Michael Melliar-Smith 30
Discovering New Members
METIS'2011 iTrust Michael Melliar-Smith 31
Leaving the Membership
METIS'2011 iTrust Michael Melliar-Smith 32
Rapidly Changing Memberships
At times of rapid membership change, it might be appropriate to request and retrieve membership information more frequently, with increased computation and communication costs
At times of rapid membership change, it might be appropriate to distribute the metadata and requests to more nodes to compensate for inaccurate membership information
METIS'2011 iTrust Michael Melliar-Smith 33
Large Memberships
Large memberships (perhaps millions of nodes) might be expensive to retrieve and store
The potentially high rate of notifications of membership changes for a large membership might impose a heavy load on the network
We are investigating strategies for creating and maintaining memberships in which each node is aware of only a small subset of the membership
We are also investigating the effects of small subsets of the membership, on the effectiveness of information distribution and retrieval
METIS'2011 iTrust Michael Melliar-Smith 34
Encryption
In iTrust, there is no intention to use encryption to ensure secrecy or privacy at the node level
Necessarily, metadata and requests must be readable by large numbers of nodes and, thus, they are public
However, encryption can be used to make it prohibitively expensive for routers to use deep packet inspection to censor metadata or requests
For this purpose, iTrust uses standard public key encryption
METIS'2011 iTrust Michael Melliar-Smith 35
Encrypted Metadata and Requests
When a node sends metadata or requests, it encrypts the message with
– Its private key– Destination’s public key
The sending node includes its public key in the message– Some receiver nodes might not yet have
its information in their membership tables
When a node finds that a request matches its metadata,
– It uses the public key in the request to encrypt the response reporting the match to the requester
– The response supplies the URL that directs the requester to the source of the information, and also the source’s public key
METIS'2011 iTrust Michael Melliar-Smith 36
Potential Malicious Attacks
A malicious attacker might seed the network with covertly subverted nodes that behave normally, except that they fail to report matches involving information that the attacker wants to suppress
A malicious attacker must ensure that a large number of nodes that participate in matching have been subverted
In iTrust, it is important to detect a malicious attack, and to prevent the malicious attack from being effective
METIS'2011 iTrust Michael Melliar-Smith 37
Detecting a Malicious Attack
In iTrust, it is likely that a request will result in several reports of matches
The probability of multiple reports of matches depends on: – Number n of participating nodes– Number m of nodes to which the metadata are distributed– Number r of nodes to which the requests are distributed– Probability x that a node is operational
The effect of a malicious attack is to increase the probability that a substantial number of subverted nodes appear to be non-operational for certain metadata or requests
METIS'2011 iTrust Michael Melliar-Smith 38
Detecting a Malicious Attack
0
0.1
0.2
0.3
0.4
0.5
0.6
k=0 k=1 k=2 k=3 k=4 k=5
80%
60%
40%
20%
1000 Node Network withDistribution to 60 Nodes
Percentage of nodesoperational 100%
Number of Matches
Pro
bab
ilit
y o
f N
um
ber
of
Mat
ches
METIS'2011 iTrust Michael Melliar-Smith 39
Responding to a Malicious Attack
If the iTrust network is under attack, it is appropriate to increase the number of nodes to which the metadata and requests are distributed
We are investigating an adaptive algorithm that increases the number of nodes to which the metadata and requests are distributed, as the probability of an attack increases, i.e., our estimate of the number of subverted nodes increases
METIS'2011 iTrust Michael Melliar-Smith 40
Prototype Implementation of iTrust
Based on the Apache Web server, compiled with several PHP standard modules and library extensions
Uses HTTP for distribution of metadata and requests, and for retrieval of documents
Multiple iTrust nodes can be installed on a single Web server by creating multiple virtual Web sites on that server
Comprises three components:– Web Server Foundation– Application Infrastructure– Public Interface
METIS'2011 iTrust Michael Melliar-Smith 41
apachePHP
public interface
delete nodes
leave membership
query
search
inbox
statistics
user settings
tools
metadata inbox
tika / lucene / dictionary
metadata functionsmetadata xml engine
register metadata list
apply xml
publish xml list
helper functions
nodes wrapper
keywords wrapper
resource wrapper
tag keyword resource
search functions
globals / navigation
cURL
SQLite
session
log
PECL http
(a) (b) (c)
The Components of iTrust
METIS'2011 iTrust Michael Melliar-Smith 42
Web Server Foundation
cURL is used for inter-node communication and resource-specific actions
SQLite database tables are used to store node, metadata, and resource information– A node uses the SQLite LIKE function to match the
metadata in a request with the metadata that it holds The session module tracks and distinguishes users The log module is used for debugging and
for simulation The PHP Extension Community Library (PECL)
for HTTP is used for inter-node search and requests
METIS'2011 iTrust Michael Melliar-Smith 43
Application Infrastructure
The metadata XML engine scans the resources and creates an XML list describing the relationship between the metadata and the resource
The node and resource-related helper functions insert nodes into the membership, insert keywords into the database, and upload or fetch resources
The Apache Tika and Lucene packages are used to generate metadata from resources, if the user opts not to generate the metadata manually
The WordNet dictionary is used to provide spell checking and synonym suggestions
METIS'2011 iTrust Michael Melliar-Smith 44
Public Interface
Comprises two kinds of interfaces:– Computer interfaces – Handle all inter-node
communication such as queries, resource distribution, and metadata list distribution
Request is sent to participating nodes using computer interfaces in a simple inbox-type fashion
A participating nodes reads its inbox for queries, and sends back a response if it has a match
– Human interfaces – Consist of PHP HTML Web pages Administrator can add nodes or metadata keywords using
HTML form text boxes User generates requests using HTML form text boxes User settings and statistics Web pages provide feedback on
the membership size, resource count, etc.
METIS'2011 iTrust Michael Melliar-Smith 45
Prototype Implementation of iTrust
METIS'2011 iTrust Michael Melliar-Smith 46
Prototype Implementation of iTrust
METIS'2011 iTrust Michael Melliar-Smith 47
Prototype Implementation of iTrust
METIS'2011 iTrust Michael Melliar-Smith 48
Simulation Results Based on the Implementation
METIS'2011 iTrust Michael Melliar-Smith 49
Conclusion and Future Work
We have described iTrust [13], [14], a trustworthy information distribution and retrieval network
We plan to do experimental evaluations of the prototype implementation using PlanetLab
We are investigating other implementations of iTrust based on:– SMS– Twitter– What else? We need your advice.
We plan to make the iTrust source code, tools, documentation, etc. freely available for all to use
METIS'2011 iTrust Michael Melliar-Smith 50
References[1] C. Castelfranchi and R. Falcone, Principles of Trust for MAS: Cognitive Anatomy, Social Importance, and Quantification,
Proceedings of the International Conference on Multi-Agent Systems, Paris, France, 72-79. [2] D. Artz and Y. Gil, A Survey of Trust in Computer Science and the Semantic Web, Journal
of Web Semantics, 5:58-71, Elsevier, 2007.[3] T. Grandison and M. Sloman, A Survey of Trust in Internet Applications, IEEE
Communications Survey Tutorials, 4(4):2-16, 2000.[4] P. T. Eugster, P. A. Felber, R. Guerraoui and A. M. Kermarrec, The Many Faces of Publish/Subscribe, ACM
Computing Surveys, 35(2):114-131, June 2005.[5] J. Mischke and B. Stiller, A Methodology for the Design of Distributed Search in P2P Middleware, IEEE
Network 19(1):30-37, January 2004.[6] J. Risson and T. Moors, Survey of Research Towards Robust Peer-to-Peer Networks: Search Methods, Technical Report
UNSW-EE-P2P-1-1. University of New South Wales, September 2007, RFC 4982, http://tools.ietf.org/html/rfc-4821.[7] Gnutella, http://gnutella.wego.com/.[8] I. Clarke, O. Sandberg, B. Wiley and T. Hong, Freenet: A Distributed Anonymous Information Storage and Retrieval System,
Proceedings of the Workshop on Design Issues in Anonymity and Unobservability, Lecture Notes in Computer Science, Berkeley, CA, July 2000, 46-66.
[9] M. Zhong and K. Shen, Popularity-Based Random Walks for Peer-to-Peer Search under the Square-Root Principle, Lecture Notes in Computer Science 4490, 2007, 877-880.
[10] R. A. Ferreira, M. K. Ramanathan, A. Awan, A. Grama and S. Jagannathan, Search with Probabilistic Guarantees in Unstructured Peer-to-Peer Networks, Proceedings of the Fifth IEEE International Conference on Peer-to-Peer Computing, Konstance, Germany, August 2005, 165-172.
[11] B. Wong and S. Guha, Quasar: A Probabilistic Publish-Subscribe System for Social Networks, Proceedings of the 7th International Workshop on Peer-to-Peer Systems, Tampa Bay, FL, February 2008.
[12] T. Isdal, M. Piatek, A. Krishnamurthy and T. Anderson, Privacy Preserving P2P Data Sharing with OneSwarm, Technical Report UW-CSE, Department of Computer Science, University of Washington, 2009.
[13] I. Michel Lombera, Y. T. Chuang, P. M. Melliar-Smith and L. E. Moser, Trustworthy Distribution and Retrieval of Information over HTTP and the Internet, Proceedings of the Third International Conference on the Evolving Internet, INTERNET 2011, Luxembourg, June 2011.
[14] Y. T. Chuang, I. Michel Lombera, L. E. Moser and P. M. Melliar-Smith, Trustworthy Distributed Search and Retrieval over the Internet, Proceedings of the 2011 WORLDCOMP, International Conference on Internet Computing, ICOMP, Las Vegas, NV, July 2011.
METIS'2011 iTrust Michael Melliar-Smith 51
Thank You!
Questions?
Comments…