michael melliar-smith university of california, santa barbara

Trustworthy Information Distribution and Retrieval

Michael Melliar-Smith

University of California, Santa Barbara

Research conducted in collaboration withLouise E. Moser, Isai Michel Lombera and Yung-Ting Chuang

Supported in part by NSF Grant CNS 10-16103

METIS'2011 iTrust Michael Melliar-Smith 2

Information Access over the Internet Modern society and commerce depend on access to

information over the Internet Information is accessed over the Internet using

centralized search engines and search indexes– Internet search engines are centralized for efficiency

and scalability We cannot assume that centralized search engines

will always deliver the information we seek, uncensored and unbiased

iTrust is a system for publishing, searching for, and retrieving information over the Internet that provides trustworthy access to information


What Is Trust?

"Trust is the mental background to delegation" (Castelfranchi and Falcone [1])

When you delegate an activity, you trust that:– Good things will happen– Bad things will not happen

However, what is good and what is bad is a matter of your intent

Thus, it is not possible to provide a universal definition of trust

Several useful surveys [2], [3] on trust, related to reliability, security, and privacy


Trust in Algorithms

Trust in a strong encryption algorithm is based on estimates of the cost of breaking the keys– Publication of the encryption algorithm– Independent validation of the strength of the algorithm– No need to trust the people who created the algorithm

We aim to provide the same kind of trust for iTrust– Published algorithms– Statistical analysis– No need to trust an administrator

Trust a social network with a large number of users


Related Work Useful surveys [4], [5], [6] on publish/subscribe and

distributed search, categorized as:– Structured – Require nodes to be organized in an overlay network,

e.g., Distributed Hash Tables (DHTs), rings, trees More efficient than unstructured

Involve administrative control and additional overhead for constructing

and maintaining the overlay network Still incur a trust risk through administrative control

– Unstructured – Gossip-based; typically use randomization Gnutella [7] - Great grandfather of unstructured systems; uses

flooding of requests to find information Freenet [8] - More sophisticated and efficient than Gnutella, because

it learns from previous requests Zhong and Shen [9] - Uses random walks; number of nodes visited by

a request is proportional to the square root of the content popularity Ferreira, et al. [10] - Uses random walks; replicates both the queries

and the data in a sparse network


Related Work Several recent information distribution and

retrieval systems are concerned with security, privacy, and trust– Quasar [11] – A probabilistic publish/subscribe system,

using a sparse structured overlay that is concerned with the release of sensitive information by a central node

– OneSwarm [12] – A peer-to-peer data sharing system that uses a combination of trusted and untrusted peers

Part of an effort to provide an alternative to cloud computing that does not depend on centralized trust

Initial goal is to protect the privacy of the users Uses trusted intermediary nodes to preserve anonymity


Objectives of iTrust

Main Objective: Provide users with information that is important to them

Publish information for other people to access Search for, and retrieve, published information Detect that the system is under attack When the system is under attack, adapt to increase

the probability of information distribution and retrieval, with some increase in costs


Non-Objectives of iTrust

No attempt to prevent the distribution of misinformation

No attempt to maintain secrecy and privacy No attempt to minimize communication,

processing and storage costs– Costs are greater than for conventional, centralized

Internet search engines– We assume that the additional costs are acceptable,

given the primary objective – Information distribution and retrieval are probabilistic


Characteristics of iTrust

Falls into the general category of random walk publish/subscribe systems – All nodes are equal– No central search engine– Replication of metadata and requests– Random search or random walk– No supernodes– Distributed membership– No central control over membership


Basic Idea of iTrust The participating nodes constitute the membership of iTrust A source node is a participating node that

– Produces information, which it makes available to other participating nodes

– Produces metadata (keywords) describing the information, and distributes the metadata to a subset of participating nodes chosen at random

A requesting node is a participating node that– Generates requests containing metadata and distributes the

requests to a subset of participating nodes chosen at random A participating node that receives a request

– Compares the metadata in the request with the metadata it holds– If it finds a match, it returns the URL of the associated

information to the requesting node The requesting node then uses the URL to retrieve the

information from the source node


Distribution of Metadata

Source ofInformation


Distribution of a Request


Requester ofInformation

RequestEncounters

Metadata


Retrieval of Information


Requester ofInformation

RequestMatched


Probabilistic Analysis

Membership contains n participating nodes Metadata are distributed to m nodes Requests are distributed to r nodes Of the participating nodes, a proportion x

are operational



A request is distributed to r nodes (r trials)

Probability of no match on 1st trial: n-m

Probability of no match on 2nd trial: n-m-1

Probability of no match on rth trial: n-m-r+1

Probability q of no match on r trials: n-m n-m-1 … n-m-r+1 = (n-m)! (n-r)!

n

n-1

n-r+1

n n-1 n-r+1 n! (n-m-r)!



Probability p of a match on r trials:

1 - n-m n-m-1 … n-m-r+1

= 1 - (n-m)! (n-r)!

m=r =n p > 1 – e-1 0.6321

m=r =2n p > 1 – e-2 0.8647

m=r =2n p > 1 – e-4 0.9817

If m+r > n, then p = 1

n n-1 n-r+1

n! (n-m-r)!where n ≥ m+r



Now assume that only a proportion x of the participating nodes are operational

Probability that 1st node has the metadata: m

Probability that 1st node has the metadata and is operational: mx

Probability of no match on 1st trial: 1- mx = n-mx

Probability of no match on 2nd trial: n-mx-1

Probability of no match on rth trial: n-mx-r+1

n

n

n n

n-r+1

n-1


Probability q of no match on r trials: n-mx n-mx-1 … n-mx-r+1

= (n-mx)! (n-r)!

Probability p of a match on r trials:

1 - n-mx n-mx-1 … n-mx-r+1

= 1 - (n-mx)! (n-r)!

If mx+r > n, then p = 1


n n-1 n-r+1

n! (n-mx-r)!

n n-1 n-r+1

n! (n-mx-r)!where n ≥ mx+r


Probability of a Match


Time-to-Live

Metadata can be provided with a time-to-live– Receiver of the metadata deletes the metadata when

the time-to-live expires

Similarly, a request can be provided with a time-to-live– Receiver of the request stores the request until the

time-to-live and then deletes the request– Receiver attempts to match newly arrived metadata

with the metadata in the request until the time-to-live


Small Information

Many information items are small Distribute the information itself, rather than

the metadata about the information


Different Classes of Nodes

Some nodes are less capable, or are only intermittently connected to the network

Distribute the metadata and the requests only to the more capable nodes

Less capable nodes might have more powerful proxy nodes or home agents


Forwarding Metadata and Requests

To exploit the parallelism of the Internet, the originator of the metadata or request does not necessarily send the metadata or request to all m or r of the participating nodes

When a node receives the metadata or request, with some probability, it forwards the metadata or request to another participating node selected at random

Doing so introduces some variability in the number of nodes to which the metadata and requests are distributed


Differential Distribution

If there are many more requests than metadata, it might be appropriate to distribute the metadata to more nodes and the requests to fewer nodes

Similarly, long-lived metadata and requests might be distributed to more nodes than short-lived metadata and requests

Likewise, frequently requested metadata might be distributed to more nodes than rarely requested metadata


Network Load

We are investigating the effects of metadata and request distribution on the network load and also on the load of participating nodes

If the network load is too high, it might be necessary to reduce the number of nodes to which the metadata and requests are distributed


Membership

The membership of participating nodes need not be exact and up-to-date

Small differences in the membership are equivalent to small proportions of non-operational nodes

It is essential, to the iTrust strategy, that the membership should not be centrally managed

Thus, we employ a membership algorithm that is based on iTrust itself


Membership Algorithm

A node wishing to join the membership contacts any current member to obtain the current membership

– It does so using mechanisms that are outside the iTrust strategy, perhaps Email, Twitter, etc.

It then publishes its joining the membership, through the iTrust distribution and retrieval mechanisms

All nodes periodically request and retrieve information about new nodes that have joined the membership


Joining the Membership


Discovering New Members


Leaving the Membership


Rapidly Changing Memberships

At times of rapid membership change, it might be appropriate to request and retrieve membership information more frequently, with increased computation and communication costs

At times of rapid membership change, it might be appropriate to distribute the metadata and requests to more nodes to compensate for inaccurate membership information


Large Memberships

Large memberships (perhaps millions of nodes) might be expensive to retrieve and store

The potentially high rate of notifications of membership changes for a large membership might impose a heavy load on the network

We are investigating strategies for creating and maintaining memberships in which each node is aware of only a small subset of the membership

We are also investigating the effects of small subsets of the membership, on the effectiveness of information distribution and retrieval


Encryption

In iTrust, there is no intention to use encryption to ensure secrecy or privacy at the node level

Necessarily, metadata and requests must be readable by large numbers of nodes and, thus, they are public

However, encryption can be used to make it prohibitively expensive for routers to use deep packet inspection to censor metadata or requests

For this purpose, iTrust uses standard public key encryption


Encrypted Metadata and Requests

When a node sends metadata or requests, it encrypts the message with

– Its private key– Destination’s public key

The sending node includes its public key in the message– Some receiver nodes might not yet have

its information in their membership tables

When a node finds that a request matches its metadata,

– It uses the public key in the request to encrypt the response reporting the match to the requester

– The response supplies the URL that directs the requester to the source of the information, and also the source’s public key


Potential Malicious Attacks

A malicious attacker might seed the network with covertly subverted nodes that behave normally, except that they fail to report matches involving information that the attacker wants to suppress

A malicious attacker must ensure that a large number of nodes that participate in matching have been subverted

In iTrust, it is important to detect a malicious attack, and to prevent the malicious attack from being effective


Detecting a Malicious Attack

In iTrust, it is likely that a request will result in several reports of matches

The probability of multiple reports of matches depends on: – Number n of participating nodes– Number m of nodes to which the metadata are distributed– Number r of nodes to which the requests are distributed– Probability x that a node is operational

The effect of a malicious attack is to increase the probability that a substantial number of subverted nodes appear to be non-operational for certain metadata or requests


Detecting a Malicious Attack

0

0.1

0.2

0.3

0.4

0.5

0.6

k=0 k=1 k=2 k=3 k=4 k=5

80%

60%

40%

20%

1000 Node Network withDistribution to 60 Nodes

Percentage of nodesoperational 100%

Number of Matches

Pro

bab

ilit

y o

f N

um

ber

of

Mat

ches


Responding to a Malicious Attack

If the iTrust network is under attack, it is appropriate to increase the number of nodes to which the metadata and requests are distributed

We are investigating an adaptive algorithm that increases the number of nodes to which the metadata and requests are distributed, as the probability of an attack increases, i.e., our estimate of the number of subverted nodes increases


Prototype Implementation of iTrust

Based on the Apache Web server, compiled with several PHP standard modules and library extensions

Uses HTTP for distribution of metadata and requests, and for retrieval of documents

Multiple iTrust nodes can be installed on a single Web server by creating multiple virtual Web sites on that server

Comprises three components:– Web Server Foundation– Application Infrastructure– Public Interface


apachePHP

public interface

delete nodes

leave membership

query

search

inbox

statistics

user settings

tools

metadata inbox

tika / lucene / dictionary

metadata functionsmetadata xml engine

register metadata list

apply xml

publish xml list

helper functions

nodes wrapper

keywords wrapper

resource wrapper

tag keyword resource

search functions

globals / navigation

cURL

SQLite

session

log

PECL http

(a) (b) (c)

The Components of iTrust


Web Server Foundation

cURL is used for inter-node communication and resource-specific actions

SQLite database tables are used to store node, metadata, and resource information– A node uses the SQLite LIKE function to match the

metadata in a request with the metadata that it holds The session module tracks and distinguishes users The log module is used for debugging and

for simulation The PHP Extension Community Library (PECL)

for HTTP is used for inter-node search and requests


Application Infrastructure

The metadata XML engine scans the resources and creates an XML list describing the relationship between the metadata and the resource

The node and resource-related helper functions insert nodes into the membership, insert keywords into the database, and upload or fetch resources

The Apache Tika and Lucene packages are used to generate metadata from resources, if the user opts not to generate the metadata manually

The WordNet dictionary is used to provide spell checking and synonym suggestions


Public Interface

Comprises two kinds of interfaces:– Computer interfaces – Handle all inter-node

communication such as queries, resource distribution, and metadata list distribution

Request is sent to participating nodes using computer interfaces in a simple inbox-type fashion

A participating nodes reads its inbox for queries, and sends back a response if it has a match

– Human interfaces – Consist of PHP HTML Web pages Administrator can add nodes or metadata keywords using

HTML form text boxes User generates requests using HTML form text boxes User settings and statistics Web pages provide feedback on

the membership size, resource count, etc.


Simulation Results Based on the Implementation


Conclusion and Future Work

We have described iTrust [13], [14], a trustworthy information distribution and retrieval network

We plan to do experimental evaluations of the prototype implementation using PlanetLab

We are investigating other implementations of iTrust based on:– SMS– Twitter– What else? We need your advice.

We plan to make the iTrust source code, tools, documentation, etc. freely available for all to use


References[1] C. Castelfranchi and R. Falcone, Principles of Trust for MAS: Cognitive Anatomy, Social Importance, and Quantification,

Proceedings of the International Conference on Multi-Agent Systems, Paris, France, 72-79. [2] D. Artz and Y. Gil, A Survey of Trust in Computer Science and the Semantic Web, Journal

of Web Semantics, 5:58-71, Elsevier, 2007.[3] T. Grandison and M. Sloman, A Survey of Trust in Internet Applications, IEEE

Communications Survey Tutorials, 4(4):2-16, 2000.[4] P. T. Eugster, P. A. Felber, R. Guerraoui and A. M. Kermarrec, The Many Faces of Publish/Subscribe, ACM

Computing Surveys, 35(2):114-131, June 2005.[5] J. Mischke and B. Stiller, A Methodology for the Design of Distributed Search in P2P Middleware, IEEE

Network 19(1):30-37, January 2004.[6] J. Risson and T. Moors, Survey of Research Towards Robust Peer-to-Peer Networks: Search Methods, Technical Report

UNSW-EE-P2P-1-1. University of New South Wales, September 2007, RFC 4982, http://tools.ietf.org/html/rfc-4821.[7] Gnutella, http://gnutella.wego.com/.[8] I. Clarke, O. Sandberg, B. Wiley and T. Hong, Freenet: A Distributed Anonymous Information Storage and Retrieval System,

Proceedings of the Workshop on Design Issues in Anonymity and Unobservability, Lecture Notes in Computer Science, Berkeley, CA, July 2000, 46-66.

[9] M. Zhong and K. Shen, Popularity-Based Random Walks for Peer-to-Peer Search under the Square-Root Principle, Lecture Notes in Computer Science 4490, 2007, 877-880.

[10] R. A. Ferreira, M. K. Ramanathan, A. Awan, A. Grama and S. Jagannathan, Search with Probabilistic Guarantees in Unstructured Peer-to-Peer Networks, Proceedings of the Fifth IEEE International Conference on Peer-to-Peer Computing, Konstance, Germany, August 2005, 165-172.

[11] B. Wong and S. Guha, Quasar: A Probabilistic Publish-Subscribe System for Social Networks, Proceedings of the 7th International Workshop on Peer-to-Peer Systems, Tampa Bay, FL, February 2008.

[12] T. Isdal, M. Piatek, A. Krishnamurthy and T. Anderson, Privacy Preserving P2P Data Sharing with OneSwarm, Technical Report UW-CSE, Department of Computer Science, University of Washington, 2009.

[13] I. Michel Lombera, Y. T. Chuang, P. M. Melliar-Smith and L. E. Moser, Trustworthy Distribution and Retrieval of Information over HTTP and the Internet, Proceedings of the Third International Conference on the Evolving Internet, INTERNET 2011, Luxembourg, June 2011.

[14] Y. T. Chuang, I. Michel Lombera, L. E. Moser and P. M. Melliar-Smith, Trustworthy Distributed Search and Retrieval over the Internet, Proceedings of the 2011 WORLDCOMP, International Conference on Internet Computing, ICOMP, Las Vegas, NV, July 2011.


Thank You!

Questions?

Comments…

michael melliar-smith university of california, santa barbara

Documents

trust risk

kind of trust

centralized search engines

distributed search

trustworthy access

sparse structured overlay

peer data sharing system

random walks number