![Page 1: Text Retrieval in Peer to Peer Systems David Karger MIT](https://reader035.vdocuments.site/reader035/viewer/2022081515/56649c6f5503460f949221cd/html5/thumbnails/1.jpg)
Text Retrievalin Peer to Peer Systems
David Karger
MIT
![Page 2: Text Retrieval in Peer to Peer Systems David Karger MIT](https://reader035.vdocuments.site/reader035/viewer/2022081515/56649c6f5503460f949221cd/html5/thumbnails/2.jpg)
Information Retrievalbefore P2P
The traditional approach
![Page 3: Text Retrieval in Peer to Peer Systems David Karger MIT](https://reader035.vdocuments.site/reader035/viewer/2022081515/56649c6f5503460f949221cd/html5/thumbnails/3.jpg)
Information Retrieval
Most of our information base is text academic journals books and encyclopedias news feeds world wide web pages email
How do we find what we need?
neat
messy
![Page 4: Text Retrieval in Peer to Peer Systems David Karger MIT](https://reader035.vdocuments.site/reader035/viewer/2022081515/56649c6f5503460f949221cd/html5/thumbnails/4.jpg)
The Classic IR Model
User has information need User formulates textual query System processes corpus of documents System extracts relevant documents User refines query Metrics:
recall: % of relevant documents retrieved precision: % of retrieved docs with relevance
![Page 5: Text Retrieval in Peer to Peer Systems David Karger MIT](https://reader035.vdocuments.site/reader035/viewer/2022081515/56649c6f5503460f949221cd/html5/thumbnails/5.jpg)
Precision-Recall Tradeoff
Recall 100%
Precision
100%
Fetch Nothing
Fetch Everything
CIA
Web Search
Library
![Page 6: Text Retrieval in Peer to Peer Systems David Karger MIT](https://reader035.vdocuments.site/reader035/viewer/2022081515/56649c6f5503460f949221cd/html5/thumbnails/6.jpg)
Specific Retrieval Algorithms
Define relevance Build a model of documents, meanings Ignore computational cost
Implement efficiently Preprocessing
Tb corpora call for big-iron machines (or simulations) Interaction:
after 1/2 second, user notices delay after 10 seconds, user gives up (historical perspective; changed by web)
![Page 7: Text Retrieval in Peer to Peer Systems David Karger MIT](https://reader035.vdocuments.site/reader035/viewer/2022081515/56649c6f5503460f949221cd/html5/thumbnails/7.jpg)
Boolean Keyword Search
Q: “Do harsh winters affect steel production?” Query: steel AND winter
Output: “Last WINTER, overproduction of STEEL led to...” “STEEL automobiles resist WINTER weather poorly.” “Boston must STEEL itself for another bad WINTER” “the Pittsburgh STEELers started WINTER training...”
Not Output: “Cold weather caused increased metal prices as orders for
radiators and automobiles picked up...”
![Page 8: Text Retrieval in Peer to Peer Systems David Karger MIT](https://reader035.vdocuments.site/reader035/viewer/2022081515/56649c6f5503460f949221cd/html5/thumbnails/8.jpg)
Implementing Boolean Search Typical: OR of ANDS,
handle each OR separately, aggregate For ANDs, inverted index:
Per term, list of documents containing that term intersect lists for query terms
Basically a database join
![Page 9: Text Retrieval in Peer to Peer Systems David Karger MIT](https://reader035.vdocuments.site/reader035/viewer/2022081515/56649c6f5503460f949221cd/html5/thumbnails/9.jpg)
Intersection Algorithms (as in DB) Method 1: direct list merge
Linear work in summed size of lists Method 2: examine candidates
Start with shortest term list For each list entry, check for other search terms Linear in smallest list Good if at least one rare term, but
requires forward index (list of terms in each document) no gain if all search terms common (“flying fish”)
![Page 10: Text Retrieval in Peer to Peer Systems David Karger MIT](https://reader035.vdocuments.site/reader035/viewer/2022081515/56649c6f5503460f949221cd/html5/thumbnails/10.jpg)
Problems with Boolean Approach Synonymy
several words for same thing if author used different one, query won’t match
Polysemy one word can mean many things (“bank”) query matches wrong meanings
Harsh cutoffs (1 wrong keyword kills) user can’t type descriptive paragraph...
Terms have uniform influence repeated occurrence same as single occurrence common terms treated same as rare ones
![Page 11: Text Retrieval in Peer to Peer Systems David Karger MIT](https://reader035.vdocuments.site/reader035/viewer/2022081515/56649c6f5503460f949221cd/html5/thumbnails/11.jpg)
Fixing Problems
Synonymy thesaurus can add equivalent terms to query increases recall, but lowers precision expensive to construct (semantics---manual)
Polysemy use more query terms to disambiguate user might not know more terms increases precision, but lowers recall
Harsh cutoffs quorum system (maximize # matching terms)
Uniformity?
![Page 12: Text Retrieval in Peer to Peer Systems David Karger MIT](https://reader035.vdocuments.site/reader035/viewer/2022081515/56649c6f5503460f949221cd/html5/thumbnails/12.jpg)
Vector Space Model
Document is a vector with a coordinate/term 0-1 for presence/absence of term (quorum) real valued to represent term “importance”
term frequency in document increases value term frequency in corpus decreases value
Dot product with query measures similarity Best known implementation: inverted index
for each query term, list documents containing it accumulate dot products
![Page 13: Text Retrieval in Peer to Peer Systems David Karger MIT](https://reader035.vdocuments.site/reader035/viewer/2022081515/56649c6f5503460f949221cd/html5/thumbnails/13.jpg)
Vector Space Advantages
Smoother than Boolean search Provides ranking rather than sharp cut-off
Tends to allow/encourage queries with many nonzero terms Easy to “expand query” with synonyms Hopefully polysemes will “interfere constructively” May even add relevant documents to query
100s or 1000s of terms
![Page 14: Text Retrieval in Peer to Peer Systems David Karger MIT](https://reader035.vdocuments.site/reader035/viewer/2022081515/56649c6f5503460f949221cd/html5/thumbnails/14.jpg)
P2PIR
Simulating big iron
![Page 15: Text Retrieval in Peer to Peer Systems David Karger MIT](https://reader035.vdocuments.site/reader035/viewer/2022081515/56649c6f5503460f949221cd/html5/thumbnails/15.jpg)
Web Search Info From Google Web queries
Almost all queries 2 terms only “Boolean vector space” model (tiny recall OK) Zipf distribution, so caching queries helps some
Corpus 3B pages, 10K average size, 30TB total Inverted index: roughly the same size Fits in a “moderate” P2P system of 30K nodes But must be partitioned. How?
![Page 16: Text Retrieval in Peer to Peer Systems David Karger MIT](https://reader035.vdocuments.site/reader035/viewer/2022081515/56649c6f5503460f949221cd/html5/thumbnails/16.jpg)
Obvious: Partition Documents Node builds full inverted index for its subset Query quite tractable per node Merge results sent back from each node Used by Google (in data center) and Gnutella Drawback: query broadcast to all nodes
OK for Google data center; bad for P2P
![Page 17: Text Retrieval in Peer to Peer Systems David Karger MIT](https://reader035.vdocuments.site/reader035/viewer/2022081515/56649c6f5503460f949221cd/html5/thumbnails/17.jpg)
Alternative: Partition Terms
One node owns a few terms of inverted index Term pair is “key” for distributed hash table
Talk only to nodes that own query terms They return desired inverted-index lists Results intersected at query issuer Drawback: transfer huge inverted index lists Alternative: send first term-list to second
Ships 1 (perhaps small) list instead of 2
![Page 18: Text Retrieval in Peer to Peer Systems David Karger MIT](https://reader035.vdocuments.site/reader035/viewer/2022081515/56649c6f5503460f949221cd/html5/thumbnails/18.jpg)
Avoiding Communication(Om Gnawali et al. @MIT) Build inverted index on term pairs
Pre-answering all queries Partition pairs among nodes Search contacts one node Problem: pre-computation cost
Size-n document generates n2 pairs Each pair must be communicated Each pair must be stored
![Page 19: Text Retrieval in Peer to Peer Systems David Karger MIT](https://reader035.vdocuments.site/reader035/viewer/2022081515/56649c6f5503460f949221cd/html5/thumbnails/19.jpg)
Good Cases
Music search “document” is song title + author n small, so n2 factor unimportant
Document windows Usually, good docs have query terms “nearby” Scan window of length 5, take pairs in window 10 pairs/window, so 10n per document So linear in corpus size as before
Bundle pairs to ship over sparse overlay
![Page 20: Text Retrieval in Peer to Peer Systems David Karger MIT](https://reader035.vdocuments.site/reader035/viewer/2022081515/56649c6f5503460f949221cd/html5/thumbnails/20.jpg)
What About Vector Space?
Weighting terms is easy But cannot limit search to pair list
However, need only highest-scored documents on individual terms
So, pre-compute and store small “winner list” Vector space encourages many-term queries
Find pairs with small intersection Index triples, quadruples, etc Apply branch and bound techniques
![Page 21: Text Retrieval in Peer to Peer Systems David Karger MIT](https://reader035.vdocuments.site/reader035/viewer/2022081515/56649c6f5503460f949221cd/html5/thumbnails/21.jpg)
Google Pushback
No need for P2P More precisely: “keep peers in our data center” Exploit high local communication bandwidth Economics support large server farm
More load? Buy more servers
Main bottleneck: content provider bandwidth Limits rate of crawl Google index often weeks out of date Distributed crawler won’t help
![Page 22: Text Retrieval in Peer to Peer Systems David Karger MIT](https://reader035.vdocuments.site/reader035/viewer/2022081515/56649c6f5503460f949221cd/html5/thumbnails/22.jpg)
Google Pushback Pushback
P2P might help Let each node build own index Ship changes to Google
Potential applications real-time index new-relevant-content notification
Problem: SPAM Content providers will lie about index changes Use P2P system to spot-check?
![Page 23: Text Retrieval in Peer to Peer Systems David Karger MIT](https://reader035.vdocuments.site/reader035/viewer/2022081515/56649c6f5503460f949221cd/html5/thumbnails/23.jpg)
Person-to-Person IR
New modalities
![Page 24: Text Retrieval in Peer to Peer Systems David Karger MIT](https://reader035.vdocuments.site/reader035/viewer/2022081515/56649c6f5503460f949221cd/html5/thumbnails/24.jpg)
P2P: Systems Perspective Distributed system has more resources
Computation/Storage Reliability
Can exploit, if successfully hide Latency Bandwidth
Goal: simulate reliable big iron Solve traditional problems that need resources File storage, factoring, database queries, IR
![Page 25: Text Retrieval in Peer to Peer Systems David Karger MIT](https://reader035.vdocuments.site/reader035/viewer/2022081515/56649c6f5503460f949221cd/html5/thumbnails/25.jpg)
P2P:Social Perspective Applications based on person-to-person
interactions Messaging Linking/community bulding (the web) Reputation management (Mojo Nation) File-sharing collaborations (just now)
Need not run on top of P2P network
![Page 26: Text Retrieval in Peer to Peer Systems David Karger MIT](https://reader035.vdocuments.site/reader035/viewer/2022081515/56649c6f5503460f949221cd/html5/thumbnails/26.jpg)
The “Pathetic Fallacy” of P2P
Assumption that network layer should mirror social layer E.g. “peers should be node with similar interests”
Many work fine on one (big, reliable) machine Placement on P2P system is “coincidental” On other side of “one big machine” abstraction
Breaching abstraction has bad consequences Peering to “friends” unlikely to optimize efficiency,
reliability
![Page 27: Text Retrieval in Peer to Peer Systems David Karger MIT](https://reader035.vdocuments.site/reader035/viewer/2022081515/56649c6f5503460f949221cd/html5/thumbnails/27.jpg)
P2P Opportunity:Leverage Involvement of People Each individual manipulates information
In much more powerful, semantic ways than machines can achieve
Record that manipulation Exploit to help others do better retrieval
![Page 28: Text Retrieval in Peer to Peer Systems David Karger MIT](https://reader035.vdocuments.site/reader035/viewer/2022081515/56649c6f5503460f949221cd/html5/thumbnails/28.jpg)
Link-based Retrieval
Simultaneous work: Kleinberg at IBM Brin/Page at Stanford/Google
People find “good” web pages, link to them So, a page with large in-degree is good Refine: target of many good nodes is good
Mathematically, random walk model Page rank=stationary probability of random walk
![Page 29: Text Retrieval in Peer to Peer Systems David Karger MIT](https://reader035.vdocuments.site/reader035/viewer/2022081515/56649c6f5503460f949221cd/html5/thumbnails/29.jpg)
Applications
Search Raise relevance of high page-rank pages If lazy, limit corpus to high page-rank Anchor text better description than page contents
Crawl Page rank computed before see page Prioritize high page-rank pages for crawl
People add usable info no system could find
![Page 30: Text Retrieval in Peer to Peer Systems David Karger MIT](https://reader035.vdocuments.site/reader035/viewer/2022081515/56649c6f5503460f949221cd/html5/thumbnails/30.jpg)
P2P:Systems/Social Interactions Distributed system has novel properties Exploit them to enable novel capabilities E.g., anonymity
Relies on partition of control/knowledge E.g., privacy
Allow limited access to my private information Gain (false, but important) sense of safety by
keeping it on my machine
![Page 31: Text Retrieval in Peer to Peer Systems David Karger MIT](https://reader035.vdocuments.site/reader035/viewer/2022081515/56649c6f5503460f949221cd/html5/thumbnails/31.jpg)
Expertise Networks
Haystack (Karger et al), Shock (Adar et al) Route questions to appropriate expert
Use text to describe knowledge Based on human entry, or indexing of human’s
personal files Might be unwilling to admit knowledge
P2P framework can protect anonymity Shock achieves by Gnutella-style query broadcast More efficient approach?
![Page 32: Text Retrieval in Peer to Peer Systems David Karger MIT](https://reader035.vdocuments.site/reader035/viewer/2022081515/56649c6f5503460f949221cd/html5/thumbnails/32.jpg)
Other New Aspects
Personal information sharing Unwilling to “publish” mail, documents to world But might allow search, access in some cases Keeping data, index on own machine gives (false)
sense of security, privacy Anonymity
P2P provides strong anonymity primitives Can be exploited, e.g., for “recommending”
embarrassing content
![Page 33: Text Retrieval in Peer to Peer Systems David Karger MIT](https://reader035.vdocuments.site/reader035/viewer/2022081515/56649c6f5503460f949221cd/html5/thumbnails/33.jpg)
Sample Application
Social: “Secret Web” Maintain links for use by page-rank algorithm But, links are secret from most others Need random walk through link path
Implement via recursive lookup Censorproof?, spamproof?
![Page 34: Text Retrieval in Peer to Peer Systems David Karger MIT](https://reader035.vdocuments.site/reader035/viewer/2022081515/56649c6f5503460f949221cd/html5/thumbnails/34.jpg)
Semantics vs. Syntax
Clearly, using word meanings would help Some systems try to implement semantics But this is a core AI problem, unsolved Current attempts don’t scale to large corpora All current large systems are syntactic only Idea: use computational power of P2P Idea: use humans to attach semantics
![Page 35: Text Retrieval in Peer to Peer Systems David Karger MIT](https://reader035.vdocuments.site/reader035/viewer/2022081515/56649c6f5503460f949221cd/html5/thumbnails/35.jpg)
Conclusion: Two Approaches to P2P Hide P2P (Partition to Partition)
Goal: illusion of single server Know how to do task on single server Devise tools to achieve same in distributed sys. Focus on surmounting drawbacks: systems
Exploit P2P (Person to Person) Determine new opportunities afforded by P2P Perhaps impossible on single server Focus on new applications: AI? HCI?