document clustering and collection selection diego puppin web mining, 2006-2007

38
Document Clustering and Collection Selection Diego Puppin Web Mining, 2006-2007

Upload: stanley-jacobs

Post on 18-Jan-2018

223 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Document Clustering and Collection Selection Diego Puppin Web Mining, 2006-2007

Document Clustering and Collection

SelectionDiego Puppin

Web Mining, 2006-2007

Page 2: Document Clustering and Collection Selection Diego Puppin Web Mining, 2006-2007

Web Search Engines: Parallel Architecture

To improve throughput, latency, res. quality A broker works as a unique interfaceComposed of several computing serversQueries are routed to a subset The broker collects and merges results

Page 3: Document Clustering and Collection Selection Diego Puppin Web Mining, 2006-2007
Page 4: Document Clustering and Collection Selection Diego Puppin Web Mining, 2006-2007

d1d1 d2d2 d3d3 d4d4 d5d5 d6d6 d7d7 d8d8t1t1 x xt2t2 x xt3t3 x x xt4t4 x xt5t5 x x x x x x xt6t6 x x xt7t7 x x x xt8t8 x x xt9t9

Page 5: Document Clustering and Collection Selection Diego Puppin Web Mining, 2006-2007

d1d1 d2d2 d3d3 d4d4 d5d5 d6d6 d7d7 d8d8t1t1 x xt2t2 x xt3t3 x x xt4t4 x xt5t5 x x x x x x xt6t6 x x xt7t7 x x x xt8t8 x x xt9t9

Page 6: Document Clustering and Collection Selection Diego Puppin Web Mining, 2006-2007

d1d1 d2d2 d3d3 d4d4 d5d5 d6d6 d7d7 d8d8t1t1 x xt2t2 x xt3t3 x x xt4t4 x xt5t5 x x x x x x xt6t6 x x xt7t7 x x x xt8t8 x x xt9t9

Page 7: Document Clustering and Collection Selection Diego Puppin Web Mining, 2006-2007

Doc-Partitioned Approach

The document base is split among serversEach server indexes and manages queries for its own documentsIt knows all terms of some documentsBetter scalability of indexing/search

Each server is independentDocuments can be easily added/removed

Page 8: Document Clustering and Collection Selection Diego Puppin Web Mining, 2006-2007

Term-Partitioned Approach

The dictionary is split among serversEach server stores the index for some termsIt knows documents where its terms occurPotential for load reductionPoor load balancing (some work...)

Page 9: Document Clustering and Collection Selection Diego Puppin Web Mining, 2006-2007

Some considerations

Every time you add/remove a docYou must update MANY servers

With queries:only relevant servers are queried but... servers with hot terms are overloaded

Page 10: Document Clustering and Collection Selection Diego Puppin Web Mining, 2006-2007

How To Load BalancePut together related terms

This minimizes the number of hit servers

Try to put together group of documents with similar overall frequence

Servers shouldn’t be overloadedQuery logs could be used to predict

Page 11: Document Clustering and Collection Selection Diego Puppin Web Mining, 2006-2007

Multiple indexesTerm-based index

Query-vector or Bag-of-wordHot text index

Titles, Anchor, Bold etcLink-based index

It can find related pages etcKey-phrase index

For some idioms

Page 12: Document Clustering and Collection Selection Diego Puppin Web Mining, 2006-2007

terms links hot

query terms

referrer, relatedquery terms

Page 13: Document Clustering and Collection Selection Diego Puppin Web Mining, 2006-2007

How To Doc-Partition?

1. Random doc assignment + Query broadcast

2. Collection selection for independent collections (meta-search)

3. Smart doc assignment + Collection selection

4. Random assignment + Random selection

Page 14: Document Clustering and Collection Selection Diego Puppin Web Mining, 2006-2007

1. Random + Broadcast

Used by commercial WSEsNo computing effort for doc clusteringVery high scalability

Low latency on each serverResult collection and merging is the heaviest part

Page 15: Document Clustering and Collection Selection Diego Puppin Web Mining, 2006-2007

IR Core1idx

IR Core2idx

IR Corekidx

t1,t2,…tq r1,r2,…rr

query results

Broker

Distributed/Replicated Documents

Page 16: Document Clustering and Collection Selection Diego Puppin Web Mining, 2006-2007

2. Independent collections

The WSE uses data from several sourcesIt routes the query to the most authoritative collection(s)It collects the results according to independent ranking choices (HARD)Example: Biology, News, Law

Page 17: Document Clustering and Collection Selection Diego Puppin Web Mining, 2006-2007

Coll1 Coll2 Coll.k

t1,t2,…tq r1,r2,…rr

query results

Broker

Page 18: Document Clustering and Collection Selection Diego Puppin Web Mining, 2006-2007

3. Assignment + Selection

WSE creates document groupsEach server holds one groupThe broker has a knowledge of group placementThe selection strategy routes the query suitably

Page 19: Document Clustering and Collection Selection Diego Puppin Web Mining, 2006-2007

Doc cluster1idx

Doccluster2idx

Doccluster kidx

t1,t2,…tq r1,r2,…rr

query results

Broker

Page 20: Document Clustering and Collection Selection Diego Puppin Web Mining, 2006-2007

4. Random + Random

If data (pages, resources...) are replicated, interchangeable, hard to indexData are stored in the server that publishes themWe query a few servers hoping to get something

Page 21: Document Clustering and Collection Selection Diego Puppin Web Mining, 2006-2007

Doc 1idx

Doc2idx

Dockidx

t1,t2,…tq r1,r2,…rr

query results

Broker

Page 22: Document Clustering and Collection Selection Diego Puppin Web Mining, 2006-2007

CORI

The Effect of Database Size Distribution on Resource Selection Algorithms, Luo Si and Jamie CallanExtends the concept of TF.IDF to collections

Page 23: Document Clustering and Collection Selection Diego Puppin Web Mining, 2006-2007
Page 24: Document Clustering and Collection Selection Diego Puppin Web Mining, 2006-2007
Page 25: Document Clustering and Collection Selection Diego Puppin Web Mining, 2006-2007

CORINeeds a deep collaboration from the collections:

Data about terms, documents, sizeUnfeasible with independent collections

Statistical sampling, Query-based sampling

Term-based: no links, no anchorsVery large footprint

Page 26: Document Clustering and Collection Selection Diego Puppin Web Mining, 2006-2007
Page 27: Document Clustering and Collection Selection Diego Puppin Web Mining, 2006-2007

Querylog-based Collection SelectionUsing Query Logs to Establish Vocabularies in Distributed Information RetrievalMilad Shokouhi; Justin Zobel; S.M.M. Tahaghoghi; Falk ScholerThe collections are sampled using data from a query logBefore: Queries over a dictionary (QBS)

Page 28: Document Clustering and Collection Selection Diego Puppin Web Mining, 2006-2007

RecallNumber of found documents (not for web)

Precision at X (P@X)Number of relevant documents out of the first X results (X= 5, 10, 20, 100)

Average precisionAverage of Precision for increasing X

MAP (Mean Average Precision)The average over all queries

Page 29: Document Clustering and Collection Selection Diego Puppin Web Mining, 2006-2007
Page 30: Document Clustering and Collection Selection Diego Puppin Web Mining, 2006-2007

And now for something completely

different!

Page 31: Document Clustering and Collection Selection Diego Puppin Web Mining, 2006-2007

Important features

It can use any underlying WSELinks, snippets, anchors...

Good or bad as the WSE it uses!Small footprint

It can be added as another index

Page 32: Document Clustering and Collection Selection Diego Puppin Web Mining, 2006-2007

terms links hot query-vector

query terms

referrer, relatedquery terms

query mapping

Page 33: Document Clustering and Collection Selection Diego Puppin Web Mining, 2006-2007

DevelopmentsQuery suggestions

The system finds related queriesResult grouping

Documents are already organized into groups

Query expansionCan find more complex queries still matching

Page 34: Document Clustering and Collection Selection Diego Puppin Web Mining, 2006-2007

d1d1 d2d2 d3d3 d4d4 d5d5 d6d6 d7d7 d8d8Q1Q1 x xQ2Q2 x xQ3Q3 x x xQ4Q4 x xQ5Q5 x x x x x x xQ6Q6 x x xQ7Q7 x x x xQ8Q8 x x xQ9Q9

Page 35: Document Clustering and Collection Selection Diego Puppin Web Mining, 2006-2007

d1d1 d2d2 d3d3 d4d4 d5d5 d6d6 d7d7 d8d8Q1Q1 x xQ2Q2 x xQ3Q3 x x xQ4Q4 x xQ5Q5 x x x x x x xQ6Q6 x x xQ7Q7 x x x xQ8Q8 x x xQ9Q9

Page 36: Document Clustering and Collection Selection Diego Puppin Web Mining, 2006-2007

d1d1 d2d2 d3d3 d4d4 d5d5 d6d6 d7d7 d8d8Q1Q1 x xQ2Q2 x xQ3Q3 x x xQ4Q4 x xQ5Q5 x x x x x x xQ6Q6 x x xQ7Q7 x x x xQ8Q8 x x xQ9Q9

Page 37: Document Clustering and Collection Selection Diego Puppin Web Mining, 2006-2007

Still missingUsing the QV model as a full IR model:

Is it possible to perform queries over this representation?

New query terms cannot be foundCORI should be used for unseen queries

Better testing with topic shift

Page 38: Document Clustering and Collection Selection Diego Puppin Web Mining, 2006-2007

Possible seminars/projects

Advanced collection selectionStatistical sampling, Query-based sampling

Partitioning a LARGE collection (1 TB)Load balancing for doc- and term-partitioningTopic shift

Query log analysis