distributed search over the hidden web

46
Distributed Search over the Hidden Web Hierarchical Database Sampling and Selection Panagiotis G. Ipeirotis & Luis Gravano

Upload: ryu

Post on 05-Jan-2016

30 views

Category:

Documents


0 download

DESCRIPTION

Distributed Search over the Hidden Web. Hierarchical Database Sampling and Selection. Panagiotis G. Ipeirotis & Luis Gravano. Outline. Introduction Background Focused Probing for Content Summary Construction Exploiting Topic Hierarchies for Database Selection - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Distributed Search over the Hidden Web

Distributed Search over the Hidden Web

Hierarchical Database

Sampling and Selection

Panagiotis G. Ipeirotis & Luis Gravano

Page 2: Distributed Search over the Hidden Web

Outline

Introduction Background Focused Probing for Content Summary

Construction Exploiting Topic Hierarchies for Database

Selection Experiments: Data and Metrics Experimental Results Conclusion and Future Work

Page 3: Distributed Search over the Hidden Web

Introduction

Search engines create their indexes by spidering or crawling Web pages

Hidden Web sources store their content in searchable databases

Page 4: Distributed Search over the Hidden Web

An Example…

Searching in the medical database CANCERLIT – www.cancer.gov, the query [ lung and cancer ] returns 68,430 matches.

Searching Google with the query [“lung and cancer” site: www.cancer.gov ] returns 23 matches

Page 5: Distributed Search over the Hidden Web

Meta - searchers

Tools for searching Hidden data sources

Relies on statistical or content summaries Performs three main tasks:

Database SelectionQuery Translation Result Merging

Page 6: Distributed Search over the Hidden Web

This Paper Presents

An algorithm to derive content summaries from “uncooperative” databases

A Database Selection Algorithm that exploits: Extracted content summaries Hierarchical classification of the databases

Page 7: Distributed Search over the Hidden Web

Background:Existing Database Selection Algorithms

Page 8: Distributed Search over the Hidden Web

Contd. : Database Selection Algorithms

Assumption: Query words are independently distributed over database documents.

The answer to a query is the set of all the documents that satisfy the Boolean expression.

Deficiency: Are the content summaries accurate and up to date?

Page 9: Distributed Search over the Hidden Web

Uniform Probing for Content Summary Construction

Extracts a document sample from a given database, D and computes the frequency of each observed word w in the sample, SampleDF(w)

Page 10: Distributed Search over the Hidden Web

The Algorithm: Start with an empty content summary where

SampleDF(w) = 0 for each word w, and a general (i.e., not specific to D), comprehensive word dictionary.

Pick a word and send it as a query to database D.

Retrieve the top-k documents returned. If the number of retrieved documents exceeds a

pre-specified threshold, stop. Else continue the sampling process by returning to Step 2.

Page 11: Distributed Search over the Hidden Web

2 Versions of Algorithm

RS-Ord :

RandomSampling-OtherResource RS-Lrd :

RandomSampling-LearnedResource

Page 12: Distributed Search over the Hidden Web

Deficiencies:

ActualDF(w) for each word w is not revealed

RS-Ord tends to produce inefficient executions in which it repeatedly issues queries to databases that produce no matches

Page 13: Distributed Search over the Hidden Web

Database Classification

Rationale : Queries closely associated with topical categories retrieve mainly documents about that category

Place the database in a classification scheme based on the number of matches

Page 14: Distributed Search over the Hidden Web

Automation and Hierarchical Classification

Automates classification by queries derived automatically from a rule-based document classifier.

A rule-based classifier is a set of logical rules defining classification decisions. jordan AND bulls-->Sports, hepatitis-->Health

Apply this principle recursively to create a hierarchical classifier.

Page 15: Distributed Search over the Hidden Web

Focused Probing

Sends query probes, and extracts number of matches without retrieving any documents.

Calculates two metrics, the Coverage(Ci) and Specificity(Ci) for the subcategory Ci

If the values of Coverage(Ci) and Specificity(Ci) exceed two pre-specified thresholds Tc and Ts, respectively, classify the database into a category Ci

Page 16: Distributed Search over the Hidden Web

Author’s Algorithm

Exploit Topic Hierarchy

Produce a document sample that Is topically representative of the contents Gives accurate and efficient content

summary

Page 17: Distributed Search over the Hidden Web

Content-Summary Construction

Steps of the algorithm:

Query the database using focused probing to: Retrieve a document sample. Generate a preliminary content summary Categorize the database.

Estimate the absolute frequencies of the words retrieved from the database.

Page 18: Distributed Search over the Hidden Web

Building Content Summaries from Extracted Documents

ActualDF(w): The actual number of documents in the

database that contain word w. The algorithm knows this number only if [w] is

a single word query probe that was issued to the database

SampleDF(w): The number of documents in the extracted

sample that contain word w.

Page 19: Distributed Search over the Hidden Web

Focused Probing for Content Summary Construction

Page 20: Distributed Search over the Hidden Web

Focused Probing for Content Summary Construction

Page 21: Distributed Search over the Hidden Web

Focused Probing for Content

Summary Construction

Page 22: Distributed Search over the Hidden Web

Estimating Absolute Document Frequencies

Use Mandelbrot’s equation P(r+p)-B for distribution of words for estimating unknown ActualDF (¢) frequencies. Sort words in descending order of their SampleDF(¢)

frequencies  Focus on words with known ActualDF (¢) frequencies. Find the P, B, and p parameter values that best fit the

data. Estimate ActualDF (wi) for all words wi with unknown

ActualDF (wi) as P(ri+p)-B

Page 23: Distributed Search over the Hidden Web

Example

Page 24: Distributed Search over the Hidden Web

Creating Content Summaries for Topic CategoriesExample: “metastasis” did not appear in any of the documents

sampled from CANCERLIT during probing Cancer-BACUP classified under “Cancer”, has a high

ActualDFest(metastasis) = 3, 569 Convey this information by associating a content summary

with category “Cancer” that is obtained by merging the summaries of all databases under this category

In merged summary, ActualDFest(w) is sum of the document frequency of w for databases under this category

Page 25: Distributed Search over the Hidden Web

Creating Content Summaries for Topic Categories

Page 26: Distributed Search over the Hidden Web

Selecting Databases Hierarchically: Algorithm Inputs : a query Q, target databases K, top category C Steps: HierSelect(Query Q, Category C, int K)

1: Use a flat database selection algorithm to assign a score for Q to each subcategory of C

2: if there is a subcategory C with a non-zero score

3: Pick the subcategory Cj with the highest score

4: if NumDBs(Cj) >= K //Cj has enough databases

5: return HierSelect(Q,Cj ,K)

6: else // Cj does not have enough databases

7: return DBs(Cj) FlatSelect(Q,C-Cj,K-NumDBs(Cj))

8: else // no subcategory C has non-zero score

9: return FlatSelect(Q,C,K)

Page 27: Distributed Search over the Hidden Web

Example: Topic hierarchy for database selection (babe AND ruth ,k=3)

Page 28: Distributed Search over the Hidden Web

Experiments :Data and Metrics

Evaluate two main sets of techniques:1. Content-summary construction techniques

2. Database selection techniques Evaluate the algorithms, using two data

sets Controlled Database Set Web Database Set

Page 29: Distributed Search over the Hidden Web

Data Sets

Controlled Database Set 500,000 newsgroup articles from 54

newsgroup 81,000 articles to train documents classifiers

over the 72 – node topic hierarchy 419,000 articles to build the set of Controlled

Databases Contained 500 databases ranging in size from

25 to 25,000 documents.

Page 30: Distributed Search over the Hidden Web

Data Sets

Web Database Set 50 real web accessible databases with no

control over it.

Databases picked randomly from two directories of hidden-web databases, namely InvisibleWeb and Complete Planet

Page 31: Distributed Search over the Hidden Web

Content-summary construction Test variations of Focused Probing technique

against RS-Ord and RS-Lrd. Focused Probing:

Evaluated configurations with different underlying document classifiers for query-probe creation.

Different values for the thresholds Ts and Tc Varied the specificity threshold Ts from 0 to 1 Fixed coverage threshold to Tc = 10.

Page 32: Distributed Search over the Hidden Web

Database Selection Effectiveness

Underlying Database selection algorithm: Hierarchical algorithm

Relies on a “flat” database selection algorithm.

Chose algorithms: CORI, bGlOSS

Adapted both algorithms to work with category content summary.

Page 33: Distributed Search over the Hidden Web

Database Selection Effectiveness

Content Summary Construction Evaluated how the hierarchical database

selection algorithm behaved over content summaries generated by different techniques

Also studied QPilot Strategy Exploits HTML links to characterize text

databases.

Page 34: Distributed Search over the Hidden Web

Content Summary Quality Metric : content summaries coverage of the actual

database vocabulary ctf = ΣwєTr ActualDF(w) / ΣwєTd ActualDF(w) Tr = set of terms in content summary, Td =

complete set of words in vocabulary Results:

Focused Probing techniques achieve much higher ctf ratios than RS-Ord and RS-Lrd.

The coverage of the Focused Probing summaries increases for lower thresholds of Ts

Page 35: Distributed Search over the Hidden Web

Content Summary Quality

Correlation of word rankings: Used Spearman Rank Correlation Coefficient (SRCC ) –

to measure how well a content summary orders words by frequencies with respect to the actual word frequency order in the database.

Result :The Focused Probing method have higher SRCC values than the RS-Ord and RS-Lrd.

Page 36: Distributed Search over the Hidden Web

Content Summary Quality

Page 37: Distributed Search over the Hidden Web

Content Summary Quality - Efficiency

Focused Probing techniques on average retrieve one document per query sent

RS-Lrd retrieves about one document per two queries.

RS-Ord unnecessarily issues many queries that produce no document matches.

Page 38: Distributed Search over the Hidden Web

Content Summary Quality

Produce significantly better-quality summaries than RS-Ord and RS-Lrd do in terms of vocabulary coverage and word ranking preservation.

Page 39: Distributed Search over the Hidden Web
Page 40: Distributed Search over the Hidden Web

Database Selection Effectiveness

Methodology: Web set of real web-accessible databases 50 queries from the Web Track of TREC Each database selection algorithm picked 3

databases for the query

Retrieved the top 5 documents for the query.

Human evaluators to judge

the relevance of each retrieved document for the query

Page 41: Distributed Search over the Hidden Web

Database Selection Effectiveness

Measured the precision of a technique for each query q as :

Average precision of different database selection algorithms.

Page 42: Distributed Search over the Hidden Web

Database Selection Effectiveness

Analysis: All the flat selection techniques suffer from

incomplete coverage of the underlying probing-generated summaries.

QPilot summaries do not work well for database selection because they generally contain only a few words and are hence highly incomplete.

Page 43: Distributed Search over the Hidden Web

Hierarchical vs. flat database selection

The hierarchical algorithm using CORI as flat database selection has 50% better precision than CORI for flat selection with the same content summaries.

For bGlOSS, the improvement is 92%.

Reason: Topic hierarchy compensates for incomplete content summaries.

Page 44: Distributed Search over the Hidden Web

Hierarchical vs. flat database selection

Measured fraction of times that hierarchical database selection algorithm picked a database for a query That produced matches for the query And was given a zero score by the flat

database selection algorithm of choice.

Page 45: Distributed Search over the Hidden Web

Conclusion

Presented a novel and efficient method for the construction of content summaries of web accessible text databases

Presented a hierarchical database selection algorithm that exploits the database content summaries

Algorithm generated classification to produce accurate results even for imperfect content summaries.

Page 46: Distributed Search over the Hidden Web

Future Work

Alternative hierarchy traversing techniques. For example, “route” queries to multiple categories if appropriate.

Examine the effect of absolute frequency estimation on database selection.

Alternative methods for creating content summaries.