distributed search over the hidden web

Distributed Search over the Hidden Web

Hierarchical Database

Sampling and Selection

Panagiotis G. Ipeirotis & Luis Gravano

Outline

Introduction Background Focused Probing for Content Summary

Construction Exploiting Topic Hierarchies for Database

Selection Experiments: Data and Metrics Experimental Results Conclusion and Future Work

Introduction

Search engines create their indexes by spidering or crawling Web pages

Hidden Web sources store their content in searchable databases

An Example…

Searching in the medical database CANCERLIT – www.cancer.gov, the query [ lung and cancer ] returns 68,430 matches.

Searching Google with the query [“lung and cancer” site: www.cancer.gov ] returns 23 matches

http://www.cancer.gov/

http://www.cancer.gov/

Meta - searchers

Tools for searching Hidden data sources

Relies on statistical or content summaries Performs three main tasks:

Database SelectionQuery Translation Result Merging

This Paper Presents

An algorithm to derive content summaries from “uncooperative” databases

A Database Selection Algorithm that exploits: Extracted content summaries Hierarchical classification of the databases

Background:Existing Database Selection Algorithms

Contd. : Database Selection Algorithms

Assumption: Query words are independently distributed over database documents.

The answer to a query is the set of all the documents that satisfy the Boolean expression.

Deficiency: Are the content summaries accurate and up to date?

Uniform Probing for Content Summary Construction

Extracts a document sample from a given database, D and computes the frequency of each observed word w in the sample, SampleDF(w)

The Algorithm: Start with an empty content summary where

SampleDF(w) = 0 for each word w, and a general (i.e., not specific to D), comprehensive word dictionary.

Pick a word and send it as a query to database D.

Retrieve the top-k documents returned. If the number of retrieved documents exceeds a

pre-specified threshold, stop. Else continue the sampling process by returning to Step 2.

2 Versions of Algorithm

RS-Ord :

RandomSampling-OtherResource RS-Lrd :

RandomSampling-LearnedResource

Deficiencies:

ActualDF(w) for each word w is not revealed

RS-Ord tends to produce inefficient executions in which it repeatedly issues queries to databases that produce no matches

Database Classification

Rationale : Queries closely associated with topical categories retrieve mainly documents about that category

Place the database in a classification scheme based on the number of matches

Automation and Hierarchical Classification

Automates classification by queries derived automatically from a rule-based document classifier.

A rule-based classifier is a set of logical rules defining classification decisions. jordan AND bulls-->Sports, hepatitis-->Health

Apply this principle recursively to create a hierarchical classifier.

Focused Probing

Sends query probes, and extracts number of matches without retrieving any documents.

Calculates two metrics, the Coverage(Ci) and Specificity(Ci) for the subcategory Ci

If the values of Coverage(Ci) and Specificity(Ci) exceed two pre-specified thresholds Tc and Ts, respectively, classify the database into a category Ci

Author’s Algorithm

Exploit Topic Hierarchy

Produce a document sample that Is topically representative of the contents Gives accurate and efficient content

summary

Content-Summary Construction

Steps of the algorithm:

Query the database using focused probing to: Retrieve a document sample. Generate a preliminary content summary Categorize the database.

Estimate the absolute frequencies of the words retrieved from the database.

Building Content Summaries from Extracted Documents

ActualDF(w): The actual number of documents in the

database that contain word w. The algorithm knows this number only if [w] is

a single word query probe that was issued to the database

SampleDF(w): The number of documents in the extracted

sample that contain word w.

Focused Probing for Content Summary Construction

Focused Probing for Content

Summary Construction

Estimating Absolute Document Frequencies

Use Mandelbrot’s equation P(r+p)-B for distribution of words for estimating unknown ActualDF (¢) frequencies. Sort words in descending order of their SampleDF(¢)

frequencies Focus on words with known ActualDF (¢) frequencies. Find the P, B, and p parameter values that best fit the

data. Estimate ActualDF (wi) for all words wi with unknown

ActualDF (wi) as P(ri+p)-B

Example

Creating Content Summaries for Topic CategoriesExample: “metastasis” did not appear in any of the documents

sampled from CANCERLIT during probing Cancer-BACUP classified under “Cancer”, has a high

ActualDFest(metastasis) = 3, 569 Convey this information by associating a content summary

with category “Cancer” that is obtained by merging the summaries of all databases under this category

In merged summary, ActualDFest(w) is sum of the document frequency of w for databases under this category

Creating Content Summaries for Topic Categories

Selecting Databases Hierarchically: Algorithm Inputs : a query Q, target databases K, top category C Steps: HierSelect(Query Q, Category C, int K)

1: Use a flat database selection algorithm to assign a score for Q to each subcategory of C

2: if there is a subcategory C with a non-zero score

3: Pick the subcategory Cj with the highest score

4: if NumDBs(Cj) >= K //Cj has enough databases

5: return HierSelect(Q,Cj ,K)

6: else // Cj does not have enough databases

7: return DBs(Cj) FlatSelect(Q,C-Cj,K-NumDBs(Cj))

8: else // no subcategory C has non-zero score

9: return FlatSelect(Q,C,K)

Example: Topic hierarchy for database selection (babe AND ruth ,k=3)

Experiments :Data and Metrics

Evaluate two main sets of techniques:1. Content-summary construction techniques

2. Database selection techniques Evaluate the algorithms, using two data

sets Controlled Database Set Web Database Set

Data Sets

Controlled Database Set 500,000 newsgroup articles from 54

newsgroup 81,000 articles to train documents classifiers

over the 72 – node topic hierarchy 419,000 articles to build the set of Controlled

Databases Contained 500 databases ranging in size from

25 to 25,000 documents.

Data Sets

Web Database Set 50 real web accessible databases with no

control over it.

Databases picked randomly from two directories of hidden-web databases, namely InvisibleWeb and Complete Planet

Content-summary construction Test variations of Focused Probing technique

against RS-Ord and RS-Lrd. Focused Probing:

Evaluated configurations with different underlying document classifiers for query-probe creation.

Different values for the thresholds Ts and Tc Varied the specificity threshold Ts from 0 to 1 Fixed coverage threshold to Tc = 10.

Database Selection Effectiveness

Underlying Database selection algorithm: Hierarchical algorithm

Relies on a “flat” database selection algorithm.

Chose algorithms: CORI, bGlOSS

Adapted both algorithms to work with category content summary.


Content Summary Construction Evaluated how the hierarchical database

selection algorithm behaved over content summaries generated by different techniques

Also studied QPilot Strategy Exploits HTML links to characterize text

databases.

Content Summary Quality Metric : content summaries coverage of the actual

database vocabulary ctf = ΣwєTr ActualDF(w) / ΣwєTd ActualDF(w) Tr = set of terms in content summary, Td =

complete set of words in vocabulary Results:

Focused Probing techniques achieve much higher ctf ratios than RS-Ord and RS-Lrd.

The coverage of the Focused Probing summaries increases for lower thresholds of Ts

Content Summary Quality

Correlation of word rankings: Used Spearman Rank Correlation Coefficient (SRCC ) –

to measure how well a content summary orders words by frequencies with respect to the actual word frequency order in the database.

Result :The Focused Probing method have higher SRCC values than the RS-Ord and RS-Lrd.

Content Summary Quality - Efficiency

Focused Probing techniques on average retrieve one document per query sent

RS-Lrd retrieves about one document per two queries.

RS-Ord unnecessarily issues many queries that produce no document matches.


Produce significantly better-quality summaries than RS-Ord and RS-Lrd do in terms of vocabulary coverage and word ranking preservation.


Methodology: Web set of real web-accessible databases 50 queries from the Web Track of TREC Each database selection algorithm picked 3

databases for the query

Retrieved the top 5 documents for the query.

Human evaluators to judge

the relevance of each retrieved document for the query


Measured the precision of a technique for each query q as :

Average precision of different database selection algorithms.


Analysis: All the flat selection techniques suffer from

incomplete coverage of the underlying probing-generated summaries.

QPilot summaries do not work well for database selection because they generally contain only a few words and are hence highly incomplete.

Hierarchical vs. flat database selection

The hierarchical algorithm using CORI as flat database selection has 50% better precision than CORI for flat selection with the same content summaries.

For bGlOSS, the improvement is 92%.

Reason: Topic hierarchy compensates for incomplete content summaries.

Hierarchical vs. flat database selection

Measured fraction of times that hierarchical database selection algorithm picked a database for a query That produced matches for the query And was given a zero score by the flat

database selection algorithm of choice.

Conclusion

Presented a novel and efficient method for the construction of content summaries of web accessible text databases

Presented a hierarchical database selection algorithm that exploits the database content summaries

Algorithm generated classification to produce accurate results even for imperfect content summaries.

Future Work

Alternative hierarchy traversing techniques. For example, “route” queries to multiple categories if appropriate.

Examine the effect of absolute frequency estimation on database selection.

Alternative methods for creating content summaries.