distributed search over the hidden web
DESCRIPTION
Distributed Search over the Hidden Web. Hierarchical Database Sampling and Selection. Panagiotis G. Ipeirotis & Luis Gravano. Outline. Introduction Background Focused Probing for Content Summary Construction Exploiting Topic Hierarchies for Database Selection - PowerPoint PPT PresentationTRANSCRIPT
Distributed Search over the Hidden Web
Hierarchical Database
Sampling and Selection
Panagiotis G. Ipeirotis & Luis Gravano
Outline
Introduction Background Focused Probing for Content Summary
Construction Exploiting Topic Hierarchies for Database
Selection Experiments: Data and Metrics Experimental Results Conclusion and Future Work
Introduction
Search engines create their indexes by spidering or crawling Web pages
Hidden Web sources store their content in searchable databases
An Example…
Searching in the medical database CANCERLIT – www.cancer.gov, the query [ lung and cancer ] returns 68,430 matches.
Searching Google with the query [“lung and cancer” site: www.cancer.gov ] returns 23 matches
Meta - searchers
Tools for searching Hidden data sources
Relies on statistical or content summaries Performs three main tasks:
Database SelectionQuery Translation Result Merging
This Paper Presents
An algorithm to derive content summaries from “uncooperative” databases
A Database Selection Algorithm that exploits: Extracted content summaries Hierarchical classification of the databases
Background:Existing Database Selection Algorithms
Contd. : Database Selection Algorithms
Assumption: Query words are independently distributed over database documents.
The answer to a query is the set of all the documents that satisfy the Boolean expression.
Deficiency: Are the content summaries accurate and up to date?
Uniform Probing for Content Summary Construction
Extracts a document sample from a given database, D and computes the frequency of each observed word w in the sample, SampleDF(w)
The Algorithm: Start with an empty content summary where
SampleDF(w) = 0 for each word w, and a general (i.e., not specific to D), comprehensive word dictionary.
Pick a word and send it as a query to database D.
Retrieve the top-k documents returned. If the number of retrieved documents exceeds a
pre-specified threshold, stop. Else continue the sampling process by returning to Step 2.
2 Versions of Algorithm
RS-Ord :
RandomSampling-OtherResource RS-Lrd :
RandomSampling-LearnedResource
Deficiencies:
ActualDF(w) for each word w is not revealed
RS-Ord tends to produce inefficient executions in which it repeatedly issues queries to databases that produce no matches
Database Classification
Rationale : Queries closely associated with topical categories retrieve mainly documents about that category
Place the database in a classification scheme based on the number of matches
Automation and Hierarchical Classification
Automates classification by queries derived automatically from a rule-based document classifier.
A rule-based classifier is a set of logical rules defining classification decisions. jordan AND bulls-->Sports, hepatitis-->Health
Apply this principle recursively to create a hierarchical classifier.
Focused Probing
Sends query probes, and extracts number of matches without retrieving any documents.
Calculates two metrics, the Coverage(Ci) and Specificity(Ci) for the subcategory Ci
If the values of Coverage(Ci) and Specificity(Ci) exceed two pre-specified thresholds Tc and Ts, respectively, classify the database into a category Ci
Author’s Algorithm
Exploit Topic Hierarchy
Produce a document sample that Is topically representative of the contents Gives accurate and efficient content
summary
Content-Summary Construction
Steps of the algorithm:
Query the database using focused probing to: Retrieve a document sample. Generate a preliminary content summary Categorize the database.
Estimate the absolute frequencies of the words retrieved from the database.
Building Content Summaries from Extracted Documents
ActualDF(w): The actual number of documents in the
database that contain word w. The algorithm knows this number only if [w] is
a single word query probe that was issued to the database
SampleDF(w): The number of documents in the extracted
sample that contain word w.
Focused Probing for Content Summary Construction
Focused Probing for Content Summary Construction
Focused Probing for Content
Summary Construction
Estimating Absolute Document Frequencies
Use Mandelbrot’s equation P(r+p)-B for distribution of words for estimating unknown ActualDF (¢) frequencies. Sort words in descending order of their SampleDF(¢)
frequencies Focus on words with known ActualDF (¢) frequencies. Find the P, B, and p parameter values that best fit the
data. Estimate ActualDF (wi) for all words wi with unknown
ActualDF (wi) as P(ri+p)-B
Example
Creating Content Summaries for Topic CategoriesExample: “metastasis” did not appear in any of the documents
sampled from CANCERLIT during probing Cancer-BACUP classified under “Cancer”, has a high
ActualDFest(metastasis) = 3, 569 Convey this information by associating a content summary
with category “Cancer” that is obtained by merging the summaries of all databases under this category
In merged summary, ActualDFest(w) is sum of the document frequency of w for databases under this category
Creating Content Summaries for Topic Categories
Selecting Databases Hierarchically: Algorithm Inputs : a query Q, target databases K, top category C Steps: HierSelect(Query Q, Category C, int K)
1: Use a flat database selection algorithm to assign a score for Q to each subcategory of C
2: if there is a subcategory C with a non-zero score
3: Pick the subcategory Cj with the highest score
4: if NumDBs(Cj) >= K //Cj has enough databases
5: return HierSelect(Q,Cj ,K)
6: else // Cj does not have enough databases
7: return DBs(Cj) FlatSelect(Q,C-Cj,K-NumDBs(Cj))
8: else // no subcategory C has non-zero score
9: return FlatSelect(Q,C,K)
Example: Topic hierarchy for database selection (babe AND ruth ,k=3)
Experiments :Data and Metrics
Evaluate two main sets of techniques:1. Content-summary construction techniques
2. Database selection techniques Evaluate the algorithms, using two data
sets Controlled Database Set Web Database Set
Data Sets
Controlled Database Set 500,000 newsgroup articles from 54
newsgroup 81,000 articles to train documents classifiers
over the 72 – node topic hierarchy 419,000 articles to build the set of Controlled
Databases Contained 500 databases ranging in size from
25 to 25,000 documents.
Data Sets
Web Database Set 50 real web accessible databases with no
control over it.
Databases picked randomly from two directories of hidden-web databases, namely InvisibleWeb and Complete Planet
Content-summary construction Test variations of Focused Probing technique
against RS-Ord and RS-Lrd. Focused Probing:
Evaluated configurations with different underlying document classifiers for query-probe creation.
Different values for the thresholds Ts and Tc Varied the specificity threshold Ts from 0 to 1 Fixed coverage threshold to Tc = 10.
Database Selection Effectiveness
Underlying Database selection algorithm: Hierarchical algorithm
Relies on a “flat” database selection algorithm.
Chose algorithms: CORI, bGlOSS
Adapted both algorithms to work with category content summary.
Database Selection Effectiveness
Content Summary Construction Evaluated how the hierarchical database
selection algorithm behaved over content summaries generated by different techniques
Also studied QPilot Strategy Exploits HTML links to characterize text
databases.
Content Summary Quality Metric : content summaries coverage of the actual
database vocabulary ctf = ΣwєTr ActualDF(w) / ΣwєTd ActualDF(w) Tr = set of terms in content summary, Td =
complete set of words in vocabulary Results:
Focused Probing techniques achieve much higher ctf ratios than RS-Ord and RS-Lrd.
The coverage of the Focused Probing summaries increases for lower thresholds of Ts
Content Summary Quality
Correlation of word rankings: Used Spearman Rank Correlation Coefficient (SRCC ) –
to measure how well a content summary orders words by frequencies with respect to the actual word frequency order in the database.
Result :The Focused Probing method have higher SRCC values than the RS-Ord and RS-Lrd.
Content Summary Quality
Content Summary Quality - Efficiency
Focused Probing techniques on average retrieve one document per query sent
RS-Lrd retrieves about one document per two queries.
RS-Ord unnecessarily issues many queries that produce no document matches.
Content Summary Quality
Produce significantly better-quality summaries than RS-Ord and RS-Lrd do in terms of vocabulary coverage and word ranking preservation.
Database Selection Effectiveness
Methodology: Web set of real web-accessible databases 50 queries from the Web Track of TREC Each database selection algorithm picked 3
databases for the query
Retrieved the top 5 documents for the query.
Human evaluators to judge
the relevance of each retrieved document for the query
Database Selection Effectiveness
Measured the precision of a technique for each query q as :
Average precision of different database selection algorithms.
Database Selection Effectiveness
Analysis: All the flat selection techniques suffer from
incomplete coverage of the underlying probing-generated summaries.
QPilot summaries do not work well for database selection because they generally contain only a few words and are hence highly incomplete.
Hierarchical vs. flat database selection
The hierarchical algorithm using CORI as flat database selection has 50% better precision than CORI for flat selection with the same content summaries.
For bGlOSS, the improvement is 92%.
Reason: Topic hierarchy compensates for incomplete content summaries.
Hierarchical vs. flat database selection
Measured fraction of times that hierarchical database selection algorithm picked a database for a query That produced matches for the query And was given a zero score by the flat
database selection algorithm of choice.
Conclusion
Presented a novel and efficient method for the construction of content summaries of web accessible text databases
Presented a hierarchical database selection algorithm that exploits the database content summaries
Algorithm generated classification to produce accurate results even for imperfect content summaries.
Future Work
Alternative hierarchy traversing techniques. For example, “route” queries to multiple categories if appropriate.
Examine the effect of absolute frequency estimation on database selection.
Alternative methods for creating content summaries.