query rewriting for extracting data behind html forms xueqi chen, 1 david w. embley 1 stephen w....
Post on 21-Dec-2015
223 views
TRANSCRIPT
Query Rewriting for Extracting Data Behind HTML Forms
Xueqi Chen,1 David W. Embley1
Stephen W. Liddle2
1Department of Computer Science2Rollins Center for eBusiness
Brigham Young University
November 9, 2004
Funded by the National Science Foundation under grant IIS-0083127
2
Motivation
• Web information is stored in databases• Databases are accessed through forms• Forms are designed in various ways
3
Motivation
• Web information is stored in databases• Databases are accessed through forms• Forms are designed in various ways• Automated agents are of great value
4
Prototype System Flowchart
Input Analyzer
Retrieved Page(s)
User Query
Site Form
Output Analyzer
Extracted Information
ApplicationExtraction Ontology
5
Input Analyzer – User Query Acquisition
System creates a form based on application-specific ontology
6
Input Analyzer – User Query Acquisition (cont.)
7
Input Analyzer – Site Form Analysis
Understand name, type, and/or values for each field
8
Input Analyzer – Form Query Generation
Form field name recognition– For all fields
Form field value recognition– For range fields only
Form field matching (Case 0 – 5)– For all fields
9
Form Field Name Recognition
Match by value– Application extraction ontology
Match by name– WordNet-based C4.5 decision tree learning
algorithm– Levenshtein edit distance, SoundEx, and longest
common subsequence (LCS)
10
Form Field Value Recognition
For range fields only
11
Form Field Value Recognition: Type 1
Lower value list: [0, 1, 5000, 10000, 15000, 20000, 30000];
Upper value list: [2500, 5000, 10000, 15000, 20000, 30000, 50000, 999999];
Paired = false.
12
Form Field Value Recognition: Type 2
Lower value list: [0, 0, 5001, 10001, 15001, 20001];
Upper value list: [999999, 5000, 10000, 15000, 20000, 999999];
Paired = true.
13
Form Field Value Recognition: Type 3
Lower value list: [25, 25, 25, 25, 25, 25, 25];
Upper value list: [25, 50, 100, 300, 500, 500, 500];
Paired = true.
14
Form Field Matching: Case 0
Field specified in user query (Q) is the same as in a site form (F)
15
Form Field Matching: Case 1
Field in Q is not contained in F, but is in the returned information ??
16
Form Field Matching: Case 2
Field in Q is not contained in F, and is not in the returned information
Color?
??
17
Form Field Matching: Case 3
Field required by F is not provided in Q, but a general default value, such as “All” or “Any”, is provided by F
18
Form Field Matching: Case 4
Field required by F is not provided in Q, and the default value provided by the site form is specific, not “All” or “Any”
?
19
Form Field Matching: Case 5
Values specified in Q do not match values provided in F
20
Output Analyzer
Form results processor– Record separator– BYU Ontos
Final results generator– Database manipulation
Single table Multiple tables
21
A Car-ads Search Example
22
A Car-ads Search Example (cont.)
23
Measurements
Field-matching efficiency
matchedbeenhaveshouldthatfieldsofnumbertotal
fieldsmatchedcorrectlyofnumberR fm ________
____
fieldsmatchedofnumbertotal
fieldsmatchedcorrectlyofnumberPfm ____
____
24
Measurements (cont.)
Field-matching efficiency Query-submission efficiency
submittedbeenhaveshouldthatqueriesofnumbertotal
submittedqueriescorrectofnumberRqs ________
____
submittedqueriesofnumbertotal
submittedqueriescorrectofnumberPqs ____
____
25
Measurements (cont.)
Field-matching efficiency Query-submission efficiency Overall efficiency
qsfmoverall RRR
qsfmoverall PPP
26
Experimental Results
Car-ads search
Number of Forms: 7
Number of Fields in Forms: 31
Number of Fields Applicable to Ontology: 21 (67.7%)
Field Matching Query Submission Overall
Recall 100% (21/21) 100% (249/249) 100%
Precision 100% (21/21) 82.7% (249/301)
[97.1% (249+1847)/(301+1858)]*
82.7%
[97.1%]*
* Numbers in square brackets are calculated including queries submitted for retrieving next links.
27
Experimental Results (cont.)
Digital-camera search
Number of Forms: 7
Number of Fields in Forms: 41
Number of Fields Applicable to Ontology: 23 (56.1%)
Field Matching Query Submission Overall
Recall 91.3% (21/23) 100% (31/31) 91.3%
Precision 100% (21/21) 100% (31/31)
[100% (31+85)/(31+85)]*
100%
[100%]*
* Numbers in square brackets are calculated including queries submitted for retrieving next links.
28
Results Discussion
Field matching– By value
Successful: 100%
– By name Successful example: price vs. myprice, pricelow, pricehigh,
_extern_price, min_price, max_price Failed: price vs. lo_p, hi_p
29
Results Discussion (cont.)
Query submission
30
Conclusion
Our system’s performance– Fields applicable to extraction ontologies: 61.9%– Fields system matched: 95.7%– Queries submitted that are necessary: 91.4%
To improve the performance– Field labels– The quality of the extraction ontologies
Forms our system does not handle– Multiple forms– Forms whose actions are coded inside scripts
31
Contributions
Enables directed hidden Web crawling– Accurate field matching– Efficient form filling and submission– Post processing for precise results
Ontology based– Extensible to multiple domains– Resilient to page changes