query rewriting for extracting data behind html forms xueqi chen, 1 david w. embley 1 stephen w....

31
Query Rewriting for Extracting Data Behind HTML Forms Xueqi Chen, 1 David W. Embley 1 Stephen W. Liddle 2 1 Department of Computer Science 2 Rollins Center for eBusiness Brigham Young University November 9, 2004 Funded by the National Science Foundation under grant IIS-0083127

Post on 21-Dec-2015

223 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: Query Rewriting for Extracting Data Behind HTML Forms Xueqi Chen, 1 David W. Embley 1 Stephen W. Liddle 2 1 Department of Computer Science 2 Rollins Center

Query Rewriting for Extracting Data Behind HTML Forms

Xueqi Chen,1 David W. Embley1

Stephen W. Liddle2

1Department of Computer Science2Rollins Center for eBusiness

Brigham Young University

November 9, 2004

Funded by the National Science Foundation under grant IIS-0083127

Page 2: Query Rewriting for Extracting Data Behind HTML Forms Xueqi Chen, 1 David W. Embley 1 Stephen W. Liddle 2 1 Department of Computer Science 2 Rollins Center

2

Motivation

• Web information is stored in databases• Databases are accessed through forms• Forms are designed in various ways

Page 3: Query Rewriting for Extracting Data Behind HTML Forms Xueqi Chen, 1 David W. Embley 1 Stephen W. Liddle 2 1 Department of Computer Science 2 Rollins Center

3

Motivation

• Web information is stored in databases• Databases are accessed through forms• Forms are designed in various ways• Automated agents are of great value

Page 4: Query Rewriting for Extracting Data Behind HTML Forms Xueqi Chen, 1 David W. Embley 1 Stephen W. Liddle 2 1 Department of Computer Science 2 Rollins Center

4

Prototype System Flowchart

Input Analyzer

Retrieved Page(s)

User Query

Site Form

Output Analyzer

Extracted Information

ApplicationExtraction Ontology

Page 5: Query Rewriting for Extracting Data Behind HTML Forms Xueqi Chen, 1 David W. Embley 1 Stephen W. Liddle 2 1 Department of Computer Science 2 Rollins Center

5

Input Analyzer – User Query Acquisition

System creates a form based on application-specific ontology

Page 6: Query Rewriting for Extracting Data Behind HTML Forms Xueqi Chen, 1 David W. Embley 1 Stephen W. Liddle 2 1 Department of Computer Science 2 Rollins Center

6

Input Analyzer – User Query Acquisition (cont.)

Page 7: Query Rewriting for Extracting Data Behind HTML Forms Xueqi Chen, 1 David W. Embley 1 Stephen W. Liddle 2 1 Department of Computer Science 2 Rollins Center

7

Input Analyzer – Site Form Analysis

Understand name, type, and/or values for each field

Page 8: Query Rewriting for Extracting Data Behind HTML Forms Xueqi Chen, 1 David W. Embley 1 Stephen W. Liddle 2 1 Department of Computer Science 2 Rollins Center

8

Input Analyzer – Form Query Generation

Form field name recognition– For all fields

Form field value recognition– For range fields only

Form field matching (Case 0 – 5)– For all fields

Page 9: Query Rewriting for Extracting Data Behind HTML Forms Xueqi Chen, 1 David W. Embley 1 Stephen W. Liddle 2 1 Department of Computer Science 2 Rollins Center

9

Form Field Name Recognition

Match by value– Application extraction ontology

Match by name– WordNet-based C4.5 decision tree learning

algorithm– Levenshtein edit distance, SoundEx, and longest

common subsequence (LCS)

Page 10: Query Rewriting for Extracting Data Behind HTML Forms Xueqi Chen, 1 David W. Embley 1 Stephen W. Liddle 2 1 Department of Computer Science 2 Rollins Center

10

Form Field Value Recognition

For range fields only

Page 11: Query Rewriting for Extracting Data Behind HTML Forms Xueqi Chen, 1 David W. Embley 1 Stephen W. Liddle 2 1 Department of Computer Science 2 Rollins Center

11

Form Field Value Recognition: Type 1

Lower value list: [0, 1, 5000, 10000, 15000, 20000, 30000];

Upper value list: [2500, 5000, 10000, 15000, 20000, 30000, 50000, 999999];

Paired = false.

Page 12: Query Rewriting for Extracting Data Behind HTML Forms Xueqi Chen, 1 David W. Embley 1 Stephen W. Liddle 2 1 Department of Computer Science 2 Rollins Center

12

Form Field Value Recognition: Type 2

Lower value list: [0, 0, 5001, 10001, 15001, 20001];

Upper value list: [999999, 5000, 10000, 15000, 20000, 999999];

Paired = true.

Page 13: Query Rewriting for Extracting Data Behind HTML Forms Xueqi Chen, 1 David W. Embley 1 Stephen W. Liddle 2 1 Department of Computer Science 2 Rollins Center

13

Form Field Value Recognition: Type 3

Lower value list: [25, 25, 25, 25, 25, 25, 25];

Upper value list: [25, 50, 100, 300, 500, 500, 500];

Paired = true.

Page 14: Query Rewriting for Extracting Data Behind HTML Forms Xueqi Chen, 1 David W. Embley 1 Stephen W. Liddle 2 1 Department of Computer Science 2 Rollins Center

14

Form Field Matching: Case 0

Field specified in user query (Q) is the same as in a site form (F)

Page 15: Query Rewriting for Extracting Data Behind HTML Forms Xueqi Chen, 1 David W. Embley 1 Stephen W. Liddle 2 1 Department of Computer Science 2 Rollins Center

15

Form Field Matching: Case 1

Field in Q is not contained in F, but is in the returned information ??

Page 16: Query Rewriting for Extracting Data Behind HTML Forms Xueqi Chen, 1 David W. Embley 1 Stephen W. Liddle 2 1 Department of Computer Science 2 Rollins Center

16

Form Field Matching: Case 2

Field in Q is not contained in F, and is not in the returned information

Color?

??

Page 17: Query Rewriting for Extracting Data Behind HTML Forms Xueqi Chen, 1 David W. Embley 1 Stephen W. Liddle 2 1 Department of Computer Science 2 Rollins Center

17

Form Field Matching: Case 3

Field required by F is not provided in Q, but a general default value, such as “All” or “Any”, is provided by F

Page 18: Query Rewriting for Extracting Data Behind HTML Forms Xueqi Chen, 1 David W. Embley 1 Stephen W. Liddle 2 1 Department of Computer Science 2 Rollins Center

18

Form Field Matching: Case 4

Field required by F is not provided in Q, and the default value provided by the site form is specific, not “All” or “Any”

?

Page 19: Query Rewriting for Extracting Data Behind HTML Forms Xueqi Chen, 1 David W. Embley 1 Stephen W. Liddle 2 1 Department of Computer Science 2 Rollins Center

19

Form Field Matching: Case 5

Values specified in Q do not match values provided in F

Page 20: Query Rewriting for Extracting Data Behind HTML Forms Xueqi Chen, 1 David W. Embley 1 Stephen W. Liddle 2 1 Department of Computer Science 2 Rollins Center

20

Output Analyzer

Form results processor– Record separator– BYU Ontos

Final results generator– Database manipulation

Single table Multiple tables

Page 21: Query Rewriting for Extracting Data Behind HTML Forms Xueqi Chen, 1 David W. Embley 1 Stephen W. Liddle 2 1 Department of Computer Science 2 Rollins Center

21

A Car-ads Search Example

Page 22: Query Rewriting for Extracting Data Behind HTML Forms Xueqi Chen, 1 David W. Embley 1 Stephen W. Liddle 2 1 Department of Computer Science 2 Rollins Center

22

A Car-ads Search Example (cont.)

Page 23: Query Rewriting for Extracting Data Behind HTML Forms Xueqi Chen, 1 David W. Embley 1 Stephen W. Liddle 2 1 Department of Computer Science 2 Rollins Center

23

Measurements

Field-matching efficiency

matchedbeenhaveshouldthatfieldsofnumbertotal

fieldsmatchedcorrectlyofnumberR fm ________

____

fieldsmatchedofnumbertotal

fieldsmatchedcorrectlyofnumberPfm ____

____

Page 24: Query Rewriting for Extracting Data Behind HTML Forms Xueqi Chen, 1 David W. Embley 1 Stephen W. Liddle 2 1 Department of Computer Science 2 Rollins Center

24

Measurements (cont.)

Field-matching efficiency Query-submission efficiency

submittedbeenhaveshouldthatqueriesofnumbertotal

submittedqueriescorrectofnumberRqs ________

____

submittedqueriesofnumbertotal

submittedqueriescorrectofnumberPqs ____

____

Page 25: Query Rewriting for Extracting Data Behind HTML Forms Xueqi Chen, 1 David W. Embley 1 Stephen W. Liddle 2 1 Department of Computer Science 2 Rollins Center

25

Measurements (cont.)

Field-matching efficiency Query-submission efficiency Overall efficiency

qsfmoverall RRR

qsfmoverall PPP

Page 26: Query Rewriting for Extracting Data Behind HTML Forms Xueqi Chen, 1 David W. Embley 1 Stephen W. Liddle 2 1 Department of Computer Science 2 Rollins Center

26

Experimental Results

Car-ads search

Number of Forms: 7

Number of Fields in Forms: 31

Number of Fields Applicable to Ontology: 21 (67.7%)

Field Matching Query Submission Overall

Recall 100% (21/21) 100% (249/249) 100%

Precision 100% (21/21) 82.7% (249/301)

[97.1% (249+1847)/(301+1858)]*

82.7%

[97.1%]*

* Numbers in square brackets are calculated including queries submitted for retrieving next links.

Page 27: Query Rewriting for Extracting Data Behind HTML Forms Xueqi Chen, 1 David W. Embley 1 Stephen W. Liddle 2 1 Department of Computer Science 2 Rollins Center

27

Experimental Results (cont.)

Digital-camera search

Number of Forms: 7

Number of Fields in Forms: 41

Number of Fields Applicable to Ontology: 23 (56.1%)

Field Matching Query Submission Overall

Recall 91.3% (21/23) 100% (31/31) 91.3%

Precision 100% (21/21) 100% (31/31)

[100% (31+85)/(31+85)]*

100%

[100%]*

* Numbers in square brackets are calculated including queries submitted for retrieving next links.

Page 28: Query Rewriting for Extracting Data Behind HTML Forms Xueqi Chen, 1 David W. Embley 1 Stephen W. Liddle 2 1 Department of Computer Science 2 Rollins Center

28

Results Discussion

Field matching– By value

Successful: 100%

– By name Successful example: price vs. myprice, pricelow, pricehigh,

_extern_price, min_price, max_price Failed: price vs. lo_p, hi_p

Page 29: Query Rewriting for Extracting Data Behind HTML Forms Xueqi Chen, 1 David W. Embley 1 Stephen W. Liddle 2 1 Department of Computer Science 2 Rollins Center

29

Results Discussion (cont.)

Query submission

Page 30: Query Rewriting for Extracting Data Behind HTML Forms Xueqi Chen, 1 David W. Embley 1 Stephen W. Liddle 2 1 Department of Computer Science 2 Rollins Center

30

Conclusion

Our system’s performance– Fields applicable to extraction ontologies: 61.9%– Fields system matched: 95.7%– Queries submitted that are necessary: 91.4%

To improve the performance– Field labels– The quality of the extraction ontologies

Forms our system does not handle– Multiple forms– Forms whose actions are coded inside scripts

Page 31: Query Rewriting for Extracting Data Behind HTML Forms Xueqi Chen, 1 David W. Embley 1 Stephen W. Liddle 2 1 Department of Computer Science 2 Rollins Center

31

Contributions

Enables directed hidden Web crawling– Accurate field matching– Efficient form filling and submission– Post processing for precise results

Ontology based– Extensible to multiple domains– Resilient to page changes