automating the extraction of data behind web forms automating the extraction of data behind web...

18
Automating the Automating the Extraction of Data Extraction of Data Behind Web Forms Behind Web Forms Brigham Young University Sai Ho Yau

Post on 19-Dec-2015

230 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Automating the Extraction of Data Behind Web Forms Automating the Extraction of Data Behind Web Forms Brigham Young University Sai Ho Yau

Automating the Extraction of Data Automating the Extraction of Data Behind Web FormsBehind Web Forms

Brigham Young University

Sai Ho Yau

Page 2: Automating the Extraction of Data Behind Web Forms Automating the Extraction of Data Behind Web Forms Brigham Young University Sai Ho Yau

Next

Previous

Hurdles Against Automating Data ExtractionHurdles Against Automating Data Extraction

There are enormous amounts of information available from the Web, but it is difficult to extract the data automatically due to several reasons:

Web information is stored in databases Form interfaces Relevant information can be obtained only after a

Web form is filled out and submitted

Page 3: Automating the Extraction of Data Behind Web Forms Automating the Extraction of Data Behind Web Forms Brigham Young University Sai Ho Yau

Next

Previous

Problems Dealing with Forms Problems Dealing with Forms

No general Web form design

Required text fields

One form may lead to another

Resulting information embedded within forms

Returned error messages versus valid data

Elimination of possible duplicate data

Page 4: Automating the Extraction of Data Behind Web Forms Automating the Extraction of Data Behind Web Forms Brigham Young University Sai Ho Yau

Next

Previous

MotivationsMotivations

Eliminate duplicate data and merge resulting information.

We want to automatically:

Fill in Web forms.

Extract information behind forms.

Screen out errors.

Page 5: Automating the Extraction of Data Behind Web Forms Automating the Extraction of Data Behind Web Forms Brigham Young University Sai Ho Yau

Next

Previous

The FrameworkThe Framework

Page 6: Automating the Extraction of Data Behind Web Forms Automating the Extraction of Data Behind Web Forms Brigham Young University Sai Ho Yau

Next

Previous

Method: Construct the Method: Construct the Query StringQuery String

Domain_Path: http://www.automobilesearch.com/ win2form_action: search.html win2Elem_length_0: 9 win2Elem_name_0: cat2 win2Elem_type_0: select-one win2Elem_value_0: 0 win2Elem_option_length: 16

win2Elem_option_Text_0: All Types win2Elem_option_0: 0 win2Elem_option_Text_1: Accessories win2Elem_option_1: 4940 win2Elem_option_Text_2: Classic Cars win2Elem_option_2: 4981 :

: win2Elem_name_1: manufacturer win2Elem_type_1: select-one win2Elem_value_1:

win2Elem_option_length: 43 win2Elem_option_Text_0: Any Manufacturer win2Elem_option_0: win2Elem_option_Text_1: AM General win2Elem_option_1: AM General

: :

win2Elem_name_6: minyear win2Elem_type_6: text win2Elem_value_6: win2Elem_name_7: maxyear win2Elem_type_7: text win2Elem_value_7: win2Elem_name_8: go win2Elem_type_8: submit win2Elem_value_8: Search

Page 7: Automating the Extraction of Data Behind Web Forms Automating the Extraction of Data Behind Web Forms Brigham Young University Sai Ho Yau

Next

Previous

Method: Construct the Method: Construct the Query StringQuery String

Domain_Path: http://www.automobilesearch.com/ win2form_action: search.html win2Elem_length_0: 9 win2Elem_name_0: cat2 win2Elem_type_0: select-one win2Elem_value_0: 0 win2Elem_option_length: 16

win2Elem_option_Text_0: All Types win2Elem_option_0: 0 win2Elem_option_Text_1: Accessories win2Elem_option_1: 4940 win2Elem_option_Text_2: Classic Cars win2Elem_option_2: 4981 :

: win2Elem_name_1: manufacturer win2Elem_type_1: select-one win2Elem_value_1:

win2Elem_option_length: 43 win2Elem_option_Text_0: Any Manufacturer win2Elem_option_0: win2Elem_option_Text_1: AM General win2Elem_option_1: AM General

: :

win2Elem_name_6: minyear win2Elem_type_6: text win2Elem_value_6: win2Elem_name_7: maxyear win2Elem_type_7: text win2Elem_value_7: win2Elem_name_8: go win2Elem_type_8: submit win2Elem_value_8: Search

Page 8: Automating the Extraction of Data Behind Web Forms Automating the Extraction of Data Behind Web Forms Brigham Young University Sai Ho Yau

Next

Previous

Method: Construct the Method: Construct the Query StringQuery String

Query String:

http://www.automobilesearch.com/search.html?cat2=0&manufacturer=&searcharea=0&mincost=&maxcost=&currency=USD&minyear=&maxyear=&go=Search

Domain_Path: http://www.automobilesearch.com/ win2form_action: search.html win2Elem_length_0: 9 win2Elem_name_0: cat2 win2Elem_type_0: select-one win2Elem_value_0: 0 win2Elem_option_length: 16

win2Elem_option_Text_0: All Types win2Elem_option_0: 0 win2Elem_option_Text_1: Accessories win2Elem_option_1: 4940 win2Elem_option_Text_2: Classic Cars win2Elem_option_2: 4981 :

: win2Elem_name_1: manufacturer win2Elem_type_1: select-one win2Elem_value_1:

win2Elem_option_length: 43 win2Elem_option_Text_0: Any Manufacturer win2Elem_option_0: win2Elem_option_Text_1: AM General win2Elem_option_1: AM General

: :

win2Elem_name_6: minyear win2Elem_type_6: text win2Elem_value_6: win2Elem_name_7: maxyear win2Elem_type_7: text win2Elem_value_7: win2Elem_name_8: go win2Elem_type_8: submit win2Elem_value_8: Search

Domain_Path: http://www.automobilesearch.com/ win2form_action: search.html win2Elem_length_0: 9 win2Elem_name_0: cat2 win2Elem_type_0: select-one win2Elem_value_0: 0 win2Elem_option_length: 16

win2Elem_option_Text_0: All Types win2Elem_option_0: 0 win2Elem_option_Text_1: Accessories win2Elem_option_1: 4940 win2Elem_option_Text_2: Classic Cars win2Elem_option_2: 4981 :

: win2Elem_name_1: manufacturer win2Elem_type_1: select-one win2Elem_value_1:

win2Elem_option_length: 43 win2Elem_option_Text_0: Any Manufacturer win2Elem_option_0: win2Elem_option_Text_1: AM General win2Elem_option_1: AM General

: :

win2Elem_name_6: minyear win2Elem_type_6: text win2Elem_value_6: win2Elem_name_7: maxyear win2Elem_type_7: text win2Elem_value_7: win2Elem_name_8: go win2Elem_type_8: submit win2Elem_value_8: Search

Domain_Path: http://www.automobilesearch.com/ win2form_action: search.html win2Elem_length_0: 9 win2Elem_name_0: cat2 win2Elem_type_0: select-one win2Elem_value_0: 0 win2Elem_option_length: 16

win2Elem_option_Text_0: All Types win2Elem_option_0: 0 win2Elem_option_Text_1: Accessories win2Elem_option_1: 4940 win2Elem_option_Text_2: Classic Cars win2Elem_option_2: 4981 :

: win2Elem_name_1: manufacturer win2Elem_type_1: select-one win2Elem_value_1:

win2Elem_option_length: 43 win2Elem_option_Text_0: Any Manufacturer win2Elem_option_0: win2Elem_option_Text_1: AM General win2Elem_option_1: AM General

: :

win2Elem_name_6: minyear win2Elem_type_6: text win2Elem_value_6: win2Elem_name_7: maxyear win2Elem_type_7: text win2Elem_value_7: win2Elem_name_8: go win2Elem_type_8: submit win2Elem_value_8: Search

Domain_Path: http://www.automobilesearch.com/ win2form_action: search.html win2Elem_length_0: 9 win2Elem_name_0: cat2 win2Elem_type_0: select-one win2Elem_value_0: 0 win2Elem_option_length: 16

win2Elem_option_Text_0: All Types win2Elem_option_0: 0 win2Elem_option_Text_1: Accessories win2Elem_option_1: 4940 win2Elem_option_Text_2: Classic Cars win2Elem_option_2: 4981 :

: win2Elem_name_1: manufacturer win2Elem_type_1: select-one win2Elem_value_1:

win2Elem_option_length: 43 win2Elem_option_Text_0: Any Manufacturer win2Elem_option_0: win2Elem_option_Text_1: AM General win2Elem_option_1: AM General

: :

win2Elem_name_6: minyear win2Elem_type_6: text win2Elem_value_6: win2Elem_name_7: maxyear win2Elem_type_7: text win2Elem_value_7: win2Elem_name_8: go win2Elem_type_8: submit win2Elem_value_8: Search

Page 9: Automating the Extraction of Data Behind Web Forms Automating the Extraction of Data Behind Web Forms Brigham Young University Sai Ho Yau

Next

Previous

Returned Web PageReturned Web Page

Page 10: Automating the Extraction of Data Behind Web Forms Automating the Extraction of Data Behind Web Forms Brigham Young University Sai Ho Yau

Next

Previous

SolutionsSolutions

Two phases to deal with many possible responses to a query*:

• Sampling phase

• Exhaustive phase

* Assuming no HTTP error

Page 11: Automating the Extraction of Data Behind Web Forms Automating the Extraction of Data Behind Web Forms Brigham Young University Sai Ho Yau

Next

Previous

Sampling PhaseSampling Phase

• Submit the default form.

• Randomly select N form-field settings and submit the form N times.

• If no new information, STOP and send the result downstream (N is set so that the probability of subsequent submissions yielding new data is less than 5%).

• Otherwise, ENTER the Exhaustive Phase.

Page 12: Automating the Extraction of Data Behind Web Forms Automating the Extraction of Data Behind Web Forms Brigham Young University Sai Ho Yau

Next

Previous

Exhaustive PhaseExhaustive Phase

• Estimate the total time and quantity of data.

• If below threshold, exhaustively obtain the rest of the information.

• Otherwise, return the results of the sampling and report to the user the estimate of time and quantity of data.

Page 13: Automating the Extraction of Data Behind Web Forms Automating the Extraction of Data Behind Web Forms Brigham Young University Sai Ho Yau

Next

Previous

Data Retrieving StrategyData Retrieving Strategy

• Locate possible duplicate information from subsequent retrieved Web pages during Sampling and Exhaustive Phases.

Page 14: Automating the Extraction of Data Behind Web Forms Automating the Extraction of Data Behind Web Forms Brigham Young University Sai Ho Yau

Next

Previous

Retrieved Web PagesRetrieved Web Pages

Page 15: Automating the Extraction of Data Behind Web Forms Automating the Extraction of Data Behind Web Forms Brigham Young University Sai Ho Yau

Next

Previous

Data Retrieving StrategyData Retrieving Strategy

• Locate possible duplicate information from subsequent retrieved Web pages during Sampling and Exhaustive Phases.

• Discard duplicates and merge new information.

Page 16: Automating the Extraction of Data Behind Web Forms Automating the Extraction of Data Behind Web Forms Brigham Young University Sai Ho Yau

Next

Previous

Duplicates Duplicates Discarded and New Discarded and New

Information Information MergedMerged

Page 17: Automating the Extraction of Data Behind Web Forms Automating the Extraction of Data Behind Web Forms Brigham Young University Sai Ho Yau

Next

Previous

Data Retrieving StrategyData Retrieving Strategy

• Locate possible duplicate information from subsequent retrieved Web pages during Sampling and Exhaustive Phases.

• Discard duplicates and merge new information.

• Send fully merged data downstream for data extraction.

Page 18: Automating the Extraction of Data Behind Web Forms Automating the Extraction of Data Behind Web Forms Brigham Young University Sai Ho Yau

Next

Previous

ConclusionsConclusions

Filter duplicate data and merge resulting information.

We can automate data extraction process by automatically:

Fill in Web forms.

Retrieve information behind forms.

Handle errors.