automating the extraction of data behind web forms automating the extraction of data behind web...
Post on 19-Dec-2015
231 Views
Preview:
TRANSCRIPT
Automating the Extraction of Data Automating the Extraction of Data Behind Web FormsBehind Web Forms
Brigham Young University
Sai Ho Yau
Next
Previous
Hurdles Against Automating Data ExtractionHurdles Against Automating Data Extraction
There are enormous amounts of information available from the Web, but it is difficult to extract the data automatically due to several reasons:
Web information is stored in databases Form interfaces Relevant information can be obtained only after a
Web form is filled out and submitted
Next
Previous
Problems Dealing with Forms Problems Dealing with Forms
No general Web form design
Required text fields
One form may lead to another
Resulting information embedded within forms
Returned error messages versus valid data
Elimination of possible duplicate data
Next
Previous
MotivationsMotivations
Eliminate duplicate data and merge resulting information.
We want to automatically:
Fill in Web forms.
Extract information behind forms.
Screen out errors.
Next
Previous
The FrameworkThe Framework
Next
Previous
Method: Construct the Method: Construct the Query StringQuery String
Domain_Path: http://www.automobilesearch.com/ win2form_action: search.html win2Elem_length_0: 9 win2Elem_name_0: cat2 win2Elem_type_0: select-one win2Elem_value_0: 0 win2Elem_option_length: 16
win2Elem_option_Text_0: All Types win2Elem_option_0: 0 win2Elem_option_Text_1: Accessories win2Elem_option_1: 4940 win2Elem_option_Text_2: Classic Cars win2Elem_option_2: 4981 :
: win2Elem_name_1: manufacturer win2Elem_type_1: select-one win2Elem_value_1:
win2Elem_option_length: 43 win2Elem_option_Text_0: Any Manufacturer win2Elem_option_0: win2Elem_option_Text_1: AM General win2Elem_option_1: AM General
: :
win2Elem_name_6: minyear win2Elem_type_6: text win2Elem_value_6: win2Elem_name_7: maxyear win2Elem_type_7: text win2Elem_value_7: win2Elem_name_8: go win2Elem_type_8: submit win2Elem_value_8: Search
Next
Previous
Method: Construct the Method: Construct the Query StringQuery String
Domain_Path: http://www.automobilesearch.com/ win2form_action: search.html win2Elem_length_0: 9 win2Elem_name_0: cat2 win2Elem_type_0: select-one win2Elem_value_0: 0 win2Elem_option_length: 16
win2Elem_option_Text_0: All Types win2Elem_option_0: 0 win2Elem_option_Text_1: Accessories win2Elem_option_1: 4940 win2Elem_option_Text_2: Classic Cars win2Elem_option_2: 4981 :
: win2Elem_name_1: manufacturer win2Elem_type_1: select-one win2Elem_value_1:
win2Elem_option_length: 43 win2Elem_option_Text_0: Any Manufacturer win2Elem_option_0: win2Elem_option_Text_1: AM General win2Elem_option_1: AM General
: :
win2Elem_name_6: minyear win2Elem_type_6: text win2Elem_value_6: win2Elem_name_7: maxyear win2Elem_type_7: text win2Elem_value_7: win2Elem_name_8: go win2Elem_type_8: submit win2Elem_value_8: Search
Next
Previous
Method: Construct the Method: Construct the Query StringQuery String
Query String:
http://www.automobilesearch.com/search.html?cat2=0&manufacturer=&searcharea=0&mincost=&maxcost=¤cy=USD&minyear=&maxyear=&go=Search
Domain_Path: http://www.automobilesearch.com/ win2form_action: search.html win2Elem_length_0: 9 win2Elem_name_0: cat2 win2Elem_type_0: select-one win2Elem_value_0: 0 win2Elem_option_length: 16
win2Elem_option_Text_0: All Types win2Elem_option_0: 0 win2Elem_option_Text_1: Accessories win2Elem_option_1: 4940 win2Elem_option_Text_2: Classic Cars win2Elem_option_2: 4981 :
: win2Elem_name_1: manufacturer win2Elem_type_1: select-one win2Elem_value_1:
win2Elem_option_length: 43 win2Elem_option_Text_0: Any Manufacturer win2Elem_option_0: win2Elem_option_Text_1: AM General win2Elem_option_1: AM General
: :
win2Elem_name_6: minyear win2Elem_type_6: text win2Elem_value_6: win2Elem_name_7: maxyear win2Elem_type_7: text win2Elem_value_7: win2Elem_name_8: go win2Elem_type_8: submit win2Elem_value_8: Search
Domain_Path: http://www.automobilesearch.com/ win2form_action: search.html win2Elem_length_0: 9 win2Elem_name_0: cat2 win2Elem_type_0: select-one win2Elem_value_0: 0 win2Elem_option_length: 16
win2Elem_option_Text_0: All Types win2Elem_option_0: 0 win2Elem_option_Text_1: Accessories win2Elem_option_1: 4940 win2Elem_option_Text_2: Classic Cars win2Elem_option_2: 4981 :
: win2Elem_name_1: manufacturer win2Elem_type_1: select-one win2Elem_value_1:
win2Elem_option_length: 43 win2Elem_option_Text_0: Any Manufacturer win2Elem_option_0: win2Elem_option_Text_1: AM General win2Elem_option_1: AM General
: :
win2Elem_name_6: minyear win2Elem_type_6: text win2Elem_value_6: win2Elem_name_7: maxyear win2Elem_type_7: text win2Elem_value_7: win2Elem_name_8: go win2Elem_type_8: submit win2Elem_value_8: Search
Domain_Path: http://www.automobilesearch.com/ win2form_action: search.html win2Elem_length_0: 9 win2Elem_name_0: cat2 win2Elem_type_0: select-one win2Elem_value_0: 0 win2Elem_option_length: 16
win2Elem_option_Text_0: All Types win2Elem_option_0: 0 win2Elem_option_Text_1: Accessories win2Elem_option_1: 4940 win2Elem_option_Text_2: Classic Cars win2Elem_option_2: 4981 :
: win2Elem_name_1: manufacturer win2Elem_type_1: select-one win2Elem_value_1:
win2Elem_option_length: 43 win2Elem_option_Text_0: Any Manufacturer win2Elem_option_0: win2Elem_option_Text_1: AM General win2Elem_option_1: AM General
: :
win2Elem_name_6: minyear win2Elem_type_6: text win2Elem_value_6: win2Elem_name_7: maxyear win2Elem_type_7: text win2Elem_value_7: win2Elem_name_8: go win2Elem_type_8: submit win2Elem_value_8: Search
Domain_Path: http://www.automobilesearch.com/ win2form_action: search.html win2Elem_length_0: 9 win2Elem_name_0: cat2 win2Elem_type_0: select-one win2Elem_value_0: 0 win2Elem_option_length: 16
win2Elem_option_Text_0: All Types win2Elem_option_0: 0 win2Elem_option_Text_1: Accessories win2Elem_option_1: 4940 win2Elem_option_Text_2: Classic Cars win2Elem_option_2: 4981 :
: win2Elem_name_1: manufacturer win2Elem_type_1: select-one win2Elem_value_1:
win2Elem_option_length: 43 win2Elem_option_Text_0: Any Manufacturer win2Elem_option_0: win2Elem_option_Text_1: AM General win2Elem_option_1: AM General
: :
win2Elem_name_6: minyear win2Elem_type_6: text win2Elem_value_6: win2Elem_name_7: maxyear win2Elem_type_7: text win2Elem_value_7: win2Elem_name_8: go win2Elem_type_8: submit win2Elem_value_8: Search
Next
Previous
Returned Web PageReturned Web Page
Next
Previous
SolutionsSolutions
Two phases to deal with many possible responses to a query*:
• Sampling phase
• Exhaustive phase
* Assuming no HTTP error
Next
Previous
Sampling PhaseSampling Phase
• Submit the default form.
• Randomly select N form-field settings and submit the form N times.
• If no new information, STOP and send the result downstream (N is set so that the probability of subsequent submissions yielding new data is less than 5%).
• Otherwise, ENTER the Exhaustive Phase.
Next
Previous
Exhaustive PhaseExhaustive Phase
• Estimate the total time and quantity of data.
• If below threshold, exhaustively obtain the rest of the information.
• Otherwise, return the results of the sampling and report to the user the estimate of time and quantity of data.
Next
Previous
Data Retrieving StrategyData Retrieving Strategy
• Locate possible duplicate information from subsequent retrieved Web pages during Sampling and Exhaustive Phases.
Next
Previous
Retrieved Web PagesRetrieved Web Pages
Next
Previous
Data Retrieving StrategyData Retrieving Strategy
• Locate possible duplicate information from subsequent retrieved Web pages during Sampling and Exhaustive Phases.
• Discard duplicates and merge new information.
Next
Previous
Duplicates Duplicates Discarded and New Discarded and New
Information Information MergedMerged
Next
Previous
Data Retrieving StrategyData Retrieving Strategy
• Locate possible duplicate information from subsequent retrieved Web pages during Sampling and Exhaustive Phases.
• Discard duplicates and merge new information.
• Send fully merged data downstream for data extraction.
Next
Previous
ConclusionsConclusions
Filter duplicate data and merge resulting information.
We can automate data extraction process by automatically:
Fill in Web forms.
Retrieve information behind forms.
Handle errors.
top related