data collection with web crawlers (web-crawl graphs)
DESCRIPTION
Data collection with Web crawlers (Web-crawl graphs). further experience:. technical/technological “treading lightly” incremental versus batch crawling HTTP headers character sets and malformed headers/urls shallow/deep queries methodological minimise modification/distortion of data - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Data collection with Web crawlers (Web-crawl graphs)](https://reader036.vdocuments.site/reader036/viewer/2022082711/56812d9f550346895d92c1a0/html5/thumbnails/1.jpg)
Data collection with Web crawlers
(Web-crawl graphs)
![Page 2: Data collection with Web crawlers (Web-crawl graphs)](https://reader036.vdocuments.site/reader036/viewer/2022082711/56812d9f550346895d92c1a0/html5/thumbnails/2.jpg)
further experience:
• technical/technological– “treading lightly”
– incremental versus batch crawling
– HTTP headers
– character sets and malformed headers/urls
– shallow/deep queries
• methodological– minimise modification/distortion of data
– maximise accessibility to the data
![Page 3: Data collection with Web crawlers (Web-crawl graphs)](https://reader036.vdocuments.site/reader036/viewer/2022082711/56812d9f550346895d92c1a0/html5/thumbnails/3.jpg)
incremental versus batch crawling
![Page 4: Data collection with Web crawlers (Web-crawl graphs)](https://reader036.vdocuments.site/reader036/viewer/2022082711/56812d9f550346895d92c1a0/html5/thumbnails/4.jpg)
HTTP headers
![Page 5: Data collection with Web crawlers (Web-crawl graphs)](https://reader036.vdocuments.site/reader036/viewer/2022082711/56812d9f550346895d92c1a0/html5/thumbnails/5.jpg)
character sets and malformed headers/urls
• cannot assume ASCII– WISER needs support for EU languages!– characters are no longer bytes
• cannot assume either HTTP headers or html urls are well formed– may contain arbitrary characters
![Page 6: Data collection with Web crawlers (Web-crawl graphs)](https://reader036.vdocuments.site/reader036/viewer/2022082711/56812d9f550346895d92c1a0/html5/thumbnails/6.jpg)
blinker (Web link crawler) development
blinker is a stable parameterised link crawler based on standard software components
• objectives– to identifier problem issues in crawling e.g non-
standard servers, malformed data
– to demonstrate ethical crawling
– to provide Web-crawl graphs
– to compare the effect of varying crawling parameters
![Page 7: Data collection with Web crawlers (Web-crawl graphs)](https://reader036.vdocuments.site/reader036/viewer/2022082711/56812d9f550346895d92c1a0/html5/thumbnails/7.jpg)
shallow / deep queries
• the query url problem– are not necessarily dynamic– are routinely collected by search engine
crawlers– may lead to recursion, but recursion is not
eliminated by ignoring them
• collecting shallow queries is a compromise– a shallow query is a query url from a Web-page
that does not itself have a query url
![Page 8: Data collection with Web crawlers (Web-crawl graphs)](https://reader036.vdocuments.site/reader036/viewer/2022082711/56812d9f550346895d92c1a0/html5/thumbnails/8.jpg)
(further) methodological goals
• minimise modification/distortion of data
• maximise accessibility
These are discussed next in more detail in the context of using xml to exchange Web-crawl graphs.