data collection with web crawlers (web-crawl graphs)

8
Data collection with Web crawlers (Web-crawl graphs)

Upload: kamal-huff

Post on 31-Dec-2015

17 views

Category:

Documents


0 download

DESCRIPTION

Data collection with Web crawlers (Web-crawl graphs). further experience:. technical/technological “treading lightly” incremental versus batch crawling HTTP headers character sets and malformed headers/urls shallow/deep queries methodological minimise modification/distortion of data - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Data collection with Web crawlers (Web-crawl graphs)

Data collection with Web crawlers

(Web-crawl graphs)

Page 2: Data collection with Web crawlers (Web-crawl graphs)

further experience:

• technical/technological– “treading lightly”

– incremental versus batch crawling

– HTTP headers

– character sets and malformed headers/urls

– shallow/deep queries

• methodological– minimise modification/distortion of data

– maximise accessibility to the data

Page 3: Data collection with Web crawlers (Web-crawl graphs)

incremental versus batch crawling

Page 4: Data collection with Web crawlers (Web-crawl graphs)

HTTP headers

Page 5: Data collection with Web crawlers (Web-crawl graphs)

character sets and malformed headers/urls

• cannot assume ASCII– WISER needs support for EU languages!– characters are no longer bytes

• cannot assume either HTTP headers or html urls are well formed– may contain arbitrary characters

Page 6: Data collection with Web crawlers (Web-crawl graphs)

blinker (Web link crawler) development

blinker is a stable parameterised link crawler based on standard software components

• objectives– to identifier problem issues in crawling e.g non-

standard servers, malformed data

– to demonstrate ethical crawling

– to provide Web-crawl graphs

– to compare the effect of varying crawling parameters

Page 7: Data collection with Web crawlers (Web-crawl graphs)

shallow / deep queries

• the query url problem– are not necessarily dynamic– are routinely collected by search engine

crawlers– may lead to recursion, but recursion is not

eliminated by ignoring them

• collecting shallow queries is a compromise– a shallow query is a query url from a Web-page

that does not itself have a query url

Page 8: Data collection with Web crawlers (Web-crawl graphs)

(further) methodological goals

• minimise modification/distortion of data

• maximise accessibility

These are discussed next in more detail in the context of using xml to exchange Web-crawl graphs.