crawl rss kristinn sigurðsson national and university library of iceland iipc ga 2014 – paris

14
Crawl RSS Kristinn Sigurðsson National and University Library of Iceland IIPC GA 2014 – Paris

Upload: tobias-fowler

Post on 31-Dec-2015

221 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Crawl RSS Kristinn Sigurðsson National and University Library of Iceland IIPC GA 2014 – Paris

Crawl RSSKristinn SigurðssonNational and University Library of Iceland

IIPC GA 2014 – Paris

Page 2: Crawl RSS Kristinn Sigurðsson National and University Library of Iceland IIPC GA 2014 – Paris

The problem

•Certain sites change very frequently▫News sites especially

•While we can capture all the stories by visiting once per day, week, month or even year they may have been modified several times and the front page changes will be missed

Page 3: Crawl RSS Kristinn Sigurðsson National and University Library of Iceland IIPC GA 2014 – Paris

RSS feed advantages

•Changes to the feed is highly likely to signify an actual change has occurred

•A single RSS feed informs on changes both to the presumed “front page” as well as article or item pages

•RSS feeds are generally smaller (in bytes) then the front page (just html) of a site▫Crawling the RSS feed frequently is more

likely to be tolerated

Page 4: Crawl RSS Kristinn Sigurðsson National and University Library of Iceland IIPC GA 2014 – Paris

How it works 1/4• On first load all feed elements are loaded

▫A feed element is uniquely identified by its URL Timestamp

• Each element plus front page is visited▫Embeds are downloaded ▫No further links are followed ▫Strict controls need to be in place to halt scope

leakage Each feed element should lead to a very finite number of

URLs to crawl Basically, just get minimal embedds, do not follow links

Page 5: Crawl RSS Kristinn Sigurðsson National and University Library of Iceland IIPC GA 2014 – Paris

How it works 2/4• Once all the URLs generated by the initial feed

elements have been crawled the RSS feed may be revisited▫ IF the minimum wait between visits has elapsed▫ ELSE wait until the minimum time has elapsed

• The second visit will (probably) show many already seen elements▫ Identified by url+timestamp▫ If feed is entirely unchanged than the content hash

will likely be unchanged▫ If an url has a new timestamp it is probable that the

content of the item has changed▫ Only load items that have a timestamp that is more

recent than the ‚most recently seen‘ timestamp for each feed

Page 6: Crawl RSS Kristinn Sigurðsson National and University Library of Iceland IIPC GA 2014 – Paris

How it works 3/4

•If there are changed or new elements▫Fetch ‘front page’ URI and URIs of

changed and new elements If they match existing content hashes, they

may be discarded, otherwise written to (W)ARCs.

▫Do not revisit embedded content that we have already crawled This massively reduces the amount of time it

takes to complete each RSS visit

Page 7: Crawl RSS Kristinn Sigurðsson National and University Library of Iceland IIPC GA 2014 – Paris

How it works 4/4

•Once visit 2 is over▫Check has minimum wait elapsed, ▫rinse,▫repeat

Page 8: Crawl RSS Kristinn Sigurðsson National and University Library of Iceland IIPC GA 2014 – Paris

Sites

•Many sites have multiple feeds•Sometimes items will appear in more than

one feed at a time•It is therefor possible to have multiple

related feeds for one site•Such feeds are always crawled jointly and

duplicate items are discarded

Page 9: Crawl RSS Kristinn Sigurðsson National and University Library of Iceland IIPC GA 2014 – Paris

ExampleRSS Site: ruv.is State: HOLD_FOR_FEED_EMIT Number of discovered items: 0 Minimum wait between emitting feeds (ms): 600000 Earliest next feed emission: Mon May 12 14:49:48 GMT 2014 URLs being crawled: 0 Feeds last emitted: Mon May 12 14:39:48 GMT 2014 Feeds: Feed: http://www.ruv.is/rss/frettir Most recent seen: Mon May 12 14:24:34 GMT 2014 http://www.ruv.is/ Feed: http://www.ruv.is/rss/erlent Most recent seen: Mon May 12 14:11:50 GMT 2014 http://www.ruv.is/ http://www.ruv.is/erlent Feed: http://www.ruv.is/rss/sport Most recent seen: Sun May 11 22:55:17 GMT 2014 http://www.ruv.is/ http://www.ruv.is/ithrottir Feed: http://www.ruv.is/rss/innlent Most recent seen: Mon May 12 14:24:34 GMT 2014 http://www.ruv.is/ http://www.ruv.is/innlent

Page 10: Crawl RSS Kristinn Sigurðsson National and University Library of Iceland IIPC GA 2014 – Paris

Configuration

•Either via Heritrix’s CXML

•Or using the database interface▫Maintaining the DB is outside the scope of

the add-on•Easy to add not configuration handlers

Page 11: Crawl RSS Kristinn Sigurðsson National and University Library of Iceland IIPC GA 2014 – Paris

Crawl RSS - Heritrix 3 add-on•Available on GitHub:

▫https://github.com/Landsbokasafn/crawlrss•Requires Heritrix 3.1.2 or newer•Stable, but still technically in ‘beta’•In use at NULI for almost a year now

▫First new sites▫Now also select blogs and government sites

Page 12: Crawl RSS Kristinn Sigurðsson National and University Library of Iceland IIPC GA 2014 – Paris
Page 13: Crawl RSS Kristinn Sigurðsson National and University Library of Iceland IIPC GA 2014 – Paris
Page 14: Crawl RSS Kristinn Sigurðsson National and University Library of Iceland IIPC GA 2014 – Paris