1 advanced archive-it application training: crawl scoping

28
1 Advanced Archive-It Application Training: Crawl Scoping

Upload: ethel-ellis

Post on 14-Jan-2016

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 1 Advanced Archive-It Application Training: Crawl Scoping

1

Advanced Archive-It Application Training:

Crawl Scoping

Page 2: 1 Advanced Archive-It Application Training: Crawl Scoping

2

Agenda

• Basic Crawl Scoping and Seed Types• What to look for in your reports• How to Change your Crawl Scope– Crawl Limits– Expand scope– Host rules– Actionable Host report

Page 3: 1 Advanced Archive-It Application Training: Crawl Scoping

3

Archive-It Crawling Scope

• Scope: How to specify what you want to archive

• in scope URLs will be archived• out of scope URLs are not archived.

• The scope of a crawl is determined by • the seed URLs added to a collection • by any scoping rules specified for your

collection

Page 4: 1 Advanced Archive-It Application Training: Crawl Scoping

4

Archive-It Crawling Scope

• The crawler will start with your seed URL and follow links within your seed site to archive pages

• Only links associated with your seeds will be archived• All embedded content on an in scope page is captured

Example seed: www.archive.org/• Link: www.archive.org/about.html is in scope • Link: www.ca.gov is NOT in scope• Embedded image: www.ala.org/logo.jpg is in scope

Page 5: 1 Advanced Archive-It Application Training: Crawl Scoping

Archive-It Crawling Scope

5

• Seed URLs can limit the crawl to a single directory of a site.

• ex: www.archive.org/about/

* a / at the end of your url can have a big effect on scope

* Parts of the site not included in your seed directory will NOT be archived

Example seed: www.archive.org/about/

Link: www.archive.org/webarchive.html NOT in scope

Example seed: www.archive.org/about

Link: www.archive.org/webarchive.html IS in scope

Page 6: 1 Advanced Archive-It Application Training: Crawl Scoping

Archive-It Crawling Scope

• Sub-domains are divisions of a larger site named to the left of the host name (ex. crawler.archive.org)

• Sub-domains of seed URLs are NOT automatically in scope

• To crawl sub-domains, either:• Add individual sub-domains as separate seed URLs• Or add an ‘Expand Scope’ rule to allow all or specific

sub-domainsExample seed: www.archive.org– Link: crawler.archive.org NOT in scopeExample seed: archive.org• Link: : crawler.archive.org IS in scope 6

Page 7: 1 Advanced Archive-It Application Training: Crawl Scoping

7

Seed Types

Default

– Used in majority of seeds and the universal setting for most crawls. Will capture all links that are in scope.

Crawl One Page Only

– Capture just your seed URL and embedded content

RSS/News Feed

– Capture any linked pages from your seed URL as one page only

Page 8: 1 Advanced Archive-It Application Training: Crawl Scoping

8

Analyzing Crawl Scope

• How to analyze the scope of your crawls:– Run a test crawl on new collections or seeds– Review reports of test crawl (or for existing crawl,

review reports of actual crawl)– Based on the reports, you will be able to add the

appropriate scoping rules– It is a good idea to run a test crawl with your scoping

rules in to ensure they are correct.– Note: Running test crawls is just the first step. You

may need to run additional tests to perfect scoping rules.

Page 9: 1 Advanced Archive-It Application Training: Crawl Scoping

9

Hosts Report

• Are there numbers in the “Queued” or “Robots.txt Blocked” column?– Check the URL lists to see if you want to capture these URLs or not

• Are there hosts with fewer or more archived URLs than you expected?– Fewer: Are any expected URLs “Out of Scope”?– More: Are there parts of the site or specific URLs you want to block?

Page 10: 1 Advanced Archive-It Application Training: Crawl Scoping

10

Common Reasons to Limit Crawl Scope

• Crawler traps (ex: calendars)• “Duplicate” URLs (ex: print version URLs)• If there are certain areas of the site you do not

care about or do not want to archive• If you just want a snapshot of the site, and

don’t necessarily want to crawl it to completion

• If you only want to capture one page of a site

Page 11: 1 Advanced Archive-It Application Training: Crawl Scoping

11

Modify Crawl Scope

Page 12: 1 Advanced Archive-It Application Training: Crawl Scoping

12

Crawl Limits

Page 13: 1 Advanced Archive-It Application Training: Crawl Scoping

13

2 Different Types of Rules

Host Constraints– Ignore Robots.txt– Block a host– Limit the kinds of URLs from a specific host

-by text match-by Regular Expression

Expand Scope– Include URLs in a crawl that would not be in scope by default

-by text match-by regular expression-by SURT

Page 14: 1 Advanced Archive-It Application Training: Crawl Scoping

14

Host Constraints

• Specific to a host

http://www.facebook.com/archiveitorg is a URL

www.facebook.com is the HOST

facebook.com is a host, and applies to all subdomains, including photos.facebook.com

Page 15: 1 Advanced Archive-It Application Training: Crawl Scoping

15

Adding Host Constraints

Page 16: 1 Advanced Archive-It Application Training: Crawl Scoping

16

Adding Host Constraints

Page 17: 1 Advanced Archive-It Application Training: Crawl Scoping

17

Adding Host Constraints

Page 18: 1 Advanced Archive-It Application Training: Crawl Scoping

18

Adding Host Constraints

Page 19: 1 Advanced Archive-It Application Training: Crawl Scoping

19

Actionable Hosts Report

• Available in 5.0 Reports: Allows you to quickly add and review rules that were in place for specific hosts, as well as run a patch crawl for URLs blocked by Robots.txt.

Page 20: 1 Advanced Archive-It Application Training: Crawl Scoping

20

Actionable Hosts Report

Page 21: 1 Advanced Archive-It Application Training: Crawl Scoping

21

Actionable Hosts Report

Page 22: 1 Advanced Archive-It Application Training: Crawl Scoping

22

Expand Crawl Scope

• How do you know you need to expand your scope?– Review the ‘Out of Scope’ column in the Hosts

Report.– If in clicking around your archived site you find

‘Not in Archive’ trends that could be addressed by an expand scope rule

Page 23: 1 Advanced Archive-It Application Training: Crawl Scoping

23

Expand Crawl Scope

Page 24: 1 Advanced Archive-It Application Training: Crawl Scoping

24

Expand Crawl Scope

– Include all (or only specific) subdomains– Include certain parts of the site that may not have been

included based on the seed URL• Ex: seed URL is:

http://mgahouse.maryland.gov/But you also want to archive pages such as:• http://files.maryland.gov/House/report.pdf

Page 25: 1 Advanced Archive-It Application Training: Crawl Scoping

25

Expand Crawl Scope

Solution: Add an expand scope rule to include URLs that contain:“files.maryland.gov”

Page 26: 1 Advanced Archive-It Application Training: Crawl Scoping

26

Expand Crawl Scope

WARNINGExpanding scope is a powerful tool, and the more specific the better. Expand scope rules do not help the crawler discover URLs.

Common mistake scenario: I’m responsible for archiving amazinguniversity.edu, so I’m going to create an expand scope rule to include any URL with amazinguniversity.edu.

Page 27: 1 Advanced Archive-It Application Training: Crawl Scoping

27

Play it safe

1. Run Test Crawls

1. Deactivate Rules when appropriate.

Page 28: 1 Advanced Archive-It Application Training: Crawl Scoping

28

Q&A