1 advanced archive-it application training: reviewing reports and crawl scoping

30
1 Advanced Archive-It Application Training: Reviewing Reports and Crawl Scoping

Upload: junior-weaver

Post on 18-Jan-2018

217 views

Category:

Documents


0 download

DESCRIPTION

3 Archive-It Crawling Scope Scope: How to specify what you want to archive in scope URLs will be archived out of scope URLs are not archived. The scope of a crawl is determined by the seed URLs added to a collection by any scoping rules specified for collection

TRANSCRIPT

Page 1: 1 Advanced Archive-It Application Training: Reviewing Reports and Crawl Scoping

1

Advanced Archive-It Application Training:

Reviewing Reports and Crawl Scoping

Page 2: 1 Advanced Archive-It Application Training: Reviewing Reports and Crawl Scoping

2

Agenda

• Basic Crawl Scoping• What to look for in your reports• How to Change your Crawl Scope• Scope-It

• Live examples….

Page 3: 1 Advanced Archive-It Application Training: Reviewing Reports and Crawl Scoping

3

Archive-It Crawling Scope

• Scope: How to specify what you want to archive

• in scope URLs will be archived• out of scope URLs are not archived.

• The scope of a crawl is determined by • the seed URLs added to a collection • by any scoping rules specified for collection

Page 4: 1 Advanced Archive-It Application Training: Reviewing Reports and Crawl Scoping

4

Archive-It Crawling Scope

• The crawler will start with your seed URL and follow links within your seed site to archive pages

• Only links associated with your seeds will be archived• All embedded content on an in scope page is captured

Example seed: www.archive.org/• Link: www.archive.org/about.html is in scope • Link: www.ca.gov is NOT in scope• Embedded image: www.ala.org/logo.jpg is in scope

Page 5: 1 Advanced Archive-It Application Training: Reviewing Reports and Crawl Scoping

Archive-It Crawling Scope

• Seed URLs can limit the crawl to a single directory of a site.

ex: www.archive.org/about/

• a / at the end of your URL can have a big effect on scope

• Parts of the site not included in your seed directory will NOT be archived

• Example seed: www.archive.org/about/

www.archive.org/webarchive.html NOT in scope

• Example seed: www.archive.org/about

www.archive.org/webarchive.html IS in scope

5

Page 6: 1 Advanced Archive-It Application Training: Reviewing Reports and Crawl Scoping

Archive-It Crawling Scope

• Sub-domains are divisions of a larger site named to the left of the host name (ex. crawler.archive.org)

• Sub-domains of seed URLs are NOT automatically in scope

• To crawl sub-domains, either:• Add individual sub-domains as separate seed URLs• Or add an ‘Expand Scope’ rule to allow all or specific

sub-domains– Example seed: www.archive.org– Link: crawler.archive.org NOT in scope

6

Page 7: 1 Advanced Archive-It Application Training: Reviewing Reports and Crawl Scoping

7

Analyzing Crawl Scope

• How to analyze the scope of your crawls:– Run a test crawl on new collections or seeds– Review reports of test crawl (or for existing crawl,

review reports of actual crawl)– Based on the reports, you will be able to add the

appropriate scoping rules– It is a good idea to run a test crawl with your scoping

rules in to ensure they are correct.– Note: This can be a trial and error process, so be

patient.

Page 8: 1 Advanced Archive-It Application Training: Reviewing Reports and Crawl Scoping

8

Reviewing Reports

• How make the most of your time reviewing reports:– Review high level reports first (Seed Status and

Seed Source) for seed level issues– Then review more detailed reports (Hosts report

and file type specific reports)– Run a QA Report to see if any embedded content

on your seed pages was not captured

Page 9: 1 Advanced Archive-It Application Training: Reviewing Reports and Crawl Scoping

9

Seed Status Report

• Are there any seeds not being crawled?– Double check your seed URLs are correct– Ignore robots.txt

Page 10: 1 Advanced Archive-It Application Training: Reviewing Reports and Crawl Scoping

10

Seed Source Report

• Are there any seeds that are capturing far fewer or far more URLs than others?– Fewer: Was seed “Not Crawled” in seed status report?– More: Check host report for any obvious area to limit your crawl

Page 11: 1 Advanced Archive-It Application Training: Reviewing Reports and Crawl Scoping

11

Hosts Report

• Are there numbers in the “Queued” or “Robots.txt Blocked” column?– Check the URL lists to see if you want to capture these URLs or not

• Are there hosts with fewer or more archived URLs than you expected?– Fewer: Are any expected URLs “Out of Scope”?– More: Are there parts of the site or specific URLs you want to block?

Page 12: 1 Advanced Archive-It Application Training: Reviewing Reports and Crawl Scoping

12

QA Report

• Is there embedded content on your seed pages that was not captured?– Run a Patch Crawl!

Page 13: 1 Advanced Archive-It Application Training: Reviewing Reports and Crawl Scoping

13

File Type/PDF/Videos Reports

• Are there file types you expected to archive that were not archived?– Check the “Out of scope” column of host report for files not captured

Page 14: 1 Advanced Archive-It Application Training: Reviewing Reports and Crawl Scoping

14

Changing Crawl Scope

• The default Archive-It crawl settings can be adjusted

• Use Modify Crawl Scope options to limit or expand scope for specific websites

• Use Seed Types other than default to change the scope of a seed in specific ways

• Use Scope-It to refine the scope of your collection

Page 15: 1 Advanced Archive-It Application Training: Reviewing Reports and Crawl Scoping

15

Common Reasons to Limit Crawl Scope

• Crawler traps (ex: calendars)• “Duplicate” URLs (ex: print version URLs)• If there are certain areas of the site you do not

care about or do not want to archive• If you just want a snapshot of the site, and

don’t necessarily want to crawl it to completion• If you only want to capture one page of a site

Page 16: 1 Advanced Archive-It Application Training: Reviewing Reports and Crawl Scoping

16

Changing Crawl Scope

• How do you know you need to limit your scope?– You are using up more of your document budget

than you want to or expected.– Reviewing the Queued Docs in the Host Report

shows many URLs that you do not want or need

Page 17: 1 Advanced Archive-It Application Training: Reviewing Reports and Crawl Scoping

17

Common Reasons to Expand Crawl Scope

– Include all (or only specific) subdomains– Include certain parts of the site that may not

have been included based on the seed URL• Ex: seed URL is:

http://mgahouse.maryland.gov/House/Catalog/catalogs/default.aspx

• But you also want to archive pages such as• http://mgahouse.maryland.gov/House/report.pdf

Page 18: 1 Advanced Archive-It Application Training: Reviewing Reports and Crawl Scoping

18

Changing Crawl Scope

• How do you know you need to expand your scope?– Review the ‘Out of Scope’ column in the Host

Report for a real or test crawl. If you see URLs you would like to be archived, make the appropriate scoping rule

– If in clicking around your archived site you find ‘Not in Archive’ pages that you want captured, make the appropriate scoping rule and recrawl

Page 19: 1 Advanced Archive-It Application Training: Reviewing Reports and Crawl Scoping

19

Changing Crawl Scope

• “Not in Archive” example for http://www.epa.gov/climatechange/ seed

Page 20: 1 Advanced Archive-It Application Training: Reviewing Reports and Crawl Scoping

20

Changing Crawl Scope

• Other ways to expand scope:– Ignoring Robots.txt blocks• Not available by default, but the feature can be turned

on by request for a partner• Can ignore robots.txt blocks on a per-host basis• Can be helpful for capturing social media sites,

stylesheets as well as sites not in your organization's domain

Page 21: 1 Advanced Archive-It Application Training: Reviewing Reports and Crawl Scoping

21

Modify Crawl Scope

– Host constraints• Block completely• Block certain URLs (URL contains, regular expression)• Limit host to maximum number of URLs (documents)• (optional) Ignore robots.txt block

– Crawl Limits• Limit by number of URLs (documents) or amount of data• Crawl PDFs only• Change maximum crawl duration

– Expand Scope Rules• Crawl certain URLs (URL contains, regular expression,

SURTs)

Page 22: 1 Advanced Archive-It Application Training: Reviewing Reports and Crawl Scoping

22

Host Constraints

Page 23: 1 Advanced Archive-It Application Training: Reviewing Reports and Crawl Scoping

23

Crawl Limits

Page 24: 1 Advanced Archive-It Application Training: Reviewing Reports and Crawl Scoping

24

Expand Crawl Scope

Page 25: 1 Advanced Archive-It Application Training: Reviewing Reports and Crawl Scoping

25

Seed Types

Crawl One Page Only

• Capture just your seed URL and embedded content

RSS/News Feed

• Capture any linked pages from your seed URL as one page only

Page 26: 1 Advanced Archive-It Application Training: Reviewing Reports and Crawl Scoping

Scope-It

• A tool for limiting the scope of new or existing collections

26

Page 27: 1 Advanced Archive-It Application Training: Reviewing Reports and Crawl Scoping

Why Use Scope-It?

• For existing collections/crawls:– Analyze completed crawls in existing collections

and revise the scope for future crawls– View host report information and add rules to

your collection from the same screen– Add the same scoping rules to multiple collections

at once

27

Page 28: 1 Advanced Archive-It Application Training: Reviewing Reports and Crawl Scoping

Why use Scope-It?

• To test and scope new seeds before creating a collection:– Run a “Scope-It” test crawl on a set of seeds that

are not yet part of a collection.– Analyze the results of the crawl and potentially

create a new collection with scoping rules in place

28

Page 29: 1 Advanced Archive-It Application Training: Reviewing Reports and Crawl Scoping

29

Changing Crawl Scope

• And now for some real-life examples...

Page 30: 1 Advanced Archive-It Application Training: Reviewing Reports and Crawl Scoping

30

Thank you!

• Any Questions, Discussion and/or Feedback?

• Please take our quick survey: http://www.surveymonkey.com/s/GZ8CWC8