9:00am – welcome/setting the agenda for the day 9:10am - 10:30am – challenges of the web now...
TRANSCRIPT
![Page 1: 9:00am – Welcome/Setting the Agenda for the Day 9:10am - 10:30am – Challenges of the Web Now & in the Future Response to these Challenges 10:30am – BREAK](https://reader036.vdocuments.site/reader036/viewer/2022062320/56649f455503460f94c679b3/html5/thumbnails/1.jpg)
9:00am – Welcome/Setting the Agenda for the Day
9:10am - 10:30am – Challenges of the Web Now & in the Future
Response to these Challenges
10:30am – BREAK
11:00am - 12:30pm – Intro to Metadata extraction, Data Mining & the Web Archiving Lifecycle
12:30pm – LUNCH
1:30pm - 3pm – Data Mining Breakout sessions/Deep Dives
3pm – BREAK
3:15pm - 4:30pm – Data Mining Breakout sessions/Deep Dives
4:30pm – Wrap-up & Next Steps
Welcome/Agenda
IIPC GA Meeting Ljubljana, Slovenia April 26, 2013
![Page 2: 9:00am – Welcome/Setting the Agenda for the Day 9:10am - 10:30am – Challenges of the Web Now & in the Future Response to these Challenges 10:30am – BREAK](https://reader036.vdocuments.site/reader036/viewer/2022062320/56649f455503460f94c679b3/html5/thumbnails/2.jpg)
Data Mining & Web Archiving ‘Lifecycles’
Kris Carpenter Negulescu
Internet Archive
IIPC GA Meeting Ljubljana, Slovenia April 26, 20132
![Page 3: 9:00am – Welcome/Setting the Agenda for the Day 9:10am - 10:30am – Challenges of the Web Now & in the Future Response to these Challenges 10:30am – BREAK](https://reader036.vdocuments.site/reader036/viewer/2022062320/56649f455503460f94c679b3/html5/thumbnails/3.jpg)
Use Cases
Election 2012 CollaborativeNLNZ 2013 Domain and GOV CollectionsWide00002/00005 Crawls
http//home.us.archive.org/~vinay/wide/wide-00002.html
http://home.us.archive.org/~vinay/wide/wide-00005.html
https://webarchive.jira.com/wiki/display/~vinay/Embed+Analysis+for+the+Wide00005+Crawl
IIPC General Assembly, The Hague, May 9, 2011 3
![Page 4: 9:00am – Welcome/Setting the Agenda for the Day 9:10am - 10:30am – Challenges of the Web Now & in the Future Response to these Challenges 10:30am – BREAK](https://reader036.vdocuments.site/reader036/viewer/2022062320/56649f455503460f94c679b3/html5/thumbnails/4.jpg)
Traditional “Crawl” Lifecycles
CDXs/WATs
WARCsLucene Shards
IIPC GA Meeting Ljubljana, Slovenia April 26, 2013
![Page 5: 9:00am – Welcome/Setting the Agenda for the Day 9:10am - 10:30am – Challenges of the Web Now & in the Future Response to these Challenges 10:30am – BREAK](https://reader036.vdocuments.site/reader036/viewer/2022062320/56649f455503460f94c679b3/html5/thumbnails/5.jpg)
Analyzing Scope & Quality
IIPC GA Meeting Ljubljana, Slovenia April 26, 2013
![Page 6: 9:00am – Welcome/Setting the Agenda for the Day 9:10am - 10:30am – Challenges of the Web Now & in the Future Response to these Challenges 10:30am – BREAK](https://reader036.vdocuments.site/reader036/viewer/2022062320/56649f455503460f94c679b3/html5/thumbnails/6.jpg)
Preparing to Collect/Scoping/Framing a Crawl/Collection
Pre “Crawl” WorkflowsTarget identification (beyond curatorial selection…)
• Automated Filtering of Data Sources by Topic, Geo IP, file format, robots policy or other criteria
• Out-link analyses and ranking from selected sources, In-link analyses
• Mining Anchor text/Page Descriptions/Title tags (if not full text)
“Test” Capture Analyses (…routing to proper capture mechanisms)
IIPC General Assembly, The Hague, May 9, 2011 6
![Page 7: 9:00am – Welcome/Setting the Agenda for the Day 9:10am - 10:30am – Challenges of the Web Now & in the Future Response to these Challenges 10:30am – BREAK](https://reader036.vdocuments.site/reader036/viewer/2022062320/56649f455503460f94c679b3/html5/thumbnails/7.jpg)
Your Browser: Behind the Scenes
![Page 8: 9:00am – Welcome/Setting the Agenda for the Day 9:10am - 10:30am – Challenges of the Web Now & in the Future Response to these Challenges 10:30am – BREAK](https://reader036.vdocuments.site/reader036/viewer/2022062320/56649f455503460f94c679b3/html5/thumbnails/8.jpg)
IIPC General Assembly, The Hague, May 9, 2011 8
![Page 9: 9:00am – Welcome/Setting the Agenda for the Day 9:10am - 10:30am – Challenges of the Web Now & in the Future Response to these Challenges 10:30am – BREAK](https://reader036.vdocuments.site/reader036/viewer/2022062320/56649f455503460f94c679b3/html5/thumbnails/9.jpg)
Extracted Metadata & Links (WAT)
WAT is WARC ☺WAT records are WARC
metadata recordsWARC-Refers-To header
identifies original WARC record
WAT payload is JSONCan be combined with
Curator generated metadata
![Page 10: 9:00am – Welcome/Setting the Agenda for the Day 9:10am - 10:30am – Challenges of the Web Now & in the Future Response to these Challenges 10:30am – BREAK](https://reader036.vdocuments.site/reader036/viewer/2022062320/56649f455503460f94c679b3/html5/thumbnails/10.jpg)
Monitoring/Enhancing/Confirming Capture
Comparing Live Resources to Files WrittenEvaluating Completeness (at all levels)Generating Snapshots of Live and Archived
resourcesEliminating Spam/Detecting Scoping Mistakes
& IssuesMining Crawl Logs (HIVE)Mining Browser LogsMining/Analyzing Links
IIPC General Assembly, The Hague, May 9, 2011 10
![Page 11: 9:00am – Welcome/Setting the Agenda for the Day 9:10am - 10:30am – Challenges of the Web Now & in the Future Response to these Challenges 10:30am – BREAK](https://reader036.vdocuments.site/reader036/viewer/2022062320/56649f455503460f94c679b3/html5/thumbnails/11.jpg)
Characterizing/Documenting/Preserving Captures & Collections
IIPC General Assembly, The Hague, May 9, 2011 11
![Page 12: 9:00am – Welcome/Setting the Agenda for the Day 9:10am - 10:30am – Challenges of the Web Now & in the Future Response to these Challenges 10:30am – BREAK](https://reader036.vdocuments.site/reader036/viewer/2022062320/56649f455503460f94c679b3/html5/thumbnails/12.jpg)
Enabling Access & Research
Host profilesLink Graphs, Tag Clouds, & Visualizations
Collection Based: http://home.us.archive.org/~vinay/eot08-explore-data.html
Archive wide: http://home.us.archive.org/~vinay/global/1995-2011/stats.html
http://home.us.archive.org/~vinay/tld.html
Site/Page Evolution http://archive.org/details/TheNewYorkTimesTimelapse1996-2010
Portal Browse/Search http://eotarchive.cdlib.org/
Research Use/Access History Tracker (Weber/Lazer) ARCLink (AlSum/Nelson)
IIPC General Assembly, The Hague, May 9, 2011 12
![Page 13: 9:00am – Welcome/Setting the Agenda for the Day 9:10am - 10:30am – Challenges of the Web Now & in the Future Response to these Challenges 10:30am – BREAK](https://reader036.vdocuments.site/reader036/viewer/2022062320/56649f455503460f94c679b3/html5/thumbnails/13.jpg)
![Page 14: 9:00am – Welcome/Setting the Agenda for the Day 9:10am - 10:30am – Challenges of the Web Now & in the Future Response to these Challenges 10:30am – BREAK](https://reader036.vdocuments.site/reader036/viewer/2022062320/56649f455503460f94c679b3/html5/thumbnails/14.jpg)
HistoryTracker Tool
14
Beta Version!
PIG Scripts inHadoop Environment
RU High-Speed Computing Cluster
Link Lists
Curated Data Sets