scape webinar: tools for uncovering preservation risks in large repositories

22
Luis Faria [email protected] KEEP SOLUTIONS www.keepsolu=ons.com SCAPE webminar July 26, 2014 Tools for uncovering preserva=on risks in your large repositories

Upload: scape-project

Post on 05-Dec-2014

135 views

Category:

Technology


2 download

DESCRIPTION

This presentation origins from a webinar presented by Luís Faria. The webinar presents the SCAPE developed tools Scout and C3PO and demonstrates how to identify preservation risks in your content and, at the same time, share your content profile information with others to open new opportunities. Scout, the preservation watch system, centralizes all the necessary knowledge on the same platform, cross-referencing this knowledge to uncover all preservation risks. Scout automatically fetches information from several sources to populate its knowledge base. For example, Scout integrates with C3PO to get large-scale characterization profiles of content. Furthermore, Scout aims to be a knowledge exchange platform, to allow the community to bring together all the necessary information into the system. The sharing of information opens new opportunities for joining forces against common problems. The webinar was held 26 June 2014.

TRANSCRIPT

Page 1: SCAPE Webinar: Tools for uncovering preservation risks in large repositories

Luis  Faria  [email protected]  KEEP  SOLUTIONS  www.keep-­‐solu=ons.com

SCAPE  webminar  July  26,  2014

Tools  for  uncovering  preserva=on  risks  in  your  large  repositories

Page 2: SCAPE Webinar: Tools for uncovering preservation risks in large repositories

Repository

Format obsolescence

Emerging technology

Consumer trends

New standards

Organisation mission

Bit rot

Resource capability

System availability

Security breach

Economical limitations Social and political factors

Producer trends

Organisation policies

2

Why do we need monitoring?

Page 3: SCAPE Webinar: Tools for uncovering preservation risks in large repositories

Repository

Format obsolescence

Emerging technology

Consumer trends

New standards

Organisation mission

Bit rot

Resource capability

System availability

Security breach

Economical limitations Social and political factors

Producer trends

Organisation policies

3

Why do we need monitoring?

RisksOpportunities

Page 4: SCAPE Webinar: Tools for uncovering preservation risks in large repositories

This  work  was  par,ally  supported  by  the  SCAPE  Project.  The  SCAPE  project  is  co-­‐funded  by  the  European  Union  under  FP7  ICT-­‐2009.4.1  (Grant  Agreement  number  270137). 4

5.41%&0.77%&1.54%&1.93%&2.32%&2.70%&2.70%&

5.02%&7.34%&

9.27%&15.83%&

26.64%&28.57%&

0.00%& 5.00%& 10.00%& 15.00%& 20.00%& 25.00%& 30.00%&

Other&Data&intensive&industry&

Non&affiliated&Big&data&science&

Digital&preservaDon&vendor&Research&funder&Large&enterprise&

Publisher&or&content&producer&Small&or&medium&enterprise&Local&government&insDtuDon&

NaDonal&government&insDtuDon&Memory&insDtuDon&or&content&holder&

University&

What%descrip-ons%fit%your%organiza-on?%

Preserva'on  monitoring  survey

181 valid  par=cipants

Page 5: SCAPE Webinar: Tools for uncovering preservation risks in large repositories

This  work  was  par,ally  supported  by  the  SCAPE  Project.  The  SCAPE  project  is  co-­‐funded  by  the  European  Union  under  FP7  ICT-­‐2009.4.1  (Grant  Agreement  number  270137).

Preserva'on  monitoring  survey

5

92%$

89%$

78%$

77%$

76%$

76%$

75%$

74%$

69%$

68%$

64%$

41%$

51%$

41%$

40%$

44%$

23%$

27%$

17%$

28%$

25%$

30%$

18%$

9%$

18%$

13%$

12%$

24%$

22%$

25%$

25%$

19%$

23%$

41%$

40%$

41%$

46%$

44%$

53%$

51%$

58%$

47%$

55%$

46%$

0.00%$ 10.00%$ 20.00%$ 30.00%$ 40.00%$ 50.00%$ 60.00%$ 70.00%$ 80.00%$ 90.00%$ 100.00%$

File$corrup7on$

Backup$failure$

Staff$not$enough$or$adequate$

SoDware$plaForm$obsolescence$

Hardware$plaForm$obsolescence$

Lack$of$context$informa7on$

Incorrect$ac7on$results$

Consumers$misalignment$

Outdated$preserva7on$plans$

Producers$misalignment$

Content$not$aligned$with$policies$

Importance$(normalized$mean)$ Monitoring$ Not$monitoring$ Uncertain$or$No$answer$

Page 6: SCAPE Webinar: Tools for uncovering preservation risks in large repositories

This  work  was  par,ally  supported  by  the  SCAPE  Project.  The  SCAPE  project  is  co-­‐funded  by  the  European  Union  under  FP7  ICT-­‐2009.4.1  (Grant  Agreement  number  270137). 6

Tools  for  uncovering  preserva'on  risks

Content FITS C3PO Scout

FITS  output    (XML)

</>

File  characteris=cs  distribu=on  (graphs  and  drill-­‐down  analysis)

File  and  world  proper=es    throughout  =me  and  no=fica=ons

Page 7: SCAPE Webinar: Tools for uncovering preservation risks in large repositories

This  work  was  par,ally  supported  by  the  SCAPE  Project.  The  SCAPE  project  is  co-­‐funded  by  the  European  Union  under  FP7  ICT-­‐2009.4.1  (Grant  Agreement  number  270137).

• h\p://fitstool.org  • Characteriza=on  

• Iden=fica=on  • Feature  extrac=on  • Valida=on  

• Support  for:  • DROID  

• JHove  

• Apache  Tika  

• ADL  Tool  

• Exidool  

• FFIdent  

• File  U=lity  (windows  port)  

• NLNZ  Metadata  Extractor  

• OIS  Audio,  File  and  XML  Informa=on

FITS  -­‐  File  Informa'on  Tool  Set• h\ps://github.com/keeps/fits/tree/keeps  

• Developed  by  KEEPS  • Added  support  for:  

• FIDO  

• Microsod  Office  

• Adobe  Illustrator  

• Corel  Draw  

• Email  (EML)  

• Autocad  (DWG)  

• Shapefile  

• RTF,  TXT  

• Databases  (DBML)

7

Page 8: SCAPE Webinar: Tools for uncovering preservation risks in large repositories

This  work  was  par,ally  supported  by  the  SCAPE  Project.  The  SCAPE  project  is  co-­‐funded  by  the  European  Union  under  FP7  ICT-­‐2009.4.1  (Grant  Agreement  number  270137).

FITS  -­‐  File  Informa'on  Tool  Set

• Demonstra=on  • Download  from  h\p://fitstool.org  !

• Execute  for  a  file  !

!• Execute  for  a  directory

8

./fits.sh  -­‐i  file.png

./fits.sh  -­‐r  -­‐i  source_directory/  -­‐o  output_directory/

Page 9: SCAPE Webinar: Tools for uncovering preservation risks in large repositories

This  work  was  par,ally  supported  by  the  SCAPE  Project.  The  SCAPE  project  is  co-­‐funded  by  the  European  Union  under  FP7  ICT-­‐2009.4.1  (Grant  Agreement  number  270137).

FITS  performance

• h\ps://github.com/keeps/fits-­‐tes=ng  • 3  to  6  seconds  per  file  • 12  TB  -­‐  A  year    

• h\p://www.openplanetsfounda=on.org/blogs/2013-­‐01-­‐09-­‐year-­‐fits  

• Other  op=ons  for  scalability:  • Fido  • Apache  Tika  • Nanite

9

Page 10: SCAPE Webinar: Tools for uncovering preservation risks in large repositories

This  work  was  par,ally  supported  by  the  SCAPE  Project.  The  SCAPE  project  is  co-­‐funded  by  the  European  Union  under  FP7  ICT-­‐2009.4.1  (Grant  Agreement  number  270137).

C3PO  -­‐  Clever,  Cra?y  Content  Profile  of  Objects

• h\p://ifs.tuwien.ac.at/imp/c3po  • Web  applica=on  • Content  characteris=cs  aggrega=on    • Drill-­‐down  analysis

10

Page 11: SCAPE Webinar: Tools for uncovering preservation risks in large repositories

This  work  was  par,ally  supported  by  the  SCAPE  Project.  The  SCAPE  project  is  co-­‐funded  by  the  European  Union  under  FP7  ICT-­‐2009.4.1  (Grant  Agreement  number  270137).

C3PO  install

• Download  binaries  at:  • h\p://dl.bintray.com/peshkira/c3po/  

• Install  mongodb:  • h\p://www.mongodb.org/  

• Install  Apache  Tomcat  • h\p://tomcat.apache.org/  

• Put  C3PO  web  app  in  Apache  Tomcat  • Remove  ROOT  dir  for  webapps  and  rename  C3PO  web  app  to  ROOT.war  

• Start  Apache  Tomcat  and  connect  to:  • h\p://localhost:8080/  

• Usage  guide:  • h\ps://github.com/peshkira/c3po/wiki/Usage-­‐Guide

11

Page 12: SCAPE Webinar: Tools for uncovering preservation risks in large repositories

This  work  was  par,ally  supported  by  the  SCAPE  Project.  The  SCAPE  project  is  co-­‐funded  by  the  European  Union  under  FP7  ICT-­‐2009.4.1  (Grant  Agreement  number  270137).

C3PO  performance

Dataset:  Statsbiblioteket  (Denmark)  • Size:  440M  files  (12  TB)  • Process  =me:  388h  (16  days)  /  50h  for  XML  report  • Average  =me:  2.5s  per  1000  files  • Web  applica=on  has  2.5  million  FITS  files  limit  

12

Page 13: SCAPE Webinar: Tools for uncovering preservation risks in large repositories

Scout:  a  preserva'on  watch  system

This  work  was  par,ally  supported  by  the  SCAPE  Project.  The  SCAPE  project  is  co-­‐funded  by  the  European  Union  under  FP7  ICT-­‐2009.4.1  (Grant  Agreement  number  270137).

Monitors  aspects  of  the  world  to  detect  preserva=on  risks  and  opportuni=es

13

Content

Policies Web

Scout

Risk notification

Humanknowledge

Registries

Page 14: SCAPE Webinar: Tools for uncovering preservation risks in large repositories

This  work  was  par,ally  supported  by  the  SCAPE  Project.  The  SCAPE  project  is  co-­‐funded  by  the  European  Union  under  FP7  ICT-­‐2009.4.1  (Grant  Agreement  number  270137). 14

Information Sources

• Format registries & software catalogues

• Digital repositories & web archives

• Organizational objectives

• Experiments

• Simulation

• Human knowledge

Page 15: SCAPE Webinar: Tools for uncovering preservation risks in large repositories

This  work  was  par,ally  supported  by  the  SCAPE  Project.  The  SCAPE  project  is  co-­‐funded  by  the  European  Union  under  FP7  ICT-­‐2009.4.1  (Grant  Agreement  number  270137). 15

Current information sources

• Repository content and events

• SCAPE Policy model

• PRONOM

• Web semantic extraction

• Web page renderability experiments

Page 16: SCAPE Webinar: Tools for uncovering preservation risks in large repositories

16

Define triggers

• Notify me when there are tools that can render the format X.

Page 17: SCAPE Webinar: Tools for uncovering preservation risks in large repositories

17

Define triggers Simple query with templates

Page 18: SCAPE Webinar: Tools for uncovering preservation risks in large repositories

18

Receive notifications

Email

HTTP Push API

There  are  tools  that  can  render  format  X.

Page 19: SCAPE Webinar: Tools for uncovering preservation risks in large repositories

19

Interfaces

Web page

REST API

Page 20: SCAPE Webinar: Tools for uncovering preservation risks in large repositories

This  work  was  par,ally  supported  by  the  SCAPE  Project.  The  SCAPE  project  is  co-­‐funded  by  the  European  Union  under  FP7  ICT-­‐2009.4.1  (Grant  Agreement  number  270137).

How to be a part of Scout

• Checkout • Site: http://openplanets.github.io/scout/

• Report: http://www.scape-project.eu/deliverable/d12-2-final-version-of-the-preservation-watch-component

• Demo: http://scout.scape.keep.pt

• Integrate your content

• Contribute with information (soon) • Use Scout form for manual input of knowledge

20

Page 21: SCAPE Webinar: Tools for uncovering preservation risks in large repositories

This  work  was  par,ally  supported  by  the  SCAPE  Project.  The  SCAPE  project  is  co-­‐funded  by  the  European  Union  under  FP7  ICT-­‐2009.4.1  (Grant  Agreement  number  270137).

Roadmap

• User  support  • More  trigger  templates  • More  adaptors  

• KrakeN  /  Propminer    • Sodware  catalogues  • Other  format  registries  • Other  experiments  informa=on  sources  • Manual  input  (human  knowledge)  • Simula=on

21

Page 22: SCAPE Webinar: Tools for uncovering preservation risks in large repositories

Luis  Faria  [email protected]  KEEP  SOLUTIONS  www.keep-­‐solu=ons.com

SCAPE  webminar  July  26,  2014

Tools  for  uncovering  preserva=on  risks  in  large  repositories