columbia’s born-digital preservation...

Post on 25-Apr-2019

219 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Columbia’s Born-Digital Preservation Infrastructure

& Ford International Fellowships Program

Columbia University

Columbia Libraries Digital Program

1. Collection-based digitization

2. Long-term digital preservation

3. Website development

4. Digital Library infrastructure development

5. Born-digital collection archiving and access

Ford International Fellowships Program

“The IFP has since 2001 offered fellowships for post-graduate study to leaders from underserved communities in Asia, Africa, Latin America, and Russia, and will complete its work in 2014. Their archives include documentation and videos of the more than 3,300 IFP fellows who passed through the program as well as comprehensive planning and adminstrative files.”

Ford IFP Grant

• Received in October 2011

• Become permanent archive for IFP’s paper and digital archives

• $1 million

• 3 years (technology portion)

• Archive and provide access IFP’s archives

• And …

“ … to build out a full set of repository-based systems and services so that it can more easily acquire, ingest, process, preserve and make accessible both paper and born-digital organizational records.”

High-Level View

Countries Harvested From: • Brazil • Chile and Peru • China • Egypt • Ghana • Guatemala • India • Indonesia • Kenya • Mexico • Mozambique • Nigeria • Palestine (Gaza and West Bank) • Philippines • Russia • Senegal • South Africa • Tanzania • Thailand • Uganda • United States - NYC Secretariat • Vietnam

Languages Encountered:

• English

• Russian

• Portuguese

• Spanish

• Chinese

• Arabic

• Indonesian

• French

• Thai

• Vietnamese

Files Harvested:

334,000 and counting …

File Formats Encountered:

32, 3gp, accdb, adb, adp, adx, ai, aif, amr, asf, avi, axd, back, bat, bin, bk, blb, bmp, BridgeSort, btr, bup, cab, cat, cda, cdr, cfg, chm, cnf, cnm, con, css, cst, csv, cxt, d, dat, db, dbf, ddb, ddx, dfont, dir, dll, dmi, doc, doc-MRB, docm, docx, dot, ds_store, dtd, dwz, dxr, edb, edx, eml, emz, eps, exe, F&A, fcp, fff, fh9, fil, flp, flv, fol, frm, gdb, gdx, gif, hdb, hdx, hk4, hlp, hta, htm, html, ico, idx, ifo, inc, indd, inf, info, ini, itc2, itdb, itl, jar, jp2, jpe, jpeg, jpg, js, l, lck, ldb, lnk, log, m4a, m4v, mbx, mdb, mde, mdi, mdx, mht, mid, mls, mno, mov, mp3, mp4, mpeg, mpg, mpp, msf, msg, msi, mso, msv, mswmm, nri, ocx, odc, odt, ofa, oft, opd, opf, otf, p65, pab, pages, pcx, pdf, php, pif, plist, pm, pm!, pm0, pm5, pmd, pmh, pmi, pmj, pml, pmm, pmo, pmr, pms, pmx, pnc, pnd, png, pns, pnx, pot, pps, ppsx, ppt, pptx, prod, prod1, properties, psd, psp, pst, pub, qpw, qxd, r, ra, ra-att, rar, rdp, rel, rels, rem, rex, rpt, rsc, rtf, sav, sc4, sdb, sdx, sh, shs, snm, spi, spss, spv, spx, sql, svn-base, swa, swf, sys, tdb, tdx, thm, thmx, tif, tiff, tlb, tmp, toc, tpl, ttf, txt, txz, up, url, usr, utf8, vcf, vdproj, vob, vsd, wav, wbk, webarchive, wma, wmf, wmv, wmz, wpd, wpl, wps, xla, xlk, xls, xlsb, xlsm, xlsx, xlw, xml, xps, zip (243 different file formats)

High Level Workflow:

Technology Tools:

• FRED (Forensic Recovery of Evidence Device) – hardware / OS

• Forensic Toolkit (FTK) – suit of tools

• Archivematica – preservation analysis and packaging

• Fedora – enterprise-level repository solution

• Archivists Toolkit – archival processing tool

• SOLR – powerful Lucene-based search server

• Blacklight – open source discovery interface

Preservation ‘Curation’:

• From original file, generate format versions that are more preservable and more accessible than the original file form, e.g.,

• From MS .doc and .docx files generate .rtf and/or PDF-A

• From MS .xls and .xlsx files generate tab-delimited format

• From HD video files generate motion JPEG2000

• Database files? SPSS files? Pro Tools audio files?

• ‘Legacy’ file formats?

File Formats Encountered:

32, 3gp, accdb, adb, adp, adx, ai, aif, amr, asf, avi, axd, back, bat, bin, bk, blb, bmp, BridgeSort, btr, bup, cab, cat, cda, cdr, cfg, chm, cnf, cnm, con, css, cst, csv, cxt, d, dat, db, dbf, ddb, ddx, dfont, dir, dll, dmi, doc, doc-MRB, docm, docx, dot, ds_store, dtd, dwz, dxr, edb, edx, eml, emz, eps, exe, F&A, fcp, fff, fh9, fil, flp, flv, fol, frm, gdb, gdx, gif, hdb, hdx, hk4, hlp, hta, htm, html, ico, idx, ifo, inc, indd, inf, info, ini, itc2, itdb, itl, jar, jp2, jpe, jpeg, jpg, js, l, lck, ldb, lnk, log, m4a, m4v, mbx, mdb, mde, mdi, mdx, mht, mid, mls, mno, mov, mp3, mp4, mpeg, mpg, mpp, msf, msg, msi, mso, msv, mswmm, nri, ocx, odc, odt, ofa, oft, opd, opf, otf, p65, pab, pages, pcx, pdf, php, pif, plist, pm, pm!, pm0, pm5, pmd, pmh, pmi, pmj, pml, pmm, pmo, pmr, pms, pmx, pnc, pnd, png, pns, pnx, pot, pps, ppsx, ppt, pptx, prod, prod1, properties, psd, psp, pst, pub, qpw, qxd, r, ra, ra-att, rar, rdp, rel, rels, rem, rex, rpt, rsc, rtf, sav, sc4, sdb, sdx, sh, shs, snm, spi, spss, spv, spx, sql, svn-base, swa, swf, sys, tdb, tdx, thm, thmx, tif, tiff, tlb, tmp, toc, tpl, ttf, txt, txz, up, url, usr, utf8, vcf, vdproj, vob, vsd, wav, wbk, webarchive, wma, wmf, wmv, wmz, wpd, wpl, wps, xla, xlk, xls, xlsb, xlsm, xlsx, xlw, xml, xps, zip (243 different file formats)

Special IFP Project Challenges …

Determining and encoding intellectual property rights

Determining and encoding information relating to privacy and access

Metadata creation / extraction

Working with 23 separate entities in advance of their data deliveries and office closings

Building scalable workflows

Building scalable storage infrastructure

top related