columbias born-digital preservation infrastructure ford international fellowships program
DESCRIPTION
Columbia Libraries Digital Program 1.Collection-based digitization 2.Long-term digital preservation 3.Website development 4.Digital Library infrastructure development 5.Born-digital collection archiving and accessTRANSCRIPT
Columbia’s Born-Digital Preservation Infrastructure
&Ford International Fellowships Program
Columbia University
Columbia Libraries Digital Program
1. Collection-based digitization2. Long-term digital preservation3. Website development4. Digital Library infrastructure development
5. Born-digital collection archiving and access
Ford International Fellowships Program
“The IFP has since 2001 offered fellowships for post-graduate study to leaders from underserved communities in Asia, Africa, Latin America, and Russia, and will complete its work in 2014. Their archives include documentation and videos of the more than 3,300 IFP fellows who passed through the program as well as comprehensive planning and adminstrative files.”
Ford IFP Grant• Received in October 2011• Become permanent archive for IFP’s paper and digital archives• $1 million• 3 years (technology portion)• Archive and provide access IFP’s archives• And …
“ … to build out a full set of repository-based systems and services so that it can more easily acquire, ingest, process, preserve and make accessible both paper and born-digital organizational records.”
High-Level View
Countries Harvested From:• Brazil• Chile and Peru• China• Egypt• Ghana• Guatemala• India• Indonesia• Kenya• Mexico• Mozambique• Nigeria• Palestine (Gaza and West Bank)• Philippines• Russia• Senegal• South Africa• Tanzania• Thailand• Uganda• United States - NYC Secretariat• Vietnam
Languages Encountered:• English• Russian• Portuguese• Spanish• Chinese• Arabic• Indonesian• French• Thai• Vietnamese
Files Harvested:
334,000 and counting …
File Formats Encountered:32, 3gp, accdb, adb, adp, adx, ai, aif, amr, asf, avi, axd, back, bat, bin, bk, blb, bmp, BridgeSort, btr, bup, cab, cat, cda, cdr, cfg, chm, cnf, cnm, con, css, cst, csv, cxt, d, dat, db, dbf, ddb, ddx, dfont, dir, dll, dmi, doc, doc-MRB, docm, docx, dot, ds_store, dtd, dwz, dxr, edb, edx, eml, emz, eps, exe, F&A, fcp, fff, fh9, fil, flp, flv, fol, frm, gdb, gdx, gif, hdb, hdx, hk4, hlp, hta, htm, html, ico, idx, ifo, inc, indd, inf, info, ini, itc2, itdb, itl, jar, jp2, jpe, jpeg, jpg, js, l, lck, ldb, lnk, log, m4a, m4v, mbx, mdb, mde, mdi, mdx, mht, mid, mls, mno, mov, mp3, mp4, mpeg, mpg, mpp, msf, msg, msi, mso, msv, mswmm, nri, ocx, odc, odt, ofa, oft, opd, opf, otf, p65, pab, pages, pcx, pdf, php, pif, plist, pm, pm!, pm0, pm5, pmd, pmh, pmi, pmj, pml, pmm, pmo, pmr, pms, pmx, pnc, pnd, png, pns, pnx, pot, pps, ppsx, ppt, pptx, prod, prod1, properties, psd, psp, pst, pub, qpw, qxd, r, ra, ra-att, rar, rdp, rel, rels, rem, rex, rpt, rsc, rtf, sav, sc4, sdb, sdx, sh, shs, snm, spi, spss, spv, spx, sql, svn-base, swa, swf, sys, tdb, tdx, thm, thmx, tif, tiff, tlb, tmp, toc, tpl, ttf, txt, txz, up, url, usr, utf8, vcf, vdproj, vob, vsd, wav, wbk, webarchive, wma, wmf, wmv, wmz, wpd, wpl, wps, xla, xlk, xls, xlsb, xlsm, xlsx, xlw, xml, xps, zip
(243 different file formats)
High Level Workflow:
Technology Tools:
• FRED (Forensic Recovery of Evidence Device) – hardware / OS• Forensic Toolkit (FTK) – suit of tools• Archivematica – preservation analysis and packaging• Fedora – enterprise-level repository solution• Archivists Toolkit – archival processing tool• SOLR – powerful Lucene-based search server• Blacklight – open source discovery interface
Preservation ‘Curation’:• From original file, generate format versions that are more
preservable and more accessible than the original file form, e.g.,
• From MS .doc and .docx files generate .rtf and/or PDF-A• From MS .xls and .xlsx files generate tab-delimited format• From HD video files generate motion JPEG2000
• Database files? SPSS files? Pro Tools audio files? • ‘Legacy’ file formats?
File Formats Encountered:32, 3gp, accdb, adb, adp, adx, ai, aif, amr, asf, avi, axd, back, bat, bin, bk, blb, bmp, BridgeSort, btr, bup, cab, cat, cda, cdr, cfg, chm, cnf, cnm, con, css, cst, csv, cxt, d, dat, db, dbf, ddb, ddx, dfont, dir, dll, dmi, doc, doc-MRB, docm, docx, dot, ds_store, dtd, dwz, dxr, edb, edx, eml, emz, eps, exe, F&A, fcp, fff, fh9, fil, flp, flv, fol, frm, gdb, gdx, gif, hdb, hdx, hk4, hlp, hta, htm, html, ico, idx, ifo, inc, indd, inf, info, ini, itc2, itdb, itl, jar, jp2, jpe, jpeg, jpg, js, l, lck, ldb, lnk, log, m4a, m4v, mbx, mdb, mde, mdi, mdx, mht, mid, mls, mno, mov, mp3, mp4, mpeg, mpg, mpp, msf, msg, msi, mso, msv, mswmm, nri, ocx, odc, odt, ofa, oft, opd, opf, otf, p65, pab, pages, pcx, pdf, php, pif, plist, pm, pm!, pm0, pm5, pmd, pmh, pmi, pmj, pml, pmm, pmo, pmr, pms, pmx, pnc, pnd, png, pns, pnx, pot, pps, ppsx, ppt, pptx, prod, prod1, properties, psd, psp, pst, pub, qpw, qxd, r, ra, ra-att, rar, rdp, rel, rels, rem, rex, rpt, rsc, rtf, sav, sc4, sdb, sdx, sh, shs, snm, spi, spss, spv, spx, sql, svn-base, swa, swf, sys, tdb, tdx, thm, thmx, tif, tiff, tlb, tmp, toc, tpl, ttf, txt, txz, up, url, usr, utf8, vcf, vdproj, vob, vsd, wav, wbk, webarchive, wma, wmf, wmv, wmz, wpd, wpl, wps, xla, xlk, xls, xlsb, xlsm, xlsx, xlw, xml, xps, zip
(243 different file formats)
Special IFP Project Challenges …
Determining and encoding intellectual property rights
Determining and encoding information relating to privacy and access
Metadata creation / extractionWorking with 23 separate entities in
advance of their data deliveries and office closings
Building scalable workflowsBuilding scalable storage infrastructure