leveraging correspondence management systems - wordpress… · 2019. 5. 20. · leveraging...
TRANSCRIPT
LEVERAGING CORRESPONDENCE
MANAGEMENT SYSTEMS
(FOR DIGITAL OBJECT METADATA)BRIAN THOMAS
ELECTRONIC RECORDS SPECIALIST
TEXAS STATE LIBRARY AND ARCHIVES COMMISSION
DISCLAIMER
This presentation and any subsequent discussion represents work and perspectives on
work completed at the Texas State Library and Archives Commission by the presenter.
Opinions and perspectives provided by this presenter are their own and may not indicate
the official stance of the agency.
CTS: THE CORRESPONDENCE TRACKING SYSTEM
Some details
1. Completely homegrown system
2. Interface written in Visual Basic 6
3. Running against a MS SQL Server
database
4. The database itself is a record
5. Covers physical mail, webmail, phone calls
6. Each mail/webmail item was supposed to
have a corresponding image file or PDF
WHAT IF…The content in the database could be extracted in
a way that captured the elements of the
Governor’s staff interface?
And then paired with the individual images
themselves in the preservation/access system for
staff research?
And possibly indexed for some linked data fun?
FROM: HTTPS://WWW.YOUTUBE.COM/WATCH?V=AOF5LCT5JD0
IF YOU HAVE A HAMMER, EVERYTHING LOOKS LIKE A NAIL
About me and the tools at my disposal
1. I had been working on database preservation
2. I love virtualization
3. I had also been using Python extensively for API
and data manipulation in other project
4. Therefore almost all work was done with Python
in a virtual machine for this project
5. I like the new Doctor
Courtesy https://imgur.com/gallery/NIgUNZZ
OVERVIEW OF THE WORK
Preserve database
Study database structure
Export and manipulate
data
Export data to valid
sidecar files
Final data manipulation
Fix miscellaneous
problems
THE ACTUAL STEPS
● Get SQL Server 2018 running
● Preserve the database into SIARD format
● Review tables in SQL Server Management
Studio and Database Visualization Toolkit to
understand data structure
● Review fields in CTS GUI to see what staff
would have worked with
● Determine how tables should be connected
● Export tables to CSV format
● Use Python PANDAS to merge tables
● Replace illegal characters in spreadsheets
● Use Python script to export metadata into
individual files
● Use Python script to create valid XML
● Use Python script to validate the XML
● Fix broken XML, re-validate until all good
● Transform metadata export to desired schema
(x2, see later explanation)
● Use Python script to remove artifacts from
transforms
● Use Python to correct filenaming/pairing errors
● Re-upload files with sidecar metadata
STEP ONEPreserve the database
STEP 1: PRESERVE THE DATABASE
Running SQL Server
● First step, see the database in its actual
unmediated format
● Take SQL dump and import it into SQL Server
● Use SQL Server Management Studio or similar
software to review structure and contents
● Maybe can export directly to a spreadsheet?
Run XML export?
SQL Server management studio available here:
https://docs.microsoft.com/en-us/sql/ssms/download-sql-server-
management-studio-ssms?view=sql-server-2017
Run Database Preservation Toolkit
● SIARD format, XML-based
○ captures all database content and most
functions
● Invented by Swiss Federal Archives
○ SIARD Suite app converted databases to SIARD
● Database Preservation Toolkit is a product of EARK
and seeks to automate conversion, more detailed
SIARD2 standard
● http://www.database-preservation.com/
● Later Swiss Federal Archives released a tool for
SIARD2.1 standard
○ https://www.bar.admin.ch/bar/en/home/archiving
/tools/siard-suite.html
IN SQL SERVER MANAGEMENT STUDIO
IN DATABASE VISUALIZATION TOOLKIT
WHAT IT SHOULD HAVE LOOKED LIKE
STEP TWO
Study database structure
STEP 2: STUDY THE DATABASE STRUCTURE
1. Review staff GUI for essential
elements
2. Find elements in database tables
3. Develop a plan on how to
reconstruct the information
elements from all tables
4. Beware programmatic joins not
represented in linked tables
STEP THREE
Export and manipulate
data
STEP 3: EXPORT AND MANIPULATE DATA
1. Export each table to CSV using an
DBVTK export function
2. Load individual CSVs using python
PANDAS
3. Merge CSV files on shared column
data
a. Use an outer, inner, left/right
join?
4. Iteratively save, slice and dice the
output
STEP FOUR
Export data to valid
sidecar files
STEP 4: EXPORT DATA TO VALID SIDECAR FILES
● Eliminate the illegal characters from the CSV(s)
first
○ I didn’t the first time and spent over a
day correcting the results
● Load each CSV and run a script to export that
data into a metadata file per ???
○ Make sure it appends data, not
overwrites. You may have multiple
entries for the same thing
● Run a script to encapsulate the data to create
valid XML
● Run another script to validate your XMLThis Photo by Unknown Author is
licensed under CC BY-SA
STEP FIVEFinal data
manipulation
STEP 5: FINAL DATA MANIPULATION
● Check existing XML schemas for fit
○ 95 data points
○ TEI too simple
○ Qualified Dublin Core not a good fit
● Write your own?
○ Yes!
● Run XSLTs against XML files to match
chosen/written schema
● Run more XSLTs to de-dupe content
● Re-arrange XML into correct directory structure
● Pair with files in-system or re-upload files
STEP SIX
Fix miscellaneous
problems
PROBLEM ONE: MISSING IMAGES AND DB ENTRIES
● Everything should have been there
● Paper correspondence only sampled
● Some images had no metadata.
Outgoing/incoming correspondence not
logged? Log name is correct?
● Some metadata had no images. Missing
files? Never scanned?
● 353,674 Mail entries without any logged
scan. Never scanned? Forgot to add
filename?
● Yes to all
PROBLEM ONE: SOLUTION(S)
● Develop a script to identify what might be
missing
● Including specific filepaths for processing
● Create a cute no-scan placeholder file for
missing scans so metadata is preserved
● Leave items without metadata as is. Still text
searchable
PROBLEM TWO: CAPITALIZATION ERRORS
● False negatives for matching XML because…
● Staff did not capitalize database entries
the same way they capitalized the images
● Problem because metadata pairing process
is sensitive to exact filename
Solution● Use comparative script to generate a list of
image/metadata files without matches (with
filepath)
● Use a script to de-capitalize listed filenames
and compare.
● If there is a match, use the image version of the
filename to rename the metadata file
PROBLEM THREE: SAME IMAGE IN MULTIPLE PLACES
● False negatives for matching XML
because…
● The file is in another folder altogether
● And it is in multiple places
Solution● Use comparative script to generate a list of
image/metadata files without matches
(with filepath)
● Use a script to de-capitalize listed
filenames, drop the filepath and compare.
● If there is a match, copy the file to a new
location with the correct filepath
PROBLEM FOUR: MISFILED/MISNAMED FILES
● Files put in the wrong directory
● E.G. 200106110167.tif filed in directory
2001/01/0111
● Files misnamed
● E.G. 200106110167.tif misnamed as
200101110167.tif
Solution● If no matches in metadata, generate a
generic metadata file suggesting look for
correct metadata based on content of file
● SIP creator tool catches duplicate names,
correct at point that it find errors.
PROBLEM FIVE: LOGGED PHONE CALLS
● 771,825 logged phone calls
● No document for these
● Need an object to pair metadata to OR
● Upload metadata only and rely on text
search?
● Create an html version of metadata?
Solution● Find a cool icon
● Use a script to generate a list of metadata
files but with the file extension changed to
match the icon file extension
● Use a script to mass copy the icon into an
image that can be uploaded
LESSONS LEARNED/COULD HAVE DONE BETTER
● Expanded conversation to account for
more internal stakeholder/staff requests
● Don’t trust that anybody (that they did
100% of what they said they did)
● Direct database SQL queries?
● Before the fact contingency planning
http://4.bp.blogspot.com/-
pOMrxILoPV8/TgOfWqGU8SI/AAAAAAAAAlU/XXDsDr4BaS8/s1600/mist
ake3.jpg
NOW LET’S DISCUSS...
1. How could this have been done better?
2. What situations are other people facing?
3. What limitations do you have to work
around?
4. Any other thoughts?
Courtesy NBC.com
(https://www.nbc.com/saturday-night-live/video/coffee-talk/n10457)
CONTACT INFORMATION
BRIAN THOMAS
NON-GOVERNMENTAL EMAIL:
BRIAN.THE.ARCHIVIST@GMAIL.
COM
GOVERNMENTAL EMAIL:
WORK PHONE: 512-475-3374
SOME USEFUL SCRIPTS/TRICKS
MERGING SPREADSHEETS USING PYTHON/PANDAS
EXPORTING TO XML FROM CSV USING PYTHON
XML ENCAPSULATION AND VALIDATION USING PYTHON
BATCH FILING IN WINDOWS COMMAND LINE
● Print file list to a text file (Karen’s Directory
printer works wonders)
● In Excel or another spreadsheet program○ “mid” function to pull source directory○ “mid” function to pullfull filename○ “mid” function to pull subdirectory 1, 2, etc.○ “concat” function to assemble parts for a
Windows “mkdir” Powershell command■ Don’t forget to dedupe
○ “concat’ function to assemble parts for Windows “move” cmd to file into new directories
● Copy finished mkdir and move commands and
paste as values to remove formulas
● Copy mkdir and move to Powershell and cmd,
respectively. Wait… … …
MASS MANIPULATION WITH STYLESHEETS AND PYTHON
XSL transform engine Example De-dupe transform
COMPARING DIRECTORIES USING PYTHON