slinging data: data loading and cleanup in evergreen

26
Slinging Data: Data Loading and Cleanup in Evergreen Growing Evergreen Conference 22 April 2010

Upload: galen-charlton

Post on 27-Jun-2015

1.174 views

Category:

Technology


0 download

DESCRIPTION

Presentation for the 2010 Evergreen Conference on migrating data to the Evergreen open source ILS.

TRANSCRIPT

Page 1: Slinging Data: Data Loading and Cleanup in Evergreen

Slinging Data: Data Loading and Cleanup in Evergreen

Growing Evergreen Conference

22 April 2010

Page 2: Slinging Data: Data Loading and Cleanup in Evergreen

To migrate data …

Extract from the old, map and load into the new, clean up along the way, and keep

the auditor happy.

Page 3: Slinging Data: Data Loading and Cleanup in Evergreen

Whence

Extract data in a convenient form:

• Sometimes that means whatever you can get

• But better is

• MARC

• Flat text

• XML

Page 4: Slinging Data: Data Loading and Cleanup in Evergreen

All over the map

• Map entities

• Map fields

• Map values

• Map policies

Page 5: Slinging Data: Data Loading and Cleanup in Evergreen

All over the map

• Entities

• What is an item?

• What is a patron?

• Fields

• Where does the patron PIN come from?

Page 6: Slinging Data: Data Loading and Cleanup in Evergreen

All over the map

• Values

• Legacy item types• 0

• 1

• 45

• 123

• 234

Quick: which is the one for journal loan?

Page 7: Slinging Data: Data Loading and Cleanup in Evergreen

All over the map

Legacy Item Type Circ Modifier

0 Regular

1 Media

45 AV

123 Reference

234 Reference

Page 8: Slinging Data: Data Loading and Cleanup in Evergreen

Cleaning up

What?

• Bad data

• Ancient data

• Data it is too expensive to deal with later

When?

• Extract

• Load

• Post-load

Page 9: Slinging Data: Data Loading and Cleanup in Evergreen

Don’t box me in!

• The case of the dreaded double-encoding

• The even more dreadful case of the duplicitous and multiplicitous character encoding

Page 10: Slinging Data: Data Loading and Cleanup in Evergreen

Yes, those fixed fields really matter

The purpose of every modern ILS and discovery layer …

Page 11: Slinging Data: Data Loading and Cleanup in Evergreen

Yes, those fixed fields really matter

… is to point out every fixed field coding error in a form convenient for catalogers to identify and

fix.

Page 12: Slinging Data: Data Loading and Cleanup in Evergreen

Fixed fields

Page 13: Slinging Data: Data Loading and Cleanup in Evergreen

Oops!

create or replace function m_foo.set_leader (TEXT, INT, TEXT) RETURNS TEXT AS $$

my ($marcxml, $pos, $value) = @_;

use MARC::Record; use MARC::File::XML;

my $xml = $marcxml; eval { my $marc = MARC::Record->new_from_xml($marcxml, 'UTF-8'); my $leader = $marc->leader(); substr($leader, $pos, 1) = $value; $marc->leader($leader); $xml = $marc->as_xml_record; $xml =~ s/^<\?.+?\?>$//mo; $xml =~ s/\n//sgo; $xml =~ s/>\s+</></sgo; }; return $xml;$$ LANGUAGE PLPERLU STABLE;

Page 14: Slinging Data: Data Loading and Cleanup in Evergreen

On stage

Postgres lets us create an elegant mechanism for staging data to be loaded into an Evergreen database:

• Table inheritance

• Sequences

Page 15: Slinging Data: Data Loading and Cleanup in Evergreen

On stage

We want to be able to

• Load and manipulate the data

• … using every tool on our belt

• … while ensuring that it doesn’t show up in production until it’s ready (and we’re ready)

Page 16: Slinging Data: Data Loading and Cleanup in Evergreen

On stage

• Make a separate schema

psql> create schema m_foo;

• Mirror a real table

create table m_foo.asset_copy …

Page 17: Slinging Data: Data Loading and Cleanup in Evergreen

On stage

• Use the sequence

…id bigint not null default nextval('asset.copy_id_seq'::regclass)…

Page 18: Slinging Data: Data Loading and Cleanup in Evergreen

On stage

• Make space for the legacy

create table m_foo.asset_copy_legacy (

l_call_number TEXT

inherits (m_foo.asset_copy);

Page 19: Slinging Data: Data Loading and Cleanup in Evergreen

On stage

• Munge

• Munge

• Munge some more, then …

• Insert into production:

insert into asset.copy

select * from m_foo.asset_copy;

Page 20: Slinging Data: Data Loading and Cleanup in Evergreen

Counting

Who is the auditor?

It is you … and your patrons … and maybe even an actual auditor.

Page 21: Slinging Data: Data Loading and Cleanup in Evergreen

Counting

• Count what matters

• Number of records

• Number of dollars

• Number of things you’ll have to fix manually

• Don’t count what doesn’t matter

• Header rows

• Junk

Page 22: Slinging Data: Data Loading and Cleanup in Evergreen

Counting

• Count early and often

• Conservation of library data is Newton’s 42nd law!

Page 23: Slinging Data: Data Loading and Cleanup in Evergreen

Tools

• The usual suspects

• MARC::Record (or pymarc, or ruby-marc, or …)

• MARCEdit

• yaz-marcdump

• Spreadsheets

Page 24: Slinging Data: Data Loading and Cleanup in Evergreen

And now something new

Page 25: Slinging Data: Data Loading and Cleanup in Evergreen

Equinox Migration Tools

What?

MARC processing

Non-MARC processing

And more …

Where?

git://git.esilibrary.com/git/migration-tools.git

Page 26: Slinging Data: Data Loading and Cleanup in Evergreen

Thanks!

Galen Charlton

VP for Data Services, Equinox Software Inc.

[email protected]