a relational data safari: migrating transactional data to ...kafei.dev/migration-safari.pdf ·...
TRANSCRIPT
A Relational Data
Safari: Migrating Transactional
Data to DataStax/Cassandra
© DataStax, All Rights Reserved.ConfidentialConfidential © DataStax, All Rights Reserved.
Ryan Weal
Kafei Interactive Inc.
https://kafei.dev
● 10+ years of content / data migrations
● Focused on Content Management Systems
● MySQL, Microsoft SQL Server, MongoDB,
SQLite, Oracle... can get data out of anything
Who am I?
© DataStax, All Rights Reserved.ConfidentialConfidential © DataStax, All Rights Reserved.
Our example
project
● Travel planning application
● Some data feeds into some public pages
● Was originally architected inside a Content
Management System (CMS)
● Lots of functionality was added over time
● 100s of database reads per request
● Very typical CMS experience...
● Let's change that!
© DataStax, All Rights Reserved.ConfidentialConfidential © DataStax, All Rights Reserved.
Why we used
Cassandra for
this project
● Each interaction ("transaction") is essentially
a fork of original document
− Reference data is NEVER used again
● Redundancy!
− Much safer choice than MongoDB and others
● Flexible architecture
− Configurable partition keys
− JSON Structure – no accidental schema changes
● Fast
−Even if we don't optimize it should still be OK
● Familiarity (using C* for something else)
PLANNING THE MIGRATIONStrategies & tools to get started
© DataStax, All Rights Reserved.ConfidentialConfidential © DataStax, All Rights Reserved.
Tools for
migrating to a
CMS we will
use to
migrate from
this CMS...
● Makes use of spreadsheets
● Something we have done 100s of times over
● Denormalizes the data based on entity types
● Attempts to make a 1:1 mapping
● ETL (extract, transform, load) type of process
© DataStax, All Rights Reserved.ConfidentialConfidential © DataStax, All Rights Reserved.
Spreadsheets: where is the legacy data?
Create a tabbed document for each source->destination entity mapping.
Make sure to track progression of implementation with colors (column F) or columns (such as I-L).
Example IDs and number of expected items helps the developer.
© DataStax, All Rights Reserved.ConfidentialConfidential © DataStax, All Rights Reserved.
Planning, for
a Cassandra
migration
● Entity Relationship Diagrams (Chen)
− Denormalization of the data
● Logical & Physical mappings
● Lots of examples of these elsewhere...
● ETL type of process again
© DataStax, All Rights Reserved.ConfidentialConfidential © DataStax, All Rights Reserved.
Remember CMS
relational entities?
Almost all of the datawas set up to be relationalresulting in a cascade ofdatabase lookups.
From a design standpointit is probably the way to do itwith a CMS.
Many CMS projects willincludesome "custom functionality"and a lot of them have similar stories.
These diagrams help peopleto understand and have discussionsabout architecture.
© DataStax, All Rights Reserved.ConfidentialConfidential © DataStax, All Rights Reserved.
New Entity Relationship
Diagram
ONE QUERY!!!
Our most complex set of dataare now all part of one object.
Only hitting the database onceto get all the data we need.
Except for "files"...
© DataStax, All Rights Reserved.ConfidentialConfidential © DataStax, All Rights Reserved.
User
Workflows This is the entire scope
of the public CMS site.
Numbers represent the
new C* entity types.
Even though we are already
down from 29 types to 6,
for the public we only need 5 (files is the 5th).
© DataStax, All Rights Reserved.ConfidentialConfidential © DataStax, All Rights Reserved.
Logical & Physical diagrams
Reveals what performance issues you may have.
● Go back to the entity relationship diagram
● Make a big circle around all of the fields you need to satisfy a query in the user workflow
● Make a list of these fields, and put the partition key fields at the top of the list
● Add C* metadata to the list of fields (parition key, etc)
● Do this for each of the destination entities
● Now you can calculate how large your paritions should be!
© DataStax, All Rights Reserved.ConfidentialConfidential © DataStax, All Rights Reserved.
Entity Review Process
Cassandra Data Modeling
● Create entity relationship diagram
● Determine user access flows
● Create logical + physical model
● Get acceptance, ensure performant
Migration Spreadsheets
● Create spreadsheet for existing data
● Map to the new fields
● Joins, joins, joins!
● Track progress, ensure complete
© DataStax, All Rights Reserved.ConfidentialConfidential © DataStax, All Rights Reserved.
Migrations:
keeping a list
of migrations
in code
● Migrate-minimal.sh
− Partial import
− Specific entity import (field-specific tests)
● Migrate-all.sh
− To test performance
− To do the launch
● Each entity type has a migration (all the joins)
− If multiple source entities consider separate
− Use the language you prefer, probably not a .sh!
© DataStax, All Rights Reserved.ConfidentialConfidential © DataStax, All Rights Reserved.
Migrations:
writing a
migration
script
1. Extract: Pull together all the data;
2. Transform: Remove numbered entity
references
− Field mappings
3. Load: Use the API we are building to do the
insert
● No special functionality in the API for the
migration
● Migrate as you go... when you add each type
● The migration (from old system) script should
not be in the API codebase (delete later)
− Also mitigates memory issues which can be
common working with unknown data
© DataStax, All Rights Reserved.ConfidentialConfidential © DataStax, All Rights Reserved.
Finally we
have a project
plan…
1. Spreadsheets to review data, potential
joins, and track progress
2. Entity Relationship Diagrams to understand
3. Logical/Physical mappings to implement
4. Create base entities
5. Write a migration script
6. Resolve "relationships" in our app
7. Do something about locking
8. Deal with misunderstood cases
DENORMALIZED DATA:
GATHERING 100+ FIELDSFlatten all the data.
© DataStax, All Rights Reserved.ConfidentialConfidential © DataStax, All Rights Reserved.
Re-usable
base entities
● We re-used the same entity where use cases
were similar
● Partition key was a bonus to keep the data light
● Data modeling helps with this step
● "General Content" has:
− Pages
− Blog posts
− Media Room
● All data is self-contained, easily versionable
● Fields are hidden if not used
© DataStax, All Rights Reserved.ConfidentialConfidential © DataStax, All Rights Reserved.
SURPRISE!
Keyspace
definitions are
relational!
● We used Datastax (KillrVideo) example as
base for our connection kit
− Automatic retries, woot!
● Use promises to await the creation of tables
− UDTs are interrelated when being created!
● We don't start express (our api server) until all
the types & tables have been created
− This allowed us to run API in a serverless lambda!
− Using now.sh, not AWS
− C* was in VMs not lambdas
REFERENCE DATA?
UDTS & URLSRelated content is simply embedded now
© DataStax, All Rights Reserved.ConfidentialConfidential © DataStax, All Rights Reserved.
If you are
referencing
something, it
has a URL
● We made a generic "reference" type
● It uses a common concept called a "URL"
● Some extra data add context:
− Link title
− Image
− Description
© DataStax, All Rights Reserved.ConfidentialConfidential © DataStax, All Rights Reserved.
Two-way
references:
how to deal
● We deal with it at write time
● Must compare to old copy of entity in order to
see if we need to remove referenced content
− We are therefore in need of two queries at write
time:
− Update entity
− Update related entities
● Keep records on both sides
− Use "application level" tombstones when
submitting new data
− New "delete me" field
● Submit the changes as a batch!
THE LOCKLESS MONSTERImprove performance by changing the way you think
© DataStax, All Rights Reserved.ConfidentialConfidential © DataStax, All Rights Reserved.
Old system
had locks...
everywhere!
● Locks aren't really possible in Cassandra
● The old system had locks applied when
someone clicked edit, even if no changes!
− Locks were never released!!!
− Huge performance issues were caused by this
● Many lock systems are quirky like this...
© DataStax, All Rights Reserved.ConfidentialConfidential © DataStax, All Rights Reserved.
Locks? We
don't need
locks!
● Cassandra provided us with better options:
● We now track version numbers
− Writes are only accepted if the values match
− if version == X (light weight transaction, LWTs)
● Prevents overwrites... but we can do better!
− Break page into patches/chunks of ~10 fields (of
100) and do partial updates, compare old and
new versions
− Users can edit same page and have no conflicts
if different chunks are submitted, even if the
version number has changed.
− Have to do a diff in this case
A HIERARCHY OF DATA:
TREES FOR EVERYONEIt may be easier to invert the relationship than write the code to use wrong design
© DataStax, All Rights Reserved.ConfidentialConfidential © DataStax, All Rights Reserved.
Country,
Region,
Subregion.
…easy, right?
● These values are in a natural hierarchy
− Country is on the public site, most of the time
− Regions get displayed on Country pages (ok)
− Subregion is not visible to the public (hmm)
● The editors, on the other hand, ONLY use
subregion to input the data
− Really subregion was the most important
● Our access pattern was backwards!
© DataStax, All Rights Reserved.ConfidentialConfidential © DataStax, All Rights Reserved.
Being careful
with trees
● UDTs mean lots of loops if poorly thought out
− Can be bad for front-end performance
− Each layer is another loop (or filter, etc.)
● UDTs are "relational" when creating the
keyspace, which can complicate things
depending on how you bootstrap your servers
IMAGES...Where do we store them so we never lose them?
© DataStax, All Rights Reserved.ConfidentialConfidential © DataStax, All Rights Reserved.
Images...
in the
database?!
● We had a really hard time being OK with this
● This is our 6th and final entity type!
● We know from CMS migrations that disks will
lose anywhere from 1-10% of the files either
due to user error or hardware failure...
● There is no way we want to be looking around
for lost files, ever.
● We write the files to disk to save a bit of
bandwidth during development
− Our serverless lambdas are limited to 50mb, so
chances are we will kill the process if we cache
too much!
© DataStax, All Rights Reserved.ConfidentialConfidential © DataStax, All Rights Reserved.
Challenges
with Images
● These want to be relational so we use URL
● Some parts of the site re-use images
− Use browser's cache by default
● Very tempting to over-optimize these
− Or one big image or smaller chunked versions?
− Multiple sizes / crop variations
● How should our API deal with it?
− Send as chunks?
− Stream the results?
− Etc...
● Optimize on write: only upload scaled/ready
content. We need predictable sizes.
UNEXPECTED ACCESS
PATTERNSMicroservices architecture introduced ways of getting the data that did not follow our models
© DataStax, All Rights Reserved.ConfidentialConfidential © DataStax, All Rights Reserved.
Editor app
uses different
access
pattern...
● We did workflows for the public, not for admins!
● For some of our data we just wanted to
SELECT * to load data into the app
− This is SLOW... surprisingly!
− It does make sense: it may have to query
many/all nodes rather than one
− Try to make queries correspond with partition key
− Preload that data when possible
© DataStax, All Rights Reserved.ConfidentialConfidential © DataStax, All Rights Reserved.
Standalone
entities vs.
dynamic
select lists...
● Some of our select lists on our forms depend
on dynamic data. What if the user had a
connection failure and didn't get the whole set?
● The select list will not be populated until
another entity type has been loaded
− This is fine, we are storing value as text
− It would be possible for user to overwrite with null
● When user edits the entity the select list may
not find that item if we have not (yet) migrated
that type
− Need to deal with this case by adding the existing value if it is not already in the list
− Otherwise users could save and overwrite the data, which is not ideal!
● Write code to never lose data due to
dependencies
© DataStax, All Rights Reserved.ConfidentialConfidential © DataStax, All Rights Reserved.
No more
"overrides"
functionality,
now always
override
● The old app had the ability to "override" some
reference data in the destination entity
● It turns out it was actually cloning the data!
− Easy to migrate now! Same as C* architecture!
● The data was very buggy in the old system, we
found outdated references and lots of junk
data. Helped us find bugs in the new app.
● We eliminated an "override" checkbox as it was
not adding value.
● Created a new accidental feature: the ability to
override descriptions on a per-use basis. Very
useful when a product is displayed in multiple
places.
WE HAVE ARRIVED!Just run that migration and we're done.
© DataStax, All Rights Reserved.ConfidentialConfidential © DataStax, All Rights Reserved.
Some Final
Thoughts
● Expect to constantly re-iterate over the schema
● Lots of different pieces to configure across the
app, when coming from a CMS
− Plan for unexpected functionality
● The database itself was never a problem
● Try to make the hard decisions before launch
● Measure the amount of time it takes to run the
migrations, your client will want to know
Ryan Weal
Kafei Interactive Inc.
https://kafei.dev
THANK YOU