a relational data safari: migrating transactional data to ...kafei.dev/migration-safari.pdf ·...

A Relational Data

Safari: Migrating Transactional

Data to DataStax/Cassandra

© DataStax, All Rights Reserved.ConfidentialConfidential © DataStax, All Rights Reserved.

Ryan Weal

Kafei Interactive Inc.

https://kafei.dev

● 10+ years of content / data migrations

● Focused on Content Management Systems

● MySQL, Microsoft SQL Server, MongoDB,

SQLite, Oracle... can get data out of anything

Who am I?

https://kafei.dev


Our example

project

● Travel planning application

● Some data feeds into some public pages

● Was originally architected inside a Content

Management System (CMS)

● Lots of functionality was added over time

● 100s of database reads per request

● Very typical CMS experience...

● Let's change that!


Why we used

Cassandra for

this project

● Each interaction ("transaction") is essentially

a fork of original document

− Reference data is NEVER used again

● Redundancy!

− Much safer choice than MongoDB and others

● Flexible architecture

− Configurable partition keys

− JSON Structure – no accidental schema changes

● Fast

−Even if we don't optimize it should still be OK

● Familiarity (using C* for something else)

PLANNING THE MIGRATIONStrategies & tools to get started


Tools for

migrating to a

CMS we will

use to

migrate from

this CMS...

● Makes use of spreadsheets

● Something we have done 100s of times over

● Denormalizes the data based on entity types

● Attempts to make a 1:1 mapping

● ETL (extract, transform, load) type of process


Spreadsheets: where is the legacy data?

Create a tabbed document for each source->destination entity mapping.

Make sure to track progression of implementation with colors (column F) or columns (such as I-L).

Example IDs and number of expected items helps the developer.


Planning, for

a Cassandra

migration

● Entity Relationship Diagrams (Chen)

− Denormalization of the data

● Logical & Physical mappings

● Lots of examples of these elsewhere...

● ETL type of process again


Remember CMS

relational entities?

Almost all of the datawas set up to be relationalresulting in a cascade ofdatabase lookups.

From a design standpointit is probably the way to do itwith a CMS.

Many CMS projects willincludesome "custom functionality"and a lot of them have similar stories.

These diagrams help peopleto understand and have discussionsabout architecture.


New Entity Relationship

Diagram

ONE QUERY!!!

Our most complex set of dataare now all part of one object.

Only hitting the database onceto get all the data we need.

Except for "files"...


User

Workflows This is the entire scope

of the public CMS site.

Numbers represent the

new C* entity types.

Even though we are already

down from 29 types to 6,

for the public we only need 5 (files is the 5th).


Logical & Physical diagrams

Reveals what performance issues you may have.

● Go back to the entity relationship diagram

● Make a big circle around all of the fields you need to satisfy a query in the user workflow

● Make a list of these fields, and put the partition key fields at the top of the list

● Add C* metadata to the list of fields (parition key, etc)

● Do this for each of the destination entities

● Now you can calculate how large your paritions should be!


Entity Review Process

Cassandra Data Modeling

● Create entity relationship diagram

● Determine user access flows

● Create logical + physical model

● Get acceptance, ensure performant

Migration Spreadsheets

● Create spreadsheet for existing data

● Map to the new fields

● Joins, joins, joins!

● Track progress, ensure complete


Migrations:

keeping a list

of migrations

in code

● Migrate-minimal.sh

− Partial import

− Specific entity import (field-specific tests)

● Migrate-all.sh

− To test performance

− To do the launch

● Each entity type has a migration (all the joins)

− If multiple source entities consider separate

− Use the language you prefer, probably not a .sh!


Migrations:

writing a

migration

script

1. Extract: Pull together all the data;

2. Transform: Remove numbered entity

references

− Field mappings

3. Load: Use the API we are building to do the

insert

● No special functionality in the API for the

migration

● Migrate as you go... when you add each type

● The migration (from old system) script should

not be in the API codebase (delete later)

− Also mitigates memory issues which can be

common working with unknown data


Finally we

have a project

plan…

1. Spreadsheets to review data, potential

joins, and track progress

2. Entity Relationship Diagrams to understand

3. Logical/Physical mappings to implement

4. Create base entities

5. Write a migration script

6. Resolve "relationships" in our app

7. Do something about locking

8. Deal with misunderstood cases

DENORMALIZED DATA:

GATHERING 100+ FIELDSFlatten all the data.


Re-usable

base entities

● We re-used the same entity where use cases

were similar

● Partition key was a bonus to keep the data light

● Data modeling helps with this step

● "General Content" has:

− Pages

− Blog posts

− Media Room

● All data is self-contained, easily versionable

● Fields are hidden if not used


SURPRISE!

Keyspace

definitions are

relational!

● We used Datastax (KillrVideo) example as

base for our connection kit

− Automatic retries, woot!

● Use promises to await the creation of tables

− UDTs are interrelated when being created!

● We don't start express (our api server) until all

the types & tables have been created

− This allowed us to run API in a serverless lambda!

− Using now.sh, not AWS

− C* was in VMs not lambdas

REFERENCE DATA?

UDTS & URLSRelated content is simply embedded now


If you are

referencing

something, it

has a URL

● We made a generic "reference" type

● It uses a common concept called a "URL"

● Some extra data add context:

− Link title

− Image

− Description


Two-way

references:

how to deal

● We deal with it at write time

● Must compare to old copy of entity in order to

see if we need to remove referenced content

− We are therefore in need of two queries at write

time:

− Update entity

− Update related entities

● Keep records on both sides

− Use "application level" tombstones when

submitting new data

− New "delete me" field

● Submit the changes as a batch!

THE LOCKLESS MONSTERImprove performance by changing the way you think


Old system

had locks...

everywhere!

● Locks aren't really possible in Cassandra

● The old system had locks applied when

someone clicked edit, even if no changes!

− Locks were never released!!!

− Huge performance issues were caused by this

● Many lock systems are quirky like this...


Locks? We

don't need

locks!

● Cassandra provided us with better options:

● We now track version numbers

− Writes are only accepted if the values match

− if version == X (light weight transaction, LWTs)

● Prevents overwrites... but we can do better!

− Break page into patches/chunks of ~10 fields (of

100) and do partial updates, compare old and

new versions

− Users can edit same page and have no conflicts

if different chunks are submitted, even if the

version number has changed.

− Have to do a diff in this case

A HIERARCHY OF DATA:

TREES FOR EVERYONEIt may be easier to invert the relationship than write the code to use wrong design


Country,

Region,

Subregion.

…easy, right?

● These values are in a natural hierarchy

− Country is on the public site, most of the time

− Regions get displayed on Country pages (ok)

− Subregion is not visible to the public (hmm)

● The editors, on the other hand, ONLY use

subregion to input the data

− Really subregion was the most important

● Our access pattern was backwards!


Being careful

with trees

● UDTs mean lots of loops if poorly thought out

− Can be bad for front-end performance

− Each layer is another loop (or filter, etc.)

● UDTs are "relational" when creating the

keyspace, which can complicate things

depending on how you bootstrap your servers

IMAGES...Where do we store them so we never lose them?


Images...

in the

database?!

● We had a really hard time being OK with this

● This is our 6th and final entity type!

● We know from CMS migrations that disks will

lose anywhere from 1-10% of the files either

due to user error or hardware failure...

● There is no way we want to be looking around

for lost files, ever.

● We write the files to disk to save a bit of

bandwidth during development

− Our serverless lambdas are limited to 50mb, so

chances are we will kill the process if we cache

too much!


Challenges

with Images

● These want to be relational so we use URL

● Some parts of the site re-use images

− Use browser's cache by default

● Very tempting to over-optimize these

− Or one big image or smaller chunked versions?

− Multiple sizes / crop variations

● How should our API deal with it?

− Send as chunks?

− Stream the results?

− Etc...

● Optimize on write: only upload scaled/ready

content. We need predictable sizes.

UNEXPECTED ACCESS

PATTERNSMicroservices architecture introduced ways of getting the data that did not follow our models


Editor app

uses different

access

pattern...

● We did workflows for the public, not for admins!

● For some of our data we just wanted to

SELECT * to load data into the app

− This is SLOW... surprisingly!

− It does make sense: it may have to query

many/all nodes rather than one

− Try to make queries correspond with partition key

− Preload that data when possible


Standalone

entities vs.

dynamic

select lists...

● Some of our select lists on our forms depend

on dynamic data. What if the user had a

connection failure and didn't get the whole set?

● The select list will not be populated until

another entity type has been loaded

− This is fine, we are storing value as text

− It would be possible for user to overwrite with null

● When user edits the entity the select list may

not find that item if we have not (yet) migrated

that type

− Need to deal with this case by adding the existing value if it is not already in the list

− Otherwise users could save and overwrite the data, which is not ideal!

● Write code to never lose data due to

dependencies


No more

"overrides"

functionality,

now always

override

● The old app had the ability to "override" some

reference data in the destination entity

● It turns out it was actually cloning the data!

− Easy to migrate now! Same as C* architecture!

● The data was very buggy in the old system, we

found outdated references and lots of junk

data. Helped us find bugs in the new app.

● We eliminated an "override" checkbox as it was

not adding value.

● Created a new accidental feature: the ability to

override descriptions on a per-use basis. Very

useful when a product is displayed in multiple

places.

WE HAVE ARRIVED!Just run that migration and we're done.


Some Final

Thoughts

● Expect to constantly re-iterate over the schema

● Lots of different pieces to configure across the

app, when coming from a CMS

− Plan for unexpected functionality

● The database itself was never a problem

● Try to make the hard decisions before launch

● Measure the amount of time it takes to run the

migrations, your client will want to know

Ryan Weal

Kafei Interactive Inc.

https://kafei.dev

https://kafei.dev

THANK YOU

a relational data safari: migrating transactional data to ...kafei.dev/migration-safari.pdf ·...

Documents