serverless realtime backup
TRANSCRIPT
Serverless Realtime Backup and Restore of
DynamoDB with AWS Lambda
Ian Meyers, Principal Solution Architect
Amazon Web Services EMEA
Not your
regular
Serverless
Talk
!Slackbots
!Pizza
!Alexa
!Connected Home
!Driverless Cars
!Smart Cities
!
AWS
Lambda
Logic
Amazon
DynamoDB
Amazon
S3
State
Amazon API
Gateway
Amazon
Kinesis
Amazon
SNS
Message Passing
Computational Primitives
Serverless Compute
AWS LambdaFully Managed Event Processor (Node.js, Python,
or Java)
Natively Compile & Install Any Type Of
Dependency
Specify Runtime RAM & Timeout
Automatically Scaled to support Event Volume
Integrated CloudWatch Logging
REST Interface with API Gateway
Compute Storage
AWS Global Infrastructure
Database
App Services
Deployment & Administration
Networking
Analytics
Amazon KinesisManaged Service for Real Time Big Data Processing
Create Streams to Produce & Consume Data
Elastically Add and Remove Shards for Performance
Use Kinesis Worker Library to Process Data
Integration with S3, Redshift and Dynamo DB
Serverless Messaging
Compute Storage
AWS Global Infrastructure
Database
App Services
Deployment & Administration
Networking
Analytics
• Zero administration: Capture and deliver streaming data into S3, Redshift, and other
destinations without writing an application or managing infrastructure.
• Direct-to-data store integration: Batch, compress, and encrypt streaming data for
delivery into data destinations in as little as 60 secs using simple configurations.
• Seamless elasticity: Seamlessly scales to match data throughput w/o intervention
Capture and submit
streaming data to
Firehose
Firehose loads streaming data
continuously into S3 and Redshift
Analyze streaming data using your favorite
BI tools
Zero administration: Capture and deliver streaming data into S3, Redshift, and
other destinations without writing an application or managing infrastructure.
Direct-to-data store integration: Batch, compress, and encrypt streaming data for delivery into data destinations in as little as 60 secs using simple configurations.
Seamless elasticity: Seamlessly scales to match data throughput w/o intervention
Serverless Stream Archive
Serverless Database
DynamoDBProvisioned throughput NoSQL database
Fast, predictable, configurable performance
Fully distributed, fault tolerant HA architecture
Update Streams provide DB Notifications
Integration with EMR & Hive
RDS DynamoDB
Redshift ElastiCache
Compute Storage
AWS Global Infrastructure
Database
App Services
Deployment & Administration
Networking
Analytics
DynamoDB Internals - Partitions
0000
NNNN
Hash R
ange
* 1KB Write * * 4KB Read
1000 Write IOPS* 3000 Read IOPS**
DynamoDB
Table
Partitions are three-way replicated
Id = 2
Name = Andy
Dept = Engg
Id = 3
Name = Kim
Dept = Ops
Id = 1
Name = Jim
Id = 2
Name = Andy
Dept = Engg
Id = 3
Name = Kim
Dept = Ops
Id = 1
Name = Jim
Id = 2
Name = Andy
Dept = Engg
Id = 3
Name = Kim
Dept = Ops
Id = 1
Name = Jim
Facility 1
Facility 2
Facility 3
Partition 1 Partition 2 Partition N
DynamoDB Internals - Streams
0000
NNNN
Ha
sh R
ang
e
DynamoDB
Table
NNNN
MMMM
Ha
sh R
ang
e
Update Stream
Update Stream
2MB/sec
2MB/sec
INSERT
UPDATE
DELETE
What about backups?
DynamoDB is multi-AZ durable, always…
Why do I need backups?
Human.
Error.
Application.
Error.
Serverless Full Backups of DynamoDB
Input Datanode: This could be a S3 bucket, RDS
table, EMR Hive table, etc.
Activity: This is a data aggregation,
manipulation, or copy that runs on a user-
configured schedule.
Output Datanode: This supports all the same
datasources as the input datanode, but they
don’t have to be the same type.
Serverless Orchestration
Data PipelineAutomatically Provision EC2 & EMR Resources
Manage Dependencies & Scheduling
Automatically Retry and Notify of Success &
Failure
Compute Storage
AWS Global Infrastructure
Database
App Services
Deployment & Administration
Networking
Analytics
Elastic MapReduceManaged, elastic Yarn (1.x & 2.x) cluster
Integrates with S3, DynamoDB and Redshift
Install Spark, Presto, Impala, Hive, Pig, Impala &
End User Tools Automatically
Integrated HBase NOSQL Database
Support for Spot Instances
Support for Transparent HDFS Encryption
Big Data Analytics
Elastic
MapReduce
Compute Storage
AWS Global Infrastructure
Database
App Services
Deployment & Administration
Networking
Analytics
2 Important Concepts
RTO: How long will it take to get data back
RPO: When I do restore, how much data can I lose?
Serverless Full Backups of DynamoDB
RPO: How often do
you do a full backup?
RTO: How long does
an import take?
Serverless Stream Replication
http://bit.ly/2eNimjv
LambdaStreamsToFirehose
meh, that’s easy
• Supports AWS Lambda and DynamoDB Update
Streams
• User Defined transformers
• Kinesis Producer Library Deaggregation from
Protocol Buffers
• Deterministic ordering to destination
• User defined routing rules (coming soon)
Serverless Continuous Backup
so we’re done…?
we need to ensure that
we can’t do the wrong
thing...
DynamoDB Update Stream
Kinesis Firehose Delivery Stream
AWS Lambda Streams to Firehose
DDB⇒Lambda Event Source
all mandatory configurations
Policies, or technology?
Serverless Account Audit Trails
Backup
Provisioning
Architecture
http://bit.ly/2dWqNMO
OK – backup covered. Now we’re done…?
Components
of a Backup
System
Periodic full backups
Incremental change capture
Ability to restore data
Restore
{"Keys":{"MyHashKey":{"S":"abc"}},"NewImage":{"123":{"S":"asdfasdf
{"Keys":{"MyHashKey":{"S":"abc"}},"NewImage":{"123":{"S":"asdfasq223qdf"},"
Restoreadd jar s3://mybucket/prefix/json-serde-1.3.8-SNAPSHOT-jar-with-dependencies.jar
create external table MyTable_<YYYY><MM><DD>_<HH>( Keys map<string,map<string,string>>, NewImage map<string,map<string,string>>, OldImage map<string,map<string,string>>, SequenceNumber string, SizeBytes bigint, eventName string)ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe’location 's3://backup-bucket/backup-prefix/MyTable/<YYYY>/<MM>/<DD>/<HH>';
select OldImage['attribute1']['s'], NewImage['attribute1']['s'], SequenceNumber, SizeBytes, EventNamefrom MyTable_<YYYY><MM><DD>_<HH> where Keys['MyHashKey']['s'] = <some hash key value of the item> order by SequenceNumber desc;
Restore
Now comes the hard part…
Serverless isn't just Lambda…
...it’s also...
…streaming replication
…streaming data archival
…audit logging and stream production
…long term, high durability storage
…orchestration of full backups
Serverless Conf London 2016