walb: real-time and incremental backup system for block devices

Post on 21-Jan-2018

3.432 Views

Category:

Software

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

WalB: A Fast and Low LatencyBackup System for Block Devices

Cybozu Meetup #8 SRE WalB

Kota Uchida

September 25, 2017

1

2

About me

▌Kota Uchida

▌SRE team at Cybozu, Inc.

▌A WalB developer

3

About Cybozu

▌A large cloud service vendor in Japan.

▌Largest market shares

in field of collaborative software.

▌We serve web applications on our own cloud platform.

kintone: a low-code business app platform

and more

#customer companies:

#accesses / day:

write IOs / day:

20,000+

210 millions

24.5 TiB

4

5

Service Level Objective

▌24/7 nonstop service

▌99.99% availability (4 min / month)

▌Daily backup (retention period is 14 days)

▌Disaster recover: copy data to a remote site once a day

Architecture of our platform

6

ApplicationServer

L7LB

Storage Server

dm-snap

Storage Server

dm-snap

Backup Server

Remote Site

DatabaseServer

DiffDiff

DiffDiff

The scope of this talk

RAID 1Blob

Server

MappingInfo

Snapshot Managementwith dm-snap

7

A B

Original Volume Area

Snapshot Area

Logical Structure

Physical Structure

(1) CoW

Latest Image

Write A’ Write B’

Snapshot Image

(2) Write

B’

B

B’

A

A’

A’

0 1 2 3 4

Backup using dm-snap

8

Snapshot1

(2) Full-scan a new snapshot

Logical Structure

Snapshot0

B’A’

(3) Generate a diff imageby comparing two snapshots

B

(1) Full-scan an old snapshot

B’A’

A

Full-scan at night

9

Daytime

Backup processing time

o’clock

UX degradationduring a full-scan

10Full-scanning

11

We have no more “nights”

▌Until now:

Full scan is allowed only when access rate is low, i.e., at night.

▌From now on:

We have to handle accesses from multiple timezones.

▌We must be able to backup any time without UX degradation.

12

New Solution

▌We need a new solution with:

No IO spikes

Short backup time

▌We compared dm-thin with WalB

13

What is dm-thin?

▌dm-thin provides thin-provisioning volume management to

share same data among volumes

reduce disk usage using snapshots

▌In the mainline Linux kernel

Snapshot Managementwith dm-thin

Logical Structure

Physical Structure

A

Latest Tree

Latest Image A

Snapshot Managementwith dm-thin

15

Logical Structure

Physical Structure

A

Snapshot Tree Latest Tree

ASnapshot

Latest Image A

Snapshot Managementwith dm-thin

16

A A’

Snapshot Tree Latest Tree

(1) CoW

(1) CoW

Write A’

Physical Structure

(2) Write

(2) Update

A’

ASnapshot

Latest Image

Logical Structure

17

A B B’

Snapshot0 Snapshot1

A’

A’ B’

A BSnapshot0

Snapshot1

Generate a diff image using dm-thin metadata

Logical Structure

Physical Structure

Backup using dm-thin

18

What is WalB?

▌A real-time and incremental backup system

developed at Cybozu Labs

▌Can backup block devices without IO spikes

dm-snapfull scanning

WalBno spikes

Special Block Devices for WalB

19

WalB device

Data device Log device

Read Write

Any application (File system, DBMS, etc.)

Linear mapped Ring buffer

Write IO Logging and Backup with WalB

20

A B

Data Device Log Device

0 1 2 3 4

Time series of write I/Os

Time

Write IO Logging and Backup with WalB

21

B

A B

Write A’

Data Device Log Device

A’

0 1 2 3 4

1 A’

Time series of write I/Os

Time

Scan the log device and generate a diff image

Write IO Logging and Backup with WalB

22

B

A B

B’

Write A’

Write B’

Data Device Log Device

A’

A’ 41

0 1 2 3 4

A’

A’ B’

Time series of write I/Os

Scan the log device and generate a diff image

Time

1

23

Performance test

▌Compared dm-snap, dm-thin, and WalB

▌Executed a workload during a backup

The workload & the backup will affect each other

▌Measured the following metrics:

Latencies of the workload

Backup time

24

Environment & Settings

▌Test environment:

CPU:2.40 GHz x 12 cores

MEM:192 GiB

HDD:4 TB HDD, RAID 6 (8D2P)

NIC:10 Gbps x 2

Kernel:4.11 (latest upstream)

▌Test settings:

100 GiB volumes

Workload: 4 KiB Random writes for a 5 GiB range

25

Measuring the Backup Time(dm-snap, dm-thin)

▌dm-snap:take a snapshot & scan full image

▌dm-thin:get a structure of snapshot trees & find modified

blocks & read these blocks

5 GiB 95 GiB (unchanged)

4 KiB Random Writes

dm-snap : scan full image

dm-thin : scan changed chunks (tree traversal)

26

Measuring the Backup Time(WalB)

▌WalB:scan logs from a log device & send them to a backup

server continuously

5 GiB 95 GiB (unchanged)

4 KiB Random Writes

WalB : scan logs

Log Device

Write IO logsWalB Device

Backup Server

DiffDiff

Network

Write I/O latency

dm-thin

dm-snap

WalB

no-backup

27

IO spikes due to CoW,worse than dm-snap!

Small overhead

large due to CoW

Backup time

28

1146

2260

1.2

slower than dm-snap

so fast!

29

Conclusion

▌dm-snap & dm-thin

High I/O latency during a backup

Long backup time

▌WalB

Stable and low I/O latency (no spikes)

Short backup time

WalB satisfies our requirements for production use.

30

Try WalB!

▌Project page

https://walb-linux.github.io/

▌Tutorial

https://github.com/walb-linux/walb-

tools/tree/master/misc/vagrant/

Vagrantfile for Ubuntu 16.04 and CentOS 7

Remote Host

31

Incremental backup

▌Daily backup (retention period is 14 days)

▌Worker daemon of WalB selects diff files older than 14

days and applies them to a base image.

Volume Diff Diff Diff…Base

Diff files for 14 days

Backup Host

Apply everyday

Remote Host

32

Restoring a volume

▌To restore the latest state of a volume:

take a snapshot of a base image, and

apply all diff files to it.

Diff Diff Diff…Base

Base'Writablesnapshot

Apply all diffs

Remote Host

33

Make restoration faster 1/2

▌Fast restoration

by preparing read-only snapshots for each day

Diff Diff Diff…Base

1421

dm-thin snapshots for each day

Diff

Remote Host

34

Make restoration faster 2/2

▌Apply some diffs to the appropriate snapshot.

▌At most 24 hours of diffs are needed to be applied.

Faster!

Diff Diff Diff…Base

1421

Diff

35

Worldline: restoring a whole environment

▌"Worldline" means a parallel world.

▌We backup configurations in addition to user data.

Configurations:

definitions for each customer (ID, FQDN, Apps, …),

application version definition,

host definition, etc.

▌It is important to use applications whose versions are

consistent with user data backed up before.

36

Worldline: restoring a whole environment

▌A daily script takes a snapshot of a whole environment.

▌An weekly script restores the latest backup, so we can use it

for investigation of failures or development our services.

User data

DiffDiff

Snapshot

ConfigDB

ConfigDB'Backup Backup

Worldline

Spare hosts

Restore

DiffDiff

Restore

Q&Aemail: kota-uchida@cybozu.co.jp

twitter: @uchan_nos

37

top related