18 november 2008 13:00 -13:30 valencia, spain josep vidal canet universitat de valència ha...

18 November 2008 • 13:00 -13:30Valencia, Spain

Josep Vidal Canet Universitat de València

HA architectures using open source virtualization and networking technologies

2

Motivation• There are many factors that can cause data unavailability,

corruption or even loss. • Hardware failures• natural disasters like thunders, earthquakes, flooding or fire• Wars or terrorist attacks like the ones that occurred in

September the 11th in which a lot of corporate data was destroyed.

• Each time, a computer system that keeps running a business for whatever reason stops, money is lost

• Example: you want to buy a book on an online book store• However, at this moment the system is not longer working so you

decide to buy it from competitor’s store.

• What I am presenting here is a highly available distributed computer architecture that keeps running even in the case that some of its components fails - for example due to a natural disaster.

3

Problem analysis

• Usually, most architectures are designed using a layered approach

3 tiers: Web + Application Logic + Data

• Each tier should be designed in order to meet SLA (Service Level Agreements) goals

Availability – 99.xxx %

Performance – 95 % accesses, < 1 second

• Nowadays SLA is forcing organizations to deploy HA architectures in order to avoid downtimes.

4

Data SourcesApplications Clusters WEB ClustersExample of IS Architecture (UV)

Switch L

evel 4 / poundW

orkLoad B

alancer

Web mail(1)

J2EEApplications

Virtual Classroom

(2)

www.uv.es (1)

Library (5)

disc (1)

post (2)

Virtual

Classroom (4)

J2EEApplications

(5)

disc (1)

mailboxes (2)

Accounting Research (2)

DW, SECRETARIA VIRTUAL, REPLICA (2)

Virtual Classroom (1)

WEB (1)

VIRTUAL DISC (1)

Library (1)Library

(7)monitor

5

Web + Application Tiers

• It is easy to design an architecture to meet SLA goals

Main reason: Easy to clone & balance work between several systems

Modification’s rate is low -> Data is only changed each time an application is installed or updated.

Balancer is the SPoF (Single Point of Failure)

An active/passive approach can be used in order to avoid unavailability

• Let’s see an example

6

Cluster Web ServersLinux/Apache

CICSSERVER

WAS Grid

plu

gin-cfg M

axconn

ect

CT

Gjd

bc C

onn

ection P

ool

DB2

OS/390Z/890

Open Databases

Data ServersApplication Servers WEB Servers

General Architecture

Explica

LDAPserver2

25

Complica

Multiplica

Replica

Implica

Sw

itch L

evel 4 / pou

nd

Work

load B

alancin

g

LDAPserver1

Session’spersistence

AIX pseries

db2jd

webges01

webges11

Webges..

CICSSERVERMaxTasks

THREADLIMIT

AUTOMAT

Web Balancer Automatic Failover Automatic

session recovery for a down JVM

7

WEB Servers Arquitecture

Web Tier

CICSSERVER

plu

gin-c

fg M

axcon

nect

Web Server Cluster

Linux/ opteron

apache 1

apache 11

apache x

apache 2

ActiveBalancer

Web Balancer

PasiveBalancer

heart

beatClientsClients

Active

HeartbeatAutomatic failover using a

Public IP + Soft ARP

12 seconds of unavailability in the case of primary balancer failure

Active/Passive Web balancer

8

CICSSERVER

Logical DesignPhysical Design

Application Servers

Explica

Sessionpersistence

WAS Grid cprod1= Critical applications (automatricula, gdh )

implica (cprod1_1) JVMdistritubion = complica (cprod1_s2, cprod1_s4)

multiplica (cprod1_s3,cprod1_s9)

=cprod2 Normal (actas, personal, ...)

implica (cprod2_1)= JVM distribution complica (cprod2_s2, cprod2_s4)

multiplica (cprod2_s3)

replica (cprod2_s5)

=cprod3 Low

= JVM distribution replica (cprod3_1, cprod3_s2 ,cprod3_s3)

=cquarentena Very low, isolate applications

JVMdistribution = replica (cquarentena_s1)

Complica

Multiplica

Replica

Implica

Runtime = WebsphereAplication Server (WAS)

JSPsServletsEJBs

Pseries: power5/power6

Application Tier

9

Data Tier for Web Applications

CT

Gjd

bc C

onn

ection P

ool

DB2

OS/390Z/890

Open Databases

db2jd

CICSSERVERMaxTasks

THREADLIMIT

AUTOMAT

• More difficult to design IS without SPoF.

• Main reasons

Data is persistent

Modification’s rate is high

They need a lot of resources -> Multicore servers + high I/O bandwidth

IS are complex and heterogeneous, they consist of databases, text files, batch, etc …

10

Available Solutions:

Information Systems

•To use clustarizable databases:

• To deploy SSI (Single System Architectures)

Not enough matureWe’ll tell you our experienceswith DB2 + OpenSSI

Shared storage = SPoF

IBM SYSPLEX.

Data Tier for Web Applications

11

DB running with

OpenSSI

Comprehensive clustering solution offering a full, highly available SSI environment for Linux

Goals for OpenSSI clusters include availability, scalability and manageability, built from standard servers

Open Source

Can run databasesNO SPoF using DRBD +

Heartbeat + CFS

Problems with process migration

12

SSI at UVdb2inst2@ssi:~$ db2start07-28-2008 17:14:04 0 0 SQL1063N DB2START processing was successful.SQL1063N DB2START processing was successful.db2inst2@ssi:~$ db2 connect to replica

Database Connection Information

Database server = DB2/LINUX 8.1.0 SQL authorization ID = DB2INST2 Local database alias = REPLICA

db2inst2@ssi:~$ db2 "select count(*) from sysibm.systables "

1----------- 436

1 record(s) selected.

db2inst2@ssi:~$ cluster -v1: UP2: UP

13

Data Tier for DB-based Applications

jdb

c Con

nection

Pool

DB2 open

Using Open Source VirtualizaTion and Networking technologies, it is possible to deploy geographical distributed architectures with automatic failover that have low downtimes due to contingencies.

The UV has deployed variations (active/passive, active/active) of such architectures in order to guarantee good response times and maximize availability for its DB based information systems

RAID1 RAID1 RAID1 RAID1

XenServer A

XenServer B

VM1 VM2 VM3 VM4

IPNetwork

FC FC

10 km

/dev/drbd1 /dev/drbd4heartbeat

Active/Active Architecture

DistributedRAIDBlockDevices


14

Virtualization Software (XEN).Distributed Replicated Block Device (DRBD). Automatic Failover (hearbeat).

Proposed Architecture:

Components:

Ideal solution: SSI. Not possible yet

Using Virtualization & Network Technologies we can design distributed, fault-tolerant systems, where physical resources are virtualizated & replicated far away (using IP)

Physical Resources:

IP & Storage (SAN) Networks. Physical Servers.

Logical Resources:

Data Tier Architecture using Open Source

15

Backup CentrePrimary Centre

Physical Layer

PrimaryDiscs Array

PrimaryServer

StandbyServer

FC FC

10 km

IP Network

Secondary DiscsArray

SAN

LUN´s (Logical Unit Number)

Active/Passive Architecture

HA Architecture components (Data Tier)

16

Active/Passive HA architecture

Using DRBD we build a reliable mirror between disks from primary and secondary disks arrays

Over this mirror, we define a XEN VM in which DB system will run

This VM runs by default, using the CPUs of primary site and modifying the data stored over LUNs in primary disk array

DRBD uses standard IP network in order to keep synchronized primary and secondary disks

In the event of a contingency Heartbeat, will detect the unavailability, migrating XEN VM from primary site to the computational resources available on secondary site

To facilitate DB automatic recover after a system crash additional configuration is needed:

DB2 conf: Enable AUTORESTART, LOGRETAIN,...

Oracle conf: Enable ARCHIVE LOG mode.

17


Brico-mania

PrimaryDisc Array

Deco-garden

BrescaXen dom0

SecondaryDisc Array

ColmenaXen dom0

FC FC

Heartbeat

Automatic Failover

NetworkRAID-1

/ dev/ drbd1

IP

DRBD (Distributed RAID Block Device)

Bancuv3Active VM

km 10

IP

dev/drbd1/ dev/drbd1/

Bancuv3Standby VM

Virtual Layer

Physical Layer

Virtual secretary

XEN + DRBD + Heartbeat


18


Active Pasive/Architecture

Brico-mania

PrimaryDisc Array

Deco-garden

BrescaXen dom0

SecondaryDisc Array

ColmenaXen dom0

FC FC

Heartbeat

Automatic Failover

10 km

/dev/drbd1 /dev/drbd1

Capa Virtual

Capa Física

Bancuv3Maquina virtual

Activa

Final situation after a failure

Heartbeat detects failure& proceeds to restart VM

on secondary HW resources

During System restart, DB2 could proceed to do a crash recovery

19

db2inst1@bancuv3:~$ db2 connect to josep Database Connection Information

Database server = DB2/LINUX 8.1.0 SQL authorization ID = DB2INST1 Local database alias = JOSEP

bresca:~# xm listName ID Mem(MiB) VCPUs State Time(s)Domain-0 0 1209 2 r----- 66538.2bancuv3 13 1024 1 -b---- 3286.8

bresca:~# more /proc/drbdversion: 0.7.21 (api:79/proto:74)SVN Revision: 2326 build by root@bresca, 2007-07-03 11:57:21 0: cs:Unconfigured 1: cs:Connected st:Primary/Secondary ld:Consistent ns:31544386 nr:0 dw:36278 dr:31835131 al:95 bm:3848 lo:0 pe:0 ua:0 ap:0

bresca:~# dmesg qla2xxx 0000:03:01.1: Configure NVRAM parameters... Vendor: STK Model: FLEXLINE 380 Rev: 0619 Type: Direct-Access ANSI SCSI revision: 05SCSI device sdh: drive cache: write through w/ FUA sdh: sdh1 sdh2

System components

PrimaryDisc Array

BrescaXen dom0

NetworkRAID-1

/dev/drbd1

Bancuv3Active VM

DB

20

DRBD

colmena:/etc# more /proc/drbdversion: 0.7.21 (api:79/proto:74)SVN Revision: 2326 build by root@bresca, 2007-07-03 11:57:21 0: cs:Unconfigured 1: cs:SyncTarget st:Secondary/Primary ld:Inconsistent ns:0 nr:1044903 dw:1044903 dr:0 al:0 bm:3023 lo:37 pe:1222 ua:37 ap:0 [>...................] sync'ed: 2.2% (46235/47256)M finish: 0:17:00 speed: 46,160 (52,228) K/sec

Disk Synchronization after server’s failure

DRBD uses host based replication (sync & async) in order to keep up to date local & remote discs

Be careful with failures of primary system while the secondarynode is synchronizing

21

XEN

….# Disk device(s).root = '/dev/sda1 ro'disk = [ 'phy:/dev/drbd1,sda1,w', 'phy:/dev/sdh2,sda2,w']….

• Virtual Machine Configuration: /etc/xen/bancuv3.cfg

• VM Disk is a network mirror between two remote FC disks

Brico-mania

PrimaryDisc Array

Deco-garden

BrescaXen dom0

SecondaryDisc Array

ColmenaXen dom0

FC FC

Heartbeat

NetworkRAID-1

/dev/drbd1

IP

DRBD (Distributed RAID Block

Device)

10 km

IP


22

XEN + DRBD + Heartbeat• At this point, VM uses only virtual resources, so it not depends on the underlying HW • As VM disk is a network mirror, it can be run in both systems •Finally we add Heartbeat for failure detection & recovery

• In the advent of a failure, Heartbeat migrates VM to the available resources (secondary site)

Brico-mania

PrimaryDisc Array

Deco-garden

BrescaXen dom0

SecondaryDisc Array

ColmenaXen dom0

FC FC

Heartbeat

NetworkRAID-1

/dev/drbd1

IP


Device)

10 km

IP


23

Backup Centre

Brico-mania

PrimaryDisc Array

Deco-garden

BrescaXen dom0

SecondaryDisc Array

ColmenaXen dom0

FC FC

Heartbeat

Automatic Failover


Capa Virtual

Capa Física

Bancuv3Maquina virtual

Activa

Failure detection & recovery

more /var/log/ha-logheartbeat: 2008/08/27_10:43:59 info: Received shutdown notice from 'bresca'.heartbeat: 2008/08/27_10:43:59 info: Acquiring resource group: bresca 147.156.1.56/23/eth0:5 bancuv3heartbeat: 2008/08/27_10:43:59 info: Running /etc/ha.d/resource.d/IPaddr 147.156.1.56/23/eth0:5 startheartbeat: 2008/08/27_10:44:00 info: /sbin/ifconfig eth0:5:0 147.156.1.56 netmask 255.255.254.0 broadcast 147.156.1.255heartbeat: 2008/08/27_10:44:00 info: Sending Gratuitous Arp for 147.156.1.56 on eth0:5:0 [eth0]heartbeat: 2008/08/27_10:44:00 /usr/lib/heartbeat/send_arp -i 1010 -r 5 -p /var/lib/heartbeat/rsctmp/send_arp/send_arp-147.156.1.56 eth0 147.156.1.56 auto 147.156.1.56 ffffffffffffheartbeat: 2008/08/27_10:44:00 info: Running /etc/ha.d/resource.d/bancuv3 startheartbeat: 2008/08/27_10:44:04 info: all HA resource acquisition completed (standby).heartbeat: 2008/08/27_10:44:04 info: Standby resource acquisition done [all].

24


Additional system tuning is needed to improve system recovery times:

Using a journal filesystem (xfs, ext3, etc ..)

To facilitate DB automatic recover after a system crash additional configuration is needed (in the case of DB2: AUTORESTART, LOGRETAIN, DB2_USE_PARALLEL_RECOVERY...)

Database recovery could be a time consuming task (not deterministic)

The drawback of this architecture is that secondary site computational resources are idle waiting for a failure on primary site

A better one consists on balancing the execution of DB instances or other VMs between both sites

In the event of a contingency over a site, the VMs are migrated from the affected site to the available one.

VM migration consists of stopping VM on affected site and starting it on the available resources

25

Backup CenterPrimary Center


Brico-mania

PrimaryDisc Array

Deco-garden

BrescaXen dom0

Secondary Disc Array

ColmenaXen dom0

FC FC

Heartbeat

Automatic Failover

IP


Device)

NetworkRAID-1

dev/ drbd2/

AcademicVM

Active

km 10

IP

dev/drbdx/ dev/drbdy/

Virtual Layer

Physical Layer

Human ResourcesVM

Active

Data WareHouseVM

Active

SecretaryVM

Active

NetworkRAID-1

dev/ drbd4/

NetworkRAID-1

dev/ drbd3/

NetworkRAID-1

dev/ drbd1/

Active/Active HA architecture (I)

26

Backup CenterPrimary Center


Brico-mania

PrimaryDisc Array

Deco-garden

BrescaXen dom0

Secondary Disc Array

ColmenaXen dom0

FC FC

Heartbeat

Automatic Failover

IP


Device)

NetworkRAID-1

dev/ drbd2/

AcademicVM

Active

km 10

IP

dev/drbdx/ dev/drbdy/

Virtual Layer

Physical Layer

Human ResourcesVM

Active

Data WareHouseVM

Active

SecretaryVM

Active

NetworkRAID-1

dev/ drbd4/

NetworkRAID-1

dev/ drbd3/

NetworkRAID-1

dev/ drbd1/

Active/Active HA architecture (II)bresca:~# xm listName ID Mem(MiB) VCPUs State Time(s)Domain-0 0 1209 2 r----- 66538.2bancuv3 13 1024 1 -b---- 3286.8webges06 6 256 1 -b---- 83.6

colmena:~# xm listName ID Mem(MiB) VCPUs State Time(s)Domain-0 0 2575 4 r----- 492.2rac2 1 1024 1 -b---- 458.3webges05 2 256 1 -b---- 69.7

27


XenServer A

XenServer B

MV1 MV2 MV3 MV4

IPNetwork

FC FC

10 km

/dev/drbd1 /dev/drbd4heartbeat



Active/Active HA architecture (II)

•In the advent of a failure, Heartbeat migrates VM to the available resources, in a determinate point of time• The load of the available site, will be increased (x2)

28

Active/Active HA architecture (IV)• After a failure in one of the two sites, the load of the available site, will be increased by a factor of two• We will be up & running but with worse response times• Once primary site HW & SW resources have been recovered, load is redistributed automatically


XenServer B

MV1 MV2MV3 MV4

FC

/dev/drbd4


29

What we have learned

To deploy HA architectures for DB based information systems, which automatically detects and recovers from errors of the runtime (HW, SW) needed to run corporate applications

To select the best HA architecture for databases (active/passive, active/active, single system image) that fits with the business's SLA

To automate major steps involved in the detection and recovery from errors of a determinate component of DB runtime

To learn how to configure & use open-source tools (Xen, Heartbeat,

DRBD, openssi) needed to implement high availability architectures

18 november 2008 13:00 -13:30 valencia, spain josep vidal canet universitat de valència ha...

Documents

web application logic

web application tiers

data tier

data unavailability

jvm slide

monitor slide

low data

virtual disc