18 november 2008 13:00 -13:30 valencia, spain josep vidal canet universitat de valència ha...
Post on 21-Dec-2015
216 views
TRANSCRIPT
18 November 2008 • 13:00 -13:30Valencia, Spain
Josep Vidal Canet Universitat de València
HA architectures using open source virtualization and networking technologies
2
Motivation• There are many factors that can cause data unavailability,
corruption or even loss. • Hardware failures• natural disasters like thunders, earthquakes, flooding or fire• Wars or terrorist attacks like the ones that occurred in
September the 11th in which a lot of corporate data was destroyed.
• Each time, a computer system that keeps running a business for whatever reason stops, money is lost
• Example: you want to buy a book on an online book store• However, at this moment the system is not longer working so you
decide to buy it from competitor’s store.
• What I am presenting here is a highly available distributed computer architecture that keeps running even in the case that some of its components fails - for example due to a natural disaster.
3
Problem analysis
• Usually, most architectures are designed using a layered approach
3 tiers: Web + Application Logic + Data
• Each tier should be designed in order to meet SLA (Service Level Agreements) goals
Availability – 99.xxx %
Performance – 95 % accesses, < 1 second
• Nowadays SLA is forcing organizations to deploy HA architectures in order to avoid downtimes.
4
Data SourcesApplications Clusters WEB ClustersExample of IS Architecture (UV)
Switch L
evel 4 / poundW
orkLoad B
alancer
Web mail(1)
J2EEApplications
Virtual Classroom
(2)
www.uv.es (1)
Library (5)
disc (1)
post (2)
Virtual
Classroom (4)
J2EEApplications
(5)
disc (1)
mailboxes (2)
Accounting Research (2)
DW, SECRETARIA VIRTUAL, REPLICA (2)
Virtual Classroom (1)
WEB (1)
VIRTUAL DISC (1)
Library (1)Library
(7)monitor
5
Web + Application Tiers
• It is easy to design an architecture to meet SLA goals
Main reason: Easy to clone & balance work between several systems
Modification’s rate is low -> Data is only changed each time an application is installed or updated.
Balancer is the SPoF (Single Point of Failure)
An active/passive approach can be used in order to avoid unavailability
• Let’s see an example
6
Cluster Web ServersLinux/Apache
CICSSERVER
WAS Grid
plu
gin-cfg M
axconn
ect
CT
Gjd
bc C
onn
ection P
ool
DB2
OS/390Z/890
Open Databases
Data ServersApplication Servers WEB Servers
General Architecture
Explica
LDAPserver2
25
Complica
Multiplica
Replica
Implica
Sw
itch L
evel 4 / pou
nd
Work
load B
alancin
g
LDAPserver1
Session’spersistence
AIX pseries
db2jd
webges01
webges11
Webges..
CICSSERVERMaxTasks
THREADLIMIT
AUTOMAT
Web Balancer Automatic Failover Automatic
session recovery for a down JVM
7
WEB Servers Arquitecture
Web Tier
CICSSERVER
plu
gin-c
fg M
axcon
nect
Web Server Cluster
Linux/ opteron
apache 1
apache 11
apache x
apache 2
ActiveBalancer
Web Balancer
PasiveBalancer
heart
beatClientsClients
Active
HeartbeatAutomatic failover using a
Public IP + Soft ARP
12 seconds of unavailability in the case of primary balancer failure
Active/Passive Web balancer
8
CICSSERVER
Logical DesignPhysical Design
Application Servers
Explica
Sessionpersistence
WAS Grid cprod1= Critical applications (automatricula, gdh )
implica (cprod1_1) JVMdistritubion = complica (cprod1_s2, cprod1_s4)
multiplica (cprod1_s3,cprod1_s9)
=cprod2 Normal (actas, personal, ...)
implica (cprod2_1)= JVM distribution complica (cprod2_s2, cprod2_s4)
multiplica (cprod2_s3)
replica (cprod2_s5)
=cprod3 Low
= JVM distribution replica (cprod3_1, cprod3_s2 ,cprod3_s3)
=cquarentena Very low, isolate applications
JVMdistribution = replica (cquarentena_s1)
Complica
Multiplica
Replica
Implica
Runtime = WebsphereAplication Server (WAS)
JSPsServletsEJBs
Pseries: power5/power6
Application Tier
9
Data Tier for Web Applications
CT
Gjd
bc C
onn
ection P
ool
DB2
OS/390Z/890
Open Databases
db2jd
CICSSERVERMaxTasks
THREADLIMIT
AUTOMAT
• More difficult to design IS without SPoF.
• Main reasons
Data is persistent
Modification’s rate is high
They need a lot of resources -> Multicore servers + high I/O bandwidth
IS are complex and heterogeneous, they consist of databases, text files, batch, etc …
10
Available Solutions:
Information Systems
•To use clustarizable databases:
• To deploy SSI (Single System Architectures)
Not enough matureWe’ll tell you our experienceswith DB2 + OpenSSI
Shared storage = SPoF
IBM SYSPLEX.
Data Tier for Web Applications
11
DB running with
OpenSSI
Comprehensive clustering solution offering a full, highly available SSI environment for Linux
Goals for OpenSSI clusters include availability, scalability and manageability, built from standard servers
Open Source
Can run databasesNO SPoF using DRBD +
Heartbeat + CFS
Problems with process migration
12
SSI at UVdb2inst2@ssi:~$ db2start07-28-2008 17:14:04 0 0 SQL1063N DB2START processing was successful.SQL1063N DB2START processing was successful.db2inst2@ssi:~$ db2 connect to replica
Database Connection Information
Database server = DB2/LINUX 8.1.0 SQL authorization ID = DB2INST2 Local database alias = REPLICA
db2inst2@ssi:~$ db2 "select count(*) from sysibm.systables "
1----------- 436
1 record(s) selected.
db2inst2@ssi:~$ cluster -v1: UP2: UP
13
Data Tier for DB-based Applications
jdb
c Con
nection
Pool
DB2 open
Using Open Source VirtualizaTion and Networking technologies, it is possible to deploy geographical distributed architectures with automatic failover that have low downtimes due to contingencies.
The UV has deployed variations (active/passive, active/active) of such architectures in order to guarantee good response times and maximize availability for its DB based information systems
RAID1 RAID1 RAID1 RAID1
XenServer A
XenServer B
VM1 VM2 VM3 VM4
IPNetwork
FC FC
10 km
/dev/drbd1 /dev/drbd4heartbeat
Active/Active Architecture
DistributedRAIDBlockDevices
DistributedRAIDBlockDevices
14
Virtualization Software (XEN).Distributed Replicated Block Device (DRBD). Automatic Failover (hearbeat).
Proposed Architecture:
Components:
Ideal solution: SSI. Not possible yet
Using Virtualization & Network Technologies we can design distributed, fault-tolerant systems, where physical resources are virtualizated & replicated far away (using IP)
Physical Resources:
IP & Storage (SAN) Networks. Physical Servers.
Logical Resources:
Data Tier Architecture using Open Source
15
Backup CentrePrimary Centre
Physical Layer
PrimaryDiscs Array
PrimaryServer
StandbyServer
FC FC
10 km
IP Network
Secondary DiscsArray
SAN
LUN´s (Logical Unit Number)
Active/Passive Architecture
HA Architecture components (Data Tier)
16
Active/Passive HA architecture
Using DRBD we build a reliable mirror between disks from primary and secondary disks arrays
Over this mirror, we define a XEN VM in which DB system will run
This VM runs by default, using the CPUs of primary site and modifying the data stored over LUNs in primary disk array
DRBD uses standard IP network in order to keep synchronized primary and secondary disks
In the event of a contingency Heartbeat, will detect the unavailability, migrating XEN VM from primary site to the computational resources available on secondary site
To facilitate DB automatic recover after a system crash additional configuration is needed:
DB2 conf: Enable AUTORESTART, LOGRETAIN,...
Oracle conf: Enable ARCHIVE LOG mode.
17
Backup CentrePrimary Centre
Brico-mania
PrimaryDisc Array
Deco-garden
BrescaXen dom0
SecondaryDisc Array
ColmenaXen dom0
FC FC
Heartbeat
Automatic Failover
NetworkRAID-1
/ dev/ drbd1
IP
DRBD (Distributed RAID Block Device)
Bancuv3Active VM
km 10
IP
dev/drbd1/ dev/drbd1/
Bancuv3Standby VM
Virtual Layer
Physical Layer
Virtual secretary
XEN + DRBD + Heartbeat
Active/Passive HA architecture
18
Backup CentrePrimary Centre
Active Pasive/Architecture
Brico-mania
PrimaryDisc Array
Deco-garden
BrescaXen dom0
SecondaryDisc Array
ColmenaXen dom0
FC FC
Heartbeat
Automatic Failover
10 km
/dev/drbd1 /dev/drbd1
Capa Virtual
Capa Física
Bancuv3Maquina virtual
Activa
Final situation after a failure
Heartbeat detects failure& proceeds to restart VM
on secondary HW resources
During System restart, DB2 could proceed to do a crash recovery
19
db2inst1@bancuv3:~$ db2 connect to josep Database Connection Information
Database server = DB2/LINUX 8.1.0 SQL authorization ID = DB2INST1 Local database alias = JOSEP
bresca:~# xm listName ID Mem(MiB) VCPUs State Time(s)Domain-0 0 1209 2 r----- 66538.2bancuv3 13 1024 1 -b---- 3286.8
bresca:~# more /proc/drbdversion: 0.7.21 (api:79/proto:74)SVN Revision: 2326 build by root@bresca, 2007-07-03 11:57:21 0: cs:Unconfigured 1: cs:Connected st:Primary/Secondary ld:Consistent ns:31544386 nr:0 dw:36278 dr:31835131 al:95 bm:3848 lo:0 pe:0 ua:0 ap:0
bresca:~# dmesg qla2xxx 0000:03:01.1: Configure NVRAM parameters... Vendor: STK Model: FLEXLINE 380 Rev: 0619 Type: Direct-Access ANSI SCSI revision: 05SCSI device sdh: drive cache: write through w/ FUA sdh: sdh1 sdh2
System components
PrimaryDisc Array
BrescaXen dom0
NetworkRAID-1
/dev/drbd1
Bancuv3Active VM
DB
20
DRBD
colmena:/etc# more /proc/drbdversion: 0.7.21 (api:79/proto:74)SVN Revision: 2326 build by root@bresca, 2007-07-03 11:57:21 0: cs:Unconfigured 1: cs:SyncTarget st:Secondary/Primary ld:Inconsistent ns:0 nr:1044903 dw:1044903 dr:0 al:0 bm:3023 lo:37 pe:1222 ua:37 ap:0 [>...................] sync'ed: 2.2% (46235/47256)M finish: 0:17:00 speed: 46,160 (52,228) K/sec
Disk Synchronization after server’s failure
DRBD uses host based replication (sync & async) in order to keep up to date local & remote discs
Be careful with failures of primary system while the secondarynode is synchronizing
21
XEN
….# Disk device(s).root = '/dev/sda1 ro'disk = [ 'phy:/dev/drbd1,sda1,w', 'phy:/dev/sdh2,sda2,w']….
• Virtual Machine Configuration: /etc/xen/bancuv3.cfg
• VM Disk is a network mirror between two remote FC disks
Brico-mania
PrimaryDisc Array
Deco-garden
BrescaXen dom0
SecondaryDisc Array
ColmenaXen dom0
FC FC
Heartbeat
NetworkRAID-1
/dev/drbd1
IP
DRBD (Distributed RAID Block
Device)
10 km
IP
/dev/drbd1 /dev/drbd1
22
XEN + DRBD + Heartbeat• At this point, VM uses only virtual resources, so it not depends on the underlying HW • As VM disk is a network mirror, it can be run in both systems •Finally we add Heartbeat for failure detection & recovery
• In the advent of a failure, Heartbeat migrates VM to the available resources (secondary site)
Brico-mania
PrimaryDisc Array
Deco-garden
BrescaXen dom0
SecondaryDisc Array
ColmenaXen dom0
FC FC
Heartbeat
NetworkRAID-1
/dev/drbd1
IP
DRBD (Distributed RAID Block
Device)
10 km
IP
/dev/drbd1 /dev/drbd1
23
Backup Centre
Brico-mania
PrimaryDisc Array
Deco-garden
BrescaXen dom0
SecondaryDisc Array
ColmenaXen dom0
FC FC
Heartbeat
Automatic Failover
/dev/drbd1 /dev/drbd1
Capa Virtual
Capa Física
Bancuv3Maquina virtual
Activa
Failure detection & recovery
more /var/log/ha-logheartbeat: 2008/08/27_10:43:59 info: Received shutdown notice from 'bresca'.heartbeat: 2008/08/27_10:43:59 info: Acquiring resource group: bresca 147.156.1.56/23/eth0:5 bancuv3heartbeat: 2008/08/27_10:43:59 info: Running /etc/ha.d/resource.d/IPaddr 147.156.1.56/23/eth0:5 startheartbeat: 2008/08/27_10:44:00 info: /sbin/ifconfig eth0:5:0 147.156.1.56 netmask 255.255.254.0 broadcast 147.156.1.255heartbeat: 2008/08/27_10:44:00 info: Sending Gratuitous Arp for 147.156.1.56 on eth0:5:0 [eth0]heartbeat: 2008/08/27_10:44:00 /usr/lib/heartbeat/send_arp -i 1010 -r 5 -p /var/lib/heartbeat/rsctmp/send_arp/send_arp-147.156.1.56 eth0 147.156.1.56 auto 147.156.1.56 ffffffffffffheartbeat: 2008/08/27_10:44:00 info: Running /etc/ha.d/resource.d/bancuv3 startheartbeat: 2008/08/27_10:44:04 info: all HA resource acquisition completed (standby).heartbeat: 2008/08/27_10:44:04 info: Standby resource acquisition done [all].
24
Active/Passive HA architecture
Additional system tuning is needed to improve system recovery times:
Using a journal filesystem (xfs, ext3, etc ..)
To facilitate DB automatic recover after a system crash additional configuration is needed (in the case of DB2: AUTORESTART, LOGRETAIN, DB2_USE_PARALLEL_RECOVERY...)
Database recovery could be a time consuming task (not deterministic)
The drawback of this architecture is that secondary site computational resources are idle waiting for a failure on primary site
A better one consists on balancing the execution of DB instances or other VMs between both sites
In the event of a contingency over a site, the VMs are migrated from the affected site to the available one.
VM migration consists of stopping VM on affected site and starting it on the available resources
25
Backup CenterPrimary Center
Active/Active Architecture
Brico-mania
PrimaryDisc Array
Deco-garden
BrescaXen dom0
Secondary Disc Array
ColmenaXen dom0
FC FC
Heartbeat
Automatic Failover
IP
DRBD (Distributed RAID Block
Device)
NetworkRAID-1
dev/ drbd2/
AcademicVM
Active
km 10
IP
dev/drbdx/ dev/drbdy/
Virtual Layer
Physical Layer
Human ResourcesVM
Active
Data WareHouseVM
Active
SecretaryVM
Active
NetworkRAID-1
dev/ drbd4/
NetworkRAID-1
dev/ drbd3/
NetworkRAID-1
dev/ drbd1/
Active/Active HA architecture (I)
26
Backup CenterPrimary Center
Active/Active Architecture
Brico-mania
PrimaryDisc Array
Deco-garden
BrescaXen dom0
Secondary Disc Array
ColmenaXen dom0
FC FC
Heartbeat
Automatic Failover
IP
DRBD (Distributed RAID Block
Device)
NetworkRAID-1
dev/ drbd2/
AcademicVM
Active
km 10
IP
dev/drbdx/ dev/drbdy/
Virtual Layer
Physical Layer
Human ResourcesVM
Active
Data WareHouseVM
Active
SecretaryVM
Active
NetworkRAID-1
dev/ drbd4/
NetworkRAID-1
dev/ drbd3/
NetworkRAID-1
dev/ drbd1/
Active/Active HA architecture (II)bresca:~# xm listName ID Mem(MiB) VCPUs State Time(s)Domain-0 0 1209 2 r----- 66538.2bancuv3 13 1024 1 -b---- 3286.8webges06 6 256 1 -b---- 83.6
colmena:~# xm listName ID Mem(MiB) VCPUs State Time(s)Domain-0 0 2575 4 r----- 492.2rac2 1 1024 1 -b---- 458.3webges05 2 256 1 -b---- 69.7
27
RAID1 RAID1 RAID1 RAID1
XenServer A
XenServer B
MV1 MV2 MV3 MV4
IPNetwork
FC FC
10 km
/dev/drbd1 /dev/drbd4heartbeat
DistributedRAIDBlockDevices
DistributedRAIDBlockDevices
Active/Active HA architecture (II)
•In the advent of a failure, Heartbeat migrates VM to the available resources, in a determinate point of time• The load of the available site, will be increased (x2)
28
Active/Active HA architecture (IV)• After a failure in one of the two sites, the load of the available site, will be increased by a factor of two• We will be up & running but with worse response times• Once primary site HW & SW resources have been recovered, load is redistributed automatically
RAID1 RAID1 RAID1 RAID1
XenServer B
MV1 MV2MV3 MV4
FC
/dev/drbd4
DistributedRAIDBlockDevices
29
What we have learned
To deploy HA architectures for DB based information systems, which automatically detects and recovers from errors of the runtime (HW, SW) needed to run corporate applications
To select the best HA architecture for databases (active/passive, active/active, single system image) that fits with the business's SLA
To automate major steps involved in the detection and recovery from errors of a determinate component of DB runtime
To learn how to configure & use open-source tools (Xen, Heartbeat,
DRBD, openssi) needed to implement high availability architectures