cern db services: status, activities, announcements
DESCRIPTION
CERN DB Services: Status, Activities, Announcements. Marcin Blaszczyk - IT-DB. Recap. Last workshop: 16 th Nov 2010 – at that time We were using 10.2.0.4 We were installing new hardware to replace RAC3 & RAC4 RAC8 in “ Safehost ” for standbys RAC9 for integration DBs - PowerPoint PPT PresentationTRANSCRIPT
CERN DB Services: Status, Activities, Announcements
Replication Technology Evolution for ATLAS Data Workshop, 3rd of June 2014
Marcin Blaszczyk - IT-DB
3
Recap• Last workshop: 16th Nov 2010 – at that time
• We were using 10.2.0.4 • We were installing new hardware to replace RAC3 & RAC4
• RAC8 in “Safehost” for standbys • RAC9 for integration DBs
• 11.2 evaluation process • 10.2.0.5 upgrade under planning
• Infrastructure for Physics DB Services• Quadcore machines with 16GB of RAM• FC infrastructure for storage (~2500 disks)
4
Things have changed…• Service evolution
• RAC8 in Safehost for standby installed • Performed in Q3 2010 • To assure geographical separation for DR• New standby installations - for each
production DB
• 10.2.0.5 upgrade• Performed in Q1 2011
5
Oracle 11gR2• SW upgrade + HW migration
• Target version 11.2.0.3• Performed in Q1 2012
• HW migration• New HW installations (RAC10 & RAC11)• 8 cores (16 threads) CPU, 48GB of memory
• Move from ASM to NAS• Netapp NAS storage
• Replication technology • Usage of streams replication - gradually reduced• Usage of Active Data Guard has grown
6
Offloading with ADG• Offloading Backups to ADG
• Significantly reduces load on primary• Removes sequential I/O of full backup
• Offloading Queries to ADG• Transactional workload runs on primary• Read-only workload can be moved to ADG• Examples of workload on our ADGs:
• Ad-hoc queries, analytics and long-running reports, parallel queries, unpredictable workload and test queries
• ORA-1555 (snapshot too old)• Sporadic occurrences • Oracle bug – to be confirmed if present in 11.2.0.4
7
New Architecture with ADG
Primary Database
Active Data Guard for disaster recovery
Active Data Guard for users’ access
2. Busy & critical ADG1. Low load ADG
Active Data Guard for users’ access and for disaster recovery
Primary Database
Maximum performance Maximum performance
Redo Transport
Redo Transport
Redo Transport
• Disaster recovery• Offloading read-only workload
8
IT-DB Service on 11gR2
• IT-DB service much more stable• Workload has been stabilized
• High loads and node reboots eliminated
• More powerful HW • Offloading to ADG helps a lot• 11g clusterware more stable• Storage model benefited from using NAS
• single/multiple disk failure can’t affect DB service anymore
• Faster and less vulnerable Streams replication
9
Preparation for Run2 • Oracle SW
• No good solution to fit entire RUN 2• New Software versions:
• 11.2.0.4 vs 12.1.0.1
• New HW• 32 threads CPU, 128/256GB memory
• New Storage NetApp model• More SSD cache• Consolidated storage
10
Hardware upgrades in Q1 2014• New servers and storage
• Servers: more RAM, more CPU • 128GB of RAM memory (48GB current prod machines)
• Storage: more SSD cache• Newer NetApp model• Consolidated storage
• Refresh cycle of OS and OS related• Puppet & RHEL 6
• Refresh cycle of our HW• New HW for production• Current production HW will be moved to standby
11
Software upgrades in Q1 2014• Available Oracle releases
• 11.2.0.4• 12.1.0.1
• Evolution – how to balance• Stable services• Latest releases for bug fixes• Newest releases for new features• Fit with LHC schedule
12
DBAs & workload validation• DBAs - can do:
• Test upgrades of integration and production databases
• Share experience across users communities• Database CAPTURE and REPLAY with RAT testing
• Capture workload from production and replay it in upgraded DB
• Useful to catch bugs and regressions• Unfortunately it cannot cover the edge cases
13
Validation by the users• Validation by the application owners is very
valuable to reduce risk• Functional tests• Tests with ‘real world’ data sizes• Tests with concurrent workload
• The criticality depends• On the complexity of the application• On how well they can test their SQL
14
Recent Changes: Q1-Q2 2014• DB services for Experiments/WLCG
• Target version 11.2.0.4• Exceptions - target 12c
• ATLARC• LHCBR• Few more IT-DB services
• Interventions took 2-5 hours of DB downtime• Depending on system complexity: standby
infrastructure, number of nodes etc…
15
Upgrade technique - overview
Clusterware 11g+
RDBMS 11.2.0.3
Clusterware 12c+
RDBMS 11.2.0.3Redo Transport
DATA GUARD RAC DATABASE
PRIMARY DATABASE RAC
Redo Transport
RW A
ccessRW A
cess
Clusterware 12c+
RDBMS 11.2.0.4
RDBMS upgrade
DATABASE downtime
Upgrade complete!
123456
16
Phased approach to 12c• Some DBs already on 12.1 version
• ATLARC, LHCBR• Smooth upgrade • No major issues discovered so far
• Following Oracle SW evolution, depending on • Next 12c releases feedback (12.2)• Testing status• Possibility to schedule upgrades
• Next possible slot for upgrades to 12c 1st patchset• Technical stop Q4 2014/Q1 2015?• Candidates: offline DBs (ATLR, CMSR, LCGR…)
17
Monitoring & Security• Monitoring
• RacMon • EM12c • Strmmon
• Support level during LS1• Best effort
• Security• AuditMon• Firewall rules for external access
• For ADCR in 2013• For ATLR in 2014
IT-DB Operations Report
ATLAS databases
• Production DBs: 12 nodes*, ~69 TB of data– ATONR: 2 nodes, ~8 TB– ADCR: 4 nodes, ~19,5 TB– ATLR: 3 nodes, ~20.5 TB– ATLARC: 2 nodes, ~17 TB– *ATLAS DASHBOARD (1 node of WLCG database), ~4TB
• Standby DBs: 14 nodes, ~75 TB of data– ATONR_ADG: 2 nodes; ATONR_DG: 2 nodes– ADCR_ADG: 4 nodes; ADCR_DG: 3 nodes– ATLR_DG: 3 nodes
• Integration DBs: 4 nodes, ~18 TB of data– INTR: 2 nodes, ~7,5 TB,– INT8R: 2 nodes, ~9 TB– **ATLASINT: 2 nodes, ~2 TB (will be consolidated with INT8R)
• Nearly 165TB of space, 30 database servers• 12* databases (11 RAC clusters + 1 dedicated RAC node*)
19
Replication for ATLAS - current status
20
Replication for ATLAS - plans• Replication changes overview
• PVSS • Read only replica: Active Data Guard
• COOL• Online -> Offline: GoldenGate• Offline ->Tier1s: GoldenGate
• MUON• Streams stopped when ATLAS new solution for custom
data movement will be in place
21
Conclusions• Focus on stability for DB services • Software evolution
• Critical services has just moved to 11.2.0.4 • Long perspective: keep testing towards 12c
• HW evolution• Technology evolution for replication
• ADG & GG will fully replace Oracle Streams
22
Acknowledgements• Work presented here on behalf of:
• CERN Database Group
Replication Technology Evolution for ATLAS Data Workshop, 3 rd of June 2014
Thank [email protected]
24