mars: experience with managing a multi-petabytes...
TRANSCRIPT
1
MARS: Experience with managing a multi-petabytes archive
Baudouin RaoultHead of Data and Service Section
ECMWF
2
Supporting States and Co-operation
Belgium Ireland PortugalDenmark Italy SwitzerlandGermany Luxembourg FinlandSpain The Netherlands SwedenFrance Norway TurkeyGreece Austria United Kingdom
Co-operation agreements or working arrangements with:Czech Republic Montenegro ACMADCroatia Morocco ESAEstonia Romania EUMETSAT Hungary Serbia WMOIceland Slovakia JRC Latvia Slovenia CTBTOLithuania CLRTAP
3
ECMWF Objectives
Operational forecasting up to 15 days ahead (including waves)
R & D activities in forecast modelling
Data archiving and related services
Operational forecasts for the coming month and season
Advanced NWP training
Provision of supercomputer resources
Assistance to WMO programmes
Management of Regional Meteorological Data Communications Network (RMDCN)
4
Current computer configuration
October 2008
5
ECMWF Forecasting system:
6
Atmosphere global forecastsForecast to ten days from 00 and 12 UTC at 25 km resolution and 91 levels50 ensemble forecasts to fifteen days from 00 and 12 UTC at 50 km resolution
Ocean wave forecastsGlobal forecast to ten days from 00 and 12 UTC at 50 km resolutionEuropean waters forecast to five days from 00 and 12 UTC at 25 km resolution
Monthly forecasts: Atmosphere-ocean coupled modelGlobal forecasts to one month:atmosphere: 1.125° resolution, 62 levelsocean: horizontally-varying resolution ( ° to 1°), 9 levels
Seasonal forecasts: Atmosphere-ocean coupled modelGlobal forecasts to six months:atmosphere: 1.8° resolution, 40 levelsocean: horizontally-varying resolution ( ° to 1°), 9 levels
ECMWF Forecast Products
13
13
7
What is a field? An object uniquely identified by:
Forecasting systemDateAnalysis timeLevelParameterTime step…
Up to 11 attributes
8
MARS: A managed archive
Meteorological Archival and Retrieval System22 years of existenceRetrievals expressed in meteorological termsPost-processing facilities
Interpolation between various data representationInterpolation on coarser gridsSub-area extractions
Manages meteorological fieldsData in GRIB and BUFR format according to WMO standards
9
MARS: A managed archive (cont.)
Not a file systemUsers are not aware of the location of the data
An archive, not a databaseMetadata onlineData offline
10
A meteorological language
Retrieve,date = 20010101/to/20010131,parameter = temperature/geopotential,type = forecast,step = 12/to/240/by/12,levels = 1000/850/500/200,grid = 2/2,area = -10/20/10/0
11
MARS: Contents
All operational model outputsAnalysis, Forecasts, EPS, Seasonal, WaveClimatologies, Hindcasts
ECMWF Research experimentsMember State’s Research experimentsObservations
Conventional, SatelliteAnalysis input (for restartability)Analysis feedbackImages
12
MARS: Contents (cont.)
Member State’s own model data HIRLAM, COSMO, …
International collaborationsPROVOST, ECSN, ENSEMBLE, DEMETER, TIGGE, …
ReanalysisERA15, ERA40, ERA Interim
Other centres (Washington, Tokyo, Toulouse, Offenbach, Exeter …)
For comparison
13
Archive size vs. Supercomputer power
0.050.1
0.4
25.2
9
45 5690.5
413.81156
417410800
0.01
0.1
1
10
100
1000
10000
100000
Cray-1A
01/11/1
978
X-MP/2
01/11
/1983
X-MP/4
01/01
/1986
X-MP/8
01/01
/1990
C90/12
01/01
/1992
C90/16
01/01
/1993
VPP700/4
8 01/0
6/1996
VPP700-1
12 01/1
0/199
7VPP50
00 01
/04/19
99IBM-P
4 31/1
2/200
2IBM-P
5 01/0
7/200
4IBM-P
5+ 31
/12/20
06IBM-P
6 01/0
1/200
9
0.01
0.1
1
10
100
1000
10000
100000HPC (GFLOPs)Archive (TB)
14
Archive size vs. Supercomputer power
0.05 0.1 0.4 2 5.2 9 45 56 90.5 413.81156
4174
10800
0
5000
10000
15000
20000
25000
Cray-1A
01/11/1
978
X-MP/2
01/11
/1983
X-MP/4
01/01
/1986
X-MP/8
01/01
/1990
C90/12
01/01
/1992
C90/16
01/01
/1993
VPP700/4
8 01/0
6/1996
VPP700-1
12 01/1
0/199
7VPP50
00 01
/04/19
99IBM-P
4 31/1
2/200
2IBM-P
5 01/0
7/200
4IBM-P
5+ 31
/12/20
06IBM-P
6 01/0
1/200
9
0
2000
4000
6000
8000
10000
12000HPC (GFLOPs)Archive (TB)
15
MARS in numbers (March 2009)
~1000 active users, at ECMWF and in the Member StatesAnalysis from 1980, Forecasts from 1985ERA40: Analysis and observations since 19576 PByte of data in 3.4 * 1010 fields8.2 million filesGrowing daily by 10 TBytes (30 million fields)About 300,000 requests per day (100 million fields)
16
MARS: Requirements
ScalableData volumes (many order of magnitude, Tb, Pb, Eb, …)Number of fields (hundreds of billions)Number of requests
RobustPower cuts, disks full, network glitches, damaged tapesData loss is unacceptable
PerformantHardware is expensive: make the most of the available resources (CPU, disks, tape drives, network, …)Human resources are even more expensive: users should not wait too long…
17
MARS: Requirements (cont.)
SustainableData must be readable in 100 years from now (and more…)Archive must survive technology changes (hardware and software)
Capable of evolutionsSupport for new data typesSupport for new communities of users
Serves operations and researchDifferent expectations
18
MARS: Requirements (cont.)Data Integrity (the most important one)
Did we archive already corrupted data?Did we archive the wrong data?Did the network corrupt the data?Did the disks corrupt the data?Did the tape drives corrupt the data?Is the tape damaged?Was there a software bug?
ConsequencesImpossible to investigate an event that happened several years ago.Loss of confidence in the data: one corrupted piece of data = lack of trust in whole archive
19
Data Integrity
Self describing dataChecking the data to be archived:
By the client before sending to the serverBy the server on receptionData is retrieved again and checked against original.
All disk RAID or mirror“Enterprise quality” drives and tapesBackups made from primary tape copyBackups on different tape technologySoftware does not touch the dataA lot of testing…
20
Scalability, manageability, performances
Keep the number of files to a minimumLarge files
Use collocation to reduce tape accessTape families, large files
Manage queues and priorities Build a system in which data can be moved around, where files can be split, joined and migrated.Minimise dependencies with commercial software
New releases may force you to perform unwanted migrationsTwo commercial software may become incompatible They may not be there in 20 years
21
Scalability, manageability, performancesOO design, based on three abstract classes:
Tape manager (create, remove, exists, location, …)Tape file (size, last access, …)Data stream (open, read, write, close, partial reads)
Physical layer is abstracted, indirection is the key
Meteo. Server Data Server
Metadata Metadata
Caches
Requests References
Data Data
Tape System
22
MARS manages its own caches
“Pre-archive” spaceData is first stored there as produced by the model
• Efficient archival
Allow incremental archivingData is then sorted and aggregated into large tape files
• Efficient retrievals
Retrieval cachesField level caching, only small parts of files are cached
23
MARS manages its own queues
Three queues:User requests queuesTape read queuesTape write queues
One user request can create several tape read requestsRead requests are sorted according to volume and position
All possible requests for a volume are processedWrite requests are sorted by families
24
MARS manages its own queues – cont.
A fixed number of tape drives is allocated for reading or writing
Queues and disk spaces are monitoredResults are fed into a decision making algorithmDrive allocation is adjusted accordingly
Better control Minimise tape mountsOptimise tape drive usagePriorities (serve VIPs first)
25
WEBMARSActivity
System activityQueuesRequests progressInstructive
Better usage
26
A brief history
1987 - MARS I: CFS – IBM/MVS mainframe - PL/IData in CFS: Common File System from Los AlamosRan out of steam after 5 million files
1997 - MARS II: TSM – AIX – C++TSM (ADSM): A backup system from Tivoli (IBM)Problem with support: ECMWF is one of the largest TSM site, atypical use of the software
2001 - MARS III: HPSS – AIX – C++HPSS is designed for and used by large scientific sitesHPSS is very scalability, excellent supportECMWF is a relatively small site in the HPSS community
27
Back-archive
Copying data between systemsCostly
In human resourcesIn computer resourcesIn timeFinancially
Must be done without service interruption Do not underestimate it!!Two back-archives
1997: 25 Tb (18 months)2001: 300 Tb (9 months)
28
Statistics gathering
Statistics are gather continuouslyShort term to check the health of the systemMedium term to check the effect of a change in the system (new disks, reconfigurations of the queues)Long term for evolution planning
What?System activity (CPU, disk usage, tape mounts…)Application performance (cache hits, …)User “experience” (number of fields per second)
The system needs to be tuned regularlyHardware changes (disks, tape drives,…)Access patterns change (new users, new projects,…)
29
Cache hits in the various MARS servers
30
Changing tape technologies
31
QoS: MARS performance from a user point of view
32
Total Mars Data Archived Daily
New HPC / Parallel run
33
With a 60% growth per annum, half of the archive is less that 18 months old
Deletion of 1 petabyte
34
Conclusion
Optimise resource usageData collocation, tape familiesMinimise the number of filesImplement QoS through queues
Use enterprise quality tapes and drivesAllow for migration and back-archiveGather statistics, analyse trends, tune and re-tuneTest everything many times Corruption will happen, it is a fact: check data integrity at all stages, before it is too late, do not trust built-in mechanisms
35
Thank you