mars: experience with managing a multi-petabytes...

1

MARS: Experience with managing a multi-petabytes archive

Baudouin RaoultHead of Data and Service Section

ECMWF

2

Supporting States and Co-operation

Belgium Ireland PortugalDenmark Italy SwitzerlandGermany Luxembourg FinlandSpain The Netherlands SwedenFrance Norway TurkeyGreece Austria United Kingdom

Co-operation agreements or working arrangements with:Czech Republic Montenegro ACMADCroatia Morocco ESAEstonia Romania EUMETSAT Hungary Serbia WMOIceland Slovakia JRC Latvia Slovenia CTBTOLithuania CLRTAP

3

ECMWF Objectives

Operational forecasting up to 15 days ahead (including waves)

R & D activities in forecast modelling

Data archiving and related services

Operational forecasts for the coming month and season

Advanced NWP training

Provision of supercomputer resources

Assistance to WMO programmes

Management of Regional Meteorological Data Communications Network (RMDCN)

4

Current computer configuration

October 2008

5

ECMWF Forecasting system:

6

Atmosphere global forecastsForecast to ten days from 00 and 12 UTC at 25 km resolution and 91 levels50 ensemble forecasts to fifteen days from 00 and 12 UTC at 50 km resolution

Ocean wave forecastsGlobal forecast to ten days from 00 and 12 UTC at 50 km resolutionEuropean waters forecast to five days from 00 and 12 UTC at 25 km resolution

Monthly forecasts: Atmosphere-ocean coupled modelGlobal forecasts to one month:atmosphere: 1.125° resolution, 62 levelsocean: horizontally-varying resolution ( ° to 1°), 9 levels

Seasonal forecasts: Atmosphere-ocean coupled modelGlobal forecasts to six months:atmosphere: 1.8° resolution, 40 levelsocean: horizontally-varying resolution ( ° to 1°), 9 levels

ECMWF Forecast Products

13

13

7

What is a field? An object uniquely identified by:

Forecasting systemDateAnalysis timeLevelParameterTime step…

Up to 11 attributes

8

MARS: A managed archive

Meteorological Archival and Retrieval System22 years of existenceRetrievals expressed in meteorological termsPost-processing facilities

Interpolation between various data representationInterpolation on coarser gridsSub-area extractions

Manages meteorological fieldsData in GRIB and BUFR format according to WMO standards

9

MARS: A managed archive (cont.)

Not a file systemUsers are not aware of the location of the data

An archive, not a databaseMetadata onlineData offline

10

A meteorological language

Retrieve,date = 20010101/to/20010131,parameter = temperature/geopotential,type = forecast,step = 12/to/240/by/12,levels = 1000/850/500/200,grid = 2/2,area = -10/20/10/0

11

MARS: Contents

All operational model outputsAnalysis, Forecasts, EPS, Seasonal, WaveClimatologies, Hindcasts

ECMWF Research experimentsMember State’s Research experimentsObservations

Conventional, SatelliteAnalysis input (for restartability)Analysis feedbackImages

12

MARS: Contents (cont.)

Member State’s own model data HIRLAM, COSMO, …

International collaborationsPROVOST, ECSN, ENSEMBLE, DEMETER, TIGGE, …

ReanalysisERA15, ERA40, ERA Interim

Other centres (Washington, Tokyo, Toulouse, Offenbach, Exeter …)

For comparison

13

Archive size vs. Supercomputer power

0.050.1

0.4

25.2

9

45 5690.5

413.81156

417410800

0.01

0.1

1

10

100

1000

10000

100000

Cray-1A

01/11/1

978

X-MP/2

01/11

/1983

X-MP/4

01/01

/1986

X-MP/8

01/01

/1990

C90/12

01/01

/1992

C90/16

01/01

/1993

VPP700/4

8 01/0

6/1996

VPP700-1

12 01/1

0/199

7VPP50

00 01

/04/19

99IBM-P

4 31/1

2/200

2IBM-P

5 01/0

7/200

4IBM-P

5+ 31

/12/20

06IBM-P

6 01/0

1/200

9

0.01

0.1

1

10

100

1000

10000

100000HPC (GFLOPs)Archive (TB)

14

Archive size vs. Supercomputer power

0.05 0.1 0.4 2 5.2 9 45 56 90.5 413.81156

4174

10800

0

5000

10000

15000

20000

25000

Cray-1A

01/11/1

978

X-MP/2

01/11

/1983

X-MP/4

01/01

/1986

X-MP/8

01/01

/1990

C90/12

01/01

/1992

C90/16

01/01

/1993

VPP700/4

8 01/0

6/1996

VPP700-1

12 01/1

0/199

7VPP50

00 01

/04/19

99IBM-P

4 31/1

2/200

2IBM-P

5 01/0

7/200

4IBM-P

5+ 31

/12/20

06IBM-P

6 01/0

1/200

9

0

2000

4000

6000

8000

10000

12000HPC (GFLOPs)Archive (TB)

15

MARS in numbers (March 2009)

~1000 active users, at ECMWF and in the Member StatesAnalysis from 1980, Forecasts from 1985ERA40: Analysis and observations since 19576 PByte of data in 3.4 * 1010 fields8.2 million filesGrowing daily by 10 TBytes (30 million fields)About 300,000 requests per day (100 million fields)

16

MARS: Requirements

ScalableData volumes (many order of magnitude, Tb, Pb, Eb, …)Number of fields (hundreds of billions)Number of requests

RobustPower cuts, disks full, network glitches, damaged tapesData loss is unacceptable

PerformantHardware is expensive: make the most of the available resources (CPU, disks, tape drives, network, …)Human resources are even more expensive: users should not wait too long…

17

MARS: Requirements (cont.)

SustainableData must be readable in 100 years from now (and more…)Archive must survive technology changes (hardware and software)

Capable of evolutionsSupport for new data typesSupport for new communities of users

Serves operations and researchDifferent expectations

18

MARS: Requirements (cont.)Data Integrity (the most important one)

Did we archive already corrupted data?Did we archive the wrong data?Did the network corrupt the data?Did the disks corrupt the data?Did the tape drives corrupt the data?Is the tape damaged?Was there a software bug?

ConsequencesImpossible to investigate an event that happened several years ago.Loss of confidence in the data: one corrupted piece of data = lack of trust in whole archive

19

Data Integrity

Self describing dataChecking the data to be archived:

By the client before sending to the serverBy the server on receptionData is retrieved again and checked against original.

All disk RAID or mirror“Enterprise quality” drives and tapesBackups made from primary tape copyBackups on different tape technologySoftware does not touch the dataA lot of testing…

20

Scalability, manageability, performances

Keep the number of files to a minimumLarge files

Use collocation to reduce tape accessTape families, large files

Manage queues and priorities Build a system in which data can be moved around, where files can be split, joined and migrated.Minimise dependencies with commercial software

New releases may force you to perform unwanted migrationsTwo commercial software may become incompatible They may not be there in 20 years

21

Scalability, manageability, performancesOO design, based on three abstract classes:

Tape manager (create, remove, exists, location, …)Tape file (size, last access, …)Data stream (open, read, write, close, partial reads)

Physical layer is abstracted, indirection is the key

Meteo. Server Data Server

Metadata Metadata

Caches

Requests References

Data Data

Tape System

22

MARS manages its own caches

“Pre-archive” spaceData is first stored there as produced by the model

• Efficient archival

Allow incremental archivingData is then sorted and aggregated into large tape files

• Efficient retrievals

Retrieval cachesField level caching, only small parts of files are cached

23

MARS manages its own queues

Three queues:User requests queuesTape read queuesTape write queues

One user request can create several tape read requestsRead requests are sorted according to volume and position

All possible requests for a volume are processedWrite requests are sorted by families

Présentateur

Commentaires de présentation

Read request can go to one or more tape manager All request from a volume

24

MARS manages its own queues – cont.

A fixed number of tape drives is allocated for reading or writing

Queues and disk spaces are monitoredResults are fed into a decision making algorithmDrive allocation is adjusted accordingly

Better control Minimise tape mountsOptimise tape drive usagePriorities (serve VIPs first)

25

WEBMARSActivity

System activityQueuesRequests progressInstructive

Better usage

26

A brief history

1987 - MARS I: CFS – IBM/MVS mainframe - PL/IData in CFS: Common File System from Los AlamosRan out of steam after 5 million files

1997 - MARS II: TSM – AIX – C++TSM (ADSM): A backup system from Tivoli (IBM)Problem with support: ECMWF is one of the largest TSM site, atypical use of the software

2001 - MARS III: HPSS – AIX – C++HPSS is designed for and used by large scientific sitesHPSS is very scalability, excellent supportECMWF is a relatively small site in the HPSS community

27

Back-archive

Copying data between systemsCostly

In human resourcesIn computer resourcesIn timeFinancially

Must be done without service interruption Do not underestimate it!!Two back-archives

1997: 25 Tb (18 months)2001: 300 Tb (9 months)

28

Statistics gathering

Statistics are gather continuouslyShort term to check the health of the systemMedium term to check the effect of a change in the system (new disks, reconfigurations of the queues)Long term for evolution planning

What?System activity (CPU, disk usage, tape mounts…)Application performance (cache hits, …)User “experience” (number of fields per second)

The system needs to be tuned regularlyHardware changes (disks, tape drives,…)Access patterns change (new users, new projects,…)

29

Cache hits in the various MARS servers

30

Changing tape technologies

31

QoS: MARS performance from a user point of view

32

Total Mars Data Archived Daily

New HPC / Parallel run

33

With a 60% growth per annum, half of the archive is less that 18 months old

Deletion of 1 petabyte

34

Conclusion

Optimise resource usageData collocation, tape familiesMinimise the number of filesImplement QoS through queues

Use enterprise quality tapes and drivesAllow for migration and back-archiveGather statistics, analyse trends, tune and re-tuneTest everything many times Corruption will happen, it is a fact: check data integrity at all stages, before it is too late, do not trust built-in mechanisms

35

Thank you

mars: experience with managing a multi-petabytes...

Documents