developing & managing a large linux farm – the brookhaven experience chep2004 – interlaken...
TRANSCRIPT
![Page 1: Developing & Managing A Large Linux Farm – The Brookhaven Experience CHEP2004 – Interlaken September 27, 2004 Tomasz Wlodek - BNL](https://reader030.vdocuments.site/reader030/viewer/2022032806/56649eff5503460f94c14460/html5/thumbnails/1.jpg)
Developing & Managing A Developing & Managing A Large Linux Farm – The Large Linux Farm – The Brookhaven ExperienceBrookhaven Experience
CHEP2004 – InterlakenCHEP2004 – Interlaken
September 27, 2004September 27, 2004
Tomasz Wlodek - BNLTomasz Wlodek - BNL
![Page 2: Developing & Managing A Large Linux Farm – The Brookhaven Experience CHEP2004 – Interlaken September 27, 2004 Tomasz Wlodek - BNL](https://reader030.vdocuments.site/reader030/viewer/2022032806/56649eff5503460f94c14460/html5/thumbnails/2.jpg)
BackgroundBackground
Brookhaven National Lab (BNL) is a multi-Brookhaven National Lab (BNL) is a multi-disciplinary research laboratory funded by disciplinary research laboratory funded by US government.US government.
BNL is the site of Relativistic Heavy Ion BNL is the site of Relativistic Heavy Ion Collider (RHIC) and four of its experiments.Collider (RHIC) and four of its experiments.
The Rhic Computing Facility (RCF) was The Rhic Computing Facility (RCF) was formed in the mid 90’s, in order to address formed in the mid 90’s, in order to address computing needs of RHIC experiments.computing needs of RHIC experiments.
![Page 3: Developing & Managing A Large Linux Farm – The Brookhaven Experience CHEP2004 – Interlaken September 27, 2004 Tomasz Wlodek - BNL](https://reader030.vdocuments.site/reader030/viewer/2022032806/56649eff5503460f94c14460/html5/thumbnails/3.jpg)
Background (cont.)Background (cont.)
BNL has also been chosen as the site of BNL has also been chosen as the site of Tier-1 ATLAS Computing Facility (ACF) for Tier-1 ATLAS Computing Facility (ACF) for the Atlas experiment in CERN.the Atlas experiment in CERN.
RCF/ACF supports HENP and HEP scientific RCF/ACF supports HENP and HEP scientific computing efforts and various general computing efforts and various general services (backup, e-mail, web, off-site data services (backup, e-mail, web, off-site data transfer, Grid, etc). transfer, Grid, etc).
![Page 4: Developing & Managing A Large Linux Farm – The Brookhaven Experience CHEP2004 – Interlaken September 27, 2004 Tomasz Wlodek - BNL](https://reader030.vdocuments.site/reader030/viewer/2022032806/56649eff5503460f94c14460/html5/thumbnails/4.jpg)
Background (cont.)Background (cont.)
The Linux Farm is the main source of CPU (and The Linux Farm is the main source of CPU (and increasingly storage) resources in the RCF/ACFincreasingly storage) resources in the RCF/ACF
RCF/ACF is transforming itself from a local RCF/ACF is transforming itself from a local resource into a national and global resourceresource into a national and global resource
Growing design and operational complexityGrowing design and operational complexity
Increasing staffing levels to handle additional Increasing staffing levels to handle additional responsibilitiesresponsibilities
![Page 5: Developing & Managing A Large Linux Farm – The Brookhaven Experience CHEP2004 – Interlaken September 27, 2004 Tomasz Wlodek - BNL](https://reader030.vdocuments.site/reader030/viewer/2022032806/56649eff5503460f94c14460/html5/thumbnails/5.jpg)
RCF/ACF StructureRCF/ACF Structure
![Page 6: Developing & Managing A Large Linux Farm – The Brookhaven Experience CHEP2004 – Interlaken September 27, 2004 Tomasz Wlodek - BNL](https://reader030.vdocuments.site/reader030/viewer/2022032806/56649eff5503460f94c14460/html5/thumbnails/6.jpg)
Staff Growth at the RCF/ACFStaff Growth at the RCF/ACF
0
5
10
15
20
25
30
35
Staf
f L
evel
s (F
TE
)
1997 1998 1999 2000 2001 2002 2003 2004 2005(est.)
Year
![Page 7: Developing & Managing A Large Linux Farm – The Brookhaven Experience CHEP2004 – Interlaken September 27, 2004 Tomasz Wlodek - BNL](https://reader030.vdocuments.site/reader030/viewer/2022032806/56649eff5503460f94c14460/html5/thumbnails/7.jpg)
The Pre-Grid EraThe Pre-Grid Era Rack-mounted commodity hardwareRack-mounted commodity hardware
Self-contained, localized resourcesSelf-contained, localized resources
Resources available only to local usersResources available only to local users
Little interaction with external resources at Little interaction with external resources at remote locations remote locations
Considerable freedom to set own usage policiesConsiderable freedom to set own usage policies
![Page 8: Developing & Managing A Large Linux Farm – The Brookhaven Experience CHEP2004 – Interlaken September 27, 2004 Tomasz Wlodek - BNL](https://reader030.vdocuments.site/reader030/viewer/2022032806/56649eff5503460f94c14460/html5/thumbnails/8.jpg)
The (Near-Term) FutureThe (Near-Term) Future
Resources available globallyResources available globally
Distributed computing architectureDistributed computing architecture
Extensive interaction with remote resources Extensive interaction with remote resources requires closer software inter-operability and requires closer software inter-operability and higher network bandwidthhigher network bandwidth
Constraints on freedom to set own policiesConstraints on freedom to set own policies
![Page 9: Developing & Managing A Large Linux Farm – The Brookhaven Experience CHEP2004 – Interlaken September 27, 2004 Tomasz Wlodek - BNL](https://reader030.vdocuments.site/reader030/viewer/2022032806/56649eff5503460f94c14460/html5/thumbnails/9.jpg)
How do we get there?How do we get there?
Change in management philosophyChange in management philosophy
Evolution in hardware requirementsEvolution in hardware requirements
Evolution in software packagesEvolution in software packages
Different security protocol(s)Different security protocol(s)
Change in access policyChange in access policy
![Page 10: Developing & Managing A Large Linux Farm – The Brookhaven Experience CHEP2004 – Interlaken September 27, 2004 Tomasz Wlodek - BNL](https://reader030.vdocuments.site/reader030/viewer/2022032806/56649eff5503460f94c14460/html5/thumbnails/10.jpg)
Change in Management PhilosophyChange in Management Philosophy
Automated monitoring & management of servers Automated monitoring & management of servers in large clusters a mustin large clusters a must
Remote power management, predictive hardware Remote power management, predictive hardware failure analysis and preventive maintenance are failure analysis and preventive maintenance are important important
High-availability based on large number of High-availability based on large number of identical servers, not on 24-hour supportidentical servers, not on 24-hour support
Increasingly larger clusters only manageable if Increasingly larger clusters only manageable if servers are identical servers are identical avoid specialized servers avoid specialized servers
![Page 11: Developing & Managing A Large Linux Farm – The Brookhaven Experience CHEP2004 – Interlaken September 27, 2004 Tomasz Wlodek - BNL](https://reader030.vdocuments.site/reader030/viewer/2022032806/56649eff5503460f94c14460/html5/thumbnails/11.jpg)
Evolution in Hardware Evolution in Hardware RequirementsRequirements
Early acquisitions emphasized CPU power over Early acquisitions emphasized CPU power over local storage capacitylocal storage capacity
Increasing affordability of local disk storage has Increasing affordability of local disk storage has changed this philosophychanged this philosophy
Hardware chosen by optimal combination of CPU Hardware chosen by optimal combination of CPU power, storage capacity, server density and pricepower, storage capacity, server density and price
Buy from high-quality vendors to avoid labor-Buy from high-quality vendors to avoid labor-intensive maintenance issuesintensive maintenance issues
![Page 12: Developing & Managing A Large Linux Farm – The Brookhaven Experience CHEP2004 – Interlaken September 27, 2004 Tomasz Wlodek - BNL](https://reader030.vdocuments.site/reader030/viewer/2022032806/56649eff5503460f94c14460/html5/thumbnails/12.jpg)
The Growth of the Linux FarmThe Growth of the Linux Farm
0
200
400
600
800
1000
1200
1400
KSp
ecIn
t200
0
1999 2000 2001 2002 2003 2004
YearKSpecInt2000
![Page 13: Developing & Managing A Large Linux Farm – The Brookhaven Experience CHEP2004 – Interlaken September 27, 2004 Tomasz Wlodek - BNL](https://reader030.vdocuments.site/reader030/viewer/2022032806/56649eff5503460f94c14460/html5/thumbnails/13.jpg)
Drop in Server Price as a Function Drop in Server Price as a Function of Performanceof Performance
02
4
6
8
10
12
14
Co
st/
Sp
ecIn
t2000
(in
U.S
. d
oll
ars
)
1999 2000 2001 2002 2003 2004
Year
Cost/SpecInt2000
![Page 14: Developing & Managing A Large Linux Farm – The Brookhaven Experience CHEP2004 – Interlaken September 27, 2004 Tomasz Wlodek - BNL](https://reader030.vdocuments.site/reader030/viewer/2022032806/56649eff5503460f94c14460/html5/thumbnails/14.jpg)
Drop in Cost of Local Storage Drop in Cost of Local Storage
010
20
30
40
50
60
70
Co
st/
GB
(in
U.S
.
do
llars
)
1999 2000 2001 2002 2003 2004
Year
Cost/GB
![Page 15: Developing & Managing A Large Linux Farm – The Brookhaven Experience CHEP2004 – Interlaken September 27, 2004 Tomasz Wlodek - BNL](https://reader030.vdocuments.site/reader030/viewer/2022032806/56649eff5503460f94c14460/html5/thumbnails/15.jpg)
Total Distributed Storage Capacity Total Distributed Storage Capacity
0
50
100
150
200
250
Total Storage Capacity
(TB)
1999 2000 2001 2002 2003 2004
Year
TB
![Page 16: Developing & Managing A Large Linux Farm – The Brookhaven Experience CHEP2004 – Interlaken September 27, 2004 Tomasz Wlodek - BNL](https://reader030.vdocuments.site/reader030/viewer/2022032806/56649eff5503460f94c14460/html5/thumbnails/16.jpg)
Growth of Storage Capacity per Growth of Storage Capacity per ServerServer
050
100150200250300350400450
GB
1999 2000 2001 2002 2003 2004
Year
GB/server
![Page 17: Developing & Managing A Large Linux Farm – The Brookhaven Experience CHEP2004 – Interlaken September 27, 2004 Tomasz Wlodek - BNL](https://reader030.vdocuments.site/reader030/viewer/2022032806/56649eff5503460f94c14460/html5/thumbnails/17.jpg)
Server ReliabilityServer Reliability
0
0.002
0.004
0.006
0.008
0.01
0.012
Fa
ilu
re/M
ac
hin
e.M
on
th
2000 2001 2002 2003 2004
Year
Failure Rate-about 1/week at current size
![Page 18: Developing & Managing A Large Linux Farm – The Brookhaven Experience CHEP2004 – Interlaken September 27, 2004 Tomasz Wlodek - BNL](https://reader030.vdocuments.site/reader030/viewer/2022032806/56649eff5503460f94c14460/html5/thumbnails/18.jpg)
The Factors Enforcing Evolution in The Factors Enforcing Evolution in Software PackagesSoftware Packages
CostCost Farm size / scalabilityFarm size / scalability SecuritySecurity External influences / wide External influences / wide
acceptanceacceptance
![Page 19: Developing & Managing A Large Linux Farm – The Brookhaven Experience CHEP2004 – Interlaken September 27, 2004 Tomasz Wlodek - BNL](https://reader030.vdocuments.site/reader030/viewer/2022032806/56649eff5503460f94c14460/html5/thumbnails/19.jpg)
CostCost
Red Hat Linux Red Hat Linux →→ Scientific Scientific LinuxLinux
LSF LSF →→ CondorCondor
![Page 20: Developing & Managing A Large Linux Farm – The Brookhaven Experience CHEP2004 – Interlaken September 27, 2004 Tomasz Wlodek - BNL](https://reader030.vdocuments.site/reader030/viewer/2022032806/56649eff5503460f94c14460/html5/thumbnails/20.jpg)
Farm Size / ScalabilityFarm Size / Scalability
Home built batch system for Home built batch system for data reconstructiondata reconstruction→→ Condor Condor based batch system based batch system
Home built monitoring Home built monitoring system system →→ Ganglia Ganglia
![Page 21: Developing & Managing A Large Linux Farm – The Brookhaven Experience CHEP2004 – Interlaken September 27, 2004 Tomasz Wlodek - BNL](https://reader030.vdocuments.site/reader030/viewer/2022032806/56649eff5503460f94c14460/html5/thumbnails/21.jpg)
SecuritySecurity
Started with NIS/telnet in the 90’sStarted with NIS/telnet in the 90’s
Cyber-security threats prompted the Cyber-security threats prompted the installation of firewalls, gatekeepers and installation of firewalls, gatekeepers and migration to ssh migration to ssh scricter security scricter security standards than in the paststandards than in the past
On-going change to Kerberos 5. Ongoing On-going change to Kerberos 5. Ongoing phase-out of NIS passwords.phase-out of NIS passwords.
Testing GSI Testing GSI limited support for GSI limited support for GSI
![Page 22: Developing & Managing A Large Linux Farm – The Brookhaven Experience CHEP2004 – Interlaken September 27, 2004 Tomasz Wlodek - BNL](https://reader030.vdocuments.site/reader030/viewer/2022032806/56649eff5503460f94c14460/html5/thumbnails/22.jpg)
Security Changes (cont.)Security Changes (cont.) Authorization & authentication controlled by local Authorization & authentication controlled by local
site (NIS and Kerberos)site (NIS and Kerberos)
Migration to GSI requires a central CA and Migration to GSI requires a central CA and regional VO’s for authentication regional VO’s for authentication local sites local sites performs final authentication before granting performs final authentication before granting accessaccess
Accept certificates from multiple CA’s?Accept certificates from multiple CA’s?
Difficult transition from complete to partial control Difficult transition from complete to partial control over security issuesover security issues
![Page 23: Developing & Managing A Large Linux Farm – The Brookhaven Experience CHEP2004 – Interlaken September 27, 2004 Tomasz Wlodek - BNL](https://reader030.vdocuments.site/reader030/viewer/2022032806/56649eff5503460f94c14460/html5/thumbnails/23.jpg)
External Influences / Wide External Influences / Wide AcceptanceAcceptance
Ganglia – used by RHIC experiments Ganglia – used by RHIC experiments to monitor the RCF and external to monitor the RCF and external farms in order to manage their job farms in order to manage their job submission.submission.
HRM / dCACHE – used by other labs HRM / dCACHE – used by other labs Condor – widely used by Atlas Condor – widely used by Atlas
communitycommunity
![Page 24: Developing & Managing A Large Linux Farm – The Brookhaven Experience CHEP2004 – Interlaken September 27, 2004 Tomasz Wlodek - BNL](https://reader030.vdocuments.site/reader030/viewer/2022032806/56649eff5503460f94c14460/html5/thumbnails/24.jpg)
Software Evolution - summarySoftware Evolution - summaryPackagePackage OldOld NewNew DateDate
OSOS RedHat RedHat LinuxLinux
Scientific Scientific LinuxLinux
20042004
BatchBatch Home-Built/Home-Built/LSFLSF
Condor/LSFCondor/LSF 2004/20002004/2000
MonitoringMonitoring Home-BuiltHome-Built GangliaGanglia 20032003
SecuritySecurity NISNIS K5/GSIK5/GSI 2003/20042003/2004
Distributed Distributed StorageStorage
---------------------- HRM/dCacheHRM/dCache 2004/?2004/?
![Page 25: Developing & Managing A Large Linux Farm – The Brookhaven Experience CHEP2004 – Interlaken September 27, 2004 Tomasz Wlodek - BNL](https://reader030.vdocuments.site/reader030/viewer/2022032806/56649eff5503460f94c14460/html5/thumbnails/25.jpg)
Ganglia at the RCF/ACFGanglia at the RCF/ACF
![Page 26: Developing & Managing A Large Linux Farm – The Brookhaven Experience CHEP2004 – Interlaken September 27, 2004 Tomasz Wlodek - BNL](https://reader030.vdocuments.site/reader030/viewer/2022032806/56649eff5503460f94c14460/html5/thumbnails/26.jpg)
Condor at the RCF/ACFCondor at the RCF/ACF
![Page 27: Developing & Managing A Large Linux Farm – The Brookhaven Experience CHEP2004 – Interlaken September 27, 2004 Tomasz Wlodek - BNL](https://reader030.vdocuments.site/reader030/viewer/2022032806/56649eff5503460f94c14460/html5/thumbnails/27.jpg)
SummarySummary
RCF/ACF going through a transition from a local RCF/ACF going through a transition from a local facility to a regional (global) facility facility to a regional (global) facility many many changeschanges
Linux Farm built with commodity hardware is Linux Farm built with commodity hardware is increasingly affordable and reliableincreasingly affordable and reliable
Distributed storage is also increasingly affordable Distributed storage is also increasingly affordable management software issues.management software issues.
![Page 28: Developing & Managing A Large Linux Farm – The Brookhaven Experience CHEP2004 – Interlaken September 27, 2004 Tomasz Wlodek - BNL](https://reader030.vdocuments.site/reader030/viewer/2022032806/56649eff5503460f94c14460/html5/thumbnails/28.jpg)
Summary (cont.)Summary (cont.)
Inter-operability with remote sites (software and Inter-operability with remote sites (software and services) plays an increasingly important role in services) plays an increasingly important role in our software choicesour software choices
Transition with security and access issuesTransition with security and access issues
Migration will take longer and be more difficult Migration will take longer and be more difficult than generally expected than generally expected change in hardware change in hardware and software needs to be complemented by a and software needs to be complemented by a change in management philosophychange in management philosophy