sanger hpc infrastructure report (2007)

Download Sanger HPC infrastructure  Report (2007)

If you can't read please download the document

Upload: guy-coates

Post on 30-Jun-2015

977 views

Category:

Technology


2 download

DESCRIPTION

Overview of the Sanger Institute HPC infrastructure and mangement tools, given at the Spring 2007 Hepix meeting.

TRANSCRIPT

  • 1. Sanger Institute Site Report Nov 2007 Guy Coates [email_address]

2. About the Institute

  • Funded by Wellcome Trust.
    • 2 ndlargest research charity in the world.
    • ~700 employees.
  • Large scale genomic research.
    • Sequenced 1/3 of the human genome (largest single contributor).
    • We have active cancer, malaria, pathogen and genomic variation studies.
  • All data is made publicly available.
    • Websites, ftp, direct database. access, programmatic APIs.

3. Why are we here?

  • HEPIX Themes:
    • We have particle accelerators which throw out massive amounts of data.
    • We need lots of storage.
    • We need lots of compute.
    • Managing it is hard.
  • Different science, same problems.

Sequencing machines 4. Managing Growth

  • We have exponential growth in storage and compute.
    • Storage doubles every 12 months.
      • We will haveat least2PB of disk next year.
  • New sequencing technologies are a huge challenge.
    • ~50x increase in data production in the space of 6 months.
  • New sequencing tech is still growing.
    • Known unknowns:
      • Higher data output from our current machines.
      • More machines.
    • Unknown unknowns:
      • New big science projects are just a good idea away...

5. Data centre

  • 4x250 M 2Data centres.
    • 2-4KW / M 2cooling.
    • 3.4MW power draw.
  • Overhead aircon, power and networking.
    • Allows counter-current cooling.
    • More efficient.
  • Technology Refresh.
    • 1 data centre is an empty shell.
      • Rotate into the empty room every 4 years.
      • Refurb one of the in-use rooms with the current state of the art.
    • Fallow Field principle.

rack rack rack rack 6. Storage

  • SAN Fabric.
    • Dual Brocade fabric (27 switches per fabric).
    • ~1PB in production today.
  • HP EVA 5000/8000 arrays.
    • Holds the bulk of our data (~1PB)
    • Dual controller, Fibre channel disks, ~50TB per array.
    • virtual raid5(effectively Raid 6).
      • Don't need to worry about raidset size being nice multiples of physical disk size etc.
      • Allows rapid allocation of storage to projects as required.
    • Storage is either directly attached, or used with cluster filesytems.
  • Bluearc Titan.
    • NFS serving for home directories and storage which needs concurrent windows / linux access.
    • EVA storage at the back end.
  • Backup.
    • Veritas netbackup to a Storagetek SL8500 library.
    • 12 drives (LTO2&3), 1500 slots.

7. Compute

  • 3800 cores in >1500 blades and rack mount servers.
    • Blades preferred due to ease of management, space and power efficiency.
    • Mostly x86_64 servers, some older x86 systems.
      • Single, dual and quad core.
    • Token ia64 for large memory machines.
      • (SGI Altix 350, 16 CPUs, 192GB memory).
  • We use Debian Linux as primary OS.
    • Badly burned by proprietary OS and file-systems.
      • We still have legacy Alpha / Tru64 / AdvFS data and apps which require migration to Linux.
    • 99% of systems run Debian Sarge / Etch.
      • Run 64bit on x86_64 CPUs.
      • SLES9 on Oracle server to say inside support matrix.
  • Complex User-base.
    • ~300 users, diverse workload.
    • Typically IO bound, integer intensive, single threaded.
      • Scales well on clusters (Apart from the IO bit).

8. Infrastructure / Management

  • Deployment.
    • Debian FAI automated installer. Integrated with blade management systems for fire-and-forget deployment.
      • ~2 minutes for a complete OS install.
  • Updates.
    • cfengine, dsh.
  • Monitoring.
    • ganglia, nagios.
  • RequestTracker.
    • Used by many software development and science teams within the Institute as well as the System team.
    • External engineers and collaborators have access.
    • 30k tickets per year.
  • Heartbeat 2.
    • 2-8 node cluster for high-availability.
    • Mostly mysql + apache using SAN for storage failover.

9. Tera-scale Oracle

  • Sequencing Trace archive.
    • Hold results from all DNA sequencing experiments, everywhere.
    • Mirrored with NCBI trace archive.
    • Currently ~60TB / 8 Billion traces.
    • Doubles in size every 12 months.
  • Originally data was on file-system, meta-data in oracle.
    • Billions of small files (20-80k).
      • File-system worst-case.
    • Hard to backup, hard to manage space.
    • All on Tru64 /advfs. (Dead architecture).
  • We decided to move everything into oracle.
    • How hard can it be?
      • Tera-scale databases are common (according to Oracle).

10. Tera-scale Oracle

  • Primary.
    • 4 node Oracle 10g RAC cluster (4 core x86_64, 16GB RAM).
    • 60TB of EVA / fibre-channel storage, Oracle ASM clustered file-system.
  • Backup Strategy.
    • Replicate database to a secondary database with Oracle dataguard.
    • 2 node RAC cluster with 60TB of MSA1000 storage (cheap-n-cheerful fibre-channel).
      • 15 minute delay in replication to protect against finger trouble.
    • Secondary database is the primary backup (disk-to-disk, fast).
      • We can run off the secondary if we need to.
    • Dumps to tape taken from the secondary.
  • Big oracle is hard.
    • Oracle is not well tested (especially by Oracle!) on this scale.
    • How will we cope with future growth of the database?

11. Compute farm

  • Exclusively Blade.
    • 588IBM HS20/LS20 (42 chassis), 128 HP BL460c (8 chassis).
    • 2224 cores (mix of 32 and 64 bit), 2GB memory /core.
    • Raid 1 system disks.
    • Debian Sarge + custom kernel.
    • LSF used for job scheduling.
      • Typically 10k-100k jobs in the system.
  • Networking.
    • Extreme Networks.
    • 1-2x GigE edge.
    • 2-4x GigE trunks.
    • 2x 10GigE core.
    • Systems distributed across data centres.

12. Farm lustre storage

  • HP SFS / Lustre v2.1.1.
    • Based on CFS lustre 1.4.
    • In house client port to Debian.
  • Lustre for work / scratch areas.
    • 10 OSS / 20 OST.
      • 20 SFS20 arrays.
      • Dual tailed SCSI (highly available).
      • 12x 250GB SATA disks.
      • Raid 6 + 1 hot spare.
      • 35 TB usable storage.
    • Reliability sacrificed for performance.
  • We have NFS as well.
    • Lustre random access / meta-data performance is rubbish.
    • NFS for home directories.

13. Lustre performance

  • Sustained 11-12 Gbit/s peak.
    • This is real work, not a benchmark.

14. Supporting New Sequencing

  • We have 20 Illumina (n e Solexa) sequencing machines.
    • This will produce 20-30TBper day.
    • Machines will run 24x7.
    • We need to keep raw data for ~2 weeks for analysis and QC.
  • 320 TB Lustre staging area.
    • 8 EVA8000 arrays, 28 OSSs, 160 OSTs.
      • (8 luns per OSS limit required more OSS than planned).
    • 3 x 100TB file-systemsfor production + 50TB file-system for development.
      • Smaller file-systems hedge against EVA failure.
  • Compute.
    • 256 HP BL460c blades. 600 cores, mixture of dual / quad core.
    • Extreme networks Black Diamond 8810 switch (360 non-blocking GigE ports).
  • Scratch storage.
    • 25TB SFS20 lustre scratch area for ad-hoc analysis.

15. Data pull ... LSF reconfig allows processing and alignment capacity to be interchanged. Lustre Clients have 2xGigE, 4GigE trunks from chassis to core switch. Datastore 320TB EVA (Lustre) sequencer1sequencer 20 blade chassis (alignment) blade chassis (processing) blade chassis (processing) blade chassis (alignment) blade chassis (alignment) blade chassis (suckers) scratch area 25TB SFS20 (Lustre) Final Repository (NFS) 16. Acknowledgements

    • Tim Cutts
    • Simon Kelley
    • Pete Clapham
    • Mark Flint
    • James Beal

HP Life sciences / SFS

    • Jon Nicholson
    • Russell Vincent
    • Dave Holland
    • Martin Burton

Sanger Institute

    • Eamonn O'Toole
    • Gavin Brebner
    • Phil Butcher