managing genomics data at the sanger institute

15
1 Produc’on and Research: Managing Genomics Data at the Sanger Ins’tute Dr Tim Cu;s Head of Scien’fic Compu’ng [email protected]

Upload: insidehpc

Post on 06-May-2015

1.572 views

Category:

Technology


2 download

DESCRIPTION

In this presentation from the DDN User Meeting at SC13, Tim Cutts from The Sanger Insitute describes how the company wrangles genomics data with DDN storage. Watch the video presentation: http://insidehpc.com/2013/11/13/ddn-user-meeting-coming-sc13-nov-18/

TRANSCRIPT

Page 1: Managing Genomics Data at the Sanger Institute

1

Produc'on  and  Research:  Managing  Genomics  Data  at  the  Sanger  Ins'tute  

Dr  Tim  Cu;s  Head  of  Scien'fic  Compu'ng  [email protected]  

Page 2: Managing Genomics Data at the Sanger Institute

2

Background  to  the  Sanger  Ins'tute  

Page 3: Managing Genomics Data at the Sanger Institute

Po;ed  history  

•  Funded  by  the  Wellcome  Trust  •  Sequencing  projects  increase  in  scale  by  10x  every  two  

years  •  ~17000  cores  of  total  compute  •  22PB  usable  storage  (~40PB  raw)        

3

1993  Centre  Opens  

1998  Nematode  Genome  completed  

2000  Dra[  Human  genome  

2003  • 2  billionth  base  pair  • Human  Genome  Project  completed  

2004  • MRSA  genome  

2005  • Current  datacentre  opens  

2008  • Next  genera'on  sequuencing  • 1000  genome  project  begins  

2009  • Joins  interna'onal  Cancer  Genome  Consor'um  

2010  • UK10K  project  begins  

2013  • UK10K  project  ends  

Page 4: Managing Genomics Data at the Sanger Institute

4

Research  Programmes  

Human  Gene'cs  

Mouse  and  Zebrafish  Gene'cs  

Pathogen  Gene'cs  

Cellular  Gene'cs  

Bioinforma'cs  

Page 5: Managing Genomics Data at the Sanger Institute

5

Core  Facili'es  

Model  Organisms  

Cellular  Genera'on  

and  Phenotyping  

DNA  Pipelines   IT  

Page 6: Managing Genomics Data at the Sanger Institute

Idealised  data  flow  

6

Page 7: Managing Genomics Data at the Sanger Institute

Example:  Varia'on  associa'on  

7

Page 8: Managing Genomics Data at the Sanger Institute

Typical  data  flow  

8

Raw data from sequencer

QC and alignment

iRODS

Lustre

Stage data to Lustre

Research analysis

Archival storage

Website

Staging storage

Page 9: Managing Genomics Data at the Sanger Institute

Choosing  your  tech:  Pick  two…  

9

Price  

Performance  Capacity  

Page 10: Managing Genomics Data at the Sanger Institute

Staging  storage  

10

One of these for each of27 sequencersOne of these for each of

27 sequencersOne of these for each of27 sequencers

Next Gen Sequencer

CIFS/NFSstaging server

50TB

iRODS(4PB)

Sequencedata over CIFS

Production sequencing clusterQC and alignment

(1000 cores)

NFS

Aligned BAM files

Simple  scale-­‐out  architecture  –  Server  with  ~50TB  direct  a;ached  

block  storage  –  One  per  sequencer  –  Running  SAMBA  for  upload  from  

sequencer  

Maximum  data  from  all  sequencers  is  currently  1.7  TB/day    1000  core  cluster  reads  data  from  staging  servers  over  NFS  

–  Quality  checks  –  Alignment  to  reference  genome  –  Store  aligned  BAM  and/or  CRAM  

files  in  iRODS    

Page 11: Managing Genomics Data at the Sanger Institute

iRODS  Object  store  with  arbitrary  metadata  Rules  to  automate  mirroring,  and  other  tasks  as  required    Vendor-­‐agnos'c  

 Mostly  DDN  SFA  10K    Some  other  vendors’  storage  also  

 Oracle  RAC  cluster  holds  metadata    Two  ac've-­‐ac've  iRES  resource  servers  in  different  rooms  

 8Gb  FC  to  storage    10Gb  IP  

 Series  of  43  TB  LVM  volumes  from  2x  SFA  10K  in  each  room  

11

iRES server

SFA10K

43TB 43TB43TB

43TB 43TB43TB

43TB 43TB43TB

Other vendors

SFA10K

43TB 43TB43TB

43TB 43TB43TB

43TB 43TB43TB

iRES server

SFA10K

43TB 43TB43TB

43TB 43TB43TB

43TB 43TB43TB

Other vendors

SFA10K

43TB 43TB43TB

43TB 43TB43TB

43TB 43TB43TB

iCAT(Oracle RAC)

iRODS Server

Page 12: Managing Genomics Data at the Sanger Institute

Downstream  analysis  

12

iRODS(4PB)

Analysis clusters (~14000 cores)Aligned sequences

Lustre scratch space(13 filesystems)

NFS storage for completed work

Researchanalysis

Page 13: Managing Genomics Data at the Sanger Institute

Lustre  setup  11  filesystems  500TB  /1PB  each  Large  projects  have  their  own    Exascaler  hardware    …  but  our  own  Lustre  install    Aim  to  deliver  5MB/sec  per  core  of  compute    IB  connected  OSS-­‐OST    10G  ethernet  to  clients  

13

EF3015

SFA10K/12K

OSS

OSS

MDS

OST

OST

MDT

OSS

OSS OST

OST

OSS

OSS OST

OST

OSS

OSS OST

OST

MGS MDT

1/2U serversIB

10G/40GNetwork

Clients

Page 14: Managing Genomics Data at the Sanger Institute

Future  challenges  and  direc'ons  

• Object  storage  instead  of  filesystems  (WOS?)  •  File  systems  take  a  long  'me  to  fsck  •  integra'on  with  WOS  

iRODS  

•  Security  implica'ons  • How  can  we  do  this  in  a  small  laboratory  in  Africa  with  terrible  power  and  minimal  IT  skills?  

Clinical  use  and  personalised  medicine  

• Upgrade  to  2.5  (HSM  features)  •  Exascaler  needs  to  be  more  current  

Lustre  

• Nanopore  sequencing  • Use  outside  the  datacentre  

Sequencing  technology  

•  Integrated  support  plaoorms  for  produc'on  systems  

Vendor  support  

14

Page 15: Managing Genomics Data at the Sanger Institute

Thank  you  

15

The  team    

–  Phil  Butcher,  IT  Director  –  Tim  Cu;s,  Ac'ng  Head  of  Scien'fic  Compu'ng  –  Guy  Coates,  Informa'cs  Systems  Group  Team  Leader  

–  Peter  Clapham  –  James  Beal  –  Helen  Brimmer  

–  Jon  Nicholson,  Network  Team  Leader  –  Shanthi  Sivadasan,  DBA  Team  Leader  –  Numerous  bioinforma'cians