managing genomics data at the sanger institute

1

Produc'on and Research: Managing Genomics Data at the Sanger Ins'tute

Dr Tim Cu;s Head of Scien'fic Compu'ng [email protected]

2

Background to the Sanger Ins'tute

Po;ed history

•  Funded by the Wellcome Trust •  Sequencing projects increase in scale by 10x every two

years •  ~17000 cores of total compute •  22PB usable storage (~40PB raw)

3

1993 Centre Opens

1998 Nematode Genome completed

2000 Dra[ Human genome

2003 • 2 billionth base pair • Human Genome Project completed

2004 • MRSA genome

2005 • Current datacentre opens

2008 • Next genera'on sequuencing • 1000 genome project begins

2009 • Joins interna'onal Cancer Genome Consor'um

2010 • UK10K project begins

2013 • UK10K project ends

4

Research Programmes

Human Gene'cs

Mouse and Zebrafish Gene'cs

Pathogen Gene'cs

Cellular Gene'cs

Bioinforma'cs

5

Core Facili'es

Model Organisms

Cellular Genera'on

and Phenotyping

DNA Pipelines IT

Idealised data flow

6

Example: Varia'on associa'on

7

Typical data flow

8

Raw data from sequencer

QC and alignment

iRODS

Lustre

Stage data to Lustre

Research analysis

Archival storage

Website

Staging storage

Choosing your tech: Pick two…

9

Price

Performance Capacity

Staging storage

10

One of these for each of27 sequencersOne of these for each of

27 sequencersOne of these for each of27 sequencers

Next Gen Sequencer

CIFS/NFSstaging server

50TB

iRODS(4PB)

Sequencedata over CIFS

Production sequencing clusterQC and alignment

(1000 cores)

NFS

Aligned BAM files

Simple scale-‐out architecture –  Server with ~50TB direct a;ached

block storage –  One per sequencer –  Running SAMBA for upload from

sequencer

Maximum data from all sequencers is currently 1.7 TB/day 1000 core cluster reads data from staging servers over NFS

–  Quality checks –  Alignment to reference genome –  Store aligned BAM and/or CRAM

files in iRODS

iRODS Object store with arbitrary metadata Rules to automate mirroring, and other tasks as required Vendor-‐agnos'c

Mostly DDN SFA 10K Some other vendors’ storage also

Oracle RAC cluster holds metadata Two ac've-‐ac've iRES resource servers in different rooms

8Gb FC to storage 10Gb IP

Series of 43 TB LVM volumes from 2x SFA 10K in each room

11

iRES server

SFA10K

43TB 43TB43TB

43TB 43TB43TB

43TB 43TB43TB

Other vendors

SFA10K

43TB 43TB43TB

43TB 43TB43TB

43TB 43TB43TB

iRES server

SFA10K

43TB 43TB43TB

43TB 43TB43TB

43TB 43TB43TB

Other vendors

SFA10K

43TB 43TB43TB

43TB 43TB43TB

43TB 43TB43TB

iCAT(Oracle RAC)

iRODS Server

Downstream analysis

12

iRODS(4PB)

Analysis clusters (~14000 cores)Aligned sequences

Lustre scratch space(13 filesystems)

NFS storage for completed work

Researchanalysis

Lustre setup 11 filesystems 500TB /1PB each Large projects have their own Exascaler hardware … but our own Lustre install Aim to deliver 5MB/sec per core of compute IB connected OSS-‐OST 10G ethernet to clients

13

EF3015

SFA10K/12K

OSS

OSS

MDS

OST

OST

MDT

OSS

OSS OST

OST

OSS

OSS OST

OST

OSS

OSS OST

OST

MGS MDT

1/2U serversIB

10G/40GNetwork

Clients

Future challenges and direc'ons

• Object storage instead of filesystems (WOS?) •  File systems take a long 'me to fsck •  integra'on with WOS

iRODS

•  Security implica'ons • How can we do this in a small laboratory in Africa with terrible power and minimal IT skills?

Clinical use and personalised medicine

• Upgrade to 2.5 (HSM features) •  Exascaler needs to be more current

Lustre

• Nanopore sequencing • Use outside the datacentre

Sequencing technology

•  Integrated support plaoorms for produc'on systems

Vendor support

14

Thank you

15

The team

–  Phil Butcher, IT Director –  Tim Cu;s, Ac'ng Head of Scien'fic Compu'ng –  Guy Coates, Informa'cs Systems Group Team Leader

–  Peter Clapham –  James Beal –  Helen Brimmer

–  Jon Nicholson, Network Team Leader –  Shanthi Sivadasan, DBA Team Leader –  Numerous bioinforma'cians

managing genomics data at the sanger institute

Technology