managing genomics data at the sanger institute
DESCRIPTION
In this presentation from the DDN User Meeting at SC13, Tim Cutts from The Sanger Insitute describes how the company wrangles genomics data with DDN storage. Watch the video presentation: http://insidehpc.com/2013/11/13/ddn-user-meeting-coming-sc13-nov-18/TRANSCRIPT
1
Produc'on and Research: Managing Genomics Data at the Sanger Ins'tute
Dr Tim Cu;s Head of Scien'fic Compu'ng [email protected]
2
Background to the Sanger Ins'tute
Po;ed history
• Funded by the Wellcome Trust • Sequencing projects increase in scale by 10x every two
years • ~17000 cores of total compute • 22PB usable storage (~40PB raw)
3
1993 Centre Opens
1998 Nematode Genome completed
2000 Dra[ Human genome
2003 • 2 billionth base pair • Human Genome Project completed
2004 • MRSA genome
2005 • Current datacentre opens
2008 • Next genera'on sequuencing • 1000 genome project begins
2009 • Joins interna'onal Cancer Genome Consor'um
2010 • UK10K project begins
2013 • UK10K project ends
4
Research Programmes
Human Gene'cs
Mouse and Zebrafish Gene'cs
Pathogen Gene'cs
Cellular Gene'cs
Bioinforma'cs
5
Core Facili'es
Model Organisms
Cellular Genera'on
and Phenotyping
DNA Pipelines IT
Idealised data flow
6
Example: Varia'on associa'on
7
Typical data flow
8
Raw data from sequencer
QC and alignment
iRODS
Lustre
Stage data to Lustre
Research analysis
Archival storage
Website
Staging storage
Choosing your tech: Pick two…
9
Price
Performance Capacity
Staging storage
10
One of these for each of27 sequencersOne of these for each of
27 sequencersOne of these for each of27 sequencers
Next Gen Sequencer
CIFS/NFSstaging server
50TB
iRODS(4PB)
Sequencedata over CIFS
Production sequencing clusterQC and alignment
(1000 cores)
NFS
Aligned BAM files
Simple scale-‐out architecture – Server with ~50TB direct a;ached
block storage – One per sequencer – Running SAMBA for upload from
sequencer
Maximum data from all sequencers is currently 1.7 TB/day 1000 core cluster reads data from staging servers over NFS
– Quality checks – Alignment to reference genome – Store aligned BAM and/or CRAM
files in iRODS
iRODS Object store with arbitrary metadata Rules to automate mirroring, and other tasks as required Vendor-‐agnos'c
Mostly DDN SFA 10K Some other vendors’ storage also
Oracle RAC cluster holds metadata Two ac've-‐ac've iRES resource servers in different rooms
8Gb FC to storage 10Gb IP
Series of 43 TB LVM volumes from 2x SFA 10K in each room
11
iRES server
SFA10K
43TB 43TB43TB
43TB 43TB43TB
43TB 43TB43TB
Other vendors
SFA10K
43TB 43TB43TB
43TB 43TB43TB
43TB 43TB43TB
iRES server
SFA10K
43TB 43TB43TB
43TB 43TB43TB
43TB 43TB43TB
Other vendors
SFA10K
43TB 43TB43TB
43TB 43TB43TB
43TB 43TB43TB
iCAT(Oracle RAC)
iRODS Server
Downstream analysis
12
iRODS(4PB)
Analysis clusters (~14000 cores)Aligned sequences
Lustre scratch space(13 filesystems)
NFS storage for completed work
Researchanalysis
Lustre setup 11 filesystems 500TB /1PB each Large projects have their own Exascaler hardware … but our own Lustre install Aim to deliver 5MB/sec per core of compute IB connected OSS-‐OST 10G ethernet to clients
13
EF3015
SFA10K/12K
OSS
OSS
MDS
OST
OST
MDT
OSS
OSS OST
OST
OSS
OSS OST
OST
OSS
OSS OST
OST
MGS MDT
1/2U serversIB
10G/40GNetwork
Clients
Future challenges and direc'ons
• Object storage instead of filesystems (WOS?) • File systems take a long 'me to fsck • integra'on with WOS
iRODS
• Security implica'ons • How can we do this in a small laboratory in Africa with terrible power and minimal IT skills?
Clinical use and personalised medicine
• Upgrade to 2.5 (HSM features) • Exascaler needs to be more current
Lustre
• Nanopore sequencing • Use outside the datacentre
Sequencing technology
• Integrated support plaoorms for produc'on systems
Vendor support
14
Thank you
15
The team
– Phil Butcher, IT Director – Tim Cu;s, Ac'ng Head of Scien'fic Compu'ng – Guy Coates, Informa'cs Systems Group Team Leader
– Peter Clapham – James Beal – Helen Brimmer
– Jon Nicholson, Network Team Leader – Shanthi Sivadasan, DBA Team Leader – Numerous bioinforma'cians