next-generation sequencing: data mangement

1. Next-Gen Sequencing: Data Management Guy Coates Wellcome Trust Sanger Institute [email_address]

2. About the Institute

Funded by Wellcome Trust.

2 ndlargest research charity in the world.

3. ~700 employees. Large scale genomic research.

Sequenced 1/3 of the human genome (largest single contributor).

4. We have active cancer, malaria, pathogen and genomic variation studies. All data is made publicly available.

Websites, ftp, direct database. access, programmatic APIs.

5. Previously...at BioIT Europe: 6. The Scary Graph Instrument upgrades Peak Yearly capillary sequencing 7. The Scary Graph 8. Managing Growth

We have exponential growth in storage and compute.

Storage /compute doubles every 12 months.

2009 ~7 PB raw

Moore's law will not save us.

Transistor/disk density:T d =18 months

9. Sequencing cost: T d =12 months 10. Classic Sanger Stealth project

Summer 2007

first early access sequencer.

Not long after:

15 sequencers have been ordered. They are arriving in 8 weeks. Can we have some storage and computers?

A fun summer was had by all! 11. Classic Sanger Stealth project

Early 2010

Hi-seq announced.

Not long after:

30 sequencers upgradeshave been ordered. They are arriving in 8 weeks. Can we have some storage and computers?

A fun summer was had by all! 12. What we learned...

Masterly inactivity

We had 6 months where we bought no storage.

13. Nobody stops to tidy up until they have no more disk space. Data-triage:

We are much more aggressive about throwing away data we no longer need.

No raw images files, srf or fastq.

14. BAM only. Storage-Tax:

PI's requesting sequencing have a storage surcharge applied to them.

15. Historically sequencing and IT were budgeted separately. 16. Makes Pis aware of the IT costs, even if it does not cover 100%. 17. Flexible Infrastructure

Modular design.

Blocks of network, compute and storage.

18. Assume from day 1 we will be adding more. 19. Expand simply by adding more blocks. Make storage visible from everywhere.

Key enabler; lots of 10Gig.

This allows us to move compute jobs between farms.

Logically rather than physically separated.

20. Currently using LSF to manage workflow.LSF Fast scratch disk Archival / Warehouse disk Network 21. Our Modules:

KISS: Keep It Simple, Stupid.

Tendency to go for the clever solution.

22. Simple might not be so robust, but it is much simpler and faster to fix if it breaks. More reliable in practice. Compute:

Racks of blades.

Bulk Storage:

Nexsan Satabeast. Raid 6 SATA disks.

Directly attached via FC or served via linux / NFS.

23. 50-100TB chunks. Fast Storage:

DDN9900/10000 + Lustre (250TB chunks)

(KISS violation).

Reasonably successful.

Takes longer than we would like to physically install it.

24. Data management

100TB filesystem, 136M files.

Where is the stuff we can delete so we can continue production...?

#df -h FilesystemSizeUsed Avail Use% Mounted on lus02-mds1:/lus02108T107T1T99% /lustre/scratch102 #df -iFilesystemInodesIUsedIFree IUse% Mounted on lus02-mds1:/lus02300296107 136508072 163788035 45% /lustre/scratch102 25. Sequencing data flow. Automated processing and data management Sequencer Analysis/ alignment Internalrepository EGA / SRA (EBI) compute-farm High-performance storage Manual data movement 26. Unmanaged data

Investigators take sequence data off the pipeline and do stuff with it.

Data inflation.

10x the space of the raw data.

Data is left in the wrong place.

Typically left where it was created.

Moving data is hard and slow.

Important data left in scratch areas, or high IO analysis being run against slow storage. Finding data is impossible.

Where is the important data?

Everyone creates a copy for themselves, just to be sure.

Are we backing up the important stuff? 27. Are we keeping control of our private datasets? 28. Managing unstructured data

Automation is key:

Computers, not people moving data around.

29. Works well for the pipelines where it is currently used. Hard to get buy-in from our non production users.

Added complication that gets in the way of doing ad-hoc analysis.

Our Breakthrough Moment:

One of our informatics teams mentioned that they had written a simple data tracking application for their team.

We kept losing our files, or running out of disk space halfway through an analysis.

Big benefits:

Big increase in productivity.

30. 50% reduction in disk utilisation.

50% of 2PB isa lotof $.

Easy to do capacity planning. 31. Bottlenecks:

Data management now impacting on productivity.

Groups who control their data get much more done.

32. As data sizes increase,even smal datal groups get hit. Money talks:

Group A only need the storage budget of group B to do the same analysis.

Powerful message.

We do not want lots of distinct data tracking systems.

Avoid wheel reinvention.

33. Groups need to exchange data. 34. Small groups do not have the manpower to hack something together. We need something with a simple interface so it can easily support ad-hoc requests. 35. Sequencing data flow. Automated processing and data management Manual Sequencer Analysis/ alignment Internalrepository EGA / SRA (EBI) compute-farm High-performance storage Managed data movement 36. What are we using?

iRODS: Integrated Rule-OrientedData System.

Produced by DICE Group (Data Intensive Cyber Environments)at U. North Carolina, Chapel Hill.

Successor to SRB.

SRB used by the High-Energy-Physics (HEP) community.

20PB/year LHC data.

HEP community has lots of lessons learned that we can benefit from. 37. iRODS ICAT Catalogue database Rule Engine Implements policies Irods Server Data on disk User interface WebDAV, icommands,fuse Irods Server Data in database Irods Server Data in S3 38. iRODS Features

Store data and metadata.

Meta data can be queried.

Scalable:

Copes with PB of data and 100,000M+ files.

39. Replicates data. 40. Fast parallel data transfers across local and wide area network links. Extensible

System can be linked outexternal services.

Eg external databases holding metadata, external authentication systems.

Federated

Physically and logically separated iRODS installs can be federated across institutions.

41. First implementation Automated processing and data management Manual Sequencer Analysis/ alignment Internalrepository EGA / SRA (EBI) compute-farm High-performance storage 42. First Implementation

Simple archive system.

It is our first production system:KISS.

43. Hold bam files, and a small amount of metadata. Rules: 44. Replicate:

All files replicated across storage held in two different data centres.

Set access controls:

Enforce access-controls for some confidential datasets.

Automatically triggered from study metadata.

45. Example access: $ icd /seq/5307 $ ils /seq/5307: 5307_1.bam 5307_2.bam 5307_3.bam $ ils -l 5307_1.bamsrpipe0 res-g21987106409 2010-09-24.13:35 & 5307_1.bam srpipe1 res-r21987106409 2010-09-24.13:36 & 5307_1.bam 46. Metadata imeta ls -d /seq/5307/5307_1.bam AVUs defined for dataObj /seq/5307/5307_1.bam: attribute: type value: bam units:---- attribute: sample value: BG81 units:---- attribute: id_run value: 5307 units:---- attribute: lane value: 1 units:---- attribute: study value: TRANSCRIPTION FACTORS IN HAEMATOPOIESIS - MOUSE units:---- attribute: library value: BG81 449223 units: 47. Query imeta qu -d study = "TRANSCRIPTION FACTORS IN HAEMATOPOIESIS - MOUSE" collection: /seq/5307 dataObj: 5307_1.bam ---- collection: /seq/5307 dataObj: 5307_2.bam ---- collection: /seq/5307 dataObj: 5307_3.bam ---- 48. So what...? 49. Next steps Sanger iRODs Datacentre 2 Datacentre 1 Replicate EGA/ERA Automatedrelease/purge CollaboratoriRODs Federate 50. Wishlist: HPC Integration Data is staged in/out to filesystem Archive / Metadatasystem FastStorage/ POSIX filesystem Compute farm FastStorage/ POSIX filesystem + Metadata sytem Compute farm System can do rule/metadata based ops and standard POSIX ops too. 51. Managing Workflow 52. Modular Compute

We have a very heterogeneous network.

Modules of storage and compute.

53. Storage and servers spread across several locations. Fast link Storage Storage Storage Storage CPU CPU CPU CPU CPU medium link slow link 54. How do we manage data and workflow?

Some compute and data is closer than other parts.

Jobs should use compute that is near their data.

How do we steer workload to where we want it?

We may want to markmodules offline for maintenance,and steer workload away from them.

55. LSF Data Aware Scheduler

How it works:

LSF has a map describing the storage pool /compute topology.

Simple weighting matrix.

56. LSF knows how much free space is available on each pool. Users can optionally register datasets as being on a particular storage pool.

Users submit a job request.

May include a dataset request, an amount of storage, or a storage-distance request.

LSF finds free machines and storage.

Storage location is passed as an environment variables into the job at runtime.

57. Future Work

Let the system know about load on storage.

Perhaps storage that is further away is better, if the nearest storage is really busy.

Let the system move data.

The system currently moves jobs. Users are responsible for placing and registering datasets.

58. Hot datasets change over time. 59. Replicate/move the datasets to faster storage, or a greater number of storage pools. Making LSF to do data migration/replication will be a hard.

If only there was some data-grid software that did that already...

60. Acknowledgements

Sanger Institute

61. Phil Butcher 62. ISG

James Beal

63. Gen-Tao Chiang 64. Pete Clapham 65. Simon Kelley Platform Computing

Chris Smith

66. Chris Duddington 67. Da Xu

next-generation sequencing: data mangement

Technology