sharing data: sanger experiences

Download Sharing data: Sanger Experiences

If you can't read please download the document

Upload: guy-coates

Post on 01-Jul-2015

892 views

Category:

Technology


1 download

DESCRIPTION

Sharing large amounts of data is easier said than done. This talk gives an overview of our experiences doing big-data science over wide-area networks.

TRANSCRIPT

  • 1. Data Sharing: Sanger Experiences Guy Coates Wellcome Trust Sanger Institute [email_address]

2. Background

  • Moving large amounts of data: 3. Cloud Experiments
    • Moving data to the cloud
  • Production Pipelines
    • Moving data to EBI
  • Do we need to move this data at all?

4. Cloud Experiments

  • Can we move some solexa image files to AWS and run our processing pipeline? 5. Answer: No.
    • Moving the data took much longer than processing it. 6. First attempt: 14 Mbits/s out of 2Gbit/s link.

7. Do some reading:

  • http://fasterdata.es.net 8. Department of Enegy Office of Science.
    • Covers all of the technical bits and pieces required to make wide-area transfers go fast.

9. Getting better:

  • Use the right tools:
    • Use WAN tools:gridFTP/FDT/Aspera, not rsync/ssh. 10. Tune your TCP stack.
  • Data transfer rates:
    • Cambridge -> EC2 East coast: 12 Mbytes/s (96 Mbits/s) 11. Cambridge -> EC2 Dublin: 25 Mbytes/s(200 Mbits/s)
  • What speedshouldwe get?
    • Once we leave JANET (UK academic network) finding out what the connectivity is and what we should expect is almost impossible.
  • How do we get the broken bits in the middle?
    • Finding the person responsible for a broken router on the internet is hard.

12. What about the Physicists?

  • LHC moves 20 PBytes year across the internet to their processing sites.
    • Not really. 13. Dedicated 10GigE networking between CERN and the 10 Tier 1 centres.
  • Even with dedicated paths, it is still hard.
    • Multiple telcos involved, even for a point-to-point link. 14. Constant monitoring / bandwidth tests to ensure it stays working. 15. See HEPIX talks for gory details.

16. We need a bigger networks:

  • A fast network is fundamental to moving data. 17. Is it the only thing we need to do?

18. Sanger Production Pipeline

  • Provides a nice example of moving large amounts of data in real-life.

19. Sequencing data flow Sanger Sequencer Analysis/ alignment Internalrepository EBI EGA / SRA (EBI) 20. Data movement between Sanger/EBI

  • This should be easy...
    • We are on the same campus. 21. 10Gbit/s (1.2 Gbyte/s) link between EBI and Sanger. 22. We share a data-centre. 23. Physically near, so we do not need to worry about WAN issues.

24. It is not just networks:

  • Speed will only be as fast as the slowest link. 25. Speed was not a design point for our holding area.
    • $ per TB was the overriding design goal, not speed.

EBI Sanger Server Firewall Internet Server Firewall Network Network Disk Disk 26. Organisational issues:

  • Data movement was not considered until after Sanger/EBI started building the systems.
    • Hard to do fast data transfers if your disk subsystem is not up to the job.
  • Expectation management:
    • How fast should I be able to move data?
  • Good communication.
    • Multi-institute teams. 27. Need to take end-to-end ownership across institutions.
  • Application Led:
    • Nobody cares about raw data rates- they care how fasttheir applicationcan move data. 28. Need application developers and sys-admin to work together.
  • This needs to be in-place before you start projects!

29. Do we need to move the data? 30. Centralised data Data Sequencing Centre + DCC Sequencing centre Sequencing centre Sequencing centre Sequencing centre 31. Example Problem:

  • We want to run out pipeline across 100TB of data currently in EGA/SRA. 32. We will need to de-stage the data to Sanger, and then run the compute.
    • Extra 0.5 PB of storage, 1000 cores of compute. 33. 3 month lead time. 34. ~$1.5M capex.

35. Federation: A Better way: Collaborations are short term: 18 months-3 years. Sequencing centre Sequencing centre Sequencing centre Sequencing centre Federated access 36. Federation software: Unstructured data (flat files) Data size per Genome Structured data (databases) BioMart IRODS (data grid software) Intensities / raw data (2TB) Alignments (200 GB) Sequence + quality data (500 GB) Variation data (1GB) Individualfeatures(3MB) 37. Cloud / Computable archives

  • Can we move the compute to the data?
    • Upload workload onto VMs. 38. Put VMs on compute that is attached to the data.

Data CPU CPU CPU CPU Data CPU CPU CPU CPU VM 39. Summary

  • We need fast network links. 40. We need cross site teams who can troubleshoot all potential trouble spots. 41. Teams need application & systems people.

42. Acknowledgements:

  • The HEPIX Community.
    • Http://www.hepix.org
  • Team ISG:
    • James Beal 43. Gen-Tao Chiang 44. Pete Clapham 45. Simon Kelley