sharing data: sanger experiences

1. Data Sharing: Sanger Experiences Guy Coates Wellcome Trust Sanger Institute [email_address]

Moving large amounts of data: 3. Cloud Experiments

Moving data to the cloud

Production Pipelines

Moving data to EBI

Do we need to move this data at all?

Can we move some solexa image files to AWS and run our processing pipeline? 5. Answer: No.

Moving the data took much longer than processing it. 6. First attempt: 14 Mbits/s out of 2Gbit/s link.

http://fasterdata.es.net 8. Department of Enegy Office of Science.

Covers all of the technical bits and pieces required to make wide-area transfers go fast.

Use the right tools:

Use WAN tools:gridFTP/FDT/Aspera, not rsync/ssh. 10. Tune your TCP stack.

Data transfer rates:

Cambridge -> EC2 East coast: 12 Mbytes/s (96 Mbits/s) 11. Cambridge -> EC2 Dublin: 25 Mbytes/s(200 Mbits/s)

What speedshouldwe get?

Once we leave JANET (UK academic network) finding out what the connectivity is and what we should expect is almost impossible.

How do we get the broken bits in the middle?

Finding the person responsible for a broken router on the internet is hard.

LHC moves 20 PBytes year across the internet to their processing sites.

Not really. 13. Dedicated 10GigE networking between CERN and the 10 Tier 1 centres.

Even with dedicated paths, it is still hard.

Multiple telcos involved, even for a point-to-point link. 14. Constant monitoring / bandwidth tests to ensure it stays working. 15. See HEPIX talks for gory details.

A fast network is fundamental to moving data. 17. Is it the only thing we need to do?

Provides a nice example of moving large amounts of data in real-life.

This should be easy...

We are on the same campus. 21. 10Gbit/s (1.2 Gbyte/s) link between EBI and Sanger. 22. We share a data-centre. 23. Physically near, so we do not need to worry about WAN issues.

Speed will only be as fast as the slowest link. 25. Speed was not a design point for our holding area.

$ per TB was the overriding design goal, not speed.

Data movement was not considered until after Sanger/EBI started building the systems.

Hard to do fast data transfers if your disk subsystem is not up to the job.

Expectation management:

How fast should I be able to move data?

Good communication.

Multi-institute teams. 27. Need to take end-to-end ownership across institutions.

Application Led:

Nobody cares about raw data rates- they care how fasttheir applicationcan move data. 28. Need application developers and sys-admin to work together.

This needs to be in-place before you start projects!

We want to run out pipeline across 100TB of data currently in EGA/SRA. 32. We will need to de-stage the data to Sanger, and then run the compute.

Extra 0.5 PB of storage, 1000 cores of compute. 33. 3 month lead time. 34. ~$1.5M capex.

Can we move the compute to the data?

Upload workload onto VMs. 38. Put VMs on compute that is attached to the data.

We need fast network links. 40. We need cross site teams who can troubleshoot all potential trouble spots. 41. Teams need application & systems people.

The HEPIX Community.

Http://www.hepix.org

Team ISG:

James Beal 43. Gen-Tao Chiang 44. Pete Clapham 45. Simon Kelley

sharing data: sanger experiences

Technology