3000 rice genome white paper

2
WHITEPAPER, Version: Draft, 23 January 2015 © 2015 L3 Bioinformatics Limited, All rights reserved Introduction With the rapid and ever-increasing growth of genomic data, it has become a real challenge how to manage such data while enabling ecient derivation of useful information. BGI- Online is a time-and-cost ecient solution to meet this challenge. The 3K Rice Genomes Project* has sequenced 3,000 rice genomes with an average sequencing depth of 14x, comprising a 13.4- terabyte dataset. BGI-Online can analyze all 3,000 rice genomes in less than a day, cost no more than 2 US dollars per sample. The clean reads, alignments and identied variants are now available on BGI-Online (search "3K Rice Genome Projects" under Public Files). Analysis Pipeline * The 3,000 rice genomes project GigaScience, Vol. 3, No. 1. (2014), 7, doi:10.1186/2047-217x-3-7 Cost: $1.41 / Sample AWS On-Demand Price US$ / Hr 1.68 AWS Spot Instance Price avg. US$ / Hr 0.257 On-Demand allocation % 20 Spot Instance allocation % 80 Min. Pgms per Machine Pgm 4 Duration Hr / Sample 6 Max Computation Cost US$ 0.81 Computation Storage Volume GB / Sample 10 Hot Storage Price US$ / Month 0.03 Hot Storage Duration Month 1 Archive Price US$ / Month 0.01 Archive Duration Month 3 Storage Cost US$ 0.60

Upload: tewei-robert-luo

Post on 17-Jul-2016

13 views

Category:

Documents


0 download

TRANSCRIPT

WHITEPAPER, Version: Draft, 23 January 2015 © 2015 L3 Bioinformatics Limited, All rights reserved

Introduction

With the rapid and ever-increasing growth of genomic data, it has become a real challenge how to manage such data while enabling efficient derivation of useful information. BGI-Online is a time-and-cost efficient solution to meet this challenge.

The 3K Rice Genomes Project* has sequenced 3,000 rice genomes with an average sequencing depth of 14x, comprising a 13.4-terabyte dataset.  BGI-Online can analyze all 3,000 rice genomes in less than a day, cost no more than 2 US dollars per sample.

The clean reads, alignments and identified variants are now available on BGI-Online (search "3K Rice Genome Projects" under Public Files).

Analysis Pipeline

* The 3,000 rice genomes project GigaScience, Vol. 3, No. 1. (2014), 7, doi:10.1186/2047-217x-3-7

Cost: $1.41 / Sample

AWS On-Demand Price US$ / Hr 1.68AWS Spot Instance Price avg. US$ / Hr 0.257On-Demand allocation % 20Spot Instance allocation % 80Min. Pgms per Machine Pgm 4

Duration Hr / Sample 6Max Computation Cost US$ 0.81

Computation

Storage Volume GB / Sample 10

Hot Storage Price US$ / Month 0.03 Hot Storage Duration Month 1

Archive Price US$ / Month 0.01 Archive Duration Month 3

Storage Cost US$ 0.60

Platform

ü  Powerful metadata system ü  Dynamic scaling to support thousands

of tasks, or none ü  Supports all major sequencing

technologies ü  Data Level Parallelism (DLP) ü  Automation & LIMS integration via API ü  Multi-platform graphical desktop client

for uploads ü  Command line uploader ü  Multi-tier, flexible, long-term storage ü  Multiple users and roles per project ü  Multiple and transferable payment role

BGI-Online Selected Features

Security ü  HIPAA & EU’s Data Protection Directive

compliant ü  Fine-grained Project/User privileges ü  Secure (SSL) data transfer through

whole platform ü  AES256 data encryption at rest

Bioinformatics Analysis ü  WGS, WES & RNA-Seq analysis ü  Quality Control tools ü  Build and customize pipelines with

visual editor ü  Wrap in-house pipelines with SDK ü  Pre-collected public datasets in one

place

1. User logs on BGI-Online. 2. BGI-Online creates temporary access token. 3. Using the token, data is uploaded to Engine and being de-identified. Keys to restore the data are stored in Metadata database. de-identified data are stored in Encrypted tier-1 cache and S3 bucket synchronously. 4. Once the user starts a computation, BGI-Online calculates the optimal execution plan. Final results are uploaded to Encrypted tier-1 cache. 5. Infrequently accessed data are removed from Encrypted tier-1 cache, 6. or being further archived in Encrypted Glacier Vault and removed from S3.

Secured Logical Dataflow

WHITEPAPER, Version: Draft, 23 January 2015 © 2015 L3 Bioinformatics Limited, All rights reserved