cugi pilot sequencing/assembly projects
DESCRIPTION
CUGI Pilot Sequencing/Assembly Projects. Christopher Saski. Sequencing the Cacao Genome: 3 Megabases at a Time. Pilot project to sequence and assemble 3Mbp segment of cacao genome IBM in silico assembly project – Testing the assembly pipeline. - PowerPoint PPT PresentationTRANSCRIPT
CUGI Pilot Sequencing/Assembly Projects
Christopher Saski
Sequencing the Cacao Genome:3 Megabases at a Time
• Pilot project to sequence and assemble 3Mbp segment of cacao genome
• IBM in silico assembly project – Testing the assembly pipeline
Sequencing the Cacao Genome:3 Megabases at a Time
• Combination of:– “Old School Genomics”
• BAC libraries, physical mapping, and clone-by-clone sequencing
– Roche 454 Titanium and FLX De Novo sequencing
• Key: – Not yet accurately assembled a eukaryotic genome
with NGS alone– Reduce assembly complexity
3 Megabase segments
Rounsley et al., 2009
Advantages• Reduce assembly complexity
• Limit number of sequencing libraries
• Prioritize critical genomic regions
• Outsource BAC pools for sequencing in parallel at any center that has a 454 Titanium/GS-FLX sequencer
• Flexibility – Start slow with minimal investment– Could redesign strategy to reduce sequence runs
Strategy Components
• Integrated Physical/Genetic framework• Pool development and sequencing:– BAC-end – Titanium 454 (paired/non-paired)– Draft sequence
• Assembly and integration:– Newbler– Celera (CABOG)
Cacao Integrated Physical/Genetic Framework
• Represents ~29X coverage (3 BAC libraries)• Assembled into small number of large contigs• Suggests reasonable levels of heterozygosity • Manageable amounts of repetitive sequence• 220 anchored genetic markers spanning 10
linkage groups– Resemble recombinational derived order
Pool Development
• Select contiguous BAC clones from MTP• Pools will contain 25-30 clones– 20-30kb overlap
• Complete Cacao MTP will require 120-150 pools
• Repetitive-type regions: – BAC-end sequence and physical map data
predictive tool• Modify pools accordingly
Pool Development
• Estimate contig size using Consensus Band (CB) algorithm
• Example: Cacao cp genome is 160,604bp– Hybridization revealed cp containing contig and is
estimated to be ~160 kb based on CB algorithm.
• Purified pool DNA can be produced at CUGI– Treat with ATP-dependent Dnase
Sequencing
• 3 Levels of Sequence:– Paired BAC-end Sequence – 20 kb increments– End sequencing of pool members– 454 sequencing of BAC pools• Paired 3.5X-5.1X coverage (Roche 454/FLX)• Non-paired 17X-26X coverage (Titanium)
454 Runs—Whole Genome
• 454 Titanium non-paired – 26X coverage/pool– 4 pools per slide (up to 150 pools total) • Up to 38 slide runs
• 454 FLX paired-end (3kb) – 5X coverage/pool– 16 pools per slide (up to 150 pools total)• Up to 10 slide runs total
Assembly/Curation of 3Mbp Segment
• Preprocessing– Filter reads to remove:
• Pair-end that did not contain both ends• BAC vector• E. coli (host DNA)
• Newbler Assembler (Roche)• Celera Assembler (CABOG)– Improvements in homopolymer calls, and
heterogeneous read length issues– Recently shown N50 contig size double to Newbler
• Human (50% repetitive) and microbes
Assembly Curation of 3Mbp Segment
• Assembly at various depths (5X, 10X, 15X)– Determine optimal sequencing coverage
• Utilize available data to scaffold contigs:– BAC end sequences every 20kb– Genetic marker sequences– RNA-seq clusters– Arabidopsis – Cacao synteny– Draft Sequence (2X)
• Augment approach by covering regions missed by clones – assist in selecting MTP
Assembly Curation of 3Mbp Segment
• Deliverable will be a pseudomolecule sequence for the 3Mbp region– Gaps will be strings of N• Assess and employ lab-based gap filling strategies
• Make every attempt to close gaps
Assembly Validation and Correction
• In-silico virtual digest of scaffold sequence and compare to physical map restriction fragments– Draft sequence integration (DSI) via FPC• Integrate and visualize physical map, 3 Mbp segments,
and draft sequence
Sequence/Assembly Pipeline
IBM in silico Sequences
• IBM will provide a set of sequences that mimic the pilot caco sequences– Input error
• Indels, homopolymer calls, nucleotide substitutions
• Simulated data to test pipeline:– Physical map– Simulated BAC end sequences– Simulated pseudo-reads from pooled BACs– EST clusters– Indicate reference species for syntenic comparisons
Pilot Project Budget
• BAC-end sequencing (30K BACs), 20Kb increments– $206,605.00
• Assembly/curation/validation of cacao 3Mbp– $16,720.00
• Assembly of IBM in-silico derived sequences– $15,400.00
ESTIMATED Budget – Whole Genome Assembly
• Assembly, curation, validation of 130-150, 3Mbp segments– $147,620.00
• Automated structural/functional annotation– $8,800.00
Acknowledgements
• USDA-ARS• Mars Inc.• Dr. Alex Feltus• Stephen Ficklin• Dr. Keith Murphy• Dr. Margaret Staton