Download - Assembly and finishing
![Page 1: Assembly and finishing](https://reader035.vdocuments.site/reader035/viewer/2022062300/554e842bb4c905f66a8b56c7/html5/thumbnails/1.jpg)
Genome Assembly and Finishing
Alla Lapidus, Ph.D.Associate Professor
Fox Chase Cancer Center
![Page 2: Assembly and finishing](https://reader035.vdocuments.site/reader035/viewer/2022062300/554e842bb4c905f66a8b56c7/html5/thumbnails/2.jpg)
A typical Microbial (and not only) project
FINISHING
Annotation
Public release
Sequencing
Draftassembly
Goals:
Completely restore genome
Produce high quality consensus
![Page 3: Assembly and finishing](https://reader035.vdocuments.site/reader035/viewer/2022062300/554e842bb4c905f66a8b56c7/html5/thumbnails/3.jpg)
Sequencing Technology at a Glance
![Page 4: Assembly and finishing](https://reader035.vdocuments.site/reader035/viewer/2022062300/554e842bb4c905f66a8b56c7/html5/thumbnails/4.jpg)
Evolution of Microbial Drafts
Sanger only – 4x of 3kb plasmids + 4x of 8kb plasmids + 1x of fosmids – ~ $50k for 5MB genome draft
Hybrid Sanger/pyrosequence/Illumina – 4x 8kb Sanger + 15 x coverage 454 shotgun + 20x Illumina
(quality improvement)– ~ $35k for 5MB genome draft
454 + Solexa - 20x coverage 454 standard + 4x coverage 454 paired end (PE) + 50x
coverage Illumina shotgun (quality improvement; gaps) - ~ $10k per 5MB genome
Solexa only - low cost; too fragmented; good assembler is needed!
Solexa +PacBio - low cost; better sachffolding
![Page 5: Assembly and finishing](https://reader035.vdocuments.site/reader035/viewer/2022062300/554e842bb4c905f66a8b56c7/html5/thumbnails/5.jpg)
Process Overview
![Page 6: Assembly and finishing](https://reader035.vdocuments.site/reader035/viewer/2022062300/554e842bb4c905f66a8b56c7/html5/thumbnails/6.jpg)
Library Preparation - Sanger
DNA fragmentation
Random fragment DNA
![Page 7: Assembly and finishing](https://reader035.vdocuments.site/reader035/viewer/2022062300/554e842bb4c905f66a8b56c7/html5/thumbnails/7.jpg)
Library Preparation - new
![Page 8: Assembly and finishing](https://reader035.vdocuments.site/reader035/viewer/2022062300/554e842bb4c905f66a8b56c7/html5/thumbnails/8.jpg)
Assembly (assembler)
• Sanger reads only (phrap, PGA, Arachne) --
3kb-- --3kb--
--8kb-- --8kb--
---------40kb--------
• Hybrid Sanger/pyrosequence/Solexa (no special assemblers; use
Newbler, PGA, Arachne) 454 contig454 contig
--8kb-- --8kb-- --8kb--
--8kb-- --8kb-- --8kb--
454 shreds454 shreds
• 454/Solexa (Newbler, PCAP, Velvet, ALLPATH etc) –
Shotgun readsPE reads
![Page 9: Assembly and finishing](https://reader035.vdocuments.site/reader035/viewer/2022062300/554e842bb4c905f66a8b56c7/html5/thumbnails/9.jpg)
Assembly: set of contigs
Draft assembly - what we get
10 16 21
10 21
Clone walk(Sanger lib)
Ordered sets of contigs (scaffolds)
New technologies: no clones to walk off even if you can scaffold contigs
(bPCR – new approach of gap closing)
16
PCR - sequence
pri1 pri2
PCR product
PE
![Page 10: Assembly and finishing](https://reader035.vdocuments.site/reader035/viewer/2022062300/554e842bb4c905f66a8b56c7/html5/thumbnails/10.jpg)
Primer walking
PCR – sequence
(un captured gaps)
Template: gDNA
PCR product
Clone walk
(captured gaps)
Clone A
![Page 11: Assembly and finishing](https://reader035.vdocuments.site/reader035/viewer/2022062300/554e842bb4c905f66a8b56c7/html5/thumbnails/11.jpg)
Why do we have gaps
• Sequencing coverage may not span all regions of the genome, thus producing gaps in the assembly – colony picking
• Assembly results of the shotgun reads may produce misassembled regions due to repetitive sequences (new and old tech)
• A biased base content (this can result in failure to be cloned, poor stability in the chosen host-vector system, or inability of the polymerase to reliably copy the sequence):
~ AT-rich DNA clones poorly in bacteria (cloning bias; promoters like structures {Sanger} )=> uncaptured gaps ~GC rich DNA is difficult to PCR and to sequence and often requires the use of special chemistry => captured gaps
~ high AT and GC content caused by problematic PCR (new tech)
What are gaps ?- Genome areas not covered by
random shotgun
![Page 12: Assembly and finishing](https://reader035.vdocuments.site/reader035/viewer/2022062300/554e842bb4c905f66a8b56c7/html5/thumbnails/12.jpg)
Actual genome
Assembling repeats
![Page 13: Assembly and finishing](https://reader035.vdocuments.site/reader035/viewer/2022062300/554e842bb4c905f66a8b56c7/html5/thumbnails/13.jpg)
High GC sequencing problems:
The presence of small hairpins (inverted repeat sequences) in theDNA that re anneal ether during sequencing or electrophoresisresulting in failed sequencing reactions or unreadable electrophoresisresults. (This can be aided by adding modifiers to the reaction,sequencing smaller clones and running gels at higher temperatures inthe presence of stronger denaturants).
![Page 14: Assembly and finishing](https://reader035.vdocuments.site/reader035/viewer/2022062300/554e842bb4c905f66a8b56c7/html5/thumbnails/14.jpg)
Why more than one platform?
• 454 - high quality reliable skeletons of genomes (454 std + 454 PE): correctly assembled contigs; problems with repeats (unassembled or assembled in contigs outside of main scaffolds); homopolymer related frame shifts
• Illumina data is used to help improve the overall consensus quality, correct frameshifts and to close secondary structure related gaps; not ready for de-novo assembly of complex genomes (too many gaps!)
• Sanger – finishing reads; fosmids – larger repeats and templates for primer walk – less cost effective but very useful in many cases
![Page 15: Assembly and finishing](https://reader035.vdocuments.site/reader035/viewer/2022062300/554e842bb4c905f66a8b56c7/html5/thumbnails/15.jpg)
454 (pyrosequence) and low GC
genomesThermotoga lettingae TMO
Sanger based draft assembly: - 55 total contigs; 41 contigs >2kb- 38GC% - biased Sanger libraries
Draft assembly +454- 2 total contigs; 1 contigs >2kb- 454 – no cloning
<166bp> - average length of gaps
![Page 16: Assembly and finishing](https://reader035.vdocuments.site/reader035/viewer/2022062300/554e842bb4c905f66a8b56c7/html5/thumbnails/16.jpg)
454 and High GC projectsXylanimonas cellulosilytica DSM 15894 (3.8 MB; 72.1% GC)
PGA assembly - 9x of 8kb PGA assembly - 9x of 8kb +454
Assembly Total contigs Major contigs Scaffolds Misassenblies* N50
PGA-8kb 210 166 4 165 41,048
PGA-8kb+454 33 23 2 14 288,369
![Page 17: Assembly and finishing](https://reader035.vdocuments.site/reader035/viewer/2022062300/554e842bb4c905f66a8b56c7/html5/thumbnails/17.jpg)
454/Sanger contig
Fosmid ends* and 454 PE
1.Pyrosequence and Sanger to obtain main ordered and oriented part of the assembly – Newbler assembler
3. Solexa reads to detect and correct errors in consensus –in house created tool (the Polisher) and close gaps (Velvet)
2. GapResolution (in house tool) to close some (up to 40%) gaps using unassembled 454 data – PGA or Newbler assemblers
Solexa
* Fosmids ends not used for microbes
Unassembled 454 reads
NextGen high Quality Drafts at JGI (multiple sequencing platforms)
Solexa contig
![Page 18: Assembly and finishing](https://reader035.vdocuments.site/reader035/viewer/2022062300/554e842bb4c905f66a8b56c7/html5/thumbnails/18.jpg)
Solving gaps: gapResopution tool
ContigGap (due to repeat)
Read pairs that are found in contigs outside of this
scaffold
Step 1 For each gap, identify read pairs from contigs found on different scaffolds
Step 2 Assemble reads in contigs adjacent to the gap and reads obtained from contigs outside the scaffold. Sometimes use assembler other than Newbler for sub-assemblies (PGA)
Contig Gap
Consensus from sub-assembly
![Page 19: Assembly and finishing](https://reader035.vdocuments.site/reader035/viewer/2022062300/554e842bb4c905f66a8b56c7/html5/thumbnails/19.jpg)
Solving gaps: gapResopution tool (II)
Step 3 If gap is not closed, tool designs designs primers for sequencing reactions
Contig Gap
Design sequencing reactions to close gap
Step 4 Iterate as necessary (in sub-assemblies)
http://www.jgi.doe.gov/[email protected]
![Page 20: Assembly and finishing](https://reader035.vdocuments.site/reader035/viewer/2022062300/554e842bb4c905f66a8b56c7/html5/thumbnails/20.jpg)
• Velvet assembly• Blast Velvet contigs against Newbler ends• Use proper Velvet contigs to close gaps
Solexa for gaps
454 Contig GapVelvet contig
Illumina reads
Velvet contigs close gaps caused by hairpins and secondary structures
![Page 21: Assembly and finishing](https://reader035.vdocuments.site/reader035/viewer/2022062300/554e842bb4c905f66a8b56c7/html5/thumbnails/21.jpg)
Low quality areas – areas of potential frameshifts
Assemblies contain low quality regions (red tags)
![Page 22: Assembly and finishing](https://reader035.vdocuments.site/reader035/viewer/2022062300/554e842bb4c905f66a8b56c7/html5/thumbnails/22.jpg)
Frameshift 1 (AAAAA, should be AAAA)
Frameshift 2 (CCCC, should be CCC) homopolymers (n>=3)
Modified from N. Ivanova (JGI)
Homopoymer related frameshifts
![Page 23: Assembly and finishing](https://reader035.vdocuments.site/reader035/viewer/2022062300/554e842bb4c905f66a8b56c7/html5/thumbnails/23.jpg)
Polisher: software for consensus software for consensus quality improvement quality improvement
Step 1: Align Illumina data to 454-only or Sanger/454 hybrid assembly
Contig
Illumina reads
Step 2: Analyze and correct consensus errorsC T
T
G
AAAAA
CorrectionsIllumina coverage >= 10X and at least 70% llumina bases disagrees with the reference base
Unsupporteda. Illumina coverage < 10Xb. Illumina coverage >= 10X and <70% of Illumina bases agree with the reference baseStep 3: Design sequencing reactions for low
quality and unsupported Illumina areas
Unsupported Illumina regionSanger/454 low quality
![Page 24: Assembly and finishing](https://reader035.vdocuments.site/reader035/viewer/2022062300/554e842bb4c905f66a8b56c7/html5/thumbnails/24.jpg)
Errors corrected by Solexa
CCTCTTTGATGGAAATGATA**TCTTCGAGCATCGCCTC**GGGTTTTCCATACAGAGAACCTTTGATGATGAACCGGTTGAAGATCTGCGGGTCAAA CCTCTTTGATGGAAATAATA**TATTCGAGCATC TTAGTGGAAATGATA**TCTTCGAGCATCGCCTC CGAGCNTCGCCTC**GGGCTTTCCCT CGAGCATCGCCTC**GGGTTCTCCATACACAGA GCATCGCCTC**GGGTTTTCAATACAGAGAACCT CAGCGCCTC**GGGTTTTCCATACAGAGAACCTT ATCGCCTC**GGGTTTTCCAGACAGAGAACCTTT GGTTC**GGGTTTTCCATACAGAGAACCTTTGAT GTTTTCCATACAGAGAACATTTGATGATGAAC GTTGTCCATACAGAGAACTTTTGATGATGAAC TATANCATACAGAGAACCTTTGATGATGAACC ATTTCCAGACAGAGAACCNTTGATGATGAACC CAAACAGAGAACCTTTGAGGATGAACCGGTTG ACAGGGAACCTTAGATGATGAACCGGTTGAAG ACAGAGAACCTTAGATGATGAACCGGTTGAAG ACCGTTGATGATGAACCGGTTGAAGATCTGCG GATGGTGAACGGGTTGAAGATCTGCGGGTCAA GGTTTGAAGATCTGCGGGTCAAACCAGTCCTC GGTGGAAGATCTGCGGGTAAAACCAGTCCTCT GGT.GNAGAGCTGCGGGTCAAACCAGTCCTCTG TGAAGATCTGCGGTTCAAACCAGTCCTCTCCC GATCGGCGTGTCAAACCAGTCCTCTGCCTCGT TCTGCGGGTCAAACCAGTACTCTGCCTCGTTC
Frame shift detected (454 contig)
454 contig
Finished consensus
Sanger reads
![Page 25: Assembly and finishing](https://reader035.vdocuments.site/reader035/viewer/2022062300/554e842bb4c905f66a8b56c7/html5/thumbnails/25.jpg)
So, what is Finishing?
The process of taking a rough draft assembly composed of
shotgun sequencing reads, identifying and resolving miss
assemblies, sequence gaps and regions of low quality to
produce a highly accurate finished DNA sequence.
Final error rate should be less than 1 per 50 Kb.
No gaps, no misassembled areas, no characters other than ACGT
Final quality:
![Page 26: Assembly and finishing](https://reader035.vdocuments.site/reader035/viewer/2022062300/554e842bb4c905f66a8b56c7/html5/thumbnails/26.jpg)
Sequencing Centers for Archaea & BacteriaMay 2009: 3549 projects
JGI23%
JCVI18%
BROAD9%
WashU6%
BCM5%
WORLD37%
Genome projectsGenome projectsArchaea + Bacteria onlyArchaea + Bacteria only
http://www.genomesonline.org/
298CompleteGenomes
137CompleteGenomes
![Page 27: Assembly and finishing](https://reader035.vdocuments.site/reader035/viewer/2022062300/554e842bb4c905f66a8b56c7/html5/thumbnails/27.jpg)
Metagenomic assembly and Finishing
• Typically size of metagenomic sequencing project is very large
• Different organisms have different coverage. Non-uniform sequence coverage results in significant under- and over-representation of certain community members
• Low coverage for the majority of organisms in highly complex communities leads to poor (if any) assemblies
• Chimerical contigs produced by co-assembly of sequencing reads originating from different species.
• Genome rearrangements and the presence of mobile genetic elements (phages, transposons) in closely related organisms further complicate assembly.
• No assemblers developed for metagenomic data sets
The whole-genome shotgun sequencing approach was used for a number of
microbial community projects, however useful quality control and assembly
of these data require reassessing methods developed to handle relatively
uniform sequences derived from isolate microbes.
![Page 28: Assembly and finishing](https://reader035.vdocuments.site/reader035/viewer/2022062300/554e842bb4c905f66a8b56c7/html5/thumbnails/28.jpg)
QC: Annotation of poor quality sequence
To avoid this:
-make sure you use high quality sequence
-choose proper assembler
A Bioinformatician's Guide to Metagenomics . Microbiol Mol Biol Rev. 2008 December; 72(4): 557–578.
![Page 29: Assembly and finishing](https://reader035.vdocuments.site/reader035/viewer/2022062300/554e842bb4c905f66a8b56c7/html5/thumbnails/29.jpg)
Assembly mistakes
A Bioinformatician's Guide to Metagenomics. Microbiol Mol Biol Rev. 2008 December; 72(4): 557–578.
![Page 30: Assembly and finishing](https://reader035.vdocuments.site/reader035/viewer/2022062300/554e842bb4c905f66a8b56c7/html5/thumbnails/30.jpg)
Recommendations for metagenomic assembly
- Use Trimmer (Lucy etc) to treat reads PRIOR to assembly
- None of the existing assemblers designed for metagenomic data but assemblers like PGA work better with paired reads information and produce better assemblies.
- We currently test Newbler assembler for second generation sequencing: 454 only and 454/Solexa co-assembly
![Page 31: Assembly and finishing](https://reader035.vdocuments.site/reader035/viewer/2022062300/554e842bb4c905f66a8b56c7/html5/thumbnails/31.jpg)
Metagenomic finishing: approach
Binning:Binning: Which DNA fragment
derived from which phylotype?
(BLAST; GC%; read depth)
Non-CAP readsNon-CAP reads
CAP readsCAP reads
++
Complete genome of Complete genome of Candidatus Accumulibacter
phosphatis
Lucy/PGALucy/PGA
Candidatus Accumulibacter phosphatis (CAP)
~ 45%
![Page 32: Assembly and finishing](https://reader035.vdocuments.site/reader035/viewer/2022062300/554e842bb4c905f66a8b56c7/html5/thumbnails/32.jpg)
Few more details: read quality
![Page 33: Assembly and finishing](https://reader035.vdocuments.site/reader035/viewer/2022062300/554e842bb4c905f66a8b56c7/html5/thumbnails/33.jpg)
![Page 34: Assembly and finishing](https://reader035.vdocuments.site/reader035/viewer/2022062300/554e842bb4c905f66a8b56c7/html5/thumbnails/34.jpg)
Merged assemblies ( k=31 and k=51) with minimus(Cloneview used for visualization)
Green k=31
Purple k=51Illumina only data
![Page 35: Assembly and finishing](https://reader035.vdocuments.site/reader035/viewer/2022062300/554e842bb4c905f66a8b56c7/html5/thumbnails/35.jpg)
Stats for 31, 51 and merged 31-51 assemblies
Hash L 31 51 31_51expCov NO NO NOTotal Ctgs 3,796,782 377,044 275,273Largest 15,553 23,012 40,135N50bp 116 196 325Min Ctg L 80 80 80Total Len Ctgs 360,994,462 62,631,932 138,833,812
![Page 36: Assembly and finishing](https://reader035.vdocuments.site/reader035/viewer/2022062300/554e842bb4c905f66a8b56c7/html5/thumbnails/36.jpg)
Thank you!