build reference genomes using next-generation … reference genomes using next-generation sequencing...
TRANSCRIPT
Build Reference Genomes Using
Next-Generation Sequencing
Technologies
Jianbin Wang
HMGP7620, STBB7620, CPBS7620 and MICB7620
Advanced Genome Analysis
1/22/15
Yeast, 1996 E. coli, 1997
C. elegans, 1998 Fruit fly, 2000
Arabidopsis, 2000 Mouse, 2002
1st Generation Large-Scale
Sequencing (Sanger Capillary)
Human, 2001
Produced many important genomes for modern biology
Biodiversity is Everywhere
Diversity of fungi from Northern Saskatchewan
Diversity of butterfly’s wings’ sizes, shapes, and colors
Available (Eukaryotic) Genomes
Increased Steadily at NCBI
ftp://ftp.ncbi.nlm.nih.gov/genomes/GENOME_REPORTS/
6 3 15 22 34 47 46 65 51 60 66
205 235
396
574
6 0
200
400
600
Re
leas
ed
Ge
no
me
s
Year(s)
Number of Eukaryotic Genome Released @ NCBI GenBank (1/9/15)
Total: 1,831 by 1/9/15
Examples of Reported New Genomes in 2014
Almost all were done using Next-
Generation Sequencing (NGS)
The Nature of NGS Data
• Higher parallel operation/yield
• Much lower cost per base
• Usually shorter (unfortunately)
Illumina has the lowest cost/Mb ($0.05-$0.15) and is the most popular platform
Building a Genome is Like Solving a Puzzle of a Map
States: 50 = 3Gb/50 = 60 Mb
Counties: 3,144 = 1 Mb
Zip Codes: 43,000 = 70 Kb
Sanger reads: 800 bp = 3.75 million reads (x
10x)
Illumina reads: 200 bp = 15 million reads (x 50x)
Illumina reads: 50 bp = 60 million reads (x 100x)
De-novo Genome Assembly Concepts
Genomic DNA
Gaps
Genomic reads
Whole genome
shortgun
sequencing
Contig1 Contig2 Contig3 Contig4
De novo assembly
Scaffold
Paired-end information
Metrics for Genome Assembly
N50 = 18,063 bp
N50 number = 4,175
N90 = 3,548 bp
N90 number = 16,950
• Number of contigs/scaffolds
• Total size of contigs/scaffolds
• Longest contig/scaffold
• N50/N90 contigs/scaffolds length
Methods: Overlap-Layout-Consensus
• Pair-wise sequence alignments (computationally expensive)
• Construction and manipulation of an overlap graph to produce the
reads layout
• Multiple sequence alignments and generate consensus
Examples: Phrap, Celera, Arachne, CAP, PCAP, Newbler, SGA …
Illumina
Illumina
Methods: Eulerian Path/de Bruijn Graph
• Kmer hash table
• de Bruijn
graph/Eulerian path
search Examples: Euler, Velvet,
Allpath, Abyss,
SOAPdenovo, ...
AGATGATTCG
AGA
GAT
ATG
TGA
GAT
ATT
TTC
TCG
Coverage and K-mer Coverage
• Coverage (C)
• K-mer coverage (Ck)
• K-mer coverage depends on K-mer size and read
length
Ck = C * (L - k +1) / L
where k is your hash length, and L your read length
• Choice of K-mer: a tug-of-war between specificity and
coverage
Challenges for De-novo Genome Assembly
• Repetitive sequence
• DNA polymorphisms/sequencing errors
• Non-uniform coverage
• Computational complexity of processing large volume of
data
Reduced the Complexity of the Data
• Sub-assembly (grouped assembly)
– Illumina Tn5 transposase based barcodes
– Fosmid, BAC pooling and others
• Repeat-masking
• Reference based
Scaffold
Scaffolding information
Additional Scaffolding (with indirect source)
• Related-genome as reference
• cDNAs/transcriptomes
• Conserved proteins
Contig1 Contig2 Contig3 Contig4
…… - - …..… - - …..….. - - Reference genome cDNA conserved protein
This step needs extra caution as this is under the assumption that might not be true!
To the Next Level: Chromosomal Size
Scaffolding Approaches
• Fosmid: 35-40 Kb
• BAC: 150-350 Kb
• Optical mapping: chromosomal level
• Hi-C assembly: chromosome-scale
• Longer reads (Sanger, Illumina, PacBio, Nanopore, …)
Super-
scaffold/Chromo
some
Higher-level Scaffolding Information
Scaffold1 Scaffold2 Scaffold3 Scaffold4
…… - - …..… - - …..….. - - Fosmid/BAC Optical Mapping Longer Reads
Genome Assessment - Coverage
• Reads coverage/reads used
• Physical coverage
• Functional coverage
– Core Eukaryotic Genes Mapping Approach (CEGMA)
– Transcriptomes (mRNAs, Small RNAs, and others)
– Other sequence of interest
Genome Assessment - Continuity
• N50 and N90 on contigs and scaffolds
• Consistency to available genetic maps
• Paired-end discrepancy
• mRNA/cDNA intactness
• …
Summary of De-novo Assembly Process for
WGS
• Experiment design
– Genome size and complexity
– Goal and budget
• Sample collection
• Sample preparation
• Sequencing
– Choice of platform(s)
• Pre-processing
• Assembly
– Strategies and software choices
• Post-assembly analysis
See http://en.wikibooks.org/wiki/Next_Generation_Sequencing_(NGS)/De_novo_assembly for
details such as experiment design, data processing, and available software comparison
Programmed DNA Elimination is an Exception to
Genome Constancy in Multicellular Organisms
Wang & Davis 2014
Current Opinion in
Genetics & Development
First cell lineage defined in 1910 by Theodor Boveri
Ascaris Early Embryo Development, Cell
Lineage, and Chromatin Diminution
Somatic Cells Germline Cells
P0
P1
P2
P3
P4
S1
S2
S3
S4
S1b S1a
S2a
(AB)
(EMS)
(MS) (E)
(C)
(D)
Zygote
2-cell stage
4-cell stage
8-cell stage
16-cell stage
32-cell stage
Wang et al. 2012 Developmental Cell
A. suum Diminution Mitosis
Eliminated DNA (red) stays at
metaphase plate while
retained chromosomes are
pulled toward the daughter
cells in early anaphase
Eliminated DNA (red) is in
fragments between segregating
chromosomes in anaphase.
DNA fragments from a previous
diminution is still visible
Samples and Reads for Ascaris Genome
Assembly
1 male carcass = whole male - testis - spermatids - intestine
2 female carcass = whole female - ovary (oviduct) - uterus (embryos) – intestine
3 Jex et. al used mixed DNA sources for genome assembly
Assemblies for A. suum Genomes
Protein-coding genes
Functional coverage
Wang et al. 2012 Developmental Cell
Read Coverage for the Germline Genome Defines the
Eliminated Sequences & Breakpoints
Wang et al. 2012 Developmental Cell
17 additional sites confirmed by PCR
Parascaris Genome Sequencing and
Assemblies
Genomic DNA
Source
Insertion
Size (bp)
Sequencing
type
# of sequencing
lanes
Reads Number
(million)
Genome
Coverage (x)
Male #1 testis 450 2 x 100 5 x HiSeq 2000 1,468 48
Male #1 intestine 450 2 x 100 1 x HiSeq 2000 264 70
Male #1 intestine 450 2 x 250 1 x MiSeq 31 18
Genome assembly features Germline Somatic
Estimated genome size (Mb) ~2,500 ~285
Total base assembled (bp) 234,063,191 229,444,662
Number of scaffolds (>= 200bp) 25,519 19,520
N50 of scaffolds (bp); N50 number 36,592; 1,644 103,210; 675
N90 of scaffolds (bp); N90 number 6,674; 7,370 21,816; 2,406
Maximum length of scaffold (bp) 397,251 495,322
N50 of contigs (bp); N50 number 16,545; 3,468 26,670; 2,122
N90 of contigs (bp); N90 number 2,414; 17,435 2,112; 14,596
Sequencing
Assemblies
• 88% of germline genome is eliminated in somatic
cells
• Primarily satellite repeats eliminated
• 5-mer = 1.3 Gb
• 10-mer = 0.9 Gb
• ~ 700 genes eliminated
Genes lost and many breakpoints are conserved
suggesting ancient mechanism for diminution
Parascaris Germline Genome Assembly
Was Enabled by Repeat Masker
On the way to improve the genomes using Bionano, PacBio, and Fosmid libraries
Genome Assembly Using NGS Data
• Is feasible and is the choice to sequence a new genome
• Is still a challenge for complex genomes
• Algorithm matters, but more importantly is the source of
DNA and type/quality of the libraries
• Reference genome or other higher-order genetic map is of
great value
• The quality of a genome assembly is improving constantly
• Put it into the biological content
References and Additional Reading
• Schatz, M. C., A. L. Delcher, et al. (2010). "Assembly of large genomes
using second-generation sequencing." Genome research 20(9): 1165-
1173.
• Earl, D., K. Bradnam, et al. (2011). "Assemblathon 1: a competitive
assessment of de novo short read assembly methods." Genome research
21(12): 2224-2241.
• Salzberg, S. L., A. M. Phillippy, et al. (2012). "GAGE: A critical evaluation of
genome assemblies and assembly algorithms." Genome research.
• Treangen, T. J. and S. L. Salzberg (2012). "Repetitive DNA and next-
generation sequencing: computational challenges and solutions." Nature
reviews. Genetics 13(1): 36-46.
• Nagarajan, N. and Pop, M (2013). "Sequence assembly demystified"
Nature reviews. Genetics 14(3): 157-167.