dna barcode sequence identification incorporating taxonomic hierarchy and within taxon variability...
TRANSCRIPT
![Page 1: DNA Barcode sequence identification incorporating taxonomic hierarchy and within taxon variability Damon P. Little Cullman Program for Molecular Systematics](https://reader035.vdocuments.site/reader035/viewer/2022062421/56649cd65503460f9499d1e2/html5/thumbnails/1.jpg)
DNA Barcode sequence identification incorporating taxonomic hierarchy and
within taxon variability
Damon P. Little
Cullman Program for Molecular Systematics StudiesThe New York Botanical Garden, Bronx, New York
![Page 2: DNA Barcode sequence identification incorporating taxonomic hierarchy and within taxon variability Damon P. Little Cullman Program for Molecular Systematics](https://reader035.vdocuments.site/reader035/viewer/2022062421/56649cd65503460f9499d1e2/html5/thumbnails/2.jpg)
![Page 3: DNA Barcode sequence identification incorporating taxonomic hierarchy and within taxon variability Damon P. Little Cullman Program for Molecular Systematics](https://reader035.vdocuments.site/reader035/viewer/2022062421/56649cd65503460f9499d1e2/html5/thumbnails/3.jpg)
test data sets (Little and Stevenson 2007)
gymnosperm nuclear ribosomal internal transcribed spacer 2 (nrITS 2)
1,037 sequences
413 species71 genera
gymnosperm plastid encoded maturase K (matK)
522 sequences334 species75 genera
![Page 4: DNA Barcode sequence identification incorporating taxonomic hierarchy and within taxon variability Damon P. Little Cullman Program for Molecular Systematics](https://reader035.vdocuments.site/reader035/viewer/2022062421/56649cd65503460f9499d1e2/html5/thumbnails/4.jpg)
…alignment
locus sequencesmedian unaligned length (IQR)
aligned length
nrITS 2
all 137 (108–250) bp 8,733 bp
one per species 196 (115–260) bp 6,778 bp
matK
all 1,561 (1,412–1,661) bp 3,975 bp
one per species 1,601 (1,530–1,661) bp 3,906 bp
![Page 5: DNA Barcode sequence identification incorporating taxonomic hierarchy and within taxon variability Damon P. Little Cullman Program for Molecular Systematics](https://reader035.vdocuments.site/reader035/viewer/2022062421/56649cd65503460f9499d1e2/html5/thumbnails/5.jpg)
pairwise divergence
locus sequences median interquartile rangezero comparisons
nrITS 2
all 30.99% 26.53–34.48% 0.09%
one per species 29.39% 25.75–33.30% 0.21%
matK
all 20.39% 5.95–23.30% 0.54%
one per species 21.38% 8.13–23.89% 0.42%
![Page 6: DNA Barcode sequence identification incorporating taxonomic hierarchy and within taxon variability Damon P. Little Cullman Program for Molecular Systematics](https://reader035.vdocuments.site/reader035/viewer/2022062421/56649cd65503460f9499d1e2/html5/thumbnails/6.jpg)
measuring precision and accuracy
![Page 7: DNA Barcode sequence identification incorporating taxonomic hierarchy and within taxon variability Damon P. Little Cullman Program for Molecular Systematics](https://reader035.vdocuments.site/reader035/viewer/2022062421/56649cd65503460f9499d1e2/html5/thumbnails/7.jpg)
![Page 8: DNA Barcode sequence identification incorporating taxonomic hierarchy and within taxon variability Damon P. Little Cullman Program for Molecular Systematics](https://reader035.vdocuments.site/reader035/viewer/2022062421/56649cd65503460f9499d1e2/html5/thumbnails/8.jpg)
![Page 9: DNA Barcode sequence identification incorporating taxonomic hierarchy and within taxon variability Damon P. Little Cullman Program for Molecular Systematics](https://reader035.vdocuments.site/reader035/viewer/2022062421/56649cd65503460f9499d1e2/html5/thumbnails/9.jpg)
![Page 10: DNA Barcode sequence identification incorporating taxonomic hierarchy and within taxon variability Damon P. Little Cullman Program for Molecular Systematics](https://reader035.vdocuments.site/reader035/viewer/2022062421/56649cd65503460f9499d1e2/html5/thumbnails/10.jpg)
precision
method nrITS2 matK
parsimony ratchet 58% (13%) 71% (41%)
SPR search 60% (11%) 70% (41%)
neighbor joining 65% (8%) 44% (23%)
BLAST 94% (81%) 99% (67%)
BLAT 94% (82%) 99% (69%)
megaBLAST 94% (80%) 99% (61%)
BLAST/parsimony ratchet 86% (74%) 77% (55%)
BLAST/SPR 87% (73%) 76% (53%)
BLAST/neighbor joining 93% (71%) 95% (56%)
DNA–BAR 98% (89%) 100% (79%)
DOME ID 80% (80%) 60% (60%)
ATIM 100% (83%) 100 (67%)
![Page 11: DNA Barcode sequence identification incorporating taxonomic hierarchy and within taxon variability Damon P. Little Cullman Program for Molecular Systematics](https://reader035.vdocuments.site/reader035/viewer/2022062421/56649cd65503460f9499d1e2/html5/thumbnails/11.jpg)
![Page 12: DNA Barcode sequence identification incorporating taxonomic hierarchy and within taxon variability Damon P. Little Cullman Program for Molecular Systematics](https://reader035.vdocuments.site/reader035/viewer/2022062421/56649cd65503460f9499d1e2/html5/thumbnails/12.jpg)
![Page 13: DNA Barcode sequence identification incorporating taxonomic hierarchy and within taxon variability Damon P. Little Cullman Program for Molecular Systematics](https://reader035.vdocuments.site/reader035/viewer/2022062421/56649cd65503460f9499d1e2/html5/thumbnails/13.jpg)
![Page 14: DNA Barcode sequence identification incorporating taxonomic hierarchy and within taxon variability Damon P. Little Cullman Program for Molecular Systematics](https://reader035.vdocuments.site/reader035/viewer/2022062421/56649cd65503460f9499d1e2/html5/thumbnails/14.jpg)
![Page 15: DNA Barcode sequence identification incorporating taxonomic hierarchy and within taxon variability Damon P. Little Cullman Program for Molecular Systematics](https://reader035.vdocuments.site/reader035/viewer/2022062421/56649cd65503460f9499d1e2/html5/thumbnails/15.jpg)
accuracy to species
method nrITS2 matK
parsimony ratchet 67% (46%) 77% (60%)
SPR search 69% (47%) 78% (58%)
neighbor joining 68% (42%) 75% (52%)
BLAST 67% (63%) 84% (68%)
BLAT 66% (62%) 82% (67%)
megaBLAST 72% (68%) 84% (64%)
BLAST/parsimony ratchet 78% (67%) 80% (60%)
BLAST/SPR 79% (67%) 78% (61%)
BLAST/neighbor joining 80% (64%) 86% (56%)
DNA–BAR 65% (62%) 73% (62%)
DOME ID 67% (66%) 50% (50%)
ATIM 83% (71%) 87% (53%)
![Page 16: DNA Barcode sequence identification incorporating taxonomic hierarchy and within taxon variability Damon P. Little Cullman Program for Molecular Systematics](https://reader035.vdocuments.site/reader035/viewer/2022062421/56649cd65503460f9499d1e2/html5/thumbnails/16.jpg)
lessons learned
![Page 17: DNA Barcode sequence identification incorporating taxonomic hierarchy and within taxon variability Damon P. Little Cullman Program for Molecular Systematics](https://reader035.vdocuments.site/reader035/viewer/2022062421/56649cd65503460f9499d1e2/html5/thumbnails/17.jpg)
“global” alignments do not work
![Page 18: DNA Barcode sequence identification incorporating taxonomic hierarchy and within taxon variability Damon P. Little Cullman Program for Molecular Systematics](https://reader035.vdocuments.site/reader035/viewer/2022062421/56649cd65503460f9499d1e2/html5/thumbnails/18.jpg)
precision
method nrITS2 matK
parsimony ratchet 58% (13%) 71% (41%)
SPR search 60% (11%) 70% (41%)
neighbor joining 65% (8%) 44% (23%)
BLAST 94% (81%) 99% (67%)
BLAT 94% (82%) 99% (69%)
megaBLAST 94% (80%) 99% (61%)
BLAST/parsimony ratchet 86% (74%) 77% (55%)
BLAST/SPR 87% (73%) 76% (53%)
BLAST/neighbor joining 93% (71%) 95% (56%)
DNA–BAR 98% (89%) 100% (79%)
DOME ID 80% (80%) 60% (60%)
ATIM 100% (83%) 100 (67%)
![Page 19: DNA Barcode sequence identification incorporating taxonomic hierarchy and within taxon variability Damon P. Little Cullman Program for Molecular Systematics](https://reader035.vdocuments.site/reader035/viewer/2022062421/56649cd65503460f9499d1e2/html5/thumbnails/19.jpg)
accuracy to species
method nrITS2 matK
parsimony ratchet 67% (46%) 77% (60%)
SPR search 69% (47%) 78% (58%)
neighbor joining 68% (42%) 75% (52%)
BLAST 67% (63%) 84% (68%)
BLAT 66% (62%) 82% (67%)
megaBLAST 72% (68%) 84% (64%)
BLAST/parsimony ratchet 78% (67%) 80% (60%)
BLAST/SPR 79% (67%) 78% (61%)
BLAST/neighbor joining 80% (64%) 86% (56%)
DNA–BAR 65% (62%) 73% (62%)
DOME ID 67% (66%) 50% (50%)
ATIM 83% (71%) 87% (53%)
![Page 20: DNA Barcode sequence identification incorporating taxonomic hierarchy and within taxon variability Damon P. Little Cullman Program for Molecular Systematics](https://reader035.vdocuments.site/reader035/viewer/2022062421/56649cd65503460f9499d1e2/html5/thumbnails/20.jpg)
“fuzzy” matches are not precise
![Page 21: DNA Barcode sequence identification incorporating taxonomic hierarchy and within taxon variability Damon P. Little Cullman Program for Molecular Systematics](https://reader035.vdocuments.site/reader035/viewer/2022062421/56649cd65503460f9499d1e2/html5/thumbnails/21.jpg)
precision
method nrITS2 matK
parsimony ratchet 58% (13%) 71% (41%)
SPR search 60% (11%) 70% (41%)
neighbor joining 65% (8%) 44% (23%)
BLAST 94% (81%) 99% (67%)
BLAT 94% (82%) 99% (69%)
megaBLAST 94% (80%) 99% (61%)
BLAST/parsimony ratchet 86% (74%) 77% (55%)
BLAST/SPR 87% (73%) 76% (53%)
BLAST/neighbor joining 93% (71%) 95% (56%)
DNA–BAR 98% (89%) 100% (79%)
DOME ID 80% (80%) 60% (60%)
ATIM 100% (83%) 100 (67%)
![Page 22: DNA Barcode sequence identification incorporating taxonomic hierarchy and within taxon variability Damon P. Little Cullman Program for Molecular Systematics](https://reader035.vdocuments.site/reader035/viewer/2022062421/56649cd65503460f9499d1e2/html5/thumbnails/22.jpg)
accuracy to species
method nrITS2 matK
parsimony ratchet 67% (46%) 77% (60%)
SPR search 69% (47%) 78% (58%)
neighbor joining 68% (42%) 75% (52%)
BLAST 67% (63%) 84% (68%)
BLAT 66% (62%) 82% (67%)
megaBLAST 72% (68%) 84% (64%)
BLAST/parsimony ratchet 78% (67%) 80% (60%)
BLAST/SPR 79% (67%) 78% (61%)
BLAST/neighbor joining 80% (64%) 86% (56%)
DNA–BAR 65% (62%) 73% (62%)
DOME ID 67% (66%) 50% (50%)
ATIM 83% (71%) 87% (53%)
![Page 23: DNA Barcode sequence identification incorporating taxonomic hierarchy and within taxon variability Damon P. Little Cullman Program for Molecular Systematics](https://reader035.vdocuments.site/reader035/viewer/2022062421/56649cd65503460f9499d1e2/html5/thumbnails/23.jpg)
autoapomorphies (unique characters) work... but not always present
![Page 24: DNA Barcode sequence identification incorporating taxonomic hierarchy and within taxon variability Damon P. Little Cullman Program for Molecular Systematics](https://reader035.vdocuments.site/reader035/viewer/2022062421/56649cd65503460f9499d1e2/html5/thumbnails/24.jpg)
precision
method nrITS2 matK
parsimony ratchet 58% (13%) 71% (41%)
SPR search 60% (11%) 70% (41%)
neighbor joining 65% (8%) 44% (23%)
BLAST 94% (81%) 99% (67%)
BLAT 94% (82%) 99% (69%)
megaBLAST 94% (80%) 99% (61%)
BLAST/parsimony ratchet 86% (74%) 77% (55%)
BLAST/SPR 87% (73%) 76% (53%)
BLAST/neighbor joining 93% (71%) 95% (56%)
DNA–BAR 98% (89%) 100% (79%)
DOME ID 80% (80%) 60% (60%)
DOME ID* 100% (100%) 100% (100%)
ATIM 100% (83%) 100 (67%)
![Page 25: DNA Barcode sequence identification incorporating taxonomic hierarchy and within taxon variability Damon P. Little Cullman Program for Molecular Systematics](https://reader035.vdocuments.site/reader035/viewer/2022062421/56649cd65503460f9499d1e2/html5/thumbnails/25.jpg)
accuracy to species
method nrITS2 matK
parsimony ratchet 67% (46%) 77% (60%)
SPR search 69% (47%) 78% (58%)
neighbor joining 68% (42%) 75% (52%)
BLAST 67% (63%) 84% (68%)
BLAT 66% (62%) 82% (67%)
megaBLAST 72% (68%) 84% (64%)
BLAST/parsimony ratchet 78% (67%) 80% (60%)
BLAST/SPR 79% (67%) 78% (61%)
BLAST/neighbor joining 80% (64%) 86% (56%)
DNA–BAR 65% (62%) 73% (62%)
DOME ID 67% (66%) 50% (50%)
DOME ID* 76% (75%) 90% (90%)
ATIM 83% (71%) 87% (53%)
![Page 26: DNA Barcode sequence identification incorporating taxonomic hierarchy and within taxon variability Damon P. Little Cullman Program for Molecular Systematics](https://reader035.vdocuments.site/reader035/viewer/2022062421/56649cd65503460f9499d1e2/html5/thumbnails/26.jpg)
some sequences are simply unidentifiable
![Page 27: DNA Barcode sequence identification incorporating taxonomic hierarchy and within taxon variability Damon P. Little Cullman Program for Molecular Systematics](https://reader035.vdocuments.site/reader035/viewer/2022062421/56649cd65503460f9499d1e2/html5/thumbnails/27.jpg)
...remaining (insoluble) problems
identical sequences for multiple terminals
shared alleles between terminals
use allele frequency as a predictor?
![Page 28: DNA Barcode sequence identification incorporating taxonomic hierarchy and within taxon variability Damon P. Little Cullman Program for Molecular Systematics](https://reader035.vdocuments.site/reader035/viewer/2022062421/56649cd65503460f9499d1e2/html5/thumbnails/28.jpg)
desirable methodologies and properties of
Sequence IDentification Engines (SIDEs)
![Page 29: DNA Barcode sequence identification incorporating taxonomic hierarchy and within taxon variability Damon P. Little Cullman Program for Molecular Systematics](https://reader035.vdocuments.site/reader035/viewer/2022062421/56649cd65503460f9499d1e2/html5/thumbnails/29.jpg)
Sequence IDentification Engines (SIDEs)
avoid global alignment by comparing short segments: pseudo–alignment
use exact matches
use autoapomorphies where possible
...but allow the use of other characters too
![Page 30: DNA Barcode sequence identification incorporating taxonomic hierarchy and within taxon variability Damon P. Little Cullman Program for Molecular Systematics](https://reader035.vdocuments.site/reader035/viewer/2022062421/56649cd65503460f9499d1e2/html5/thumbnails/30.jpg)
context/text DNA recoding
characters are defined by flanking context
=> pretext and postext
permit “alignment–free” comparisons
size and separation between pretext and postext must be arbitrarily delimited
states (text) limited by the proximity of context
terminals can be individual sequences or composites representing taxa
![Page 31: DNA Barcode sequence identification incorporating taxonomic hierarchy and within taxon variability Damon P. Little Cullman Program for Molecular Systematics](https://reader035.vdocuments.site/reader035/viewer/2022062421/56649cd65503460f9499d1e2/html5/thumbnails/31.jpg)
context/text DNA recoding
![Page 32: DNA Barcode sequence identification incorporating taxonomic hierarchy and within taxon variability Damon P. Little Cullman Program for Molecular Systematics](https://reader035.vdocuments.site/reader035/viewer/2022062421/56649cd65503460f9499d1e2/html5/thumbnails/32.jpg)
context/text DNA recoding
characters are defined by flanking context
=> pretext and postext
permit “alignment–free” comparisons
size and separation between pretext and postext is arbitrarily
possible states (text) is limited by the length of the text
terminals can be individual sequences or composites representing taxa
![Page 33: DNA Barcode sequence identification incorporating taxonomic hierarchy and within taxon variability Damon P. Little Cullman Program for Molecular Systematics](https://reader035.vdocuments.site/reader035/viewer/2022062421/56649cd65503460f9499d1e2/html5/thumbnails/33.jpg)
querying text/context database
find pretext/text/postext in the query sequence and match to references
![Page 34: DNA Barcode sequence identification incorporating taxonomic hierarchy and within taxon variability Damon P. Little Cullman Program for Molecular Systematics](https://reader035.vdocuments.site/reader035/viewer/2022062421/56649cd65503460f9499d1e2/html5/thumbnails/34.jpg)
querying text/context database
![Page 35: DNA Barcode sequence identification incorporating taxonomic hierarchy and within taxon variability Damon P. Little Cullman Program for Molecular Systematics](https://reader035.vdocuments.site/reader035/viewer/2022062421/56649cd65503460f9499d1e2/html5/thumbnails/35.jpg)
querying text/context database
find pretext/text/postext in the query sequence and match to references
score terminals based on the number of matches
final score can be raw or based a weighting function
![Page 36: DNA Barcode sequence identification incorporating taxonomic hierarchy and within taxon variability Damon P. Little Cullman Program for Molecular Systematics](https://reader035.vdocuments.site/reader035/viewer/2022062421/56649cd65503460f9499d1e2/html5/thumbnails/36.jpg)
possible weighting functions
equal weights (raw score)
number of distinct texts
=> up weights more variable characters
1/(number of distinct texts)
=> down weights more variable characters
(number of texts)/(number of scores)
![Page 37: DNA Barcode sequence identification incorporating taxonomic hierarchy and within taxon variability Damon P. Little Cullman Program for Molecular Systematics](https://reader035.vdocuments.site/reader035/viewer/2022062421/56649cd65503460f9499d1e2/html5/thumbnails/37.jpg)
precisionmethod nrITS2 matK
parsimony ratchet 58% (13%) 71% (41%)
SPR search 60% (11%) 70% (41%)
neighbor joining 65% (8%) 44% (23%)
BLAST 94% (81%) 99% (67%)
BLAT 94% (82%) 99% (69%)
megaBLAST 94% (80%) 99% (61%)
BLAST/parsimony ratchet 86% (74%) 77% (55%)
BLAST/SPR 87% (73%) 76% (53%)
BLAST/neighbor joining 93% (71%) 95% (56%)
DNA–BAR 98% (89%) 100% (79%)
DOME ID 80% (80%) 60% (60%)
ATIM 100% (83%) 100 (67%)
BRONX 0 91% (90%) 88% (84%)
BRONX 1 96% (86%) 98% (79%)
![Page 38: DNA Barcode sequence identification incorporating taxonomic hierarchy and within taxon variability Damon P. Little Cullman Program for Molecular Systematics](https://reader035.vdocuments.site/reader035/viewer/2022062421/56649cd65503460f9499d1e2/html5/thumbnails/38.jpg)
accuracy to speciesmethod nrITS2 matK
parsimony ratchet 67% (46%) 77% (60%)
SPR search 69% (47%) 78% (58%)
neighbor joining 68% (42%) 75% (52%)
BLAST 67% (63%) 84% (68%)
BLAT 66% (62%) 82% (67%)
megaBLAST 72% (68%) 84% (64%)
BLAST/parsimony ratchet 78% (67%) 80% (60%)
BLAST/SPR 79% (67%) 78% (61%)
BLAST/neighbor joining 80% (64%) 86% (56%)
DNA–BAR 65% (62%) 73% (62%)
DOME ID 67% (66%) 50% (50%)
ATIM 83% (71%) 87% (53%)
BRONX 0 59% (58%) 76% (71%)
BRONX 1 72% (67%) 92% (75%)
![Page 39: DNA Barcode sequence identification incorporating taxonomic hierarchy and within taxon variability Damon P. Little Cullman Program for Molecular Systematics](https://reader035.vdocuments.site/reader035/viewer/2022062421/56649cd65503460f9499d1e2/html5/thumbnails/39.jpg)
BRONX conclusions
BRONX is more precise than existing algorithms
BRONX is sometimes more accurate than existing algorithms
BRONX is an incremental improvement
![Page 40: DNA Barcode sequence identification incorporating taxonomic hierarchy and within taxon variability Damon P. Little Cullman Program for Molecular Systematics](https://reader035.vdocuments.site/reader035/viewer/2022062421/56649cd65503460f9499d1e2/html5/thumbnails/40.jpg)
future directions
improve the scoring function in BRONX
dynamically size context/text
benchmark additional datasets for all methods
incorporate context/text recoding into a scalable version of the ATIM algorithm
![Page 41: DNA Barcode sequence identification incorporating taxonomic hierarchy and within taxon variability Damon P. Little Cullman Program for Molecular Systematics](https://reader035.vdocuments.site/reader035/viewer/2022062421/56649cd65503460f9499d1e2/html5/thumbnails/41.jpg)
acknowledgments
Kenneth Cameron
Santiago Madriñán
Christian Schulz
Dennis Stevenson