synamer a new application for rapid identification of overlapping n-mers from sequence reads
DESCRIPTION
SynaMer A New Application for Rapid Identification of Overlapping n-mers From Sequence Reads June 2006. Synamatix team - Introductions. Colin Hercus CTO Poh Yang Ming Bioinformatics Research Team Member Arif Anwar VP. Summary of Agenda. Overview of Genome assembly Key bottlenecks - PowerPoint PPT PresentationTRANSCRIPT
Copyright © 2004 Synamatix sdn bhd (538481-U)
SynaMerSynaMer
A New Application for A New Application for
Rapid Identification ofRapid Identification of
Overlapping n-mers From Sequence Reads Overlapping n-mers From Sequence Reads
June 2006
www.MGRC.com.myCopyright © 2006 Synamatix Sdn. Bhd. (538481-U)
Synamatix team - IntroductionsSynamatix team - Introductions
Colin HercusCTO
Poh Yang MingBioinformatics Research Team Member
Arif AnwarVP
www.MGRC.com.myCopyright © 2006 Synamatix Sdn. Bhd. (538481-U)
Summary of AgendaSummary of Agenda
Overview of Genome assembly
Key bottlenecks
Introducing SynaMer:A solution for rapidly finding longer overlapping n-mersThe methodThe resultsDiscussion
www.MGRC.com.myCopyright © 2006 Synamatix Sdn. Bhd. (538481-U)
2 major approaches
Ab initioAb initio genome assemblygenome assembly
Overlap-layout-consensus
Needs high sequence coverage
No requirement for closely related genome
Comparative genome Comparative genome assemblyassembly
Alignment-layout-consensus
Requires a closely related genome
High speed sequence read to genome mapping is
required
Less dependent on overlap finding
www.MGRC.com.myCopyright © 2006 Synamatix Sdn. Bhd. (538481-U)
Identified bottlenecks for Identified bottlenecks for Ab initioAb initio
Typical genome assembly process flowSequence reads/FragmentsVector trimmingOverlappingContig/Supercontig/Scaffold generationFinishingFinal Genome
User* identified major bottleneck in n-mer finding:PerformancePreference for longer n-mersIT Hardware requirements
User* - Major US Genome Research Institute
www.MGRC.com.myCopyright © 2006 Synamatix Sdn. Bhd. (538481-U)
Task to accomplishTask to accomplish
Original user data set and requirement was:To find all overlapping exact 100-mers in 50million 1kb sequencing reads – i.e. 50 Billion bpReport n-mers that have a frequency >2 and <m
Using conventional software and approaches the user took 500hrs and 1.5TB of disc space to find all 100-mer overlaps
Hence standard approach limits usage to 32mers
Longer mers help bridge repetitive and low-complexity regions
www.MGRC.com.myCopyright © 2006 Synamatix Sdn. Bhd. (538481-U)
Long v Short n-mersLong v Short n-mersadvantages and disadvantages
100 mer
+ve
-ve
Fewer false positives
Improvement in final assembly
Errors in reads may lead to false negatives
Slow to process with conventional software
www.MGRC.com.myCopyright © 2006 Synamatix Sdn. Bhd. (538481-U)
Explanation of advantagesExplanation of advantages
Low-complexity region
A shorter overlap results in more false
positives
A longer overlap results in less false
positives
Final assembly improved
A
B
www.MGRC.com.myCopyright © 2006 Synamatix Sdn. Bhd. (538481-U)
Synamer: Synamer: A solution for rapid identification of longer n-A solution for rapid identification of longer n-
mersmers
Synamer finds overlapping sequences given a defined “n” with a range of frequency of occurrence in the sequence set
It is similar to a class of tools in genome assembly called “overlappers”
2 well known overlappers are:UMD Overlapper
Roberts M et. al.(2004) Bioinformatics 20(18):3363-3369
KI OverlapperTammi MT et.al., (2003) NAR 31(15):4663-4672
www.MGRC.com.myCopyright © 2006 Synamatix Sdn. Bhd. (538481-U)
How Synamer worksHow Synamer works
Given a mer length of “n”Extract a n-mer at each position within a readCompare the n-mer and reverse complement, to report palindromesIndex n-mers and their location within readsFor each n-mer within a user defined frequency range report the n-mer and locations
www.MGRC.com.myCopyright © 2006 Synamatix Sdn. Bhd. (538481-U)
SynaMer dataSynaMer dataInput:
Text file of the reads
Parameters:n (default 96, maximum of 128)Frequency range (default of 2 to 50)Memory usage (Default to available memory)Temporary file location
Output Format:Text or binaryn-mer Frequency Palindrome direction read ID:location
www.MGRC.com.myCopyright © 2006 Synamatix Sdn. Bhd. (538481-U)
Test casesTest casesUser case 1: 30million 1kb reads finding exact 96mer took approximately 5hrs to process, with less than 200GB temporary disk space on a dual CPU Itanium
Compared to 500hrs and over 1.5TB of disk space
Use case 2: Brucella_suis 1330, 36080 900bp reads (http://www.tigr.org/tdb/benchmark/)Tests were conducted with a range of n-mer with frequency of minimum of 2 to 120.n-mer range of: 12, 24, 36, 48, 60, 72, 84, 96, 108, 120Average execution time measured with 6 replicates
www.MGRC.com.myCopyright © 2006 Synamatix Sdn. Bhd. (538481-U)
Brucella suis - resultsBrucella suis - results
Majority of the patterns are at frequency of 2-50More pattern at higher n-merLonger n-mer would be more specific and less false positive
Brucella Suis 1330
0
500000
1000000
1500000
2000000
2500000
3000000
3500000
0 2 10 20 30 40 50 60 70 80 90
Fre que ncy, m
Nu
mb
er o
f O
verl
app
ing
Seq
uen
ce
12
24
36
96
www.MGRC.com.myCopyright © 2006 Synamatix Sdn. Bhd. (538481-U)
Distribution of overlapping Distribution of overlapping sequence with frequencysequence with frequency
Mammalian genome
0.00E+00
5.00E+11
1.00E+12
1.50E+12
2.00E+12
2.50E+12
3.00E+12
3.50E+12
0 2 10 20 30 40 50 60 70 80 90
Frequency, m
Nu
mb
er
of
ov
erl
ap
pin
g s
eq
ue
nc
es
12
24
36
96
Higher level of repeats in more complex genomes leads to increased benefits from using longer n-mers
www.MGRC.com.myCopyright © 2006 Synamatix Sdn. Bhd. (538481-U)
Using SynaMer there is no time Using SynaMer there is no time increase with longer n-mersincrease with longer n-mers
Time vs n-mer (m 2 to 50)
0
5
10
15
20
25
0 20 40 60 80 100 120 140
n-mer
Tim
e, S
www.MGRC.com.myCopyright © 2006 Synamatix Sdn. Bhd. (538481-U)
Sample OutputSample Output
At 96-mer:
TTTCATAAAGCCGCTTTGCACCATAAAGCGCGTCGCCGGTGCTGCCTGTGGTGCCGTAGAAAGTCCAGCCTTCCTCCGCCATCAGGAAATCAACCACTGAAACGGAAA 5 33984:395 25036:255 17186:435 -5741:85 5184:181TTTCATAAACCTGACCCTGATTCGCCGCACCATCGCCGAAATAGGTCAGCGAAACGGATTTATTCTCACGATAGTGATTGGCGAAGGCCAACCCCGTACCGAGCGAAA 8 30929:163 28279:329 25051:228 -22556:257 -14554:249 -12303:286 15820:325 6770:434TTTCATAAAACCTAAATAATATAGAATATATTTTTTAATTTACTCCCACAAAAATTGATATTTATAAAATAAAAAATCCCAATCTGTAAATCCCAATAATTTTACAAA 4 32618:184 -9587:456 9891:617 8902:369TTTCAGTTTCTCAAGCAAACCCTTTATGACATTGCATCTTTGCTGGTGTTTTTCGCCAATGTTGCATTTTGTTTCTCAATTGTAGCGCAAGCAAATGCGGCTTGAAAA 5 26073:487 -21045:262 22952:244 12603:19 6640:383
The numbers before the “:” are the ordinal position of the reads in the file
www.MGRC.com.myCopyright © 2006 Synamatix Sdn. Bhd. (538481-U)
Sample contigSample contig
To show validity of the resultGBUAS15TR and GBUCA37TFDetected overlap at 96-mer – shown below:At position 188 on GBUAS15TR and 811 on GBUCA37TFThey can be joined to a 1.5kbp contig, with consensus
www.MGRC.com.myCopyright © 2006 Synamatix Sdn. Bhd. (538481-U)
ConclusionsConclusions
For 30million 1kb reads took 5 hours on a dual CPU itanium
machine, with temporary file size less than 200GB
Time consumed to find overlapping sequences for 33000
900bp reads of a bacterial WGSS reads took less than 20s
100 fold faster than conventional method
Allows use of longer n-mers
Potentially increases quality of assembly
SynaMer will be made released as a product later this
Summer
www.MGRC.com.myCopyright © 2006 Synamatix Sdn. Bhd. (538481-U)
Questions and Follow upQuestions and Follow up
Please send questions to: [email protected]
Webcast will be available online in 24hrs at www.mgrc.com.my
Paper accompanying this webcast will be sent to all attendees
If you are interested in testing SynaMer when it is released please email: [email protected]