transposable elements (te) in genomic sequence mina rho
TRANSCRIPT
Transposable Elements (TE) in genomic sequence
Mina Rho
• Definition
• De novo identification of repeat families in large genomes (RepeatScout)
Alkes L. Price, Neil C. Jones and Pavel A. Pevzner
• Combined Evidence Annotation of Transposable Elements in Genome Sequences
Hadi Quesneville, Casey M. Bergman, Olivier Andrieu, Delphine Autard, Danielle Nouaud, Michael Ashburner, Dominique Anxolabehere
Contents
Mobile element/Transposable element
Transposon- a segment of DNA that can move around to different positions in the genome of a single cell.- cut out of its location and inserted into a new location. - consisting of DNA.
Retrotransposon- copy and paste into a new location.- the copy is made of RNA and transcribed back into DNA using reverse transcriptase.- long terminal repeats (LTRs) at its ends.
=> expect to get information of evolution, mutation, changes of amount of DNA in the genome.
RepeatScout
Definition
• Repeat family: a collection of similar sequences which appear many times in a genome. – the Alu repeat family has over 1 million approximate occurrences in the
human genome
– ~ 50% Human genome
• l-mer: substring whose length is l
• The current status on identification method of repeat families– Given an existing library of repeat families
• RepeatMasker
– De novo identification• REPuter (Kurtz et al., 2000)
• RepeatFinder (Volfovsky et al., 2001)
• RECON (Bao and Eddy, 2002)
• RepeatGluer (Pevzner et al., 2004)
• PILER (Edgar and Myers, 2005)
• RepeatScout
Backgroud
Overview of RepeatScout
• Method– Builds a table of high frequency l-mers as seeds– Extends each seed to a longer consensus sequence
• Main advantage– an efficient method of similarity search which enables a rigorous
definition of repeat boundaries.
How to create l-mer table
frequency Position of last occurrence
l-mer1 l-mer2
l-mer3
Hash table
l-mer4 l-mer5 l-mer6
Sequence
i i+1 i+2 j k
Output of l-mer table
AAAAAAAAAAAGATA 8 2920943AAAAAAAGGAAAGAA 5 2468525AGGCTTGAACAATGG 3 1425014AAAAAAAAGAAAGAA 62 3009663GTTGGTTTCAAAGAA 7 2855871AAAAAAAATTTTTTT 22 2992836ATTCAAGTTAAATGG 4 1473342ATTCAATGTAACCAC 3 1463008ATGCATGCAATGCAT 9 1788944ATGCATTTAAAAGAA 3 1464381AAAAAACTCACTCCA 5 1489159
How to build all positions of repeats
l-mer1 l-mer2
l-mer3
Hash table
l-mer4 l-mer5 l-mer6
Sequence
i i+1 i+2
i iji
i iki+2
j k
S1
S2
S3
S4
S5
Q1Q2Q3Q4
High frequency l-mer Extending Q maximizing objective function one nucleotide at a time
S1 S2 S3 S4 S5
Query sequence (with l-mer1)
Objective Function
|Q| : the length of Q C: minimum threshold on the number of repeat elements
a(Q, Sk): a pairwise fit_preferred alignment score
p: Incomplete-fit penalty
Output of optimized Q
>R=0GGCCGGGCGCGGTGGCTCACGCCTGTAATCCCAGCACTTTGGGAGGCCGAGGCGGGCGGATCACTTGAGGTCAGGAGTTCGAGACCAGCCTGGCCAACATGGTGAAACCCCGTCTCTACTAAAAATACAAAAATTAGCCGGGCGTGGTGGCGCGCGCCTGTAATCCCAGCTACTCGGGAGGCTGAGGCAGGAGAATCGCTTGAACCCGGGAGGCGGAGGTTGCAGTGAGCCGAGATCGCGCCACTGCACTCCAGCCTGGGCGACAGAGCGAGACTCCGTCTCAAAAAAAAAAAAAAAAAAAAAAAAAAA>R=1AAAAGGCAGCAGAAACCTCTGCAGACTTAAATGTCCCTGTCTGACAGCTTTGAAGAGAGTAGTGGTTCTCCCAGCACGCAGCTGGAGATCTGAGAACGGACAGACTGCCTCCTCAAGTGGATCCCTGACCCCCGAGTAGCCTAACTGGGAGGCACCCCCCAGTAGGGGCAGACTGACACCTCACACGGCCAGGTACTCCTCTGAGAAAAAACTTCCAGAGGAACAATCAGGCAGCAACATTTGCTGCTCACCAATATCCACTGTTCTGCAGCCTCTGCTGCTGATACCCAGGCAAACAGGGTCTGGAGTGGACCTCCAGCAAACTCCAACAGACCTGCAGCTGAGGGTCCTGTCTGTTAGAAGGAAAACTAACAAACAGAAAGGACATCCACACCAAAAACCCATCTGTACGTCACCATCATCAAAGACCAAAAGTAGATAAAACCACAAAGATGGGGAAAAAACAGAGCAGAAAAACTGGAAACTCTAAAAAGCAGAGCGCCTCTCCTCCTCCAAAGGAACGCAGCTCCTCACCAGCAACGGAACAAAGCTGGACGGAGAATGACTTTGATGAGTTGAGAGAAGAAGGCTTCAGATGATCAAACTACTCCAAGCTAAAGGAGGAAATTCAAACCCATGGCAAAGAAGTTAAAAACCTTGAAAAAAAATTAGACGAATGGATAACTAGAATAACCAATGCAGAGAAGTCCTTAAAGGAGCTGATGGAGCTGAAAACCAAGGCTCGAGAACTACGTGAAGAATGCACAAGCCTCAGGAGCCGATGCGATCAACTGGAAGAAAGGGTATCAGTGATGGAAGATCAAATGAATGAAATGAAGTGAGAAGAGAAGTTTAGAGAAAAAAGAATAAAAAGAAATGA>R=2TTTTTTTTTTTTTTTAGATGCGGGGTGTCACTGTGTTGCTCAGGCTGGTCTCAAACTCCTGGGCTCAAGTGATCCTCCCACCTCAGCCTCTTTAATAGATGCGATTA>R=3TTTTTATACATGCTGTAGACAATCAATTCACACCTGTACTTTTTTTTAAGGTTGTGTTATTGCACTTTTATACCTCTTGACTGGTAGCTGATTTCCTTGAATACCTGTAAGGTAATCACCGGCTCACCAATGAATGTGGTTTTAACAATGGCTCACAGTGGCTTGGAAAGCCCTCATGGGAAGTATTTCTGAGGAAAAGTGGAGAGTGTGCAGGAATAGTTTTGAAAAACAGAGACAACCGATGTCCTCCTTCCCTCCCTTGCCTCTCCTCATGTGCCAGGTTTTCTGTTTTCTCCACTATTACAGAATCACCATGTTGTATCCTGTGATGAAAAGTTTTTATCTCTTTAATCATCCCATTTCGTCCTCCAGACCTTTTTTTTTCTGGAAGGGTTGTAAGCAGAAGGGACGAAACATCTTCAGAAAAACACATTATGATATAAACTTAGTGAAAAGATTCATCATATTTAAGAAATGGACAGGATGAAATCCTGAATTCATAAAAATTTTAAAAATCAGTTTACATAACATCCATCCCTTTTGTCTCTATCCCTTATCCA
Parameter setting and post processing
• Parameter setting– Recommend the smallest l = 15
– For the arbitrary length L,
– The length of Q up to 10,000bp on each side
– Remove repeat families with Q < 50
• Postprocessing– Tandem Repeat finder, Nseg
• Remove repeat families with >50% of their length annotated as low-complexity and tandem repeats
– RepeatMasker• Mask the repeat families based on the library
Benchmark
• C.briggsae genome (108Mb)• 7h on a single 0.5 GHz DEC Alpha processor
Combined evidence model of TE
Overview
Query Sequences: Drosophila melanogaster (Fruit fly) Release 3, 4
Combined evidence model: pipeline of RepeatMasker, BLASTER, TBLASTX, all-by-all BLASTN, RECON, and TE-HMM
- Methods for the annotation of known TE families
- Methods for the annotation of anonymous TE families
Benchmark : FlyBase Release 3.1 annotation
Sensitivity and specificity, characteristics of boundary
Tools
• Blaster – compares a query sequences against a subject databank.
– Launches one of the BLAST (BLASTN, TBLASTN, BLASTX, TBLASTX).
– Cut long sequences before launching BLAST and reassembles the results.
• MATCHER– Maps match results onto query sequences by filtering overlapping hits.
– Keeps the match results with E-value < 10-10 and length >20
– Chains the remaining matches by dynamic programming.
• GROUPER– Gather similar sequences into groups
Measures
For each nucleotide,• TP: correctly annotated as belonging to a TE• FP: falsely predicted as belonging to a TE• TN: correctly annotated as not belonging to a TE• FN: falsely predicted as not belonging to a TE
Method for the Annotation of known TE families
- BLASTER using BLASTN and MATCHER (BLRn)- RepeatMasker (RM)- RepeatMasker with MATCHER (RMm)
Method for the Annotation of known TE families
- BLASTER using BLASTN and MATCHER (BLRn)- RepeatMasker (RM)- RepeatMasker with MATCHER (RMm)
- RepeatMasker-BLASTER (RMBLR) : combined hits from both BLRn and RM and give them to MATCHER
Method for the Annotation of anonymous TE families
- all-by-all comparison with BLASTER using BLASTN, MATCHER, and GROUPER
- RECON- BLASTER using TBLASTX and MATCHER- HMM
What they (we) learned
• Overall, BLRn outperforms RM with respect to the precise determination of TE boundaries.
• RM is more sensitive for the detection of small and divergent TE.• The difference between BLRn and RM make them complementary
for TE annotation.• A combined-evidence framework can improve the quality and
confidence of TE annotation.
Pipeline structure
• TE detection software : BLASTER, RepeatMasker, TE-HMM, and RECON
• Tandem repeat detection software : RepeatMasker, Tandem Repeat Finder (TRF), Mreps
• Database: MySQL• Open Portable Batch System
• Whole genomic sequence was segmented into chucks of 200kb overlapping by 10kb.
• The results from different tool were stored in the database.• XML file is generated from the stored results and loaded into the
Apollo genome annotation tool.
The Annotation Pipeline