master’s course bioinformatics data analysis and tools lecture 8: methods for repeats detection...

27
Master’s course Bioinformatics Data Analysis and Tools Lecture 8: Methods for repeats detection Centre for Integrative Bioinformatics FEW/FALW

Post on 20-Dec-2015

216 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Master’s course Bioinformatics Data Analysis and Tools Lecture 8: Methods for repeats detection Centre for Integrative Bioinformatics FEW/FALW heringa@cs.vu.nl

Master’s course

Bioinformatics Data Analysis and Tools

Lecture 8: Methods for repeats detection

Centre for Integrative BioinformaticsFEW/FALW

[email protected]

Page 2: Master’s course Bioinformatics Data Analysis and Tools Lecture 8: Methods for repeats detection Centre for Integrative Bioinformatics FEW/FALW heringa@cs.vu.nl

Repeats in genomes

• More than 50% of genomes is repeats• Transposable elements or interspersed repeats

(comprise ~45% of human genome): LINEs and SINEs

• Retroposed pseudogenes• Simple ‘stutters’: microsatellites (CAG repeats)

and minisatellites• Segmental duplications, intra- or

interchromosomal (can be at multiple sites)• Blocks of tandem repeats

Page 3: Master’s course Bioinformatics Data Analysis and Tools Lecture 8: Methods for repeats detection Centre for Integrative Bioinformatics FEW/FALW heringa@cs.vu.nl

Non-coding repeats in genomes

Element Size (bp) Copy No % Genome

Short Interspersed Nuclear Elements (SINEs) 100-300 1.5M 13

Long Interspersed Nuclear Elements (LINEs) 6000-8000

850,000 21

Long Terminal Repeats (LTRs) 15000-110000

450,000 8

DNA Transposon Fossils 80-3000 300,000 3

Page 4: Master’s course Bioinformatics Data Analysis and Tools Lecture 8: Methods for repeats detection Centre for Integrative Bioinformatics FEW/FALW heringa@cs.vu.nl

SATELLITE DNA

•Arrays of tandemly repeated simple sequence DNA•found within and adjacent to centromeres in heterochromatin.•Multiple different families include the human centromeric alphoid sequence.•Different repeat families are found within a single array that may exceed megabases in size.

Further ReadingCooper DN (1999). Human Gene Evolution. Bios Scientific Publishers, pp. 265–285.Ridley M (1996). Evolution. Blackwell Science, pp. 265–276.

Page 5: Master’s course Bioinformatics Data Analysis and Tools Lecture 8: Methods for repeats detection Centre for Integrative Bioinformatics FEW/FALW heringa@cs.vu.nl

DNA sequence composed of repeated identical or highly similar short motifs. Simple sequences may be polymorphic in length. The term encompasses both minisatellites and microsatellites and may also include nontandem arrangements of motifs. Sequences containing overrepresentations of short motifs that are not tandemly arranged are known as cryptically simple sequences. Simple repeats are widely dispersed in the human genome and account for at least 0.5% of genomic DNA.

Further ReadingCooper DN (1999). Human Gene Evolution. Bios Scientific Publishers, pp. 265–285.Ridley M (1996). Evolution. Blackwell Science, pp. 265–276.Rowold DJ, Herrera RJ (2000). Alu elements and the human genome. Genetics 108: 57–72.Tautz D, Trick M, Dover GA (1986). Cryptic simplicity in DNA is a major source of genetic variation. Nature. 322: 652–656.

SIMPLE DNA SEQUENCE (SIMPLE REPEAT, SIMPLE SEQUENCE REPEAT)

Page 6: Master’s course Bioinformatics Data Analysis and Tools Lecture 8: Methods for repeats detection Centre for Integrative Bioinformatics FEW/FALW heringa@cs.vu.nl

Microsatellites often used as genetic markers in disease gene mappingTo be useful, markers must (a) have a known and unique chromosomal location; (b) show at least a moderate degree of polymorphism; (c) have a low mutation rate; and (d) be easily typed (most usually by PCR-based methods). The types of markers most frequently used in genetic analyses are microsatellites (usually di-, tri-, or tetra-nucleotide repeats), which are often highly polymorphic and well suited to genomewide linkage analysis, and single-nucleotide polymorphisms (SNPs),which, though less polymorphic, are more frequent, are thought to represent most functional variation, and are best suited to association analysis.Related WebsitesGenethon microsatellite map

ftp://ftp.genethon.fr/pub/Gmap/Nature-1995/Marshfield microsatellite map http://www.marshfieldclinic.org/research/genetics/Map_Markers/maps/IndexMapFrames.htmlDECODE microsatellite marker map http://www.nature.com/cgitaf/DynaPage.taf?file=/ng/journal/v31/n3/full/ng917.html

Page 7: Master’s course Bioinformatics Data Analysis and Tools Lecture 8: Methods for repeats detection Centre for Integrative Bioinformatics FEW/FALW heringa@cs.vu.nl

Genomic repeats are associated with disease

• Huntington disease (poly-GLN in huntingtin protein)

• Fragile-X syndrome (Friedreich’s ataxia)

• Very often you get disease when repeat copy number exceeds critical threshold, e.g. Huntington disease > 42 CAG (Gln) copies

Page 8: Master’s course Bioinformatics Data Analysis and Tools Lecture 8: Methods for repeats detection Centre for Integrative Bioinformatics FEW/FALW heringa@cs.vu.nl

Issues in DNA repeat finding

• 4-let alphabet

• CG versus AT biases (P. falciparum 80% AT)

• Repeats found are normally not as distant as in protein sequences

Page 9: Master’s course Bioinformatics Data Analysis and Tools Lecture 8: Methods for repeats detection Centre for Integrative Bioinformatics FEW/FALW heringa@cs.vu.nl

REPuter• Whole genome comparison?• Sequence duplications?• Palindromic structures?

MAX I STAY AWAY AT SIX AM• Degenerate (near identical) repeats?• ORFs in Transposons?• EST Clustering?

Page 10: Master’s course Bioinformatics Data Analysis and Tools Lecture 8: Methods for repeats detection Centre for Integrative Bioinformatics FEW/FALW heringa@cs.vu.nl

REPuter

D. melanogaster exact repeats.

Page 11: Master’s course Bioinformatics Data Analysis and Tools Lecture 8: Methods for repeats detection Centre for Integrative Bioinformatics FEW/FALW heringa@cs.vu.nl

REPfind

• detects degenerate repeats with insertions, deletions, mismatches

• detects direct repeats as well as reverse complemented • output size: control via parameters for min. length and

max. error • these parameters can also be calculated heuristically

by the program, according to the number of repeats required by the user

• output is sorted by significance scores (E-values, as in BLAST)

Page 12: Master’s course Bioinformatics Data Analysis and Tools Lecture 8: Methods for repeats detection Centre for Integrative Bioinformatics FEW/FALW heringa@cs.vu.nl

REPselect

• selects interesting repeats from the output of REPfind as specified by user-defined criteria

• delivers a list of repeats of chosen length, degeneracy or significance

• simple interfacing to other • analysis program via executable object code

that is linked dynamically

Page 13: Master’s course Bioinformatics Data Analysis and Tools Lecture 8: Methods for repeats detection Centre for Integrative Bioinformatics FEW/FALW heringa@cs.vu.nl

REPvis • state of the art interactive visualization tool to display large amount of repeats data • designed to be used by the biologist, thus putting the data in the hands of those who can

best interpret it • written in ANSI C using GTK • overview of the repeats over whole genome / chromosome • repeats displayed as lines connecting their starting positions • further display modes: circular graph and dot plot • color coded repeat lengths and significance scores • scroll bar controls amount of data displayed • zoom function to display details of a particular repetitive region • single repeat can be selected to view the alignment • submit corresponding nucleotide sequence to FASTA / BLAST • subsequences can be exported to a file • display additional user provided annotation • batch mode to produce publishable graphic of repeat structure

Page 14: Master’s course Bioinformatics Data Analysis and Tools Lecture 8: Methods for repeats detection Centre for Integrative Bioinformatics FEW/FALW heringa@cs.vu.nl

REPuter output

Page 15: Master’s course Bioinformatics Data Analysis and Tools Lecture 8: Methods for repeats detection Centre for Integrative Bioinformatics FEW/FALW heringa@cs.vu.nl

Suffix trees

•REPuter is based on concept of suffix trees for fast searching

•Can only do identical searches, but can work with generalising substrings into superstrings with gaps

Page 16: Master’s course Bioinformatics Data Analysis and Tools Lecture 8: Methods for repeats detection Centre for Integrative Bioinformatics FEW/FALW heringa@cs.vu.nl

Suffix treesT12 = (empty) T11 = iT10 = pi T9 = ppi T8 = ippi T7 = sippi T6 = ssippi T5 = issippi T4 = sissippi T3 = ssissippi T2 = ississippi T1 = mississippi

Figure 1 - The list of suffixes in the string 'mississippi', these suffixes are added to a suffix tree in Fig. 2

+ -$ -i - - $- - ppi$ - - ssi - - - ppi$ - - - ssippi$ -mississippi$ -p - - i$ - - pi$ -s - - i - - - ppi$ - - - ssippi$ - - si - - - ppi$ - - - ssippi$

Figure 2- Explicit suffix tree for

'mississippi' using substrings as

labels.

$ = leaf

Page 17: Master’s course Bioinformatics Data Analysis and Tools Lecture 8: Methods for repeats detection Centre for Integrative Bioinformatics FEW/FALW heringa@cs.vu.nl

Traditional approach

• Run local alignment algorithm (Smith-Waterman (SW), BLAST, FASTA)

• First mask simple (single nucleotide) repeats

• Methods: XBLAST, CENSOR, RepeatMasker

Page 18: Master’s course Bioinformatics Data Analysis and Tools Lecture 8: Methods for repeats detection Centre for Integrative Bioinformatics FEW/FALW heringa@cs.vu.nl

Protein internal repeats

• Can be complete domain repeats down to single amino acid repeats

• Repeats can be so diverged that there is no sequence signal anymore

• Repeats contain many mutations and insertions/deletetions

• Multiple repeat types in single proteins• Super repeats – repeats that comprise fixed

internal repeat patterns (2A 3B C)• ‘Just’ doing SW not good enough

Page 19: Master’s course Bioinformatics Data Analysis and Tools Lecture 8: Methods for repeats detection Centre for Integrative Bioinformatics FEW/FALW heringa@cs.vu.nl

Protein internal repeats: detection approaches

• Check sequence periodicities with (fast) Fourier transforms (FFT)– Repeats too distant– Multiple repeat types weaken signal for any

given type– Repeats too interspersed– Is good for not-too-distant tandem repeats

• ‘Just’ doing SW good enough?

Page 20: Master’s course Bioinformatics Data Analysis and Tools Lecture 8: Methods for repeats detection Centre for Integrative Bioinformatics FEW/FALW heringa@cs.vu.nl

Local dynamic programming (Smith & Waterman, 1981)

LCFVMLAGSTVIVGTREDASTILCGS

Amino AcidExchange Matrix

Gap penalties (open, extension)

Search matrix

Negativenumbers

AGSTVIVG

A-STILCG

Page 21: Master’s course Bioinformatics Data Analysis and Tools Lecture 8: Methods for repeats detection Centre for Integrative Bioinformatics FEW/FALW heringa@cs.vu.nl

Local dynamic programming (Smith & Waterman, 1981)

i-1

j-1

Si,j = Max

Si,j + Max{S0<x<i-1,j-1 - Pi - (i-x-1)Px}

Si,j + Si-1,j-1

Si,j + Max {Si-1,0<y<j-1 - Pi - (j-y-1)Px}

0

Page 22: Master’s course Bioinformatics Data Analysis and Tools Lecture 8: Methods for repeats detection Centre for Integrative Bioinformatics FEW/FALW heringa@cs.vu.nl

2 Basic methods: REPRO and RADAR

• REPRO (Heringa & Argos, 1993)

• RADAR (Heger & Holm, 2000)

Page 23: Master’s course Bioinformatics Data Analysis and Tools Lecture 8: Methods for repeats detection Centre for Integrative Bioinformatics FEW/FALW heringa@cs.vu.nl

RADAR (Heger & Holm), Proteins. 2000 Nov 1;41(2):224-37)

Many large proteins have evolved by internal duplication and many internal sequence repeats correspond to functional and structural units. We have developed an automatic algorithm, RADAR, for segmenting a query sequence into repeats. The segmentation procedure has three steps: (i) repeat length is determined by the spacing between suboptimal self-alignment traces; (ii) repeat borders are optimized to yield a maximal integer number of repeats, and (iii) distant repeats are validated by iterative profile alignment. The method identifies short composition biased as well as gapped approximate repeats and complex repeat architectures involving many different types of repeats in the query sequence. No manual intervention and no prior assumptions on the number and length of repeats are required. Comparison to the Pfam-A database indicates good coverage, accurate alignments, and reasonable repeat borders. Screening the Swissprot database revealed 3,000 repeats not annotated in existing domain databases. A number of these repeats had been described in the literature but most were novel. This illustrates how in times when curated databases grapple with ever increasing backlogs, automatic (re)analysis of sequences provides an efficient way to capture this important information.

Page 24: Master’s course Bioinformatics Data Analysis and Tools Lecture 8: Methods for repeats detection Centre for Integrative Bioinformatics FEW/FALW heringa@cs.vu.nl

Early method: REPRO

• Detects repeats in single sequences

• Based on Waterman-Eggert extension of Smith-Waterman algorithm

• Iterative graph-clustering procedure to check for consistent repeat type members

• Iterative profile scanning of repeat clusters found

Page 25: Master’s course Bioinformatics Data Analysis and Tools Lecture 8: Methods for repeats detection Centre for Integrative Bioinformatics FEW/FALW heringa@cs.vu.nl

REPRO

• Detects repeats in single sequences

• Based on Waterman-Eggert extension of Smith-Waterman algorithm

• Iterative graph-clustering procedure to check for consistent repeat type members

• Iterative profile scanning of repeat clusters found

Page 26: Master’s course Bioinformatics Data Analysis and Tools Lecture 8: Methods for repeats detection Centre for Integrative Bioinformatics FEW/FALW heringa@cs.vu.nl

Repro: finding the top-alignments

Page 27: Master’s course Bioinformatics Data Analysis and Tools Lecture 8: Methods for repeats detection Centre for Integrative Bioinformatics FEW/FALW heringa@cs.vu.nl

Repro: iterative graph-based clustering of repeat start sites

1343

10543

1373

245283

Non-overlapping top-alignments

13

43

1

105

73

2

3245

2

283143?

Initial graphs