master’s course bioinformatics data analysis and tools lecture 8: methods for repeats detection...

Master’s course

Bioinformatics Data Analysis and Tools

Lecture 8: Methods for repeats detection

Centre for Integrative BioinformaticsFEW/FALW

[email protected]

Repeats in genomes

• More than 50% of genomes is repeats• Transposable elements or interspersed repeats

(comprise ~45% of human genome): LINEs and SINEs

• Retroposed pseudogenes• Simple ‘stutters’: microsatellites (CAG repeats)

and minisatellites• Segmental duplications, intra- or

interchromosomal (can be at multiple sites)• Blocks of tandem repeats

Non-coding repeats in genomes

Element Size (bp) Copy No % Genome

Short Interspersed Nuclear Elements (SINEs) 100-300 1.5M 13

Long Interspersed Nuclear Elements (LINEs) 6000-8000

850,000 21

Long Terminal Repeats (LTRs) 15000-110000

450,000 8

DNA Transposon Fossils 80-3000 300,000 3

SATELLITE DNA

•Arrays of tandemly repeated simple sequence DNA•found within and adjacent to centromeres in heterochromatin.•Multiple different families include the human centromeric alphoid sequence.•Different repeat families are found within a single array that may exceed megabases in size.

Further ReadingCooper DN (1999). Human Gene Evolution. Bios Scientific Publishers, pp. 265–285.Ridley M (1996). Evolution. Blackwell Science, pp. 265–276.

DNA sequence composed of repeated identical or highly similar short motifs. Simple sequences may be polymorphic in length. The term encompasses both minisatellites and microsatellites and may also include nontandem arrangements of motifs. Sequences containing overrepresentations of short motifs that are not tandemly arranged are known as cryptically simple sequences. Simple repeats are widely dispersed in the human genome and account for at least 0.5% of genomic DNA.

Further ReadingCooper DN (1999). Human Gene Evolution. Bios Scientific Publishers, pp. 265–285.Ridley M (1996). Evolution. Blackwell Science, pp. 265–276.Rowold DJ, Herrera RJ (2000). Alu elements and the human genome. Genetics 108: 57–72.Tautz D, Trick M, Dover GA (1986). Cryptic simplicity in DNA is a major source of genetic variation. Nature. 322: 652–656.

SIMPLE DNA SEQUENCE (SIMPLE REPEAT, SIMPLE SEQUENCE REPEAT)

Microsatellites often used as genetic markers in disease gene mappingTo be useful, markers must (a) have a known and unique chromosomal location; (b) show at least a moderate degree of polymorphism; (c) have a low mutation rate; and (d) be easily typed (most usually by PCR-based methods). The types of markers most frequently used in genetic analyses are microsatellites (usually di-, tri-, or tetra-nucleotide repeats), which are often highly polymorphic and well suited to genomewide linkage analysis, and single-nucleotide polymorphisms (SNPs),which, though less polymorphic, are more frequent, are thought to represent most functional variation, and are best suited to association analysis.Related WebsitesGenethon microsatellite map

ftp://ftp.genethon.fr/pub/Gmap/Nature-1995/Marshfield microsatellite map http://www.marshfieldclinic.org/research/genetics/Map_Markers/maps/IndexMapFrames.htmlDECODE microsatellite marker map http://www.nature.com/cgitaf/DynaPage.taf?file=/ng/journal/v31/n3/full/ng917.html

Genomic repeats are associated with disease

• Huntington disease (poly-GLN in huntingtin protein)

• Fragile-X syndrome (Friedreich’s ataxia)

• Very often you get disease when repeat copy number exceeds critical threshold, e.g. Huntington disease > 42 CAG (Gln) copies

Issues in DNA repeat finding

• 4-let alphabet

• CG versus AT biases (P. falciparum 80% AT)

• Repeats found are normally not as distant as in protein sequences

REPuter• Whole genome comparison?• Sequence duplications?• Palindromic structures?

MAX I STAY AWAY AT SIX AM• Degenerate (near identical) repeats?• ORFs in Transposons?• EST Clustering?

REPuter

D. melanogaster exact repeats.

REPfind

• detects degenerate repeats with insertions, deletions, mismatches

• detects direct repeats as well as reverse complemented • output size: control via parameters for min. length and

max. error • these parameters can also be calculated heuristically

by the program, according to the number of repeats required by the user

• output is sorted by significance scores (E-values, as in BLAST)

REPselect

• selects interesting repeats from the output of REPfind as specified by user-defined criteria

• delivers a list of repeats of chosen length, degeneracy or significance

• simple interfacing to other • analysis program via executable object code

that is linked dynamically

REPvis • state of the art interactive visualization tool to display large amount of repeats data • designed to be used by the biologist, thus putting the data in the hands of those who can

best interpret it • written in ANSI C using GTK • overview of the repeats over whole genome / chromosome • repeats displayed as lines connecting their starting positions • further display modes: circular graph and dot plot • color coded repeat lengths and significance scores • scroll bar controls amount of data displayed • zoom function to display details of a particular repetitive region • single repeat can be selected to view the alignment • submit corresponding nucleotide sequence to FASTA / BLAST • subsequences can be exported to a file • display additional user provided annotation • batch mode to produce publishable graphic of repeat structure

REPuter output

Suffix trees

•REPuter is based on concept of suffix trees for fast searching

•Can only do identical searches, but can work with generalising substrings into superstrings with gaps

Suffix treesT12 = (empty) T11 = iT10 = pi T9 = ppi T8 = ippi T7 = sippi T6 = ssippi T5 = issippi T4 = sissippi T3 = ssissippi T2 = ississippi T1 = mississippi

Figure 1 - The list of suffixes in the string 'mississippi', these suffixes are added to a suffix tree in Fig. 2

+ -$ -i - - $- - ppi$ - - ssi - - - ppi$ - - - ssippi$ -mississippi$ -p - - i$ - - pi$ -s - - i - - - ppi$ - - - ssippi$ - - si - - - ppi$ - - - ssippi$

Figure 2- Explicit suffix tree for

'mississippi' using substrings as

labels.

$ = leaf

Traditional approach

• Run local alignment algorithm (Smith-Waterman (SW), BLAST, FASTA)

• First mask simple (single nucleotide) repeats

• Methods: XBLAST, CENSOR, RepeatMasker

Protein internal repeats

• Can be complete domain repeats down to single amino acid repeats

• Repeats can be so diverged that there is no sequence signal anymore

• Repeats contain many mutations and insertions/deletetions

• Multiple repeat types in single proteins• Super repeats – repeats that comprise fixed

internal repeat patterns (2A 3B C)• ‘Just’ doing SW not good enough

Protein internal repeats: detection approaches

• Check sequence periodicities with (fast) Fourier transforms (FFT)– Repeats too distant– Multiple repeat types weaken signal for any

given type– Repeats too interspersed– Is good for not-too-distant tandem repeats

• ‘Just’ doing SW good enough?

Local dynamic programming (Smith & Waterman, 1981)

LCFVMLAGSTVIVGTREDASTILCGS

Amino AcidExchange Matrix

Gap penalties (open, extension)

Search matrix

Negativenumbers

AGSTVIVG

A-STILCG

Local dynamic programming (Smith & Waterman, 1981)

i-1

j-1

Si,j = Max

Si,j + Max{S0<x<i-1,j-1 - Pi - (i-x-1)Px}

Si,j + Si-1,j-1

Si,j + Max {Si-1,0<y<j-1 - Pi - (j-y-1)Px}

0

2 Basic methods: REPRO and RADAR

• REPRO (Heringa & Argos, 1993)

• RADAR (Heger & Holm, 2000)

RADAR (Heger & Holm), Proteins. 2000 Nov 1;41(2):224-37)

Many large proteins have evolved by internal duplication and many internal sequence repeats correspond to functional and structural units. We have developed an automatic algorithm, RADAR, for segmenting a query sequence into repeats. The segmentation procedure has three steps: (i) repeat length is determined by the spacing between suboptimal self-alignment traces; (ii) repeat borders are optimized to yield a maximal integer number of repeats, and (iii) distant repeats are validated by iterative profile alignment. The method identifies short composition biased as well as gapped approximate repeats and complex repeat architectures involving many different types of repeats in the query sequence. No manual intervention and no prior assumptions on the number and length of repeats are required. Comparison to the Pfam-A database indicates good coverage, accurate alignments, and reasonable repeat borders. Screening the Swissprot database revealed 3,000 repeats not annotated in existing domain databases. A number of these repeats had been described in the literature but most were novel. This illustrates how in times when curated databases grapple with ever increasing backlogs, automatic (re)analysis of sequences provides an efficient way to capture this important information.

Early method: REPRO

• Detects repeats in single sequences

• Based on Waterman-Eggert extension of Smith-Waterman algorithm

• Iterative graph-clustering procedure to check for consistent repeat type members

• Iterative profile scanning of repeat clusters found

REPRO

• Detects repeats in single sequences

• Based on Waterman-Eggert extension of Smith-Waterman algorithm

• Iterative graph-clustering procedure to check for consistent repeat type members

• Iterative profile scanning of repeat clusters found

Repro: finding the top-alignments

Repro: iterative graph-based clustering of repeat start sites

1343

10543

1373

245283

Non-overlapping top-alignments

13

43

1

105

73

2

3245

2

283143?

Initial graphs

master’s course bioinformatics data analysis and tools lecture 8: methods for repeats detection...

Documents

simple repeats

interspersed repeats

genomic repeats

tandem repeats

simple sequence repeat

repeated simple sequence

simple sequences

microsatellites cag