1 computational molecular biology mpi for molecular genetics dna sequence analysis gene prediction...

56
Computational Molecular Biology MPI for Molecular Genetics 1 DNA sequence analysis Gene prediction Gene prediction methods Gene indices Mapping cDNA on genomic DNA Applications

Post on 18-Dec-2015

224 views

Category:

Documents


0 download

TRANSCRIPT

Computational Molecular BiologyMPI for Molecular Genetics

1

DNA sequence analysisGene prediction

Gene prediction methods

Gene indices

Mapping cDNA on genomic DNA

Applications

Computational Molecular BiologyMPI for Molecular Genetics

2

DNA sequence analysisGene prediction

exon 2exon 1 exon npromotor

5‘UTR

3‘UTRProtein coding sequence

exon n-1

Computational Molecular BiologyMPI for Molecular Genetics

3

Gene predictionStrategies for detecting ORFs / exons

Distribution of Stop-codons

Codon usage

Hexamer frequencies

Prediction of the coding frame

Splice site recognition (Eucaryotes only)

Computational Molecular BiologyMPI for Molecular Genetics

4

Gene predictionCodon usage (single exon)

Frame 1

Frame 2

Frame 3

coding

non-coding

Computational Molecular BiologyMPI for Molecular Genetics

5

Gene predictionCodon usage (single exon)

Frame 1

Frame 2

Frame 3

coding

non-coding

correct start

coding sequence

Computational Molecular BiologyMPI for Molecular Genetics

6

Gene predictionCodon usage (multiple exons)

Frame 1

Frame 2

Frame 3

coding

non-coding

Splice sites

Exons:208. .2951029. .13491500. .16882686. .29343326. .34443573. .36804135. .43094708. .48464993. .50967301. .73897860. .80138124. .84058553. .87139089. .922513841. .14244

Computational Molecular BiologyMPI for Molecular Genetics

7

Gene predictionCodon usage (multiple exons)

Frame 1

Frame 2

Frame 3

coding

non-coding

Splice sites

Exons:208. .2951029. .13491500. .16882686. .29343326. .34443573. .36804135. .43094708. .48464993. .50967301. .73897860. .80138124. .84058553. .87139089. .922513841. .14244

Computational Molecular BiologyMPI for Molecular Genetics

8

Gene predictionAdditional criteria

Detection of start codons

Detection of potential promotor elements

Detection of repetitive sequences (mostly untranslated)

Homology to known genes of related

organisms

Computational Molecular BiologyMPI for Molecular Genetics

9

Gene predictionSoftware

GENSCAN (C.Burge & S.Karlin)

Grail (neural network; Ueberbacher et al.)

MZEF (M. Zhang,1997)

FGeneH, Hexon (V.Solovyev et al., 1994)

Genie, etc.All programs are using dynamic programming for detection of theoptimal solution

Computational Molecular BiologyMPI for Molecular Genetics

10

DNA sequences in public databases

Human

~ 4 million ESTs + 130 000 RNAs

Mouse

~ 2.7 million ESTs + 30 000 RNAs

Computational Molecular BiologyMPI for Molecular Genetics

11

Expressed sequence tags (EST)

AAAAAA...mRNATTTTTT...

cDNA is usually oligo dT primed, or by random primers

Reverse transcriptase stops ‚randomly‘

cDNA

Several cDNAs for the same mRNA may be generated

Computational Molecular BiologyMPI for Molecular Genetics

12

Expressed sequence tags (EST)

Average: 1500 bp

<700 bpVector

(known sequence)

Clone = mRNA fragmentDechiffered sequence (EST)

3‘-primer

Computational Molecular BiologyMPI for Molecular Genetics

13

Expressed sequence tags (EST)

Isolation of mRNAs from tissue(s)

Generation of cDNAs reflecting parts of the RNAs

Cloning of cDNAs into a vector (often random orientation)

End sequencing of the clones

Computational Molecular BiologyMPI for Molecular Genetics

14

Generation of ESTsbasecalling problems

close to 3‘ end of EST

close to 5‘ end of EST

missing bases

Computational Molecular BiologyMPI for Molecular Genetics

15

Coverage of an mRNA by ESTs

AAAAAA...putativemRNA exon 15‘UTR exon 2 3‘UTR

expressed sequence tags(ESTs)

Computational Molecular BiologyMPI for Molecular Genetics

16

Characteristics of ESTs

Highly redundant

Low sequence quality

(Cheap)

Reflect expressed genes

May be tissue/stage specific

Computational Molecular BiologyMPI for Molecular Genetics

17

Gene indices

UniGene (NCBI)

TIGR Gene Indices

STACK (SANBI)

GeneNest (DKFZ,MPI)

Clustering of EST and mRNA sequences of an organism toreduce redundance in sequence data.

Goal: Each cluster represents one gene or mRNA

Computational Molecular BiologyMPI for Molecular Genetics

18

Gene indicesGeneNest workflow

EMBL database Unigene database

Quality clipping Quality clipping

BLAST/QUASARsearch, clustering

Assembly,Consensus sequences

Visualization

Computational Molecular BiologyMPI for Molecular Genetics

19

Gene indicesQuality clipping

Removal of vector sequence

Masking of repetitive sequences (e.g. Alu)

Removal of terminal sequences of low quality

In order to cluster based on gene-specific sequence datathe following steps have to be performed:

Computational Molecular BiologyMPI for Molecular Genetics

20

Gene indices Clustering

Minimal % identity (e.g. > 95%)

Minimal length of match (e.g. >40 bp)

No internal matches (TIGR gene indices)

Same origin of tissue (only STACK)

Sequences are usually clustered if the matching part between two sequences fullfills several (empirical) criteria:

Computational Molecular BiologyMPI for Molecular Genetics

21

Gene indices Assembly

Contigs, reflecting parts of different transcripts

One consensus sequence per contig

A relative order of the sequences (alignment)

Sequences in a cluster are assembled to group those sequences which are globally similar, resulting in

Computational Molecular BiologyMPI for Molecular Genetics

22

Gene indicesConsensus sequences

Reduced error rate

Consensus often longer than any single sequence contributing

Efficient database search

Detection of exon/intron boundaries and alternative splice variants

Computational Molecular BiologyMPI for Molecular Genetics

23

Gene indices Alignment

consensus

Computational Molecular BiologyMPI for Molecular Genetics

24

Gene indices Alignment Software

Phrap (Phil Green)

CAP3 (X. Huang)

TIGR assembler

GAP4 (R. Staden)

Computational Molecular BiologyMPI for Molecular Genetics

25

GeneNest visualization(http://genenest.molgen.mpg.de)

Computational Molecular BiologyMPI for Molecular Genetics

26

GeneNest visualization(http://genenest.molgen.mpg.de)

Computational Molecular BiologyMPI for Molecular Genetics

27

TIGR Gene Indices(http://www.tigr.org/)

Alignment scheme

Computational Molecular BiologyMPI for Molecular Genetics

28

UniGene(http://www.ncbi.nih.nlm.gov/UniGene)

Computational Molecular BiologyMPI for Molecular Genetics

29

UniGene(http://www.ncbi.nih.nlm.gov/UniGene)

Computational Molecular BiologyMPI for Molecular Genetics

30

Mapping of consensus sequences on genomic DNA

genomic sequence

exons

consensus sequence( mRNA)

missing intron

Computational Molecular BiologyMPI for Molecular Genetics

31

Mapping cDNA on genomic DNA

Computational Molecular BiologyMPI for Molecular Genetics

32

Gene indicesApplications

Detection of exon/intron boundaries

Detection of alternative splicing

Detection of Single Nucleotide Polymorphisms

Genome annotation

Analysis of gene expression

Genome-genome comparison

Computational Molecular BiologyMPI for Molecular Genetics

33

Alternative Splicing

hnRNA

mRNA 2exon 15‘UTR exon 2

mRNA 1exon 15‘UTR exon 3

exon 15‘UTR exon 2 exon 3

Computational Molecular BiologyMPI for Molecular Genetics

34

Alignment of EST consensus sequences and genomic target

genomic sequence

Computational Molecular BiologyMPI for Molecular Genetics

35

Detection of the appropriate genomic target sequence

Local similarity of EST consensus and genomic DNA>96% identity

genomic sequence

Computational Molecular BiologyMPI for Molecular Genetics

36

Cutting out genomic target sequence

genomic sequence

Computational Molecular BiologyMPI for Molecular Genetics

37

Alternative Splicing(mapping on genomic DNA)

genomic sequence

exons

consensus sequence( mRNA)

splice variant

Computational Molecular BiologyMPI for Molecular Genetics

38

SpliceNest(http://SpliceNest.molgen.mpg.de)

putative exons

genomic sequence

aligned GeneNestconsensus

alternative exon

Computational Molecular BiologyMPI for Molecular Genetics

39

Alternative Splicing(additional exon)

skipped exon

Splice variants of adenylsuccinate lyase

gene prediction errors ?

unspliced ?

Computational Molecular BiologyMPI for Molecular Genetics

40

Alternative Splicing

Splice variants of APECED gene

number of sequences genomic sequencealternative variants

Computational Molecular BiologyMPI for Molecular Genetics

41

Alternative splicing

Computational Molecular BiologyMPI for Molecular Genetics

42

Alternative Splicing (alternative donor site)

Computational Molecular BiologyMPI for Molecular Genetics

43

Alternative Splicing

Computational Molecular BiologyMPI for Molecular Genetics

44

Alternative Splicing(alternative exons)

Computational Molecular BiologyMPI for Molecular Genetics

45

SpliceNest(hypothetical gene Hs16936)

Computational Molecular BiologyMPI for Molecular Genetics

46

Single Nucleotide Polymorphisms(SNP)

SNPs are single base differences within one species

Several million SNPs detected in Human

SNPs may be related to diseases

Computational Molecular BiologyMPI for Molecular Genetics

47

Single Nucleotide Polymorphisms(SNP)

SNP or basecalling error ?

Computational Molecular BiologyMPI for Molecular Genetics

48

Genome Annotation / Ensembl(http://www.ensembl.org)

Computational Molecular BiologyMPI for Molecular Genetics

49

Analysis of gene expressiontissue-specificity

Counting frequency of EST derived from a specific tissue within one sequence cluster

Searching for cluster/contigs which are tissue specific (e.g. tumor)

Searching for alternative splice variants which are potentially tissue specific

Computational Molecular BiologyMPI for Molecular Genetics

50

Analysis of gene expressionPDZ-domain containing protein PDZK1 (Hs.15456)

liver tumor

kidney

Computational Molecular BiologyMPI for Molecular Genetics

51

Analysis of gene expressionsmall muscular protein, SMPX (Hs.88492)

heart

muscle

Computational Molecular BiologyMPI for Molecular Genetics

52

Analysis of gene expressionhypothetical protein (Hs.32343)

thyroid tumor

heart

ovary

Computational Molecular BiologyMPI for Molecular Genetics

53

Analysis of gene expressionnon-redundant gene set

Selection of ‚optimal‘ clones

Generation of gene-specific PCR-products

Computational Molecular BiologyMPI for Molecular Genetics

54

Analysis of gene expression ‚optimal clones‘

clone availability

type of clone library

length of the clone

relative position to the consensus sequence

homology to other genes

existence of repetitive elements

Computational Molecular BiologyMPI for Molecular Genetics

55

Analysis of gene expressiongene-specific PCR-products

putative gene consensussequence exon A exon Cexon B

repetitive sequencesimilarity to another gene

potential gene-specific fragment

potential gene-specific fragment

Computational Molecular BiologyMPI for Molecular Genetics

56

Analysis of gene expressionoptimal gene-specific PCR-product

minimal similarity to other genes

minimal content of repetitive sequences

not spanning over several exons

+/- constant length of PCR-products of different genes