bioinformatics and computer science ina koch tfh berlin, masters course bioinformatics cottbus, 8...

36
Bioinformatics and Computer Science Ina Koch TFH Berlin, Master‘s course Bioinformatics http://www.tfh-berlin.de/bi/ Cottbus, 8 th of October 2004

Upload: kendal-bold

Post on 29-Mar-2015

220 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Bioinformatics and Computer Science Ina Koch TFH Berlin, Masters course Bioinformatics  Cottbus, 8 th of October 2004

Bioinformatics and Computer Science

Ina KochTFH Berlin, Master‘s course Bioinformaticshttp://www.tfh-berlin.de/bi/

Cottbus, 8th of October 2004

Page 2: Bioinformatics and Computer Science Ina Koch TFH Berlin, Masters course Bioinformatics  Cottbus, 8 th of October 2004

Outline

Introduction

SNP analysis in the human genome

Dynamic programming as basis for

sequence comparison

Summary and outlook

Page 3: Bioinformatics and Computer Science Ina Koch TFH Berlin, Masters course Bioinformatics  Cottbus, 8 th of October 2004

Bioinformatics-Computational Biology

Data collection and storage - data base techniques,

integrative, data bases

Data visualisation - computer graphics, molecule graphics

MicroArray analysis – pattern recognition, statistics

Data analysis

sequence - string algorithms, dynamic programming

structure - graph theory, AI, knowledge acquisition

networks - graph theory, Petri nets, computer algebra Drug Design, Molecular Modelling - parallel algorithms

Page 4: Bioinformatics and Computer Science Ina Koch TFH Berlin, Masters course Bioinformatics  Cottbus, 8 th of October 2004

Outline

Introduction

SNP analysis in the human genome

Dynamic programming as basis for sequence comparison

Summary and outlook

Page 5: Bioinformatics and Computer Science Ina Koch TFH Berlin, Masters course Bioinformatics  Cottbus, 8 th of October 2004

SNP analysis in the human genomeThe average human being exhibits ~100 new mutations.

The mutation of one nucleotide (point mutation) in the genome:

Single Nucleotide Polymorphism - SNP,

if it occurs with more than 1% in a population.

non-synonymous: causes a mutation of the amino acid TTT - Phe TTA - Leusynonymous: codes the same amino acid TTT - Phe TTC - Phe

Page 6: Bioinformatics and Computer Science Ina Koch TFH Berlin, Masters course Bioinformatics  Cottbus, 8 th of October 2004

SNPs – some numbersTwo individuals: some millions nucleotide differences ~ 100,000 amino acid differences

Within a population: 1/300 bp differences

~ half of the SNPs in coding regions are none-synonymous.

In two equal chromosomes: 1/1000 bp differences (nucleotide-variety)

Most frequent type: transition C T (G A) 2/3 of all SNPs

Page 7: Bioinformatics and Computer Science Ina Koch TFH Berlin, Masters course Bioinformatics  Cottbus, 8 th of October 2004

Why SNPs are interesting?

Medical questions

CD - CV hypothesis (Common Disease - Common Variant)

Example: ApoE*E4 allele of Alzheimer’s disease

• How many SNPs are associated with diseases? • How can we identify these SNPs?

• Ho many none-synonymous SNPs are damaging the structure or function of the protein? • How can we identify these SNPs?

Page 8: Bioinformatics and Computer Science Ina Koch TFH Berlin, Masters course Bioinformatics  Cottbus, 8 th of October 2004

A disease causing SNP The human Hemochromatosis protein (1A6Z)

Frequency: ~ 6% in north Europe ~ 14% in Irland

Page 9: Bioinformatics and Computer Science Ina Koch TFH Berlin, Masters course Bioinformatics  Cottbus, 8 th of October 2004

Search for SNPs in data bases

SWISS-PROT

Data bases

OMIM

HGBASE

dbSNP

HSSP

FilterKeywords: ‘3D STRUCTURE‘and ‘DISEASE MUTATION’

Results

Keywords: ‘3D STRUCTURE‘and ‘POLYMORPHISM’, but not ‘DISEASE MUTATION’Allelic variants with >1% frequency in ‘normal’ humans

BLASTX search against HSSP

Search for close homologues(>95% similarity) in other species for all until now selec-ted proteins and mutations

1 551 diseasecausing mutations

459 allelic variants

440 neutral mutations between species

Page 10: Bioinformatics and Computer Science Ina Koch TFH Berlin, Masters course Bioinformatics  Cottbus, 8 th of October 2004

Prediction of function-damaging effect

Active sites, binding sites

Analysis of the multiple alignment

Disulfide bridges

Hydrophobicity in the protein core

Solvent accessibility

Interactions with hetero atoms

Page 11: Bioinformatics and Computer Science Ina Koch TFH Berlin, Masters course Bioinformatics  Cottbus, 8 th of October 2004

The amino acid variant is function-damaging, if 1. it is located in a region annotated in SWISS-PROT as ACTIVE_SITE, BINDING_SITE, SITE, MOD_RES, DISULFID, METAL or

2. it is not compatible with the amino acid substitutions at the same position of homologous proteins,

or

3. it is located inside of the protein core and causes a change in the electrostatic potential, or

4. it is located at the protein surface and changes the surface accessibility of the protein, or

5. it concerns a proline residue in a helix, or

6. its minimal distance to hetero atoms (except water) < 6 Å.

Prediction rulesPrediction rules

Page 12: Bioinformatics and Computer Science Ina Koch TFH Berlin, Masters course Bioinformatics  Cottbus, 8 th of October 2004

Results

total predicted absolute percent

Disease causing 60 54 90Mutations

Function-damaging 54 43 80mutations artificially generated

Control predictions on proteins with knownfunction-damaging mutations

Page 13: Bioinformatics and Computer Science Ina Koch TFH Berlin, Masters course Bioinformatics  Cottbus, 8 th of October 2004

Results

total as function-damaging predicted absolute percent

All Polymorphisms 459 156 34

Experimentally proved 245 79 32Polymorphisms

Page 14: Bioinformatics and Computer Science Ina Koch TFH Berlin, Masters course Bioinformatics  Cottbus, 8 th of October 2004

False-negative predictions

isoleucine

Page 15: Bioinformatics and Computer Science Ina Koch TFH Berlin, Masters course Bioinformatics  Cottbus, 8 th of October 2004

False-negative predictions

serine

Page 16: Bioinformatics and Computer Science Ina Koch TFH Berlin, Masters course Bioinformatics  Cottbus, 8 th of October 2004

False-negative predictions

K A L G I S P F H E Homo sapiens K S L G I S P F H E Ovis aries K G L G L S P F H E Gallus gallus K T F G I S P F H E Sminthopsis macroura K A L G V S P F H E Petaurus breviceps K K L G L T P F H E Rana catesbiana T N Q G S T P F H E Sparus aurata K K Q N L E S F F P Escherichia coli E S K J L D T F F P Salmonella dublin K A K N V E S F Y P Caenorhabelzis elegans

Part of the multiple alignment of the human transthyretin

Page 17: Bioinformatics and Computer Science Ina Koch TFH Berlin, Masters course Bioinformatics  Cottbus, 8 th of October 2004

False-negative predictions

Page 18: Bioinformatics and Computer Science Ina Koch TFH Berlin, Masters course Bioinformatics  Cottbus, 8 th of October 2004

False-negative predictions

Page 19: Bioinformatics and Computer Science Ina Koch TFH Berlin, Masters course Bioinformatics  Cottbus, 8 th of October 2004

False-negative predictions

Page 20: Bioinformatics and Computer Science Ina Koch TFH Berlin, Masters course Bioinformatics  Cottbus, 8 th of October 2004

Outline

Introduction

SNP analysis in the human genome

Dynamic programming as basis for

sequence comparison

Summary and outlook

Page 21: Bioinformatics and Computer Science Ina Koch TFH Berlin, Masters course Bioinformatics  Cottbus, 8 th of October 2004

Sequence AlignmentSearch for evolutionary or functional similarity

Input: two nucleotide or amino acid sequencesDesired output: biologically meaningful similarity

Scoring of an alignment: Sum over all scores for each aligned pair and the gap

penalties

Score for amino acid pairs: substitution matrices (PAM, BLOSUM)

Difficulty to set gap penalties

Search for the optimal global alignment

Page 22: Bioinformatics and Computer Science Ina Koch TFH Berlin, Masters course Bioinformatics  Cottbus, 8 th of October 2004

Sequence AlignmentHuman alpha globin and human beta globin: trueHBA_HUMAN GSAQVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKL G+ +VK+HGKKV A+++++AH+D++LS+LH KLHBB_HUMAN GNPKVKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKL

Human alpha globin and leghaemoglobin from yellow lupin: Human alpha globin and leghaemoglobin from yellow lupin: true true HBA_HUMAN GSAQVKGHGKKVADALTNAVAHV---D--DMPNALSALSDLHAHKLHBA_HUMAN GSAQVKGHGKKVADALTNAVAHV---D--DMPNALSALSDLHAHKL ++ ++++H+ KV + +A ++ +L+ L+++H+ K++ ++++H+ KV + +A ++ +L+ L+++H+ K LGB2_LUPLU NNPELQAHAGKVFKLVYEAAIQLQVTGVVVTDATLKNLGSVHVSKGLGB2_LUPLU NNPELQAHAGKVFKLVYEAAIQLQVTGVVVTDATLKNLGSVHVSKG

Human alpha globin and glutathione S-transferase: Human alpha globin and glutathione S-transferase: falsefalseHBA_HUMANHBA_HUMAN GSAQVKGHGKKVADALTNAVAHVDDMPNALSALSD----LHAHKLGSAQVKGHGKKVADALTNAVAHVDDMPNALSALSD----LHAHKL GS+ + G + +D L ++ H+ D+ A +AL D ++AH+GS+ + G + +D L ++ H+ D+ A +AL D ++AH+F11G11.2 GSGYLVGDSLTFVDLL--VAQHTADLLAANAALLDEFPQFKAHQEF11G11.2 GSGYLVGDSLTFVDLL--VAQHTADLLAANAALLDEFPQFKAHQE

Page 23: Bioinformatics and Computer Science Ina Koch TFH Berlin, Masters course Bioinformatics  Cottbus, 8 th of October 2004

Dynamic Programming Application to optimisation problems

Development of an dynamic programming algorithm

(1) characterise the structure of an optimal solution(2) recursively define the value of an optimal

solution(3) compute the value of an optimal solution in a bottom-up fashion(4) construct an optimal solution from computed

information

Page 24: Bioinformatics and Computer Science Ina Koch TFH Berlin, Masters course Bioinformatics  Cottbus, 8 th of October 2004

Dynamic Programming - Example M H E A G A W G H E E 0 -8 -16 -24 -32 -40 -48 -56 -64 -72 -80P -8 –2 -1 -1 -2 -1 -4 -2 -2 -1 -1A -16 -2 -1 5 0 5 -3 0 -2 -1 -1 W -24 -3 -3 -3 -3 -3 15 -3 -3 -3 -3H -32 10 0 -2 -2 -2 -3 -2 10 0 0E -40 0 6 -1 -3 -3 -1 -3 0 6 6 A -48 -2 -1 5 0 5 -3 0 -2 -1 -1E -56 0 6 -1 -3 -3 -1 -3 0 6 6

Initialisation with BLOSUM50

Page 25: Bioinformatics and Computer Science Ina Koch TFH Berlin, Masters course Bioinformatics  Cottbus, 8 th of October 2004

Dynamic Programming - Example (1) Optimal solution: the alignment with the highest score

(2) Recursive solution:Three ways an alignment can be extended up to ( i, j )

(a) xi aligned to yj

(b) xi aligned to a gap

(c) yj aligned to a gap F ( I - 1, j - 1 ) + s ( xi, yj )

MM ( i, j ) = max F ( i-1, j ) – d F (I, j-1 ) – dd: gap penalty,

s (xi, yj ): score of the pair (xi, yj)

{

Page 26: Bioinformatics and Computer Science Ina Koch TFH Berlin, Masters course Bioinformatics  Cottbus, 8 th of October 2004

Dynamic Programming - Example M H E A G A W G H E E 0 -8 -16 -24 -32 -40 -48 -56 -64 -72 -80P -8 –2 -1 -1 -2 -1 -4 -2 -2 -1 -1A -16 -2 -1 5 0 5 -3 0 -2 -1 -1 W -24 -3 -3 -3 -3 -3 15 -3 -3 -3 -3H -32 10 0 -2 -2 -2 -3 -2 10 0 0E -40 0 6 -1 -3 -3 -1 -3 0 6 6 A -48 -2 -1 5 0 5 -3 0 -2 -1 -1E -56 0 6 -1 -3 -3 -1 -3 0 6 6

Computation of M ( 1, 1 )

Page 27: Bioinformatics and Computer Science Ina Koch TFH Berlin, Masters course Bioinformatics  Cottbus, 8 th of October 2004

Dynamic Programming - Example M H E A G A W G H E E 0 -8 -16 -24 -32 -40 -48 -56 -64 -72 -80P -8 –2 -9 -1 -2 -1 -4 -2 -2 -1 -1A -16 -2 -1 5 0 5 -3 0 -2 -1 -1 W -24 -3 -3 -3 -3 -3 15 -3 -3 -3 -3H -32 10 0 -2 -2 -2 -3 -2 10 0 0E -40 0 6 -1 -3 -3 -1 -3 0 6 6 A -48 -2 -1 5 0 5 -3 0 -2 -1 -1E -56 0 6 -1 -3 -3 -1 -3 0 6 6

Computation of M ( 1, 2 )

Page 28: Bioinformatics and Computer Science Ina Koch TFH Berlin, Masters course Bioinformatics  Cottbus, 8 th of October 2004

Dynamic Programming - Example M H E A G A W G H E E 0 -8 -16 -24 -32 -40 -48 -56 -64 -72 -80P -8 –2 -9-17 -2 -1 -4 -2 -2 -1 -1A -16 -2 -1 5 0 5 -3 0 -2 -1 -1 W -24 -3 -3 -3 -3 -3 15 -3 -3 -3 -3H -32 10 0 -2 -2 -2 -3 -2 10 0 0E -40 0 6 -1 -3 -3 -1 -3 0 6 6 A -48 -2 -1 5 0 5 -3 0 -2 -1 -1E -56 0 6 -1 -3 -3 -1 -3 0 6 6

Computation of M ( 1, 3 )

Page 29: Bioinformatics and Computer Science Ina Koch TFH Berlin, Masters course Bioinformatics  Cottbus, 8 th of October 2004

Dynamic Programming - Example M H E A G A W G H E E 0 -8 -16 -24 -32 -40 -48 -56 -64 -72 -80P -8 –2 -9-17-25 -1 -4 -2 -2 -1 -1A -16 -2 -1 5 0 5 -3 0 -2 -1 -1 W -24 -3 -3 -3 -3 -3 15 -3 -3 -3 -3H -32 10 0 -2 -2 -2 -3 -2 10 0 0E -40 0 6 -1 -3 -3 -1 -3 0 6 6 A -48 -2 -1 5 0 5 -3 0 -2 -1 -1E -56 0 6 -1 -3 -3 -1 -3 0 6 6

Computation of M ( 1, 4 )

Page 30: Bioinformatics and Computer Science Ina Koch TFH Berlin, Masters course Bioinformatics  Cottbus, 8 th of October 2004

Dynamic Programming - Example M H E A G A W G H E E 0 -8 -16 -24 -32 -40 -48 -56 -64 -72 -80P -8 –2 -9 -17 -25 -33 -42 -49 -57 -65 -73A -16 -10 -3 -4 -12 -20 -28 -36 -44 -52 -60 W -24 -18 -11 -6 -7 -15 -5 -13 -21 -29 -37H -32 -14 -18 -13 -8 -9 -13 -7 -3 -11 -19E -40 -22 -8 -16 -16 -9 -12 -15 -7 3 -5 A -48 -30 -16 -3 -11 -11 -12 -12 -15 -5 2 E -56 -38 -24 -11 -6 -12 -14 -15 -12 -9 1

The completely calculated matrix

Page 31: Bioinformatics and Computer Science Ina Koch TFH Berlin, Masters course Bioinformatics  Cottbus, 8 th of October 2004

Dynamic Programming - Example M H E A G A W G H E E 0 -8 -16 -24 -32 -40 -48 -56 -64 -72 -80P -8 –2 -9 -17 -25 -33 -42 -49 -57 -65 -73A -16 -10 -3 -4 -12 -20 -28 -36 -44 -52 -60 W -24 -18 -11 -6 -7 -15 -5 -13 -21 -29 -37H -32 -14 -18 -13 -8 -9 -13 -7 -3 -11 -19E -40 -22 -8 -16 -16 -9 -12 -15 -7 3 -5 A -48 -30 -16 -3 -11 -11 -12 -12 -15 -5 2 E -56 -38 -24 -11 -6 -12 -14 -15 -12 -9 1

Computation of the optimal alignment path

HEAGAWGHE-E--P-AW-HEAE

The optimal alignment

Page 32: Bioinformatics and Computer Science Ina Koch TFH Berlin, Masters course Bioinformatics  Cottbus, 8 th of October 2004

Dynamic Programming - Example

Needleman/Wunsch – algorithm for global alignment

Needleman & Wunsch (1970) J. Mol. Biol. 48:443-453. O(n3)

Gotoh – algorithm for global alignment O(n2) Gotoh (1982) J. Mol. Biol. 162:705-708.

Smith/Watermann – algorithm for local alignment Smith & Waterman (1981) J. Mol. Biol. 147:195-197. O(n2)

Page 33: Bioinformatics and Computer Science Ina Koch TFH Berlin, Masters course Bioinformatics  Cottbus, 8 th of October 2004

Summary

SNP analysis as typical example for bioinformatics Sunyaev, Ramensky, Lathe III., Kondrashov, Bork, Human Molecular Genetics (2001) 10:591-

597.

data base parsing multiple sequence alignment rule based system molecular modelling

Application of dynamic programming to sequence alignment

Gotoh algorithm for pair-wise global sequence alignment

Page 34: Bioinformatics and Computer Science Ina Koch TFH Berlin, Masters course Bioinformatics  Cottbus, 8 th of October 2004

Outlook Application of graph theory to protein structure analysis

PTGL Protein Topology Graph Library http://sanaga.tfh-berlin.de/~ptgl/ptgl.html

May, Barthel, Koch (2004) Bioinformatics, in press. Koch (2001) Theoretical Computer Science 250:1-30.

Investigations of Alternative Splicing

Boué, Vingron, Koch (2002) Bioinformatics, suppl.2, 18:S65-S75. Kriventseva, Koch, Apweiler, Vingron, Bork, Gelfand, Sunyaev

(2003)Trends in Genetics 19:124-128.

Page 35: Bioinformatics and Computer Science Ina Koch TFH Berlin, Masters course Bioinformatics  Cottbus, 8 th of October 2004

Outlook Modelling, analysis, and simulation of biological

molecular networks using Petri net theory in co-operation with BTU Cottbus (Prof. M. Heiner) Voss, Heiner, Koch (2003) In Silico Biology 3:0031. Heiner, Koch, Will (2004) BioSystems, Special Issue 75(1-3):15-28. Heiner & Koch (2004) Proc. 25th ICAPTN, LNCS 3099:216-237. Koch, Junker, Heiner (2004) Bioinformatics, in press.

Ongoing projects: 1. Human glycolysis with coloured Petri nets Thomas Runge 2. Metabolism in the human liver cell Daniel Schrödter 3. G1/S phase in the mammalian cell cycle Dr. Thomas Kaunath 4. Duchenne muscle dystrophy Stepfanie Grunwald

Page 36: Bioinformatics and Computer Science Ina Koch TFH Berlin, Masters course Bioinformatics  Cottbus, 8 th of October 2004

Thank you!