cap5510 – bioinformatics fall 2013

58
1 CAP5510 – Bioinformatics Fall 2013 Tamer Kahveci CISE Department University of Florida

Upload: summer

Post on 15-Jan-2016

42 views

Category:

Documents


2 download

DESCRIPTION

CAP5510 – Bioinformatics Fall 2013. Tamer Kahveci CISE Department University of Florida. Vital Information. Instructor: Tamer Kahveci Office: E566 Time: Mon/Wed/Thu 3:00 - 3:50 PM Office hours: Mon/Wed 2:00-2:50 PM TA: Gokhan Kaya Office hrs: Location Course page: - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: CAP5510 – Bioinformatics Fall 2013

1

CAP5510 – BioinformaticsFall 2013

Tamer Kahveci

CISE Department

University of Florida

Page 2: CAP5510 – Bioinformatics Fall 2013

2

Vital Information

• Instructor: Tamer Kahveci• Office: E566• Time: Mon/Wed/Thu 3:00 - 3:50 PM• Office hours: Mon/Wed 2:00-2:50 PM• TA: Gokhan Kaya

– Office hrs: – Location

• Course page: – http://www.cise.ufl.edu/~tamer/teaching/fall2013

Page 3: CAP5510 – Bioinformatics Fall 2013

3

Goals

• Understand the major components of bioinformatics data and how computer technology is used to understand this data better.

• Learn main potential research problems in bioinformatics and gain background information.

Page 4: CAP5510 – Bioinformatics Fall 2013

4

This Course will

• Give you a feeling for main issues in molecular biological computing: sequence, structure and function.

• Give you exposure to classic biological problems, as represented computationally.

• Encourage you to explore research problems and make contribution.

Page 5: CAP5510 – Bioinformatics Fall 2013

5

This Course will not

• Teach you biology.

• Teach you programming

• Teach you how to be an expert user of off-the-shelf molecular biology computer packages.

• Force you to make a novel contribution to bioinformatics.

Page 6: CAP5510 – Bioinformatics Fall 2013

6

Course Outline

• Introduction to terminology• Biological sequences • Sequence comparison

– Lossless alignment (DP)– Lossy alignments (BLAST, etc)

• Protein structures and their prediction• Sequence assembly• Substitution matrices, statistics • Multiple sequence alignment • Phylogeny • Biological networks

Page 7: CAP5510 – Bioinformatics Fall 2013

7

Grading

1. Project (50 %)– Contribution (2.5 % bonus)

2. Other (50 %)– Non-EDGE: Homeworks +

quizzes – EDGE: Homeworks + 3 surveys

• Attendance (2.5% bonus)

How can I get an A ?

Page 8: CAP5510 – Bioinformatics Fall 2013

8

Expectations

• Require– Data structures and algorithms.– Coding (C, Java)

• Encourage – actively participate in discussions in the classroom– read bioinformatics literature in general– attend colloquiums on campus

• Academic honesty

Page 9: CAP5510 – Bioinformatics Fall 2013

9

Text Book

• Not required, but recommended.• Class notes + papers.

Page 10: CAP5510 – Bioinformatics Fall 2013

10

Where to Look ?

• Journals– Bioinformatics– Genome Research– PLOS Computational Biology– Journal of Computational Biology– IEEE Transaction on Computational Biology and Bioinformatics

• Conferences– RECOMB– ISMB– ECCB– PSB– BCB

Page 11: CAP5510 – Bioinformatics Fall 2013

11

What is Bioinformatics?• Bioinformatics is the field of science in which biology, computer

science, and information technology merge into a single discipline. The ultimate goal of the field is to enable the discovery of new biological insights as well as to create a global perspective from which unifying principles in biology can be discerned. There are three important sub-disciplines within bioinformatics:– the development of new algorithms and statistics with which to assess

relationships among members of large data sets – the analysis and interpretation of various types of data including

nucleotide and amino acid sequences, protein domains, and protein structures

– the development and implementation of tools that enable efficient access and management of different types of information.

From NCBI (National Center for Biotechnology Information)http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/milestones.html

Page 12: CAP5510 – Bioinformatics Fall 2013

12

Does biology have anything to do with computer science?

Page 13: CAP5510 – Bioinformatics Fall 2013

13

Challenges 1/5

• Data diversity– DNA

(ATCCAGAGCAG)– Protein sequences

(MHPKVDALLSR)– Protein structures– Microarrays– Pathways– Bio-images– Time series

Page 14: CAP5510 – Bioinformatics Fall 2013

14

Challenges 2/5• Database size

– GeneBank : As of August 2013, there are over 154B + 500B bases.

– More than 500K protein sequences, More than 190M amino acids as of July 2012.

– More than 83K protein structures in PDB as of August 2012.

Genome sequence now accumulate so quickly that, in less than a week, a single laboratory can produce more bits of data than

Shakespeare managed in a lifetime, although the latter make better reading.

-- G A Pekso, Nature 401: 115-116 (1999)

Page 15: CAP5510 – Bioinformatics Fall 2013

15

• Moore’s Law Matched by Growth of Data• CPU vs Disk

– As important as the increase in computer speed has been, the ability to store large amounts of information on computers is even more crucial

Str

uct

ure

s in

PD

B

0500

10001500200025003000350040004500

1980 1985 1990 19950

20

40

60

80

100

120

1401979 1981 1983 1985 1987 1989 1991 1993 1995

CP

U In

stru

ctio

nT

ime

(ns)Num.

Protein DomainStructures

Challenges 3/5

Page 16: CAP5510 – Bioinformatics Fall 2013

16

Challenges 4/5

• Deciphering the code– Within same data type: hard– Across data types: harder

caacaagccaaaactcgtacaaatatgaccgcacttcgctataaagaacacggcttgtggcgagatatctcttggaaaaactttcaagagcaactcaatcaactttctcgagcattgcttgctcacaatattgacgtacaagataaaatcgccatttttgcccataatatggaacgttgggttgttcatgaaactttcggtatcaaagatggtttaatgaccactgttcacgcaacgactacaatcgttgacattgcgaccttacaaattcgagcaatcacagtgcctatttacgcaaccaatacagcccagcaagcagaatttatcctaaatcacgccgatgtaaaaattctcttcgtcggcgatcaagagcaatacgatcaaacattggaaattgctcatcattgtccaaaattacaaaaaattgtagcaatgaaatccaccattcaattacaacaagatcctctttcttgcacttgg

atggcaattaaaattggtatcaatggttttggtcgtatcggccgtatcgtattccgtgcagcacaacaccgtgatgacattgaagttgtaggtattaacgacttaatcgacgttgaatacatggcttatatgttgaaatatgattcaactcacggtcgtttcgacggcactgttgaagtgaaagatggtaacttagtggttaatggtaaaactatccgtgtaactgcagaacgtgatccagcaaacttaaactggggtgcaatcggtgttgatatcgctgttgaagcgactggtttattcttaactgatgaaactgctcgtaaacatatcactgcaggcgcaaaaaaagttgtattaactggcccatctaaagatgcaacccctatgttcgttcgtggtgtaaacttcaacgcatacgcaggtcaagatatcgtttctaacgcatcttgtacaacaaactgtttagctcctttagcacgtgttgttcatgaaactttcggtatcaaagatggtttaatgaccactgttcacgcaacgact

Page 17: CAP5510 – Bioinformatics Fall 2013

17

Challenges 5/5

• Inaccuracy

• Redundancy

Page 18: CAP5510 – Bioinformatics Fall 2013

18

What is the Real Solution?

We need better computational methods

•Compact summarization•Fast and accurate analysis of data•Efficient indexing

Page 19: CAP5510 – Bioinformatics Fall 2013

19

A Gentle Introduction to Molecular Biology

Page 20: CAP5510 – Bioinformatics Fall 2013

20

Goals

• Understand major components of biological data– DNA, protein sequences, expression arrays,

protein structures

• Get familiar with basic terminology

• Learn commonly used data formats

Page 21: CAP5510 – Bioinformatics Fall 2013

21

Genetic Material: DNA

• Deoxyribonucleic Acid, 1950s– Basis of inheritance– Eye color, hair color,

• 4 nucleotides – A, C, G, T

Page 22: CAP5510 – Bioinformatics Fall 2013

22

Chemical Structure of Nucleotides

Purines

Pyrmidines

Page 23: CAP5510 – Bioinformatics Fall 2013

23

Making of Long Chains

5’ -> 3’

Page 25: CAP5510 – Bioinformatics Fall 2013

25

Base Pairs

Page 26: CAP5510 – Bioinformatics Fall 2013

26

Question

• 5’ - GTTACA – 3’

• 5’ – XXXXXX – 3’ ?

• 5’ – TGTAAC – 3’

• Reverse complements.

Page 27: CAP5510 – Bioinformatics Fall 2013

27

Repetitive DNA

• Tandem repeats: highly repetitive – Satellites (100 k – 1 Gbp) / (a few hundred bp)– Mini satellites (1 k – 20 kbp) / (9 – 80 bp)– Micro satellites (< 150 bp) / (1 – 6 bp)– DNA fingerprinting

• Interspersed repeats: moderately repetitive– LINE– SINE

• Proteins contain repetitive patterns too

Page 28: CAP5510 – Bioinformatics Fall 2013

28

Genetic Material: an Analogy

• Nucleotide => letter• Gene => sentence• Contig => chapter• Chromosome => book

– Traits: Gender, hair/eye color, …– Disorders: down syndrome, turner syndrome, …– Chromosome number varies for species– We have 46 (23 + 23) chromosomes

• Complete genome => volumes of encyclopedia• Hershey & Chase experiment show that DNA is the

genetic material. (ch14)

Page 29: CAP5510 – Bioinformatics Fall 2013

29

Functions of Genes 1/2

• Signal transduction: sensing a physical signal and turning into a chemical signal

• Enzymatic catalysis: accelerating chemical transformations otherwise too slow.

• Transport: getting things into and out of separated compartments– Animation (ch 5.2)

Page 30: CAP5510 – Bioinformatics Fall 2013

30

Functions of Genes 2/2

• Movement: contracting in order to pull things together or push things apart.

• Transcription control: deciding when other genes should be turned ON/OFF– Animation (ch7)

• Structural support: creating the shape and pliability of a cell or set of cells

Page 31: CAP5510 – Bioinformatics Fall 2013

31

Central Dogma

Page 32: CAP5510 – Bioinformatics Fall 2013

32

Introns and Exons 1/2

Page 33: CAP5510 – Bioinformatics Fall 2013

33

Introns and Exons 2/2

• Humans have about 25,000 genes = 40,000,000 DNA bases < 3% of total DNA in genome.

• Remaining 2,960,000,000 bases for control information. (e.g. when, where, how long, etc...)

Page 34: CAP5510 – Bioinformatics Fall 2013

34

ProteinPhenotype

DNA(Genotype)

Gene expression

Page 35: CAP5510 – Bioinformatics Fall 2013

35

Gene Expression

• Building proteins from DNA– Promoter sequence: start of a gene 13 nucleotides.

• Positive regulation: proteins that bind to DNA near promoter sequences increases transcription.

• Negative regulation

Page 36: CAP5510 – Bioinformatics Fall 2013

36

Microarray

Animation on creating microarrays

Page 37: CAP5510 – Bioinformatics Fall 2013

37

Amino Acids

• 20 different amino acids– ACDEFGHIKLMNPQRSTVWY but not BJOUXZ

• ~300 amino acids in an average protein, hundreds of thousands known protein sequences

• How many nucleotides can encode one amino acid ?– 42 < 20 < 43

– E.g., Q (glutamine) = CAG– degeneracy– Triplet code (codon)

Page 38: CAP5510 – Bioinformatics Fall 2013

38

Triplet Code

Page 39: CAP5510 – Bioinformatics Fall 2013

39

Molecular Structure of Amino Acid

Side Chain

•Non-polar, Hydrophobic (G, A, V, L, I, M, F, W, P)•Polar, Hydrophilic (S, T, C, Y, N, Q)•Electrically charged (D, E, K, R, H)

C

Page 40: CAP5510 – Bioinformatics Fall 2013

40

Peptide Bonds

Page 41: CAP5510 – Bioinformatics Fall 2013

41

Direction of Protein Sequence

Animation on protein synthesis (ch15)

Page 42: CAP5510 – Bioinformatics Fall 2013

42

Data Format

• GenBank

• EMBL (European Mol. Biol. Lab.)

• SwissProt

• FASTA

• NBRF (Nat. Biomedical Res. Foundation)

• Others– IG, GCG, Codata, ASN, GDE, Plain ASCII

Page 43: CAP5510 – Bioinformatics Fall 2013

Primary Structure of Proteins

43

>2IC8:A|PDBID|CHAIN|SEQUENCE

ERAGPVTWVMMIACVVVFIAMQILGDQEVMLWLAWPFDPTLKFEFWRYFTHALMHFSLMHILFNLLWWWYLGGAVEKRLGSGKLIVITLISALLSGYVQQKFSGPWFGGLSGVVYALMGYVWLRGERDPQSGIYLQRGLIIFALIWIVAGWFDLFGMSMANGAHIAGLAVGLAMAFVDSLNA

Page 44: CAP5510 – Bioinformatics Fall 2013

44

Secondary Structure: Alpha Helix

• 1.5 A translation• 100 degree rotation• Phi = -60• Psi = -60

Page 45: CAP5510 – Bioinformatics Fall 2013

45

anti-parallel parallel

Secondary Structure: Beta sheet

Phi = -135Psi = 135

Page 46: CAP5510 – Bioinformatics Fall 2013

46

Tertiary Structure

phi1

psi1

phi2

2N angles

Page 47: CAP5510 – Bioinformatics Fall 2013

47

• 3-d structure of a polypeptide sequence– interactions between non-local atoms

tertiary structure ofmyoglobin

Tertiary Structure

Page 48: CAP5510 – Bioinformatics Fall 2013

48

Ramachandran Plot

Sample pdb entry ( http://www.rcsb.org/pdb/ )

Page 49: CAP5510 – Bioinformatics Fall 2013

49

• Arrangement of protein subunits

quaternary structure of Cro

human hemoglobin tetramer

Quaternary Structure

Page 50: CAP5510 – Bioinformatics Fall 2013

50

• 3-d structure determined by protein sequence

• Prediction remains a challenge

• Diseases caused by misfolded proteins– Mad cow disease

• Classification of protein structure

Structure Summary

Page 51: CAP5510 – Bioinformatics Fall 2013

Biological networks

• Signal transduction network

• Transcription control network

• Post-transcriptional regulation network

• PPI (protein-protein interaction) network

• Metabolic network

Page 52: CAP5510 – Bioinformatics Fall 2013

Signal transduction

Extracellular molecule

activate

Memberane receptor

Intrecellular molecule

alter

Page 53: CAP5510 – Bioinformatics Fall 2013

Transcription control network

Transcription Factor (TF) – some protein

Promoter region of a gene

bind

•Up/down regulates•TFs are potential drug targets

Page 54: CAP5510 – Bioinformatics Fall 2013

Post transcriptional regulation

RNA-binding protein

RNA

bind

Slow down or accelerate protein translation from RNA

Page 55: CAP5510 – Bioinformatics Fall 2013

PPI (protein-protein interaction)

Creates a protein complex

Page 56: CAP5510 – Bioinformatics Fall 2013

Metabolic interactions

Compound A1

consume

Enzyme(s)

Compound B1

produce

Compound Am

Compound Bn

Page 57: CAP5510 – Bioinformatics Fall 2013

57

Quiz Next Lecture

পরী�ক্ষা�考試

Page 58: CAP5510 – Bioinformatics Fall 2013

58

STOP

Next:•Basic sequence comparison•Dynamic programming methods

–Global/local alignment–Gaps