biological information

22
Biotech 4490 Bioinformati cs I Fall 20 06 J.C. Salerno 1 Biological Information

Upload: kolina

Post on 09-Feb-2016

33 views

Category:

Documents


0 download

DESCRIPTION

Biological Information. The basics:. DNA (deoxyribonucleic acid) stores information, codes for more DNA and for RNA (ribonucleic acid) , which is the intermediate between long term storage in the nucleus and Proteins, which do most of the work in living cells. Alphabets and translation. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Biological Information

Biotech 4490 Bioinformatics I Fall 2006 J.C. Salerno

1

Biological Information

Page 2: Biological Information

Biotech 4490 Bioinformatics I Fall 2006 J.C. Salerno

2

The basics:• DNA (deoxyribonucleic acid) stores

information, codes for more DNA and for

• RNA (ribonucleic acid) , which is the intermediate between long term storage in the nucleus and

• Proteins, which do most of the work in living cells

Page 3: Biological Information

Biotech 4490 Bioinformatics I Fall 2006 J.C. Salerno

3

Alphabets and translation• DNA and RNA use four letter alphabets

(ACGT or ACGU); base pairing (A-T and G-C) in DNA double helix is the key to replication, and in DNA-RNA duplex is the key to transcription

• Proteins have a basic 20 letter alphabet corresponing to the amino acids. Since strands of DNA, RNA, and polypeptide are linear, unbranched polymers, they can be treated as character strings.

Page 4: Biological Information

Biotech 4490 Bioinformatics I Fall 2006 J.C. Salerno

4

Alphabets and translation• Transcription of DNA to RNA is a

simple 1:1 read- a strand of DNA produces its complement

• Translation of RNA to protein amino acid sequence is complex

Page 5: Biological Information

Biotech 4490 Bioinformatics I Fall 2006 J.C. Salerno

5

Alphabets and translation• One base alone could only code for 4

different AA• Two bases together could code for 4x4=16

different AA- close, but no cigar• Three bases could code for 64 different

AA- we only need 21 for the 20 AA used in proteins and a stop signal

Page 6: Biological Information

Biotech 4490 Bioinformatics I Fall 2006 J.C. Salerno

6

Alphabets and translation• In translation, groups of three bases

(codons) are translated into amino acids

• Since there are 64 (4x4x4) codons, most AAs have multiple codons (serine has 6!). We say that the genetic code is degenerate. This isn’t a comment on its character.

Page 7: Biological Information

Biotech 4490 Bioinformatics I Fall 2006 J.C. Salerno

7

Alphabets and translation• One consequence of the degeneracy

of the genetic code is that you can translate nucleic acid sequences to AA sequences, but you can’t reverse translate to a unique nucleic acid sequence.

Page 8: Biological Information

Biotech 4490 Bioinformatics I Fall 2006 J.C. Salerno

8

Information content• How much information can you put

into a character string?• The computer age has provided the

current generation of students with valuable intuition in this area

• If I can put 10,000 songs on one ipod, how many songs can I put on two ipods?

Page 9: Biological Information

Biotech 4490 Bioinformatics I Fall 2006 J.C. Salerno

9

Information content

• In general, we expect the amount of information to increase linearly with the amount of space available to store it: songs with ipods, phone numbers with pages in the phone book, digital photos with memory cards.

Page 10: Biological Information

Biotech 4490 Bioinformatics I Fall 2006 J.C. Salerno

10

Information content• More precisely, we express information

content in terms of bits (or bytes) of information. The information content of a string of binary characters is just the number of characters.

• 1010101= 7 bits• 0001000= 7 bits• 10 = 2 bits (no shave or haircut)

• This assumes 1 and 0 are equally likely

Page 11: Biological Information

Biotech 4490 Bioinformatics I Fall 2006 J.C. Salerno

11

Information content

• It should be obvious that the information content of a number is independent of how we express it – 999 should have the same significance written in binary as it does in base 10,

Page 12: Biological Information

Biotech 4490 Bioinformatics I Fall 2006 J.C. Salerno

12

Information content

• In general, if the characters in the alphabet are equally probable we can express the information content of a character string as N log2M, where N is the number of characters in a sequence and M is the number of letters in the alphabet. For binary strings, there are only two characters so N log2M, = N.

Page 13: Biological Information

Biotech 4490 Bioinformatics I Fall 2006 J.C. Salerno

13

Information content

• For nucleic acids, M = 4 (ACGT) so N log2M =2 N• For proteins, M=20

(ACDEFGHIKLMNPQRSTVWY) so N log2M ~ 4.3 N

Page 14: Biological Information

Biotech 4490 Bioinformatics I Fall 2006 J.C. Salerno

14

Information content

• A protein sequence has more than twice the information content of a nucleic acid sequence of the same length.

• But since it takes 3 bases to code for a single AA, a protein sequence has only about .7 the information content of the DNA sequence that originally coded for it.

Page 15: Biological Information

Biotech 4490 Bioinformatics I Fall 2006 J.C. Salerno

15

Information content• Suppose we translate a 15 base pair

sequence into a five AA sequence. The information content of the nucleic acid sequence is just 2N=30 bits.

• The information content of the protein sequence is 5log220 ( this is an upper bound assuming all AAs equally probable), or about 21.6 bits

• Almost 81/2 bits are lost to degeneracy.

Page 16: Biological Information

Biotech 4490 Bioinformatics I Fall 2006 J.C. Salerno

16

Information and EntropyEntropy is a measure of the number of ways a system

can exist.

Example: the oversimplified 2 state molecule

______ B

_______ A

Molecule has two states, A and B

In a large ensemble (sample) of molecules the populations of the states are Na and Nb

Page 17: Biological Information

Biotech 4490 Bioinformatics I Fall 2006 J.C. Salerno

17

Information and Entropy

The oversimplified 2 state molecule

______ B

_______ A

If a photon with energy h can induce transitions between the states the energy difference between them is just = h, and at temperature T the population ratio Nb/Na is e-/kT, where K is the Boltzmann constant

Page 18: Biological Information

Biotech 4490 Bioinformatics I Fall 2006 J.C. Salerno

18

Information and Entropy

The oversimplified 2 state molecule: multiplicity

______ B

_______ A

Now suppose that A consists of n substates and B of m substates. The ratio of the populations of any substate of B to any substate of A is e/kT, so the ratio the populations of all the B states to A states is just n/m (e-/kT)

Page 19: Biological Information

Biotech 4490 Bioinformatics I Fall 2006 J.C. Salerno

19

Information and EntropyThe oversimplified 2 state molecule: free energy and

entropyWe can rearrange the expression n/m(e-/kT) using simple algebra to obtain the equivalent expression e-(D+kTlog(n/m)/kT. In the exponent, the term (D+kTln(n/m) has units of energy and is a free energy. Free energies in general determine equilibria. Ln(n/M) is an entropy term representing the difference in entropy between A and B (S=Sb-Sa).

Page 20: Biological Information

Biotech 4490 Bioinformatics I Fall 2006 J.C. Salerno

20

Information and EntropyQuestion: What has entropy got to do with information?

Answer: Everything, because entropy is just a measure of the number of possible states.

The entropy of a state is just the natural logarithm of the # of ways that state can exist. (That’s why it’s related to the degree of order: there are more ways of making a mess than of keeping things neat).

Page 21: Biological Information

Biotech 4490 Bioinformatics I Fall 2006 J.C. Salerno

21

Information and EntropyHalf a century ago Claude Shannon’s seminal work on information theory showed that the information content in a message could be expressed as an function we call the Shannon entropy. The basic idea is that the information content is the difference between the ln of the ways the message might read before we see it and the ln of the ways it might read after we read it. (Shannon was interested in errors as well as perfect reads.) Other people has similar ideas, (e.g., Norbert Weiner, who coined the term cybernetics) but Shannon got the details right.

Page 22: Biological Information

Biotech 4490 Bioinformatics I Fall 2006 J.C. Salerno

22

Information and EntropyThe information content (in bits) of a string of N characters with M ‘letters’ in the alphabet is Nlog2M if characters are equally probable.

More generally, information content can be written in terms of probabilities as –logPi, which looks worse than it is. Suppose that in an organism the CG content is 60%. The Pi are .3 for C and G and .2 for A and T . Each C or G contributes –log2(.3) bits, and each A or T contributes –log2(.2) bits. The average information per position is –PilogPi~1.96.