bioinformatics practical for biochemists · &other idea is that certain triplets make...

38
Bioinformatics Practical for Biochemists Andrei Lupas, Birte Höcker, Steffen Schmidt WS 2012/2013 01. DNA & Genomics 1

Upload: others

Post on 09-Jun-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Bioinformatics Practical for Biochemists · &other idea is that certain triplets make ‘sense’, whereas others make ‘nonsense’, as in the comma-free codes of Crick, Griffith

Bioinformatics Practicalfor

Biochemists

Andrei Lupas, Birte Höcker, Steffen SchmidtWS 2012/2013

01. DNA & Genomics

1

Page 2: Bioinformatics Practical for Biochemists · &other idea is that certain triplets make ‘sense’, whereas others make ‘nonsense’, as in the comma-free codes of Crick, Griffith

Description

• Lectures about general topics in Bioinformatics & History

• Tutorials will provide you with a toolbox of bioinformatics programs to analyze data

• Hands-On sessions will give you the opportunity to use these tools

2

Page 3: Bioinformatics Practical for Biochemists · &other idea is that certain triplets make ‘sense’, whereas others make ‘nonsense’, as in the comma-free codes of Crick, Griffith

Course Outline

• Mon – DNA & Genomics

• Tue – Introduction to Proteins

• Wed – Annotation of Sequence Features

• Thr – Protein Classification

• Fri – Evolution & Design

Course Material:

eb.mpg.de/research/departments/protein-evolution/teaching

3

Page 4: Bioinformatics Practical for Biochemists · &other idea is that certain triplets make ‘sense’, whereas others make ‘nonsense’, as in the comma-free codes of Crick, Griffith

Course Outline

• 13:00-14:00 Presentation

• 14:15-17:30 Tutorial (2 x 30min) & hands-on practical

• You will need to keep an electronic lab notebook

• Fri afternoon: Test Exercises

4

Page 5: Bioinformatics Practical for Biochemists · &other idea is that certain triplets make ‘sense’, whereas others make ‘nonsense’, as in the comma-free codes of Crick, Griffith

Software Requirements

• Browser (e.g. Firefox)

• “Advanced” Word Processor

• PyMOL (www.pymol.org – free for teaching)

5

Page 6: Bioinformatics Practical for Biochemists · &other idea is that certain triplets make ‘sense’, whereas others make ‘nonsense’, as in the comma-free codes of Crick, Griffith

DNA & Genomics

1953 Model of DNA (F. Crick)

6

Page 7: Bioinformatics Practical for Biochemists · &other idea is that certain triplets make ‘sense’, whereas others make ‘nonsense’, as in the comma-free codes of Crick, Griffith

wikipedia.org

What is the “genetic material”?

• 1865 Gregor Mendel

• basic rules of heredity

• 1869 Friedrich Miescher

• discovery of ‘nuclein’ (DNA), Hoppe-Seyler repeated all experiments

• 1881 Edward Zacharias

• chromosomes are composed of nuclein

• 1899 Richard Altmann

• renaming nuclein to nucleic acid

7

Page 8: Bioinformatics Practical for Biochemists · &other idea is that certain triplets make ‘sense’, whereas others make ‘nonsense’, as in the comma-free codes of Crick, Griffith

• 1928 Frederick Griffith

• “transforming principle” - Str. pneumoniae experiment

• 1944 Avery & McCarty

• Griffith’s “transforming principle”is DNA

history.nih.gov / wikipedia.org

DNA is the “transforming material”

8

Page 9: Bioinformatics Practical for Biochemists · &other idea is that certain triplets make ‘sense’, whereas others make ‘nonsense’, as in the comma-free codes of Crick, Griffith

bacteriophagetherapy.info / www.lifesciencesfoundation.org

DNA is the genetic material

• 1950 Erwin Chargaff

• A/T, C/G same amount in different tissues

• 1952 Hershey & Chase

• DNA is the genetic material using 32P/35S Phage/E. coli experiment

9

Page 10: Bioinformatics Practical for Biochemists · &other idea is that certain triplets make ‘sense’, whereas others make ‘nonsense’, as in the comma-free codes of Crick, Griffith

http://osulibrary.oregonstate.edu/specialcollections/coll/pauling/dna/notes/1952a.22-ms-01.html

Solving the DNA structure

• 1952/53 Linus Pauling

• beat Cavendish Lab in discovery of α-helix

• Cavendish Lab (Cambridge) Watson & Crick allowedto work full-time on DNA

• Pauling shared manuscriptwith Cavendish Lab before publication(via his son Peter Pauling)

10

Page 11: Bioinformatics Practical for Biochemists · &other idea is that certain triplets make ‘sense’, whereas others make ‘nonsense’, as in the comma-free codes of Crick, Griffith

Solving the DNA structure

• 1952 Franklin & Wilkins

• X-ray of B-DNA - Wilkins showed results to Watson & Crick

• periodicity, phosphates are outside

• 1953 Crick & Watson

• model of B-DNA

orig

inal p

apers

NAT

UR

E| VO

L 421| 23 JAN

UA

RY 2003| ww

w.nature.com

/nature397

© 2003 N

ature Publishing

Group

11

Page 12: Bioinformatics Practical for Biochemists · &other idea is that certain triplets make ‘sense’, whereas others make ‘nonsense’, as in the comma-free codes of Crick, Griffith

Nature, 1953

Solving the DNA structure

original papers

NATURE | VOL 421 | 23 JANUARY 2003 | www.nature.com/nature 397© 2003 Nature Publishing Group

original papers

NATURE | VOL 421 | 23 JANUARY 2003 | www.nature.com/nature 397© 2003 Nature Publishing Group

original papers

NATURE | VOL 421 | 23 JANUARY 2003 | www.nature.com/nature 397© 2003 Nature Publishing Group

original papers

NATURE | VOL 421 | 23 JANUARY 2003 | www.nature.com/nature 397© 2003 Nature Publishing Group 12

Page 13: Bioinformatics Practical for Biochemists · &other idea is that certain triplets make ‘sense’, whereas others make ‘nonsense’, as in the comma-free codes of Crick, Griffith

DNA structure

13

Page 14: Bioinformatics Practical for Biochemists · &other idea is that certain triplets make ‘sense’, whereas others make ‘nonsense’, as in the comma-free codes of Crick, Griffith

Getting the “code”

• 1953 George E. Palade

• “RNA organelles” (ribosomes)

• 1957 Crick et.al

• suggest non-overlapping triplets

• only 20 out of 64 triplet code for an amino acid

• “comma-free code”

14

Page 15: Bioinformatics Practical for Biochemists · &other idea is that certain triplets make ‘sense’, whereas others make ‘nonsense’, as in the comma-free codes of Crick, Griffith

Getting the “code”

• 1961 Nirenberg & Matthaei

• polyU mRNA produces polyF protein

• complete genetic code

• 1961 Sydney Brenner

• no overlapping codes

• concept of mRNA

• triplet Code (Crick, Brenner, Barnett, Watts-Tobin)

NO. AS09 December 30, 1961 ‘NATURE 122i

GENERAL NATURE OF THE GENETIC CODE FOR PROTEINS

@ DR.I R. J./WATTS-TOBIN - Medical Research Council Unit for Molecular Biology,

Cavendish Laboratory, Cambridge

HERE is now a mass of indirect evidence which suggests that ths amino-a&d sequence along the

polypeptids chain of a protein is determined by the sequence of the bases along some particular part of the nucleic acid of the genetic material. Since there are twenty common amino-acids found throughout Sature, but only four common bases, it haa often been surmised that the sequence of the four baaes is in soms way a code for the sequence of the amino- acids. In this article ws report genetic experiments which, togsther with the work of others, suggest that the genetic code is of the foUowing general type:

(a) A group of three bases (or, leas likely, a multiple of three bases) codes one amino-acid.

(b) The code is not of the overlapping type (see Fig. 1).

(c) The sequence of the baass is read from a fixed Btarting point. This dstsrminsa how the long sequences of bases are to bs correctly read off as triplets. There ars no special ‘commas’ to show how to select the right triplets. If the starting point is displaced by one bass, then the reading into triplets is displaced, and thus becomes incorrsct.

(d) The code is probably ‘degenerate’; that is, in general, one particular ammo-acid can be coded by one of several tripieta of bases.

The Reading of the Code The evidence that the genetic cods is not over-

lapping (see Fig. 1) doss not come from our work. but from that, of Wittmannl and of Tsugita and Frasnkel-Conrat on the mutants of tobacco mosaic virus produced by nitrous asid. In an overlapping triplet code, an alteration to one baas will in general change three adjacent amino-acids in the polypeptide chain. Their work on the alterations produced in the protein of the virus show that usually only one amino-acid at a time is changed a8 a result of treating the ribonuclsic acid (RNA) of the virus with nitrous acid. In the rarer cases where two amino-acids are altered (owing presumably to two separate deamma- tions by the nitrous acid on one piece of RNA), the altered amino-acids ars not in adjacent positions in the polypeptide chain.

Brsnnera had previously shown that, if the code were universal (that is, the same throughout Nature), then all overlapping triplet codes were impossible. Moreover, all the abnormal human hremoglobins studied in detail4 show only single amino-acid changes. The newer experimental rssulta ssssntially rule out all simple codes of the overlapping type.

If the code is not overlapping, then there must be Borne arrangement to show how to select the correct triplets (or quadruplets, or whatever it may be) along the continuous sequence of bases. One obvious suggestion is that, say, every fourth baas is a ‘comma’. &other idea is that certain triplets make ‘sense’, whereas others make ‘nonsense’, as in the comma-free

codes of Crick, Griffith and Or&j. Alternatively, the correct choice may be made by starting at a fixed point and working along the sequence of bases three (or four, or whatever) at a time. which we now favour.

It is this possibility

Experimental Results Our genetic experiments have heen carried out on

the B cistron of the rn region of the bacteriophage T’4, which attacke strains of Eschmichia coli. This is the system so brilliantly exploited by BenzeP*‘. The rn region consists. of two adjacent genes, or ‘cistrona’, called cistron A and cistron B. The wild- type phags will grow on both E. coli B (here called B) and on J!?. coli K12 (a) (here called K), but a phage which has lost the function of either gene will not grow on K. Such a phags produces an r plaque on B. Many point mutations of ths genes are known which behave in this way. Deletions of part of the region are also found. Other mutations, known as ‘leaky’, show partial function; that is, they will grow on R but their plaque-type on B is not truly wild. We ‘report hers our work ,on the mutant P 13 (now renamed FC 0) in the Bl segment of the B cistron. Thie mutant was originally produced by the action of proflavins.

We@ have previously argued that acridines such aa pro5vin act as mutagens because they add or dslsts a base or bases. The most striking evidence in favour of this is that mutants produced by a&dines are seldom ‘leaky’ ; they are almost always completely lacking in the function of the gene. Since our note was published, experimental data from two eourcsa have been added to 0u.1: previous evidence: (1) we have examined a set of 126 pn mutants made with acridine yellow; of these only 6 are IeaLT- (typically about half the mutants made with base analogues are leaky) ; (2) Streisinger lo has found that whereas mutants of the lysozyme of phage T4 produced by baas-analogues are usually leaky, all lysozyme mutants produced by proflavin are negative, that is, the function is completely lacking.

If an acridine mutant i,3 produced by, say, adding a base, it should revert to ‘lvild-type’ by deleting a bass. Our work on revertants of FC-0 shows that it-usually

Starlinq point 3 ,, ;$I Overlappirq code

+7

NUCLEIC ACID * I’ ’ ’ ’ ’ ’ ’ --- ,-J+-~----

1 3 '

ETC.

Non-overlapplnq Code

Fig. 1. To show the difference between an overlapping code and a non-overlappinu code. The short wrticnl lines represent the bases of the nucleic acid. The czw illustrated is for a triplet code

15

Page 16: Bioinformatics Practical for Biochemists · &other idea is that certain triplets make ‘sense’, whereas others make ‘nonsense’, as in the comma-free codes of Crick, Griffith

E. coli

Getting the “code” – incl. start & stop codons

• Alternative start codon

• AUG (83%)

• GUG (14%)

• UUG (3%)

• Alternative stops

• UAA (63%, ‘ochre’)

• UGA (29% ‘opal’) / or Sec (Seleoncys)

• UAG (8%, ‘amber’)

16

Page 17: Bioinformatics Practical for Biochemists · &other idea is that certain triplets make ‘sense’, whereas others make ‘nonsense’, as in the comma-free codes of Crick, Griffith

wikipedia.org / yale.edu

Gene Structure

• 1977 Sharp & Roberts

• pre-mRNA is processed

• 1982 Cech

• ribo(nucleic en)zymes

• 1980 Joan A. Steitz

• role of snRNPs in splicing

17

Page 18: Bioinformatics Practical for Biochemists · &other idea is that certain triplets make ‘sense’, whereas others make ‘nonsense’, as in the comma-free codes of Crick, Griffith

Gene Structure – Eurkayotes / Prokaryotes

lac Operon

1: Regulatory gene

Promotor region

3: ß-galactosidase4: ß-gal permease8: ß-gal transacetylase

18

Page 19: Bioinformatics Practical for Biochemists · &other idea is that certain triplets make ‘sense’, whereas others make ‘nonsense’, as in the comma-free codes of Crick, Griffith

Miller, O. L. et al. Visualization of bacterial genes in action. Science 169, 392–395

Gene structure – Polysomes in Prokaryotes

• EM picture of polysomes on a chromosome

19

Transcription initiation

DNA

mRNA with Ribosomes

Page 20: Bioinformatics Practical for Biochemists · &other idea is that certain triplets make ‘sense’, whereas others make ‘nonsense’, as in the comma-free codes of Crick, Griffith

Griswold, A. (2008) Nature Education 1(1)Understanding Bioinformatics, Zvelebil & Baum, 2007

Gene Structure – Prokaryotic Operons

lac Operon

1: Regulatory gene

Promotor region

3: ß-galactosidase4: ß-gal permease8: ß-gal transacetylase

20

Page 21: Bioinformatics Practical for Biochemists · &other idea is that certain triplets make ‘sense’, whereas others make ‘nonsense’, as in the comma-free codes of Crick, Griffith

u-tokyo.ac.jp

Gene Structure – Prokayotes

21

Page 22: Bioinformatics Practical for Biochemists · &other idea is that certain triplets make ‘sense’, whereas others make ‘nonsense’, as in the comma-free codes of Crick, Griffith

Gene Structure – Eurkayotes / Prokaryotes

lac Operon

1: Regulatory gene

Promotor region

3: ß-galactosidase4: ß-gal permease8: ß-gal transacetylase

22

Page 23: Bioinformatics Practical for Biochemists · &other idea is that certain triplets make ‘sense’, whereas others make ‘nonsense’, as in the comma-free codes of Crick, Griffith

zazzle.com

23

Gene Structure – Eukaryotes

Page 24: Bioinformatics Practical for Biochemists · &other idea is that certain triplets make ‘sense’, whereas others make ‘nonsense’, as in the comma-free codes of Crick, Griffith

Gene Structure – Comparison

!Eukaryote! Prokaryote!

Genes!

• Often&have&introns&

• Intraspecific&gene&order&and&number&generally&relatively&stable&&

• many&non8coding&(RNA)&genes&

• There&is&NOT&generally&a&relationship&between&organism&complexity&and&gene&number&

• No&introns&

• Gene&order&and&number&may&vary&between&strains&of&a&species&

Gene!regulation!

• Promoters,&often&with&distal&long&range&enhancers/silencers,&MARS,&transcriptional&domains&

• Generally&mono8cistronic&

• Promoters&

• Enhancers/silencers&rare&&

• Genes&often&regulated&as&polycistronic&operons&

Repetitive!sequences!• Generally&highly&repetitive&with&genome&wide&families&from&transposable&element&propagation&

• Generally&few&repeated&sequences&

• Relatively&few&transposons&

Organelle!(subgenomes)!

• Mitochondrial&(all)&

• chloroplasts&(in&plants)&• Absent&

24

Page 25: Bioinformatics Practical for Biochemists · &other idea is that certain triplets make ‘sense’, whereas others make ‘nonsense’, as in the comma-free codes of Crick, Griffith

Genomic era

• 1975 Frederick Sanger

• dideoxy sequencing

• 1986 Human Genome Initiative

• Genomes

• 1995 H. influenca 1.8 Mb 1.7k genes

• 1997 E. coli 4.6 Mb 4.3k genes

• 1996 S. cerevisiae 12.5 Mb 5.7k genes

• 1998 C. elegans 100 Mb 21.7k genes

• 2000 D. melanogaster 121 Mb 17k genes

25

Page 26: Bioinformatics Practical for Biochemists · &other idea is that certain triplets make ‘sense’, whereas others make ‘nonsense’, as in the comma-free codes of Crick, Griffith

Kavanoff, Nature Education : Supercoiled chromosome of E. coli.

Prokaryotic Genome

• E. coli

• 6 Mbp

• 1 by 2 µm cell size

26

Page 27: Bioinformatics Practical for Biochemists · &other idea is that certain triplets make ‘sense’, whereas others make ‘nonsense’, as in the comma-free codes of Crick, Griffith

Science (2001), Nature (2001)

The human genome

• 2001 Draft H. sapiens 2.9 Bb 20-30k genes

27

Page 28: Bioinformatics Practical for Biochemists · &other idea is that certain triplets make ‘sense’, whereas others make ‘nonsense’, as in the comma-free codes of Crick, Griffith

The human genome

28

Page 29: Bioinformatics Practical for Biochemists · &other idea is that certain triplets make ‘sense’, whereas others make ‘nonsense’, as in the comma-free codes of Crick, Griffith

Gene content

29

Page 30: Bioinformatics Practical for Biochemists · &other idea is that certain triplets make ‘sense’, whereas others make ‘nonsense’, as in the comma-free codes of Crick, Griffith

Genome Structure – Comparison

!Eukaryote! Prokaryote!

Size!

• Large&(10&Mb&–&100,000&Mb)&

• There&is&not&generally&a&relationship&between&organism&complexity&and&its&genome&size&(many&plants&have&larger&genomes&than&human!)&

• Generally&small&(<10&Mb;&most&<&5Mb)&

• Complexity&(as&measured&by&#&of&genes&and&metabolism)&generally&proportional&to&genome&size&

Content! • Most&DNA&is&nonLcoding& • DNA&is&“coding&gene&dense”&

Telomeres/!Centromeres!

• Present&(Linear&DNA)&• Circular&DNA,&doesn't&need&telomeres&

• Don’t&have&mitosis,&hence,&no&centromeres.&

Number!of!chromosomes!

• More&than&one,&(often)&including&those&discriminating&sexual&identity&

• Often&one,&sometimes&more,&Lbut&plasmids,&not&true&chromosome.&

Chromatin! • Histone&bound&(which&serves&as&a&genome&regulation&point)&

• No&histones&

• Uses&supercoiling&to&pack&genome&

&

30

Page 31: Bioinformatics Practical for Biochemists · &other idea is that certain triplets make ‘sense’, whereas others make ‘nonsense’, as in the comma-free codes of Crick, Griffith

Gene content

31

Page 32: Bioinformatics Practical for Biochemists · &other idea is that certain triplets make ‘sense’, whereas others make ‘nonsense’, as in the comma-free codes of Crick, Griffith

Gregory (2005), Nature

Human Genome Content

SINEs

LINEs

Protein-codinggenes

Introns

Miscellaneousunique sequences

Miscellaneousheterochromatin

Segmentalduplications

Simple sequencerepeats

DNA transposonsLTR retrotransposons

20.4%

13.1%

1.5%

25.9%

11.6%

8%

5%

3%2.9%

8.3%

32

Page 33: Bioinformatics Practical for Biochemists · &other idea is that certain triplets make ‘sense’, whereas others make ‘nonsense’, as in the comma-free codes of Crick, Griffith

Gene Structure – Eukaryotic Gene

Scalechr1:

Common SNPs(135)

RepeatMasker

10 kb hg19156,225,000 156,230,000 156,235,000 156,240,000 156,245,000 156,250,000

UCSC Genes (RefSeq, UniProt, CCDS, Rfam, tRNAs & Comparative Genomics)

Placental Mammal Basewise Conservation by PhyloP

Simple Nucleotide Polymorphisms (dbSNP 135) Found in >= 1% of Samples

Repeating Elements by RepeatMasker

SMG5

Mammal Cons4 _

-4 _

33

Page 34: Bioinformatics Practical for Biochemists · &other idea is that certain triplets make ‘sense’, whereas others make ‘nonsense’, as in the comma-free codes of Crick, Griffith

Gregory (2005), Nature

Human Genome Content

SINEs

LINEs

Protein-codinggenes

Introns

Miscellaneousunique sequences

Miscellaneousheterochromatin

Segmentalduplications

Simple sequencerepeats

DNA transposonsLTR retrotransposons

20.4%

13.1%

1.5%

25.9%

11.6%

8%

5%

3%2.9%

8.3%

34

Page 35: Bioinformatics Practical for Biochemists · &other idea is that certain triplets make ‘sense’, whereas others make ‘nonsense’, as in the comma-free codes of Crick, Griffith

wikipedia.org

Transposable Element - Mobile Elements / Jumping genes

• Barbara McClintock (1902 - 1992)

• studies in the 40’s & 50’s of spotted kernels inmaize

• discovery of “controlling elements”

• initially thought to be unique to maize but lateralso found in eukaryotes, bacteria, viruses,phages & plasmids

• Nobel prize in 1983

35

Page 36: Bioinformatics Practical for Biochemists · &other idea is that certain triplets make ‘sense’, whereas others make ‘nonsense’, as in the comma-free codes of Crick, Griffith

wikipedia.org

Transposable Element - Mobile Elements / Jumping genes

• DNA Transposons

• transposase cuts out transposon& inserts it at the target site

• “cut-and-paste” mechanism

• prokaryotes & eukaryotes

• Retrotransposons

• transposon DNA transcribed to RNA

• insertion to genome by reverse transcription

• LTR, LINEs, SINEs

• eukaryotes only

36

Page 37: Bioinformatics Practical for Biochemists · &other idea is that certain triplets make ‘sense’, whereas others make ‘nonsense’, as in the comma-free codes of Crick, Griffith

What can Bioinformatics do for you?

• sequence analysis

• comparison, annotation, phylogeny

• genomics

• assembly, gene finding / annotation, phylogeny

• data mining / analysis

• text mining, expression profiling (microarray, RNAseq), image analysis

• structural bioinformatic

• 2ndary structure prediction, protein design, docking

37

Page 38: Bioinformatics Practical for Biochemists · &other idea is that certain triplets make ‘sense’, whereas others make ‘nonsense’, as in the comma-free codes of Crick, Griffith

38