orf calling. why? need to know protein sequence protein sequence is usually what does the work...

55
ORF Calling

Upload: david-woods

Post on 18-Jan-2016

221 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: ORF Calling. Why? Need to know protein sequence Protein sequence is usually what does the work Functional studies Crystallography Proteomics Similarity

ORF Calling

Page 2: ORF Calling. Why? Need to know protein sequence Protein sequence is usually what does the work Functional studies Crystallography Proteomics Similarity

Why? Need to know protein sequence Protein sequence is usually what does the

work Functional studies

Crystallography Proteomics

Similarity studies Proteins are better for remote similarities

than DNA sequences Protein sequences change slower than DNA

sequences

ORF Calling

Page 3: ORF Calling. Why? Need to know protein sequence Protein sequence is usually what does the work Functional studies Crystallography Proteomics Similarity

Intrinsic gene calling

Extrinsic gene calling

Compare your DNA sequences to known sequences. Needs other sequences that are known!

Only use information in your DNA sequences. Does not use other information.

ORF Calling

Page 4: ORF Calling. Why? Need to know protein sequence Protein sequence is usually what does the work Functional studies Crystallography Proteomics Similarity

Start with DNA sequence

Translate in all 6 reading frames

Extrinsic gene calling

Page 5: ORF Calling. Why? Need to know protein sequence Protein sequence is usually what does the work Functional studies Crystallography Proteomics Similarity

AGT AAA ACT TTA ATT GTT GGT TAAAGT AAA ACT TTA ATT GTT GGT TAA1

AG TAA AAC TTT AAT TGT TGG TTA A3A GTA AAA CTT TAA TTG TTG GTT AA2

TCA TTT TGA AAT TAA CAA CCA ATT | | | | | | | | | | | | | | | | | | | | | | | |

T CAT TTT GAA ATT AAC AAC CAA TT-3

TCA TTT TGA AAT TAA CAA CCA ATT-1TC ATT TTG AAA TTA ACA ACC AAT T-2

Why are there 6 reading frames?

Page 6: ORF Calling. Why? Need to know protein sequence Protein sequence is usually what does the work Functional studies Crystallography Proteomics Similarity

Start with DNA sequence

Translate in all 6 reading frames

Compare your sequence to known protein sequences

Find the ends of each, and call those genes!

Extrinsic gene calling

Page 7: ORF Calling. Why? Need to know protein sequence Protein sequence is usually what does the work Functional studies Crystallography Proteomics Similarity

DNAsequence

}Similarproteinsequencese.g. from BLAST

Protein encodinggene

For example

Page 8: ORF Calling. Why? Need to know protein sequence Protein sequence is usually what does the work Functional studies Crystallography Proteomics Similarity

This is how (most) metagenome ORF calling is done

Eukaryotic ORF calling – especially using EST sequences

Uses of extrinsic calling

Page 9: ORF Calling. Why? Need to know protein sequence Protein sequence is usually what does the work Functional studies Crystallography Proteomics Similarity

Very slow (depending on search algorithm)

Dependent on your database

Only finds known genes

Problems with extrinsic calling

Page 10: ORF Calling. Why? Need to know protein sequence Protein sequence is usually what does the work Functional studies Crystallography Proteomics Similarity

Intrinsic gene calling Ab initio gene calling

What are the start codons?

What are the stop codons?

ATG

TAA TAG TGA

Alternatives to extrinsic gene calling

Page 11: ORF Calling. Why? Need to know protein sequence Protein sequence is usually what does the work Functional studies Crystallography Proteomics Similarity

Approximately once every 20 amino acids at random!

A stretch of 100 amino acids is likely to have a stop codon!

How frequently do stop codons appear?

Page 12: ORF Calling. Why? Need to know protein sequence Protein sequence is usually what does the work Functional studies Crystallography Proteomics Similarity

DNA

3

2

1

-1

-2

-3

How to call ORFs (the easy way)

Page 13: ORF Calling. Why? Need to know protein sequence Protein sequence is usually what does the work Functional studies Crystallography Proteomics Similarity

DNA

3

2

1

-1

-2

-3

Find all the stop codons

Page 14: ORF Calling. Why? Need to know protein sequence Protein sequence is usually what does the work Functional studies Crystallography Proteomics Similarity

DNA

3

2

1

-1

-2

-3

X is often 100 amino acids

Find all the ORFs > x amino acids

Page 15: ORF Calling. Why? Need to know protein sequence Protein sequence is usually what does the work Functional studies Crystallography Proteomics Similarity

DNA

3

2

1

-1

-2

-3

Trim to those ORFs that have a start

Page 16: ORF Calling. Why? Need to know protein sequence Protein sequence is usually what does the work Functional studies Crystallography Proteomics Similarity

DNA

3

2

1

-1

-2

-3

Short ORFs that overlap others

Remove “shadow” ORFs

Page 17: ORF Calling. Why? Need to know protein sequence Protein sequence is usually what does the work Functional studies Crystallography Proteomics Similarity

DNA

3

2

1

-1

-2

-3

Trim the start sites to first ATG

Page 18: ORF Calling. Why? Need to know protein sequence Protein sequence is usually what does the work Functional studies Crystallography Proteomics Similarity

DNA

3

2

1

-1

-2

-3

These are the ORFs

Page 19: ORF Calling. Why? Need to know protein sequence Protein sequence is usually what does the work Functional studies Crystallography Proteomics Similarity

Intrinsic ORF calling usingMarkov Models

Page 20: ORF Calling. Why? Need to know protein sequence Protein sequence is usually what does the work Functional studies Crystallography Proteomics Similarity

Based on language processing

Common for gene and protein finding, alignments, and so on

Markov Models

Page 21: ORF Calling. Why? Need to know protein sequence Protein sequence is usually what does the work Functional studies Crystallography Proteomics Similarity

English: the

Spanish: el (la)

Portuguese: que

What is the most common word?

Page 22: ORF Calling. Why? Need to know protein sequence Protein sequence is usually what does the work Functional studies Crystallography Proteomics Similarity

Scrabble

Page 23: ORF Calling. Why? Need to know protein sequence Protein sequence is usually what does the work Functional studies Crystallography Proteomics Similarity

In scrabble, how do they score the letters?

The most abundant letters (easiest to place on the board) are given the lowest score

Scrabble

Page 24: ORF Calling. Why? Need to know protein sequence Protein sequence is usually what does the work Functional studies Crystallography Proteomics Similarity

1 point: E, A, I, O, N, R, T, L, S, U

2 points: D, G

3 points: B, C, M, P

4 points: F, H, V, W, Y

5 points: K

8 points: J, X

10 points: Q, Z

Scrabble

Page 25: ORF Calling. Why? Need to know protein sequence Protein sequence is usually what does the work Functional studies Crystallography Proteomics Similarity

Frequency of letters

Page 26: ORF Calling. Why? Need to know protein sequence Protein sequence is usually what does the work Functional studies Crystallography Proteomics Similarity

If I want to make up a sentence, I could choose some letters at random, based on their occurrence in the alphabet (i.e their scrabble score)

rla bsht es stsfa ohhofsd

Making up sentences

Page 27: ORF Calling. Why? Need to know protein sequence Protein sequence is usually what does the work Functional studies Crystallography Proteomics Similarity

What follows a period (“.”)?

What follows a t?

Usually a space “ ”

Usually an “i” (-tion, -tize, ...)

Lets get clever!

Page 28: ORF Calling. Why? Need to know protein sequence Protein sequence is usually what does the work Functional studies Crystallography Proteomics Similarity

When the first letter is “t” (from 3,269 words):

ti 51%

te 20%

ta 15%

th 8%

Frequency of two letters

Page 29: ORF Calling. Why? Need to know protein sequence Protein sequence is usually what does the work Functional studies Crystallography Proteomics Similarity

Choose a letter based on the probability that it follows the letter before:

s h a n d t u c ht i n e y m e l e o l l d

Level 1 analysis

Page 30: ORF Calling. Why? Need to know protein sequence Protein sequence is usually what does the work Functional studies Crystallography Proteomics Similarity

1 letter (a, e, o …)

2 letters (th, ti, sh …)

3 letters (the, and, …)

4 letters (that, …)

Zero order model

First order model

Second order model

Third order model

Levels of analysis

Page 31: ORF Calling. Why? Need to know protein sequence Protein sequence is usually what does the work Functional studies Crystallography Proteomics Similarity

With about 10th order Markov models of English you get complete words and sentences!

Markov models

Page 32: ORF Calling. Why? Need to know protein sequence Protein sequence is usually what does the work Functional studies Crystallography Proteomics Similarity

With about 10th order Markov models of English you get complete words and sentences!

Markov models

Page 33: ORF Calling. Why? Need to know protein sequence Protein sequence is usually what does the work Functional studies Crystallography Proteomics Similarity

Scoring words with Markov Models

If I choose random letters how can I tell if they are real words?

Sum the scores of 10th order Markov models across the words … if it is high it is likely to be a real word!

In reality, maybe use 1st, 2nd, 3rd, 4th, 5th, 6th … order models and compare to some known words

Page 34: ORF Calling. Why? Need to know protein sequence Protein sequence is usually what does the work Functional studies Crystallography Proteomics Similarity

Codons have three letters (ATG, CAC, GGG, ...)

Use a 2nd order Markov model for ORF calling

The frequency of a letter is predicted based on the frequency of the two letters before

Markov Models and ORF calling

Page 35: ORF Calling. Why? Need to know protein sequence Protein sequence is usually what does the work Functional studies Crystallography Proteomics Similarity

Scrabble

Page 36: ORF Calling. Why? Need to know protein sequence Protein sequence is usually what does the work Functional studies Crystallography Proteomics Similarity

Do English and Spanish use the same letters?

Scrabble (México)

Page 37: ORF Calling. Why? Need to know protein sequence Protein sequence is usually what does the work Functional studies Crystallography Proteomics Similarity

Scrabble (México)

Page 38: ORF Calling. Why? Need to know protein sequence Protein sequence is usually what does the work Functional studies Crystallography Proteomics Similarity

1 point: E, A, I, O, N, R, T, L, S, U

2 points: D, G

3 points: B, C, M, P

4 points: F, H, V, W, Y

5 points: K

8 points: J, X

10 points: Q, Z

Scrabble (US)

Based on the front page of the NY Times!

Page 39: ORF Calling. Why? Need to know protein sequence Protein sequence is usually what does the work Functional studies Crystallography Proteomics Similarity

1 point: A, E, O, I, S, N, L, R, U, T

2 points: D, G

3 points: C, B, M, P

4 points: H, F, V, Y

5 points: CH, Q

8 points: J, LL, Ñ, RR, X

10 points: Z

Scrabble (Spanish)

Page 40: ORF Calling. Why? Need to know protein sequence Protein sequence is usually what does the work Functional studies Crystallography Proteomics Similarity

Will vary with the composition of the organism!

Remember, some organisms have high G+C compared to A+T

What about scrabble scores for DNA?

Page 41: ORF Calling. Why? Need to know protein sequence Protein sequence is usually what does the work Functional studies Crystallography Proteomics Similarity

Use a 2nd order Markov model for ORF calling

The frequency of a letter is predicted based on the frequency of the two letters before

Markov Models and ORF calling

Page 42: ORF Calling. Why? Need to know protein sequence Protein sequence is usually what does the work Functional studies Crystallography Proteomics Similarity

Need to train the Markov model – not all organisms are the same

Can use phylogentically close organisms

Can use “long orfs” – likely to be correct because unlikely to be random stretches without a stop codon!

Problems!

Page 43: ORF Calling. Why? Need to know protein sequence Protein sequence is usually what does the work Functional studies Crystallography Proteomics Similarity

Markov Models order 1-8 (word size 2-9)

Discard (or ↓ weight) for rare words

Promote (or ↑ weight) for common words

Probability is the sum of all probabilities from 1-8

2-9

Interpolated Markov Model(The imm in GLIMMER)

Page 44: ORF Calling. Why? Need to know protein sequence Protein sequence is usually what does the work Functional studies Crystallography Proteomics Similarity

As with proteins, two main methods:

Ab initio

• Intrinsic

Homology based

• extrinsic

RNA genes

Page 45: ORF Calling. Why? Need to know protein sequence Protein sequence is usually what does the work Functional studies Crystallography Proteomics Similarity

Ribosomes are made of proteins and RNA

Ribosomes

Page 46: ORF Calling. Why? Need to know protein sequence Protein sequence is usually what does the work Functional studies Crystallography Proteomics Similarity

30S subunit from Thermus aquaticus

Blue: proteinOrange: rRNA

Page 47: ORF Calling. Why? Need to know protein sequence Protein sequence is usually what does the work Functional studies Crystallography Proteomics Similarity

E. coli16S rRNA secondary structure

Page 48: ORF Calling. Why? Need to know protein sequence Protein sequence is usually what does the work Functional studies Crystallography Proteomics Similarity

Variable regionConserved region

Page 49: ORF Calling. Why? Need to know protein sequence Protein sequence is usually what does the work Functional studies Crystallography Proteomics Similarity

Variable regions inthe 16S rRNA. Vn – 9 regions(n) – variable loop(s)forward/rev primers V1

(6)

V2 (8-11)

V3 (18)

V4 (P23-1, 24)

V5(28, 29)

V6(37)

V7 (43)

V8(45, 46)

V9 (49)

Van de Peer Y, Chapelle S, De Wachter R. (1996) A quantitative map of nucleotide substitution rates in bacterial rRNA. Nucl. Acids Res. 24:3381-3391

Page 50: ORF Calling. Why? Need to know protein sequence Protein sequence is usually what does the work Functional studies Crystallography Proteomics Similarity

Ribosomes are made of proteins and RNA

Prokaryotic ribosome:

Large subunit:50S

5S and 23S rRNA genes

Small subunit:

30S

16S rRNA gene

Ribosomes

Page 51: ORF Calling. Why? Need to know protein sequence Protein sequence is usually what does the work Functional studies Crystallography Proteomics Similarity

Easiest way is iterative: BLAST ALIGN TRIM

Problem: secondary structure makes identification of the ends difficult

Finding 16S genes

Page 52: ORF Calling. Why? Need to know protein sequence Protein sequence is usually what does the work Functional studies Crystallography Proteomics Similarity

Not as easy as rRNA

Much shorter

Varied sequence

Only conservation is 2° structure

Finding tRNA genes

Page 53: ORF Calling. Why? Need to know protein sequence Protein sequence is usually what does the work Functional studies Crystallography Proteomics Similarity

tRNAScan-SE

Sean Eddy

Use it!

Page 54: ORF Calling. Why? Need to know protein sequence Protein sequence is usually what does the work Functional studies Crystallography Proteomics Similarity

How does this relate to tRNA?

tRNA-Phe by Yikrazuul - Own work.Licensed under CC BY-SA 3.0 via Wikimedia Commonshttps://commons.wikimedia.org/wiki/File:TRNA-Phe_yeast_en.svg

Page 55: ORF Calling. Why? Need to know protein sequence Protein sequence is usually what does the work Functional studies Crystallography Proteomics Similarity

tRNA structure

Start of acceptor stem (7-9 bp) D-loop (4-6-bp) stem plus loop anticodon arm (6-bp) stem plus loop with

anticodon T-loop (4-5-bp) stem plus loop End of acceptor stem (7-9 bp) CCA to attach amino acid (may not be in

sequence ... added during processing)