biol335: how to annotate a genome

25
Genome annotation Paul Gardner March 3, 2015 Paul Gardner Genome annotation

Upload: paul-gardner

Post on 03-Jul-2015

140 views

Category:

Science


1 download

DESCRIPTION

Course material for: http://www.canterbury.ac.nz/courseinfo/GetCourseDetails.aspx?course=BIOL335

TRANSCRIPT

Page 1: BIOL335: How to annotate a genome

Genome annotation

Paul Gardner

March 3, 2015

Paul Gardner Genome annotation

Page 2: BIOL335: How to annotate a genome

Medical genomics

I Vicky Cameron & Anna Pilbrow at Otago areidentifying genetic variation and genes associatedwith an increased risk of heart disease.

I Mike Stratton at the Sanger Institute is huntingfor genetic variation that is associated with anincreased risk of cancer.

I Rob Knight at UC Boulder is sequencing themicrobes that live on us. Finding associationsbetween our health and microbial communities.See Rob’s TEDTalk.

Paul Gardner Genome annotation

Page 3: BIOL335: How to annotate a genome

Agricultural genomics

I Graeme Attwood at AgResearch is trying to stopcows & sheep from emitting greenhouse gases bystudying their gut microbes. He has sequencedtwo methanogenic Archaeal genomes ofMethanobrevibacter sp.

I Honour McCann at Massey University is tryingto determine how Pseudomonas syringae pv.actinidiae (PSA) is killing kiwifruit.

I Rebecca Ganley at SCION is investigating howPhytophthora Taxon Agathis (PTA) is causingkauri die-back disease and killing kauri trees.

Paul Gardner Genome annotation

Page 4: BIOL335: How to annotate a genome

Academic interest genomics

I Tom Gilbert at the University of Copenhagen issequencing bird and giant squid genomes.

I Elizabeth Murchison is sequencing tasmaniandevils (and their transmissible cancers). SeeLiz’s TEDTalk.

I Neil Gemmel at Otago University is sequencingthe tuatara genome.

Paul Gardner Genome annotation

Page 5: BIOL335: How to annotate a genome

Annotate me!

TTACAGAGTACACAACATCCATGAAACGCATTAGCACCACCATTACCACCACCATCACCATTACCACAGGTAACGGTGCGGGCTGACGCGTACAGGAAA

CACAGAAAAAAGCCCGCACCTGACAGTGCGGGCTTTTTTTTTCGACCAAAGGTAACGAGGTAACAACCATGCGAGTGTTGAAGTTCGGCGGTACATCAG

TGGCAAATGCAGAACGTTTTCTGCGTGTTGCCGATATTCTGGAAAGCAATGCCAGGCAGGGGCAGGTGGCCACCGTCCTCTCTGCCCCCGCCAAAATCA

CCAACCACCTGGTGGCGATGATTGAAAAAACCATTAGCGGCCAGGATGCTTTACCCAATATCAGCGATGCCGAACGTATTTTTGCCGAACTTTTGACGG

GACTCGCCGCCGCCCAGCCGGGGTTCCCGCTGGCGCAATTGAAAACTTTCGTCGATCAGGAATTTGCCCAAATAAAACATGTCCTGCATGGCATTAGTT

TGTTGGGGCAGTGCCCGGATAGCATCAACGCTGCGCTGATTTGCCGTGGCGAGAAAATGTCGATCGCCATTATGGCCGGCGTATTAGAAGCGCGCGGTC

ACAACGTTACTGTTATCGATCCGGTCGAAAAACTGCTGGCAGTGGGGCATTACCTCGAATCTACCGTCGATATTGCTGAGTCCACCCGCCGTATTGCGG

CAAGCCGCATTCCGGCTGATCACATGGTGCTGATGGCAGGTTTCACCGCCGGTAATGAAAAAGGCGAACTGGTGGTGCTTGGACGCAACGGTTCCGACT

ACTCTGCTGCGGTGCTGGCTGCCTGTTTACGCGCCGATTGTTGCGAGATTTGGACGGACGTTGACGGGGTCTATACCTGCGACCCGCGTCAGGTGCCCG

ATGCGAGGTTGTTGAAGTCGATGTCCTACCAGGAAGCGATGGAGCTTTCCTACTTCGGCGCTAAAGTTCTTCACCCCCGCACCATTACCCCCATCGCCC

AGTTCCAGATCCCTTGCCTGATTAAAAATACCGGAAATCCTCAAGCACCAGGTACGCTCATTGGTGCCAGCCGTGATGAAGACGAATTACCGGTCAAGG

GCATTTCCAATCTGAATAACATGGCAATGTTCAGCGTTTCTGGTCCGGGGATGAAAGGGATGGTCGGCATGGCGGCGCGCGTCTTTGCAGCGATGTCAC

GCGCCCGTATTTCCGTGGTGCTGATTACGCAATCATCTTCCGAATACAGCATCAGTTTCTGCGTTCCACAAAGCGACTGTGTGCGAGCTGAACGGGCAA

TGCAGGAAGAGTTCTACCTGGAACTGAAAGAAGGCTTACTGGAGCCGCTGGCAGTGACGGAACGGCTGGCCATTATCTCGGTGGTAGGTGATGGTATGC

GCACCTTGCGTGGGATCTCGGCGAAATTCTTTGCCGCACTGGCCCGCGCCAATATCAACATTGTCGCCATTGCTCAGGGATCTTCTGAACGCTCAATCT

CTGTCGTGGTAAATAACGATGATGCGACCACTGGCGTGCGCGTTACTCATCAGATGCTGTTCAATACCGATCAGGTTATCGAAGTGTTTGTGATTGGCG

TCGGTGGCGTTGGCGGTGCGCTGCTGGAGCAACTGAAGCGTCAGCAAAGCTGGCTGAAGAATAAACATATCGACTTACGTGTCTGCGGTGTTGCCAACT

CGAAGGCTCTGCTCACCAATGTACATGGCCTTAATCTGGAAAACTGGCAGGAAGAACTGGCGCAAGCCAAAGAGCCGTTTAATCTCGGGCGCTTAATTC

GCCTCGTGAAAGAATATCATCTGCTGAACCCGGTCATTGTTGACTGCACTTCCAGCCAGGCAGTGGCGGATCAATATGCCGACTTCCTGCGCGAAGGTT

TCCACGTTGTCACGCCGAACAAAAAGGCCAACACCTCGTCGATGGATTACTACCATCAGTTGCGTTATGCGGCGGAAAAATCGCGGCGTAAATTCCTCT

ATGACACCAACGTTGGGGCTGGATTACCGGTTATTGAGAACCTGCAAAATCTGCTCAATGCAGGTGATGAATTGATGAAGTTCTCCGGCATTCTTTCTG

GTTCGCTTTCTTATATCTTCGGCAAGTTAGACGAAGGCATGAGTTTCTCCGAGGCGACCACGCTGGCGCGGGAAATGGGTTATACCGAACCGGACCCGC

GAGATGATCTTTCTGGTATGGATGTGGCGCGTAAACTATTGATTCTCGCTCGTGAAACGGGACGTGAACTGGAGCTGGCGGATATTGAAATTGAACCTG

TGCTGCCCGCAGAGTTTAACGCCGAGGGTGATGTTGCCGCTTTTATGGCGAATCTGTCACAACTCGACGATCTCTTTGCCGCGCGCGTGGCGAAGGCCC

GTGATGAAGGAAAAGTTTTGCGCTATGTTGGCAATATTGATGAAGATGGCGTCTGCCGCGTGAAGATTGCCGAAGTGGATGGTAATGATCCGCTGTTCA

AAGTGAAAAATGGCGAAAACGCCCTGGCCTTCTATAGCCACTATTATCAGCCGCTGCCGTTGGTACTGCGCGGATATGGTGCGGGCAATGACGTTACAG

CTGCCGGTGTCTTTGCTGATCTGCTACGTACCCTCTCATGGAAGTTAGGAGTCTGACATGGTTAAAGTTTATGCCCCCATGGTTAAAGTTTATGCCCCG

GCTTCCAGTGCCAATATGAGCGTCGGGTTTGATGTGCTCGGGGCGGCGGTGACACCTGTTGATGGTGCATTGCTCGGAGATGTAGTCACGGTTGAGGCG

GCAGAGACATTCAGTCTCAACAACCTCGGACGCTTTGCCGATAAGCTGCCGTCAGAACCACGGGAAAATATCGTTTATCA

Paul Gardner Genome annotation

Page 6: BIOL335: How to annotate a genome

Discussion

I How should these researchers annotate their genomes (afterthey have sequenced and assembled them)?

I What are the fast and cheap methods?I What are the most accurate methods?

Paul Gardner Genome annotation

Page 7: BIOL335: How to annotate a genome

The data tsunami

I Thanks to new sequencing technologies (recall Ant’steeny-tiny little sequencer).

I Biologists no longer spend years acquiring data.I The bottle-neck for research is now in the analysis phase of

research.I Biologists with good mathematics skills and mathematicians

with an interest in biology are in high demand.

Gather data

Analyze-Classify

Hypotheses-Predictions

ExperimentGCGAGCAGACGCACCGAACAGACACAGUGAGCAGGCGCCCCGAGCAGUCAUAACACUGAGACGCAGCGAGCGU-AACG

RAAAARCY

Y R

RGYUUUUUU U5'

0.0

1.0

2.0

A

CGU

CC

A

GA5

A

GA

U

CAGGUA10

CAGUCUGA

Paul Gardner Genome annotation

Page 8: BIOL335: How to annotate a genome

We can use sequence analysis...

I Genes leave a statistical signal in the genome...I Example: identify promotors, ribosome binding sites,

open-reading frames (ORFs), terminatorsI In eukaryotes CpG islands, splicing signals and poly-A tails may

be incorporatedI How reliable are these approaches? What are the main

weaknesses & strengths?

Figure from: http://zerocool.is-a-geek.net/?p=630

Paul Gardner Genome annotation

Page 9: BIOL335: How to annotate a genome

Sequence analysis: strengths and weaknesses

I ORF prediction: Prodigal, GLIMMERI Strengths:

I very fastI cheap

I Weaknesses:I false positives (see AntiFam)I misses short peptides (e.g. toxins-antitoxin systems)I No ncRNAs, pseudogenes, recoding elements, ...

Paul Gardner Genome annotation

Page 10: BIOL335: How to annotate a genome

Annotate me!

TTACAGAGTACACAACATCCATGAAACGCATTAGCACCACCATTACCACCACCATCACCATTACCACAGGTAACGGTGCGGGCTGACGCGTACAGGAAA

CACAGAAAAAAGCCCGCACCTGACAGTGCGGGCTTTTTTTTTCGACCAAAGGTAACGAGGTAACAACCATGCGAGTGTTGAAGTTCGGCGGTACATCAG

TGGCAAATGCAGAACGTTTTCTGCGTGTTGCCGATATTCTGGAAAGCAATGCCAGGCAGGGGCAGGTGGCCACCGTCCTCTCTGCCCCCGCCAAAATCA

CCAACCACCTGGTGGCGATGATTGAAAAAACCATTAGCGGCCAGGATGCTTTACCCAATATCAGCGATGCCGAACGTATTTTTGCCGAACTTTTGACGG

GACTCGCCGCCGCCCAGCCGGGGTTCCCGCTGGCGCAATTGAAAACTTTCGTCGATCAGGAATTTGCCCAAATAAAACATGTCCTGCATGGCATTAGTT

TGTTGGGGCAGTGCCCGGATAGCATCAACGCTGCGCTGATTTGCCGTGGCGAGAAAATGTCGATCGCCATTATGGCCGGCGTATTAGAAGCGCGCGGTC

ACAACGTTACTGTTATCGATCCGGTCGAAAAACTGCTGGCAGTGGGGCATTACCTCGAATCTACCGTCGATATTGCTGAGTCCACCCGCCGTATTGCGG

CAAGCCGCATTCCGGCTGATCACATGGTGCTGATGGCAGGTTTCACCGCCGGTAATGAAAAAGGCGAACTGGTGGTGCTTGGACGCAACGGTTCCGACT

ACTCTGCTGCGGTGCTGGCTGCCTGTTTACGCGCCGATTGTTGCGAGATTTGGACGGACGTTGACGGGGTCTATACCTGCGACCCGCGTCAGGTGCCCG

ATGCGAGGTTGTTGAAGTCGATGTCCTACCAGGAAGCGATGGAGCTTTCCTACTTCGGCGCTAAAGTTCTTCACCCCCGCACCATTACCCCCATCGCCC

AGTTCCAGATCCCTTGCCTGATTAAAAATACCGGAAATCCTCAAGCACCAGGTACGCTCATTGGTGCCAGCCGTGATGAAGACGAATTACCGGTCAAGG

GCATTTCCAATCTGAATAACATGGCAATGTTCAGCGTTTCTGGTCCGGGGATGAAAGGGATGGTCGGCATGGCGGCGCGCGTCTTTGCAGCGATGTCAC

GCGCCCGTATTTCCGTGGTGCTGATTACGCAATCATCTTCCGAATACAGCATCAGTTTCTGCGTTCCACAAAGCGACTGTGTGCGAGCTGAACGGGCAA

TGCAGGAAGAGTTCTACCTGGAACTGAAAGAAGGCTTACTGGAGCCGCTGGCAGTGACGGAACGGCTGGCCATTATCTCGGTGGTAGGTGATGGTATGC

GCACCTTGCGTGGGATCTCGGCGAAATTCTTTGCCGCACTGGCCCGCGCCAATATCAACATTGTCGCCATTGCTCAGGGATCTTCTGAACGCTCAATCT

CTGTCGTGGTAAATAACGATGATGCGACCACTGGCGTGCGCGTTACTCATCAGATGCTGTTCAATACCGATCAGGTTATCGAAGTGTTTGTGATTGGCG

TCGGTGGCGTTGGCGGTGCGCTGCTGGAGCAACTGAAGCGTCAGCAAAGCTGGCTGAAGAATAAACATATCGACTTACGTGTCTGCGGTGTTGCCAACT

CGAAGGCTCTGCTCACCAATGTACATGGCCTTAATCTGGAAAACTGGCAGGAAGAACTGGCGCAAGCCAAAGAGCCGTTTAATCTCGGGCGCTTAATTC

GCCTCGTGAAAGAATATCATCTGCTGAACCCGGTCATTGTTGACTGCACTTCCAGCCAGGCAGTGGCGGATCAATATGCCGACTTCCTGCGCGAAGGTT

TCCACGTTGTCACGCCGAACAAAAAGGCCAACACCTCGTCGATGGATTACTACCATCAGTTGCGTTATGCGGCGGAAAAATCGCGGCGTAAATTCCTCT

ATGACACCAACGTTGGGGCTGGATTACCGGTTATTGAGAACCTGCAAAATCTGCTCAATGCAGGTGATGAATTGATGAAGTTCTCCGGCATTCTTTCTG

GTTCGCTTTCTTATATCTTCGGCAAGTTAGACGAAGGCATGAGTTTCTCCGAGGCGACCACGCTGGCGCGGGAAATGGGTTATACCGAACCGGACCCGC

GAGATGATCTTTCTGGTATGGATGTGGCGCGTAAACTATTGATTCTCGCTCGTGAAACGGGACGTGAACTGGAGCTGGCGGATATTGAAATTGAACCTG

TGCTGCCCGCAGAGTTTAACGCCGAGGGTGATGTTGCCGCTTTTATGGCGAATCTGTCACAACTCGACGATCTCTTTGCCGCGCGCGTGGCGAAGGCCC

GTGATGAAGGAAAAGTTTTGCGCTATGTTGGCAATATTGATGAAGATGGCGTCTGCCGCGTGAAGATTGCCGAAGTGGATGGTAATGATCCGCTGTTCA

AAGTGAAAAATGGCGAAAACGCCCTGGCCTTCTATAGCCACTATTATCAGCCGCTGCCGTTGGTACTGCGCGGATATGGTGCGGGCAATGACGTTACAG

CTGCCGGTGTCTTTGCTGATCTGCTACGTACCCTCTCATGGAAGTTAGGAGTCTGACATGGTTAAAGTTTATGCCCCCATGGTTAAAGTTTATGCCCCG

GCTTCCAGTGCCAATATGAGCGTCGGGTTTGATGTGCTCGGGGCGGCGGTGACACCTGTTGATGGTGCATTGCTCGGAGATGTAGTCACGGTTGAGGCG

GCAGAGACATTCAGTCTCAACAACCTCGGACGCTTTGCCGATAAGCTGCCGTCAGAACCACGGGAAAATATCGTTTATCA

Paul Gardner Genome annotation

Page 11: BIOL335: How to annotate a genome

We can use homology...

I Evolution tends to preserve functional genomic regions...

I Example 1: Use an existing set of genes from related speciesand map these onto your genome (e.g. RATT)

I Example 2: Align two or more related genomes, look forconserved regions, patterns of variation can be indicative offunction (e.g. QRNA, RNAz & RNAcode)

I How reliable are these approaches? What are the mainweaknesses & strengths?

Paul Gardner Genome annotation

Page 12: BIOL335: How to annotate a genome

The QRNA approach...

Rivas et al. (2001) Computational identification of noncoding RNAs in E. coli by comparative genomics. CurrentBiology.

Paul Gardner Genome annotation

Page 13: BIOL335: How to annotate a genome

DNA encodes Protein

# STOCKHOLM 1.0#33 unique RNA sequences, 1 peptide sequence#=GR PR1 G..A..D..V..T..H..P..P..A..G..D..#=GR PR3 GlyAlaAspValThrHisProProAlaGlyAspplatypus GGAGCAGACGTCACTCACCCCCCAGCCGGAGATopossum GGAGCAGATGTTACTCACCCTCCTGCTGGAGATsloth GGAGCAGACGTCACACACCCTCCCGCGGGGGATarmadillo GGAGCAGACGTCACGCACCCTCCGGCAGGGGATtenrec GGGGCCGACGTCACGCACCCCCCTGCGGGCGATelephant GGAGCGGATGTCACACACCCGCCTGCGGGGGATshrew GGCGCAGATGTCACGCATCCTCCAGCAGGGGAChedgehog GGAGCAGATGTCACACACCCCCCAGCAGGAGATmegabat GGAGCAGATGTCACACACCCTCCTGCAGGAGATmicrobat GGAGCAGATGTCACCCACCCCCCTGCAGGGGACdog GGAGCGGATGTCACACACCCCCCAGCCGGGGACcat GGAGCCGATGTCACGCACCCCCCAGCAGGGGAThorse GGAGCGGATGTCACACACCCTCCGGCAGGGGATpika GGAGCAGATGTCACTCACCCTCCAGCTGGGGATrabbit GGTGCAGATGTCACACACCCCCCAGCTGGAGATsquirrel GGAGCAGATGTCACTCACCCTCCAGCGGGAGATguinea_pig GGAGCAGATGTCACACACCCACCAGCGGGAGATmouse GGAGCAGATGTCACTCATCCGCCTGCTGGGGACrat GGAGCAGATGTCACTCATCCACCTGCTGGGGATkangaroo_rat GGAGCAGATGTTACACACCCTCCAGCAGGGGATtree_shrew GGCGCAGACGTCACGCACCCCCCGGCCGGGGAThuman GGAGCGGATGTCACACACCCCCCAGCAGGGGATtarsier GGTGCTGATGTCACACACCCCCCTGCAGGGGATmarmoset GGAGCAGATGTCACACACCCACCAGCAGGGGATzebrafinch GGAGCAGATGTCACTCACCCTCCCGCCGGGGATgreen_anole GGGGCAGACGTCACTCACCCGCCAGCCGGGGACxenopus GGAGCAGATGTTACACACCCACCTGCTGGTGATpufferfish GGTGCGGATGTTACTCATCCTCCTGCTGGTGATfugu GGGGCTGATGTTACTCACCCTCCAGCTGGTGATstickleback GGTGCAGACGTCACACATCCTCCAGCGGGTGATmedaka GGTGCCGATGTCACTCATCCTCCTGCCGGGGACzebrafish GGGGCAGATGTTACACACCCGCCGGCTGGTGATlamprey GGTGCCGATGTGACACACCCTCCAGCGGGAGAC//

GA

A

A

A

A

G

G

G

G

C

C

C

C

U

U

U

U

UC AG UCAGUCAGUCAGUCAGU

CAG

UCAGUCAGUCAGUCAGU

CA

GUCAGUCAGUCAG

UCAG UCAG

P

S

U

nG

nG

oG

oG

oG

G

P

P

P

P

P

nM

nM

M

M

nM

nM

nM

Phenylalanine

Phe

Leucine

Leu

Leucine

Leu

Proline

Pro

Histidine

His

Glutamine

Gln

Isoleucine

Ile

Methionine

Met

Threonine

Thr

Asparagine

Asn

Lysine

Lys

Arginine

Arg

Arginine

Arg

Valine

Val

Alanine

Ala

Glutamic acid

Glu

Aspartic acid

Asp

Glycine

Gly

Serine

Ser

Serine

Ser

Tyrosine

Tyr

Cysteine

Cys

Tryptophan

Trp

Stops

Stop

E G F LS

S

Y

C

WL

P

H

R

R

QIM

TN

K

V

A

D89.09

75.07

174.20

174.20

146.19

165.19

133.11

117.15

147.13

146.15

155.16

115.13

105.09

105.09

131.18

132.12

MW

= 1

49.2

1 Da

131.18

119.12

204.23

131.18

181.19

121.16

HN

NH2

NH

H2N

OH

O

H2N

CH3 OH

O

H2N

O

H2N

OH

O

O

HO

H2N

OH

O

HS

H2N

OH

O

H2N

O

NH2

OH

O

O

OH

H2N

OH

OH2N

OH

O

NH

H2N

OH

O

N

CH3 CH3

H2N

OH

O

CH3

CH3

H2N

OH

O

CH3

CH3

H2N

OH

O

H2N

H2N

OH

O

CH3 S

H2N

OH

O

H2N

OH

O

NH

OH

O

H2N

HO OH

O

H2N

HO OH

O

H2N

HO

CH3

OH

O

NH

H2N

OH

O

HO

H2N

OH

O

H2N

CH3

CH3

OH

O

BasicAcidicPolarNonpolar(hydrophobic)

S -M - P - U - nM -oG - nG -

SumoMethyl

PhosphoUbiquitinN-Methyl

O-glycosylN-glycosyl

Modification

am

ino a

cid

2nd1st position 3rdUC

Image source: http://upload.wikimedia.org/wikipedia/en/d/d6/GeneticCode21-version-2.svgPaul Gardner Genome annotation

Page 14: BIOL335: How to annotate a genome

DNA encodes RNA

GCGGAUUU

AGCUC

AGDDGG G A

G A G CG

CCA

GACUG

A A.A.

CUGGAGGU

CC U G U G

T . CGA

UCCACAG

AAUUCGC

AC

CA

VariableLoopAnticodon

Loop

T ΨCLoop

10 15 20 25 30 355 40 45 50 55 60 65 70 75

AnticodonLoop

Acceptor Stem

GCGGAUUUAGCUCAGDDGGGAGAGCGCCAGACUGAAYA.CUGGAGGUCCUGUGT.CGAUCCACAGAAUUCGCACCA5’ 3’

Secondary Structure Tertiary StructureB C

Primary StructureA

Acceptor Stem

T ΨCLoop

ΨΨ

Ψ

Ψ

Y

6560

55

40

10

20

155

70

75

25

30

35

45

50

D Loop

3’

5’

5’3’

D Loop

Paul Gardner Genome annotation

Page 15: BIOL335: How to annotate a genome

Homology-based annotation: strengths and weaknesses

I Example 1: map known genes onto genomesI Strengths: fast, cheap, ...I Weaknesses:

I Inaccurate for divergent species (e.g. Graeme’sMethanobrevibacter or GEBA genomes)

I Requires manual correction of border-line resultsI Errors are propagated throughout the databases

I Example 2: aligning genomesI Strengths:

I “cheap” if genomes already existI fast for small genomesI evolutionary support for all discoveries

I Weaknesses:I Requires lots of powerful computers for large genomesI Inaccurate for divergent species (e.g. Neil’s tuatara or

Graeme’s Methanobrevibacter)I Requires manual correction of border-line results

Paul Gardner Genome annotation

Page 16: BIOL335: How to annotate a genome

Homology annotation: nucleotides are difficult to align

0

20

40

60

80

100

Conservation of Xfam families in bacterial genomes

Con

served

families

(%)Freq.

RNA−seq species0

10

Pfam (N=6671)Rfam (N=331)

0.0 0.1 0.2 0.3 0.4 0.5 0.6

Phylogenetic distance

Lindgreen et al. (2014) Robust identification of noncoding RNA from transcriptomes requiresphylogenetically-informed sampling. PLOS Computational Biology.

Paul Gardner Genome annotation

Page 17: BIOL335: How to annotate a genome

We can use RNA detection methods...

I Remember the central dogma of molecular biologyI Example: sequence RNAs from multiple tissues,

developmental stages and environmental conditionsI How reliable is this approach? What are the main weaknesses

& strengths?

Wang, Gerstein & Snyder (2009) RNA-Seq: a revolutionary tool for transcriptomics. Nature Reviews Genetics.

Paul Gardner Genome annotation

Page 18: BIOL335: How to annotate a genome

RNA-seq: strengths and weaknesses

I RNA-seqI Strengths:

I Experimental support for transcribed regionsI Identifies untranslated regions (UTRs), ncRNAs, antisense

RNAs, ...I Identifies alternatively spliced and edited RNAs

I Weaknesses:I Expensive & lots of workI RNA degradation and genomic contaminationI Transcription does not prove translationI Will miss genes transcribed in specific developmental stages,

tissues & environmental conditions E.g. lsy-6 microRNA

Paul Gardner Genome annotation

Page 19: BIOL335: How to annotate a genome

We can use protein detection methods...

I Central dogma of molecular biologyI Example: Protein mass spectrometry

I How reliable is this approach? What are the main weaknesses& strengths?

Figure from: http://en.wikipedia.org/wiki/Protein mass spectrometry

Paul Gardner Genome annotation

Page 20: BIOL335: How to annotate a genome

Protein mass spectrometry: strengths and weaknesses

I Protein mass spectrometryI Strengths:

I Experimental support for translated regionsI Identifies alternative isoforms and post-translational

modifications (Ezkurdia et al. 2012)

I Weaknesses:I Expensive & lots of workI Misses genes transcribed in specific developmental stages,

tissues & environmental conditionsI Currently technology generally only detects the most

abundant proteins

Ezkurdia et al. (2012) Comparative proteomics reveals a significant bias toward alternative protein isoforms withconserved structure and function. Mol Biol Evol.

Paul Gardner Genome annotation

Page 21: BIOL335: How to annotate a genome

How cool is this?!

Schwanhausser et al. (2011) Global quantification of mammalian gene expression control. Nature

Paul Gardner Genome annotation

Page 22: BIOL335: How to annotate a genome

This is also kinda neat...

Lu et al. (2007) Absolute protein expression profiling estimates the relative contributions of transcriptional andtranslational regulation. Nature Biotechnology

Paul Gardner Genome annotation

Page 23: BIOL335: How to annotate a genome

Relevant reading

I Reviews:I Stein L (2001) Genome annotation: from sequence to biology.

Nature Reviews Genetics.I Reed JL et al. (2006) Towards multidimensional genome

annotation. Nature Reviews Genetics.I ORF finding:

I Delcher AL et al. (2007) Identifying bacterial genes andendosymbiont DNA with Glimmer. Bioinformatics.

I Hyatt D et al. (2010) Prodigal: prokaryotic gene recognitionand translation initiation site identification. BMCBioinformatics.

I RNA-seq (Ant’s lectures)I Wang, Gerstein & Snyder (2009) RNA-Seq: a revolutionary

tool for transcriptomics. Nature Reviews Genetics.I Proteomics (Sarah’s lectures)

I Ezkurdia et al. (2012) Comparative proteomics reveals asignificant bias toward alternative protein isoforms withconserved structure and function. Mol Biol Evol.

Paul Gardner Genome annotation

Page 24: BIOL335: How to annotate a genome

Homework: How to make a sequence alignment?

I Play: http://phylo.cs.mcgill.ca

I or even better, play Ribo: http://ribo.cs.mcgill.ca/

Paul Gardner Genome annotation

Page 25: BIOL335: How to annotate a genome

The End

Paul Gardner Genome annotation