dark matters in the genomes

44
Dark matters in the genomes Dark matters in the genomes Shin-Han Shiu Shin-Han Shiu Plant Biology / Genetics / EEBB / QBMI Plant Biology / Genetics / EEBB / QBMI

Upload: shadi

Post on 08-Jan-2016

21 views

Category:

Documents


0 download

DESCRIPTION

Dark matters in the genomes. Shin-Han Shiu Plant Biology / Genetics / EEBB / QBMI. About myself. About myself. About myself. About myself. Cell, nucleus, and chromosomes. DNA. A. G. G. C. G. T. A. G. A. G. A. G. A. T. C. C. T. T. G. A. T. T. C. C. G. C. A. A. C. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Dark matters in the genomes

Dark matters in the Dark matters in the genomesgenomes

Shin-Han ShiuShin-Han Shiu

Plant Biology / Genetics / EEBB / QBMIPlant Biology / Genetics / EEBB / QBMI

Page 2: Dark matters in the genomes

About myselfAbout myself

Page 3: Dark matters in the genomes

About myselfAbout myself

Page 4: Dark matters in the genomes

About myselfAbout myself

Page 5: Dark matters in the genomes

About myselfAbout myself

Page 6: Dark matters in the genomes

Cell, nucleus, and chromosomesCell, nucleus, and chromosomes

Page 7: Dark matters in the genomes

DNADNA

A G G C G T A G A G A G A T C C T T G A T

T C C G C A A C T C T C A A G G A A C A A

Page 8: Dark matters in the genomes

DNA and GenomeDNA and Genome

Genome is all the DNA in a cell made up of A, T, G, C...Genome is all the DNA in a cell made up of A, T, G, C...

How many A's, T's, G's, and C's are there in the human How many A's, T's, G's, and C's are there in the human genome?genome?

3,200,000,000 letters A sizable book, say, Lord of the Ring: Fellowship of the RingA sizable book, say, Lord of the Ring: Fellowship of the Ring

764,470 characters in 410 pages

~2,000 characters per page

The book of our lifeThe book of our life

1,600,000 pages

4,186 Fellowship of the ring

Page 9: Dark matters in the genomes

Genome sequencing chronologyGenome sequencing chronology

YearYear OrganismOrganism SignificanceSignificance Genome size Genome size (bp)(bp)

Number Number of genesof genes

19771977 Bacteriophage Bacteriophage fX174fX174

First genome First genome ever!ever! 5,3865,386 1111

19811981 Human Human mitochondriamitochondria First organelleFirst organelle 16,50016,500 3737

19951995 Haemophilus Haemophilus influenzae Rdinfluenzae Rd

First free-First free-living living organismorganism

1,830,1371,830,137 ~3,500~3,500

19961996 Saccharomyces Saccharomyces cerevisiaecerevisiae

First First eukaryoteeukaryote 12,086,00012,086,000 ~6,000~6,000

Page 10: Dark matters in the genomes

Genome sequencing chronologyGenome sequencing chronology

YearYear OrganismOrganism SignificanceSignificance Genome size Genome size (bp)(bp)

Number Number of genesof genes

19981998 Caenorhab-Caenorhab-ditis elegansditis elegans

First multi-First multi-cellular cellular organismorganism

97,000,00097,000,000 ~19,000~19,000

19991999Human Human chromosome chromosome 2222

First human First human chromosomechromosome 49,000,00049,000,000 673673

20002000 Drosophila Drosophila melanogastermelanogaster First insectFirst insect 150,000,000150,000,000 ~14,000~14,000

20002000 Arabidopsis Arabidopsis thalianathaliana

First plant First plant genomegenome 150,000,000150,000,000 ~25,000~25,000

Page 11: Dark matters in the genomes

Now, 1366 genomes are sequenced or being Now, 1366 genomes are sequenced or being sequencedsequenced

Page 12: Dark matters in the genomes

Between human and other animalsBetween human and other animals

How much do our and How much do our and chimp genomes differ?chimp genomes differ? 0.1%0.1% 1%1% 10%10% 50%50% 90%90%

How many genes do you How many genes do you think we share with think we share with worm?worm? 1%1% 10%10% 50%50% 75%75% 99%99%

Page 13: Dark matters in the genomes

Genome and better foodGenome and better food

Page 14: Dark matters in the genomes

Basic understanding of science & Basic understanding of science & environmentenvironment

Page 15: Dark matters in the genomes

Our research interestOur research interestTTGGCTATCCTTTATATTTTAAGGGTTATTAGGATATTTTTTATTATGACTACATGGGATAAATGTTTAAAAAAAATAAAAAAAAACCTTTCTACGTTTGAGTATAAGACGTGGATAAAGCCTATCCATGTGGAGCAAAATAGTAACTTATTCACAGTTTACTGTAACAATGAATATTTCAAAAAACATATAAAATCTAAGTATGGAAATCTTATTTTATCAACAATCCAAGAGTGTCATGGTAATGATTTAATTATTGAATATTCTAATAAAAAATTCTCTGGCGAAAAAATTACTGAGGTTATCACAGCTGGACCACAAGCTAATTTTTTTAGCACAACAAGTGTTGAGATAAAAGATGAATCAGAAGATACAAAAGTAGTACAAGAACCTAAAATATCAAAGAAGTCTAATAGTAAAGACTTTTCTTCATCACAAGAGTTATTCGGTTTTGACGAAGCTATGCTAATTACAGCAAAAGAAGATGAGGAATACTCTTTTGGTTTACCGTTAAAAGAAAAATATGTTTTTGATAGTTTTGTTGTTGGAGATGCTAACAAAATTGCTAGAGCAGCGGCTATGCAGGTATCGATAAATCCAGGTAAATTACATAACCCTTTATTCATTTATGGTGGTAGTGGTTTAGGTAAAACTCACTTAATGCAAGCAATAGGTAATCATGCAAGAGAAGTTAATCCTAATGCCAAAATTATTTATACAAATTCAGAACAATTTATTAAAGATTATGTAAATTCTATTCGTTTACAAGATCAAGATGAGTTTCAAAGAGTTTATAGATCTGCGGATATACTTTTGATTGATGATATTCAATTTATCGCTGGTAAAGAGGGTACTGCTCAGGAGTTTTTCCATACTTTTAATGCATTGTATGAAAATGGTAAACAGATAATTCTAACTAGTGATAAGTATCCAAATGAAATAGAAGGGCTTGAAGAAAGACTAGTTTCGCGTTTTGGTTATGGTTTAACAGTTTCTGTTGATATGCCAGATTTAGAAACCAGAATTGCTATCTTGCTCAAAAAAGCTCATGATTTAGGTCAGAAATTACCTAACGAAACAGCAGCTTTTATTGCTGAGAATGTACGTACTAATGTCAGAGAACTAGAAGGTGCTCTAAATAGGGTTCTTACTACCTCTAAATTTAATCATAAAGATCCTACTATCGAAGTAGCACAAGCTTGCTTAAGAGATGTTATAAAAATACAAGAAAAGAAAGTAAAAATAGATAATATCCAAAAGGTTGTTGCTGATTTTTATAGAATCAGGGTAAAAGATTTAACTTCTAATCAAAGAAGTAGAAATATAGCTAGACCAAGACAGATAGCAATGAGTTTAGCACGTGAACTAACATCACATAGTTTGCCAGAAATAGGCAATGCTTTTGGTGGTAGAGACCATACGACAGTTATGCATGCTGTCAAAGCTATAACTAAATTAAGACAAAGCAATACTTCAATATCGGATGATTATGAGTTGCTTTTAAATAAAATTTCTCGTTAAATAAAATTAGTAACTTTATCAAAGGGGTTTTAAAAAATGAATTTTGTACTAAATAGAGATGACTTACTAAAGCCTTTGCAATCTATGCTCTCAGTTGCAAATAGTAAGAGTACAATGCCTTTATTATCATGTATCTTATTTGATATTGATAATAATAATCTCAAAATTACGGCTTCGGATCTTGATACAGAGATATCATGCAATATAGCAGTTAGTTGTAACACAACTATTAAGTTAGCATTAAATGCTGACAAAATTTATAACATTGTCAGAAGCTTAAATGAAAATTCAATGATTGATTTTAGAATTAATGAAAATAAGGTAACTATTGTTTCTAATAATAGTACTTTTAACCTTATATCACTAAATGCTGACAACTATCCTCTTATTGATAGTAATATCAATGAGCAAGCAAGTTTTGATCTTTCTCAACAAGATTTTCATCATATTATTTCAAAAGTAGATTTCTCAATGGCTAATGATGATACTCGATATTTCTTAAATGGGATGTTTTGGGAAATCAACGCAAATCTACTAAGAGCAGTATCTACAGATGGTCATAGAATGTCTATCACAGAGGCTATAATTGATAGTAAAGTGTTAGATAGTGCTTCTCAGTCGATAATTCCAAAAAAAGCGATTTTAGAGCTTAAAAAGATAGTTGGCAAAACAGAAGAAAATATCAAAATTTGTCTTGGCAAAAATTATCTAAAAGCGATTTTTGGTAATTATGCTTTTATATCAAAGCTTATAGATGGTCGCTATCCTGATTACCAAAAAGTAATCCCTAAAAATAATACAAAACTATTAGCAGTTGATAAGCAGTTTTTCAAAAATTCATTATTAAGAACATCAATACTTGCTAATGATAAATATAAAGGTGTTCGTCTTAACATATCTCAAAATCAATTACTTCTATCAGCTAATAACCCTGATAATGAAAAGGCTGAAGATAAAATCGAAGTTCAATATAATGATCAACCAATGGAAATTTGTTTTAATTACAAATATCTTTTGGATATTATAAATGTACTTAGTGAAGAAACTATGTCTATCTACCTTGATAATCCAAATATGAGTGCTTTAGTTAAAGATGAGAAAGATAATAGTTTGTTTATTATTATGCCAATGAAAATTTAAGTAATAAGTAGTTTTAGGAAATAACTATTTTTATAAGCCTTTTGGAATGAATAATAAAGCAATAAAAAAAGGTATGCATAAAAACATTATATAGAAAGCTGGGATTAGATAATTTCCAGTAGTAGTAATTAATAAAGTCATAAGAAAGGCAACAGTACCTCCAAATAAAGAAACGCTTATATTAAAACTTATTCCAAAACCAGTATTTCTATTTCTAACAGGAAATAGCCCTGCCGTATTTGCAAATATAGGTCCTATTACAGCACCACTTATAATAGCAAGAGAAAAAATAGCTATAGATACTAATTGATGATTTTTTATAATAATTTGGTATATTGGTAAAACAGCTATAAATAAGACTATACAAGAATACATCAGAACTTTTTTACCACCAATTCTATCAGCAATATATCCAAATATAATTGAAGAAAGCATTAATACTATAGTTAATCCGAGAGTATTTTGTTAAACACATAAGAGATAGCATCACAAAATTTTTCAAAACTATTATTCACTTTTCTAAATATTTTTTTAAAGTTAGCCCAAACCTTTTCAATAGGATTTAAATCTGGAGAATACGGAGGTAGATATAATATTTGTACATCAAATTTATTGGCTATTTCAATCAGCTTAGAGGATTTATGGAAACTAGCATTATCCATTACTATAGTAGTTTTAGGTTTTAATGATGGGCATAAGTGTTCCTCAAACCATTGATTAAAAATTTCAGTATTGGTATATCCACTGTACTCTAATGGAGCTATAATCTTTTTATCTGCATAATTATATCCAGCAACAATACTTCTTCTTTGTGTTTGATATGCTAAAACCTCACCATAACTAGGCTCACCAATTAGTGACCATCCTCTTAGGATAGAAAGCTTATTGTCACACCCCATCTCATCTATATAAAATAACAAGTTTTGAGCTATTTCTTTTAGTTTTTCTATATACTCCAACCTTTCATGTTCTTTTCTTTGCTTATATTTTGGAGTCTTTTTTTAAAACTAAAACCAAGTCTATTAAGACAATCATAAAATGTACTTCTTGGAATATCAGGGGCTAATGCTTCTTTTATATCTAATGCACTTGCATCTGGATGATCTATCAAATACTGTTCAATCAATGTTTTATCGGTAAAGCTAGCGACTCTGCCACAACCAACTCCTTGCTTTGAACTATAATCTCCGGTTCTTTTATAAAACTCTATCCATGAAACAACTGTACGCTTATCTATGTTAAAAAACTTACTCAGCTCGAACTCCGTCATACCTTCTTCATATTTATTAATTACGATGTCTCTAAAATATTGGCTATATGATGGCATTTTTATTAGACATTATAACATTTCTACAAATATCTTTTTCTACAAATATCTTTCGGATTAACTATATAAGTAGAGTCAACAACCATCCAAATCACCCAATTATCTATAATTTTCTGCTTGCTAAAAAAACGCATACCAATGATGCTACACTTGTAAAACCATCCATATATGGCGTTGTTGAATCAGTATAAAATATAAGTAGTTGCGAAACTAGTAACCCAAATACTACAATGCTTACTAGAACTTTTAACCAACCAATGATTTTAAGTCTATGAACAACTATCTTTTTGTGACTAAAGTTGGGTTGCCAACTATACCAACCGTATCCAAAGCTAAATAAAAGAATCATCTGCAATATAGCATCGGCATATAGTCCACTAACAGAAAATAAACCCGCACTCATGATCAAACCAACTATCTCCACAGGCCAACCAATGACATAAAGCCTTGCCAGCAAAAAGGTACACAAAAGATTAACAATCATTGTACAAAAATCAAAAATATGCAGCATATTTATTTTACTAATCAAAGTATTATAAATATTATAATAACTTTGAAGTTGGCGTATTAAAGCCATAAACTTTAGTAGGTTAGTGTTTATACCAATATTTTGAGATGCTTTCTGCAAGCTAATAACATTTAGCTATCTAGCCTAAATAATTAATATACAAAACTTTCAAGCTTATTGAATTTTTCAACAGATACAGCGCGTTATAACAAATAAGTAATTGACTAAATTAAAAAGCAAGTATAATATCGATTGTGTTTATTACATAATATAAAACGAGGATAAAAAAAATATGAAATTAAGAAAAGTATTAATCGCGACATTATTAGGAGCTTCTGCTTTATCTTTAAGTAGTTGTTGGTTACTTGTTGGTGCAGCTGTTGGTGGTGGAACTGCTGCGTATATTTCTGGTGAGTATTCAATGAATATGAGTGGCAGTGTAAAAGATATTTACAATGCTACTTTAAAAGCTGTTCAAAGCAATGATGATTTTGTAATTACTAAAAAATCTATTACTTCTGTTGATGCAGTTGTTGATGGTAGTACTAAGGTAGACTCAACAAGTTTCTATGTTAAAATAGAAAAACTTACTGATAATGCTTCAAAAGTTACAATTAAGTTTGGTACTTTTGGTGACCAAGCAATGTCAGCAACATTAATGGATCAAATCCAAAAGAATCTTTAATTAAATAGGTAATTACTATAATGACTTTTCTAAAGAAAGCTTTTATTGCAACTATAGTTTCTATTTCAGCATTAGTTCTAAATAGTTGTATTGTTGCAGCAATAGCTGTTGGTGGTGGAACAGTTGCCTATATTGATGGAAATTATTTTATGAATATAGAAGGCAACTATAAAGCTGTCTATAAAGCTACTCTTAAAGCTATTAATGATAATAATGACTTTGTTCTAGTATCAAAAGATCTTGATCAAACAAAGCAAAATGCCGACATTGAAGGTGCTACTAAAATTGATAGTACGAGTTTTAGTGTCAAAATTGAAAGACTGACAGATCAGGCTACTAAAGTGACAATCAAATTTGGTACTTTTGGCGATCAAGCAATGTCATCAACATTAATGGATCAGATCCAGGCAGCTGTACATAAAGCTTCTTAGAAATGTACAAAAAACTCTACTTAATTATATTATCCACAATAATCGCAATCTCTCTTAATAGTTGTGTTGTTGCCGCTGTTTTAATTGGTACAGCAGTTGTTGCTGGAGGTACAGTATATTACATCAATGGTAACTATATAATCGAAGTCCCTAAAGATATTAGAAGTGTATACAATGCTACAATCAAGACTATACAGATGGATAGTCAAAATAAACTAATAAGTCAAACCTATAATACTAAATCTGCTATAATTAAAGCTTTACAAAAAGGTGAAAAAATTAGTATAGATTTAAGCAATATTGATAGTCGTTCAACAGAGATAAAAATTCGTATAGGTGTACTTGGCGATGAGAAAAAATCTGCTGATTTAGCAAACTCAATAACAAAAAATATCACCTAAGCAATATTTCTCGAACTTTGGTTAACTTTTTCTTTTTAAAAACTTTCAAAAATGTATAATTTGTGTTAGTTTGCAAACTACCCTTATATCCATAATGAGTAATAAGGTATTAGATACATATTATAAAAACAATCGACATATTTGGGTGCTAGTACTATCTGGTGCTGTTATAGGCACAATGATTGGTCTTCTAGCAACAGCATTTCAGCTACTCCTAGACTTTATTTTTAAAATTAAGCTGGCTCTTTTTTCTTTCAGTGGTGGTAATCTTTTTATCGAAATCGCTATGTCAATATCATTAAGTATTGTGATGGTATTAATTTCGATTTTTATTGTTAAAAAATTTGCGAAAGAGGCTGGTGGTAGCGGTATCCAAGAGGTTGAGGGTGCTTTAAAAGGCTGCCGCAAAATACGTAAAAGAGTTATGCCCGTGAAGTTTATAAGTGGACTTTTTTCGTTAGGCTCAGGTTTAAGTTTAGGTAAAGAGGGACCATCAATTCATATGGCTGCTGCATTAGCGCAGTTTTTTGTTGATAAATTTAAACTTACTACAAAATATGCTAATGCGGTTATCTCTGCTGGGGCTGGAGCTGGACTAGCAGCTGCTTTTAATACCCCACTTTCTGGGATTATCTTTGTTATTGAAGAGATGAATAGAAAGTTTAGATTTAGTGTTTCGGCAATAAAGTGTGTGCTAGTAGCATGTATCATGAGTACAGTTATCTCTAGAGCTATTATGGGTAATCCTCCAGCAATACGCGTAGAAACTTTCAGCTCAGTACCACAAAATACTCTTTGGTTATTTATGGTATTAGGGATTATATTTGGTTATTTTGGTTTACTATTTAACAAATCCTTAATCAAAGTGGCAAACTTTTTCTCAGAAGGCTCCAAGAAGAGGTATTGGACTTTAGTTATAATTGTTTGCATAATTTTTGGTATTGGTGTTGTTCTATCTCCAAATGCTGTTGGCGGTGGCTATATTGTCATAGCAAATACTCTTGATTATAACTTATCAATCAAGATGCTTTTAGTGCTTTTTGTACTTCGTTTTGCTGGAGTTATTTTCTCATATGGCACCGGCGTTACTGGTGGGATATTCGCACCAATGATTGCGCTTGGTACTG

Page 16: Dark matters in the genomes

Evolution of genome sizesEvolution of genome sizes

C-value: 1pg ~= 1.02GbC-value: 1pg ~= 1.02Gb

Thale cress (Arabidopsis thaliana): 0.16 pgThale cress (Arabidopsis thaliana): 0.16 pg

Fruit fly (Drosophila melanogaster): 0.18 pgFruit fly (Drosophila melanogaster): 0.18 pg

Pufferfish (Takifugu rubripes): 0.4 pgPufferfish (Takifugu rubripes): 0.4 pg

Human (Homo sapiens): 3.5 pgHuman (Homo sapiens): 3.5 pg

Onion (Allium cepa): 16.75 pgOnion (Allium cepa): 16.75 pg

Tiger salamander (Ambystoma tigrinum): 32 pgTiger salamander (Ambystoma tigrinum): 32 pg

Marbled lungfish (Protopterus aethiopicus): 132 pgMarbled lungfish (Protopterus aethiopicus): 132 pg

http://www.rbgkew.org.uk/

Page 17: Dark matters in the genomes

Genic region and genome sizeGenic region and genome size

Dan Graur

Page 18: Dark matters in the genomes

What's in the genomeWhat's in the genome

Genome

Annotated genes

Exon

UTR

Intron

Cis-regulatory elements

Selfish elements

Novel genes

Dead genes (pseudogenes)

Page 19: Dark matters in the genomes

"Non-genic": repetitive elements"Non-genic": repetitive elements

E.g. Human genomeE.g. Human genome Exons take up?Exons take up? Introns account for?Introns account for? Repetitive elements occupy?Repetitive elements occupy? Unknown?Unknown?

Venter et al. (2001) Science 291:1304

A B C1% 24%25%24% 1%25%35% 60%45%40% 15% 5%

Page 20: Dark matters in the genomes

What are in the unknown regions?What are in the unknown regions?

Investigate with Investigate with tiling arraytiling array

cDNA array

Tiling array

Gap size: 10bpProbe size: 25bp

Number of features:Number of features: Arabidopsis, 135Mb, 1 chip, ~6x106 featuresArabidopsis, 135Mb, 1 chip, ~6x106 features Human, 3Gb, 7 chips, ~4.2x107 featuresHuman, 3Gb, 7 chips, ~4.2x107 features

Page 21: Dark matters in the genomes

"Non-genic": unannotated genes"Non-genic": unannotated genes

Kapranov et al., 2002. Science

Tiling array analysis of human Chr 21, 22Tiling array analysis of human Chr 21, 22

Page 22: Dark matters in the genomes

Tiling array analysis of human Tiling array analysis of human transcriptometranscriptome

Kapranov et al., 2002. Science

Human Chr 21, 22Human Chr 21, 22

What do you think these expressed regions What do you think these expressed regions represent??represent??

Page 23: Dark matters in the genomes

Difficulties for coding gene predictionDifficulties for coding gene prediction

Training dataTraining data You need to know something...You need to know something... ““Biased” toward the properties of the majority.Biased” toward the properties of the majority.

Real genes that are shorter tend to be much harder Real genes that are shorter tend to be much harder to predict.to predict.

Table 3 Accuracy of GISMO, Glimmer and CRITICA in predicting short genes (<300 bp)

Gene finder Cor Sn Snfk (%) Sp

GISMO 0.64 63.0 86.4 69.0

Glimmer 0.54 72.0 83.7 44.0

CRITICA 0.60 46.0 67.4 84.0

Snfk denotes the sensitivity in detecting function-known genes.Krause et al., 2006. Nucleic Acid Res. 35:540

Page 24: Dark matters in the genomes

Novel coding sequence identificationNovel coding sequence identification

Arabidopsis thaliana as an exampleArabidopsis thaliana as an example 135Mb, ~50% occupied by 135Mb, ~50% occupied by

annotated genes.annotated genes. Focus on coding sequences 90-Focus on coding sequences 90-

300bp long.300bp long.

What would you do next to eliminate What would you do next to eliminate ORFs that are likely false ORFs that are likely false predictions?predictions?

133,090 sORFs

Page 25: Dark matters in the genomes

Criterion 1: Codon usage biasCriterion 1: Codon usage bias

Some codons are used more frequently than othersSome codons are used more frequently than others

http://www.cbs.dtu.dk/services/GenomeAtlas/

Page 26: Dark matters in the genomes

Criterion 1: Codon usage biasCriterion 1: Codon usage bias

For example: codons for prolineFor example: codons for proline

Suppose you have the following 2 sequences both Suppose you have the following 2 sequences both code for poly-leucine, which one is more likely to be code for poly-leucine, which one is more likely to be real coding sequence?real coding sequence?

NCDSNCDS CDSCDS

CCTCCT 0.250.25 0.120.12

CCCCCC 0.250.25 0.490.49

CCACCA 0.250.25 0.060.06

CCGCCG 0.250.25 0.330.33

Seq1 CCT CCA CCT

Seq2 CCC CCG CCC 2

4

109.749.033.049.0)2|(

106.812.006.012.0)1|(

SeqCDSp

SeqCDSp

Page 27: Dark matters in the genomes

Novel CDS identificationNovel CDS identification Posterior probability calculationPosterior probability calculation

Bayes' theormBayes' theorm

61

21)(...)()(

21)()(

)()|()()|(

)()|()()|()()|()|(

621

6

1

CDSPCDSPCDSP

NCDSPCDSP

mCDSPmCDSSPCDSPCDSSP

NCDSPNCDSSPCDSPCDSSPCDSPCDSSPSCDSP

m

Page 28: Dark matters in the genomes

Non-codingsequences

Codingsequences

Novel CDS identificationNovel CDS identification

Determine base composition probabilitiesDetermine base composition probabilities

Feature tablesFeature tables

Codingsequences

Non-codingsequences

CDSparameters

NCDSparameters

c1 c2 c3

c4 c5 c6

n

)(

)()|(

)(

AAAP

AAATPAAATP

N

NAAAP

CDS

CDSCDS

NNN

AAACDS

A T G C

AAA 0.01 0.31 0.03 0.02

AAT 0.03 0.17 0.01 0.15

AAG 0.02 0.00 0.02 0.02

AAC 0.15 0.02 0.02 0.05

ATA 0.04 0.03 0.05 0.00

ATT 0.06 0.01 0.07 0.02

ATC 0.01 0.01 0.05 0.29

... ... ... ... ...

Page 29: Dark matters in the genomes

Posterior probability of coding sequencePosterior probability of coding sequence

Compare known non-coding and coding sequencesCompare known non-coding and coding sequences

Hanada et al., 2007. Genome Res.

Page 30: Dark matters in the genomes

Posterior probability of coding sequencePosterior probability of coding sequence

Scanning Scanning ArabidopsisArabidopsis genome genome

Hanada et al., 2007. Genome Res.

Page 31: Dark matters in the genomes

After applying the first criterionAfter applying the first criterion

7,442 coding sORFs

Page 32: Dark matters in the genomes

How good is the CDS finding measure How good is the CDS finding measure

For the training dataFor the training data

For 18 Arabidopsis small protein genes For 18 Arabidopsis small protein genes All 18 are predicted as CDS.All 18 are predicted as CDS.

For 84 yeast small protein genesFor 84 yeast small protein genes All 84 are predicted as CDS.All 84 are predicted as CDS.

Page 33: Dark matters in the genomes

So what does this mean? So what does this mean?

If a sequence is a true coding sequenceIf a sequence is a true coding sequence Our approach can predict them with high accuracy.Our approach can predict them with high accuracy. So, the So, the sensitivitysensitivity is very good. is very good.

Is this good enough??Is this good enough??

What about specificity?What about specificity? Namely, how good is the criteria in excluding Namely, how good is the criteria in excluding false false

positivespositives??

Page 34: Dark matters in the genomes

Criterion 2: ExpressionCriterion 2: Expression

What would be the expression level you would What would be the expression level you would expect for true CDS compared to false CDS?expect for true CDS compared to false CDS?

Tiling array

Gap size: 10bpProbe size:

25bp

Expression level

Fre

qu

en

cy

Page 35: Dark matters in the genomes

Comparison of expression levelsComparison of expression levels

A: ExonB: IntronC: Prediceted novel CDSD: tRNAE: rRNA

Exon, intron, tRNA, rRNA, our predictionsExon, intron, tRNA, rRNA, our predictions

Page 36: Dark matters in the genomes

Applying the second criterionApplying the second criterion

Prediction significantly enriched in expressed sequencesPrediction significantly enriched in expressed sequences

2,996 transcribedsORFs

Page 37: Dark matters in the genomes

Criterion 3: Purifying selectionCriterion 3: Purifying selection

Compare known coding and non-coding sequencesCompare known coding and non-coding sequences

selection positive:1

neutraly selectivel:1

selection )(purifying negative:1

rateon substituti synonymous:

rateon substituti synonymous-non:

w

w

w

K

K

K

Kw

s

a

s

a

Page 38: Dark matters in the genomes

Criterion 3: Purifying selectionCriterion 3: Purifying selection

Compare known coding and non-coding sequencesCompare known coding and non-coding sequences

Page 39: Dark matters in the genomes

Our research interestsOur research interests

30,000 25,000

10,000

6,00045,000

17,000

Page 40: Dark matters in the genomes

Duplication Mechanism and Loss RateDuplication Mechanism and Loss Rate

GeneDuplications

Mechanisms ConsequencesPreferential

retentionPreferential

retentionConsequences

Page 41: Dark matters in the genomes

Duplication mechanismsDuplication mechanisms

+

Whole genome duplicationWhole genome duplication

Tandem duplicationTandem duplication

Segmental duplicationSegmental duplication

Duplicative transpositionDuplicative transposition

Page 42: Dark matters in the genomes

Differences in DuplicabilityDifferences in Duplicability

CategoryCategory ArabidopsisArabidopsis HumanHuman

Defense responseDefense response

ProteolysisProteolysis

TransportTransport

Ion channel activityIon channel activity

MetabolismMetabolism

DevelopmentDevelopment

Protein kinase activityProtein kinase activity

Transcription factor activityTranscription factor activity

DuplicabilityDuplicability The propensity for the retention of a duplicate geneThe propensity for the retention of a duplicate gene Computational analysis of genome-wide trendComputational analysis of genome-wide trend

Page 43: Dark matters in the genomes

Functional Consequences of DuplicationFunctional Consequences of Duplication

Functional divergence and conservationFunctional divergence and conservation Is it because of changes in cis-regulatory elements or coding sequencesIs it because of changes in cis-regulatory elements or coding sequences

How are duplicates retained, subfunctionalization or How are duplicates retained, subfunctionalization or neofunctionalizationneofunctionalization

Page 44: Dark matters in the genomes

AcknowledgementAcknowledgement

Lab membersLab members

Kousuke Hanada

Melissa Lehti-Shiu

Cheng Zou

TIGRTIGR Chris TownChris Town Hank WuHank Wu

University of ChicagoUniversity of Chicago Wen-Hsiung LiWen-Hsiung Li Justin O. BorevitzJustin O. Borevitz Xu ZhangXu Zhang

FundingFunding