dark matters in the genomes
DESCRIPTION
Dark matters in the genomes. Shin-Han Shiu Plant Biology / Genetics / EEBB / QBMI. About myself. About myself. About myself. About myself. Cell, nucleus, and chromosomes. DNA. A. G. G. C. G. T. A. G. A. G. A. G. A. T. C. C. T. T. G. A. T. T. C. C. G. C. A. A. C. - PowerPoint PPT PresentationTRANSCRIPT
Dark matters in the Dark matters in the genomesgenomes
Shin-Han ShiuShin-Han Shiu
Plant Biology / Genetics / EEBB / QBMIPlant Biology / Genetics / EEBB / QBMI
About myselfAbout myself
About myselfAbout myself
About myselfAbout myself
About myselfAbout myself
Cell, nucleus, and chromosomesCell, nucleus, and chromosomes
DNADNA
A G G C G T A G A G A G A T C C T T G A T
T C C G C A A C T C T C A A G G A A C A A
DNA and GenomeDNA and Genome
Genome is all the DNA in a cell made up of A, T, G, C...Genome is all the DNA in a cell made up of A, T, G, C...
How many A's, T's, G's, and C's are there in the human How many A's, T's, G's, and C's are there in the human genome?genome?
3,200,000,000 letters A sizable book, say, Lord of the Ring: Fellowship of the RingA sizable book, say, Lord of the Ring: Fellowship of the Ring
764,470 characters in 410 pages
~2,000 characters per page
The book of our lifeThe book of our life
1,600,000 pages
4,186 Fellowship of the ring
Genome sequencing chronologyGenome sequencing chronology
YearYear OrganismOrganism SignificanceSignificance Genome size Genome size (bp)(bp)
Number Number of genesof genes
19771977 Bacteriophage Bacteriophage fX174fX174
First genome First genome ever!ever! 5,3865,386 1111
19811981 Human Human mitochondriamitochondria First organelleFirst organelle 16,50016,500 3737
19951995 Haemophilus Haemophilus influenzae Rdinfluenzae Rd
First free-First free-living living organismorganism
1,830,1371,830,137 ~3,500~3,500
19961996 Saccharomyces Saccharomyces cerevisiaecerevisiae
First First eukaryoteeukaryote 12,086,00012,086,000 ~6,000~6,000
Genome sequencing chronologyGenome sequencing chronology
YearYear OrganismOrganism SignificanceSignificance Genome size Genome size (bp)(bp)
Number Number of genesof genes
19981998 Caenorhab-Caenorhab-ditis elegansditis elegans
First multi-First multi-cellular cellular organismorganism
97,000,00097,000,000 ~19,000~19,000
19991999Human Human chromosome chromosome 2222
First human First human chromosomechromosome 49,000,00049,000,000 673673
20002000 Drosophila Drosophila melanogastermelanogaster First insectFirst insect 150,000,000150,000,000 ~14,000~14,000
20002000 Arabidopsis Arabidopsis thalianathaliana
First plant First plant genomegenome 150,000,000150,000,000 ~25,000~25,000
Now, 1366 genomes are sequenced or being Now, 1366 genomes are sequenced or being sequencedsequenced
Between human and other animalsBetween human and other animals
How much do our and How much do our and chimp genomes differ?chimp genomes differ? 0.1%0.1% 1%1% 10%10% 50%50% 90%90%
How many genes do you How many genes do you think we share with think we share with worm?worm? 1%1% 10%10% 50%50% 75%75% 99%99%
Genome and better foodGenome and better food
Basic understanding of science & Basic understanding of science & environmentenvironment
Our research interestOur research interestTTGGCTATCCTTTATATTTTAAGGGTTATTAGGATATTTTTTATTATGACTACATGGGATAAATGTTTAAAAAAAATAAAAAAAAACCTTTCTACGTTTGAGTATAAGACGTGGATAAAGCCTATCCATGTGGAGCAAAATAGTAACTTATTCACAGTTTACTGTAACAATGAATATTTCAAAAAACATATAAAATCTAAGTATGGAAATCTTATTTTATCAACAATCCAAGAGTGTCATGGTAATGATTTAATTATTGAATATTCTAATAAAAAATTCTCTGGCGAAAAAATTACTGAGGTTATCACAGCTGGACCACAAGCTAATTTTTTTAGCACAACAAGTGTTGAGATAAAAGATGAATCAGAAGATACAAAAGTAGTACAAGAACCTAAAATATCAAAGAAGTCTAATAGTAAAGACTTTTCTTCATCACAAGAGTTATTCGGTTTTGACGAAGCTATGCTAATTACAGCAAAAGAAGATGAGGAATACTCTTTTGGTTTACCGTTAAAAGAAAAATATGTTTTTGATAGTTTTGTTGTTGGAGATGCTAACAAAATTGCTAGAGCAGCGGCTATGCAGGTATCGATAAATCCAGGTAAATTACATAACCCTTTATTCATTTATGGTGGTAGTGGTTTAGGTAAAACTCACTTAATGCAAGCAATAGGTAATCATGCAAGAGAAGTTAATCCTAATGCCAAAATTATTTATACAAATTCAGAACAATTTATTAAAGATTATGTAAATTCTATTCGTTTACAAGATCAAGATGAGTTTCAAAGAGTTTATAGATCTGCGGATATACTTTTGATTGATGATATTCAATTTATCGCTGGTAAAGAGGGTACTGCTCAGGAGTTTTTCCATACTTTTAATGCATTGTATGAAAATGGTAAACAGATAATTCTAACTAGTGATAAGTATCCAAATGAAATAGAAGGGCTTGAAGAAAGACTAGTTTCGCGTTTTGGTTATGGTTTAACAGTTTCTGTTGATATGCCAGATTTAGAAACCAGAATTGCTATCTTGCTCAAAAAAGCTCATGATTTAGGTCAGAAATTACCTAACGAAACAGCAGCTTTTATTGCTGAGAATGTACGTACTAATGTCAGAGAACTAGAAGGTGCTCTAAATAGGGTTCTTACTACCTCTAAATTTAATCATAAAGATCCTACTATCGAAGTAGCACAAGCTTGCTTAAGAGATGTTATAAAAATACAAGAAAAGAAAGTAAAAATAGATAATATCCAAAAGGTTGTTGCTGATTTTTATAGAATCAGGGTAAAAGATTTAACTTCTAATCAAAGAAGTAGAAATATAGCTAGACCAAGACAGATAGCAATGAGTTTAGCACGTGAACTAACATCACATAGTTTGCCAGAAATAGGCAATGCTTTTGGTGGTAGAGACCATACGACAGTTATGCATGCTGTCAAAGCTATAACTAAATTAAGACAAAGCAATACTTCAATATCGGATGATTATGAGTTGCTTTTAAATAAAATTTCTCGTTAAATAAAATTAGTAACTTTATCAAAGGGGTTTTAAAAAATGAATTTTGTACTAAATAGAGATGACTTACTAAAGCCTTTGCAATCTATGCTCTCAGTTGCAAATAGTAAGAGTACAATGCCTTTATTATCATGTATCTTATTTGATATTGATAATAATAATCTCAAAATTACGGCTTCGGATCTTGATACAGAGATATCATGCAATATAGCAGTTAGTTGTAACACAACTATTAAGTTAGCATTAAATGCTGACAAAATTTATAACATTGTCAGAAGCTTAAATGAAAATTCAATGATTGATTTTAGAATTAATGAAAATAAGGTAACTATTGTTTCTAATAATAGTACTTTTAACCTTATATCACTAAATGCTGACAACTATCCTCTTATTGATAGTAATATCAATGAGCAAGCAAGTTTTGATCTTTCTCAACAAGATTTTCATCATATTATTTCAAAAGTAGATTTCTCAATGGCTAATGATGATACTCGATATTTCTTAAATGGGATGTTTTGGGAAATCAACGCAAATCTACTAAGAGCAGTATCTACAGATGGTCATAGAATGTCTATCACAGAGGCTATAATTGATAGTAAAGTGTTAGATAGTGCTTCTCAGTCGATAATTCCAAAAAAAGCGATTTTAGAGCTTAAAAAGATAGTTGGCAAAACAGAAGAAAATATCAAAATTTGTCTTGGCAAAAATTATCTAAAAGCGATTTTTGGTAATTATGCTTTTATATCAAAGCTTATAGATGGTCGCTATCCTGATTACCAAAAAGTAATCCCTAAAAATAATACAAAACTATTAGCAGTTGATAAGCAGTTTTTCAAAAATTCATTATTAAGAACATCAATACTTGCTAATGATAAATATAAAGGTGTTCGTCTTAACATATCTCAAAATCAATTACTTCTATCAGCTAATAACCCTGATAATGAAAAGGCTGAAGATAAAATCGAAGTTCAATATAATGATCAACCAATGGAAATTTGTTTTAATTACAAATATCTTTTGGATATTATAAATGTACTTAGTGAAGAAACTATGTCTATCTACCTTGATAATCCAAATATGAGTGCTTTAGTTAAAGATGAGAAAGATAATAGTTTGTTTATTATTATGCCAATGAAAATTTAAGTAATAAGTAGTTTTAGGAAATAACTATTTTTATAAGCCTTTTGGAATGAATAATAAAGCAATAAAAAAAGGTATGCATAAAAACATTATATAGAAAGCTGGGATTAGATAATTTCCAGTAGTAGTAATTAATAAAGTCATAAGAAAGGCAACAGTACCTCCAAATAAAGAAACGCTTATATTAAAACTTATTCCAAAACCAGTATTTCTATTTCTAACAGGAAATAGCCCTGCCGTATTTGCAAATATAGGTCCTATTACAGCACCACTTATAATAGCAAGAGAAAAAATAGCTATAGATACTAATTGATGATTTTTTATAATAATTTGGTATATTGGTAAAACAGCTATAAATAAGACTATACAAGAATACATCAGAACTTTTTTACCACCAATTCTATCAGCAATATATCCAAATATAATTGAAGAAAGCATTAATACTATAGTTAATCCGAGAGTATTTTGTTAAACACATAAGAGATAGCATCACAAAATTTTTCAAAACTATTATTCACTTTTCTAAATATTTTTTTAAAGTTAGCCCAAACCTTTTCAATAGGATTTAAATCTGGAGAATACGGAGGTAGATATAATATTTGTACATCAAATTTATTGGCTATTTCAATCAGCTTAGAGGATTTATGGAAACTAGCATTATCCATTACTATAGTAGTTTTAGGTTTTAATGATGGGCATAAGTGTTCCTCAAACCATTGATTAAAAATTTCAGTATTGGTATATCCACTGTACTCTAATGGAGCTATAATCTTTTTATCTGCATAATTATATCCAGCAACAATACTTCTTCTTTGTGTTTGATATGCTAAAACCTCACCATAACTAGGCTCACCAATTAGTGACCATCCTCTTAGGATAGAAAGCTTATTGTCACACCCCATCTCATCTATATAAAATAACAAGTTTTGAGCTATTTCTTTTAGTTTTTCTATATACTCCAACCTTTCATGTTCTTTTCTTTGCTTATATTTTGGAGTCTTTTTTTAAAACTAAAACCAAGTCTATTAAGACAATCATAAAATGTACTTCTTGGAATATCAGGGGCTAATGCTTCTTTTATATCTAATGCACTTGCATCTGGATGATCTATCAAATACTGTTCAATCAATGTTTTATCGGTAAAGCTAGCGACTCTGCCACAACCAACTCCTTGCTTTGAACTATAATCTCCGGTTCTTTTATAAAACTCTATCCATGAAACAACTGTACGCTTATCTATGTTAAAAAACTTACTCAGCTCGAACTCCGTCATACCTTCTTCATATTTATTAATTACGATGTCTCTAAAATATTGGCTATATGATGGCATTTTTATTAGACATTATAACATTTCTACAAATATCTTTTTCTACAAATATCTTTCGGATTAACTATATAAGTAGAGTCAACAACCATCCAAATCACCCAATTATCTATAATTTTCTGCTTGCTAAAAAAACGCATACCAATGATGCTACACTTGTAAAACCATCCATATATGGCGTTGTTGAATCAGTATAAAATATAAGTAGTTGCGAAACTAGTAACCCAAATACTACAATGCTTACTAGAACTTTTAACCAACCAATGATTTTAAGTCTATGAACAACTATCTTTTTGTGACTAAAGTTGGGTTGCCAACTATACCAACCGTATCCAAAGCTAAATAAAAGAATCATCTGCAATATAGCATCGGCATATAGTCCACTAACAGAAAATAAACCCGCACTCATGATCAAACCAACTATCTCCACAGGCCAACCAATGACATAAAGCCTTGCCAGCAAAAAGGTACACAAAAGATTAACAATCATTGTACAAAAATCAAAAATATGCAGCATATTTATTTTACTAATCAAAGTATTATAAATATTATAATAACTTTGAAGTTGGCGTATTAAAGCCATAAACTTTAGTAGGTTAGTGTTTATACCAATATTTTGAGATGCTTTCTGCAAGCTAATAACATTTAGCTATCTAGCCTAAATAATTAATATACAAAACTTTCAAGCTTATTGAATTTTTCAACAGATACAGCGCGTTATAACAAATAAGTAATTGACTAAATTAAAAAGCAAGTATAATATCGATTGTGTTTATTACATAATATAAAACGAGGATAAAAAAAATATGAAATTAAGAAAAGTATTAATCGCGACATTATTAGGAGCTTCTGCTTTATCTTTAAGTAGTTGTTGGTTACTTGTTGGTGCAGCTGTTGGTGGTGGAACTGCTGCGTATATTTCTGGTGAGTATTCAATGAATATGAGTGGCAGTGTAAAAGATATTTACAATGCTACTTTAAAAGCTGTTCAAAGCAATGATGATTTTGTAATTACTAAAAAATCTATTACTTCTGTTGATGCAGTTGTTGATGGTAGTACTAAGGTAGACTCAACAAGTTTCTATGTTAAAATAGAAAAACTTACTGATAATGCTTCAAAAGTTACAATTAAGTTTGGTACTTTTGGTGACCAAGCAATGTCAGCAACATTAATGGATCAAATCCAAAAGAATCTTTAATTAAATAGGTAATTACTATAATGACTTTTCTAAAGAAAGCTTTTATTGCAACTATAGTTTCTATTTCAGCATTAGTTCTAAATAGTTGTATTGTTGCAGCAATAGCTGTTGGTGGTGGAACAGTTGCCTATATTGATGGAAATTATTTTATGAATATAGAAGGCAACTATAAAGCTGTCTATAAAGCTACTCTTAAAGCTATTAATGATAATAATGACTTTGTTCTAGTATCAAAAGATCTTGATCAAACAAAGCAAAATGCCGACATTGAAGGTGCTACTAAAATTGATAGTACGAGTTTTAGTGTCAAAATTGAAAGACTGACAGATCAGGCTACTAAAGTGACAATCAAATTTGGTACTTTTGGCGATCAAGCAATGTCATCAACATTAATGGATCAGATCCAGGCAGCTGTACATAAAGCTTCTTAGAAATGTACAAAAAACTCTACTTAATTATATTATCCACAATAATCGCAATCTCTCTTAATAGTTGTGTTGTTGCCGCTGTTTTAATTGGTACAGCAGTTGTTGCTGGAGGTACAGTATATTACATCAATGGTAACTATATAATCGAAGTCCCTAAAGATATTAGAAGTGTATACAATGCTACAATCAAGACTATACAGATGGATAGTCAAAATAAACTAATAAGTCAAACCTATAATACTAAATCTGCTATAATTAAAGCTTTACAAAAAGGTGAAAAAATTAGTATAGATTTAAGCAATATTGATAGTCGTTCAACAGAGATAAAAATTCGTATAGGTGTACTTGGCGATGAGAAAAAATCTGCTGATTTAGCAAACTCAATAACAAAAAATATCACCTAAGCAATATTTCTCGAACTTTGGTTAACTTTTTCTTTTTAAAAACTTTCAAAAATGTATAATTTGTGTTAGTTTGCAAACTACCCTTATATCCATAATGAGTAATAAGGTATTAGATACATATTATAAAAACAATCGACATATTTGGGTGCTAGTACTATCTGGTGCTGTTATAGGCACAATGATTGGTCTTCTAGCAACAGCATTTCAGCTACTCCTAGACTTTATTTTTAAAATTAAGCTGGCTCTTTTTTCTTTCAGTGGTGGTAATCTTTTTATCGAAATCGCTATGTCAATATCATTAAGTATTGTGATGGTATTAATTTCGATTTTTATTGTTAAAAAATTTGCGAAAGAGGCTGGTGGTAGCGGTATCCAAGAGGTTGAGGGTGCTTTAAAAGGCTGCCGCAAAATACGTAAAAGAGTTATGCCCGTGAAGTTTATAAGTGGACTTTTTTCGTTAGGCTCAGGTTTAAGTTTAGGTAAAGAGGGACCATCAATTCATATGGCTGCTGCATTAGCGCAGTTTTTTGTTGATAAATTTAAACTTACTACAAAATATGCTAATGCGGTTATCTCTGCTGGGGCTGGAGCTGGACTAGCAGCTGCTTTTAATACCCCACTTTCTGGGATTATCTTTGTTATTGAAGAGATGAATAGAAAGTTTAGATTTAGTGTTTCGGCAATAAAGTGTGTGCTAGTAGCATGTATCATGAGTACAGTTATCTCTAGAGCTATTATGGGTAATCCTCCAGCAATACGCGTAGAAACTTTCAGCTCAGTACCACAAAATACTCTTTGGTTATTTATGGTATTAGGGATTATATTTGGTTATTTTGGTTTACTATTTAACAAATCCTTAATCAAAGTGGCAAACTTTTTCTCAGAAGGCTCCAAGAAGAGGTATTGGACTTTAGTTATAATTGTTTGCATAATTTTTGGTATTGGTGTTGTTCTATCTCCAAATGCTGTTGGCGGTGGCTATATTGTCATAGCAAATACTCTTGATTATAACTTATCAATCAAGATGCTTTTAGTGCTTTTTGTACTTCGTTTTGCTGGAGTTATTTTCTCATATGGCACCGGCGTTACTGGTGGGATATTCGCACCAATGATTGCGCTTGGTACTG
Evolution of genome sizesEvolution of genome sizes
C-value: 1pg ~= 1.02GbC-value: 1pg ~= 1.02Gb
Thale cress (Arabidopsis thaliana): 0.16 pgThale cress (Arabidopsis thaliana): 0.16 pg
Fruit fly (Drosophila melanogaster): 0.18 pgFruit fly (Drosophila melanogaster): 0.18 pg
Pufferfish (Takifugu rubripes): 0.4 pgPufferfish (Takifugu rubripes): 0.4 pg
Human (Homo sapiens): 3.5 pgHuman (Homo sapiens): 3.5 pg
Onion (Allium cepa): 16.75 pgOnion (Allium cepa): 16.75 pg
Tiger salamander (Ambystoma tigrinum): 32 pgTiger salamander (Ambystoma tigrinum): 32 pg
Marbled lungfish (Protopterus aethiopicus): 132 pgMarbled lungfish (Protopterus aethiopicus): 132 pg
http://www.rbgkew.org.uk/
Genic region and genome sizeGenic region and genome size
Dan Graur
What's in the genomeWhat's in the genome
Genome
Annotated genes
Exon
UTR
Intron
Cis-regulatory elements
Selfish elements
Novel genes
Dead genes (pseudogenes)
"Non-genic": repetitive elements"Non-genic": repetitive elements
E.g. Human genomeE.g. Human genome Exons take up?Exons take up? Introns account for?Introns account for? Repetitive elements occupy?Repetitive elements occupy? Unknown?Unknown?
Venter et al. (2001) Science 291:1304
A B C1% 24%25%24% 1%25%35% 60%45%40% 15% 5%
What are in the unknown regions?What are in the unknown regions?
Investigate with Investigate with tiling arraytiling array
cDNA array
Tiling array
Gap size: 10bpProbe size: 25bp
Number of features:Number of features: Arabidopsis, 135Mb, 1 chip, ~6x106 featuresArabidopsis, 135Mb, 1 chip, ~6x106 features Human, 3Gb, 7 chips, ~4.2x107 featuresHuman, 3Gb, 7 chips, ~4.2x107 features
"Non-genic": unannotated genes"Non-genic": unannotated genes
Kapranov et al., 2002. Science
Tiling array analysis of human Chr 21, 22Tiling array analysis of human Chr 21, 22
Tiling array analysis of human Tiling array analysis of human transcriptometranscriptome
Kapranov et al., 2002. Science
Human Chr 21, 22Human Chr 21, 22
What do you think these expressed regions What do you think these expressed regions represent??represent??
Difficulties for coding gene predictionDifficulties for coding gene prediction
Training dataTraining data You need to know something...You need to know something... ““Biased” toward the properties of the majority.Biased” toward the properties of the majority.
Real genes that are shorter tend to be much harder Real genes that are shorter tend to be much harder to predict.to predict.
Table 3 Accuracy of GISMO, Glimmer and CRITICA in predicting short genes (<300 bp)
Gene finder Cor Sn Snfk (%) Sp
GISMO 0.64 63.0 86.4 69.0
Glimmer 0.54 72.0 83.7 44.0
CRITICA 0.60 46.0 67.4 84.0
Snfk denotes the sensitivity in detecting function-known genes.Krause et al., 2006. Nucleic Acid Res. 35:540
Novel coding sequence identificationNovel coding sequence identification
Arabidopsis thaliana as an exampleArabidopsis thaliana as an example 135Mb, ~50% occupied by 135Mb, ~50% occupied by
annotated genes.annotated genes. Focus on coding sequences 90-Focus on coding sequences 90-
300bp long.300bp long.
What would you do next to eliminate What would you do next to eliminate ORFs that are likely false ORFs that are likely false predictions?predictions?
133,090 sORFs
Criterion 1: Codon usage biasCriterion 1: Codon usage bias
Some codons are used more frequently than othersSome codons are used more frequently than others
http://www.cbs.dtu.dk/services/GenomeAtlas/
Criterion 1: Codon usage biasCriterion 1: Codon usage bias
For example: codons for prolineFor example: codons for proline
Suppose you have the following 2 sequences both Suppose you have the following 2 sequences both code for poly-leucine, which one is more likely to be code for poly-leucine, which one is more likely to be real coding sequence?real coding sequence?
NCDSNCDS CDSCDS
CCTCCT 0.250.25 0.120.12
CCCCCC 0.250.25 0.490.49
CCACCA 0.250.25 0.060.06
CCGCCG 0.250.25 0.330.33
Seq1 CCT CCA CCT
Seq2 CCC CCG CCC 2
4
109.749.033.049.0)2|(
106.812.006.012.0)1|(
SeqCDSp
SeqCDSp
Novel CDS identificationNovel CDS identification Posterior probability calculationPosterior probability calculation
Bayes' theormBayes' theorm
61
21)(...)()(
21)()(
)()|()()|(
)()|()()|()()|()|(
621
6
1
CDSPCDSPCDSP
NCDSPCDSP
mCDSPmCDSSPCDSPCDSSP
NCDSPNCDSSPCDSPCDSSPCDSPCDSSPSCDSP
m
Non-codingsequences
Codingsequences
Novel CDS identificationNovel CDS identification
Determine base composition probabilitiesDetermine base composition probabilities
Feature tablesFeature tables
Codingsequences
Non-codingsequences
CDSparameters
NCDSparameters
c1 c2 c3
c4 c5 c6
n
)(
)()|(
)(
AAAP
AAATPAAATP
N
NAAAP
CDS
CDSCDS
NNN
AAACDS
A T G C
AAA 0.01 0.31 0.03 0.02
AAT 0.03 0.17 0.01 0.15
AAG 0.02 0.00 0.02 0.02
AAC 0.15 0.02 0.02 0.05
ATA 0.04 0.03 0.05 0.00
ATT 0.06 0.01 0.07 0.02
ATC 0.01 0.01 0.05 0.29
... ... ... ... ...
Posterior probability of coding sequencePosterior probability of coding sequence
Compare known non-coding and coding sequencesCompare known non-coding and coding sequences
Hanada et al., 2007. Genome Res.
Posterior probability of coding sequencePosterior probability of coding sequence
Scanning Scanning ArabidopsisArabidopsis genome genome
Hanada et al., 2007. Genome Res.
After applying the first criterionAfter applying the first criterion
7,442 coding sORFs
How good is the CDS finding measure How good is the CDS finding measure
For the training dataFor the training data
For 18 Arabidopsis small protein genes For 18 Arabidopsis small protein genes All 18 are predicted as CDS.All 18 are predicted as CDS.
For 84 yeast small protein genesFor 84 yeast small protein genes All 84 are predicted as CDS.All 84 are predicted as CDS.
So what does this mean? So what does this mean?
If a sequence is a true coding sequenceIf a sequence is a true coding sequence Our approach can predict them with high accuracy.Our approach can predict them with high accuracy. So, the So, the sensitivitysensitivity is very good. is very good.
Is this good enough??Is this good enough??
What about specificity?What about specificity? Namely, how good is the criteria in excluding Namely, how good is the criteria in excluding false false
positivespositives??
Criterion 2: ExpressionCriterion 2: Expression
What would be the expression level you would What would be the expression level you would expect for true CDS compared to false CDS?expect for true CDS compared to false CDS?
Tiling array
Gap size: 10bpProbe size:
25bp
Expression level
Fre
qu
en
cy
Comparison of expression levelsComparison of expression levels
A: ExonB: IntronC: Prediceted novel CDSD: tRNAE: rRNA
Exon, intron, tRNA, rRNA, our predictionsExon, intron, tRNA, rRNA, our predictions
Applying the second criterionApplying the second criterion
Prediction significantly enriched in expressed sequencesPrediction significantly enriched in expressed sequences
2,996 transcribedsORFs
Criterion 3: Purifying selectionCriterion 3: Purifying selection
Compare known coding and non-coding sequencesCompare known coding and non-coding sequences
selection positive:1
neutraly selectivel:1
selection )(purifying negative:1
rateon substituti synonymous:
rateon substituti synonymous-non:
w
w
w
K
K
K
Kw
s
a
s
a
Criterion 3: Purifying selectionCriterion 3: Purifying selection
Compare known coding and non-coding sequencesCompare known coding and non-coding sequences
Our research interestsOur research interests
30,000 25,000
10,000
6,00045,000
17,000
Duplication Mechanism and Loss RateDuplication Mechanism and Loss Rate
GeneDuplications
Mechanisms ConsequencesPreferential
retentionPreferential
retentionConsequences
Duplication mechanismsDuplication mechanisms
+
Whole genome duplicationWhole genome duplication
Tandem duplicationTandem duplication
Segmental duplicationSegmental duplication
Duplicative transpositionDuplicative transposition
Differences in DuplicabilityDifferences in Duplicability
CategoryCategory ArabidopsisArabidopsis HumanHuman
Defense responseDefense response
ProteolysisProteolysis
TransportTransport
Ion channel activityIon channel activity
MetabolismMetabolism
DevelopmentDevelopment
Protein kinase activityProtein kinase activity
Transcription factor activityTranscription factor activity
DuplicabilityDuplicability The propensity for the retention of a duplicate geneThe propensity for the retention of a duplicate gene Computational analysis of genome-wide trendComputational analysis of genome-wide trend
Functional Consequences of DuplicationFunctional Consequences of Duplication
Functional divergence and conservationFunctional divergence and conservation Is it because of changes in cis-regulatory elements or coding sequencesIs it because of changes in cis-regulatory elements or coding sequences
How are duplicates retained, subfunctionalization or How are duplicates retained, subfunctionalization or neofunctionalizationneofunctionalization
AcknowledgementAcknowledgement
Lab membersLab members
Kousuke Hanada
Melissa Lehti-Shiu
Cheng Zou
TIGRTIGR Chris TownChris Town Hank WuHank Wu
University of ChicagoUniversity of Chicago Wen-Hsiung LiWen-Hsiung Li Justin O. BorevitzJustin O. Borevitz Xu ZhangXu Zhang
FundingFunding