1 inside the genome. 2 2001: the human genome venter et. al., science 292:1304-1351 (2001)...

45
1 Inside the Genome Inside the Genome

Post on 20-Dec-2015

216 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 1 Inside the Genome. 2 2001: The Human Genome Venter et. al., Science 292:1304-1351 (2001) International Human Genome Sequencing Consortium, Nature, 409:

11

Inside the GenomeInside the Genome

Page 2: 1 Inside the Genome. 2 2001: The Human Genome Venter et. al., Science 292:1304-1351 (2001) International Human Genome Sequencing Consortium, Nature, 409:

22

2001: The Human Genome2001: The Human Genome

Venter et. al. , Science 292:1304-1351 (2001)

International Human Genome Sequencing Consortium, Nature, 409: 860-921 (2001)

The club resident JD Watson Back2back with DJ. Venter and

Page 3: 1 Inside the Genome. 2 2001: The Human Genome Venter et. al., Science 292:1304-1351 (2001) International Human Genome Sequencing Consortium, Nature, 409:

33

ProloguePrologue RNA word – the dark matter of genomicsRNA word – the dark matter of genomics

How many coding genes in the human How many coding genes in the human genome?genome?

– The Bet of 2000: The Bet of 2000: – Mean 61710Mean 61710– Range – 30,000 – 150,000Range – 30,000 – 150,000

– By the end of the genome project the estimated number By the end of the genome project the estimated number of human protein-coding genes declined to only of human protein-coding genes declined to only ~25,000~25,000

– What is the source for that discrepancy?What is the source for that discrepancy? ESTs based estimation Vs. Whole Genome annotationESTs based estimation Vs. Whole Genome annotation

Page 4: 1 Inside the Genome. 2 2001: The Human Genome Venter et. al., Science 292:1304-1351 (2001) International Human Genome Sequencing Consortium, Nature, 409:

44

RNA revolutionRNA revolution

The majority of the The majority of the transcriptional output comes transcriptional output comes from non coding RNAfrom non coding RNA– an average of an average of 10%10% of the human of the human

genome (compared with genome (compared with ~1.5%~1.5% exonic sequences) resulted in exonic sequences) resulted in transcripts [Cheng et al. 2005]transcripts [Cheng et al. 2005]

– Or even more...Or even more...62% of the mouse genome is 62% of the mouse genome is transcribed transcribed [FANTOM3: Science 2005][FANTOM3: Science 2005]

Page 5: 1 Inside the Genome. 2 2001: The Human Genome Venter et. al., Science 292:1304-1351 (2001) International Human Genome Sequencing Consortium, Nature, 409:

55

Various RNAs – A partial list…Various RNAs – A partial list…

messenger RNA (mRNA)messenger RNA (mRNA) Ribosomal RNA (rRNA)Ribosomal RNA (rRNA) Transfer RNA (tRNA)Transfer RNA (tRNA) Small nuclear RNA (snRNA)Small nuclear RNA (snRNA) Small nucleolar RNA (snoRNA)Small nucleolar RNA (snoRNA) Short interfering RNA (siRNA)Short interfering RNA (siRNA) Micro RNA (miRNA)Micro RNA (miRNA)

Page 6: 1 Inside the Genome. 2 2001: The Human Genome Venter et. al., Science 292:1304-1351 (2001) International Human Genome Sequencing Consortium, Nature, 409:

66

RNAs are not merely the intermediary cousins of proteins -RNAs are not merely the intermediary cousins of proteins -

The The Central dogma of molecular biology Central dogma of molecular biology RevisitedRevisited

Transcription

RNA

Translation

Protein

Genome

Transcriptome

Proteome

Regulation by proteins

miRNA

Regulation by RNA

Page 7: 1 Inside the Genome. 2 2001: The Human Genome Venter et. al., Science 292:1304-1351 (2001) International Human Genome Sequencing Consortium, Nature, 409:

77

Research in Biology is Research in Biology is complex…complex…

Deciphering Biological Deciphering Biological SystemsSystems– The The advantageadvantage (what makes this (what makes this

quest feasible) and the quest feasible) and the hindrancehindrance (what makes this quest inherently (what makes this quest inherently difficult) – difficult) – both explained by both explained by evolutionevolution..

Page 8: 1 Inside the Genome. 2 2001: The Human Genome Venter et. al., Science 292:1304-1351 (2001) International Human Genome Sequencing Consortium, Nature, 409:

88

The difficulties in our research fundamentally The difficulties in our research fundamentally owe their complexity to the designer – natural owe their complexity to the designer – natural selection.selection.

What is it - a “What is it - a “RobotRobot” or a “” or a “UFOUFO” ?” ?– The reason lies in the profound difference between The reason lies in the profound difference between

systems “designed” by systems “designed” by natural selectionnatural selection and those and those designed by designed by intelligent engineersintelligent engineers [Langton 1989 Artificial Life].[Langton 1989 Artificial Life].

The Hindrance – The Hindrance – Topological Entanglement of Topological Entanglement of functional interconnectionsfunctional interconnections

Page 9: 1 Inside the Genome. 2 2001: The Human Genome Venter et. al., Science 292:1304-1351 (2001) International Human Genome Sequencing Consortium, Nature, 409:

99

Bottom lineBottom line::we investigate an we investigate an outrageously complex outrageously complex weave of interconnectionsweave of interconnections– The “textbook networks” The “textbook networks”

represent only the tip of the represent only the tip of the iceberg.iceberg.

miRNAs and “Regolomics”miRNAs and “Regolomics”– microRNAs - Expected to microRNAs - Expected to

represent ~represent ~1%1% of predicted of predicted genes [Lim genes [Lim et al.,et al., 2003] 2003]

– Lewis Lewis et al.,et al., (2003) estimate (2003) estimate average of average of fivefive targets per targets per miRNAmiRNA

– Many targets are transcription Many targets are transcription factors - miRNAs factors - miRNAs regulate the regulate the regulatorsregulators

Page 10: 1 Inside the Genome. 2 2001: The Human Genome Venter et. al., Science 292:1304-1351 (2001) International Human Genome Sequencing Consortium, Nature, 409:

1010

The advantage – The advantage – universal homology, thus enabling universal homology, thus enabling

comparative biology.comparative biology. Bottom lineBottom line::

the research in biology advances through the research in biology advances through a reductionist approach - using simple a reductionist approach - using simple model organisms to infer functionality of model organisms to infer functionality of homologous systems. homologous systems.

Page 11: 1 Inside the Genome. 2 2001: The Human Genome Venter et. al., Science 292:1304-1351 (2001) International Human Genome Sequencing Consortium, Nature, 409:

1111

2.91 billion base pairs

24,000 protein coding genes

(>30,000 non-coding genes ???)

1.5% exons (127 nucleotides)

24% introns (~3,000 nucleotides)

75% intergenic (no genes)

Repetitive elements rule (~ 45% dispersed repeat)

Average size of a gene is 27,894 bases

Contains an average of 8.8 exons*Titin contains 234 exons.

Ave. of 4 diff. proteins per gene (alternative splicing)

Human genome statistics

Page 12: 1 Inside the Genome. 2 2001: The Human Genome Venter et. al., Science 292:1304-1351 (2001) International Human Genome Sequencing Consortium, Nature, 409:

1212

Detecting genes in the human genomeDetecting genes in the human genome

Gene finding methods:Gene finding methods: Ab initioAb initio

use general knowledge of gene structure: use general knowledge of gene structure: rules and statisticsrules and statisticsThe challenge: small exons in a sea of The challenge: small exons in a sea of intronsintrons

Homology-based Homology-based The problem: will not detect novel genesThe problem: will not detect novel genes

Page 13: 1 Inside the Genome. 2 2001: The Human Genome Venter et. al., Science 292:1304-1351 (2001) International Human Genome Sequencing Consortium, Nature, 409:

1313

Genscan (ab initio)Genscan (ab initio)

Based on a probabilistic model of a gene Based on a probabilistic model of a gene structurestructure

Takes into account:Takes into account:- promoters - promoters - gene composition – exons/introns- gene composition – exons/introns- GC content- GC content- splice signals- splice signals

Goes over all 6 reading framesGoes over all 6 reading framesBurge and Karlin, 1997, Prediction of complete gene structure in human genomic DNA, J. Mol. Biol. 268

\\|// (o o)-. .-. .-oOOo~(_)~oOOo-. .-. .-. .-. .-. .-. .-. .-. .-. .-. .-. .-. .-. .-. .-. .-. .-. ||X|||\ /|||X|||\ /|||X|||\ /|||X|||\ /|||X|||\ /|||X|||\ /|||X|||\ /|||X|||\ /|||X|||\ /|||X|||\ /|||X|||\|/ \|||X|||/ \|||X|||/ \|||X|||/ \|||X|||/ \|||X|||/ \|||X|||/ \|||X|||/ \|||X|||/ \|||X|||/ \|||X|||/ \|||' `-' `-' `-' `-' `-' `-' `-' `-' `-' `-' `-' `-' `-' `-' `-' `-' `-' `-' `-' `-' `-'

Page 14: 1 Inside the Genome. 2 2001: The Human Genome Venter et. al., Science 292:1304-1351 (2001) International Human Genome Sequencing Consortium, Nature, 409:

1414

SplicingSplicing

Page 15: 1 Inside the Genome. 2 2001: The Human Genome Venter et. al., Science 292:1304-1351 (2001) International Human Genome Sequencing Consortium, Nature, 409:

1515

Eukaryotic splice sitesEukaryotic splice sites

Poly-pyrimidine tract

Page 16: 1 Inside the Genome. 2 2001: The Human Genome Venter et. al., Science 292:1304-1351 (2001) International Human Genome Sequencing Consortium, Nature, 409:

1616

CpG Islands: another signal CpG Islands: another signal

CpG islands are regions of the CpG islands are regions of the genome with a higher frequency of genome with a higher frequency of CG dinucleotides (not base-pairs!) CG dinucleotides (not base-pairs!) than the rest of the genomethan the rest of the genome

CpG islands often occur near the CpG islands often occur near the beginning of genesbeginning of genes maybe maybe related to the binding of the related to the binding of the TF Sp1TF Sp1

Page 17: 1 Inside the Genome. 2 2001: The Human Genome Venter et. al., Science 292:1304-1351 (2001) International Human Genome Sequencing Consortium, Nature, 409:

1717

GGeneene OOntologyntology

GO describes proteins in terms of :GO describes proteins in terms of :

biological processbiological process

(e.g. induction of apoptosis by external signals)(e.g. induction of apoptosis by external signals)

cellular componentcellular component

((e.g. membrane fraction)e.g. membrane fraction)

molecular functionmolecular function

((e.g. protein kinase)e.g. protein kinase)

nucleus

Nuclear chromosome

cell

Page 18: 1 Inside the Genome. 2 2001: The Human Genome Venter et. al., Science 292:1304-1351 (2001) International Human Genome Sequencing Consortium, Nature, 409:

1818

Comparative proteome analysisComparative proteome analysis

Functional categories based on GO

Page 19: 1 Inside the Genome. 2 2001: The Human Genome Venter et. al., Science 292:1304-1351 (2001) International Human Genome Sequencing Consortium, Nature, 409:

1919

Comparative proteome analysisComparative proteome analysis

Humans have more proteins involved Humans have more proteins involved in cytoskeleton, immune defense, in cytoskeleton, immune defense, and and transcriptiontranscription

Page 20: 1 Inside the Genome. 2 2001: The Human Genome Venter et. al., Science 292:1304-1351 (2001) International Human Genome Sequencing Consortium, Nature, 409:

2020

Evolutionary conservation of Evolutionary conservation of human proteinshuman proteins

???

Page 21: 1 Inside the Genome. 2 2001: The Human Genome Venter et. al., Science 292:1304-1351 (2001) International Human Genome Sequencing Consortium, Nature, 409:

2121

Horizontal (lateral) gene transferHorizontal (lateral) gene transfer

Lateral Gene Transfer (LGT) is any process in which an organism transfers genetic material to another organism that is not its offspring

Page 22: 1 Inside the Genome. 2 2001: The Human Genome Venter et. al., Science 292:1304-1351 (2001) International Human Genome Sequencing Consortium, Nature, 409:

2222

Mechanisms:

Transformation

Transduction (phages/viruses)

Conjugation

Page 23: 1 Inside the Genome. 2 2001: The Human Genome Venter et. al., Science 292:1304-1351 (2001) International Human Genome Sequencing Consortium, Nature, 409:

2323

Bacteria to vertebrate LGT Bacteria to vertebrate LGT detectiondetection

E-value of bacterial homolog X9 E-value of bacterial homolog X9 better than eukaryal homologbetter than eukaryal homolog

Human query:

Hit ……………… e-value

Frog ………….. 4e-180

Mouse …………1e-164

E.Coli ………….. 7e-124

Streptococcus .. 9e-71

Worm ……………….0.1

Page 24: 1 Inside the Genome. 2 2001: The Human Genome Venter et. al., Science 292:1304-1351 (2001) International Human Genome Sequencing Consortium, Nature, 409:

2424

Bacteria to vertebrate LGTBacteria to vertebrate LGT

vertebratesBacteria Non-vertebrates

Page 25: 1 Inside the Genome. 2 2001: The Human Genome Venter et. al., Science 292:1304-1351 (2001) International Human Genome Sequencing Consortium, Nature, 409:

2525

Page 26: 1 Inside the Genome. 2 2001: The Human Genome Venter et. al., Science 292:1304-1351 (2001) International Human Genome Sequencing Consortium, Nature, 409:

2626

Bacteria to vertebrate LGT??Bacteria to vertebrate LGT??

Hundreds of sequenced bacterial Hundreds of sequenced bacterial genome vs. handful of eukaryotesgenome vs. handful of eukaryotes

Gene finding in bacteria is much Gene finding in bacteria is much easier than in eukaryoteseasier than in eukaryotes

On the practical side: On the practical side: rigid mechanical barriers to LGT in rigid mechanical barriers to LGT in eukaryotes (nucleus, germ line)eukaryotes (nucleus, germ line)

Page 27: 1 Inside the Genome. 2 2001: The Human Genome Venter et. al., Science 292:1304-1351 (2001) International Human Genome Sequencing Consortium, Nature, 409:

2727

Repetitive ElementsRepetitive Elements in the in the

Human GenomeHuman Genome

Page 28: 1 Inside the Genome. 2 2001: The Human Genome Venter et. al., Science 292:1304-1351 (2001) International Human Genome Sequencing Consortium, Nature, 409:

2828

Repeats statisticsRepeats statistics

The human genome is ~45% dispersed repeatThe human genome is ~45% dispersed repeat 20% LINEs, (AT rich)20% LINEs, (AT rich) 13% is SINES (11% Alu), (GC rich)13% is SINES (11% Alu), (GC rich) 8% LTR (retrovirus like) and 8% LTR (retrovirus like) and 2% DNA transposons2% DNA transposons Another 3% is tandem simple sequence repeats Another 3% is tandem simple sequence repeats

(e.g. triplet)(e.g. triplet) And another 3-5% is segmentally duplicated at And another 3-5% is segmentally duplicated at

high similarity (over 1kb over 90% id)high similarity (over 1kb over 90% id)

Identifying and screening these out is Identifying and screening these out is essential to avoid fake matchesessential to avoid fake matches

Page 29: 1 Inside the Genome. 2 2001: The Human Genome Venter et. al., Science 292:1304-1351 (2001) International Human Genome Sequencing Consortium, Nature, 409:

2929

LINEs and SINEsLINEs and SINEs

Highly successful elements in Highly successful elements in eukaryoteseukaryotes

LINE - LINE - LLong ong IInterspersed nterspersed NNuclear uclear EElement (>5,000 bp)lement (>5,000 bp)

SINE - SINE - SShort hort IInterspersed nterspersed NNuclear uclear EElement (< 500 bp)lement (< 500 bp)

SINEs are freeriders on the backs of SINEs are freeriders on the backs of LINEs – LINEs – encode no proteinsencode no proteins

Page 30: 1 Inside the Genome. 2 2001: The Human Genome Venter et. al., Science 292:1304-1351 (2001) International Human Genome Sequencing Consortium, Nature, 409:

3030

The C-value paradoxThe C-value paradox

Genome size does not correlate with Genome size does not correlate with organism complexityorganism complexity

YeastYeastHumanHumanRiceRiceAmoebaAmoeba

Genome Genome sizesize

12 million12 million3 billion3 billion4.3 billion4.3 billion670 billion670 billion

Number of Number of genesgenes

6,2756,27520-25,00020-25,000~30,000~30,000??

Page 31: 1 Inside the Genome. 2 2001: The Human Genome Venter et. al., Science 292:1304-1351 (2001) International Human Genome Sequencing Consortium, Nature, 409:

3131

Repetitive elementsRepetitive elements

The C-value mystery was partially The C-value mystery was partially resolved when it was found that resolved when it was found that large portions of genomes contain large portions of genomes contain repetitive elementsrepetitive elements

Page 32: 1 Inside the Genome. 2 2001: The Human Genome Venter et. al., Science 292:1304-1351 (2001) International Human Genome Sequencing Consortium, Nature, 409:

3232

Are Alus functional??Are Alus functional??

SINEs are transcribed under stressSINEs are transcribed under stress SINE RNAs may bind a protein kinase SINE RNAs may bind a protein kinase

promote translation under stress promote translation under stress

Need to be in regions which are highly Need to be in regions which are highly transcribedtranscribed

Role in alternative splicingRole in alternative splicing

Page 33: 1 Inside the Genome. 2 2001: The Human Genome Venter et. al., Science 292:1304-1351 (2001) International Human Genome Sequencing Consortium, Nature, 409:

3333

Segment duplicationsSegment duplications

1077 segmental duplications detected1077 segmental duplications detected Several genes in the duplicated regions Several genes in the duplicated regions

associated with diseases (may be related associated with diseases (may be related to homologous recombination)to homologous recombination)

Most are recent duplications (conservation Most are recent duplications (conservation of entire segment, versus conservation of of entire segment, versus conservation of coding sequences only)coding sequences only)

Page 34: 1 Inside the Genome. 2 2001: The Human Genome Venter et. al., Science 292:1304-1351 (2001) International Human Genome Sequencing Consortium, Nature, 409:

3434

Genome-wide studiesGenome-wide studies

Page 35: 1 Inside the Genome. 2 2001: The Human Genome Venter et. al., Science 292:1304-1351 (2001) International Human Genome Sequencing Consortium, Nature, 409:

3535

Sequenced genomesSequenced genomes

Page 36: 1 Inside the Genome. 2 2001: The Human Genome Venter et. al., Science 292:1304-1351 (2001) International Human Genome Sequencing Consortium, Nature, 409:

3636

481 segments > 200 bp absolutely 481 segments > 200 bp absolutely conserved (100% identity) between conserved (100% identity) between human, rat and mousehuman, rat and mouse

Page 37: 1 Inside the Genome. 2 2001: The Human Genome Venter et. al., Science 292:1304-1351 (2001) International Human Genome Sequencing Consortium, Nature, 409:

3737

Comparison with a neutral Comparison with a neutral substitution ratesubstitution rate

Compare the substitution rate in a Compare the substitution rate in a any 1Mb regionany 1Mb region

Probability of 10Probability of 10-22 -22 of obtaining of obtaining 11 ultranconserved element (UE) by ultranconserved element (UE) by chancechance

Page 38: 1 Inside the Genome. 2 2001: The Human Genome Venter et. al., Science 292:1304-1351 (2001) International Human Genome Sequencing Consortium, Nature, 409:

3838

481 UEs

111 UE overlap a known

mRNA: exonic UEs

256 - no overlap (non-

exonic)

114 - inconclusive

100 intronic

156 inter-genic

Page 39: 1 Inside the Genome. 2 2001: The Human Genome Venter et. al., Science 292:1304-1351 (2001) International Human Genome Sequencing Consortium, Nature, 409:

3939

Who are the genes?Who are the genes?

Type 1: exonic

Type 2: genes which are near non-exonic UEs (???)

Page 40: 1 Inside the Genome. 2 2001: The Human Genome Venter et. al., Science 292:1304-1351 (2001) International Human Genome Sequencing Consortium, Nature, 409:

4040

Intergenic UEsIntergenic UEs

Genes which flank intergenic UEs are Genes which flank intergenic UEs are enriched for early developmental enriched for early developmental genesgenes

Are UEs distal enhancers of Are UEs distal enhancers of these genes?these genes?

Page 41: 1 Inside the Genome. 2 2001: The Human Genome Venter et. al., Science 292:1304-1351 (2001) International Human Genome Sequencing Consortium, Nature, 409:

4141

Gene enhancerGene enhancer

A short region of DNA, usually quite A short region of DNA, usually quite distant from a gene (due to distant from a gene (due to chromatin complex folding), which chromatin complex folding), which binds an activatorbinds an activator

An activator recruits transcription An activator recruits transcription factors to the genefactors to the gene

Page 42: 1 Inside the Genome. 2 2001: The Human Genome Venter et. al., Science 292:1304-1351 (2001) International Human Genome Sequencing Consortium, Nature, 409:

4242

Experimental studies of UEsExperimental studies of UEs

Tested 167 UEs (both mouse-human UEs and fish-human UEs) for enhancer activity: cloned before a reporter gene to test their activity

45% functioned as enhancers

Page 43: 1 Inside the Genome. 2 2001: The Human Genome Venter et. al., Science 292:1304-1351 (2001) International Human Genome Sequencing Consortium, Nature, 409:

4343

A bioinformatic successA bioinformatic success

Ultraconservation can predict highly Ultraconservation can predict highly important function!important function!

Page 44: 1 Inside the Genome. 2 2001: The Human Genome Venter et. al., Science 292:1304-1351 (2001) International Human Genome Sequencing Consortium, Nature, 409:

4444

Ahituv PLoS Biol. 2007 Sep;5(9):e234

Chose 4 UEs which are near specific genes:

genes which show a specific phenotype when knocked-out

Performed complete deletion of these UEs

… the mice were viable and did not show any different phenotype

BUT …

Page 45: 1 Inside the Genome. 2 2001: The Human Genome Venter et. al., Science 292:1304-1351 (2001) International Human Genome Sequencing Consortium, Nature, 409:

4545

Conclusions…Conclusions…

Ultraconservation can be indicative Ultraconservation can be indicative of important functionof important function

…… And sometimes not:And sometimes not:

- gene redundancy- gene redundancy- long-range phenotypes- long-range phenotypes- laboratories cannot mimic life- laboratories cannot mimic life