lecture 1
TRANSCRIPT
Hillary Term 04: “The Human Genome”
20.1 The Human Genome – evolutionary issues (Hein)
27.1 Non-Genic Selection in the Human Genome (Lunter)
3.2 Mammalian Genes I: Conservation and slow evolution (Ponting)
10.2 Mammalian Genes II: Functional innovation and rapid change (Ponting/Goodstadt)
17.2 RNAs in Human Genome (Sam Griffiths-Jones)
24.2 Population Genetics of the Human Genome (Gil McVean )
2.3 Association Mapping and the Human Genome (Lon Cardon)
9.3 The Human Genome and Human Evolution (Chris Tyler-Smith)
The Human Genome – key issuesThe Human Genome Project
Few basic facts of the human genome
Grammar of Genes
Basic events happening to a genome per mitosis/generation Genealogical Structures: Phylogenies, Pedigrees and the ARG
Long term Dynamics of the Human Genome: The comparative aspect
(Genotype Phenotype) & (Population Genetics/History) => Gene Mapping
History
Our interests.
History of the Human Genome Project
Strachan and Read, HMG3 p213
1956 Physical map. 24 types and total set of 46 chromosomes
1977 Sanger publishes dideoxy sequencing method
1980 Botstein proposes human genetic map using RFLPs
1987 US DOE publishes report discussing HGP
1988 HUGO is established
1990 Official start of HGP with 3 billion $ and a 15 year horizon.
1991 Genome Database GB is established
1992 Genethon publishes map based on microsatelites.
1995 Lander et al. detailed map based on sequence tagged sites.
1998 Comprehensive map based on gene markers.
1999 Sanger Centre publishes chromosome 22
2001 Draft Genome published: Celera & Public
2003 Completion (almost) of Human Genome
Public effort- strategy: Celera - strategy: From Myers 99
Sequencing Strategies
Celera’s view of International Consortium International Consortium’s view of Celera
Unfair competition: IC delivering the same goods but with state funding.
Unfair competition: Celera delivering the same goods but can use IC data, while IC cannot use Celera data.
Other Genome Projects1976/79 First viral genome – MS2/fX174
1980 Mitochondrion
1982 First shotgun sequenced genome – Bacteriophage lambda
1995 First prokaryotic genome – H. influenzae
1996 First unicellular eukaryotic genome – Yeast
1998 The first multicellular eukaryotic genome – C.elegans
2000 Drosophila melanogaster
2000 Arabidopsis thaliana
2001 Human Genome
2002 Mouse Genome
The Genome OnLine Database knows of 958 genome sequencing projects, of which 169 are completed
Favourite and Model OrganismsMulticellular Animals
Mammals
Human 3.5 Gb
Mouse 3.2 Gb
Cow 3.0 Gb
Dog 2.8 Gb
Rat 3.1 Gb
Chimp 3.5 Gb
Pig 3.0 Gb
Fish
Puffer Fish 0.4 Gb
Zebra Fish 1.9 Gb
Insects
Drosophila 165 Mb
Honey Bee 270 Mb
Yellow Fever Mosquito 780 Mb
Malaria Mosquito 278 MbStrachan and Read (2004) Chapter 8
Birds
Chicken 1.2 Gb
Frog
Xenopus Laevis 1.7 Gb
Nematodes
Caenorhabdites elegans 100 Mb
Caenorhabdites briggsae 80 Mb
Sea Urchin
Strongylocentrotus purpuratus 800 Mb
Multicellular Plants
Arabidopsis thaliana 125 Mb
Rice 430 Mb
globin
Exon 2Exon 1 Exon 3
5’ flanking 3’ flanking
(chromosome 11)
The Human Genome I http://www.sanger.ac.uk/HGP/ & R.Harding & HMG (2004) p 245
*5.000
*20
6*104 bp
3.2*109 bp
*103
3*103 bp
ATTGCCATGTCGATAATTGGACTATTTGGA 30 bp
Myoglobin globin
aa aa aa aa aa aa aa aa aa aa
DNA:
Protein:
1
2 3
4 56 7
8 9X
Y151413121011
2120191817
1622
279251
221197 198
176 163 148 140 143 148 142118 107 100
10488 86
72 66 45 48
163
51
mitochondria
.016
The Human Genome II http://www.sanger.ac.uk/HGP/
Strachan and Read (2004) Chapter 9
Nuclear Genome MitochondriaHighly conserved - coding 1.5% 93%Highly conserved - other 3.5% 5%Transposon based repeats 45 % -Heterochromatin 6.6% -Other non-conserved 44 % 2% Mendelian inheritance Maternal inheritance 1 (typically) Possibly thousands Recombination No recombination
Gene Density: 1/130 kb 2 kb
Pseudogenes: 20000
Processed Pseudogenes
The Human Genome III http://www.sanger.ac.uk/HGP/
Strachan and Read (2004) Chapter 9 + Lander et al.(2001)
Gene families
Clustered
-globins (7), growth hormone (5), Class I HLA heavy chain (20),….
Dispersed
Pyruvate dehydrogenase (2), Aldolase (5), PAX (>12),..
Clustered and Dispersed
HOX (38 – 4), Histones (61 – 2), Olfactory receptors (>900 – 25),…
Transposons
Genes and Gene Structures I
•Presently estimated Gene Number: 24.000 (reference: )
•Average Gene Size: 27 kb
•The largest gene: Dystrophin 2.4 Mb - 0.6% coding – 16 hours to transcribe.
•The shortest gene: tRNATYR 100% coding
•Largest exon: ApoB exon 26 is 7.6 kb Smallest: <10bp
•Average exon number: 9
•Largest exon number: Titin 363 Smallest: 1
•Largest intron: WWOX intron 8 is 800 kb Smallest: 10s of bp
•Largest polypeptide: Titin 38.138 smallest: tens – small hormones.
•Intronless Genes: mitochondrial genes, many RNA genes, Interferons, Histones,..
Jobling, Hurles & Tyler-Smith (2004) HEG p 29 + HMG chapt. 9
Genes and Gene Structures IIGenes within Genes:
Intron 26 of neurofibromatosis type I (NF1) contains 3 internal (2 exons) genes in the opposite direction.
Overlapping Genes:
Class III region of HLA
Str
acha
n an
d R
ead
(200
4) C
hapt
er 9
p 2
58
Simple Eukaryotic
Alternative Splicing
Cartegni,L. et al.(2002) “Listening to Silence and understanding nonsense: Exonic mutations that affect splicing” Nature Reviews Genetics 3.4.285-
HMG p291-294
1. A challenge to automated annotation.
2. How widespread is it?
3. Is it always functional?
4. How does it evolve?
RNAs in the Genome
Strachan and Read (2004) p.247 F9.4
~200 snoRNA small nucleolar, over 100 types - RNA modification and processing
~100 snRNA small nuclear - involved in splicing
~200 miRNA very small ~22bp , regulation
~175 28S,5.8S,5S large cytosolic subunit
~175 18S small mitochondrial subunit
~250 5S large mitochondrial subunit
>500 tRNA transfer RNA
>1500 Antisense RNA > 1500 types
Genome Annotation
Ensembl
http://www.ensembl.org
Santa Cruz Genome Browser
http://genome.ucsc.edu/
Genomes
Proteins
ESTs
Gene Finding and Protein (HMM) DescriptorsBurge & Karlin jmb 96
A. Make gene characteristics to each nucleotide. Extract legal prediction by dynamical programming.
B. Use HMM to describe biological knowledge of gene structure.
Mutations and Mutation Rates
1 mitosis or generation
Average Number of Mitoses
Male generation (15:35 .. 20:150
Female generation: ~24
Crow,JF (2000) “The Origins, Patterns and Implications of Human Spontaneous Mutation” Nature Review Genetics 1.1.40-47 + Strachan and Read (2004) chapter 11 +Jobling, Hurles and Tyler-Smith (2004) chapter 2
• Single nucleotide substitutions: ~10-7
• Microsatellites (~100.000): ~10-2
• Small insertion deletions: ~10-8
Recombination
1 meiosis
Lander et al.(2001) “Initial sequencing and analysis of the human genome” Nature 409.860-912. + Kong,E. et al.(2002) “A high resolution recombination map of the human genome” Nature Genetics
Recombination:Gene Conversion:
•Total Haploid length males: 25.9 M - females: 44.6 M.
•Gene conversions 1-2 orders higher. Length 300-2000 pb.
Selection: Positive & Negative
A
A
A
A
A
A
One sequence scenario Population scenario
AAACC
AAACC
AAACC
ThrSer
ACGTCA
ThrProPro
ACGCCAThrSer
ACGCCG
ArgSer
AGGCCG ThrSer
ACTCTG
AlaSer
GCTCTG
AlaSer
GCACTG
-
-
One sequence scenario again
Certain events have functional consequences and will be selected out. The strength and localization of this selection is of great interest.
The selection criteria could in principle be anything, but the selection against amino acid changes is without comparison the most important.
The Genetic Code
Substitutions Number Percent
Total in all codons 549 100
Synonymous 134 25
Nonsynonymous 415 75
Missense 392 71
Nonsense 23 4
Examples of rates remade from Li,1997
RNA Virus
Influenza A Hemagglutinin 13.1 10-3 3.6 10-3
Hepatitis C E 6.9 10-3 0.3 10-3
HIV 1 gag 2.8 10-3 1.7 10-3
DNA virus
Hepatitis B P 4.6 10-5 1.5 10-5
Herpes Simplex Genome 3.5 10-8
Nuclear Genes
Mammals c-mos 5.2 10-9 0.9 10-9
Mammals a-globin 3.9 10-9 0.6 10-9
Mammals histone 3 6.2 10-9 0.0
Organism Gene Syno/year Non-Syno/Year
Genealogical Structures
Homology:
The existence of a common ancestor (for instance for 2 sequences)
Phylogeny Pedigree:
Ancestral Recombination Graph – the ARG
ccagtcg
ccggtcgcagtct
Only finding common ancestors. Only one ancestor.
i. Finding common ancestors.
ii. A sequence encounters Recombinations
iii. A “point” ARG is a phylogeny
Populations
Now
Parents
Grand parents
Genealogical approach to Population Variation Analysis
Africa Non-Africa
Inter.SNP Consortium (2001): A map of human genome sequence variation containing 1.42 million SNPs. Nature 409.928-33
Pedigrees
Icelandichttp://www.decode.com + Helgason, A. et al. (2003 June) “A population-wide coalescent analysis of Icelandic matrilineal and patrilineal genealogies: Evidence for a faster evolutionary rate of mtDNA lineages than Y-chromosomes” American Journal Human Genetics.
Chinesehttp://demography.anu.edu.au/People/Staff/zhongwei.html
Burke’s British Peeragehttp://www.burkes-peerage.net/sites/wars/sitepages/home.asp
Mormonshttp://genealogy-mormons.com/
Quebec FrenchHeyer and Tremblay, 1998 PNAS
Total Pedigree
1972
2002
1848
1892
Year
2
2
3
1
1 1 11
1
1
2
2
2
Ancestor cohort
Contemporary cohort
77.9%
22.1%
N = 31,817 N = 31,659
N = 66,910N = 64,150
8.3%
91.7% 86.2%
13.8%
73.9%
26.1%
Ancestral cohort born 1848-1892
Descendant cohortborn after 1972
Matrilines Patrilines
Helg
aso
n
Genealogical Questions
Pedigrees
Time back to first individual common ancestor to everyone
ARG questions:
The height of ARGs - correlation between local phylogenies
Gene Phylogeny Questions
Total Branch Length - Height
Long Term Evolutionary History: Myr/Gyr
Origin of Life
Last Universal Common Ancestor – LUCA
First Eukaryotes
First Chordates
First Vertebrates
First Mammals
First Primates
First Hominoids
Chimp-Human Split
Hedges, SB (2002) “The Origin and Evolution of Model Organisms” Nature Review Genetics 3.11.838-848.
Brown (2003) “Horizontal Genetic Transfers “ Nature Genetics
observable observable
Parameters:tim
e
rates, selection
Unobservable
Evolutionary Path
observable
MRCA-Most Recent Common Ancestor
?
3 Problems:
i. Test all possible relationships.
ii. Examine unknown internal states.
iii. Explore unknown paths between states at nodes.
ATTGCGTATATAT….CAG ATTGCGTATATAT….CAG ATTGCGTATATAT….CAG
Tim
e Direction
The Comparative Aspect.
Observable
Observable Unobservable
Unobservable
U
C G
A
C
AU
A
C
)()(
)()(
SequencePSequenceStructureP
StructurePStructureSequenceP
Goldman, Thorne & Jones, 96
RNA Structure
Gene Structure
One Principle of Comparative Genomics
Protein Structure
Molecular Evolution and Gene Finding: Two HMMs
Simple Prokaryotic Simple Eukaryotic
AGTGGTACCATTTAATGCG..... Pcoding{ATG-->GTG} orAGTGGTACTATTTAGTGCG..... Pnon-coding{ATG-->GTG}
The Rise of Comparative Genomics
Lander et al(2001) Figure 25A
RNA (Secondary) StructureSequences
ACTGT
ACTCCT
Protein Structure
87654321
4
Cabbage
Turnip
75 31 86 2
Gene Order/Orientation.
Gene Structure
Interaction Networks
Any Graph.
General Theme.
Formal Model of Structure
Stochastic Model of Structure Evolution.
Renin
HIV proteinase
The Domain of Comparative Genomics
Linkage Mapping
r
M
D
From McVean
A set of characters.
Binary decision (0,1).
Quantitative Character.
Dominant/Recessive.
Penetrance
Spurious Occurrence
Heterogeneity
genotype Genotype Phenotype phenotype2N
e generation
s
Association/Fine scale mapping
Single marker association
Bayesian analysis
1000 cases and 1000 controls typed at 8 microsatellite markers
BRCA2 example
Rafnar et al.(2004) – Morris et al(2001) + Causative SNPs.
Short Term Evolutionary History: Kyr/Myr
Oldest Polymorphisms
Neutral Human Autosomal Polymorphisms
First Out-of-Africa
Anatomically Modern Man
Peopling of the Globe – genetic and fossil evidence.
The globe & migrations:
Cavalli-Sforza,2001 + HEG (2004)
Supposedly well behaved populations
Iceland
Finland
Sardinia
HapMap “T
he In
ternatio
nal H
apM
ap P
roject “N
ature 426, 789 - 796 (18 Dec 2003) h
ttp://w
ww
.hap
map
.org
/
Started October 27-29, 2002
HapMap
Ontologies
http://www.geneontology.org
Gene Ontology Consortium (2001) “Creating the Gene Ontology Resource: Design and Implementation.” Genome Research 11.1425-33
Gene Ontology Consortium (2004) “The Gene Ontology (GO) database and informatics resource” Nucleic Acid Research 32.D258-61.
A Structured Vocabulary – Consistent across species.
Purpose:
Facility communication among researchers
Facility communication among computer systems
2001: Three Ontologies:
Molecular Function
Biological Process
Cellular Component
So
urce N
AR
(2004) 32.D258-
Structural Genomics: Systematic Structure Determination
http://www.strgen.org/ http://www.nysgrc.org/ http://www.oppf.ox.ac.uk/ http://pdb.ccdc.cam.ac.uk/pdb/strucgen.html
John Westbrook, Zukang Feng, Li Chen, Huanwang Yang and Helen M. Berman “The Protein Data Bank and structural genomics” Nucleic Acids Research, 2003, Vol. 31, No. 1 489-491
PDB Holdings List: 10-Feb-2004 Molecule Type
Proteins, Peptides, and Viruses
Protein/Nucleic Acid Complexes
Nucleic Acids Carbohydrates total
Exp. Tech.
X-ray Diffraction and other
19014 898 719 14 20645
NMR 2934 96 569 4 3603
Total 21948 994 1288 18 24248
Examples:•Center for Eukaryotic Structural Genomics
•Structural Genomics of Pathogenic Protozoa Consortium
•Berkeley Structural Genomics Center : Mycoplasma genitalium and Mycoplasma pneumoniae
Structural Genomics: Mycoplasma pneumoniae proteins
http://www.strgen.org/status/mpoverview.html
Proteomics
http://www.hupo.org Hanash,S.(2003) “Disease Proteomics” Nature 422.226- Aebersold,R. and M.Mann (2003) “Mass spectrometry-based proteomics” Nature 422.198- Gavin et al. (2002) “Functional Organisation of the Yeast Proteome by systematic analysis of protein complexes” Nature 415.141-
2D PAGE gels (polyacryl gel electrophoresis )
MALDI
Protein Micro-arrays
Source: Hanash (2003)
Source Gavin et al.(2002)
The Genome
Genomes: Variation and long term evolution.
Genealogical Structures: Phylogenies, Pedigrees and the ARG
Long term Dynamics of the Human Genome: The comparative aspect
(Genotype Phenotype) & (Population Genetics/History) => Gene Mapping
Summary
Our Genomically Motivated Projects
1. Comparative gene annotation (Meyer, Skou Pedersen)
2. Superimposed selective constraints (Forsberg, Meyer, Skou Pedersen) *
3. Haplotype Blocks (Song) *
4. Genome transformations (Miklos)
5. Ancestral Blocks*
6. Statistical Sequence Comparison (Drummond, Lunter, Miklos)
7. Substitutions and insertion-deletions at the Genome Level (Lunter) Next week
a: (3,4)
b: (3,4)
c: (15,16)
d: (16,17)
e: (35,36)
f: (35,36)
g: (36,37)
Minimal ARGs and Haplotype Blocks (Song)
Combining Levels of Selection.Forsberg, Meyer, Pedersen
Protein-Protein
Hein & Støvlbæk, 1995 Codon Nucleotide Independence Heuristic
Jensen & Pedersen, 2001
Contagious Dependence
Assume multiplicativity: fA,B = fA*fB
Protein-RNA
DoubletsSinglet
Contagious Dependence
A randomly picked ancestor: (ancestral material comes in batteries!)
0
0 52.000
260 Mb
06890 8360
7.5 Mb
*35
0 30kb
*250
Parameters used 4Ne 20.000 Chromos. 1: 263 Mb. 263 cM
Chromosome 1: Segments 52.000 Ancestors 6.800
All chromosomes Ancestors 86.000Physical Population. 1.3-5.0 Mill.
Applications to Human Genome (Wiuf and Hein,97)
References: Books & www-pages.Books:
Strachan and Read (2004) “Human Molecular Genetics” (3rd Ed.) Bioscience
Jobling, Hurles and Tyler-Smith (2004) “Human Evolutionary Genetics” Bioscience
Sulston, J.(2002) “Our Common Thread” Corgi Books
Ridley, Matt (2001) “Genome”
“Encyclopedia of the Human Genome” (2003) Nature Publishing Group
Cavalli-Sforza,L. (2001) “Genes, People and Language” Penguin
Key articles:
Lander et al.(2001) “Initial Sequencing and Analysis of the Human Genome” Nature
Venter et al.(2001)”The Sequence of the Human Genome” Science 291.1304-1351
References: www-pages.Major sequencing centers: Baylor College of Medicine Genome Sequencing Center hgsc.bcm.tcm.edu/
Celera www.celera.com
DoE Joint Genome Institute www.jgi.doe.gov
Genoscope www.genoscope.cns.fr
TIGR www.tigr.org
Washington University Genome Sequencing Center www.genome.wustl.edu
Wellcome Trust Sanger Institute www.sanger.ac.uk
Whitehead Institute/MIT Center for Genome Research www.-genome.wi.mit.edu
Ensembl genome annotator - www.ensembl.orgEuropean Bionformatics Institute - www.ebi.ac.ukNCBI - www.ncbi.nlm.nih.gov
Nature Genome Gateway http://www.nature.com/genomics/human/
Integrated Genomics http://wit.integratedgenomics.com/GOLD/
Ebi genome databases http://www2.ebi.ac.uk/genomes/
Primate Sequencing Projects http://sayer.lab.nig.jp/~silver/index.html
European Bioinformatics Institute Proteomics http://www.ebi.ac.uk/proteome/
National Center for Biotechnology Information http://www.ncbi.nlm.nih.gov/
HapMap Project Homepage http://www.hapmap.org/
Online Inheritance in Man http://www.ncbi.nlm.nih.gov/omim/