encode variation analysis. analysis goals quantify genetic variation in encode regions detect...
TRANSCRIPT
Encode variation analysis
Analysis goals
• Quantify genetic variation in ENCODE regions
• Detect selective constraint in ENCODE features
• Develop rules for interpretation of functional variation
• Motivate experiments to test functional variation
Data
• Encode SNPs (HapMap resequencing)
• 5kB HapMap SNPs
• DIPs
• Gene expression variation
Metrics of variation
• Derived allele frequency spectrum (Manolis)• Diversity/Het (Ewan)• SNP density (Ewan, others)• DIP density (Jim, Taane)• LD/Recombination (Daryl/Oxford)• Regions of contiguous DNA without variation (Manolis)• Accelerated (positively selected?) regions (Manolis)• Standard tests of neutrality McDonald Kreitman/Tajima’s
D etc (Mike, others)• Other non-parametric tests of selection (Andy)• Tagging (Paul)
Analysis plansAnalysis wrt to genomic features• Calculate variability in a large number of genomic features with all metrics• Correlate variability metrics with “intensity” of feature (e.g. levels
conservation with levels of variability)• Variation, alternative spicing and expression• Distance effects from genomic features• Association of gene expression with SNPs (some is in UCSC and some will
be provided by Manolis at the workshop)
Analysis independent of genomic features (in principle)• Tag SNPs and comparison of resequencing data to 5 Kb map. Here it will
be a good idea to see how the 5 Kb map captures variation within genomic elements. If we really aim to capture variation mainly in functional genomic elements (e.g. known regulatory regions, or nonsym SNPs) how can we modify the tag algorithms?
• General description of levels of variation wrt to the functional content of the 44 ENCODE regions
av2pq/SNP av2pq/pos #snps
Promoters : 0.15 0.00045 856Region Rnd2 : 0.16 0.00041 737
Completely Rnd: 0.16 0.00045 1584
Exons : 0.14 0.00039 635RRnd Exons : 0.15 0.00040 636
Overall : 0.16 0.00042 16609
Diversity in featuresEwan Birney
Derived allele frequency spectrum
Derived_Allele_Frequency_CEU
Perc
ent
0.980.840.700.560.420.280.140.00
20
15
10
5
0
cns_inter01
Histogram of Derived_Allele_Frequency_CEU
CNS intersectionP = 0.003
Derived allele frequency spectrum
Derived_Allele_Frequency_CEU
Perc
ent
0.980.840.700.560.420.280.140.00
18
16
14
12
10
8
6
4
2
0
transfrags01
Histogram of Derived_Allele_Frequency_CEU Transfrags unionP = 0.204
Taane Clark Heterozygosity
Indels
Human
Chimp
Macaque
Human
Chimp
Macaque
Identification of accelerated CNGs
Frequency class
Frequency
0.900.750.600.450.300.150.00
25
20
15
10
5
0
DAF Control CNGs (orange) vs. Accelerated CNGs (green)
P = 0.0003
Regions accelerated in humans
selective constrains differ for genes expressed in different tissuesNuria Lopez
Genes expressed in more tissues have more selective constrains (lower dN)
Tagging
• ENCODE is near-complete inventory of common (MAF≥5%) sites
• How well do tag SNPs picked from thinned versions of ENCODE (to mimic ascertainment of Phase I and II) capture:– all common variants– functional sites
Paul de Baker
Coverage of common variants by tags picked from simulated
Phase I and II HapMapPopulation sample
% r2>0.2 % r2>0.5 % r2>0.8 Mean r2
5kb HapMap (Phase I) CEU 97.2% 86.4% 71.6% 0.83 JPT/CHB 96.3% 84.9% 70.2% 0.82 YRI 90.3% 64.4% 41.9% 0.64
1kb HapMap (Phase II) CEU 99.6% 97.7% 93.9% 0.96 JPT/CHB 99.5% 97.7% 93.9% 0.96 YRI 99.1% 92.5% 80.9% 0.90