Download - Correlations Genomic Data
-
8/7/2019 Correlations Genomic Data
1/41
David Brody
Maya KrishnanRajiv McCoy
Gourab Mukherjee
Identifying significant correlationsbetween sets of genomic data
-
8/7/2019 Correlations Genomic Data
2/41
Key issues in finding relationships
between tracks of genomic data
From a statistical standpoint, how do you find these
relationships?
How do you know if the relationships are statistically
significant?
How can you generalize the methods you use so they will
work for multiple types of data?
-
8/7/2019 Correlations Genomic Data
3/41
DNAmethylation may physically impede
transcriptional machinery or recruit proteins
that modify histones (packaging proteins) in a
way that alters transcription.
One class of histone modification that can
influence transcription is methylation. Oneexample of a transcription-activating histone
modification is the trimethylation of the 4th
lysine residue of the 3rd histone, abbreviated
as H3K4me3. Annotated active promoters
show peaks of this methylation signature, but
enhancers do not.
Heintzman et al.
Nature Genetics
2007;39:311-318
Roh T et al. PNAS2006;103:15782-
15787
Histone methylation H3K4me3 is
associated with active transcription
-
8/7/2019 Correlations Genomic Data
4/41
CpG islands are regions in the genome where cytosine bases are
followed by guanine bases. Such regions are chemically unstable, and
therefore evolutionary pressure is required to maintain them.
Most human promoters contain CpG islands that lack DNAmethylation
and are enriched for histone modifications characteristic of activetranscription, such as H3K4me3.
Recent studies have shown that methylation-free CpGs act as a signal
to recruit the binding protein Cfp1 which is involved in H3K4 methylation
(Bird et al. 2010).
CpG islands lack DNA methylation
signalingH3K4me3 at these promoters
-
8/7/2019 Correlations Genomic Data
5/41
Function
Measurement
+ + - +
Given a function track and a measurement track
Preprocessing of tracks for statistical tests
-
8/7/2019 Correlations Genomic Data
6/41
Function
Measurement
+ + - +
Preprocessing of tracks for statistical tests
Given a function track and a measurement track
-
8/7/2019 Correlations Genomic Data
7/41
Function
Measurement
Binning10,000K / 200
+ + - +
Given a function track and a measurement track
Preprocessing of tracks for statistical tests
-
8/7/2019 Correlations Genomic Data
8/41
Function
Measurement
+ + - +
Binning10,000K / 200
Given a function track and a measurement track
Preprocessing of tracks for statistical tests
-
8/7/2019 Correlations Genomic Data
9/41
Function
Measurement
+ + - +
Binning10,000K / 200
Binning100,000K / 200
3 3 2 1
Given a function track and a measurement track
Preprocessing of tracks for statistical tests
-
8/7/2019 Correlations Genomic Data
10/41
Code Verification
Preprocessing of tracks for statistical tests
0 1 3 1 0
Output
-
8/7/2019 Correlations Genomic Data
11/41
Given a function track and two measurement tracks
Preprocessing of tracks for statistical tests
Function
Measurement 1
+ + - +
Measurement 2
-
8/7/2019 Correlations Genomic Data
12/41
Function
Measurement 1
+ + - +
Measurement 2
Binning100,000K / 200
Given a function track and two measurement tracks
Preprocessing of tracks for statistical tests
2 1 0 3
-
8/7/2019 Correlations Genomic Data
13/41
Enrichment over assumption of
independence
For all bins, count:
how many have H3K4me3 markers
how many have promoters
how many have both
Do more bins have both H3K4me3 and the promoters than we
would expect if the two factors were independent?
-
8/7/2019 Correlations Genomic Data
14/41
Enrichment over assumption of
independence
Bin Size Padding Size P(co if indep) P(co in reality) p-value
10000 200 0.013 0.040 2.28 10 ( -3389.0 )
10000 500 0.014 0.041 2.91 10 ( -3462.0 )
10000 1000 0.014 0.042 9.26 10 ( -3539.0 )50000 200 0.117 0.200 5.19 10 ( -865.0 )
50000 500 0.118 0.200 7.76 10 ( -867.0 )
50000 1000 0.118 0.201 3.61 10 ( -866.0 )
100000 200 0.241 0.343 3.95 10^( -371.0 )
100000 500 0.242 0.343 2.58 10^( -371.0 )
100000 1000 0.242 0.344 1.23 10 ( -370.0 )
-
8/7/2019 Correlations Genomic Data
15/41
Enrichment over assumption of
independence
Difference between assumption and reali
0.000
0.050
0.
00
0.
50
0. 00
0.
50
0.
00
0.
50
0.
00
1 2 3 4 5 6 7 8 9
Trial
P(Indep)
P(Reality)
-
8/7/2019 Correlations Genomic Data
16/41
Impact that varying input parameters has
on final resultsBi i i i U str m hits Hist m rks O rl s
10000 200 40264 30508 11976
10000 500 40264 31308 12270
10000 1000 40264 32596 12695
50000 200 21808 19641 12073
50000 500 21808 19702 1210750000 1000 21808 19823 12162
100000 200 15627 14096 10353
100000 500 15627 14115 10365
100000 1000 15627 14157 10387
SUMMARY
-Bin size affects number of upstream hits and histone marks, as would be expected, butdoes not seem to have a huge impact on the number of overlap
-Padding size does not have much of an effect on the number of overlaps detected
-
8/7/2019 Correlations Genomic Data
17/41
Impact that varying input parameters has
on final resultsHow tr
l
gth i
t
r ofoverl
etecte
0
2000
4000
6000
8000
10000
12000
14000
0 20000 40000 60000 80000 100000 120000
Bi
ize
-
8/7/2019 Correlations Genomic Data
18/41
Impact that varying input parameters has
on final results
Bin si
e versus overlap percenta
e
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0 20000 40000 60000 80000 100000 120000
Bin si
e
-
8/7/2019 Correlations Genomic Data
19/41
Foldwise enrichment to further explore
low p-values and impact of bin size
Abandons binning approach
Examines percentage of genome covered by H3K4me3 markers
and/or upstream region
Conceptually similar to previous enrichment, but more direct
Base-pair-wise intersection
Compare percentage expected and actual percentage
-
8/7/2019 Correlations Genomic Data
20/41
Results of performing foldwise
enrichment on two tracksU str m L th U str mBa s s Hist Bas s Both Bas s
200 7555840 120425900 71978675
500 18142155 120425900 72108300
1000 34863595 120425900 72268775
p trea engt % Indep % Reality
200 %0.00001 %0.02316
500 %0.00023 %0.02320
1000 %0.00043 %0.02325
-
8/7/2019 Correlations Genomic Data
21/41
Expanding test to analyze two tracks of
measurement data
This test can be expanded to test for a correlation between
multiple tracks of functional data
Are they pairwise independent? And does P(A B C) =P(A)P(B)P(C)?
In this example:
CpG islands and H3K4me3 markers are the two measurement
tracks
Upstream regions form the functional track
-
8/7/2019 Correlations Genomic Data
22/41
Results of test between H3K4me3,CpG
islands, and upstream regions
Para
eter
M1 & M2 (Indep) M1 & M2 (Rlty)
(10000, 200) 0.00850 0.04185
(10000, 500) 0.00850 0.04185
(10000, 1000) 0.00850 0.04185
(50000,200) 0.09285 0.19181
(50000, 500) 0.09285 0.19181
(50000, 1000) 0.09285 0.19181
Para!
eter"
M1 and F (Indep) M1 and F (Rlty)
(10000, 200) 0.01117 0.03437
(10000, 500) 0.01142 0.03513
(10000, 1000) 0.01181 0.03633
(50000,200) 0.11239 0.19975
(50000, 500) 0.11264 0.20031(50000, 1000) 0.11322 0.20122
Para#
eter$
M2 and F (I) M2 and F (R )
(10000, 200) 0.00626 0.03329
(10000, 500) 0.00640 0.03376
(10000, 1000) 0.00662 0.03421
(50000,200) 0.08012 0.18621
(50000, 500) 0.08030 0.18648
(50000, 1000) 0.08071 0.18697
Para%
eter&
M1 and M2 and F (I) M1 and M2 and F (R )
(10000, 200) 0.00077 0.02408
(10000, 500) 0.00078 0.02437
(10000, 1000) 0.00081 0.02470
(50000,200) 0.02891 0.15121
(50000, 500) 0.02898 0.15140(50000, 1000) 0.02913 0.15177
.04.008
.19.09
.04
.20
.01
.11
.006
.08
.03
.19
.001
.03
.02
.15
-
8/7/2019 Correlations Genomic Data
23/41
M1 & F: A' ' ( ) 0
tionvs. Reality
0
0.05
0.1
0.15
0.2
0.25
1 2
Trial
Assum1tio
2
3
4 5
lit6
M1 &M2: Assumptionvs. Reality
0
0.05
0.1
0.15
0.2
0.25
1 2
Trial
Assum7 tio 8
9 @ A litB
Results of test between H3K4me3,CpG
islands, and upstream regions
-
8/7/2019 Correlations Genomic Data
24/41
M2 & F: Assumptionvs. Reality
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
0.18
0.2
1 2
Trial
AssumCtio
D
E
F G
litH
M I &M2 & F: Assumptionvs. Reality
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
1 2
Trial
AssumP tio Q
R S T litU
Results of test between H3K4me3,CpG
islands, and upstream regions
-
8/7/2019 Correlations Genomic Data
25/41
bin size upstream window corrected p-value
10 Kb 200 bp 6.292483e-275
500 bp 7.351025e-283
1000 bp 2.413367e-299
50 Kb 200 bp 4.425021e-170
500 bp 8.912015e-172
1000 bp 2.539715e-171
100 Kb 200 bp 4.538280e-119
500 bp 3.354191e-120
1000 bp 1.425142e-119
Function track: upstream window of
known genes
Measurement track: peaks of
H3K4me3
k = # bins with both H3K4me3 peak
and upstream region of known
gene
n = # bins with at least one H3K4me3
peak
K = # bins with upstream region of a
known gene
N = total # bins
Hypergeometric test for two tracks
-
8/7/2019 Correlations Genomic Data
26/41
Function track: upstream window of
known genes (proxy for promoter)Measurement track 1: peaks of
H3K4me3
Measurement track 2: CpG islands
Based on restricting counts to cases
where the function track is a hit.
bin size upstream window corrected p-value
10 Kb 200 bp 6.69E-200
k = # bins with H3K4me3 peak, CpG island,
and upstream region of known gene
n = # bins with H3K4me3 and upstream
region of known gene
K = # bins with CpG island and upstream
region of a known gene
N = total # bins with upstream regions of
known gene
Hypergeometric test for three tracks
-
8/7/2019 Correlations Genomic Data
27/41
+ + - +
MeasurementMeasurement
Function
Region Overlap & Non-linearity
-
8/7/2019 Correlations Genomic Data
28/41
+ + - +
MeasurementMeasurement
Function
Region
overlap
Region
overlap
Region Overlap & Non-linearity
-
8/7/2019 Correlations Genomic Data
29/41
+ + - +
MeasurementMeasurement
Function
Region
overlap
Region
overlap
Region Overlap & Non-linearity
-
8/7/2019 Correlations Genomic Data
30/41
Segmentation
-
8/7/2019 Correlations Genomic Data
31/41
Segmentation
Min segmentation
length
Min segmentation
length
-
8/7/2019 Correlations Genomic Data
32/41
Block-wise Sub-sampling
-
8/7/2019 Correlations Genomic Data
33/41
Block-wise Sub-sampling
Select SegmentSelect Segment
-
8/7/2019 Correlations Genomic Data
34/41
Block-wise Sub-sampling
Select SegmentSelect Segment
Select SubSelect Sub--segmentsegment
-
8/7/2019 Correlations Genomic Data
35/41
Block-wise Sub-sampling
Select SegmentSelect Segment
Select SubSelect Sub--segmentsegmentRepeat it and rescaleRepeat it and rescale
to get null distributionto get null distribution
Calculate TestCalculate Test
StatisticsStatistics
-
8/7/2019 Correlations Genomic Data
36/41
Googol^{-1} P-Values !!
-
8/7/2019 Correlations Genomic Data
37/41
Expected P-Values
-
8/7/2019 Correlations Genomic Data
38/41
Expected P-Values (n=2600)
-
8/7/2019 Correlations Genomic Data
39/41
Expected P-Values (n=2600)
-
8/7/2019 Correlations Genomic Data
40/41
Expected P-Values (n=2600)
-
8/7/2019 Correlations Genomic Data
41/41
Conclusion: finding relationships
between tracks of genomic data
The tests we implemented successfully identify significantly
correlated tracks
Given a statistical test and a p-value, able to determine theexpected correlation
Using different preprocessing techniques the statistical tests
can be extended for multiple tracks to identify new biological
associations