Download - Large Scale Data Integration
![Page 1: Large Scale Data Integration](https://reader037.vdocuments.site/reader037/viewer/2022102908/56812f20550346895d94b818/html5/thumbnails/1.jpg)
Large Scale Data Integration
Curtis Huttenhower
Sequence and Expression
01-24-08
![Page 2: Large Scale Data Integration](https://reader037.vdocuments.site/reader037/viewer/2022102908/56812f20550346895d94b818/html5/thumbnails/2.jpg)
Functional Relationships
• Two genes that work together to achieve similar cellular goals are functionally related
– Proteins that co-complex: ribosomal, polymerase, ORC, etc. etc.– A TF and its target– Two enzymes catalyzing different steps in the same metabolic pathway– A membrane-bound receptor and a downstream phos. target– etc. etc. etc.
• Genes that do really different stuff are considered to be functionally unrelated
• Anything else is neither– Genes with unknown function– Genes in similar but non-identical pathways
![Page 3: Large Scale Data Integration](https://reader037.vdocuments.site/reader037/viewer/2022102908/56812f20550346895d94b818/html5/thumbnails/3.jpg)
Functional Relationships
• How can we tell?These databases classify every gene pair into one of three groups:
• Functionally related• Unrelated• Neither
![Page 4: Large Scale Data Integration](https://reader037.vdocuments.site/reader037/viewer/2022102908/56812f20550346895d94b818/html5/thumbnails/4.jpg)
Data
• Well, as long as we’re talking about gene pairs, how do pairs of genes act in data?
ColocalizationYes/No
Two-hybridYes/No
AffinityYes/No
Shared miRNA sitesYes/No
Same chromosome band
Yes/No
Conserved TF sitesYes/No
MANY MANY microarraysCorrelation
High
Low
![Page 5: Large Scale Data Integration](https://reader037.vdocuments.site/reader037/viewer/2022102908/56812f20550346895d94b818/html5/thumbnails/5.jpg)
Data
• These fall into two general categories:– Yes/no (binary or 0/1)– Continuous (numerical scores)
Two-hybridYes/No
G1 G2 0G1 G3 1G1 G4 -G2 G3 -G2 G4 0G3 G4 1
MicroarraysCorrelation
G1 G2 0.9G1 G3 0.75G1 G4 0.1G2 G3 -0.1G2 G4 0.2G3 G4 -0.5
G1 G2 4G1 G3 3G1 G4 1G2 G3 0G2 G4 2G3 G4 0
Binning
Each dataset turns into a set of gene pairs labeled with small integers (or
nothing).
![Page 6: Large Scale Data Integration](https://reader037.vdocuments.site/reader037/viewer/2022102908/56812f20550346895d94b818/html5/thumbnails/6.jpg)
Integration
• Now, for each gene pair…
We have an “answer” indicating whether it’s a
related pair
G1 G2 1
And we have a bunch of datasets contributing their opinions(i.e. experimental results)
G1 G2 0
DS1
G1 G2 1
DS2
G1 G2 -
DS3
G1 G2 3
DSN
![Page 7: Large Scale Data Integration](https://reader037.vdocuments.site/reader037/viewer/2022102908/56812f20550346895d94b818/html5/thumbnails/7.jpg)
Integration
• Let’s look at each dataset individually:
Our answers know about a bunch of unrelated genes.
And a bunch of related genes.
G1 G3 0G3 G4 0G9 G14 0G10 G11 0…
G1 G2 1G1 G3 1G4 G7 1G10 G12 1…
Within each dataset, some subset of these
pairs have values:DS1
G1 G3 0G3 G4 -G9 G14 1G10 G11 0…
G1 G2 1G1 G3 0G4 G7 -G10 G12 1…
![Page 8: Large Scale Data Integration](https://reader037.vdocuments.site/reader037/viewer/2022102908/56812f20550346895d94b818/html5/thumbnails/8.jpg)
Integration
• Let’s look at each dataset individually:
Our answers know about a bunch of unrelated genes.
And a bunch of related genes.
G1 G3 0G3 G4 0G9 G14 0G10 G11 0…
G1 G2 1G1 G3 1G4 G7 1G10 G12 1…
Within each dataset, some subset of these
pairs have values:DS2
G1 G3 -G3 G4 0G9 G14 0G10 G11 -…
G1 G2 -G1 G3 1G4 G7 1G10 G12 1…
![Page 9: Large Scale Data Integration](https://reader037.vdocuments.site/reader037/viewer/2022102908/56812f20550346895d94b818/html5/thumbnails/9.jpg)
Integration
• Within each dataset, let’s count up the number of times each value occurs for each type of gene pair (related or unrelated):
G1 G3 0G3 G4 0G9 G14 0G10 G11 0…
G1 G2 1G1 G3 1G4 G7 1G10 G12 1…
DS1
G1 G3 0G3 G4 -G9 G14 1G10 G11 0…
G1 G2 1G1 G3 0G4 G7 -G10 G12 1…
96 9
13 36
Functionally related?
YesNo
Dataset value
0
1
![Page 10: Large Scale Data Integration](https://reader037.vdocuments.site/reader037/viewer/2022102908/56812f20550346895d94b818/html5/thumbnails/10.jpg)
Integration
• Within each dataset, let’s count up the number of times each value occurs for each type of gene pair (related or unrelated):
G1 G3 0G3 G4 0G9 G14 0G10 G11 0…
G1 G2 1G1 G3 1G4 G7 1G10 G12 1…
DS1
G1 G3 0G3 G4 -G9 G14 1G10 G11 0…
G1 G2 1G1 G3 0G4 G7 -G10 G12 1…
0.9 0.2
0.1 0.8
Functionally related?
YesNo
Dataset value
0
1
![Page 11: Large Scale Data Integration](https://reader037.vdocuments.site/reader037/viewer/2022102908/56812f20550346895d94b818/html5/thumbnails/11.jpg)
Integration
• Each dataset is now represented by two probability distributions:– One for related gene pairs, one for unrelated
DS1value
0 1
RelatedUnrelated
Pro
b.
DS5value
0 1
Prob.
2 3 4 5This is particularly
noticeable for continuous
datasets, where these represent
correlations.
Related genes are more likely to be highly correlated
Related genes are more likely to bind
![Page 12: Large Scale Data Integration](https://reader037.vdocuments.site/reader037/viewer/2022102908/56812f20550346895d94b818/html5/thumbnails/12.jpg)
Integration
• In the best case, datasets look like these:
• In the worst case, they look like these:0 1
Pro
b. RelatedUnrelated
0 1
Pro
b.
0 1
Pro
b.
0 1P
rob.
0 1
Pro
b.
![Page 13: Large Scale Data Integration](https://reader037.vdocuments.site/reader037/viewer/2022102908/56812f20550346895d94b818/html5/thumbnails/13.jpg)
Integration
• The variation in a dataset’s probability distribution indicates how informative it is.
• Some microarrays might look like these:
0 1
Prob.
2 3 4 5
Even if genes are highly correlated, it doesn’t mean
anything, because unrelated genes are also correlated.
0 1
Pro
b.
2 3 4 5
Everything’s really correlated! We can actually correct microarrays like this
during preprocessing.
![Page 14: Large Scale Data Integration](https://reader037.vdocuments.site/reader037/viewer/2022102908/56812f20550346895d94b818/html5/thumbnails/14.jpg)
Prediction
Ok, so what?• Given what we know about some genes,
we’ve learned something about datasets:
For each dataset Di, we know P(Di = d | FR)
• What we want to know is, given some data, what can we predict about unknown genes?
![Page 15: Large Scale Data Integration](https://reader037.vdocuments.site/reader037/viewer/2022102908/56812f20550346895d94b818/html5/thumbnails/15.jpg)
Prediction
• We know:
• And we want to know:
P(FR)The probability of a
gene pair being functionally related
The probability of each dataset
containing some valueP(D = d)
The probability of each value given a relationship (or not)
P(D = d | FR)
The probability that new genes are related
given some dataP(FR | D = d)
![Page 16: Large Scale Data Integration](https://reader037.vdocuments.site/reader037/viewer/2022102908/56812f20550346895d94b818/html5/thumbnails/16.jpg)
Prediction
• Enter Thomas Bayes:
• Who established Bayes’ theorem:
)(
)()|()|(
BP
APABPBAP
![Page 17: Large Scale Data Integration](https://reader037.vdocuments.site/reader037/viewer/2022102908/56812f20550346895d94b818/html5/thumbnails/17.jpg)
Prediction
<insert math here>
For each new gene pair, we can find its probability of being functionally related
Each dataset is weighted according to how informative we’ve
calculated it to beDatasets with no data for a particular gene
pair are ignored
And importantly for us, this all happens very quickly, regardless of the number
of genes or datasets
![Page 18: Large Scale Data Integration](https://reader037.vdocuments.site/reader037/viewer/2022102908/56812f20550346895d94b818/html5/thumbnails/18.jpg)
Prediction
• The result is that we produce a probability of functional relationship for each gene pair:
• Which in turns translates into a fully connected interaction network:
G1 G2 0.9G1 G3 0.75G1 G4 0.1G2 G3 0.3G2 G4 0.2…
Only the most confident edges are
typically shown
![Page 19: Large Scale Data Integration](https://reader037.vdocuments.site/reader037/viewer/2022102908/56812f20550346895d94b818/html5/thumbnails/19.jpg)
Context Specificity
• We can do even better!
• This process lets us figure out how much to “trust” each dataset.
• But datasets can give better (or worse) results in particular biological areas:
0 1
Pro
b.
2 3 4 5 0 1
Prob.
2 3 4 5
Microarrays for all gene pairs
Microarrays for ribosomal gene pairs
![Page 20: Large Scale Data Integration](https://reader037.vdocuments.site/reader037/viewer/2022102908/56812f20550346895d94b818/html5/thumbnails/20.jpg)
Context Specificity
• So we don’t just learn one probability distribution per dataset.
0 1
Pro
b.
![Page 21: Large Scale Data Integration](https://reader037.vdocuments.site/reader037/viewer/2022102908/56812f20550346895d94b818/html5/thumbnails/21.jpg)
Context Specificity
• So we don’t just learn one probability distribution per dataset.
• We learn one probability distribution per dataset per biological process of interest!
0 1
Pro
b.
0 1 0 1 0 1
Carbonmetabolism Translation Autophagy
![Page 22: Large Scale Data Integration](https://reader037.vdocuments.site/reader037/viewer/2022102908/56812f20550346895d94b818/html5/thumbnails/22.jpg)
Context Specificity
• So we don’t just learn one probability distribution per dataset.
• We learn one probability distribution per dataset per biological process of interest!
• This means that for each gene pair, we can predict a different probability of relationship per process of interest.
0 1
Pro
b.
0 1 0 1 0 1
Carbonmetabolism Translation Autophagy
G1 G2 0.9G1 G3 0.75G1 G4 0.1G2 G3 0.3G2 G4 0.2…
G1 G2 0.15G1 G3 0.1G1 G4 0.9G2 G3 0.2G2 G4 0.25…
G1 G2 0.1G1 G3 0.2G1 G4 0.75G2 G3 0.15G2 G4 0.9…
![Page 23: Large Scale Data Integration](https://reader037.vdocuments.site/reader037/viewer/2022102908/56812f20550346895d94b818/html5/thumbnails/23.jpg)
Context Specificity
• So we don’t just learn one probability distribution per dataset.
• We learn one probability distribution per dataset per biological process of interest!
• This means that for each gene pair, we can predict a different probability of relationship per process of interest.
• Which in turn produces different interaction networks for each process.
0 1
Pro
b.
0 1 0 1 0 1
Carbonmetabolism Translation Autophagy
G1 G2 0.9G1 G3 0.75G1 G4 0.1G2 G3 0.3G2 G4 0.2…
G1 G2 0.15G1 G3 0.1G1 G4 0.9G2 G3 0.2G2 G4 0.25…
G1 G2 0.1G1 G3 0.2G1 G4 0.75G2 G3 0.15G2 G4 0.9…
![Page 24: Large Scale Data Integration](https://reader037.vdocuments.site/reader037/viewer/2022102908/56812f20550346895d94b818/html5/thumbnails/24.jpg)
Predicting Gene Function
Ok, so what?• We can now dig through these networks for
interesting things:
– YFG, its interaction partners, and their processes– Each edge represents specific datasets/publications– Dense clusters (new functional modules)– Areas that change a lot from process to process– What known disease genes are doing– Find relationships between TFs and their targets– Predicting function for uncharacterized genes
![Page 25: Large Scale Data Integration](https://reader037.vdocuments.site/reader037/viewer/2022102908/56812f20550346895d94b818/html5/thumbnails/25.jpg)
Predicting Gene Function
• Suppose we have a whole interaction network for autophagy:
• How do we predict new genes involved in the process?– Look at stuff “around” the known autophagy genes!
![Page 26: Large Scale Data Integration](https://reader037.vdocuments.site/reader037/viewer/2022102908/56812f20550346895d94b818/html5/thumbnails/26.jpg)
Predicting Gene Function
• The bioPIXIE algorithm:– Given a network and some query genes, find the other
genes most strongly connected to the whole query
![Page 27: Large Scale Data Integration](https://reader037.vdocuments.site/reader037/viewer/2022102908/56812f20550346895d94b818/html5/thumbnails/27.jpg)
Predicting Gene Function
• The bioPIXIE algorithm:– Given a network and some query genes, find the other
genes most strongly connected to the whole query
G1: 0.5 + 0.5 + 0.1 = 1.1
![Page 28: Large Scale Data Integration](https://reader037.vdocuments.site/reader037/viewer/2022102908/56812f20550346895d94b818/html5/thumbnails/28.jpg)
Predicting Gene Function
• The bioPIXIE algorithm:– Given a network and some query genes, find the other
genes most strongly connected to the whole query
G1: 0.5 + 0.5 + 0.1 = 1.1
G2: 0.9 + 0.9 + 0.5 = 2.3
![Page 29: Large Scale Data Integration](https://reader037.vdocuments.site/reader037/viewer/2022102908/56812f20550346895d94b818/html5/thumbnails/29.jpg)
Predicting Gene Function
• The bioPIXIE algorithm:– Given a network and some query genes, find the other
genes most strongly connected to the whole query
G1: 0.5 + 0.5 + 0.1 = 1.1
G2: 0.9 + 0.9 + 0.5 = 2.3…
![Page 30: Large Scale Data Integration](https://reader037.vdocuments.site/reader037/viewer/2022102908/56812f20550346895d94b818/html5/thumbnails/30.jpg)
Predicting Gene Function
• The bioPIXIE algorithm:– Given a network and some query genes, find the other
genes most strongly connected to the whole query
G1: 0.5 + 0.5 + 0.1 = 1.1
G2: 0.9 + 0.9 + 0.5 = 2.3…
Then display the genes with the best scores and the strongest edges connecting them
![Page 31: Large Scale Data Integration](https://reader037.vdocuments.site/reader037/viewer/2022102908/56812f20550346895d94b818/html5/thumbnails/31.jpg)
Predicting Gene Function
• The ratio algorithm:– Given a network and some query genes, find the other
genes most specifically connected to the whole query
![Page 32: Large Scale Data Integration](https://reader037.vdocuments.site/reader037/viewer/2022102908/56812f20550346895d94b818/html5/thumbnails/32.jpg)
Predicting Gene Function
• The ratio algorithm:– Given a network and some query genes, find the other
genes most specifically connected to the whole query
G1: = 0.86/)5.01.09.0(
3/)(
0.10.50.5
0.10.50.5
![Page 33: Large Scale Data Integration](https://reader037.vdocuments.site/reader037/viewer/2022102908/56812f20550346895d94b818/html5/thumbnails/33.jpg)
Predicting Gene Function
• The ratio algorithm:– Given a network and some query genes, find the other
genes most specifically connected to the whole query
G1: = 0.86/)5.01.09.0(
3/)(
0.10.50.5
0.10.50.5
G2: = 1.56/)1.01.05.0(
3/)(
0.50.90.9
0.50.90.9
![Page 34: Large Scale Data Integration](https://reader037.vdocuments.site/reader037/viewer/2022102908/56812f20550346895d94b818/html5/thumbnails/34.jpg)
Predicting Gene Function
• The ratio algorithm:– Given a network and some query genes, find the other
genes most specifically connected to the whole query
…
Then display the genes with the best scores and the strongest edges connecting them
G1: = 0.86/)5.01.09.0(
3/)(
0.10.50.5
0.10.50.5
G2: = 1.56/)1.01.05.0(
3/)(
0.50.90.9
0.50.90.9
![Page 35: Large Scale Data Integration](https://reader037.vdocuments.site/reader037/viewer/2022102908/56812f20550346895d94b818/html5/thumbnails/35.jpg)
Predicting Gene Function
• These can differ a lot, particularly for “hubby” genes!
![Page 36: Large Scale Data Integration](https://reader037.vdocuments.site/reader037/viewer/2022102908/56812f20550346895d94b818/html5/thumbnails/36.jpg)
Predicting Gene Function
• These can differ a lot, particularly for “hubby” genes!
bioPIXIE
G1: 0.9 + 0.5 = 1.4
G2: 0.5 + 0.5 = 1.0
Ratio
G1: = 1.24/)1.09.0(
2/)(
0.50.9
0.50.9
G2: = 1.74/)1.01.0(
2/)(
0.50.5
0.50.5This difference is
exacerbated when the query isn’t itself strongly
connected, since it makes it easy for hubby
genes to dominate bioPIXIE’s results.
![Page 37: Large Scale Data Integration](https://reader037.vdocuments.site/reader037/viewer/2022102908/56812f20550346895d94b818/html5/thumbnails/37.jpg)
Predicting Gene Function
• How is this relevant to us?
• Suppose we ask about just a few genes– If they’re not internally consistent in the data,
bioPIXIE’s results are mostly hubs– This usually means that each predicted gene is
only related to one or two of the query genes
![Page 38: Large Scale Data Integration](https://reader037.vdocuments.site/reader037/viewer/2022102908/56812f20550346895d94b818/html5/thumbnails/38.jpg)
Predicting Gene Function
• This is a problem in the human genome, where our prior knowledge is relatively limited
• The ratio algorithm generates predictions that are targeted towards the commonalities of the query: