learning regulatory networks from postgenomic data and prior knowledge dirk husmeier 1)...

Learning regulatory networks from postgenomic data and prior knowledge

Dirk Husmeier

1) Biomathematics & Statistics Scotland

2) Centre for Systems Biology at Edinburgh

Raf signalling network

From Sachs et al Science 2005

Systems Biology

unknown

high-throughput experiment

postgenomic data

machine learning

statistical methods

Bayesian networks

•Marriage between graph theory and probability theory.

•Directed acyclic graph (DAG) representing conditional independence relations.

•It is possible to score a network in light of the data: P(D|M), D:data, M: network structure.

•We can infer how well a particular network explains the observed data.

),|()|(),|()|()|()(

),,,,,(

DCFPDEPCBDPACPABPAP

FEDCBAP

Parameters

Learning Bayesian networks

P(M|D) = P(D|M) P(M) / Z

M: Network structure. D: Data

MCMC in structure spaceMadigan & York (1995), Guidici & Castello (2003)

Alternative paradigm: order MCMC

Machine Learning, 2004

Successful application of Bayesian networks to the Raf regulatory network

Flow cytometry data

• Intracellular multicolour flow cytometry experiments: concentrations of 11 proteins

• 5400 cells have been measured under 9 different cellular conditions (cues)

• Optimzation with hill climbing

• Perfect reconstruction

Microarray data Spellman et al (1998)Cell cycle73 samples

Tu et al (2005)Metabolic cycle36 samples

time time

AUC scores TP for FP=5

Part 1

Integration of prior knowledge

Use TF binding motifs in promoter sequences

Biological prior knowledge matrix

Biological Prior Knowledge

Define the energy of a Graph G

Indicates some knowledge aboutthe relationship between genes i and j

Prior distribution over networks

Deviation between the network G and the prior knowledge B: Graph: є {0,1}

Prior knowledge: є [0,1]“Energy”

Hyperparameter

New contribution

• Generalization to more sources of prior knowledge

• Inferring the hyperparameters

• Bayesian approach

Multiple sources of prior knowledge

Sample networks and hyperparameters from the posterior distribution

Bayesian networkswith two sources of prior

BNs + MCMC

Recovered Networks and trade off parameters

Source 1 Source 2

BNs + MCMC

Source 1 Source 2

BNs + MCMC

Source 1 Source 2

Sample networks and hyperparameters from the posterior distribution with MCMC

Metropolis-Hastings scheme

Proposal probabilities

Prior distribution

Rewriting the energy

Energy of a network

Approximation of the partition function

Partition function of an ideal gas

Evaluation on the Raf regulatory network

Evaluation: Raf signalling pathway

• Cellular signalling network of 11 phosphorylated proteins and phospholipids in human immune systems cell

• Deregulation carcinogenesis

• Extensively studied in the literature gold standard network

DataPrior knowledge

Flow cytometry data

• Intracellular multicolour flow cytometry experiments: concentrations of 11 proteins

• 5400 cells have been measured under 9 different cellular conditions (cues)

• Downsampling to 100 instances (5 separate subsets): indicative of microarray experiments

Microarray example Spellman et al (1998)Cell cycle73 samples

Tu et al (2005)Metabolic cycle36 samples

time time

DataPrior knowledge

Prior knowledge from KEGG

Prior distribution

Prior knowledge from KEGG

Raf network

Data and prior knowledge

+ KEGG

+ Random

Evaluation

• Can the method automatically evaluate how useful the different sources of prior knowledge are?

• Do we get an improvement in the regulatory network reconstruction?

• Is this improvement optimal?

Sampled values of the

hyperparameters

Bayesian networkswith two sources of prior knowledge

BNs + MCMC

Random KEGG

Bayesian networkswith two sources of prior knowledge

BNs + MCMC

Random KEGG

Evaluation

•We use the Area Under the Receiver Operating

Characteristic Curve (AUC).

0.5<AUC<1

AUC=1AUC=0.5

Performance evaluation:ROC curves

5 FP counts

Alternative performance evaluation: True positive (TP) scores

Flow cytometry data and KEGG

Evaluation

Learning the trade-off hyperparameter

• Repeat MCMC simulations for large set of fixed hyperparameters β

• Obtain AUC scores for each value of β

• Compare with the proposed scheme in which β is automatically inferred.

Mean and standard deviation of the sampled trade off parameter

Flow cytometry data and KEGG

Part 2

Combining data from different experimental conditions

What if we have multiple data sets obtained under different experimental conditions?

Example: Cytokine network

• Infection

•Treatment with IFN

•Infection and treatment with IFN

Collaboration with Peter Ghazal, Paul Dickinson, Kevin Robertson, Thorsten Forster & Steve Watterson.

datadata data datadata data

Monolithic Individual

datadata data datadata data

Monolithic Individual

Propose a compromise between the two

M1 M221

Compromise between the two previous ways of combining the data

BGe or BDe

Ideal gas approximation

Empirical evaluation

Real application: macrophages infected with CMV and pre-treated

with IFN-γ

No gold-standard

Simulated data from the Raf signalling network

Simulated dataRaf network

Simulated data

v-Raf network

Simulated data

Raf network

v-Raf network

Simulated data

Simulated Data

Weights between nodes are different for different data sets.

Simulated Data

Weights between nodes are different for different data sets.

5 data sets100 data points each

1 random data set (pure noise)

1 data set from the modified network

3 data sets from the Raf network, but with

different regulations strengths

M1 M221

Corrupt, noisy data

Modified network

Raf network

Posterior distribution of ß

M1 M221

5 data sets100 data points each

1 random data set (pure noise)

1 data set from the modified network

3 data sets from the Raf network, but with

different regulations strengths

Corrupt, noisy data

Modified network

Raf network

Network reconstruction accuracy

Convergence problems

Coupling method

Std MCMC

Data sets:1 rand (blue)3 raf 1 vraf (cyan)

Traceplots of sampled

hyperparameters;

Gaussian data set

log likelihood

• The MCMC simulations have convergence problems.• If the simulations “converge”:

– Random data set is identified and switched off.– Data from a slightly modified network are also

identified.– The reconstructed network outperforms the two

competing approaches.• Future work:

The convergence problems need to be addressed.

Conclusions – Part 2

Part 3

Markov chain Monte Carlo

Learning Bayesian networks

P(M|D) = P(D|M) P(M) / Z

M: Network structure. D: Data

MCMC in structure spaceMadigan & York (1995), Guidici & Castello (2003)

Main idea

Propose new parents from the distribution:

•Identify those new parents that are involved in the formation of directed cycles.

•Orphan them, and sample new parents for them subject to the acyclicity constraint.

1) Select a node

2) Sample new parents 3) Find directed cycles

4) Orphan “loopy” parents

5) Sample new parents for these parents

Mathematical Challenge:

• Show that condition of detailed balance is satisfied.

• Derive the Hastings factor …

• … which is a function of various partition functions

Acceptance probability

Summary

• Learning Bayesian networks from postgenomic data

• Integration of biological prior knowledge

• Learning regulatory networks from heterogeneous data obtained under different experimental conditions

• Improving MCMC

Acknowledgements

Funding from the

Scottish Government

Rural and Environment Research and Analysis Directorate (RERAD)

Collaboration with

Adriano Werhli

Marco Grzegorczyk

Adriano Werhli

Marco Grzegorzcyk

Thank you!

Any questions?

learning regulatory networks from postgenomic data and prior knowledge dirk husmeier 1)...

regulatory networks

parameterssample networks

networks deviation

source 2b1b2recovered

network structure

prior knowledge b

network approximation

network g

Documents

using r & r commander in biomathematics...

reconstructing gene regulatory networks with probabilistic...

svetlanabunimovich-mendrazitsky,lewistone...journaloftheoreticalbiology237(2005)302–315...

biomathematics & statistics & mathematics statistics ... ·...

inferring gene regulatory networks with non-stationary...

extreme values adam butler biomathematics & statistics...

inferring gene regulatory networks from transcriptomic...

just one more… alex james biomathematics research centre...

probabilistic modelling in computational biology dirk...

a postgenomic body: histories, genealogy,...

reverse engineering gene and protein regulatory networks...

fuzzy arithmetic in risk analysis scott ferson applied...

lecture notes in biomathematics - rockefeller...

program of the international conference on postgenomic...

biomathematics and statistics scotland chris glasbey

extreme values and risk adam butler biomathematics &...

ethnographic encounters with experimental animals: towards a...

advances in the study of schistosomiasis: the postgenomic...

clustering visualization filemaking sense of gene expression...

race and iq in the postgenomic age: the microcephaly...