pat langley center for the study of language and information
DESCRIPTION
Computational Discovery of Communicable Scientific Models. Pat Langley Center for the Study of Language and Information Stanford University, Stanford, California http://cll.stanford.edu/ ~ langley [email protected]. - PowerPoint PPT PresentationTRANSCRIPT
Pat LangleyPat LangleyCenter for the Study of Language and InformationCenter for the Study of Language and Information
Stanford University, Stanford, CaliforniaStanford University, Stanford, California
http://cll.stanford.edu/~langleyhttp://cll.stanford.edu/~langley
[email protected]@csli.stanford.edu
Computational Discovery of Computational Discovery of Communicable Scientific ModelsCommunicable Scientific Models
Thanks to N. Asgharbeygi, K. Arrigo, S. Bay, S. Dzeroski, J. Sanchez, Oren Shiran, Thanks to N. Asgharbeygi, K. Arrigo, S. Bay, S. Dzeroski, J. Sanchez, Oren Shiran, and L. Todorovski for their contributions to this research, which is funded by a grant and L. Todorovski for their contributions to this research, which is funded by a grant from the National Science Foundation.from the National Science Foundation.
Data Mining vs. Scientific DiscoveryData Mining vs. Scientific Discovery
Data miningData mining generates knowledge cast as decision trees, generates knowledge cast as decision trees, logical rules, or other notations invented by AI researchers;logical rules, or other notations invented by AI researchers;
Computational scientific discoveryComputational scientific discovery instead uses equations, instead uses equations, structural models, reaction pathways, or other formalisms structural models, reaction pathways, or other formalisms invented by scientists and engineers.invented by scientists and engineers.
There exist two computational paradigms for discovering explicit There exist two computational paradigms for discovering explicit knowledge from data: knowledge from data:
Both approaches draw on heuristic search to find regularities in Both approaches draw on heuristic search to find regularities in data, but they differ considerably in their emphases.data, but they differ considerably in their emphases.
Lesson 1Lesson 1
NPPc = month max (E·IPAR, 0)
E = 0.56 · T1 · T2 · W
T1 = 0.8 + 0.02 · Topt – 0.0005 · Topt2
T2 = 1.18 / [(1 + e 0.2 · (Topt – Tempc – 10) ) · (1 + e 0.3 · (Tempc – Topt – 10) )]
W = 0.5 + 0.5 · EET / PET
PET = 1.6 · (10 · Tempc / AHI)A · PET-TW-M if Tempc > 0
PET = 0 if Tempc < 0
A = 0.00000068 · AHI3 – 0.000077 · AHI2 + 0.018 · AHI + 0.49
IPAR = 0.5 · FPAR-FAS · Monthly-Solar · Sol-Conver
FPAR-FAS = min [(SR-FAS – 1.08) / SR (UMD-VEG) , 0.95]
SR-FAS = (Mon-FAS-NDVI + 1000) / (Mon-FAS-NDVI – 1000)
DFRDFR
NBLANBLANBLRNBLR
RRRR PhotoPhoto
PBSPBS
HealthHealth
--
++
++ ++
--
--
--
psbA1psbA1
psbA2psbA2
cpcBcpcB
++
++--
--
LightLight
++
Traditional notations from machine learning are not communicated Traditional notations from machine learning are not communicated easily to domain scientists.easily to domain scientists.
Ecosystem modelEcosystem model Gene regulation modelGene regulation model
m
Lesson 2Lesson 2
Scientists often have initial models that should influence the Scientists often have initial models that should influence the discovery process.discovery process.
DiscoveryDiscovery
Initial modelInitial model
DFRDFR
NBLANBLANBLRNBLR
RRRR PhotoPhoto
PBSPBS
HealthHealth
--
++
++ ++
--
--
--
psbA1psbA1
psbA2psbA2
cpcBcpcB
++
++--
--
LightLight
++
ObservationsObservations
Revised modelRevised model
×
DFRDFR
NBLANBLANBLRNBLR
RRRR PhotoPhoto
PBSPBS
HealthHealth
--
--
++ ++
--
--
psbA1psbA1
psbA2psbA2
cpcBcpcB
++--
++
LightLight
++ ×
Lesson 3Lesson 3
Scientific data are often rare and difficult to obtain rather than Scientific data are often rare and difficult to obtain rather than being plentiful.being plentiful.
Ecosystem modelEcosystem model Gene regulation modelGene regulation model
Number of variablesNumber of variables
Number of initial linksNumber of initial links
Number of possible linksNumber of possible links
Number of samplesNumber of samples
Number of variables
Number of equations
Number of parameters
Number of samples
8
11
20
303
99
1111
7070
2020
Lesson 4Lesson 4
Scientists want models that move beyond description to provide Scientists want models that move beyond description to provide explanationsexplanations of their data. of their data.
Ecosystem modelEcosystem model Gene regulation modelGene regulation model
DFRDFR
NBLANBLANBLRNBLR
RRRR PhotoPhoto
PBSPBS
HealthHealth
--
++
++ ++
--
--
--
psbA1psbA1
psbA2psbA2
cpcBcpcB
++
++--
--
LightLight
++
NPPc
IPAR
PET
T1T2We_max
E
EET
Tempc
Topt
NDVI
SOLAR
AHI
A
PETTWM
SR
FPAR
VEG
Lesson 5Lesson 5
Scientists want computational assistance rather than automated Scientists want computational assistance rather than automated discovery systems.discovery systems.
DiscoveryDiscovery
Initial modelInitial model
DFRDFR
NBLANBLANBLRNBLR
RRRR PhotoPhoto
PBSPBS
HealthHealth
--
++
++ ++
--
--
--
psbA1psbA1
psbA2psbA2
cpcBcpcB
++
++--
--
LightLight
++
ObservationsObservations
Revised modelRevised model
×
DFRDFR
NBLANBLANBLRNBLR
RRRR PhotoPhoto
PBSPBS
HealthHealth
--
--
++ ++
--
--
psbA1psbA1
psbA2psbA2
cpcBcpcB
++--
++
LightLight
++ ×
The Nature of Systems ScienceThe Nature of Systems Science
focus on synthesis rather than analysis in their operation;focus on synthesis rather than analysis in their operation;
rely on computer modeling as one of their central methods;rely on computer modeling as one of their central methods;
develop system-level models with many variables and relations;develop system-level models with many variables and relations;
require that models make contact with known mechanisms. require that models make contact with known mechanisms.
Disciplines like Earth science and computational biology differ Disciplines like Earth science and computational biology differ from traditional fields in that they:from traditional fields in that they:
However, existing methods for computational scientific discovery However, existing methods for computational scientific discovery were not designed with systems science in mind. were not designed with systems science in mind.
Time Series from the Ross Sea EcosystemTime Series from the Ross Sea Ecosystem
Inductive Process ModelingInductive Process Modeling
Our approach is to design and implement computational methods Our approach is to design and implement computational methods for for inductive process modelinginductive process modeling, which: , which:
represent scientific models as sets of quantitative processes;represent scientific models as sets of quantitative processes;
use these models to predict and explain observational data;use these models to predict and explain observational data;
search a space of process models to find good candidates;search a space of process models to find good candidates;
utilize background knowledge to constrain this search. utilize background knowledge to constrain this search.
This framework has great potential both for modeling scientific This framework has great potential both for modeling scientific reasoning and aiding practicing scientists. reasoning and aiding practicing scientists.
Existing Formalisms Are InadequateExisting Formalisms Are Inadequate
d[ice_mass,t] = d[ice_mass,t] = (18 (18 heat) / 6.02 heat) / 6.02d[water_mass,t] = (18 d[water_mass,t] = (18 heat) / 6.02 heat) / 6.02
systems of equationssystems of equations
B>6B>6
C>0C>0 C>4C>4
14.314.3 18.718.7 11.511.5 16.916.9
regression treesregression trees
gcd(X,X,X).gcd(X,X,X).gcd(X,Y,D) :- X<Y,Z is Y–X,gcd(X,Z,D).gcd(X,Y,D) :- X<Y,Z is Y–X,gcd(X,Z,D).gcd(X,Y,D) :- Y<X,gcd(Y,X,D).gcd(X,Y,D) :- Y<X,gcd(Y,X,D).
Horn clause programsHorn clause programs
xx=12,=12,xx=1=1
yy=18,=18,xx=2=2
xx=12,=12,xx=1=1
yy=10,=10,xx=2=2
xx=16,=16,xx=2=2
yy=13,=13,xx=1=1
xx=19,=19,xx=1=1
yy=11,=11,xx=2=2
0.30.3
0.70.7
1.01.0
1.01.0
hidden Markov modelshidden Markov models
A Process Model for an Aquatic EcosystemA Process Model for an Aquatic Ecosystem
model AquaticEcosystemmodel AquaticEcosystem
variables: phyto, zoo, nitro, residuevariables: phyto, zoo, nitro, residueobservables: phyto, nitroobservables: phyto, nitro
process phyto_lossprocess phyto_loss equations:equations: d[phyto,t,1] = d[phyto,t,1] = 0.307 0.307 phyto phyto
d[residue,t,1] = 0.307 d[residue,t,1] = 0.307 phyto phyto
process zoo_lossprocess zoo_loss equations:equations: d[zoo,t,1] = d[zoo,t,1] = 0.251 0.251 zoo zoo
d[residue,t,1] = 0.251d[residue,t,1] = 0.251
process zoo_phyto_grazingprocess zoo_phyto_grazing equations:equations: d[zoo,t,1] = 0.615 d[zoo,t,1] = 0.615 0.495 0.495 zoo zoo
d[residue,t,1] = 0.385 d[residue,t,1] = 0.385 0.495 0.495 zoo zood[phyto,t,1] = d[phyto,t,1] = 0.495 0.495 zoo zoo
process nitro_uptakeprocess nitro_uptake conditions:conditions: nitro > 0nitro > 0 equations:equations: d[phyto,t,1] = 0.411 d[phyto,t,1] = 0.411 phyto phyto
d[nitro,t,1] = d[nitro,t,1] = 0.098 0.098 0.411 0.411 phyto phyto
process nitro_remineralization;process nitro_remineralization; equations:equations: d[nitro,t,1] = 0.005 d[nitro,t,1] = 0.005 residue residue
d[residue,t,1 ] = d[residue,t,1 ] = 0.005 0.005 residue residue
Advantages of Quantitative Process ModelsAdvantages of Quantitative Process Models
they embed quantitative relations within qualitative structure;they embed quantitative relations within qualitative structure;
that refer to notations and mechanisms familiar to experts;that refer to notations and mechanisms familiar to experts;
they provide dynamical predictions of changes over time;they provide dynamical predictions of changes over time;
they offer causal and explanatory accounts of phenomena;they offer causal and explanatory accounts of phenomena;
while retaining the modularity needed for induction/abduction.while retaining the modularity needed for induction/abduction.
Process models offer scientists a promising framework because: Process models offer scientists a promising framework because:
Quantitative process models provide an important alternative to Quantitative process models provide an important alternative to formalisms used currently in computational discovery. formalisms used currently in computational discovery.
Challenges of Inductive Process ModelingChallenges of Inductive Process Modeling
process models characterize behavior of dynamical systems; process models characterize behavior of dynamical systems;
variables are continuous but can have discontinuous behavior; variables are continuous but can have discontinuous behavior;
observations are not independently and identically distributed;observations are not independently and identically distributed;
models may contain unobservable processes and variables;models may contain unobservable processes and variables;
multiple processes can interact to produce complex behavior. multiple processes can interact to produce complex behavior.
Process model induction differs from typical learning tasks in that:Process model induction differs from typical learning tasks in that:
Compensating factors include a focus on deterministic systems and Compensating factors include a focus on deterministic systems and the availability of background knowledge. the availability of background knowledge.
Encoding Background KnowledgeEncoding Background Knowledge
Horn clause programs (e.g., Towell & Shavlik, 1990) Horn clause programs (e.g., Towell & Shavlik, 1990)
context-free grammars (e.g., Dzeroski & Todorovski, 1997) context-free grammars (e.g., Dzeroski & Todorovski, 1997)
prior probability distributions (e.g., Friedman et al., 2000)prior probability distributions (e.g., Friedman et al., 2000)
To constrain candidate models, we can utilize available backround To constrain candidate models, we can utilize available backround knowledge about the domain. knowledge about the domain.
Previous work has encoded background knowledge in terms of:Previous work has encoded background knowledge in terms of:
However, none of these notations are familiar to domain scientists, However, none of these notations are familiar to domain scientists, which suggests the need for another approach. which suggests the need for another approach.
Generic Processes as Background KnowledgeGeneric Processes as Background Knowledge
the variables involved in a process and their types;the variables involved in a process and their types;
the parameters appearing in a process and their ranges; the parameters appearing in a process and their ranges;
the forms of conditions on the process; andthe forms of conditions on the process; and
the forms of associated equations and their parameters.the forms of associated equations and their parameters.
We cast background knowledge as We cast background knowledge as generic processesgeneric processes that specify: that specify:
Generic processes are building blocks from which one can compose Generic processes are building blocks from which one can compose a specific process model. a specific process model.
Generic Processes for Aquatic EcosystemsGeneric Processes for Aquatic Ecosystems
generic process exponential_lossgeneric process exponential_loss generic process remineralizationgeneric process remineralization variables: S{species}, D{detritus}variables: S{species}, D{detritus} variables: N{nutrient}, variables: N{nutrient}, D{detritus}D{detritus} parameters: parameters: [0, 1] [0, 1] parameters: parameters: [0, 1] [0, 1] equations:equations: d[S,t,1] = d[S,t,1] = 1 1 S S equations: equations: d[N, t,1] = d[N, t,1] = D D
d[D,t,1] = d[D,t,1] = S S d[D, t,1] = d[D, t,1] = 1 1 DD
generic process grazinggeneric process grazing generic process constant_inflowgeneric process constant_inflow variables: S1{species}, S2{species}, D{detritus}variables: S1{species}, S2{species}, D{detritus} variables: variables: N{nutrient}N{nutrient} parameters: parameters: [0, 1], [0, 1], [0, 1] [0, 1] parameters: parameters: [0, 1] [0, 1] equations:equations: d[S1,t,1] = d[S1,t,1] = S1 S1 equations: equations: d[N,t,1] = d[N,t,1] =
d[D,t,1] = (1 d[D,t,1] = (1 ) ) S1 S1d[S2,t,1] = d[S2,t,1] = 1 1 S1 S1
generic process nutrient_uptakegeneric process nutrient_uptake variables: S{species}, N{nutrient}variables: S{species}, N{nutrient} parameters: parameters: [0, [0, ], ], [0, 1], [0, 1], [0, 1] [0, 1] conditions:conditions: N > N > equations:equations: d[S,t,1] = d[S,t,1] = S S
d[N,t,1] = d[N,t,1] = 1 1 S S
process exponential_growth process exponential_growth variables: P {population} variables: P {population} equations: d[P,t] = [0, 1,equations: d[P,t] = [0, 1,] ] P P
process logistic_growthprocess logistic_growth variables: P {population}variables: P {population} equations: d[P,t] = [0, 1, equations: d[P,t] = [0, 1, ] ] P P (1 (1 P / [0, 1, P / [0, 1, ])])
process constant_inflowprocess constant_inflow variables: I {inorganic_nutrient}variables: I {inorganic_nutrient} equations: d[I,t] = [0, 1, equations: d[I,t] = [0, 1, ]]
process consumptionprocess consumption variables: P1 {population}, P2 {population}, variables: P1 {population}, P2 {population}, nutrient_P2 nutrient_P2 equations: d[P1,t] = [0, 1, equations: d[P1,t] = [0, 1, ] ] P1 P1 nutrient_P2, nutrient_P2, d[P2,t] = d[P2,t] = [0, 1, [0, 1, ] ] P1 P1 nutrient_P2 nutrient_P2
process no_saturationprocess no_saturation variables: P {number}, nutrient_P {number}variables: P {number}, nutrient_P {number} equations: nutrient_P = Pequations: nutrient_P = P
process saturationprocess saturation variables: P {number}, nutrient_P {number}variables: P {number}, nutrient_P {number} equations: nutrient_P = P / (P + [0, 1, equations: nutrient_P = P / (P + [0, 1, ])])
Inducing Process ModelsInducing Process Models
model AquaticEcosystemmodel AquaticEcosystem
variables: nitro, phyto, zoo, nutrient_nitro, variables: nitro, phyto, zoo, nutrient_nitro, nutrient_phytonutrient_phytoobservables: nitro, phyto, zooobservables: nitro, phyto, zoo
process phyto_exponential_growthprocess phyto_exponential_growth equations: d[phyto,t] = 0.1 equations: d[phyto,t] = 0.1 phyto phyto
process zoo_logistic_growthprocess zoo_logistic_growth equations: d[zoo,t] = 0.1 equations: d[zoo,t] = 0.1 zoo / (1 zoo / (1 zoo / 1.5) zoo / 1.5)
process phyto_nitro_consumptionprocess phyto_nitro_consumption equations: d[nitro,t] = equations: d[nitro,t] = 1 1 phyto phyto nutrient_nitro, nutrient_nitro, d[phyto,t] = 1 d[phyto,t] = 1 phyto phyto nutrient_nitro nutrient_nitro
process phyto_nitro_no_saturationprocess phyto_nitro_no_saturation equations: nutrient_nitro = nitroequations: nutrient_nitro = nitro
process zoo_phyto_consumptionprocess zoo_phyto_consumption equations: d[phyto,t] = equations: d[phyto,t] = 1 1 zoo zoo nutrient_phyto, nutrient_phyto, d[zoo,t] = 1 d[zoo,t] = 1 zoo zoo nutrient_phyto nutrient_phyto
process zoo_phyto_saturationprocess zoo_phyto_saturation equations: nutrient_phyto = phyto / (phyto + 0.5)equations: nutrient_phyto = phyto / (phyto + 0.5)
InductionInduction
training datatraining data
generic processesgeneric processes
process modelprocess model
A Method for Process Model ConstructionA Method for Process Model Construction
1. Find all ways to instantiate known generic processes with 1. Find all ways to instantiate known generic processes with specific variables, subject to type constraints;specific variables, subject to type constraints;
2. Combine instantiated processes into candidate generic models 2. Combine instantiated processes into candidate generic models subject to additional constraints (e.g., number of processes); subject to additional constraints (e.g., number of processes);
3. For each generic model, carry out search through parameter 3. For each generic model, carry out search through parameter space to find good coefficients;space to find good coefficients;
4. Return the parameterized model with the best overall score.4. Return the parameterized model with the best overall score.
The IPM algorithm constructs explanatory models from generic The IPM algorithm constructs explanatory models from generic elements components in four stages:elements components in four stages:
Our typical evaluation metric is squared error, but we have also Our typical evaluation metric is squared error, but we have also explored other measures of explanatory adequacy. explored other measures of explanatory adequacy.
Estimating Parameters in Process ModelsEstimating Parameters in Process Models
1. Selects random initial values that fall within ranges specified 1. Selects random initial values that fall within ranges specified in the generic processes;in the generic processes;
2. Improves these parameters using the Levenberg-Marquardt 2. Improves these parameters using the Levenberg-Marquardt method until it reaches a local optimum;method until it reaches a local optimum;
3. Generates new candidate values through random jumps along 3. Generates new candidate values through random jumps along dimensions of the parameter vector and continue search; dimensions of the parameter vector and continue search;
4. If no improvement occurs after N jumps, it restarts the search 4. If no improvement occurs after N jumps, it restarts the search from a new random initial point.from a new random initial point.
To estimate the parameters for each generic model structure, the To estimate the parameters for each generic model structure, the IPM algorithm:IPM algorithm:
This multi-level method gives reasonable fits to time-series data This multi-level method gives reasonable fits to time-series data from a number of domains, but it is computationally intensive. from a number of domains, but it is computationally intensive.
Observations from the Ross SeaObservations from the Ross Sea
Results on Training Data from Ross SeaResults on Training Data from Ross Sea
Results on Test Data from Ross SeaResults on Test Data from Ross Sea
Results on a Protist EcosystemResults on a Protist Ecosystem
Results on Rinkobing FjordResults on Rinkobing Fjord
Results on Biochemical KineticsResults on Biochemical Kinetics
observed trajectoriesobserved trajectories
predicted trajectoriespredicted trajectories
specify a quantitative process model of the target system;specify a quantitative process model of the target system;
display and edit the model’s structure and details graphically;display and edit the model’s structure and details graphically;
simulate the model’s behavior over time and situations;simulate the model’s behavior over time and situations;
compare the model’s predicted behavior to observations; compare the model’s predicted behavior to observations;
invoke a revision module in response to detected anomalies.invoke a revision module in response to detected anomalies.
Because few scientists want to be replaced, we are developing an Because few scientists want to be replaced, we are developing an interactive environment, Pinteractive environment, PROMETHEUSROMETHEUS, that lets users:, that lets users:
The environment offers computational assistance in forming and The environment offers computational assistance in forming and evaluating models but lets the user retain control. evaluating models but lets the user retain control.
Interfacing with ScientistsInterfacing with Scientists
Viewing a Process Model GraphicallyViewing a Process Model Graphically
Indicating Processes to Consider AddingIndicating Processes to Consider Adding
Specifying Data and Search ParametersSpecifying Data and Search Parameters
Inspecting Revised Process ModelsInspecting Revised Process Models
computational scientific discovery (e.g., Langley et al., 1983);computational scientific discovery (e.g., Langley et al., 1983);
theory revision in machine learning (e.g., Towell, 1991);theory revision in machine learning (e.g., Towell, 1991);
qualitative physics and simulation (e.g., Forbus, 1984);qualitative physics and simulation (e.g., Forbus, 1984);
languages for scientific simulation (e.g., languages for scientific simulation (e.g., STELLA, MATLABSTELLA, MATLAB););
interactive tools for data analysis (e.g., Schneiderman, 2001).interactive tools for data analysis (e.g., Schneiderman, 2001).
Intellectual InfluencesIntellectual Influences
Our approach to computational discovery incorporates ideas from Our approach to computational discovery incorporates ideas from many traditions:many traditions:
Our work combines, in novel ways, insights from machine learning, Our work combines, in novel ways, insights from machine learning, AI, programming languages, and human-computer interaction.AI, programming languages, and human-computer interaction.
Contributions of the ResearchContributions of the Research
a new formalism for representing scientific process models;a new formalism for representing scientific process models;
a computational method for simulating these models’ behavior;a computational method for simulating these models’ behavior;
an encoding for background knowledge as generic processes; an encoding for background knowledge as generic processes;
an algorithm for inducing process models from time-series data;an algorithm for inducing process models from time-series data;
an interactive environment for model construction/utilization.an interactive environment for model construction/utilization.
In summary, our work on computational scientific discovery has, In summary, our work on computational scientific discovery has, in responding to various challenges, produced:in responding to various challenges, produced:
We have demonstrated this approach to model creation on domains We have demonstrated this approach to model creation on domains from Earth science, microbiology, and engineering. from Earth science, microbiology, and engineering.
Some Recent ExtensionsSome Recent Extensions
heuristic beam search through the space of process models;heuristic beam search through the space of process models;
hierarchical generic processes that further constrain search;hierarchical generic processes that further constrain search;
an ensemble-like method that mitigates overfitting effects; an ensemble-like method that mitigates overfitting effects;
metrics for explanatory adequacy based on trajectory shapes. metrics for explanatory adequacy based on trajectory shapes.
In recent work, we have extended our approach to incorporate:In recent work, we have extended our approach to incorporate:
Inductive process modeling has great potential to speed progress Inductive process modeling has great potential to speed progress in systems science and engineering.in systems science and engineering.
End of PresentationEnd of Presentation