uq: from end to end

UQ: from end to end

Tony O’Hagan

UQ Summerschool 2014 2

Outline

Session 1: Quantifying input uncertainty

Information

Modelling

Elicitation

Coffee break: Propagation!

Session 2: Model discrepancy and calibration

All models are wrong

Impact of model discrepancy

Modelling model discrepancy

12/8/2014

UQ: from end to end

Session 1: Quantifying input uncertainty


Context

You have a modelTo simulate or predict some real-world process

I’ll call it a simulator

For a given use of the simulator you are unsure of the true or correct values of inputs

This uncertainty is a major component of UQ

Propagating it through the simulator is a fundamental step in UQ

We need to express that uncertainty in the form of probability distributions

But how?

I feel that this is a neglected area in UQDistributions assumed, often with no discussion of where they came from

12/8/2014


Focus of this session

Probability distributions for inputsRepresenting the analyst’s knowledge/uncertainty

What they meanInterpretation of probability

Where they come fromAnalysis of data and/or judgement

ElicitationPrinciples

Single input

Multiple inputs

Multiple experts

12/8/2014


The analyst

The distributions should represent the best knowledge of the model user about the inputs

I will refer to the model user as the analyst

They are the analyst’s responsibilityThe analyst is the one who is interested in the simulator output

For a particular application

And some or all of the inputs refer specifically to that application

The analyst must own the input distributions

They should represent best knowledgeObviously!

Anything else is unscientific

Less input uncertainty means (generally) less output uncertainty

12/8/2014

7

What probability?

Before we go further, we need to understand how a probability distribution represents someone’s knowledge

The question goes right to the heart of what probability means

Example:We are interested in X = the proportion of people infected with HIV who will develop AIDS within 10 years

when treated with a new drug

X will be an input to a clinical trial simulatorTo assist the pharmaceutical company in designing the drug’s development plan

Analyst Mary expresses a probability distribution for X

12/8/2014 UQ Summerschool 2014

8

Mary’s distribution

The stated distribution is shown on the right

It specifies how probable any particular values of X are

E.g. It says there is a probability of almost 0.7 that X is below 0.4

And the expected value of X is 0.35

It even gives a nontrivial probability to X being less than 0.2

Which would represent a major reduction in HIV progression


9

How can X have probabilities?

Almost everyone learning probability is taught the frequency interpretation

The probability of something is the long run relative frequency with which it occurs in a very long sequence of repetitions

How can we have repetitions of X?It’s a one-off: it will only ever have one value

It’s that unique value we’re interested in

Simulator inputs are almost always like this – they’re one-off!

Mary’s distribution can’t be a probability distribution in that sense

So what do her probabilities actually mean?And does she know?


10

Mary’s probabilities

Mary’s probability 0.7 that X < 0.4 is a judgementShe thinks it’s more likely to be below 0.4 than above

So in principle she would bet even money on it

In fact she would bet £2 to win £1 (because 0.7 > 2/3)

Her expectation of 0.35 is a kind of best estimateNot a long run average over many repetitions

Her probabilities are an expression of her beliefs

They are personal judgementsYou or I would have different probabilities

We want her judgements because she’s the expert!

We need a new definition of probability


11

Subjective probability

The probability of a proposition E is a measure of a person’s degree of belief in the truth of E

If they are certain that E is true then P(E) = 1

If they are certain it is false then P(E) = 0

Otherwise P(E) lies between these two extremes

Exercise 1 – How many Muslims in Britain?Refer to the two questions on your sheet

The first asks for a probabilityMake your own personal judgement

If you don’t already have a good feel for the probability scale, you may find it useful to think about betting odds

The second asks for another probability


12

Subjective includes frequency

The frequency and subjective definitions of probability are compatible

If the results of a very long sequence of repetitions are available, they agree

Frequency probability equates to the long run frequency

All observers who accept the sequence as comprising repetitions will give that frequency as their (personal/subjective) probability

for the next (or any future) result in the sequence

Subjective probability extends frequency probabilityBut also seamlessly covers propositions that are not repeatable

It’s also more controversial


13

It doesn’t include prejudice etc!

The word “subjective” has derogatory overtonesSubjectivity should not admit prejudice, bias, superstition, wishful thinking, sloppy thinking, manipulation ...

Subjective probabilities are judgements but they should be careful, honest, informed judgements

As “objective” as possible without ducking the issue

Using best practiceFormal elicitation methods

Bayesian analysis

Probability judgements go along with all the other judgements that a scientist necessarily makes

And should be argued for in the same careful, honest and informed way



What about data?

I’ve presented the analyst’s probability distributions as a matter of pure subjective judgement – what about data?

Many possible scenarios:X is a parameter for which there is a published value

Analyst has one or more direct experimental evaluations for X

Analyst has data relating more or less directly to X

Analyst has some hard data but also personal expertise about X

Analyst relies on personal expertise about X

Analyst seeks input from an expert on X

…

12/8/2014


The case of a published value

The published value may come with a completely characterised probability distribution for X representing uncertainty in the value

The analyst simply accepts this distribution as her own judgement

Or it may notThe analyst needs to consider the uncertainty in X around the published value P

X = P + E, where E is the error

Analyst formulates her own probability distribution for E

The published value P may simply come with a standard deviation

The analyst accepts this as one judgement about E

12/8/2014


Using data – principles

The appropriate framework for using data is Bayesian statistics

Because it delivers a probability distribution for X

Classical frequentist statistics can’t do thatEven a confidence interval is not a probability statement about X

The data are related to X through a likelihood functionDerived from a statistical model

This is combined with whatever additional knowledge the analyst may have

In the form of a prior distribution

Combination is performed by Bayes’ theorem

The result is the analyst’s posterior distribution for X

12/8/2014


Using data – practicalities

If data are highly informative about X, prior information may not matter

Use a conventional non-informative prior distribution

Otherwise the analyst formulates her own prior distribution for X

Bayesian analysis can be complexAnalyst is likely to need the services of a Bayesian statistician

The likelihood/model is also a matter of judgement!Although I will not delve into this today

12/8/2014


Summary

We have identified several situations where distributions need to be formulated by personal judgement

No good data – analyst formulates distribution for X

Published data does not have complete characterisation of uncertainty – analyst formulates distribution for E

Data supplemented by additional expertise – analyst formulates prior distribution for X

Analyst may seek judgements of one or more expertsRather than relying on her own

Particularly when the stakes are high

We have identified just one situation where personal judgements are not needed

Published data with completely characterised uncertainty

12/8/2014

19

Elicitation

The process of representing the knowledge

of one or more persons (experts)

concerning an uncertain quantity

as a probability distribution for that quantity

Typically conducted as a dialogue betweenthe experts – who have substantive knowledge about the quantity (or quantities) of interest – and

a facilitator – who has expertise in the process of elicitation

ideally face to facebut may also be done by video-conference, teleconference or online


20

Some history

The idea of formally representing uncertainty using subjective probability judgements began to be taken seriously in the 1960s

For instance, for judgement of extreme risks

Psychologists became interestedHow do people make probability judgements?

What mental processes are used, and what does this tell us about the brain’s processing generally?

They found many ways that we make bad judgementsThe heuristics and biases movement

And continued to look mostly at how we get it wrongSince this told them a lot about our mental processes


21

Meanwhile ...

Statisticians increasingly made use of subjective probabilities

Growth of Bayesian statistics

Some formal elicitation but mostly unstructured judgements

Little awareness of the work in psychology

Reinforced recently by UQ with uncertain simulator inputs

Our interests are more complexNot really interested in single probabilities

Whole probability distributions

Multivariate distributions

We want to know how to get it rightPsychology provides almost no help with these challenges


22

Heuristics and biases

Our brains evolved to make quick decisionsHeuristics are short-cut reasoning techniques

Allow us to make good judgements quickly in familiar situations

Judgement of probability is not something that we evolved to do well

The old heuristics now produce biasesAnchoring and adjustment

Availability

Overconfidence

And many others12/8/2014 UQ Summerschool 2014

23

Anchoring and adjustment

Exercise 1 was designed to exhibit this heuristicThe probabilities should on average be different in the two groups

When asked to make two related judgements, the second is affected by the first

The second is judged relative to the first

By adjustment away from the first judgement

The first is called the anchor

Adjustment is typically inadequateSecond response too close to the first (anchor)

Anchoring can be strong even when obviouslynot really relevant to the second question

Just putting any numbers into the discussioncreates anchors

Exercise 112/8/2014 UQ Summerschool 2014

24

Availability

The probability of an event is judged more likely if we can quickly bring to mind instances of it

Things that are more memorable are deemed more probable

High profile train accidents in the UK lead people to imagine rail travel is more risky than it really is

My judgement of the risk of dying from a particular disease will be increased if I know (of) people who have the disease or have died from it

Important for analyst to review all the evidence12/8/2014 UQ Summerschool 2014

25

Overconfidence

It is generally said that experts are overconfidentWhen asked to give 95% intervals, say, far fewer than 95% contain the true value

Several possible explanationsWish to demonstrate expertise

Anchoring to a central estimate

Difficulty of judging extreme events

Not thinking ‘outside the box’Expertise often consists of specialist heuristics

Situations we elicit judgements on are not typical

Probably over-stated as a general phenomenonExperts can be under-confident if afraid of consequences

A matter of personality and feeling of security

Evidence of over-confidence is not from real experts making judgements on serious questions


26

The keys to good elicitation

First, pay attention to the literature on psychology of elicitation

How you ask a question influences the answer

Second, ask about the right thingsThings that experts are likely to assess most accurately

Third, prepare thoroughlyProvide help and training for experts

These are built into the SHELF systemSheffield Elicitation Framework



The SHELF system

SHELF is a package of documents and simple software to aid elicitation

General advice on conducting the elicitation

Templates for recording the elicitationSuitable for several different basic methods

Annotated versions of the templates with detailed guidance

Some R functions for fitting distributions and providing feedback

SHELF is freely available and comments and suggestions for additions are welcomed

Developed by Tony O’Hagan and Jeremy OakleyR functions by Jeremy

http://tonyohagan.co.uk/shelf

12/8/2014


A SHELF template

Word documentFacilitator follows acarefully constructedsequence of questions

Final step invites experts to give their own feed-back

The tertile methodOne of several supported in SHELF

12/8/2014

Annotated template

For facilitator’s guidanceAdvice on each fieldof the template

Ordinary text sayswhat is required ineach field

Text in brackets gives advice on how to work with experts

Text in italics says why we are doing it this way

Based on findings in psychology



Let’s see how it works

12/8/2014

SHELF templates provide a carefully structured sequence of steps

Informed by psychology and practical experience

I’ll work through these, using the following illustrative example

An Expert is asked for her judgements about the distance D between the airports of Paris Charles de Gaulle and Chicago O’Hare

in miles

She has experience of flying distances but has not flown this route before

She knows that from LHR to JFK is about 3500 miles


Expert is asked for lower and upper credible boundsExpert would be very surprised if X was found to be below the lower credible bound or above the upper credible bound

It’s not impossible to be outside the credible range, just highly unlikely

Practical interpretation might be a probability of 1% that X is below L and 1% that it’s above U

ExampleExpert sets lower bound L = 3500

CDG to ORD surely more than LHR to JFK

Upper bound U = 5000Additional flying distance for CDG to ORD surely less than 1500

Credible range L to U

12/8/2014


The median M

The value of x for which the expert judges X to be equally likely to be above or below x

Probability 0.5 (or 50%) below and 0.5 above

Like a toss of a coinOr chopping the range into two equallyprobable parts

If the expert were asked to choose either to bet on X < x or on X > x, he/she should have no preference

It’s a specific kind of ‘estimate’ of X

Need to think, not just go for mid-point of the credible range

ExampleExpert chooses median M = 4000

12/8/2014

L = 0, U = 1M = 0.36


The quartiles Q1 and Q3

12/8/2014

The lower quartile Q1 is the p = 25% quantileThe expert judges X < x to have probability 0.25

Like tossing two successive Heads with a coin

Equivalently, x divides the range below the median into two equi-probable parts

‘Less than Q1’ & ‘between Q1 and M’Should generally be closer to M than Q1

Similarly, upper quartile Q3 is p = 75%

Q1, M and Q3 divide the range into four equi-probable parts

ExampleExpert chooses Q1 = 3850, Q3 = 4300

L = 0, U = 1M = 0.36Q1 = 0.25Q3 = 0.49

34

Then fit a distribution

Any convenient distribution As long as it fits the elicited summaries adequately

SHELF has software for fitting a range of standard distributions

At this point, the choice should not matterThe idea is that we have elicited enough

Any reasonable distribution choice will be similar to any other

Elicitation can never be exactThe elicited summaries are only approximate anyway

If the choice does matteri.e. different fitted distributions give different answers to the problem for which we are doing the elicitation

We can try to remove the sensitivity by eliciting more summaries Or involving more experts



Exercise 2

So let’s do it!

We’re going to elicit your beliefs about one of the following (you can choose!)

Number of gold medals to be won by China in 2016 Olympics

Length of the Yangtze River

Population of Beijing in 2011

Proportion of the total world land area covered by China

12/8/2014


Do we need a facilitator?

Yes, if the simulator output is sufficiently importantA skilled facilitator is essential to get the most accurate and reliable representation of the expert’s knowledge

At least for the most influential inputs

Otherwise, noThe analyst can simply quantify her own judgements

But it’s still very useful to follow the SHELF process

In effect, the analyst interrogates herselfPlaying the role of facilitator as well as that of expert

12/8/2014


Multiple inputs

Hitherto we’ve basically considered just one input X

In practice, simulators almost always have multiple inputs

Then we need to think about dependence

Two or more uncertain quantities are independent if:When you learn something about one of them it doesn’t change your beliefs about the others

It’s a personal judgement, like everything else in elicitation!They may be independent for one expert but not for another

Independence is niceIndependent inputs can just be elicited separately

12/8/2014

38

Exercise 3

Which of the following sets of quantities would you consider independent?

1. The average weight B of black US males aged 40 and the average weight W of white US males aged 40

2. My height H and my age A

3. The time T taken by the Japanese bullet train to travel from Tokyo to Kyoto and the distance D travelled

4. The atomic numbers of Calcium (Ca), Silver (Ag) and Arsenic (As)


39

Eliciting dependence

If quantities are not independent we must elicit the nature and magnitude of dependence between them

Remembering that probabilities are the best summaries to elicit

Joint probabilitiesProbability that X takes some x-values and Y takes some y-values

Conditional probabilitiesProbability that Y takes some y-values if X takes some x-values

Much harder to think about than probabilities for a single quantity

Perhaps the simplest is the quadrant probabilityProbability both X and Y are above their individual medians



Bivariate quadrant probability

First elicit medians

Now elicit quadrant probabilityIt can’t be negative

Or more than 0.5

Value indicates direction and strength of dependence

0.25 if X and Y are independent

Greater if positively correlated0.5 if when one is above its median the other must be

Less than 0.25 if negatively correlated

Zero if they can’t both be above their medians

12/8/2014

Median

Median

?

0.5

0.5

41

Higher dimensions

This is already hardJust for two uncertain quantities

In order to elicit dependence in any depth we will need to elicit several more joint or conditional probabilities

More than two variables – more complex still!

Even with just three quantities...Three pairwise bivariate distributions

With constraints

The three-way joint distribution is not implied by those, eitherWe can’t even visualise or draw it!

There is no clear understanding among elicitation practitioners on how to elicit dependence


42

Avoiding the problem

It would be so much easier if the quantities we chose to elicit were independent

i.e. no dependence or correlation between them

Then eliciting a distribution for each quantity would be enough

We wouldn’t need to elicit multivariate summaries

The trick is to ask about theright quantities

Redefine inputs so they become independent

This is called elaborationOr structuring



Example – two treatment effects

A clinical trial will compare a new treatment with an existing treatment

Existing treatment effect A is relatively well knownExpert has low uncertainty

But added uncertainty due to the effects of the sample population

New treatment effect B is more uncertainEvidence mainly from small-scale Phase III trial

A and B will not be independentMainly because of the trial population effect

If A is at the high end of the expert’s distribution, she would expect B also to be relatively high

Can we break this dependence with elaboration?

12/8/2014


Relative effect

In the two treatments example, note that in clinical trials attention often focuses on the relative effect R = B/A

When effect is bad, like deaths, this is called relative risk

Expert may judge R to be independent of AParticularly if random trial effect is assumed multiplicative

If additive we might instead consider A independent of D = B – ABut this is unusual

So elicit separate distributions for R and AThe joint distribution of (A, B) is now implicit

Can be derived if needed

But often the motivating task can be rephrased in terms of (A, R)

12/8/2014


Trial effect

Instead of simple structuring with the relative risk R, we can explicitly recognise the cause of the correlation

Let T be the trial effect due to difference between the trial patients and the wider population

Let E and N be efficacies of existing and new treatments in the wider population

Then A = E x T and B = N x T

Expert may be comfortable with independence of T, E and N

With E well known, T fairly well known and N more uncertain

We now have to elicit distributions for three quantities instead of two

But can possibly assume them independent

12/8/2014


General principles

Independence or dependence are in the head of the expert

Two quantities are dependent if learning about one of them would change his/her beliefs about the other

Explore possible structures with the expert(s)Find out how they think about these quantities

Expertise often involves learning how to break a problem down into independent components

SHELF does not yet handle multivariate elicitationBut it does include an explicit structuring step

Which we can now see is potentially very important!

Templates for some special cases expected in the next release

12/8/2014

47

Multiple experts

The case of multiple experts is important

When elicitation is used to provide expert input to a decision problem with substantial consequences, we generally want to use the skill of as many experts as possible

But they will all have different opinionsDifferent distributions

How do we aggregate them?In order to get a single elicited distribution


48

Aggregating expert judgements

Two approaches

1. Aggregate the distributionsElicit a distribution from each expert separately

Combine them using a suitable formulaFor instance, simply average them

Called ‘mathematical aggregation’ or ‘pooling’

2. Aggregate the expertsGet the experts together and elicit a single distribution

Called ‘behavioural aggregation’

Neither is without problems


49

Multiple experts in SHELF

SHELF uses behavioural aggregation

However, distributions are first elicited from experts separately

After sharing of key information

Allows facilitator to see the range of belief before aggregation

Then experts discuss their differencesWith a view to assigning an aggregate distribution

To represent what an impartial, intelligent observer might reasonably believe after seeing the experts’ judgements and hearing their discussions

Facilitator can judge whether degree of compromise is appropriate to the intervening discussion


50

Challenges in behavioural aggregation

More psychological hazardsGroup dynamic – dominant/reticent experts

Tendency to end up more confident

Block votes

Requires careful management

What to do if they can’t agree?End up with two or more composite distributions

Need to apply mathematical pooling to these

But this is rare in practice


51

Conclusions – Session 1

The analyst needs to supply probability distributions for uncertain inputs

Probabilities are personal judgementsBut as objective and scientific as possible

Distributions should represent her best knowledge

A range of scenarios for specifying distributionsFrom pure judgement

When no good data are available

To simply using a published valueWith quantification of uncertainty around that value

Almost always, some part of the task will require distributions based on personal judgement

E.g. prior distributions for Bayesian analysis of data



Elicitation is the process of formulating knowledge about an uncertain quantity as a probability distribution

Many pitfalls, practical and psychological

SHELF is a set of protocols designed to avoid pitfallsAn example of best practice in elicitation

Templates to guide the facilitator or analyst through a structured sequence of steps

Particular challenges arise when eliciting judgements about multiple inputs or from multiple experts

12/8/2014


And so to the coffee break

Once we have specified input distributions, the next task is propagation of uncertainty through the simulator

A well studied problem in UQ

Polynomial chaos favoured by engineers, mathematicians

Gaussian process emulators preferred by statisticiansIn my opinion, a more powerful and comprehensive UQ approach

12/8/2014

UQ: from end to end

Session 2: Model discrepancy and calibration

UQ Summerschool 2014 5512/8/2014

Case study – carbon flux

Vegetation can be a major factor in mitigating the increase of CO2 in the atmosphere

And hence reducing the greenhouse effect

Through photosynthesis, plants take atmospheric CO2

Carbon builds new plant material and O2 is released

But some CO2 is released again Respiration, death and decay

The net reduction of CO2 is called Net Biosphere Production (NBP)

I will refer to it as the carbon flux

Complex processes modelled in SDGVMSheffield Global Dynamic Vegetation Model


SDGVM C flux outputs for 2000

Map of SDGVM estimatesshows positive flux (C sink)in North, but negative(C source) in Midlands

Total estimated flux is9.06 Mt C

Highly dependent onweather, so will varygreatly between years


Quantifying input uncertainties

Plant functional type parameters (growth characteristics)

Expert elicitation

Soil composition (nutrients and decomposition)Simple analysis from extensive (published) data

Land cover (which PFTs are where)More complex Bayesian analysis of ‘confusion matrix’ data


Elicitation

Beliefs of expert (developer of SDGVMd) regarding plausible values of PFT parameters

Four PFTs – Deciduous broadleaf (DBL), evergreen needleleaf (ENL), crops, grass

Many parameters for each PFTKey ones identified by preliminary sensitivity analysis

Important to allow for uncertainty about mix of species in a site and role of parameter in the model

In the case of leaf life span for ENL, this was more complex


ENL leaf life span


Correlations

PFT parameter value at one site may differ from its value in another

Because of variation in species mix

Common uncertainty about average over all species induces correlation

Elicit beliefs about average over whole UKENL joint distributions are mixtures of 25 components, with correlation both between and within years


Soil composition

Percentages of sand, clay and silt, plus bulk density

Soil map available at high resolution

Multiple values in each SDGVM siteUsed to form average (central estimate)

And to assess uncertainty (variance)

Augmented to allow for uncertainty in original data (expert judgement)

Assumed independent between sites


Land cover map

LCM2000 is another high resolution mapObtained from satellite imagesVegetation in each pixel assigned to one of 26 classesAggregated to give proportions of each PFT at each siteBut data are uncertain

Field data are available at a sample of pixels

Countryside Survey 2000Table of CS2000 class versus LCM2000 class is called the confusion matrix


CS2000 versus LCM2000 matrix

Not symmetricRather small numbersBare is not a PFT and produces zero NBP

LCM2000CS2000

DBL ENL Grass Crop Bare

DBL 66 3 19 4 5

Enl 8 20 1 0 0

Grass 31 5 356 22 15

Crop 7 1 41 289 9

Bare 2 0 3 8 81


Modelling land cover

The matrix tells us about the probability distribution of LCM2000 class given the true (CS2000) class

Subject to sampling errors

But we need the probability distribution of true PFT given observed PFT

Posterior probabilities as opposed to likelihoodsWe need a prior distribution for land coverWe used observations in a neighbourhood

Implicitly assuming an underlying smooth random field

And the confusion matrix says nothing about spatial correlation of LCM2000 errors

We again relied on expert judgementUsing a notional equivalent number of independent pixels per site


Overall proportions

Red lines show LCM2000 proportionsClear overall biases

Analysis gives estimates for all PFTs in each SDGVM siteWith variances and correlations


Case study – results

Following on the carbon flux case study, input uncertainties were propagated through the SDGVM simulator

Extensive use of Gaussian process emulators

12/8/2014


Mean NBP corrections


NBP standard deviations


Aggregate across 4 PFTs

Mean NBP Standard deviation


England & Wales aggregate

PFTPlug-in estimate

(Mt C)Mean(Mt C)

Variance (Mt C2)

Grass 5.28 4.37 0.2453

Crop 0.85 0.43 0.0327

Deciduous 2.13 1.80 0.0221

Evergreen 0.80 0.86 0.0048

Covariances -0.0081

Total 9.06 7.46 0.2968


Sources of uncertainty

The total variance of 0.2968 is made up as follows

Variance due to PFT and soil inputs = 0.2642Variance due to land cover uncertainty = 0.0105Variance due to interpolation/emulation = 0.0222

Land cover uncertainty much larger for individual PFT contributions

Dominates for ENLBut overall tends to cancel outChanges estimates

Larger mean corrections and smaller overall uncertainty

But we haven’t addressed what is probably the biggest source of uncertainty in this carbon flux problem …

Notation

A simulator takes a number of inputs and produces a number of outputs

We can represent any output y as a function

y = f (x)

of a vector x of inputs



Example: A simple machine (SM)

A machine produces an amount of work y which depends on the amount of effort x put into it

Ideally, y = f(x, β) = βxParameter β is the rate at which effort can be converted to work

True value of β is β* = 0.65

Data zi = yi + εi

Graph shows observed dataPoints lie below y = 0.65x

For large enough x

Because of losses due to friction etc.

Large relative to observation errors

12/8/2014


The SM as a simulator

12/8/2014

A simulator produces output from inputs

When we consider calibration we divide its inputs intoCalibration parameters – unknown but fixed

Control variables – known features of application context

Calibration concerns learning about the calibration parameters

Using observations of the real process

Extrapolation concerns predicting the real processAt control variable values beyond where we have observations

We can view the SM as a (very simple) simulatorx is a control variable, β is a calibration parameter


Tuning and physical parameters

Calibration parameters may be physical or just for tuning

We adjust tuning parameters so the model fits reality better

We are not really interested in their ‘true’ values

We calibrate tuning parameters for prediction

Physical parameters are differentWe are often really interested in true physical values

The SM’s efficiency parameter β is physical

It’s the theoretically achievable efficiency in the absence of friction

We like to think that calibration can help us learn about them

12/8/2014


Exercise 4

Look at the four datasets, and in each case estimate the best fitting slope β

Draw a line by eye through the origin

Using a straight-edge

Read off the slope as the y value on the line when x = 1

Write that value beside the graph

The actual best-fitting calibrated values are:Dataset 1 – 0.58

Dataset 2 – 0.58

Dataset 3 – 0.66

Dataset 4 – 0.57

12/8/2014


Calibrating the SM

12/8/2014

It’s basically a simple linear regression through the originzi = βxi + εi

CalibrationPosterior distribution misses the true value completely

More data makes things worse

More and more tightlyconcentrated on thewrong value

We could use a quadraticregression but the problemwould remain


The problem is completely general

12/8/2014

Calibrating (inverting, tuning, matching) a wrong model gives parameter estimates that are wrong

Not equal to their true physical values – biased

With more data we become more sure of these wrong values

The SM is a trivial model, but the same conclusions apply to all models

All models are wrong

In more complex models it is just harder to see what is going wrong

Even with the SM, it takes a lot of data to see any curvature in reality

What can we do about this?


Model discrepancy

12/8/2014

The SM example suggests that we need to allow that the model does not correctly represent reality

For any values of the calibration parameters

The simulator outputs deviate systematically from realityModel discrepancy (or model bias or model error)

There is a difference between the model with best/true parameter values and reality

r(x) = f(x, θ) + δ(x)where δ(x) represents this discrepancy

and will typically itself have uncertain parameters

We observe

zi = r(xi) + εi = f(xi, θ) + δ(xi) + εi


SM revisited

Kennedy and O’Hagan (2001) introduced this model discrepancy

Modelled it as a zero-mean Gaussian process

They claimed it acknowledges additional uncertainty

And mitigates against over-fitting of θ

So add this model discrepancy term to the linear model of the simple machine

r(x) = βx + δ(x)With δ(x) modelled as a zero-mean GP

Posterior distribution of β now behaves quite differently

12/8/2014


SM – calibration, with discrepancy

12/8/2014

Posterior distribution much broader and doesn’t get worse with more data

But still misses the true value


Interpolation

12/8/2014

Main benefit of simple GP model discrepancy is predictionE.g. at x = 1.5

Prediction within the range of the data is possibleAnd gets better with more data


But when it comes to extrapolation …

12/8/2014

… at x = 6

More data doesn’t help because it’s all in the range [0, 4]

Prediction OK here but gets worse for larger x


Extrapolation

12/8/2014

One reason for wish to learn about physical parametersShould be better for extrapolation than just tuning

Without model discrepancy The parameter estimates will be biased

Extrapolation will also be biasedBecause best fitting parameter values are different in different parts of the control variable space

With more data we become more sure of these wrong values

With GP model discrepancyExtrapolating far from the data does not work

No information about model discrepancy

Prediction just uses the (calibrated) simulator


We haven’t solved the problem

12/8/2014

With simple GP model discrepancy the posterior distribution for θ is typically very wide

Increases the chance that we cover the true value

But is not very helpful

And increasing data does not improve the precision

Similarly, extrapolation with model discrepancy gives wide prediction intervals

And may still not be wide enough

What’s going wrong here?


Nonidentifiability

12/8/2014

Formulation with model discrepancy is not identifiable

For any θ, there is a δ(x) to match reality perfectlyReality is r(x) = f(x, θ) + δ(x)

Given θ, model discrepancy is δ(x) = r(x) – f(x, θ)

Suppose we had an unlimited number of observationsWe would learn reality’s true function r(x) exactly

Within the range of the data

Interpolation works

But we would still not learn θIt could in principle be anything

And we would still not be able to extrapolate reliably


The joint posterior

Calibration leads to a joint posterior distribution for θ and δ(x)

But nonidentifiability means there are many equally good fits (θ, δ(x)) to the data

Induces strong correlation between θ and δ(x)

This may be compounded by the fact that simulators often have large numbers of parameters

(Near-)redundancy means that different θ values produce (almost) identical predictions

Sometimes called equifinality

Within this set, the prior distributions for θ and δ(x) count

12/8/2014


The importance of prior information

The nonparametric GP term allows the model to fit and predict reality accurately given enough data

Within the range of the data

But it doesn’t mean physical parameters are correctly estimated

The separation between original model and discrepancy is unidentified

Estimates depend on prior information

Unless the real model discrepancy is just the kind expected a priori the physical parameter estimates will still be biased

To learn about θ in the presence of model discrepancy we need better prior information

And this is also crucial for extrapolation

12/8/2014


Better prior information

12/8/2014

For calibrationPrior information about θ and/or δ(x)

We wish to calibrate because prior information about θ is not strong enough

So prior knowledge of model discrepancy is crucialIn the range of the data

For extrapolationAll this plus good prior knowledge of δ(x) outside the range of the calibration data

That’s seriously challenging!

In the SM, a model for δ(x) that says it is zero at x = 0, then increasingly negative, should do better


Inference about the physical parameter

12/8/2014

We conditioned the GP

δ(0) = 0

δ′(0) = 0

δ′(0.5) < 0

δ′(1.5) < 0


Prediction

12/8/2014

x = 1.5 x = 6

Where is the uncertainty?

Return to the general case

How might the simulator output y = f (x) differ from the true real-world value z that the simulator is supposed to predict?

Error in inputs xInitial values

Forcing inputs

Model parameters

Error in model structure or solutionWrong, inaccurate or incomplete science

Bugs, solution errors


Quantifying uncertainty


The ideal is to provide a probability distribution p(z) for the true real-world value

The centre of the distribution is a best estimate

Its spread shows how much uncertainty about z is induced by uncertainties on the previous slide

How do we get this?Input uncertainty: characterise p(x), propagate through to p(y)

Model discrepancy: characterise p(z - y)

12/8/2014


The hard part

12/8/2014

We know pretty well how to do uncertainty propagationUncertainties associated with the simulator output

The hard part is the link to realityThe difference between the real system z and the simulator output y = f(x) using best input values

Because all models are wrong (Box, 1979)

It was through thinking about this link …Particularly in the context of calibration

i.e. learning about uncertain parameters in the model

And also extrapolation

… that I was led to think more deeply about parametersAnd to realise just how important model discrepancy is


Modelling model discrepancy

Three rules:

1. Must account for model discrepancyIgnoring it leads to biased calibration, over-optimistic predictions

2. Discrepancy term must be modelled nonparametricallyAllows learning about reality and interpolative prediction

3. Model must incorporate realistic knowledge about discrepancy

To get unbiased learning about physical parameters and extrapolation

But following these rules is hardOngoing research

12/8/2014

Managing uncertainty

To understand the implications of different uncertainty sources

Probabilistic, variance-based sensitivity analysis

Helps with targeting and prioritising research

To reduce uncertainty, get more information!

Informal – more/better scienceTighten p(x) through improved understanding

Tighten p(z - y) through improved modelling or programming

Formal – using real-world dataCalibration – learn about model parameters

Data assimilation – learn about the state variables

Learn about model discrepancy z - y

Validation (another talk!)UQ Summerschool 2014 9612/8/2014


Conclusions – Session 2

12/8/2014

Without model discrepancyInference about physical parameters will be wrong

And will get worse with more data

The same is true of prediction Both interpolation and extrapolation

With crude GP model discrepancyInterpolation inference is OK

And gets better with more data

But we still get physical parameters and extrapolation wrong

The better our prior knowledge about model discrepancy The more chance we have of getting physical parameters right

Also extrapolation But then we need even better prior knowledge

98

Any final questions?

It remains just to say thank you for sitting through this morning’s sessions!


uq: from end to end

Documents