hepdata for bsm reinterpretation of lhc data filehepdata for bsm reinterpretation of lhc data andy...
TRANSCRIPT
HepData for BSM reinterpretationof LHC data
Andy Buckley, University of GlasgowHepData Advisory Meeting, 24 Nov 2017
Intro
I Quantity and quality of HepData content from LHC has beensteadily improving – including ad hoc “auxiliary data”
I BSM pheno community very happy about this – vision of anautomated limit re-setting toolchain with comprehensive data coverage
I Include both “SM measurements” & dedicated BSM search data
I Primary data faithfully recorded, modulo format details. Issue iswith secondary data:
(MC) background estimatesCorrelation data
I All data needs to “automatically” flow from experiments,through HepData, and into analysis tools⇒ standardise formats and conventions for data & aux data
I Frames a potential work-plan to include in funding application
Andy Buckley 2/16
Intro
I Quantity and quality of HepData content from LHC has beensteadily improving – including ad hoc “auxiliary data”
I BSM pheno community very happy about this – vision of anautomated limit re-setting toolchain with comprehensive data coverage
I Include both “SM measurements” & dedicated BSM search data
I Primary data faithfully recorded, modulo format details. Issue iswith secondary data:
(MC) background estimatesCorrelation data
I All data needs to “automatically” flow from experiments,through HepData, and into analysis tools⇒ standardise formats and conventions for data & aux data
I Frames a potential work-plan to include in funding applicationAndy Buckley 3/16
Correlations in fits/limit settingMany types of correlation:
I Between bins/SRs, introduced by experimental/theorysystematics
I Between bins/analyses, introduced by sharing events (ornormalisation)
I Between systematic (nuisance) params, induced by profilefitting
Possible approaches to providing this information:I full likelihood expression, e.g. HistFactory demoI approximate: express as independent error sources,
correlated across bins — extensibleI approximate: simplified likelihoods: drop connection to error
sources, bkg systs only, express as (symm) bin covarianceActively used by CMS: https://cds.cern.ch/record/2242860
Not 100% clear that correlations are necessary, but without themthere will always be questions of whether an analysis was toooptimistic or conservative
Andy Buckley 4/16
Correlations in fits/limit settingMany types of correlation:
I Between bins/SRs, introduced by experimental/theorysystematics
I Between bins/analyses, introduced by sharing events (ornormalisation)
I Between systematic (nuisance) params, induced by profilefitting
Possible approaches to providing this information:I full likelihood expression, e.g. HistFactory demoI approximate: express as independent error sources,
correlated across bins — extensibleI approximate: simplified likelihoods: drop connection to error
sources, bkg systs only, express as (symm) bin covarianceActively used by CMS: https://cds.cern.ch/record/2242860
Not 100% clear that correlations are necessary, but without themthere will always be questions of whether an analysis was toooptimistic or conservative
Andy Buckley 5/16
Correlations in fits/limit settingMany types of correlation:
I Between bins/SRs, introduced by experimental/theorysystematics
I Between bins/analyses, introduced by sharing events (ornormalisation)
I Between systematic (nuisance) params, induced by profilefitting
Possible approaches to providing this information:I full likelihood expression, e.g. HistFactory demoI approximate: express as independent error sources,
correlated across bins — extensibleI approximate: simplified likelihoods: drop connection to error
sources, bkg systs only, express as (symm) bin covarianceActively used by CMS: https://cds.cern.ch/record/2242860
Not 100% clear that correlations are necessary, but without themthere will always be questions of whether an analysis was toooptimistic or conservative
Andy Buckley 6/16
Correlation formats: error sources vs. bin covarianceCMS 0` cov matrix – note log-scale!
NJe
ts[2
]
NJe
ts[3
,4]
NJe
ts[5
,6]
NJe
ts[7
,8]
NJe
ts[9
+]
NJets[2]
NJets[3,4]
NJets[5,6]
NJets[7,8]
NJets[9+]
xyρC
ovar
ianc
e
1−10
1
10
210
310
410
510
610 (13 TeV)-135.9 fbCMS SupplementaryarXiv:1704.07781
Error breakdown in a HepData recordNB. normal in Standard Model analyses
SL originally formalised as symm covarianceSimple to use: L(µ, ~θ) =
∏i Pois(ni, µ, ~θ) ·Gaus(~θ,C)
Dimensionality of cov fixed: uniform approach, scales well. Butlimited to symmetric errs and no correlations between analyses.
HepData doesn’t understand datasets semantics: would need “link”metadata to reliably connect correlation datasets to primary datasets
Error-source representation more flexible: can construct cov matrixCij =
∑e σiσj, or asymm by toy-sampling
Extensible! Supported already. HD preference. But. . .
Andy Buckley 7/16
Correlation formats: error sources vs. bin covarianceCMS 0` cov matrix – note log-scale!
NJe
ts[2
]
NJe
ts[3
,4]
NJe
ts[5
,6]
NJe
ts[7
,8]
NJe
ts[9
+]
NJets[2]
NJets[3,4]
NJets[5,6]
NJets[7,8]
NJets[9+]
xyρC
ovar
ianc
e
1−10
1
10
210
310
410
510
610 (13 TeV)-135.9 fbCMS SupplementaryarXiv:1704.07781
Error breakdown in a HepData recordNB. normal in Standard Model analyses
SL originally formalised as symm covarianceSimple to use: L(µ, ~θ) =
∏i Pois(ni, µ, ~θ) ·Gaus(~θ,C)
Dimensionality of cov fixed: uniform approach, scales well. Butlimited to symmetric errs and no correlations between analyses.
HepData doesn’t understand datasets semantics: would need “link”metadata to reliably connect correlation datasets to primary datasets
Error-source representation more flexible: can construct cov matrixCij =
∑e σiσj, or asymm by toy-sampling
Extensible! Supported already. HD preference. But. . .
Andy Buckley 8/16
Correlation formats: error sources vs. bin covarianceCMS 0` cov matrix – note log-scale!
NJe
ts[2
]
NJe
ts[3
,4]
NJe
ts[5
,6]
NJe
ts[7
,8]
NJe
ts[9
+]
NJets[2]
NJets[3,4]
NJets[5,6]
NJets[7,8]
NJets[9+]
xyρC
ovar
ianc
e
1−10
1
10
210
310
410
510
610 (13 TeV)-135.9 fbCMS SupplementaryarXiv:1704.07781
Error breakdown in a HepData recordNB. normal in Standard Model analyses
SL originally formalised as symm covarianceSimple to use: L(µ, ~θ) =
∏i Pois(ni, µ, ~θ) ·Gaus(~θ,C)
Dimensionality of cov fixed: uniform approach, scales well. Butlimited to symmetric errs and no correlations between analyses.
HepData doesn’t understand datasets semantics: would need “link”metadata to reliably connect correlation datasets to primary datasets
Error-source representation more flexible: can construct cov matrixCij =
∑e σiσj, or asymm by toy-sampling
Extensible! Supported already. HD preference. But. . .Andy Buckley 9/16
Logistical issues & extensions
I Need standard names, esp. to distinguish uncorr stat errorsI Also need groupings, e.g. to separate theory/MC errors from
experimental/detector resolutions⇒ future reinterpretations withtheory improvements. Easier with explicit cov matrices?
I Error-sources are naturally usable in an asymmetric way. Butcurrent activity on use of skew moments to implement asymmparametrisation: how to store this in HD?!
160018002000220024000
0.001
0.002
0.003
1600 1800 2000 2200 2400
800
850
900
800 850 9000
0.005
0.01
0.015
1600 1800 2000 2200 2400
300
400
800 850 900
300
400
300 400 5000
0.005
0.01
1600 1800 2000 2200 240050
100
150
200
250
800 850 90050
100
150
200
250
300 40050
100
150
200
250
50 100 150 200 2500
0.005
0.01
1600 1800 2000 2200 24000
50
100
800 850 9000
50
100
300 4000
50
100
50 100 150 200 2500
50
100
0 50 1000
0.005
0.01
0.015
0.02
1600 1800 2000 2200 2400
0
20
40
60
800 850 900
0
20
40
60
300 400
0
20
40
60
50 100 150 200 250
0
20
40
60
0 50 100
0
20
40
60
0 20 40 600
0.01
0.02
0.03
0.04
1600 1800 2000 2200 2400
0
10
20
30
800 850 900
0
10
20
30
300 400
0
10
20
30
50 100 150 200 250
0
10
20
30
0 50 100
0
10
20
30
0 20 40 60
0
10
20
30
0 10 20 300
0.02
0.04
0.06
1600 1800 2000 2200 24005−
0
5
10
15
800 850 9005−
0
5
10
15
300 4005−
0
5
10
15
50 100 150 200 2505−
0
5
10
15
0 50 1005−
0
5
10
15
0 20 40 605−
0
5
10
15
0 10 20 305−
0
5
10
15
5− 0 5 10 150
0.05
0.1
0.15
1b
2b
3b
4b
5b
6b
7b
8b
1b 2b 3b 4b 5b 6b 7b 8b
Andy Buckley 10/16
Logistical issues & extensions
I Need standard names, esp. to distinguish uncorr stat errorsI Also need groupings, e.g. to separate theory/MC errors from
experimental/detector resolutions⇒ future reinterpretations withtheory improvements. Easier with explicit cov matrices?
I Error-sources are naturally usable in an asymmetric way. Butcurrent activity on use of skew moments to implement asymmparametrisation: how to store this in HD?!
Andy Buckley 11/16
Logistical issues & extensions
I Need standard names, esp. to distinguish uncorr stat errorsI Also need groupings, e.g. to separate theory/MC errors from
experimental/detector resolutions⇒ future reinterpretations withtheory improvements. Easier with explicit cov matrices?
I Error-sources are naturally usable in an asymmetric way. Butcurrent activity on use of skew moments to implement asymmparametrisation: how to store this in HD?!
Possibility for HD to have semantic understanding of correlations?Andy Buckley 12/16
MC and background data
I Correlations are the most technically complex demand, since thedata objects are semantically different from “normal” datasets
I Not the only requirement for scalable recasting, though:background estimates are also crucial
I Typical BSM reinterpretations only have the capacity to generate(maybe LO) signal events
I Backgrounds computed by experiments using vast MC datasetswith very complex and CPU-intensive high-sophisticationmodelling: not reproducible, so needs to be published
I This has started, but – again – how to make HD (and its API)semantically aware of what is data and what’s thecorresponding MC?And background process breakdown? And pre-/post-fit? . . .
Andy Buckley 13/16
MC and background data
I Correlations are the most technically complex demand, since thedata objects are semantically different from “normal” datasets
I Not the only requirement for scalable recasting, though:background estimates are also crucial
I Typical BSM reinterpretations only have the capacity to generate(maybe LO) signal events
I Backgrounds computed by experiments using vast MC datasetswith very complex and CPU-intensive high-sophisticationmodelling: not reproducible, so needs to be published
I This has started, but – again – how to make HD (and its API)semantically aware of what is data and what’s thecorresponding MC?And background process breakdown? And pre-/post-fit? . . .
Andy Buckley 14/16
MC and background data
I Correlations are the most technically complex demand, since thedata objects are semantically different from “normal” datasets
I Not the only requirement for scalable recasting, though:background estimates are also crucial
I Typical BSM reinterpretations only have the capacity to generate(maybe LO) signal events
I Backgrounds computed by experiments using vast MC datasetswith very complex and CPU-intensive high-sophisticationmodelling: not reproducible, so needs to be published
I This has started, but – again – how to make HD (and its API)semantically aware of what is data and what’s thecorresponding MC?And background process breakdown? And pre-/post-fit? . . .
Andy Buckley 15/16
MC and background data
I Correlations are the most technically complex demand, since thedata objects are semantically different from “normal” datasets
I Not the only requirement for scalable recasting, though:background estimates are also crucial
I Typical BSM reinterpretations only have the capacity to generate(maybe LO) signal events
I Backgrounds computed by experiments using vast MC datasetswith very complex and CPU-intensive high-sophisticationmodelling: not reproducible, so needs to be published
I This has started, but – again – how to make HD (and its API)semantically aware of what is data and what’s thecorresponding MC?And background process breakdown? And pre-/post-fit? . . .
Andy Buckley 16/16
MC and background data
I Correlations are the most technically complex demand, since thedata objects are semantically different from “normal” datasets
I Not the only requirement for scalable recasting, though:background estimates are also crucial
I Typical BSM reinterpretations only have the capacity to generate(maybe LO) signal events
I Backgrounds computed by experiments using vast MC datasetswith very complex and CPU-intensive high-sophisticationmodelling: not reproducible, so needs to be published
I This has started, but – again – how to make HD (and its API)semantically aware of what is data and what’s thecorresponding MC?And background process breakdown? And pre-/post-fit? . . .
Andy Buckley 17/16