hepdata for bsm reinterpretation of lhc data filehepdata for bsm reinterpretation of lhc data andy...

HepData for BSM reinterpretationof LHC data

Andy Buckley, University of GlasgowHepData Advisory Meeting, 24 Nov 2017

Intro

I Quantity and quality of HepData content from LHC has beensteadily improving – including ad hoc “auxiliary data”

I BSM pheno community very happy about this – vision of anautomated limit re-setting toolchain with comprehensive data coverage

I Include both “SM measurements” & dedicated BSM search data

I Primary data faithfully recorded, modulo format details. Issue iswith secondary data:

(MC) background estimatesCorrelation data

I All data needs to “automatically” flow from experiments,through HepData, and into analysis tools⇒ standardise formats and conventions for data & aux data

I Frames a potential work-plan to include in funding application

Andy Buckley 2/16

Intro

I Quantity and quality of HepData content from LHC has beensteadily improving – including ad hoc “auxiliary data”

I BSM pheno community very happy about this – vision of anautomated limit re-setting toolchain with comprehensive data coverage

I Include both “SM measurements” & dedicated BSM search data

I Primary data faithfully recorded, modulo format details. Issue iswith secondary data:

(MC) background estimatesCorrelation data

I All data needs to “automatically” flow from experiments,through HepData, and into analysis tools⇒ standardise formats and conventions for data & aux data

I Frames a potential work-plan to include in funding applicationAndy Buckley 3/16

Correlations in fits/limit settingMany types of correlation:

I Between bins/SRs, introduced by experimental/theorysystematics

I Between bins/analyses, introduced by sharing events (ornormalisation)

I Between systematic (nuisance) params, induced by profilefitting

Possible approaches to providing this information:I full likelihood expression, e.g. HistFactory demoI approximate: express as independent error sources,

correlated across bins — extensibleI approximate: simplified likelihoods: drop connection to error

sources, bkg systs only, express as (symm) bin covarianceActively used by CMS: https://cds.cern.ch/record/2242860

Not 100% clear that correlations are necessary, but without themthere will always be questions of whether an analysis was toooptimistic or conservative

Andy Buckley 4/16

https://www.youtube.com/watch?v=UDyfvyngn9c

https://cds.cern.ch/record/2242860









Andy Buckley 5/16











Andy Buckley 6/16



Correlation formats: error sources vs. bin covarianceCMS 0` cov matrix – note log-scale!

NJe

ts[2

]

NJe

ts[3

,4]

NJe

ts[5

,6]

NJe

ts[7

,8]

NJe

ts[9

+]

NJets[2]

NJets[3,4]

NJets[5,6]

NJets[7,8]

NJets[9+]

xyρC

ovar

ianc

e

1−10

1

10

210

310

410

510

610 (13 TeV)-135.9 fbCMS SupplementaryarXiv:1704.07781

Error breakdown in a HepData recordNB. normal in Standard Model analyses

SL originally formalised as symm covarianceSimple to use: L(µ, ~θ) =

∏i Pois(ni, µ, ~θ) ·Gaus(~θ,C)

Dimensionality of cov fixed: uniform approach, scales well. Butlimited to symmetric errs and no correlations between analyses.

HepData doesn’t understand datasets semantics: would need “link”metadata to reliably connect correlation datasets to primary datasets

Error-source representation more flexible: can construct cov matrixCij =

∑e σiσj, or asymm by toy-sampling

Extensible! Supported already. HD preference. But. . .

Andy Buckley 7/16


NJe

ts[2

]

NJe

ts[3

,4]

NJe

ts[5

,6]

NJe

ts[7

,8]

NJe

ts[9

+]

NJets[2]

NJets[3,4]

NJets[5,6]

NJets[7,8]

NJets[9+]

xyρC

ovar

ianc

e

1−10

1

10

210

310

410

510









Extensible! Supported already. HD preference. But. . .

Andy Buckley 8/16


NJe

ts[2

]

NJe

ts[3

,4]

NJe

ts[5

,6]

NJe

ts[7

,8]

NJe

ts[9

+]

NJets[2]

NJets[3,4]

NJets[5,6]

NJets[7,8]

NJets[9+]

xyρC

ovar

ianc

e

1−10

1

10

210

310

410

510









Extensible! Supported already. HD preference. But. . .Andy Buckley 9/16

Logistical issues & extensions

I Need standard names, esp. to distinguish uncorr stat errorsI Also need groupings, e.g. to separate theory/MC errors from

experimental/detector resolutions⇒ future reinterpretations withtheory improvements. Easier with explicit cov matrices?

I Error-sources are naturally usable in an asymmetric way. Butcurrent activity on use of skew moments to implement asymmparametrisation: how to store this in HD?!

160018002000220024000

0.001

0.002

0.003

1600 1800 2000 2200 2400

800

850

900

800 850 9000

0.005

0.01

0.015

1600 1800 2000 2200 2400

300

400

800 850 900

300

400

300 400 5000

0.005

0.01

1600 1800 2000 2200 240050

100

150

200

250

800 850 90050

100

150

200

250

300 40050

100

150

200

250

50 100 150 200 2500

0.005

0.01

1600 1800 2000 2200 24000

50

100

800 850 9000

50

100

300 4000

50

100

50 100 150 200 2500

50

100

0 50 1000

0.005

0.01

0.015

0.02

1600 1800 2000 2200 2400

0

20

40

60

800 850 900

0

20

40

60

300 400

0

20

40

60

50 100 150 200 250

0

20

40

60

0 50 100

0

20

40

60

0 20 40 600

0.01

0.02

0.03

0.04

1600 1800 2000 2200 2400

0

10

20

30

800 850 900

0

10

20

30

300 400

0

10

20

30

50 100 150 200 250

0

10

20

30

0 50 100

0

10

20

30

0 20 40 60

0

10

20

30

0 10 20 300

0.02

0.04

0.06

1600 1800 2000 2200 24005−

0

5

10

15

800 850 9005−

0

5

10

15

300 4005−

0

5

10

15

50 100 150 200 2505−

0

5

10

15

0 50 1005−

0

5

10

15

0 20 40 605−

0

5

10

15

0 10 20 305−

0

5

10

15

5− 0 5 10 150

0.05

0.1

0.15

1b

2b

3b

4b

5b

6b

7b

8b

1b 2b 3b 4b 5b 6b 7b 8b

Andy Buckley 10/16

https://sites.google.com/site/lhcchapter2workshop/home/topics-and-working-groups/statistical-methods-and-presentation-of-results





Andy Buckley 11/16






Possibility for HD to have semantic understanding of correlations?Andy Buckley 12/16


MC and background data

I Correlations are the most technically complex demand, since thedata objects are semantically different from “normal” datasets

I Not the only requirement for scalable recasting, though:background estimates are also crucial

I Typical BSM reinterpretations only have the capacity to generate(maybe LO) signal events

I Backgrounds computed by experiments using vast MC datasetswith very complex and CPU-intensive high-sophisticationmodelling: not reproducible, so needs to be published

I This has started, but – again – how to make HD (and its API)semantically aware of what is data and what’s thecorresponding MC?And background process breakdown? And pre-/post-fit? . . .

Andy Buckley 13/16







Andy Buckley 14/16







Andy Buckley 15/16







Andy Buckley 16/16







Andy Buckley 17/16

hepdata for bsm reinterpretation of lhc data filehepdata for bsm reinterpretation of lhc data andy...

Documents