big data competition: maximizing your potential exampled with the 2014 higgs boson machine learning...

101
BIG DATA COMPETITION: MAXIMIZING YOUR POTENTIAL EXAMPLED WITH THE 2014 HIGGS BOSON MACHINE LEARNING CHALLENGE Dr. Cheng CHEN email: [email protected] twitter: @cheng_chen_us Development Consulting International LLC goDCI.com 1 this presentation is copyright protected ©

Upload: cheng-chen

Post on 09-Jul-2015

1.175 views

Category:

Data & Analytics


2 download

DESCRIPTION

The Higgs Boson Machine Learning Challenge is, by far, one of the biggest big data competitions focusing on data analysis in the world. To be successful in such a competition, Cheng applied his knowledge in Computer Science, Mathematics, Statistics, and Physics, while his problem solving habit is developed during his training in Civil Engineering. In this presentation, Cheng will use his experience in this competition to illustrate some important elements in big data analytics and why they are important. The content of the presentation covers different disciplines such as physics, statistics, and mathematics. But no background knowledge of these areas are required to understand the essence of the presentation. In brief, the presentation covers the following content: An effective framework for general data mining projects, Introduction of the competition and its related physics background, Various techniques in data exploring and some traps to avoid, Various ways of feature enhancement, Model building and selection, and Optimization of model performance

TRANSCRIPT

Page 1: Big Data Competition: maximizing your potential exampled with the 2014 Higgs Boson Machine Learning Challenge

BIG  DATA  COMPETITION:  MAXIMIZING  YOUR  POTENTIAL

EXAMPLED  WITH  THE  2014  HIGGS  BOSON  MACHINE  LEARNING  CHALLENGE

Dr.  Cheng  CHEN  email:  [email protected]  

twitter:  @cheng_chen_us  

Development  Consulting  International  LLC  

goDCI.com

1this  presentation  is  copyright  protected  ©

Page 2: Big Data Competition: maximizing your potential exampled with the 2014 Higgs Boson Machine Learning Challenge

Ohio State University, Tongji University

Ph.D. Civil Engineering

M.S. Applied Statistics

Minor Computer Science

Advanced trainings:

City and Regional Planning

Industrial and Systems Engineering

Mathematics

Passion: (this) machine learning

PRESENTER

2

Page 3: Big Data Competition: maximizing your potential exampled with the 2014 Higgs Boson Machine Learning Challenge

• Goal: improve the procedure that produces the selection region of Higgs Boson

• 4 Month Duration

• 1,785 teams

• Many machine learning experts, statisticians, and physicist

• Top 5 are from 5 different countries

HIGGS  BOSON    MACHINE  LEARNING  CHALLENGE

3

Netherlands

Hungary

France

Russia

U.S.A/Chinahttp://www.kaggle.com/c/higgs-­‐boson/leaderboard

Page 4: Big Data Competition: maximizing your potential exampled with the 2014 Higgs Boson Machine Learning Challenge

Background

4

Data

Model

Understand

Explore Enhance

Train Select Optimize

read

visualize

reduce

generate

cross  validate innovate

read

discuss

Validate

apply

fine-­‐tune

find

©

Page 5: Big Data Competition: maximizing your potential exampled with the 2014 Higgs Boson Machine Learning Challenge

Background

5

Data

Model

Understand

Explore Enhance

Train Select Optimize

read

visualize

reduce

generate

innovate

read

discuss

Validate

apply

fine-­‐tune

find

cross  validate

©

Page 6: Big Data Competition: maximizing your potential exampled with the 2014 Higgs Boson Machine Learning Challenge

READ  AND  DISCUSS

6

Page 7: Big Data Competition: maximizing your potential exampled with the 2014 Higgs Boson Machine Learning Challenge

• a.k.a  the  God  Particle  (explains  some  mass)  

• A  fundamental  particle  theorized  in  1964  in  the  Standard  Model  of  Particle  Physics  

• “Considered”  discovered  in  2011  –  2013  in  LHC  by  CERN  

• A  number  of  prestigious  awards  in  2013,  including  a  Nobel  prize

HIGGS  BOSON

7http://upload.wikimedia.org/wikipedia/commons/0/00/Standard_Model_of_Elementary_Particles.svg

A  "definitive"  answer  might  require  "another  few  years"  after  the  collider's  2015  restart.deputy  chair  of  physics  at  Brookhaven  National  Laboratory  

http://en.wikipedia.org/wiki/Higgs_boson

Page 8: Big Data Competition: maximizing your potential exampled with the 2014 Higgs Boson Machine Learning Challenge

• Established  in  1954  

• Birth  of  World  Wide  Web  (1989)

CERN:  THE  EUROPEAN  ORGANIZATION  FOR  NUCLEAR  RESEARCH

8

maps.google.com

Page 9: Big Data Competition: maximizing your potential exampled with the 2014 Higgs Boson Machine Learning Challenge

• 27  km  (17  mi)  in  circumference  

• 175  meters  (574  ft)  beneath  ground  

• Built  from  1998  to  2008  

• Over  10,000  scientists  and  engineers  

• Over  100  counties  

• Seven  particle  detectors

LARGE  HADRON  COLLIDER  (LHC)

9https://www.llnl.gov/news/llnl-­‐set-­‐host-­‐international-­‐lattice-­‐physics-­‐conference

http://en.wikipedia.org/wiki/Large_Hadron_Collider

http://en.wikipedia.org/wiki/Large_Hadron_Collider

Page 10: Big Data Competition: maximizing your potential exampled with the 2014 Higgs Boson Machine Learning Challenge

• 46  meters  long    

• 25  meters  in  diameter  

• Weighs  about  7,000  tonnes  

• Contains  some  3000  km  of  cable  

• Involves  roughly  3,000  physicists  from  over  175  institutions  in  38  countries.

ATLAS

10

http://en.wikipedia.org/wiki/Large_Hadron_Collider

http://higgsml.lal.in2p3.fr/documentation/

Page 11: Big Data Competition: maximizing your potential exampled with the 2014 Higgs Boson Machine Learning Challenge

• 46  meters  long    

• 25  meters  in  diameter  

• Weighs  about  7,000  tonnes  

• Contains  some  3000  km  of  cable  

• Involves  roughly  3,000  physicists  from  over  175  institutions  in  38  countries.

ATLAS

11

http://en.wikipedia.org/wiki/Large_Hadron_Collider

http://higgsml.lal.in2p3.fr/documentation/

Page 12: Big Data Competition: maximizing your potential exampled with the 2014 Higgs Boson Machine Learning Challenge

• 46  meters  long    

• 25  meters  in  diameter  

• Weighs  about  7,000  tonnes  

• Contains  some  3000  km  of  cable  

• Involves  roughly  3,000  physicists  from  over  175  institutions  in  38  countries.

ATLAS

12

http://en.wikipedia.org/wiki/Large_Hadron_Collider

http://higgsml.lal.in2p3.fr/documentation/

Page 13: Big Data Competition: maximizing your potential exampled with the 2014 Higgs Boson Machine Learning Challenge

• Higgs  Boson  can  not  be  measured  directly  (decays  immediately  into  lighter  particles)  

• Other  particles  can  decay  into  the  same  set  of  lighter  particles  

• PRODUCTION  and  DECAY  of  Higgs  Boson  depends  on  the  mass,  while  mass  was  not  predicted  by  theory  (now  we  know  it  is  close  to  125  Gev)

CHALLENGES  IN  DETECTION  OF  HIGGS  BOSON

13https://www2.physics.ox.ac.uk/sites/default/files/2012-­‐03-­‐27/sinead_farrington_pdf_17376.pdf

Seeing  a  circular  shaped  shadow  does  not  mean  the  real  object  is  a  sphere  ball  

Page 14: Big Data Competition: maximizing your potential exampled with the 2014 Higgs Boson Machine Learning Challenge

• Raw  data  collected  from  LHC  

• Hundreds  of  millions  of  proton-­‐proton  collisions  (event)  per  second  

• 400  events  of  interest  are  selected  per  second  

– Signal  event  (i.e.  Higgs  Boson)  

–Background  event  (i.e.  other  particles)  

• Events  in  Ad  Hoc  selection  region  (in  certain  channels)  exceeding  background  noise

CURRENT  DETECTION  MECHANISM

14

Needs  improvement  in  significance  and  robustness  in  selection  criteria

Page 15: Big Data Competition: maximizing your potential exampled with the 2014 Higgs Boson Machine Learning Challenge

• Simulated  Data  

• Fixed  mass  (125  GeV)  

• Simplified  decay  channel  

–Next  Slide  

• Simplified  background  events  (three  representative  types  only)  

–Decay  of  the  Z  boson  (91.2  GeV)  into  Tau-­‐Tau  –Decay  of  a  pair  of  top  quarks  into  lepton  and  hadronic  tau    –“Decay”  of  the  W  boson  into  lepton  and  hadronic  tau  due  to  imperfections  in  the  particle  identification  procedure  

• Simplified  objective  function  (significance  score)

SIMPLIFICATIONS  FOR  COMPETITION

15

Page 16: Big Data Competition: maximizing your potential exampled with the 2014 Higgs Boson Machine Learning Challenge

• Decay  of  Tau-­‐Tau  Channel  only    

• One  tau  decays  into  lepton  and  two  neutrino  

• The  other  tau  decays  into  hadronic  tau  and  a  neutrino  

• (Note:  Neutrinos  can  not  be  detected)

SIMPLIFIED  DECAY  CHANNEL

16

hadronic tau:a bunch of hadrons

Page 17: Big Data Competition: maximizing your potential exampled with the 2014 Higgs Boson Machine Learning Challenge

• Decay  of  Tau-­‐Tau  Channel  only    

• One  tau  decays  into  lepton  and  two  neutrino  

• The  other  tau  decays  into  hadronic  tau  and  a  neutrino  

• (Note:  Neutrinos  can  not  be  detected)

SIMPLIFIED  DECAY  CHANNEL

17

hadronic tau:a bunch of hadrons

Page 18: Big Data Competition: maximizing your potential exampled with the 2014 Higgs Boson Machine Learning Challenge

• Decay  of  Tau-­‐Tau  Channel  only    

• One  tau  decays  into  lepton  and  two  neutrino  

• The  other  tau  decays  into  hadronic  tau  and  a  neutrino  

• (Note:  Neutrinos  can  not  be  detected)

SIMPLIFIED  DECAY  CHANNEL

18

Jets MET

vectorized  momenta  are  givenhadronic tau:a bunch of hadrons

Page 19: Big Data Competition: maximizing your potential exampled with the 2014 Higgs Boson Machine Learning Challenge

Background

19

Data

Model

Understand

Explore Enhance

Train Select Optimize

read

visualize

reduce

generate

innovate

read

discuss

Validate

apply

fine-­‐tune

find

cross  validate

©

Page 20: Big Data Competition: maximizing your potential exampled with the 2014 Higgs Boson Machine Learning Challenge

• 250,000  training    

• 550,000  testing  

• 30  variables  

– 17  Primitive  • Momenta    • Direction  

– 13  Derived

DATA  DIMENSION

20

4 rows in training data

EventId DER_mass_MMC

DER_mass_transverse_met_lep

DER_mass_vis

DER_pt_h

DER_deltaeta_jet_jet

DER_mass_jet_jet

DER_prodeta_jet_jet

DER_deltar_tau_lep

DER_pt_tot

DER_sum_pt

100000 138.47 51.655 97.827 27.98 0.91 124.711 2.666 3.064 41.928 197.76100001 160.937 68.768 103.235 48.146 NA NA NA 3.473 2.078 125.157100002 NA 162.172 125.953 35.635 NA NA NA 3.148 9.336 197.814100003 143.905 81.417 80.943 0.414 NA NA NA 3.31 0.414 75.968

EventIdDER_pt_ratio_lep_tau

DER_met_phi_centrality

DER_lep_eta_centrality

PRI_tau_pt

PRI_tau_eta

PRI_tau_phi

PRI_lep_pt

PRI_lep_eta

PRI_lep_phi PRI_met

100000 1.582 1.396 0.2 32.638 1.017 0.381 51.626 2.273 -2.414 16.824100001 0.879 1.414 NA 42.014 2.039 -3.011 36.918 0.501 0.103 44.704100002 3.776 1.414 NA 32.154 -0.705 -2.093 121.409 -0.953 1.052 54.283100003 2.354 -1.285 NA 22.647 -1.655 0.01 53.321 -0.522 -3.1 31.082

EventId PRI_met_phi

PRI_met_sumet

PRI_jet_num

PRI_jet_leading_pt

PRI_jet_leading_eta

PRI_jet_leading_phi

PRI_jet_subleading_pt

PRI_jet_subleading_eta

PRI_jet_subleading_phi

PRI_jet_all_pt

100000 -0.277 258.733 2 67.435 2.15 0.444 46.062 1.24 -2.475 113.497100001 -1.916 164.546 1 46.226 0.725 1.158 NA NA NA 46.226100002 -2.186 260.414 1 44.251 2.053 -2.028 NA NA NA 44.251100003 0.06 86.062 0 NA NA NA NA NA NA 0

EventId Weight Label100000 0.00265331133733s100001 2.23358448717b100002 2.34738894364b100003 5.44637821192b

Data  loaded  correctly  Notice  NA  values

Page 21: Big Data Competition: maximizing your potential exampled with the 2014 Higgs Boson Machine Learning Challenge

MISSING  VALUES

21

  col_name NA_count   NA_pct1 EventId    2 DER_mass_MMC  38,114   15%3 DER_mass_transverse_met_lep    4 DER_mass_vis    5 DER_pt_h    6 DER_deltaeta_jet_jet  177,457   71%7 DER_mass_jet_jet  177,457   71%8 DER_prodeta_jet_jet  177,457   71%9 DER_deltar_tau_lep    10 DER_pt_tot    11 DER_sum_pt    12 DER_pt_ratio_lep_tau    13 DER_met_phi_centrality    14 DER_lep_eta_centrality  177,457   71%15 PRI_tau_pt    16 PRI_tau_eta    17 PRI_tau_phi    18 PRI_lep_pt    19 PRI_lep_eta    20 PRI_lep_phi    21 PRI_met    22 PRI_met_phi    23 PRI_met_sumet    24 PRI_jet_num    25 PRI_jet_leading_pt  99,913   40%26 PRI_jet_leading_eta  99,913   40%27 PRI_jet_leading_phi  99,913   40%  28 PRI_jet_subleading_pt  177,457   71%29 PRI_jet_subleading_eta  177,457   71%30 PRI_jet_subleading_phi  177,457   71%31 PRI_jet_all_pt    32 Weight    33 Label    

Page 22: Big Data Competition: maximizing your potential exampled with the 2014 Higgs Boson Machine Learning Challenge

MISSING  VALUES

22

  col_name NA_count   NA_pct1 EventId    2 DER_mass_MMC  38,114   15%3 DER_mass_transverse_met_lep    4 DER_mass_vis    5 DER_pt_h    6 DER_deltaeta_jet_jet  177,457   71%7 DER_mass_jet_jet  177,457   71%8 DER_prodeta_jet_jet  177,457   71%9 DER_deltar_tau_lep    10 DER_pt_tot    11 DER_sum_pt    12 DER_pt_ratio_lep_tau    13 DER_met_phi_centrality    14 DER_lep_eta_centrality  177,457   71%15 PRI_tau_pt    16 PRI_tau_eta    17 PRI_tau_phi    18 PRI_lep_pt    19 PRI_lep_eta    20 PRI_lep_phi    21 PRI_met    22 PRI_met_phi    23 PRI_met_sumet    24 PRI_jet_num    25 PRI_jet_leading_pt  99,913   40%26 PRI_jet_leading_eta  99,913   40%27 PRI_jet_leading_phi  99,913   40%28 PRI_jet_subleading_pt  177,457   71%29 PRI_jet_subleading_eta  177,457   71%30 PRI_jet_subleading_phi  177,457   71%31 PRI_jet_all_pt    32 Weight    33 Label    

Notice  the  consistency  in  missing  values

Page 23: Big Data Competition: maximizing your potential exampled with the 2014 Higgs Boson Machine Learning Challenge

• Assign  a  value  

–Generate  a  random  value  

– Fit  a  value  (mean,  median,  nearest  neighbor,  etc.)  

– Fix  a  value  (domain  knowledge)  

• Remove  the  record  

• Leave  as  is

HOW  TO  HANDLE  MISSING  VALUES

23

Page 24: Big Data Competition: maximizing your potential exampled with the 2014 Higgs Boson Machine Learning Challenge

• Assign  a  value  

–Generate  a  random  value  

– Fit  a  value  (mean,  median,  nearest  neighbor,  etc.)  

– Fix  a  value  (domain  knowledge)  

• Remove  the  record  

• Leave  as  is

HOW  TO  HANDLE  MISSING  VALUES

24

Page 25: Big Data Competition: maximizing your potential exampled with the 2014 Higgs Boson Machine Learning Challenge

HISTOGRAM

25Density  is  more  meaningful  in  the  range  of  x No  fuzzy  jump  at  the  edge

PRI_jet_leading_pt

Coun

t

Log  transformation

Coun

t

Inverse  transformation

Coun

t

Page 26: Big Data Competition: maximizing your potential exampled with the 2014 Higgs Boson Machine Learning Challenge

HISTOGRAM  (CONT’D)

26Bi-­‐modality  is  revealed

DER_pt_h

Coun

t

Log  transformation

Coun

t

Inverse  transformation

Coun

t

Page 27: Big Data Competition: maximizing your potential exampled with the 2014 Higgs Boson Machine Learning Challenge

INTERACTIVE  VISUALIZATION  R  SHINY

27http://chencheng.shinyapps.io/demo_higgsDEMO

Page 28: Big Data Competition: maximizing your potential exampled with the 2014 Higgs Boson Machine Learning Challenge

INTERACTIVE  VISUALIZATION  R  SHINY

28http://chencheng.shinyapps.io/demo_higgsDEMO

Page 29: Big Data Competition: maximizing your potential exampled with the 2014 Higgs Boson Machine Learning Challenge

INTERACTIVE  VISUALIZATION  R  SHINY

29

Use  a  reasonable  number  of  bins  to  display  the  underlying  distribution

http://chencheng.shinyapps.io/demo_higgsDEMO

Page 30: Big Data Competition: maximizing your potential exampled with the 2014 Higgs Boson Machine Learning Challenge

INTERACTIVE  VISUALIZATION  R  SHINY

30

Use  a  reasonable  transformation  to  display  the  underlying  distribution

http://chencheng.shinyapps.io/demo_higgsDEMO

Page 31: Big Data Competition: maximizing your potential exampled with the 2014 Higgs Boson Machine Learning Challenge

HISTOGRAM  (CONT’D)

31

Coun

t

PRI_tau_etaTransformations  are  sometimes  not  necessary

Page 32: Big Data Competition: maximizing your potential exampled with the 2014 Higgs Boson Machine Learning Challenge

32

Do  that  for  all  30  variables

Page 33: Big Data Competition: maximizing your potential exampled with the 2014 Higgs Boson Machine Learning Challenge

PAIRWISE  CORRELATIONS

33

Coun

t

Count

BKG

SGN

PRI_lep_phi  &  PRI_met_phi

Page 34: Big Data Competition: maximizing your potential exampled with the 2014 Higgs Boson Machine Learning Challenge

PAIRWISE  CORRELATIONS

34

Coun

t

CountSet  transparency  parameter  appropriately  to  reveal  important  patterns

BKG

SGN

PRI_lep_phi  &  PRI_met_phi

Page 35: Big Data Competition: maximizing your potential exampled with the 2014 Higgs Boson Machine Learning Challenge

PAIRWISE  CORRELATIONS

35

Coun

t

CountCorrelation  coefficient  ==  0  does  not  mean  no  correlation

BKG

SGN

PRI_lep_phi  &  PRI_met_phi

Page 36: Big Data Competition: maximizing your potential exampled with the 2014 Higgs Boson Machine Learning Challenge

PAIRWISE  CORRELATIONS

36

Coun

t

Count

BKG

SGN

PRI_lep_phi  &  PRI_met_phi

Page 37: Big Data Competition: maximizing your potential exampled with the 2014 Higgs Boson Machine Learning Challenge

FEATURE  ENHANCEMENT  ROTATION

37Validate  visual  “evidence”  from  various  perspectives

BKG

SGN

rotated  PRI_lep_phi  &  PRI_met_phi

Page 38: Big Data Competition: maximizing your potential exampled with the 2014 Higgs Boson Machine Learning Challenge

FEATURE  ENHANCEMENT  ROTATION

38Validate  visual  “evidence”  from  various  perspectives

BKG

SGN

rotated  PRI_lep_phi  &  PRI_met_phi

Page 39: Big Data Competition: maximizing your potential exampled with the 2014 Higgs Boson Machine Learning Challenge

PAIRWISE  VARIABLES  —  LOW  RES.

39

Coun

t

Count

BKG

SGN

DER_pt_h  &  DER_deltar_tau_lep  

Page 40: Big Data Competition: maximizing your potential exampled with the 2014 Higgs Boson Machine Learning Challenge

PAIRWISE  VARIABLES  —  HIGH  RES.

40Try  High  Resolution

Coun

t

Count

BKG

SGN

DER_pt_h  &  DER_deltar_tau_lep

Page 41: Big Data Competition: maximizing your potential exampled with the 2014 Higgs Boson Machine Learning Challenge

PAIRWISE  VARIABLES  —  HIGH  RES.

41Curve  fitting

Coun

t

Count

BKG

SGN

DER_pt_h  &  DER_deltar_tau_lep

Page 42: Big Data Competition: maximizing your potential exampled with the 2014 Higgs Boson Machine Learning Challenge

FEATURE  ENHANCEMENT    CURVE  FITTING

42Enhance  a  variable  based  on  correlation  with  another  variable

Coun

t

Count

BKG

SGN

DER_pt_h  &  DER_deltar_tau_lep

Page 43: Big Data Competition: maximizing your potential exampled with the 2014 Higgs Boson Machine Learning Challenge

FEATURE  ENHANCEMENT      ROTATION  BY  PRI_TAU_PHI

43

Domain  Knowledge

Coun

t

Count

BKG

SGN

DER_pt_h  &  PRI_lep_phi

Page 44: Big Data Competition: maximizing your potential exampled with the 2014 Higgs Boson Machine Learning Challenge

FEATURE  ENHANCEMENT      ROTATION  BY  PRI_TAU_PHI

44Feature  enhancement  by  applying  domain  knowledge  

Coun

t

Count

BKG

SGN

DER_pt_h  &  PRI_lep_phi

Domain  Knowledge

Page 45: Big Data Competition: maximizing your potential exampled with the 2014 Higgs Boson Machine Learning Challenge

FEATURE  ENHANCEMENT  ROTATION

45

Coun

t

Count

BKG

SGN

PRI_jet_leading_eta  &  PRI_jet_subleading_eta

Page 46: Big Data Competition: maximizing your potential exampled with the 2014 Higgs Boson Machine Learning Challenge

• Select  variable(s):  One  var.  for  histogram,  two  var.  for  scatter  plot

DATA  DRILL  DOWN

46http://chencheng.shinyapps.io/demo_higgsDEMO

Page 47: Big Data Competition: maximizing your potential exampled with the 2014 Higgs Boson Machine Learning Challenge

• Dynamically  select  a  subset  of  data  —  PRI_jet_num  =  2

DATA  DRILL  DOWN

47http://chencheng.shinyapps.io/demo_higgsDEMO

Page 48: Big Data Competition: maximizing your potential exampled with the 2014 Higgs Boson Machine Learning Challenge

• Patterns  in  the  subset  data  —  PRI_jet_leading_eta  &  PRI_jet_subleading_eta

DATA  DRILL  DOWN

48http://chencheng.shinyapps.io/demo_higgsDEMO

Page 49: Big Data Competition: maximizing your potential exampled with the 2014 Higgs Boson Machine Learning Challenge

• Dynamically  select  a  subset  of  data  —  PRI_jet_num  =  3

DATA  DRILL  DOWN

49http://chencheng.shinyapps.io/demo_higgsDEMO

Page 50: Big Data Competition: maximizing your potential exampled with the 2014 Higgs Boson Machine Learning Challenge

• Patterns  in  the  subset  data  —  PRI_jet_leading_eta  &  PRI_jet_subleading_eta

DATA  DRILL  DOWN

50http://chencheng.shinyapps.io/demo_higgsDEMO

Page 51: Big Data Competition: maximizing your potential exampled with the 2014 Higgs Boson Machine Learning Challenge

• Patterns  in  the  subset  data  —  PRI_jet_leading_eta  &  PRI_jet_subleading_eta

DATA  DRILL  DOWN

51

PRI_jet_num  =  2 PRI_jet_num  =  3

Interactive  data  visualization  techniques  are  helpful

http://chencheng.shinyapps.io/demo_higgsDEMO

Page 52: Big Data Competition: maximizing your potential exampled with the 2014 Higgs Boson Machine Learning Challenge

52

Do  that  for  all  30  *  29  ~=  900  pairs

Page 53: Big Data Competition: maximizing your potential exampled with the 2014 Higgs Boson Machine Learning Challenge

PARTICLE  LOCATION  —  (0,  S)

53

Convert  numerical  data  back  into  actual  object  with  meaning

Animation

Page 54: Big Data Competition: maximizing your potential exampled with the 2014 Higgs Boson Machine Learning Challenge

PARTICLE  LOCATION  —  (0,  B)

54

Animation

Page 55: Big Data Competition: maximizing your potential exampled with the 2014 Higgs Boson Machine Learning Challenge

• Distance  ratio  between  MET-­‐Lep  and  Tau-­‐Lep  

                                                                                 d(MET,  Lep)/d(Tau,  Lep)

INSPIRATION  FROM  ANIMATION

55

Inspiration  from  meaningful  visualization  can  be  helpful

Coun

t

dist_ratio_met_lep_tau

BKG

SGN

Page 56: Big Data Competition: maximizing your potential exampled with the 2014 Higgs Boson Machine Learning Challenge

• Distance  ratio  between  MET-­‐Lep  and  Tau-­‐Lep  

                                                                                 d(MET,  Lep)/d(Tau,  Lep)

BKG

SGN

INSPIRATION  FROM  ANIMATION

56

Adjust  visualization  for  better  efficiency

Coun

t

dist_ratio_met_lep_tau

Coun

t

dist_ratio_met_lep_tau

BKG

SGN

Page 57: Big Data Competition: maximizing your potential exampled with the 2014 Higgs Boson Machine Learning Challenge

• Variable  reduction  

– Simple  rotation  

– Transformation  

–Domain  knowledge  

–…  

• Feature  generation  

–Domain  knowledge  

– Inspiration  from  various  visualizations  

– Statistical  approaches  

–…

FEATURE  ENHANCEMENT  

57

Principle  component  analysis

distance_ratio

Rotation  by  phiCurve  fitting

45  degree  rotation

Page 58: Big Data Competition: maximizing your potential exampled with the 2014 Higgs Boson Machine Learning Challenge

Background

58

Data

Model

Understand

Explore Enhance

Train Select Optimize

read

visualize

reduce

generate

innovateapply

fine-­‐tune

read

discuss

Validate

find

cross  validate

©

Page 59: Big Data Competition: maximizing your potential exampled with the 2014 Higgs Boson Machine Learning Challenge

• Gradient  boosting  tree  

• Neural  network  

• Bayesian  network  

• Support  vector  machine  

• Generalized  additive  model

MODELS

59

Page 60: Big Data Competition: maximizing your potential exampled with the 2014 Higgs Boson Machine Learning Challenge

• Gradient  boosting  tree  

• Neural  network  

• Bayesian  network  

• Support  vector  machine  

• Generalized  additive  model

MODELS

60

Page 61: Big Data Competition: maximizing your potential exampled with the 2014 Higgs Boson Machine Learning Challenge

• Decision  tree  

–Build  many  shallow  trees  

• Boosting  

–Build  trees  based  on  residual  

• Bagging  

– Each  tree  uses  a  subset  of  the  data  

• Ensembling  

–Combine  the  trees

GRADIENT  BOOSTING  TREE

61

Page 62: Big Data Competition: maximizing your potential exampled with the 2014 Higgs Boson Machine Learning Challenge

• Decision  tree  

–Build  many  shallow  trees  

• Boosting  

–Build  trees  based  on  residual  

• Bagging  

– Each  tree  uses  a  subset  of  the  data  

• Ensembling  

–Combine  the  trees  

GRADIENT  BOOSTING  TREE

62

Page 63: Big Data Competition: maximizing your potential exampled with the 2014 Higgs Boson Machine Learning Challenge

• Regression  tree

DECISION  TREE

63

−1.0

−0.5

0.0

0.5

1.0

0.0 2.5 5.0 7.5 10.0x

y

Page 64: Big Data Competition: maximizing your potential exampled with the 2014 Higgs Boson Machine Learning Challenge

• Regression  tree

DECISION  TREE

64

−1.0

−0.5

0.0

0.5

1.0

0.0 2.5 5.0 7.5 10.0x

y

|

x< 6.614x>=6.614

0.19n=100

−0.08n=64

0.66n=36

Regression Tree with Node Depth = 1

Depth  =  1  

Page 65: Big Data Competition: maximizing your potential exampled with the 2014 Higgs Boson Machine Learning Challenge

• Regression  tree

DECISION  TREE

65

|

x< 6.614

x>=3.049 x>=8.953

x>=6.614

x< 3.049 x< 8.953

0.19n=100

−0.08n=64

−0.53n=40

0.67n=24

0.66n=36

0.086n=7

0.8n=29

Regression Tree with Node Depth = 2

−1.0

−0.5

0.0

0.5

1.0

0.0 2.5 5.0 7.5 10.0x

y

Depth  =  2

Page 66: Big Data Competition: maximizing your potential exampled with the 2014 Higgs Boson Machine Learning Challenge

• Regression  tree

DECISION  TREE

66

|

x< 6.614

x>=3.049

x< 5.862

x>=8.953

x< 7.207

x>=6.614

x< 3.049

x>=5.862

x< 8.953

x>=7.207

0.19n=100

−0.08n=64

−0.53n=40

−0.67n=32

0.045n=8

0.67n=24

0.66n=36

0.086n=7

0.8n=29

0.57n=7

0.87n=22

Regression Tree with Node Depth = 3

−1.0

−0.5

0.0

0.5

1.0

0.0 2.5 5.0 7.5 10.0x

y

Depth  =  3

Page 67: Big Data Competition: maximizing your potential exampled with the 2014 Higgs Boson Machine Learning Challenge

• Regression  tree

DECISION  TREE

67

|

x< 6.614

x>=3.049

x< 5.862

x>=3.594

x>=8.953

x< 7.207

x>=6.614

x< 3.049

x>=5.862

x< 3.594

x< 8.953

x>=7.207

0.19n=100

−0.08n=64

−0.53n=40

−0.67n=32

−0.8n=25

−0.23n=7

0.045n=8

0.67n=24

0.66n=36

0.086n=7

0.8n=29

0.57n=7

0.87n=22

Regression Tree with Node Depth = 4

−1.0

−0.5

0.0

0.5

1.0

0.0 2.5 5.0 7.5 10.0x

y

Depth  =  4

Page 68: Big Data Competition: maximizing your potential exampled with the 2014 Higgs Boson Machine Learning Challenge

X0  =  X;  Y0  =  Y;  

latest_model  =  train_tree(X,  Y);  

for  ii  =  1:NUM_ITER  

       Index_train  =  random(1:NUM_REC,  FRAC_TRAIN  *  NUM_REC)  

       X  =  X0[Index_train];  Y  =  Y0[Index_train];  

       v_resid  =  Y  -­‐  wts  *  latest_model(X);  

       tree(ii)  =  train_tree(X,  v_pseudo_resid,  wts);  

       latest_model  +=  LARNING_RATE  *  tree(ii)  

DECISION  TREE

68

base  model

Page 69: Big Data Competition: maximizing your potential exampled with the 2014 Higgs Boson Machine Learning Challenge

X0  =  X;  Y0  =  Y;  

latest_model  =  train_tree(X,  Y);  

for  ii  =  1:NUM_ITER  

       Index_train  =  random(1:NUM_REC,  FRAC_TRAIN  *  NUM_REC)  

       X  =  X0[Index_train];  Y  =  Y0[Index_train];  

       v_resid  =  Y  -­‐  latest_model(X);  

       tree_add=  train_tree(X,  v_resid);  

       latest_model  +=  LARNING_RATE  *  tree_add  

GRADIENT  BOOSTING  TREE  (V.  1)

69

get  the  residuals

fit  a  tree  for  residuals

additive  model

Page 70: Big Data Competition: maximizing your potential exampled with the 2014 Higgs Boson Machine Learning Challenge

X0  =  X;  Y0  =  Y;  

latest_model  =  train_tree(X,  Y);  

for  ii  =  1:NUM_ITER  

       Index_train  =  random(1:NUM_REC,  FRAC_TRAIN  *  NUM_REC)  

       X  =  X0[Index_train];  Y  =  Y0[Index_train];  

       v_resid  =  Y  -­‐  latest_model(X);  

       tree_add  =  train_tree(X,  v_resid);  

       latest_model  +=  LARNING_RATE  *  tree_add  

(STOCHASTIC)  GRADIENT  BOOSTING  TREE

70

get  sampled  index

sampled  records  as  input

store  input

Page 71: Big Data Competition: maximizing your potential exampled with the 2014 Higgs Boson Machine Learning Challenge

X0  =  X;  Y0  =  Y;  

latest_model  =  train_tree(X,  Y,  wts);  

for  ii  =  1:NUM_ITER  

       Index_train  =  random(1:NUM_REC,  FRAC_TRAIN  *  NUM_REC)  

       X  =  X0[Index_train];  Y  =  Y0[Index_train];  

       v_resid  =  Y  -­‐  wts  *  latest_model(X);  

       tree_add  =  train_tree(X,  v_resid,  wts);  

       latest_model  +=  LARNING_RATE  *  tree_add  

(STOCHASTIC)  GRADIENT  BOOSTING  TREE  WITH  WEIGHT

71

Page 72: Big Data Competition: maximizing your potential exampled with the 2014 Higgs Boson Machine Learning Challenge

X0  =  X;  Y0  =  Y;  

latest_model  =  train_base_model(X,  Y,  wts);  

for  ii  =  1:NUM_ITER  

       Index_train  =  random(1:NUM_REC,  FRAC_TRAIN  *  NUM_REC)  

       X  =  X0[Index_train];  Y  =  Y0[Index_train];  

       v_pseudo_resid  =  get_pseudo_residual(X,  Y,  wts,  latest_model,  LOSS_FUNCTION_TYPE);  

       model_add_base  =  train_base_model(X,  v_pseudo_resid,  wts);  

       alpha  =  linear_search(cost_function,  model_add_base,  X,  Y,  wts);  

       latest_model  +=  LARNING_RATE  *  (alpha  *  model_add_base)  

(GENERAL)  GRADIENT  BOOSTING

72

[Stochastic Gradient Boosting] Jerome H. Friedman, 1999

Page 73: Big Data Competition: maximizing your potential exampled with the 2014 Higgs Boson Machine Learning Challenge

Background

73

Data

Model

Understand

Explore Enhance

Train Select Optimize

read

visualize

reduce

generate

innovateapply

fine-­‐tune

read

discuss

Validate

find

cross  validate

©

Page 74: Big Data Competition: maximizing your potential exampled with the 2014 Higgs Boson Machine Learning Challenge

   gbm_model  =  gbm.fit(  

                                           x=train[,x_vars,  with  =  FALSE],  

                                           y=train$Label,  

                                           distribution  =  char_distr,  

                                           w  =  w,  

                                           n.trees  =  n_trees,  

                                           interaction.depth  =  num_inter,  

                                           n.minobsinnode  =  min_obs_node,  

                                           shrinkage  =  shrinkage_rate,  

                                           bag.fraction  =  frac_bag)

APPLYING  GBM  IN  R

74

Page 75: Big Data Competition: maximizing your potential exampled with the 2014 Higgs Boson Machine Learning Challenge

VARIABLE  IMPORTANCE

75Relative  Importance

Page 76: Big Data Competition: maximizing your potential exampled with the 2014 Higgs Boson Machine Learning Challenge

APPLY  MODEL  ON  TEST  DATA

76

EventId Score RankOrder Class

1 0.98 501 s

2 0.42 259,579 b

3 0.46 264,125 b

. . . .

. . . .

449,998 0.86 31,154 s

449,999 0.12 489,251 b

550,000 0.79 110,154 b

Page 77: Big Data Competition: maximizing your potential exampled with the 2014 Higgs Boson Machine Learning Challenge

Background

77

Data

Model

Understand

Explore Enhance

Train Select Optimize

read

visualize

reduce

generate

innovateapply

fine-­‐tune

read

discuss

Validate

find

cross  validate

Page 78: Big Data Competition: maximizing your potential exampled with the 2014 Higgs Boson Machine Learning Challenge

• Number  of  iteration  

• Minimum  observation  for  each  node  

• Fraction  of  bagging  (0.5  ~  0.8)  

• Learning  rate  (<0.1)  

• Depth  of  tree  (4  ~  8)

GRADIENT  BOOSTING  PARAMETERS

78

Page 79: Big Data Competition: maximizing your potential exampled with the 2014 Higgs Boson Machine Learning Challenge

Background

79

Data

Model

Understand

Explore Enhance

Train Select Optimize

read

visualize

reduce

generate

innovateapply

fine-­‐tune

read

discuss

Validate

find

cross  validate

Page 80: Big Data Competition: maximizing your potential exampled with the 2014 Higgs Boson Machine Learning Challenge

• Split  training  data  

– 70%  for  training  

– 30%  for  cross  validation  

• Train  model  (70%)  

• Measure  performance  (30%)

CROSS  VALIDATION

80

Page 81: Big Data Competition: maximizing your potential exampled with the 2014 Higgs Boson Machine Learning Challenge

PERFORMANCE  BASED  ON  AMS

81

Trade-­‐off  between:          Ratio  of  Signal/Background  events          Number  of  records  in  selection  region

EventId Score RankOrder

Class truth

1 0.98 501 S S

2 0.42 259,579 B

3 0.46 264,125 B

. . . .

. . . .

449,998 0.86 31,154 S B

449,999 0.12 489,251 B

550,000 0.79 110,154 B

Selection  Region

s  =  sum(S)  b=  sum(B)

Page 82: Big Data Competition: maximizing your potential exampled with the 2014 Higgs Boson Machine Learning Challenge

PERFORMANCE  BASED  ON  AMS

82

Percentile

AMS

AMS

percentage  of  signal

Page 83: Big Data Competition: maximizing your potential exampled with the 2014 Higgs Boson Machine Learning Challenge

COMPARE  TWO  MODEL  RESULTS

Percentile

83

Training

Cross  validation

Percentile

AMS

AMS

percentage  of  signal

Page 84: Big Data Competition: maximizing your potential exampled with the 2014 Higgs Boson Machine Learning Challenge

Percentile

84

COMPARE  TWO  MODEL  RESULTS

Training

Cross  validation

Percentile

AMS

AMS

percentage  of  signal

Page 85: Big Data Competition: maximizing your potential exampled with the 2014 Higgs Boson Machine Learning Challenge

AMS  BY  NUM.  ITERATION

85

Percentile

AMS

Animation

Page 86: Big Data Competition: maximizing your potential exampled with the 2014 Higgs Boson Machine Learning Challenge

Background

86

Data

Model

Understand

Explore Enhance

Train Select Optimize

read

visualize

reduce

generate

innovateapply

fine-­‐tune

read

discuss

Validate

find

cross  validate

Page 87: Big Data Competition: maximizing your potential exampled with the 2014 Higgs Boson Machine Learning Challenge

s

b

>>  4

HEAT  MAP  OF  AMS  ON  B-­‐S  PLAN

87

Page 88: Big Data Competition: maximizing your potential exampled with the 2014 Higgs Boson Machine Learning Challenge

OPTIMIZATION  BASED  ON    OBJECTIVE  FUNCTION

Percentile

88

A

B

C

AMS

Page 89: Big Data Competition: maximizing your potential exampled with the 2014 Higgs Boson Machine Learning Challenge

HEAT  MAP  OF  AMS  ON  B-­‐S  PLAN

89

s

b

A

B

C

Page 90: Big Data Competition: maximizing your potential exampled with the 2014 Higgs Boson Machine Learning Challenge

HEAT  MAP  OF  AMS  ON  B-­‐S  PLAN

90

s

b

A

B

C

Inspiration  from  Lagrangian  Method  Weight  signal  and  background  events  by  partial  derivatives  of  AMS  function

Page 91: Big Data Competition: maximizing your potential exampled with the 2014 Higgs Boson Machine Learning Challenge

AMS  CURVE  ON  B-­‐S  PLAN

91

A

B

C

Inspiration  from  Lagrangian  Method  Weight  signal  and  background  events  by  partial  derivatives  of  AMS  function

s

b

partial  derivative  of  AMS  against  s

partial  derivative  of  AMS  against  b

Ratio  of  the  derivatives  ==>  relative  weight

Page 92: Big Data Competition: maximizing your potential exampled with the 2014 Higgs Boson Machine Learning Challenge

IMPROVEMENT  DUE  TO    WEIGHTING

92

AMS*

Num_Iterations

AMS    

Page 93: Big Data Competition: maximizing your potential exampled with the 2014 Higgs Boson Machine Learning Challenge

IMPROVEMENT  DUE  TO    WEIGHTING  (CONT’D)

93Num_Iterations

AMS*

AMS    

Page 94: Big Data Competition: maximizing your potential exampled with the 2014 Higgs Boson Machine Learning Challenge

AUGMENTED  GRADIENT  BOOSTING

94

Apply  GBMWeight  

Adjustment

©

Page 95: Big Data Competition: maximizing your potential exampled with the 2014 Higgs Boson Machine Learning Challenge

AUGMENTED  GRADIENT  BOOSTING

95

Apply  GBMWeight  

Adjustment

Remove    very  high  and  very  low  score  records    

from    train  and  test

©

Page 96: Big Data Competition: maximizing your potential exampled with the 2014 Higgs Boson Machine Learning Challenge

IMPROVEMENT  DUE  TO  ELIMINATION

96Num_Iterations

AMS*

AMS    

Page 97: Big Data Competition: maximizing your potential exampled with the 2014 Higgs Boson Machine Learning Challenge

IMPROVEMENT  DUE  TO  ELIMINATION  (CONT’D)

97Num_Iterations

AMS*

AMS    

Page 98: Big Data Competition: maximizing your potential exampled with the 2014 Higgs Boson Machine Learning Challenge

AUGMENTED  GRADIENT  BOOSTING

98

Apply  ML  

Model

Weight  Adjustment

Remove    very  high  and  very  low  score  records    

from    train  and  test

©

Page 99: Big Data Competition: maximizing your potential exampled with the 2014 Higgs Boson Machine Learning Challenge

Background

99

Data

Model

Understand

Explore Enhance

Train Select Optimize

read

visualize

reduce

generate

innovateapply

fine-­‐tune

read

discuss

Validate

find

cross  validate

Page 100: Big Data Competition: maximizing your potential exampled with the 2014 Higgs Boson Machine Learning Challenge

• Version  control  (Git,  Source  Tree)  

– Effectively  implement  many  different  ideas  

• File  organization  

– Efficiently  pull  out  the  file  needed  

• Effective  code  (R,  Python)  

– it  matters  so  much  when  dealing  with  big  data

OTHER  TOPICS

100

Page 101: Big Data Competition: maximizing your potential exampled with the 2014 Higgs Boson Machine Learning Challenge

Thanks  you  for  your  participation!  

Any  Questions?

goDCI.com