research article high performance numerical computing for...

14
Research Article High Performance Numerical Computing for High Energy Physics: A New Challenge for Big Data Science Florin Pop Computer Science Department, Faculty of Automatic Control and Computers, University Politehnica of Bucharest, Splaiul Independentei 313, Bucharest 060042, Romania Correspondence should be addressed to Florin Pop; fl[email protected] Received 20 September 2013; Revised 23 December 2013; Accepted 26 December 2013 Academic Editor: Carlo Cattani Copyright © 2014 Florin Pop. is is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. e publication of this article was funded by SCOAP 3 . Modern physics is based on both theoretical analysis and experimental validation. Complex scenarios like subatomic dimensions, high energy, and lower absolute temperature are frontiers for many theoretical models. Simulation with stable numerical methods represents an excellent instrument for high accuracy analysis, experimental validation, and visualization. High performance computing support offers possibility to make simulations at large scale, in parallel, but the volume of data generated by these experiments creates a new challenge for Big Data Science. is paper presents existing computational methods for high energy physics (HEP) analyzed from two perspectives: numerical methods and high performance computing. e computational methods presented are Monte Carlo methods and simulations of HEP processes, Markovian Monte Carlo, unfolding methods in particle physics, kernel estimation in HEP, and Random Matrix eory used in analysis of particles spectrum. All of these methods produce data-intensive applications, which introduce new challenges and requirements for ICT systems architecture, programming paradigms, and storage capabilities. 1. Introduction High Energy Physics (HEP) experiments are probably the main consumers of High Performance Computing (HPC) in the area of e-Science, considering numerical methods in real experiments and assisted analysis using complex simulation. Starting with quarks discovery in the last century to Higgs Boson in 2012 [1], all HEP experiments were modeled using numerical algorithms: numerical integration, interpolation, random number generation, eigenvalues computation, and so forth. Data collection from HEP experiments generates a huge volume, with a high velocity, variety, and variability and passes the common upper bounds to be considered Big Data. e numerical experiments using HPC for HEP represent a new challenge for Big Data Science. eoretical research in HEP is related to matter (funda- mental particles and Standard Model) and Universe forma- tion basic knowledge. Beyond this, the practical research in HEP has led to the development of new analysis tools (syn- chrotron radiation, medical imaging or hybrid models [2], wavelets-computational aspects [3]), new processes (cancer therapy [4], food preservation, or nuclear waste treatment), or even the birth of a new industry (Internet) [5]. is paper analyzes two aspects: the computational meth- ods used in HEP (Monte Carlo methods and simulations, Markovian Monte Carlo, unfolding methods in particle physics, kernel estimation, and Random Matrix eory) and the challenges and requirements for ICT systems to deal with processing of Big Data generated by HEP experiments and simulations. e motivation of using numerical methods in HEP simulations is based on special problems which can be for- mulated using integral or differential-integral equations (or systems of such equations), like quantum chromodynamics evolution of parton distributions inside a proton which can be described by the Gribov-Lipatov-Altarelli-Parisi (GLAP) equations [6], estimation of cross section for a typical HEP interaction (numerical integration problem), and data representation using histograms (numerical interpolation problem). Numerical methods used for solving differential Hindawi Publishing Corporation Advances in High Energy Physics Volume 2014, Article ID 507690, 13 pages http://dx.doi.org/10.1155/2014/507690

Upload: others

Post on 20-May-2020

10 views

Category:

Documents


0 download

TRANSCRIPT

Research ArticleHigh Performance Numerical Computing for High EnergyPhysics A New Challenge for Big Data Science

Florin Pop

Computer Science Department Faculty of Automatic Control and Computers University Politehnica of BucharestSplaiul Independentei 313 Bucharest 060042 Romania

Correspondence should be addressed to Florin Pop florinpopcspubro

Received 20 September 2013 Revised 23 December 2013 Accepted 26 December 2013

Academic Editor Carlo Cattani

Copyright copy 2014 Florin Pop This is an open access article distributed under the Creative Commons Attribution License whichpermits unrestricted use distribution and reproduction in any medium provided the original work is properly cited Thepublication of this article was funded by SCOAP3

Modern physics is based on both theoretical analysis and experimental validation Complex scenarios like subatomic dimensionshigh energy and lower absolute temperature are frontiers for many theoretical models Simulation with stable numerical methodsrepresents an excellent instrument for high accuracy analysis experimental validation and visualization High performancecomputing support offers possibility to make simulations at large scale in parallel but the volume of data generated by theseexperiments creates a new challenge for Big Data Science This paper presents existing computational methods for high energyphysics (HEP) analyzed from two perspectives numerical methods and high performance computingThe computational methodspresented are Monte Carlo methods and simulations of HEP processes Markovian Monte Carlo unfolding methods in particlephysics kernel estimation in HEP and Random Matrix Theory used in analysis of particles spectrum All of these methodsproduce data-intensive applications which introduce new challenges and requirements for ICT systems architecture programmingparadigms and storage capabilities

1 Introduction

High Energy Physics (HEP) experiments are probably themain consumers of High Performance Computing (HPC) inthe area of e-Science considering numerical methods in realexperiments and assisted analysis using complex simulationStarting with quarks discovery in the last century to HiggsBoson in 2012 [1] all HEP experiments were modeled usingnumerical algorithms numerical integration interpolationrandom number generation eigenvalues computation andso forth Data collection from HEP experiments generates ahuge volume with a high velocity variety and variability andpasses the common upper bounds to be considered Big DataThe numerical experiments using HPC for HEP represent anew challenge for Big Data Science

Theoretical research in HEP is related to matter (funda-mental particles and Standard Model) and Universe forma-tion basic knowledge Beyond this the practical research inHEP has led to the development of new analysis tools (syn-chrotron radiation medical imaging or hybrid models [2]

wavelets-computational aspects [3]) new processes (cancertherapy [4] food preservation or nuclear waste treatment)or even the birth of a new industry (Internet) [5]

This paper analyzes two aspects the computationalmeth-ods used in HEP (Monte Carlo methods and simulationsMarkovian Monte Carlo unfolding methods in particlephysics kernel estimation and RandomMatrix Theory) andthe challenges and requirements for ICT systems to deal withprocessing of Big Data generated by HEP experiments andsimulations

The motivation of using numerical methods in HEPsimulations is based on special problems which can be for-mulated using integral or differential-integral equations (orsystems of such equations) like quantum chromodynamicsevolution of parton distributions inside a proton which canbe described by the Gribov-Lipatov-Altarelli-Parisi (GLAP)equations [6] estimation of cross section for a typicalHEP interaction (numerical integration problem) and datarepresentation using histograms (numerical interpolationproblem) Numerical methods used for solving differential

Hindawi Publishing CorporationAdvances in High Energy PhysicsVolume 2014 Article ID 507690 13 pageshttpdxdoiorg1011552014507690

2 Advances in High Energy Physics

Simulated detectors Events

Nature(physical experiments)

Real detectors Events

Reconstruction Vector of events

Comparison and correction

Physics model rarr simulation process rarrevent generator rarrvectors of events

Figure 1 General approach of event generation detection and reconstruction

equations or integrals are based on classical quadratures andMonte Carlo (MC) techniquesThese allow generating eventsin terms of particle flavors and four-momenta which is par-ticularly useful for experimental applications For exampleMC techniques for solving the GLAP equations are basedon simulated Markov chains (random walks) which havethe advantage of filtering and smoothing the state vector forestimating parameters

In practice several MC event generators and simulationtools are used For example HERWIG (httpprojectshepforgeorgherwig) project considers angular-orderedparton shower cluster hadronization (the tool is imple-mented using Fortran) PYTHIA (httpwwwtheplusetorbjornPythiahtml) project is oriented on dipole-type parton shower and string hadronization (the toolis implemented in Fortran and C++) and SHERPA(httpprojectshepforgeorgsherpa) considers dipole-typeparton shower and cluster hadronization (the tool isimplemented in C++) An important tool forMC simulationsis GATE (GEANT4 Application for Tomographic Emission)a generic simulation platform based on GEANT4 GATEprovides new features for nuclear imaging applicationsand includes specific modules that have been developed tomeet specific requirements encountered in SPECT (SinglePhoton Emission Tomography) and PET (Positron EmissionTomography)

The main contributions of this paper are as follows(i) introduction and analysis of most important model-

ing methods used in High Energy Physics(ii) identifying and describing of the computational

numerical methods for High Energy Physics(iii) presentation of the main challenges for Big Data

processingThe paper is structured as follows Section 2 introduces

the computational methods used in HEP and describes theperformance evaluation of parallel numerical algorithmsSection 3 discusses the new challenge for Big Data Sciencegenerated by HEP and HPC Section 4 presents the conclu-sions and general open issues

2 Computational Methods Used inHigh Energy Physics

Computational methods are used in HEP in parallel withphysical experiments to generate particle interactions that

are modeled using vector of events This section presentsgeneral approach of event generation simulation methodsbased on Monte Carlo algorithms Markovian Monte Carlochains methods that describe unfolding processes in particlephysics Random Matrix Theory as support for particlespectrum and kernel estimation that produce continuousestimates of the parent distribution from the empirical prob-ability density function The section ends with performanceanalysis of parallel numerical algorithms used in HEP

21 General Approach of Event Generation Themost impor-tant aspect in simulation for HEP experiments is event gener-ation This process can be split into multiple steps accordingto physical models For example structure of LHC (LargeHadron Collider) events (1) hard process (2) parton shower(3) hadronization (4) underlying event According to officialLHC website (httphomewebcernchaboutcomputing)ldquoapproximately 600 million times per second particles collidewithin the LHC Experiments at CERN generate colossalamounts of data The Data Centre stores it and sends itaround the world for analysisrdquo The analysis must producevaluable data and the simulation results must be correlatedwith physical experiments

Figure 1 presents the general approach of event gener-ation detection and reconstruction The physical model isused to create simulation process that produces different typeof events clustered in vector of events (eg the fourth type ofevents in LHC experiments)

In parallel the real experiments are performed Thedetectors identify the most relevant events and based onreconstruction techniques vector of events is created Thedetectors can be real or simulated (software tools) andthe reconstruction phase combines real events with eventsdetected in simulation At the end the final result is comparedwith the simulation model (especially with generated vectorsof events) The model can be corrected for further experi-ments The goal is to obtain high accuracy and precision ofmeasured and processed data

Software tools for event generation are based on ran-dom number generators There are three types of randomnumbers truly random numbers (from physical generators)pseudorandom numbers (from mathematical generators)and quasirandom numbers (special correlated sequences ofnumbers used only for integration) For example numericalintegration using quasirandom numbers usually gives fasterconvergence than the standard integrationmethods based on

Advances in High Energy Physics 3

(1) procedure Random Generator Poisson(120583)(2) 119899119906119898119887119890119903 larr minus1(3) 119886119888119888119906119898119906119897119886119905119900119903 larr 10(4) 119902 larr exp minus120583(5) while 119886119888119888119906119898119906119897119886119905119900119903 gt 119902 do(6) 119903119899119889 119899119906119898119887119890119903 larr 119877119873119863() (7) 119886119888119888119906119898119906119897119886119905119900119903 larr 119886119888119888119906119898119906119897119886119905119900119903 lowast 119903119899119889 119899119906119898119887119890119903(8) 119899119906119898119887119890119903 larr 119899119906119898119887119890119903 + 1(9) end while(10) return number(11) end procedure

Algorithm 1 Randomnumber generation for Poisson distribution usingmany randomgenerated numbers with normal distribution (RND)

quadratures In event generation pseudorandomnumbers areused most often

Themost popular HEP application uses Poisson distribu-tion combined with a basic normal distribution The Poissondistribution can be formulated as

119875 [119883 = 119896] =120583119896

119896exp minus120583 119896 = 0 1 (1)

with 119864(119896) = 119881(119896) = 120583 (119881 is variance and 119864 is expectationvalue) Having a uniform random number generator calledRND() (Random based on Normal Distribution) we can usethe following two algorithms for event generation techniques

The result of running Algorithms 1 and 2 to generatearound 10

6 random numbers is presented in Figure 2 Ingeneral the second algorithm has better result for Poissondistribution General recommendation for HEP experimentsindicates the use of popular random number generators likeTRNG (True RandomNumber Generators) RANMAR (FastUniform Random Number Generator used in CERN exper-iments) RANLUX (algorithm developed by Luscher usedby Unix random number generators) and Mersenne Twister(the ldquoindustry standardrdquo) Random number generators pro-vided with compilers operating system and programminglanguage libraries can have serious problem because they arebased on system clock and suffer from lack of uniformityof distribution for large amounts of generated numbers andcorrelation of successive values

The art of event generation is to use appropriate combina-tions of various randomnumber generationmethods in orderto construct an efficient event generation algorithm beingsolution to a given problem in HEP

22 Monte Carlo Simulation and Markovian Monte CarloChains in HEP In general a Monte Carlo (MC) method isany simulation technique that uses random numbers to solvea well-defined problem 119875 If 119865 is a solution of the problem119875 (eg 119865 isin 119877

119899 or 119865 has a Boolean value) we define 119865 anestimation of 119865 as 119865 = 119891(1199031 1199032 119903119899 ) where 1199031198941le119894le119899is a random variable that can take more than one value andfor which any value that will be taken cannot be predicted in

050000

100000150000200000250000300000350000

0 5 10 15 20 25 30Fr

eque

ncy

Generted numbers

Random generator Poisson (Alg 1)

(a)

050000

100000150000200000250000300000350000400000

0 5 10 15 20 25 30

Freq

uenc

y

Generted numbers

Random generator Poisson RND (Alg 2)

(b)

Figure 2 General approach of event generation detection andreconstruction

advance If 120588(119903) is the probability density function 120588(119903)119889119903 =119875[119903 lt 1199031015840 lt 119903 + 119889119903] the cumulative distributed function is

119862 (119903) = int119903

minusinfin

120588 (119909) 119889119909 997904rArr 120588 (119903) =119889119862 (119903)

119889119903 (2)

119862(119903) is a monotonically nondecreasing function with allvalues in [0 1] The expectation value is

119864 (119891) = int119891 (119903) 119889119862 (119903) = int119891 (119903) 120588 (119903) 119889119903 (3)

And the variance is

119881 (119891) = 119864[119891 minus 119864 (119891)]2= 119864 (119891

2) minus 1198642(119891) (4)

221 Monte Carlo Event Generation and Simulation Todefine a MC estimator the ldquoLaw of Large Numbers (LLN)rdquois used LLN can be described as follows let one choose 119899

4 Advances in High Energy Physics

(1) procedure Random Generator Poisson RND(120583 119903)(2) 119899119906119898119887119890119903 larr 0(3) 119902 larr exp minus120583(4) 119886119888119888119906119898119906119897119886119905119900119903 larr 119902(5) 119901 larr 119902(6) while 119903 gt 119886119888119888119906119898119906119897119886119905119900119903 do(7) 119899119906119898119887119890119903 larr 119899119906119898119887119890119903 + 1(8) 119901 larr 119901 lowast 120583119899119906119898119887119890119903(9) 119886119888119888119906119898119906119897119886119905119900119903 larr 119886119888119888119906119898119906119897119886119905119900119903 + 119901(10) end while(11) return number(12) end procedure

Algorithm 2 Random number generation for Poisson distribution using one random generated number with normal distribution

numbers 119903119894 randomly with the probability density functionuniform on a specific interval (119886 119887) each 119903119894 being used toevaluate 119891(119903119894) For large 119899 (consistent estimator)

1

119899

119899

sum119894=1

119891 (119903119894) 997888rarr 119864 (119891) =1

119887 minus 119886int119887

119886

119891 (119903) 119889119903 (5)

The properties of a MC estimator are being normallydistributed (with Gaussian density) the standard deviation is120590 = radic119881(119891)119899 MC is unbiased for all 119899 (the expectation valueis the real value of the integral) the estimator is consistent if119881(119891) lt infin (the estimator converges to the true value of theintegral for every large 119899) a sampling phase can be appliedto compute the estimator if we do not know anything aboutthe function119891 it is just suitable for integrationThe samplingphase can be expressed in a stratified way as

int119887

119886

119891 (119903) 119889119903 = int1199031

119886

119891 (119903) 119889119903 + int1199032

1199031

119891 (119903) 119889119903 + sdot sdot sdot + int119887

119903119899

119891 (119903) 119889119903

(6)

MC estimations and MC event generators are necessarytools in most of HEP experiments being used at all theirsteps experiments preparation simulation running and dataanalysis

An example of MC estimation is the Lorentz invariantphase space (LIPS) that describes the cross section for atypical HEP process with 119899 particle in the final state

Consider

120590119899 sim int |119872|2119889119877119899 (7)

where 119872 is the matrix describing the interaction betweenparticles and 119889119877119899 is the element of LIPS We have thefollowing estimation

119877119899 (119875 1199011 1199012 119901119899)

= int 120575(4)(119875 minus

119899

sum119896=1

119901119896)

119899

prod119896=1

(120575 (1199012

119896minus 1198982

119896)Θ (119901

0

119896) 1198894119901119896)

(8)

e+(p1) eminus(p2)

120583+(q1)

120583minus(q2)

z

120579

Φ

Figure 3 Example of particle interaction 119890(1199011) + 119890minus(1199012) rarr

120583+(1199021)120583minus(1199022)

where 119875 is total four-momentum of the 119899-particle system 119901119896and119898119896 are four-momenta andmass of the final state particles120575(4)(119875minussum

119899

119896=1119901119896) is the total energymomentum conservation

120575(1199012119896minus 1198982119896) is the on-mass-shell condition for the final state

system Based on the integration formula

int120575 (1199012

119896minus 1198982

119896)Θ (119901

0

119896) 1198894119901119896 =

1198893119901119896

21199010119896

(9)

obtain the iterative form for cross section

119877119899 (119875 1199011 1199012 119901119899)

= int119877119899minus1 (119875 minus 119901119899 1199011 1199012 119901119899minus1)1198893119901119899

21199010119899

(10)

which can be numerical integrated by using the recurrencerelation As result we can construct a general MC algorithmfor particle collision processes

Example 1 Let us consider the interaction 119890+119890minus rarr 120583+120583minus

where Higgs boson contribution is numerically negligibleFigure 3 describes this interaction (Φ is the azimuthal angle120579 the polar angle and 1199011 1199012 1199021 1199022 are the four-momenta forparticles)

The cross section is

119889120590 =1205722

4119904[1198821 (119904) (1 + cos2120579) +1198822 (119904) cos 120579] 119889Ω (11)

Advances in High Energy Physics 5

where 119889Ω = 119889 cos 120579119889Φ 120572 = 11989024120587 (fine structure constant)119904 = (1199010

1+ 11990102)2 is the center of mass energy squared and

1198821(119904) and 1198822(119904) are constant functions For pure processeswe have1198821(119904) = 1 and1198822(119904) = 0 and the total cross sectionbecomes

120590 = int2120587

0

119889Φint1

minus1

119889 cos 120579 1198892120590

119889Φ119889 cos 120579 (12)

We introduce the following notation

120588 (cos 120579Φ) = 1198892120590

119889Φ119889 cos 120579 (13)

and let us consider 120588(cos 120579Φ) an approximation of120588(cos 120579Φ) Then = ∬119889Φ119889 cos 120579120588 Now we can compute

120590 = int2120587

0

119889Φint1

minus1

119889 cos 120579120588 (cos 120579Φ)

= int2120587

0

119889Φint1

minus1

119889 cos 120579119908 (cos 120579 Φ) 120588 (cos 120579 Φ)

asymp ⟨119908⟩120588 int2120587

0

119889Φint1

minus1

119889 cos 120579120588 (cos 120579Φ) = ⟨119908⟩120588

(14)

where119908(cos 120579Φ) = 120588(cos 120579 Φ)120588(cos 120579 Φ) and ⟨119908⟩120588 is theestimation of 119908 based on 120588 Here the MC estimator is

⟨119908⟩MC =1

119899

119899

sum119894=1

119908119894 (15)

and the standard deviation is

119904MC = (1

119899(119899 minus 1)

119899

sum119894=1

(119908119894 minus ⟨119908⟩MC)2)

12

(16)

The final numerical result based on MC estimator is

120590MC = ⟨119908⟩MC plusmn 119904MC (17)

As we can show the principle of a Monte Carlo estimatorin physics is to simulate the cross section in interaction andradiation transport knowing the probability distributions (oran approximation) governing each interaction of elementaryparticles

Based on this result the Monte Carlo algorithm used togenerate events is as follows It takes as input 120588(cos 120579Φ)and in a main loop considers the following steps (1) gen-erate (cos 120579 Φ) peer from 120588 (2) compute four-momenta1199011 1199012 1199021 1199022 (3) compute 119908 = 120588120588 The loop can be stoppedin the case of unweighted events and we will stay in the loopfor weighted events As output the algorithm returns four-momenta for particle for weighted events and four-momentaand an array of weights for unweighted eventsThemain issueis how to initialize the input of the algorithm Based on 119889120590

formula (for 1198821(119904) = 1 and 1198822(119904) = 0) we can consider asinput 120588(cos 120579Φ) = (12057224119904)(1 + cos2120579) Then = 412058712057223119904

In HEP theoretical predictions used for particle collisionprocesses modeling (as shown in presented example) should

be provided in terms of Monte Carlo event generators whichdirectly simulate these processes and can provide unweighted(weight = 1) events A good Monte Carlo algorithm shouldbe used not only for numerical integration [7] (ie pro-vide weighted events) but also for efficient generation ofunweighted events which is very important issue for HEP

222 Markovian Monte-Carlo Chains A classical MonteCarlo method estimates a function 119865 with 119865 by using arandomvariableThemain problemwith this approach is thatwe cannot predict any value in advance for a randomvariableIn HEP simulation experiments the systems are describedin states [8] Let us consider a system with a finite set ofpossible states 1198781 1198782 and 119878119905 the state at the moment 119905 Theconditional probability is defined as

119875 (119878119905 = 119878119895 | 1198781199051

= 1198781198941

1198781199052

= 1198781198942

119878119905119899

= 119878119894119899

) (18)

where the mappings (1199051 1198941) (119905119899 119894119899) can be interpreted asthe description of system evolution in time by specifying aspecific state for each moment of time

The system is a Markov chain if the distribution of119878119905 depends only on immediate predecessor 119878119905minus1 and it isindependent of all previous states as follows

119875 (119878119905 = 119878119895 | 119878119905minus1 = 119878119894119905minus1

1198781199052

= 1198781198942

1198781199051

= 1198781198941

)

= 119875 (119878119905 = 119878119895 | 119878119905minus1 = 119878119894119905minus1

)

(19)

To generate the time steps (1199051 1199052 119905119899) we use theprobability of a single forward Markovian step given by 119901(119905 |119905119899) with the property int

infin

119905119899

119901(119905 | 119905119899)119889119905 = 1 and we define119901(119905) = 119901(119905 | 0) The 1-dimensional Monte Carlo MarkovianAlgorithm used to generate the time steps is presented inAlgorithm 3

The main result of Algorithm 3 is that 119875(119905max) follows aPoisson distribution

119875119873 = int119905max

0

119901 (1199051 | 1199050) 1198891199051 times int119905max

1199051

119901 (1199052 | 1199051) 1198891199052

times sdot sdot sdot times int119905max

119905119873minus1

119901 (119905119873 | 119905119873minus1) 119889119905119873

timesintinfin

119905max

119901 (119905119873+1 | 119905119873) 119889119905119873+1 =1

119873(119905max)

119873119890minus119905max

(20)

We can consider the 1-dimensional Monte Carlo Marko-vian Algorithm as a method used to iteratively generate thesystemsrsquo states (codified as a Markov chain) in simulationexperiments According to the Ergodic Theorem for Markovchains the chain defined has a unique stationary probabilitydistribution [9 10]

Figures 4 and 5 present the running of Algorithm 3According to different values of parameter 119904 used to generatethe next step the results are very different for 1000 iterationsFigure 4 for 119904 = 1 shows a profile of the type of noise For119904 = 10 100 1000 profile looks like some of the informationis filtered and lost The best results are obtained for 119904 = 001

6 Advances in High Energy Physics

(1) Generate 1199051 according with 119901 (1199051) = 119901 (1199051 | 1199050 = 0)

(2) if 1199051lt 119905max then ⊳ Generate the initial state

(3) 119875119873ge1 = int119905max0

119901 (1199051 | 1199050) 1198891199051 ⊳ Compute the initial probability(4) Retain 1199051(5) end if(6) if 1199051 gt 119905max then ⊳ Discard all generated and computed data(7) 119873 = 0 1198750 = int

infin

119905max119901 (1199051 | 1199050) 1198891199051 = 119890minus119905max

(8) Delete 1199051(9) EXIT ⊳ The algorithm ends here(10) end if(11) 119894 = 2(12) while (1) do ⊳ Infinite loop until a successful EXIT(13) Generate 119905119894 according with 119901 (119905119894 | 119905119894minus1)(14) if 119905

119894lt 119905max then ⊳ Generate a new state and new probability

(15) 119875119873ge119894 = int119905max119905119894

119901 (119905119894 | 119905119894minus1) 119889119905119894(16) Retain 119905119894(17) end if(18) if 119905119894 gt 119905max then ⊳ Discard all generated and computed data(19) 119873 = 119894 minus 1 119875119894 = int

infin

119905max119901 (119905119894 | 119905119894minus1) 119889119905119894

(20) Retain (1199051 1199052 119905119894minus1) Delete 119905119894(21) EXIT ⊳ The algorithm ends here(22) end if(23) 119894 = 119894 + 1(24) end while

Algorithm 3 1-Dimensional Monte Carlo Markovian Algorithm

and 119904 = 01 and the generated values can be easily acceptedfor MC simulation in HEP experiments

Figure 5 shows the acceptance rate of values generatedwith parameter 119904used in the algorithmAndparameter valuesare correlated with Figure 4 Results in Figure 5 show thatthe acceptance rate decreases rapidly with increasing value ofparameter 119904 The conclusion is that values must be kept smallto obtain meaningful data A correlation with the normaldistribution is evident showing that a small value for themean square deviation provides useful results

223 Performance of Numerical Algorithms Used in MCSimulations Numerical methods used to compute MC esti-mator use numerical quadratures to approximate the valueof the integral for function 119891 on a specific domain by alinear compilation of function values and weights 1199081198941le119894le119898as follows

int119887

119886

119891 (119903) 119889119903 =

119898

sum119894=1

119908119894119891 (119903119894) (21)

We can consider a consistent MC estimator 119886 a clas-sical numerical quadrature with all 119908119894 = 1 Efficiency ofintegration methods for 1 dimension and for 119889 dimensionsis presented in Table 1 We can conclude that quadraturemethods are difficult to apply in many dimensions for variateintegration domains (regions) and the integral is not easy tobe estimated

Table 1 Efficiency of integration methods for 1 dimension and for119889 dimensions

Method 1 dimension 119889 dimensionsMonte Carlo 119899minus12 119899minus12

Trapezoidal rule 119899minus2 119899minus2119889

Simpsonrsquos rule 119899minus4 119899minus4119889

119898-points Gauss rule (119898 lt 119899) 119899minus2119898 119899minus2119898119889

Aspractical example in a typical high-energy particle col-lision there can bemany final-state particles (even hundreds)If we have 119899 final state particle we face with 119889 = 3119899 minus 4

dimensional phase space As numerical example for 119899 = 4

we have 119889 = 8 dimensions which is very difficult approachfor classical numerical quadratures

Full decomposition integration volume for one doublenumber (10 Bytes) per volume unit is 119899119889 times 10 Bytes For theexample considered with 119889 = 8 and 119899 = 10 divisions forinterval [0 1] we have for one numerical integration

119899119889times 10Bytes = 108 times 10

10243119866Bytes asymp 093119866Bytes (22)

Considering 106 events per second one integration per eventthe data produced in one hour will be asymp31974 119875Bytes

Advances in High Energy Physics 7

State (iteration)200 400 600 800 1000

s=001 07

05030100

(a)

State (iteration)200 400 600 800 1000

s=01 4

20

0minus2

(b)

State (iteration)200 400 600 800 1000

s=1

0

3

minus3

1minus1

(c)

State (iteration)200 400 600 800 1000

s=10

0

31

minus3minus1

(d)

State (iteration)200 400 600 800 1000

s=100

0

210

minus1

(e)

200 400 600 800 1000State (iteration)

s=1000

0

1

0

minus1

(f)

Figure 4 Example of 1-dimensional Monte Carlo Markovianalgorithm

The previous assumption is only for multidimensionalarrays But due to the factorization assumption119901(1199031 1199032 119903119899) = 119901(1199031)119901(1199032) sdot sdot sdot 119901(119903119899) we obtain for oneintegration

119899 times 119889 times 10 Bytes = 800 Bytes (23)

which means asymp262 119879Bytes of data produce for one hour ofsimulations

23 Unfolding Processes in Particle Physics and Kernel Estima-tion in HEP In particle physics analysis we have two typesof distributions true distribution (considered in theoreticalmodels) and measured distribution (considered in experi-mental models which are affected by finite resolution andlimited acceptance of existing detectors) A HEP interactionprocess starts with a true knows distribution and generatea measured distribution corresponding to an experiment of

0

20

40

60

80

100

0 1 2 3 4

Acce

ptan

ce ra

te (

)

minus1minus2minus3

log10(s)

Figure 5 Analysis of acceptance rate for 1-dimensionalMonte CarloMarkovian algorithm for different 119904 values

a well-confirmed theory An inverse process starts with ameasured distribution and tries to identify the true distri-bution These unfolding processes are used to identify newtheories based on experiments [11]

231 Unfolding Processes in Particle Physics The theory ofunfolding processes in particle physics is as follows [12] For aphysics variable 119905 we have a true distribution 119891(119905)mapped in119909 and an 119899-vector of unknowns and a measured distribution119892(119904) (for a measured variable 119904) mapped in an 119898-vector ofmeasured data A responsematrix119860 isin 119877

119898times119899 encodes aKernelfunction119870(119904 119905) describing the physical measurement process[12ndash15]The direct and inverse processes are described by theFredholm integral equation [16] of the first kind for a specificdomainΩ

intΩ

119870 (119904 119905) 119891 (119905) 119889119905 = 119892 (119904) (24)

In particle physics the Kernel function 119870(119904 119905) is usuallyknown fromaMonteCarlo sample obtained from simulationA numerical solution is obtained using the following linearequation 119860119909 = 119887 Vectors 119909 and 119910 are assumed tobe 1-dimensional in theory but they can be multidimen-sional in practice (considering multiple independent linearequations) In practice also the statistical properties of themeasurements are well known and often they follow thePoisson statistics [17] To solve the linear systems we havedifferent numerical methods

First method is based on linear transformation 119909 = 119860119910

If 119898 = 119899 then 119860= 119860minus1 and we can use direct Gaussian

methods iterative methods (Gauss-Siedel Jacobi or SOR) ororthogonal methods (based on Householder transformationGivens methods or Gram-Schmidt algorithm) If 119898 gt 119899

(the most frequent scenario) we will construct the matrix119860

= (119860119879119860)minus1119860119879 (called pseudoinverse Penrose-Moore) In

these cases the orthogonalmethods offer very good and stablenumerical solutions

8 Advances in High Energy Physics

Second method considers the singular value decomposi-tion

119860 = 119880Σ119881119879=

119899

sum119894=1

120590119894119906119894V119879

119894 (25)

where119880 isin 119877119898times119899 and119881 isin 119877119899times119899 arematrices with orthonormalcolumns and the diagonal matrix Σ = diag1205901 120590119899 =

119880119879119860119881 The solution is

119909 = 119860119910 = 119881Σ

minus1(119880119879119910) =

119899

sum119894=1

1

120590119894(119906119879

119894119910) V119894 =

119899

sum119894=1

1

120590119894119888119894V119894 (26)

where 119888119894 = 119906119879119894119910 119894 = 1 119899 are called Fourier coefficients

232 RandomMatrix Theory Analysis of particle spectrum(eg neutrino spectrum) faces with Random Matrix Theory(RMT) especially if we consider anarchic neutrino massesThe RMT means the study of the statistical properties ofeigenvalues of very large matrices [18] For an interactionmatrix 119860 (with size 119873) where 119860 119894119895 is an independent dis-tributed random variable and 119860119867 is the complex conjugateand transposematrix we define119872 = 119860+119860

119867 which describesa Gaussian Unitary Ensemble (GUE) The GUE propertiesare described by the probability distribution 119875(119872)119889119872 (1)it is invariant under unitary transformation 119875(119872)119889119898 =

119875(1198721015840)1198891198721015840 where 1198721015840 = 119880119867119872119880 119880 is a Hermitian matrix

(119880119867119880 = 119868) (2) the elements of 119872 matrix are statisticallyindependent119875(119872) = prod

119894le119895119875119894119895(119872119894119895) and (3) thematrix119872 can

be diagonalized as119872 = 119880119863119880119867 where119880 = diag1205821 120582119873120582119894 is the eigenvalue of119872 and 120582119894 le 120582119895 if 119894 lt 119895

Propability (2) 119875 (119872) 119889119872 sim 119889119872 exp minus1198732119879119903 (119872

119867119872)

Propability (3) 119875 (119872) 119889119872 sim 119889119880prod119894

119889120582119894prod119894lt119895

(120582119894 minus 120582119895)2

times expminus1198732sum119894

(1205822

119894)

(27)The numerical methods used for eigenvalues computa-

tion are the QR method and Power methods (direct andindirect) The QR method is a numerical stable algorithmand Power method is an iterative one The RMT can be usedfor many body systems quantum chaos disordered systemsquantum chromodynamics and so forth

233 Kernel Estimation in HEP Kernel estimation is a verypowerful solution and relevant method for HEP when itis necessary to combine data from heterogeneous sourceslike MC datasets obtained by simulation and from StandardModel expectation obtained from real experiments [19] Fora set of data 1199091198941le119894le119899 with a constant bandwidth ℎ (thedifference between two consecutive data values) called thesmoothing parameter we have the estimation

119891 (119909) =1

119899ℎ

119899

sum119894=1

119870(119909 minus 119909119894

ℎ) (28)

where119870 is an estimator For example a Gauss estimator withmean 120583 and standard deviation 120590 is

119870 (119909) =1

120590radic2120587expminus

(119909 minus 120583)2

21205902 (29)

and has the following properties positive definite andinfinitely differentiable (due to the exp function) and it canbe defined for an infinite supports (119899 rarr infin) The kernel is anonparametricmethodwhichmeans thatℎ is independent ofdataset and for large amount of normally distributed data wecan find a value for ℎ that minimizes the integrated squarederror of 119891(119909) This value for bandwidth is computed as

ℎlowast= (

4

3119899)15

120590 (30)

The main problem in Kernel Estimation is that theset of data 1199091198941le119894le119899 is not normally distributed and inreal experiments the optimal bandwidth it is not knownAn improvement of presented method considers adaptiveKernel Estimation proposed by Abramson [20] where ℎ119894 =ℎradic119891(119909119894) and 120590 are considered global qualities for datasetThe new form is

119891119886 (119909) =1

119899

119899

sum119894=1

1

ℎ119894119870(

119909 minus 119909119894

ℎ119894) (31)

and the local bandwidth value that minimizes the integratedsquared error of 119891119886(119909) is

ℎlowast

119894= (

4

3119899)15

radic120590

119891 (119909119894) (32)

where 119891 is the normal estimatorKernel estimation is used for event selection to confidence

level evaluation for example in Markovian Monte Carlochains or in selection of neural network output used inexperiments for reconstructed Higgs mass In general themain usage of Kernel estimation in HEP is searching for newparticle by finding relevant data in a large dataset

A method based on Kernel estimation is the graphicalrepresentation of datasets using advanced shifted histogramalgorithm (ASH) This is a numerical interpolation for largedatasets with themain aimof creating a set of 119899119887119894119899histograms119867 = 119867119894 with the same bin-width ℎ Algorithm 4presents the steps of histograms generation starting with aspecific interval [119886 119887] a number of points 119899 in this intervaland a number of bins and a number of values used forkernel estimation 119898 Figure 6 shows the results of kernelestimation if function 119891 = minus(12)119909

2 on [0 1] and graphicalrepresentation with a different number of bins The valueson vertical axis are aggregated in step 17 of Algorithm 4 andincrease with the number of bins

234 Performance of Numerical Algorithms Used in ParticlePhysics All operations used in presented methods for par-ticle physics (Unfolding Processes Random Matrix Theory

Advances in High Energy Physics 9

(1) procedure ASH (119886 119887 119899 119909 119899bin 119898 119891119898)(2) 120575 = (119887 minus 119886) 119899bin ℎ = 119898120575(3) for 119896 = 1 119899bin do(4) V119896 = 0 119910119896 = 0(5) end for(6) for 119894 = 1 119899 do(7) 119896 = (119909119894 minus 119886) 120575 + 1(8) if 119896 isin [1 119899bin] then(9) V119896 = V119896 + 1(10) end if(11) end for(12) for 119896 = 1 119899bin do(13) if V119896 = 0 then(14) 119896 = 119896 + 1(15) end if(16) for 119894 = max 1 119896 minus 119898 + 1 min 119899bin 119896 + 119898 minus 1 do(17) 119910119894 = 119910119894 + V119896119891119898 (119894 minus 119896)(18) end for(19) end for(20) for 119896 = 1 119899bin do(21) 119910119896 = 119910119896(119899ℎ) (22) 119905119896 = 119886 + (119896 minus 05) 120575(23) end for(24) return 1199051198961le119896le119899bin 1199101198961le119896le119899bin(25) end procedure

Algorithm 4 Advanced shifted histogram (1D algorithm)

and Kernel Estimation) can be reduced to scalar productsmatrix-vector products and matrix-matrix products In [21]the design of new standard for the BLAS (Basic LinearAlgebra Subroutines) in C language by extension of precisionis described This permits higher internal precision andmixed inputoutput types The precision allows implemen-tation of some algorithms that are simpler more accurateand sometimes faster than possible without these featuresRegarding the precision of numerical computing Dongarraand Langou established in [22] an upper bound for theresidual check for 119860119909 = 119910 system with 119860 isin 119877

119899times119899 a densematrix The residual check is defined as

119903infin =

1003817100381710038171003817119860119909 minus 1199101003817100381710038171003817infin

119899120598 (119860infin119909infin +10038171003817100381710038171199101003817100381710038171003817infin)

lt 16 (33)

where 120598 is the relative machine precision for the IEEErepresentation standard 119910

infinis the infinite normof a vector

119910infin

= max1le119894le119899|119910119894| and 119860infin is the infinite norm of amatrix 119860infin = max1le119894le119899sum

119899

119895=1|119860 119894119895|

Figure 7 presents the graphical representation of Dongar-ras result (using logarithmic scales) for simple and doubleprecision For simple precision 120598119904 = 2minus24 for all 119899 ge 105 times

106 the residual check is always lower than imposed upperbound similarly for double precision with 120598119889 = 2

minus53 for all119899 ge 563 times 1014 If matrix size is greater than these values itwill not be possible to detect if the solution is correct or notThese results establish upper bounds for data volume in thismodel

In a single-processor system the complexity of algo-rithms depends only on the problem size 119899 We can assume

119879(119899) = Θ(119891(119899)) where 119891(119899) is a fundamental function(119891(119899) isin 1 119899

120572 119886119899 log 119899radic119899 ) In parallel systems (mul-tiprocessor systems with 119901 processors) we have the serialprocessing time 119879

lowast(119899) = 1198791(119899) and parallel processingtime 119879119901(119899) The performance of parallel algorithms can beanalyzed using speed-up efficiency and isoefficiencymetrics

(i) The speed-up 119878(119901) represents how a parallel algo-rithm is faster than a corresponding sequential algo-rithmThe speed-up is defined as 119878(119901) = 1198791(119899)119879119901(119899)There are special bounds for speed-up [23] 119878(119901) le

119901119901(119901 + 119901 minus 1) where 119901 = 1198791119879infin is the averageparallelism (the average number of busy processorsgiven unbounded number of processors) Usually119878(119901) le 119901 but under special circumstances thespeed-up can be 119878(119901) gt 119901 [24] Another upperbound is established by the Amdahls law 119878(119901) =

(119904 + ((1 minus 119904)119901))12

le 1119904 where 119904 is the fraction ofa program that is sequential The upper bound isconsidered for a 0 time of parallel fraction

(ii) The efficiency is the average utilization of 119901 proces-sors 119864(119901) = 119878(119901)119901

(iii) The isoefficiency is the growth rate of workload119882119901(119899) = 119901119879119901(119899) in terms of number of processors tokeep efficiency fixed If we consider1198821(119899)minus119864119882119901(119899) =0 for any fixed efficiency 119864 we obtain 119901 = 119901(119899) Thismeans thatwe can establish a relation between needednumber of processors and problem size For examplefor the parallel sum of 119899 numbers using 119901 processorswe have 119899 asymp 119864(119899 + 119901 log119901) so 119899 = Θ(119901 log119901)

10 Advances in High Energy Physicsnbin

=10

0

minus5

minus10

minus15

minus20

[a b]0 02 04 06 08 1

(a)

nbin

=100

0minus10minus20minus30minus40minus50

[a b]0 02 04 06 08 1

(b)

nbin

=1000

0minus100minus200minus300minus400minus500

0 02 04 06 08 1

[a b]

(c)

Figure 6 Example of advanced shifted histogram algorithm run-ning for different bins 10 100 and 1000

0

5

10

15

0 5 10 15 20 25

Simple precisionDouble precisionHPL20 bound

minus5

minus10

minus15

log10(n)

log10(1elowastn))

Figure 7 Residual check analysis for solving 119860119909 = 119910 system inHPL20 using simple and double precision representation

Numerical algorithms use for implementation a hyper-cube architecture We analyze the performance of differentnumerical operations using the isoefficiency metric For thehypercube architecture a simple model for intertask commu-nication considers 119879com = 119905119904 +119871119905119908 where 119905119904 is the latency (thetime needed by a message to cross through the network) 119905119908

Table 2 Isoefficiency for a hypercube architecture 119899 = Θ(119901 log119901)and 119899 = Θ(119901(log119901)2) We marked with (lowast) the limitations imposedby Formula (33)

Scenario Architecture size (119901) 119899 = Θ(119901 log119901) 119899 = Θ(119901(log119901)2)1 101 10 times 101 100 times 101

2 102

20 times 102

800 times 102

3 103 30 times 103 270 times 104

4 104 40 times 104 640 times 105lowast

5 105

50 times 105lowast

125 times 107

6 106

60 times 106

216 times 108

7 107 70 times 107 343 times 109

8 108 80 times 108 512 times 1010

9 109

90 times 109

729 times 1011

is the time needed to send a word (1119905119908 is called bandwidth)and 119871 is the message length (expressed in number of words)The word size depends on processing architecture (usually itis two bytes)We define 119905119888 as the processing time per word fora processor We have the following results

(i) External product 119909119910119879 The isoefficiency is written as

119905119888119899 asymp 119864 (119905119888119899 + (119905119904 + 119905119908) 119901 log119901) 997904rArr 119899 = Θ (119901 log119901) (34)

Parallel processing time is 119879119901 = 119905119888119899119901+(119905119904 +119905119908) log119901The optimality is computed using

119889119879119901

119889119901= 0 997904rArr minus119905119888

119899

1199012+119905119904 + 119905119908

119901= 0 997904rArr 119901 asymp

119905119888119899

119905119904 + 119905119908 (35)

(ii) Scalar product (internal product) 119909119879119910 = sum119899

119894=1119909119894119910119894

The isoefficiency is written as

1199051198881198992asymp 119864(119905119888119899

2+119905119904

2119901 log119901 +

119905119908

2119899radic119901 log119901)

997904rArr 119899 = Θ(119901(log119901)2) (36)

(iii) Matrix-vector product 119910 = 119860119909 119910119894 = sum119899

119895=1119860 119894119895119909119895 The

isoefficiency is written as

1199051198881198992asymp 119864 (119905119888119899

2+ 119905119904119901 log119901 + 119905119908119899radic119901 log119901) 997904rArr 119899

= Θ(119901(log119901)2) (37)

Table 2 presented the amount of data that can be pro-cessed for a specific sizeThe cases that meet the upper bound119899 ge 105 times 10

6 are marked with (lowast) To keep the efficiencyhigh for a specific parallel architecture HPC algorithms forparticle physics introduce upper limits for the amount of datawhich means that we have also an upper bound for Big Datavolume in this case

The factors that determine the efficiency of parallel algo-rithms are task balancing (work-load distribution betweenall used processors in a system rarr to be maximized)concurrency (the numberpercentage of processors workingsimultaneously rarr to be maximized) and overhead (extrawork for introduce by parallel processing that does not appearin serial processing rarr to be minimized)

Advances in High Energy Physics 11

Experimental data

problem)

Modeling and reconstruction

(find properties of observed particles)

Analysis(reduce result to simple statistical

distribution)

Simulation(generate MC data based on

theoretical models-BigData problem)

Modeling and reconstruction

(find properties of observed particles)

Analysis(reduce result to simple statistical

distribution)

Compare

(app TB of data rarr big data

Figure 8 Processing flows for HEP experiments

3 New Challenges for Big Data Science

There are a lot of applications that generate Big Data likesocial networking profiles social influence SaaS amp CloudApps public web information MapReduce scientific exper-iments and simulations (especially HEP simulations) datawarehouse monitoring technologies and e-government ser-vices Data grow rapidly since applications produce contin-uously increasing volumes of both unstructured and struc-tured data The impact on the approach to data processingtransfer and storage is the need to reevaluate the way andsolutions to better answer the usersrsquo needs [25] In this con-text scheduling models and algorithms for data processinghave an important role becoming a new challenge for BigData Science

HEP applications consider both experimental data (thatare application with TB of valuable data) and simulationdata (with data generated using MC based on theoreticalmodels) The processing phase is represented by modelingand reconstruction in order to find properties of observedparticles (see Figure 8)Then the data are analyzed a reducedto a simple statistical distribution The comparison of resultsobtainedwill validate how realistic is a simulation experimentand validate it for use in other new models

Since we face a large variety of solutions for specificapplications and platforms a thorough and systematic anal-ysis of existing solutions for scheduling models methodsand algorithms used in Big Data processing and storageenvironments is needed The challenges for schedulingimpose specific requirements in distributed systems theclaims of the resource consumers the restrictions imposed byresource owners the need to continuously adapt to changesof resourcesrsquo availability and so forth We will pay specialattention to Cloud Systems and HPC clusters (datacenters)as reliable solutions for Big Data [26] Based on theserequirements a number of challenging issues are maximiza-tion of system throughput sitesrsquo autonomy scalability fault-tolerance and quality of services

When discussing Big Data we have in mind the 5Vs Vol-ume Velocity Variety Variability and Value There is a clearneed of many organizations companies and researchers todeal with Big Data volumes efficiently Examples include webanalytics applications scientific applications and social net-works For these examples a popular data processing enginefor Big Data is Hadoop MapReduce [27] The main problemis that data arrives too fast for optimal storage and indexing

[28] There are other several processing platforms for BigData Mesos [29] YARN (Hortonworks Hadoop YARN Anext-generation framework forHadoop data processing 2013(httphortonworkscomhadoopyarn)) Corona (CoronaUnder the Hood Scheduling MapReduce jobs more effi-ciently with Corona 2012 (Facebook)) and so forth A reviewof various parallel and distributed programming paradigmsanalyzing how they fit into the Big Data era is presented in[30]The challenges that are described for BigData Science onthe modern and future Scientific Data Infrastructure are pre-sented in [31] The paper introduces the Scientific Data Life-cycle Management (SDLM)model that includes all the majorstages and reflects specifics in data management in moderne-Science The paper proposes the SDI generic architecturemodel that provides a basis for building interoperable data orproject centric SDI usingmodern technologies and best prac-tices This analysis highlights in the same time performanceand limitations of existing solutions in the context of BigData Hadoop can handle many types of data from disparatesystems structured unstructured logs pictures audio filescommunications records emails and so forth Hadoop relieson an internal redundant data structure with cost advantagesand is deployed on industry standard servers rather than onexpensive specialized data storage systems [32] The mainchallenges for scheduling in Hadoop are to improve existingalgorithms for Big Data processing capacity schedulingfair scheduling delay scheduling longest approximate timeto end (LATE) speculative execution deadline constraintscheduler and resource aware scheduling

Data transfer scheduling in Grids Cloud P2P and soforth represents a new challenge that is the subject to BigData In many cases depending on applications architecturedata must be transported to the place where tasks willbe executed [33] Consequently scheduling schemes shouldconsider not only the task execution time but also the datatransfer time for finding a more convenient mapping of tasks[34] Only a handful of current research efforts considerthe simultaneous optimization of computation and datatransfer scheduling The big-data IO scheduler [35] offers asolution for applications that compete for IO resources ina shared MapReduce-type Big Data system [36] The paper[37] reviews Big Data challenges from a data managementrespective and addresses Big Data diversity Big Data reduc-tion Big Data integration and cleaning Big Data indexingand query and finally Big Data analysis and mining On theopposite side business analytics occupying the intersection

12 Advances in High Energy Physics

of the worlds of management science computer science andstatistical science is a potent force for innovation in both theprivate and public sectors The conclusion is that the data istoo heterogeneous to fit into a rigid schema [38]

Another challenge is the scheduling policies used todetermine the relative ordering of requests Large distributedsystems with different administrative domains will mostlikely have different resource utilization policies For examplea policy can take into consideration the deadlines andbudgets and also the dynamic behavior [39] HEP experi-ments are usually performed in private Clouds consideringdynamic scheduling with soft deadlines which is an openissue

The optimization techniques for the scheduling processrepresent an important aspect because the scheduling is amain building block formaking datacentersmore available touser communities being energy-aware [40] and supportingmulticriteria optimization [41] An example of optimizationis multiobjective and multiconstrained scheduling of manytasks in Hadoop [42] or optimizing short jobs [43] Thecost effectiveness scalability and streamlined architecturesof Hadoop represent solutions for Big Data processingConsidering the use of Hadoop in publicprivate Clouds achallenge is to answer the following questions what type ofdatatasks should move to public cloud in order to achieve acost-aware cloud scheduler And is public Cloud a solutionfor HEP simulation experiments

The activities for Big Data processing vary widely in anumber of issues for example support for heterogeneousresources objective function(s) scalability coschedulingand assumptions about system characteristics The currentresearch directions are focused on accelerating data process-ing especially for Big Data analytics (frequently used in HEPexperiments) complex task dependencies for dataworkflowsand new scheduling algorithms for real-time scenarios

4 Conclusions

This paper presented general aspects about methods used inHEP Monte Carlo methods and simulations of HEP pro-cesses Markovian Monte Carlo unfolding methods in parti-cle physics kernel estimation inHEP RandomMatrixTheoryused in analysis of particles spectrum For each method theproper numerical method had been identified and analyzedAll of identified methods produce data-intensive applica-tions which introduce new challenges and requirementsfor Big Data systems architecture especially for processingparadigms and storage capabilities This paper puts togetherseveral concepts HEP HPC numerical methods and sim-ulations HEP experiments are modeled using numericalmethods and simulations numerical integration eigenvaluescomputation solving linear equation systems multiplyingvectors and matrices interpolation HPC environments offerpowerful tools for data processing and analysis Big Data wasintroduced as a concept for a real problem we live in a data-intensive world produce huge amount of information weface with upper bound introduced by theoretical models

Conflict of Interests

The author declares that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

The research presented in this paper is supported bythe following projects ldquoSideSTEPmdashScheduling Methods forDynamic Distributed Systems a self-lowast approachrdquo (PN-II-CT-RO-FR-2012-1-0084) ldquoERRICmdashEmpowering Roma-nian Research on Intelligent Information Technologiesrdquo FP7-REGPOT-2010-1 ID 264207 CyberWater Grant of theRomanianNational Authority for Scientific Research CNDI-UEFISCDI Project no 472012 The author would like tothank the reviewers for their time and expertise constructivecomments and valuable insights

References

[1] H Newman ldquoSearch for higgs boson diphoton decay withcms at lhcrdquo in Proceedings of the ACMIEEE Conference onSupercomputing (SC rsquo06) ACM New York NY USA 2006

[2] C Cattani and A Ciancio ldquoSeparable transition density inthe hybrid model for tumor-immune system competitionrdquoComputational and Mathematical Methods in Medicine vol2012 Article ID 610124 6 pages 2012

[3] C Toma ldquoWavelets-computational aspects of sterian realisticapproach to uncertainty principle in high energy physics atransient approachrdquo Advances in High Energy Physics vol 2013Article ID 735452 6 pages 2013

[4] C Cattani A Ciancio and B Lods ldquoOn a mathematical modelof immune competitionrdquo Applied Mathematics Letters vol 19no 7 pp 678ndash683 2006

[5] D Perret-Gallix ldquoSimulation and event generation in high-energy physicsrdquoComputer Physics Communications vol 147 no1-2 pp 488ndash493 2002

[6] R Baishya and J K Sarma ldquoSemi numerical solution of non-singlet Dokshitzer-Gribov-Lipatov-Altarelli- Parisi evolutionequation up to next-to-next-to-leading order at small xrdquo Euro-pean Physical Journal C vol 60 no 4 pp 585ndash591 2009

[7] I I Bucur I Fagarasan C Popescu G Culea and A E SusuldquoDelay optimum and area optimal mapping of k-lut based fpgacircuitsrdquo Journal of Control Engineering andApplied Informaticsvol 11 no 1 pp 43ndash48 2009

[8] R R de Austri R Trotta and L Roszkowski ldquoA markov chainmonte carlo analysis of the cmssmrdquo Journal of High EnergyPhysics vol 2006 no 05 2006

[9] C Serbanescu ldquoNoncommutativeMarkov processes as stochas-tic equationsrsquo solutionsrdquo BulletinMathematique de la Societe desSciencesMathematiques de Roumanie vol 41 no 3 pp 219ndash2281998

[10] C Serbanescu ldquoStochastic differential equations and unitaryprocessesrdquo Bulletin Mathematique de la Societe des SciencesMathematiques de Roumanie vol 41 no 4 pp 311ndash322 1998

[11] O Behnke K Kroninger G Schott and T H Schorner-Sadenius Data Analysis in High Energy Physics A PracticalGuide to Statistical Methods Wiley-VCH Berlin Germany2013

[12] V Blobel ldquoAn unfolding method for high energy physicsexperimentsrdquo in Proceedings of the Conference on Advanced

Advances in High Energy Physics 13

Statistical Techniques in Particle Physics Durham UK March2002 DESY 02-078 (June 2002)

[13] P C Hansen Rank-Deficient and Discrete Ill-Posed ProblemsSIAM Philadelphia Pa USA 1998

[14] P C Hansen Discrete Inverse Problems Insight and Algorithmsvol 7 SIAM Philadelphia Pa USA 2010

[15] A Hocker and V Kartvelishvili ldquoSVD approach to data unfold-ingrdquo Nuclear Instruments and Methods in Physics Research Avol 372 no 3 pp 469ndash481 1996

[16] I Aziz and Siraj-ul-Islam ldquoNew algorithms for the numericalsolution of nonlinear Fredholm and Volterra integral equationsusing Haar waveletsrdquo Journal of Computational and AppliedMathematics vol 239 pp 333ndash345 2013

[17] A Sawatzky C Brune JMuller andM Burger ldquoTotal variationprocessing of images with poisson statisticsrdquo in Proceedingsof the 13th International Conference on Computer Analysis ofImages and Patterns (CAIP rsquo09) pp 533ndash540 Springer BerlinGermany 2009

[18] A Edelman and N R Rao ldquoRandom matrix theoryrdquo ActaNumerica vol 14 no 1 pp 233ndash297 2005

[19] K Cranmer ldquoKernel estimation in high-energy physicsrdquo Com-puter Physics Communications vol 136 no 3 pp 198ndash207 2001

[20] I S Abramson ldquoOn bandwidth variation in kernel estimatesmdashasquare root lawrdquoThe Annals of Statistics vol 10 no 4 pp 1217ndash1223 1982

[21] X Li JWDemmel DH Bailey et al ldquoDesign implementationand testing of extended and mixed precision BLASrdquo ACMTransactions on Mathematical Software vol 28 no 2 pp 152ndash205 2002

[22] J J Dongarra and J Langou ldquoThe problem with the linpackbenchmark 10 matrix generatorrdquo International Journal of HighPerformance Computing Applications vol 23 no 1 pp 5ndash132009

[23] L Lundberg and H Lennerstad ldquoAn Optimal lower boundon the maximum speedup in multiprocessors with clustersrdquoin Proceedings of the IEEE 1st International Conference onAlgorithms and Architectures for Parallel Processing (ICAPP rsquo95)vol 2 pp 640ndash649 April 1995

[24] N J Gunther ldquoA note on parallel algorithmicspeedup boundsrdquoTechnical Report on Distributed Parallel and Cluster Comput-ing In presshttparxivorgabs11044078

[25] B C Tak B Urgaonkar and A Sivasubramaniam ldquoTo move ornot tomove the economics of cloud computingrdquo in Proceedingsof the 3rdUSENIXConference onHot Topics inCloudComputing(HotCloud rsquo11) p 5 USENIX Association Berkeley CA USA2011

[26] L Zhang C Wu Z Li C Guo M Chen and F Lau ldquoMovingbig data to the cloud an online cost-minimizing approachrdquoIEEE Journal on Selected Areas in Communications vol 31 no12 pp 2710ndash2721 2013

[27] J Dittrich and J A Quiane-Ruiz ldquoEfficient big data processingin hadoop mapreducerdquo Proceedings of the VLDB Endowmentvol 5 no 12 pp 2014ndash2015 2012

[28] D Suciu ldquoBig data begets big database theoryrdquo in Proceedings ofthe 29th British National conference on Big Data (BNCOD rsquo13)pp 1ndash5 Springer Berlin Germany 2013

[29] B Hindman A Konwinski M Zaharia et al ldquoA platform forfine-grained resource sharing in the data centerrdquo in Proceedingsof the 8th USENIX Conference on Networked Systems Designand Implementation (NSDI rsquo11) p 22 USENIX AssociationBerkeley CA USA 2011

[30] C Dobre and F Xhafa ldquoParallel programming paradigms andframeworks in big data erardquo International Journal of ParallelProgramming 2013

[31] Y Demchenko C de Laat AWibisono P Grosso and Z ZhaoldquoAddressing big data challenges for scientific data infrastruc-turerdquo in Proceedings of the IEEE 4th International Conferenceon CloudComputing Technology and Science (CLOUDCOM rsquo12)pp 614ndash617 IEEE Computer Society Washington DC USA2012

[32] T White Hadoop The Definitive Guide OrsquoReilly Media 2012[33] J Celaya and U Arronategui ldquoA task routing approach to large-

scale schedulingrdquo Future Generation Computer Systems vol 29no 5 pp 1097ndash1111 2013

[34] N Bessis S Sotiriadis F Xhafa F Pop and V Cristea ldquoMeta-scheduling issues in interoperable hpcs grids and cloudsrdquoInternational Journal of Web and Grid Services vol 8 no 2 pp153ndash172 2012

[35] Y Xu A Suarez and M Zhao ldquoIbis interposed big-data ioschedulerrdquo in Proceedings of the 22nd International Sympo-sium on High-performance Parallel and Distributed Computing(HPDC rsquo13) pp 109ndash110 ACM New York NY USA 2013

[36] S Sotiriadis N Bessis and N Antonopoulos ldquoTowards inter-cloud schedulers a survey of meta-scheduling approachesrdquo inProceedings of the 6th International Conference on P2P ParallelGrid Cloud and Internet Computing (PGCIC rsquo11) pp 59ndash66Barcelona Spain October 2011

[37] J Chen Y Chen X Du et al ldquoBig data challenge a datamanagement perspectiverdquo Frontiers in Computer Science vol 7no 2 pp 157ndash164 2013

[38] V Gopalkrishnan D Steier H Lewis and J Guszcza ldquoBigdata big business bridging the gaprdquo in Proceedings of the 1stInternationalWorkshop on Big Data Streams andHeterogeneousSource Mining Algorithms Systems Programming Models andApplications (BigMine rsquo12) pp 7ndash11 ACM New York NY USA2012

[39] R van den Bossche K Vanmechelen and J BroeckhoveldquoOnline coste fficient scheduling of deadline-constrained work-loads on hybrid cloudsrdquo Future Generation Computer Systemsvol 29 no 4 pp 973ndash985 2013

[40] N Bessis S Sotiriadis F Pop and V Cristea ldquoUsing anovel messageexchanging optimization (meo) model to reduceenergy consumption in distributed systemsrdquo Simulation Mod-elling Practice and Theory vol 39 pp 104ndash112 2013

[41] G V Iordache M S Boboila F Pop C Stratan and VCristea ldquoA decentralized strategy for genetic scheduling inheterogeneous environmentsrdquo Multiagent and Grid Systemsvol 3 no 4 pp 355ndash367 2007

[42] F Zhang J Cao K Li S U Khan and K Hwang ldquoMulti-objective scheduling of many tasks in cloud platformsrdquo FutureGeneration Computer Systems 2013

[43] K Elmeleegy ldquoPiranha optimizing short jobs in hadooprdquoProceedings of the VLDB Endowment vol 6 no 11 pp 985ndash9962013

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

High Energy PhysicsAdvances in

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

FluidsJournal of

Atomic and Molecular Physics

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Advances in Condensed Matter Physics

OpticsInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

AstronomyAdvances in

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Superconductivity

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Statistical MechanicsInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

GravityJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

AstrophysicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Physics Research International

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Solid State PhysicsJournal of

 Computational  Methods in Physics

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Soft MatterJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

AerodynamicsJournal of

Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

PhotonicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Biophysics

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ThermodynamicsJournal of

2 Advances in High Energy Physics

Simulated detectors Events

Nature(physical experiments)

Real detectors Events

Reconstruction Vector of events

Comparison and correction

Physics model rarr simulation process rarrevent generator rarrvectors of events

Figure 1 General approach of event generation detection and reconstruction

equations or integrals are based on classical quadratures andMonte Carlo (MC) techniquesThese allow generating eventsin terms of particle flavors and four-momenta which is par-ticularly useful for experimental applications For exampleMC techniques for solving the GLAP equations are basedon simulated Markov chains (random walks) which havethe advantage of filtering and smoothing the state vector forestimating parameters

In practice several MC event generators and simulationtools are used For example HERWIG (httpprojectshepforgeorgherwig) project considers angular-orderedparton shower cluster hadronization (the tool is imple-mented using Fortran) PYTHIA (httpwwwtheplusetorbjornPythiahtml) project is oriented on dipole-type parton shower and string hadronization (the toolis implemented in Fortran and C++) and SHERPA(httpprojectshepforgeorgsherpa) considers dipole-typeparton shower and cluster hadronization (the tool isimplemented in C++) An important tool forMC simulationsis GATE (GEANT4 Application for Tomographic Emission)a generic simulation platform based on GEANT4 GATEprovides new features for nuclear imaging applicationsand includes specific modules that have been developed tomeet specific requirements encountered in SPECT (SinglePhoton Emission Tomography) and PET (Positron EmissionTomography)

The main contributions of this paper are as follows(i) introduction and analysis of most important model-

ing methods used in High Energy Physics(ii) identifying and describing of the computational

numerical methods for High Energy Physics(iii) presentation of the main challenges for Big Data

processingThe paper is structured as follows Section 2 introduces

the computational methods used in HEP and describes theperformance evaluation of parallel numerical algorithmsSection 3 discusses the new challenge for Big Data Sciencegenerated by HEP and HPC Section 4 presents the conclu-sions and general open issues

2 Computational Methods Used inHigh Energy Physics

Computational methods are used in HEP in parallel withphysical experiments to generate particle interactions that

are modeled using vector of events This section presentsgeneral approach of event generation simulation methodsbased on Monte Carlo algorithms Markovian Monte Carlochains methods that describe unfolding processes in particlephysics Random Matrix Theory as support for particlespectrum and kernel estimation that produce continuousestimates of the parent distribution from the empirical prob-ability density function The section ends with performanceanalysis of parallel numerical algorithms used in HEP

21 General Approach of Event Generation Themost impor-tant aspect in simulation for HEP experiments is event gener-ation This process can be split into multiple steps accordingto physical models For example structure of LHC (LargeHadron Collider) events (1) hard process (2) parton shower(3) hadronization (4) underlying event According to officialLHC website (httphomewebcernchaboutcomputing)ldquoapproximately 600 million times per second particles collidewithin the LHC Experiments at CERN generate colossalamounts of data The Data Centre stores it and sends itaround the world for analysisrdquo The analysis must producevaluable data and the simulation results must be correlatedwith physical experiments

Figure 1 presents the general approach of event gener-ation detection and reconstruction The physical model isused to create simulation process that produces different typeof events clustered in vector of events (eg the fourth type ofevents in LHC experiments)

In parallel the real experiments are performed Thedetectors identify the most relevant events and based onreconstruction techniques vector of events is created Thedetectors can be real or simulated (software tools) andthe reconstruction phase combines real events with eventsdetected in simulation At the end the final result is comparedwith the simulation model (especially with generated vectorsof events) The model can be corrected for further experi-ments The goal is to obtain high accuracy and precision ofmeasured and processed data

Software tools for event generation are based on ran-dom number generators There are three types of randomnumbers truly random numbers (from physical generators)pseudorandom numbers (from mathematical generators)and quasirandom numbers (special correlated sequences ofnumbers used only for integration) For example numericalintegration using quasirandom numbers usually gives fasterconvergence than the standard integrationmethods based on

Advances in High Energy Physics 3

(1) procedure Random Generator Poisson(120583)(2) 119899119906119898119887119890119903 larr minus1(3) 119886119888119888119906119898119906119897119886119905119900119903 larr 10(4) 119902 larr exp minus120583(5) while 119886119888119888119906119898119906119897119886119905119900119903 gt 119902 do(6) 119903119899119889 119899119906119898119887119890119903 larr 119877119873119863() (7) 119886119888119888119906119898119906119897119886119905119900119903 larr 119886119888119888119906119898119906119897119886119905119900119903 lowast 119903119899119889 119899119906119898119887119890119903(8) 119899119906119898119887119890119903 larr 119899119906119898119887119890119903 + 1(9) end while(10) return number(11) end procedure

Algorithm 1 Randomnumber generation for Poisson distribution usingmany randomgenerated numbers with normal distribution (RND)

quadratures In event generation pseudorandomnumbers areused most often

Themost popular HEP application uses Poisson distribu-tion combined with a basic normal distribution The Poissondistribution can be formulated as

119875 [119883 = 119896] =120583119896

119896exp minus120583 119896 = 0 1 (1)

with 119864(119896) = 119881(119896) = 120583 (119881 is variance and 119864 is expectationvalue) Having a uniform random number generator calledRND() (Random based on Normal Distribution) we can usethe following two algorithms for event generation techniques

The result of running Algorithms 1 and 2 to generatearound 10

6 random numbers is presented in Figure 2 Ingeneral the second algorithm has better result for Poissondistribution General recommendation for HEP experimentsindicates the use of popular random number generators likeTRNG (True RandomNumber Generators) RANMAR (FastUniform Random Number Generator used in CERN exper-iments) RANLUX (algorithm developed by Luscher usedby Unix random number generators) and Mersenne Twister(the ldquoindustry standardrdquo) Random number generators pro-vided with compilers operating system and programminglanguage libraries can have serious problem because they arebased on system clock and suffer from lack of uniformityof distribution for large amounts of generated numbers andcorrelation of successive values

The art of event generation is to use appropriate combina-tions of various randomnumber generationmethods in orderto construct an efficient event generation algorithm beingsolution to a given problem in HEP

22 Monte Carlo Simulation and Markovian Monte CarloChains in HEP In general a Monte Carlo (MC) method isany simulation technique that uses random numbers to solvea well-defined problem 119875 If 119865 is a solution of the problem119875 (eg 119865 isin 119877

119899 or 119865 has a Boolean value) we define 119865 anestimation of 119865 as 119865 = 119891(1199031 1199032 119903119899 ) where 1199031198941le119894le119899is a random variable that can take more than one value andfor which any value that will be taken cannot be predicted in

050000

100000150000200000250000300000350000

0 5 10 15 20 25 30Fr

eque

ncy

Generted numbers

Random generator Poisson (Alg 1)

(a)

050000

100000150000200000250000300000350000400000

0 5 10 15 20 25 30

Freq

uenc

y

Generted numbers

Random generator Poisson RND (Alg 2)

(b)

Figure 2 General approach of event generation detection andreconstruction

advance If 120588(119903) is the probability density function 120588(119903)119889119903 =119875[119903 lt 1199031015840 lt 119903 + 119889119903] the cumulative distributed function is

119862 (119903) = int119903

minusinfin

120588 (119909) 119889119909 997904rArr 120588 (119903) =119889119862 (119903)

119889119903 (2)

119862(119903) is a monotonically nondecreasing function with allvalues in [0 1] The expectation value is

119864 (119891) = int119891 (119903) 119889119862 (119903) = int119891 (119903) 120588 (119903) 119889119903 (3)

And the variance is

119881 (119891) = 119864[119891 minus 119864 (119891)]2= 119864 (119891

2) minus 1198642(119891) (4)

221 Monte Carlo Event Generation and Simulation Todefine a MC estimator the ldquoLaw of Large Numbers (LLN)rdquois used LLN can be described as follows let one choose 119899

4 Advances in High Energy Physics

(1) procedure Random Generator Poisson RND(120583 119903)(2) 119899119906119898119887119890119903 larr 0(3) 119902 larr exp minus120583(4) 119886119888119888119906119898119906119897119886119905119900119903 larr 119902(5) 119901 larr 119902(6) while 119903 gt 119886119888119888119906119898119906119897119886119905119900119903 do(7) 119899119906119898119887119890119903 larr 119899119906119898119887119890119903 + 1(8) 119901 larr 119901 lowast 120583119899119906119898119887119890119903(9) 119886119888119888119906119898119906119897119886119905119900119903 larr 119886119888119888119906119898119906119897119886119905119900119903 + 119901(10) end while(11) return number(12) end procedure

Algorithm 2 Random number generation for Poisson distribution using one random generated number with normal distribution

numbers 119903119894 randomly with the probability density functionuniform on a specific interval (119886 119887) each 119903119894 being used toevaluate 119891(119903119894) For large 119899 (consistent estimator)

1

119899

119899

sum119894=1

119891 (119903119894) 997888rarr 119864 (119891) =1

119887 minus 119886int119887

119886

119891 (119903) 119889119903 (5)

The properties of a MC estimator are being normallydistributed (with Gaussian density) the standard deviation is120590 = radic119881(119891)119899 MC is unbiased for all 119899 (the expectation valueis the real value of the integral) the estimator is consistent if119881(119891) lt infin (the estimator converges to the true value of theintegral for every large 119899) a sampling phase can be appliedto compute the estimator if we do not know anything aboutthe function119891 it is just suitable for integrationThe samplingphase can be expressed in a stratified way as

int119887

119886

119891 (119903) 119889119903 = int1199031

119886

119891 (119903) 119889119903 + int1199032

1199031

119891 (119903) 119889119903 + sdot sdot sdot + int119887

119903119899

119891 (119903) 119889119903

(6)

MC estimations and MC event generators are necessarytools in most of HEP experiments being used at all theirsteps experiments preparation simulation running and dataanalysis

An example of MC estimation is the Lorentz invariantphase space (LIPS) that describes the cross section for atypical HEP process with 119899 particle in the final state

Consider

120590119899 sim int |119872|2119889119877119899 (7)

where 119872 is the matrix describing the interaction betweenparticles and 119889119877119899 is the element of LIPS We have thefollowing estimation

119877119899 (119875 1199011 1199012 119901119899)

= int 120575(4)(119875 minus

119899

sum119896=1

119901119896)

119899

prod119896=1

(120575 (1199012

119896minus 1198982

119896)Θ (119901

0

119896) 1198894119901119896)

(8)

e+(p1) eminus(p2)

120583+(q1)

120583minus(q2)

z

120579

Φ

Figure 3 Example of particle interaction 119890(1199011) + 119890minus(1199012) rarr

120583+(1199021)120583minus(1199022)

where 119875 is total four-momentum of the 119899-particle system 119901119896and119898119896 are four-momenta andmass of the final state particles120575(4)(119875minussum

119899

119896=1119901119896) is the total energymomentum conservation

120575(1199012119896minus 1198982119896) is the on-mass-shell condition for the final state

system Based on the integration formula

int120575 (1199012

119896minus 1198982

119896)Θ (119901

0

119896) 1198894119901119896 =

1198893119901119896

21199010119896

(9)

obtain the iterative form for cross section

119877119899 (119875 1199011 1199012 119901119899)

= int119877119899minus1 (119875 minus 119901119899 1199011 1199012 119901119899minus1)1198893119901119899

21199010119899

(10)

which can be numerical integrated by using the recurrencerelation As result we can construct a general MC algorithmfor particle collision processes

Example 1 Let us consider the interaction 119890+119890minus rarr 120583+120583minus

where Higgs boson contribution is numerically negligibleFigure 3 describes this interaction (Φ is the azimuthal angle120579 the polar angle and 1199011 1199012 1199021 1199022 are the four-momenta forparticles)

The cross section is

119889120590 =1205722

4119904[1198821 (119904) (1 + cos2120579) +1198822 (119904) cos 120579] 119889Ω (11)

Advances in High Energy Physics 5

where 119889Ω = 119889 cos 120579119889Φ 120572 = 11989024120587 (fine structure constant)119904 = (1199010

1+ 11990102)2 is the center of mass energy squared and

1198821(119904) and 1198822(119904) are constant functions For pure processeswe have1198821(119904) = 1 and1198822(119904) = 0 and the total cross sectionbecomes

120590 = int2120587

0

119889Φint1

minus1

119889 cos 120579 1198892120590

119889Φ119889 cos 120579 (12)

We introduce the following notation

120588 (cos 120579Φ) = 1198892120590

119889Φ119889 cos 120579 (13)

and let us consider 120588(cos 120579Φ) an approximation of120588(cos 120579Φ) Then = ∬119889Φ119889 cos 120579120588 Now we can compute

120590 = int2120587

0

119889Φint1

minus1

119889 cos 120579120588 (cos 120579Φ)

= int2120587

0

119889Φint1

minus1

119889 cos 120579119908 (cos 120579 Φ) 120588 (cos 120579 Φ)

asymp ⟨119908⟩120588 int2120587

0

119889Φint1

minus1

119889 cos 120579120588 (cos 120579Φ) = ⟨119908⟩120588

(14)

where119908(cos 120579Φ) = 120588(cos 120579 Φ)120588(cos 120579 Φ) and ⟨119908⟩120588 is theestimation of 119908 based on 120588 Here the MC estimator is

⟨119908⟩MC =1

119899

119899

sum119894=1

119908119894 (15)

and the standard deviation is

119904MC = (1

119899(119899 minus 1)

119899

sum119894=1

(119908119894 minus ⟨119908⟩MC)2)

12

(16)

The final numerical result based on MC estimator is

120590MC = ⟨119908⟩MC plusmn 119904MC (17)

As we can show the principle of a Monte Carlo estimatorin physics is to simulate the cross section in interaction andradiation transport knowing the probability distributions (oran approximation) governing each interaction of elementaryparticles

Based on this result the Monte Carlo algorithm used togenerate events is as follows It takes as input 120588(cos 120579Φ)and in a main loop considers the following steps (1) gen-erate (cos 120579 Φ) peer from 120588 (2) compute four-momenta1199011 1199012 1199021 1199022 (3) compute 119908 = 120588120588 The loop can be stoppedin the case of unweighted events and we will stay in the loopfor weighted events As output the algorithm returns four-momenta for particle for weighted events and four-momentaand an array of weights for unweighted eventsThemain issueis how to initialize the input of the algorithm Based on 119889120590

formula (for 1198821(119904) = 1 and 1198822(119904) = 0) we can consider asinput 120588(cos 120579Φ) = (12057224119904)(1 + cos2120579) Then = 412058712057223119904

In HEP theoretical predictions used for particle collisionprocesses modeling (as shown in presented example) should

be provided in terms of Monte Carlo event generators whichdirectly simulate these processes and can provide unweighted(weight = 1) events A good Monte Carlo algorithm shouldbe used not only for numerical integration [7] (ie pro-vide weighted events) but also for efficient generation ofunweighted events which is very important issue for HEP

222 Markovian Monte-Carlo Chains A classical MonteCarlo method estimates a function 119865 with 119865 by using arandomvariableThemain problemwith this approach is thatwe cannot predict any value in advance for a randomvariableIn HEP simulation experiments the systems are describedin states [8] Let us consider a system with a finite set ofpossible states 1198781 1198782 and 119878119905 the state at the moment 119905 Theconditional probability is defined as

119875 (119878119905 = 119878119895 | 1198781199051

= 1198781198941

1198781199052

= 1198781198942

119878119905119899

= 119878119894119899

) (18)

where the mappings (1199051 1198941) (119905119899 119894119899) can be interpreted asthe description of system evolution in time by specifying aspecific state for each moment of time

The system is a Markov chain if the distribution of119878119905 depends only on immediate predecessor 119878119905minus1 and it isindependent of all previous states as follows

119875 (119878119905 = 119878119895 | 119878119905minus1 = 119878119894119905minus1

1198781199052

= 1198781198942

1198781199051

= 1198781198941

)

= 119875 (119878119905 = 119878119895 | 119878119905minus1 = 119878119894119905minus1

)

(19)

To generate the time steps (1199051 1199052 119905119899) we use theprobability of a single forward Markovian step given by 119901(119905 |119905119899) with the property int

infin

119905119899

119901(119905 | 119905119899)119889119905 = 1 and we define119901(119905) = 119901(119905 | 0) The 1-dimensional Monte Carlo MarkovianAlgorithm used to generate the time steps is presented inAlgorithm 3

The main result of Algorithm 3 is that 119875(119905max) follows aPoisson distribution

119875119873 = int119905max

0

119901 (1199051 | 1199050) 1198891199051 times int119905max

1199051

119901 (1199052 | 1199051) 1198891199052

times sdot sdot sdot times int119905max

119905119873minus1

119901 (119905119873 | 119905119873minus1) 119889119905119873

timesintinfin

119905max

119901 (119905119873+1 | 119905119873) 119889119905119873+1 =1

119873(119905max)

119873119890minus119905max

(20)

We can consider the 1-dimensional Monte Carlo Marko-vian Algorithm as a method used to iteratively generate thesystemsrsquo states (codified as a Markov chain) in simulationexperiments According to the Ergodic Theorem for Markovchains the chain defined has a unique stationary probabilitydistribution [9 10]

Figures 4 and 5 present the running of Algorithm 3According to different values of parameter 119904 used to generatethe next step the results are very different for 1000 iterationsFigure 4 for 119904 = 1 shows a profile of the type of noise For119904 = 10 100 1000 profile looks like some of the informationis filtered and lost The best results are obtained for 119904 = 001

6 Advances in High Energy Physics

(1) Generate 1199051 according with 119901 (1199051) = 119901 (1199051 | 1199050 = 0)

(2) if 1199051lt 119905max then ⊳ Generate the initial state

(3) 119875119873ge1 = int119905max0

119901 (1199051 | 1199050) 1198891199051 ⊳ Compute the initial probability(4) Retain 1199051(5) end if(6) if 1199051 gt 119905max then ⊳ Discard all generated and computed data(7) 119873 = 0 1198750 = int

infin

119905max119901 (1199051 | 1199050) 1198891199051 = 119890minus119905max

(8) Delete 1199051(9) EXIT ⊳ The algorithm ends here(10) end if(11) 119894 = 2(12) while (1) do ⊳ Infinite loop until a successful EXIT(13) Generate 119905119894 according with 119901 (119905119894 | 119905119894minus1)(14) if 119905

119894lt 119905max then ⊳ Generate a new state and new probability

(15) 119875119873ge119894 = int119905max119905119894

119901 (119905119894 | 119905119894minus1) 119889119905119894(16) Retain 119905119894(17) end if(18) if 119905119894 gt 119905max then ⊳ Discard all generated and computed data(19) 119873 = 119894 minus 1 119875119894 = int

infin

119905max119901 (119905119894 | 119905119894minus1) 119889119905119894

(20) Retain (1199051 1199052 119905119894minus1) Delete 119905119894(21) EXIT ⊳ The algorithm ends here(22) end if(23) 119894 = 119894 + 1(24) end while

Algorithm 3 1-Dimensional Monte Carlo Markovian Algorithm

and 119904 = 01 and the generated values can be easily acceptedfor MC simulation in HEP experiments

Figure 5 shows the acceptance rate of values generatedwith parameter 119904used in the algorithmAndparameter valuesare correlated with Figure 4 Results in Figure 5 show thatthe acceptance rate decreases rapidly with increasing value ofparameter 119904 The conclusion is that values must be kept smallto obtain meaningful data A correlation with the normaldistribution is evident showing that a small value for themean square deviation provides useful results

223 Performance of Numerical Algorithms Used in MCSimulations Numerical methods used to compute MC esti-mator use numerical quadratures to approximate the valueof the integral for function 119891 on a specific domain by alinear compilation of function values and weights 1199081198941le119894le119898as follows

int119887

119886

119891 (119903) 119889119903 =

119898

sum119894=1

119908119894119891 (119903119894) (21)

We can consider a consistent MC estimator 119886 a clas-sical numerical quadrature with all 119908119894 = 1 Efficiency ofintegration methods for 1 dimension and for 119889 dimensionsis presented in Table 1 We can conclude that quadraturemethods are difficult to apply in many dimensions for variateintegration domains (regions) and the integral is not easy tobe estimated

Table 1 Efficiency of integration methods for 1 dimension and for119889 dimensions

Method 1 dimension 119889 dimensionsMonte Carlo 119899minus12 119899minus12

Trapezoidal rule 119899minus2 119899minus2119889

Simpsonrsquos rule 119899minus4 119899minus4119889

119898-points Gauss rule (119898 lt 119899) 119899minus2119898 119899minus2119898119889

Aspractical example in a typical high-energy particle col-lision there can bemany final-state particles (even hundreds)If we have 119899 final state particle we face with 119889 = 3119899 minus 4

dimensional phase space As numerical example for 119899 = 4

we have 119889 = 8 dimensions which is very difficult approachfor classical numerical quadratures

Full decomposition integration volume for one doublenumber (10 Bytes) per volume unit is 119899119889 times 10 Bytes For theexample considered with 119889 = 8 and 119899 = 10 divisions forinterval [0 1] we have for one numerical integration

119899119889times 10Bytes = 108 times 10

10243119866Bytes asymp 093119866Bytes (22)

Considering 106 events per second one integration per eventthe data produced in one hour will be asymp31974 119875Bytes

Advances in High Energy Physics 7

State (iteration)200 400 600 800 1000

s=001 07

05030100

(a)

State (iteration)200 400 600 800 1000

s=01 4

20

0minus2

(b)

State (iteration)200 400 600 800 1000

s=1

0

3

minus3

1minus1

(c)

State (iteration)200 400 600 800 1000

s=10

0

31

minus3minus1

(d)

State (iteration)200 400 600 800 1000

s=100

0

210

minus1

(e)

200 400 600 800 1000State (iteration)

s=1000

0

1

0

minus1

(f)

Figure 4 Example of 1-dimensional Monte Carlo Markovianalgorithm

The previous assumption is only for multidimensionalarrays But due to the factorization assumption119901(1199031 1199032 119903119899) = 119901(1199031)119901(1199032) sdot sdot sdot 119901(119903119899) we obtain for oneintegration

119899 times 119889 times 10 Bytes = 800 Bytes (23)

which means asymp262 119879Bytes of data produce for one hour ofsimulations

23 Unfolding Processes in Particle Physics and Kernel Estima-tion in HEP In particle physics analysis we have two typesof distributions true distribution (considered in theoreticalmodels) and measured distribution (considered in experi-mental models which are affected by finite resolution andlimited acceptance of existing detectors) A HEP interactionprocess starts with a true knows distribution and generatea measured distribution corresponding to an experiment of

0

20

40

60

80

100

0 1 2 3 4

Acce

ptan

ce ra

te (

)

minus1minus2minus3

log10(s)

Figure 5 Analysis of acceptance rate for 1-dimensionalMonte CarloMarkovian algorithm for different 119904 values

a well-confirmed theory An inverse process starts with ameasured distribution and tries to identify the true distri-bution These unfolding processes are used to identify newtheories based on experiments [11]

231 Unfolding Processes in Particle Physics The theory ofunfolding processes in particle physics is as follows [12] For aphysics variable 119905 we have a true distribution 119891(119905)mapped in119909 and an 119899-vector of unknowns and a measured distribution119892(119904) (for a measured variable 119904) mapped in an 119898-vector ofmeasured data A responsematrix119860 isin 119877

119898times119899 encodes aKernelfunction119870(119904 119905) describing the physical measurement process[12ndash15]The direct and inverse processes are described by theFredholm integral equation [16] of the first kind for a specificdomainΩ

intΩ

119870 (119904 119905) 119891 (119905) 119889119905 = 119892 (119904) (24)

In particle physics the Kernel function 119870(119904 119905) is usuallyknown fromaMonteCarlo sample obtained from simulationA numerical solution is obtained using the following linearequation 119860119909 = 119887 Vectors 119909 and 119910 are assumed tobe 1-dimensional in theory but they can be multidimen-sional in practice (considering multiple independent linearequations) In practice also the statistical properties of themeasurements are well known and often they follow thePoisson statistics [17] To solve the linear systems we havedifferent numerical methods

First method is based on linear transformation 119909 = 119860119910

If 119898 = 119899 then 119860= 119860minus1 and we can use direct Gaussian

methods iterative methods (Gauss-Siedel Jacobi or SOR) ororthogonal methods (based on Householder transformationGivens methods or Gram-Schmidt algorithm) If 119898 gt 119899

(the most frequent scenario) we will construct the matrix119860

= (119860119879119860)minus1119860119879 (called pseudoinverse Penrose-Moore) In

these cases the orthogonalmethods offer very good and stablenumerical solutions

8 Advances in High Energy Physics

Second method considers the singular value decomposi-tion

119860 = 119880Σ119881119879=

119899

sum119894=1

120590119894119906119894V119879

119894 (25)

where119880 isin 119877119898times119899 and119881 isin 119877119899times119899 arematrices with orthonormalcolumns and the diagonal matrix Σ = diag1205901 120590119899 =

119880119879119860119881 The solution is

119909 = 119860119910 = 119881Σ

minus1(119880119879119910) =

119899

sum119894=1

1

120590119894(119906119879

119894119910) V119894 =

119899

sum119894=1

1

120590119894119888119894V119894 (26)

where 119888119894 = 119906119879119894119910 119894 = 1 119899 are called Fourier coefficients

232 RandomMatrix Theory Analysis of particle spectrum(eg neutrino spectrum) faces with Random Matrix Theory(RMT) especially if we consider anarchic neutrino massesThe RMT means the study of the statistical properties ofeigenvalues of very large matrices [18] For an interactionmatrix 119860 (with size 119873) where 119860 119894119895 is an independent dis-tributed random variable and 119860119867 is the complex conjugateand transposematrix we define119872 = 119860+119860

119867 which describesa Gaussian Unitary Ensemble (GUE) The GUE propertiesare described by the probability distribution 119875(119872)119889119872 (1)it is invariant under unitary transformation 119875(119872)119889119898 =

119875(1198721015840)1198891198721015840 where 1198721015840 = 119880119867119872119880 119880 is a Hermitian matrix

(119880119867119880 = 119868) (2) the elements of 119872 matrix are statisticallyindependent119875(119872) = prod

119894le119895119875119894119895(119872119894119895) and (3) thematrix119872 can

be diagonalized as119872 = 119880119863119880119867 where119880 = diag1205821 120582119873120582119894 is the eigenvalue of119872 and 120582119894 le 120582119895 if 119894 lt 119895

Propability (2) 119875 (119872) 119889119872 sim 119889119872 exp minus1198732119879119903 (119872

119867119872)

Propability (3) 119875 (119872) 119889119872 sim 119889119880prod119894

119889120582119894prod119894lt119895

(120582119894 minus 120582119895)2

times expminus1198732sum119894

(1205822

119894)

(27)The numerical methods used for eigenvalues computa-

tion are the QR method and Power methods (direct andindirect) The QR method is a numerical stable algorithmand Power method is an iterative one The RMT can be usedfor many body systems quantum chaos disordered systemsquantum chromodynamics and so forth

233 Kernel Estimation in HEP Kernel estimation is a verypowerful solution and relevant method for HEP when itis necessary to combine data from heterogeneous sourceslike MC datasets obtained by simulation and from StandardModel expectation obtained from real experiments [19] Fora set of data 1199091198941le119894le119899 with a constant bandwidth ℎ (thedifference between two consecutive data values) called thesmoothing parameter we have the estimation

119891 (119909) =1

119899ℎ

119899

sum119894=1

119870(119909 minus 119909119894

ℎ) (28)

where119870 is an estimator For example a Gauss estimator withmean 120583 and standard deviation 120590 is

119870 (119909) =1

120590radic2120587expminus

(119909 minus 120583)2

21205902 (29)

and has the following properties positive definite andinfinitely differentiable (due to the exp function) and it canbe defined for an infinite supports (119899 rarr infin) The kernel is anonparametricmethodwhichmeans thatℎ is independent ofdataset and for large amount of normally distributed data wecan find a value for ℎ that minimizes the integrated squarederror of 119891(119909) This value for bandwidth is computed as

ℎlowast= (

4

3119899)15

120590 (30)

The main problem in Kernel Estimation is that theset of data 1199091198941le119894le119899 is not normally distributed and inreal experiments the optimal bandwidth it is not knownAn improvement of presented method considers adaptiveKernel Estimation proposed by Abramson [20] where ℎ119894 =ℎradic119891(119909119894) and 120590 are considered global qualities for datasetThe new form is

119891119886 (119909) =1

119899

119899

sum119894=1

1

ℎ119894119870(

119909 minus 119909119894

ℎ119894) (31)

and the local bandwidth value that minimizes the integratedsquared error of 119891119886(119909) is

ℎlowast

119894= (

4

3119899)15

radic120590

119891 (119909119894) (32)

where 119891 is the normal estimatorKernel estimation is used for event selection to confidence

level evaluation for example in Markovian Monte Carlochains or in selection of neural network output used inexperiments for reconstructed Higgs mass In general themain usage of Kernel estimation in HEP is searching for newparticle by finding relevant data in a large dataset

A method based on Kernel estimation is the graphicalrepresentation of datasets using advanced shifted histogramalgorithm (ASH) This is a numerical interpolation for largedatasets with themain aimof creating a set of 119899119887119894119899histograms119867 = 119867119894 with the same bin-width ℎ Algorithm 4presents the steps of histograms generation starting with aspecific interval [119886 119887] a number of points 119899 in this intervaland a number of bins and a number of values used forkernel estimation 119898 Figure 6 shows the results of kernelestimation if function 119891 = minus(12)119909

2 on [0 1] and graphicalrepresentation with a different number of bins The valueson vertical axis are aggregated in step 17 of Algorithm 4 andincrease with the number of bins

234 Performance of Numerical Algorithms Used in ParticlePhysics All operations used in presented methods for par-ticle physics (Unfolding Processes Random Matrix Theory

Advances in High Energy Physics 9

(1) procedure ASH (119886 119887 119899 119909 119899bin 119898 119891119898)(2) 120575 = (119887 minus 119886) 119899bin ℎ = 119898120575(3) for 119896 = 1 119899bin do(4) V119896 = 0 119910119896 = 0(5) end for(6) for 119894 = 1 119899 do(7) 119896 = (119909119894 minus 119886) 120575 + 1(8) if 119896 isin [1 119899bin] then(9) V119896 = V119896 + 1(10) end if(11) end for(12) for 119896 = 1 119899bin do(13) if V119896 = 0 then(14) 119896 = 119896 + 1(15) end if(16) for 119894 = max 1 119896 minus 119898 + 1 min 119899bin 119896 + 119898 minus 1 do(17) 119910119894 = 119910119894 + V119896119891119898 (119894 minus 119896)(18) end for(19) end for(20) for 119896 = 1 119899bin do(21) 119910119896 = 119910119896(119899ℎ) (22) 119905119896 = 119886 + (119896 minus 05) 120575(23) end for(24) return 1199051198961le119896le119899bin 1199101198961le119896le119899bin(25) end procedure

Algorithm 4 Advanced shifted histogram (1D algorithm)

and Kernel Estimation) can be reduced to scalar productsmatrix-vector products and matrix-matrix products In [21]the design of new standard for the BLAS (Basic LinearAlgebra Subroutines) in C language by extension of precisionis described This permits higher internal precision andmixed inputoutput types The precision allows implemen-tation of some algorithms that are simpler more accurateand sometimes faster than possible without these featuresRegarding the precision of numerical computing Dongarraand Langou established in [22] an upper bound for theresidual check for 119860119909 = 119910 system with 119860 isin 119877

119899times119899 a densematrix The residual check is defined as

119903infin =

1003817100381710038171003817119860119909 minus 1199101003817100381710038171003817infin

119899120598 (119860infin119909infin +10038171003817100381710038171199101003817100381710038171003817infin)

lt 16 (33)

where 120598 is the relative machine precision for the IEEErepresentation standard 119910

infinis the infinite normof a vector

119910infin

= max1le119894le119899|119910119894| and 119860infin is the infinite norm of amatrix 119860infin = max1le119894le119899sum

119899

119895=1|119860 119894119895|

Figure 7 presents the graphical representation of Dongar-ras result (using logarithmic scales) for simple and doubleprecision For simple precision 120598119904 = 2minus24 for all 119899 ge 105 times

106 the residual check is always lower than imposed upperbound similarly for double precision with 120598119889 = 2

minus53 for all119899 ge 563 times 1014 If matrix size is greater than these values itwill not be possible to detect if the solution is correct or notThese results establish upper bounds for data volume in thismodel

In a single-processor system the complexity of algo-rithms depends only on the problem size 119899 We can assume

119879(119899) = Θ(119891(119899)) where 119891(119899) is a fundamental function(119891(119899) isin 1 119899

120572 119886119899 log 119899radic119899 ) In parallel systems (mul-tiprocessor systems with 119901 processors) we have the serialprocessing time 119879

lowast(119899) = 1198791(119899) and parallel processingtime 119879119901(119899) The performance of parallel algorithms can beanalyzed using speed-up efficiency and isoefficiencymetrics

(i) The speed-up 119878(119901) represents how a parallel algo-rithm is faster than a corresponding sequential algo-rithmThe speed-up is defined as 119878(119901) = 1198791(119899)119879119901(119899)There are special bounds for speed-up [23] 119878(119901) le

119901119901(119901 + 119901 minus 1) where 119901 = 1198791119879infin is the averageparallelism (the average number of busy processorsgiven unbounded number of processors) Usually119878(119901) le 119901 but under special circumstances thespeed-up can be 119878(119901) gt 119901 [24] Another upperbound is established by the Amdahls law 119878(119901) =

(119904 + ((1 minus 119904)119901))12

le 1119904 where 119904 is the fraction ofa program that is sequential The upper bound isconsidered for a 0 time of parallel fraction

(ii) The efficiency is the average utilization of 119901 proces-sors 119864(119901) = 119878(119901)119901

(iii) The isoefficiency is the growth rate of workload119882119901(119899) = 119901119879119901(119899) in terms of number of processors tokeep efficiency fixed If we consider1198821(119899)minus119864119882119901(119899) =0 for any fixed efficiency 119864 we obtain 119901 = 119901(119899) Thismeans thatwe can establish a relation between needednumber of processors and problem size For examplefor the parallel sum of 119899 numbers using 119901 processorswe have 119899 asymp 119864(119899 + 119901 log119901) so 119899 = Θ(119901 log119901)

10 Advances in High Energy Physicsnbin

=10

0

minus5

minus10

minus15

minus20

[a b]0 02 04 06 08 1

(a)

nbin

=100

0minus10minus20minus30minus40minus50

[a b]0 02 04 06 08 1

(b)

nbin

=1000

0minus100minus200minus300minus400minus500

0 02 04 06 08 1

[a b]

(c)

Figure 6 Example of advanced shifted histogram algorithm run-ning for different bins 10 100 and 1000

0

5

10

15

0 5 10 15 20 25

Simple precisionDouble precisionHPL20 bound

minus5

minus10

minus15

log10(n)

log10(1elowastn))

Figure 7 Residual check analysis for solving 119860119909 = 119910 system inHPL20 using simple and double precision representation

Numerical algorithms use for implementation a hyper-cube architecture We analyze the performance of differentnumerical operations using the isoefficiency metric For thehypercube architecture a simple model for intertask commu-nication considers 119879com = 119905119904 +119871119905119908 where 119905119904 is the latency (thetime needed by a message to cross through the network) 119905119908

Table 2 Isoefficiency for a hypercube architecture 119899 = Θ(119901 log119901)and 119899 = Θ(119901(log119901)2) We marked with (lowast) the limitations imposedby Formula (33)

Scenario Architecture size (119901) 119899 = Θ(119901 log119901) 119899 = Θ(119901(log119901)2)1 101 10 times 101 100 times 101

2 102

20 times 102

800 times 102

3 103 30 times 103 270 times 104

4 104 40 times 104 640 times 105lowast

5 105

50 times 105lowast

125 times 107

6 106

60 times 106

216 times 108

7 107 70 times 107 343 times 109

8 108 80 times 108 512 times 1010

9 109

90 times 109

729 times 1011

is the time needed to send a word (1119905119908 is called bandwidth)and 119871 is the message length (expressed in number of words)The word size depends on processing architecture (usually itis two bytes)We define 119905119888 as the processing time per word fora processor We have the following results

(i) External product 119909119910119879 The isoefficiency is written as

119905119888119899 asymp 119864 (119905119888119899 + (119905119904 + 119905119908) 119901 log119901) 997904rArr 119899 = Θ (119901 log119901) (34)

Parallel processing time is 119879119901 = 119905119888119899119901+(119905119904 +119905119908) log119901The optimality is computed using

119889119879119901

119889119901= 0 997904rArr minus119905119888

119899

1199012+119905119904 + 119905119908

119901= 0 997904rArr 119901 asymp

119905119888119899

119905119904 + 119905119908 (35)

(ii) Scalar product (internal product) 119909119879119910 = sum119899

119894=1119909119894119910119894

The isoefficiency is written as

1199051198881198992asymp 119864(119905119888119899

2+119905119904

2119901 log119901 +

119905119908

2119899radic119901 log119901)

997904rArr 119899 = Θ(119901(log119901)2) (36)

(iii) Matrix-vector product 119910 = 119860119909 119910119894 = sum119899

119895=1119860 119894119895119909119895 The

isoefficiency is written as

1199051198881198992asymp 119864 (119905119888119899

2+ 119905119904119901 log119901 + 119905119908119899radic119901 log119901) 997904rArr 119899

= Θ(119901(log119901)2) (37)

Table 2 presented the amount of data that can be pro-cessed for a specific sizeThe cases that meet the upper bound119899 ge 105 times 10

6 are marked with (lowast) To keep the efficiencyhigh for a specific parallel architecture HPC algorithms forparticle physics introduce upper limits for the amount of datawhich means that we have also an upper bound for Big Datavolume in this case

The factors that determine the efficiency of parallel algo-rithms are task balancing (work-load distribution betweenall used processors in a system rarr to be maximized)concurrency (the numberpercentage of processors workingsimultaneously rarr to be maximized) and overhead (extrawork for introduce by parallel processing that does not appearin serial processing rarr to be minimized)

Advances in High Energy Physics 11

Experimental data

problem)

Modeling and reconstruction

(find properties of observed particles)

Analysis(reduce result to simple statistical

distribution)

Simulation(generate MC data based on

theoretical models-BigData problem)

Modeling and reconstruction

(find properties of observed particles)

Analysis(reduce result to simple statistical

distribution)

Compare

(app TB of data rarr big data

Figure 8 Processing flows for HEP experiments

3 New Challenges for Big Data Science

There are a lot of applications that generate Big Data likesocial networking profiles social influence SaaS amp CloudApps public web information MapReduce scientific exper-iments and simulations (especially HEP simulations) datawarehouse monitoring technologies and e-government ser-vices Data grow rapidly since applications produce contin-uously increasing volumes of both unstructured and struc-tured data The impact on the approach to data processingtransfer and storage is the need to reevaluate the way andsolutions to better answer the usersrsquo needs [25] In this con-text scheduling models and algorithms for data processinghave an important role becoming a new challenge for BigData Science

HEP applications consider both experimental data (thatare application with TB of valuable data) and simulationdata (with data generated using MC based on theoreticalmodels) The processing phase is represented by modelingand reconstruction in order to find properties of observedparticles (see Figure 8)Then the data are analyzed a reducedto a simple statistical distribution The comparison of resultsobtainedwill validate how realistic is a simulation experimentand validate it for use in other new models

Since we face a large variety of solutions for specificapplications and platforms a thorough and systematic anal-ysis of existing solutions for scheduling models methodsand algorithms used in Big Data processing and storageenvironments is needed The challenges for schedulingimpose specific requirements in distributed systems theclaims of the resource consumers the restrictions imposed byresource owners the need to continuously adapt to changesof resourcesrsquo availability and so forth We will pay specialattention to Cloud Systems and HPC clusters (datacenters)as reliable solutions for Big Data [26] Based on theserequirements a number of challenging issues are maximiza-tion of system throughput sitesrsquo autonomy scalability fault-tolerance and quality of services

When discussing Big Data we have in mind the 5Vs Vol-ume Velocity Variety Variability and Value There is a clearneed of many organizations companies and researchers todeal with Big Data volumes efficiently Examples include webanalytics applications scientific applications and social net-works For these examples a popular data processing enginefor Big Data is Hadoop MapReduce [27] The main problemis that data arrives too fast for optimal storage and indexing

[28] There are other several processing platforms for BigData Mesos [29] YARN (Hortonworks Hadoop YARN Anext-generation framework forHadoop data processing 2013(httphortonworkscomhadoopyarn)) Corona (CoronaUnder the Hood Scheduling MapReduce jobs more effi-ciently with Corona 2012 (Facebook)) and so forth A reviewof various parallel and distributed programming paradigmsanalyzing how they fit into the Big Data era is presented in[30]The challenges that are described for BigData Science onthe modern and future Scientific Data Infrastructure are pre-sented in [31] The paper introduces the Scientific Data Life-cycle Management (SDLM)model that includes all the majorstages and reflects specifics in data management in moderne-Science The paper proposes the SDI generic architecturemodel that provides a basis for building interoperable data orproject centric SDI usingmodern technologies and best prac-tices This analysis highlights in the same time performanceand limitations of existing solutions in the context of BigData Hadoop can handle many types of data from disparatesystems structured unstructured logs pictures audio filescommunications records emails and so forth Hadoop relieson an internal redundant data structure with cost advantagesand is deployed on industry standard servers rather than onexpensive specialized data storage systems [32] The mainchallenges for scheduling in Hadoop are to improve existingalgorithms for Big Data processing capacity schedulingfair scheduling delay scheduling longest approximate timeto end (LATE) speculative execution deadline constraintscheduler and resource aware scheduling

Data transfer scheduling in Grids Cloud P2P and soforth represents a new challenge that is the subject to BigData In many cases depending on applications architecturedata must be transported to the place where tasks willbe executed [33] Consequently scheduling schemes shouldconsider not only the task execution time but also the datatransfer time for finding a more convenient mapping of tasks[34] Only a handful of current research efforts considerthe simultaneous optimization of computation and datatransfer scheduling The big-data IO scheduler [35] offers asolution for applications that compete for IO resources ina shared MapReduce-type Big Data system [36] The paper[37] reviews Big Data challenges from a data managementrespective and addresses Big Data diversity Big Data reduc-tion Big Data integration and cleaning Big Data indexingand query and finally Big Data analysis and mining On theopposite side business analytics occupying the intersection

12 Advances in High Energy Physics

of the worlds of management science computer science andstatistical science is a potent force for innovation in both theprivate and public sectors The conclusion is that the data istoo heterogeneous to fit into a rigid schema [38]

Another challenge is the scheduling policies used todetermine the relative ordering of requests Large distributedsystems with different administrative domains will mostlikely have different resource utilization policies For examplea policy can take into consideration the deadlines andbudgets and also the dynamic behavior [39] HEP experi-ments are usually performed in private Clouds consideringdynamic scheduling with soft deadlines which is an openissue

The optimization techniques for the scheduling processrepresent an important aspect because the scheduling is amain building block formaking datacentersmore available touser communities being energy-aware [40] and supportingmulticriteria optimization [41] An example of optimizationis multiobjective and multiconstrained scheduling of manytasks in Hadoop [42] or optimizing short jobs [43] Thecost effectiveness scalability and streamlined architecturesof Hadoop represent solutions for Big Data processingConsidering the use of Hadoop in publicprivate Clouds achallenge is to answer the following questions what type ofdatatasks should move to public cloud in order to achieve acost-aware cloud scheduler And is public Cloud a solutionfor HEP simulation experiments

The activities for Big Data processing vary widely in anumber of issues for example support for heterogeneousresources objective function(s) scalability coschedulingand assumptions about system characteristics The currentresearch directions are focused on accelerating data process-ing especially for Big Data analytics (frequently used in HEPexperiments) complex task dependencies for dataworkflowsand new scheduling algorithms for real-time scenarios

4 Conclusions

This paper presented general aspects about methods used inHEP Monte Carlo methods and simulations of HEP pro-cesses Markovian Monte Carlo unfolding methods in parti-cle physics kernel estimation inHEP RandomMatrixTheoryused in analysis of particles spectrum For each method theproper numerical method had been identified and analyzedAll of identified methods produce data-intensive applica-tions which introduce new challenges and requirementsfor Big Data systems architecture especially for processingparadigms and storage capabilities This paper puts togetherseveral concepts HEP HPC numerical methods and sim-ulations HEP experiments are modeled using numericalmethods and simulations numerical integration eigenvaluescomputation solving linear equation systems multiplyingvectors and matrices interpolation HPC environments offerpowerful tools for data processing and analysis Big Data wasintroduced as a concept for a real problem we live in a data-intensive world produce huge amount of information weface with upper bound introduced by theoretical models

Conflict of Interests

The author declares that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

The research presented in this paper is supported bythe following projects ldquoSideSTEPmdashScheduling Methods forDynamic Distributed Systems a self-lowast approachrdquo (PN-II-CT-RO-FR-2012-1-0084) ldquoERRICmdashEmpowering Roma-nian Research on Intelligent Information Technologiesrdquo FP7-REGPOT-2010-1 ID 264207 CyberWater Grant of theRomanianNational Authority for Scientific Research CNDI-UEFISCDI Project no 472012 The author would like tothank the reviewers for their time and expertise constructivecomments and valuable insights

References

[1] H Newman ldquoSearch for higgs boson diphoton decay withcms at lhcrdquo in Proceedings of the ACMIEEE Conference onSupercomputing (SC rsquo06) ACM New York NY USA 2006

[2] C Cattani and A Ciancio ldquoSeparable transition density inthe hybrid model for tumor-immune system competitionrdquoComputational and Mathematical Methods in Medicine vol2012 Article ID 610124 6 pages 2012

[3] C Toma ldquoWavelets-computational aspects of sterian realisticapproach to uncertainty principle in high energy physics atransient approachrdquo Advances in High Energy Physics vol 2013Article ID 735452 6 pages 2013

[4] C Cattani A Ciancio and B Lods ldquoOn a mathematical modelof immune competitionrdquo Applied Mathematics Letters vol 19no 7 pp 678ndash683 2006

[5] D Perret-Gallix ldquoSimulation and event generation in high-energy physicsrdquoComputer Physics Communications vol 147 no1-2 pp 488ndash493 2002

[6] R Baishya and J K Sarma ldquoSemi numerical solution of non-singlet Dokshitzer-Gribov-Lipatov-Altarelli- Parisi evolutionequation up to next-to-next-to-leading order at small xrdquo Euro-pean Physical Journal C vol 60 no 4 pp 585ndash591 2009

[7] I I Bucur I Fagarasan C Popescu G Culea and A E SusuldquoDelay optimum and area optimal mapping of k-lut based fpgacircuitsrdquo Journal of Control Engineering andApplied Informaticsvol 11 no 1 pp 43ndash48 2009

[8] R R de Austri R Trotta and L Roszkowski ldquoA markov chainmonte carlo analysis of the cmssmrdquo Journal of High EnergyPhysics vol 2006 no 05 2006

[9] C Serbanescu ldquoNoncommutativeMarkov processes as stochas-tic equationsrsquo solutionsrdquo BulletinMathematique de la Societe desSciencesMathematiques de Roumanie vol 41 no 3 pp 219ndash2281998

[10] C Serbanescu ldquoStochastic differential equations and unitaryprocessesrdquo Bulletin Mathematique de la Societe des SciencesMathematiques de Roumanie vol 41 no 4 pp 311ndash322 1998

[11] O Behnke K Kroninger G Schott and T H Schorner-Sadenius Data Analysis in High Energy Physics A PracticalGuide to Statistical Methods Wiley-VCH Berlin Germany2013

[12] V Blobel ldquoAn unfolding method for high energy physicsexperimentsrdquo in Proceedings of the Conference on Advanced

Advances in High Energy Physics 13

Statistical Techniques in Particle Physics Durham UK March2002 DESY 02-078 (June 2002)

[13] P C Hansen Rank-Deficient and Discrete Ill-Posed ProblemsSIAM Philadelphia Pa USA 1998

[14] P C Hansen Discrete Inverse Problems Insight and Algorithmsvol 7 SIAM Philadelphia Pa USA 2010

[15] A Hocker and V Kartvelishvili ldquoSVD approach to data unfold-ingrdquo Nuclear Instruments and Methods in Physics Research Avol 372 no 3 pp 469ndash481 1996

[16] I Aziz and Siraj-ul-Islam ldquoNew algorithms for the numericalsolution of nonlinear Fredholm and Volterra integral equationsusing Haar waveletsrdquo Journal of Computational and AppliedMathematics vol 239 pp 333ndash345 2013

[17] A Sawatzky C Brune JMuller andM Burger ldquoTotal variationprocessing of images with poisson statisticsrdquo in Proceedingsof the 13th International Conference on Computer Analysis ofImages and Patterns (CAIP rsquo09) pp 533ndash540 Springer BerlinGermany 2009

[18] A Edelman and N R Rao ldquoRandom matrix theoryrdquo ActaNumerica vol 14 no 1 pp 233ndash297 2005

[19] K Cranmer ldquoKernel estimation in high-energy physicsrdquo Com-puter Physics Communications vol 136 no 3 pp 198ndash207 2001

[20] I S Abramson ldquoOn bandwidth variation in kernel estimatesmdashasquare root lawrdquoThe Annals of Statistics vol 10 no 4 pp 1217ndash1223 1982

[21] X Li JWDemmel DH Bailey et al ldquoDesign implementationand testing of extended and mixed precision BLASrdquo ACMTransactions on Mathematical Software vol 28 no 2 pp 152ndash205 2002

[22] J J Dongarra and J Langou ldquoThe problem with the linpackbenchmark 10 matrix generatorrdquo International Journal of HighPerformance Computing Applications vol 23 no 1 pp 5ndash132009

[23] L Lundberg and H Lennerstad ldquoAn Optimal lower boundon the maximum speedup in multiprocessors with clustersrdquoin Proceedings of the IEEE 1st International Conference onAlgorithms and Architectures for Parallel Processing (ICAPP rsquo95)vol 2 pp 640ndash649 April 1995

[24] N J Gunther ldquoA note on parallel algorithmicspeedup boundsrdquoTechnical Report on Distributed Parallel and Cluster Comput-ing In presshttparxivorgabs11044078

[25] B C Tak B Urgaonkar and A Sivasubramaniam ldquoTo move ornot tomove the economics of cloud computingrdquo in Proceedingsof the 3rdUSENIXConference onHot Topics inCloudComputing(HotCloud rsquo11) p 5 USENIX Association Berkeley CA USA2011

[26] L Zhang C Wu Z Li C Guo M Chen and F Lau ldquoMovingbig data to the cloud an online cost-minimizing approachrdquoIEEE Journal on Selected Areas in Communications vol 31 no12 pp 2710ndash2721 2013

[27] J Dittrich and J A Quiane-Ruiz ldquoEfficient big data processingin hadoop mapreducerdquo Proceedings of the VLDB Endowmentvol 5 no 12 pp 2014ndash2015 2012

[28] D Suciu ldquoBig data begets big database theoryrdquo in Proceedings ofthe 29th British National conference on Big Data (BNCOD rsquo13)pp 1ndash5 Springer Berlin Germany 2013

[29] B Hindman A Konwinski M Zaharia et al ldquoA platform forfine-grained resource sharing in the data centerrdquo in Proceedingsof the 8th USENIX Conference on Networked Systems Designand Implementation (NSDI rsquo11) p 22 USENIX AssociationBerkeley CA USA 2011

[30] C Dobre and F Xhafa ldquoParallel programming paradigms andframeworks in big data erardquo International Journal of ParallelProgramming 2013

[31] Y Demchenko C de Laat AWibisono P Grosso and Z ZhaoldquoAddressing big data challenges for scientific data infrastruc-turerdquo in Proceedings of the IEEE 4th International Conferenceon CloudComputing Technology and Science (CLOUDCOM rsquo12)pp 614ndash617 IEEE Computer Society Washington DC USA2012

[32] T White Hadoop The Definitive Guide OrsquoReilly Media 2012[33] J Celaya and U Arronategui ldquoA task routing approach to large-

scale schedulingrdquo Future Generation Computer Systems vol 29no 5 pp 1097ndash1111 2013

[34] N Bessis S Sotiriadis F Xhafa F Pop and V Cristea ldquoMeta-scheduling issues in interoperable hpcs grids and cloudsrdquoInternational Journal of Web and Grid Services vol 8 no 2 pp153ndash172 2012

[35] Y Xu A Suarez and M Zhao ldquoIbis interposed big-data ioschedulerrdquo in Proceedings of the 22nd International Sympo-sium on High-performance Parallel and Distributed Computing(HPDC rsquo13) pp 109ndash110 ACM New York NY USA 2013

[36] S Sotiriadis N Bessis and N Antonopoulos ldquoTowards inter-cloud schedulers a survey of meta-scheduling approachesrdquo inProceedings of the 6th International Conference on P2P ParallelGrid Cloud and Internet Computing (PGCIC rsquo11) pp 59ndash66Barcelona Spain October 2011

[37] J Chen Y Chen X Du et al ldquoBig data challenge a datamanagement perspectiverdquo Frontiers in Computer Science vol 7no 2 pp 157ndash164 2013

[38] V Gopalkrishnan D Steier H Lewis and J Guszcza ldquoBigdata big business bridging the gaprdquo in Proceedings of the 1stInternationalWorkshop on Big Data Streams andHeterogeneousSource Mining Algorithms Systems Programming Models andApplications (BigMine rsquo12) pp 7ndash11 ACM New York NY USA2012

[39] R van den Bossche K Vanmechelen and J BroeckhoveldquoOnline coste fficient scheduling of deadline-constrained work-loads on hybrid cloudsrdquo Future Generation Computer Systemsvol 29 no 4 pp 973ndash985 2013

[40] N Bessis S Sotiriadis F Pop and V Cristea ldquoUsing anovel messageexchanging optimization (meo) model to reduceenergy consumption in distributed systemsrdquo Simulation Mod-elling Practice and Theory vol 39 pp 104ndash112 2013

[41] G V Iordache M S Boboila F Pop C Stratan and VCristea ldquoA decentralized strategy for genetic scheduling inheterogeneous environmentsrdquo Multiagent and Grid Systemsvol 3 no 4 pp 355ndash367 2007

[42] F Zhang J Cao K Li S U Khan and K Hwang ldquoMulti-objective scheduling of many tasks in cloud platformsrdquo FutureGeneration Computer Systems 2013

[43] K Elmeleegy ldquoPiranha optimizing short jobs in hadooprdquoProceedings of the VLDB Endowment vol 6 no 11 pp 985ndash9962013

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

High Energy PhysicsAdvances in

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

FluidsJournal of

Atomic and Molecular Physics

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Advances in Condensed Matter Physics

OpticsInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

AstronomyAdvances in

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Superconductivity

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Statistical MechanicsInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

GravityJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

AstrophysicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Physics Research International

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Solid State PhysicsJournal of

 Computational  Methods in Physics

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Soft MatterJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

AerodynamicsJournal of

Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

PhotonicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Biophysics

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ThermodynamicsJournal of

Advances in High Energy Physics 3

(1) procedure Random Generator Poisson(120583)(2) 119899119906119898119887119890119903 larr minus1(3) 119886119888119888119906119898119906119897119886119905119900119903 larr 10(4) 119902 larr exp minus120583(5) while 119886119888119888119906119898119906119897119886119905119900119903 gt 119902 do(6) 119903119899119889 119899119906119898119887119890119903 larr 119877119873119863() (7) 119886119888119888119906119898119906119897119886119905119900119903 larr 119886119888119888119906119898119906119897119886119905119900119903 lowast 119903119899119889 119899119906119898119887119890119903(8) 119899119906119898119887119890119903 larr 119899119906119898119887119890119903 + 1(9) end while(10) return number(11) end procedure

Algorithm 1 Randomnumber generation for Poisson distribution usingmany randomgenerated numbers with normal distribution (RND)

quadratures In event generation pseudorandomnumbers areused most often

Themost popular HEP application uses Poisson distribu-tion combined with a basic normal distribution The Poissondistribution can be formulated as

119875 [119883 = 119896] =120583119896

119896exp minus120583 119896 = 0 1 (1)

with 119864(119896) = 119881(119896) = 120583 (119881 is variance and 119864 is expectationvalue) Having a uniform random number generator calledRND() (Random based on Normal Distribution) we can usethe following two algorithms for event generation techniques

The result of running Algorithms 1 and 2 to generatearound 10

6 random numbers is presented in Figure 2 Ingeneral the second algorithm has better result for Poissondistribution General recommendation for HEP experimentsindicates the use of popular random number generators likeTRNG (True RandomNumber Generators) RANMAR (FastUniform Random Number Generator used in CERN exper-iments) RANLUX (algorithm developed by Luscher usedby Unix random number generators) and Mersenne Twister(the ldquoindustry standardrdquo) Random number generators pro-vided with compilers operating system and programminglanguage libraries can have serious problem because they arebased on system clock and suffer from lack of uniformityof distribution for large amounts of generated numbers andcorrelation of successive values

The art of event generation is to use appropriate combina-tions of various randomnumber generationmethods in orderto construct an efficient event generation algorithm beingsolution to a given problem in HEP

22 Monte Carlo Simulation and Markovian Monte CarloChains in HEP In general a Monte Carlo (MC) method isany simulation technique that uses random numbers to solvea well-defined problem 119875 If 119865 is a solution of the problem119875 (eg 119865 isin 119877

119899 or 119865 has a Boolean value) we define 119865 anestimation of 119865 as 119865 = 119891(1199031 1199032 119903119899 ) where 1199031198941le119894le119899is a random variable that can take more than one value andfor which any value that will be taken cannot be predicted in

050000

100000150000200000250000300000350000

0 5 10 15 20 25 30Fr

eque

ncy

Generted numbers

Random generator Poisson (Alg 1)

(a)

050000

100000150000200000250000300000350000400000

0 5 10 15 20 25 30

Freq

uenc

y

Generted numbers

Random generator Poisson RND (Alg 2)

(b)

Figure 2 General approach of event generation detection andreconstruction

advance If 120588(119903) is the probability density function 120588(119903)119889119903 =119875[119903 lt 1199031015840 lt 119903 + 119889119903] the cumulative distributed function is

119862 (119903) = int119903

minusinfin

120588 (119909) 119889119909 997904rArr 120588 (119903) =119889119862 (119903)

119889119903 (2)

119862(119903) is a monotonically nondecreasing function with allvalues in [0 1] The expectation value is

119864 (119891) = int119891 (119903) 119889119862 (119903) = int119891 (119903) 120588 (119903) 119889119903 (3)

And the variance is

119881 (119891) = 119864[119891 minus 119864 (119891)]2= 119864 (119891

2) minus 1198642(119891) (4)

221 Monte Carlo Event Generation and Simulation Todefine a MC estimator the ldquoLaw of Large Numbers (LLN)rdquois used LLN can be described as follows let one choose 119899

4 Advances in High Energy Physics

(1) procedure Random Generator Poisson RND(120583 119903)(2) 119899119906119898119887119890119903 larr 0(3) 119902 larr exp minus120583(4) 119886119888119888119906119898119906119897119886119905119900119903 larr 119902(5) 119901 larr 119902(6) while 119903 gt 119886119888119888119906119898119906119897119886119905119900119903 do(7) 119899119906119898119887119890119903 larr 119899119906119898119887119890119903 + 1(8) 119901 larr 119901 lowast 120583119899119906119898119887119890119903(9) 119886119888119888119906119898119906119897119886119905119900119903 larr 119886119888119888119906119898119906119897119886119905119900119903 + 119901(10) end while(11) return number(12) end procedure

Algorithm 2 Random number generation for Poisson distribution using one random generated number with normal distribution

numbers 119903119894 randomly with the probability density functionuniform on a specific interval (119886 119887) each 119903119894 being used toevaluate 119891(119903119894) For large 119899 (consistent estimator)

1

119899

119899

sum119894=1

119891 (119903119894) 997888rarr 119864 (119891) =1

119887 minus 119886int119887

119886

119891 (119903) 119889119903 (5)

The properties of a MC estimator are being normallydistributed (with Gaussian density) the standard deviation is120590 = radic119881(119891)119899 MC is unbiased for all 119899 (the expectation valueis the real value of the integral) the estimator is consistent if119881(119891) lt infin (the estimator converges to the true value of theintegral for every large 119899) a sampling phase can be appliedto compute the estimator if we do not know anything aboutthe function119891 it is just suitable for integrationThe samplingphase can be expressed in a stratified way as

int119887

119886

119891 (119903) 119889119903 = int1199031

119886

119891 (119903) 119889119903 + int1199032

1199031

119891 (119903) 119889119903 + sdot sdot sdot + int119887

119903119899

119891 (119903) 119889119903

(6)

MC estimations and MC event generators are necessarytools in most of HEP experiments being used at all theirsteps experiments preparation simulation running and dataanalysis

An example of MC estimation is the Lorentz invariantphase space (LIPS) that describes the cross section for atypical HEP process with 119899 particle in the final state

Consider

120590119899 sim int |119872|2119889119877119899 (7)

where 119872 is the matrix describing the interaction betweenparticles and 119889119877119899 is the element of LIPS We have thefollowing estimation

119877119899 (119875 1199011 1199012 119901119899)

= int 120575(4)(119875 minus

119899

sum119896=1

119901119896)

119899

prod119896=1

(120575 (1199012

119896minus 1198982

119896)Θ (119901

0

119896) 1198894119901119896)

(8)

e+(p1) eminus(p2)

120583+(q1)

120583minus(q2)

z

120579

Φ

Figure 3 Example of particle interaction 119890(1199011) + 119890minus(1199012) rarr

120583+(1199021)120583minus(1199022)

where 119875 is total four-momentum of the 119899-particle system 119901119896and119898119896 are four-momenta andmass of the final state particles120575(4)(119875minussum

119899

119896=1119901119896) is the total energymomentum conservation

120575(1199012119896minus 1198982119896) is the on-mass-shell condition for the final state

system Based on the integration formula

int120575 (1199012

119896minus 1198982

119896)Θ (119901

0

119896) 1198894119901119896 =

1198893119901119896

21199010119896

(9)

obtain the iterative form for cross section

119877119899 (119875 1199011 1199012 119901119899)

= int119877119899minus1 (119875 minus 119901119899 1199011 1199012 119901119899minus1)1198893119901119899

21199010119899

(10)

which can be numerical integrated by using the recurrencerelation As result we can construct a general MC algorithmfor particle collision processes

Example 1 Let us consider the interaction 119890+119890minus rarr 120583+120583minus

where Higgs boson contribution is numerically negligibleFigure 3 describes this interaction (Φ is the azimuthal angle120579 the polar angle and 1199011 1199012 1199021 1199022 are the four-momenta forparticles)

The cross section is

119889120590 =1205722

4119904[1198821 (119904) (1 + cos2120579) +1198822 (119904) cos 120579] 119889Ω (11)

Advances in High Energy Physics 5

where 119889Ω = 119889 cos 120579119889Φ 120572 = 11989024120587 (fine structure constant)119904 = (1199010

1+ 11990102)2 is the center of mass energy squared and

1198821(119904) and 1198822(119904) are constant functions For pure processeswe have1198821(119904) = 1 and1198822(119904) = 0 and the total cross sectionbecomes

120590 = int2120587

0

119889Φint1

minus1

119889 cos 120579 1198892120590

119889Φ119889 cos 120579 (12)

We introduce the following notation

120588 (cos 120579Φ) = 1198892120590

119889Φ119889 cos 120579 (13)

and let us consider 120588(cos 120579Φ) an approximation of120588(cos 120579Φ) Then = ∬119889Φ119889 cos 120579120588 Now we can compute

120590 = int2120587

0

119889Φint1

minus1

119889 cos 120579120588 (cos 120579Φ)

= int2120587

0

119889Φint1

minus1

119889 cos 120579119908 (cos 120579 Φ) 120588 (cos 120579 Φ)

asymp ⟨119908⟩120588 int2120587

0

119889Φint1

minus1

119889 cos 120579120588 (cos 120579Φ) = ⟨119908⟩120588

(14)

where119908(cos 120579Φ) = 120588(cos 120579 Φ)120588(cos 120579 Φ) and ⟨119908⟩120588 is theestimation of 119908 based on 120588 Here the MC estimator is

⟨119908⟩MC =1

119899

119899

sum119894=1

119908119894 (15)

and the standard deviation is

119904MC = (1

119899(119899 minus 1)

119899

sum119894=1

(119908119894 minus ⟨119908⟩MC)2)

12

(16)

The final numerical result based on MC estimator is

120590MC = ⟨119908⟩MC plusmn 119904MC (17)

As we can show the principle of a Monte Carlo estimatorin physics is to simulate the cross section in interaction andradiation transport knowing the probability distributions (oran approximation) governing each interaction of elementaryparticles

Based on this result the Monte Carlo algorithm used togenerate events is as follows It takes as input 120588(cos 120579Φ)and in a main loop considers the following steps (1) gen-erate (cos 120579 Φ) peer from 120588 (2) compute four-momenta1199011 1199012 1199021 1199022 (3) compute 119908 = 120588120588 The loop can be stoppedin the case of unweighted events and we will stay in the loopfor weighted events As output the algorithm returns four-momenta for particle for weighted events and four-momentaand an array of weights for unweighted eventsThemain issueis how to initialize the input of the algorithm Based on 119889120590

formula (for 1198821(119904) = 1 and 1198822(119904) = 0) we can consider asinput 120588(cos 120579Φ) = (12057224119904)(1 + cos2120579) Then = 412058712057223119904

In HEP theoretical predictions used for particle collisionprocesses modeling (as shown in presented example) should

be provided in terms of Monte Carlo event generators whichdirectly simulate these processes and can provide unweighted(weight = 1) events A good Monte Carlo algorithm shouldbe used not only for numerical integration [7] (ie pro-vide weighted events) but also for efficient generation ofunweighted events which is very important issue for HEP

222 Markovian Monte-Carlo Chains A classical MonteCarlo method estimates a function 119865 with 119865 by using arandomvariableThemain problemwith this approach is thatwe cannot predict any value in advance for a randomvariableIn HEP simulation experiments the systems are describedin states [8] Let us consider a system with a finite set ofpossible states 1198781 1198782 and 119878119905 the state at the moment 119905 Theconditional probability is defined as

119875 (119878119905 = 119878119895 | 1198781199051

= 1198781198941

1198781199052

= 1198781198942

119878119905119899

= 119878119894119899

) (18)

where the mappings (1199051 1198941) (119905119899 119894119899) can be interpreted asthe description of system evolution in time by specifying aspecific state for each moment of time

The system is a Markov chain if the distribution of119878119905 depends only on immediate predecessor 119878119905minus1 and it isindependent of all previous states as follows

119875 (119878119905 = 119878119895 | 119878119905minus1 = 119878119894119905minus1

1198781199052

= 1198781198942

1198781199051

= 1198781198941

)

= 119875 (119878119905 = 119878119895 | 119878119905minus1 = 119878119894119905minus1

)

(19)

To generate the time steps (1199051 1199052 119905119899) we use theprobability of a single forward Markovian step given by 119901(119905 |119905119899) with the property int

infin

119905119899

119901(119905 | 119905119899)119889119905 = 1 and we define119901(119905) = 119901(119905 | 0) The 1-dimensional Monte Carlo MarkovianAlgorithm used to generate the time steps is presented inAlgorithm 3

The main result of Algorithm 3 is that 119875(119905max) follows aPoisson distribution

119875119873 = int119905max

0

119901 (1199051 | 1199050) 1198891199051 times int119905max

1199051

119901 (1199052 | 1199051) 1198891199052

times sdot sdot sdot times int119905max

119905119873minus1

119901 (119905119873 | 119905119873minus1) 119889119905119873

timesintinfin

119905max

119901 (119905119873+1 | 119905119873) 119889119905119873+1 =1

119873(119905max)

119873119890minus119905max

(20)

We can consider the 1-dimensional Monte Carlo Marko-vian Algorithm as a method used to iteratively generate thesystemsrsquo states (codified as a Markov chain) in simulationexperiments According to the Ergodic Theorem for Markovchains the chain defined has a unique stationary probabilitydistribution [9 10]

Figures 4 and 5 present the running of Algorithm 3According to different values of parameter 119904 used to generatethe next step the results are very different for 1000 iterationsFigure 4 for 119904 = 1 shows a profile of the type of noise For119904 = 10 100 1000 profile looks like some of the informationis filtered and lost The best results are obtained for 119904 = 001

6 Advances in High Energy Physics

(1) Generate 1199051 according with 119901 (1199051) = 119901 (1199051 | 1199050 = 0)

(2) if 1199051lt 119905max then ⊳ Generate the initial state

(3) 119875119873ge1 = int119905max0

119901 (1199051 | 1199050) 1198891199051 ⊳ Compute the initial probability(4) Retain 1199051(5) end if(6) if 1199051 gt 119905max then ⊳ Discard all generated and computed data(7) 119873 = 0 1198750 = int

infin

119905max119901 (1199051 | 1199050) 1198891199051 = 119890minus119905max

(8) Delete 1199051(9) EXIT ⊳ The algorithm ends here(10) end if(11) 119894 = 2(12) while (1) do ⊳ Infinite loop until a successful EXIT(13) Generate 119905119894 according with 119901 (119905119894 | 119905119894minus1)(14) if 119905

119894lt 119905max then ⊳ Generate a new state and new probability

(15) 119875119873ge119894 = int119905max119905119894

119901 (119905119894 | 119905119894minus1) 119889119905119894(16) Retain 119905119894(17) end if(18) if 119905119894 gt 119905max then ⊳ Discard all generated and computed data(19) 119873 = 119894 minus 1 119875119894 = int

infin

119905max119901 (119905119894 | 119905119894minus1) 119889119905119894

(20) Retain (1199051 1199052 119905119894minus1) Delete 119905119894(21) EXIT ⊳ The algorithm ends here(22) end if(23) 119894 = 119894 + 1(24) end while

Algorithm 3 1-Dimensional Monte Carlo Markovian Algorithm

and 119904 = 01 and the generated values can be easily acceptedfor MC simulation in HEP experiments

Figure 5 shows the acceptance rate of values generatedwith parameter 119904used in the algorithmAndparameter valuesare correlated with Figure 4 Results in Figure 5 show thatthe acceptance rate decreases rapidly with increasing value ofparameter 119904 The conclusion is that values must be kept smallto obtain meaningful data A correlation with the normaldistribution is evident showing that a small value for themean square deviation provides useful results

223 Performance of Numerical Algorithms Used in MCSimulations Numerical methods used to compute MC esti-mator use numerical quadratures to approximate the valueof the integral for function 119891 on a specific domain by alinear compilation of function values and weights 1199081198941le119894le119898as follows

int119887

119886

119891 (119903) 119889119903 =

119898

sum119894=1

119908119894119891 (119903119894) (21)

We can consider a consistent MC estimator 119886 a clas-sical numerical quadrature with all 119908119894 = 1 Efficiency ofintegration methods for 1 dimension and for 119889 dimensionsis presented in Table 1 We can conclude that quadraturemethods are difficult to apply in many dimensions for variateintegration domains (regions) and the integral is not easy tobe estimated

Table 1 Efficiency of integration methods for 1 dimension and for119889 dimensions

Method 1 dimension 119889 dimensionsMonte Carlo 119899minus12 119899minus12

Trapezoidal rule 119899minus2 119899minus2119889

Simpsonrsquos rule 119899minus4 119899minus4119889

119898-points Gauss rule (119898 lt 119899) 119899minus2119898 119899minus2119898119889

Aspractical example in a typical high-energy particle col-lision there can bemany final-state particles (even hundreds)If we have 119899 final state particle we face with 119889 = 3119899 minus 4

dimensional phase space As numerical example for 119899 = 4

we have 119889 = 8 dimensions which is very difficult approachfor classical numerical quadratures

Full decomposition integration volume for one doublenumber (10 Bytes) per volume unit is 119899119889 times 10 Bytes For theexample considered with 119889 = 8 and 119899 = 10 divisions forinterval [0 1] we have for one numerical integration

119899119889times 10Bytes = 108 times 10

10243119866Bytes asymp 093119866Bytes (22)

Considering 106 events per second one integration per eventthe data produced in one hour will be asymp31974 119875Bytes

Advances in High Energy Physics 7

State (iteration)200 400 600 800 1000

s=001 07

05030100

(a)

State (iteration)200 400 600 800 1000

s=01 4

20

0minus2

(b)

State (iteration)200 400 600 800 1000

s=1

0

3

minus3

1minus1

(c)

State (iteration)200 400 600 800 1000

s=10

0

31

minus3minus1

(d)

State (iteration)200 400 600 800 1000

s=100

0

210

minus1

(e)

200 400 600 800 1000State (iteration)

s=1000

0

1

0

minus1

(f)

Figure 4 Example of 1-dimensional Monte Carlo Markovianalgorithm

The previous assumption is only for multidimensionalarrays But due to the factorization assumption119901(1199031 1199032 119903119899) = 119901(1199031)119901(1199032) sdot sdot sdot 119901(119903119899) we obtain for oneintegration

119899 times 119889 times 10 Bytes = 800 Bytes (23)

which means asymp262 119879Bytes of data produce for one hour ofsimulations

23 Unfolding Processes in Particle Physics and Kernel Estima-tion in HEP In particle physics analysis we have two typesof distributions true distribution (considered in theoreticalmodels) and measured distribution (considered in experi-mental models which are affected by finite resolution andlimited acceptance of existing detectors) A HEP interactionprocess starts with a true knows distribution and generatea measured distribution corresponding to an experiment of

0

20

40

60

80

100

0 1 2 3 4

Acce

ptan

ce ra

te (

)

minus1minus2minus3

log10(s)

Figure 5 Analysis of acceptance rate for 1-dimensionalMonte CarloMarkovian algorithm for different 119904 values

a well-confirmed theory An inverse process starts with ameasured distribution and tries to identify the true distri-bution These unfolding processes are used to identify newtheories based on experiments [11]

231 Unfolding Processes in Particle Physics The theory ofunfolding processes in particle physics is as follows [12] For aphysics variable 119905 we have a true distribution 119891(119905)mapped in119909 and an 119899-vector of unknowns and a measured distribution119892(119904) (for a measured variable 119904) mapped in an 119898-vector ofmeasured data A responsematrix119860 isin 119877

119898times119899 encodes aKernelfunction119870(119904 119905) describing the physical measurement process[12ndash15]The direct and inverse processes are described by theFredholm integral equation [16] of the first kind for a specificdomainΩ

intΩ

119870 (119904 119905) 119891 (119905) 119889119905 = 119892 (119904) (24)

In particle physics the Kernel function 119870(119904 119905) is usuallyknown fromaMonteCarlo sample obtained from simulationA numerical solution is obtained using the following linearequation 119860119909 = 119887 Vectors 119909 and 119910 are assumed tobe 1-dimensional in theory but they can be multidimen-sional in practice (considering multiple independent linearequations) In practice also the statistical properties of themeasurements are well known and often they follow thePoisson statistics [17] To solve the linear systems we havedifferent numerical methods

First method is based on linear transformation 119909 = 119860119910

If 119898 = 119899 then 119860= 119860minus1 and we can use direct Gaussian

methods iterative methods (Gauss-Siedel Jacobi or SOR) ororthogonal methods (based on Householder transformationGivens methods or Gram-Schmidt algorithm) If 119898 gt 119899

(the most frequent scenario) we will construct the matrix119860

= (119860119879119860)minus1119860119879 (called pseudoinverse Penrose-Moore) In

these cases the orthogonalmethods offer very good and stablenumerical solutions

8 Advances in High Energy Physics

Second method considers the singular value decomposi-tion

119860 = 119880Σ119881119879=

119899

sum119894=1

120590119894119906119894V119879

119894 (25)

where119880 isin 119877119898times119899 and119881 isin 119877119899times119899 arematrices with orthonormalcolumns and the diagonal matrix Σ = diag1205901 120590119899 =

119880119879119860119881 The solution is

119909 = 119860119910 = 119881Σ

minus1(119880119879119910) =

119899

sum119894=1

1

120590119894(119906119879

119894119910) V119894 =

119899

sum119894=1

1

120590119894119888119894V119894 (26)

where 119888119894 = 119906119879119894119910 119894 = 1 119899 are called Fourier coefficients

232 RandomMatrix Theory Analysis of particle spectrum(eg neutrino spectrum) faces with Random Matrix Theory(RMT) especially if we consider anarchic neutrino massesThe RMT means the study of the statistical properties ofeigenvalues of very large matrices [18] For an interactionmatrix 119860 (with size 119873) where 119860 119894119895 is an independent dis-tributed random variable and 119860119867 is the complex conjugateand transposematrix we define119872 = 119860+119860

119867 which describesa Gaussian Unitary Ensemble (GUE) The GUE propertiesare described by the probability distribution 119875(119872)119889119872 (1)it is invariant under unitary transformation 119875(119872)119889119898 =

119875(1198721015840)1198891198721015840 where 1198721015840 = 119880119867119872119880 119880 is a Hermitian matrix

(119880119867119880 = 119868) (2) the elements of 119872 matrix are statisticallyindependent119875(119872) = prod

119894le119895119875119894119895(119872119894119895) and (3) thematrix119872 can

be diagonalized as119872 = 119880119863119880119867 where119880 = diag1205821 120582119873120582119894 is the eigenvalue of119872 and 120582119894 le 120582119895 if 119894 lt 119895

Propability (2) 119875 (119872) 119889119872 sim 119889119872 exp minus1198732119879119903 (119872

119867119872)

Propability (3) 119875 (119872) 119889119872 sim 119889119880prod119894

119889120582119894prod119894lt119895

(120582119894 minus 120582119895)2

times expminus1198732sum119894

(1205822

119894)

(27)The numerical methods used for eigenvalues computa-

tion are the QR method and Power methods (direct andindirect) The QR method is a numerical stable algorithmand Power method is an iterative one The RMT can be usedfor many body systems quantum chaos disordered systemsquantum chromodynamics and so forth

233 Kernel Estimation in HEP Kernel estimation is a verypowerful solution and relevant method for HEP when itis necessary to combine data from heterogeneous sourceslike MC datasets obtained by simulation and from StandardModel expectation obtained from real experiments [19] Fora set of data 1199091198941le119894le119899 with a constant bandwidth ℎ (thedifference between two consecutive data values) called thesmoothing parameter we have the estimation

119891 (119909) =1

119899ℎ

119899

sum119894=1

119870(119909 minus 119909119894

ℎ) (28)

where119870 is an estimator For example a Gauss estimator withmean 120583 and standard deviation 120590 is

119870 (119909) =1

120590radic2120587expminus

(119909 minus 120583)2

21205902 (29)

and has the following properties positive definite andinfinitely differentiable (due to the exp function) and it canbe defined for an infinite supports (119899 rarr infin) The kernel is anonparametricmethodwhichmeans thatℎ is independent ofdataset and for large amount of normally distributed data wecan find a value for ℎ that minimizes the integrated squarederror of 119891(119909) This value for bandwidth is computed as

ℎlowast= (

4

3119899)15

120590 (30)

The main problem in Kernel Estimation is that theset of data 1199091198941le119894le119899 is not normally distributed and inreal experiments the optimal bandwidth it is not knownAn improvement of presented method considers adaptiveKernel Estimation proposed by Abramson [20] where ℎ119894 =ℎradic119891(119909119894) and 120590 are considered global qualities for datasetThe new form is

119891119886 (119909) =1

119899

119899

sum119894=1

1

ℎ119894119870(

119909 minus 119909119894

ℎ119894) (31)

and the local bandwidth value that minimizes the integratedsquared error of 119891119886(119909) is

ℎlowast

119894= (

4

3119899)15

radic120590

119891 (119909119894) (32)

where 119891 is the normal estimatorKernel estimation is used for event selection to confidence

level evaluation for example in Markovian Monte Carlochains or in selection of neural network output used inexperiments for reconstructed Higgs mass In general themain usage of Kernel estimation in HEP is searching for newparticle by finding relevant data in a large dataset

A method based on Kernel estimation is the graphicalrepresentation of datasets using advanced shifted histogramalgorithm (ASH) This is a numerical interpolation for largedatasets with themain aimof creating a set of 119899119887119894119899histograms119867 = 119867119894 with the same bin-width ℎ Algorithm 4presents the steps of histograms generation starting with aspecific interval [119886 119887] a number of points 119899 in this intervaland a number of bins and a number of values used forkernel estimation 119898 Figure 6 shows the results of kernelestimation if function 119891 = minus(12)119909

2 on [0 1] and graphicalrepresentation with a different number of bins The valueson vertical axis are aggregated in step 17 of Algorithm 4 andincrease with the number of bins

234 Performance of Numerical Algorithms Used in ParticlePhysics All operations used in presented methods for par-ticle physics (Unfolding Processes Random Matrix Theory

Advances in High Energy Physics 9

(1) procedure ASH (119886 119887 119899 119909 119899bin 119898 119891119898)(2) 120575 = (119887 minus 119886) 119899bin ℎ = 119898120575(3) for 119896 = 1 119899bin do(4) V119896 = 0 119910119896 = 0(5) end for(6) for 119894 = 1 119899 do(7) 119896 = (119909119894 minus 119886) 120575 + 1(8) if 119896 isin [1 119899bin] then(9) V119896 = V119896 + 1(10) end if(11) end for(12) for 119896 = 1 119899bin do(13) if V119896 = 0 then(14) 119896 = 119896 + 1(15) end if(16) for 119894 = max 1 119896 minus 119898 + 1 min 119899bin 119896 + 119898 minus 1 do(17) 119910119894 = 119910119894 + V119896119891119898 (119894 minus 119896)(18) end for(19) end for(20) for 119896 = 1 119899bin do(21) 119910119896 = 119910119896(119899ℎ) (22) 119905119896 = 119886 + (119896 minus 05) 120575(23) end for(24) return 1199051198961le119896le119899bin 1199101198961le119896le119899bin(25) end procedure

Algorithm 4 Advanced shifted histogram (1D algorithm)

and Kernel Estimation) can be reduced to scalar productsmatrix-vector products and matrix-matrix products In [21]the design of new standard for the BLAS (Basic LinearAlgebra Subroutines) in C language by extension of precisionis described This permits higher internal precision andmixed inputoutput types The precision allows implemen-tation of some algorithms that are simpler more accurateand sometimes faster than possible without these featuresRegarding the precision of numerical computing Dongarraand Langou established in [22] an upper bound for theresidual check for 119860119909 = 119910 system with 119860 isin 119877

119899times119899 a densematrix The residual check is defined as

119903infin =

1003817100381710038171003817119860119909 minus 1199101003817100381710038171003817infin

119899120598 (119860infin119909infin +10038171003817100381710038171199101003817100381710038171003817infin)

lt 16 (33)

where 120598 is the relative machine precision for the IEEErepresentation standard 119910

infinis the infinite normof a vector

119910infin

= max1le119894le119899|119910119894| and 119860infin is the infinite norm of amatrix 119860infin = max1le119894le119899sum

119899

119895=1|119860 119894119895|

Figure 7 presents the graphical representation of Dongar-ras result (using logarithmic scales) for simple and doubleprecision For simple precision 120598119904 = 2minus24 for all 119899 ge 105 times

106 the residual check is always lower than imposed upperbound similarly for double precision with 120598119889 = 2

minus53 for all119899 ge 563 times 1014 If matrix size is greater than these values itwill not be possible to detect if the solution is correct or notThese results establish upper bounds for data volume in thismodel

In a single-processor system the complexity of algo-rithms depends only on the problem size 119899 We can assume

119879(119899) = Θ(119891(119899)) where 119891(119899) is a fundamental function(119891(119899) isin 1 119899

120572 119886119899 log 119899radic119899 ) In parallel systems (mul-tiprocessor systems with 119901 processors) we have the serialprocessing time 119879

lowast(119899) = 1198791(119899) and parallel processingtime 119879119901(119899) The performance of parallel algorithms can beanalyzed using speed-up efficiency and isoefficiencymetrics

(i) The speed-up 119878(119901) represents how a parallel algo-rithm is faster than a corresponding sequential algo-rithmThe speed-up is defined as 119878(119901) = 1198791(119899)119879119901(119899)There are special bounds for speed-up [23] 119878(119901) le

119901119901(119901 + 119901 minus 1) where 119901 = 1198791119879infin is the averageparallelism (the average number of busy processorsgiven unbounded number of processors) Usually119878(119901) le 119901 but under special circumstances thespeed-up can be 119878(119901) gt 119901 [24] Another upperbound is established by the Amdahls law 119878(119901) =

(119904 + ((1 minus 119904)119901))12

le 1119904 where 119904 is the fraction ofa program that is sequential The upper bound isconsidered for a 0 time of parallel fraction

(ii) The efficiency is the average utilization of 119901 proces-sors 119864(119901) = 119878(119901)119901

(iii) The isoefficiency is the growth rate of workload119882119901(119899) = 119901119879119901(119899) in terms of number of processors tokeep efficiency fixed If we consider1198821(119899)minus119864119882119901(119899) =0 for any fixed efficiency 119864 we obtain 119901 = 119901(119899) Thismeans thatwe can establish a relation between needednumber of processors and problem size For examplefor the parallel sum of 119899 numbers using 119901 processorswe have 119899 asymp 119864(119899 + 119901 log119901) so 119899 = Θ(119901 log119901)

10 Advances in High Energy Physicsnbin

=10

0

minus5

minus10

minus15

minus20

[a b]0 02 04 06 08 1

(a)

nbin

=100

0minus10minus20minus30minus40minus50

[a b]0 02 04 06 08 1

(b)

nbin

=1000

0minus100minus200minus300minus400minus500

0 02 04 06 08 1

[a b]

(c)

Figure 6 Example of advanced shifted histogram algorithm run-ning for different bins 10 100 and 1000

0

5

10

15

0 5 10 15 20 25

Simple precisionDouble precisionHPL20 bound

minus5

minus10

minus15

log10(n)

log10(1elowastn))

Figure 7 Residual check analysis for solving 119860119909 = 119910 system inHPL20 using simple and double precision representation

Numerical algorithms use for implementation a hyper-cube architecture We analyze the performance of differentnumerical operations using the isoefficiency metric For thehypercube architecture a simple model for intertask commu-nication considers 119879com = 119905119904 +119871119905119908 where 119905119904 is the latency (thetime needed by a message to cross through the network) 119905119908

Table 2 Isoefficiency for a hypercube architecture 119899 = Θ(119901 log119901)and 119899 = Θ(119901(log119901)2) We marked with (lowast) the limitations imposedby Formula (33)

Scenario Architecture size (119901) 119899 = Θ(119901 log119901) 119899 = Θ(119901(log119901)2)1 101 10 times 101 100 times 101

2 102

20 times 102

800 times 102

3 103 30 times 103 270 times 104

4 104 40 times 104 640 times 105lowast

5 105

50 times 105lowast

125 times 107

6 106

60 times 106

216 times 108

7 107 70 times 107 343 times 109

8 108 80 times 108 512 times 1010

9 109

90 times 109

729 times 1011

is the time needed to send a word (1119905119908 is called bandwidth)and 119871 is the message length (expressed in number of words)The word size depends on processing architecture (usually itis two bytes)We define 119905119888 as the processing time per word fora processor We have the following results

(i) External product 119909119910119879 The isoefficiency is written as

119905119888119899 asymp 119864 (119905119888119899 + (119905119904 + 119905119908) 119901 log119901) 997904rArr 119899 = Θ (119901 log119901) (34)

Parallel processing time is 119879119901 = 119905119888119899119901+(119905119904 +119905119908) log119901The optimality is computed using

119889119879119901

119889119901= 0 997904rArr minus119905119888

119899

1199012+119905119904 + 119905119908

119901= 0 997904rArr 119901 asymp

119905119888119899

119905119904 + 119905119908 (35)

(ii) Scalar product (internal product) 119909119879119910 = sum119899

119894=1119909119894119910119894

The isoefficiency is written as

1199051198881198992asymp 119864(119905119888119899

2+119905119904

2119901 log119901 +

119905119908

2119899radic119901 log119901)

997904rArr 119899 = Θ(119901(log119901)2) (36)

(iii) Matrix-vector product 119910 = 119860119909 119910119894 = sum119899

119895=1119860 119894119895119909119895 The

isoefficiency is written as

1199051198881198992asymp 119864 (119905119888119899

2+ 119905119904119901 log119901 + 119905119908119899radic119901 log119901) 997904rArr 119899

= Θ(119901(log119901)2) (37)

Table 2 presented the amount of data that can be pro-cessed for a specific sizeThe cases that meet the upper bound119899 ge 105 times 10

6 are marked with (lowast) To keep the efficiencyhigh for a specific parallel architecture HPC algorithms forparticle physics introduce upper limits for the amount of datawhich means that we have also an upper bound for Big Datavolume in this case

The factors that determine the efficiency of parallel algo-rithms are task balancing (work-load distribution betweenall used processors in a system rarr to be maximized)concurrency (the numberpercentage of processors workingsimultaneously rarr to be maximized) and overhead (extrawork for introduce by parallel processing that does not appearin serial processing rarr to be minimized)

Advances in High Energy Physics 11

Experimental data

problem)

Modeling and reconstruction

(find properties of observed particles)

Analysis(reduce result to simple statistical

distribution)

Simulation(generate MC data based on

theoretical models-BigData problem)

Modeling and reconstruction

(find properties of observed particles)

Analysis(reduce result to simple statistical

distribution)

Compare

(app TB of data rarr big data

Figure 8 Processing flows for HEP experiments

3 New Challenges for Big Data Science

There are a lot of applications that generate Big Data likesocial networking profiles social influence SaaS amp CloudApps public web information MapReduce scientific exper-iments and simulations (especially HEP simulations) datawarehouse monitoring technologies and e-government ser-vices Data grow rapidly since applications produce contin-uously increasing volumes of both unstructured and struc-tured data The impact on the approach to data processingtransfer and storage is the need to reevaluate the way andsolutions to better answer the usersrsquo needs [25] In this con-text scheduling models and algorithms for data processinghave an important role becoming a new challenge for BigData Science

HEP applications consider both experimental data (thatare application with TB of valuable data) and simulationdata (with data generated using MC based on theoreticalmodels) The processing phase is represented by modelingand reconstruction in order to find properties of observedparticles (see Figure 8)Then the data are analyzed a reducedto a simple statistical distribution The comparison of resultsobtainedwill validate how realistic is a simulation experimentand validate it for use in other new models

Since we face a large variety of solutions for specificapplications and platforms a thorough and systematic anal-ysis of existing solutions for scheduling models methodsand algorithms used in Big Data processing and storageenvironments is needed The challenges for schedulingimpose specific requirements in distributed systems theclaims of the resource consumers the restrictions imposed byresource owners the need to continuously adapt to changesof resourcesrsquo availability and so forth We will pay specialattention to Cloud Systems and HPC clusters (datacenters)as reliable solutions for Big Data [26] Based on theserequirements a number of challenging issues are maximiza-tion of system throughput sitesrsquo autonomy scalability fault-tolerance and quality of services

When discussing Big Data we have in mind the 5Vs Vol-ume Velocity Variety Variability and Value There is a clearneed of many organizations companies and researchers todeal with Big Data volumes efficiently Examples include webanalytics applications scientific applications and social net-works For these examples a popular data processing enginefor Big Data is Hadoop MapReduce [27] The main problemis that data arrives too fast for optimal storage and indexing

[28] There are other several processing platforms for BigData Mesos [29] YARN (Hortonworks Hadoop YARN Anext-generation framework forHadoop data processing 2013(httphortonworkscomhadoopyarn)) Corona (CoronaUnder the Hood Scheduling MapReduce jobs more effi-ciently with Corona 2012 (Facebook)) and so forth A reviewof various parallel and distributed programming paradigmsanalyzing how they fit into the Big Data era is presented in[30]The challenges that are described for BigData Science onthe modern and future Scientific Data Infrastructure are pre-sented in [31] The paper introduces the Scientific Data Life-cycle Management (SDLM)model that includes all the majorstages and reflects specifics in data management in moderne-Science The paper proposes the SDI generic architecturemodel that provides a basis for building interoperable data orproject centric SDI usingmodern technologies and best prac-tices This analysis highlights in the same time performanceand limitations of existing solutions in the context of BigData Hadoop can handle many types of data from disparatesystems structured unstructured logs pictures audio filescommunications records emails and so forth Hadoop relieson an internal redundant data structure with cost advantagesand is deployed on industry standard servers rather than onexpensive specialized data storage systems [32] The mainchallenges for scheduling in Hadoop are to improve existingalgorithms for Big Data processing capacity schedulingfair scheduling delay scheduling longest approximate timeto end (LATE) speculative execution deadline constraintscheduler and resource aware scheduling

Data transfer scheduling in Grids Cloud P2P and soforth represents a new challenge that is the subject to BigData In many cases depending on applications architecturedata must be transported to the place where tasks willbe executed [33] Consequently scheduling schemes shouldconsider not only the task execution time but also the datatransfer time for finding a more convenient mapping of tasks[34] Only a handful of current research efforts considerthe simultaneous optimization of computation and datatransfer scheduling The big-data IO scheduler [35] offers asolution for applications that compete for IO resources ina shared MapReduce-type Big Data system [36] The paper[37] reviews Big Data challenges from a data managementrespective and addresses Big Data diversity Big Data reduc-tion Big Data integration and cleaning Big Data indexingand query and finally Big Data analysis and mining On theopposite side business analytics occupying the intersection

12 Advances in High Energy Physics

of the worlds of management science computer science andstatistical science is a potent force for innovation in both theprivate and public sectors The conclusion is that the data istoo heterogeneous to fit into a rigid schema [38]

Another challenge is the scheduling policies used todetermine the relative ordering of requests Large distributedsystems with different administrative domains will mostlikely have different resource utilization policies For examplea policy can take into consideration the deadlines andbudgets and also the dynamic behavior [39] HEP experi-ments are usually performed in private Clouds consideringdynamic scheduling with soft deadlines which is an openissue

The optimization techniques for the scheduling processrepresent an important aspect because the scheduling is amain building block formaking datacentersmore available touser communities being energy-aware [40] and supportingmulticriteria optimization [41] An example of optimizationis multiobjective and multiconstrained scheduling of manytasks in Hadoop [42] or optimizing short jobs [43] Thecost effectiveness scalability and streamlined architecturesof Hadoop represent solutions for Big Data processingConsidering the use of Hadoop in publicprivate Clouds achallenge is to answer the following questions what type ofdatatasks should move to public cloud in order to achieve acost-aware cloud scheduler And is public Cloud a solutionfor HEP simulation experiments

The activities for Big Data processing vary widely in anumber of issues for example support for heterogeneousresources objective function(s) scalability coschedulingand assumptions about system characteristics The currentresearch directions are focused on accelerating data process-ing especially for Big Data analytics (frequently used in HEPexperiments) complex task dependencies for dataworkflowsand new scheduling algorithms for real-time scenarios

4 Conclusions

This paper presented general aspects about methods used inHEP Monte Carlo methods and simulations of HEP pro-cesses Markovian Monte Carlo unfolding methods in parti-cle physics kernel estimation inHEP RandomMatrixTheoryused in analysis of particles spectrum For each method theproper numerical method had been identified and analyzedAll of identified methods produce data-intensive applica-tions which introduce new challenges and requirementsfor Big Data systems architecture especially for processingparadigms and storage capabilities This paper puts togetherseveral concepts HEP HPC numerical methods and sim-ulations HEP experiments are modeled using numericalmethods and simulations numerical integration eigenvaluescomputation solving linear equation systems multiplyingvectors and matrices interpolation HPC environments offerpowerful tools for data processing and analysis Big Data wasintroduced as a concept for a real problem we live in a data-intensive world produce huge amount of information weface with upper bound introduced by theoretical models

Conflict of Interests

The author declares that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

The research presented in this paper is supported bythe following projects ldquoSideSTEPmdashScheduling Methods forDynamic Distributed Systems a self-lowast approachrdquo (PN-II-CT-RO-FR-2012-1-0084) ldquoERRICmdashEmpowering Roma-nian Research on Intelligent Information Technologiesrdquo FP7-REGPOT-2010-1 ID 264207 CyberWater Grant of theRomanianNational Authority for Scientific Research CNDI-UEFISCDI Project no 472012 The author would like tothank the reviewers for their time and expertise constructivecomments and valuable insights

References

[1] H Newman ldquoSearch for higgs boson diphoton decay withcms at lhcrdquo in Proceedings of the ACMIEEE Conference onSupercomputing (SC rsquo06) ACM New York NY USA 2006

[2] C Cattani and A Ciancio ldquoSeparable transition density inthe hybrid model for tumor-immune system competitionrdquoComputational and Mathematical Methods in Medicine vol2012 Article ID 610124 6 pages 2012

[3] C Toma ldquoWavelets-computational aspects of sterian realisticapproach to uncertainty principle in high energy physics atransient approachrdquo Advances in High Energy Physics vol 2013Article ID 735452 6 pages 2013

[4] C Cattani A Ciancio and B Lods ldquoOn a mathematical modelof immune competitionrdquo Applied Mathematics Letters vol 19no 7 pp 678ndash683 2006

[5] D Perret-Gallix ldquoSimulation and event generation in high-energy physicsrdquoComputer Physics Communications vol 147 no1-2 pp 488ndash493 2002

[6] R Baishya and J K Sarma ldquoSemi numerical solution of non-singlet Dokshitzer-Gribov-Lipatov-Altarelli- Parisi evolutionequation up to next-to-next-to-leading order at small xrdquo Euro-pean Physical Journal C vol 60 no 4 pp 585ndash591 2009

[7] I I Bucur I Fagarasan C Popescu G Culea and A E SusuldquoDelay optimum and area optimal mapping of k-lut based fpgacircuitsrdquo Journal of Control Engineering andApplied Informaticsvol 11 no 1 pp 43ndash48 2009

[8] R R de Austri R Trotta and L Roszkowski ldquoA markov chainmonte carlo analysis of the cmssmrdquo Journal of High EnergyPhysics vol 2006 no 05 2006

[9] C Serbanescu ldquoNoncommutativeMarkov processes as stochas-tic equationsrsquo solutionsrdquo BulletinMathematique de la Societe desSciencesMathematiques de Roumanie vol 41 no 3 pp 219ndash2281998

[10] C Serbanescu ldquoStochastic differential equations and unitaryprocessesrdquo Bulletin Mathematique de la Societe des SciencesMathematiques de Roumanie vol 41 no 4 pp 311ndash322 1998

[11] O Behnke K Kroninger G Schott and T H Schorner-Sadenius Data Analysis in High Energy Physics A PracticalGuide to Statistical Methods Wiley-VCH Berlin Germany2013

[12] V Blobel ldquoAn unfolding method for high energy physicsexperimentsrdquo in Proceedings of the Conference on Advanced

Advances in High Energy Physics 13

Statistical Techniques in Particle Physics Durham UK March2002 DESY 02-078 (June 2002)

[13] P C Hansen Rank-Deficient and Discrete Ill-Posed ProblemsSIAM Philadelphia Pa USA 1998

[14] P C Hansen Discrete Inverse Problems Insight and Algorithmsvol 7 SIAM Philadelphia Pa USA 2010

[15] A Hocker and V Kartvelishvili ldquoSVD approach to data unfold-ingrdquo Nuclear Instruments and Methods in Physics Research Avol 372 no 3 pp 469ndash481 1996

[16] I Aziz and Siraj-ul-Islam ldquoNew algorithms for the numericalsolution of nonlinear Fredholm and Volterra integral equationsusing Haar waveletsrdquo Journal of Computational and AppliedMathematics vol 239 pp 333ndash345 2013

[17] A Sawatzky C Brune JMuller andM Burger ldquoTotal variationprocessing of images with poisson statisticsrdquo in Proceedingsof the 13th International Conference on Computer Analysis ofImages and Patterns (CAIP rsquo09) pp 533ndash540 Springer BerlinGermany 2009

[18] A Edelman and N R Rao ldquoRandom matrix theoryrdquo ActaNumerica vol 14 no 1 pp 233ndash297 2005

[19] K Cranmer ldquoKernel estimation in high-energy physicsrdquo Com-puter Physics Communications vol 136 no 3 pp 198ndash207 2001

[20] I S Abramson ldquoOn bandwidth variation in kernel estimatesmdashasquare root lawrdquoThe Annals of Statistics vol 10 no 4 pp 1217ndash1223 1982

[21] X Li JWDemmel DH Bailey et al ldquoDesign implementationand testing of extended and mixed precision BLASrdquo ACMTransactions on Mathematical Software vol 28 no 2 pp 152ndash205 2002

[22] J J Dongarra and J Langou ldquoThe problem with the linpackbenchmark 10 matrix generatorrdquo International Journal of HighPerformance Computing Applications vol 23 no 1 pp 5ndash132009

[23] L Lundberg and H Lennerstad ldquoAn Optimal lower boundon the maximum speedup in multiprocessors with clustersrdquoin Proceedings of the IEEE 1st International Conference onAlgorithms and Architectures for Parallel Processing (ICAPP rsquo95)vol 2 pp 640ndash649 April 1995

[24] N J Gunther ldquoA note on parallel algorithmicspeedup boundsrdquoTechnical Report on Distributed Parallel and Cluster Comput-ing In presshttparxivorgabs11044078

[25] B C Tak B Urgaonkar and A Sivasubramaniam ldquoTo move ornot tomove the economics of cloud computingrdquo in Proceedingsof the 3rdUSENIXConference onHot Topics inCloudComputing(HotCloud rsquo11) p 5 USENIX Association Berkeley CA USA2011

[26] L Zhang C Wu Z Li C Guo M Chen and F Lau ldquoMovingbig data to the cloud an online cost-minimizing approachrdquoIEEE Journal on Selected Areas in Communications vol 31 no12 pp 2710ndash2721 2013

[27] J Dittrich and J A Quiane-Ruiz ldquoEfficient big data processingin hadoop mapreducerdquo Proceedings of the VLDB Endowmentvol 5 no 12 pp 2014ndash2015 2012

[28] D Suciu ldquoBig data begets big database theoryrdquo in Proceedings ofthe 29th British National conference on Big Data (BNCOD rsquo13)pp 1ndash5 Springer Berlin Germany 2013

[29] B Hindman A Konwinski M Zaharia et al ldquoA platform forfine-grained resource sharing in the data centerrdquo in Proceedingsof the 8th USENIX Conference on Networked Systems Designand Implementation (NSDI rsquo11) p 22 USENIX AssociationBerkeley CA USA 2011

[30] C Dobre and F Xhafa ldquoParallel programming paradigms andframeworks in big data erardquo International Journal of ParallelProgramming 2013

[31] Y Demchenko C de Laat AWibisono P Grosso and Z ZhaoldquoAddressing big data challenges for scientific data infrastruc-turerdquo in Proceedings of the IEEE 4th International Conferenceon CloudComputing Technology and Science (CLOUDCOM rsquo12)pp 614ndash617 IEEE Computer Society Washington DC USA2012

[32] T White Hadoop The Definitive Guide OrsquoReilly Media 2012[33] J Celaya and U Arronategui ldquoA task routing approach to large-

scale schedulingrdquo Future Generation Computer Systems vol 29no 5 pp 1097ndash1111 2013

[34] N Bessis S Sotiriadis F Xhafa F Pop and V Cristea ldquoMeta-scheduling issues in interoperable hpcs grids and cloudsrdquoInternational Journal of Web and Grid Services vol 8 no 2 pp153ndash172 2012

[35] Y Xu A Suarez and M Zhao ldquoIbis interposed big-data ioschedulerrdquo in Proceedings of the 22nd International Sympo-sium on High-performance Parallel and Distributed Computing(HPDC rsquo13) pp 109ndash110 ACM New York NY USA 2013

[36] S Sotiriadis N Bessis and N Antonopoulos ldquoTowards inter-cloud schedulers a survey of meta-scheduling approachesrdquo inProceedings of the 6th International Conference on P2P ParallelGrid Cloud and Internet Computing (PGCIC rsquo11) pp 59ndash66Barcelona Spain October 2011

[37] J Chen Y Chen X Du et al ldquoBig data challenge a datamanagement perspectiverdquo Frontiers in Computer Science vol 7no 2 pp 157ndash164 2013

[38] V Gopalkrishnan D Steier H Lewis and J Guszcza ldquoBigdata big business bridging the gaprdquo in Proceedings of the 1stInternationalWorkshop on Big Data Streams andHeterogeneousSource Mining Algorithms Systems Programming Models andApplications (BigMine rsquo12) pp 7ndash11 ACM New York NY USA2012

[39] R van den Bossche K Vanmechelen and J BroeckhoveldquoOnline coste fficient scheduling of deadline-constrained work-loads on hybrid cloudsrdquo Future Generation Computer Systemsvol 29 no 4 pp 973ndash985 2013

[40] N Bessis S Sotiriadis F Pop and V Cristea ldquoUsing anovel messageexchanging optimization (meo) model to reduceenergy consumption in distributed systemsrdquo Simulation Mod-elling Practice and Theory vol 39 pp 104ndash112 2013

[41] G V Iordache M S Boboila F Pop C Stratan and VCristea ldquoA decentralized strategy for genetic scheduling inheterogeneous environmentsrdquo Multiagent and Grid Systemsvol 3 no 4 pp 355ndash367 2007

[42] F Zhang J Cao K Li S U Khan and K Hwang ldquoMulti-objective scheduling of many tasks in cloud platformsrdquo FutureGeneration Computer Systems 2013

[43] K Elmeleegy ldquoPiranha optimizing short jobs in hadooprdquoProceedings of the VLDB Endowment vol 6 no 11 pp 985ndash9962013

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

High Energy PhysicsAdvances in

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

FluidsJournal of

Atomic and Molecular Physics

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Advances in Condensed Matter Physics

OpticsInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

AstronomyAdvances in

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Superconductivity

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Statistical MechanicsInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

GravityJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

AstrophysicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Physics Research International

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Solid State PhysicsJournal of

 Computational  Methods in Physics

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Soft MatterJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

AerodynamicsJournal of

Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

PhotonicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Biophysics

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ThermodynamicsJournal of

4 Advances in High Energy Physics

(1) procedure Random Generator Poisson RND(120583 119903)(2) 119899119906119898119887119890119903 larr 0(3) 119902 larr exp minus120583(4) 119886119888119888119906119898119906119897119886119905119900119903 larr 119902(5) 119901 larr 119902(6) while 119903 gt 119886119888119888119906119898119906119897119886119905119900119903 do(7) 119899119906119898119887119890119903 larr 119899119906119898119887119890119903 + 1(8) 119901 larr 119901 lowast 120583119899119906119898119887119890119903(9) 119886119888119888119906119898119906119897119886119905119900119903 larr 119886119888119888119906119898119906119897119886119905119900119903 + 119901(10) end while(11) return number(12) end procedure

Algorithm 2 Random number generation for Poisson distribution using one random generated number with normal distribution

numbers 119903119894 randomly with the probability density functionuniform on a specific interval (119886 119887) each 119903119894 being used toevaluate 119891(119903119894) For large 119899 (consistent estimator)

1

119899

119899

sum119894=1

119891 (119903119894) 997888rarr 119864 (119891) =1

119887 minus 119886int119887

119886

119891 (119903) 119889119903 (5)

The properties of a MC estimator are being normallydistributed (with Gaussian density) the standard deviation is120590 = radic119881(119891)119899 MC is unbiased for all 119899 (the expectation valueis the real value of the integral) the estimator is consistent if119881(119891) lt infin (the estimator converges to the true value of theintegral for every large 119899) a sampling phase can be appliedto compute the estimator if we do not know anything aboutthe function119891 it is just suitable for integrationThe samplingphase can be expressed in a stratified way as

int119887

119886

119891 (119903) 119889119903 = int1199031

119886

119891 (119903) 119889119903 + int1199032

1199031

119891 (119903) 119889119903 + sdot sdot sdot + int119887

119903119899

119891 (119903) 119889119903

(6)

MC estimations and MC event generators are necessarytools in most of HEP experiments being used at all theirsteps experiments preparation simulation running and dataanalysis

An example of MC estimation is the Lorentz invariantphase space (LIPS) that describes the cross section for atypical HEP process with 119899 particle in the final state

Consider

120590119899 sim int |119872|2119889119877119899 (7)

where 119872 is the matrix describing the interaction betweenparticles and 119889119877119899 is the element of LIPS We have thefollowing estimation

119877119899 (119875 1199011 1199012 119901119899)

= int 120575(4)(119875 minus

119899

sum119896=1

119901119896)

119899

prod119896=1

(120575 (1199012

119896minus 1198982

119896)Θ (119901

0

119896) 1198894119901119896)

(8)

e+(p1) eminus(p2)

120583+(q1)

120583minus(q2)

z

120579

Φ

Figure 3 Example of particle interaction 119890(1199011) + 119890minus(1199012) rarr

120583+(1199021)120583minus(1199022)

where 119875 is total four-momentum of the 119899-particle system 119901119896and119898119896 are four-momenta andmass of the final state particles120575(4)(119875minussum

119899

119896=1119901119896) is the total energymomentum conservation

120575(1199012119896minus 1198982119896) is the on-mass-shell condition for the final state

system Based on the integration formula

int120575 (1199012

119896minus 1198982

119896)Θ (119901

0

119896) 1198894119901119896 =

1198893119901119896

21199010119896

(9)

obtain the iterative form for cross section

119877119899 (119875 1199011 1199012 119901119899)

= int119877119899minus1 (119875 minus 119901119899 1199011 1199012 119901119899minus1)1198893119901119899

21199010119899

(10)

which can be numerical integrated by using the recurrencerelation As result we can construct a general MC algorithmfor particle collision processes

Example 1 Let us consider the interaction 119890+119890minus rarr 120583+120583minus

where Higgs boson contribution is numerically negligibleFigure 3 describes this interaction (Φ is the azimuthal angle120579 the polar angle and 1199011 1199012 1199021 1199022 are the four-momenta forparticles)

The cross section is

119889120590 =1205722

4119904[1198821 (119904) (1 + cos2120579) +1198822 (119904) cos 120579] 119889Ω (11)

Advances in High Energy Physics 5

where 119889Ω = 119889 cos 120579119889Φ 120572 = 11989024120587 (fine structure constant)119904 = (1199010

1+ 11990102)2 is the center of mass energy squared and

1198821(119904) and 1198822(119904) are constant functions For pure processeswe have1198821(119904) = 1 and1198822(119904) = 0 and the total cross sectionbecomes

120590 = int2120587

0

119889Φint1

minus1

119889 cos 120579 1198892120590

119889Φ119889 cos 120579 (12)

We introduce the following notation

120588 (cos 120579Φ) = 1198892120590

119889Φ119889 cos 120579 (13)

and let us consider 120588(cos 120579Φ) an approximation of120588(cos 120579Φ) Then = ∬119889Φ119889 cos 120579120588 Now we can compute

120590 = int2120587

0

119889Φint1

minus1

119889 cos 120579120588 (cos 120579Φ)

= int2120587

0

119889Φint1

minus1

119889 cos 120579119908 (cos 120579 Φ) 120588 (cos 120579 Φ)

asymp ⟨119908⟩120588 int2120587

0

119889Φint1

minus1

119889 cos 120579120588 (cos 120579Φ) = ⟨119908⟩120588

(14)

where119908(cos 120579Φ) = 120588(cos 120579 Φ)120588(cos 120579 Φ) and ⟨119908⟩120588 is theestimation of 119908 based on 120588 Here the MC estimator is

⟨119908⟩MC =1

119899

119899

sum119894=1

119908119894 (15)

and the standard deviation is

119904MC = (1

119899(119899 minus 1)

119899

sum119894=1

(119908119894 minus ⟨119908⟩MC)2)

12

(16)

The final numerical result based on MC estimator is

120590MC = ⟨119908⟩MC plusmn 119904MC (17)

As we can show the principle of a Monte Carlo estimatorin physics is to simulate the cross section in interaction andradiation transport knowing the probability distributions (oran approximation) governing each interaction of elementaryparticles

Based on this result the Monte Carlo algorithm used togenerate events is as follows It takes as input 120588(cos 120579Φ)and in a main loop considers the following steps (1) gen-erate (cos 120579 Φ) peer from 120588 (2) compute four-momenta1199011 1199012 1199021 1199022 (3) compute 119908 = 120588120588 The loop can be stoppedin the case of unweighted events and we will stay in the loopfor weighted events As output the algorithm returns four-momenta for particle for weighted events and four-momentaand an array of weights for unweighted eventsThemain issueis how to initialize the input of the algorithm Based on 119889120590

formula (for 1198821(119904) = 1 and 1198822(119904) = 0) we can consider asinput 120588(cos 120579Φ) = (12057224119904)(1 + cos2120579) Then = 412058712057223119904

In HEP theoretical predictions used for particle collisionprocesses modeling (as shown in presented example) should

be provided in terms of Monte Carlo event generators whichdirectly simulate these processes and can provide unweighted(weight = 1) events A good Monte Carlo algorithm shouldbe used not only for numerical integration [7] (ie pro-vide weighted events) but also for efficient generation ofunweighted events which is very important issue for HEP

222 Markovian Monte-Carlo Chains A classical MonteCarlo method estimates a function 119865 with 119865 by using arandomvariableThemain problemwith this approach is thatwe cannot predict any value in advance for a randomvariableIn HEP simulation experiments the systems are describedin states [8] Let us consider a system with a finite set ofpossible states 1198781 1198782 and 119878119905 the state at the moment 119905 Theconditional probability is defined as

119875 (119878119905 = 119878119895 | 1198781199051

= 1198781198941

1198781199052

= 1198781198942

119878119905119899

= 119878119894119899

) (18)

where the mappings (1199051 1198941) (119905119899 119894119899) can be interpreted asthe description of system evolution in time by specifying aspecific state for each moment of time

The system is a Markov chain if the distribution of119878119905 depends only on immediate predecessor 119878119905minus1 and it isindependent of all previous states as follows

119875 (119878119905 = 119878119895 | 119878119905minus1 = 119878119894119905minus1

1198781199052

= 1198781198942

1198781199051

= 1198781198941

)

= 119875 (119878119905 = 119878119895 | 119878119905minus1 = 119878119894119905minus1

)

(19)

To generate the time steps (1199051 1199052 119905119899) we use theprobability of a single forward Markovian step given by 119901(119905 |119905119899) with the property int

infin

119905119899

119901(119905 | 119905119899)119889119905 = 1 and we define119901(119905) = 119901(119905 | 0) The 1-dimensional Monte Carlo MarkovianAlgorithm used to generate the time steps is presented inAlgorithm 3

The main result of Algorithm 3 is that 119875(119905max) follows aPoisson distribution

119875119873 = int119905max

0

119901 (1199051 | 1199050) 1198891199051 times int119905max

1199051

119901 (1199052 | 1199051) 1198891199052

times sdot sdot sdot times int119905max

119905119873minus1

119901 (119905119873 | 119905119873minus1) 119889119905119873

timesintinfin

119905max

119901 (119905119873+1 | 119905119873) 119889119905119873+1 =1

119873(119905max)

119873119890minus119905max

(20)

We can consider the 1-dimensional Monte Carlo Marko-vian Algorithm as a method used to iteratively generate thesystemsrsquo states (codified as a Markov chain) in simulationexperiments According to the Ergodic Theorem for Markovchains the chain defined has a unique stationary probabilitydistribution [9 10]

Figures 4 and 5 present the running of Algorithm 3According to different values of parameter 119904 used to generatethe next step the results are very different for 1000 iterationsFigure 4 for 119904 = 1 shows a profile of the type of noise For119904 = 10 100 1000 profile looks like some of the informationis filtered and lost The best results are obtained for 119904 = 001

6 Advances in High Energy Physics

(1) Generate 1199051 according with 119901 (1199051) = 119901 (1199051 | 1199050 = 0)

(2) if 1199051lt 119905max then ⊳ Generate the initial state

(3) 119875119873ge1 = int119905max0

119901 (1199051 | 1199050) 1198891199051 ⊳ Compute the initial probability(4) Retain 1199051(5) end if(6) if 1199051 gt 119905max then ⊳ Discard all generated and computed data(7) 119873 = 0 1198750 = int

infin

119905max119901 (1199051 | 1199050) 1198891199051 = 119890minus119905max

(8) Delete 1199051(9) EXIT ⊳ The algorithm ends here(10) end if(11) 119894 = 2(12) while (1) do ⊳ Infinite loop until a successful EXIT(13) Generate 119905119894 according with 119901 (119905119894 | 119905119894minus1)(14) if 119905

119894lt 119905max then ⊳ Generate a new state and new probability

(15) 119875119873ge119894 = int119905max119905119894

119901 (119905119894 | 119905119894minus1) 119889119905119894(16) Retain 119905119894(17) end if(18) if 119905119894 gt 119905max then ⊳ Discard all generated and computed data(19) 119873 = 119894 minus 1 119875119894 = int

infin

119905max119901 (119905119894 | 119905119894minus1) 119889119905119894

(20) Retain (1199051 1199052 119905119894minus1) Delete 119905119894(21) EXIT ⊳ The algorithm ends here(22) end if(23) 119894 = 119894 + 1(24) end while

Algorithm 3 1-Dimensional Monte Carlo Markovian Algorithm

and 119904 = 01 and the generated values can be easily acceptedfor MC simulation in HEP experiments

Figure 5 shows the acceptance rate of values generatedwith parameter 119904used in the algorithmAndparameter valuesare correlated with Figure 4 Results in Figure 5 show thatthe acceptance rate decreases rapidly with increasing value ofparameter 119904 The conclusion is that values must be kept smallto obtain meaningful data A correlation with the normaldistribution is evident showing that a small value for themean square deviation provides useful results

223 Performance of Numerical Algorithms Used in MCSimulations Numerical methods used to compute MC esti-mator use numerical quadratures to approximate the valueof the integral for function 119891 on a specific domain by alinear compilation of function values and weights 1199081198941le119894le119898as follows

int119887

119886

119891 (119903) 119889119903 =

119898

sum119894=1

119908119894119891 (119903119894) (21)

We can consider a consistent MC estimator 119886 a clas-sical numerical quadrature with all 119908119894 = 1 Efficiency ofintegration methods for 1 dimension and for 119889 dimensionsis presented in Table 1 We can conclude that quadraturemethods are difficult to apply in many dimensions for variateintegration domains (regions) and the integral is not easy tobe estimated

Table 1 Efficiency of integration methods for 1 dimension and for119889 dimensions

Method 1 dimension 119889 dimensionsMonte Carlo 119899minus12 119899minus12

Trapezoidal rule 119899minus2 119899minus2119889

Simpsonrsquos rule 119899minus4 119899minus4119889

119898-points Gauss rule (119898 lt 119899) 119899minus2119898 119899minus2119898119889

Aspractical example in a typical high-energy particle col-lision there can bemany final-state particles (even hundreds)If we have 119899 final state particle we face with 119889 = 3119899 minus 4

dimensional phase space As numerical example for 119899 = 4

we have 119889 = 8 dimensions which is very difficult approachfor classical numerical quadratures

Full decomposition integration volume for one doublenumber (10 Bytes) per volume unit is 119899119889 times 10 Bytes For theexample considered with 119889 = 8 and 119899 = 10 divisions forinterval [0 1] we have for one numerical integration

119899119889times 10Bytes = 108 times 10

10243119866Bytes asymp 093119866Bytes (22)

Considering 106 events per second one integration per eventthe data produced in one hour will be asymp31974 119875Bytes

Advances in High Energy Physics 7

State (iteration)200 400 600 800 1000

s=001 07

05030100

(a)

State (iteration)200 400 600 800 1000

s=01 4

20

0minus2

(b)

State (iteration)200 400 600 800 1000

s=1

0

3

minus3

1minus1

(c)

State (iteration)200 400 600 800 1000

s=10

0

31

minus3minus1

(d)

State (iteration)200 400 600 800 1000

s=100

0

210

minus1

(e)

200 400 600 800 1000State (iteration)

s=1000

0

1

0

minus1

(f)

Figure 4 Example of 1-dimensional Monte Carlo Markovianalgorithm

The previous assumption is only for multidimensionalarrays But due to the factorization assumption119901(1199031 1199032 119903119899) = 119901(1199031)119901(1199032) sdot sdot sdot 119901(119903119899) we obtain for oneintegration

119899 times 119889 times 10 Bytes = 800 Bytes (23)

which means asymp262 119879Bytes of data produce for one hour ofsimulations

23 Unfolding Processes in Particle Physics and Kernel Estima-tion in HEP In particle physics analysis we have two typesof distributions true distribution (considered in theoreticalmodels) and measured distribution (considered in experi-mental models which are affected by finite resolution andlimited acceptance of existing detectors) A HEP interactionprocess starts with a true knows distribution and generatea measured distribution corresponding to an experiment of

0

20

40

60

80

100

0 1 2 3 4

Acce

ptan

ce ra

te (

)

minus1minus2minus3

log10(s)

Figure 5 Analysis of acceptance rate for 1-dimensionalMonte CarloMarkovian algorithm for different 119904 values

a well-confirmed theory An inverse process starts with ameasured distribution and tries to identify the true distri-bution These unfolding processes are used to identify newtheories based on experiments [11]

231 Unfolding Processes in Particle Physics The theory ofunfolding processes in particle physics is as follows [12] For aphysics variable 119905 we have a true distribution 119891(119905)mapped in119909 and an 119899-vector of unknowns and a measured distribution119892(119904) (for a measured variable 119904) mapped in an 119898-vector ofmeasured data A responsematrix119860 isin 119877

119898times119899 encodes aKernelfunction119870(119904 119905) describing the physical measurement process[12ndash15]The direct and inverse processes are described by theFredholm integral equation [16] of the first kind for a specificdomainΩ

intΩ

119870 (119904 119905) 119891 (119905) 119889119905 = 119892 (119904) (24)

In particle physics the Kernel function 119870(119904 119905) is usuallyknown fromaMonteCarlo sample obtained from simulationA numerical solution is obtained using the following linearequation 119860119909 = 119887 Vectors 119909 and 119910 are assumed tobe 1-dimensional in theory but they can be multidimen-sional in practice (considering multiple independent linearequations) In practice also the statistical properties of themeasurements are well known and often they follow thePoisson statistics [17] To solve the linear systems we havedifferent numerical methods

First method is based on linear transformation 119909 = 119860119910

If 119898 = 119899 then 119860= 119860minus1 and we can use direct Gaussian

methods iterative methods (Gauss-Siedel Jacobi or SOR) ororthogonal methods (based on Householder transformationGivens methods or Gram-Schmidt algorithm) If 119898 gt 119899

(the most frequent scenario) we will construct the matrix119860

= (119860119879119860)minus1119860119879 (called pseudoinverse Penrose-Moore) In

these cases the orthogonalmethods offer very good and stablenumerical solutions

8 Advances in High Energy Physics

Second method considers the singular value decomposi-tion

119860 = 119880Σ119881119879=

119899

sum119894=1

120590119894119906119894V119879

119894 (25)

where119880 isin 119877119898times119899 and119881 isin 119877119899times119899 arematrices with orthonormalcolumns and the diagonal matrix Σ = diag1205901 120590119899 =

119880119879119860119881 The solution is

119909 = 119860119910 = 119881Σ

minus1(119880119879119910) =

119899

sum119894=1

1

120590119894(119906119879

119894119910) V119894 =

119899

sum119894=1

1

120590119894119888119894V119894 (26)

where 119888119894 = 119906119879119894119910 119894 = 1 119899 are called Fourier coefficients

232 RandomMatrix Theory Analysis of particle spectrum(eg neutrino spectrum) faces with Random Matrix Theory(RMT) especially if we consider anarchic neutrino massesThe RMT means the study of the statistical properties ofeigenvalues of very large matrices [18] For an interactionmatrix 119860 (with size 119873) where 119860 119894119895 is an independent dis-tributed random variable and 119860119867 is the complex conjugateand transposematrix we define119872 = 119860+119860

119867 which describesa Gaussian Unitary Ensemble (GUE) The GUE propertiesare described by the probability distribution 119875(119872)119889119872 (1)it is invariant under unitary transformation 119875(119872)119889119898 =

119875(1198721015840)1198891198721015840 where 1198721015840 = 119880119867119872119880 119880 is a Hermitian matrix

(119880119867119880 = 119868) (2) the elements of 119872 matrix are statisticallyindependent119875(119872) = prod

119894le119895119875119894119895(119872119894119895) and (3) thematrix119872 can

be diagonalized as119872 = 119880119863119880119867 where119880 = diag1205821 120582119873120582119894 is the eigenvalue of119872 and 120582119894 le 120582119895 if 119894 lt 119895

Propability (2) 119875 (119872) 119889119872 sim 119889119872 exp minus1198732119879119903 (119872

119867119872)

Propability (3) 119875 (119872) 119889119872 sim 119889119880prod119894

119889120582119894prod119894lt119895

(120582119894 minus 120582119895)2

times expminus1198732sum119894

(1205822

119894)

(27)The numerical methods used for eigenvalues computa-

tion are the QR method and Power methods (direct andindirect) The QR method is a numerical stable algorithmand Power method is an iterative one The RMT can be usedfor many body systems quantum chaos disordered systemsquantum chromodynamics and so forth

233 Kernel Estimation in HEP Kernel estimation is a verypowerful solution and relevant method for HEP when itis necessary to combine data from heterogeneous sourceslike MC datasets obtained by simulation and from StandardModel expectation obtained from real experiments [19] Fora set of data 1199091198941le119894le119899 with a constant bandwidth ℎ (thedifference between two consecutive data values) called thesmoothing parameter we have the estimation

119891 (119909) =1

119899ℎ

119899

sum119894=1

119870(119909 minus 119909119894

ℎ) (28)

where119870 is an estimator For example a Gauss estimator withmean 120583 and standard deviation 120590 is

119870 (119909) =1

120590radic2120587expminus

(119909 minus 120583)2

21205902 (29)

and has the following properties positive definite andinfinitely differentiable (due to the exp function) and it canbe defined for an infinite supports (119899 rarr infin) The kernel is anonparametricmethodwhichmeans thatℎ is independent ofdataset and for large amount of normally distributed data wecan find a value for ℎ that minimizes the integrated squarederror of 119891(119909) This value for bandwidth is computed as

ℎlowast= (

4

3119899)15

120590 (30)

The main problem in Kernel Estimation is that theset of data 1199091198941le119894le119899 is not normally distributed and inreal experiments the optimal bandwidth it is not knownAn improvement of presented method considers adaptiveKernel Estimation proposed by Abramson [20] where ℎ119894 =ℎradic119891(119909119894) and 120590 are considered global qualities for datasetThe new form is

119891119886 (119909) =1

119899

119899

sum119894=1

1

ℎ119894119870(

119909 minus 119909119894

ℎ119894) (31)

and the local bandwidth value that minimizes the integratedsquared error of 119891119886(119909) is

ℎlowast

119894= (

4

3119899)15

radic120590

119891 (119909119894) (32)

where 119891 is the normal estimatorKernel estimation is used for event selection to confidence

level evaluation for example in Markovian Monte Carlochains or in selection of neural network output used inexperiments for reconstructed Higgs mass In general themain usage of Kernel estimation in HEP is searching for newparticle by finding relevant data in a large dataset

A method based on Kernel estimation is the graphicalrepresentation of datasets using advanced shifted histogramalgorithm (ASH) This is a numerical interpolation for largedatasets with themain aimof creating a set of 119899119887119894119899histograms119867 = 119867119894 with the same bin-width ℎ Algorithm 4presents the steps of histograms generation starting with aspecific interval [119886 119887] a number of points 119899 in this intervaland a number of bins and a number of values used forkernel estimation 119898 Figure 6 shows the results of kernelestimation if function 119891 = minus(12)119909

2 on [0 1] and graphicalrepresentation with a different number of bins The valueson vertical axis are aggregated in step 17 of Algorithm 4 andincrease with the number of bins

234 Performance of Numerical Algorithms Used in ParticlePhysics All operations used in presented methods for par-ticle physics (Unfolding Processes Random Matrix Theory

Advances in High Energy Physics 9

(1) procedure ASH (119886 119887 119899 119909 119899bin 119898 119891119898)(2) 120575 = (119887 minus 119886) 119899bin ℎ = 119898120575(3) for 119896 = 1 119899bin do(4) V119896 = 0 119910119896 = 0(5) end for(6) for 119894 = 1 119899 do(7) 119896 = (119909119894 minus 119886) 120575 + 1(8) if 119896 isin [1 119899bin] then(9) V119896 = V119896 + 1(10) end if(11) end for(12) for 119896 = 1 119899bin do(13) if V119896 = 0 then(14) 119896 = 119896 + 1(15) end if(16) for 119894 = max 1 119896 minus 119898 + 1 min 119899bin 119896 + 119898 minus 1 do(17) 119910119894 = 119910119894 + V119896119891119898 (119894 minus 119896)(18) end for(19) end for(20) for 119896 = 1 119899bin do(21) 119910119896 = 119910119896(119899ℎ) (22) 119905119896 = 119886 + (119896 minus 05) 120575(23) end for(24) return 1199051198961le119896le119899bin 1199101198961le119896le119899bin(25) end procedure

Algorithm 4 Advanced shifted histogram (1D algorithm)

and Kernel Estimation) can be reduced to scalar productsmatrix-vector products and matrix-matrix products In [21]the design of new standard for the BLAS (Basic LinearAlgebra Subroutines) in C language by extension of precisionis described This permits higher internal precision andmixed inputoutput types The precision allows implemen-tation of some algorithms that are simpler more accurateand sometimes faster than possible without these featuresRegarding the precision of numerical computing Dongarraand Langou established in [22] an upper bound for theresidual check for 119860119909 = 119910 system with 119860 isin 119877

119899times119899 a densematrix The residual check is defined as

119903infin =

1003817100381710038171003817119860119909 minus 1199101003817100381710038171003817infin

119899120598 (119860infin119909infin +10038171003817100381710038171199101003817100381710038171003817infin)

lt 16 (33)

where 120598 is the relative machine precision for the IEEErepresentation standard 119910

infinis the infinite normof a vector

119910infin

= max1le119894le119899|119910119894| and 119860infin is the infinite norm of amatrix 119860infin = max1le119894le119899sum

119899

119895=1|119860 119894119895|

Figure 7 presents the graphical representation of Dongar-ras result (using logarithmic scales) for simple and doubleprecision For simple precision 120598119904 = 2minus24 for all 119899 ge 105 times

106 the residual check is always lower than imposed upperbound similarly for double precision with 120598119889 = 2

minus53 for all119899 ge 563 times 1014 If matrix size is greater than these values itwill not be possible to detect if the solution is correct or notThese results establish upper bounds for data volume in thismodel

In a single-processor system the complexity of algo-rithms depends only on the problem size 119899 We can assume

119879(119899) = Θ(119891(119899)) where 119891(119899) is a fundamental function(119891(119899) isin 1 119899

120572 119886119899 log 119899radic119899 ) In parallel systems (mul-tiprocessor systems with 119901 processors) we have the serialprocessing time 119879

lowast(119899) = 1198791(119899) and parallel processingtime 119879119901(119899) The performance of parallel algorithms can beanalyzed using speed-up efficiency and isoefficiencymetrics

(i) The speed-up 119878(119901) represents how a parallel algo-rithm is faster than a corresponding sequential algo-rithmThe speed-up is defined as 119878(119901) = 1198791(119899)119879119901(119899)There are special bounds for speed-up [23] 119878(119901) le

119901119901(119901 + 119901 minus 1) where 119901 = 1198791119879infin is the averageparallelism (the average number of busy processorsgiven unbounded number of processors) Usually119878(119901) le 119901 but under special circumstances thespeed-up can be 119878(119901) gt 119901 [24] Another upperbound is established by the Amdahls law 119878(119901) =

(119904 + ((1 minus 119904)119901))12

le 1119904 where 119904 is the fraction ofa program that is sequential The upper bound isconsidered for a 0 time of parallel fraction

(ii) The efficiency is the average utilization of 119901 proces-sors 119864(119901) = 119878(119901)119901

(iii) The isoefficiency is the growth rate of workload119882119901(119899) = 119901119879119901(119899) in terms of number of processors tokeep efficiency fixed If we consider1198821(119899)minus119864119882119901(119899) =0 for any fixed efficiency 119864 we obtain 119901 = 119901(119899) Thismeans thatwe can establish a relation between needednumber of processors and problem size For examplefor the parallel sum of 119899 numbers using 119901 processorswe have 119899 asymp 119864(119899 + 119901 log119901) so 119899 = Θ(119901 log119901)

10 Advances in High Energy Physicsnbin

=10

0

minus5

minus10

minus15

minus20

[a b]0 02 04 06 08 1

(a)

nbin

=100

0minus10minus20minus30minus40minus50

[a b]0 02 04 06 08 1

(b)

nbin

=1000

0minus100minus200minus300minus400minus500

0 02 04 06 08 1

[a b]

(c)

Figure 6 Example of advanced shifted histogram algorithm run-ning for different bins 10 100 and 1000

0

5

10

15

0 5 10 15 20 25

Simple precisionDouble precisionHPL20 bound

minus5

minus10

minus15

log10(n)

log10(1elowastn))

Figure 7 Residual check analysis for solving 119860119909 = 119910 system inHPL20 using simple and double precision representation

Numerical algorithms use for implementation a hyper-cube architecture We analyze the performance of differentnumerical operations using the isoefficiency metric For thehypercube architecture a simple model for intertask commu-nication considers 119879com = 119905119904 +119871119905119908 where 119905119904 is the latency (thetime needed by a message to cross through the network) 119905119908

Table 2 Isoefficiency for a hypercube architecture 119899 = Θ(119901 log119901)and 119899 = Θ(119901(log119901)2) We marked with (lowast) the limitations imposedby Formula (33)

Scenario Architecture size (119901) 119899 = Θ(119901 log119901) 119899 = Θ(119901(log119901)2)1 101 10 times 101 100 times 101

2 102

20 times 102

800 times 102

3 103 30 times 103 270 times 104

4 104 40 times 104 640 times 105lowast

5 105

50 times 105lowast

125 times 107

6 106

60 times 106

216 times 108

7 107 70 times 107 343 times 109

8 108 80 times 108 512 times 1010

9 109

90 times 109

729 times 1011

is the time needed to send a word (1119905119908 is called bandwidth)and 119871 is the message length (expressed in number of words)The word size depends on processing architecture (usually itis two bytes)We define 119905119888 as the processing time per word fora processor We have the following results

(i) External product 119909119910119879 The isoefficiency is written as

119905119888119899 asymp 119864 (119905119888119899 + (119905119904 + 119905119908) 119901 log119901) 997904rArr 119899 = Θ (119901 log119901) (34)

Parallel processing time is 119879119901 = 119905119888119899119901+(119905119904 +119905119908) log119901The optimality is computed using

119889119879119901

119889119901= 0 997904rArr minus119905119888

119899

1199012+119905119904 + 119905119908

119901= 0 997904rArr 119901 asymp

119905119888119899

119905119904 + 119905119908 (35)

(ii) Scalar product (internal product) 119909119879119910 = sum119899

119894=1119909119894119910119894

The isoefficiency is written as

1199051198881198992asymp 119864(119905119888119899

2+119905119904

2119901 log119901 +

119905119908

2119899radic119901 log119901)

997904rArr 119899 = Θ(119901(log119901)2) (36)

(iii) Matrix-vector product 119910 = 119860119909 119910119894 = sum119899

119895=1119860 119894119895119909119895 The

isoefficiency is written as

1199051198881198992asymp 119864 (119905119888119899

2+ 119905119904119901 log119901 + 119905119908119899radic119901 log119901) 997904rArr 119899

= Θ(119901(log119901)2) (37)

Table 2 presented the amount of data that can be pro-cessed for a specific sizeThe cases that meet the upper bound119899 ge 105 times 10

6 are marked with (lowast) To keep the efficiencyhigh for a specific parallel architecture HPC algorithms forparticle physics introduce upper limits for the amount of datawhich means that we have also an upper bound for Big Datavolume in this case

The factors that determine the efficiency of parallel algo-rithms are task balancing (work-load distribution betweenall used processors in a system rarr to be maximized)concurrency (the numberpercentage of processors workingsimultaneously rarr to be maximized) and overhead (extrawork for introduce by parallel processing that does not appearin serial processing rarr to be minimized)

Advances in High Energy Physics 11

Experimental data

problem)

Modeling and reconstruction

(find properties of observed particles)

Analysis(reduce result to simple statistical

distribution)

Simulation(generate MC data based on

theoretical models-BigData problem)

Modeling and reconstruction

(find properties of observed particles)

Analysis(reduce result to simple statistical

distribution)

Compare

(app TB of data rarr big data

Figure 8 Processing flows for HEP experiments

3 New Challenges for Big Data Science

There are a lot of applications that generate Big Data likesocial networking profiles social influence SaaS amp CloudApps public web information MapReduce scientific exper-iments and simulations (especially HEP simulations) datawarehouse monitoring technologies and e-government ser-vices Data grow rapidly since applications produce contin-uously increasing volumes of both unstructured and struc-tured data The impact on the approach to data processingtransfer and storage is the need to reevaluate the way andsolutions to better answer the usersrsquo needs [25] In this con-text scheduling models and algorithms for data processinghave an important role becoming a new challenge for BigData Science

HEP applications consider both experimental data (thatare application with TB of valuable data) and simulationdata (with data generated using MC based on theoreticalmodels) The processing phase is represented by modelingand reconstruction in order to find properties of observedparticles (see Figure 8)Then the data are analyzed a reducedto a simple statistical distribution The comparison of resultsobtainedwill validate how realistic is a simulation experimentand validate it for use in other new models

Since we face a large variety of solutions for specificapplications and platforms a thorough and systematic anal-ysis of existing solutions for scheduling models methodsand algorithms used in Big Data processing and storageenvironments is needed The challenges for schedulingimpose specific requirements in distributed systems theclaims of the resource consumers the restrictions imposed byresource owners the need to continuously adapt to changesof resourcesrsquo availability and so forth We will pay specialattention to Cloud Systems and HPC clusters (datacenters)as reliable solutions for Big Data [26] Based on theserequirements a number of challenging issues are maximiza-tion of system throughput sitesrsquo autonomy scalability fault-tolerance and quality of services

When discussing Big Data we have in mind the 5Vs Vol-ume Velocity Variety Variability and Value There is a clearneed of many organizations companies and researchers todeal with Big Data volumes efficiently Examples include webanalytics applications scientific applications and social net-works For these examples a popular data processing enginefor Big Data is Hadoop MapReduce [27] The main problemis that data arrives too fast for optimal storage and indexing

[28] There are other several processing platforms for BigData Mesos [29] YARN (Hortonworks Hadoop YARN Anext-generation framework forHadoop data processing 2013(httphortonworkscomhadoopyarn)) Corona (CoronaUnder the Hood Scheduling MapReduce jobs more effi-ciently with Corona 2012 (Facebook)) and so forth A reviewof various parallel and distributed programming paradigmsanalyzing how they fit into the Big Data era is presented in[30]The challenges that are described for BigData Science onthe modern and future Scientific Data Infrastructure are pre-sented in [31] The paper introduces the Scientific Data Life-cycle Management (SDLM)model that includes all the majorstages and reflects specifics in data management in moderne-Science The paper proposes the SDI generic architecturemodel that provides a basis for building interoperable data orproject centric SDI usingmodern technologies and best prac-tices This analysis highlights in the same time performanceand limitations of existing solutions in the context of BigData Hadoop can handle many types of data from disparatesystems structured unstructured logs pictures audio filescommunications records emails and so forth Hadoop relieson an internal redundant data structure with cost advantagesand is deployed on industry standard servers rather than onexpensive specialized data storage systems [32] The mainchallenges for scheduling in Hadoop are to improve existingalgorithms for Big Data processing capacity schedulingfair scheduling delay scheduling longest approximate timeto end (LATE) speculative execution deadline constraintscheduler and resource aware scheduling

Data transfer scheduling in Grids Cloud P2P and soforth represents a new challenge that is the subject to BigData In many cases depending on applications architecturedata must be transported to the place where tasks willbe executed [33] Consequently scheduling schemes shouldconsider not only the task execution time but also the datatransfer time for finding a more convenient mapping of tasks[34] Only a handful of current research efforts considerthe simultaneous optimization of computation and datatransfer scheduling The big-data IO scheduler [35] offers asolution for applications that compete for IO resources ina shared MapReduce-type Big Data system [36] The paper[37] reviews Big Data challenges from a data managementrespective and addresses Big Data diversity Big Data reduc-tion Big Data integration and cleaning Big Data indexingand query and finally Big Data analysis and mining On theopposite side business analytics occupying the intersection

12 Advances in High Energy Physics

of the worlds of management science computer science andstatistical science is a potent force for innovation in both theprivate and public sectors The conclusion is that the data istoo heterogeneous to fit into a rigid schema [38]

Another challenge is the scheduling policies used todetermine the relative ordering of requests Large distributedsystems with different administrative domains will mostlikely have different resource utilization policies For examplea policy can take into consideration the deadlines andbudgets and also the dynamic behavior [39] HEP experi-ments are usually performed in private Clouds consideringdynamic scheduling with soft deadlines which is an openissue

The optimization techniques for the scheduling processrepresent an important aspect because the scheduling is amain building block formaking datacentersmore available touser communities being energy-aware [40] and supportingmulticriteria optimization [41] An example of optimizationis multiobjective and multiconstrained scheduling of manytasks in Hadoop [42] or optimizing short jobs [43] Thecost effectiveness scalability and streamlined architecturesof Hadoop represent solutions for Big Data processingConsidering the use of Hadoop in publicprivate Clouds achallenge is to answer the following questions what type ofdatatasks should move to public cloud in order to achieve acost-aware cloud scheduler And is public Cloud a solutionfor HEP simulation experiments

The activities for Big Data processing vary widely in anumber of issues for example support for heterogeneousresources objective function(s) scalability coschedulingand assumptions about system characteristics The currentresearch directions are focused on accelerating data process-ing especially for Big Data analytics (frequently used in HEPexperiments) complex task dependencies for dataworkflowsand new scheduling algorithms for real-time scenarios

4 Conclusions

This paper presented general aspects about methods used inHEP Monte Carlo methods and simulations of HEP pro-cesses Markovian Monte Carlo unfolding methods in parti-cle physics kernel estimation inHEP RandomMatrixTheoryused in analysis of particles spectrum For each method theproper numerical method had been identified and analyzedAll of identified methods produce data-intensive applica-tions which introduce new challenges and requirementsfor Big Data systems architecture especially for processingparadigms and storage capabilities This paper puts togetherseveral concepts HEP HPC numerical methods and sim-ulations HEP experiments are modeled using numericalmethods and simulations numerical integration eigenvaluescomputation solving linear equation systems multiplyingvectors and matrices interpolation HPC environments offerpowerful tools for data processing and analysis Big Data wasintroduced as a concept for a real problem we live in a data-intensive world produce huge amount of information weface with upper bound introduced by theoretical models

Conflict of Interests

The author declares that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

The research presented in this paper is supported bythe following projects ldquoSideSTEPmdashScheduling Methods forDynamic Distributed Systems a self-lowast approachrdquo (PN-II-CT-RO-FR-2012-1-0084) ldquoERRICmdashEmpowering Roma-nian Research on Intelligent Information Technologiesrdquo FP7-REGPOT-2010-1 ID 264207 CyberWater Grant of theRomanianNational Authority for Scientific Research CNDI-UEFISCDI Project no 472012 The author would like tothank the reviewers for their time and expertise constructivecomments and valuable insights

References

[1] H Newman ldquoSearch for higgs boson diphoton decay withcms at lhcrdquo in Proceedings of the ACMIEEE Conference onSupercomputing (SC rsquo06) ACM New York NY USA 2006

[2] C Cattani and A Ciancio ldquoSeparable transition density inthe hybrid model for tumor-immune system competitionrdquoComputational and Mathematical Methods in Medicine vol2012 Article ID 610124 6 pages 2012

[3] C Toma ldquoWavelets-computational aspects of sterian realisticapproach to uncertainty principle in high energy physics atransient approachrdquo Advances in High Energy Physics vol 2013Article ID 735452 6 pages 2013

[4] C Cattani A Ciancio and B Lods ldquoOn a mathematical modelof immune competitionrdquo Applied Mathematics Letters vol 19no 7 pp 678ndash683 2006

[5] D Perret-Gallix ldquoSimulation and event generation in high-energy physicsrdquoComputer Physics Communications vol 147 no1-2 pp 488ndash493 2002

[6] R Baishya and J K Sarma ldquoSemi numerical solution of non-singlet Dokshitzer-Gribov-Lipatov-Altarelli- Parisi evolutionequation up to next-to-next-to-leading order at small xrdquo Euro-pean Physical Journal C vol 60 no 4 pp 585ndash591 2009

[7] I I Bucur I Fagarasan C Popescu G Culea and A E SusuldquoDelay optimum and area optimal mapping of k-lut based fpgacircuitsrdquo Journal of Control Engineering andApplied Informaticsvol 11 no 1 pp 43ndash48 2009

[8] R R de Austri R Trotta and L Roszkowski ldquoA markov chainmonte carlo analysis of the cmssmrdquo Journal of High EnergyPhysics vol 2006 no 05 2006

[9] C Serbanescu ldquoNoncommutativeMarkov processes as stochas-tic equationsrsquo solutionsrdquo BulletinMathematique de la Societe desSciencesMathematiques de Roumanie vol 41 no 3 pp 219ndash2281998

[10] C Serbanescu ldquoStochastic differential equations and unitaryprocessesrdquo Bulletin Mathematique de la Societe des SciencesMathematiques de Roumanie vol 41 no 4 pp 311ndash322 1998

[11] O Behnke K Kroninger G Schott and T H Schorner-Sadenius Data Analysis in High Energy Physics A PracticalGuide to Statistical Methods Wiley-VCH Berlin Germany2013

[12] V Blobel ldquoAn unfolding method for high energy physicsexperimentsrdquo in Proceedings of the Conference on Advanced

Advances in High Energy Physics 13

Statistical Techniques in Particle Physics Durham UK March2002 DESY 02-078 (June 2002)

[13] P C Hansen Rank-Deficient and Discrete Ill-Posed ProblemsSIAM Philadelphia Pa USA 1998

[14] P C Hansen Discrete Inverse Problems Insight and Algorithmsvol 7 SIAM Philadelphia Pa USA 2010

[15] A Hocker and V Kartvelishvili ldquoSVD approach to data unfold-ingrdquo Nuclear Instruments and Methods in Physics Research Avol 372 no 3 pp 469ndash481 1996

[16] I Aziz and Siraj-ul-Islam ldquoNew algorithms for the numericalsolution of nonlinear Fredholm and Volterra integral equationsusing Haar waveletsrdquo Journal of Computational and AppliedMathematics vol 239 pp 333ndash345 2013

[17] A Sawatzky C Brune JMuller andM Burger ldquoTotal variationprocessing of images with poisson statisticsrdquo in Proceedingsof the 13th International Conference on Computer Analysis ofImages and Patterns (CAIP rsquo09) pp 533ndash540 Springer BerlinGermany 2009

[18] A Edelman and N R Rao ldquoRandom matrix theoryrdquo ActaNumerica vol 14 no 1 pp 233ndash297 2005

[19] K Cranmer ldquoKernel estimation in high-energy physicsrdquo Com-puter Physics Communications vol 136 no 3 pp 198ndash207 2001

[20] I S Abramson ldquoOn bandwidth variation in kernel estimatesmdashasquare root lawrdquoThe Annals of Statistics vol 10 no 4 pp 1217ndash1223 1982

[21] X Li JWDemmel DH Bailey et al ldquoDesign implementationand testing of extended and mixed precision BLASrdquo ACMTransactions on Mathematical Software vol 28 no 2 pp 152ndash205 2002

[22] J J Dongarra and J Langou ldquoThe problem with the linpackbenchmark 10 matrix generatorrdquo International Journal of HighPerformance Computing Applications vol 23 no 1 pp 5ndash132009

[23] L Lundberg and H Lennerstad ldquoAn Optimal lower boundon the maximum speedup in multiprocessors with clustersrdquoin Proceedings of the IEEE 1st International Conference onAlgorithms and Architectures for Parallel Processing (ICAPP rsquo95)vol 2 pp 640ndash649 April 1995

[24] N J Gunther ldquoA note on parallel algorithmicspeedup boundsrdquoTechnical Report on Distributed Parallel and Cluster Comput-ing In presshttparxivorgabs11044078

[25] B C Tak B Urgaonkar and A Sivasubramaniam ldquoTo move ornot tomove the economics of cloud computingrdquo in Proceedingsof the 3rdUSENIXConference onHot Topics inCloudComputing(HotCloud rsquo11) p 5 USENIX Association Berkeley CA USA2011

[26] L Zhang C Wu Z Li C Guo M Chen and F Lau ldquoMovingbig data to the cloud an online cost-minimizing approachrdquoIEEE Journal on Selected Areas in Communications vol 31 no12 pp 2710ndash2721 2013

[27] J Dittrich and J A Quiane-Ruiz ldquoEfficient big data processingin hadoop mapreducerdquo Proceedings of the VLDB Endowmentvol 5 no 12 pp 2014ndash2015 2012

[28] D Suciu ldquoBig data begets big database theoryrdquo in Proceedings ofthe 29th British National conference on Big Data (BNCOD rsquo13)pp 1ndash5 Springer Berlin Germany 2013

[29] B Hindman A Konwinski M Zaharia et al ldquoA platform forfine-grained resource sharing in the data centerrdquo in Proceedingsof the 8th USENIX Conference on Networked Systems Designand Implementation (NSDI rsquo11) p 22 USENIX AssociationBerkeley CA USA 2011

[30] C Dobre and F Xhafa ldquoParallel programming paradigms andframeworks in big data erardquo International Journal of ParallelProgramming 2013

[31] Y Demchenko C de Laat AWibisono P Grosso and Z ZhaoldquoAddressing big data challenges for scientific data infrastruc-turerdquo in Proceedings of the IEEE 4th International Conferenceon CloudComputing Technology and Science (CLOUDCOM rsquo12)pp 614ndash617 IEEE Computer Society Washington DC USA2012

[32] T White Hadoop The Definitive Guide OrsquoReilly Media 2012[33] J Celaya and U Arronategui ldquoA task routing approach to large-

scale schedulingrdquo Future Generation Computer Systems vol 29no 5 pp 1097ndash1111 2013

[34] N Bessis S Sotiriadis F Xhafa F Pop and V Cristea ldquoMeta-scheduling issues in interoperable hpcs grids and cloudsrdquoInternational Journal of Web and Grid Services vol 8 no 2 pp153ndash172 2012

[35] Y Xu A Suarez and M Zhao ldquoIbis interposed big-data ioschedulerrdquo in Proceedings of the 22nd International Sympo-sium on High-performance Parallel and Distributed Computing(HPDC rsquo13) pp 109ndash110 ACM New York NY USA 2013

[36] S Sotiriadis N Bessis and N Antonopoulos ldquoTowards inter-cloud schedulers a survey of meta-scheduling approachesrdquo inProceedings of the 6th International Conference on P2P ParallelGrid Cloud and Internet Computing (PGCIC rsquo11) pp 59ndash66Barcelona Spain October 2011

[37] J Chen Y Chen X Du et al ldquoBig data challenge a datamanagement perspectiverdquo Frontiers in Computer Science vol 7no 2 pp 157ndash164 2013

[38] V Gopalkrishnan D Steier H Lewis and J Guszcza ldquoBigdata big business bridging the gaprdquo in Proceedings of the 1stInternationalWorkshop on Big Data Streams andHeterogeneousSource Mining Algorithms Systems Programming Models andApplications (BigMine rsquo12) pp 7ndash11 ACM New York NY USA2012

[39] R van den Bossche K Vanmechelen and J BroeckhoveldquoOnline coste fficient scheduling of deadline-constrained work-loads on hybrid cloudsrdquo Future Generation Computer Systemsvol 29 no 4 pp 973ndash985 2013

[40] N Bessis S Sotiriadis F Pop and V Cristea ldquoUsing anovel messageexchanging optimization (meo) model to reduceenergy consumption in distributed systemsrdquo Simulation Mod-elling Practice and Theory vol 39 pp 104ndash112 2013

[41] G V Iordache M S Boboila F Pop C Stratan and VCristea ldquoA decentralized strategy for genetic scheduling inheterogeneous environmentsrdquo Multiagent and Grid Systemsvol 3 no 4 pp 355ndash367 2007

[42] F Zhang J Cao K Li S U Khan and K Hwang ldquoMulti-objective scheduling of many tasks in cloud platformsrdquo FutureGeneration Computer Systems 2013

[43] K Elmeleegy ldquoPiranha optimizing short jobs in hadooprdquoProceedings of the VLDB Endowment vol 6 no 11 pp 985ndash9962013

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

High Energy PhysicsAdvances in

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

FluidsJournal of

Atomic and Molecular Physics

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Advances in Condensed Matter Physics

OpticsInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

AstronomyAdvances in

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Superconductivity

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Statistical MechanicsInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

GravityJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

AstrophysicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Physics Research International

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Solid State PhysicsJournal of

 Computational  Methods in Physics

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Soft MatterJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

AerodynamicsJournal of

Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

PhotonicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Biophysics

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ThermodynamicsJournal of

Advances in High Energy Physics 5

where 119889Ω = 119889 cos 120579119889Φ 120572 = 11989024120587 (fine structure constant)119904 = (1199010

1+ 11990102)2 is the center of mass energy squared and

1198821(119904) and 1198822(119904) are constant functions For pure processeswe have1198821(119904) = 1 and1198822(119904) = 0 and the total cross sectionbecomes

120590 = int2120587

0

119889Φint1

minus1

119889 cos 120579 1198892120590

119889Φ119889 cos 120579 (12)

We introduce the following notation

120588 (cos 120579Φ) = 1198892120590

119889Φ119889 cos 120579 (13)

and let us consider 120588(cos 120579Φ) an approximation of120588(cos 120579Φ) Then = ∬119889Φ119889 cos 120579120588 Now we can compute

120590 = int2120587

0

119889Φint1

minus1

119889 cos 120579120588 (cos 120579Φ)

= int2120587

0

119889Φint1

minus1

119889 cos 120579119908 (cos 120579 Φ) 120588 (cos 120579 Φ)

asymp ⟨119908⟩120588 int2120587

0

119889Φint1

minus1

119889 cos 120579120588 (cos 120579Φ) = ⟨119908⟩120588

(14)

where119908(cos 120579Φ) = 120588(cos 120579 Φ)120588(cos 120579 Φ) and ⟨119908⟩120588 is theestimation of 119908 based on 120588 Here the MC estimator is

⟨119908⟩MC =1

119899

119899

sum119894=1

119908119894 (15)

and the standard deviation is

119904MC = (1

119899(119899 minus 1)

119899

sum119894=1

(119908119894 minus ⟨119908⟩MC)2)

12

(16)

The final numerical result based on MC estimator is

120590MC = ⟨119908⟩MC plusmn 119904MC (17)

As we can show the principle of a Monte Carlo estimatorin physics is to simulate the cross section in interaction andradiation transport knowing the probability distributions (oran approximation) governing each interaction of elementaryparticles

Based on this result the Monte Carlo algorithm used togenerate events is as follows It takes as input 120588(cos 120579Φ)and in a main loop considers the following steps (1) gen-erate (cos 120579 Φ) peer from 120588 (2) compute four-momenta1199011 1199012 1199021 1199022 (3) compute 119908 = 120588120588 The loop can be stoppedin the case of unweighted events and we will stay in the loopfor weighted events As output the algorithm returns four-momenta for particle for weighted events and four-momentaand an array of weights for unweighted eventsThemain issueis how to initialize the input of the algorithm Based on 119889120590

formula (for 1198821(119904) = 1 and 1198822(119904) = 0) we can consider asinput 120588(cos 120579Φ) = (12057224119904)(1 + cos2120579) Then = 412058712057223119904

In HEP theoretical predictions used for particle collisionprocesses modeling (as shown in presented example) should

be provided in terms of Monte Carlo event generators whichdirectly simulate these processes and can provide unweighted(weight = 1) events A good Monte Carlo algorithm shouldbe used not only for numerical integration [7] (ie pro-vide weighted events) but also for efficient generation ofunweighted events which is very important issue for HEP

222 Markovian Monte-Carlo Chains A classical MonteCarlo method estimates a function 119865 with 119865 by using arandomvariableThemain problemwith this approach is thatwe cannot predict any value in advance for a randomvariableIn HEP simulation experiments the systems are describedin states [8] Let us consider a system with a finite set ofpossible states 1198781 1198782 and 119878119905 the state at the moment 119905 Theconditional probability is defined as

119875 (119878119905 = 119878119895 | 1198781199051

= 1198781198941

1198781199052

= 1198781198942

119878119905119899

= 119878119894119899

) (18)

where the mappings (1199051 1198941) (119905119899 119894119899) can be interpreted asthe description of system evolution in time by specifying aspecific state for each moment of time

The system is a Markov chain if the distribution of119878119905 depends only on immediate predecessor 119878119905minus1 and it isindependent of all previous states as follows

119875 (119878119905 = 119878119895 | 119878119905minus1 = 119878119894119905minus1

1198781199052

= 1198781198942

1198781199051

= 1198781198941

)

= 119875 (119878119905 = 119878119895 | 119878119905minus1 = 119878119894119905minus1

)

(19)

To generate the time steps (1199051 1199052 119905119899) we use theprobability of a single forward Markovian step given by 119901(119905 |119905119899) with the property int

infin

119905119899

119901(119905 | 119905119899)119889119905 = 1 and we define119901(119905) = 119901(119905 | 0) The 1-dimensional Monte Carlo MarkovianAlgorithm used to generate the time steps is presented inAlgorithm 3

The main result of Algorithm 3 is that 119875(119905max) follows aPoisson distribution

119875119873 = int119905max

0

119901 (1199051 | 1199050) 1198891199051 times int119905max

1199051

119901 (1199052 | 1199051) 1198891199052

times sdot sdot sdot times int119905max

119905119873minus1

119901 (119905119873 | 119905119873minus1) 119889119905119873

timesintinfin

119905max

119901 (119905119873+1 | 119905119873) 119889119905119873+1 =1

119873(119905max)

119873119890minus119905max

(20)

We can consider the 1-dimensional Monte Carlo Marko-vian Algorithm as a method used to iteratively generate thesystemsrsquo states (codified as a Markov chain) in simulationexperiments According to the Ergodic Theorem for Markovchains the chain defined has a unique stationary probabilitydistribution [9 10]

Figures 4 and 5 present the running of Algorithm 3According to different values of parameter 119904 used to generatethe next step the results are very different for 1000 iterationsFigure 4 for 119904 = 1 shows a profile of the type of noise For119904 = 10 100 1000 profile looks like some of the informationis filtered and lost The best results are obtained for 119904 = 001

6 Advances in High Energy Physics

(1) Generate 1199051 according with 119901 (1199051) = 119901 (1199051 | 1199050 = 0)

(2) if 1199051lt 119905max then ⊳ Generate the initial state

(3) 119875119873ge1 = int119905max0

119901 (1199051 | 1199050) 1198891199051 ⊳ Compute the initial probability(4) Retain 1199051(5) end if(6) if 1199051 gt 119905max then ⊳ Discard all generated and computed data(7) 119873 = 0 1198750 = int

infin

119905max119901 (1199051 | 1199050) 1198891199051 = 119890minus119905max

(8) Delete 1199051(9) EXIT ⊳ The algorithm ends here(10) end if(11) 119894 = 2(12) while (1) do ⊳ Infinite loop until a successful EXIT(13) Generate 119905119894 according with 119901 (119905119894 | 119905119894minus1)(14) if 119905

119894lt 119905max then ⊳ Generate a new state and new probability

(15) 119875119873ge119894 = int119905max119905119894

119901 (119905119894 | 119905119894minus1) 119889119905119894(16) Retain 119905119894(17) end if(18) if 119905119894 gt 119905max then ⊳ Discard all generated and computed data(19) 119873 = 119894 minus 1 119875119894 = int

infin

119905max119901 (119905119894 | 119905119894minus1) 119889119905119894

(20) Retain (1199051 1199052 119905119894minus1) Delete 119905119894(21) EXIT ⊳ The algorithm ends here(22) end if(23) 119894 = 119894 + 1(24) end while

Algorithm 3 1-Dimensional Monte Carlo Markovian Algorithm

and 119904 = 01 and the generated values can be easily acceptedfor MC simulation in HEP experiments

Figure 5 shows the acceptance rate of values generatedwith parameter 119904used in the algorithmAndparameter valuesare correlated with Figure 4 Results in Figure 5 show thatthe acceptance rate decreases rapidly with increasing value ofparameter 119904 The conclusion is that values must be kept smallto obtain meaningful data A correlation with the normaldistribution is evident showing that a small value for themean square deviation provides useful results

223 Performance of Numerical Algorithms Used in MCSimulations Numerical methods used to compute MC esti-mator use numerical quadratures to approximate the valueof the integral for function 119891 on a specific domain by alinear compilation of function values and weights 1199081198941le119894le119898as follows

int119887

119886

119891 (119903) 119889119903 =

119898

sum119894=1

119908119894119891 (119903119894) (21)

We can consider a consistent MC estimator 119886 a clas-sical numerical quadrature with all 119908119894 = 1 Efficiency ofintegration methods for 1 dimension and for 119889 dimensionsis presented in Table 1 We can conclude that quadraturemethods are difficult to apply in many dimensions for variateintegration domains (regions) and the integral is not easy tobe estimated

Table 1 Efficiency of integration methods for 1 dimension and for119889 dimensions

Method 1 dimension 119889 dimensionsMonte Carlo 119899minus12 119899minus12

Trapezoidal rule 119899minus2 119899minus2119889

Simpsonrsquos rule 119899minus4 119899minus4119889

119898-points Gauss rule (119898 lt 119899) 119899minus2119898 119899minus2119898119889

Aspractical example in a typical high-energy particle col-lision there can bemany final-state particles (even hundreds)If we have 119899 final state particle we face with 119889 = 3119899 minus 4

dimensional phase space As numerical example for 119899 = 4

we have 119889 = 8 dimensions which is very difficult approachfor classical numerical quadratures

Full decomposition integration volume for one doublenumber (10 Bytes) per volume unit is 119899119889 times 10 Bytes For theexample considered with 119889 = 8 and 119899 = 10 divisions forinterval [0 1] we have for one numerical integration

119899119889times 10Bytes = 108 times 10

10243119866Bytes asymp 093119866Bytes (22)

Considering 106 events per second one integration per eventthe data produced in one hour will be asymp31974 119875Bytes

Advances in High Energy Physics 7

State (iteration)200 400 600 800 1000

s=001 07

05030100

(a)

State (iteration)200 400 600 800 1000

s=01 4

20

0minus2

(b)

State (iteration)200 400 600 800 1000

s=1

0

3

minus3

1minus1

(c)

State (iteration)200 400 600 800 1000

s=10

0

31

minus3minus1

(d)

State (iteration)200 400 600 800 1000

s=100

0

210

minus1

(e)

200 400 600 800 1000State (iteration)

s=1000

0

1

0

minus1

(f)

Figure 4 Example of 1-dimensional Monte Carlo Markovianalgorithm

The previous assumption is only for multidimensionalarrays But due to the factorization assumption119901(1199031 1199032 119903119899) = 119901(1199031)119901(1199032) sdot sdot sdot 119901(119903119899) we obtain for oneintegration

119899 times 119889 times 10 Bytes = 800 Bytes (23)

which means asymp262 119879Bytes of data produce for one hour ofsimulations

23 Unfolding Processes in Particle Physics and Kernel Estima-tion in HEP In particle physics analysis we have two typesof distributions true distribution (considered in theoreticalmodels) and measured distribution (considered in experi-mental models which are affected by finite resolution andlimited acceptance of existing detectors) A HEP interactionprocess starts with a true knows distribution and generatea measured distribution corresponding to an experiment of

0

20

40

60

80

100

0 1 2 3 4

Acce

ptan

ce ra

te (

)

minus1minus2minus3

log10(s)

Figure 5 Analysis of acceptance rate for 1-dimensionalMonte CarloMarkovian algorithm for different 119904 values

a well-confirmed theory An inverse process starts with ameasured distribution and tries to identify the true distri-bution These unfolding processes are used to identify newtheories based on experiments [11]

231 Unfolding Processes in Particle Physics The theory ofunfolding processes in particle physics is as follows [12] For aphysics variable 119905 we have a true distribution 119891(119905)mapped in119909 and an 119899-vector of unknowns and a measured distribution119892(119904) (for a measured variable 119904) mapped in an 119898-vector ofmeasured data A responsematrix119860 isin 119877

119898times119899 encodes aKernelfunction119870(119904 119905) describing the physical measurement process[12ndash15]The direct and inverse processes are described by theFredholm integral equation [16] of the first kind for a specificdomainΩ

intΩ

119870 (119904 119905) 119891 (119905) 119889119905 = 119892 (119904) (24)

In particle physics the Kernel function 119870(119904 119905) is usuallyknown fromaMonteCarlo sample obtained from simulationA numerical solution is obtained using the following linearequation 119860119909 = 119887 Vectors 119909 and 119910 are assumed tobe 1-dimensional in theory but they can be multidimen-sional in practice (considering multiple independent linearequations) In practice also the statistical properties of themeasurements are well known and often they follow thePoisson statistics [17] To solve the linear systems we havedifferent numerical methods

First method is based on linear transformation 119909 = 119860119910

If 119898 = 119899 then 119860= 119860minus1 and we can use direct Gaussian

methods iterative methods (Gauss-Siedel Jacobi or SOR) ororthogonal methods (based on Householder transformationGivens methods or Gram-Schmidt algorithm) If 119898 gt 119899

(the most frequent scenario) we will construct the matrix119860

= (119860119879119860)minus1119860119879 (called pseudoinverse Penrose-Moore) In

these cases the orthogonalmethods offer very good and stablenumerical solutions

8 Advances in High Energy Physics

Second method considers the singular value decomposi-tion

119860 = 119880Σ119881119879=

119899

sum119894=1

120590119894119906119894V119879

119894 (25)

where119880 isin 119877119898times119899 and119881 isin 119877119899times119899 arematrices with orthonormalcolumns and the diagonal matrix Σ = diag1205901 120590119899 =

119880119879119860119881 The solution is

119909 = 119860119910 = 119881Σ

minus1(119880119879119910) =

119899

sum119894=1

1

120590119894(119906119879

119894119910) V119894 =

119899

sum119894=1

1

120590119894119888119894V119894 (26)

where 119888119894 = 119906119879119894119910 119894 = 1 119899 are called Fourier coefficients

232 RandomMatrix Theory Analysis of particle spectrum(eg neutrino spectrum) faces with Random Matrix Theory(RMT) especially if we consider anarchic neutrino massesThe RMT means the study of the statistical properties ofeigenvalues of very large matrices [18] For an interactionmatrix 119860 (with size 119873) where 119860 119894119895 is an independent dis-tributed random variable and 119860119867 is the complex conjugateand transposematrix we define119872 = 119860+119860

119867 which describesa Gaussian Unitary Ensemble (GUE) The GUE propertiesare described by the probability distribution 119875(119872)119889119872 (1)it is invariant under unitary transformation 119875(119872)119889119898 =

119875(1198721015840)1198891198721015840 where 1198721015840 = 119880119867119872119880 119880 is a Hermitian matrix

(119880119867119880 = 119868) (2) the elements of 119872 matrix are statisticallyindependent119875(119872) = prod

119894le119895119875119894119895(119872119894119895) and (3) thematrix119872 can

be diagonalized as119872 = 119880119863119880119867 where119880 = diag1205821 120582119873120582119894 is the eigenvalue of119872 and 120582119894 le 120582119895 if 119894 lt 119895

Propability (2) 119875 (119872) 119889119872 sim 119889119872 exp minus1198732119879119903 (119872

119867119872)

Propability (3) 119875 (119872) 119889119872 sim 119889119880prod119894

119889120582119894prod119894lt119895

(120582119894 minus 120582119895)2

times expminus1198732sum119894

(1205822

119894)

(27)The numerical methods used for eigenvalues computa-

tion are the QR method and Power methods (direct andindirect) The QR method is a numerical stable algorithmand Power method is an iterative one The RMT can be usedfor many body systems quantum chaos disordered systemsquantum chromodynamics and so forth

233 Kernel Estimation in HEP Kernel estimation is a verypowerful solution and relevant method for HEP when itis necessary to combine data from heterogeneous sourceslike MC datasets obtained by simulation and from StandardModel expectation obtained from real experiments [19] Fora set of data 1199091198941le119894le119899 with a constant bandwidth ℎ (thedifference between two consecutive data values) called thesmoothing parameter we have the estimation

119891 (119909) =1

119899ℎ

119899

sum119894=1

119870(119909 minus 119909119894

ℎ) (28)

where119870 is an estimator For example a Gauss estimator withmean 120583 and standard deviation 120590 is

119870 (119909) =1

120590radic2120587expminus

(119909 minus 120583)2

21205902 (29)

and has the following properties positive definite andinfinitely differentiable (due to the exp function) and it canbe defined for an infinite supports (119899 rarr infin) The kernel is anonparametricmethodwhichmeans thatℎ is independent ofdataset and for large amount of normally distributed data wecan find a value for ℎ that minimizes the integrated squarederror of 119891(119909) This value for bandwidth is computed as

ℎlowast= (

4

3119899)15

120590 (30)

The main problem in Kernel Estimation is that theset of data 1199091198941le119894le119899 is not normally distributed and inreal experiments the optimal bandwidth it is not knownAn improvement of presented method considers adaptiveKernel Estimation proposed by Abramson [20] where ℎ119894 =ℎradic119891(119909119894) and 120590 are considered global qualities for datasetThe new form is

119891119886 (119909) =1

119899

119899

sum119894=1

1

ℎ119894119870(

119909 minus 119909119894

ℎ119894) (31)

and the local bandwidth value that minimizes the integratedsquared error of 119891119886(119909) is

ℎlowast

119894= (

4

3119899)15

radic120590

119891 (119909119894) (32)

where 119891 is the normal estimatorKernel estimation is used for event selection to confidence

level evaluation for example in Markovian Monte Carlochains or in selection of neural network output used inexperiments for reconstructed Higgs mass In general themain usage of Kernel estimation in HEP is searching for newparticle by finding relevant data in a large dataset

A method based on Kernel estimation is the graphicalrepresentation of datasets using advanced shifted histogramalgorithm (ASH) This is a numerical interpolation for largedatasets with themain aimof creating a set of 119899119887119894119899histograms119867 = 119867119894 with the same bin-width ℎ Algorithm 4presents the steps of histograms generation starting with aspecific interval [119886 119887] a number of points 119899 in this intervaland a number of bins and a number of values used forkernel estimation 119898 Figure 6 shows the results of kernelestimation if function 119891 = minus(12)119909

2 on [0 1] and graphicalrepresentation with a different number of bins The valueson vertical axis are aggregated in step 17 of Algorithm 4 andincrease with the number of bins

234 Performance of Numerical Algorithms Used in ParticlePhysics All operations used in presented methods for par-ticle physics (Unfolding Processes Random Matrix Theory

Advances in High Energy Physics 9

(1) procedure ASH (119886 119887 119899 119909 119899bin 119898 119891119898)(2) 120575 = (119887 minus 119886) 119899bin ℎ = 119898120575(3) for 119896 = 1 119899bin do(4) V119896 = 0 119910119896 = 0(5) end for(6) for 119894 = 1 119899 do(7) 119896 = (119909119894 minus 119886) 120575 + 1(8) if 119896 isin [1 119899bin] then(9) V119896 = V119896 + 1(10) end if(11) end for(12) for 119896 = 1 119899bin do(13) if V119896 = 0 then(14) 119896 = 119896 + 1(15) end if(16) for 119894 = max 1 119896 minus 119898 + 1 min 119899bin 119896 + 119898 minus 1 do(17) 119910119894 = 119910119894 + V119896119891119898 (119894 minus 119896)(18) end for(19) end for(20) for 119896 = 1 119899bin do(21) 119910119896 = 119910119896(119899ℎ) (22) 119905119896 = 119886 + (119896 minus 05) 120575(23) end for(24) return 1199051198961le119896le119899bin 1199101198961le119896le119899bin(25) end procedure

Algorithm 4 Advanced shifted histogram (1D algorithm)

and Kernel Estimation) can be reduced to scalar productsmatrix-vector products and matrix-matrix products In [21]the design of new standard for the BLAS (Basic LinearAlgebra Subroutines) in C language by extension of precisionis described This permits higher internal precision andmixed inputoutput types The precision allows implemen-tation of some algorithms that are simpler more accurateand sometimes faster than possible without these featuresRegarding the precision of numerical computing Dongarraand Langou established in [22] an upper bound for theresidual check for 119860119909 = 119910 system with 119860 isin 119877

119899times119899 a densematrix The residual check is defined as

119903infin =

1003817100381710038171003817119860119909 minus 1199101003817100381710038171003817infin

119899120598 (119860infin119909infin +10038171003817100381710038171199101003817100381710038171003817infin)

lt 16 (33)

where 120598 is the relative machine precision for the IEEErepresentation standard 119910

infinis the infinite normof a vector

119910infin

= max1le119894le119899|119910119894| and 119860infin is the infinite norm of amatrix 119860infin = max1le119894le119899sum

119899

119895=1|119860 119894119895|

Figure 7 presents the graphical representation of Dongar-ras result (using logarithmic scales) for simple and doubleprecision For simple precision 120598119904 = 2minus24 for all 119899 ge 105 times

106 the residual check is always lower than imposed upperbound similarly for double precision with 120598119889 = 2

minus53 for all119899 ge 563 times 1014 If matrix size is greater than these values itwill not be possible to detect if the solution is correct or notThese results establish upper bounds for data volume in thismodel

In a single-processor system the complexity of algo-rithms depends only on the problem size 119899 We can assume

119879(119899) = Θ(119891(119899)) where 119891(119899) is a fundamental function(119891(119899) isin 1 119899

120572 119886119899 log 119899radic119899 ) In parallel systems (mul-tiprocessor systems with 119901 processors) we have the serialprocessing time 119879

lowast(119899) = 1198791(119899) and parallel processingtime 119879119901(119899) The performance of parallel algorithms can beanalyzed using speed-up efficiency and isoefficiencymetrics

(i) The speed-up 119878(119901) represents how a parallel algo-rithm is faster than a corresponding sequential algo-rithmThe speed-up is defined as 119878(119901) = 1198791(119899)119879119901(119899)There are special bounds for speed-up [23] 119878(119901) le

119901119901(119901 + 119901 minus 1) where 119901 = 1198791119879infin is the averageparallelism (the average number of busy processorsgiven unbounded number of processors) Usually119878(119901) le 119901 but under special circumstances thespeed-up can be 119878(119901) gt 119901 [24] Another upperbound is established by the Amdahls law 119878(119901) =

(119904 + ((1 minus 119904)119901))12

le 1119904 where 119904 is the fraction ofa program that is sequential The upper bound isconsidered for a 0 time of parallel fraction

(ii) The efficiency is the average utilization of 119901 proces-sors 119864(119901) = 119878(119901)119901

(iii) The isoefficiency is the growth rate of workload119882119901(119899) = 119901119879119901(119899) in terms of number of processors tokeep efficiency fixed If we consider1198821(119899)minus119864119882119901(119899) =0 for any fixed efficiency 119864 we obtain 119901 = 119901(119899) Thismeans thatwe can establish a relation between needednumber of processors and problem size For examplefor the parallel sum of 119899 numbers using 119901 processorswe have 119899 asymp 119864(119899 + 119901 log119901) so 119899 = Θ(119901 log119901)

10 Advances in High Energy Physicsnbin

=10

0

minus5

minus10

minus15

minus20

[a b]0 02 04 06 08 1

(a)

nbin

=100

0minus10minus20minus30minus40minus50

[a b]0 02 04 06 08 1

(b)

nbin

=1000

0minus100minus200minus300minus400minus500

0 02 04 06 08 1

[a b]

(c)

Figure 6 Example of advanced shifted histogram algorithm run-ning for different bins 10 100 and 1000

0

5

10

15

0 5 10 15 20 25

Simple precisionDouble precisionHPL20 bound

minus5

minus10

minus15

log10(n)

log10(1elowastn))

Figure 7 Residual check analysis for solving 119860119909 = 119910 system inHPL20 using simple and double precision representation

Numerical algorithms use for implementation a hyper-cube architecture We analyze the performance of differentnumerical operations using the isoefficiency metric For thehypercube architecture a simple model for intertask commu-nication considers 119879com = 119905119904 +119871119905119908 where 119905119904 is the latency (thetime needed by a message to cross through the network) 119905119908

Table 2 Isoefficiency for a hypercube architecture 119899 = Θ(119901 log119901)and 119899 = Θ(119901(log119901)2) We marked with (lowast) the limitations imposedby Formula (33)

Scenario Architecture size (119901) 119899 = Θ(119901 log119901) 119899 = Θ(119901(log119901)2)1 101 10 times 101 100 times 101

2 102

20 times 102

800 times 102

3 103 30 times 103 270 times 104

4 104 40 times 104 640 times 105lowast

5 105

50 times 105lowast

125 times 107

6 106

60 times 106

216 times 108

7 107 70 times 107 343 times 109

8 108 80 times 108 512 times 1010

9 109

90 times 109

729 times 1011

is the time needed to send a word (1119905119908 is called bandwidth)and 119871 is the message length (expressed in number of words)The word size depends on processing architecture (usually itis two bytes)We define 119905119888 as the processing time per word fora processor We have the following results

(i) External product 119909119910119879 The isoefficiency is written as

119905119888119899 asymp 119864 (119905119888119899 + (119905119904 + 119905119908) 119901 log119901) 997904rArr 119899 = Θ (119901 log119901) (34)

Parallel processing time is 119879119901 = 119905119888119899119901+(119905119904 +119905119908) log119901The optimality is computed using

119889119879119901

119889119901= 0 997904rArr minus119905119888

119899

1199012+119905119904 + 119905119908

119901= 0 997904rArr 119901 asymp

119905119888119899

119905119904 + 119905119908 (35)

(ii) Scalar product (internal product) 119909119879119910 = sum119899

119894=1119909119894119910119894

The isoefficiency is written as

1199051198881198992asymp 119864(119905119888119899

2+119905119904

2119901 log119901 +

119905119908

2119899radic119901 log119901)

997904rArr 119899 = Θ(119901(log119901)2) (36)

(iii) Matrix-vector product 119910 = 119860119909 119910119894 = sum119899

119895=1119860 119894119895119909119895 The

isoefficiency is written as

1199051198881198992asymp 119864 (119905119888119899

2+ 119905119904119901 log119901 + 119905119908119899radic119901 log119901) 997904rArr 119899

= Θ(119901(log119901)2) (37)

Table 2 presented the amount of data that can be pro-cessed for a specific sizeThe cases that meet the upper bound119899 ge 105 times 10

6 are marked with (lowast) To keep the efficiencyhigh for a specific parallel architecture HPC algorithms forparticle physics introduce upper limits for the amount of datawhich means that we have also an upper bound for Big Datavolume in this case

The factors that determine the efficiency of parallel algo-rithms are task balancing (work-load distribution betweenall used processors in a system rarr to be maximized)concurrency (the numberpercentage of processors workingsimultaneously rarr to be maximized) and overhead (extrawork for introduce by parallel processing that does not appearin serial processing rarr to be minimized)

Advances in High Energy Physics 11

Experimental data

problem)

Modeling and reconstruction

(find properties of observed particles)

Analysis(reduce result to simple statistical

distribution)

Simulation(generate MC data based on

theoretical models-BigData problem)

Modeling and reconstruction

(find properties of observed particles)

Analysis(reduce result to simple statistical

distribution)

Compare

(app TB of data rarr big data

Figure 8 Processing flows for HEP experiments

3 New Challenges for Big Data Science

There are a lot of applications that generate Big Data likesocial networking profiles social influence SaaS amp CloudApps public web information MapReduce scientific exper-iments and simulations (especially HEP simulations) datawarehouse monitoring technologies and e-government ser-vices Data grow rapidly since applications produce contin-uously increasing volumes of both unstructured and struc-tured data The impact on the approach to data processingtransfer and storage is the need to reevaluate the way andsolutions to better answer the usersrsquo needs [25] In this con-text scheduling models and algorithms for data processinghave an important role becoming a new challenge for BigData Science

HEP applications consider both experimental data (thatare application with TB of valuable data) and simulationdata (with data generated using MC based on theoreticalmodels) The processing phase is represented by modelingand reconstruction in order to find properties of observedparticles (see Figure 8)Then the data are analyzed a reducedto a simple statistical distribution The comparison of resultsobtainedwill validate how realistic is a simulation experimentand validate it for use in other new models

Since we face a large variety of solutions for specificapplications and platforms a thorough and systematic anal-ysis of existing solutions for scheduling models methodsand algorithms used in Big Data processing and storageenvironments is needed The challenges for schedulingimpose specific requirements in distributed systems theclaims of the resource consumers the restrictions imposed byresource owners the need to continuously adapt to changesof resourcesrsquo availability and so forth We will pay specialattention to Cloud Systems and HPC clusters (datacenters)as reliable solutions for Big Data [26] Based on theserequirements a number of challenging issues are maximiza-tion of system throughput sitesrsquo autonomy scalability fault-tolerance and quality of services

When discussing Big Data we have in mind the 5Vs Vol-ume Velocity Variety Variability and Value There is a clearneed of many organizations companies and researchers todeal with Big Data volumes efficiently Examples include webanalytics applications scientific applications and social net-works For these examples a popular data processing enginefor Big Data is Hadoop MapReduce [27] The main problemis that data arrives too fast for optimal storage and indexing

[28] There are other several processing platforms for BigData Mesos [29] YARN (Hortonworks Hadoop YARN Anext-generation framework forHadoop data processing 2013(httphortonworkscomhadoopyarn)) Corona (CoronaUnder the Hood Scheduling MapReduce jobs more effi-ciently with Corona 2012 (Facebook)) and so forth A reviewof various parallel and distributed programming paradigmsanalyzing how they fit into the Big Data era is presented in[30]The challenges that are described for BigData Science onthe modern and future Scientific Data Infrastructure are pre-sented in [31] The paper introduces the Scientific Data Life-cycle Management (SDLM)model that includes all the majorstages and reflects specifics in data management in moderne-Science The paper proposes the SDI generic architecturemodel that provides a basis for building interoperable data orproject centric SDI usingmodern technologies and best prac-tices This analysis highlights in the same time performanceand limitations of existing solutions in the context of BigData Hadoop can handle many types of data from disparatesystems structured unstructured logs pictures audio filescommunications records emails and so forth Hadoop relieson an internal redundant data structure with cost advantagesand is deployed on industry standard servers rather than onexpensive specialized data storage systems [32] The mainchallenges for scheduling in Hadoop are to improve existingalgorithms for Big Data processing capacity schedulingfair scheduling delay scheduling longest approximate timeto end (LATE) speculative execution deadline constraintscheduler and resource aware scheduling

Data transfer scheduling in Grids Cloud P2P and soforth represents a new challenge that is the subject to BigData In many cases depending on applications architecturedata must be transported to the place where tasks willbe executed [33] Consequently scheduling schemes shouldconsider not only the task execution time but also the datatransfer time for finding a more convenient mapping of tasks[34] Only a handful of current research efforts considerthe simultaneous optimization of computation and datatransfer scheduling The big-data IO scheduler [35] offers asolution for applications that compete for IO resources ina shared MapReduce-type Big Data system [36] The paper[37] reviews Big Data challenges from a data managementrespective and addresses Big Data diversity Big Data reduc-tion Big Data integration and cleaning Big Data indexingand query and finally Big Data analysis and mining On theopposite side business analytics occupying the intersection

12 Advances in High Energy Physics

of the worlds of management science computer science andstatistical science is a potent force for innovation in both theprivate and public sectors The conclusion is that the data istoo heterogeneous to fit into a rigid schema [38]

Another challenge is the scheduling policies used todetermine the relative ordering of requests Large distributedsystems with different administrative domains will mostlikely have different resource utilization policies For examplea policy can take into consideration the deadlines andbudgets and also the dynamic behavior [39] HEP experi-ments are usually performed in private Clouds consideringdynamic scheduling with soft deadlines which is an openissue

The optimization techniques for the scheduling processrepresent an important aspect because the scheduling is amain building block formaking datacentersmore available touser communities being energy-aware [40] and supportingmulticriteria optimization [41] An example of optimizationis multiobjective and multiconstrained scheduling of manytasks in Hadoop [42] or optimizing short jobs [43] Thecost effectiveness scalability and streamlined architecturesof Hadoop represent solutions for Big Data processingConsidering the use of Hadoop in publicprivate Clouds achallenge is to answer the following questions what type ofdatatasks should move to public cloud in order to achieve acost-aware cloud scheduler And is public Cloud a solutionfor HEP simulation experiments

The activities for Big Data processing vary widely in anumber of issues for example support for heterogeneousresources objective function(s) scalability coschedulingand assumptions about system characteristics The currentresearch directions are focused on accelerating data process-ing especially for Big Data analytics (frequently used in HEPexperiments) complex task dependencies for dataworkflowsand new scheduling algorithms for real-time scenarios

4 Conclusions

This paper presented general aspects about methods used inHEP Monte Carlo methods and simulations of HEP pro-cesses Markovian Monte Carlo unfolding methods in parti-cle physics kernel estimation inHEP RandomMatrixTheoryused in analysis of particles spectrum For each method theproper numerical method had been identified and analyzedAll of identified methods produce data-intensive applica-tions which introduce new challenges and requirementsfor Big Data systems architecture especially for processingparadigms and storage capabilities This paper puts togetherseveral concepts HEP HPC numerical methods and sim-ulations HEP experiments are modeled using numericalmethods and simulations numerical integration eigenvaluescomputation solving linear equation systems multiplyingvectors and matrices interpolation HPC environments offerpowerful tools for data processing and analysis Big Data wasintroduced as a concept for a real problem we live in a data-intensive world produce huge amount of information weface with upper bound introduced by theoretical models

Conflict of Interests

The author declares that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

The research presented in this paper is supported bythe following projects ldquoSideSTEPmdashScheduling Methods forDynamic Distributed Systems a self-lowast approachrdquo (PN-II-CT-RO-FR-2012-1-0084) ldquoERRICmdashEmpowering Roma-nian Research on Intelligent Information Technologiesrdquo FP7-REGPOT-2010-1 ID 264207 CyberWater Grant of theRomanianNational Authority for Scientific Research CNDI-UEFISCDI Project no 472012 The author would like tothank the reviewers for their time and expertise constructivecomments and valuable insights

References

[1] H Newman ldquoSearch for higgs boson diphoton decay withcms at lhcrdquo in Proceedings of the ACMIEEE Conference onSupercomputing (SC rsquo06) ACM New York NY USA 2006

[2] C Cattani and A Ciancio ldquoSeparable transition density inthe hybrid model for tumor-immune system competitionrdquoComputational and Mathematical Methods in Medicine vol2012 Article ID 610124 6 pages 2012

[3] C Toma ldquoWavelets-computational aspects of sterian realisticapproach to uncertainty principle in high energy physics atransient approachrdquo Advances in High Energy Physics vol 2013Article ID 735452 6 pages 2013

[4] C Cattani A Ciancio and B Lods ldquoOn a mathematical modelof immune competitionrdquo Applied Mathematics Letters vol 19no 7 pp 678ndash683 2006

[5] D Perret-Gallix ldquoSimulation and event generation in high-energy physicsrdquoComputer Physics Communications vol 147 no1-2 pp 488ndash493 2002

[6] R Baishya and J K Sarma ldquoSemi numerical solution of non-singlet Dokshitzer-Gribov-Lipatov-Altarelli- Parisi evolutionequation up to next-to-next-to-leading order at small xrdquo Euro-pean Physical Journal C vol 60 no 4 pp 585ndash591 2009

[7] I I Bucur I Fagarasan C Popescu G Culea and A E SusuldquoDelay optimum and area optimal mapping of k-lut based fpgacircuitsrdquo Journal of Control Engineering andApplied Informaticsvol 11 no 1 pp 43ndash48 2009

[8] R R de Austri R Trotta and L Roszkowski ldquoA markov chainmonte carlo analysis of the cmssmrdquo Journal of High EnergyPhysics vol 2006 no 05 2006

[9] C Serbanescu ldquoNoncommutativeMarkov processes as stochas-tic equationsrsquo solutionsrdquo BulletinMathematique de la Societe desSciencesMathematiques de Roumanie vol 41 no 3 pp 219ndash2281998

[10] C Serbanescu ldquoStochastic differential equations and unitaryprocessesrdquo Bulletin Mathematique de la Societe des SciencesMathematiques de Roumanie vol 41 no 4 pp 311ndash322 1998

[11] O Behnke K Kroninger G Schott and T H Schorner-Sadenius Data Analysis in High Energy Physics A PracticalGuide to Statistical Methods Wiley-VCH Berlin Germany2013

[12] V Blobel ldquoAn unfolding method for high energy physicsexperimentsrdquo in Proceedings of the Conference on Advanced

Advances in High Energy Physics 13

Statistical Techniques in Particle Physics Durham UK March2002 DESY 02-078 (June 2002)

[13] P C Hansen Rank-Deficient and Discrete Ill-Posed ProblemsSIAM Philadelphia Pa USA 1998

[14] P C Hansen Discrete Inverse Problems Insight and Algorithmsvol 7 SIAM Philadelphia Pa USA 2010

[15] A Hocker and V Kartvelishvili ldquoSVD approach to data unfold-ingrdquo Nuclear Instruments and Methods in Physics Research Avol 372 no 3 pp 469ndash481 1996

[16] I Aziz and Siraj-ul-Islam ldquoNew algorithms for the numericalsolution of nonlinear Fredholm and Volterra integral equationsusing Haar waveletsrdquo Journal of Computational and AppliedMathematics vol 239 pp 333ndash345 2013

[17] A Sawatzky C Brune JMuller andM Burger ldquoTotal variationprocessing of images with poisson statisticsrdquo in Proceedingsof the 13th International Conference on Computer Analysis ofImages and Patterns (CAIP rsquo09) pp 533ndash540 Springer BerlinGermany 2009

[18] A Edelman and N R Rao ldquoRandom matrix theoryrdquo ActaNumerica vol 14 no 1 pp 233ndash297 2005

[19] K Cranmer ldquoKernel estimation in high-energy physicsrdquo Com-puter Physics Communications vol 136 no 3 pp 198ndash207 2001

[20] I S Abramson ldquoOn bandwidth variation in kernel estimatesmdashasquare root lawrdquoThe Annals of Statistics vol 10 no 4 pp 1217ndash1223 1982

[21] X Li JWDemmel DH Bailey et al ldquoDesign implementationand testing of extended and mixed precision BLASrdquo ACMTransactions on Mathematical Software vol 28 no 2 pp 152ndash205 2002

[22] J J Dongarra and J Langou ldquoThe problem with the linpackbenchmark 10 matrix generatorrdquo International Journal of HighPerformance Computing Applications vol 23 no 1 pp 5ndash132009

[23] L Lundberg and H Lennerstad ldquoAn Optimal lower boundon the maximum speedup in multiprocessors with clustersrdquoin Proceedings of the IEEE 1st International Conference onAlgorithms and Architectures for Parallel Processing (ICAPP rsquo95)vol 2 pp 640ndash649 April 1995

[24] N J Gunther ldquoA note on parallel algorithmicspeedup boundsrdquoTechnical Report on Distributed Parallel and Cluster Comput-ing In presshttparxivorgabs11044078

[25] B C Tak B Urgaonkar and A Sivasubramaniam ldquoTo move ornot tomove the economics of cloud computingrdquo in Proceedingsof the 3rdUSENIXConference onHot Topics inCloudComputing(HotCloud rsquo11) p 5 USENIX Association Berkeley CA USA2011

[26] L Zhang C Wu Z Li C Guo M Chen and F Lau ldquoMovingbig data to the cloud an online cost-minimizing approachrdquoIEEE Journal on Selected Areas in Communications vol 31 no12 pp 2710ndash2721 2013

[27] J Dittrich and J A Quiane-Ruiz ldquoEfficient big data processingin hadoop mapreducerdquo Proceedings of the VLDB Endowmentvol 5 no 12 pp 2014ndash2015 2012

[28] D Suciu ldquoBig data begets big database theoryrdquo in Proceedings ofthe 29th British National conference on Big Data (BNCOD rsquo13)pp 1ndash5 Springer Berlin Germany 2013

[29] B Hindman A Konwinski M Zaharia et al ldquoA platform forfine-grained resource sharing in the data centerrdquo in Proceedingsof the 8th USENIX Conference on Networked Systems Designand Implementation (NSDI rsquo11) p 22 USENIX AssociationBerkeley CA USA 2011

[30] C Dobre and F Xhafa ldquoParallel programming paradigms andframeworks in big data erardquo International Journal of ParallelProgramming 2013

[31] Y Demchenko C de Laat AWibisono P Grosso and Z ZhaoldquoAddressing big data challenges for scientific data infrastruc-turerdquo in Proceedings of the IEEE 4th International Conferenceon CloudComputing Technology and Science (CLOUDCOM rsquo12)pp 614ndash617 IEEE Computer Society Washington DC USA2012

[32] T White Hadoop The Definitive Guide OrsquoReilly Media 2012[33] J Celaya and U Arronategui ldquoA task routing approach to large-

scale schedulingrdquo Future Generation Computer Systems vol 29no 5 pp 1097ndash1111 2013

[34] N Bessis S Sotiriadis F Xhafa F Pop and V Cristea ldquoMeta-scheduling issues in interoperable hpcs grids and cloudsrdquoInternational Journal of Web and Grid Services vol 8 no 2 pp153ndash172 2012

[35] Y Xu A Suarez and M Zhao ldquoIbis interposed big-data ioschedulerrdquo in Proceedings of the 22nd International Sympo-sium on High-performance Parallel and Distributed Computing(HPDC rsquo13) pp 109ndash110 ACM New York NY USA 2013

[36] S Sotiriadis N Bessis and N Antonopoulos ldquoTowards inter-cloud schedulers a survey of meta-scheduling approachesrdquo inProceedings of the 6th International Conference on P2P ParallelGrid Cloud and Internet Computing (PGCIC rsquo11) pp 59ndash66Barcelona Spain October 2011

[37] J Chen Y Chen X Du et al ldquoBig data challenge a datamanagement perspectiverdquo Frontiers in Computer Science vol 7no 2 pp 157ndash164 2013

[38] V Gopalkrishnan D Steier H Lewis and J Guszcza ldquoBigdata big business bridging the gaprdquo in Proceedings of the 1stInternationalWorkshop on Big Data Streams andHeterogeneousSource Mining Algorithms Systems Programming Models andApplications (BigMine rsquo12) pp 7ndash11 ACM New York NY USA2012

[39] R van den Bossche K Vanmechelen and J BroeckhoveldquoOnline coste fficient scheduling of deadline-constrained work-loads on hybrid cloudsrdquo Future Generation Computer Systemsvol 29 no 4 pp 973ndash985 2013

[40] N Bessis S Sotiriadis F Pop and V Cristea ldquoUsing anovel messageexchanging optimization (meo) model to reduceenergy consumption in distributed systemsrdquo Simulation Mod-elling Practice and Theory vol 39 pp 104ndash112 2013

[41] G V Iordache M S Boboila F Pop C Stratan and VCristea ldquoA decentralized strategy for genetic scheduling inheterogeneous environmentsrdquo Multiagent and Grid Systemsvol 3 no 4 pp 355ndash367 2007

[42] F Zhang J Cao K Li S U Khan and K Hwang ldquoMulti-objective scheduling of many tasks in cloud platformsrdquo FutureGeneration Computer Systems 2013

[43] K Elmeleegy ldquoPiranha optimizing short jobs in hadooprdquoProceedings of the VLDB Endowment vol 6 no 11 pp 985ndash9962013

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

High Energy PhysicsAdvances in

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

FluidsJournal of

Atomic and Molecular Physics

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Advances in Condensed Matter Physics

OpticsInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

AstronomyAdvances in

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Superconductivity

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Statistical MechanicsInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

GravityJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

AstrophysicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Physics Research International

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Solid State PhysicsJournal of

 Computational  Methods in Physics

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Soft MatterJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

AerodynamicsJournal of

Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

PhotonicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Biophysics

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ThermodynamicsJournal of

6 Advances in High Energy Physics

(1) Generate 1199051 according with 119901 (1199051) = 119901 (1199051 | 1199050 = 0)

(2) if 1199051lt 119905max then ⊳ Generate the initial state

(3) 119875119873ge1 = int119905max0

119901 (1199051 | 1199050) 1198891199051 ⊳ Compute the initial probability(4) Retain 1199051(5) end if(6) if 1199051 gt 119905max then ⊳ Discard all generated and computed data(7) 119873 = 0 1198750 = int

infin

119905max119901 (1199051 | 1199050) 1198891199051 = 119890minus119905max

(8) Delete 1199051(9) EXIT ⊳ The algorithm ends here(10) end if(11) 119894 = 2(12) while (1) do ⊳ Infinite loop until a successful EXIT(13) Generate 119905119894 according with 119901 (119905119894 | 119905119894minus1)(14) if 119905

119894lt 119905max then ⊳ Generate a new state and new probability

(15) 119875119873ge119894 = int119905max119905119894

119901 (119905119894 | 119905119894minus1) 119889119905119894(16) Retain 119905119894(17) end if(18) if 119905119894 gt 119905max then ⊳ Discard all generated and computed data(19) 119873 = 119894 minus 1 119875119894 = int

infin

119905max119901 (119905119894 | 119905119894minus1) 119889119905119894

(20) Retain (1199051 1199052 119905119894minus1) Delete 119905119894(21) EXIT ⊳ The algorithm ends here(22) end if(23) 119894 = 119894 + 1(24) end while

Algorithm 3 1-Dimensional Monte Carlo Markovian Algorithm

and 119904 = 01 and the generated values can be easily acceptedfor MC simulation in HEP experiments

Figure 5 shows the acceptance rate of values generatedwith parameter 119904used in the algorithmAndparameter valuesare correlated with Figure 4 Results in Figure 5 show thatthe acceptance rate decreases rapidly with increasing value ofparameter 119904 The conclusion is that values must be kept smallto obtain meaningful data A correlation with the normaldistribution is evident showing that a small value for themean square deviation provides useful results

223 Performance of Numerical Algorithms Used in MCSimulations Numerical methods used to compute MC esti-mator use numerical quadratures to approximate the valueof the integral for function 119891 on a specific domain by alinear compilation of function values and weights 1199081198941le119894le119898as follows

int119887

119886

119891 (119903) 119889119903 =

119898

sum119894=1

119908119894119891 (119903119894) (21)

We can consider a consistent MC estimator 119886 a clas-sical numerical quadrature with all 119908119894 = 1 Efficiency ofintegration methods for 1 dimension and for 119889 dimensionsis presented in Table 1 We can conclude that quadraturemethods are difficult to apply in many dimensions for variateintegration domains (regions) and the integral is not easy tobe estimated

Table 1 Efficiency of integration methods for 1 dimension and for119889 dimensions

Method 1 dimension 119889 dimensionsMonte Carlo 119899minus12 119899minus12

Trapezoidal rule 119899minus2 119899minus2119889

Simpsonrsquos rule 119899minus4 119899minus4119889

119898-points Gauss rule (119898 lt 119899) 119899minus2119898 119899minus2119898119889

Aspractical example in a typical high-energy particle col-lision there can bemany final-state particles (even hundreds)If we have 119899 final state particle we face with 119889 = 3119899 minus 4

dimensional phase space As numerical example for 119899 = 4

we have 119889 = 8 dimensions which is very difficult approachfor classical numerical quadratures

Full decomposition integration volume for one doublenumber (10 Bytes) per volume unit is 119899119889 times 10 Bytes For theexample considered with 119889 = 8 and 119899 = 10 divisions forinterval [0 1] we have for one numerical integration

119899119889times 10Bytes = 108 times 10

10243119866Bytes asymp 093119866Bytes (22)

Considering 106 events per second one integration per eventthe data produced in one hour will be asymp31974 119875Bytes

Advances in High Energy Physics 7

State (iteration)200 400 600 800 1000

s=001 07

05030100

(a)

State (iteration)200 400 600 800 1000

s=01 4

20

0minus2

(b)

State (iteration)200 400 600 800 1000

s=1

0

3

minus3

1minus1

(c)

State (iteration)200 400 600 800 1000

s=10

0

31

minus3minus1

(d)

State (iteration)200 400 600 800 1000

s=100

0

210

minus1

(e)

200 400 600 800 1000State (iteration)

s=1000

0

1

0

minus1

(f)

Figure 4 Example of 1-dimensional Monte Carlo Markovianalgorithm

The previous assumption is only for multidimensionalarrays But due to the factorization assumption119901(1199031 1199032 119903119899) = 119901(1199031)119901(1199032) sdot sdot sdot 119901(119903119899) we obtain for oneintegration

119899 times 119889 times 10 Bytes = 800 Bytes (23)

which means asymp262 119879Bytes of data produce for one hour ofsimulations

23 Unfolding Processes in Particle Physics and Kernel Estima-tion in HEP In particle physics analysis we have two typesof distributions true distribution (considered in theoreticalmodels) and measured distribution (considered in experi-mental models which are affected by finite resolution andlimited acceptance of existing detectors) A HEP interactionprocess starts with a true knows distribution and generatea measured distribution corresponding to an experiment of

0

20

40

60

80

100

0 1 2 3 4

Acce

ptan

ce ra

te (

)

minus1minus2minus3

log10(s)

Figure 5 Analysis of acceptance rate for 1-dimensionalMonte CarloMarkovian algorithm for different 119904 values

a well-confirmed theory An inverse process starts with ameasured distribution and tries to identify the true distri-bution These unfolding processes are used to identify newtheories based on experiments [11]

231 Unfolding Processes in Particle Physics The theory ofunfolding processes in particle physics is as follows [12] For aphysics variable 119905 we have a true distribution 119891(119905)mapped in119909 and an 119899-vector of unknowns and a measured distribution119892(119904) (for a measured variable 119904) mapped in an 119898-vector ofmeasured data A responsematrix119860 isin 119877

119898times119899 encodes aKernelfunction119870(119904 119905) describing the physical measurement process[12ndash15]The direct and inverse processes are described by theFredholm integral equation [16] of the first kind for a specificdomainΩ

intΩ

119870 (119904 119905) 119891 (119905) 119889119905 = 119892 (119904) (24)

In particle physics the Kernel function 119870(119904 119905) is usuallyknown fromaMonteCarlo sample obtained from simulationA numerical solution is obtained using the following linearequation 119860119909 = 119887 Vectors 119909 and 119910 are assumed tobe 1-dimensional in theory but they can be multidimen-sional in practice (considering multiple independent linearequations) In practice also the statistical properties of themeasurements are well known and often they follow thePoisson statistics [17] To solve the linear systems we havedifferent numerical methods

First method is based on linear transformation 119909 = 119860119910

If 119898 = 119899 then 119860= 119860minus1 and we can use direct Gaussian

methods iterative methods (Gauss-Siedel Jacobi or SOR) ororthogonal methods (based on Householder transformationGivens methods or Gram-Schmidt algorithm) If 119898 gt 119899

(the most frequent scenario) we will construct the matrix119860

= (119860119879119860)minus1119860119879 (called pseudoinverse Penrose-Moore) In

these cases the orthogonalmethods offer very good and stablenumerical solutions

8 Advances in High Energy Physics

Second method considers the singular value decomposi-tion

119860 = 119880Σ119881119879=

119899

sum119894=1

120590119894119906119894V119879

119894 (25)

where119880 isin 119877119898times119899 and119881 isin 119877119899times119899 arematrices with orthonormalcolumns and the diagonal matrix Σ = diag1205901 120590119899 =

119880119879119860119881 The solution is

119909 = 119860119910 = 119881Σ

minus1(119880119879119910) =

119899

sum119894=1

1

120590119894(119906119879

119894119910) V119894 =

119899

sum119894=1

1

120590119894119888119894V119894 (26)

where 119888119894 = 119906119879119894119910 119894 = 1 119899 are called Fourier coefficients

232 RandomMatrix Theory Analysis of particle spectrum(eg neutrino spectrum) faces with Random Matrix Theory(RMT) especially if we consider anarchic neutrino massesThe RMT means the study of the statistical properties ofeigenvalues of very large matrices [18] For an interactionmatrix 119860 (with size 119873) where 119860 119894119895 is an independent dis-tributed random variable and 119860119867 is the complex conjugateand transposematrix we define119872 = 119860+119860

119867 which describesa Gaussian Unitary Ensemble (GUE) The GUE propertiesare described by the probability distribution 119875(119872)119889119872 (1)it is invariant under unitary transformation 119875(119872)119889119898 =

119875(1198721015840)1198891198721015840 where 1198721015840 = 119880119867119872119880 119880 is a Hermitian matrix

(119880119867119880 = 119868) (2) the elements of 119872 matrix are statisticallyindependent119875(119872) = prod

119894le119895119875119894119895(119872119894119895) and (3) thematrix119872 can

be diagonalized as119872 = 119880119863119880119867 where119880 = diag1205821 120582119873120582119894 is the eigenvalue of119872 and 120582119894 le 120582119895 if 119894 lt 119895

Propability (2) 119875 (119872) 119889119872 sim 119889119872 exp minus1198732119879119903 (119872

119867119872)

Propability (3) 119875 (119872) 119889119872 sim 119889119880prod119894

119889120582119894prod119894lt119895

(120582119894 minus 120582119895)2

times expminus1198732sum119894

(1205822

119894)

(27)The numerical methods used for eigenvalues computa-

tion are the QR method and Power methods (direct andindirect) The QR method is a numerical stable algorithmand Power method is an iterative one The RMT can be usedfor many body systems quantum chaos disordered systemsquantum chromodynamics and so forth

233 Kernel Estimation in HEP Kernel estimation is a verypowerful solution and relevant method for HEP when itis necessary to combine data from heterogeneous sourceslike MC datasets obtained by simulation and from StandardModel expectation obtained from real experiments [19] Fora set of data 1199091198941le119894le119899 with a constant bandwidth ℎ (thedifference between two consecutive data values) called thesmoothing parameter we have the estimation

119891 (119909) =1

119899ℎ

119899

sum119894=1

119870(119909 minus 119909119894

ℎ) (28)

where119870 is an estimator For example a Gauss estimator withmean 120583 and standard deviation 120590 is

119870 (119909) =1

120590radic2120587expminus

(119909 minus 120583)2

21205902 (29)

and has the following properties positive definite andinfinitely differentiable (due to the exp function) and it canbe defined for an infinite supports (119899 rarr infin) The kernel is anonparametricmethodwhichmeans thatℎ is independent ofdataset and for large amount of normally distributed data wecan find a value for ℎ that minimizes the integrated squarederror of 119891(119909) This value for bandwidth is computed as

ℎlowast= (

4

3119899)15

120590 (30)

The main problem in Kernel Estimation is that theset of data 1199091198941le119894le119899 is not normally distributed and inreal experiments the optimal bandwidth it is not knownAn improvement of presented method considers adaptiveKernel Estimation proposed by Abramson [20] where ℎ119894 =ℎradic119891(119909119894) and 120590 are considered global qualities for datasetThe new form is

119891119886 (119909) =1

119899

119899

sum119894=1

1

ℎ119894119870(

119909 minus 119909119894

ℎ119894) (31)

and the local bandwidth value that minimizes the integratedsquared error of 119891119886(119909) is

ℎlowast

119894= (

4

3119899)15

radic120590

119891 (119909119894) (32)

where 119891 is the normal estimatorKernel estimation is used for event selection to confidence

level evaluation for example in Markovian Monte Carlochains or in selection of neural network output used inexperiments for reconstructed Higgs mass In general themain usage of Kernel estimation in HEP is searching for newparticle by finding relevant data in a large dataset

A method based on Kernel estimation is the graphicalrepresentation of datasets using advanced shifted histogramalgorithm (ASH) This is a numerical interpolation for largedatasets with themain aimof creating a set of 119899119887119894119899histograms119867 = 119867119894 with the same bin-width ℎ Algorithm 4presents the steps of histograms generation starting with aspecific interval [119886 119887] a number of points 119899 in this intervaland a number of bins and a number of values used forkernel estimation 119898 Figure 6 shows the results of kernelestimation if function 119891 = minus(12)119909

2 on [0 1] and graphicalrepresentation with a different number of bins The valueson vertical axis are aggregated in step 17 of Algorithm 4 andincrease with the number of bins

234 Performance of Numerical Algorithms Used in ParticlePhysics All operations used in presented methods for par-ticle physics (Unfolding Processes Random Matrix Theory

Advances in High Energy Physics 9

(1) procedure ASH (119886 119887 119899 119909 119899bin 119898 119891119898)(2) 120575 = (119887 minus 119886) 119899bin ℎ = 119898120575(3) for 119896 = 1 119899bin do(4) V119896 = 0 119910119896 = 0(5) end for(6) for 119894 = 1 119899 do(7) 119896 = (119909119894 minus 119886) 120575 + 1(8) if 119896 isin [1 119899bin] then(9) V119896 = V119896 + 1(10) end if(11) end for(12) for 119896 = 1 119899bin do(13) if V119896 = 0 then(14) 119896 = 119896 + 1(15) end if(16) for 119894 = max 1 119896 minus 119898 + 1 min 119899bin 119896 + 119898 minus 1 do(17) 119910119894 = 119910119894 + V119896119891119898 (119894 minus 119896)(18) end for(19) end for(20) for 119896 = 1 119899bin do(21) 119910119896 = 119910119896(119899ℎ) (22) 119905119896 = 119886 + (119896 minus 05) 120575(23) end for(24) return 1199051198961le119896le119899bin 1199101198961le119896le119899bin(25) end procedure

Algorithm 4 Advanced shifted histogram (1D algorithm)

and Kernel Estimation) can be reduced to scalar productsmatrix-vector products and matrix-matrix products In [21]the design of new standard for the BLAS (Basic LinearAlgebra Subroutines) in C language by extension of precisionis described This permits higher internal precision andmixed inputoutput types The precision allows implemen-tation of some algorithms that are simpler more accurateand sometimes faster than possible without these featuresRegarding the precision of numerical computing Dongarraand Langou established in [22] an upper bound for theresidual check for 119860119909 = 119910 system with 119860 isin 119877

119899times119899 a densematrix The residual check is defined as

119903infin =

1003817100381710038171003817119860119909 minus 1199101003817100381710038171003817infin

119899120598 (119860infin119909infin +10038171003817100381710038171199101003817100381710038171003817infin)

lt 16 (33)

where 120598 is the relative machine precision for the IEEErepresentation standard 119910

infinis the infinite normof a vector

119910infin

= max1le119894le119899|119910119894| and 119860infin is the infinite norm of amatrix 119860infin = max1le119894le119899sum

119899

119895=1|119860 119894119895|

Figure 7 presents the graphical representation of Dongar-ras result (using logarithmic scales) for simple and doubleprecision For simple precision 120598119904 = 2minus24 for all 119899 ge 105 times

106 the residual check is always lower than imposed upperbound similarly for double precision with 120598119889 = 2

minus53 for all119899 ge 563 times 1014 If matrix size is greater than these values itwill not be possible to detect if the solution is correct or notThese results establish upper bounds for data volume in thismodel

In a single-processor system the complexity of algo-rithms depends only on the problem size 119899 We can assume

119879(119899) = Θ(119891(119899)) where 119891(119899) is a fundamental function(119891(119899) isin 1 119899

120572 119886119899 log 119899radic119899 ) In parallel systems (mul-tiprocessor systems with 119901 processors) we have the serialprocessing time 119879

lowast(119899) = 1198791(119899) and parallel processingtime 119879119901(119899) The performance of parallel algorithms can beanalyzed using speed-up efficiency and isoefficiencymetrics

(i) The speed-up 119878(119901) represents how a parallel algo-rithm is faster than a corresponding sequential algo-rithmThe speed-up is defined as 119878(119901) = 1198791(119899)119879119901(119899)There are special bounds for speed-up [23] 119878(119901) le

119901119901(119901 + 119901 minus 1) where 119901 = 1198791119879infin is the averageparallelism (the average number of busy processorsgiven unbounded number of processors) Usually119878(119901) le 119901 but under special circumstances thespeed-up can be 119878(119901) gt 119901 [24] Another upperbound is established by the Amdahls law 119878(119901) =

(119904 + ((1 minus 119904)119901))12

le 1119904 where 119904 is the fraction ofa program that is sequential The upper bound isconsidered for a 0 time of parallel fraction

(ii) The efficiency is the average utilization of 119901 proces-sors 119864(119901) = 119878(119901)119901

(iii) The isoefficiency is the growth rate of workload119882119901(119899) = 119901119879119901(119899) in terms of number of processors tokeep efficiency fixed If we consider1198821(119899)minus119864119882119901(119899) =0 for any fixed efficiency 119864 we obtain 119901 = 119901(119899) Thismeans thatwe can establish a relation between needednumber of processors and problem size For examplefor the parallel sum of 119899 numbers using 119901 processorswe have 119899 asymp 119864(119899 + 119901 log119901) so 119899 = Θ(119901 log119901)

10 Advances in High Energy Physicsnbin

=10

0

minus5

minus10

minus15

minus20

[a b]0 02 04 06 08 1

(a)

nbin

=100

0minus10minus20minus30minus40minus50

[a b]0 02 04 06 08 1

(b)

nbin

=1000

0minus100minus200minus300minus400minus500

0 02 04 06 08 1

[a b]

(c)

Figure 6 Example of advanced shifted histogram algorithm run-ning for different bins 10 100 and 1000

0

5

10

15

0 5 10 15 20 25

Simple precisionDouble precisionHPL20 bound

minus5

minus10

minus15

log10(n)

log10(1elowastn))

Figure 7 Residual check analysis for solving 119860119909 = 119910 system inHPL20 using simple and double precision representation

Numerical algorithms use for implementation a hyper-cube architecture We analyze the performance of differentnumerical operations using the isoefficiency metric For thehypercube architecture a simple model for intertask commu-nication considers 119879com = 119905119904 +119871119905119908 where 119905119904 is the latency (thetime needed by a message to cross through the network) 119905119908

Table 2 Isoefficiency for a hypercube architecture 119899 = Θ(119901 log119901)and 119899 = Θ(119901(log119901)2) We marked with (lowast) the limitations imposedby Formula (33)

Scenario Architecture size (119901) 119899 = Θ(119901 log119901) 119899 = Θ(119901(log119901)2)1 101 10 times 101 100 times 101

2 102

20 times 102

800 times 102

3 103 30 times 103 270 times 104

4 104 40 times 104 640 times 105lowast

5 105

50 times 105lowast

125 times 107

6 106

60 times 106

216 times 108

7 107 70 times 107 343 times 109

8 108 80 times 108 512 times 1010

9 109

90 times 109

729 times 1011

is the time needed to send a word (1119905119908 is called bandwidth)and 119871 is the message length (expressed in number of words)The word size depends on processing architecture (usually itis two bytes)We define 119905119888 as the processing time per word fora processor We have the following results

(i) External product 119909119910119879 The isoefficiency is written as

119905119888119899 asymp 119864 (119905119888119899 + (119905119904 + 119905119908) 119901 log119901) 997904rArr 119899 = Θ (119901 log119901) (34)

Parallel processing time is 119879119901 = 119905119888119899119901+(119905119904 +119905119908) log119901The optimality is computed using

119889119879119901

119889119901= 0 997904rArr minus119905119888

119899

1199012+119905119904 + 119905119908

119901= 0 997904rArr 119901 asymp

119905119888119899

119905119904 + 119905119908 (35)

(ii) Scalar product (internal product) 119909119879119910 = sum119899

119894=1119909119894119910119894

The isoefficiency is written as

1199051198881198992asymp 119864(119905119888119899

2+119905119904

2119901 log119901 +

119905119908

2119899radic119901 log119901)

997904rArr 119899 = Θ(119901(log119901)2) (36)

(iii) Matrix-vector product 119910 = 119860119909 119910119894 = sum119899

119895=1119860 119894119895119909119895 The

isoefficiency is written as

1199051198881198992asymp 119864 (119905119888119899

2+ 119905119904119901 log119901 + 119905119908119899radic119901 log119901) 997904rArr 119899

= Θ(119901(log119901)2) (37)

Table 2 presented the amount of data that can be pro-cessed for a specific sizeThe cases that meet the upper bound119899 ge 105 times 10

6 are marked with (lowast) To keep the efficiencyhigh for a specific parallel architecture HPC algorithms forparticle physics introduce upper limits for the amount of datawhich means that we have also an upper bound for Big Datavolume in this case

The factors that determine the efficiency of parallel algo-rithms are task balancing (work-load distribution betweenall used processors in a system rarr to be maximized)concurrency (the numberpercentage of processors workingsimultaneously rarr to be maximized) and overhead (extrawork for introduce by parallel processing that does not appearin serial processing rarr to be minimized)

Advances in High Energy Physics 11

Experimental data

problem)

Modeling and reconstruction

(find properties of observed particles)

Analysis(reduce result to simple statistical

distribution)

Simulation(generate MC data based on

theoretical models-BigData problem)

Modeling and reconstruction

(find properties of observed particles)

Analysis(reduce result to simple statistical

distribution)

Compare

(app TB of data rarr big data

Figure 8 Processing flows for HEP experiments

3 New Challenges for Big Data Science

There are a lot of applications that generate Big Data likesocial networking profiles social influence SaaS amp CloudApps public web information MapReduce scientific exper-iments and simulations (especially HEP simulations) datawarehouse monitoring technologies and e-government ser-vices Data grow rapidly since applications produce contin-uously increasing volumes of both unstructured and struc-tured data The impact on the approach to data processingtransfer and storage is the need to reevaluate the way andsolutions to better answer the usersrsquo needs [25] In this con-text scheduling models and algorithms for data processinghave an important role becoming a new challenge for BigData Science

HEP applications consider both experimental data (thatare application with TB of valuable data) and simulationdata (with data generated using MC based on theoreticalmodels) The processing phase is represented by modelingand reconstruction in order to find properties of observedparticles (see Figure 8)Then the data are analyzed a reducedto a simple statistical distribution The comparison of resultsobtainedwill validate how realistic is a simulation experimentand validate it for use in other new models

Since we face a large variety of solutions for specificapplications and platforms a thorough and systematic anal-ysis of existing solutions for scheduling models methodsand algorithms used in Big Data processing and storageenvironments is needed The challenges for schedulingimpose specific requirements in distributed systems theclaims of the resource consumers the restrictions imposed byresource owners the need to continuously adapt to changesof resourcesrsquo availability and so forth We will pay specialattention to Cloud Systems and HPC clusters (datacenters)as reliable solutions for Big Data [26] Based on theserequirements a number of challenging issues are maximiza-tion of system throughput sitesrsquo autonomy scalability fault-tolerance and quality of services

When discussing Big Data we have in mind the 5Vs Vol-ume Velocity Variety Variability and Value There is a clearneed of many organizations companies and researchers todeal with Big Data volumes efficiently Examples include webanalytics applications scientific applications and social net-works For these examples a popular data processing enginefor Big Data is Hadoop MapReduce [27] The main problemis that data arrives too fast for optimal storage and indexing

[28] There are other several processing platforms for BigData Mesos [29] YARN (Hortonworks Hadoop YARN Anext-generation framework forHadoop data processing 2013(httphortonworkscomhadoopyarn)) Corona (CoronaUnder the Hood Scheduling MapReduce jobs more effi-ciently with Corona 2012 (Facebook)) and so forth A reviewof various parallel and distributed programming paradigmsanalyzing how they fit into the Big Data era is presented in[30]The challenges that are described for BigData Science onthe modern and future Scientific Data Infrastructure are pre-sented in [31] The paper introduces the Scientific Data Life-cycle Management (SDLM)model that includes all the majorstages and reflects specifics in data management in moderne-Science The paper proposes the SDI generic architecturemodel that provides a basis for building interoperable data orproject centric SDI usingmodern technologies and best prac-tices This analysis highlights in the same time performanceand limitations of existing solutions in the context of BigData Hadoop can handle many types of data from disparatesystems structured unstructured logs pictures audio filescommunications records emails and so forth Hadoop relieson an internal redundant data structure with cost advantagesand is deployed on industry standard servers rather than onexpensive specialized data storage systems [32] The mainchallenges for scheduling in Hadoop are to improve existingalgorithms for Big Data processing capacity schedulingfair scheduling delay scheduling longest approximate timeto end (LATE) speculative execution deadline constraintscheduler and resource aware scheduling

Data transfer scheduling in Grids Cloud P2P and soforth represents a new challenge that is the subject to BigData In many cases depending on applications architecturedata must be transported to the place where tasks willbe executed [33] Consequently scheduling schemes shouldconsider not only the task execution time but also the datatransfer time for finding a more convenient mapping of tasks[34] Only a handful of current research efforts considerthe simultaneous optimization of computation and datatransfer scheduling The big-data IO scheduler [35] offers asolution for applications that compete for IO resources ina shared MapReduce-type Big Data system [36] The paper[37] reviews Big Data challenges from a data managementrespective and addresses Big Data diversity Big Data reduc-tion Big Data integration and cleaning Big Data indexingand query and finally Big Data analysis and mining On theopposite side business analytics occupying the intersection

12 Advances in High Energy Physics

of the worlds of management science computer science andstatistical science is a potent force for innovation in both theprivate and public sectors The conclusion is that the data istoo heterogeneous to fit into a rigid schema [38]

Another challenge is the scheduling policies used todetermine the relative ordering of requests Large distributedsystems with different administrative domains will mostlikely have different resource utilization policies For examplea policy can take into consideration the deadlines andbudgets and also the dynamic behavior [39] HEP experi-ments are usually performed in private Clouds consideringdynamic scheduling with soft deadlines which is an openissue

The optimization techniques for the scheduling processrepresent an important aspect because the scheduling is amain building block formaking datacentersmore available touser communities being energy-aware [40] and supportingmulticriteria optimization [41] An example of optimizationis multiobjective and multiconstrained scheduling of manytasks in Hadoop [42] or optimizing short jobs [43] Thecost effectiveness scalability and streamlined architecturesof Hadoop represent solutions for Big Data processingConsidering the use of Hadoop in publicprivate Clouds achallenge is to answer the following questions what type ofdatatasks should move to public cloud in order to achieve acost-aware cloud scheduler And is public Cloud a solutionfor HEP simulation experiments

The activities for Big Data processing vary widely in anumber of issues for example support for heterogeneousresources objective function(s) scalability coschedulingand assumptions about system characteristics The currentresearch directions are focused on accelerating data process-ing especially for Big Data analytics (frequently used in HEPexperiments) complex task dependencies for dataworkflowsand new scheduling algorithms for real-time scenarios

4 Conclusions

This paper presented general aspects about methods used inHEP Monte Carlo methods and simulations of HEP pro-cesses Markovian Monte Carlo unfolding methods in parti-cle physics kernel estimation inHEP RandomMatrixTheoryused in analysis of particles spectrum For each method theproper numerical method had been identified and analyzedAll of identified methods produce data-intensive applica-tions which introduce new challenges and requirementsfor Big Data systems architecture especially for processingparadigms and storage capabilities This paper puts togetherseveral concepts HEP HPC numerical methods and sim-ulations HEP experiments are modeled using numericalmethods and simulations numerical integration eigenvaluescomputation solving linear equation systems multiplyingvectors and matrices interpolation HPC environments offerpowerful tools for data processing and analysis Big Data wasintroduced as a concept for a real problem we live in a data-intensive world produce huge amount of information weface with upper bound introduced by theoretical models

Conflict of Interests

The author declares that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

The research presented in this paper is supported bythe following projects ldquoSideSTEPmdashScheduling Methods forDynamic Distributed Systems a self-lowast approachrdquo (PN-II-CT-RO-FR-2012-1-0084) ldquoERRICmdashEmpowering Roma-nian Research on Intelligent Information Technologiesrdquo FP7-REGPOT-2010-1 ID 264207 CyberWater Grant of theRomanianNational Authority for Scientific Research CNDI-UEFISCDI Project no 472012 The author would like tothank the reviewers for their time and expertise constructivecomments and valuable insights

References

[1] H Newman ldquoSearch for higgs boson diphoton decay withcms at lhcrdquo in Proceedings of the ACMIEEE Conference onSupercomputing (SC rsquo06) ACM New York NY USA 2006

[2] C Cattani and A Ciancio ldquoSeparable transition density inthe hybrid model for tumor-immune system competitionrdquoComputational and Mathematical Methods in Medicine vol2012 Article ID 610124 6 pages 2012

[3] C Toma ldquoWavelets-computational aspects of sterian realisticapproach to uncertainty principle in high energy physics atransient approachrdquo Advances in High Energy Physics vol 2013Article ID 735452 6 pages 2013

[4] C Cattani A Ciancio and B Lods ldquoOn a mathematical modelof immune competitionrdquo Applied Mathematics Letters vol 19no 7 pp 678ndash683 2006

[5] D Perret-Gallix ldquoSimulation and event generation in high-energy physicsrdquoComputer Physics Communications vol 147 no1-2 pp 488ndash493 2002

[6] R Baishya and J K Sarma ldquoSemi numerical solution of non-singlet Dokshitzer-Gribov-Lipatov-Altarelli- Parisi evolutionequation up to next-to-next-to-leading order at small xrdquo Euro-pean Physical Journal C vol 60 no 4 pp 585ndash591 2009

[7] I I Bucur I Fagarasan C Popescu G Culea and A E SusuldquoDelay optimum and area optimal mapping of k-lut based fpgacircuitsrdquo Journal of Control Engineering andApplied Informaticsvol 11 no 1 pp 43ndash48 2009

[8] R R de Austri R Trotta and L Roszkowski ldquoA markov chainmonte carlo analysis of the cmssmrdquo Journal of High EnergyPhysics vol 2006 no 05 2006

[9] C Serbanescu ldquoNoncommutativeMarkov processes as stochas-tic equationsrsquo solutionsrdquo BulletinMathematique de la Societe desSciencesMathematiques de Roumanie vol 41 no 3 pp 219ndash2281998

[10] C Serbanescu ldquoStochastic differential equations and unitaryprocessesrdquo Bulletin Mathematique de la Societe des SciencesMathematiques de Roumanie vol 41 no 4 pp 311ndash322 1998

[11] O Behnke K Kroninger G Schott and T H Schorner-Sadenius Data Analysis in High Energy Physics A PracticalGuide to Statistical Methods Wiley-VCH Berlin Germany2013

[12] V Blobel ldquoAn unfolding method for high energy physicsexperimentsrdquo in Proceedings of the Conference on Advanced

Advances in High Energy Physics 13

Statistical Techniques in Particle Physics Durham UK March2002 DESY 02-078 (June 2002)

[13] P C Hansen Rank-Deficient and Discrete Ill-Posed ProblemsSIAM Philadelphia Pa USA 1998

[14] P C Hansen Discrete Inverse Problems Insight and Algorithmsvol 7 SIAM Philadelphia Pa USA 2010

[15] A Hocker and V Kartvelishvili ldquoSVD approach to data unfold-ingrdquo Nuclear Instruments and Methods in Physics Research Avol 372 no 3 pp 469ndash481 1996

[16] I Aziz and Siraj-ul-Islam ldquoNew algorithms for the numericalsolution of nonlinear Fredholm and Volterra integral equationsusing Haar waveletsrdquo Journal of Computational and AppliedMathematics vol 239 pp 333ndash345 2013

[17] A Sawatzky C Brune JMuller andM Burger ldquoTotal variationprocessing of images with poisson statisticsrdquo in Proceedingsof the 13th International Conference on Computer Analysis ofImages and Patterns (CAIP rsquo09) pp 533ndash540 Springer BerlinGermany 2009

[18] A Edelman and N R Rao ldquoRandom matrix theoryrdquo ActaNumerica vol 14 no 1 pp 233ndash297 2005

[19] K Cranmer ldquoKernel estimation in high-energy physicsrdquo Com-puter Physics Communications vol 136 no 3 pp 198ndash207 2001

[20] I S Abramson ldquoOn bandwidth variation in kernel estimatesmdashasquare root lawrdquoThe Annals of Statistics vol 10 no 4 pp 1217ndash1223 1982

[21] X Li JWDemmel DH Bailey et al ldquoDesign implementationand testing of extended and mixed precision BLASrdquo ACMTransactions on Mathematical Software vol 28 no 2 pp 152ndash205 2002

[22] J J Dongarra and J Langou ldquoThe problem with the linpackbenchmark 10 matrix generatorrdquo International Journal of HighPerformance Computing Applications vol 23 no 1 pp 5ndash132009

[23] L Lundberg and H Lennerstad ldquoAn Optimal lower boundon the maximum speedup in multiprocessors with clustersrdquoin Proceedings of the IEEE 1st International Conference onAlgorithms and Architectures for Parallel Processing (ICAPP rsquo95)vol 2 pp 640ndash649 April 1995

[24] N J Gunther ldquoA note on parallel algorithmicspeedup boundsrdquoTechnical Report on Distributed Parallel and Cluster Comput-ing In presshttparxivorgabs11044078

[25] B C Tak B Urgaonkar and A Sivasubramaniam ldquoTo move ornot tomove the economics of cloud computingrdquo in Proceedingsof the 3rdUSENIXConference onHot Topics inCloudComputing(HotCloud rsquo11) p 5 USENIX Association Berkeley CA USA2011

[26] L Zhang C Wu Z Li C Guo M Chen and F Lau ldquoMovingbig data to the cloud an online cost-minimizing approachrdquoIEEE Journal on Selected Areas in Communications vol 31 no12 pp 2710ndash2721 2013

[27] J Dittrich and J A Quiane-Ruiz ldquoEfficient big data processingin hadoop mapreducerdquo Proceedings of the VLDB Endowmentvol 5 no 12 pp 2014ndash2015 2012

[28] D Suciu ldquoBig data begets big database theoryrdquo in Proceedings ofthe 29th British National conference on Big Data (BNCOD rsquo13)pp 1ndash5 Springer Berlin Germany 2013

[29] B Hindman A Konwinski M Zaharia et al ldquoA platform forfine-grained resource sharing in the data centerrdquo in Proceedingsof the 8th USENIX Conference on Networked Systems Designand Implementation (NSDI rsquo11) p 22 USENIX AssociationBerkeley CA USA 2011

[30] C Dobre and F Xhafa ldquoParallel programming paradigms andframeworks in big data erardquo International Journal of ParallelProgramming 2013

[31] Y Demchenko C de Laat AWibisono P Grosso and Z ZhaoldquoAddressing big data challenges for scientific data infrastruc-turerdquo in Proceedings of the IEEE 4th International Conferenceon CloudComputing Technology and Science (CLOUDCOM rsquo12)pp 614ndash617 IEEE Computer Society Washington DC USA2012

[32] T White Hadoop The Definitive Guide OrsquoReilly Media 2012[33] J Celaya and U Arronategui ldquoA task routing approach to large-

scale schedulingrdquo Future Generation Computer Systems vol 29no 5 pp 1097ndash1111 2013

[34] N Bessis S Sotiriadis F Xhafa F Pop and V Cristea ldquoMeta-scheduling issues in interoperable hpcs grids and cloudsrdquoInternational Journal of Web and Grid Services vol 8 no 2 pp153ndash172 2012

[35] Y Xu A Suarez and M Zhao ldquoIbis interposed big-data ioschedulerrdquo in Proceedings of the 22nd International Sympo-sium on High-performance Parallel and Distributed Computing(HPDC rsquo13) pp 109ndash110 ACM New York NY USA 2013

[36] S Sotiriadis N Bessis and N Antonopoulos ldquoTowards inter-cloud schedulers a survey of meta-scheduling approachesrdquo inProceedings of the 6th International Conference on P2P ParallelGrid Cloud and Internet Computing (PGCIC rsquo11) pp 59ndash66Barcelona Spain October 2011

[37] J Chen Y Chen X Du et al ldquoBig data challenge a datamanagement perspectiverdquo Frontiers in Computer Science vol 7no 2 pp 157ndash164 2013

[38] V Gopalkrishnan D Steier H Lewis and J Guszcza ldquoBigdata big business bridging the gaprdquo in Proceedings of the 1stInternationalWorkshop on Big Data Streams andHeterogeneousSource Mining Algorithms Systems Programming Models andApplications (BigMine rsquo12) pp 7ndash11 ACM New York NY USA2012

[39] R van den Bossche K Vanmechelen and J BroeckhoveldquoOnline coste fficient scheduling of deadline-constrained work-loads on hybrid cloudsrdquo Future Generation Computer Systemsvol 29 no 4 pp 973ndash985 2013

[40] N Bessis S Sotiriadis F Pop and V Cristea ldquoUsing anovel messageexchanging optimization (meo) model to reduceenergy consumption in distributed systemsrdquo Simulation Mod-elling Practice and Theory vol 39 pp 104ndash112 2013

[41] G V Iordache M S Boboila F Pop C Stratan and VCristea ldquoA decentralized strategy for genetic scheduling inheterogeneous environmentsrdquo Multiagent and Grid Systemsvol 3 no 4 pp 355ndash367 2007

[42] F Zhang J Cao K Li S U Khan and K Hwang ldquoMulti-objective scheduling of many tasks in cloud platformsrdquo FutureGeneration Computer Systems 2013

[43] K Elmeleegy ldquoPiranha optimizing short jobs in hadooprdquoProceedings of the VLDB Endowment vol 6 no 11 pp 985ndash9962013

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

High Energy PhysicsAdvances in

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

FluidsJournal of

Atomic and Molecular Physics

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Advances in Condensed Matter Physics

OpticsInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

AstronomyAdvances in

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Superconductivity

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Statistical MechanicsInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

GravityJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

AstrophysicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Physics Research International

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Solid State PhysicsJournal of

 Computational  Methods in Physics

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Soft MatterJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

AerodynamicsJournal of

Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

PhotonicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Biophysics

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ThermodynamicsJournal of

Advances in High Energy Physics 7

State (iteration)200 400 600 800 1000

s=001 07

05030100

(a)

State (iteration)200 400 600 800 1000

s=01 4

20

0minus2

(b)

State (iteration)200 400 600 800 1000

s=1

0

3

minus3

1minus1

(c)

State (iteration)200 400 600 800 1000

s=10

0

31

minus3minus1

(d)

State (iteration)200 400 600 800 1000

s=100

0

210

minus1

(e)

200 400 600 800 1000State (iteration)

s=1000

0

1

0

minus1

(f)

Figure 4 Example of 1-dimensional Monte Carlo Markovianalgorithm

The previous assumption is only for multidimensionalarrays But due to the factorization assumption119901(1199031 1199032 119903119899) = 119901(1199031)119901(1199032) sdot sdot sdot 119901(119903119899) we obtain for oneintegration

119899 times 119889 times 10 Bytes = 800 Bytes (23)

which means asymp262 119879Bytes of data produce for one hour ofsimulations

23 Unfolding Processes in Particle Physics and Kernel Estima-tion in HEP In particle physics analysis we have two typesof distributions true distribution (considered in theoreticalmodels) and measured distribution (considered in experi-mental models which are affected by finite resolution andlimited acceptance of existing detectors) A HEP interactionprocess starts with a true knows distribution and generatea measured distribution corresponding to an experiment of

0

20

40

60

80

100

0 1 2 3 4

Acce

ptan

ce ra

te (

)

minus1minus2minus3

log10(s)

Figure 5 Analysis of acceptance rate for 1-dimensionalMonte CarloMarkovian algorithm for different 119904 values

a well-confirmed theory An inverse process starts with ameasured distribution and tries to identify the true distri-bution These unfolding processes are used to identify newtheories based on experiments [11]

231 Unfolding Processes in Particle Physics The theory ofunfolding processes in particle physics is as follows [12] For aphysics variable 119905 we have a true distribution 119891(119905)mapped in119909 and an 119899-vector of unknowns and a measured distribution119892(119904) (for a measured variable 119904) mapped in an 119898-vector ofmeasured data A responsematrix119860 isin 119877

119898times119899 encodes aKernelfunction119870(119904 119905) describing the physical measurement process[12ndash15]The direct and inverse processes are described by theFredholm integral equation [16] of the first kind for a specificdomainΩ

intΩ

119870 (119904 119905) 119891 (119905) 119889119905 = 119892 (119904) (24)

In particle physics the Kernel function 119870(119904 119905) is usuallyknown fromaMonteCarlo sample obtained from simulationA numerical solution is obtained using the following linearequation 119860119909 = 119887 Vectors 119909 and 119910 are assumed tobe 1-dimensional in theory but they can be multidimen-sional in practice (considering multiple independent linearequations) In practice also the statistical properties of themeasurements are well known and often they follow thePoisson statistics [17] To solve the linear systems we havedifferent numerical methods

First method is based on linear transformation 119909 = 119860119910

If 119898 = 119899 then 119860= 119860minus1 and we can use direct Gaussian

methods iterative methods (Gauss-Siedel Jacobi or SOR) ororthogonal methods (based on Householder transformationGivens methods or Gram-Schmidt algorithm) If 119898 gt 119899

(the most frequent scenario) we will construct the matrix119860

= (119860119879119860)minus1119860119879 (called pseudoinverse Penrose-Moore) In

these cases the orthogonalmethods offer very good and stablenumerical solutions

8 Advances in High Energy Physics

Second method considers the singular value decomposi-tion

119860 = 119880Σ119881119879=

119899

sum119894=1

120590119894119906119894V119879

119894 (25)

where119880 isin 119877119898times119899 and119881 isin 119877119899times119899 arematrices with orthonormalcolumns and the diagonal matrix Σ = diag1205901 120590119899 =

119880119879119860119881 The solution is

119909 = 119860119910 = 119881Σ

minus1(119880119879119910) =

119899

sum119894=1

1

120590119894(119906119879

119894119910) V119894 =

119899

sum119894=1

1

120590119894119888119894V119894 (26)

where 119888119894 = 119906119879119894119910 119894 = 1 119899 are called Fourier coefficients

232 RandomMatrix Theory Analysis of particle spectrum(eg neutrino spectrum) faces with Random Matrix Theory(RMT) especially if we consider anarchic neutrino massesThe RMT means the study of the statistical properties ofeigenvalues of very large matrices [18] For an interactionmatrix 119860 (with size 119873) where 119860 119894119895 is an independent dis-tributed random variable and 119860119867 is the complex conjugateand transposematrix we define119872 = 119860+119860

119867 which describesa Gaussian Unitary Ensemble (GUE) The GUE propertiesare described by the probability distribution 119875(119872)119889119872 (1)it is invariant under unitary transformation 119875(119872)119889119898 =

119875(1198721015840)1198891198721015840 where 1198721015840 = 119880119867119872119880 119880 is a Hermitian matrix

(119880119867119880 = 119868) (2) the elements of 119872 matrix are statisticallyindependent119875(119872) = prod

119894le119895119875119894119895(119872119894119895) and (3) thematrix119872 can

be diagonalized as119872 = 119880119863119880119867 where119880 = diag1205821 120582119873120582119894 is the eigenvalue of119872 and 120582119894 le 120582119895 if 119894 lt 119895

Propability (2) 119875 (119872) 119889119872 sim 119889119872 exp minus1198732119879119903 (119872

119867119872)

Propability (3) 119875 (119872) 119889119872 sim 119889119880prod119894

119889120582119894prod119894lt119895

(120582119894 minus 120582119895)2

times expminus1198732sum119894

(1205822

119894)

(27)The numerical methods used for eigenvalues computa-

tion are the QR method and Power methods (direct andindirect) The QR method is a numerical stable algorithmand Power method is an iterative one The RMT can be usedfor many body systems quantum chaos disordered systemsquantum chromodynamics and so forth

233 Kernel Estimation in HEP Kernel estimation is a verypowerful solution and relevant method for HEP when itis necessary to combine data from heterogeneous sourceslike MC datasets obtained by simulation and from StandardModel expectation obtained from real experiments [19] Fora set of data 1199091198941le119894le119899 with a constant bandwidth ℎ (thedifference between two consecutive data values) called thesmoothing parameter we have the estimation

119891 (119909) =1

119899ℎ

119899

sum119894=1

119870(119909 minus 119909119894

ℎ) (28)

where119870 is an estimator For example a Gauss estimator withmean 120583 and standard deviation 120590 is

119870 (119909) =1

120590radic2120587expminus

(119909 minus 120583)2

21205902 (29)

and has the following properties positive definite andinfinitely differentiable (due to the exp function) and it canbe defined for an infinite supports (119899 rarr infin) The kernel is anonparametricmethodwhichmeans thatℎ is independent ofdataset and for large amount of normally distributed data wecan find a value for ℎ that minimizes the integrated squarederror of 119891(119909) This value for bandwidth is computed as

ℎlowast= (

4

3119899)15

120590 (30)

The main problem in Kernel Estimation is that theset of data 1199091198941le119894le119899 is not normally distributed and inreal experiments the optimal bandwidth it is not knownAn improvement of presented method considers adaptiveKernel Estimation proposed by Abramson [20] where ℎ119894 =ℎradic119891(119909119894) and 120590 are considered global qualities for datasetThe new form is

119891119886 (119909) =1

119899

119899

sum119894=1

1

ℎ119894119870(

119909 minus 119909119894

ℎ119894) (31)

and the local bandwidth value that minimizes the integratedsquared error of 119891119886(119909) is

ℎlowast

119894= (

4

3119899)15

radic120590

119891 (119909119894) (32)

where 119891 is the normal estimatorKernel estimation is used for event selection to confidence

level evaluation for example in Markovian Monte Carlochains or in selection of neural network output used inexperiments for reconstructed Higgs mass In general themain usage of Kernel estimation in HEP is searching for newparticle by finding relevant data in a large dataset

A method based on Kernel estimation is the graphicalrepresentation of datasets using advanced shifted histogramalgorithm (ASH) This is a numerical interpolation for largedatasets with themain aimof creating a set of 119899119887119894119899histograms119867 = 119867119894 with the same bin-width ℎ Algorithm 4presents the steps of histograms generation starting with aspecific interval [119886 119887] a number of points 119899 in this intervaland a number of bins and a number of values used forkernel estimation 119898 Figure 6 shows the results of kernelestimation if function 119891 = minus(12)119909

2 on [0 1] and graphicalrepresentation with a different number of bins The valueson vertical axis are aggregated in step 17 of Algorithm 4 andincrease with the number of bins

234 Performance of Numerical Algorithms Used in ParticlePhysics All operations used in presented methods for par-ticle physics (Unfolding Processes Random Matrix Theory

Advances in High Energy Physics 9

(1) procedure ASH (119886 119887 119899 119909 119899bin 119898 119891119898)(2) 120575 = (119887 minus 119886) 119899bin ℎ = 119898120575(3) for 119896 = 1 119899bin do(4) V119896 = 0 119910119896 = 0(5) end for(6) for 119894 = 1 119899 do(7) 119896 = (119909119894 minus 119886) 120575 + 1(8) if 119896 isin [1 119899bin] then(9) V119896 = V119896 + 1(10) end if(11) end for(12) for 119896 = 1 119899bin do(13) if V119896 = 0 then(14) 119896 = 119896 + 1(15) end if(16) for 119894 = max 1 119896 minus 119898 + 1 min 119899bin 119896 + 119898 minus 1 do(17) 119910119894 = 119910119894 + V119896119891119898 (119894 minus 119896)(18) end for(19) end for(20) for 119896 = 1 119899bin do(21) 119910119896 = 119910119896(119899ℎ) (22) 119905119896 = 119886 + (119896 minus 05) 120575(23) end for(24) return 1199051198961le119896le119899bin 1199101198961le119896le119899bin(25) end procedure

Algorithm 4 Advanced shifted histogram (1D algorithm)

and Kernel Estimation) can be reduced to scalar productsmatrix-vector products and matrix-matrix products In [21]the design of new standard for the BLAS (Basic LinearAlgebra Subroutines) in C language by extension of precisionis described This permits higher internal precision andmixed inputoutput types The precision allows implemen-tation of some algorithms that are simpler more accurateand sometimes faster than possible without these featuresRegarding the precision of numerical computing Dongarraand Langou established in [22] an upper bound for theresidual check for 119860119909 = 119910 system with 119860 isin 119877

119899times119899 a densematrix The residual check is defined as

119903infin =

1003817100381710038171003817119860119909 minus 1199101003817100381710038171003817infin

119899120598 (119860infin119909infin +10038171003817100381710038171199101003817100381710038171003817infin)

lt 16 (33)

where 120598 is the relative machine precision for the IEEErepresentation standard 119910

infinis the infinite normof a vector

119910infin

= max1le119894le119899|119910119894| and 119860infin is the infinite norm of amatrix 119860infin = max1le119894le119899sum

119899

119895=1|119860 119894119895|

Figure 7 presents the graphical representation of Dongar-ras result (using logarithmic scales) for simple and doubleprecision For simple precision 120598119904 = 2minus24 for all 119899 ge 105 times

106 the residual check is always lower than imposed upperbound similarly for double precision with 120598119889 = 2

minus53 for all119899 ge 563 times 1014 If matrix size is greater than these values itwill not be possible to detect if the solution is correct or notThese results establish upper bounds for data volume in thismodel

In a single-processor system the complexity of algo-rithms depends only on the problem size 119899 We can assume

119879(119899) = Θ(119891(119899)) where 119891(119899) is a fundamental function(119891(119899) isin 1 119899

120572 119886119899 log 119899radic119899 ) In parallel systems (mul-tiprocessor systems with 119901 processors) we have the serialprocessing time 119879

lowast(119899) = 1198791(119899) and parallel processingtime 119879119901(119899) The performance of parallel algorithms can beanalyzed using speed-up efficiency and isoefficiencymetrics

(i) The speed-up 119878(119901) represents how a parallel algo-rithm is faster than a corresponding sequential algo-rithmThe speed-up is defined as 119878(119901) = 1198791(119899)119879119901(119899)There are special bounds for speed-up [23] 119878(119901) le

119901119901(119901 + 119901 minus 1) where 119901 = 1198791119879infin is the averageparallelism (the average number of busy processorsgiven unbounded number of processors) Usually119878(119901) le 119901 but under special circumstances thespeed-up can be 119878(119901) gt 119901 [24] Another upperbound is established by the Amdahls law 119878(119901) =

(119904 + ((1 minus 119904)119901))12

le 1119904 where 119904 is the fraction ofa program that is sequential The upper bound isconsidered for a 0 time of parallel fraction

(ii) The efficiency is the average utilization of 119901 proces-sors 119864(119901) = 119878(119901)119901

(iii) The isoefficiency is the growth rate of workload119882119901(119899) = 119901119879119901(119899) in terms of number of processors tokeep efficiency fixed If we consider1198821(119899)minus119864119882119901(119899) =0 for any fixed efficiency 119864 we obtain 119901 = 119901(119899) Thismeans thatwe can establish a relation between needednumber of processors and problem size For examplefor the parallel sum of 119899 numbers using 119901 processorswe have 119899 asymp 119864(119899 + 119901 log119901) so 119899 = Θ(119901 log119901)

10 Advances in High Energy Physicsnbin

=10

0

minus5

minus10

minus15

minus20

[a b]0 02 04 06 08 1

(a)

nbin

=100

0minus10minus20minus30minus40minus50

[a b]0 02 04 06 08 1

(b)

nbin

=1000

0minus100minus200minus300minus400minus500

0 02 04 06 08 1

[a b]

(c)

Figure 6 Example of advanced shifted histogram algorithm run-ning for different bins 10 100 and 1000

0

5

10

15

0 5 10 15 20 25

Simple precisionDouble precisionHPL20 bound

minus5

minus10

minus15

log10(n)

log10(1elowastn))

Figure 7 Residual check analysis for solving 119860119909 = 119910 system inHPL20 using simple and double precision representation

Numerical algorithms use for implementation a hyper-cube architecture We analyze the performance of differentnumerical operations using the isoefficiency metric For thehypercube architecture a simple model for intertask commu-nication considers 119879com = 119905119904 +119871119905119908 where 119905119904 is the latency (thetime needed by a message to cross through the network) 119905119908

Table 2 Isoefficiency for a hypercube architecture 119899 = Θ(119901 log119901)and 119899 = Θ(119901(log119901)2) We marked with (lowast) the limitations imposedby Formula (33)

Scenario Architecture size (119901) 119899 = Θ(119901 log119901) 119899 = Θ(119901(log119901)2)1 101 10 times 101 100 times 101

2 102

20 times 102

800 times 102

3 103 30 times 103 270 times 104

4 104 40 times 104 640 times 105lowast

5 105

50 times 105lowast

125 times 107

6 106

60 times 106

216 times 108

7 107 70 times 107 343 times 109

8 108 80 times 108 512 times 1010

9 109

90 times 109

729 times 1011

is the time needed to send a word (1119905119908 is called bandwidth)and 119871 is the message length (expressed in number of words)The word size depends on processing architecture (usually itis two bytes)We define 119905119888 as the processing time per word fora processor We have the following results

(i) External product 119909119910119879 The isoefficiency is written as

119905119888119899 asymp 119864 (119905119888119899 + (119905119904 + 119905119908) 119901 log119901) 997904rArr 119899 = Θ (119901 log119901) (34)

Parallel processing time is 119879119901 = 119905119888119899119901+(119905119904 +119905119908) log119901The optimality is computed using

119889119879119901

119889119901= 0 997904rArr minus119905119888

119899

1199012+119905119904 + 119905119908

119901= 0 997904rArr 119901 asymp

119905119888119899

119905119904 + 119905119908 (35)

(ii) Scalar product (internal product) 119909119879119910 = sum119899

119894=1119909119894119910119894

The isoefficiency is written as

1199051198881198992asymp 119864(119905119888119899

2+119905119904

2119901 log119901 +

119905119908

2119899radic119901 log119901)

997904rArr 119899 = Θ(119901(log119901)2) (36)

(iii) Matrix-vector product 119910 = 119860119909 119910119894 = sum119899

119895=1119860 119894119895119909119895 The

isoefficiency is written as

1199051198881198992asymp 119864 (119905119888119899

2+ 119905119904119901 log119901 + 119905119908119899radic119901 log119901) 997904rArr 119899

= Θ(119901(log119901)2) (37)

Table 2 presented the amount of data that can be pro-cessed for a specific sizeThe cases that meet the upper bound119899 ge 105 times 10

6 are marked with (lowast) To keep the efficiencyhigh for a specific parallel architecture HPC algorithms forparticle physics introduce upper limits for the amount of datawhich means that we have also an upper bound for Big Datavolume in this case

The factors that determine the efficiency of parallel algo-rithms are task balancing (work-load distribution betweenall used processors in a system rarr to be maximized)concurrency (the numberpercentage of processors workingsimultaneously rarr to be maximized) and overhead (extrawork for introduce by parallel processing that does not appearin serial processing rarr to be minimized)

Advances in High Energy Physics 11

Experimental data

problem)

Modeling and reconstruction

(find properties of observed particles)

Analysis(reduce result to simple statistical

distribution)

Simulation(generate MC data based on

theoretical models-BigData problem)

Modeling and reconstruction

(find properties of observed particles)

Analysis(reduce result to simple statistical

distribution)

Compare

(app TB of data rarr big data

Figure 8 Processing flows for HEP experiments

3 New Challenges for Big Data Science

There are a lot of applications that generate Big Data likesocial networking profiles social influence SaaS amp CloudApps public web information MapReduce scientific exper-iments and simulations (especially HEP simulations) datawarehouse monitoring technologies and e-government ser-vices Data grow rapidly since applications produce contin-uously increasing volumes of both unstructured and struc-tured data The impact on the approach to data processingtransfer and storage is the need to reevaluate the way andsolutions to better answer the usersrsquo needs [25] In this con-text scheduling models and algorithms for data processinghave an important role becoming a new challenge for BigData Science

HEP applications consider both experimental data (thatare application with TB of valuable data) and simulationdata (with data generated using MC based on theoreticalmodels) The processing phase is represented by modelingand reconstruction in order to find properties of observedparticles (see Figure 8)Then the data are analyzed a reducedto a simple statistical distribution The comparison of resultsobtainedwill validate how realistic is a simulation experimentand validate it for use in other new models

Since we face a large variety of solutions for specificapplications and platforms a thorough and systematic anal-ysis of existing solutions for scheduling models methodsand algorithms used in Big Data processing and storageenvironments is needed The challenges for schedulingimpose specific requirements in distributed systems theclaims of the resource consumers the restrictions imposed byresource owners the need to continuously adapt to changesof resourcesrsquo availability and so forth We will pay specialattention to Cloud Systems and HPC clusters (datacenters)as reliable solutions for Big Data [26] Based on theserequirements a number of challenging issues are maximiza-tion of system throughput sitesrsquo autonomy scalability fault-tolerance and quality of services

When discussing Big Data we have in mind the 5Vs Vol-ume Velocity Variety Variability and Value There is a clearneed of many organizations companies and researchers todeal with Big Data volumes efficiently Examples include webanalytics applications scientific applications and social net-works For these examples a popular data processing enginefor Big Data is Hadoop MapReduce [27] The main problemis that data arrives too fast for optimal storage and indexing

[28] There are other several processing platforms for BigData Mesos [29] YARN (Hortonworks Hadoop YARN Anext-generation framework forHadoop data processing 2013(httphortonworkscomhadoopyarn)) Corona (CoronaUnder the Hood Scheduling MapReduce jobs more effi-ciently with Corona 2012 (Facebook)) and so forth A reviewof various parallel and distributed programming paradigmsanalyzing how they fit into the Big Data era is presented in[30]The challenges that are described for BigData Science onthe modern and future Scientific Data Infrastructure are pre-sented in [31] The paper introduces the Scientific Data Life-cycle Management (SDLM)model that includes all the majorstages and reflects specifics in data management in moderne-Science The paper proposes the SDI generic architecturemodel that provides a basis for building interoperable data orproject centric SDI usingmodern technologies and best prac-tices This analysis highlights in the same time performanceand limitations of existing solutions in the context of BigData Hadoop can handle many types of data from disparatesystems structured unstructured logs pictures audio filescommunications records emails and so forth Hadoop relieson an internal redundant data structure with cost advantagesand is deployed on industry standard servers rather than onexpensive specialized data storage systems [32] The mainchallenges for scheduling in Hadoop are to improve existingalgorithms for Big Data processing capacity schedulingfair scheduling delay scheduling longest approximate timeto end (LATE) speculative execution deadline constraintscheduler and resource aware scheduling

Data transfer scheduling in Grids Cloud P2P and soforth represents a new challenge that is the subject to BigData In many cases depending on applications architecturedata must be transported to the place where tasks willbe executed [33] Consequently scheduling schemes shouldconsider not only the task execution time but also the datatransfer time for finding a more convenient mapping of tasks[34] Only a handful of current research efforts considerthe simultaneous optimization of computation and datatransfer scheduling The big-data IO scheduler [35] offers asolution for applications that compete for IO resources ina shared MapReduce-type Big Data system [36] The paper[37] reviews Big Data challenges from a data managementrespective and addresses Big Data diversity Big Data reduc-tion Big Data integration and cleaning Big Data indexingand query and finally Big Data analysis and mining On theopposite side business analytics occupying the intersection

12 Advances in High Energy Physics

of the worlds of management science computer science andstatistical science is a potent force for innovation in both theprivate and public sectors The conclusion is that the data istoo heterogeneous to fit into a rigid schema [38]

Another challenge is the scheduling policies used todetermine the relative ordering of requests Large distributedsystems with different administrative domains will mostlikely have different resource utilization policies For examplea policy can take into consideration the deadlines andbudgets and also the dynamic behavior [39] HEP experi-ments are usually performed in private Clouds consideringdynamic scheduling with soft deadlines which is an openissue

The optimization techniques for the scheduling processrepresent an important aspect because the scheduling is amain building block formaking datacentersmore available touser communities being energy-aware [40] and supportingmulticriteria optimization [41] An example of optimizationis multiobjective and multiconstrained scheduling of manytasks in Hadoop [42] or optimizing short jobs [43] Thecost effectiveness scalability and streamlined architecturesof Hadoop represent solutions for Big Data processingConsidering the use of Hadoop in publicprivate Clouds achallenge is to answer the following questions what type ofdatatasks should move to public cloud in order to achieve acost-aware cloud scheduler And is public Cloud a solutionfor HEP simulation experiments

The activities for Big Data processing vary widely in anumber of issues for example support for heterogeneousresources objective function(s) scalability coschedulingand assumptions about system characteristics The currentresearch directions are focused on accelerating data process-ing especially for Big Data analytics (frequently used in HEPexperiments) complex task dependencies for dataworkflowsand new scheduling algorithms for real-time scenarios

4 Conclusions

This paper presented general aspects about methods used inHEP Monte Carlo methods and simulations of HEP pro-cesses Markovian Monte Carlo unfolding methods in parti-cle physics kernel estimation inHEP RandomMatrixTheoryused in analysis of particles spectrum For each method theproper numerical method had been identified and analyzedAll of identified methods produce data-intensive applica-tions which introduce new challenges and requirementsfor Big Data systems architecture especially for processingparadigms and storage capabilities This paper puts togetherseveral concepts HEP HPC numerical methods and sim-ulations HEP experiments are modeled using numericalmethods and simulations numerical integration eigenvaluescomputation solving linear equation systems multiplyingvectors and matrices interpolation HPC environments offerpowerful tools for data processing and analysis Big Data wasintroduced as a concept for a real problem we live in a data-intensive world produce huge amount of information weface with upper bound introduced by theoretical models

Conflict of Interests

The author declares that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

The research presented in this paper is supported bythe following projects ldquoSideSTEPmdashScheduling Methods forDynamic Distributed Systems a self-lowast approachrdquo (PN-II-CT-RO-FR-2012-1-0084) ldquoERRICmdashEmpowering Roma-nian Research on Intelligent Information Technologiesrdquo FP7-REGPOT-2010-1 ID 264207 CyberWater Grant of theRomanianNational Authority for Scientific Research CNDI-UEFISCDI Project no 472012 The author would like tothank the reviewers for their time and expertise constructivecomments and valuable insights

References

[1] H Newman ldquoSearch for higgs boson diphoton decay withcms at lhcrdquo in Proceedings of the ACMIEEE Conference onSupercomputing (SC rsquo06) ACM New York NY USA 2006

[2] C Cattani and A Ciancio ldquoSeparable transition density inthe hybrid model for tumor-immune system competitionrdquoComputational and Mathematical Methods in Medicine vol2012 Article ID 610124 6 pages 2012

[3] C Toma ldquoWavelets-computational aspects of sterian realisticapproach to uncertainty principle in high energy physics atransient approachrdquo Advances in High Energy Physics vol 2013Article ID 735452 6 pages 2013

[4] C Cattani A Ciancio and B Lods ldquoOn a mathematical modelof immune competitionrdquo Applied Mathematics Letters vol 19no 7 pp 678ndash683 2006

[5] D Perret-Gallix ldquoSimulation and event generation in high-energy physicsrdquoComputer Physics Communications vol 147 no1-2 pp 488ndash493 2002

[6] R Baishya and J K Sarma ldquoSemi numerical solution of non-singlet Dokshitzer-Gribov-Lipatov-Altarelli- Parisi evolutionequation up to next-to-next-to-leading order at small xrdquo Euro-pean Physical Journal C vol 60 no 4 pp 585ndash591 2009

[7] I I Bucur I Fagarasan C Popescu G Culea and A E SusuldquoDelay optimum and area optimal mapping of k-lut based fpgacircuitsrdquo Journal of Control Engineering andApplied Informaticsvol 11 no 1 pp 43ndash48 2009

[8] R R de Austri R Trotta and L Roszkowski ldquoA markov chainmonte carlo analysis of the cmssmrdquo Journal of High EnergyPhysics vol 2006 no 05 2006

[9] C Serbanescu ldquoNoncommutativeMarkov processes as stochas-tic equationsrsquo solutionsrdquo BulletinMathematique de la Societe desSciencesMathematiques de Roumanie vol 41 no 3 pp 219ndash2281998

[10] C Serbanescu ldquoStochastic differential equations and unitaryprocessesrdquo Bulletin Mathematique de la Societe des SciencesMathematiques de Roumanie vol 41 no 4 pp 311ndash322 1998

[11] O Behnke K Kroninger G Schott and T H Schorner-Sadenius Data Analysis in High Energy Physics A PracticalGuide to Statistical Methods Wiley-VCH Berlin Germany2013

[12] V Blobel ldquoAn unfolding method for high energy physicsexperimentsrdquo in Proceedings of the Conference on Advanced

Advances in High Energy Physics 13

Statistical Techniques in Particle Physics Durham UK March2002 DESY 02-078 (June 2002)

[13] P C Hansen Rank-Deficient and Discrete Ill-Posed ProblemsSIAM Philadelphia Pa USA 1998

[14] P C Hansen Discrete Inverse Problems Insight and Algorithmsvol 7 SIAM Philadelphia Pa USA 2010

[15] A Hocker and V Kartvelishvili ldquoSVD approach to data unfold-ingrdquo Nuclear Instruments and Methods in Physics Research Avol 372 no 3 pp 469ndash481 1996

[16] I Aziz and Siraj-ul-Islam ldquoNew algorithms for the numericalsolution of nonlinear Fredholm and Volterra integral equationsusing Haar waveletsrdquo Journal of Computational and AppliedMathematics vol 239 pp 333ndash345 2013

[17] A Sawatzky C Brune JMuller andM Burger ldquoTotal variationprocessing of images with poisson statisticsrdquo in Proceedingsof the 13th International Conference on Computer Analysis ofImages and Patterns (CAIP rsquo09) pp 533ndash540 Springer BerlinGermany 2009

[18] A Edelman and N R Rao ldquoRandom matrix theoryrdquo ActaNumerica vol 14 no 1 pp 233ndash297 2005

[19] K Cranmer ldquoKernel estimation in high-energy physicsrdquo Com-puter Physics Communications vol 136 no 3 pp 198ndash207 2001

[20] I S Abramson ldquoOn bandwidth variation in kernel estimatesmdashasquare root lawrdquoThe Annals of Statistics vol 10 no 4 pp 1217ndash1223 1982

[21] X Li JWDemmel DH Bailey et al ldquoDesign implementationand testing of extended and mixed precision BLASrdquo ACMTransactions on Mathematical Software vol 28 no 2 pp 152ndash205 2002

[22] J J Dongarra and J Langou ldquoThe problem with the linpackbenchmark 10 matrix generatorrdquo International Journal of HighPerformance Computing Applications vol 23 no 1 pp 5ndash132009

[23] L Lundberg and H Lennerstad ldquoAn Optimal lower boundon the maximum speedup in multiprocessors with clustersrdquoin Proceedings of the IEEE 1st International Conference onAlgorithms and Architectures for Parallel Processing (ICAPP rsquo95)vol 2 pp 640ndash649 April 1995

[24] N J Gunther ldquoA note on parallel algorithmicspeedup boundsrdquoTechnical Report on Distributed Parallel and Cluster Comput-ing In presshttparxivorgabs11044078

[25] B C Tak B Urgaonkar and A Sivasubramaniam ldquoTo move ornot tomove the economics of cloud computingrdquo in Proceedingsof the 3rdUSENIXConference onHot Topics inCloudComputing(HotCloud rsquo11) p 5 USENIX Association Berkeley CA USA2011

[26] L Zhang C Wu Z Li C Guo M Chen and F Lau ldquoMovingbig data to the cloud an online cost-minimizing approachrdquoIEEE Journal on Selected Areas in Communications vol 31 no12 pp 2710ndash2721 2013

[27] J Dittrich and J A Quiane-Ruiz ldquoEfficient big data processingin hadoop mapreducerdquo Proceedings of the VLDB Endowmentvol 5 no 12 pp 2014ndash2015 2012

[28] D Suciu ldquoBig data begets big database theoryrdquo in Proceedings ofthe 29th British National conference on Big Data (BNCOD rsquo13)pp 1ndash5 Springer Berlin Germany 2013

[29] B Hindman A Konwinski M Zaharia et al ldquoA platform forfine-grained resource sharing in the data centerrdquo in Proceedingsof the 8th USENIX Conference on Networked Systems Designand Implementation (NSDI rsquo11) p 22 USENIX AssociationBerkeley CA USA 2011

[30] C Dobre and F Xhafa ldquoParallel programming paradigms andframeworks in big data erardquo International Journal of ParallelProgramming 2013

[31] Y Demchenko C de Laat AWibisono P Grosso and Z ZhaoldquoAddressing big data challenges for scientific data infrastruc-turerdquo in Proceedings of the IEEE 4th International Conferenceon CloudComputing Technology and Science (CLOUDCOM rsquo12)pp 614ndash617 IEEE Computer Society Washington DC USA2012

[32] T White Hadoop The Definitive Guide OrsquoReilly Media 2012[33] J Celaya and U Arronategui ldquoA task routing approach to large-

scale schedulingrdquo Future Generation Computer Systems vol 29no 5 pp 1097ndash1111 2013

[34] N Bessis S Sotiriadis F Xhafa F Pop and V Cristea ldquoMeta-scheduling issues in interoperable hpcs grids and cloudsrdquoInternational Journal of Web and Grid Services vol 8 no 2 pp153ndash172 2012

[35] Y Xu A Suarez and M Zhao ldquoIbis interposed big-data ioschedulerrdquo in Proceedings of the 22nd International Sympo-sium on High-performance Parallel and Distributed Computing(HPDC rsquo13) pp 109ndash110 ACM New York NY USA 2013

[36] S Sotiriadis N Bessis and N Antonopoulos ldquoTowards inter-cloud schedulers a survey of meta-scheduling approachesrdquo inProceedings of the 6th International Conference on P2P ParallelGrid Cloud and Internet Computing (PGCIC rsquo11) pp 59ndash66Barcelona Spain October 2011

[37] J Chen Y Chen X Du et al ldquoBig data challenge a datamanagement perspectiverdquo Frontiers in Computer Science vol 7no 2 pp 157ndash164 2013

[38] V Gopalkrishnan D Steier H Lewis and J Guszcza ldquoBigdata big business bridging the gaprdquo in Proceedings of the 1stInternationalWorkshop on Big Data Streams andHeterogeneousSource Mining Algorithms Systems Programming Models andApplications (BigMine rsquo12) pp 7ndash11 ACM New York NY USA2012

[39] R van den Bossche K Vanmechelen and J BroeckhoveldquoOnline coste fficient scheduling of deadline-constrained work-loads on hybrid cloudsrdquo Future Generation Computer Systemsvol 29 no 4 pp 973ndash985 2013

[40] N Bessis S Sotiriadis F Pop and V Cristea ldquoUsing anovel messageexchanging optimization (meo) model to reduceenergy consumption in distributed systemsrdquo Simulation Mod-elling Practice and Theory vol 39 pp 104ndash112 2013

[41] G V Iordache M S Boboila F Pop C Stratan and VCristea ldquoA decentralized strategy for genetic scheduling inheterogeneous environmentsrdquo Multiagent and Grid Systemsvol 3 no 4 pp 355ndash367 2007

[42] F Zhang J Cao K Li S U Khan and K Hwang ldquoMulti-objective scheduling of many tasks in cloud platformsrdquo FutureGeneration Computer Systems 2013

[43] K Elmeleegy ldquoPiranha optimizing short jobs in hadooprdquoProceedings of the VLDB Endowment vol 6 no 11 pp 985ndash9962013

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

High Energy PhysicsAdvances in

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

FluidsJournal of

Atomic and Molecular Physics

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Advances in Condensed Matter Physics

OpticsInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

AstronomyAdvances in

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Superconductivity

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Statistical MechanicsInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

GravityJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

AstrophysicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Physics Research International

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Solid State PhysicsJournal of

 Computational  Methods in Physics

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Soft MatterJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

AerodynamicsJournal of

Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

PhotonicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Biophysics

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ThermodynamicsJournal of

8 Advances in High Energy Physics

Second method considers the singular value decomposi-tion

119860 = 119880Σ119881119879=

119899

sum119894=1

120590119894119906119894V119879

119894 (25)

where119880 isin 119877119898times119899 and119881 isin 119877119899times119899 arematrices with orthonormalcolumns and the diagonal matrix Σ = diag1205901 120590119899 =

119880119879119860119881 The solution is

119909 = 119860119910 = 119881Σ

minus1(119880119879119910) =

119899

sum119894=1

1

120590119894(119906119879

119894119910) V119894 =

119899

sum119894=1

1

120590119894119888119894V119894 (26)

where 119888119894 = 119906119879119894119910 119894 = 1 119899 are called Fourier coefficients

232 RandomMatrix Theory Analysis of particle spectrum(eg neutrino spectrum) faces with Random Matrix Theory(RMT) especially if we consider anarchic neutrino massesThe RMT means the study of the statistical properties ofeigenvalues of very large matrices [18] For an interactionmatrix 119860 (with size 119873) where 119860 119894119895 is an independent dis-tributed random variable and 119860119867 is the complex conjugateand transposematrix we define119872 = 119860+119860

119867 which describesa Gaussian Unitary Ensemble (GUE) The GUE propertiesare described by the probability distribution 119875(119872)119889119872 (1)it is invariant under unitary transformation 119875(119872)119889119898 =

119875(1198721015840)1198891198721015840 where 1198721015840 = 119880119867119872119880 119880 is a Hermitian matrix

(119880119867119880 = 119868) (2) the elements of 119872 matrix are statisticallyindependent119875(119872) = prod

119894le119895119875119894119895(119872119894119895) and (3) thematrix119872 can

be diagonalized as119872 = 119880119863119880119867 where119880 = diag1205821 120582119873120582119894 is the eigenvalue of119872 and 120582119894 le 120582119895 if 119894 lt 119895

Propability (2) 119875 (119872) 119889119872 sim 119889119872 exp minus1198732119879119903 (119872

119867119872)

Propability (3) 119875 (119872) 119889119872 sim 119889119880prod119894

119889120582119894prod119894lt119895

(120582119894 minus 120582119895)2

times expminus1198732sum119894

(1205822

119894)

(27)The numerical methods used for eigenvalues computa-

tion are the QR method and Power methods (direct andindirect) The QR method is a numerical stable algorithmand Power method is an iterative one The RMT can be usedfor many body systems quantum chaos disordered systemsquantum chromodynamics and so forth

233 Kernel Estimation in HEP Kernel estimation is a verypowerful solution and relevant method for HEP when itis necessary to combine data from heterogeneous sourceslike MC datasets obtained by simulation and from StandardModel expectation obtained from real experiments [19] Fora set of data 1199091198941le119894le119899 with a constant bandwidth ℎ (thedifference between two consecutive data values) called thesmoothing parameter we have the estimation

119891 (119909) =1

119899ℎ

119899

sum119894=1

119870(119909 minus 119909119894

ℎ) (28)

where119870 is an estimator For example a Gauss estimator withmean 120583 and standard deviation 120590 is

119870 (119909) =1

120590radic2120587expminus

(119909 minus 120583)2

21205902 (29)

and has the following properties positive definite andinfinitely differentiable (due to the exp function) and it canbe defined for an infinite supports (119899 rarr infin) The kernel is anonparametricmethodwhichmeans thatℎ is independent ofdataset and for large amount of normally distributed data wecan find a value for ℎ that minimizes the integrated squarederror of 119891(119909) This value for bandwidth is computed as

ℎlowast= (

4

3119899)15

120590 (30)

The main problem in Kernel Estimation is that theset of data 1199091198941le119894le119899 is not normally distributed and inreal experiments the optimal bandwidth it is not knownAn improvement of presented method considers adaptiveKernel Estimation proposed by Abramson [20] where ℎ119894 =ℎradic119891(119909119894) and 120590 are considered global qualities for datasetThe new form is

119891119886 (119909) =1

119899

119899

sum119894=1

1

ℎ119894119870(

119909 minus 119909119894

ℎ119894) (31)

and the local bandwidth value that minimizes the integratedsquared error of 119891119886(119909) is

ℎlowast

119894= (

4

3119899)15

radic120590

119891 (119909119894) (32)

where 119891 is the normal estimatorKernel estimation is used for event selection to confidence

level evaluation for example in Markovian Monte Carlochains or in selection of neural network output used inexperiments for reconstructed Higgs mass In general themain usage of Kernel estimation in HEP is searching for newparticle by finding relevant data in a large dataset

A method based on Kernel estimation is the graphicalrepresentation of datasets using advanced shifted histogramalgorithm (ASH) This is a numerical interpolation for largedatasets with themain aimof creating a set of 119899119887119894119899histograms119867 = 119867119894 with the same bin-width ℎ Algorithm 4presents the steps of histograms generation starting with aspecific interval [119886 119887] a number of points 119899 in this intervaland a number of bins and a number of values used forkernel estimation 119898 Figure 6 shows the results of kernelestimation if function 119891 = minus(12)119909

2 on [0 1] and graphicalrepresentation with a different number of bins The valueson vertical axis are aggregated in step 17 of Algorithm 4 andincrease with the number of bins

234 Performance of Numerical Algorithms Used in ParticlePhysics All operations used in presented methods for par-ticle physics (Unfolding Processes Random Matrix Theory

Advances in High Energy Physics 9

(1) procedure ASH (119886 119887 119899 119909 119899bin 119898 119891119898)(2) 120575 = (119887 minus 119886) 119899bin ℎ = 119898120575(3) for 119896 = 1 119899bin do(4) V119896 = 0 119910119896 = 0(5) end for(6) for 119894 = 1 119899 do(7) 119896 = (119909119894 minus 119886) 120575 + 1(8) if 119896 isin [1 119899bin] then(9) V119896 = V119896 + 1(10) end if(11) end for(12) for 119896 = 1 119899bin do(13) if V119896 = 0 then(14) 119896 = 119896 + 1(15) end if(16) for 119894 = max 1 119896 minus 119898 + 1 min 119899bin 119896 + 119898 minus 1 do(17) 119910119894 = 119910119894 + V119896119891119898 (119894 minus 119896)(18) end for(19) end for(20) for 119896 = 1 119899bin do(21) 119910119896 = 119910119896(119899ℎ) (22) 119905119896 = 119886 + (119896 minus 05) 120575(23) end for(24) return 1199051198961le119896le119899bin 1199101198961le119896le119899bin(25) end procedure

Algorithm 4 Advanced shifted histogram (1D algorithm)

and Kernel Estimation) can be reduced to scalar productsmatrix-vector products and matrix-matrix products In [21]the design of new standard for the BLAS (Basic LinearAlgebra Subroutines) in C language by extension of precisionis described This permits higher internal precision andmixed inputoutput types The precision allows implemen-tation of some algorithms that are simpler more accurateand sometimes faster than possible without these featuresRegarding the precision of numerical computing Dongarraand Langou established in [22] an upper bound for theresidual check for 119860119909 = 119910 system with 119860 isin 119877

119899times119899 a densematrix The residual check is defined as

119903infin =

1003817100381710038171003817119860119909 minus 1199101003817100381710038171003817infin

119899120598 (119860infin119909infin +10038171003817100381710038171199101003817100381710038171003817infin)

lt 16 (33)

where 120598 is the relative machine precision for the IEEErepresentation standard 119910

infinis the infinite normof a vector

119910infin

= max1le119894le119899|119910119894| and 119860infin is the infinite norm of amatrix 119860infin = max1le119894le119899sum

119899

119895=1|119860 119894119895|

Figure 7 presents the graphical representation of Dongar-ras result (using logarithmic scales) for simple and doubleprecision For simple precision 120598119904 = 2minus24 for all 119899 ge 105 times

106 the residual check is always lower than imposed upperbound similarly for double precision with 120598119889 = 2

minus53 for all119899 ge 563 times 1014 If matrix size is greater than these values itwill not be possible to detect if the solution is correct or notThese results establish upper bounds for data volume in thismodel

In a single-processor system the complexity of algo-rithms depends only on the problem size 119899 We can assume

119879(119899) = Θ(119891(119899)) where 119891(119899) is a fundamental function(119891(119899) isin 1 119899

120572 119886119899 log 119899radic119899 ) In parallel systems (mul-tiprocessor systems with 119901 processors) we have the serialprocessing time 119879

lowast(119899) = 1198791(119899) and parallel processingtime 119879119901(119899) The performance of parallel algorithms can beanalyzed using speed-up efficiency and isoefficiencymetrics

(i) The speed-up 119878(119901) represents how a parallel algo-rithm is faster than a corresponding sequential algo-rithmThe speed-up is defined as 119878(119901) = 1198791(119899)119879119901(119899)There are special bounds for speed-up [23] 119878(119901) le

119901119901(119901 + 119901 minus 1) where 119901 = 1198791119879infin is the averageparallelism (the average number of busy processorsgiven unbounded number of processors) Usually119878(119901) le 119901 but under special circumstances thespeed-up can be 119878(119901) gt 119901 [24] Another upperbound is established by the Amdahls law 119878(119901) =

(119904 + ((1 minus 119904)119901))12

le 1119904 where 119904 is the fraction ofa program that is sequential The upper bound isconsidered for a 0 time of parallel fraction

(ii) The efficiency is the average utilization of 119901 proces-sors 119864(119901) = 119878(119901)119901

(iii) The isoefficiency is the growth rate of workload119882119901(119899) = 119901119879119901(119899) in terms of number of processors tokeep efficiency fixed If we consider1198821(119899)minus119864119882119901(119899) =0 for any fixed efficiency 119864 we obtain 119901 = 119901(119899) Thismeans thatwe can establish a relation between needednumber of processors and problem size For examplefor the parallel sum of 119899 numbers using 119901 processorswe have 119899 asymp 119864(119899 + 119901 log119901) so 119899 = Θ(119901 log119901)

10 Advances in High Energy Physicsnbin

=10

0

minus5

minus10

minus15

minus20

[a b]0 02 04 06 08 1

(a)

nbin

=100

0minus10minus20minus30minus40minus50

[a b]0 02 04 06 08 1

(b)

nbin

=1000

0minus100minus200minus300minus400minus500

0 02 04 06 08 1

[a b]

(c)

Figure 6 Example of advanced shifted histogram algorithm run-ning for different bins 10 100 and 1000

0

5

10

15

0 5 10 15 20 25

Simple precisionDouble precisionHPL20 bound

minus5

minus10

minus15

log10(n)

log10(1elowastn))

Figure 7 Residual check analysis for solving 119860119909 = 119910 system inHPL20 using simple and double precision representation

Numerical algorithms use for implementation a hyper-cube architecture We analyze the performance of differentnumerical operations using the isoefficiency metric For thehypercube architecture a simple model for intertask commu-nication considers 119879com = 119905119904 +119871119905119908 where 119905119904 is the latency (thetime needed by a message to cross through the network) 119905119908

Table 2 Isoefficiency for a hypercube architecture 119899 = Θ(119901 log119901)and 119899 = Θ(119901(log119901)2) We marked with (lowast) the limitations imposedby Formula (33)

Scenario Architecture size (119901) 119899 = Θ(119901 log119901) 119899 = Θ(119901(log119901)2)1 101 10 times 101 100 times 101

2 102

20 times 102

800 times 102

3 103 30 times 103 270 times 104

4 104 40 times 104 640 times 105lowast

5 105

50 times 105lowast

125 times 107

6 106

60 times 106

216 times 108

7 107 70 times 107 343 times 109

8 108 80 times 108 512 times 1010

9 109

90 times 109

729 times 1011

is the time needed to send a word (1119905119908 is called bandwidth)and 119871 is the message length (expressed in number of words)The word size depends on processing architecture (usually itis two bytes)We define 119905119888 as the processing time per word fora processor We have the following results

(i) External product 119909119910119879 The isoefficiency is written as

119905119888119899 asymp 119864 (119905119888119899 + (119905119904 + 119905119908) 119901 log119901) 997904rArr 119899 = Θ (119901 log119901) (34)

Parallel processing time is 119879119901 = 119905119888119899119901+(119905119904 +119905119908) log119901The optimality is computed using

119889119879119901

119889119901= 0 997904rArr minus119905119888

119899

1199012+119905119904 + 119905119908

119901= 0 997904rArr 119901 asymp

119905119888119899

119905119904 + 119905119908 (35)

(ii) Scalar product (internal product) 119909119879119910 = sum119899

119894=1119909119894119910119894

The isoefficiency is written as

1199051198881198992asymp 119864(119905119888119899

2+119905119904

2119901 log119901 +

119905119908

2119899radic119901 log119901)

997904rArr 119899 = Θ(119901(log119901)2) (36)

(iii) Matrix-vector product 119910 = 119860119909 119910119894 = sum119899

119895=1119860 119894119895119909119895 The

isoefficiency is written as

1199051198881198992asymp 119864 (119905119888119899

2+ 119905119904119901 log119901 + 119905119908119899radic119901 log119901) 997904rArr 119899

= Θ(119901(log119901)2) (37)

Table 2 presented the amount of data that can be pro-cessed for a specific sizeThe cases that meet the upper bound119899 ge 105 times 10

6 are marked with (lowast) To keep the efficiencyhigh for a specific parallel architecture HPC algorithms forparticle physics introduce upper limits for the amount of datawhich means that we have also an upper bound for Big Datavolume in this case

The factors that determine the efficiency of parallel algo-rithms are task balancing (work-load distribution betweenall used processors in a system rarr to be maximized)concurrency (the numberpercentage of processors workingsimultaneously rarr to be maximized) and overhead (extrawork for introduce by parallel processing that does not appearin serial processing rarr to be minimized)

Advances in High Energy Physics 11

Experimental data

problem)

Modeling and reconstruction

(find properties of observed particles)

Analysis(reduce result to simple statistical

distribution)

Simulation(generate MC data based on

theoretical models-BigData problem)

Modeling and reconstruction

(find properties of observed particles)

Analysis(reduce result to simple statistical

distribution)

Compare

(app TB of data rarr big data

Figure 8 Processing flows for HEP experiments

3 New Challenges for Big Data Science

There are a lot of applications that generate Big Data likesocial networking profiles social influence SaaS amp CloudApps public web information MapReduce scientific exper-iments and simulations (especially HEP simulations) datawarehouse monitoring technologies and e-government ser-vices Data grow rapidly since applications produce contin-uously increasing volumes of both unstructured and struc-tured data The impact on the approach to data processingtransfer and storage is the need to reevaluate the way andsolutions to better answer the usersrsquo needs [25] In this con-text scheduling models and algorithms for data processinghave an important role becoming a new challenge for BigData Science

HEP applications consider both experimental data (thatare application with TB of valuable data) and simulationdata (with data generated using MC based on theoreticalmodels) The processing phase is represented by modelingand reconstruction in order to find properties of observedparticles (see Figure 8)Then the data are analyzed a reducedto a simple statistical distribution The comparison of resultsobtainedwill validate how realistic is a simulation experimentand validate it for use in other new models

Since we face a large variety of solutions for specificapplications and platforms a thorough and systematic anal-ysis of existing solutions for scheduling models methodsand algorithms used in Big Data processing and storageenvironments is needed The challenges for schedulingimpose specific requirements in distributed systems theclaims of the resource consumers the restrictions imposed byresource owners the need to continuously adapt to changesof resourcesrsquo availability and so forth We will pay specialattention to Cloud Systems and HPC clusters (datacenters)as reliable solutions for Big Data [26] Based on theserequirements a number of challenging issues are maximiza-tion of system throughput sitesrsquo autonomy scalability fault-tolerance and quality of services

When discussing Big Data we have in mind the 5Vs Vol-ume Velocity Variety Variability and Value There is a clearneed of many organizations companies and researchers todeal with Big Data volumes efficiently Examples include webanalytics applications scientific applications and social net-works For these examples a popular data processing enginefor Big Data is Hadoop MapReduce [27] The main problemis that data arrives too fast for optimal storage and indexing

[28] There are other several processing platforms for BigData Mesos [29] YARN (Hortonworks Hadoop YARN Anext-generation framework forHadoop data processing 2013(httphortonworkscomhadoopyarn)) Corona (CoronaUnder the Hood Scheduling MapReduce jobs more effi-ciently with Corona 2012 (Facebook)) and so forth A reviewof various parallel and distributed programming paradigmsanalyzing how they fit into the Big Data era is presented in[30]The challenges that are described for BigData Science onthe modern and future Scientific Data Infrastructure are pre-sented in [31] The paper introduces the Scientific Data Life-cycle Management (SDLM)model that includes all the majorstages and reflects specifics in data management in moderne-Science The paper proposes the SDI generic architecturemodel that provides a basis for building interoperable data orproject centric SDI usingmodern technologies and best prac-tices This analysis highlights in the same time performanceand limitations of existing solutions in the context of BigData Hadoop can handle many types of data from disparatesystems structured unstructured logs pictures audio filescommunications records emails and so forth Hadoop relieson an internal redundant data structure with cost advantagesand is deployed on industry standard servers rather than onexpensive specialized data storage systems [32] The mainchallenges for scheduling in Hadoop are to improve existingalgorithms for Big Data processing capacity schedulingfair scheduling delay scheduling longest approximate timeto end (LATE) speculative execution deadline constraintscheduler and resource aware scheduling

Data transfer scheduling in Grids Cloud P2P and soforth represents a new challenge that is the subject to BigData In many cases depending on applications architecturedata must be transported to the place where tasks willbe executed [33] Consequently scheduling schemes shouldconsider not only the task execution time but also the datatransfer time for finding a more convenient mapping of tasks[34] Only a handful of current research efforts considerthe simultaneous optimization of computation and datatransfer scheduling The big-data IO scheduler [35] offers asolution for applications that compete for IO resources ina shared MapReduce-type Big Data system [36] The paper[37] reviews Big Data challenges from a data managementrespective and addresses Big Data diversity Big Data reduc-tion Big Data integration and cleaning Big Data indexingand query and finally Big Data analysis and mining On theopposite side business analytics occupying the intersection

12 Advances in High Energy Physics

of the worlds of management science computer science andstatistical science is a potent force for innovation in both theprivate and public sectors The conclusion is that the data istoo heterogeneous to fit into a rigid schema [38]

Another challenge is the scheduling policies used todetermine the relative ordering of requests Large distributedsystems with different administrative domains will mostlikely have different resource utilization policies For examplea policy can take into consideration the deadlines andbudgets and also the dynamic behavior [39] HEP experi-ments are usually performed in private Clouds consideringdynamic scheduling with soft deadlines which is an openissue

The optimization techniques for the scheduling processrepresent an important aspect because the scheduling is amain building block formaking datacentersmore available touser communities being energy-aware [40] and supportingmulticriteria optimization [41] An example of optimizationis multiobjective and multiconstrained scheduling of manytasks in Hadoop [42] or optimizing short jobs [43] Thecost effectiveness scalability and streamlined architecturesof Hadoop represent solutions for Big Data processingConsidering the use of Hadoop in publicprivate Clouds achallenge is to answer the following questions what type ofdatatasks should move to public cloud in order to achieve acost-aware cloud scheduler And is public Cloud a solutionfor HEP simulation experiments

The activities for Big Data processing vary widely in anumber of issues for example support for heterogeneousresources objective function(s) scalability coschedulingand assumptions about system characteristics The currentresearch directions are focused on accelerating data process-ing especially for Big Data analytics (frequently used in HEPexperiments) complex task dependencies for dataworkflowsand new scheduling algorithms for real-time scenarios

4 Conclusions

This paper presented general aspects about methods used inHEP Monte Carlo methods and simulations of HEP pro-cesses Markovian Monte Carlo unfolding methods in parti-cle physics kernel estimation inHEP RandomMatrixTheoryused in analysis of particles spectrum For each method theproper numerical method had been identified and analyzedAll of identified methods produce data-intensive applica-tions which introduce new challenges and requirementsfor Big Data systems architecture especially for processingparadigms and storage capabilities This paper puts togetherseveral concepts HEP HPC numerical methods and sim-ulations HEP experiments are modeled using numericalmethods and simulations numerical integration eigenvaluescomputation solving linear equation systems multiplyingvectors and matrices interpolation HPC environments offerpowerful tools for data processing and analysis Big Data wasintroduced as a concept for a real problem we live in a data-intensive world produce huge amount of information weface with upper bound introduced by theoretical models

Conflict of Interests

The author declares that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

The research presented in this paper is supported bythe following projects ldquoSideSTEPmdashScheduling Methods forDynamic Distributed Systems a self-lowast approachrdquo (PN-II-CT-RO-FR-2012-1-0084) ldquoERRICmdashEmpowering Roma-nian Research on Intelligent Information Technologiesrdquo FP7-REGPOT-2010-1 ID 264207 CyberWater Grant of theRomanianNational Authority for Scientific Research CNDI-UEFISCDI Project no 472012 The author would like tothank the reviewers for their time and expertise constructivecomments and valuable insights

References

[1] H Newman ldquoSearch for higgs boson diphoton decay withcms at lhcrdquo in Proceedings of the ACMIEEE Conference onSupercomputing (SC rsquo06) ACM New York NY USA 2006

[2] C Cattani and A Ciancio ldquoSeparable transition density inthe hybrid model for tumor-immune system competitionrdquoComputational and Mathematical Methods in Medicine vol2012 Article ID 610124 6 pages 2012

[3] C Toma ldquoWavelets-computational aspects of sterian realisticapproach to uncertainty principle in high energy physics atransient approachrdquo Advances in High Energy Physics vol 2013Article ID 735452 6 pages 2013

[4] C Cattani A Ciancio and B Lods ldquoOn a mathematical modelof immune competitionrdquo Applied Mathematics Letters vol 19no 7 pp 678ndash683 2006

[5] D Perret-Gallix ldquoSimulation and event generation in high-energy physicsrdquoComputer Physics Communications vol 147 no1-2 pp 488ndash493 2002

[6] R Baishya and J K Sarma ldquoSemi numerical solution of non-singlet Dokshitzer-Gribov-Lipatov-Altarelli- Parisi evolutionequation up to next-to-next-to-leading order at small xrdquo Euro-pean Physical Journal C vol 60 no 4 pp 585ndash591 2009

[7] I I Bucur I Fagarasan C Popescu G Culea and A E SusuldquoDelay optimum and area optimal mapping of k-lut based fpgacircuitsrdquo Journal of Control Engineering andApplied Informaticsvol 11 no 1 pp 43ndash48 2009

[8] R R de Austri R Trotta and L Roszkowski ldquoA markov chainmonte carlo analysis of the cmssmrdquo Journal of High EnergyPhysics vol 2006 no 05 2006

[9] C Serbanescu ldquoNoncommutativeMarkov processes as stochas-tic equationsrsquo solutionsrdquo BulletinMathematique de la Societe desSciencesMathematiques de Roumanie vol 41 no 3 pp 219ndash2281998

[10] C Serbanescu ldquoStochastic differential equations and unitaryprocessesrdquo Bulletin Mathematique de la Societe des SciencesMathematiques de Roumanie vol 41 no 4 pp 311ndash322 1998

[11] O Behnke K Kroninger G Schott and T H Schorner-Sadenius Data Analysis in High Energy Physics A PracticalGuide to Statistical Methods Wiley-VCH Berlin Germany2013

[12] V Blobel ldquoAn unfolding method for high energy physicsexperimentsrdquo in Proceedings of the Conference on Advanced

Advances in High Energy Physics 13

Statistical Techniques in Particle Physics Durham UK March2002 DESY 02-078 (June 2002)

[13] P C Hansen Rank-Deficient and Discrete Ill-Posed ProblemsSIAM Philadelphia Pa USA 1998

[14] P C Hansen Discrete Inverse Problems Insight and Algorithmsvol 7 SIAM Philadelphia Pa USA 2010

[15] A Hocker and V Kartvelishvili ldquoSVD approach to data unfold-ingrdquo Nuclear Instruments and Methods in Physics Research Avol 372 no 3 pp 469ndash481 1996

[16] I Aziz and Siraj-ul-Islam ldquoNew algorithms for the numericalsolution of nonlinear Fredholm and Volterra integral equationsusing Haar waveletsrdquo Journal of Computational and AppliedMathematics vol 239 pp 333ndash345 2013

[17] A Sawatzky C Brune JMuller andM Burger ldquoTotal variationprocessing of images with poisson statisticsrdquo in Proceedingsof the 13th International Conference on Computer Analysis ofImages and Patterns (CAIP rsquo09) pp 533ndash540 Springer BerlinGermany 2009

[18] A Edelman and N R Rao ldquoRandom matrix theoryrdquo ActaNumerica vol 14 no 1 pp 233ndash297 2005

[19] K Cranmer ldquoKernel estimation in high-energy physicsrdquo Com-puter Physics Communications vol 136 no 3 pp 198ndash207 2001

[20] I S Abramson ldquoOn bandwidth variation in kernel estimatesmdashasquare root lawrdquoThe Annals of Statistics vol 10 no 4 pp 1217ndash1223 1982

[21] X Li JWDemmel DH Bailey et al ldquoDesign implementationand testing of extended and mixed precision BLASrdquo ACMTransactions on Mathematical Software vol 28 no 2 pp 152ndash205 2002

[22] J J Dongarra and J Langou ldquoThe problem with the linpackbenchmark 10 matrix generatorrdquo International Journal of HighPerformance Computing Applications vol 23 no 1 pp 5ndash132009

[23] L Lundberg and H Lennerstad ldquoAn Optimal lower boundon the maximum speedup in multiprocessors with clustersrdquoin Proceedings of the IEEE 1st International Conference onAlgorithms and Architectures for Parallel Processing (ICAPP rsquo95)vol 2 pp 640ndash649 April 1995

[24] N J Gunther ldquoA note on parallel algorithmicspeedup boundsrdquoTechnical Report on Distributed Parallel and Cluster Comput-ing In presshttparxivorgabs11044078

[25] B C Tak B Urgaonkar and A Sivasubramaniam ldquoTo move ornot tomove the economics of cloud computingrdquo in Proceedingsof the 3rdUSENIXConference onHot Topics inCloudComputing(HotCloud rsquo11) p 5 USENIX Association Berkeley CA USA2011

[26] L Zhang C Wu Z Li C Guo M Chen and F Lau ldquoMovingbig data to the cloud an online cost-minimizing approachrdquoIEEE Journal on Selected Areas in Communications vol 31 no12 pp 2710ndash2721 2013

[27] J Dittrich and J A Quiane-Ruiz ldquoEfficient big data processingin hadoop mapreducerdquo Proceedings of the VLDB Endowmentvol 5 no 12 pp 2014ndash2015 2012

[28] D Suciu ldquoBig data begets big database theoryrdquo in Proceedings ofthe 29th British National conference on Big Data (BNCOD rsquo13)pp 1ndash5 Springer Berlin Germany 2013

[29] B Hindman A Konwinski M Zaharia et al ldquoA platform forfine-grained resource sharing in the data centerrdquo in Proceedingsof the 8th USENIX Conference on Networked Systems Designand Implementation (NSDI rsquo11) p 22 USENIX AssociationBerkeley CA USA 2011

[30] C Dobre and F Xhafa ldquoParallel programming paradigms andframeworks in big data erardquo International Journal of ParallelProgramming 2013

[31] Y Demchenko C de Laat AWibisono P Grosso and Z ZhaoldquoAddressing big data challenges for scientific data infrastruc-turerdquo in Proceedings of the IEEE 4th International Conferenceon CloudComputing Technology and Science (CLOUDCOM rsquo12)pp 614ndash617 IEEE Computer Society Washington DC USA2012

[32] T White Hadoop The Definitive Guide OrsquoReilly Media 2012[33] J Celaya and U Arronategui ldquoA task routing approach to large-

scale schedulingrdquo Future Generation Computer Systems vol 29no 5 pp 1097ndash1111 2013

[34] N Bessis S Sotiriadis F Xhafa F Pop and V Cristea ldquoMeta-scheduling issues in interoperable hpcs grids and cloudsrdquoInternational Journal of Web and Grid Services vol 8 no 2 pp153ndash172 2012

[35] Y Xu A Suarez and M Zhao ldquoIbis interposed big-data ioschedulerrdquo in Proceedings of the 22nd International Sympo-sium on High-performance Parallel and Distributed Computing(HPDC rsquo13) pp 109ndash110 ACM New York NY USA 2013

[36] S Sotiriadis N Bessis and N Antonopoulos ldquoTowards inter-cloud schedulers a survey of meta-scheduling approachesrdquo inProceedings of the 6th International Conference on P2P ParallelGrid Cloud and Internet Computing (PGCIC rsquo11) pp 59ndash66Barcelona Spain October 2011

[37] J Chen Y Chen X Du et al ldquoBig data challenge a datamanagement perspectiverdquo Frontiers in Computer Science vol 7no 2 pp 157ndash164 2013

[38] V Gopalkrishnan D Steier H Lewis and J Guszcza ldquoBigdata big business bridging the gaprdquo in Proceedings of the 1stInternationalWorkshop on Big Data Streams andHeterogeneousSource Mining Algorithms Systems Programming Models andApplications (BigMine rsquo12) pp 7ndash11 ACM New York NY USA2012

[39] R van den Bossche K Vanmechelen and J BroeckhoveldquoOnline coste fficient scheduling of deadline-constrained work-loads on hybrid cloudsrdquo Future Generation Computer Systemsvol 29 no 4 pp 973ndash985 2013

[40] N Bessis S Sotiriadis F Pop and V Cristea ldquoUsing anovel messageexchanging optimization (meo) model to reduceenergy consumption in distributed systemsrdquo Simulation Mod-elling Practice and Theory vol 39 pp 104ndash112 2013

[41] G V Iordache M S Boboila F Pop C Stratan and VCristea ldquoA decentralized strategy for genetic scheduling inheterogeneous environmentsrdquo Multiagent and Grid Systemsvol 3 no 4 pp 355ndash367 2007

[42] F Zhang J Cao K Li S U Khan and K Hwang ldquoMulti-objective scheduling of many tasks in cloud platformsrdquo FutureGeneration Computer Systems 2013

[43] K Elmeleegy ldquoPiranha optimizing short jobs in hadooprdquoProceedings of the VLDB Endowment vol 6 no 11 pp 985ndash9962013

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

High Energy PhysicsAdvances in

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

FluidsJournal of

Atomic and Molecular Physics

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Advances in Condensed Matter Physics

OpticsInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

AstronomyAdvances in

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Superconductivity

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Statistical MechanicsInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

GravityJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

AstrophysicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Physics Research International

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Solid State PhysicsJournal of

 Computational  Methods in Physics

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Soft MatterJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

AerodynamicsJournal of

Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

PhotonicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Biophysics

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ThermodynamicsJournal of

Advances in High Energy Physics 9

(1) procedure ASH (119886 119887 119899 119909 119899bin 119898 119891119898)(2) 120575 = (119887 minus 119886) 119899bin ℎ = 119898120575(3) for 119896 = 1 119899bin do(4) V119896 = 0 119910119896 = 0(5) end for(6) for 119894 = 1 119899 do(7) 119896 = (119909119894 minus 119886) 120575 + 1(8) if 119896 isin [1 119899bin] then(9) V119896 = V119896 + 1(10) end if(11) end for(12) for 119896 = 1 119899bin do(13) if V119896 = 0 then(14) 119896 = 119896 + 1(15) end if(16) for 119894 = max 1 119896 minus 119898 + 1 min 119899bin 119896 + 119898 minus 1 do(17) 119910119894 = 119910119894 + V119896119891119898 (119894 minus 119896)(18) end for(19) end for(20) for 119896 = 1 119899bin do(21) 119910119896 = 119910119896(119899ℎ) (22) 119905119896 = 119886 + (119896 minus 05) 120575(23) end for(24) return 1199051198961le119896le119899bin 1199101198961le119896le119899bin(25) end procedure

Algorithm 4 Advanced shifted histogram (1D algorithm)

and Kernel Estimation) can be reduced to scalar productsmatrix-vector products and matrix-matrix products In [21]the design of new standard for the BLAS (Basic LinearAlgebra Subroutines) in C language by extension of precisionis described This permits higher internal precision andmixed inputoutput types The precision allows implemen-tation of some algorithms that are simpler more accurateand sometimes faster than possible without these featuresRegarding the precision of numerical computing Dongarraand Langou established in [22] an upper bound for theresidual check for 119860119909 = 119910 system with 119860 isin 119877

119899times119899 a densematrix The residual check is defined as

119903infin =

1003817100381710038171003817119860119909 minus 1199101003817100381710038171003817infin

119899120598 (119860infin119909infin +10038171003817100381710038171199101003817100381710038171003817infin)

lt 16 (33)

where 120598 is the relative machine precision for the IEEErepresentation standard 119910

infinis the infinite normof a vector

119910infin

= max1le119894le119899|119910119894| and 119860infin is the infinite norm of amatrix 119860infin = max1le119894le119899sum

119899

119895=1|119860 119894119895|

Figure 7 presents the graphical representation of Dongar-ras result (using logarithmic scales) for simple and doubleprecision For simple precision 120598119904 = 2minus24 for all 119899 ge 105 times

106 the residual check is always lower than imposed upperbound similarly for double precision with 120598119889 = 2

minus53 for all119899 ge 563 times 1014 If matrix size is greater than these values itwill not be possible to detect if the solution is correct or notThese results establish upper bounds for data volume in thismodel

In a single-processor system the complexity of algo-rithms depends only on the problem size 119899 We can assume

119879(119899) = Θ(119891(119899)) where 119891(119899) is a fundamental function(119891(119899) isin 1 119899

120572 119886119899 log 119899radic119899 ) In parallel systems (mul-tiprocessor systems with 119901 processors) we have the serialprocessing time 119879

lowast(119899) = 1198791(119899) and parallel processingtime 119879119901(119899) The performance of parallel algorithms can beanalyzed using speed-up efficiency and isoefficiencymetrics

(i) The speed-up 119878(119901) represents how a parallel algo-rithm is faster than a corresponding sequential algo-rithmThe speed-up is defined as 119878(119901) = 1198791(119899)119879119901(119899)There are special bounds for speed-up [23] 119878(119901) le

119901119901(119901 + 119901 minus 1) where 119901 = 1198791119879infin is the averageparallelism (the average number of busy processorsgiven unbounded number of processors) Usually119878(119901) le 119901 but under special circumstances thespeed-up can be 119878(119901) gt 119901 [24] Another upperbound is established by the Amdahls law 119878(119901) =

(119904 + ((1 minus 119904)119901))12

le 1119904 where 119904 is the fraction ofa program that is sequential The upper bound isconsidered for a 0 time of parallel fraction

(ii) The efficiency is the average utilization of 119901 proces-sors 119864(119901) = 119878(119901)119901

(iii) The isoefficiency is the growth rate of workload119882119901(119899) = 119901119879119901(119899) in terms of number of processors tokeep efficiency fixed If we consider1198821(119899)minus119864119882119901(119899) =0 for any fixed efficiency 119864 we obtain 119901 = 119901(119899) Thismeans thatwe can establish a relation between needednumber of processors and problem size For examplefor the parallel sum of 119899 numbers using 119901 processorswe have 119899 asymp 119864(119899 + 119901 log119901) so 119899 = Θ(119901 log119901)

10 Advances in High Energy Physicsnbin

=10

0

minus5

minus10

minus15

minus20

[a b]0 02 04 06 08 1

(a)

nbin

=100

0minus10minus20minus30minus40minus50

[a b]0 02 04 06 08 1

(b)

nbin

=1000

0minus100minus200minus300minus400minus500

0 02 04 06 08 1

[a b]

(c)

Figure 6 Example of advanced shifted histogram algorithm run-ning for different bins 10 100 and 1000

0

5

10

15

0 5 10 15 20 25

Simple precisionDouble precisionHPL20 bound

minus5

minus10

minus15

log10(n)

log10(1elowastn))

Figure 7 Residual check analysis for solving 119860119909 = 119910 system inHPL20 using simple and double precision representation

Numerical algorithms use for implementation a hyper-cube architecture We analyze the performance of differentnumerical operations using the isoefficiency metric For thehypercube architecture a simple model for intertask commu-nication considers 119879com = 119905119904 +119871119905119908 where 119905119904 is the latency (thetime needed by a message to cross through the network) 119905119908

Table 2 Isoefficiency for a hypercube architecture 119899 = Θ(119901 log119901)and 119899 = Θ(119901(log119901)2) We marked with (lowast) the limitations imposedby Formula (33)

Scenario Architecture size (119901) 119899 = Θ(119901 log119901) 119899 = Θ(119901(log119901)2)1 101 10 times 101 100 times 101

2 102

20 times 102

800 times 102

3 103 30 times 103 270 times 104

4 104 40 times 104 640 times 105lowast

5 105

50 times 105lowast

125 times 107

6 106

60 times 106

216 times 108

7 107 70 times 107 343 times 109

8 108 80 times 108 512 times 1010

9 109

90 times 109

729 times 1011

is the time needed to send a word (1119905119908 is called bandwidth)and 119871 is the message length (expressed in number of words)The word size depends on processing architecture (usually itis two bytes)We define 119905119888 as the processing time per word fora processor We have the following results

(i) External product 119909119910119879 The isoefficiency is written as

119905119888119899 asymp 119864 (119905119888119899 + (119905119904 + 119905119908) 119901 log119901) 997904rArr 119899 = Θ (119901 log119901) (34)

Parallel processing time is 119879119901 = 119905119888119899119901+(119905119904 +119905119908) log119901The optimality is computed using

119889119879119901

119889119901= 0 997904rArr minus119905119888

119899

1199012+119905119904 + 119905119908

119901= 0 997904rArr 119901 asymp

119905119888119899

119905119904 + 119905119908 (35)

(ii) Scalar product (internal product) 119909119879119910 = sum119899

119894=1119909119894119910119894

The isoefficiency is written as

1199051198881198992asymp 119864(119905119888119899

2+119905119904

2119901 log119901 +

119905119908

2119899radic119901 log119901)

997904rArr 119899 = Θ(119901(log119901)2) (36)

(iii) Matrix-vector product 119910 = 119860119909 119910119894 = sum119899

119895=1119860 119894119895119909119895 The

isoefficiency is written as

1199051198881198992asymp 119864 (119905119888119899

2+ 119905119904119901 log119901 + 119905119908119899radic119901 log119901) 997904rArr 119899

= Θ(119901(log119901)2) (37)

Table 2 presented the amount of data that can be pro-cessed for a specific sizeThe cases that meet the upper bound119899 ge 105 times 10

6 are marked with (lowast) To keep the efficiencyhigh for a specific parallel architecture HPC algorithms forparticle physics introduce upper limits for the amount of datawhich means that we have also an upper bound for Big Datavolume in this case

The factors that determine the efficiency of parallel algo-rithms are task balancing (work-load distribution betweenall used processors in a system rarr to be maximized)concurrency (the numberpercentage of processors workingsimultaneously rarr to be maximized) and overhead (extrawork for introduce by parallel processing that does not appearin serial processing rarr to be minimized)

Advances in High Energy Physics 11

Experimental data

problem)

Modeling and reconstruction

(find properties of observed particles)

Analysis(reduce result to simple statistical

distribution)

Simulation(generate MC data based on

theoretical models-BigData problem)

Modeling and reconstruction

(find properties of observed particles)

Analysis(reduce result to simple statistical

distribution)

Compare

(app TB of data rarr big data

Figure 8 Processing flows for HEP experiments

3 New Challenges for Big Data Science

There are a lot of applications that generate Big Data likesocial networking profiles social influence SaaS amp CloudApps public web information MapReduce scientific exper-iments and simulations (especially HEP simulations) datawarehouse monitoring technologies and e-government ser-vices Data grow rapidly since applications produce contin-uously increasing volumes of both unstructured and struc-tured data The impact on the approach to data processingtransfer and storage is the need to reevaluate the way andsolutions to better answer the usersrsquo needs [25] In this con-text scheduling models and algorithms for data processinghave an important role becoming a new challenge for BigData Science

HEP applications consider both experimental data (thatare application with TB of valuable data) and simulationdata (with data generated using MC based on theoreticalmodels) The processing phase is represented by modelingand reconstruction in order to find properties of observedparticles (see Figure 8)Then the data are analyzed a reducedto a simple statistical distribution The comparison of resultsobtainedwill validate how realistic is a simulation experimentand validate it for use in other new models

Since we face a large variety of solutions for specificapplications and platforms a thorough and systematic anal-ysis of existing solutions for scheduling models methodsand algorithms used in Big Data processing and storageenvironments is needed The challenges for schedulingimpose specific requirements in distributed systems theclaims of the resource consumers the restrictions imposed byresource owners the need to continuously adapt to changesof resourcesrsquo availability and so forth We will pay specialattention to Cloud Systems and HPC clusters (datacenters)as reliable solutions for Big Data [26] Based on theserequirements a number of challenging issues are maximiza-tion of system throughput sitesrsquo autonomy scalability fault-tolerance and quality of services

When discussing Big Data we have in mind the 5Vs Vol-ume Velocity Variety Variability and Value There is a clearneed of many organizations companies and researchers todeal with Big Data volumes efficiently Examples include webanalytics applications scientific applications and social net-works For these examples a popular data processing enginefor Big Data is Hadoop MapReduce [27] The main problemis that data arrives too fast for optimal storage and indexing

[28] There are other several processing platforms for BigData Mesos [29] YARN (Hortonworks Hadoop YARN Anext-generation framework forHadoop data processing 2013(httphortonworkscomhadoopyarn)) Corona (CoronaUnder the Hood Scheduling MapReduce jobs more effi-ciently with Corona 2012 (Facebook)) and so forth A reviewof various parallel and distributed programming paradigmsanalyzing how they fit into the Big Data era is presented in[30]The challenges that are described for BigData Science onthe modern and future Scientific Data Infrastructure are pre-sented in [31] The paper introduces the Scientific Data Life-cycle Management (SDLM)model that includes all the majorstages and reflects specifics in data management in moderne-Science The paper proposes the SDI generic architecturemodel that provides a basis for building interoperable data orproject centric SDI usingmodern technologies and best prac-tices This analysis highlights in the same time performanceand limitations of existing solutions in the context of BigData Hadoop can handle many types of data from disparatesystems structured unstructured logs pictures audio filescommunications records emails and so forth Hadoop relieson an internal redundant data structure with cost advantagesand is deployed on industry standard servers rather than onexpensive specialized data storage systems [32] The mainchallenges for scheduling in Hadoop are to improve existingalgorithms for Big Data processing capacity schedulingfair scheduling delay scheduling longest approximate timeto end (LATE) speculative execution deadline constraintscheduler and resource aware scheduling

Data transfer scheduling in Grids Cloud P2P and soforth represents a new challenge that is the subject to BigData In many cases depending on applications architecturedata must be transported to the place where tasks willbe executed [33] Consequently scheduling schemes shouldconsider not only the task execution time but also the datatransfer time for finding a more convenient mapping of tasks[34] Only a handful of current research efforts considerthe simultaneous optimization of computation and datatransfer scheduling The big-data IO scheduler [35] offers asolution for applications that compete for IO resources ina shared MapReduce-type Big Data system [36] The paper[37] reviews Big Data challenges from a data managementrespective and addresses Big Data diversity Big Data reduc-tion Big Data integration and cleaning Big Data indexingand query and finally Big Data analysis and mining On theopposite side business analytics occupying the intersection

12 Advances in High Energy Physics

of the worlds of management science computer science andstatistical science is a potent force for innovation in both theprivate and public sectors The conclusion is that the data istoo heterogeneous to fit into a rigid schema [38]

Another challenge is the scheduling policies used todetermine the relative ordering of requests Large distributedsystems with different administrative domains will mostlikely have different resource utilization policies For examplea policy can take into consideration the deadlines andbudgets and also the dynamic behavior [39] HEP experi-ments are usually performed in private Clouds consideringdynamic scheduling with soft deadlines which is an openissue

The optimization techniques for the scheduling processrepresent an important aspect because the scheduling is amain building block formaking datacentersmore available touser communities being energy-aware [40] and supportingmulticriteria optimization [41] An example of optimizationis multiobjective and multiconstrained scheduling of manytasks in Hadoop [42] or optimizing short jobs [43] Thecost effectiveness scalability and streamlined architecturesof Hadoop represent solutions for Big Data processingConsidering the use of Hadoop in publicprivate Clouds achallenge is to answer the following questions what type ofdatatasks should move to public cloud in order to achieve acost-aware cloud scheduler And is public Cloud a solutionfor HEP simulation experiments

The activities for Big Data processing vary widely in anumber of issues for example support for heterogeneousresources objective function(s) scalability coschedulingand assumptions about system characteristics The currentresearch directions are focused on accelerating data process-ing especially for Big Data analytics (frequently used in HEPexperiments) complex task dependencies for dataworkflowsand new scheduling algorithms for real-time scenarios

4 Conclusions

This paper presented general aspects about methods used inHEP Monte Carlo methods and simulations of HEP pro-cesses Markovian Monte Carlo unfolding methods in parti-cle physics kernel estimation inHEP RandomMatrixTheoryused in analysis of particles spectrum For each method theproper numerical method had been identified and analyzedAll of identified methods produce data-intensive applica-tions which introduce new challenges and requirementsfor Big Data systems architecture especially for processingparadigms and storage capabilities This paper puts togetherseveral concepts HEP HPC numerical methods and sim-ulations HEP experiments are modeled using numericalmethods and simulations numerical integration eigenvaluescomputation solving linear equation systems multiplyingvectors and matrices interpolation HPC environments offerpowerful tools for data processing and analysis Big Data wasintroduced as a concept for a real problem we live in a data-intensive world produce huge amount of information weface with upper bound introduced by theoretical models

Conflict of Interests

The author declares that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

The research presented in this paper is supported bythe following projects ldquoSideSTEPmdashScheduling Methods forDynamic Distributed Systems a self-lowast approachrdquo (PN-II-CT-RO-FR-2012-1-0084) ldquoERRICmdashEmpowering Roma-nian Research on Intelligent Information Technologiesrdquo FP7-REGPOT-2010-1 ID 264207 CyberWater Grant of theRomanianNational Authority for Scientific Research CNDI-UEFISCDI Project no 472012 The author would like tothank the reviewers for their time and expertise constructivecomments and valuable insights

References

[1] H Newman ldquoSearch for higgs boson diphoton decay withcms at lhcrdquo in Proceedings of the ACMIEEE Conference onSupercomputing (SC rsquo06) ACM New York NY USA 2006

[2] C Cattani and A Ciancio ldquoSeparable transition density inthe hybrid model for tumor-immune system competitionrdquoComputational and Mathematical Methods in Medicine vol2012 Article ID 610124 6 pages 2012

[3] C Toma ldquoWavelets-computational aspects of sterian realisticapproach to uncertainty principle in high energy physics atransient approachrdquo Advances in High Energy Physics vol 2013Article ID 735452 6 pages 2013

[4] C Cattani A Ciancio and B Lods ldquoOn a mathematical modelof immune competitionrdquo Applied Mathematics Letters vol 19no 7 pp 678ndash683 2006

[5] D Perret-Gallix ldquoSimulation and event generation in high-energy physicsrdquoComputer Physics Communications vol 147 no1-2 pp 488ndash493 2002

[6] R Baishya and J K Sarma ldquoSemi numerical solution of non-singlet Dokshitzer-Gribov-Lipatov-Altarelli- Parisi evolutionequation up to next-to-next-to-leading order at small xrdquo Euro-pean Physical Journal C vol 60 no 4 pp 585ndash591 2009

[7] I I Bucur I Fagarasan C Popescu G Culea and A E SusuldquoDelay optimum and area optimal mapping of k-lut based fpgacircuitsrdquo Journal of Control Engineering andApplied Informaticsvol 11 no 1 pp 43ndash48 2009

[8] R R de Austri R Trotta and L Roszkowski ldquoA markov chainmonte carlo analysis of the cmssmrdquo Journal of High EnergyPhysics vol 2006 no 05 2006

[9] C Serbanescu ldquoNoncommutativeMarkov processes as stochas-tic equationsrsquo solutionsrdquo BulletinMathematique de la Societe desSciencesMathematiques de Roumanie vol 41 no 3 pp 219ndash2281998

[10] C Serbanescu ldquoStochastic differential equations and unitaryprocessesrdquo Bulletin Mathematique de la Societe des SciencesMathematiques de Roumanie vol 41 no 4 pp 311ndash322 1998

[11] O Behnke K Kroninger G Schott and T H Schorner-Sadenius Data Analysis in High Energy Physics A PracticalGuide to Statistical Methods Wiley-VCH Berlin Germany2013

[12] V Blobel ldquoAn unfolding method for high energy physicsexperimentsrdquo in Proceedings of the Conference on Advanced

Advances in High Energy Physics 13

Statistical Techniques in Particle Physics Durham UK March2002 DESY 02-078 (June 2002)

[13] P C Hansen Rank-Deficient and Discrete Ill-Posed ProblemsSIAM Philadelphia Pa USA 1998

[14] P C Hansen Discrete Inverse Problems Insight and Algorithmsvol 7 SIAM Philadelphia Pa USA 2010

[15] A Hocker and V Kartvelishvili ldquoSVD approach to data unfold-ingrdquo Nuclear Instruments and Methods in Physics Research Avol 372 no 3 pp 469ndash481 1996

[16] I Aziz and Siraj-ul-Islam ldquoNew algorithms for the numericalsolution of nonlinear Fredholm and Volterra integral equationsusing Haar waveletsrdquo Journal of Computational and AppliedMathematics vol 239 pp 333ndash345 2013

[17] A Sawatzky C Brune JMuller andM Burger ldquoTotal variationprocessing of images with poisson statisticsrdquo in Proceedingsof the 13th International Conference on Computer Analysis ofImages and Patterns (CAIP rsquo09) pp 533ndash540 Springer BerlinGermany 2009

[18] A Edelman and N R Rao ldquoRandom matrix theoryrdquo ActaNumerica vol 14 no 1 pp 233ndash297 2005

[19] K Cranmer ldquoKernel estimation in high-energy physicsrdquo Com-puter Physics Communications vol 136 no 3 pp 198ndash207 2001

[20] I S Abramson ldquoOn bandwidth variation in kernel estimatesmdashasquare root lawrdquoThe Annals of Statistics vol 10 no 4 pp 1217ndash1223 1982

[21] X Li JWDemmel DH Bailey et al ldquoDesign implementationand testing of extended and mixed precision BLASrdquo ACMTransactions on Mathematical Software vol 28 no 2 pp 152ndash205 2002

[22] J J Dongarra and J Langou ldquoThe problem with the linpackbenchmark 10 matrix generatorrdquo International Journal of HighPerformance Computing Applications vol 23 no 1 pp 5ndash132009

[23] L Lundberg and H Lennerstad ldquoAn Optimal lower boundon the maximum speedup in multiprocessors with clustersrdquoin Proceedings of the IEEE 1st International Conference onAlgorithms and Architectures for Parallel Processing (ICAPP rsquo95)vol 2 pp 640ndash649 April 1995

[24] N J Gunther ldquoA note on parallel algorithmicspeedup boundsrdquoTechnical Report on Distributed Parallel and Cluster Comput-ing In presshttparxivorgabs11044078

[25] B C Tak B Urgaonkar and A Sivasubramaniam ldquoTo move ornot tomove the economics of cloud computingrdquo in Proceedingsof the 3rdUSENIXConference onHot Topics inCloudComputing(HotCloud rsquo11) p 5 USENIX Association Berkeley CA USA2011

[26] L Zhang C Wu Z Li C Guo M Chen and F Lau ldquoMovingbig data to the cloud an online cost-minimizing approachrdquoIEEE Journal on Selected Areas in Communications vol 31 no12 pp 2710ndash2721 2013

[27] J Dittrich and J A Quiane-Ruiz ldquoEfficient big data processingin hadoop mapreducerdquo Proceedings of the VLDB Endowmentvol 5 no 12 pp 2014ndash2015 2012

[28] D Suciu ldquoBig data begets big database theoryrdquo in Proceedings ofthe 29th British National conference on Big Data (BNCOD rsquo13)pp 1ndash5 Springer Berlin Germany 2013

[29] B Hindman A Konwinski M Zaharia et al ldquoA platform forfine-grained resource sharing in the data centerrdquo in Proceedingsof the 8th USENIX Conference on Networked Systems Designand Implementation (NSDI rsquo11) p 22 USENIX AssociationBerkeley CA USA 2011

[30] C Dobre and F Xhafa ldquoParallel programming paradigms andframeworks in big data erardquo International Journal of ParallelProgramming 2013

[31] Y Demchenko C de Laat AWibisono P Grosso and Z ZhaoldquoAddressing big data challenges for scientific data infrastruc-turerdquo in Proceedings of the IEEE 4th International Conferenceon CloudComputing Technology and Science (CLOUDCOM rsquo12)pp 614ndash617 IEEE Computer Society Washington DC USA2012

[32] T White Hadoop The Definitive Guide OrsquoReilly Media 2012[33] J Celaya and U Arronategui ldquoA task routing approach to large-

scale schedulingrdquo Future Generation Computer Systems vol 29no 5 pp 1097ndash1111 2013

[34] N Bessis S Sotiriadis F Xhafa F Pop and V Cristea ldquoMeta-scheduling issues in interoperable hpcs grids and cloudsrdquoInternational Journal of Web and Grid Services vol 8 no 2 pp153ndash172 2012

[35] Y Xu A Suarez and M Zhao ldquoIbis interposed big-data ioschedulerrdquo in Proceedings of the 22nd International Sympo-sium on High-performance Parallel and Distributed Computing(HPDC rsquo13) pp 109ndash110 ACM New York NY USA 2013

[36] S Sotiriadis N Bessis and N Antonopoulos ldquoTowards inter-cloud schedulers a survey of meta-scheduling approachesrdquo inProceedings of the 6th International Conference on P2P ParallelGrid Cloud and Internet Computing (PGCIC rsquo11) pp 59ndash66Barcelona Spain October 2011

[37] J Chen Y Chen X Du et al ldquoBig data challenge a datamanagement perspectiverdquo Frontiers in Computer Science vol 7no 2 pp 157ndash164 2013

[38] V Gopalkrishnan D Steier H Lewis and J Guszcza ldquoBigdata big business bridging the gaprdquo in Proceedings of the 1stInternationalWorkshop on Big Data Streams andHeterogeneousSource Mining Algorithms Systems Programming Models andApplications (BigMine rsquo12) pp 7ndash11 ACM New York NY USA2012

[39] R van den Bossche K Vanmechelen and J BroeckhoveldquoOnline coste fficient scheduling of deadline-constrained work-loads on hybrid cloudsrdquo Future Generation Computer Systemsvol 29 no 4 pp 973ndash985 2013

[40] N Bessis S Sotiriadis F Pop and V Cristea ldquoUsing anovel messageexchanging optimization (meo) model to reduceenergy consumption in distributed systemsrdquo Simulation Mod-elling Practice and Theory vol 39 pp 104ndash112 2013

[41] G V Iordache M S Boboila F Pop C Stratan and VCristea ldquoA decentralized strategy for genetic scheduling inheterogeneous environmentsrdquo Multiagent and Grid Systemsvol 3 no 4 pp 355ndash367 2007

[42] F Zhang J Cao K Li S U Khan and K Hwang ldquoMulti-objective scheduling of many tasks in cloud platformsrdquo FutureGeneration Computer Systems 2013

[43] K Elmeleegy ldquoPiranha optimizing short jobs in hadooprdquoProceedings of the VLDB Endowment vol 6 no 11 pp 985ndash9962013

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

High Energy PhysicsAdvances in

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

FluidsJournal of

Atomic and Molecular Physics

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Advances in Condensed Matter Physics

OpticsInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

AstronomyAdvances in

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Superconductivity

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Statistical MechanicsInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

GravityJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

AstrophysicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Physics Research International

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Solid State PhysicsJournal of

 Computational  Methods in Physics

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Soft MatterJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

AerodynamicsJournal of

Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

PhotonicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Biophysics

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ThermodynamicsJournal of

10 Advances in High Energy Physicsnbin

=10

0

minus5

minus10

minus15

minus20

[a b]0 02 04 06 08 1

(a)

nbin

=100

0minus10minus20minus30minus40minus50

[a b]0 02 04 06 08 1

(b)

nbin

=1000

0minus100minus200minus300minus400minus500

0 02 04 06 08 1

[a b]

(c)

Figure 6 Example of advanced shifted histogram algorithm run-ning for different bins 10 100 and 1000

0

5

10

15

0 5 10 15 20 25

Simple precisionDouble precisionHPL20 bound

minus5

minus10

minus15

log10(n)

log10(1elowastn))

Figure 7 Residual check analysis for solving 119860119909 = 119910 system inHPL20 using simple and double precision representation

Numerical algorithms use for implementation a hyper-cube architecture We analyze the performance of differentnumerical operations using the isoefficiency metric For thehypercube architecture a simple model for intertask commu-nication considers 119879com = 119905119904 +119871119905119908 where 119905119904 is the latency (thetime needed by a message to cross through the network) 119905119908

Table 2 Isoefficiency for a hypercube architecture 119899 = Θ(119901 log119901)and 119899 = Θ(119901(log119901)2) We marked with (lowast) the limitations imposedby Formula (33)

Scenario Architecture size (119901) 119899 = Θ(119901 log119901) 119899 = Θ(119901(log119901)2)1 101 10 times 101 100 times 101

2 102

20 times 102

800 times 102

3 103 30 times 103 270 times 104

4 104 40 times 104 640 times 105lowast

5 105

50 times 105lowast

125 times 107

6 106

60 times 106

216 times 108

7 107 70 times 107 343 times 109

8 108 80 times 108 512 times 1010

9 109

90 times 109

729 times 1011

is the time needed to send a word (1119905119908 is called bandwidth)and 119871 is the message length (expressed in number of words)The word size depends on processing architecture (usually itis two bytes)We define 119905119888 as the processing time per word fora processor We have the following results

(i) External product 119909119910119879 The isoefficiency is written as

119905119888119899 asymp 119864 (119905119888119899 + (119905119904 + 119905119908) 119901 log119901) 997904rArr 119899 = Θ (119901 log119901) (34)

Parallel processing time is 119879119901 = 119905119888119899119901+(119905119904 +119905119908) log119901The optimality is computed using

119889119879119901

119889119901= 0 997904rArr minus119905119888

119899

1199012+119905119904 + 119905119908

119901= 0 997904rArr 119901 asymp

119905119888119899

119905119904 + 119905119908 (35)

(ii) Scalar product (internal product) 119909119879119910 = sum119899

119894=1119909119894119910119894

The isoefficiency is written as

1199051198881198992asymp 119864(119905119888119899

2+119905119904

2119901 log119901 +

119905119908

2119899radic119901 log119901)

997904rArr 119899 = Θ(119901(log119901)2) (36)

(iii) Matrix-vector product 119910 = 119860119909 119910119894 = sum119899

119895=1119860 119894119895119909119895 The

isoefficiency is written as

1199051198881198992asymp 119864 (119905119888119899

2+ 119905119904119901 log119901 + 119905119908119899radic119901 log119901) 997904rArr 119899

= Θ(119901(log119901)2) (37)

Table 2 presented the amount of data that can be pro-cessed for a specific sizeThe cases that meet the upper bound119899 ge 105 times 10

6 are marked with (lowast) To keep the efficiencyhigh for a specific parallel architecture HPC algorithms forparticle physics introduce upper limits for the amount of datawhich means that we have also an upper bound for Big Datavolume in this case

The factors that determine the efficiency of parallel algo-rithms are task balancing (work-load distribution betweenall used processors in a system rarr to be maximized)concurrency (the numberpercentage of processors workingsimultaneously rarr to be maximized) and overhead (extrawork for introduce by parallel processing that does not appearin serial processing rarr to be minimized)

Advances in High Energy Physics 11

Experimental data

problem)

Modeling and reconstruction

(find properties of observed particles)

Analysis(reduce result to simple statistical

distribution)

Simulation(generate MC data based on

theoretical models-BigData problem)

Modeling and reconstruction

(find properties of observed particles)

Analysis(reduce result to simple statistical

distribution)

Compare

(app TB of data rarr big data

Figure 8 Processing flows for HEP experiments

3 New Challenges for Big Data Science

There are a lot of applications that generate Big Data likesocial networking profiles social influence SaaS amp CloudApps public web information MapReduce scientific exper-iments and simulations (especially HEP simulations) datawarehouse monitoring technologies and e-government ser-vices Data grow rapidly since applications produce contin-uously increasing volumes of both unstructured and struc-tured data The impact on the approach to data processingtransfer and storage is the need to reevaluate the way andsolutions to better answer the usersrsquo needs [25] In this con-text scheduling models and algorithms for data processinghave an important role becoming a new challenge for BigData Science

HEP applications consider both experimental data (thatare application with TB of valuable data) and simulationdata (with data generated using MC based on theoreticalmodels) The processing phase is represented by modelingand reconstruction in order to find properties of observedparticles (see Figure 8)Then the data are analyzed a reducedto a simple statistical distribution The comparison of resultsobtainedwill validate how realistic is a simulation experimentand validate it for use in other new models

Since we face a large variety of solutions for specificapplications and platforms a thorough and systematic anal-ysis of existing solutions for scheduling models methodsand algorithms used in Big Data processing and storageenvironments is needed The challenges for schedulingimpose specific requirements in distributed systems theclaims of the resource consumers the restrictions imposed byresource owners the need to continuously adapt to changesof resourcesrsquo availability and so forth We will pay specialattention to Cloud Systems and HPC clusters (datacenters)as reliable solutions for Big Data [26] Based on theserequirements a number of challenging issues are maximiza-tion of system throughput sitesrsquo autonomy scalability fault-tolerance and quality of services

When discussing Big Data we have in mind the 5Vs Vol-ume Velocity Variety Variability and Value There is a clearneed of many organizations companies and researchers todeal with Big Data volumes efficiently Examples include webanalytics applications scientific applications and social net-works For these examples a popular data processing enginefor Big Data is Hadoop MapReduce [27] The main problemis that data arrives too fast for optimal storage and indexing

[28] There are other several processing platforms for BigData Mesos [29] YARN (Hortonworks Hadoop YARN Anext-generation framework forHadoop data processing 2013(httphortonworkscomhadoopyarn)) Corona (CoronaUnder the Hood Scheduling MapReduce jobs more effi-ciently with Corona 2012 (Facebook)) and so forth A reviewof various parallel and distributed programming paradigmsanalyzing how they fit into the Big Data era is presented in[30]The challenges that are described for BigData Science onthe modern and future Scientific Data Infrastructure are pre-sented in [31] The paper introduces the Scientific Data Life-cycle Management (SDLM)model that includes all the majorstages and reflects specifics in data management in moderne-Science The paper proposes the SDI generic architecturemodel that provides a basis for building interoperable data orproject centric SDI usingmodern technologies and best prac-tices This analysis highlights in the same time performanceand limitations of existing solutions in the context of BigData Hadoop can handle many types of data from disparatesystems structured unstructured logs pictures audio filescommunications records emails and so forth Hadoop relieson an internal redundant data structure with cost advantagesand is deployed on industry standard servers rather than onexpensive specialized data storage systems [32] The mainchallenges for scheduling in Hadoop are to improve existingalgorithms for Big Data processing capacity schedulingfair scheduling delay scheduling longest approximate timeto end (LATE) speculative execution deadline constraintscheduler and resource aware scheduling

Data transfer scheduling in Grids Cloud P2P and soforth represents a new challenge that is the subject to BigData In many cases depending on applications architecturedata must be transported to the place where tasks willbe executed [33] Consequently scheduling schemes shouldconsider not only the task execution time but also the datatransfer time for finding a more convenient mapping of tasks[34] Only a handful of current research efforts considerthe simultaneous optimization of computation and datatransfer scheduling The big-data IO scheduler [35] offers asolution for applications that compete for IO resources ina shared MapReduce-type Big Data system [36] The paper[37] reviews Big Data challenges from a data managementrespective and addresses Big Data diversity Big Data reduc-tion Big Data integration and cleaning Big Data indexingand query and finally Big Data analysis and mining On theopposite side business analytics occupying the intersection

12 Advances in High Energy Physics

of the worlds of management science computer science andstatistical science is a potent force for innovation in both theprivate and public sectors The conclusion is that the data istoo heterogeneous to fit into a rigid schema [38]

Another challenge is the scheduling policies used todetermine the relative ordering of requests Large distributedsystems with different administrative domains will mostlikely have different resource utilization policies For examplea policy can take into consideration the deadlines andbudgets and also the dynamic behavior [39] HEP experi-ments are usually performed in private Clouds consideringdynamic scheduling with soft deadlines which is an openissue

The optimization techniques for the scheduling processrepresent an important aspect because the scheduling is amain building block formaking datacentersmore available touser communities being energy-aware [40] and supportingmulticriteria optimization [41] An example of optimizationis multiobjective and multiconstrained scheduling of manytasks in Hadoop [42] or optimizing short jobs [43] Thecost effectiveness scalability and streamlined architecturesof Hadoop represent solutions for Big Data processingConsidering the use of Hadoop in publicprivate Clouds achallenge is to answer the following questions what type ofdatatasks should move to public cloud in order to achieve acost-aware cloud scheduler And is public Cloud a solutionfor HEP simulation experiments

The activities for Big Data processing vary widely in anumber of issues for example support for heterogeneousresources objective function(s) scalability coschedulingand assumptions about system characteristics The currentresearch directions are focused on accelerating data process-ing especially for Big Data analytics (frequently used in HEPexperiments) complex task dependencies for dataworkflowsand new scheduling algorithms for real-time scenarios

4 Conclusions

This paper presented general aspects about methods used inHEP Monte Carlo methods and simulations of HEP pro-cesses Markovian Monte Carlo unfolding methods in parti-cle physics kernel estimation inHEP RandomMatrixTheoryused in analysis of particles spectrum For each method theproper numerical method had been identified and analyzedAll of identified methods produce data-intensive applica-tions which introduce new challenges and requirementsfor Big Data systems architecture especially for processingparadigms and storage capabilities This paper puts togetherseveral concepts HEP HPC numerical methods and sim-ulations HEP experiments are modeled using numericalmethods and simulations numerical integration eigenvaluescomputation solving linear equation systems multiplyingvectors and matrices interpolation HPC environments offerpowerful tools for data processing and analysis Big Data wasintroduced as a concept for a real problem we live in a data-intensive world produce huge amount of information weface with upper bound introduced by theoretical models

Conflict of Interests

The author declares that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

The research presented in this paper is supported bythe following projects ldquoSideSTEPmdashScheduling Methods forDynamic Distributed Systems a self-lowast approachrdquo (PN-II-CT-RO-FR-2012-1-0084) ldquoERRICmdashEmpowering Roma-nian Research on Intelligent Information Technologiesrdquo FP7-REGPOT-2010-1 ID 264207 CyberWater Grant of theRomanianNational Authority for Scientific Research CNDI-UEFISCDI Project no 472012 The author would like tothank the reviewers for their time and expertise constructivecomments and valuable insights

References

[1] H Newman ldquoSearch for higgs boson diphoton decay withcms at lhcrdquo in Proceedings of the ACMIEEE Conference onSupercomputing (SC rsquo06) ACM New York NY USA 2006

[2] C Cattani and A Ciancio ldquoSeparable transition density inthe hybrid model for tumor-immune system competitionrdquoComputational and Mathematical Methods in Medicine vol2012 Article ID 610124 6 pages 2012

[3] C Toma ldquoWavelets-computational aspects of sterian realisticapproach to uncertainty principle in high energy physics atransient approachrdquo Advances in High Energy Physics vol 2013Article ID 735452 6 pages 2013

[4] C Cattani A Ciancio and B Lods ldquoOn a mathematical modelof immune competitionrdquo Applied Mathematics Letters vol 19no 7 pp 678ndash683 2006

[5] D Perret-Gallix ldquoSimulation and event generation in high-energy physicsrdquoComputer Physics Communications vol 147 no1-2 pp 488ndash493 2002

[6] R Baishya and J K Sarma ldquoSemi numerical solution of non-singlet Dokshitzer-Gribov-Lipatov-Altarelli- Parisi evolutionequation up to next-to-next-to-leading order at small xrdquo Euro-pean Physical Journal C vol 60 no 4 pp 585ndash591 2009

[7] I I Bucur I Fagarasan C Popescu G Culea and A E SusuldquoDelay optimum and area optimal mapping of k-lut based fpgacircuitsrdquo Journal of Control Engineering andApplied Informaticsvol 11 no 1 pp 43ndash48 2009

[8] R R de Austri R Trotta and L Roszkowski ldquoA markov chainmonte carlo analysis of the cmssmrdquo Journal of High EnergyPhysics vol 2006 no 05 2006

[9] C Serbanescu ldquoNoncommutativeMarkov processes as stochas-tic equationsrsquo solutionsrdquo BulletinMathematique de la Societe desSciencesMathematiques de Roumanie vol 41 no 3 pp 219ndash2281998

[10] C Serbanescu ldquoStochastic differential equations and unitaryprocessesrdquo Bulletin Mathematique de la Societe des SciencesMathematiques de Roumanie vol 41 no 4 pp 311ndash322 1998

[11] O Behnke K Kroninger G Schott and T H Schorner-Sadenius Data Analysis in High Energy Physics A PracticalGuide to Statistical Methods Wiley-VCH Berlin Germany2013

[12] V Blobel ldquoAn unfolding method for high energy physicsexperimentsrdquo in Proceedings of the Conference on Advanced

Advances in High Energy Physics 13

Statistical Techniques in Particle Physics Durham UK March2002 DESY 02-078 (June 2002)

[13] P C Hansen Rank-Deficient and Discrete Ill-Posed ProblemsSIAM Philadelphia Pa USA 1998

[14] P C Hansen Discrete Inverse Problems Insight and Algorithmsvol 7 SIAM Philadelphia Pa USA 2010

[15] A Hocker and V Kartvelishvili ldquoSVD approach to data unfold-ingrdquo Nuclear Instruments and Methods in Physics Research Avol 372 no 3 pp 469ndash481 1996

[16] I Aziz and Siraj-ul-Islam ldquoNew algorithms for the numericalsolution of nonlinear Fredholm and Volterra integral equationsusing Haar waveletsrdquo Journal of Computational and AppliedMathematics vol 239 pp 333ndash345 2013

[17] A Sawatzky C Brune JMuller andM Burger ldquoTotal variationprocessing of images with poisson statisticsrdquo in Proceedingsof the 13th International Conference on Computer Analysis ofImages and Patterns (CAIP rsquo09) pp 533ndash540 Springer BerlinGermany 2009

[18] A Edelman and N R Rao ldquoRandom matrix theoryrdquo ActaNumerica vol 14 no 1 pp 233ndash297 2005

[19] K Cranmer ldquoKernel estimation in high-energy physicsrdquo Com-puter Physics Communications vol 136 no 3 pp 198ndash207 2001

[20] I S Abramson ldquoOn bandwidth variation in kernel estimatesmdashasquare root lawrdquoThe Annals of Statistics vol 10 no 4 pp 1217ndash1223 1982

[21] X Li JWDemmel DH Bailey et al ldquoDesign implementationand testing of extended and mixed precision BLASrdquo ACMTransactions on Mathematical Software vol 28 no 2 pp 152ndash205 2002

[22] J J Dongarra and J Langou ldquoThe problem with the linpackbenchmark 10 matrix generatorrdquo International Journal of HighPerformance Computing Applications vol 23 no 1 pp 5ndash132009

[23] L Lundberg and H Lennerstad ldquoAn Optimal lower boundon the maximum speedup in multiprocessors with clustersrdquoin Proceedings of the IEEE 1st International Conference onAlgorithms and Architectures for Parallel Processing (ICAPP rsquo95)vol 2 pp 640ndash649 April 1995

[24] N J Gunther ldquoA note on parallel algorithmicspeedup boundsrdquoTechnical Report on Distributed Parallel and Cluster Comput-ing In presshttparxivorgabs11044078

[25] B C Tak B Urgaonkar and A Sivasubramaniam ldquoTo move ornot tomove the economics of cloud computingrdquo in Proceedingsof the 3rdUSENIXConference onHot Topics inCloudComputing(HotCloud rsquo11) p 5 USENIX Association Berkeley CA USA2011

[26] L Zhang C Wu Z Li C Guo M Chen and F Lau ldquoMovingbig data to the cloud an online cost-minimizing approachrdquoIEEE Journal on Selected Areas in Communications vol 31 no12 pp 2710ndash2721 2013

[27] J Dittrich and J A Quiane-Ruiz ldquoEfficient big data processingin hadoop mapreducerdquo Proceedings of the VLDB Endowmentvol 5 no 12 pp 2014ndash2015 2012

[28] D Suciu ldquoBig data begets big database theoryrdquo in Proceedings ofthe 29th British National conference on Big Data (BNCOD rsquo13)pp 1ndash5 Springer Berlin Germany 2013

[29] B Hindman A Konwinski M Zaharia et al ldquoA platform forfine-grained resource sharing in the data centerrdquo in Proceedingsof the 8th USENIX Conference on Networked Systems Designand Implementation (NSDI rsquo11) p 22 USENIX AssociationBerkeley CA USA 2011

[30] C Dobre and F Xhafa ldquoParallel programming paradigms andframeworks in big data erardquo International Journal of ParallelProgramming 2013

[31] Y Demchenko C de Laat AWibisono P Grosso and Z ZhaoldquoAddressing big data challenges for scientific data infrastruc-turerdquo in Proceedings of the IEEE 4th International Conferenceon CloudComputing Technology and Science (CLOUDCOM rsquo12)pp 614ndash617 IEEE Computer Society Washington DC USA2012

[32] T White Hadoop The Definitive Guide OrsquoReilly Media 2012[33] J Celaya and U Arronategui ldquoA task routing approach to large-

scale schedulingrdquo Future Generation Computer Systems vol 29no 5 pp 1097ndash1111 2013

[34] N Bessis S Sotiriadis F Xhafa F Pop and V Cristea ldquoMeta-scheduling issues in interoperable hpcs grids and cloudsrdquoInternational Journal of Web and Grid Services vol 8 no 2 pp153ndash172 2012

[35] Y Xu A Suarez and M Zhao ldquoIbis interposed big-data ioschedulerrdquo in Proceedings of the 22nd International Sympo-sium on High-performance Parallel and Distributed Computing(HPDC rsquo13) pp 109ndash110 ACM New York NY USA 2013

[36] S Sotiriadis N Bessis and N Antonopoulos ldquoTowards inter-cloud schedulers a survey of meta-scheduling approachesrdquo inProceedings of the 6th International Conference on P2P ParallelGrid Cloud and Internet Computing (PGCIC rsquo11) pp 59ndash66Barcelona Spain October 2011

[37] J Chen Y Chen X Du et al ldquoBig data challenge a datamanagement perspectiverdquo Frontiers in Computer Science vol 7no 2 pp 157ndash164 2013

[38] V Gopalkrishnan D Steier H Lewis and J Guszcza ldquoBigdata big business bridging the gaprdquo in Proceedings of the 1stInternationalWorkshop on Big Data Streams andHeterogeneousSource Mining Algorithms Systems Programming Models andApplications (BigMine rsquo12) pp 7ndash11 ACM New York NY USA2012

[39] R van den Bossche K Vanmechelen and J BroeckhoveldquoOnline coste fficient scheduling of deadline-constrained work-loads on hybrid cloudsrdquo Future Generation Computer Systemsvol 29 no 4 pp 973ndash985 2013

[40] N Bessis S Sotiriadis F Pop and V Cristea ldquoUsing anovel messageexchanging optimization (meo) model to reduceenergy consumption in distributed systemsrdquo Simulation Mod-elling Practice and Theory vol 39 pp 104ndash112 2013

[41] G V Iordache M S Boboila F Pop C Stratan and VCristea ldquoA decentralized strategy for genetic scheduling inheterogeneous environmentsrdquo Multiagent and Grid Systemsvol 3 no 4 pp 355ndash367 2007

[42] F Zhang J Cao K Li S U Khan and K Hwang ldquoMulti-objective scheduling of many tasks in cloud platformsrdquo FutureGeneration Computer Systems 2013

[43] K Elmeleegy ldquoPiranha optimizing short jobs in hadooprdquoProceedings of the VLDB Endowment vol 6 no 11 pp 985ndash9962013

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

High Energy PhysicsAdvances in

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

FluidsJournal of

Atomic and Molecular Physics

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Advances in Condensed Matter Physics

OpticsInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

AstronomyAdvances in

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Superconductivity

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Statistical MechanicsInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

GravityJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

AstrophysicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Physics Research International

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Solid State PhysicsJournal of

 Computational  Methods in Physics

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Soft MatterJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

AerodynamicsJournal of

Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

PhotonicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Biophysics

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ThermodynamicsJournal of

Advances in High Energy Physics 11

Experimental data

problem)

Modeling and reconstruction

(find properties of observed particles)

Analysis(reduce result to simple statistical

distribution)

Simulation(generate MC data based on

theoretical models-BigData problem)

Modeling and reconstruction

(find properties of observed particles)

Analysis(reduce result to simple statistical

distribution)

Compare

(app TB of data rarr big data

Figure 8 Processing flows for HEP experiments

3 New Challenges for Big Data Science

There are a lot of applications that generate Big Data likesocial networking profiles social influence SaaS amp CloudApps public web information MapReduce scientific exper-iments and simulations (especially HEP simulations) datawarehouse monitoring technologies and e-government ser-vices Data grow rapidly since applications produce contin-uously increasing volumes of both unstructured and struc-tured data The impact on the approach to data processingtransfer and storage is the need to reevaluate the way andsolutions to better answer the usersrsquo needs [25] In this con-text scheduling models and algorithms for data processinghave an important role becoming a new challenge for BigData Science

HEP applications consider both experimental data (thatare application with TB of valuable data) and simulationdata (with data generated using MC based on theoreticalmodels) The processing phase is represented by modelingand reconstruction in order to find properties of observedparticles (see Figure 8)Then the data are analyzed a reducedto a simple statistical distribution The comparison of resultsobtainedwill validate how realistic is a simulation experimentand validate it for use in other new models

Since we face a large variety of solutions for specificapplications and platforms a thorough and systematic anal-ysis of existing solutions for scheduling models methodsand algorithms used in Big Data processing and storageenvironments is needed The challenges for schedulingimpose specific requirements in distributed systems theclaims of the resource consumers the restrictions imposed byresource owners the need to continuously adapt to changesof resourcesrsquo availability and so forth We will pay specialattention to Cloud Systems and HPC clusters (datacenters)as reliable solutions for Big Data [26] Based on theserequirements a number of challenging issues are maximiza-tion of system throughput sitesrsquo autonomy scalability fault-tolerance and quality of services

When discussing Big Data we have in mind the 5Vs Vol-ume Velocity Variety Variability and Value There is a clearneed of many organizations companies and researchers todeal with Big Data volumes efficiently Examples include webanalytics applications scientific applications and social net-works For these examples a popular data processing enginefor Big Data is Hadoop MapReduce [27] The main problemis that data arrives too fast for optimal storage and indexing

[28] There are other several processing platforms for BigData Mesos [29] YARN (Hortonworks Hadoop YARN Anext-generation framework forHadoop data processing 2013(httphortonworkscomhadoopyarn)) Corona (CoronaUnder the Hood Scheduling MapReduce jobs more effi-ciently with Corona 2012 (Facebook)) and so forth A reviewof various parallel and distributed programming paradigmsanalyzing how they fit into the Big Data era is presented in[30]The challenges that are described for BigData Science onthe modern and future Scientific Data Infrastructure are pre-sented in [31] The paper introduces the Scientific Data Life-cycle Management (SDLM)model that includes all the majorstages and reflects specifics in data management in moderne-Science The paper proposes the SDI generic architecturemodel that provides a basis for building interoperable data orproject centric SDI usingmodern technologies and best prac-tices This analysis highlights in the same time performanceand limitations of existing solutions in the context of BigData Hadoop can handle many types of data from disparatesystems structured unstructured logs pictures audio filescommunications records emails and so forth Hadoop relieson an internal redundant data structure with cost advantagesand is deployed on industry standard servers rather than onexpensive specialized data storage systems [32] The mainchallenges for scheduling in Hadoop are to improve existingalgorithms for Big Data processing capacity schedulingfair scheduling delay scheduling longest approximate timeto end (LATE) speculative execution deadline constraintscheduler and resource aware scheduling

Data transfer scheduling in Grids Cloud P2P and soforth represents a new challenge that is the subject to BigData In many cases depending on applications architecturedata must be transported to the place where tasks willbe executed [33] Consequently scheduling schemes shouldconsider not only the task execution time but also the datatransfer time for finding a more convenient mapping of tasks[34] Only a handful of current research efforts considerthe simultaneous optimization of computation and datatransfer scheduling The big-data IO scheduler [35] offers asolution for applications that compete for IO resources ina shared MapReduce-type Big Data system [36] The paper[37] reviews Big Data challenges from a data managementrespective and addresses Big Data diversity Big Data reduc-tion Big Data integration and cleaning Big Data indexingand query and finally Big Data analysis and mining On theopposite side business analytics occupying the intersection

12 Advances in High Energy Physics

of the worlds of management science computer science andstatistical science is a potent force for innovation in both theprivate and public sectors The conclusion is that the data istoo heterogeneous to fit into a rigid schema [38]

Another challenge is the scheduling policies used todetermine the relative ordering of requests Large distributedsystems with different administrative domains will mostlikely have different resource utilization policies For examplea policy can take into consideration the deadlines andbudgets and also the dynamic behavior [39] HEP experi-ments are usually performed in private Clouds consideringdynamic scheduling with soft deadlines which is an openissue

The optimization techniques for the scheduling processrepresent an important aspect because the scheduling is amain building block formaking datacentersmore available touser communities being energy-aware [40] and supportingmulticriteria optimization [41] An example of optimizationis multiobjective and multiconstrained scheduling of manytasks in Hadoop [42] or optimizing short jobs [43] Thecost effectiveness scalability and streamlined architecturesof Hadoop represent solutions for Big Data processingConsidering the use of Hadoop in publicprivate Clouds achallenge is to answer the following questions what type ofdatatasks should move to public cloud in order to achieve acost-aware cloud scheduler And is public Cloud a solutionfor HEP simulation experiments

The activities for Big Data processing vary widely in anumber of issues for example support for heterogeneousresources objective function(s) scalability coschedulingand assumptions about system characteristics The currentresearch directions are focused on accelerating data process-ing especially for Big Data analytics (frequently used in HEPexperiments) complex task dependencies for dataworkflowsand new scheduling algorithms for real-time scenarios

4 Conclusions

This paper presented general aspects about methods used inHEP Monte Carlo methods and simulations of HEP pro-cesses Markovian Monte Carlo unfolding methods in parti-cle physics kernel estimation inHEP RandomMatrixTheoryused in analysis of particles spectrum For each method theproper numerical method had been identified and analyzedAll of identified methods produce data-intensive applica-tions which introduce new challenges and requirementsfor Big Data systems architecture especially for processingparadigms and storage capabilities This paper puts togetherseveral concepts HEP HPC numerical methods and sim-ulations HEP experiments are modeled using numericalmethods and simulations numerical integration eigenvaluescomputation solving linear equation systems multiplyingvectors and matrices interpolation HPC environments offerpowerful tools for data processing and analysis Big Data wasintroduced as a concept for a real problem we live in a data-intensive world produce huge amount of information weface with upper bound introduced by theoretical models

Conflict of Interests

The author declares that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

The research presented in this paper is supported bythe following projects ldquoSideSTEPmdashScheduling Methods forDynamic Distributed Systems a self-lowast approachrdquo (PN-II-CT-RO-FR-2012-1-0084) ldquoERRICmdashEmpowering Roma-nian Research on Intelligent Information Technologiesrdquo FP7-REGPOT-2010-1 ID 264207 CyberWater Grant of theRomanianNational Authority for Scientific Research CNDI-UEFISCDI Project no 472012 The author would like tothank the reviewers for their time and expertise constructivecomments and valuable insights

References

[1] H Newman ldquoSearch for higgs boson diphoton decay withcms at lhcrdquo in Proceedings of the ACMIEEE Conference onSupercomputing (SC rsquo06) ACM New York NY USA 2006

[2] C Cattani and A Ciancio ldquoSeparable transition density inthe hybrid model for tumor-immune system competitionrdquoComputational and Mathematical Methods in Medicine vol2012 Article ID 610124 6 pages 2012

[3] C Toma ldquoWavelets-computational aspects of sterian realisticapproach to uncertainty principle in high energy physics atransient approachrdquo Advances in High Energy Physics vol 2013Article ID 735452 6 pages 2013

[4] C Cattani A Ciancio and B Lods ldquoOn a mathematical modelof immune competitionrdquo Applied Mathematics Letters vol 19no 7 pp 678ndash683 2006

[5] D Perret-Gallix ldquoSimulation and event generation in high-energy physicsrdquoComputer Physics Communications vol 147 no1-2 pp 488ndash493 2002

[6] R Baishya and J K Sarma ldquoSemi numerical solution of non-singlet Dokshitzer-Gribov-Lipatov-Altarelli- Parisi evolutionequation up to next-to-next-to-leading order at small xrdquo Euro-pean Physical Journal C vol 60 no 4 pp 585ndash591 2009

[7] I I Bucur I Fagarasan C Popescu G Culea and A E SusuldquoDelay optimum and area optimal mapping of k-lut based fpgacircuitsrdquo Journal of Control Engineering andApplied Informaticsvol 11 no 1 pp 43ndash48 2009

[8] R R de Austri R Trotta and L Roszkowski ldquoA markov chainmonte carlo analysis of the cmssmrdquo Journal of High EnergyPhysics vol 2006 no 05 2006

[9] C Serbanescu ldquoNoncommutativeMarkov processes as stochas-tic equationsrsquo solutionsrdquo BulletinMathematique de la Societe desSciencesMathematiques de Roumanie vol 41 no 3 pp 219ndash2281998

[10] C Serbanescu ldquoStochastic differential equations and unitaryprocessesrdquo Bulletin Mathematique de la Societe des SciencesMathematiques de Roumanie vol 41 no 4 pp 311ndash322 1998

[11] O Behnke K Kroninger G Schott and T H Schorner-Sadenius Data Analysis in High Energy Physics A PracticalGuide to Statistical Methods Wiley-VCH Berlin Germany2013

[12] V Blobel ldquoAn unfolding method for high energy physicsexperimentsrdquo in Proceedings of the Conference on Advanced

Advances in High Energy Physics 13

Statistical Techniques in Particle Physics Durham UK March2002 DESY 02-078 (June 2002)

[13] P C Hansen Rank-Deficient and Discrete Ill-Posed ProblemsSIAM Philadelphia Pa USA 1998

[14] P C Hansen Discrete Inverse Problems Insight and Algorithmsvol 7 SIAM Philadelphia Pa USA 2010

[15] A Hocker and V Kartvelishvili ldquoSVD approach to data unfold-ingrdquo Nuclear Instruments and Methods in Physics Research Avol 372 no 3 pp 469ndash481 1996

[16] I Aziz and Siraj-ul-Islam ldquoNew algorithms for the numericalsolution of nonlinear Fredholm and Volterra integral equationsusing Haar waveletsrdquo Journal of Computational and AppliedMathematics vol 239 pp 333ndash345 2013

[17] A Sawatzky C Brune JMuller andM Burger ldquoTotal variationprocessing of images with poisson statisticsrdquo in Proceedingsof the 13th International Conference on Computer Analysis ofImages and Patterns (CAIP rsquo09) pp 533ndash540 Springer BerlinGermany 2009

[18] A Edelman and N R Rao ldquoRandom matrix theoryrdquo ActaNumerica vol 14 no 1 pp 233ndash297 2005

[19] K Cranmer ldquoKernel estimation in high-energy physicsrdquo Com-puter Physics Communications vol 136 no 3 pp 198ndash207 2001

[20] I S Abramson ldquoOn bandwidth variation in kernel estimatesmdashasquare root lawrdquoThe Annals of Statistics vol 10 no 4 pp 1217ndash1223 1982

[21] X Li JWDemmel DH Bailey et al ldquoDesign implementationand testing of extended and mixed precision BLASrdquo ACMTransactions on Mathematical Software vol 28 no 2 pp 152ndash205 2002

[22] J J Dongarra and J Langou ldquoThe problem with the linpackbenchmark 10 matrix generatorrdquo International Journal of HighPerformance Computing Applications vol 23 no 1 pp 5ndash132009

[23] L Lundberg and H Lennerstad ldquoAn Optimal lower boundon the maximum speedup in multiprocessors with clustersrdquoin Proceedings of the IEEE 1st International Conference onAlgorithms and Architectures for Parallel Processing (ICAPP rsquo95)vol 2 pp 640ndash649 April 1995

[24] N J Gunther ldquoA note on parallel algorithmicspeedup boundsrdquoTechnical Report on Distributed Parallel and Cluster Comput-ing In presshttparxivorgabs11044078

[25] B C Tak B Urgaonkar and A Sivasubramaniam ldquoTo move ornot tomove the economics of cloud computingrdquo in Proceedingsof the 3rdUSENIXConference onHot Topics inCloudComputing(HotCloud rsquo11) p 5 USENIX Association Berkeley CA USA2011

[26] L Zhang C Wu Z Li C Guo M Chen and F Lau ldquoMovingbig data to the cloud an online cost-minimizing approachrdquoIEEE Journal on Selected Areas in Communications vol 31 no12 pp 2710ndash2721 2013

[27] J Dittrich and J A Quiane-Ruiz ldquoEfficient big data processingin hadoop mapreducerdquo Proceedings of the VLDB Endowmentvol 5 no 12 pp 2014ndash2015 2012

[28] D Suciu ldquoBig data begets big database theoryrdquo in Proceedings ofthe 29th British National conference on Big Data (BNCOD rsquo13)pp 1ndash5 Springer Berlin Germany 2013

[29] B Hindman A Konwinski M Zaharia et al ldquoA platform forfine-grained resource sharing in the data centerrdquo in Proceedingsof the 8th USENIX Conference on Networked Systems Designand Implementation (NSDI rsquo11) p 22 USENIX AssociationBerkeley CA USA 2011

[30] C Dobre and F Xhafa ldquoParallel programming paradigms andframeworks in big data erardquo International Journal of ParallelProgramming 2013

[31] Y Demchenko C de Laat AWibisono P Grosso and Z ZhaoldquoAddressing big data challenges for scientific data infrastruc-turerdquo in Proceedings of the IEEE 4th International Conferenceon CloudComputing Technology and Science (CLOUDCOM rsquo12)pp 614ndash617 IEEE Computer Society Washington DC USA2012

[32] T White Hadoop The Definitive Guide OrsquoReilly Media 2012[33] J Celaya and U Arronategui ldquoA task routing approach to large-

scale schedulingrdquo Future Generation Computer Systems vol 29no 5 pp 1097ndash1111 2013

[34] N Bessis S Sotiriadis F Xhafa F Pop and V Cristea ldquoMeta-scheduling issues in interoperable hpcs grids and cloudsrdquoInternational Journal of Web and Grid Services vol 8 no 2 pp153ndash172 2012

[35] Y Xu A Suarez and M Zhao ldquoIbis interposed big-data ioschedulerrdquo in Proceedings of the 22nd International Sympo-sium on High-performance Parallel and Distributed Computing(HPDC rsquo13) pp 109ndash110 ACM New York NY USA 2013

[36] S Sotiriadis N Bessis and N Antonopoulos ldquoTowards inter-cloud schedulers a survey of meta-scheduling approachesrdquo inProceedings of the 6th International Conference on P2P ParallelGrid Cloud and Internet Computing (PGCIC rsquo11) pp 59ndash66Barcelona Spain October 2011

[37] J Chen Y Chen X Du et al ldquoBig data challenge a datamanagement perspectiverdquo Frontiers in Computer Science vol 7no 2 pp 157ndash164 2013

[38] V Gopalkrishnan D Steier H Lewis and J Guszcza ldquoBigdata big business bridging the gaprdquo in Proceedings of the 1stInternationalWorkshop on Big Data Streams andHeterogeneousSource Mining Algorithms Systems Programming Models andApplications (BigMine rsquo12) pp 7ndash11 ACM New York NY USA2012

[39] R van den Bossche K Vanmechelen and J BroeckhoveldquoOnline coste fficient scheduling of deadline-constrained work-loads on hybrid cloudsrdquo Future Generation Computer Systemsvol 29 no 4 pp 973ndash985 2013

[40] N Bessis S Sotiriadis F Pop and V Cristea ldquoUsing anovel messageexchanging optimization (meo) model to reduceenergy consumption in distributed systemsrdquo Simulation Mod-elling Practice and Theory vol 39 pp 104ndash112 2013

[41] G V Iordache M S Boboila F Pop C Stratan and VCristea ldquoA decentralized strategy for genetic scheduling inheterogeneous environmentsrdquo Multiagent and Grid Systemsvol 3 no 4 pp 355ndash367 2007

[42] F Zhang J Cao K Li S U Khan and K Hwang ldquoMulti-objective scheduling of many tasks in cloud platformsrdquo FutureGeneration Computer Systems 2013

[43] K Elmeleegy ldquoPiranha optimizing short jobs in hadooprdquoProceedings of the VLDB Endowment vol 6 no 11 pp 985ndash9962013

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

High Energy PhysicsAdvances in

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

FluidsJournal of

Atomic and Molecular Physics

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Advances in Condensed Matter Physics

OpticsInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

AstronomyAdvances in

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Superconductivity

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Statistical MechanicsInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

GravityJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

AstrophysicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Physics Research International

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Solid State PhysicsJournal of

 Computational  Methods in Physics

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Soft MatterJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

AerodynamicsJournal of

Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

PhotonicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Biophysics

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ThermodynamicsJournal of

12 Advances in High Energy Physics

of the worlds of management science computer science andstatistical science is a potent force for innovation in both theprivate and public sectors The conclusion is that the data istoo heterogeneous to fit into a rigid schema [38]

Another challenge is the scheduling policies used todetermine the relative ordering of requests Large distributedsystems with different administrative domains will mostlikely have different resource utilization policies For examplea policy can take into consideration the deadlines andbudgets and also the dynamic behavior [39] HEP experi-ments are usually performed in private Clouds consideringdynamic scheduling with soft deadlines which is an openissue

The optimization techniques for the scheduling processrepresent an important aspect because the scheduling is amain building block formaking datacentersmore available touser communities being energy-aware [40] and supportingmulticriteria optimization [41] An example of optimizationis multiobjective and multiconstrained scheduling of manytasks in Hadoop [42] or optimizing short jobs [43] Thecost effectiveness scalability and streamlined architecturesof Hadoop represent solutions for Big Data processingConsidering the use of Hadoop in publicprivate Clouds achallenge is to answer the following questions what type ofdatatasks should move to public cloud in order to achieve acost-aware cloud scheduler And is public Cloud a solutionfor HEP simulation experiments

The activities for Big Data processing vary widely in anumber of issues for example support for heterogeneousresources objective function(s) scalability coschedulingand assumptions about system characteristics The currentresearch directions are focused on accelerating data process-ing especially for Big Data analytics (frequently used in HEPexperiments) complex task dependencies for dataworkflowsand new scheduling algorithms for real-time scenarios

4 Conclusions

This paper presented general aspects about methods used inHEP Monte Carlo methods and simulations of HEP pro-cesses Markovian Monte Carlo unfolding methods in parti-cle physics kernel estimation inHEP RandomMatrixTheoryused in analysis of particles spectrum For each method theproper numerical method had been identified and analyzedAll of identified methods produce data-intensive applica-tions which introduce new challenges and requirementsfor Big Data systems architecture especially for processingparadigms and storage capabilities This paper puts togetherseveral concepts HEP HPC numerical methods and sim-ulations HEP experiments are modeled using numericalmethods and simulations numerical integration eigenvaluescomputation solving linear equation systems multiplyingvectors and matrices interpolation HPC environments offerpowerful tools for data processing and analysis Big Data wasintroduced as a concept for a real problem we live in a data-intensive world produce huge amount of information weface with upper bound introduced by theoretical models

Conflict of Interests

The author declares that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

The research presented in this paper is supported bythe following projects ldquoSideSTEPmdashScheduling Methods forDynamic Distributed Systems a self-lowast approachrdquo (PN-II-CT-RO-FR-2012-1-0084) ldquoERRICmdashEmpowering Roma-nian Research on Intelligent Information Technologiesrdquo FP7-REGPOT-2010-1 ID 264207 CyberWater Grant of theRomanianNational Authority for Scientific Research CNDI-UEFISCDI Project no 472012 The author would like tothank the reviewers for their time and expertise constructivecomments and valuable insights

References

[1] H Newman ldquoSearch for higgs boson diphoton decay withcms at lhcrdquo in Proceedings of the ACMIEEE Conference onSupercomputing (SC rsquo06) ACM New York NY USA 2006

[2] C Cattani and A Ciancio ldquoSeparable transition density inthe hybrid model for tumor-immune system competitionrdquoComputational and Mathematical Methods in Medicine vol2012 Article ID 610124 6 pages 2012

[3] C Toma ldquoWavelets-computational aspects of sterian realisticapproach to uncertainty principle in high energy physics atransient approachrdquo Advances in High Energy Physics vol 2013Article ID 735452 6 pages 2013

[4] C Cattani A Ciancio and B Lods ldquoOn a mathematical modelof immune competitionrdquo Applied Mathematics Letters vol 19no 7 pp 678ndash683 2006

[5] D Perret-Gallix ldquoSimulation and event generation in high-energy physicsrdquoComputer Physics Communications vol 147 no1-2 pp 488ndash493 2002

[6] R Baishya and J K Sarma ldquoSemi numerical solution of non-singlet Dokshitzer-Gribov-Lipatov-Altarelli- Parisi evolutionequation up to next-to-next-to-leading order at small xrdquo Euro-pean Physical Journal C vol 60 no 4 pp 585ndash591 2009

[7] I I Bucur I Fagarasan C Popescu G Culea and A E SusuldquoDelay optimum and area optimal mapping of k-lut based fpgacircuitsrdquo Journal of Control Engineering andApplied Informaticsvol 11 no 1 pp 43ndash48 2009

[8] R R de Austri R Trotta and L Roszkowski ldquoA markov chainmonte carlo analysis of the cmssmrdquo Journal of High EnergyPhysics vol 2006 no 05 2006

[9] C Serbanescu ldquoNoncommutativeMarkov processes as stochas-tic equationsrsquo solutionsrdquo BulletinMathematique de la Societe desSciencesMathematiques de Roumanie vol 41 no 3 pp 219ndash2281998

[10] C Serbanescu ldquoStochastic differential equations and unitaryprocessesrdquo Bulletin Mathematique de la Societe des SciencesMathematiques de Roumanie vol 41 no 4 pp 311ndash322 1998

[11] O Behnke K Kroninger G Schott and T H Schorner-Sadenius Data Analysis in High Energy Physics A PracticalGuide to Statistical Methods Wiley-VCH Berlin Germany2013

[12] V Blobel ldquoAn unfolding method for high energy physicsexperimentsrdquo in Proceedings of the Conference on Advanced

Advances in High Energy Physics 13

Statistical Techniques in Particle Physics Durham UK March2002 DESY 02-078 (June 2002)

[13] P C Hansen Rank-Deficient and Discrete Ill-Posed ProblemsSIAM Philadelphia Pa USA 1998

[14] P C Hansen Discrete Inverse Problems Insight and Algorithmsvol 7 SIAM Philadelphia Pa USA 2010

[15] A Hocker and V Kartvelishvili ldquoSVD approach to data unfold-ingrdquo Nuclear Instruments and Methods in Physics Research Avol 372 no 3 pp 469ndash481 1996

[16] I Aziz and Siraj-ul-Islam ldquoNew algorithms for the numericalsolution of nonlinear Fredholm and Volterra integral equationsusing Haar waveletsrdquo Journal of Computational and AppliedMathematics vol 239 pp 333ndash345 2013

[17] A Sawatzky C Brune JMuller andM Burger ldquoTotal variationprocessing of images with poisson statisticsrdquo in Proceedingsof the 13th International Conference on Computer Analysis ofImages and Patterns (CAIP rsquo09) pp 533ndash540 Springer BerlinGermany 2009

[18] A Edelman and N R Rao ldquoRandom matrix theoryrdquo ActaNumerica vol 14 no 1 pp 233ndash297 2005

[19] K Cranmer ldquoKernel estimation in high-energy physicsrdquo Com-puter Physics Communications vol 136 no 3 pp 198ndash207 2001

[20] I S Abramson ldquoOn bandwidth variation in kernel estimatesmdashasquare root lawrdquoThe Annals of Statistics vol 10 no 4 pp 1217ndash1223 1982

[21] X Li JWDemmel DH Bailey et al ldquoDesign implementationand testing of extended and mixed precision BLASrdquo ACMTransactions on Mathematical Software vol 28 no 2 pp 152ndash205 2002

[22] J J Dongarra and J Langou ldquoThe problem with the linpackbenchmark 10 matrix generatorrdquo International Journal of HighPerformance Computing Applications vol 23 no 1 pp 5ndash132009

[23] L Lundberg and H Lennerstad ldquoAn Optimal lower boundon the maximum speedup in multiprocessors with clustersrdquoin Proceedings of the IEEE 1st International Conference onAlgorithms and Architectures for Parallel Processing (ICAPP rsquo95)vol 2 pp 640ndash649 April 1995

[24] N J Gunther ldquoA note on parallel algorithmicspeedup boundsrdquoTechnical Report on Distributed Parallel and Cluster Comput-ing In presshttparxivorgabs11044078

[25] B C Tak B Urgaonkar and A Sivasubramaniam ldquoTo move ornot tomove the economics of cloud computingrdquo in Proceedingsof the 3rdUSENIXConference onHot Topics inCloudComputing(HotCloud rsquo11) p 5 USENIX Association Berkeley CA USA2011

[26] L Zhang C Wu Z Li C Guo M Chen and F Lau ldquoMovingbig data to the cloud an online cost-minimizing approachrdquoIEEE Journal on Selected Areas in Communications vol 31 no12 pp 2710ndash2721 2013

[27] J Dittrich and J A Quiane-Ruiz ldquoEfficient big data processingin hadoop mapreducerdquo Proceedings of the VLDB Endowmentvol 5 no 12 pp 2014ndash2015 2012

[28] D Suciu ldquoBig data begets big database theoryrdquo in Proceedings ofthe 29th British National conference on Big Data (BNCOD rsquo13)pp 1ndash5 Springer Berlin Germany 2013

[29] B Hindman A Konwinski M Zaharia et al ldquoA platform forfine-grained resource sharing in the data centerrdquo in Proceedingsof the 8th USENIX Conference on Networked Systems Designand Implementation (NSDI rsquo11) p 22 USENIX AssociationBerkeley CA USA 2011

[30] C Dobre and F Xhafa ldquoParallel programming paradigms andframeworks in big data erardquo International Journal of ParallelProgramming 2013

[31] Y Demchenko C de Laat AWibisono P Grosso and Z ZhaoldquoAddressing big data challenges for scientific data infrastruc-turerdquo in Proceedings of the IEEE 4th International Conferenceon CloudComputing Technology and Science (CLOUDCOM rsquo12)pp 614ndash617 IEEE Computer Society Washington DC USA2012

[32] T White Hadoop The Definitive Guide OrsquoReilly Media 2012[33] J Celaya and U Arronategui ldquoA task routing approach to large-

scale schedulingrdquo Future Generation Computer Systems vol 29no 5 pp 1097ndash1111 2013

[34] N Bessis S Sotiriadis F Xhafa F Pop and V Cristea ldquoMeta-scheduling issues in interoperable hpcs grids and cloudsrdquoInternational Journal of Web and Grid Services vol 8 no 2 pp153ndash172 2012

[35] Y Xu A Suarez and M Zhao ldquoIbis interposed big-data ioschedulerrdquo in Proceedings of the 22nd International Sympo-sium on High-performance Parallel and Distributed Computing(HPDC rsquo13) pp 109ndash110 ACM New York NY USA 2013

[36] S Sotiriadis N Bessis and N Antonopoulos ldquoTowards inter-cloud schedulers a survey of meta-scheduling approachesrdquo inProceedings of the 6th International Conference on P2P ParallelGrid Cloud and Internet Computing (PGCIC rsquo11) pp 59ndash66Barcelona Spain October 2011

[37] J Chen Y Chen X Du et al ldquoBig data challenge a datamanagement perspectiverdquo Frontiers in Computer Science vol 7no 2 pp 157ndash164 2013

[38] V Gopalkrishnan D Steier H Lewis and J Guszcza ldquoBigdata big business bridging the gaprdquo in Proceedings of the 1stInternationalWorkshop on Big Data Streams andHeterogeneousSource Mining Algorithms Systems Programming Models andApplications (BigMine rsquo12) pp 7ndash11 ACM New York NY USA2012

[39] R van den Bossche K Vanmechelen and J BroeckhoveldquoOnline coste fficient scheduling of deadline-constrained work-loads on hybrid cloudsrdquo Future Generation Computer Systemsvol 29 no 4 pp 973ndash985 2013

[40] N Bessis S Sotiriadis F Pop and V Cristea ldquoUsing anovel messageexchanging optimization (meo) model to reduceenergy consumption in distributed systemsrdquo Simulation Mod-elling Practice and Theory vol 39 pp 104ndash112 2013

[41] G V Iordache M S Boboila F Pop C Stratan and VCristea ldquoA decentralized strategy for genetic scheduling inheterogeneous environmentsrdquo Multiagent and Grid Systemsvol 3 no 4 pp 355ndash367 2007

[42] F Zhang J Cao K Li S U Khan and K Hwang ldquoMulti-objective scheduling of many tasks in cloud platformsrdquo FutureGeneration Computer Systems 2013

[43] K Elmeleegy ldquoPiranha optimizing short jobs in hadooprdquoProceedings of the VLDB Endowment vol 6 no 11 pp 985ndash9962013

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

High Energy PhysicsAdvances in

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

FluidsJournal of

Atomic and Molecular Physics

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Advances in Condensed Matter Physics

OpticsInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

AstronomyAdvances in

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Superconductivity

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Statistical MechanicsInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

GravityJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

AstrophysicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Physics Research International

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Solid State PhysicsJournal of

 Computational  Methods in Physics

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Soft MatterJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

AerodynamicsJournal of

Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

PhotonicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Biophysics

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ThermodynamicsJournal of

Advances in High Energy Physics 13

Statistical Techniques in Particle Physics Durham UK March2002 DESY 02-078 (June 2002)

[13] P C Hansen Rank-Deficient and Discrete Ill-Posed ProblemsSIAM Philadelphia Pa USA 1998

[14] P C Hansen Discrete Inverse Problems Insight and Algorithmsvol 7 SIAM Philadelphia Pa USA 2010

[15] A Hocker and V Kartvelishvili ldquoSVD approach to data unfold-ingrdquo Nuclear Instruments and Methods in Physics Research Avol 372 no 3 pp 469ndash481 1996

[16] I Aziz and Siraj-ul-Islam ldquoNew algorithms for the numericalsolution of nonlinear Fredholm and Volterra integral equationsusing Haar waveletsrdquo Journal of Computational and AppliedMathematics vol 239 pp 333ndash345 2013

[17] A Sawatzky C Brune JMuller andM Burger ldquoTotal variationprocessing of images with poisson statisticsrdquo in Proceedingsof the 13th International Conference on Computer Analysis ofImages and Patterns (CAIP rsquo09) pp 533ndash540 Springer BerlinGermany 2009

[18] A Edelman and N R Rao ldquoRandom matrix theoryrdquo ActaNumerica vol 14 no 1 pp 233ndash297 2005

[19] K Cranmer ldquoKernel estimation in high-energy physicsrdquo Com-puter Physics Communications vol 136 no 3 pp 198ndash207 2001

[20] I S Abramson ldquoOn bandwidth variation in kernel estimatesmdashasquare root lawrdquoThe Annals of Statistics vol 10 no 4 pp 1217ndash1223 1982

[21] X Li JWDemmel DH Bailey et al ldquoDesign implementationand testing of extended and mixed precision BLASrdquo ACMTransactions on Mathematical Software vol 28 no 2 pp 152ndash205 2002

[22] J J Dongarra and J Langou ldquoThe problem with the linpackbenchmark 10 matrix generatorrdquo International Journal of HighPerformance Computing Applications vol 23 no 1 pp 5ndash132009

[23] L Lundberg and H Lennerstad ldquoAn Optimal lower boundon the maximum speedup in multiprocessors with clustersrdquoin Proceedings of the IEEE 1st International Conference onAlgorithms and Architectures for Parallel Processing (ICAPP rsquo95)vol 2 pp 640ndash649 April 1995

[24] N J Gunther ldquoA note on parallel algorithmicspeedup boundsrdquoTechnical Report on Distributed Parallel and Cluster Comput-ing In presshttparxivorgabs11044078

[25] B C Tak B Urgaonkar and A Sivasubramaniam ldquoTo move ornot tomove the economics of cloud computingrdquo in Proceedingsof the 3rdUSENIXConference onHot Topics inCloudComputing(HotCloud rsquo11) p 5 USENIX Association Berkeley CA USA2011

[26] L Zhang C Wu Z Li C Guo M Chen and F Lau ldquoMovingbig data to the cloud an online cost-minimizing approachrdquoIEEE Journal on Selected Areas in Communications vol 31 no12 pp 2710ndash2721 2013

[27] J Dittrich and J A Quiane-Ruiz ldquoEfficient big data processingin hadoop mapreducerdquo Proceedings of the VLDB Endowmentvol 5 no 12 pp 2014ndash2015 2012

[28] D Suciu ldquoBig data begets big database theoryrdquo in Proceedings ofthe 29th British National conference on Big Data (BNCOD rsquo13)pp 1ndash5 Springer Berlin Germany 2013

[29] B Hindman A Konwinski M Zaharia et al ldquoA platform forfine-grained resource sharing in the data centerrdquo in Proceedingsof the 8th USENIX Conference on Networked Systems Designand Implementation (NSDI rsquo11) p 22 USENIX AssociationBerkeley CA USA 2011

[30] C Dobre and F Xhafa ldquoParallel programming paradigms andframeworks in big data erardquo International Journal of ParallelProgramming 2013

[31] Y Demchenko C de Laat AWibisono P Grosso and Z ZhaoldquoAddressing big data challenges for scientific data infrastruc-turerdquo in Proceedings of the IEEE 4th International Conferenceon CloudComputing Technology and Science (CLOUDCOM rsquo12)pp 614ndash617 IEEE Computer Society Washington DC USA2012

[32] T White Hadoop The Definitive Guide OrsquoReilly Media 2012[33] J Celaya and U Arronategui ldquoA task routing approach to large-

scale schedulingrdquo Future Generation Computer Systems vol 29no 5 pp 1097ndash1111 2013

[34] N Bessis S Sotiriadis F Xhafa F Pop and V Cristea ldquoMeta-scheduling issues in interoperable hpcs grids and cloudsrdquoInternational Journal of Web and Grid Services vol 8 no 2 pp153ndash172 2012

[35] Y Xu A Suarez and M Zhao ldquoIbis interposed big-data ioschedulerrdquo in Proceedings of the 22nd International Sympo-sium on High-performance Parallel and Distributed Computing(HPDC rsquo13) pp 109ndash110 ACM New York NY USA 2013

[36] S Sotiriadis N Bessis and N Antonopoulos ldquoTowards inter-cloud schedulers a survey of meta-scheduling approachesrdquo inProceedings of the 6th International Conference on P2P ParallelGrid Cloud and Internet Computing (PGCIC rsquo11) pp 59ndash66Barcelona Spain October 2011

[37] J Chen Y Chen X Du et al ldquoBig data challenge a datamanagement perspectiverdquo Frontiers in Computer Science vol 7no 2 pp 157ndash164 2013

[38] V Gopalkrishnan D Steier H Lewis and J Guszcza ldquoBigdata big business bridging the gaprdquo in Proceedings of the 1stInternationalWorkshop on Big Data Streams andHeterogeneousSource Mining Algorithms Systems Programming Models andApplications (BigMine rsquo12) pp 7ndash11 ACM New York NY USA2012

[39] R van den Bossche K Vanmechelen and J BroeckhoveldquoOnline coste fficient scheduling of deadline-constrained work-loads on hybrid cloudsrdquo Future Generation Computer Systemsvol 29 no 4 pp 973ndash985 2013

[40] N Bessis S Sotiriadis F Pop and V Cristea ldquoUsing anovel messageexchanging optimization (meo) model to reduceenergy consumption in distributed systemsrdquo Simulation Mod-elling Practice and Theory vol 39 pp 104ndash112 2013

[41] G V Iordache M S Boboila F Pop C Stratan and VCristea ldquoA decentralized strategy for genetic scheduling inheterogeneous environmentsrdquo Multiagent and Grid Systemsvol 3 no 4 pp 355ndash367 2007

[42] F Zhang J Cao K Li S U Khan and K Hwang ldquoMulti-objective scheduling of many tasks in cloud platformsrdquo FutureGeneration Computer Systems 2013

[43] K Elmeleegy ldquoPiranha optimizing short jobs in hadooprdquoProceedings of the VLDB Endowment vol 6 no 11 pp 985ndash9962013

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

High Energy PhysicsAdvances in

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

FluidsJournal of

Atomic and Molecular Physics

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Advances in Condensed Matter Physics

OpticsInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

AstronomyAdvances in

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Superconductivity

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Statistical MechanicsInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

GravityJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

AstrophysicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Physics Research International

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Solid State PhysicsJournal of

 Computational  Methods in Physics

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Soft MatterJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

AerodynamicsJournal of

Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

PhotonicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Biophysics

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ThermodynamicsJournal of

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

High Energy PhysicsAdvances in

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

FluidsJournal of

Atomic and Molecular Physics

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Advances in Condensed Matter Physics

OpticsInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

AstronomyAdvances in

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Superconductivity

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Statistical MechanicsInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

GravityJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

AstrophysicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Physics Research International

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Solid State PhysicsJournal of

 Computational  Methods in Physics

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Soft MatterJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

AerodynamicsJournal of

Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

PhotonicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Biophysics

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ThermodynamicsJournal of