modeling and simulation fundamentals (theoretical underpinnings and practical domains) || monte...

131

Modeling and Simulation Fundamentals: Theoretical Underpinnings and Practical Domains, Edited by John A. Sokolowski and Catherine M. BanksCopyright © 2010 John Wiley & Sons, Inc.

5

MONTE CARLO SIMULATION

John A. Sokolowski

When one hears the name Monte Carlo , one often thinks of the gambling locale in the country of Monaco. It is the home of the famous Le Grand Casino as well as many other gambling resorts and Formula One Racing. This chapter, however, is not about gambling or racing. It is, however, about a concept that underlies gambling, that is, probability, hence, its association and designation with the well - known gambling region. The scientifi c study of probability con-cerns itself with the occurrence of random events and the characterization of those random happenings. Gambling casinos rely on probability to ensure, over the long run, that they are profi table. For this to happen, the odds or chance of the casino winning has to be in its favor. This is where probability comes into play because the theory of probability provides a mathematical way to set the rules for each one of its games to make sure the odds are in its favor . As a simulation technique, Monte Carlo simulation relies very heavily on probability.

Monte Carlo simulation, also known as the Monte Carlo method, origi-nated in the 1940s at Los Alamos National Laboratory. Physicists Stanislaw Ulman, Enrico Fermi, John von Neumann, and Nicholas Metropolis had to perform repeated simulations of their atomic physics models to understand how these models would behave given the large number of uncertain input variable values. As random samples of the input variables were chosen for

132 MONTE CARLO SIMULATION

each simulation run, a statistical description of the model output emerged that provided evidence as to how the real - world system would behave. It is this concept of repeated random samples of model input variables over many simu-lation runs that defi nes Monte Carlo simulation . Essentially, we are creating an artifi cial world (model) that is meant to closely resemble the real world in all relevant aspects.

Monte Carlo simulation is often superior to a deterministic simulation of a system when that system has input variables that are random. Deterministic simulations are referred to as what - if simulations. In these simulations, a single value is chosen for each input random variable (a particular what - if scenario) based on a best guess by the modeler. The simulation is then run and the output is observed. This output is a single value or a single set of values based on the chosen input. But because the input variables are random variables, they can take on any number of values defi ned by their probability distribu-tions. So to have a sense of how the system would respond over the complete range of input values, more than one set of inputs must be evaluated. Monte Carlo simulation randomly samples values from each input variable distribu-tion and uses that sample to calculate the model ’ s output. This process is repeated many times until the modeler obtains a sense of how the output varies given the random input values. One should readily see that when the simulation contains input random variables, Monte Carlo simulation will yield a result that is likely to be more representative of the true behavior of the system. The next section formally defi nes Monte Carlo simulation and pro-vides examples of its use.

THE MONTE CARLO METHOD

When setting up a Monte Carlo simulation or employing the Monte Carlo Method, one follows a four - step process. These four steps are:

Step 1 Defi ne a distribution of possible inputs for each input random variable.

Step 2 Generate inputs randomly from those distributions. Step 3 Perform a deterministic computation using that set of inputs. Step 4 Aggregate the results of the individual computations into the fi nal

result.

While these steps may seem overly simplistic, they are necessary to capture the essence of how Monte Carlo simulations are set up and run.

This four - step method requires having the necessary components in place to achieve the fi nal result. These components may include:

(1) probability distribution functions (pdfs) for each random variable (2) a random number generator

THE MONTE CARLO METHOD 133

(3) a sampling rule — a prescription for sampling from the pdfs (4) scoring — a method for combining the results of each run into the fi nal

result (5) error estimation — an estimate of the statistical error of the simulation

output as a function of the number of simulation runs and other parameters.

Step 1 requires the modeler to match a statistical distribution to each input random variable. If this distribution is known or suffi cient data exist to derive it, then this step is straightforward. However, if the behavior of an input variable is not well understood, then the modeler might have to estimate this distribution based on empirical observation or subject matter expertise. * The modeler may also use a uniform distribution if he or she is lacking any specifi c knowledge of the variable ’ s characteristics. When addi-tional information is gathered to defi ne the variable, then the uniform distribu-tion can be replaced.

Step 2 requires randomly sampling each input variable ’ s distribution many times to develop a vector of inputs for each variable. Suppose we have two input random variables X and Z . After sampling n times, we have X = ( x 1 , x 2 , … , x n ) and Z = ( z 1 , z 2 , … , z n ). Elements from these vectors are then sequen-tially chosen as inputs to the function defi ning the model. The question of how large n should be is an important one because the number of samples deter-mines the power of the output test statistic. As the number of samples increases, the standard deviation of the test statistic decreases. In other words, there is less variance in the output with larger sample sizes. However, the increase in power is not linear with the number of samples. The incremental improvement of power decreases by a factor of about 1 n , so there is a point when more sampling provides little improvement. Determining the number of trials needed for a desired accuracy is addressed below.

Step 3 is straightforward. It involves sequentially choosing elements from the randomly generated input vectors and computing the value of the output variable or variables until all n outputs are generated for each output variable.

Step 4 involves aggregating all these outputs. Suppose we have one output variable Y . Then we would have as a result of step 4 an output vector Y = ( y 1 , y 2 , … , y n ). We can then perform a variety of statistical tests on Y to analyze this output. These tests will be described later in the chapter.

The following is a simple example of how this method works.

* When modeling systems, especially those in the social sciences, subject matter experts may be the only source of data available to characterize the behavior of a variable. This is true when no scientifi c data or data collection is available. Subject matter expertise may also be called upon as a method to validate the output of the simulation. See Chapter 10 for a further discussion on validation techniques.


Example 1: Determining the Value of π

Recall that the value of π is the ratio of a circle ’ s circumference to its diameter. To calculate this value, we can set up a Monte Carlo simulation that employs a geometric representation of the circle.

1. To start, draw a unit circle arc, that is, an arc of radius one circumscribed by a square as shown in Figure 5.1 .

2. Then, randomly choose an x and y coordinate inside the square, and place a dot at that location.

3. Repeat step 2 at a given number of times. See Figure 5.2 . 4. Count the total number of dots inside the square and the number of

dots inside the quarter circle. With a large number of dots generated, these values will approximate the area of the circle and the area of the square. From mathematics, this result can be represented as

##

of dots inside circle of dots inside square

= =14

2

2

14

ππ

r

r..

Step 1 of our example represents step 1 of the above method, that is, deter-mining the domain of possible inputs. Steps 2 and 3 correspond to method step 2, and step 4 encompasses steps 3 and 4 of the method.

Our example relied on several components mentioned above. A random number generator was necessary to select the coordinates for each dot. The coordinates were selected from a uniform distribution that provided the prob-

0

1

0 1

Figure 5.1 Unit circle arc for calculation of π .


ability density function. A sampling rule existed that used the random numbers to select values from the uniform distribution. The scoring method was given by the formula in step 4 above. Finally, error estimation can be performed by comparing the computed value of π to an authoritative source for its value.

This simulation can be set up using a spreadsheet and the built in functions of rand() that generates uniform random numbers between 0 and 1 and the countif(range, criteria) function that can count the number of random numbers that meet the specifi ed criteria. The author generated 500 uniform random numbers between zero and one for the x coordinate of each point and the same for the y coordinate. These numbers were paired up and plotted. Precisely 340 of the 500 points fell inside the circle giving a simulated value for π of 2.76. This method produced an error of 12.1 percent. Using a larger set of generated dots can help reduce the error to an acceptable range realizing that it requires a trade - off for extra computation.

From example 1, you can see the necessary components that are central to Monte Carlo simulations. These components are one or more input random variables, one or more output variables, and a function that computes the outputs from the inputs. This confi guration is shown in Figure 5.3 .

In this fi gure, notice that there are three input random variables x 1 , x 2 , and x 3 , all with different distributions. There are two output variables, y 1 and y 2 , that have resulting distributions created by the repeated sampling of the input

0

1

0 1

Figure 5.2 Random dots placed inside the square.


and feeding those samples into the function f ( x ). The next example builds on this model to illustrate how a what - if scenario outcome can differ from one produced via a Monte Carlo approach.

Example 2: Computing Product Earnings

Let us suppose we want to predict a product ’ s earning in future years given sales data accumulated over the last 5 years. A product ’ s earning is a function of unit price , unit sales , variable costs , and fi xed costs . Specifi cally, earning = (unit price) × (unit costs) − (variable cost s + fi xed costs). We will assume that vari-ables used to calculate earnings are all independent of one another. The last 5 years of data for these variables are shown in Table 5.1 .

From these data, one can develop a probability distribution to represent each of the input variables. An appropriate distribution representation would be a triangular distribution , which is typically used when only a small amount of data is available to characterize the input variables. These distributions may be constructed from Table 5.2 .

Constructing a triangular distribution requires three values: a minimum, a maximum, and a most likely or mode. We can represent these values by a , b , and c , respectively. Then, the probability density function for this distribution is defi ned as follows:

Table 5.1 Product earnings by year

Year

1 2 3 4 5

Unit price 50 52 55 57 65 Unit sales 2000 2200 2700 2500 2800 Variable costs 50,000 55,000 56,000 57,000 58,000 Fixed costs 10,000 12,000 15,000 16,000 17,000 Earnings 40,000 47,400 77,500 69,500 107,000

Modelf(x)

x1

x2

x3

y1

y2

Figure 5.3 Basic Monte Carlo model.


f x a b c

x ab a c a

a x c

b xb a b c

c x, ,( ) =

−( )−( ) −( )

≤ ≤

−( )−( ) −( )

≤

2

2

for

for ≤≤

⎧

⎨

⎪⎪⎪

⎩

⎪⎪⎪

b

otherwise0 .

(5.1)

Figure 5.4 illustrates the triangular distribution for unit price as computed from Equation (5.1) . The other variables have similar triangular representa-tions. The max values were chosen as best guess estimates of the highest values these parameters will reach. This is often done using subject matter experts intuitively familiar with how these variables are likely to behave. The min values were the minimum numbers found in Table 5.1 for each variable. The most likely values were calculated by averaging the 5 years of data for each factor.

Table 5.2 Triangular distribution data

Min Most Likely Max

Unit price 50 55 70 Unit sales 2000 2440 3000 Variable costs 50,000 55,200 65,000 Fixed costs 10,000 14,000 20,000 Earnings 40,000 65,000 125,000

50 55 60 65 700

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

0.09

0.1

Unit price

Figure 5.4 Unit price triangular distribution.


Using these triangular distributions, 10,000 samples were generated and used to compute predicted future earnings. The following MATLAB ® code was used to generate the samples.

MATLAB Program

h = sqrt(rand(1,10000)); unit_price =(70 - 50) * h. * rand(1,10000)+55 - (55 - 50) * h; variable_costs=(65000 - 50000) * h. * rand(1,10000)+55200 - (55200 - 50000) * h; fixed_costs =(20000 - 10000) * h. * rand(1,10000)+14000 - (14000 - 10000) * h; unit_sales=(3000 - 2000) * h. * rand(1,10000)+2440 - (2440-2000) * h; for i = 1:10000 earnings(i)= unit_price(i) * unit_sales(i) - (variable_costs(i)+fi xed_costs(i)); end

The summary statistics for the output variable earnings are shown in Table 5.3 .

A plot of the computed earnings is shown in Figure 5.5 . This plot represents the probability density function for the output distribution.

From the Monte Carlo simulation results, one can see a difference between the most likely earnings value (65,000) from Table 5.2 and the mean earnings value (73,206) of Table 5.3 . In other words, the simple what - if analysis of a deterministic computation of earnings differs from the Monte Carlo compu-tation that takes into account many combinations of input variable values that could occur in predicting future earnings. Instead of a single point analysis, the modeler has the results of 10,000 points on which to base his or her estimate of future earnings. These simulation runs take into account 10,000 different combinations of input variables, which provide a much broader picture of the possible values that earnings could take on given the possible variability in the real - world data. One can also compare the minimum and maximum expected earnings from both the single - point estimate and the Monte Carlo estimate to get an understanding of the possible extreme values that may result.

How good of an estimate of the true population mean is the Monte Carlo computed mean earnings value ? One way to assess this is to compute a

Table 5.3 Earnings summary statistics

Parameter Value

Mean 73,206 Median 71,215 Standard deviation 16,523 Variance 273,009,529 Min 21,155 Max 137,930


confi dence interval for the population mean based on sample data. This com-putation is based on the important statistical concept of the central limit theorem . This theorem is expressed as follows.

Theorem 1 (Central Limit Theorem) Suppose Y 1 , … , Y n are independent and identically distributed (IID) samples and E Yi

2[ ] < ∞ . Then

ˆ, ,

θ θσ

n

nN n

− ⇒ ( ) → ∞0 1 as

(5.2)

where ˆ , :θ θn i ii

n Y n E Y= = [ ]=∑ 1 and σ 2 : = Var ( Y i ).

Note that nothing is assumed about the distribution of the Y i ’ s other than their variance is less than infi nity. So from Equation (5.2) , if n is suffi ciently large, then we can compute a confi dence interval for θ based on a standard normal distribution. The confi dence interval is computed as follows.

Let z 1 − α /2 be the (1 − α /2) percentile point of the N (0, 1) distribution such that

P z Z z− ≤ ≤( ) = −− −1 2 1 2 1α α α,

where Z ∼ N (0,1). Now take the simulated IID samples Y i and construct a 100(1 − α )% confi dence interval for θ = E [ Y ]. Essentially, we are constructing a lower and upper bound L ( Y ) and U ( Y ) such that

2 4 6 8 10 12 14 16

x 104

0

50

100

150

200

250

300

350

400

Figure 5.5 Computed earnings distribution plot.


P L Y U Y( ) ≤ ≤ ( )( ) = −θ α1 .

The central limit theorem tells us that n nnθ̂ θ−( ) is approximately a standard normal distribution for large n , so we have

P zn

z

P zn

zn

n

n

− ≤−( )

≤⎛

⎝⎜

⎞

⎠⎟ ≈ −

⇒ − ≤ − ≤⎛

− −

− −

1 2 1 2

1 2 1 2

1α α

α α

θ θσ

α

σ θ θ σ

ˆ

ˆ⎝⎝

⎞⎠ ≈ −

⇒ − ≤ ≤ +⎛⎝

⎞⎠ ≈ −− −

1

11 2 1 2

a

P zn

zn

n nˆ ˆ .θ σ θ θ σ αα α

So the approximate 100(1 − α )% confi dence interval for θ is given by Equation (5.3) :

L Y U Y z

nz

nn n( ) ( )[ ] = − +⎡

⎣⎢⎤⎦⎥− −, , .θ σ θ σ

α α� �

1 2 1 2

(5.3)

One other issue must be addressed before computing our confi dence inter-val, that is, σ 2 is usually not known. However, it can be estimated by the fol-lowing formula:

ˆˆ

.σθ

n

i ni

n Y

n2

2

1

1=

−( )−

=∑

So replacing σ with σ̂ , we arrive at

L Y U Y z

nz

nn

nn

n( ) ( )[ ] = − +⎡⎣⎢

⎤⎦⎥

− −, ˆ ˆ, ˆ ˆ

θ σ θ σα α1 2 1 2

(5.4)

as the fi nal equation for computing the confi dence interval . Equation (5.4) can now be applied to the results of Monte Carlo simulation

in example 2. For α = 0.05, the z 1 − α /2 value for the standard normal distribution is 1.96. Table 5.3 provides the standard deviation for the 10,000 sample points so the confi dence interval for the mean earnings is [72,999, 73,446]. One should interpret this interval as we are 95 percent confi dent that the interval contains the actual population mean . The smaller this interval is, the more confi dence we have in the estimate of the actual population mean.

Even though we have high confi dence that the population mean falls in the above interval, that does not necessarily indicate that is what earnings will be. Figure 5.5 shows how widely earnings could vary given a specifi c set of sales and price conditions. It is because of the Monte Carlo method that we are able to represent and to visualize the possible outcomes.


The width of the confi dence interval is a function of the number of sample points chosen. If one wants to achieve a certain level of confi dence, then one must be able to determine the number of samples necessary to achieve that accuracy. The error between the actual mean and the computed mean can be represented by an absolute error Ea n= −θ̂ θ . Thus, we want to choose a value for n such that P ( E a ≤ ε ) = 1 − α , where ε is the actual error. Recall from above,

P zn

zn

n nˆ ˆ .θ σ θ θ σ αα α− ≤ ≤ +⎛

⎝⎞⎠ ≈ −− −1 2 1 2 1

This implies that

P z

nn

ˆ .θ θ σ αα− ≤⎛⎝

⎞⎠ ≈ −−1 2 1

So in terms of E a , we have

P E z

na ≤ +⎛

⎝⎞⎠ ≈ −−1 2 1α

σ α.

If we want P ( E a ≤ ε ) ≈ 1 − α , then we must choose n such that

n

z= −σ

εα

21 22

2.

(5.5)

Just as with the computation of the confi dence interval, σ 2 is usually not known. One way to solve this problem is to estimate it by doing a pilot simula-tion . Here, the modeler conducts a small number of runs and uses the results of those runs to estimate σ 2 . The estimate is then used to compute an n̂ . This number of runs is then performed, the output variable ’ s statistics are gathered, and a confi dence interval is computed. If the modeler follows this two - stage procedure, it is likely that n̂ runs will produce the desired level of accuracy. For this method to work, the initial number of runs to estimate σ 2 must be suffi ciently large ( ≥ 50). The following pseudocode describes this procedure:

Two - Stage Procedure for Estimating the Number of Simulation Runs

/ * Do pilot simulation fi rst * / for i=1 to p generate X i end for

set

θ̂ = h X pi( )∑

set ˆ ˆσ θ2

2

1= ( ) −( ) −( )∑ h X pi


set

n

z= −σ̂

εα

21 22

2

/ * Now do main simulation * / for i=1 to n generate X i end for

set

θ̂n

ih X n= ( )∑ set

ˆ ˆσ θn

inh X n2

2

1= ( ) −( ) −( )∑

set

100 1 %Ci−( ) − +⎡⎣⎢

⎤⎦⎥− −α θ σ θ σ

α α= ˆ ˆ, ˆ ˆ

nn

nnz

nz

n1 2 1 2

To illustrate the procedure, we will repeat the Monte Carlo simulation of earnings from example 2. Suppose we want to control the absolute error so that

P Ea ≤( ) = −1000 1 α.

Note that this is equivalent to saying that we want the confi dence interval to have a width of less than or equal to 2 × 1000 = $2000. For the pilot simula-tion, we choose p = 100 and α = 0.05. Using the two - stage procedure above produces a ˆ ,θ = 70 586 and ˆ .σ 2 1 8897 8= e . Applying Equation (5.5) gives us an n̂ ≈ 726. Using this number for our second stage, we obtain the following: ˆ ,θn = 73 424, ˆ .σn e2 2 6731 8= , and a confi dence interval of [72,235, 74,613], which is about $2400 wide. Note that the fi rst Monte Carlo simulation using 10,000 samples produces a confi dence interval width of $447.

From the example above, one can see that this two - stage procedure pro-vides a method for determining a close approximation for the number of runs needed to achieve a certain absolute error value.

SENSITIVITY ANALYSIS

An important analytic concept based on Monte Carlo simulation is that of sensitivity analysis . For our purposes, we will defi ne sensitivity analysis as the study of how uncertainty in a model ’ s output can be assigned to the various sources of input uncertainty. As one can see from the discussion above, input and output uncertainties are at the heart of the Monte Carlo simulation. Gauging which input random variables have the most infl uence on the output random variables is an important fact to know when trying to analyze a model ’ s behavior. This section will introduce concepts for performing sensitiv-ity analysis based on Monte Carlo simulation and how sensitivity analysis can be used to adjust the Monte Carlo simulation.

Sensitivity analysis is important for several reasons. It can help uncover model errors and identify important bounds on input variables. This analysis

SENSITIVITY ANALYSIS 143

can also help identify research priorities and simplify models. Thus, sensitivity analysis plays a signifi cant role as a tool to assess model validity.

The most common method for conducting sensitivity analysis is based on derivatives. For example, given ∂ Y j / ∂ X i where Y j is a output random variable and X i is a input random variable, one can see that this partial derivative can be interpreted as the change in Y j with respect to X i , which is consistent with our defi nition of sensitivity analysis. Derivative - based approaches are very effi cient from a computational standpoint; however, it does have one serious fl aw. Derivative - based approaches are only valid at the point that they are computed. This is acceptable for linear systems but would be of little value for systems exhibiting nonlinear behavior. There are, however, other methods that can be applied for all systems.

One simple method involves a visual assessment of an input variable ’ s effect on an output variable. This method employs a scatter plot where each input variable in the Monte Carlo simulation is individually plotted against the output variable and the resulting pattern is analyzed. The more structured the output pattern, the more sensitive is the output variable to that input variable.

Referring back to example 2, we had four random input variables that contributed to computing the earnings random output variable. If we plot each of the 10,000 randomly generated inputs against the corresponding output using a scatter plot, we get the results shown in Figure 5.6 .

2

1

05 5.5 6

Variable costs

Ear

ning

s

6.5

× 105

2

1

01 1.2 1.4

Fixed costs

Ear

ning

s

1.6 1.8 2

× 105 × 104

2

1

01 1.2 1.4

Unit sales × unit price

Ear

ning

s

1.6 1.8 2 2.2

× 105

× 104

× 105

Figure 5.6 Scatter plots of earnings versus each input variable.


The scatter plots show that the variable earnings is more sensitive to the product of unit_price and unit_sales than the fi xed and variable costs because of its structured pattern. From the results of this analysis, one could set the fi xed_costs and variable_costs inputs to their mean values and rerun the Monte Carlo simulation using only unit_sales and unit_price as random input vari-ables with little loss in accuracy. The results of this Monte Carlo simulation are shown in Table 5.4 .

The resultant 100(1 − α )% confi dence interval for the mean is [75,223, 75,869]. The mean in Table 5.4 is within $2400 of the mean of the full Monte Carlo model from Table 5.3 and is still a better predictor than just using the single - point what - if analysis, which produced an earnings prediction of $65,000. Thus, sensitivity analysis allowed us to reduce the complexity of our model with only about 3 percent change in results.

While scatter plots provide a good visual means for identifying the relative sensitivity among input variables, other computational methods are available that improve on the derivative - based approach mentioned above. One such approach is the sigma - normalized derivatives . This is defi ned as follows:

S

YXX

X

Y ii

iσ σσ

= ∂∂

.

(5.6)

The derivative is normalized by the input – output standard deviations. The larger the result of this computation, the more sensitive the output is to this input. The sensitivity measure of Equation (5.6) is widely recognized and is recommended for sensitivity analysis by a guideline of the Intergovernmental Panel for Climate Change [1] . When the results of Equation (5.6) are squared and summed across all input variables, the following equation holds:

SX

i

r

i

σ( ) ==∑ 2

1

1.

This will be illustrated by again revisiting example 2, our earnings computa-tion. Table 5.5 provides the sigma - normalized derivatives for the earnings model.

Table 5.4 Earnings summary statistics for simplifi ed model

Parameter Value

Mean 75,551 Median 73,102 Standard deviation 16,219 Variance 263,040,000 Min 33,223 Max 135,790

REFERENCES 145

As one can see, the product of unit_sales and unit_price is the most sensitive parameter and bears out the results of the scatter plot analysis.

There are other sensitivity analysis techniques based on the Monte Carlo simulation. The reader is referred to Santelli et al. for a discussion of the most prominent techniques as well as a comparison of the two techniques presented above [2] .

CONCLUSION

This chapter explored the Monte Carlo simulation method for characterizing a model ’ s behavior in the face of one or more input random variables. This method provides a more representative way to understand the behavior of such models compared with a fi xed set of parameters under what - if analysis. Additionally, the concept of a confi dence interval was introduced as a measure of the accuracy of the Monte Carlo simulation in relation to the actual popula-tion mean of the system under study. A technique for estimating the sample size required to achieve a specifi ed accuracy was also described. The chapter concluded with a discussion of sensitivity analysis based on Monte Carlo tech-niques and introduced two methods for assessing the contribution of each input random variable to the model ’ s output.

REFERENCES

[1] IPCC . Good Practice Guidance and Uncertainty Management in National Greenhouse Gas Inventories . 2000 . Available at http://www.ipcc-nggip.iges.or.jp/public/gp/gpgaum.htm . Accessed May 2, 2009.

[2] Santelli A , Ratto M , Andres T , Campolongo F , Cariboni J , Gatelli D , Saisana M , Tarantola S . Global Sensitivity Analysis: The Primer . West Sussex : John Wiley & Sons ; 2008 .

Table 5.5 Sigma - normalized derivatives

Variable

variable_costs 0.03 fi xed_costs 0.01 unit_sales × unit_price 0.96

SXiσ( )2

modeling and simulation fundamentals (theoretical underpinnings and practical domains) || monte...

Documents