games, random numbers and introduction to simple statistics

蔡文能 C/C++ 程式設計

1

Games, Random Numbersand

Introduction to simple statistics

PRNG

Pseudo Random Number Generator

蔡文能

[email protected]


2

Agenda

• What is random number( 亂數 ) ?• How the random numbers generated ?

– rand( ) in C languages: Linear Congruential

• Why call “Pseudo random” ? (P 不發音 )• How to do “true random” ?• Application of Rrandom number ?• Other topics related to Random numbers• Introduction to simple statistics ( 統計簡介 )


3

BATNUM game• http://www.atariarchives.org/basicgames/showpage.php?page=14

• An ancient game of two players

• One pile of match sticks (or stones)

• Takes turn to remove [1, maxTake]

• ( 至少拿 1, 至多拿 maxTake)

• 可規定拿到最後一個贏或輸 !

• Winning strategy ??

Games 須用到 Random Number! Why?


4

Bulls and Cows Game

http://5ko.free.fr/en/bk.html http://en.wikipedia.org/wiki/Bulls_and_cows

http://zh.wikipedia.org/zh-hant/%E7%8C%9C%E6%95%B0%E5%AD%97http://boardgames.about.com/od/paperpencil/a/bulls_and_cows.htmhttp://pyva.net/eng/play/bk.html http://www.bullscows.com/index.phphttp://www.funmin.com/online-games/bulls-and-cows/index.php



5

NIM Game

• http://en.wikipedia.org/wiki/Nim• Nim is a two-player mathematical game of strategy in which

players take turns removing objects from distinct heaps. On each turn, a player must remove at least one object, and may remove any number of objects provided they all come from the same heap.

• 可規定拿到最後一個贏或輸 !

• Winning strategy ??



6

What is random number ?

• Sequence of independent random numbers with a specified distribution such as uniform distribution (equally probable)

• Actually, the sequence generated is not random, but it appears to be. Sequences generated in a deterministic way are usually called Pseudo-Random sequences.

參考 http://www.gnu.org/software/gsl/manual/gsl-ref_19.html

Normal distribution? exponential, gamma, Poisson, …


7

Turbo C++ 的 rand( ) 與 srand( )

#define RAND_MAX 0x7fffu

static unsigned long seed=0; int rand( ) {

seed = seed * 1103515245 + 12345; return seed % (RAND_MAX+1); } void srand(int newseed) {

seed = newseed; }

static global 變數請參考 K&R 課本 4.6節

Pseudo random number

就是 15 個 1 的 binary

<stdlib.h>

注意 C 語言的 rand( ) 生出的不是 Normal Distribution!

static 使其它 file 裡的 function 看不見這 se

ed


8

Unix 上 gcc 的 rand( ) 與 srand( )

#define RAND_MAX 0x7fffffffu

static unsigned long seed=0;

int rand( ) {

seed = seed * 1103515245 + 12345;

return seed % (RAND_MAX+1);

}

void srand(int newseed) {

seed = newseed;

}

static global 變數請參考 K&R 課本 4.6節

Pseudo random number

就是 31 個 1 的 binary

注意 Dev-Cpp 的 gcc 亂數只有 16 bits!

<stdlib.h> Pseudo random number



9

Random Number Generating Algorithms

• Linear Congruential Generators– Simple way to generate pseudo-random numbers

– Easily cracked

– Produce finite sequences of numbers

– Each number is tied to the others

– Some sequences of numbers will not ever be generated

• Cryptographic random number generators• Entropy sensors (i.e., extracted randomness)


10

Linear Congruential Generator (LCG) for Uniform Random Digits

• Preferred method: begin with a seed, x0, and successively generate the next pseudo-random number by xi+1 = (axi + c) mod m, for i = 0,1,2,… where– m is the largest prime less than largest integer computer can store– a is relatively prime to m– c is arbitrary

• Let [A] be largest integer less than A ( 就是只取整數部份 ), then N mod m = N – [N/T]*T

• Accept LCG with m, a, and c which passes tests which are also passed by know uniform digitsmod 在 C/C++/Java 用

%


11

The use of random numbers

1. Simulation2. Recreation (game programming)3. Sampling4. Numerical analysis5. Decision making randomness an essential part of optimal strategies

( in the game theory)6. Game program, . . .


12

Uniform Distribution( 齊一分配 )

• 在發生的機率皆相同a x b

2

1( ) ,

( )2

( )( )

12

f x a x bb aa b

E X

b aV X


13

Normal Distribution ( 常態分配 )

22

1( )

2

2

2

1( )

2

( )

( )

x

f x e

E X

V X


14

Standard Normal Distribution( 標準常態分配 )

• N(0, 1)

• 平均是 0• 標準差 1

( ) 0

( ) 1

( ) 1

xZ

E Z

V Z

SD Z


15

常態分配 (the Normal Distribution) • 在統計學中，最常被用到的連續分配就是常態分配。

在真實世界中，常態分配常被用來描述各種變數的行為，如考試成績、體重、智商、和商店營業額等。

• 若 X 為常態隨機變數，寫成 X ~ N(,2) 。其中參數為均數， 2 為變異數。

常態隨機變數的均數、中位數 (median) 、與眾數 (mode) 均相同。



16

Central Limit Theorem (CLT)( 中央極限定理 )

• 如果觀察值數目 n 增加，則 n 個獨立且具有相同分配(independent and identically distributed, I.I.D.) 的隨機變數 (Random variable) 之平均數向常態分配收斂。

. . . 2

1

2

( , )

1

,

i i di

n

ii

X

X Xn

X Nn

樣本大小 n 30≧ 時，趨近於常態分配。


17

如何用 C 生出常態分配的亂數 ?

#include <stdlib.h>double randNormal( ) { // 標準常態分配產生器

int i;

double ans =0.0;

for(i=1; i<=12; ++i) ans = ans + rand( )/(1.0+RAND_MAX);

return ans - 6.0; // N(0, 1)}

如何生出 N(x, std2) ?


18

Summary

• Pseudo-Random Number Generators(PRNG) depend solely on a seed, which determines the entire sequence of numbers returned.

• How to get true random ? change random seed• How random is the seed?

– Process ID, UserID: Bad Idea !– Current time: srand( time(0) ); // good

If you use the time, maybe I can guess which seed you used (microsecond part might be difficult to guess, but is limited)


19

Introduction to simple Statistics

蔡文能[email protected]


20

大考中心 …教育部…• 大學入學考試中心指出民國 96 年指考國

文科較接近「常態分布」，即中等程度的人數最多、高分、低分人數較少。

• 教育部修訂資賦優異學生鑑定標準，自九十六學年度起，各類資優鑑定標準已提高為「平均數正二個標準差或百分等級九十七以上」。

請問照這樣標準 100 人中大約有幾人是 " 資優 " ?


21

2010 大高雄市長選舉民調• 目前將在明年登場的大高雄市長選舉，根據《財訊》雙

週刊所公佈的最新民調顯示，高雄市民有 50% 挺陳菊，朱立倫僅有 32% 支持度；若是由國民黨內佈局明年市長最明顯的立委黃昭順對上陳菊，則更有 19% ： 60% 的大段差距。

• 本次《財訊》雙週刊民調，係委託山水民意研究公司，以北、高兩市住宅電話隨機取樣，高雄市於 11 月 2~3 日進行，有效樣本 1273 人，在 95% 的信心水準下，誤差約 ±2.75 個百分點。

Sampling 抽樣


22

2005 南投縣長選舉大調查請問您南投縣最急需改善的問題是什麼？

年底縣長選舉 , 請問您支持哪一位參選人？

中時電子報是於十一月八、九、十日三天，利用電話隨機抽樣，成功訪問南投縣 1103名民眾，在 95% 的信心水準下，正負誤差為 2.95 ％以下。

註 : 結果是李朝卿當選。


23

2009 南投縣長選舉民調聯合報系民意調查中心／電話調查報導

國民黨李朝卿聲勢較半個月前上揚十八個百分點，目前以四成八的支持率大幅領先民進黨李文忠的百分之三十。

這次調查於 2009 年 11 月 10 日至 11 日晚間進行，成功訪問了 932位設籍南投縣的成年選民，另有 262 人拒訪。在百分之九十五的信心水準下，抽樣誤差在正負3.2 個百分點以內。調查是以南投縣住宅電話為母體作尾數兩位隨機抽樣，調查經費來自聯合報社。


24

2008總統大選蘋果民調

這是台灣《蘋果日報》委託中山大學社科院民意調查研究中心所做最新民調 ;此民調針對全台 20歲以上有投票權公民，進行電話訪問，調查時間為 2008 年 1 月 12 日“立委”選舉隔天， 1 月 13 日至 16 日晚上 6 時至 10 時之間，共有 1054 個成功樣本，在 95% 信心水準之下，正負誤差約 3% 。


25

2006 年 10 月台北市長候選人民調

中時電子報是於 2006/9/27 到 9/28 ，以中華民國家戶電話為樣本，成功訪問 1112 名居住地為台北市的受訪者。在百分之九十五的信心水準之下，正負誤差為 2.9 ％以下。


26

2005台北縣長選舉民意調查根據 TVBS 在 11 月 21 至 22 日的民意調查顯示，

國民黨台北縣候選人周錫瑋的支持度為 48% ，民進黨的候選人羅文嘉則獲得 27% 的支持度。

此次民調和上月前相比，繼永洲案爆發後及日前沸騰的“瑋哥部落格 (BLOG)” 的抹黑，周錫瑋的支持度不降反升，多了 2 個百分點，羅文嘉則是下降 4 個百分點。

這份民調是 TVBS 民調中心在 11 月 21 日到 22 日間，成功訪問了 1033 位 20 歲以上的台北縣民，在 95% 信心水準下，抽樣誤差約為正負 3.0 個百分點。


27

1936 Presidential Election and Poll


28

背景： 1936 年美國總統選舉•法蘭克羅斯福總統爭取連任、肯薩斯州州長蘭登為共和黨總統候選人•美國經濟正由大蕭條中逐漸恢復

–九百萬人失業，於 1929年至 1933年間實際所得降低三分之一。– 蘭登州長選戰主軸為「小政府」。口號為 The spender must go。– 羅斯福總統選戰主軸為「擴大內需」 (deficit financing)。口號為 Balance the budget of the American people first。

•宣稱一：大部分的觀察家認為羅斯福總統將大勝•宣稱二： Literary Digest 雜誌認為蘭登將以 57%對 43%贏此選戰。

– 此數字乃根據於二百四十萬人之民意調查結果。– 該機構自 1916年起，皆能依照其預測辦法作正確的預測。

•選舉結果：羅斯福以 62%對 38%贏此選戰。為什麼？•新興競爭者－蓋洛普－民調：

– 依據 Literary Digest雜誌所取的二百四十萬人樣本中，蓋洛普抽樣三千人，而預測蘭登將以 56%對 44%贏此選戰。

–依據自己所取的五萬人樣本中，蓋洛普預測羅斯福將以 56% 對 44% 贏此選戰。


29

Literary Digest雜誌錯在那裡？取樣辦法：郵寄一千萬份的問卷，回收二百四十萬份，但問卷對象係從電話簿及俱樂部會員中選取。–在當時僅有一千一百萬具住宅用電話，但九百萬人失業。

可能問題的所在：取樣偏差： Literary Digest雜誌的取樣中包含過多的有錢人，而該年貧富間選舉傾向相距極大。

拒回答偏差：低回收率。–以芝加哥一地為例，問卷寄給三分之一的登記選民，回收約

20% 的問卷，其中超過一半宣稱將選蘭登 (Landon)，但選舉結果卻是羅斯福拿到三分之二的選票。

抽樣的樣本要多少才夠？


30

Sample size vs. error of estimation

• When we use to construct a 95% confidence interval for , the bound on error of estimation is B =

• n =

• The estimated standard deviation of p is

)(96.1n

296.1 ][ B

npp )1(


31

抽樣的樣本要多少才夠？

• 1- = Confidence Interval• B = the bound on error of estimation • Using a conservative value of = 0.5 in the

formula for required sample size gives n = (1-) = 0.5(1-0.5) =1067.11

• Thus, n would need to be 1068 in order to estimate to within .03 with 95% confidence.

296.1 )( B2

03.096.1 )(

95%信心水準之下，抽樣誤差在正負 3 個百分點以內。


32

Consider this program

• 台北車站廣場打算設置一台體重統計機 ,任何人站上去後立刻顯示其體重

• 並且立即顯示以下統計 :

n : 共已多少人在此量過 Average : 平均體重 STD : 這 n 人的體重標準差

注意 : 因為可能會很多人 , 所以不能把所有量過的體重都記在記憶體內 , 機器也沒有硬碟或其他儲存裝置 !


33

Descriptive Statistics

• Distribution– frequency distribution

– Histogram ( 長條圖 )

• Central tendency – Mean

– Median ( 中位數 )

– mode (眾數 )

• Dispersion– Range

– Standard deviation

– Variance

• N• Not P (inferential stats)

Central tendency 資料之集中趨勢Distribution 資料之分佈 ; 分配

Dispersion 資料之散亂 ; 發散


34

Statistics • Parameters ( 常見統計參數 )

– Mean ( 平均數 ) ─ the average of the data

– Median ( 中位數 ) ─ the value of middle observation

– Mode ( 眾數 ) ─ the value with greatest frequency

– Standard Deviation ( 標準差 ) ─ measure of average deviation

– Variance ( 變異數 ) ─ the square of standard deviation

– Range ( 範圍 ) ─ 例如 Max(B2:B60) ~ Min(B2:B60)?


35

Mean and Variance

Sample Variance

Population Mean / Sample Mean 若是全部資料而不是抽樣 , 則除以 n 而

不是除以 n -1, 此即 population

variance


36

Standard Deviation

• Variance describes the spread (variation) of that data around the mean.

• Sample variance describes the variation of the estimates.

• Standard deviation s is the square root of s2

標準差就是 sqrt ( 變異數 ); 阿就是變異數的平方根


37

Compute Variance without mean

From Wikipedia.org

Variance = ( 平方和 – 和的平方 /n) / n


38

The Central Limit Theorem

• The probability distribution of sample means is a normal distribution

• If infinite number of samples with n > =30 observations are drawn from the same population where X ~ ??(μ,σ), then

X

)n

,N(~X


39

Central Limit Theorem （中央極限定理）

• For a population with a mean and a variance , the sampling distribution of the means of all possible samples of size n generated from the population will be approximately normally distributed - with the mean of the sampling distribution equal to and the variance equal to assuming that the sample size is sufficiently large.

2

2 / n


40

The Normal Distribution

• Described by– (mean)– (standard deviation; 標準差 )– Variance 變異數 = 標準差的平方

• Write as N( , ) 或 N( , 2) • Area under the curve is equal to 1• Standard Normal Distribution

–

10 and


41

Why is the Normal Distribution important?

• It can be a good mathematical model for some distributions of real data– ACT Scores– Repeated careful measurements of the same quantity

• It is a good approximation for different types of chance outcomes (like tossing a coin)

• It is very useful distribution to use to model roughly symmetric distributions– Many statistical inference procedures are based

on the normal distribution• Sampling Distributions are roughly normal

(TBC…)


42

Normal Distributions and the Standard Deviation

Normal Distribution

Black line - MeanRed lines - 1 Std. Dev. from the mean (68.26% Interval)Green lines – 2 Std. Dev. from the mean (95.44% Interval)

What about 3 Std. Dev. from the mean?

95% Confidence interval ±1.96 Std. Dev.


43

68-95-99.7 Rule for Normal Curves

68.26%

+- µ

+3-3

99.74%

µ

+2-2

95.44%

µ

68.26% of the observations fall within of the mean

99.74% of the observations fall within 3 of the mean

95.44% of the observations fall within 2 of the mean


44

Notations

• It is important to distinguish between empirical and theoretical distributions

• Different notation for each distribution

Theoretical

sEmpirical

Std devMean

Theoretical

sEmpirical

Std devMean

x


45

Density function of Normal Distribution

• The exact density curve for a particular normal distribution is described by giving its mean () and its standard deviation ()

• density at x = f(x) =

2)(

2

1

2

1

x

e


46

Confidence Intervals (CI) for µ,from a single sample mean

Since )n

,N(~X , therefore

n

xz

)(

, and furthermore

P(-1.96<

n

x

)( <1.96) = .95


47

Confidence Interval? (1/2)

• 當我們使用軟體去模擬真實環境時，通常會用亂數 (random number)模擬很多次，假設第一次模擬的結果數據是 X1 ，第二次是 X2 ，重覆了 n 次後，就有 X1 、 X2．．． Xn共 n 個數據，這 n 個數據不盡相同，到底那個才是正確的 ? 直覺上，把得到的 n 個結果加總求平均，所得到的值應該比較能相信。

• 但是我們可以有多少程度的去相信這個平均值 (sample mean)呢 ?

• 這個問題討論的就是所謂的 Confidence Interval ( 信賴區間 )與顯著水準 (significance level) 。


48

Confidence Interval? (2/2)• 在實務上，想要在有限個模擬數據結果中得到一個較完美接近真實結果的數據，其實是不可能的。

• 因此我們能做的就是去求得一個機率範圍 (probability bound) 。若我們可以得到一個機率範圍的上限 c1 和一個範圍的下限 c2 ，則就有一個很高的機率 1 – α ，會使得每次所得到的模擬結果平均值 μ(sample mean)都落在 c1 到 c2 的範圍之間。

Probability { c1 <= μ <= c2} = 1 –α

我們把 (c1, c2) 這個範圍稱為信賴區間 (confidence interval) ；α稱為顯著水準 (significance level) ；100(1-α)%稱為信心水準 (confidence level) ，用百分比表示；1-α稱為信心係數 (confidence coefficient) 。


49

為何簡單隨機抽樣是個合理的抽樣方法？

•試想抽取 16所醫院來預測 393所醫院的平均出院病人數的例子，

– 共有約 1033種的不同樣本。– 依據中央極限定理，所得到的平均出院病人數分佈像個鐘形曲線，其中心位於所有醫院的平均出院病人數，且大多數的 16所醫院平均出院病人數都離中心 (大數法則 )不遠。

較有保障的抽樣辦法，被選取的樣本應使用隨機的原理取得。


50

Hypothesis Testing假設之檢定

• The null hypothesis for the test is that all population means (level means) are the same. (H0)

• The alternative hypothesis is that one or more population means differ from the others. (H1)


51

PRNG 相關補充• 請用 http://gogle.com 打 “ PRNG” 查看• ANSI X9.17 PRNG (PRNG = Pseudo Random Number Generator)• Von Neumann 想出的 middle square method• Von Neumann architecture ?• PRNG in RC4 (RC4 用於 802.11 無線網路加解密 )

– http://www.rsa.com– http://www.wisdom.weizmann.ac.il/~itsik/RC4/rc4.html

• WEP : RC4 Stream cipher

http://gogle.com/


52

ANSI X9.17 PRNG

• Use 3DES and a key K• Ti = Ek(current timestamp)• output[i] = Ek(Ti seed[i])• seed[i+1] = Ek(Ti output[i])• Weaknesses

– Only 64 bits are used for Ti

– seed[i+1] can be easily predicted if state compromise


53

Jon von Neumann 1946 suggested the production of random number using arithmetic operations of a computer, "middle square", square a previousrandom number and extract the middle digits,

Example generate 10-digit numbers, was 5772156649, square

33317792380594909201

the next number is 7923805949

Middle square

"middle square" has proved to be a comparatively poor source of random numbers. If zero appear as a number of the sequence, it will continually perpetuate itself.


54

Von Neumann architecture (http://wikipedia.org/)

• The term von Neumann architecture refers to a computer design model that uses a single storage structure to hold both programs and data. The term von Neumann machine can be used to describe such a computer, but that term has other meanings as well. The separation of storage from the processing unit is implicit in the von Neumann architecture.

• The term "stored-program computer" is generally used to mean a computer of this design.

Von Neumann bottle neck ?


55

Seeding RC4for(I = 0; I < 256; I++)S[I] = I;

for (I = J = 0; I < 256; I++) {j += S[I] + K[I % klen];SWAP(S[I], S[J]);

}I = J = 0;

RC4 PRNG (1/2)


56

RC4 PRNG (2/2)

rc4byte(){I++;J += S[I];SWAP(S[I], S[J]);return (S[ S[I] + S[J] ]);

}

Byte version


57

WEP: RC4 加解密 (http://rsa.com)

Pseudo-random number

generator

Encryption Key K

Plaintext bit stream p

Random bit stream b

Ciphertext bit stream c

Decryption works in the same way: p = c b

XOR

WEP : Wired Equivalent Privacy


58

謝謝捧場http://www.csie.nctu.edu.tw/~tsaiwn/introcs/

http://gogle.com/

Games, Random Numbersand

Introduction to simple statistics

http://www.csie.nctu.edu.tw/~tsaiwn/introcs/

http://gogle.com/

games, random numbers and introduction to simple statistics

Documents