econ1203 notes

35
Week 1 Ty pes of Observations Time Series: data consists of measurements of the same concept at different points of time. E.g., Sydney-are births per day, for each day in a year. Cross-sectional: data consist of measurements of one or more concepts at a single point of time. E.g., age, gender, and marital status of a sample of !SW staff in a particular year. "escriptive Statistics What are the key features of a data set# $any tools and techni%ues, some are graphical and numerical depending on data. &re%uency "istributions Summaries of categorical data using counts 'ar (harts and )ie (harts*-- +raphical representation of fre%uency distributions 'ar (harts (an display multiple patterns in fre%uencies enabling %uick visual comparisons )ie (harts Shos relative fre%uencies more eplicitly /istogram 0ualitative "ata (ategories  'ins (umulative &re%uency or elative &re%uency "istributions Stem-and-leaf "isplays "escribing /istograms  Symmetry (or lack thereof)  Skewness o 2ong tail to the right  positively skewed o 2ong tail to the left negatively skewed  o $ay be associated ith outliers, is associated i th asymmetric histogram   !umber of Modal Classes/Bins o $odal class is the class ith the highest fre%uency o /istograms may be unimodal  or multimodal  1

Upload: whyisscribdsopricey

Post on 17-Feb-2018

223 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Econ1203 Notes

7/23/2019 Econ1203 Notes

http://slidepdf.com/reader/full/econ1203-notes 1/35

Week 1Types of Observations

• Time Series: data consists of measurements of the same concept at different points of time.

E.g., Sydney-are births per day, for each day in a year.• Cross-sectional: data consist of measurements of one or more concepts at a single point of

time. E.g., age, gender, and marital status of a sample of !SW staff in a particular year.

"escriptive Statistics

• What are the key features of a data set#

• $any tools and techni%ues, some are graphical and numerical depending on data.

&re%uency "istributions

• Summaries of categorical data using counts

'ar (harts and )ie (harts*--

• +raphical representation of fre%uency distributions

• 'ar (harts (an display multiple patterns in fre%uencies enabling %uick visual comparisons

• )ie (harts Shos relative fre%uencies more eplicitly

/istogram

• 0ualitative "ata

• (ategories 'ins

(umulative &re%uency or elative &re%uency "istributions

Stem-and-leaf "isplays

"escribing /istograms

  Symmetry (or lack thereof)

  Skewness

o 2ong tail to the right positively skewed 

o 2ong tail to the left negatively skewed  

o $ay be associated ith outliers, is associated ith asymmetric histogram

 

 !umber of Modal Classes/Bins

o $odal class is the class ith the highest fre%uency

o /istograms may be unimodal  or multimodal  

1

Page 2: Econ1203 Notes

7/23/2019 Econ1203 Notes

http://slidepdf.com/reader/full/econ1203-notes 2/35

'ivariate elationships

 

/o can e characteri3e relationships beteen variables#

o Contingency Table 45cross-tabulation6 of 5cross-tab67

  (aptures relationship beteen to %ualitative variables

o Scatterplots

 

(aptures the relationship beteen to %uantitative variables

  Ti8f one of these variables is 5time6, then e get a time series plot 

Time Series )lot

• )lots a bivariate relationship beteen some variable and time, E.g., business cycles measured

as +") groth over time

2

Page 3: Econ1203 Notes

7/23/2019 Econ1203 Notes

http://slidepdf.com/reader/full/econ1203-notes 3/35

Week 9 !umerical Summaries of :ey &eatures of "ata

• :ey features of a single variable

o 2ocation, Spread, elative 2ocation, Skeness• :ey features of a to variables

o $easures of 4linear7 association

$easures of 2ocation

• Parameter describes a key feature of a population

• Statistic describes a key feature of a sample

• rithmetic Mean a natural measure of 5location6 or 5central

tendency6

• Median is the middle value of ordered observations

o When n is odd, median is a particular value, hen n is

even, the median is the average of the to middle

values.

o $edian depends on ranks, not absolute values

• Mode is the most fre%uently occurring value4s7

o $odal (lass 4the most common class7

• The mean, median and mode each provide different notions of 5representative or 5typical6

central values

o &or %uantitative data, mode is often not useful

o $ean vs. $edian

&or symmetric distributions, mean;median

&or positively 4negatively7 skeed data mean < 4=7

median

$edian may be preferred hen the data contain outliers

• Outliers

• ange is a simple measure of variability

o ange ; maimum > minimum

• !ariance 4most common measure of variability7 measures average

s%uared distance from mean

o "ivision by n-1 for sample variance relates to properties of 

estimators

• Standard "e#iation is the spread measured in the original units of the

data 4!OT s%uared7

Standardi3ing "ata

• (alculating $-scores% a variable free of &nits of meas&rement 4one 3-score per observation7

o (alculate 4observation > $ean7?Standard "eviation

(oefficient of @ariation

• Coefficient of #ariation cv ; s?   ´ x

o )rovides a measure of relative variability

o (omparable across variables

3

Page 4: Econ1203 Notes

7/23/2019 Econ1203 Notes

http://slidepdf.com/reader/full/econ1203-notes 4/35

$easures of elative 2ocation

• $edian relies on a ranking of observation to measure location, e can generalise this notion

to 'ercentiles

o The )th percentile is the value for hich ) percent of observations are less than thatvalue

o $edian ABth percentile, the 9Ath and CAth percentiles are the loer and upper quartiles

o  Interquartile Range = Upper Quartile – Lower Quartile 4another measure of spread7

$easures of Dssociation

• Co#ariance is a numerical measure

o )ositive 4negative7 covariance positive

4negative7 linear  association

o ero covariance no linear  association

• Correlation Coefficient is a standardi3ed, unit-free

measure of association

o 1 4-17 perfect positive 4negative7 linear 

relationship

2east S%uares

&ind best fit to describe bivariate relationship

• east S&ares minimi3es the residual sum of squares

o 'asis of egression Dnalysis

The solution 4ith one independent variable7

4

Page 5: Econ1203 Notes

7/23/2019 Econ1203 Notes

http://slidepdf.com/reader/full/econ1203-notes 5/35

-S%uare )ercentage of @ariation eplained by the model

• The fit of the 5model6 to the actual data, is described by *-S&ared Statistic 4also termed

coefficient of determination)+

• $aimum value of -s% is 1 4perfect fit7 and the minimum is B 4no fit7• 8n a simple bivariate regression -s% ; Fcorrelation of and yGH9

5

Page 6: Econ1203 Notes

7/23/2019 Econ1203 Notes

http://slidepdf.com/reader/full/econ1203-notes 6/35

Week I)robability

)robability is the mathematical means of studying uncertainty. 8t provides the logical foundation of

statistical inference.

8ndependence

• (ovariance and correlation are measures of linear association or linear dependence

o "ependence 4and interdependence7 captures a more general concept of association

 beteen to variables.

D tease on sampling

• ,itho&t re'lacement

• ,ith re'lacement

6

Page 7: Econ1203 Notes

7/23/2019 Econ1203 Notes

http://slidepdf.com/reader/full/econ1203-notes 7/35

)robability Trees

• Events in 5dras6 may be

represented by 'robability trees

(umulative "istribution &unction 4cdf7

• "efined as &47 ; )4JK7 for all

7

Page 8: Econ1203 Notes

7/23/2019 Econ1203 Notes

http://slidepdf.com/reader/full/econ1203-notes 8/35

Week Landom @ariables

• *andom #ariable a function that assigns a real number to each possible value in the sample

space of an event.o E.g. )/8 ; 1 if one has private health insurance, )/8 ; B if they donMt

This is an eample of a discrete random variable, it is a 5binary6 or a

5indicator6 variable

o E.g., Time measured continuously is an eample of a continuous random variable

•  !otation

o pper-case J denotes the random variable

o 2oer-case denotes a value it might take on 4a possible 5reali3ed value67

$athematical Epectation

• The basis for formulating summary measures for probability distributions

• .'ected !al&e, multiply each possible value of the random variable by the probability of its

occurrence and adding up those products.

•   For any random variable, say X, its expected value is denoted by E(X)

Epectations for discrete random variables

The previous definitions of population

mean and variance assumed e%ually

likely outcomes. These definitions are

weighted a#erages of outcomes, ith

the outcome probabilities as the eights.

ules of Epectations

• E4c7 ; c

• E4J*c7 ; E4J7 * c

• E4cJ7 ; cE4J7

• @ar4c7 ; B

• @ar4J*c7 ; @ar4J7

• @ar4cJ7 ; c9@ar4J7

• emember -scores# se the epectations rules

to determine the mean and variance of . ecall

that the original random variable has an

epected value of N and a variance of 9

8

Page 9: Econ1203 Notes

7/23/2019 Econ1203 Notes

http://slidepdf.com/reader/full/econ1203-notes 9/35

'ivariate "istributions

• The set of all values )4,y7, i.e., the probabilities associated ith all possible pairs 4,y7, is the

 oint 'robability distrib&tion of 4J,P7

• The marginal distrib&tion of P can be found by summing the Qoint probabilities across all

values of Jo )4y7 ; Rall )4, y7

• )opulation formulas for covariance and correlation 4correlation has advantage of being unit-

free7

o (ov 4J,P7 ; y; all all y 4 > N74y > Ny7)4,y7

o (or 4J,P7 ; ; y?y

$ore ules of Epectations

2et J and P be random variables. Then

• E4JUP7 ; E4J7 U E4P7

@ar4JUP7 ; @ar4J7 * @ar4P7 U 9(ov4J,P7• 8f J and P are independent 4hence 3ero covarianceV7, then

o @ar4JUP7 ; @ar4J7 * @ar4P7

• (ov4c1J, c9P7 ; c1c9 (ov4J,P7

)ortfolio Dllocation

• @ariability is called volatility in finance, and is a measure of risk 

• D 'ortfolio is a collection of stocks held by an investor 

o Dn investor reduces risk through diversification achieved by holding a portfolio

o

9

Page 10: Econ1203 Notes

7/23/2019 Econ1203 Notes

http://slidepdf.com/reader/full/econ1203-notes 10/35

'inomial "istribution

 e!uirements "or somet#in$ to !uali"y as a binomial experiment%

• Se%uence of fied number of n trials

• Each trial has 9 outcomes, arbitrarily denoted 5success6 and 5failure6

• There is a fied probability of success 4p7 and failure 41-p7 over all trials

• Trials are independent

• nder these assumptions, this is a se%uence of Berno&lli trials

• The outcome of each trial is recorded in a random variable

'inomial andom @ariables

We can represent our se%uence of 'ernoulli andom @ariables as

• J1, J9J !, here Ji ; 1 is called 5success6 and J i ; B is 5failure6

• nder the assumption made, this is a se%uence of independent and identically distributed

4i.i.d7 random variables.

•  X is called a binomial random #ariable

• Characteri0ed by 9 parameters n and p

• 8.e., once e kno the values of these parameters in a given case, e kno e#erything abo&t

the random variable and its probability distribution

1

18n this eample

)4J ; A7 is a binomial probability• )4J K 17 is a cumulative binomial

 probability

• )4J < 17 is sometimes called a

survivor probability

10

Page 11: Econ1203 Notes

7/23/2019 Econ1203 Notes

http://slidepdf.com/reader/full/econ1203-notes 11/35

Other common discrete distributions

• There are many other common discrete distributions that can be used to represent or model

real phenomena

• E.g., the discrete uniform distribution 4pretty boringV7

The )oisson distribution is a little more interestingo 'inomial > number of successes in se%uence of trials

o )oisson > number of successes in a period of time or region of space

o The )oisson distribution relates to a count variable 4B, 1, 9, 7, here successes are

relatively rare

E.g., the number of times an individual visited a +) in the last year 

11

Page 12: Econ1203 Notes

7/23/2019 Econ1203 Notes

http://slidepdf.com/reader/full/econ1203-notes 12/35

Week A

(ontinuous andom @ariables

&or discrete random variables, e assigned positive probabilities to different outcomes.

• (onsider store deliveries that arrive somehere beteen C and X D$• Suppose the random variable of interest is 5the number of minutes after C D$ that deliveries

are made6

o )ossible outcomes ould then be B, 1, 9, , YB 4CBB D$, CB1 D$, CB9 D$, 7

o Dssume all outcomes are e%ually likely, probability is 1?Y1

'ut if e can measure time to any de$ree o" accuracy, then e could construct a random variable that

can take on any value in the inter#al from B to YB.

• This ould be a contin&o&s random variable

)robability "ensity &unction 4pfd7

• D continuous version of the familiar probability histogram used for discrete random variables.

• (onsider a continuous random variable J ith range a K K b. 8ts probability density

function 45pdf67, hich e refer to as 5f476, must satisfy

o f47 Z B for all beteen a and b

o Total area under the curve beteen a and b is unity 4i.e., e%ual to 17

• )robabilities are no represented by areas under pdf 

niform 4discrete7 andom @ariable

The discrete uniform pdf for our store delivery eample 4limiting measurement to hole minutes7

ould have the form

The 5e%ually likely6 nature of this random variable is no represented by any

interval of idth m having e%ual probability

The !ormal "istribution

&or any normally distributed random variable

• )4J ; 7 ; B[[

• )4a = J = b7 ; area under pdf curve beteen possible values a and b[[

• D normal distribution is com'letely characteri0ed by its mean, N, and variance, 9 

• [[ true for (O!T8!OS random variables

+raphically the normal probability density function is symmetric, unimodal and bell-shaped.

• $ean ; median ; mode

8ts basic features include

• ange of 5support6 is unlimited -\ K K \

• "espite unlimited range, there is little probability area in the 5tails6 of a normal distribution

o L.Y] outside N U9 ^ B.I] outside N UI 4confirm from tables7 > this is here the YX-

_A-__.C 5rule6 comes from

8f J is normally distributed ith mean N and variance 9 then rite J ` ! 4N, 97

12

Page 13: Econ1203 Notes

7/23/2019 Econ1203 Notes

http://slidepdf.com/reader/full/econ1203-notes 13/35

The Standard !ormal

• ; 4J-N7?

• Ds eMve shon, this standardi3ation yields a random variable ith 3ero mean and a standard

deviation of one

/o are and J related#o ; a * bJ 4here a ; -4N ?7 and b ; 1?7

• So is a linear transformation of J

ses of the Standard !ormal

Dn important theorem says: inear combinations of normally distrib&ted random #ariables are

also normally distrib&ted

When J`! 4N,97, 4a linear function of J7 is called a standard normal random #ariable

(alculating !ormal )robabilities

 !ormal random variables are continuous

 probabilities need to be calculates as inte$rals&

contin&ity correction

"ata (ollection

  Secondary "ata

  Primary "ata

• 1bser#ational "ata measures actual behaviour or outcomes

• .'erimental "ata imposes a treatment and measures resultant behaviour or outcomes

Sim'le *andom Sam'ling: D sampling process by hich all samples of the same si3e 4n7

are e%ually likely to be chosen from the population of interest

o Dvoids problems of selection bias here the design

13

Page 14: Econ1203 Notes

7/23/2019 Econ1203 Notes

http://slidepdf.com/reader/full/econ1203-notes 14/35

Week YEstimation

• )arameters describe key features of populations

• 8n practical situations, parameters are unknon• 8nstead, a sample is dran from the population to provide basic data

• These data are used to calculate various sample statistics

• These sample statistics are used as estimators for population parameters

Estimators

• D statistic is any function of data in the sample

• Dn estimator is a statistic hose purpose is to estimate a parameter or some function thereof 

• D 'oint estimator is simply a formula 4rule7 for combining sample information to produce a

single number to estimate

stimators are random #ariables because they are functions of random variables

)roperties of Estimators

 'esirable properties "or estimators include%

• 2nbiasedness 8f e constructed it for each of many hypothetical samples of the same si3e,

ill the estimator deliver the correct value 4i.e., the value of the parameter7 on average#

• Consistency Ds the sample si3e gets larger, does the probability that the estimator deviates

from the parameter by more than a smallM amount become smaller#

• *elati#e efficiency 8f there are to competing estimators of a parameter, does the sam'ling

distrib&tion of one have less epected dispersion than that of the other#

•  !ote that 5n6 appears in the denominator of the variance of the sampling distribution 4and

hence in the standard error e Qust

used7.

• This means that the larger e set the

sample, the 5tighter6 the distribution of

the sample proportion ill be.

• So, as the sample si3e gros, the

interval e can construct, for any given

statement of confidence like the purple

statement, shrinks.

14

Page 15: Econ1203 Notes

7/23/2019 Econ1203 Notes

http://slidepdf.com/reader/full/econ1203-notes 15/35

5That is, e are _A] sure that O )DT8(2D p is ithin 94B.BBX7 of p.6

The $argin of Error 

(hoosing the Sample Si3e

•To get a narroer confidence interval ithout giving up confidence, e must choose a largersample.

• Suppose a company ants to offer a ne service and ants to estimate, to ithin I], the

 proportion of customers ho are likely to purchase this ne service ith _A] confidence.

• 0 /o large a sample do they need#

• ,hat we know: The desired $E and confidence level.

• ,hat we don3t know: n, p, or p.

/ypothesis Testing

• The underlying %uestion in hypothesis testing

5"o the data support a contention?belief?hypothesis about this parameter of interest#6

• We kno from our ork in estimating population proportions that there can be no definitive

anser to this type of %uestion

o Estimators are random variablesV

• The process of hypothesis testing is potentially subQect to incorrect conclusions.

15

Page 16: Econ1203 Notes

7/23/2019 Econ1203 Notes

http://slidepdf.com/reader/full/econ1203-notes 16/35

Week CSampling from a !ormal )opulation

Sampling from a !on-normal )opulation

(entral 2imit Theorem 4(2T7

The sampling distribution of the mean of a random sample drawn from any population with mean

 μ and variance σ2 will be approximately normally distributed for a sufficiently large sample sie

 

'ased on a limiting argument, it can be shon that

/o are data used to test a null hypothesis#

• )roceed by comparing a test statistic ith the value specified in /B and decide hether the

difference is

o Small enough to be attributable to random sampling errors do not reQect /B, or 

o So large that /B is 5more likely6 not to be correct  reQect /B

• &ormally define a reQection 4or critical7 region by your choice of alternative possibilities

o 8f values of the test statistic are 5etreme enough6 > i.e., if they fall in the reQection

region > then they lead us to reQect /B in favour of /1

• Other possible values of the test statistic, that are not so etreme, lie in the non-critical region

16

Page 17: Econ1203 Notes

7/23/2019 Econ1203 Notes

http://slidepdf.com/reader/full/econ1203-notes 17/35

One and to-tailed tests

• D one-tailed test defines the reQection region as one etreme end of the sampling distribution

o The null and alternative hypotheses here ill look something like

o /B N ; .C /1 N < .C

• D to-tailed test defines the reQection region as both etreme ends of the sampling

distribution

o /ere the null and alternatives might look like this

o /B N ; .C /1 N ; .C

• Dlso recall the alpha level associated ith a confidence level 4e.g., _A]7 is defined as the

maimum p-value that ould enable that level of confidence in our Qudgment 4e.g., B.BA7.

This is the probability of our statistic falling into the reQection region 4for a to-tailed test

split the probability across the to ends of the sampling distributionV7.

Sigma

When our sample si3e is large

• (2T sampling distribution of the sample mean is approimately normal irrespective of the

 population distribution

• When is unknon and is replaced by s, our standardi3ed test statistic remains approimately

4asymptotically7 normally distributed

o Why# 'ecause in large samples s ill be close to ith high probability 4it is a

consistent estimator of 7

• So, for large n nothing has changed

o This is true for the sampling distribution of both the sample mean and the sample

 proportion 4here, recall, e had to use the sample proportion instead of the

 population proportion in our variance formula for the sampling distribution7

'ut hat about small n#

• 2et us confine ourselves first to sampling from a population in hich the target random

variable is distributed normally

• 2et our sample be i.i.d. from ! 4N,97

o ecall that linear combinations of normal random variables are also normal, hich is

hy e kno the sample mean and its standardi3ed version 4using 7 ill also be

normally distributed

o 'ut hat happens hen e standardi3e using s#

+ossetts Student t "istribution

17

Page 18: Econ1203 Notes

7/23/2019 Econ1203 Notes

http://slidepdf.com/reader/full/econ1203-notes 18/35

)roperties of the t "istribution

• Symmetric and unimodal

o 2ooks very similar to the normal, but has 5fatter6 tails

o 8s characteri3ed by degrees of freedom 4d" 7

o &or larger and larger , the distribution becomes more and more like a normal

distribution

• Sharpe Table D-YI only provides critical values

• l &or a one-tailed test, )4t < t,7 ;

• &or a to-tailed test, )4t < t,7 ; ?9

• (heck out the 5infinity line6 in the t-table. 8ts critical values are identical to the critical values

of the normal distributionV

8nference using the t distribution

18

Page 19: Econ1203 Notes

7/23/2019 Econ1203 Notes

http://slidepdf.com/reader/full/econ1203-notes 19/35

Week X8nterval Estimation

• Point estimators produce a single estimate of the parameter of interest

• 8n many real-orld situations, some notion of the margin of error ould be useful• 4nter#al estimators produce an interval > i.e., a range of values > and a degree of confidence

associated ith that interval

o /ence the name confidence inter#al

  5/o often ould you epect the true population parameter to be in this

4sample-specific7 interval#6

8nterval estimation for means

• The endpoints of the interval are themselves random variables

o We have constructed a random inter#al

• N is a constant

• &or a particular sample 4and sample mean value7, N is either in the confidence interval, or it is

not

• 58f 1BB si3e-n samples ere dran, e ould epect _A of them to include N

(onfidence 8ntervals

(8Ms for means and proportions typically have a similar structure

• (entred at sample statistics

• Endpoints are U some multiple of the standard error 4if e donMt kno sigma7 or standard

deviation 4if e do kno sigma7 of the sampling distribution

• The 5multiple6 is determined by the confidence le#el chosen by the investigator 

•  emember " you dont *now si$ma and #ave a small sample, use t#e t+distribution tables to

 $et your bounds – not t#e  V

19

Page 20: Econ1203 Notes

7/23/2019 Econ1203 Notes

http://slidepdf.com/reader/full/econ1203-notes 20/35

Selecting Sample Si3e

 p-values

/o do 8 choose the significance level #

•  !o rules

• (onventional choices are ; B.1, B.BA, or B.B1

Why do 8 have to choose a particular #

• Pou donMt 4though doing so can helpfully bind your hands7

• Pou can calculate the 5empirical significance level6, or pvalue

The p!#al&e associated with a gi#en test statistic is the 'robability of obtaining a #al&e of the test

statistic as or more extreme than that observed % gi#en that the n&ll hy'othesis is tr&e

• 5$ore etreme6 depends on form of the alternative hypothesis

/ypothesis Testing and Errors

Ty'e 4 errors occur hen e reQect a true null hypothesis

• Only possible to make this error hen the null is true

• "enote - 4Type 8 error7 ; 4significance level7

•  - 4eQect . B . B true7 ;

Ty'e 44 errors occur hen e donMt reQect a false null hypothesis

• Only possible to make this error hen the null is false

• "enote - 4Type 88 error7 ;

•  - 4"o not reQect . B 5 . B not correct67 ;

•  - 4Type 88 error7 depends on hat the actual 4alternative7 parameter value isV

20

Page 21: Econ1203 Notes

7/23/2019 Econ1203 Notes

http://slidepdf.com/reader/full/econ1203-notes 21/35

)oer of a test

)oer 4in statistics7 The 'robability of correctly reecting a false n&ll hy'othesis

 - 4"o not reQect . B 5 . B not correct67 ; •  - 4eQect . B 5 . B not correct67 ; )oer ; 1-

The poer of a test increases hen e increase significance level 47 or increase n.

21

Page 22: Econ1203 Notes

7/23/2019 Econ1203 Notes

http://slidepdf.com/reader/full/econ1203-notes 22/35

Week _(hi-s%uared Tests

• "ata often occurs in nominal 4categorical7 form

• Such data feature several possible outcomes or categories for the measured phenomenono (ategories are mutually eclusive and ehaustive

o 8f e think of each observation as being a trial, its like a multinomial etension to

 binomial eperiments

o One could imagine an epected or hypothesised distribution of outcomes across the

categories, to hich e can compare the distribution seen in our sample

• To compare observed and epected distributions

o We could simply calculate differences in epected and observed category fre%uencies

o Our inference problem is to determine hether these differences are statistically large

enough to reQect the claim that the epected 4probability7 distribution is, in fact, hat

the sample data ere dran from

The Chi-s&ared goodness-of-fit test is used to test the null hypothesis that the observed and

epected distributions are the same.

(hi-s%uared tests

•  . B specifies probabilities pi that an observation falls into i=/,0,c categories or cells

o  . B implies epected fre%uencies for a sample of si3e n 4ei ; pi n7, assuming

andom sampling 4independent trials7

)robabilities pi are constant over trials

• The test can be unreliable if any ei ; pi n are too small 4e.g., I or L7

o Solution $erge categories together, here sensible

• The distribution theory underlying the test is not eacto 8t is large sample theory 4a reason for the above limitation about small epected cell

fre%uencies7

• The test statistic is given by

The 9 "istribution

Dn asymmetric distribution characteri3ed 4like the t distributionV7 by degrees of freedom . 8ts support

lies on the interval 4B,\7 that is, it is eclusively non-negative. 8t is the sum of the s%uares of

independent standard normal variables.

22

Page 23: Econ1203 Notes

7/23/2019 Econ1203 Notes

http://slidepdf.com/reader/full/econ1203-notes 23/35

(ontingency Tables

• )reviously, e used such tables as

descriptive tools

• We also ere interested in hether the to

events ere independent

•  !o e ant to formally test hether 

these random variables are independent or not

o Q% s t#ere a statistically si$ni"icant relations#ip between t#ese two cate$orical

random variables1

• The testing strategy here is similar to that used for the goodness-of-fit test

o We compare observed cell fre%uencies in our sample ith those epected under the

null hypothesis of independence

• /o do you calculate the epected fre%uencies#

o )reviously, these folloed readily from the hypothesi3ed probability distribution

o  !o /B simply asserts 5independence6 4or 5homogeneity6, if the categories used for

one or both dimensions are not comprehensive7 of the event described by one

 probability distribution ith respect to the other 

o ecall hat is re%uired for independent events

)4D '7;)4D7)4'7

• To craft our null hypothesis, e thus set up an imaginary contingency table hich assumes

independence beteen the to aspects under analysis

o

We use marginal 4ro and column7 totals from the data to generate epectedfre%uencies for each cell

o The epected

fre%uency of

observations

in the cell in

ro 8 and

column Q

under

independence is

23

Page 24: Econ1203 Notes

7/23/2019 Econ1203 Notes

http://slidepdf.com/reader/full/econ1203-notes 24/35

24

Page 25: Econ1203 Notes

7/23/2019 Econ1203 Notes

http://slidepdf.com/reader/full/econ1203-notes 25/35

Week 1BSimple egression

Suppose e have 42 i, X i7 pairs for i ; 1, , n

• We fit a line to the data by minimi3ing the residual sum of s%uares. This is the action

accomplished by 5ordinary least s%uares6 4O2S7 regression.

o This line is defined by estimates of the intercept and slope in the linear relationship 2 i

; B * 1 X i * ji

• The sign of the slope coefficient has the same sign as the sample covariance 4and correlation7

 beteen 2 and X 

 !umerical versus statistical properties

• O2S can be vieed as 5curve fitting6

o The curve e fit describes the relationship amongst variables

• 'ut e also ant to make inferences about the parameters of the 'o'&lation regression

f&nction

o /o can e use b1 to make inferences about 1#

o What are the properties of b1 as an estimator of 1#

• We can also use regression models to make 'redictions or forecasts > e.g.,

o 8f a company increases its advertising ependiture, hat is the predicted impact on

sales#

o What is the confidence interval for that prediction#

Some 'asics

Terminology

•  2 i is the de'endent variable

•  X i is the inde'endent or e.'lanatory variable ji is the dist&rbance or error term

• B and 1 are the parameters to be estimated

O2S produces

Estimated parameter values bB, b1

)redictions Pi ; bB * b1Ji

25

Page 26: Econ1203 Notes

7/23/2019 Econ1203 Notes

http://slidepdf.com/reader/full/econ1203-notes 26/35

esiduals ei ; Pi Pi

 The population regression relationship is 2 i ; B * 1 X i * ji

 

This e%uation links the (X,2) pairs via the &nknown parameters and the &nobser#ed errors

O2S produces Pi ; bB * b1Ji * ei

• This e%uation links the 4J,P7 pairs via estimated parameters and calc&lated residuals

The disturbance term ji plays a crucial role in regression

• "istinguishes regression models from deterministic functions

• epresents factors other than Ji that affect Pi

 

egression treats these other factors as &nobser#ed

1 is the marginal effect of Ji on Pi , holding these other factors constant

• eliable estimates of 1 ill re%uire assumptions restricting the relationship beteen Ji and

ji

• Our desire to make ceteris 'arib&s interpretations of 1  include more eplanatory

variables in the regression

 

2eads to an etension to m&lti'le regression

Dssumptions of (lassical 2inear egression $odel

(lassical 2inear egression $odel 4(2$7

26

Page 27: Econ1203 Notes

7/23/2019 Econ1203 Notes

http://slidepdf.com/reader/full/econ1203-notes 27/35

"ecomposition of @ariance

Standard Error of the Estimate

The population variance, 9, measures the spread of the data around the population regression line

• The standard error of the estimate 4SEE7 is an estimator of

• 8t measures the fit of the regression model

• 2o SEE good fit

(oefficient of "etermination

O2S 8nference

• What can e say about the properties of the O2S point estimators bB and b1, if our

assumptions hold#

o They are unbiased E4b Q7 ;  Q

o They are normally distributed, as they are linear functions of Pi , hich are assumed

to be dran from a normal distribution

27

Page 28: Econ1203 Notes

7/23/2019 Econ1203 Notes

http://slidepdf.com/reader/full/econ1203-notes 28/35

o Even ithout normality of Pi , e can invoke the (2T and assume that b Q ill be

asymptotically normal

• We need to kno @ar4bB7 and @ar4b17 in order to conduct inference

• ust as e did hen estimating means, e can define the tr&e and estimated variances

• The panel belo gives

o True variances on the left, ando Estimated standard errors on the right, here the unknon is replaced by estimated

 s.

28

Page 29: Econ1203 Notes

7/23/2019 Econ1203 Notes

http://slidepdf.com/reader/full/econ1203-notes 29/35

Week 11)rediction in egression

One of the main uses for regression models is in prediction or forecasting

•  'asic idea se the fitted regression line 4including b3 and b/7 to predict a value of 2 from a

given value of X 

• 8t is natural to think of forecasting in a time series contet e.g., hat ill happen to sales 42 7

if advertising 4 X 7 increases net year#

• )redictions can also be made using cross-sectional regression results e.g., if the income 4 X 7

of a poor household increased, hat ould be is the predicted impact of this on that

householdMs food ependitures 42 7#

• 8t is often inadvisable to predict too far aay from the (X i ,2 i ) pairs to hich O2S fits a line

4i.e., too far 5out of sample67, or for a different population

2inear Trend $odel

D very simple model applied to time series data, here the J variable is 5time6

•  2 t ; B * 1t * jt 4 t ;1,,5 

• "ates are converted to time e.g., 1__C0I becomes t ;1

)rediction

Dssume Pt ; B * 1 Jt * jt

• +iven estimates of B and 1, and a particular Jf  , e can generate a 'oint 'rediction 

4forecast7 for Pf 

o Since Pf ; E4Pf   Jf 7 * jf  , our prediction is also an estimate of the 4conditional7 mean

of Pf  , E4Pf Jf 7

o That conditional mean has a sampling distributionV

o To construct a confidence inter#al for E4Pf Jf 7, e need to kno the standard error

4s.e.7 of the estimate of E4Pf Jf 7

o Our prediction is a linear combination of O2S estimates, so e can use a simple trick

to get Ecel to calculate the standard error of our prediction of E43P f   Jf 7 for usV

unning O2S on this transformed model ill

 produce a predicted 4the intercept in above7 and

its associated standard error5

• That ill enable us to construct a confidence interval for our prediction of the conditional

mean, E4Pf Jf  7

• &or the petrol prices eample, the transformed model ould re%uire regressing prices 42 t 7 on

4t > IC7 Frecall X t  =t G

29

Page 30: Econ1203 Notes

7/23/2019 Econ1203 Notes

http://slidepdf.com/reader/full/econ1203-notes 30/35

(onfidence 8ntervals for )redictions

There appears to be a difference beteen the epressions for the variance and standard error of

the prediction given in the lecture and those in Sharpe 4p. A1_7. /oever, ith a correction for a

typo in Sharpe 4V7, they are indeed the same. The trick is to kno the sample variance for b1,var4b/7

)redicition P or Epected P#

We can also construct a prediction interval for 2 f 

  /ere e ant to predict act&al " rather than mean " 

• The point predictions are the same, but the confidence interval 4(87 for 2 f ill be ider than

the (8 for E 42 f  X f7

• Why#

• See Sharpe et al.Ms bo on p. A91-A99 for a comparison of the difference beteen predicting

the conditional mean E 42 f  X f7, and predicting actual Pf  for a 'artic&lar 'o'&lation

member+

D note on Errors, esiduals and elated $atters

30

Page 31: Econ1203 Notes

7/23/2019 Econ1203 Notes

http://slidepdf.com/reader/full/econ1203-notes 31/35

$otivation for $ultiple egression

 

D key threat to the appropriate interpretation of simple linear regression results is the problem

of omitted #ariables

o Our estimates may not accurately reflect the effect of height on income holding all

other omitted factors constant

o This is often called a problem of confo&nding and is due to omitted #ariable bias

o This bias occurs hen the folloing facts 'OT/ hold

D variable omitted from the regression is correlated ith the included

eplanatory variable

That omitted variable independently influences the dependent variable

• D primary motivation for multiple regression is to avoid this type of confounding

8nterpretation in $ultiple egression

Estimation and 8nference in $ultiple egression

• (onceptually, little has changed relative to simple linear regression

o Our prior assumptions, D1-DC, are essentially the same for multiple regression

o We need only add one additional assumption

  DX !o perfect 4multi7collinearity, hich precludes e.act linear

relationshi's beteen eplanatory variables

E.g. 8n a regression of eight on height and se, e cannot include both a

dummy for male and a dummy for female, because the to dummy variables

ould sum to unity 4i.e., 17 hich is already captured in the intercept.

8ntuition Pou can only use a uni%ue piece of information once in a regressionmodel. :noing someone is a man means you also kno he is not a omanV

31

Page 32: Econ1203 Notes

7/23/2019 Econ1203 Notes

http://slidepdf.com/reader/full/econ1203-notes 32/35

52inear6 egression

32

Page 33: Econ1203 Notes

7/23/2019 Econ1203 Notes

http://slidepdf.com/reader/full/econ1203-notes 33/35

Week 19The &-Test

• The t-tests on individual coefficients tell you about one variable at a time

o  E.g., 5is gender related to income#6• 'ut e can also test the overall significance of all variables > also referred to as the

5significance of the model6

o This is essentially a test of the significance of the -s%

 

The null hypothesis that this test assumes is that all coefficients are e&al to 0ero

• The alternative hypothesis is that any, some, or all of them is non-3ero

  Ds ith the (hi-s% test, reporting the &-test result is not informative unless you inter'ret

where that res&lt is coming from

 

Pou can do this by looking at hich variables are significant, and hich are insignificant,

using individual t-tests

"ummy @ariable Drrays

• ecall the eample last time ith an array of dummies to indicate educational attainment

• Suppose e are modelling retail sales and ant to capture the month of the observation

• We could

o (reate O!E indicator for month, taking the value 1, 9, I, L19

O (reate telve SE)DDTE indicators, one for each month, each taking the value B or 1 for 

all observations, and leave one indicator out of the model

• Which is better#

33

Page 34: Econ1203 Notes

7/23/2019 Econ1203 Notes

http://slidepdf.com/reader/full/econ1203-notes 34/35

$odelling !on-linear elationships

• We mentioned last time ho nonlinear relationships beteen J and P can be modelled via

variable transformation prior to estimation

• /o do e interpret the results of the folloing 4fairly common7 models#

$ulticollinearity

)redictor variables ehibit collinearity 4or m&lticollinearity7 hen one of the predictors can be

 predicted 5ell6 4not perfectlyV7 from the others.

(onse%uences of collinearity

Estimated coefficients can be surprising, taking on an unanticipated sign or being

unepectedly large or small.

The stronger the association of one variable ith the others in the model, the more the

variance of its estimated coefficient increases 45variance inflation67. This can lead to a smaller 

t statistic and correspondingly larger p-value.

• 5/igh6 collinearity leads to the coefficient being poorly estimated and having a large standard

error 4and correspondingly lo t -statistic7& The coefficient may seem to be the rong si3e, or

even the rong sign.

• 8f a multiple regression model has a high 9 and large &, but the individual t statistics are not

significant, you should suspect collinearity.

• (ollinearity is measured in terms of the association beteen a predictor and all of the other

 predictors in the model > not in terms of Qust the correlation beteen any to predictors.

34

Page 35: Econ1203 Notes

7/23/2019 Econ1203 Notes

http://slidepdf.com/reader/full/econ1203-notes 35/35

What can you do about multicollinearity#

Pou can try to simplify the model by removing some of the predictors. Which should you keep#

 

!ariables that are inherently the most im'ortant to the 'roblem• @ariables that are the most reliably measured

 6ote t#at a moderate association between two or more independent variables is not a big deal 7 n

 "act, i" all independent variables were ort#o$onal, t#ere would be no point in estimatin$ a model

includin$ more t#an one o" t#em