9/15/2015330 lecture 11 stats 330: lecture 1. 9/15/2015330 lecture 12 today’s agenda: introductory...
TRANSCRIPT
04/21/23 330 Lecture 1 1
STATS 330: Lecture 1
04/21/23 330 Lecture 1 2
Today’s agenda:• Introductory Comments:
– Housekeeping– Computer details– Plan of the course
• Statistical Modelling: an overview– Our Analysis strategy
– Goals for the course
• Role of Graphics• Data cleaning
04/21/23 330 Lecture 1 3
HousekeepingContact details….
Plus much else on course web page
www.stat.auckland.ac.nz/~lee/330/
Or via Cecil
04/21/23 330 Lecture 1 4
04/21/23 330 Lecture 1 5
Assignments• There will be five assignments
• (20% of total grade)
• The due dates are listed in the course summary
• Assignment 1 is due on August 2.
• No paper will be issued in class – download assignments and data from the Web or Cecil
04/21/23 330 Lecture 1 6
Tutorials• These will cover computing details
• Held in basement tutorial lab, 303S Extension (Rm 303S-B75)– Tutorial 1: 11-12 Wed– Tutorial 2: 10-11 Fri– Tutorial 3: 2-3 Fri
• Start second week of semester
04/21/23 330 Lecture 1 7
Computer details• All analyses will be done using R• I require homework to be typed, use Word, cut
and paste R text and graphics into Word• Use home/laptop computer or basement lab
(303S)• See web page for info on downloading R330
package• URL is www.stat.auckland.ac.nz/~lee/330
04/21/23 330 Lecture 1 8
Course Plan• There will be 33 lectures, divided into
chapters as follows• Chapter 1: Introduction (1 lecture, this one!)• Chapter 2: Graphics (3 lectures)• Chapter 3: Multiple Regression Model (12 lectures)• Chapter 4: Factors (4 lectures)• Chapter 5: Models for categorical and count
responses (12 lectures)• Revision (1 lecture)
04/21/23 330 Lecture 1 9
Course book• This covers most of the material we will
discuss in the lectures
• Has 5 chapters corresponding to the division on the previous slide
• On-line version available on the course web site
• Paper copies available from the Statistics Department, Commerce A
04/21/23 330 Lecture 1 10
Overview of Statistical Modelling
• Statistical models summarize relationships between variables
• Regression models focus on one variable (the response) and how its distribution can be modelled by one or more explanatory variables
04/21/23 330 Lecture 1 11
Example: Response is BPD
BPD: Bi-Parietal Diameter
For an unborn baby, depends on several factors,
including Gestational Age.
Ultrasound image
04/21/23 330 Lecture 1 12
04/21/23 330 Lecture 1 13
Points to take into account:
• Not all babies of the same age have the same BPD.
• In general, the older the baby, the bigger the BPD.
04/21/23 330 Lecture 1 14
Model relating BPD and GA
Gestational Age
BPD
Mean BPD for 30 weeks = + 30
Mean BPD for 40 weeks = a + 40
Mean BPD for GA weeks = a + GA
Line is BPD =a + b GA
Each bell curve has sd
10 mm
30 40
04/21/23 330 Lecture 1 15
Regression model features
• Model specifies the distribution of the response
• Also says how the distribution of the response is affected by the covariates (explanatory variables)
• In this case the covariate GA determines the mean response:
mean BPD = a + b GAie a straight-line relationship.
04/21/23 330 Lecture 1 16
In general….• Response has a distribution depending on
(conditional on) the explanatory variables
• Mean of the distribution given by some function of the explanatory variables
• Need to – describe the function– describe the variability
• Balance accuracy with simplicity.
04/21/23 330 Lecture 1 17
Our Analysis strategy:
• Explore the data using graphics and summary statistics
• Construct a useful model (trial and error)
• Use the model to gain knowledge about the system under study (How big should a 40 week baby’s BPD be? )
• Communicate findings (be able to write a report!!)
04/21/23 330 Lecture 1 18
Goals for 330• Get more practice in exploring data• Expand knowledge of regression• Get better at fitting models• Improve your diagnostic skills (how to
recognize when a model doesn’t describe the data properly)
• Improve your interpretation and communication skills, in a more flexible way.
04/21/23 330 Lecture 1 19
Role of Graphics• Data Cleaning: Are there gross errors, outliers,
special codings (eg 999 for missing), missing data, absolute rubbish in the data?
• Exploratory analysis: what sort of model might be appropriate?
• Diagnostics: having tentatively selected a model, is it any good? (Residual plots, etc)
04/21/23 330 Lecture 1 20
Exploratory analysis for BPD data
25 30 35 40
70
75
80
85
90
95
10
0
GA
BP
D Outlier
Linear relationship
(straight line)appropriate?
04/21/23 330 Lecture 1 21
Diagnostic Graphics: residuals versus fitted values
70 75 80 85 90-20
-15
-10
-50
5
Fitted Values
Re
sid
ua
lsSmoothing enhances
interpretation.Conclusion?
Outlier stands out
04/21/23 330 Lecture 1 22
Data Cleaning• All real-life data sets are likely to contain errors• These will usually be revealed by suitable plots,
such as the one on the previous slide• Before starting any analysis, data should always
be carefully checked• To encourage this practice, I will occasionally
introduce errors into the assignment data supplied on the web.
04/21/23 330 Lecture 1 23
Data Cleaning (2)It is your responsibility to check all data supplied on the web, against the assignment sheets . Failure to so will cost marks.
YOU HAVE
BEEN WARNED!!!