introduction, sections 1.1 and 1.3, the r statistical...
TRANSCRIPT
Introduction, Sections 1.1 and 1.3,The R Statistical Package
Shiwen Shen
Department of StatisticsUniversity of South Carolina
Elementary Statistics for the Biological and Life Sciences
(STAT 205)
1 / 44
Grading, homework, exams, office hours
Course webpage:http://people.stat.sc.edu/sshen/17sstat205.htmlGrade is: 45% for homework & quizzes, 15% each forexams I, II, and final; 10% attendance.Homework: there will be weekly homework graded over thecourse except the exam weeks.Exams are fixed time and location (Feb 15, Mar 22, May 1).Do not ask to turn in homework late; late or missedhomework will receive zero score. Lowest homeworkscore will be dropped from your grade.Both the instructor and teaching assistant (TA) will holdoffice hours, see syllabus for details. You only askhomework related questions in TA’s office hours.
2 / 44
Topics we’ll cover
Graphical displays and summary statisticsProbability, random variables (normal and binomial)Confidence intervals for µ and pTwo-sample testing and CI2× 2 tables: relative risk & odds ratiosAnalysis of varianceLinear regressionLogistic regression, survival analysis, ROC curvesUse of the statistical package R to analyze real data
3 / 44
R computing & graphics package
R is a powerful, free statistical computing and graphicspackage.Popular with many researchers due to contributedpackages: R functions to do specialized, advanced, &often complex statistical analysis.R can also do many important, routine calculations,analysis, and provide common graphical displays used inthis course.Installed in several of the computing labs across campus,e.g. Sloan 108 & 109, Gambrell 003.You can download it and install it from CRAN:http://cran.r-project.org/
4 / 44
5 / 44
R’s power
Companies as diverse as Google, Pfizer, Merck, Bank ofAmerica, the InterContinental Hotels Group and Shell useit.R is really important to the point that its hard to overvalueit, said Daryl Pregibon, a research scientist at Google,which uses the software widely. It allows statisticians to dovery intricate and complicated analyses without knowingthe blood and guts of computing systems.It is free.You can download it and install it from CRAN:http://cran.r-project.org/
6 / 44
Get Started
Installationgoogle R→ The R project for Statistical ComputingR64 bits (large datasets) vs. R32 bits
Getting help with RAt the command prompt, type, for example ?read.table orhelp(read.table)At the command prompt, type, for example,help.search(”read”) or apropos(”read”).Within R, use the menu bar: Help: R help.Quititng R. q()How to save script / work space.
7 / 44
R Resources
John Verzani’s SimpleR notesCRAN (Document/Manuals)
Note: To run some of the example in John Verzani’s notes, run:install.packages(”UsingR”)library(UsingR)
8 / 44
More on R
R allows you to do all analysis covered in this course, andbeyond.There are some tutorials, both installed in R and on theweb. Under Help choose Manuals (in PDF) andchoose An introduction to R. This is a longerintroduction to R.For homework, I will give you a skeleton set of commandsto get the basic job done.R’s error messages can be cryptic.However it is free and is being used by hundreds ofthousands of people.
9 / 44
The Comprehensive R Archive Network
Here is where you download R.
10 / 44
Installing R
From http://cran.r-project.org/, under Download and Install Rclick on your platform (Linux, MacOS X, or Windows).
for Windows click on Download R for Windows.
Click Save File and when it’s done downloading run the executable by clicking
on it – alternatively you can choose to Run Program directly after downloadingfrom the web.
The installation program will ask you a series of questions; choose the defaults.(e.g. English language, the suggested installation folder, the checked selectedcomponents to install, not to customize startup options, shortcut in the StartMenu, and additional tasks).
When it’s done, click on the new R desktop icon. Click on the console. This iswhere you will type commands to R.
11 / 44
The R interface
Initially, there is only the console window open. If you makeplots, other windows will open too.
12 / 44
The R interface
Usually we open a script window (File - New script), andprogram everything in the script window.
13 / 44
Some code to try
Note that the # sign is a “comment” – R ignores anything after#.
# calculate 3 times 43 * 4# generate a sequence of integers from 1 to 8seq(from=1, to=8, by=1)# generate some random normal datadata=rnorm(100)# look at a histogram and a boxplothist(data)boxplot(data)# compute the sample mean, median, variance, standard deviationmean(data)median(data)var(data)sd(data)
# if you have a question about a command, preface it with ??hist
14 / 44
Numerical Reasoning: Example1
The following are profits (in millions $) per month from Jan. toDec.20, 20.4, 20.8, 20.9, 21, 21.2, 21.2, 21.2,21.4, 21.6, 21.8, 22.Suppose you want to show your boss how much the profits aregrowing over the year.
How might you graph the data?
15 / 44
16 / 44
Numerical Reasoning: Example1
Suppose the data are exactly the same as before, but the datarepresent expenses.Now you want to convice your boss that the expenses are notgrowing too much over the year.
How might you graph the data?
17 / 44
18 / 44
Example 1.1.4: MAO and schizophrenia
Monoamine oxidase (MAO) enzyme is thought to regulatehuman’s behavior.Blood from n = 42 schizophrenia patients collected, stratified bydiagnosis (I, II, III).Are there differences in MAO enzyme and disease diagnosis?
19 / 44
Example 1.1.4: MAO and schizophrenia
What happens to MAO as severity of diagnosis increases? Is therelationship perfect?These are side-by-side dotplots, described in Sec. 2.2. Formalapproach in Chapter 11.
20 / 44
Example 1.1.2: liver tumors in mice
Is there a risk difference between germ environment(germ-free vs. E. coli) and whether liver tumors develop?Is the association perfect?Statistics can help answer whether there’s a difference andfurther quantify the effect of germ exposure (Chapter 10).
21 / 44
1.1 Statistics and the life sciences
Statistics is the science ofcollecting,summarizing,analyzing,interpreting
data.Goal: to understand the underlying biological phenomenathat generate the data.Statistics separates signal (difference) from noise(variation).
22 / 44
Types of Evidence: Two main types of data collection
Observational studyAnecdotal evidenceProspective observational studiesRetrospective case-control studies (case-control)
ExperimentRandomized clinical trials
23 / 44
Anecdotal evidence
Example 1.2.1: On 15 July 1911, 65-year-old Mrs. Jane Deckerwas struck by lightning while in her house. She had been deafsince birth, but after being struck, she recovered her hearing.
Is this compelling evidence that lightning is a cure for deafness?
An anecdote is a short story or an example of an interestingevent.
24 / 44
Terminology: Treatment and endpoint
Smoking → lung cancer(exposure) (outcome)(treatment) (endpoint)
Anecdotal evidence example: My uncle smoke and hegot lung cancer.Prospective observational study example: 100 smokersand 100 nonsmoker were enrolled, then were followed upto determine whether they developed lung cancer.Retrospective case-control study example: 100 lungcancer patients and 100 normal control were enrolled; thensmoking history for each participants was examined.Randomized clinical trial example: Participants wererandomized to receive low-nicotine cigarette or regularcigarette; then were followed-up to determine whether theydeveloped lung cancer.
25 / 44
Prospective observational studies:Example 1.2.2: Sexual orientation
Study enrolled 30 homosexual men, 30 heterosexual men,and 30 heterosexual womenThen observe endpoints: midsagittal area of the anteriorcommissure (AC)
26 / 44
Retrospective observational studiesCase-control study
Study enrolled 42 patients with schizophreniaThen observe exposure: MAO enzyme activities in patients
27 / 44
1.2 Type of evidence
Sex and race distribution of 158 cases of abdominal aorticaneurysms (AAA) at metropolitan hospitals in a southern city
Sex & Race # AAAWhite Males 93
AA Males 30White Females 22
AA Females 13
28 / 44
Authors Conclusion:
The incidence of AAA is almost 3 times more frequent inWhites than African-Americans. Whites have greater risk ofdeveloping AAA than African-Americans.
Do you agree?
29 / 44
What if the population of the city has 3 times as many whites asAfrican-Americans? Then this would not be a fair comparisonof the 2 groups. The number of AAA for each group needs tobe divided by population of each group. Comparing rate of AAAper 100,000 between each of the 2 groups would be faircomparison.
This is an example of lack of denominators.
30 / 44
1.2 Types of Evidence
31 / 44
1.2 Types of Evidence: The need of a control group
A review of medical records for 3,000 diabetic patientsApproximately two-thirds of patients at some time were 11% ormore overweight
Conclusion: This provides evidence of an association betweenobesity and diabetes.
Do you agree? What evidence is needed to prove theassociation?
32 / 44
1.2 Types of Evidence: The need of a control group
What if two-thirds of non-diabetic patients at some time were11% or more overweight as well? Then there would be noassociation between obesity and diabetes in this study. Acomparison group of non-diabetic patients is needed.
This fallacy is known as a lack of a control group.
33 / 44
Ice cream consumption and drowning death
A researcher looked into a public database from a southern city,and found an association between ice-cream consumption andnumber of drowning deaths for a given period. He concludedthat ice cream could cause drowning.
Do you agree? Why do you think the researcher found suchassociation?
34 / 44
Confounding
Associated with XIndependently associated with YNot in the causal pathway between X and Yex: ice cream consumption and drowning death
35 / 44
Prospective observational studiesExample 1.2.2: Sexual orientation
Study enrolled 30 homosexual men, 30 heterosexual men,and 30 heterosexual womenThen observe endpoints: midsagittal area of the anteriorcommissure (AC)
36 / 44
Observational studies: Confounding
37 / 44
Types of Evidence: Two main types of data collection
Observational studyAnecdotal evidenceProspective observational studiesRetrospective case-control studies (case-control)
ExperimentRandomized clinical trials
38 / 44
Hierarchy for scientific evidencein medical studies
Case seriesRetrospective case-control studiesProspective observational studiesRandomized clinical trials
39 / 44
Double-blinded experiment
40 / 44
Placebo
41 / 44
Experiment
A research method where the researcher manipulates onevariable at a time in order to prove causality.
42 / 44
Randomized experiment
120 larva were released in the center of the dishDarker-shaded: nodulated rootsLight-shaded: non-nodulated roots
which arrangement is better?
43 / 44
Random sampling
The population is all the subjects/animals/specimens/etc.of interest.Since we can’t measure the entire population (usually) wetake a small sample of size n and use the data collected toinfer about the population.
44 / 44