introduction, sections 1.1 and 1.3, the r statistical...

44
Introduction, Sections 1.1 and 1.3, The R Statistical Package Shiwen Shen Department of Statistics University of South Carolina Elementary Statistics for the Biological and Life Sciences (STAT 205) 1 / 44

Upload: others

Post on 10-Aug-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Introduction, Sections 1.1 and 1.3, The R Statistical Packagepeople.stat.sc.edu/sshen/courses/17sstat205/course materials/Notes/Notes 1.pdfR is a powerful, free statistical computing

Introduction, Sections 1.1 and 1.3,The R Statistical Package

Shiwen Shen

Department of StatisticsUniversity of South Carolina

Elementary Statistics for the Biological and Life Sciences

(STAT 205)

1 / 44

Page 2: Introduction, Sections 1.1 and 1.3, The R Statistical Packagepeople.stat.sc.edu/sshen/courses/17sstat205/course materials/Notes/Notes 1.pdfR is a powerful, free statistical computing

Grading, homework, exams, office hours

Course webpage:http://people.stat.sc.edu/sshen/17sstat205.htmlGrade is: 45% for homework & quizzes, 15% each forexams I, II, and final; 10% attendance.Homework: there will be weekly homework graded over thecourse except the exam weeks.Exams are fixed time and location (Feb 15, Mar 22, May 1).Do not ask to turn in homework late; late or missedhomework will receive zero score. Lowest homeworkscore will be dropped from your grade.Both the instructor and teaching assistant (TA) will holdoffice hours, see syllabus for details. You only askhomework related questions in TA’s office hours.

2 / 44

Page 3: Introduction, Sections 1.1 and 1.3, The R Statistical Packagepeople.stat.sc.edu/sshen/courses/17sstat205/course materials/Notes/Notes 1.pdfR is a powerful, free statistical computing

Topics we’ll cover

Graphical displays and summary statisticsProbability, random variables (normal and binomial)Confidence intervals for µ and pTwo-sample testing and CI2× 2 tables: relative risk & odds ratiosAnalysis of varianceLinear regressionLogistic regression, survival analysis, ROC curvesUse of the statistical package R to analyze real data

3 / 44

Page 4: Introduction, Sections 1.1 and 1.3, The R Statistical Packagepeople.stat.sc.edu/sshen/courses/17sstat205/course materials/Notes/Notes 1.pdfR is a powerful, free statistical computing

R computing & graphics package

R is a powerful, free statistical computing and graphicspackage.Popular with many researchers due to contributedpackages: R functions to do specialized, advanced, &often complex statistical analysis.R can also do many important, routine calculations,analysis, and provide common graphical displays used inthis course.Installed in several of the computing labs across campus,e.g. Sloan 108 & 109, Gambrell 003.You can download it and install it from CRAN:http://cran.r-project.org/

4 / 44

Page 5: Introduction, Sections 1.1 and 1.3, The R Statistical Packagepeople.stat.sc.edu/sshen/courses/17sstat205/course materials/Notes/Notes 1.pdfR is a powerful, free statistical computing

5 / 44

Page 6: Introduction, Sections 1.1 and 1.3, The R Statistical Packagepeople.stat.sc.edu/sshen/courses/17sstat205/course materials/Notes/Notes 1.pdfR is a powerful, free statistical computing

R’s power

Companies as diverse as Google, Pfizer, Merck, Bank ofAmerica, the InterContinental Hotels Group and Shell useit.R is really important to the point that its hard to overvalueit, said Daryl Pregibon, a research scientist at Google,which uses the software widely. It allows statisticians to dovery intricate and complicated analyses without knowingthe blood and guts of computing systems.It is free.You can download it and install it from CRAN:http://cran.r-project.org/

6 / 44

Page 7: Introduction, Sections 1.1 and 1.3, The R Statistical Packagepeople.stat.sc.edu/sshen/courses/17sstat205/course materials/Notes/Notes 1.pdfR is a powerful, free statistical computing

Get Started

Installationgoogle R→ The R project for Statistical ComputingR64 bits (large datasets) vs. R32 bits

Getting help with RAt the command prompt, type, for example ?read.table orhelp(read.table)At the command prompt, type, for example,help.search(”read”) or apropos(”read”).Within R, use the menu bar: Help: R help.Quititng R. q()How to save script / work space.

7 / 44

Page 8: Introduction, Sections 1.1 and 1.3, The R Statistical Packagepeople.stat.sc.edu/sshen/courses/17sstat205/course materials/Notes/Notes 1.pdfR is a powerful, free statistical computing

R Resources

John Verzani’s SimpleR notesCRAN (Document/Manuals)

Note: To run some of the example in John Verzani’s notes, run:install.packages(”UsingR”)library(UsingR)

8 / 44

Page 9: Introduction, Sections 1.1 and 1.3, The R Statistical Packagepeople.stat.sc.edu/sshen/courses/17sstat205/course materials/Notes/Notes 1.pdfR is a powerful, free statistical computing

More on R

R allows you to do all analysis covered in this course, andbeyond.There are some tutorials, both installed in R and on theweb. Under Help choose Manuals (in PDF) andchoose An introduction to R. This is a longerintroduction to R.For homework, I will give you a skeleton set of commandsto get the basic job done.R’s error messages can be cryptic.However it is free and is being used by hundreds ofthousands of people.

9 / 44

Page 10: Introduction, Sections 1.1 and 1.3, The R Statistical Packagepeople.stat.sc.edu/sshen/courses/17sstat205/course materials/Notes/Notes 1.pdfR is a powerful, free statistical computing

The Comprehensive R Archive Network

Here is where you download R.

10 / 44

Page 11: Introduction, Sections 1.1 and 1.3, The R Statistical Packagepeople.stat.sc.edu/sshen/courses/17sstat205/course materials/Notes/Notes 1.pdfR is a powerful, free statistical computing

Installing R

From http://cran.r-project.org/, under Download and Install Rclick on your platform (Linux, MacOS X, or Windows).

for Windows click on Download R for Windows.

Click Save File and when it’s done downloading run the executable by clicking

on it – alternatively you can choose to Run Program directly after downloadingfrom the web.

The installation program will ask you a series of questions; choose the defaults.(e.g. English language, the suggested installation folder, the checked selectedcomponents to install, not to customize startup options, shortcut in the StartMenu, and additional tasks).

When it’s done, click on the new R desktop icon. Click on the console. This iswhere you will type commands to R.

11 / 44

Page 12: Introduction, Sections 1.1 and 1.3, The R Statistical Packagepeople.stat.sc.edu/sshen/courses/17sstat205/course materials/Notes/Notes 1.pdfR is a powerful, free statistical computing

The R interface

Initially, there is only the console window open. If you makeplots, other windows will open too.

12 / 44

Page 13: Introduction, Sections 1.1 and 1.3, The R Statistical Packagepeople.stat.sc.edu/sshen/courses/17sstat205/course materials/Notes/Notes 1.pdfR is a powerful, free statistical computing

The R interface

Usually we open a script window (File - New script), andprogram everything in the script window.

13 / 44

Page 14: Introduction, Sections 1.1 and 1.3, The R Statistical Packagepeople.stat.sc.edu/sshen/courses/17sstat205/course materials/Notes/Notes 1.pdfR is a powerful, free statistical computing

Some code to try

Note that the # sign is a “comment” – R ignores anything after#.

# calculate 3 times 43 * 4# generate a sequence of integers from 1 to 8seq(from=1, to=8, by=1)# generate some random normal datadata=rnorm(100)# look at a histogram and a boxplothist(data)boxplot(data)# compute the sample mean, median, variance, standard deviationmean(data)median(data)var(data)sd(data)

# if you have a question about a command, preface it with ??hist

14 / 44

Page 15: Introduction, Sections 1.1 and 1.3, The R Statistical Packagepeople.stat.sc.edu/sshen/courses/17sstat205/course materials/Notes/Notes 1.pdfR is a powerful, free statistical computing

Numerical Reasoning: Example1

The following are profits (in millions $) per month from Jan. toDec.20, 20.4, 20.8, 20.9, 21, 21.2, 21.2, 21.2,21.4, 21.6, 21.8, 22.Suppose you want to show your boss how much the profits aregrowing over the year.

How might you graph the data?

15 / 44

Page 16: Introduction, Sections 1.1 and 1.3, The R Statistical Packagepeople.stat.sc.edu/sshen/courses/17sstat205/course materials/Notes/Notes 1.pdfR is a powerful, free statistical computing

16 / 44

Page 17: Introduction, Sections 1.1 and 1.3, The R Statistical Packagepeople.stat.sc.edu/sshen/courses/17sstat205/course materials/Notes/Notes 1.pdfR is a powerful, free statistical computing

Numerical Reasoning: Example1

Suppose the data are exactly the same as before, but the datarepresent expenses.Now you want to convice your boss that the expenses are notgrowing too much over the year.

How might you graph the data?

17 / 44

Page 18: Introduction, Sections 1.1 and 1.3, The R Statistical Packagepeople.stat.sc.edu/sshen/courses/17sstat205/course materials/Notes/Notes 1.pdfR is a powerful, free statistical computing

18 / 44

Page 19: Introduction, Sections 1.1 and 1.3, The R Statistical Packagepeople.stat.sc.edu/sshen/courses/17sstat205/course materials/Notes/Notes 1.pdfR is a powerful, free statistical computing

Example 1.1.4: MAO and schizophrenia

Monoamine oxidase (MAO) enzyme is thought to regulatehuman’s behavior.Blood from n = 42 schizophrenia patients collected, stratified bydiagnosis (I, II, III).Are there differences in MAO enzyme and disease diagnosis?

19 / 44

Page 20: Introduction, Sections 1.1 and 1.3, The R Statistical Packagepeople.stat.sc.edu/sshen/courses/17sstat205/course materials/Notes/Notes 1.pdfR is a powerful, free statistical computing

Example 1.1.4: MAO and schizophrenia

What happens to MAO as severity of diagnosis increases? Is therelationship perfect?These are side-by-side dotplots, described in Sec. 2.2. Formalapproach in Chapter 11.

20 / 44

Page 21: Introduction, Sections 1.1 and 1.3, The R Statistical Packagepeople.stat.sc.edu/sshen/courses/17sstat205/course materials/Notes/Notes 1.pdfR is a powerful, free statistical computing

Example 1.1.2: liver tumors in mice

Is there a risk difference between germ environment(germ-free vs. E. coli) and whether liver tumors develop?Is the association perfect?Statistics can help answer whether there’s a difference andfurther quantify the effect of germ exposure (Chapter 10).

21 / 44

Page 22: Introduction, Sections 1.1 and 1.3, The R Statistical Packagepeople.stat.sc.edu/sshen/courses/17sstat205/course materials/Notes/Notes 1.pdfR is a powerful, free statistical computing

1.1 Statistics and the life sciences

Statistics is the science ofcollecting,summarizing,analyzing,interpreting

data.Goal: to understand the underlying biological phenomenathat generate the data.Statistics separates signal (difference) from noise(variation).

22 / 44

Page 23: Introduction, Sections 1.1 and 1.3, The R Statistical Packagepeople.stat.sc.edu/sshen/courses/17sstat205/course materials/Notes/Notes 1.pdfR is a powerful, free statistical computing

Types of Evidence: Two main types of data collection

Observational studyAnecdotal evidenceProspective observational studiesRetrospective case-control studies (case-control)

ExperimentRandomized clinical trials

23 / 44

Page 24: Introduction, Sections 1.1 and 1.3, The R Statistical Packagepeople.stat.sc.edu/sshen/courses/17sstat205/course materials/Notes/Notes 1.pdfR is a powerful, free statistical computing

Anecdotal evidence

Example 1.2.1: On 15 July 1911, 65-year-old Mrs. Jane Deckerwas struck by lightning while in her house. She had been deafsince birth, but after being struck, she recovered her hearing.

Is this compelling evidence that lightning is a cure for deafness?

An anecdote is a short story or an example of an interestingevent.

24 / 44

Page 25: Introduction, Sections 1.1 and 1.3, The R Statistical Packagepeople.stat.sc.edu/sshen/courses/17sstat205/course materials/Notes/Notes 1.pdfR is a powerful, free statistical computing

Terminology: Treatment and endpoint

Smoking → lung cancer(exposure) (outcome)(treatment) (endpoint)

Anecdotal evidence example: My uncle smoke and hegot lung cancer.Prospective observational study example: 100 smokersand 100 nonsmoker were enrolled, then were followed upto determine whether they developed lung cancer.Retrospective case-control study example: 100 lungcancer patients and 100 normal control were enrolled; thensmoking history for each participants was examined.Randomized clinical trial example: Participants wererandomized to receive low-nicotine cigarette or regularcigarette; then were followed-up to determine whether theydeveloped lung cancer.

25 / 44

Page 26: Introduction, Sections 1.1 and 1.3, The R Statistical Packagepeople.stat.sc.edu/sshen/courses/17sstat205/course materials/Notes/Notes 1.pdfR is a powerful, free statistical computing

Prospective observational studies:Example 1.2.2: Sexual orientation

Study enrolled 30 homosexual men, 30 heterosexual men,and 30 heterosexual womenThen observe endpoints: midsagittal area of the anteriorcommissure (AC)

26 / 44

Page 27: Introduction, Sections 1.1 and 1.3, The R Statistical Packagepeople.stat.sc.edu/sshen/courses/17sstat205/course materials/Notes/Notes 1.pdfR is a powerful, free statistical computing

Retrospective observational studiesCase-control study

Study enrolled 42 patients with schizophreniaThen observe exposure: MAO enzyme activities in patients

27 / 44

Page 28: Introduction, Sections 1.1 and 1.3, The R Statistical Packagepeople.stat.sc.edu/sshen/courses/17sstat205/course materials/Notes/Notes 1.pdfR is a powerful, free statistical computing

1.2 Type of evidence

Sex and race distribution of 158 cases of abdominal aorticaneurysms (AAA) at metropolitan hospitals in a southern city

Sex & Race # AAAWhite Males 93

AA Males 30White Females 22

AA Females 13

28 / 44

Page 29: Introduction, Sections 1.1 and 1.3, The R Statistical Packagepeople.stat.sc.edu/sshen/courses/17sstat205/course materials/Notes/Notes 1.pdfR is a powerful, free statistical computing

Authors Conclusion:

The incidence of AAA is almost 3 times more frequent inWhites than African-Americans. Whites have greater risk ofdeveloping AAA than African-Americans.

Do you agree?

29 / 44

Page 30: Introduction, Sections 1.1 and 1.3, The R Statistical Packagepeople.stat.sc.edu/sshen/courses/17sstat205/course materials/Notes/Notes 1.pdfR is a powerful, free statistical computing

What if the population of the city has 3 times as many whites asAfrican-Americans? Then this would not be a fair comparisonof the 2 groups. The number of AAA for each group needs tobe divided by population of each group. Comparing rate of AAAper 100,000 between each of the 2 groups would be faircomparison.

This is an example of lack of denominators.

30 / 44

Page 31: Introduction, Sections 1.1 and 1.3, The R Statistical Packagepeople.stat.sc.edu/sshen/courses/17sstat205/course materials/Notes/Notes 1.pdfR is a powerful, free statistical computing

1.2 Types of Evidence

31 / 44

Page 32: Introduction, Sections 1.1 and 1.3, The R Statistical Packagepeople.stat.sc.edu/sshen/courses/17sstat205/course materials/Notes/Notes 1.pdfR is a powerful, free statistical computing

1.2 Types of Evidence: The need of a control group

A review of medical records for 3,000 diabetic patientsApproximately two-thirds of patients at some time were 11% ormore overweight

Conclusion: This provides evidence of an association betweenobesity and diabetes.

Do you agree? What evidence is needed to prove theassociation?

32 / 44

Page 33: Introduction, Sections 1.1 and 1.3, The R Statistical Packagepeople.stat.sc.edu/sshen/courses/17sstat205/course materials/Notes/Notes 1.pdfR is a powerful, free statistical computing

1.2 Types of Evidence: The need of a control group

What if two-thirds of non-diabetic patients at some time were11% or more overweight as well? Then there would be noassociation between obesity and diabetes in this study. Acomparison group of non-diabetic patients is needed.

This fallacy is known as a lack of a control group.

33 / 44

Page 34: Introduction, Sections 1.1 and 1.3, The R Statistical Packagepeople.stat.sc.edu/sshen/courses/17sstat205/course materials/Notes/Notes 1.pdfR is a powerful, free statistical computing

Ice cream consumption and drowning death

A researcher looked into a public database from a southern city,and found an association between ice-cream consumption andnumber of drowning deaths for a given period. He concludedthat ice cream could cause drowning.

Do you agree? Why do you think the researcher found suchassociation?

34 / 44

Page 35: Introduction, Sections 1.1 and 1.3, The R Statistical Packagepeople.stat.sc.edu/sshen/courses/17sstat205/course materials/Notes/Notes 1.pdfR is a powerful, free statistical computing

Confounding

Associated with XIndependently associated with YNot in the causal pathway between X and Yex: ice cream consumption and drowning death

35 / 44

Page 36: Introduction, Sections 1.1 and 1.3, The R Statistical Packagepeople.stat.sc.edu/sshen/courses/17sstat205/course materials/Notes/Notes 1.pdfR is a powerful, free statistical computing

Prospective observational studiesExample 1.2.2: Sexual orientation

Study enrolled 30 homosexual men, 30 heterosexual men,and 30 heterosexual womenThen observe endpoints: midsagittal area of the anteriorcommissure (AC)

36 / 44

Page 37: Introduction, Sections 1.1 and 1.3, The R Statistical Packagepeople.stat.sc.edu/sshen/courses/17sstat205/course materials/Notes/Notes 1.pdfR is a powerful, free statistical computing

Observational studies: Confounding

37 / 44

Page 38: Introduction, Sections 1.1 and 1.3, The R Statistical Packagepeople.stat.sc.edu/sshen/courses/17sstat205/course materials/Notes/Notes 1.pdfR is a powerful, free statistical computing

Types of Evidence: Two main types of data collection

Observational studyAnecdotal evidenceProspective observational studiesRetrospective case-control studies (case-control)

ExperimentRandomized clinical trials

38 / 44

Page 39: Introduction, Sections 1.1 and 1.3, The R Statistical Packagepeople.stat.sc.edu/sshen/courses/17sstat205/course materials/Notes/Notes 1.pdfR is a powerful, free statistical computing

Hierarchy for scientific evidencein medical studies

Case seriesRetrospective case-control studiesProspective observational studiesRandomized clinical trials

39 / 44

Page 40: Introduction, Sections 1.1 and 1.3, The R Statistical Packagepeople.stat.sc.edu/sshen/courses/17sstat205/course materials/Notes/Notes 1.pdfR is a powerful, free statistical computing

Double-blinded experiment

40 / 44

Page 41: Introduction, Sections 1.1 and 1.3, The R Statistical Packagepeople.stat.sc.edu/sshen/courses/17sstat205/course materials/Notes/Notes 1.pdfR is a powerful, free statistical computing

Placebo

41 / 44

Page 42: Introduction, Sections 1.1 and 1.3, The R Statistical Packagepeople.stat.sc.edu/sshen/courses/17sstat205/course materials/Notes/Notes 1.pdfR is a powerful, free statistical computing

Experiment

A research method where the researcher manipulates onevariable at a time in order to prove causality.

42 / 44

Page 43: Introduction, Sections 1.1 and 1.3, The R Statistical Packagepeople.stat.sc.edu/sshen/courses/17sstat205/course materials/Notes/Notes 1.pdfR is a powerful, free statistical computing

Randomized experiment

120 larva were released in the center of the dishDarker-shaded: nodulated rootsLight-shaded: non-nodulated roots

which arrangement is better?

43 / 44

Page 44: Introduction, Sections 1.1 and 1.3, The R Statistical Packagepeople.stat.sc.edu/sshen/courses/17sstat205/course materials/Notes/Notes 1.pdfR is a powerful, free statistical computing

Random sampling

The population is all the subjects/animals/specimens/etc.of interest.Since we can’t measure the entire population (usually) wetake a small sample of size n and use the data collected toinfer about the population.

44 / 44