stt481 capstone in statistics · r is popular “r and python are the two most popular programming...

27
STT481 Capstone in Statistics Lecture 1: Introduction to R Chih-Li Sung 8/31/2018 (modified from http://datascienceandr.org/slide/RBasic-Introduction.html#1) 1 / 27

Upload: others

Post on 24-Sep-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: STT481 Capstone in Statistics · R is popular “R and Python are the two most popular programming languages used by data analysts and data scientists” — source:link I ImpressiveGrowth

STT481 Capstone in StatisticsLecture 1: Introduction to R

Chih-Li Sung

8/31/2018 (modified fromhttp://datascienceandr.org/slide/RBasic-Introduction.html#1)

1 / 27

Page 2: STT481 Capstone in Statistics · R is popular “R and Python are the two most popular programming languages used by data analysts and data scientists” — source:link I ImpressiveGrowth

Agenda

I Basic RI Application of RI Install R and RstudioI How to learn R - swirl

2 / 27

Page 3: STT481 Capstone in Statistics · R is popular “R and Python are the two most popular programming languages used by data analysts and data scientists” — source:link I ImpressiveGrowth

What is R

I R is a language and environment for statistical computing andgraphics.

I R provides a wide variety of statistical (linear and nonlinearmodelling, classical statistical tests, time-series analysis,classification, clustering, . . . ) and graphical techniques, and ishighly extensible.

I R is free!I R was initially written by Ross Ihaka and Robert Gentleman at

the Department of Statistics of the University of Auckland inAuckland, New Zealand.

I The current group (“R Core Team”) consists of professionalstatisticians all over the world.

source: link

3 / 27

Page 4: STT481 Capstone in Statistics · R is popular “R and Python are the two most popular programming languages used by data analysts and data scientists” — source:link I ImpressiveGrowth

R is popular“R and Python are the two most popular programminglanguages used by data analysts and data scientists” —source: link

I Impressive Growth

Figure 1:

source:https://stackoverflow.blog/2017/10/10/impressive-growth-r/

4 / 27

Page 5: STT481 Capstone in Statistics · R is popular “R and Python are the two most popular programming languages used by data analysts and data scientists” — source:link I ImpressiveGrowth

R has lots of packages

Figure 2:

source: link

5 / 27

Page 6: STT481 Capstone in Statistics · R is popular “R and Python are the two most popular programming languages used by data analysts and data scientists” — source:link I ImpressiveGrowth

Easy to integrate R with other languages

Figure 3:6 / 27

Page 7: STT481 Capstone in Statistics · R is popular “R and Python are the two most popular programming languages used by data analysts and data scientists” — source:link I ImpressiveGrowth

Compared to other languages

I R has very advanced visualization toolsI R has advanced statistical analysis tools

7 / 27

Page 8: STT481 Capstone in Statistics · R is popular “R and Python are the two most popular programming languages used by data analysts and data scientists” — source:link I ImpressiveGrowth

Example (data distribution):

I A lot of statistical theories require normality assumptionI But does a data set really follow a normal distribution?

8 / 27

Page 9: STT481 Capstone in Statistics · R is popular “R and Python are the two most popular programming languages used by data analysts and data scientists” — source:link I ImpressiveGrowth

Example (data distribution):I One command line:

plot(density(x))

−4 −2 0 2 4 6 8

0.00

0.05

0.10

0.15

density.default(x = x)

N = 100 Bandwidth = 0.8205

Den

sity

−4 −2 0 2 4 6 8

0.00

0.05

0.10

0.15

density.default(x = x)

N = 100 Bandwidth = 0.8205

Den

sity

9 / 27

Page 10: STT481 Capstone in Statistics · R is popular “R and Python are the two most popular programming languages used by data analysts and data scientists” — source:link I ImpressiveGrowth

Example (data distribution):

I Can we test whether it’s from a normal distribution? Onecommand line:

shapiro.test(x)

#### Shapiro-Wilk normality test#### data: x## W = 0.91954, p-value = 1.331e-05

10 / 27

Page 11: STT481 Capstone in Statistics · R is popular “R and Python are the two most popular programming languages used by data analysts and data scientists” — source:link I ImpressiveGrowth

Example (data distribution):

I Compare two data distribution?

plot(density(x1), xlim = range(c(x1, x2)),main="")lines(density(x2), col = 2)legend("topright", c("x1", "x2"), lty = 1, col = 1:2)

−5 0 5

0.0

0.3

N = 50 Bandwidth = 0.4059

Den

sity x1

x2

11 / 27

Page 12: STT481 Capstone in Statistics · R is popular “R and Python are the two most popular programming languages used by data analysts and data scientists” — source:link I ImpressiveGrowth

Example (data distribution):

I Hypothesis testing?

ks.test(x1, x2)

#### Two-sample Kolmogorov-Smirnov test#### data: x1 and x2## D = 0.28, p-value = 0.03919## alternative hypothesis: two-sided

12 / 27

Page 13: STT481 Capstone in Statistics · R is popular “R and Python are the two most popular programming languages used by data analysts and data scientists” — source:link I ImpressiveGrowth

Example (A/B testing):

I I have two advertisementsI Advertisement 1: 10 purchases out of 10000 clicksI Advertisement 2: 3 purchases out of 5000 clicksI Are the conversion rates (purchases/clicks) of the two

advertisements significantly different?

13 / 27

Page 14: STT481 Capstone in Statistics · R is popular “R and Python are the two most popular programming languages used by data analysts and data scientists” — source:link I ImpressiveGrowth

Example (A/B testing):I Confidence interval? An R package + one command line:

library(binom)binom.confint(c(10, 3), c(10000, 5000), methods = "exact")

## method x n mean lower upper## 1 exact 10 10000 1e-03 0.0004796397 0.001838264## 2 exact 3 5000 6e-04 0.0001237515 0.001752444

prop.test(c(10, 3), c(10000, 5000))

#### 2-sample test for equality of proportions with continuity## correction#### data: c(10, 3) out of c(10000, 5000)## X-squared = 0.24059, df = 1, p-value = 0.6238## alternative hypothesis: two.sided## 95 percent confidence interval:## -0.0006689452 0.0014689452## sample estimates:## prop 1 prop 2## 1e-03 6e-04

14 / 27

Page 15: STT481 Capstone in Statistics · R is popular “R and Python are the two most popular programming languages used by data analysts and data scientists” — source:link I ImpressiveGrowth

Example (A/B testing):

I I want to see if there are other methods

Figure 4:

15 / 27

Page 16: STT481 Capstone in Statistics · R is popular “R and Python are the two most popular programming languages used by data analysts and data scientists” — source:link I ImpressiveGrowth

Example (correlation):suppressPackageStartupMessages(library(PerformanceAnalytics))chart.Correlation(iris[-5], bg=iris$Species, pch=21)

x

Den

sity

Sepal.Length

2.0

2.5

3.0

3.5

4.0

4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0

0.5

1.0

1.5

2.0

2.5

2.0 2.5 3.0 3.5 4.0

−0.12

x

Den

sity

Sepal.Width

0.87***

−0.43***

x

Den

sity

Petal.Length

1 2 3 4 5 6 7

0.5 1.0 1.5 2.0 2.5

4.5

5.5

6.5

7.5

0.82***

−0.37***

12

34

56

7

0.96***

x

Den

sity

Petal.Width

16 / 27

Page 17: STT481 Capstone in Statistics · R is popular “R and Python are the two most popular programming languages used by data analysts and data scientists” — source:link I ImpressiveGrowth

Example (visualization):The Economist

0

5000

10000

15000

1 2 3carat

pric

e

clarityI1SI2

SI1VS2

VS1VVS2

VVS1IF

Diamonds Are Forever

17 / 27

Page 18: STT481 Capstone in Statistics · R is popular “R and Python are the two most popular programming languages used by data analysts and data scientists” — source:link I ImpressiveGrowth

Other applications (Stock):library(quantmod)getSymbols("^TWII")

## [1] "TWII"

head(TWII)

## TWII.Open TWII.High TWII.Low TWII.Close TWII.Volume## 2007-01-02 7871.41 7937.26 7843.60 7920.80 5710600## 2007-01-03 7954.96 7999.42 7917.30 7917.30 5951400## 2007-01-04 7929.89 7955.90 7901.24 7934.51 5717400## 2007-01-05 7940.20 7942.23 7821.71 7835.57 5181400## 2007-01-08 7778.57 7797.57 7736.11 7736.71 4292400## 2007-01-09 7778.38 7827.93 7778.38 7790.01 4516000## TWII.Adjusted## 2007-01-02 7920.770## 2007-01-03 7917.270## 2007-01-04 7934.479## 2007-01-05 7835.541## 2007-01-08 7736.681## 2007-01-09 7789.981

18 / 27

Page 19: STT481 Capstone in Statistics · R is popular “R and Python are the two most popular programming languages used by data analysts and data scientists” — source:link I ImpressiveGrowth

Other applications (Stock):chartSeries(TWII, subset = "last 4 months", TA = c(addVo(), addBBands()))

10600

10800

11000

11200

TWII [2018−05−02/2018−08−30]

Last 11093.75Bollinger Bands (20,2) [Upper/Lower]: 11168.543/10613.493

Volume (millions):1,858,300

0

1

2

3

4

May 022018

May 142018

May 282018

Jun 112018

Jun 252018

Jul 092018

Jul 232018

Aug 062018

Aug 202018

Aug 302018

19 / 27

Page 20: STT481 Capstone in Statistics · R is popular “R and Python are the two most popular programming languages used by data analysts and data scientists” — source:link I ImpressiveGrowth

Other applications (Baseball):

library(Lahman)head(Teams[,c("yearID", "name", "Rank", "W", "L", "R", "RA")])

## yearID name Rank W L R RA## 1 1871 Boston Red Stockings 3 20 10 401 303## 2 1871 Chicago White Stockings 2 19 9 302 241## 3 1871 Cleveland Forest Citys 8 10 19 249 341## 4 1871 Fort Wayne Kekiongas 7 7 12 137 243## 5 1871 New York Mutuals 5 16 17 302 313## 6 1871 Philadelphia Athletics 1 21 7 376 266

20 / 27

Page 21: STT481 Capstone in Statistics · R is popular “R and Python are the two most popular programming languages used by data analysts and data scientists” — source:link I ImpressiveGrowth

Other applications (Average runs in MLB):

50

100

150

1900 1940 1980 2020

yearID

RU

N

21 / 27

Page 23: STT481 Capstone in Statistics · R is popular “R and Python are the two most popular programming languages used by data analysts and data scientists” — source:link I ImpressiveGrowth

Web app:

I ShinyI Gallery:

I K-means Example

23 / 27

Page 25: STT481 Capstone in Statistics · R is popular “R and Python are the two most popular programming languages used by data analysts and data scientists” — source:link I ImpressiveGrowth

Install R and Rstudio

I live demo

25 / 27

Page 26: STT481 Capstone in Statistics · R is popular “R and Python are the two most popular programming languages used by data analysts and data scientists” — source:link I ImpressiveGrowth

How to learn R - swirl

I live demo

26 / 27

Page 27: STT481 Capstone in Statistics · R is popular “R and Python are the two most popular programming languages used by data analysts and data scientists” — source:link I ImpressiveGrowth

HomeworkHomework 1, Q2: Finish the swirl course “R Programming”. FinishSection 1-15. Please take a screenshot as below. For each section,in your screenshot, you must have your last command line(my_div), the last message (“You’ve reached the end of thislesson! Returning to the main menu. . . ”), and your workingdirectory (“~/OneDrive - . . . .”).

Figure 6: 27 / 27