data management and statistical analysis - descriptive statistics

Leilani A. NoraLeilani A. NoraLeilani A. NoraLeilani A. Nora

Assistant Scientist

Descriptive Statistics

Introduction to R:

Data Manipulation and Statistical

Analysis

DATA FRAME : data.serial

• Consider a serialized data with 3 Sites, 3 Treatments, 4 reps and variable Y

Site Trt Rep Y

A 1 1 3

A 1 2 6

A 1 3 8

A 1 4 5

A 2 1 4

A 2 2 4

A 2 3 6

A 2 4 9

A 3 1 7

A 3 2 4

A 3 3 2

A 3 4 4

Site Trt Rep Y

B 1 1 3

B 1 2 6

B 1 3 5

B 1 4 NA

B 2 1 7

B 2 2 0

B 2 3 8

B 2 4 2

B 3 1 5

B 3 2 7

B 3 3 4

B 3 4 4

Site Trt Rep Y

C 1 1 8

C 1 2 NA

C 1 3 8

C 1 4 6

C 2 1 5

C 2 2 4

C 2 3 4

C 2 4 7

SUMMARY STATISTICS

• R contains all the basic tools for calculating summary

statistics.

• cor(), cov() calculate covariances and correlations

• mean(), median(), sum(), var(), min(), max(), range() all are self explanatory

• mad() calculates the mean absolute deviation

• quantile() computes various quantiles of data

• summary() will be discussed on the next slide

SUMMARY STATISTICS : summary()

• Use to obtain a descriptive statistics of a data frame or specific variable.

Min. 1st Qu. Median Mean 3rd Qu. Max. NA's

0.000 4.000 5.000 5.167 7.000 9.000 2.000

• Output are the quartiles, min, max, median, mean and the count of NA’s.

• Ex1. To obtain summary statistics for the variable Y

> summary(data.serial$Y)

• Ex2. To obtain summary statistics for all the columns of a data frame

Site Trt Rep Y

A:12 Min. :1.000 Min. :1.00 Min. :0.000

B:12 1st Qu.:1.000 1st Qu.:1.75 1st Qu.:4.000

C: 8 Median :2.000 Median :2.50 Median :5.000

Mean :1.875 Mean :2.50 Mean :5.167

3rd Qu.:2.250 3rd Qu.:3.25 3rd Qu.:7.000

Max. :3.000 Max. :4.00 Max. :9.000

NA's :2.000

> summary(data.serial)

SUMMARY STATISTICS : summary() SUMMARY STATISTICS : length()

• Use to obtain number of data points of a variable,

> length(data.serial$Y)

[1] 32

SUMMARY STATISTICS : var() and sd()

[1] 4.488506

• sd() is use to obtain the standard deviation of Y

[1] 2.118609

• var() is use to obtain the variance of Y

> Y.VAR <- var(data.serial$Y, na.rm=TRUE)

> Y.VAR

> Y.STD <- sd(data.serial$Y, na.rm=TRUE)

> Y.STD

• tapply() applies a function to a variable in a separate (non-empty) groups

X – an object, typically a vector

INDEX – list of factors, each of same length

FUN – function to be applied

SUMMARY STATISTICS : tapply()

> tapply(X, INDEX, FUN)

• Ex1. To obtain separate summary stat of Y for each Site

> tapply(data.serial$Y, data.serial$Site,

summary)$A

Min. 1st Qu. Median Mean 3rd Qu. Max.

2.000 4.000 4.500 5.167 6.250 9.000

0.000 3.500 5.000 4.636 6.500 8.000 1.000

4.0 4.5 6.0 6.0 7.5 8.0 1.0

• Ex2. To obtain separate standard deviation of Y for

each Site

> tapply(data.serial$Y,data.serial$Site,

2.081666 2.377929 1.732051

• Ex3. To obtain separate mean of Y for each Site x Trt

> tapply(data.serial$Y,

list(data.serial$Site,

data.serial$Trt), mean)

A 5.500000 5.75 4.25

B 4.666667 4.25 5.00

C 7.333333 5.00 NA

SUMMARY STATISTICS : tapply() SUMMARY STATISTICS : doBy Package

• doBy Package is use to calculate groupwise

summary statistics in a simple way, much in the spirit of PROC SUMMARY of SAS system.

summaryBy()

• Use for calculating quantities like the “mean and

variance” of a variable, for each combination of two or

more factors.

# formula – a formula object, say Y~Site

# data – a data frame

# FUN – a list of functions to be applied.

# KEEP.NAME – logical, if TRUE and if there is only ONE

function in FUN, then the variables in the output will have

the same name as the variables in the input.

# Order – logical, if TRUE the resulting data frame is

ordered according to the variables on the right hand side

of the formula.

SUMMARY STATISTICS : summaryBy()

• Usage

> summaryBy(formula, data, FUN=mean,

keep.name=FALSE, order=TRUE,na.rm=TRUE,..)

• Ex1. To obtain Site x Trt summary of means for Y

> library(doBy)

> summaryBy(Y~Site+Trt, data=data.serial,

na.rm=TRUE)

Site Trt Y.mean

1 A 1 5.500000

2 A 2 5.750000

3 A 3 4.250000

4 B 1 4.666667

5 B 2 4.250000

6 B 3 5.000000

7 C 1 7.333333

8 C 2 5.000000

• Ex2. To obtain Site x Trt summary of minimum, mean,

maximum, variance and standard deviation of Y using

predefined functions.

> summaryBy(Y~Site+Trt, data=data.serial,

FUN=c(min, mean, max, var, sd), na.rm=TRUE)

Site Trt Y.min Y.mean Y.max Y.var Y.sd

1 A 1 3 5.500000 8 4.333333 2.081666

2 A 2 4 5.750000 9 5.583333 2.362908

3 A 3 2 4.250000 7 4.250000 2.061553

4 B 1 3 4.666667 6 2.333333 1.527525

5 B 2 0 4.250000 8 14.916667 3.862210

6 B 3 4 5.000000 7 2.000000 1.414214

7 C 1 6 7.333333 8 1.333333 1.154701

8 C 2 4 5.000000 7 2.000000 1.414214

HISTOGRAM

DENSITY PLOT

# freq – logical, if FALSE probability densities are plotted so that histogram has a total area of one.

> hist(data.serial$Y,main='Histogram

of Y', col=‘yellow2',

border=‘tomato1',

freq = FALSE, xlab=“Y Class”,

ylab=“Probability", xlim=c(0, 20))

DENSITY PLOT: seq()

> x <- seq(from=0, to=20, length=100)

• seq(from, to, length) generate regular sequences from

0 to 20 with length of 100.

[1] 0.0000000 0.2020202 0.4040404 0.6060606 0.8080808 1.0101010

[7] 1.2121212 1.4141414 1.6161616 1.8181818 2.0202020 2.2222222

[97] 19.3939394 19.5959596 19.7979798 20.0000000

dnorm(x, mean, sd)

• dnorm() is use to obtain the probability of x, given the values of mean and sd.

> y <- dnorm(x,

mean(data.serial$Y,na.rm=TRUE),

sd(data.serial$Y, na.rm=TRUE)))

> lines(x, y)

[1] 0.0000000 0.2020202 0.4040404 0.6060606 0.8080808 1.0101010

[7] 1.2121212 1.4141414 1.6161616 1.8181818 2.0202020 2.2222222

[97] 19.3939394 19.5959596 19.7979798 20.0000000

DENSITY PLOT : lines()

> lines(x, y)

HISTOGRAM WITH DENSITY PLOT:

mtext()

> mtext("Fitting to a normal

distribution")

• mtext(text, side=3…) displays text on top of the plot

# text – a character expression specifying the text to be

written

# side – on which side of the plot you want to display a

1 – bottom 2 – left

3 – top 4 – right

CASE1. HISTOGRAM WITH DENSITY PLOT

> mtext("Fitting to a normal

distribution")

> hist(RF$RLD0, main='Histogram of RLD0',

col='plum4', border='black', br=5,

xlab="RLD0 Class",

ylab="Probability",

freq=FALSE,

xlim=c(0, 20))> x <- seq(from=0, to=20, length=100)

> y <- dnorm(x,

mean(data.serial$Y,na.rm=TRUE),

sd(data.serial$Y, na.rm=TRUE)))> lines(x, y)

HISTOGRAM WITH DENSITY PLOT:

lines(), dnorm(), and mtext()

Histogram of Y with Density plot

Y class

Probability

0 2 4 6 8 10

Fitting to a normal distribution

BOXPLOT

• Ex1. To obtain boxplot of Y with other graphics parameters

> Boxplot(data.serial$Y,

boxwex=0.35,

main=“Boxplot of Y”,

xlab=“Y”,

horizontal=TRUE)

# boxwex = controls the width

of the boxplot

# horizontal = logical, if

TRUE, the boxplot is plotted

horizontally0 2 4 6 8

Boxplot of Y

> boxplot(split

(data.serial$Y,

data.serial$Site))

> boxplot(Y~Site,

data=data.serial)

BOXPLOT :boxplot()

THANK YOU! ☺☺☺☺

Please do Exercise C

data management and statistical analysis - descriptive statistics

Documents

descriptive statistics and statistical graphics

i need help! applications in business and economics data...

descriptive statistics -...

descriptive study and descriptive statistics

descriptive statistics descriptive statistics describe a set...

ibm spss statistics 23 part 1: descriptive statistics | ibm...

descriptive statistics and exploratory data analysis...

descriptive statistics - retail · 1. in this statistical...

statistical indicators guide 9 · 2016-09-22 ·...

teaching statistical thinking_ part 1 descriptive statistics...

excerpted from learning to live with statistics: from...

chapter 2 | descriptive statistics 67 2|descriptive...

statistical analysis of quantitative data. statistical...

descriptive statistics and statistical inference for the...

choosing appropriate descriptive statistics, graphs and...

descriptive statistics. five types of statistical analysis...

basic concepts descriptive statistics inferential...

ibm spss statistics 20 brief guideof distributions and...

types of data, descriptive statistics, and statistical tests...

introduction to statistical concepts. objectives definition...