2011 summer erie/reu program descriptive statistics igor jankovic department of civil, structural,...

27
2011 Summer ERIE/REU Program Descriptive Statistics Igor Jankovic Department of Civil, Structural, and Environmental Engineering University at Buffalo, State University of New York

Upload: shanna-pierce

Post on 28-Dec-2015

217 views

Category:

Documents


3 download

TRANSCRIPT

2011 Summer ERIE/REU Program

Descriptive Statistics

Igor Jankovic

Department of Civil, Structural, and Environmental EngineeringUniversity at Buffalo, State University of New York

Content

• Statistics terminology1. Population vs. Sample

2. Descriptive statistics vs. Inferential statistics

3. Data Types

• Presentation of qualitative data1. Graphical method

2. Numerical method

• Presentation of quantitative data1. Graphical method

2. Numerical method

• Outliers in a data set

Population vs. Sample• Population: an entire data set that is the target of our interest

• Sample: a subset of data selected from a population

Example:

Electrical engineers recognize that high natural current in computer powersystem is a potential problem. To determine the extent of the problem, a surveyof the computer power system load currents at 146 US sites taken (IEEETransaction on Industry Applications, July/August 1990). The survey revealedthat less than 10% of the sites had high neutral to full-load current ratios.

• Identify the population of interest (powerload status at all US sites with computer powers systems)

• Identify the sample (powerload status at 146 US sites with computer powers systems

• Use of the sample information to make an inference about population(less than 10% of the sites had high neutral to full-load current ratios)

Descriptive statistics vs. Inferential statistics

• Two major applications of Statistics:-Summarizing, describing, and exploring data-Using sample data to infer the nature of the population data set

In other words,

• Descriptive statistics-The branch of statistics devoted to the organization,summarization, and description of data sets

• Inferential statistics -The branch of statistics concerned with using sample data to make an inference about populations

Data TypesQuantitative Data: The data that represent the quantity or amount of something

Qualitative (categorical) Data: The data that have no quantitative interpretation

Example: • Length (in centimeters), weight (in grams), DDT

concentration (in ppm): quantitative data

• Location and species: qualitative data

Qualitative Data

Graphical method for describing qualitative data

For qualitative data, we define the categories in such a way thateach observation can fall in one and only one category.

Example: Student distribution in terms of year at college in EAS 308

0

5

10

15

20

25

30

35

40

45

50

Senior Junior Sophomore

Year at College

Nu

me

r o

f s

tud

en

ts in

EA

S 3

08

SophomoreSeniorJunior

Pareto diagram

0 10 20 30 40 50

Sophomore

Senior

Junior

Yea

r at

Co

lleg

e

Numer of students in EAS 308

Horizontal Bar Graph

Pie Chart

Numerical method for describing qualitative data

For qualitative data, we define the categories in such a way that

each observation can fall in one and only one category.

Category frequency for a given category is the number of observations that fall in that category

Category relative frequency for a given category is the proportion of the total number of observations that fall in that category

Summary frequency table

Year at college Frequency Percent Cumulative Frequency Cumulative Percent

Sophomore 11 12.4 11 12.4Junior 35 39.3 46 51.7Senior 43 48.3 89 100.0

Quantitative Data

Graphical method for describing quantitative data (1)

Dot plots Steps:1. Draw a horizontal scale that spans

the range of data2. Place a dot over the appropriate

value on the scale representingthe value of observations

3. If data value repeats, then thedots are placed on top of each other

Graphical method for describing quantitative data (2)

Histograms (most popular and traditional method for describing quantitative data)

Steps:1. Calculate the range of data2. Divide the range into 5-20 classes of equal width3. For each class, count the number (class frequency) of observations

that fall in the class4. Calculate each relative class frequency = (class frequency)/ total

number of measurements

Graphical method for describing quantitative data (3)

Stem-and-Leaf Display Steps:1. Divide each observation in the data set into two parts, the stem and the

leaf. For example, the stem and leaf of the CPU time 2.41 are 2, and 41, respectively. Stem Leaf 2 41

2. List the stems in order in a column, starting with the smallest stem and ending with the largest.

3. Proceed through the data set, placing the leaf for each observation in the appropriate stem row.

Numerical method for describing quantitative data

Measures of central tendency - help to locate the center of the relative frequency distribution -

-Arithmetic mean (mean)

Suppose we have a set of n measurements, y1,y2,y3,…,yn,

The arithmetic mean =

Generally, we use to represent sample mean and to represent population mean-MedianMedian is the middle number when the measurements are arranged in ascending(descending) order

y[(n+1)/2] , if n is oddMedian =

{ y(n/2) + y(n/2+1) } /2, if n is evenGenerally, we use m to represent sample median and to represent populationmedian

n

yn

ii

1

y

Numerical method for describing quantitative data

Measures of central tendency - help to locate the center of the relative frequency distribution -

-Mode

The mode of a set of n measurements, y1,y2,y3,…,yn, is the value of y that

occurs with the greatest frequency

Numerical method for describing quantitative data

Measures of central tendency

Example:

We have 10 sample measurements: 4, 5, 8, 1, 11, 6, 2, 8, 3, 7

Compute the mean, median, and mode.

Solution:

Mean = 5.5

Median = (6+5)/ 2 = 5.5

Mode = 8

Measures of central tendency:Geometric Mean (from Wikipedia)

Measures of central tendency:Harmonic Mean (from Wikipedia)

Numerical method for describing quantitative data

Measures of variation- help to locate the spread of the distribution -

-Range

Range = largest measurement – smallest measurement

-Variance (of n measurements, measurements, y1,y2,y3,…,yn)

Sample variance =

Population variance =

1

]/)[(

1

)(1

2

1

2

1

2

2

n

nyy

n

yys

n

i

n

iii

n

ii

n

yn

ii

1

2

2

)(

Numerical method for describing quantitative data

Measures of variation - help to locate the spread of the distribution -

-Standard Deviation

standard deviation of a sample =

standard deviation of a population =

1

]/)[(

1

)(1

2

1

2

1

2

n

nyy

n

yys

n

i

n

iii

n

ii

n

yn

ii

1

2)(

Skewness: measure of shape

Approximate formula (accurate for large “n”)

Exact formula

where s is the sample standard deviation.

Kurtosis: measure of “peakedness”

Approximate formula (accurate for large “n”)

where s is the sample standard deviation.

Exact formula

Numerical method for describing quantitative data

Measures of relative standing- describes the relative position of an observation within the data set -

Two measures used to describe the relative standing of an observation arepercentiles and z-scores

Percentiles- 100 pth percentile100pth percentile of a data set is a value of y located so that 100 p% of the areaunder the relative frequency distribution for the data lies to the left of the 100pthpercentile and 100 (1-p)% of the area lies to its right [note: 0 p 1]

- Lower quartile, QL, , corresponding to 25th percentile.- Midquartile, m, corresponding to 50th percentile.

- Upper quartile, QU , corresponding to 75th percentile

Numerical method for describing quantitative data

Measures of relative standing- describes the relative position of an observation within the data set -

Two measures used to describe the relative standing of an observation are

percentiles and z-scores

Z-scores

The z-score for a value y of a data set is the distance that y lies above or below the mean, measured in units of the standard deviation.

Sample z-score:

Population z-score:

s

yyz

y

z

Detecting Outliers

Definition of an outlier:An observation y that is unusually large or small relative to the other values in a data set is called an outlier.

Reasons for outliers in a data set:1. The measurement is observed, recorded, or entered into the computer incorrectly2. The measurement comes from a different population3. The measurement is correct, but represents a rare (chance) event.

Rule of Thumb for detecting outliers:Observations with z-scores greater than 3 in absolute value are considered outliers.

Detecting OutliersBox Plot Method

Interquartile range, IQR

IQR = QU - QL

Steps to construct a Box Plot1. Calculate the median m, lower and upper quartiles,

QL, and QU, and IQR, for the y values in a data set

2. Construct a box on the y-axis with QL and QU located at the lower corners. The base

width will be equal to IQR. Draw a vertical line inside the box to locate the median, m

3. Construct two sets of limits on the box plot. Inner fences are located a distance of 1.5

(IQR) below QL and QU; outer fences are located a distance of 3(IQR) below QL and

above QU.

4. Observations that fall between the inner and outer fences are called suspect outliers.

Observations that fall outside the outer fences are called highly suspect outliers.

5. To further highlight extreme values, use Whiskers.

Empirical RuleIf a data set has an approximately moundshaped distribution, then the following rules ofthumb may be used to describe the data set

Example: At least 68% of the measurements will liewithin the interval ± s for samples

At least 95% of the measurements will liewithin the interval ±2s for samples

y

y

SummaryIn this lecture, we have learned:

• Some important statistics terminologies1. Population vs. Sample

2. Descriptive statistics vs. Inferential statistics

3. Data Type

• How to deal with Qualitative data1. Graphical method (Bar graph, Pie chart, Pareto diagram)

2. Numerical method

• How to deal with Quantitative data1. Graphical method (Dot plot, Histogram, Stem and Leaf plot)

2. Numerical method

• How to detect outliers in a data set?

• Empirical Rule