statistical preliminaries

30
Statistical Preliminaries Statistical Preliminaries R. Akerkar TMRF, Kolhapur, India 1 Data Mining - R. Akerkar

Upload: rajendra-akerkar

Post on 05-Dec-2014

1.189 views

Category:

Education


3 download

DESCRIPTION

 

TRANSCRIPT

Page 1: Statistical Preliminaries

Statistical PreliminariesStatistical Preliminaries

R. AkerkarTMRF, Kolhapur, India

1Data Mining - R. Akerkar

Page 2: Statistical Preliminaries

Data mining: tools, methodologies, and theories for g , g ,revealing patterns in data—a critical step in knowledge discovery.

Driving forces: Explosive growth of data in a great variety of fields Explosive growth of data in a great variety of fields

Cheaper storage devices with higher capacity Faster communication

B tt d t b t Better database manage systems Rapidly increasing computing power Make data to work for us

2

Make data to work for us

Data Mining - R. Akerkar

Page 3: Statistical Preliminaries

Categorization Categorization Supervised learning vs. unsupervised

learninglearning Is Y available in the training data?

Regression vs Classification Regression vs. Classification Is Y quantitative or qualitative?

3Data Mining - R. Akerkar

Page 4: Statistical Preliminaries

Supervised learning

Learning from examples where a training set Learning from examples, where a training set is given which acts as example for the classes.

The system finds a description for each class. Once description and hence the classification Once description and hence the classification

rule has been formulated, it is used to predict the class of previously unseen objectsthe class of previously unseen objects.

4Data Mining - R. Akerkar

Page 5: Statistical Preliminaries

Classification Rule

The domestic flights in the country were operated by Air Canada. Recently, many new airlines began their operations. Some of the customers of Air Canada started flying with y g

these private airlines. So, as a result Air Canada loses its customers.

Question: Why some customers remain loyal while others leave.

To predict: which customers it is most likely to lose To predict: which customers it is most likely to lose to its competitors.

Build a model based on the historical data of loyal customers versus customers who have left

5

customers versus customers who have left.

Data Mining - R. Akerkar

Page 6: Statistical Preliminaries

Statistics

A theory- rich approach for data analysis.A theory rich approach for data analysis.

Measures of central tendency or Averagesy g A single expression representing the whole group is

selected. Thi i l i i t ti ti i k th This single expression in statistics is known as the average.

Averages are generally the central part of the distribution. And therefore they are also called the measures of central

tendency.

6Data Mining - R. Akerkar

Page 7: Statistical Preliminaries

Types of measures of central tendency or averages Arithmetic Mean (or simply mean) Arithmetic Mean (or simply mean) Median

Mode Mode Geometric Mean Harmonic Mean

7Data Mining - R. Akerkar

Page 8: Statistical Preliminaries

Arithmetic Mean: It is the ratio of the sum of all Arithmetic Mean: It is the ratio of the sum of all observations to the total number of observation.

Median: It is the middle most value of the variable in a set of observations, when they are arranged either in ascending or in descending order of their magnit deascending or in descending order of their magnitude. Thus it divides the data into two equal parts.

Mode: Mode is defined as that value in the series which occurs most frequently. In a frequency distribution mode is that variant which has maximum frequency.

8Data Mining - R. Akerkar

Page 9: Statistical Preliminaries

Examples: Suppose we want to find the average height of a student in a classa class.

We can measure the height of all the students. Then add them and divide it by number of students in the class. It will give mean height.

We can ask the students to make a queue according to their height and then the height of the middle most student will be the median. If there are odd number of students, we will get a middle one but if they are , g yeven in numbers then the average of the heights of the two middle students will be the median.

We can measure their heights And make a frequency distribution We can measure their heights. And make a frequency distribution table. We can make a table with the height of the students in one column and the frequency in the other. With the limitations of our measuring instruments many students must be having same height. The modal height will be the one which maximum number of studentsThe modal height will be the one which maximum number of students must be having. It means the height with the maximum frequency will be the modal height.

9Data Mining - R. Akerkar

Page 10: Statistical Preliminaries

Variance Variance is defined as the mean of the square of the

deviations( difference) from the mean.deviations( difference) from the mean. Procedure:

1 Calculate the mean of the observations1. Calculate the mean of the observations.2. Then calculate the difference of each observation

from the mean.3. Then square the differences.4. Add all the squares.q5. Divide the sum by the total number of

observations.

10Data Mining - R. Akerkar

Page 11: Statistical Preliminaries

Standard De iation Standard Deviation It is the square root of the variance.

11Data Mining - R. Akerkar

Page 12: Statistical Preliminaries

Exercise 1

Find the median of the data in the above figure.

Find the standard deviation in the data in above figure.

12Data Mining - R. Akerkar

Page 13: Statistical Preliminaries

Solutions There are 15 data points in the histogram.

Seven are smaller than 3 and seven areSeven are smaller than 3 and seven are greater than 3, so the median is 3.

List the full set of observations in a spreadsheet, repeating values as many times p , p g yas they occur: 0, 0, 0, 0, 1, 2, 2, 3, 4, 4, 4, 5, 5, 6, 7.Apply the function STDEVP to the observations. The result is 2.28

13Data Mining - R. Akerkar

Page 14: Statistical Preliminaries

Exercise 2

14Data Mining - R. Akerkar

Page 15: Statistical Preliminaries

Solutions

15Data Mining - R. Akerkar

Page 16: Statistical Preliminaries

Exercise 3

16Data Mining - R. Akerkar

Page 17: Statistical Preliminaries

Solution

17Data Mining - R. Akerkar

Page 18: Statistical Preliminaries

Normal Distribution

Normal distributions are a family of Normal distributions are a family of distributions.Normal distributions are symmetric with yscores more concentrated in the middle than in the tails.

They are defined by two parameters: the mean (μ) and the standard deviation (σ). (μ) ( )

18Data Mining - R. Akerkar

Page 19: Statistical Preliminaries

For example, there are probably a nearly infinite number of factors that determine a person's heightnumber of factors that determine a person's height (thousands of genes, nutrition, diseases, etc.).

Thus, height can be expected to be normally g p ydistributed in the population.

The normal distribution function is determined by The normal distribution function is determined by

f(x) = 1/[(2*)1/2*] * e**{-1/2*[(x- µ)/]2 },f(x) 1/[(2 )1/2 ] e { 1/2 [(x µ)/]2 }, for -∞ < x < ∞

where µ is the mean is the standard deviation is the standard deviation e is the base of the natural logarithm, sometimes called Euler's e

(2.71...) is the constant Pi (3.14...)

19Data Mining - R. Akerkar

Page 20: Statistical Preliminaries

Null hypothesis The statistical hypothesis that is set up for testing a hypothesis is

known as null hypothesis. It states that there is no difference between the sample statistic and population parameter.

The purpose of hypothesis testing is to test the viability of the null p p yp g yhypothesis in the light of experimental data.

Consider a researcher interested in whether the time to respond pto a tone is affected by the consumption of alcohol. The null hypothesis is that µ1 - µ2 = 0 where µ1 is the mean time to respond after consuming alcohol and

2 i th ti t d th iµ2 is the mean time to respond otherwise. Thus, the null hypothesis concerns the parameter µ1 - µ2 and

the null hypothesis is that the parameter equals zero.

20Data Mining - R. Akerkar

Page 21: Statistical Preliminaries

Null Hypothesis vs. Experimental data

The null hypothesis is often the reverse of what theThe null hypothesis is often the reverse of what the experimenter actually believes;

it is put forward to allow the data to contradict it. In the experiment on the effect of alcohol, the

experimenter probably expects alcohol to have a h f l ff tharmful effect.

If the experimental data show a sufficiently large effect of alcohol then the null hypothesis thateffect of alcohol, then the null hypothesis that alcohol has no effect can be rejected.

21Data Mining - R. Akerkar

Page 22: Statistical Preliminaries

Hypothesis testing

Hypothesis testing is a method of inferential statistics.

An experimenter starts with a hypothesis about a populationparameter called the null hypothesis.

Data are then collected and the viability of the null hypothesis is determined in light of the data. If the data are very different from what would be expected

under the assumption that the null hypothesis is true, then the null hypothesis is rejected. f If the data are not greatly at variance with what would be

expected under the assumption that the null hypothesis is true, then the null hypothesis is not rejected.

22Data Mining - R. Akerkar

Page 23: Statistical Preliminaries

The test of hypothesis discloses the fact The test of hypothesis discloses the fact whether the difference between sample statistic and the corresponding hypothetical p g yppopulation parameter is significant or not significant. Thus the test of hypothesis is also g ypknown as the test of significance.

23Data Mining - R. Akerkar

Page 24: Statistical Preliminaries

A Classical Model for Hypothesis Testing

)//( 2211

21

nvnvXX

P

samples;tindependentheformeanssampleareand

and; score cesignifican theis

XX

P

where

sizessampleingcorrespondareand

means; respectivetheforscoresvarianceareand

samples;tindependentheformeanssampleareand

21

21

nn

XX

vv

24

sizes. sampleingcorrespondareand 21 nn

Data Mining - R. Akerkar

Page 25: Statistical Preliminaries

Exercise

25Data Mining - R. Akerkar

Page 26: Statistical Preliminaries

Solution

26Data Mining - R. Akerkar

Page 27: Statistical Preliminaries

Exercise

27Data Mining - R. Akerkar

Page 28: Statistical Preliminaries

Solution

28Data Mining - R. Akerkar

Page 29: Statistical Preliminaries

Exercise

If scores are normally distributed with a mean If scores are normally distributed with a mean of 30 and a standard deviation of 5, what percent of the scores is: (a) greater than 30? p ( ) g(b) greater than 37? (c) between 28 and 34?

29Data Mining - R. Akerkar

Page 30: Statistical Preliminaries

Answers

a 50% a. 50%b. 8.08%c. 44.35

30Data Mining - R. Akerkar