statistics basics

yash sadrani

STATISTICS BASICS BBA CENTERED

CONTENTS1. Hypothesis2. Null hypothesis3. Regression4. cor•re•la•tion5. Exponential Distribution6. Alternative hypothesis7. Central tendency8. Central tendency9. Bayes' theorem10. Chebyshev’s Theorem11. Simple random sampling12. Descriptive statistics 13. Statistical inference14. Characteristics of good estimator15. properties of the test for

independence

1

16. Utility of regression studies17. Advantages of Sample Surveys18. The hypergeometric distribution19. leptokurtic distribution20. the interquartile range

2

HypothsysWhen a possible correlation or similar relation between phenomena is investigated, such as, for example, whether a proposed remedy is effective in treating a disease, that is, at least to some extent and for some patients, the hypothesis that a relation exists cannot be examined the same way one might examine a proposed new law of nature: in such an investigation a few cases in which the tested remedy shows no effect do not falsify the hypothesis. Instead, statistical tests are used to determine how likely it is that the overall effect would be observed if no real relation as hypothesized exists. If that likelihood is sufficiently small (e.g., less than 1%), the existence of a relation may be assumed. Otherwise, any observed effect may as well be due to pure chance.

In statistical hypothesis testing two hypotheses are compared, which are called the null hypothesis and the alternative hypothesis. The null hypothesis is the hypothesis that states that there is no relation between the phenomena whose relation is under investigation, or at least not of the form given by the alternative hypothesis. The alternative hypothesis, as the name suggests, is the alternative to the null hypothesis: it states that there is some kind of relation. The alternative hypothesis may take several forms, depending on the nature of the hypothesized relation; in particular, it can be two-sided (for example: there is some effect, in a yet unknown direction) or one-sided (the direction of the hypothesized relation, positive or negative, is fixed in advance).

Conventional significance levels for testing the hypotheses are .10, .05, and .01. Whether the null hypothesis is rejected and the alternative hypothesis is accepted, all must be determined in advance, before the observations are collected or inspected. If these criteria are determined later, when the data to be tested is already known, the test is invalid.

It is important to mention that the above procedure is actually dependent on the number of the participants (units or sample size) that is included in the study. For instance, the sample size may be too small to reject a null hypothesis and, therefore, is recommended to specify the sample size from the beginning. It is advisable to define a small, medium and large effect size for each of a number of the important statistical tests which are used to test the hypotheses.

A statistical hypothesis test is a method of statistical inference using data from a scientific study. In statistics, a result is called statistically significant if it has been predicted as unlikely to have occurred by chance alone, according to a pre-determined threshold probability, the significance level. The phrase "test of significance" was coined by statistician Ronald Fisher.[1] These tests are used in determining what outcomes of a study would lead to a rejection of the null hypothesis for a pre-specified level of significance; this can help to decide whether results contain enough information to cast doubt on

3

conventional wisdom, given that conventional wisdom has been used to establish the null hypothesis. The critical region of a hypothesis test is the set of all outcomes which cause the null hypothesis to be rejected in favor of the alternative hypothesis. Statistical hypothesis testing is sometimes called confirmatory data analysis, in contrast to exploratory data analysis, which may not have pre-specified hypotheses.

Example 1 – Philosopher's beans

The following example was produced by a philosopher describing scientific methods generations before hypothesis testing was formalized and popularized.

Few beans of this handful are white.

Most beans in this bag are white.

Therefore: Probably, these beans were taken from another bag.

This is an hypothetical inference.

The beans in the bag are the population. The handful are the sample. The null hypothesis is that the sample originated from the population. The criterion for rejecting the null-hypothesis is the "obvious" difference in appearance (an informal difference in the mean). The interesting result is that consideration of a real population and a real sample produced an imaginary bag. The philosopher was considering logic rather than probability. To be a real statistical hypothesis test, this example requires the formalities of a probability calculation and a comparison of that probability to a standard.

A simple generalization of the example considers a mixed bag of beans and a handful that contain either very few or very many white beans. The generalization considers both extremes. It requires more calculations and more comparisons to arrive at a formal answer, but the core philosophy is unchanged; If the composition of the handful is greatly different that of the bag, then the sample probably originated from another bag. The original example is termed a one-sided or a one-tailed test while the generalization is termed a two-sided or two-tailed test.

4

Null hypothesisIn statistical inference of observed data of a scientific experiment, the null hypothesis refers to a general or default position: that there is no relationship between two measured phenomena,[1] or that a potential medical treatment has no effect.[2] Rejecting or disproving the null hypothesis – and thus concluding that there are grounds for believing that there is a relationship between two phenomena or that a potential treatment has a measurable effect – is a central task in the modern practice of science, and gives a precise sense in which a claim is capable of being proven false.

In statistical significance, the null hypothesis is often denoted H0 (read “H-naught”), is generally assumed true until evidence indicates otherwise (e.g., H0: μ = 500 hours). The concept of a null hypothesis is used differently in two approaches to statistical inference, though, problematically, the same term is used. In the significance testing approach of Ronald Fisher, a null hypothesis is potentially rejected or disproved on the basis of data that is significantly under its assumption, but never accepted or proved. In the hypothesis testing approach of Jerzy Neyman and Egon Pearson, a null hypothesis is contrasted with an alternative hypothesis, and these are decided between on the basis of data, with certain error rates. These two approaches criticized each other, though today a hybrid approach is widely practiced and presented in textbooks. This hybrid is in turn criticized as incorrect and incoherent – see statistical hypothesis testing. Statistical significance plays a pivotal role in statistical hypothesis testing where it is used to determine if a null hypothesis can be rejected or retained.

regression In statistics, regression analysis is a statistical process for estimating the relationships among variables. It includes many techniques for modeling and analyzing several variables, when the focus is on the relationship between a dependent variable and one or more independent variables. More specifically, regression analysis helps one understand how the typical value of the dependent variable (or 'Criterion Variable') changes when any one of the independent variables is varied, while the other independent variables are held fixed. Most commonly, regression analysis estimates the conditional expectation of the dependent variable given the independent variables – that is, the average value of the dependent variable when the independent variables are fixed. Less commonly, the focus is on a quantile, or other location parameter of the conditional distribution of the dependent variable given the independent variables. In all cases, the estimation target is a function of the independent variables called the regression function. In regression analysis, it is also of interest to characterize the variation of the dependent variable around the regression function which can be described by a probability distribution.

1. (Psychology) psychol the adoption by an adult or adolescent of behaviour more appropriate to a child, esp as a defence mechanism to avoid anxiety

2. (Statistics) statistics

5

a. the analysis or measure of the association between one variable (the dependent variable) and one or more other variables (the independent variables), usually formulated in an equation in which the independent variables have parametric coefficients, which may enable future values of the dependent variable to be predicted

b. (as modifer): regression curve.

3. (Astronomy) astronomy the slow movement around the ecliptic of the two points at which the moon's orbit intersects the ecliptic. One complete revolution occurs about every 19 years

4. (Geological Science) geology the retreat of the sea from the land

5. (Statistics) the act of regressing

6. (Logic) the act of regressing

cor·re·la·tionIn statistics, dependence is any statistical relationship between two random variables or two sets of data. Correlation refers to any of a broad class of statistical relationships involving dependence.

Familiar examples of dependent phenomena include the correlation between the physical statures of parents and their offspring, and the correlation between the demand for a product and its price. Correlations are useful because they can indicate a predictive relationship that can be exploited in practice. For example, an electrical utility may produce less power on a mild day based on the correlation between electricity demand and weather. In this example there is a causal relationship, because extreme weather causes people to use more electricity for heating or cooling; however, statistical dependence is not sufficient to demonstrate the presence of such a causal relationship (i.e., correlation does not imply causation).

1. A causal, complementary, parallel, or reciprocal relationship, especially a structural, functional, or qualitative correspondence between two comparable entities: a correlation between drug abuse and crime.

2. Statistics The simultaneous change in value of two numerically valued random variables: the positive correlation between cigarette smoking and the incidence of lung cancer; the negative correlation between age and normal vision.

3. An act of correlating or the condition of being correlated.

6

Exponential Distribution

The Exponential distribution is used to describe survival times.

Suppose that some device has the same hazard rate λ at each moment. The survival time is therefore [1/λ] on average.

Let the Random Variable X denote the time of failure. X then follows the Exponential distribution with parameter λ. The Probability Density Function of X is f X x = { λ exp − λ x i f x ≥ 0 0 o t h e r w i s e .

The Expected Value of X is 1/λ and the Variance is 1/λ2.

example:

A man enters a bank at 4pm. There is one person in front of him in the queue. Suppose that the length of time an individual should spend with a teller is an exponential random variable with mean 7 minutes.

Let X be the length of time the man in front spends with the teller. λ=[1/7] therefore X~Ex([1/7]. The probability that the man who entered the bank at 4pm has to wait for more than 10 minutes to be served is P X > 10 = 1 − F X 10 , where F X 10 is the {cumulative distribution function} of the exponential distribution evaluated at t=10. The cumulative distribution of the exponential distribution is F X t = ∫ 0 t λ exp − λ x d x = [ − exp − λ x ] 0 t = 1 − exp − λ t

The probability that the man has to wait more than 10 minutes is therefore

1 − 1 − exp − λ t = exp -10 7 ≈ 0.240

Type I and type II errorsIn statistics, a null hypothesis is a statement that the thing being studied produces no effect or makes no difference. An example of a null hypothesis is the statement "This diet has no effect on people's weight." Usually an experimenter frames a null hypothesis with the intent of rejecting it: that is, intending to run an experiment which produces data that shows that the thing under study does make a difference.

7

A type I error (or error of the first kind) is the incorrect rejection of a true null hypothesis. It is a false positive. Usually a type I error leads one to conclude that a supposed effect or relationship exists when in fact it doesn't. Examples of type I errors include a test that shows a patient to have a disease when in fact the patient does not have the disease, a fire alarm going off indicating a fire when in fact there is no fire or an experiment indicating that a medical treatment should cure a disease when in fact it does not

A type II error (or error of the second kind) is the failure to reject a false null hypothesis. It is a false negative. Examples of type II errors would be a blood test failing to detect the disease it was designed to detect, in a patient who really has the disease; a fire breaking out and the fire alarm does not ring or a clinical trial of a medical treatment failing to show that the treatment works when really it does.

Alternative hypothesisIn statistical hypothesis testing, the alternative hypothesis (or maintained hypothesis or research hypothesis) and the null hypothesis are the two rival hypotheses which are compared by a statistical hypothesis test. An example might be where water quality in a stream has been observed over many years and a test is made of the null hypothesis that there is no change in quality between the first and second halves of the data against the alternative hypothesis that the quality is poorer in the second half of the record.

Central tendencyIn statistics, a central tendency (or, more commonly, a measure of central tendency) is a central value or a typical value for a probability distribution.[1] It is occasionally called an average or just the center of the distribution. The most common measures of central tendency are the arithmetic mean, the median and the mode. A central tendency can be calculated for either a finite set of values or for a theoretical distribution, such as the normal distribution. Occasionally authors use central tendency (or centrality) to mean "the tendency of quantitative data to cluster around some central value". [2][3] This meaning might be expected from the usual dictionary definitions of the words tendency and centrality. Those authors may judge whether data has a strong or a weak central tendency based on the statistical dispersion, as measured by the standard deviation or something similar.

8

Bayes' theorem

In probability theory and statistics, Bayes' theorem (alternatively Bayes' law or Bayes' rule) is a result that is of importance in the mathematical manipulation of conditional probabilities. It is a result that derives from the more basic axioms of probability.

When applied, the probabilities involved in Bayes' theorem may have any of a number of probability interpretations. In one of these interpretations, the theorem is used directly as part of a particular approach to statistical inference. ln particular, with the Bayesian interpretation of probability, the theorem expresses how a subjective degree of belief should rationally change to account for evidence: this is Bayesian inference, which is fundamental to Bayesian statistics. However, Bayes' theorem has applications in a wide range of calculations involving probabilities, not just in Bayesian inference.

An Introduction to Bayes' Theorem

Bayes' Theorem is a theorem of probability theory originally stated by the Reverend Thomas Bayes. It can be seen as a way of understanding how the probability that a theory is true is affected by a new piece of evidence. It has been used in a wide variety of contexts, ranging from marine biology to the development of "Bayesian" spam blockers for email systems. In the philosophy of science, it has been used to try to clarify the relationship between theory and evidence. Many insights in the philosophy of science involving confirmation, falsification, the relation between science and pseudosience, and other topics can be made more precise, and sometimes extended or corrected, by using Bayes' Theorem. These pages will introduce the theorem and its use in the philosophy of science.

9

Chebyshev’s Theorem: For any set of data (either population or sample) and for any constant k greater than 1, the proportion of the data that must lie within k standard deviations on either side of the mean is at least

1 - _1_

k2

In ordinary words, Chebyshev’s Theorem says the following about sample or population data:

1) Start at the mean.2) Back off k standard deviations below the mean and then advance k standard deviations

above the mean.3) The fractional part of the data in the interval described will be at least 1 – 1/k2 (we assume k

> 1).

Simple random sampling

In a simple random sample (SRS) of a given size, all such subsets of the frame are given an equal probability. Furthermore, any given pair of elements has the same chance of selection as any other such pair (and similarly for triples, and so on). This minimises bias and simplifies analysis of results. In particular, the variance between individual results within the sample is a good indicator of variance in the overall population, which makes it relatively easy to estimate the accuracy of results.

However, SRS can be vulnerable to sampling error because the randomness of the selection may result in a sample that doesn't reflect the makeup of the population. For instance, a simple random sample of ten people from a given country will on average produce five men and five women, but any given trial is likely to overrepresent one sex and underrepresent the other. (Systematic and stratified techniques), attempt to overcome this problem by "using information about the population" to choose a more "representative" sample.

10

One of the best ways to achieve unbiased results in a study is through random sampling. Random sampling includes choosing subjects from a population through unpredictable means. In its simplest form, subjects all have an equal chance of being selected out of the population being researched.

a method of selecting a sample (random sample) from a statistical population in such a way that every possible sample that could be selected has a predetermined probability of being selected.

What is the difference between coefficient of determination, and coefficient of correlation?

Coefficient of correlation is “R” value which is given in the summary table in the Regression output. R square is also called coefficient of determination. Multiply R times R to get the R square value. In other words Coefficient of Determination is the square of Coefficeint of Correlation.

R square or coeff. of determination shows percentage variation in y which is explained by all the x variables together. Higher the better. It is always between 0 and 1. It can never be negative – since it is a squared value.

It is easy to explain the R square in terms of regression. It is not so easy to explain the R in terms of regression.

Coefficient of Correlation is the R value i.e. .850 (or 85%). Coefficient of Determination is the R square value i.e. .723 (or 72.3%). R square is simply square of R i.e. R times R.

11

Coefficient of Correlation: is the degree of relationship between two variables say x and y. It can go between -1 and 1. 1 indicates that the two variables are moving in unison. They rise and fall together and have perfect correlation. -1 means that the two variables are in perfect opposites. One goes up and other goes down, in perfect negative way. Any two variables in this universe can be argued to have a correlation value. If they are not correlated then the correlation value can still be computed which would be 0. The correlation value always lies between -1 and 1 (going thru 0 – which means no correlation at all – perfectly not related). Correlation can be rightfully explalined for simple linear regression – because you only have one x and one y variable. For multiple linear regression R is computed, but then it is difficult to explain because we have multiple variables invovled here. Thats why R square is a better term. You can explain R square for both simple linear regressions and also for multip

le linear regressions.

Descriptive statisticsDescriptive statistics is the discipline of quantitatively describing the main features of a collection of information,[1] or the quantitative description itself. Descriptive statistics are distinguished from inferential statistics (or inductive statistics), in that descriptive statistics aim to summarize a sample, rather than use the data to learn about the population that the sample of data is thought to represent. This generally means that descriptive statistics, unlike inferential statistics, are not developed on the basis of probability theory.[2] Even when a data analysis draws its main conclusions using inferential statistics, descriptive statistics are generally also presented. For example in a paper reporting on a study involving human subjects, there typically appears a table giving the overall sample size, sample sizes in important subgroups (e.g., for each treatment or exposure group), and demographic or clinical characteristics such as the average age, the proportion of subjects of each sex, and the proportion of subjects with related comorbidities.

Some measures that are commonly used to describe a data set are measures of central tendency and measures of variability or dispersion. Measures of central tendency include the mean, median and mode, while measures of variability include the standard deviation (or variance), the minimum and maximum values of the variables, kurtosis and skewness.[3]

12

Statistical inferenceIn statistics, statistical inference is the process of drawing conclusions from data that are subject to random variation, for example, observational errors or sampling variation.[1] More substantially, the terms statistical inference, statistical induction and inferential statistics are used to describe systems of procedures that can be used to draw conclusions from datasets arising from systems affected by random variation,[2] such as observational errors, random sampling, or random experimentation.[1] Initial requirements of such a system of procedures for inference and induction are that the system should produce reasonable answers when applied to well-defined situations and that it should be general enough to be applied across a range of situations. Inferential statistics are used to test hypotheses and make estimations using sample data.

The outcome of statistical inference may be an answer to the question "what should be done next?", where this might be a decision about making further experiments or surveys, or about drawing a conclusion before implementing some organizational or governmental policy.

Chrachteristics of good estimatorThere are four main properties associated with a "good" estimator. These are:

1) Unbiasedness: the expected value of the estimator (or the mean of the estimator) is simply the figure being estimated. In statistical terms, E(estimate of Y) = Y.

2) Consistency: the estimator converges in probability with the estimated figure. In other words, as the sample size approaches the population size, the estimator gets closer and closer to the estimated.

3) Efficiency: The estimator has a low variance, usually relative to other estimators, which is called relative efficiency. Otherwise, the variance of the estimator is minimized.

4) Robustness: The mean-squared errors of the estimator are minimized relative to other estimators.The estimator should be unbiased and consistent

Stats: Test for Independence

In the test for independence, the claim is that the row and column variables are independent of each other. This is the null hypothesis.

The multiplication rule said that if two events were independent, then the probability of both occurring was the product of the probabilities of each occurring. This is key to working the test for independence. If you end up

13

rejecting the null hypothesis, then the assumption must have been wrong and the row and column variable are dependent. Remember, all hypothesis testing is done under the assumption the null hypothesis is true.

The test statistic used is the same as the chi-square goodness-of-fit test. The principle behind the test for independence is the same as the principle behind the goodness-of-fit test. The test for independence is always a right tail test.

In fact, you can think of the test for independence as a goodness-of-fit test where the data is arranged into table form. This table is called a contingency table.

The test statistic has a chi-square distribution when the following assumptions are met

The data are obtained from a random sample

The expected frequency of each category must be at least 5.

The following are properties of the test for independence

The data are the observed frequencies.

The data is arranged into a contingency table.

The degrees of freedom are the degrees of freedom for the row variable times the degrees of freedom for the column variable. It is not one less than the sample size, it is the product of the two degrees of freedom.

It is always a right tail test.

It has a chi-square distribution.

The expected value is computed by taking the row total times the column total and dividing by the grand total

The value of the test statistic doesn't change if the order of the rows or columns are switched.

The value of the test statistic doesn't change if the rows and columns are interchanged (transpose of the matrix)

14

Utility of regression studiesRegression models can be used to help understand and explain relationships among variables; they can also be used to predict actual outcomes. In this course you will learn how multiple linear regression models are derived, use software to implement them, learn what assumptions underlie the models, learn how to test whether your data meet those assumptions and what can be done when those assumptions are not met, and develop strategies for building and understanding useful models.

Advantages of Sample SurveysCost Reduction

In most cases, conducting a sample survey costs less than a census survey

. If fewer people are surveyed, fewer

surveys need to be produced, printed, shipped, administered, and analyzed. Further

, fewer data reports are often

required, thus the amount of time and expense required to analyze and distribute the results reports is reduced.

Generalizability of Results

If conducted properly, the results of a sample survey can still be generalized to the entire population, meaning

that the sample results can be considered representative of the views of the entire target population. Sampling

strategies should be firmly aligned with the overarching survey goals to ensure the utilization of a proper sample

frame and sample size.

Timeliness

Sample surveys can typically be printed, distributed, administered, and analyzed more quickly than census

surveys. As a result, a shorter turnaround time for results is often achieved

Identification of Strengths & Opportunities

As with census surveys, results from a properly conducted sample survey can also be used to identify strengths

and opportunities and develop plans for meaningful change.

Cost: By comparison with a complete enumeration of the same population, a sample may be based on data for only small number of the units comprising that population. A sample survey may thus be very much less expensive to conduct than a comparable complete enumeration.

15

Time: Being small in scale, a sample survey is not only less expensive than a census; the desired information is obtained in much less time.

Scope: The smaller scale is likely to; permit the collection of a wider range of survey data and allow a wide choice of methods of observations, measurements or questioning than is usually feasible with a complete enumeration.

Respondents Convenience: The sample survey considerably reduces the overall burden of the respondents in the way that only a few, not all of the individuals in the population are put to the trouble of having to answer questions or provide information.

Labor: Sampling saves labor. A small staff is required both for fieldwork and for tabulation and processing data.

Flexibility: In certain types of investigation, highly skilled and trained personnel or even specialized equipment are needed to collect data. A complete enumeration in such cases is impracticable and hence sample surveys, being more flexible and greater scope, will be more appropriate for this type of inquires.

Data Processing: The data-processing requirement for a sample survey is likely to be much less than for a complete count. Whereas a complete count may well require a computer to process the data, a sample survey can often be processed manually with fewer people and less logistic supports.

Accuracy: A sample survey employs personnel of higher quality equipped with intensive training and supervision that is more careful is possible for fieldwork. As a result, observations, measurements, equipments, or questioning for a sample survey can often be carried out more carefully and thus yields results subject to similar non-sampling error than is generally practicable in a more extensive complete enumeration, usually at a much lower cost.

Feasibility: there are situations where complete enumeration is not feasible and thus a survey is necessary. There is also instance where it is not practicable to enumerate all the units due to their perishable or fragile nature. The alternative in this situation is to take only a few of the units. For example, consider the problem of checking the quality of mango juice produced by a local company. One way to test the quality is to drink entire lot, which is impracticable. Testing of electric bulb, screws, glass, medicine all are example of this type, where sampling is necessary.

16

The hypergeometric distribution applies to sampling without replacement from a

finite population whose elements can be classified into two mutually exclusive categories like Pass/Fail, Male/Female or Employed/Unemployed. As random selections are made from the population, each subsequent draw decreases the population causing the probability of success to change with each draw.

The following conditions characterise the hypergeometric distribution:

The result of each draw can be classified into one or two categories.

The probability of a success changes on each draw.

A random variable X follows the hypergeometric distribution if its probability mass function (pmf) is given by:[1]

P(X=k) = {{{K \choose k} {{N-K} \choose {n-k}}}\over {N \choose n}}

Where:

N is the population size

K is the number of success states in the population

n is the number of draws

k is the number of successes

\textstyle {a \choose b} is a binomial coefficient

The pmf is positive when \max(0, n+K-N) \leq k \leq \min(K,n).

17

leptokurtic distributionIn probability theory and statistics, kurtosis (from the Greek word κυρτός, kyrtos or kurtos, meaning curved, arching) is any measure of the "peakedness" of the probability distribution of a real-valued random variable.[1] In a similar way to the concept of skewness, kurtosis is a descriptor of the shape of a probability distribution and, just as for skewness, there are different ways of quantifying it for a theoretical distribution and corresponding ways of estimating it from a sample from a population. There are various interpretations of kurtosis, and of how particular measures should be interpreted; these are primarily peakedness (width of peak), tail weight, and lack of shoulders (distribution primarily peak and tails, not in between).

One common measure of kurtosis, originating with Karl Pearson, is based on a scaled version of the fourth moment of the data or population, but it has been argued that this really measures heavy tails, and not peakedness.[2] For this measure, higher kurtosis means more of the variance is the result of infrequent extreme deviations, as opposed to frequent modestly sized deviations. It is common practice to use an adjusted version of Pearson's kurtosis, the excess kurtosis, to provide a comparison of the shape of a given distribution to that of the normal distribution. Distributions with negative or positive excess kurtosis are called platykurtic distributions or leptokurtic distributions respectively

the interquartile range

the interquartile range (IQR), also called the midspread or middle fifty, is a measure of statistical dispersion, being equal to the difference between the upper and lower quartiles,[1][2] IQR = Q3 − Q1. In other words, the IQR is the 1st Quartile subtracted from the 3rd Quartile; these quartiles can be clearly seen on a box plot on the data. It is a trimmed estimator, defined as the 25% trimmed mid-range, and is the most significant basic robust measure of scale.

18

statistics basics

Education