Download - Correlation Final
LOGO
CORRELATION ANALYSIS
CORRELATION ANALYSIS
MBA “A”NewishJashan
Jotdeep SinghYogesh
Introduction
Correlation a LINEAR association between two random variables
Correlation analysis show us how to determine both the nature and strength of relationship between two variables
When variables are dependent on time correlation is applied
Correlation lies between +1 to -1
A zero correlation indicates that there is no relationship between the variables
A correlation of –1 indicates a perfect negative correlation
A correlation of +1 indicates a perfect positive correlation
Types of CorrelationThere are three types of correlation
Types
Type 1 Type 2 Type 3
Type1
Positive Negative No Perfect
If two related variables are such that when one increases (decreases), the other also increases (decreases).
If two variables are such that when one increases (decreases), the other decreases (increases)
If both the variables are independent
When plotted on a graph it tends to be a perfect line
When plotted on a graph it is not a straight line
Type 2
Linear Non – linear
Two independent and one dependent variable One dependent and more than one independent
variables One dependent variable and more than one
independent variable but only one independent variable is considered and other independent variables are considered constant
Type 3
Simple Multiple Partial
Methods of Studying Correlation
Scatter Diagram Method
Karl Pearson Coefficient Correlation of Method
Spearman’s Rank Correlation Method
0
20
40
60
80
100
120
140
160
180
0 50 100 150 200 250
Drug A (dose in mg)
Sy
mp
tom
In
de
x
0
20
40
60
80
100
120
140
160
0 50 100 150 200 250
Drug B (dose in mg)
Sym
ptom
In
dex
Very good fit Moderate fit
Correlation: Linear Relationships
Strong relationship = good linear fit
Points clustered closely around a line show a strong correlation. The line is a good predictor (good fit) with the data. The more spread out the points, the weaker the correlation, and the less good the fit. The line is a REGRESSSION line (Y = bX + a)
Coefficient of CorrelationA measure of the strength of the linear relationship
between two variables that is defined in terms of the (sample) covariance of the variables divided by their (sample) standard deviations
Represented by “r”
r lies between +1 to -1
Magnitude and Direction
-1 < r < +1
The + and – signs are used for positive linear correlations and negative linear correlations, respectively
2222 )()(
YYnXXn
YXXYnr xy
Shared variability of X and Y variables on the topIndividual variability of X and Y variables on the bottom
Interpreting Correlation Coefficient r
strong correlation: r > .70 or r < –.70 moderate correlation: r is between .30
& .70or r is between –.30 and –.70
weak correlation: r is between 0 and .30 or r is between 0 and –.30 .
Coefficient of Determination
Coefficient of determination lies between 0 to 1
Represented by r2
The coefficient of determination is a measure of how
well the regression line represents the data
If the regression line passes exactly through every
point on the scatter plot, it would be able to explain all
of the variation
The further the line is away from the points, the less it
is able to explain
r 2, is useful because it gives the proportion of the variance
(fluctuation) of one variable that is predictable from the
other variable
It is a measure that allows us to determine how certain one
can be in making predictions from a certain model/graph
The coefficient of determination is the ratio of the
explained variation to the total variation
The coefficient of determination is such that 0 < r 2 < 1,
and denotes the strength of the linear association between
x and y
The Coefficient of determination represents the percent of the data that is the closest to the line of best fit
For example, if r = 0.922, then r 2 = 0.850
Which means that 85% of the total variation in y can be explained by the linear relationship between x and y (as described by the regression equation)
The other 15% of the total variation in y remains unexplained
Spearmans rank coefficient
A method to determine correlation when the data
is not available in numerical form and as an
alternative the method, the method of rank
correlation is used. Thus when the values of the
two variables are converted to their ranks, and
there from the correlation is obtained, the
correlations known as rank correlation.
Computation of Rank Correlation
Spearman’s rank correlation coefficient ρ
can be calculated when
Actual ranks given
Ranks are not given but grades are given but not
repeated
Ranks are not given and grades are given and
repeated
LOGOBUSINESS STATISTICS
PRESENTATIONON
REGRESSION ANALYSIS
OBJECTIVES OF THE PRESENTATION-
What is regression analysis
Types and methods of regression analysis
Practical aspect of regression analysis with an example
INTRODUCTION-
Regression analysis is the statistical tool which is employed for the purpose of forecasting or making estimates
Here we make use of various mathematical formulas and assumptions to describe a real world situation.
In every situation, estimation becomes easy once it is known that the variable to be estimated is related to and dependent to some other variable.
For making estimates we first have to model the relationship between the variable involved .
Models can me broadly be classified into –
Linear regression-
Linear regression analysis is a powerful technique used for predicting the unknown value of a variable from the known value of another variable.More precisely, if X and Y are two related variables, then linear regression analysis helps us to predict the value of Y for a given value of X or vice verse.For example age of a human being and maturity are related variables. Then linear regression analyses can predict level of maturity given age of a human being.
Multiple regression-
Multiple regression analysis is a powerful technique used for predicting the unknown value of a variable from the known value of two or more variables- also called the predictors.
Multiple regression analysis helps us to predict the value of Y for given values of X1, X2, …, Xk.
For example the yield of rice per acre depends upon quality of seed, fertility of soil, fertilizer used, temperature, rainfall. If one is interested to study the joint affect of all these variables on rice yield, one can use this technique.
Dependent and Independent Variables-
By linear regression, we mean models with just one independent and one dependent variable. The variable whose value is to be predicted is known as the dependent variable and the one whose known value is used for prediction is known as the independent variable.
By multiple regression, we mean models with just one dependent and two or more independent variables. The variable whose value is to be predicted is known as the dependent variable and the ones whose known values are used for prediction are known independent variables.
Methods of solving regression models-
1) GRAPHICAL METHOD-
In this graphical method the average relationship between the dependent variable and independent variable is expressed by a line called “line of best fit”.
Example: Experience( in years)
Income( in ‘000)
15 150
10 120
5 60
3 40
8 70
9 90
2 4 6 8 10 12 14 16
60
90
120
150
30
180
210
18
240
0
Line of best fit
income
experience
2) ALGEBRIC METHOD-In this method we make use of regression equation and regression coefficients.
Regression equation(Linear).
The general equation is given by-y = a + bx a is the intercept b is the slope of line
With the use of the above general equation we find the normal equations
Multiplying the general equation by N and taking the summatation of it we find the first normal equation i.e.
∑Y = N.a + b∑X
And again to find the second normal equation we multiply the general equation by x and then take the summatation i.e.
∑XY=a ∑X + b ∑X2
A statistical technique used to explain or predict thebehaviour of a dependent variable
General equation => y = a + b1 x1 + b2x2 + .........+ bnxn
Regression equation(Multiple).
Normal equations for multiple regression are:
∑Y = N.a + b1∑X1 + b2∑X2
∑X1Y= a ∑X1 + b1 ∑ X1 2 + b2∑ X1 . X2
∑X2Y= a ∑X2 + b1 ∑ X1 . X2 + b2∑ X2
2
Lines of Regression
There are two lines of regression- that of Y on X and X on Y.
The line of regression of Y on X is given by Y = a + bX where a and b are unknown constants known as intercept and slope of the equation. This is used to predict the unknown value of variable Y when value of variable X is known.
On the other hand, the line of regression of X on Y is given by X = c + dY which is used to predict the unknown value of variable X using the known value of variable Y.
Often, only one of these lines make sense.Exactly which of these will be appropriate for the analysis in hand will depend on labeling of dependent and independent variable in the
problem to be analyzed.
Regression coefficients-
The two regression co-efficient are byx and bxy . The formula for the two regression co- efficient are given by –
or b y x = N .∑XY − ∑ X . ∑Y N. ∑X2 − (∑X)2
b x y = N.∑ XY – ∑X . ∑ Y N. ∑Y2 – (∑Y)2
The coefficient of X in the line of regression of Y on X is called the regression coefficient of Y on X and is denoted by b y x
It represents change in the value of dependent variable (Y)corresponding to unit change in the value of independent variable (X).
And similarly the coefficient of Y in the line of regression of X on Y is called coefficient of X on Y and is denoted by b x y .
How Good Is the Regression?
Once a regression equation has been constructed, we can check how good it by examining the coefficient of determination (R2). R2 always lies between 0 and 1.
The closer R2 is to 1, the better is the model and its prediction.
PRACTICAL ASPECT OF REGRESSION ANALYSIS-
Here we will show a linear regression analysis between two
variables X and Y.
Variable X is taken as “ driving experience” and variable Y is
taken as “number of road accidents(in a year)”.
Road accident is taken as the dependent variable and which
is related to independent variable X i.e. driving experience.
X (driving experience)
5 2 12 9 15 6 25 16
Y ( no. of road accidents)
64 87 50 71 44 56 42 60
From the date we will show-
The estimated regression line for the date.
Number of road accidents taking place when the
driving experience is 10 years and 30 years.
co efficient of determination(R2) and which will
help us to know that how much percentage of
dependent variable is explained by independent
variable.
X Y X.Y X2 Y2
5 64 320 25 4096
2 87 174 4 7569
12 50 600 144 2500
9 71 639 81 5041
15 44 660 225 1963
6 56 336 36 3136
25 42 1050 625 1764
16 60 960 256 3600
∑X=90 ∑Y=474 ∑X.Y=4739 ∑X2=1396 ∑Y2=29642
The following is the tabular representation of data related to driving experience and number of road accidents.
Since the estimated regression line is given by Y = a + b.X , now using the normal equations we calculate the value of a and b .
∑Y = N. a + b ∑X
474= 8.a + b.90
8a + 90b = 474 E .q - 1
∑XY=a ∑X + b ∑X2
4739 = a.90 + b.1396
90a + 1396 b = 4739 E.q-2
Now solving both the equation we get the value of a and b as-
Value of a = 76.66 Value of b = -1.5475
The estimated regression line is
Y = 76.66 – 1.5476 X
3 6 9 12 15 18 21 24 27
experience
80
70
60
50
40
30
20
10
No. Of accidents
Trend line for Y = 76.66 – 1.5476 X
Since we all know that the road accidents are dependent upon the driving experience and a new driver is considered to be inexperienced and for him the risk of accident is more so there exist a negative relationship between the two variables so the trend line is downward sloping in this case.
From the above value of a and b we can see that value of a is 76.66 which means if a driver has 0 experience then the no of road accidents that will take place is 76.66
From the value of b we can say that for every extra year of driving experience , the road accident is decreased by 1.5476
No of accidents with 10 yr experience No. of accidents with 30 yr experience
Y = 76.66 – 1.5476 XY = 76.66 – 1.5476 (10)Y = 61. 184
Y = 76.66 – 1.5476 XY = 76.66 – 1.5476 (30)Y= 30.232
Now we find coefficient of variation for the data using regression coefficients.
b y x = N .∑XY − ∑ X . ∑YN. ∑X2 − (∑X)2
b x y = N.∑ XY – ∑X . ∑ YN. ∑Y2 – (∑Y)2
= 8 (4739) − 90 . 474
8(1396) − (90)2
= − 1.547
= 8(4739) − 90. 474
8(29642)− (474)2
= − 0.381
Now R2 = b y x .b x y
= (- 1. 547) (- 0.381)
= 0.5894
From the above coefficient of determination we can say that almost 59 % of variance of dependent variable is explained by the independent variable.
LOGO
Conceptual Frame work of SENSEX and Nifty
Conceptual Frame work of SENSEX and Nifty
Stock Market Indices
Stock Market performance is quantified by calculating an index using the benchmark scrip’s and as known to all SENSEX (Sensitive Index) is associated with Bombay Stock Exchange and S&P CNX NIFTY is associated with National Stock Exchange
Bombay Stock Exchange
There are 23 stock exchanges in the India. Bombay Stock Exchange is the largest, with over 6,000 stocks listed. The BSE accounts for over two thirds of the total trading volume in the country.
Established in 1875, the exchange is also the oldest in Asia. Among the twenty-two Stock Exchanges recognized by the Government of India under the Securities Contracts (Regulation) Act, 1956, it was the first one to be recognized and it is the only one that had the privilege of getting permanent recognition.
Scrip’s at BSE
ACC AIRTEL BHEL DLF GRASIM GUJRAT AMBUJA HDFC HDFC BANK HINDALCO HUL ICICI BANK INFOSYS SUN Pharma IND.
LTD ITC L&TMARUTI
o MARUTIo MAHINDRA &
MAHINDRAo NTPCo ONGCo RANBAXYo RELIANCE
COMMUNICATIONo RELIANCE
INFRASTRUCTUREo RILo STERLITE
INDUSTIES LTDo SBIo TCSo TATA MOTERSo TATA STEELo TATA POWER
COMPANY LTDo WIPRO
National Stock Exchange
The National Stock Exchange (NSE), located in Bombay, is India's first debt market.
It was set up in 1993 to encourage stock exchange reform through system modernization and competition.
The instruments traded are, treasury bills, government security and bonds issued by public sector companies
How are the SENSEX 30 Stocks are selected?
Listing History Trading Frequency Rank based on the Market Cap (Should be
Among top 100) Market Capitalization weight Industry / sector they belong Historical Record
Methodology of SENSEX
SENSEX has been calculated since 1986 and initially it was calculated based on the Total Market Capitalization methodology and the methodology was changed in 2003 to Free Float Market Capitalization.
Hence, these days, the SENSEX is based on the Free Floating Market cap of 30 SENSEX Stocks traded on the BSE relative to the base value which is 100(1978-79) and it is calculated for every 15 seconds
SENSEX is calculated using the "Free-float Market Capitalization" methodology, wherein, the level of index at any point of time reflects the free-float market
It reflects value of 30 component stocks relative to a base period.
The market capitalization of a company is determined by multiplying the price of its stock by the number of shares issued by the company.
This market capitalization is further multiplied by the free-float factor to determine the free-float market capitalization.
How SENSEX is calculated?
The formula for calculating the SENSEX = (Sum of free flow market cap of 30 benchmark stocks)*Index Factor
where, Index Factor = 100/Market Cap Value in
1978-79. 100 is the Index value during 1978-79.
How NIFTY is calculated?
The National Stock Exchange (NSE) is associated with NIFTY and it is also calculated by the same methodology but with two key differences.
1. Base year is 1995 and base value is 1000.
2. NIFTY is calculated based on 50 stocks.
Formulae for valuation
SENSEX=
Free float market Capital
Market Capital in 1978-79
Base index points of 1978-79