analytics project - combined cycle power plant

54

Upload: jyothi-lakshmi

Post on 08-Jan-2017

996 views

Category:

Data & Analytics


3 download

TRANSCRIPT

Page 1: Analytics Project  - Combined Cycle Power Plant
Page 2: Analytics Project  - Combined Cycle Power Plant
Page 3: Analytics Project  - Combined Cycle Power Plant

Introduction

What is Combined Cycle Power Plant (CCPP)? A combined cycle power plant, is an electrical power plant in which a Gas Turbine (GT) and a Steam Turbine (ST) are used in combination to produce more electrical energy from the same fuel than that would be possible from a single cycle power plant. The gas turbine compresses air and mixes it with a fuel heated to a very high temperature. The hot air-fuel mixture moves through the blades making them spin. The fast spinning gas turbine drives a generator to generate electricity. The exhaust (waste) heat escaped through the exhaust stack of the gas turbine is utilized by a Heat Recovery Steam Generator (HSRG) system to produce steam that spins a steam turbine. This steam turbine drives a generator to produce additional electricity. CCCP is assumed to produce 50% more energy than a single power plant.

Page 4: Analytics Project  - Combined Cycle Power Plant

Business Objective

1. Create a data model to predict the net hourly electrical energy output (EP) of the plant using the following hourly average ambient variables: Temperature (temp), Ambient Pressure (pressure) Relative Humidity (humidity) Exhaust Vacuum (vacuum)

2. Identify the relationship between the response variables and predictor variable.

3. Identify the impact of vacuum on power compared to other variables (Vacuum is generated and collected from the steam turbine).

Page 5: Analytics Project  - Combined Cycle Power Plant

Value Analysis

Predicting the electricity generated hourly based on the ambient variables enables to evaluate whether the generated power will be sufficient to meet the growing consumer demands. Proactive steps to address the demands can be taken if the power generated is found to be insufficient.

Data Mining Goal

Analyze the data and build a linear multiple regression model to predict the power generated hourly by using the ambient variables.

Tools and Techniques

Excel is used only for initial summary statistics of variables. For all other analysis R is used.

Page 6: Analytics Project  - Combined Cycle Power Plant
Page 7: Analytics Project  - Combined Cycle Power Plant

The actual dataset contains 9568 data points collected from a Combined Cycle Power Plant over 6 years (2006-2011), when the power plant was set to work with full load. However, for this case study, only the data of 2011 has been considered.

Dataset Description

Data Source: UCI Machine Learning Repository https://archive.ics.uci.edu/ml/datasets/Combined+Cycle+Power+Plant

Data Collection

Citation: Pınar Tüfekci, Prediction of full load electrical power output of a base load operated combined cycle power plant using machine learning methods, International Journal of Electrical Power & Energy Systems, Volume 60, September 2014, Pages 126-140, ISSN 0142-0615, [Web Link]. Heysem Kaya, Pınar Tüfekci , Sadık Fikret Gürgen: Local and Global Learning Methods for Predicting Power of a Combined Gas & Steam Turbine, Proceedings of the International Conference on Emerging Trends in Computer and Electronics Engineering ICETCEE 2012, pp. 13-18 (Mar. 2012, Dubai)

Page 8: Analytics Project  - Combined Cycle Power Plant

Data Quality

No null values or missing measurements were found.

Temperature in °C

Exhaust Vacuum in cm Hg

Ambient Pressure in millibar

Relative Humidity in %

Electrical Energy Output in MW

Dataset Format

The dataset contains 4 independent variables and 1 dependent variable. Dependent Variable (Predictor Variable): Electrical Energy Output (energy) Independent Variables (Response Variables) Temperature (temp) Exhaust Vacuum (vacuum) Ambient Pressure (pressure) Relative Humidity (humidity) Table Structure

Page 9: Analytics Project  - Combined Cycle Power Plant
Page 10: Analytics Project  - Combined Cycle Power Plant

Dataset Description

The dataset contains 9568 records each with measurements for the four ambient variables and the electricity generated hourly. The 9568 records have been split into Training dataset and Test dataset in a 70-30 proportion. No. of records in Training dataset = 6697 No. of records in Test dataset= 2871 Data Cleaning

There are a few outliers in the dataset. However, the presence of outliers has been found not affecting the mean. Hence the outliers have been retained.

Page 11: Analytics Project  - Combined Cycle Power Plant

Data Description

For convenience, the variables are named as temp (for Temperature), vacuum (for Vacuum), pressure (for Ambient Pressure), humidity (Relative Humidity) and energy (for Electrical Energy).

temp vacuum pressure humidity energy Mean 19.71619 54.36515 1013.172 73.20201 454.19 Standard Error 0.090695 0.155985 0.072736 0.179092 0.207709 Median 20.355 52.33 1012.83 74.67 451.41 Mode 13.78 70.32 1016.05 100.09 434.01 Standard Deviation 7.421469 12.76408 5.951889 14.65489 16.99667 Sample Variance 55.0782 162.9216 35.42498 214.7659 288.8868 Kurtosis -1.01871 -1.44603 0.079202 -0.43715 -1.02307 Skewness -0.14234 0.185706 0.275577 -0.41662 0.311961 Range 35.3 54.89 40.4 74.6 75.5 Minimum 1.81 25.36 992.89 25.56 420.26 Maximum 37.11 80.25 1033.29 100.16 495.76 Sum 132019.6 364029 6784197 490160.6 3041256 Count 6696 6696 6696 6696 6696 Confidence Level(95.0%) 0.177791 0.305779 0.142585 0.351076 0.407177

Page 12: Analytics Project  - Combined Cycle Power Plant

Variable Exploration – temperature

95% of temperature observations fall between 4.87 and 34.55. The observations are concentrated around the mean. Small SE indicates that the sample mean is relatively close to the actual population mean. Negative skew indicates that extreme values are towards the left and most of the values are concentrated to the right of mean. Histogram shows a normal distribution.

temp Mean 19.71619 Standard Error 0.090695 Median 20.355 Mode 13.78 Standard Deviation 7.421469 Sample Variance 55.0782 Kurtosis -1.01871 Skewness -0.14234 Range 35.3 Minimum 1.81 Maximum 37.11 Sum 132019.6 Count 6696 Confidence Level(95.0%) 0.177791

Page 13: Analytics Project  - Combined Cycle Power Plant

Variable Exploration – vacuum

95% of vacuum observations fall between 28.84 and 79.90. The observations are concentrated around the mean. Small SE indicates that the sample mean is relatively close to the actual population mean. Positive skew indicates that extreme values are towards the right and most of the values are concentrated to the left of mean. Histogram shows a normal distribution.

vacuum Mean 54.36515 Standard Error 0.155985 Median 52.33 Mode 70.32 Standard Deviation 12.76408 Sample Variance 162.9216 Kurtosis -1.44603 Skewness 0.185706 Range 54.89 Minimum 25.36 Maximum 80.25 Sum 364029 Count 6696 Confidence Level(95.0%) 0.305779

Page 14: Analytics Project  - Combined Cycle Power Plant

Variable Exploration– pressure

95% of pressure observations fall between 1001.28 and 1025.08. The observations are concentrated around the mean. Small SE indicates that the sample mean is relatively close to the actual population mean. Positive skew indicates that extreme values are towards the right and most of the values are concentrated to the left of mean. Histogram shows a normal distribution.

pressure

Mean 1013.172 Standard Error 0.072736 Median 1012.83 Mode 1016.05 Standard Deviation 5.951889 Sample Variance 35.42498 Kurtosis 0.079202 Skewness 0.275577 Range 40.4 Minimum 992.89 Maximum 1033.29 Sum 6784197 Count 6696 Confidence Level(95.0%) 0.142585

Page 15: Analytics Project  - Combined Cycle Power Plant

Variable Exploration– humidity

95% of humidity observations fall between 43.9 and 102.51. The observations are concentrated around the mean. Small SE indicates that the sample mean is relatively close to the actual population mean. Negative skew indicates that extreme values are towards the left and most of the values are concentrated to the right of mean.

humidity

Mean 73.20201 Standard Error 0.179092 Median 74.67 Mode 100.09 Standard Deviation 14.65489 Sample Variance 214.7659 Kurtosis -0.43715 Skewness -0.41662 Range 74.6 Minimum 25.56 Maximum 100.16 Sum 490160.6 Count 6696 Confidence Level(95.0%) 0.351076

Page 16: Analytics Project  - Combined Cycle Power Plant

Variable Exploration– energy

95% of energy observations fall between 420.20 and 488.18. The observations are not concentrated around the mean. Small SE indicates that the sample mean is relatively close to the actual population mean. Positive skew indicates that extreme values are towards the right and most of the values are concentrated to the left of mean. Histogram shows a normal distribution.

energy

Mean 454.19 Standard Error 0.207709 Median 451.41 Mode 434.01 Standard Deviation 16.99667 Sample Variance 288.8868 Kurtosis -1.02307 Skewness 0.311961 Range 75.5 Minimum 420.26 Maximum 495.76 Sum 3041256 Count 6696 Confidence Level(95.0%) 0.407177

Page 17: Analytics Project  - Combined Cycle Power Plant

Relationship between Independent Variables

temp vacuum pressure humidity

temp 1.0000000 0.8456535 -0.50126993 -0.54176696

vacuum 0.8456535 1.0000000 -0.41358420 -0.30801094

pressure -0.5012699 -0.4135842 1.00000000 0.09381292

humidity -0.5417670 -0.3080109 0.09381292 1.00000000

Variable Pairs Correlation Coefficient Conclusion

temp and vacuum 0.8456535 Strong positive correlation

temp and pressure -0.50126993 Strong negative correlation

temp and humidity -0.54176696 Strong negative correlation

vacuum and pressure -0.41358420 Weak negative correlation

vacuum and humidity -0.30801094 Weak negative correlation

humidity and pressure 0.09381292 Zero correlation

> 0.5 Strong

>0.3 <0.5 Weak

<0.3 Zero

Page 18: Analytics Project  - Combined Cycle Power Plant

Relationship between IVs and DV (energy)

Independent variables energy (dependent variable)

Conclusion

temp -0.9472024 Strong negative correlation

vacuum -0.8716435 Strong negative correlation

pressure 0.5132269 Strong positive correlation

humidity 0.3878279 Weak positive correlation

Inference: Correlation coefficients indicate that strong correlation exists between temperature, vacuum and energy production. Pressure shows moderate correlation and humidity shows a weak correlation. All the above correlations do not indicate causation.

Page 19: Analytics Project  - Combined Cycle Power Plant

Graphical representation of correlation between IVs and DV (energy)

Page 20: Analytics Project  - Combined Cycle Power Plant
Page 21: Analytics Project  - Combined Cycle Power Plant

Modelling Technique Selection

The independent variables and dependent variables are continuous and quantitative. The dependent variable, energy, is found to have a linear relationship with the independent variables. Therefore, Linear Regression Model has been chosen to predict the energy output. There are 4 explanatory variables (predictors) and 1 dependent variable. Hence Multiple Linear Regression Modelling technique is used to predict the dependent variable. The multiple linear regression equation is given by: Y=b0+b1X1+b2X2+b3X3+b4X4 Where bo is the Y intercept and b1,b2,b3 and b4 are the slopes.

Page 22: Analytics Project  - Combined Cycle Power Plant

Multiple Linear Regression Modelling Design

1. Find the Regression Coefficients or coefficient of determination (R2) to check

how good the regression is. (R2 should be close to 1.)

2. Interpret the Regression Coefficients 3. Find the slopes for the Linear equation 4. Test the NULL hypothesis that slope is zero using P-value 5. Test the appropriateness of the regression model by using F-test in ANOVA. 6. Interpret the Regression Statistics 7. Assess the model 8. Select the best model by eliminating variables 9. Plot the residuals 10. Re-assess the model using K-fold Cross Validation

Page 23: Analytics Project  - Combined Cycle Power Plant

Modelling Assumptions

1. The independent variables have a linear relationship with the dependent variable

2. Linear relation of response variables with the predictor variable does not mean causation

Page 24: Analytics Project  - Combined Cycle Power Plant

Building the Model

Regression Coefficients

Pairs (IV & DV) Correlation Coefficient Regression Coefficient (R2)

Conclusion

temp and energy

-0.9472024 .90 90% of negative variance in energy can be explained by the linear relationship between temperature and energy.

vacuum and energy

-0.8716435 0.76 76% of negative variance in energy can be explained by the linear relationship between vacuum and energy.

pressure and energy

0.5132269 0.26 26% of variance in energy can be explained by the linear relationship between pressure and energy.

humidity and energy

0.3878279 0.15 15% of variance in energy can be explained by the linear relationship between humidity and energy.

Page 25: Analytics Project  - Combined Cycle Power Plant

Fitting the model to the dataset

1. >tvph.lm = lm(energy~temp+vacuum+pressure+humidity, data=ccpp_train) 2. >tvph.lm Output:

Call: lm(formula = energy ~ temp + vacuum + pressure + humidity, data = ccpp_train) Coefficients: (Intercept) temp vacuum pressure humidity 453.70280 -1.96541 -0.23719 0.06269 -0.15545

Energy=453.70-1.97*temp-0.24*vacuum+0.06*pressure-0.16*humidity

This means that: Mean energy when all predictors are zero is 453.70 for every unit increase in temperature, the energy decreases by 1.97 units for every unit increase in vacuum, energy decreases by 0.24 units. for every unit increase in pressure, energy increases by 0.06 units for every unit increase in humidity, energy deceases by 0.16 units

Page 26: Analytics Project  - Combined Cycle Power Plant

Finding the F-statistic and P-value

>summary(tvph.lm) Output:

Call: lm(formula = energy ~ temp + vacuum + pressure + humidity, data = ccpp_train) Residuals: Min 1Q Median 3Q Max -43.289 -3.166 -0.101 3.229 16.733 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 453.702803 11.630952 39.008 < 2e-16 *** temp -1.965407 0.018527 -106.083 < 2e-16 *** vacuum -0.237190 0.008777 -27.023 < 2e-16 *** pressure 0.062686 0.011283 5.556 2.87e-08 *** humidity -0.155447 0.005005 -31.059 < 2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 4.58 on 6691 degrees of freedom Multiple R-squared: 0.9274, Adjusted R-squared: 0.9274 F-statistic: 2.137e+04 on 4 and 6691 DF, p-value: < 2.2e-16

Low P values indicate that the IVs are statistically significant. Changes in IVs affect the Energy.

Page 27: Analytics Project  - Combined Cycle Power Plant

NULL Hypothesis Test using P-value

Null Hypothesis: No significant relationship exists between the independent variables and the dependent variable. Therefore Y cannot be explained by Xs, slope =0. H0: b1=0. Alternate Hypothesis: Significant linear relationship exists between the independent variables and the dependent variable. Therefore, Y can be explained by Xs and the slope will not be equal zero. Ha: b1≠0 P-value: < 2.2e-16 P-value is very less for all IVs. Therefore NULL hypothesis (Ho: b1=0) is rejected. That is, significant linear relationship exists between the independent variables and the dependent variable.

F value is much larger than 1 indicates that the variation in group means is not by chance and has statistical significance. This also confirms the validity of regression output.

Interpretation of F-statistic F-statistic: 2.137e+04

Page 28: Analytics Project  - Combined Cycle Power Plant

ANOVA - analysis of variance table

Analysis of Variance Table Response: energy Df Sum Sq Mean Sq F value Pr(>F) temp 1 1735257 1735257 82706.73 < 2.2e-16 *** vacuum 1 33878 33878 1614.69 < 2.2e-16 *** pressure 1 4340 4340 206.88 < 2.2e-16 *** humidity 1 20239 20239 964.64 < 2.2e-16 *** Residuals 6691 140383 21 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Interpretation of F-statistic and P-value

High F-value and low P-value indicates all variables have statistical significance.

Page 29: Analytics Project  - Combined Cycle Power Plant

Regression Statistics

Multiple R (r2) 0.96 R Square (R2) 0.92 Adjusted R Square 0.92 Standard Error 4.58 Observations 6696.00

Interpretation

R square =0.92 Indicates that 92% of the variance in energy is explained by the variance in the X variables (temp, vacuum, pressure and humidity).

r2: Coefficient of determination for correlation between energy and a linear combination of multiple IVs R2 is the percentage of variance in the energy that can be explained by the predictors – t.v.h and p

Page 30: Analytics Project  - Combined Cycle Power Plant

Selecting the best model by eliminating variables

Call: lm(formula = energy ~ temp + vacuum + pressure, data = ccpp_train) Residuals: Min 1Q Median 3Q Max -43.924 -3.424 -0.030 3.320 18.872 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 345.830253 11.872607 29.13 <2e-16 *** temp -1.622336 0.015909 -101.97 <2e-16 *** vacuum -0.332834 0.008791 -37.86 <2e-16 *** pressure 0.156381 0.011629 13.45 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 4.899 on 6692 degrees of freedom Multiple R-squared: 0.917, Adjusted R-squared: 0.9169 F-statistic: 2.463e+04 on 3 and 6692 DF, p-value: < 2.2e-16

Model 1 – Eliminating Humidity

Page 31: Analytics Project  - Combined Cycle Power Plant

Selecting the best model by eliminating variables

Call: lm(formula = energy ~ temp + vacuum, data = ccpp_train) Residuals: Min 1Q Median 3Q Max -44.192 -3.329 -0.075 3.318 19.266 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 505.442472 0.286375 1764.97 <2e-16 *** temp -1.689043 0.015318 -110.27 <2e-16 *** vacuum -0.330193 0.008906 -37.07 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 4.965 on 6693 degrees of freedom Multiple R-squared: 0.9147, Adjusted R-squared: 0.9147 F-statistic: 3.589e+04 on 2 and 6693 DF, p-value: < 2.2e-16

Model 2 – Eliminating Pressure & Humidity

Page 32: Analytics Project  - Combined Cycle Power Plant

Selecting the best model by eliminating variables

Model Variables Retained R2 Adjusted R2

Model 1 temp, vacuum, pressure 0.917 0.9169

Model 2 temp and vacuum 0.9147 0.9147

Model 3 temp 0.8972 0.8972

Model 4 temp, vacuum, humidity 0.9271 0.927

Model 5 temp, humidity 0.9194 0.9194

Model 6 temp, pressure, humidity 0.9195 0.9195

Model 7 vacuum, pressure, humidity 0.8053 0.8053

Model 8 vacuum and pressure 0.7879 0.7878

Model 9 pressure 0.2634 0.2633

Model 10 vacuum 0.7598 0.7597

Model 11 Temp, pressure 0.8992 0.8991

Model 12 Pressure, humidity 0.3798 0.3796

Model 13 Vacuum, humidity 0.7755 0.7754

Page 33: Analytics Project  - Combined Cycle Power Plant

Residual Plots

-15

-10

-5

0

5

10

15

0 20 40 60 80 100 120 140

Residuals

Residuals

Page 34: Analytics Project  - Combined Cycle Power Plant

Residual Plots

Page 35: Analytics Project  - Combined Cycle Power Plant

Residual Vs Fitted: Shows the residuals (that is, the vertical distance from a point to the regression line) versus the fitted values (that is, the y-value on the line corresponding to each x-value.). The red line is close to the grey line indicates the scatter is less.

Residual Plots Interpretation – Plot #1 Residual Vs Fitted

Page 36: Analytics Project  - Combined Cycle Power Plant

Normal Q-Q: This plot checks whether the errors(residuals) are normally distributed. The points lie close to the grey dashed line. This indicates that the residuals are normally distributed.

Residual Plots Interpretation – Plot #2 Normal Q-Q

Page 37: Analytics Project  - Combined Cycle Power Plant

Scale Location: Y-axis shows square root of standardized errors (scaled to have mean zero and variance of 1. This tests the homoscedasticity that the variance in the residuals is not a function of x. The plot shows homoscedasticity as the red line is flat.

Residual Plots Interpretation – Plot #3 Scale-Location

Page 38: Analytics Project  - Combined Cycle Power Plant

Standardized Residual Vs Leverage: Leverage is a measure of how much each data point influences the regression. Cook’s distance measures how much the regression would change if a point was deleted. The plot shows the residuals centered around zero as should be for normal distribution. Also Cook’s distance is small.

Residual Plots Interpretation – Plot #4 Residual Vs Fitted

Page 39: Analytics Project  - Combined Cycle Power Plant

Histogram of Residuals

Plot shows that residuals are normally distributed

Page 40: Analytics Project  - Combined Cycle Power Plant

Model Assessment

The Objective #1 was to 1. Create a data model to predict the net hourly electrical energy output (EP) of the

plant using the following hourly average ambient variables: The model has been created. The linear regression equation is Energy=453.70-1.97*temp-0.24*vacuum+0.06*pressure-0.16*humidity To assess the fitment of this equation to predict energy values, Test dataset is used to predict the energy values using the above equation for all observed values of temperature, vacuum, pressure and humidity in the Test data. A graph is plotted with Observed energy values Vs Predicted Energy values and correlation between them is observed. (See next slide for the graph) The plot shows a strong positive correlation between the model’s predictions and its observed results. This indicates that the model is a good fit.

Page 41: Analytics Project  - Combined Cycle Power Plant

Predicted Energy Vs Observed Energy

Plot shows that strong positive correlation between observed energy and predicted energy

Page 42: Analytics Project  - Combined Cycle Power Plant

Model Assessment Using 3 fold Cross Validation

Plot shows that strong positive correlation between observed energy and predicted energy

Page 43: Analytics Project  - Combined Cycle Power Plant

Interpretation of 3 fold Cross Validation Plot

To assess the fitment of the model cross validation plot is drawn. A graph is plotted with Predictions from fitted model Vs Cross Validated Predictions and correlation between them is observed. (See next slide for the graph) The plot shows a strong positive correlation between the model’s predictions and CV results. This indicates that the model is a good fit.

temp vacuum pressure humidity energy Predicted cvpred 1 10.49 42.49 1009.81 78.92 478.75 474.3805005 474.3255591 2 27.49 63.76 1010.09 62.8 436.73 438.1485714 438.1080381 3 24.14 59.87 1018.47 57.76 444.04 447.0707341 447.1530759 4 18.35 62.1 1019.97 78.36 455.29 454.8720241 454.9513531 5 18.93 59.39 1013.92 68.78 450.14 455.5372706 455.5037321 6 22.58 59.14 1017.2 80.91 439.78 446.4887455 446.4638705

Cross validation output – only first 6 rows are shown.

Observations from test data Predictions from fitted model Cross validated predictions

Page 44: Analytics Project  - Combined Cycle Power Plant

Model Prediction Vs Cross validation predictions

The plot shows a strong positive correlation between the model’s predictions and CV results. This indicates that the model is a good fit.

Page 45: Analytics Project  - Combined Cycle Power Plant

Making predictions from the model

1. Obtain a 95% confidence interval for the mean power generated in the Combined cycle power plant where temperature is 1°C and Vacuum is 20 cm Hg.

fit lwr upr 1 493 492 494 Confidence interval is 492-494 1. Obtain a 95% confidence interval for the mean power generated in the Combined cycle

power plant where temperature is 1°C and Vacuum is 20 cm Hg fit lwr upr 1 493 484 502 Confidence interval is 484 – 502 A prediction interval for a predicted value of the dependent variable gives us a range of values around which an additional observation of the dependent variable can be expected to be located (with a given level of certainty). A confidence interval for a predicted value of the dependent variable gives a range of values around which the "true" (population) mean (of the dependent variable for given levels of the independent variables) can be expected to be located (with a given level of certainty.

Page 46: Analytics Project  - Combined Cycle Power Plant

Summary

1. Temperature and Vacuum have strong negative correlation with Energy

2. Pressure and Humidity are positively c0rrelated with pressure having more correlation than humidity

Page 47: Analytics Project  - Combined Cycle Power Plant
Page 48: Analytics Project  - Combined Cycle Power Plant

R commands used in this project

Name of the file containing the data: ccpp_train.csv 1. Remove all rows with missing values >ccpp_train <- na.omit(ccpp_train) 2. Create objects to store all IVs

>t=ccpp_train$temp >v=ccpp_train$vacuum >p=ccpp_train$pressure >h=ccpp_train$humidity >e=ccpp_train$energy

3. Creating histogram for each variable >hist(t, col="gray", labels=TRUE) The above command is repeated for each variable (v,p,h and e) 4. Find the Correlation coefficients of all variables >cor(ccpp_train)

Page 49: Analytics Project  - Combined Cycle Power Plant

R commands used in this project cntd.

5. Plotting correlation graphs between IVs and DV >plot(t,e, main="Correlation between Temperature in °C and Power in MW", xlab="Temp", ylab="Power") >abline(lm(pw~t), col="red") Repeat the above two commands for all IVs. 6. Fitting the linear regression model to the data >tvph.lm = lm(energy~temp+vacuum+pressure+humidity, data=ccpp_train) 7. Display the results >tvph.lm >summary(tvph.lm) 8. Plotting residuals against fitted values (all 4 graphs in one window) > par(mfrow=c(2,2)) > plot(tvph.lm) 9. Histogram of residuals > hist(resid(tvph.lm))

Page 50: Analytics Project  - Combined Cycle Power Plant

R commands used in this project cntd.

10. Fitting the model to test dataset >model1 <- lm(energy~temp+vacuum+pressure+humidity, data=ccpp_test) 11. Finding the Beta coefficients of the new model >coef(model1) 12. Predicting energy for new values of IVs >prediction <- c(33.92, 77.95, 1010.15, 58.89) * coef(model1) 13. Predicting values for test dataset >p<-predict(tvph.lm, newdata=ccpp_test) > write.table(p, file = "C:/Users/lakshj1/Desktop/IIITB/Project/Cycle Power Plant/Final/test/ predval.csv", sep = ",", ) 14. Model Assessment Using Cross validation >Install.packages(“DAAG”) >Library(DAAG) >new.daag <- CVlm(df=ccpp_test, m=3, form.lm=formula(energy~temp+vacuum+pressure+humidity)) > write.table(new.daag, file = "C:/Users/lakshj1/Desktop/IIITB/case study/working/daag.csv", sep = ",", )

Page 51: Analytics Project  - Combined Cycle Power Plant

R commands used in this project cntd.

15. Reading daag file >daag <- read.csv("C:/Users/lakshj1/Desktop/IIITB/Project/Cycle Power Plant/working/daag.csv") 16. Plotting model predicted values and CV predictions > plot(daag$Predicted,daag$cvpred) 17. Obtaining 95% Confidence Interval >predict(tvph.lm,data.frame(temp=5, vacuum=30, pressure=1000, humidity=40),interval=“confidence") 18. Obtaining 95% Prediction Interval >predict(tvph.lm,data.frame(temp=5, vacuum=30, pressure=1000, humidity=40),interval="prediction")

Page 52: Analytics Project  - Combined Cycle Power Plant

R commands used in this project cntd.

19. Plotting observed energy Vs predicted energy #A new file (ccpp_pred_obs) created with observed values of energy in test data set and predicted values of energy > attach(ccpp_pred_obs) >obsenergy=ccpp_pred_obs$energy >predenergy=ccpp_pred_obs$predicted_energy > plot(predicted_energy, energy, main="Predicted Energy Vs Observed Energy for Test Data", xlab="Predcited Energy", ylab="Observed Energy")

Page 53: Analytics Project  - Combined Cycle Power Plant

K-fold cross validation

In k-fold cross-validation, the original sample is randomly partitioned into k equal sized subsamples. Of the k subsamples, a single subsample is retained as the validation data for testing the model, and the remaining k − 1 subsamples are used as training data. The cross-validation process is then repeated k times (the folds), with each of the k subsamples used exactly once as the validation data. The k results from the folds can then be averaged (or otherwise combined) to produce a single estimation. The advantage of this method over repeated random sub-sampling (see below) is that all observations are used for both training and validation, and each observation is used for validation exactly once. 10-fold cross-validation is commonly used,[6] but in general k remains an unfixed parameter. When k=n (the number of observations), the k-fold cross-validation is exactly the leave-one-out cross-validation.

Page 54: Analytics Project  - Combined Cycle Power Plant