Case Study: The data in the sheet is only for classroom teaching purposes. The data is of 25 hatchback cars in Indian Market. The Data variables include Name of the car, Length, Width, Height, Kerb Weight, Displacement, Max Power BHP, Max Power RPM, Max Torque NM, Max Torque RPM, Seating Capacity, Number of Gears, Mileage in kilometer per Litre, Fuel Type & Transmission type of each car. The data belongs to only petrol cars and manual transmission.
Using this Data, Conduct a Multiple Regression analysis in Microsoft Excel studying the impact of the variables on Mileage. We will not take into account two variables Fuel Type & Transmission type. Rest of all the Variables will be included in Study.
This regression will be run in R Statistical Software.
setwd("C:/Users/Rajesh Prabhakar/Desktop/R")
For inputting or reading Data from
“IndCarData.csv” file, R Command would be
IndCar=read.csv("IndCarData.csv") # inputing data into R
I have named input variable name as IndCar
so that it will be easy for reading data into the statistical function
formulas.
To Check whether Data input is done properly, Two functions "head( )" & "tail( )".
head(IndCar) # Displays first six rows of Data Set
tail(IndCar) # Displays last six rows of Data Set
For checking Descriptive statistics of the Data, "summary( )" function will be used.
summary(IndCar[,-c(1,14,15)]) # Descriptive Statistics of data set
This function will remove columns 1 ( Car names), 14 (Fuel type), 14 ( Transmission type) as these are Text based data variables.
The following is output in R Console
Running Multiple Linear Regression Model 1
In R Software, for running the Multiple Regression the function is "lm ( )".
I have named input variable name as IndCarReg1 so that it will be easy for reading data into the statistical function formulas.
IndCarReg1=lm(IndCar$Mileage ~ IndCar$Length + IndCar$Width + IndCar$Height + IndCar$KerbWeight + IndCar$Displacement + IndCar$MaxPowerBHP + IndCar$MaxPowerRPM + IndCar$MaxTorqueNm + IndCar$MaxTorqueRPM + IndCar$SeatingCapacity + IndCar$GearsNo, data=IndCar)
# Multiple Linear Regression Model 1
To know the result in R Console, type IndCarReg1.
The Multiple Regression Equation is
To know more detailed result in R Console, type summary(IndCarReg1) # Regression Output Detailed
Analysis of Multiple Regression Out put
Multiple R Squared means that 91.79% of the variation of Dependent Variable is explained by the independent variables. R Square is the square of the correlation between the response values and predicted response values.
Adjusted R Square is 85.93% is adjusted for number of predictor variables. It measures the proportion of the variation in the dependent variable accounted for by the explanatory variables.
Adjusted R square is generally considered to be a more accurate goodness-of-fit measure than R square.
In the Coefficients table look at P-Value and only those variables that has P-Value less than 0.05 will be considered significant and if P-Value is more than 0.05, those variables will be considered insignificant. The significance codes "*"next to the Pr(>|t|) highlights the significance levels.
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
In table above, Width, Height, Kerb Weight , Displacement & Max Power RPM have values less than 0.005 so they are considered significant rest of all considered insignificant.
Running Multiple Regression Model 2
We have rerun the regression by removing the insignificant variables Length, Max Power BHP, Max Torque NM, Max torque RPM, Seating Capacity & Number of Gears.
IndCarReg2=update(IndCarReg1,.~.-IndCar$Length-IndCar$MaxPowerBHP-IndCar$MaxTorqueNm-IndCar$MaxTorqueRPM-IndCar$SeatingCapacity-IndCar$GearsNo)
# rerunning multiple regression model after removing insignificant variables from regression 1
For removing variables, we have used "update( )" function as above and minus the insignificant variables.
The Multiple Regression Equation is
Mileage = -12.23644 + 0.018067 X Width + 0.010938 X Height - 0.014393 X Kerb Weight - 0.008260 X Displacement + 0.001363 X Max Power RPM
Analysis of Multiple Regression Output
Multiple R Square means that 88.92% of the variation of Dependent Variable is explained by the independent variables. This is slightly less than previous Reg Model 1.
Adjusted R Square is 86% adjusted for number of X Variables.
In table above, Width, Height, Kerb Weight , Displacement & Max Power RPM have values less than 0.005 so they are considered significant.
In table above, Width, Height, Kerb Weight , Displacement & Max Power RPM have values less than 0.005 so they are considered significant.
Regression Diagnostics using Plots
par(mfrow=c(2,2)) # Visualizing four graphs at once
plot(IndCarReg2) # Regression Diagnostics Plotting residuals
Analyzing Plots
The first plot is a standard residual plot showing residuals against fitted values. Points that tend towards being outliers are labeled. If any pattern is apparent in the points on this plot, then the linear model may not be the appropriate one.
The second plot Normal Q-Q is a normal quantile plot of the residuals. Normally distributed.
The scale-location plot shows the square root of the standardized residuals (sort of a square root of relative error) as a function of the fitted values. No obvious trend
The plot residuals vs. leverage. Labeled points on this plot represent cases we may want to investigate as possibly having undue influence on the regression relationship .
This Case Study is only for Classroom Discussion purpose.





No comments:
Post a Comment