DataAdvanceR Labs: October 2014

Saturday, October 4, 2014

Business Analytics Case Study - Multiple Regression using R Software- India Hatchback Car Data

The datasheet can be downloaded from this link: Download Data

Case Study: The data in the sheet is only for classroom teaching purposes. The data is of 25 hatchback cars in Indian Market. The Data variables include Name of the car, Length, Width, Height, Kerb Weight, Displacement, Max Power BHP, Max Power RPM, Max Torque NM, Max Torque RPM, Seating Capacity, Number of Gears, Mileage in kilometer per Litre, Fuel Type & Transmission type of each car. The data belongs to only petrol cars and manual transmission.

Using this Data, Conduct a Multiple Regression analysis in Microsoft Excel studying the impact of the variables on Mileage. We will not take into account two variables Fuel Type & Transmission type. Rest of all the Variables will be included in Study.

This regression will be run in R Statistical Software.

I have stored the IndCarData.csv file in Working Directory on my Desktop.

setwd("C:/Users/Rajesh Prabhakar/Desktop/R")

For inputting or reading Data from “IndCarData.csv” file, R Command would be

IndCar=read.csv("IndCarData.csv") # inputing data into R

I have named input variable name as IndCar so that it will be easy for reading data into the statistical function formulas.

To Check whether Data input is done properly, Two functions "head( )" & "tail( )".

head(IndCar) # Displays first six rows of Data Set

tail(IndCar) # Displays last six rows of Data Set

For checking Descriptive statistics of the Data, "summary( )" function will be used.

summary(IndCar[,-c(1,14,15)]) # Descriptive Statistics of data set

This function will remove columns 1 ( Car names), 14 (Fuel type), 14 ( Transmission type) as these are Text based data variables.

The following is output in R Console

Running Multiple Linear Regression Model 1

In R Software, for running the Multiple Regression the function is "lm ( )".

I have named input variable name as IndCarReg1 so that it will be easy for reading data into the statistical function formulas.

IndCarReg1=lm(IndCar$Mileage ~ IndCar$Length + IndCar$Width + IndCar$Height + IndCar$KerbWeight + IndCar$Displacement + IndCar$MaxPowerBHP + IndCar$MaxPowerRPM + IndCar$MaxTorqueNm + IndCar$MaxTorqueRPM + IndCar$SeatingCapacity + IndCar$GearsNo, data=IndCar)

# Multiple Linear Regression Model 1

To know the result in R Console, type IndCarReg1.

The Multiple Regression Equation is

Mileage = -41.5613 + 0.0023 X Length + 0.02142 X Width + 0.01625 X Height - 0.01869 X Kerb Weight - 0.01199 X Displacement - 0.08312 X Max Power BHP + 0.00192 X Max Power RPM + 0.08052 X Max Torque NM + 0.00066 X Max Turque RPM + 1.50058 X Number of Gears + 0 X Seating Capacity

To know more detailed result in R Console, type summary(IndCarReg1) # Regression Output Detailed

Analysis of Multiple Regression Out put

Multiple R Squared means that 91.79% of the variation of Dependent Variable is explained by the independent variables. R Square is the square of the correlation between the response values and predicted response values.

Adjusted R Square is 85.93% is adjusted for number of predictor variables. It measures the proportion of the variation in the dependent variable accounted for by the explanatory variables.

Adjusted R square is generally considered to be a more accurate goodness-of-fit measure than R square.

In the Coefficients table look at P-Value and only those variables that has P-Value less than 0.05 will be considered significant and if P-Value is more than 0.05, those variables will be considered insignificant. The significance codes "*"next to the Pr(>|t|) highlights the significance levels.

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

In table above, Width, Height, Kerb Weight , Displacement & Max Power RPM have values less than 0.005 so they are considered significant rest of all considered insignificant.

Running Multiple Regression Model 2

We have rerun the regression by removing the insignificant variables Length, Max Power BHP, Max Torque NM, Max torque RPM, Seating Capacity & Number of Gears.

IndCarReg2=update(IndCarReg1,.~.-IndCar$Length-IndCar$MaxPowerBHP-IndCar$MaxTorqueNm-IndCar$MaxTorqueRPM-IndCar$SeatingCapacity-IndCar$GearsNo)

# rerunning multiple regression model after removing insignificant variables from regression 1

For removing variables, we have used "update( )" function as above and minus the insignificant variables.

The Multiple Regression Equation is

Mileage = -12.23644 + 0.018067 X Width + 0.010938 X Height - 0.014393 X Kerb Weight - 0.008260 X Displacement + 0.001363 X Max Power RPM

Analysis of Multiple Regression Output

Multiple R Square means that 88.92% of the variation of Dependent Variable is explained by the independent variables. This is slightly less than previous Reg Model 1.

Adjusted R Square is 86% adjusted for number of X Variables.
In table above, Width, Height, Kerb Weight , Displacement & Max Power RPM have values less than 0.005 so they are considered significant.

Regression Diagnostics using Plots

par(mfrow=c(2,2)) # Visualizing four graphs at once

plot(IndCarReg2) # Regression Diagnostics Plotting residuals

Analyzing Plots

The first plot is a standard residual plot showing residuals against fitted values. Points that tend towards being outliers are labeled. If any pattern is apparent in the points on this plot, then the linear model may not be the appropriate one.

The second plot Normal Q-Q is a normal quantile plot of the residuals. Normally distributed.

The scale-location plot shows the square root of the standardized residuals (sort of a square root of relative error) as a function of the fitted values. No obvious trend

The plot residuals vs. leverage. Labeled points on this plot represent cases we may want to investigate as possibly having undue influence on the regression relationship .

This Case Study is only for Classroom Discussion purpose.

Friday, October 3, 2014

Business Analytics Case Study - Multiple Regression using Microsoft Excel- India Hatchback Car Data

The datasheet can be downloaded from this link: Download Data

For running Regression in Microsoft Excel, One needs to have Data Analysis ToolPak.

The following are the steps for downloading Data Analysis ToolPak.

1. Click on the green File tab. The File tab in Excel 2010 replaces the Office Button (or File Menu) in previous versions of Excel.

2. Click on Options

3. Under Add-ins, select Analysis ToolPak and click on the Go button.

4. Check Analysis ToolPak and click on OK.

5. On the Data tab, you can now click on Data Analysis.

The following dialog box below appears.
For example, select Regression and click OK to create a Regression in Excel.

Multiple Regression Case Solution

The following are steps for running Multiple Regression in Excel.

Continuing from Step above, a Dialog Box Appears for Regression

Input Y Range is the dependent variable in our case Mileage which is in Column M of the Data Sheet, Select $M$1:$M$26.

Input X Range is the independent variables in our case Length,Width,Height, Kerb Weight, Displacement, Max Power BHP, Max Power RPM, Max Torque NM, Max Torque RPM, Seating Capacity, Number of Gears. These variable are in Columns B to L, Select $B$1:$L$26.

Click on Labels as labels are in first row and confidence Level 95%. Click New Worksheet Ply and title it as reg Model 1. A new sheet with this title will appear in Excel Work Book and will store the Regression out put.

Regression Out Put Reg Model 1

The Multiple Regression Equation is

Analysis of the Multiple Regression Output

R Square means that 91.79% of the variation of Dependent Variable is explained by the independent variables.

Multiple R of 95.80% is the correlation between Actual Mileage Values and Predicted Mileage Values.

Adjusted R Square is adjusted for number of X Variables.

Next Look at Significance F Value in Anova Table which is associated with P-Value.

Since it is less than 0.05 we reject the H0 at the significance level 0.05.

In table above, Width, Height, Kerb Weight , Displacement & Max Power RPM have values less than 0.005 so they are considered significant rest of all considered insignificant.

Regression Output Reg Model 2

We have rerun the regression by removing the insignificant variables Length, Max Power BHP, Max Torque NM, Max torque RPM, Seating Capacity & Number of Gears.

The Multiple Regression Equation is

Mileage = -12.23644 + 0.01806 X Width + 0.01093 X Height - 0.01439 X Kerb Weight - 0.00825 X Displacement + 0.00136 X Max Power RPM

Analysis of Multiple Regression Output

R Square means that 88.91% of the variation of Dependent Variable is explained by the independent variables. This is slightly less than previous Reg Model 1.

Multiple R of 94.29% is the correlation between Actual Mileage Values and Predicted Mileage Values.

Adjusted R Square is adjusted for number of X Variables.

Next Look at Significance F Value in Anova Table which is associated with P-Value.

Since it is less than 0.05 we reject the H0 at the significance level 0.05.

In table above, Width, Height, Kerb Weight , Displacement & Max Power RPM have values less than 0.005 so they are considered significant.

R Statistical Software Basics – Descriptive Statistics - Kurtosis

The practice sheet titled StatisticMarks Data.csv downloaded from Link. – Download Sheet

About the Data Sheet - The data in this sheet is related to marks scored by 100 Students in a Statistical Test.

Based on the data, we will use R Software Statistical functions to analyze the descriptive statistics.

In the Data Sheet, we have Data from A2:A101, A1 being the header of the Data. I have stored the StatisticMarks.csv file in Working Directory on my Desktop.

setwd("C:/Users/Rajesh Prabhakar/Desktop/R")

For inputting or reading Data from “StatisticMarks.csv” file, R Command would be

StatMarks=read.csv("StatisticMarks.csv")

Kurtosis

A statistical measure used to describe the distribution of observed data around the mean. It is sometimes referred to as the "volatility of volatility."

Used generally in the statistical field, kurtosis describes trends in charts. A high kurtosis portrays a chart with fat tails and a low, even distribution, whereas a low kurtosis portrays a chart with skinny tails and a distribution concentrated toward the mean.

In R Statistical Software, Skewness is represented by function “kurtosis”. For kurtosis Load e1071 package or PerformanceAnalytics Package

kurtosis( )

In the Data Sheet, we have Data from A2:A101, A1 being the header of the Data titled StatisticsMarks.

kurtosis(StatMarks$StatisticsMarks)

StatMarks is the name of the variable in which we stored the data followed by $ sign and column header of the Data i.e. StatisticsMarks.

Remember the title of the column should be exactly same including the large caps & small caps or else it will give error.

In R the file names, column headers and row headers should exactly match the same or else the function will give errors

The result of this function in R Console is

kurtosis(StatMarks$StatisticsMarks)

[1] 1.556244

R Statistical Software Basics – Descriptive Statistics - Skewness

Based on the data, we will use R Software Statistical functions to analyze the descriptive statistics.

In the Data Sheet, we have Data from A2:A101, A1 being the header of the Data. I have stored the StatisticMarks.csv file in Working Directory on my Desktop.

setwd("C:/Users/Rajesh Prabhakar/Desktop/R")

For inputting or reading Data from “StatisticMarks.csv” file, R Command would be

StatMarks=read.csv("StatisticMarks.csv")

Skewness

Describe asymmetry from the normal distribution in a set of statistical data. Skewness can come in the form of "negative skewness" or "positive skewness", depending on whether data points are skewed to the left (negative skew) or to the right (positive skew) of the data average.

In R Statistical Software, Skewness is represented by function “skewness”.

For skewness Load e1071 package or PerformanceAnalytics Package

skewness( )

In the Data Sheet, we have Data from A2:A101, A1 being the header of the Data titled StatisticsMarks.

skewness(StatMarks$StatisticsMarks)

StatMarks is the name of the variable in which we stored the data followed by $ sign and column header of the Data i.e. StatisticsMarks.

Remember the title of the column should be exactly same including the large caps & small caps or else it will give error.

In R the file names, column headers and row headers should exactly match the same or else the function will give errors

The result of this function in R Console is

skewness(StatMarks$StatisticsMarks)

[1] -0.8221308

Result is negatively skewed.

R Statistical Software Basics – Descriptive Statistics - Standard Deviation

Based on the data, we will use R Software Statistical functions to analyze the descriptive statistics.

In the Data Sheet, we have Data from A2:A101, A1 being the header of the Data. I have stored the StatisticMarks.csv file in Working Directory on my Desktop.

setwd("C:/Users/Rajesh Prabhakar/Desktop/R")

For inputting or reading Data from “StatisticMarks.csv” file, R Command would be

StatMarks=read.csv("StatisticMarks.csv")

Standard Deviation

A measure of the dispersion of a set of data from its mean. The more spread apart the data, the higher the deviation. Standard deviation is calculated as the square root of variance.

In finance, standard deviation is applied to the annual rate of return of an investment to measure the investment's volatility.

In R Statistical Software, Standard Deviation is represented by function “sd”.

sd( )

In the Data Sheet, we have Data from A2:A101, A1 being the header of the Data titled StatisticsMarks.

sd(StatMarks$StatisticsMarks)

StatMarks is the name of the variable in which we stored the data followed by $ sign and column header of the Data i.e. StatisticsMarks.

Remember the title of the column should be exactly same including the large caps & small caps or else it will give error.

In R the file names, column headers and row headers should exactly match the same or else the function will give errors

The result of this function in R Console is

sd(StatMarks$StatisticsMarks)

[1] 14.24

R Statistical Software Basics – Descriptive Statistics - Variance

Based on the data, we will use R Software Statistical functions to analyze the descriptive statistics.

In the Data Sheet, we have Data from A2:A101, A1 being the header of the Data. I have stored the StatisticMarks.csv file in Working Directory on my Desktop.

setwd("C:/Users/Rajesh Prabhakar/Desktop/R")

For inputting or reading Data from “StatisticMarks.csv” file, R Command would be

StatMarks=read.csv("StatisticMarks.csv")

Variance

Variance (σ²) is a measure of the dispersion of a set of data points around their mean value.

In other words, variance is a mathematical expectation of the average squared deviations from the mean.

It is computed by finding the probability-weighted average of squared deviations from the expected value. Variance measures the variability from an average (volatility).

In R Statistical Software, Variance is represented by function “var”.

var( )

In the Data Sheet, we have Data from A2:A101, A1 being the header of the Data titled StatisticsMarks.

var(StatMarks$StatisticsMarks)

StatMarks is the name of the variable in which we stored the data followed by $ sign and column header of the Data i.e. StatisticsMarks.

Remember the title of the column should be exactly same including the large caps & small caps or else it will give error.

In R the file names, column headers and row headers should exactly match the same or else the function will give errors

The result of this function in R Console is

var(StatMarks)

StatisticsMarks

StatisticsMarks 202.7777