DataAdvanceR Labs: September 2014

Monday, September 29, 2014

R Statistical Software Basics – Descriptive Statistics - Percentiles

The practice sheet titled StatisticMarks Data.csv downloaded from Link. – Download Sheet

About the Data Sheet - The data in this sheet is related to marks scored by 100 Students in a
Statistical Test.

Based on the data, we will use R Software Statistical functions to analyze the descriptive statistics.

In the Data Sheet, we have Data from A2:A101, A1 being the header of the Data. I have stored the StatisticMarks.csv file in Working Directory on my Desktop.

setwd("C:/Users/Rajesh Prabhakar/Desktop/R")

For inputting or reading Data from “StatisticMarks.csv” file, R Command would be

StatMarks=read.csv("StatisticMarks.csv")

Percentiles

Assume that the elements in a data set are rank ordered from the smallest to the largest. The values that divide a rank-ordered set of elements into 100 equal parts are called percentiles.

An element having a percentile rank of P_i would have a greater value than i percent of all the elements in the set. Thus, the observation at the 50th percentile would be denoted P₅₀, and it would be greater than 50 percent of the observations in the set.

An observation at the 50th percentile would correspond to the median value in the set.

In R Statistical Software, Quartiles & percentiles are represented by function called “quantile”

quantile(filename, c(.10,.20,.30,.40,.50,.60,.70,.80,.90,.95))

In the Data Sheet, we have Data from A2:A101, A1 being the header of the Data titled StatisticsMarks.

quantile(StatMarks$StatisticsMarks, c(.10,.20,.30,.40,.50,.60,.70,.80,.90,.95))

StatMarks is the name of the variable in which we stored the data followed by $ sign and column header of the Data i.e. StatisticsMarks.

Remember the title of the column should be exactly same including the large caps & small caps or else it will give error.

In R the file names, column headers and row headers should exactly match the same or else the function will give errors

The result of this function in R Console is

> quantile(StatMarks$StatisticsMarks,c(.10,.20,.30,.40,.50,.60,.70,.80,.90,.95))

10% 20% 30% 40% 50% 60% 70% 80% 90% 95%

56.8 66.0 71.0 73.0 75.0 78.0 80.0 81.2 91.1 97.0

10% of students scored upto 56.8 marks, 20% of students scored upto 66 marks, 30% scored upto 71 marks, 50% scored upto 75 marks, etc.

R Statistical Software Basics – Descriptive Statistics - Quartiles

Based on the data, we will use R Software Statistical functions to analyze the descriptive statistics.

In the Data Sheet, we have Data from A2:A101, A1 being the header of the Data. I have stored the StatisticMarks.csv file in Working Directory on my Desktop.

setwd("C:/Users/Rajesh Prabhakar/Desktop/R")

For inputting or reading Data from “StatisticMarks.csv” file, R Command would be

StatMarks=read.csv("StatisticMarks.csv")

Quartiles

Quartiles divide a rank-ordered data set into four equal parts. The values that divide each part are called the first, second, and third quartiles; and they are denoted by Q₁, Q₂, and Q₃, respectively.

Note the relationship between quartiles and percentiles. Q₁ corresponds to P₂₅, Q₂ corresponds to P₅₀, Q₃ corresponds to P₇₅. Q₂ is the median value in the set.

In R Statistical Software, Quartiles & percentiles are represented by function called “quantile”

quantile( )

In the Data Sheet, we have Data from A2:A101, A1 being the header of the Data titled StatisticsMarks.

quantile(StatMarks$StatisticsMarks)

StatMarks is the name of the variable in which we stored the data followed by $ sign and column header of the Data i.e. StatisticsMarks.

Remember the title of the column should be exactly same including the large caps & small caps or else it will give error.

In R the file names, column headers and row headers should exactly match the same or else the function will give errors

The result of this function in R Console is

quantile(StatMarks$StatisticsMarks)

0% 25% 50% 75% 100%

25.00 69.75 75.00 80.00 100.00

The results highlights that 25% of students scored less than 69.75 marks, 25% to 50% students scored between 69.75 & 75.00, 50% to 75% of students scored between 75.00 & 80.00 and 75 to 100% students scored between 80.00 & 100.00 marks.

R Statistical Software Basics – Descriptive Statistics - Median

The practice sheet titled StatisticMarks Data.csv downloaded from Link. – Download Sheet

About the Data Sheet - The data in this sheet is related to marks scored by 100 Students in a Statistical Test.
Based on the data, we will use R Software Statistical functions to analyze the descriptive statistics.

In the Data Sheet, we have Data from A2:A101, A1 being the header of the Data.

I have stored the StatisticMarks.csv file in Working Directory on my Desktop.

setwd("C:/Users/Rajesh Prabhakar/Desktop/R")

For inputting or reading Data from “StatisticMarks.csv” file, R Command would be

StatMarks=read.csv("StatisticMarks.csv")

Median

The median is another way to measure the center of a numerical data set.

In a numerical data set, the median is the point at which there are an equal number of data points whose values lie above and below the median value.

Thus, the median is truly the middle of the data set.

median ( )

In the Data Sheet, we have Data from A2:A101, A1 being the header of the Data titled StatisticsMarks.

median(StatMarks$StatisticsMarks)

StatMarks is the name of the variable in which we stored the data followed by $ sign and column header of the Data i.e. StatisticsMarks.

Remember the title of the column should be exactly same including the large caps & small caps or else it will give error.

In R the file names, column headers and row headers should exactly match the same or else the function will give errors

The result of this function in R Console is

> median(StatMarks$StatisticsMarks)

[1] 75

Tuesday, September 23, 2014

R Statistical Software Basics – Descriptive Statistics – Arithmetic Mean or Average

Based on the data, we will use R Software Statistical functions to analyze the descriptive statistics.

In the Data Sheet, we have Data from A2:A101, A1 being the header of the Data.

I have stored the StatisticMarks.csv file in Working Directory on my Desktop.

setwd("C:/Users/Rajesh Prabhakar/Desktop/R")

For inputting or reading Data from “StatisticMarks.csv” file, R Command would be

StatMarks=read.csv("StatisticMarks.csv")

Arithmetic mean

The mean, also referred to by statisticians as the average, is the most common statistic used to measure the center of a numerical data set.

The mean is the sum of all the values in the data set divided by the number of values in the data set.

mean( )

In the

Data Sheet, we have Data from A2:A101, A1 being the header of the Data titled StatisticsMarks.

mean(StatMarks$StatisticsMarks)

StatMarks is the name of the variable in which we stored the data followed by $ sign and column header of the Data i.e. StatisticsMarks.

Remember the title of the column should be exactly same including the large caps & small caps or else it will give error.

In R the file names, column headers and row headers should exactly match the same or else the function will give errors

The result of this function in R Console is

> mean(StatMarks$StatisticsMarks)

[1] 73.99

R Statistical Software Basics - Descriptive Statistics - Minimum, Maximum & Range Functions

Based on the data, we will use R Software Statistical functions to analyze the descriptive statistics.

In the Data Sheet, we have Data from A2:A101, A1 being the header of the Data.

I have stored the StatisticMarks.csv file in Working Directory on my Desktop.

setwd("C:/Users/Rajesh Prabhakar/Desktop/R")

For inputting or reading Data from “StatisticMarks.csv” file, R Command would be

StatMarks=read.csv("StatisticMarks.csv")

I have named input variable name as StatMarks so that it will be easy for reading data into the statistical function formulas.

Click Ctrl R or Run or Ctrl plus Enter Tabs

Minimum Function

min( )

This function returns the minimum value in the data. Function name must be small caps

min(StatMarks)

Result is 25

The lowest marks scored is 25

Maximum Function

max( )

This function returns the maximum value in the data. Function name must be small caps

max(StatMarks)

Result is 100

The Highest marks scored in test is 100.

Result in R Console

> min(StatMarks)

[1] 25

> max(StatMarks)

[1] 100

Range Function

range(   )

This function gives a vector of the minimum and maximum values.

range(StatMarks)

Result is 25 & 100

These are the lowest and highest numbers in Data Set.

> range(StatMarks)

[1]  25 100

Range or Spread

For getting the range or spread i.e Range = Maximum – Minimum

max(StatMarks)-min(StatMarks)

Result is 75

The difference between 100 – 25 = 75

Result in R Console

> max(StatMarks)-min(StatMarks)

[1] 75

R Statistical Software Basics - Descriptive Statistics - Input Data & Check Number of Data Observations - Head & Tail Functions

Based on the data, we will use R Software Statistical functions to analyze the descriptive statistics.

In the Data Sheet, we have Data from A2:A101, A1 being the header of the Data.

Best Practice in R Software, Set the Working Directory, so that we can directly import & export Data files in R.

I have stored the StatisticMarks.csv file in Working Directory on my Desktop.

setwd("C:/Users/Rajesh Prabhakar/Desktop/R")

For inputting or reading Data from “StatisticMarks.csv” file, R Command would be

StatMarks=read.csv("StatisticMarks.csv")

I have named input variable name as StatMarks so that it will be easy for reading data into the statistical function formulas.

Click Ctrl R or Run or Ctrl plus Enter Tabs

Once You Click u will see the below blue color line R Console

> StatMarks=read.csv("StatisticMarks.csv")

To check all the Data Observations

StatMarks

Type StatMarks in R Console & Click Ctrl R or Run or Ctrl plus Enter Tabs

This function will list all the 100 observations in data set.

Head Function

head( )

This function returns the first six observations in Data

head(StatMarks)

Click Ctrl R or Run or Ctrl plus Enter Tabs

> head(StatMarks)

StatisticsMarks

1 100

2 100

3 99

4 99

5 97

6 97

The first six observations in Data set are 100,100,99,99,97,97.

Tail Function

tail( )

This function returns the last six observations in Data

tail(StatMarks)

Click Ctrl R or Run or Ctrl plus Enter Tabs

> tail(StatMarks)

StatisticsMarks

95 49

96 45

97 44

98 35

99 30

100 25

The Last six observations in Data Set are 49,45,44,35,30,25.

Tuesday, September 16, 2014

Set Working Directory & Get Working Directory - R Software - Basics Class

The practice sheet titled StatisticMarks Data.csv downloaded from Link. – Download Sheet

About the Data Sheet - The data in this sheet is related to marks scored by 100 Students in a Statistical Test.

Based on the data, we will use R Software Statistical functions to analyze the descriptive statistics.

In the Data Sheet, we have Data from A2:A101, A1 being the header of the Data.

Best Practice in R Software, Set the Working Directory, so that we can directly import & export Data files in R.

If you want to read files from a specific location or write files to a specific location you will need to set working directory in R.

The below example guides users how to set the working directory in R to the folder “Data” within the folder “Documents and Settings” on the C drive.

# Set the Working Directory Command

setwd(“ ”)
setwd("C:/Documents and Settings/Data")

Always remember to use the forward slash / or double backslash \\ in R.

The Windows format of single backslash will not work.

If you need to access the working Directory,

getwd()
# get current working directory

In our case, I have set working directory titled “R” on Desktop.

setwd("C:/Users/Rajesh Prabhakar/Desktop/R")

I have stored my data file titled “StatisticMarks.csv” in the “R” folder I created on Desktop and which I have set as my working Directory.

For inputting or reading Data from “StatisticMarks.csv” file, R Command would be

StatMarks=read.csv("StatisticMarks.csv")

I have named input variable name as StatMarks so that it will be easy for reading data into the statistical function formulas.

Sunday, September 14, 2014

KURTOSIS - Descriptive Statistics using Microsoft Excel Statistical Functions

The practice sheet can be downloaded from Link. Statistics Marks Data - Download Sheet

About the Data Sheet - The data in this sheet is related to marks scored by 100 Students in a Statistical Test.

Based on the data, we will use Microsoft Excel Statistical functions to analyse the descriptive statistics.

In the Data Sheet, we have Data from A2:A101, A1 being the header of the Data.

KURTOSIS

A statistical measure used to describe the distribution of observed data around the mean. It is sometimes referred to as the "volatility of volatility."

Used generally in the statistical field, kurtosis describes trends in charts.

Kurtosis characterizes the relative peakedness or flatness of a distribution compared with the normal distribution. Positive kurtosis indicates a relatively peaked distribution. Negative kurtosis indicates a relatively flat distribution.

= KURT(number1,number2,...)

Kurtosis is defined as:

where s is the sample standard deviation.

In the Data Sheet, we have Data from A2:A101, A1 being the header of the Data.

=KURT(A2:A101)

Result is 1.6994

The result is positive highlights peaked distribution.

The practice sheet can be downloaded from Link. Statistics Marks Data - Download Sheet

SKEWNESS - Descriptive Statistics using Microsoft Excel Statistical Functions

The practice sheet can be downloaded from Link. Statistics Marks Data - Download Sheet

About the Data Sheet - The data in this sheet is related to marks scored by 100 Students in a Statistical Test.

Based on the data, we will use Microsoft Excel Statistical functions to analyse the descriptive statistics.

In the Data Sheet, we have Data from A2:A101, A1 being the header of the Data.

SKEWNESS

Describe asymmetry from the normal distribution in a set of statistical data. Skewness characterizes the degree of asymmetry of a distribution around its mean.

Skewness can come in the form of "negative skewness" or "positive skewness", depending on whether data points are skewed to the left (negative skew) or to the right (positive skew) of the data average.

Positive skewness indicates a distribution with an asymmetric tail extending toward more positive values. Negative skewness indicates a distribution with an asymmetric tail extending toward more negative values.

= SKEW(number1,number2,...)

The equation for skewness is defined as:

In the Data Sheet, we have Data from A2:A101, A1 being the header of the Data.

=SKEW(A2:A101)

Result is

-0.8347

The result is negative highlights more negative values in the data.

The practice sheet can be downloaded from Link. Statistics Marks Data - Download Sheet

STANDARD DEVIATION - Descriptive Statistics using Microsoft Excel Statistical Functions - Measure of Dispersion

The practice sheet can be downloaded from Link. Statistics Marks Data - Download Sheet

About the Data Sheet - The data in this sheet is related to marks scored by 100 Students in a Statistical Test.

Based on the data, we will use Microsoft Excel Statistical functions to analyse the descriptive statistics.

In the Data Sheet, we have Data from A2:A101, A1 being the header of the Data.

STANDARD DEVIATION

Standard Deviation is measure of the dispersion of a set of data from its mean. The more spread apart the data, the higher the deviation. Standard deviation is calculated as the square root of variance.

Standard deviation is calculated based on a sample. The standard deviation is a measure of how widely values are dispersed from the average value (the mean).

= STDEV(number1,number2,...)

STDEV uses the following formula:

where x is the sample mean AVERAGE(number1,number2,…) and n is the sample size.

=SQRT(Variance)

In the Data Sheet, we have Data from A2:A101, A1 being the header of the Data.

=STDEV(A2:A101)

Result is 14.24
Or
=SQRT(Variance) or =SQRT(202.7777)

STDEV.P is for population

STDEV.S is for Sample

The practice sheet can be downloaded from Link. Statistics Marks Data - Download Sheet

VARIANCE - Descriptive Statistics using Microsoft Excel Statistical Functions - Measure of Dispersion

The practice sheet can be downloaded from Link. Statistics Marks Data - Download Sheet

About the Data Sheet - The data in this sheet is related to marks scored by 100 Students in a Statistical Test.

Based on the data, we will use Microsoft Excel Statistical functions to analyse the descriptive statistics.

In the Data Sheet, we have Data from A2:A101, A1 being the header of the Data.

Variance

Variance (σ²) is a measure of the dispersion of a set of data points around their mean value.

In other words, variance is a mathematical expectation of the average squared deviations from the mean.

Variance measures the variability from an average (volatility).

=VAR(number1,[number2],...])

VAR uses the following formula:

where x is the sample mean AVERAGE(number1,number2,…) and n is the sample size.

In the Data Sheet, we have Data from A2:A101, A1 being the header of the Data.

=VAR(A2:A101)

Result is 202.7777

VAR.P is for Population

VAR.S is for Sample.

Higher the variance the more the volatility or values are more different.

Lower the variance the lesser the volatility or values are less different.

PERCENTILES - Descriptive Statistics using Microsoft Excel Statistical Functions

The practice sheet can be downloaded from Link. Statistics Marks Data - Download Sheet

About the Data Sheet - The data in this sheet is related to marks scored by 100 Students in a Statistical Test.

Based on the data, we will use Microsoft Excel Statistical functions to analyse the descriptive statistics.

In the Data Sheet, we have Data from A2:A101, A1 being the header of the Data.

Percentiles

Assume that the elements in a data set are rank ordered from the smallest to the largest. The values that divide a rank-ordered set of elements into 100 equal parts are called percentiles.

An observation at the 50th percentile would correspond to the median value in the set.

=PERCENTILE(array,k)

Array The array or range of data that defines relative standing.

K The percentile value in the range 0 to 1, inclusive.

In the Data Sheet, we have Data from A2:A101, A1 being the header of the Data.

=PERCENTILE(A2:A101,0.3) gives the 30^th percentile

=PERCENTILE(A2:A101,0.5) gives 50^th percentile or Median

This means that 30% (30 out of 100) of the scores are lower or equal to 71.