In: Statistics and Probability
Question 3 Using R for calculations
In a competitive company, the income levels of its employees are not standardised and are awarded on a case-by-case basis after negotiations between individual employees and the company directors. An industry body wants an analysis of employee income with respect to their relative rank in the company. A random sample of 184 individuals in the company were recruited and their relative Rank and Income values were anonymously recorded.
The data is available below. The variables are defined below.
Income - The yearly income of the employee.
Rank - The relative rank of the employee’s position at the company (1 being highest rank, 9 being lowest)
a. Fit a simple linear regression to the data with Income as the response variable and worker Rank as the predictor.
b. Using your analysis in a. or otherwise, explain why simple linear regression is inadequate to explain the structure in this dataset.
c. Fit a polynomial regression model to the data and select the best order polynomial to explain the data using the significance testing techniques discussed in lectures.
d. Predict the Income for a person planning to apply for a position at Rank 5 at one of these competitive companies in the near future
Rank | Income |
7 | 106790 |
6 | 70916 |
9 | 70495 |
3 | 191968 |
6 | 59373 |
6 | 106390 |
8 | 31339 |
3 | 235000 |
3 | 209008 |
5 | 115081 |
1 | 510684 |
1 | 557015 |
8 | 115096 |
2 | 311281 |
8 | 83348 |
6 | 118896 |
1 | 523692 |
3 | 230699 |
5 | 127867 |
6 | 103211 |
5 | 97534 |
8 | 68099 |
4 | 72454 |
6 | 129781 |
2 | 360613 |
6 | 73465 |
9 | 93146 |
6 | 104356 |
7 | 42327 |
5 | 145520 |
9 | 55853 |
7 | 77324 |
3 | 216965 |
1 | 532028 |
5 | 120256 |
6 | 37870 |
7 | 89948 |
1 | 511271 |
4 | 193372 |
2 | 281334 |
6 | 83604 |
8 | 53887 |
9 | 64738 |
9 | 72541 |
4 | 164709 |
9 | 56205 |
4 | 181247 |
5 | 92034 |
4 | 177882 |
1 | 483163 |
6 | 97319 |
1 | 484151 |
1 | 492368 |
6 | 120574 |
8 | 52470 |
7 | 46166 |
5 | 155870 |
6 | 76479 |
3 | 218382 |
8 | 91030 |
3 | 200678 |
2 | 364445 |
6 | 78075 |
7 | 77990 |
1 | 530666 |
6 | 136092 |
4 | 132705 |
7 | 120456 |
6 | 115115 |
2 | 296011 |
6 | 64033 |
1 | 512753 |
3 | 167713 |
7 | 60436 |
7 | 61206 |
3 | 266501 |
4 | 227492 |
1 | 514100 |
2 | 384562 |
2 | 271253 |
1 | 505753 |
4 | 148516 |
2 | 338896 |
9 | 70202 |
2 | 288968 |
7 | 116571 |
9 | 92788 |
4 | 166387 |
8 | 84762 |
6 | 92757 |
3 | 243974 |
8 | 44752 |
2 | 311745 |
4 | 165152 |
3 | 216874 |
4 | 224083 |
6 | 125820 |
4 | 196454 |
9 | 21565 |
2 | 340717 |
9 | 48784 |
5 | 105917 |
9 | 25375 |
8 | 103300 |
6 | 107669 |
7 | 93197 |
4 | 154516 |
8 | 59497 |
8 | 68733 |
1 | 540871 |
1 | 590015 |
3 | 134095 |
8 | 87005 |
7 | 45888 |
6 | 73332 |
4 | 217111 |
9 | 86037 |
1 | 463367 |
4 | 202798 |
4 | 213355 |
4 | 216602 |
9 | 35764 |
8 | 65762 |
2 | 352920 |
2 | 279612 |
2 | 349812 |
5 | 166996 |
6 | 107851 |
5 | 128139 |
6 | 166045 |
8 | 47305 |
9 | 60798 |
8 | 37471 |
8 | 10184 |
1 | 528574 |
4 | 164696 |
9 | 25789 |
5 | 140320 |
1 | 499333 |
2 | 336158 |
6 | 89999 |
8 | 104567 |
6 | 143554 |
5 | 163795 |
1 | 513261 |
4 | 165280 |
4 | 161781 |
7 | 81081 |
9 | 41830 |
9 | 22884 |
2 | 338717 |
6 | 89851 |
6 | 77929 |
9 | 29934 |
3 | 205850 |
5 | 84776 |
5 | 125247 |
6 | 80336 |
1 | 591938 |
9 | 74762 |
9 | 53977 |
5 | 107757 |
7 | 60626 |
5 | 111661 |
4 | 149466 |
2 | 346352 |
1 | 534712 |
6 | 147205 |
2 | 288935 |
7 | 96857 |
4 | 164486 |
6 | 65347 |
9 | 36389 |
9 | 102282 |
8 | 53647 |
3 | 263337 |
6 | 56293 |
6 | 78559 |
1 | 550526 |
9 | 79542 |
8 | 35019 |
8 | 133983 |
5 | 161509 |
5 | 127704 |
Use below R code :
dim(Employee_Analysis)
184 2
184 rows with 2 columns
names(Employee_Analysis)
"Rank" "Income"
reg.mod <-
lm(Employee_Analysis$Income~Employee_Analysis$Rank)
summary(reg.mod)
(Intercept) Employee_Analysis$Rank
438835.04 -50353.65
Regression equation is
income=438835.04 -50353.65*Rank
b. Using your analysis in a. or otherwise, explain why simple linear regression is inadequate to explain the structure in this dataset.
R sq=0.7629
76.29% variation in Income is explained by model.
F(1.182)=585.6
p=0.0000
P<0.05 Model is significant.We can use model for prediction.
Use R code to get residual plot as:
plot(fitted(reg.mod),residuals(reg.mod))
From Residual plot we observe polynomial equation is the best fit
c. Fit a polynomial regression model to the data and select the best order polynomial to explain the data using the significance testing techniques discussed in lectures.
using excel trend line:
Income = 10031*Rank^2 - 150945*Rank + 624270
R² = 0.9312
Use below R to get polynomial eq
poly.mod <- lm(Income ~ poly(Rank, 2,
raw=TRUE),data=Employee_Analysis)
summary(poly.mod)
m(formula = Income ~ poly(Rank, 2, raw = TRUE), data = Employee_Analysis)
Residuals:
Min 1Q Median 3Q Max
-127622 -22179 -132 27474 108581
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 624270.1 10936.5 57.08 <2e-16 ***
poly(Rank, 2, raw = TRUE)1 -150944.8 4910.0 -30.74 <2e-16 ***
poly(Rank, 2, raw = TRUE)2 10031.2 476.6 21.05 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 39260 on 181 degrees of freedom
Multiple R-squared: 0.9312, Adjusted R-squared: 0.9305
F-statistic: 1225 on 2 and 181 DF, p-value: < 2.2e-16
d. Predict the Income for a person planning to apply for a position at Rank 5 at one of these competitive companies in the near future
x=5
using polymial eq
Income = 10031*Rank^2 - 150945*Rank + 624270
Income = 10031*5^2 - 150945*5 + 624270
Income=120320
predicted income uisng polynomial equation is 120320
using linear regression we get
income=438835.04 -50353.65*Rank
income=438835.04 -50353.65*5
=187066.8
Income=187067
predicted income uisng linear equation is 187067
ENTIRE R CODE IS
dim(Employee_Analysis)
names(Employee_Analysis)
reg.mod <-
lm(Employee_Analysis$Income~Employee_Analysis$Rank)
summary(reg.mod)
coefficients(reg.mod)
plot(fitted(reg.mod),residuals(reg.mod))
poly.mod <- lm(Income ~ poly(Rank, 2,
raw=TRUE),data=Employee_Analysis)
summary(poly.mod)