In: Computer Science
Please use R studio
Dataset: IBM HR Analytics Employee Attrition & Performance dataset (you can download the dataset from kaggle)
Name |
Description |
ATTRITION |
Employee leaving the company (0=no, 1=yes) |
BUSINESS TRAVEL |
(1=No Travel, 2=Travel Frequently, 3=Tavel Rarely) |
DEPARTMENT |
(1=HR, 2=R&D, 3=Sales) |
EDUCATION FIELD |
(1=HR, 2=LIFE SCIENCES, 3=MARKETING, 4=MEDICAL SCIENCES, 5=OTHERS, 6= TEHCNICAL) |
GENDER |
(1=FEMALE, 0=MALE) |
JOB ROLE |
(1=HC REP, 2=HR, 3=LAB TECHNICIAN, 4=MANAGER, 5= MANAGING DIRECTOR, 6= REASEARCH DIRECTOR, 7= RESEARCH SCIENTIST, 8=SALES EXECUTIEVE, 9= SALES REPRESENTATIVE) |
MARITAL STATUS |
(1=DIVORCED, 2=MARRIED, 3=SINGLE) |
OVER 18 |
(1=YES, 2=NO) |
OVERTIME |
(0=NO, 1=YES) |
The Variable Attrition is what we plan to use as our dependent variable. The variable contains a Yes if they stay with IBM and ‘No’ if they do not. We need to create this into a binary dummy variable with 0 if they do not stay with IBM (Attrition = ‘No’) and 1 if they do stay with IBM (Attrition = ‘Yes’). This will also need to be done to the variable Gender and OverTime. Gender we can assign “Male” to zero and “Female” to one. For OverTime we will assign 0 for “No” and 1 for “Yes”.
Create Pivot tables instead of correlation matrixes for categorical variables and do the data analysis.
For data analysis:
Describe the data using the techniques above (e.g. “We can see in this scatter plot that there is a positive correlation between the number of hours in which the patient exercised per week and his/her weight loss.”). About one page without the images.
Based on these observations, draw some insights. (e.g. “We believe the patient is burning calories when exercising, thus contributing to the loss of weight”). About one page.
State actionable experiments based upon your insights. (e.g. “We will use multiple regression that includes hours exercised as an explanatory variable to model weight loss. We expect…”)
Solution:
### Quick Intro:
I am new to R, so I am using this for practice purpose.
### Summary:
Based on the ananlysis about the "IBM HR Analytic Employee Attrition & Performance", there are some valuable insights about employee and employee attrition for Human Resources department to utilize in the future works to improve employee rention, job satisfaction, and work environment.
Most of 237 employees who left the company are people work for 0-5 years and have relatively low monthly income. For some employees, even though they do feel satisfied with the work environment, the colleges worked with them, and the job itself, they still choose to leave. It is important for the company to uncover what make those employees left and make changes on these possible factors to keep valuable employees.
Another intersting finding about the company's employees is that higher education level doesn't promise higher income. Such trend applies to all three dapartments: Human Resources, Research & Development, and Sales. On the other hand, age and work experience are have stronger correlations with income in the company. This results partly explain why employees left: most people left have short work years at the companty, partially leads to lower salary.
### Background Information:
* **Data description:**
The data set "IBM HR Analytic Employee Attrition & Performance" is a data set available at Kaggle.com. The data set includes 35 different variables about employees' demographics, evaluations about the company and attrition result. The data set has 1470 rows of records. This is a fictional data set created by IBM data scientists.
- **Variables:**
Age, Attrition, BusinessTravel, DailyRate, Department, DistanceFromHome, Education, EducationField, EmployeeCount, EmployeeNumber, EnvironmentSatisfaction, Gender, HourlyRate, obInvolvement, JobLevel, JobRole, JobSatisfaction, MaritalStatus, MonthlyIncome, MonthlyRate, NumCompaniesWorked, Over18, OverTime, PercentSalaryHike, PerformanceRating, RelationshipSatisfaction, StandardHours, StockOptionLevel, TotalWorkingYears, TrainingTimesLastYear, WorkLifeBalance, YearsAtCompany, YearsInCurrentRole, YearsSinceLastPromotion, YearsWithCurrManager
* **The four major objectives of the data visualization project are:**
1. Provide summary statistics about employees;
2. Get understanding about how the company's employees think about their works;
3. Explore relationships between employee attributes and monthly income;
4. Investigate possible factors affect attrition.
### Analysis:
#### Step 1: Get Data Ready
* **Load packages for analysis:**
Four R packages, "tidyverse", "knitr", "gridExtra", "ggpubr" are used in the analysis for plotting graphs and generating reports.
```{r,message=FALSE,echo=FALSE,warning=FALSE}
library(tidyverse)
library(knitr)
library(gridExtra)
library(ggpubr)
```
* **Import Data**
```{r,echo=FALSE}
HR <- read.csv("../input/WA_Fn-UseC_-HR-Employee-Attrition.csv")
names(HR)[1] <- "Age" #Rename the column name to "Age" for consistence purpose.
```
* **Dummy Coding Reverse:**
In the original data set provided, variables "Education", "EnvironmentSatisfaction", "JobInvolvement", "JobSatisfaction", "PerformanceRating", "RelationshipSatisfaction", and "WorkLifeBalance" are dummy coded using 1-5 or 1-4 scale. For better understanding and visualizing purpose, these variables are reversed back to actual categories based on the explanations given by the data provider.
```{r,echo=FALSE}
# Employees' education
HR$Education[HR$Education=="1"] <- "Below College"
HR$Education[HR$Education=="2"] <- "College"
HR$Education[HR$Education=="3"] <- "Bachelor"
HR$Education[HR$Education=="4"] <- "Master"
HR$Education[HR$Education=="5"] <- "Doctor"
```
```{r,echo=FALSE}
# Employees' environment satisfaction
HR$EnvironmentSatisfaction[HR$EnvironmentSatisfaction=="1"] <- "Low"
HR$EnvironmentSatisfaction[HR$EnvironmentSatisfaction=="2"] <- "Medium"
HR$EnvironmentSatisfaction[HR$EnvironmentSatisfaction=="3"] <- "High"
HR$EnvironmentSatisfaction[HR$EnvironmentSatisfaction=="4"] <- "Very High"
```
```{r,echo=FALSE}
# Employees' job involvement
HR$JobInvolvement[HR$JobInvolvement=="1"] <- "Low"
HR$JobInvolvement[HR$JobInvolvement=="2"] <- "Medium"
HR$JobInvolvement[HR$JobInvolvement=="3"] <- "High"
HR$JobInvolvement[HR$JobInvolvement=="4"] <- "Very High"
```
```{r,echo=FALSE}
# Employees' job satisfaction
HR$JobSatisfaction[HR$JobSatisfaction=="1"] <- "Low"
HR$JobSatisfaction[HR$JobSatisfaction=="2"] <- "Medium"
HR$JobSatisfaction[HR$JobSatisfaction=="3"] <- "High"
HR$JobSatisfaction[HR$JobSatisfaction=="4"] <- "Very High"
```
```{r,echo=FALSE}
# Employees' performance rating
HR$PerformanceRating[HR$PerformanceRating=="1"] <- "Low"
HR$PerformanceRating[HR$PerformanceRating=="2"] <- "Good"
HR$PerformanceRating[HR$PerformanceRating=="3"] <- "Excellent"
HR$PerformanceRating[HR$PerformanceRating=="4"] <- "Outstanding"
```
```{r,echo=FALSE}
# Employees' relationship satisfaction
HR$RelationshipSatisfaction[HR$RelationshipSatisfaction=="1"] <- "Low"
HR$RelationshipSatisfaction[HR$RelationshipSatisfaction=="2"] <- "Medium"
HR$RelationshipSatisfaction[HR$RelationshipSatisfaction=="3"] <- "High"
HR$RelationshipSatisfaction[HR$RelationshipSatisfaction=="4"] <- "Very High"
```
```{r,echo=FALSE}
# Employees' life balance
HR$WorkLifeBalance[HR$WorkLifeBalance=="1"] <- "Bad"
HR$WorkLifeBalance[HR$WorkLifeBalance=="2"] <- "Good"
HR$WorkLifeBalance[HR$WorkLifeBalance=="3"] <- "Better"
HR$WorkLifeBalance[HR$WorkLifeBalance=="4"] <- "Best"
```
#### Step 2: Data Analysis
##### *A. Summary Statistics about Employees:*
To begin the analysis, it is important to get some summary statstics about the 1,470 employees.
```{r,echo=FALSE}
HR$Attrition <- factor(HR$Attrition,levels=c("Yes","No"))
attrition <- data.frame(table(HR$Attrition))
names(attrition)[1] <- "Status"
names(attrition)[2] <- "Counts"
kable(attrition)
```
**1) How many employees left:**
In 1470 employee records, 237 people left the company, which accounts for 16.1% of the total population.
```{r,echo=FALSE}
department <-data.frame(table(HR$Department))
kable(department,col.names = c("Department","Count"))
```
**2) Which departments are these employees come from:**
961 employees are from the Research & Development department, accounts for the majority of the data set. Other two departments include in the data set are Sales and Human Resources.
```{r,echo=FALSE,fig.align="center",message=FALSE}
inc_1 <- ggplot(HR, aes(x = MonthlyIncome, fill = Attrition)) +
geom_histogram(position = "dodge") + labs(x="Monthly Income", y="Number of employees")
inc_2 <- ggplot(HR, aes(x = HourlyRate, fill = Attrition)) +
geom_histogram(position = "dodge") + labs(x="Hourly Rate", y="Number of employees")
inc_3 <- ggplot(HR, aes(x = DailyRate, fill = Attrition)) +
geom_histogram(position = "dodge") + labs(x="Daily Rate", y="Number of employees")
grid.arrange(inc_1,inc_2,inc_3, ncol = 2, nrow = 2, top = "Income Distribution in company", bottom = "IBM HR Analytics")
```
**3) How employees' incomes distributed:**
Since the data set doesn't explain how "rate" and "income", the analysis assume the "hourly rate" and "daily rate" are equal to "hourly income" and "daily income". It is clear that most employees who left have a relatively low monthly income level. Also, the monthly income is positive skewed, which means the most of employees have lower than $10,000 per month income level.
The distributions of hourly rate and daily rate don't provide much valuable insights of employees. For employees who choose to leave, their hourly rate and daily rate are not significantly different from people who stay in the company.
```{r,echo=FALSE,fig.align="center",message=FALSE}
ggplot(HR) +
geom_histogram(mapping=(aes(TotalWorkingYears)),fill="skyblue",col="white",binwidth = 1) +
labs(x="Total Working Years", y="Number of employees",caption="IBM HR Analytics", title="Total Working Years") + theme(legend.position="none")
ggplot(HR, aes(x= Department, y=TotalWorkingYears, group = Department, fill = Department)) +
geom_violin() + theme(legend.position="none") +
coord_flip() +
labs(x="Department",y="Total Working Years",caption="IBM HR Analytics", title="Total Working Years by Attrition") +
facet_wrap(~ Attrition)
ggplot(HR) +
geom_histogram(mapping=(aes(YearsAtCompany)),fill="skyblue",col="white",binwidth = 1) +
labs(x="Working Years at the company", y="Number of employees",caption="IBM HR Analytics", title="Working Years at Company") + theme(legend.position="none")
ggplot(HR, aes(x= Department, y=YearsAtCompany, group = Department, fill = Department)) +
geom_violin() + theme(legend.position="none") +
coord_flip() +
labs(x="Department",y="Working Years at the company",caption="IBM HR Analytics", title="Working Years at Company by Attrition") +
facet_wrap(~ Attrition)
```
**4) Working Years in the company:**
In thhe company, majority employees have 0-12 years of work experience. And when taking look at the work experience they gained at the company, 0-10 years is the most common time length for all three departments. Most left employees choose to leave after they work for the company 0-5 years, or after they work 0-10 years in total. There are few employees working for the company over 20 years no matter they choose to leave or stay.
##### *B. How employees think about their works:*
After getting some basic information about the employees, it is necessary to explore how these employees think about their work from the environment, people who work with them and the job itself. Also, departments are taken account into the analysis.
```{r,echo=FALSE,fig.align="center"}
HR$EnvironmentSatisfaction <- factor(HR$EnvironmentSatisfaction,
levels = c("Low", "Medium","High","Very High"))
ggplot(data=HR)+
geom_bar(mapping=aes(EnvironmentSatisfaction,fill=Department), width=0.3)+
coord_cartesian(ylim=c(0, 500)) +
labs(title="Environment Satisfaction", subtitle="From 1470 employees", x="Environment Satisfaction Level",
y="Number of Employees", caption="IBM HR Analytics") + facet_wrap(~ Attrition)
```
**1) Environment Satisfaction:**
Based on the bar chart, the majority of employees have "high" and "very high" satisfaction level of their work environment. The number of employees who have low satisfaction level is about the same as the number of employees who have medium level. Such trend applies to all three departments. However, the company may want to explore what factors make over 500 employees feel not satisfied with the work environment even they are not the majority group. Also, for employees who left, higher portion of people have low satisfaction level.
```{r,echo=FALSE,fig.align="center"}
HR$RelationshipSatisfaction <- factor(HR$RelationshipSatisfaction,
levels = c("Low", "Medium","High","Very High"))
ggplot(data=HR)+
geom_bar(position="dodge",mapping=aes(Department,fill=RelationshipSatisfaction)) +
coord_cartesian(ylim=c(0, 300)) +
labs(title="Relationship Satisfaction", subtitle="From 1470 employees", y="Number of Employees",
x="Department", caption="IBM HR Analytics") + facet_wrap(~ Attrition) +
theme(axis.text.x = element_text(angle = 90))
```
**2) Relationship Satisfaction:**
Generally, employees relationships in the company are positive. The majority of employees are highly satisfied with their colleges and bosses at most time. There are spaces for the company to improve since over 500 employees evaluate their relationship satisfaction "low" or "medium". When taking look at the low satisfaction level group, employees from Sales department are more likely to have low satisfaction level than medium level.
```{r,echo=FALSE,fig.align="center"}
HR$WorkLifeBalance <- factor(HR$WorkLifeBalance,
levels = c("Bad","Good","Better","Best"))
# Divided by department
worklife_sales <-data.frame(table(filter(HR,Department=="Sales")$WorkLifeBalance))
names(worklife_sales)[1] <- "Status"
names(worklife_sales)[2] <- "Counts"
worklife_RD <-data.frame(table(filter(HR,Department=="Research & Development")$WorkLifeBalance))
names(worklife_RD)[1] <- "Status"
names(worklife_RD)[2] <- "Counts"
worklife_HR <-data.frame(table(filter(HR,Department=="Human Resources")$WorkLifeBalance))
names(worklife_HR)[1] <- "Status"
names(worklife_HR)[2] <- "Counts"
```
```{r,echo=FALSE,fig.align="center"}
a <- ggpie(worklife_HR,"Counts",fill="Status",color="white", label="Counts",lab.pos = "out",lab.font = "white") +
ggtitle("Human Resources") + theme(legend.position = "right")
b <- ggpie(worklife_RD,"Counts",fill="Status",color="white", label="Counts",lab.pos = "out",lab.font = "white") +
ggtitle("Research & Development") + theme(legend.position = "right")
c <- ggpie(worklife_sales,"Counts",fill="Status",color="white", label="Counts",lab.pos = "out",lab.font = "white") +
ggtitle("Sales") + theme(legend.position = "right")
grid.arrange(a,b,c,ncol=2,nrow=2,top="Work Life Balance", bottom = "IBM HR Analytics",newpage = FALSE)
```
**3) Work Balance:**
Most of employees agree that they have "better" work life balance across all three departments. Compared with Sales and Research & Development departments, Human Resources has slightly higher portion of employees who think their work life balance is best.
```{r,echo=FALSE,fig.align="center"}
ggplot(HR, aes(x=Department, y=PerformanceRating, group = Department, fill = Department)) +
geom_violin() + theme(legend.position="none") +
coord_flip() +
labs(x="Department",y="Performance Rating",
title="Employees' Performance Rating by Department", caption="IBM HR Analytics") +
facet_wrap(~ Attrition)
```
**4) Performance Rating:**
Nobody in the company has lower than "Excellent" performance rating. It is not clear that the rating is the self rating or rating from other people. Without further explanation about how the rating is from, other varibales such as job satisfaction would be better indicator.
##### *C. Relationships between employee attributes and monthly income:*
Income is always an important factor for people when making decisions about work. Also, income could be an indicator for employees to know how the company value them. To get better understanding about the company's employees, the relationships between income level and other attributes would be necessary to consider.
```{r,echo=FALSE,fig.align="center"}
HR_Numeric <- select(HR,Age,DailyRate,DistanceFromHome,HourlyRate,MonthlyIncome,MonthlyRate,TotalWorkingYears,
YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager)
cormat <- reshape2::melt(round(cor(HR_Numeric),2))
ggplot(cormat, aes(x=Var1, y=Var2, fill=value, label=value)) +
geom_tile() + theme_bw() + geom_text(aes(label=value), color="White") +
labs(title="HR numrical attributes - Correlation plot",caption="IBM HR Analytics", x="", y="") +
theme(legend.position="none", axis.text.x = element_text(angle = 90))
```
**1) Correlation:**
According to the correlation matrix for quantitative attributes and monthly income, some factors have strong positive correlation with monthly income such as age, total working years, and years at company.
```{r,echo=FALSE,fig.align="center"}
HR$JobSatisfaction <- factor(HR$JobSatisfaction,
levels = c("Low", "Medium","High","Very High"))
ggplot(HR,aes(Age,MonthlyIncome)) + geom_point(aes(color=Department)) +
geom_smooth(col="black", se=FALSE,method="loess")+ facet_grid(.~Department) +
labs(title="Age and Monthly Income", x="Age", y="Monthly Income",caption="IBM HR Analytics")
```
**2) Ages and Monthly Income:**
Overall, the monthly incomes for employees increase as they get older in all three departments. There are "outliers" who are at their 50-60 and get low salaries, especially in Research & Development and Sales departments.
```{r,echo=FALSE,fig.align="center"}
HR$JobSatisfaction <- factor(HR$JobSatisfaction,
levels = c("Low", "Medium","High","Very High"))
HR$EducationField <- factor(HR$EducationField,
levels = c("Human Resources","Life Sciences", "Marketing", "Medical",
"Technical Degree", "Other"))
HR$Education <- factor(HR$Education, levels = c("Below College","College","Bachelor","Master","Doctor"))
ggplot(HR,aes(Education,MonthlyIncome)) + geom_point(aes(color=EducationField)) +
labs(title="Education Level and Monthly Income",
x="Education Level", y="Monthly Income",caption="IBM HR Analytics") +
coord_flip()
HR$Education <- factor(HR$Education, levels = c("Below College","College","Bachelor","Master","Doctor"))
ggplot(HR,aes(Education,MonthlyIncome)) + geom_point(aes(color=EducationField)) +
facet_grid(.~Department) + theme(axis.text.x = element_text(angle = 90))+
labs(title="Education Level and Monthly Income by Departments",x="Education Level", y="Monthly Income",caption="IBM HR Analytics")
```
**3) Education and Monthly Income:**
It is interesting that in this company, education level and monthly income are not correlated as people may think. Higher education level doesn't guarantee higher income in the company. In Human Resources department, employees have doctor level degree have relatively lower incomes when compared with Master degree holders or Bachelor degree holders. And in Research & Development department, income level doesn't increase with the education level increases. The same trend applies to Sales department.
When taking look at employee's education fields, Life Science and Medical are the two major fields the company's employees from for R&D department, and employees in Sales departments are more likely to have Life Science or Marketing background.
```{r,echo=FALSE,fig.align="center"}
Average_Income<- data.frame(summarise(filter(HR,Gender=="Female"),
"Female Average Salary" = mean(MonthlyIncome, na.rm = TRUE)),
summarise(filter(HR,Gender=="Male"),
"Male Average Salary" = mean(MonthlyIncome, na.rm = TRUE)))
kable(Average_Income)
ggplot(HR,aes(Age,MonthlyIncome)) + geom_point(aes(color=Gender)) +
geom_smooth(se=FALSE,method="loess", mapping=aes(linetype = Gender), col="black") +
facet_grid(.~Department) +
labs(title="Age and Monthly Income by Departments",x="Education Level", y="Monthly Income",caption="IBM HR Analytics")
```
**4) Gender and Monthly Income:**
The average monthly income for female employees in the company is $6,686.57 per month, slightly higher than the average male monthly income, which is $6,380.51 per month. The trend line suggests that the relationship between age and income applies to both genders. However, at different departments, the salary differences between male and female at the same age group vary. For example, the gender trend lines for Human Resources and Research & Development departments show that female employees in their 50-60 have higher monthly income than the same age male employees do. However, in Sales department, things are opposite.
##### *D. Possible factors affect attrition:*
As the income distribution chart suggests, most employees left the company have relatively lower monthly income, which may suggest why those employees left. Other than that, it would be helpful to take look on qualitative attributes. In all 35 attributes, Job Satisfaction is a good indicator.
```{r,echo=FALSE,fig.align="center"}
ggplot(data=HR)+
geom_bar(position="dodge",mapping=aes(JobSatisfaction,fill=Attrition)) + labs(title="Job Satisfaction and Attrition", x="Job Satisfaction", y="Number of employees", caption="IBM HR Analytics")
```
It is interesting to notice that the two major groups who left companies are employees who have "low" and "high" job satisfaction. For employees who are highly satisfied with thier job, it is more important for HR to know what cause them left.
#### Limitation:
* The data set is a fictional data set without adequate background information. Therefore, some interpretations about the data are based on the colomn names rather than further understadnding what are the names actually meaning.
* The data set includes relatively small amonut of records, which may not by representative enough to illustate the comapany.
#### Data Source:
* [IBM HR Analytics Employee Attrition & Performance](https://www.kaggle.com/pavansubhasht/ibm-hr-analytics-attrition-dataset)
#### Package Reference:
* [tidyverse](https://www.tidyverse.org/packages/)
* [knitr](https://cran.r-project.org/web/packages/knitr/knitr.pdf)
* [gridExtra](https://cran.r-project.org/web/packages/gridExtra/gridExtra.pdf)
* [ggpubr](https://cran.r-project.org/web/packages/ggpubr/ggpubr.pdf)
#### Useful Reference List:
* [Data Visualization with ggplot2 Cheat Sheet](https://www.rstudio.com/wp-content/uploads/2015/03/ggplot2-cheatsheet.pdf)
* [R Markdown Cheat Sheet ](https://www.rstudio.com/wp-content/uploads/2016/03/rmarkdown-cheatsheet-2.0.pdf)
* [R for Data Science](http://r4ds.had.co.nz/)