In: Accounting
Predicting if income exceeds $50,000 per year based on 1994 US Census Data with Simple Classification Techniques
Linear Discriminant Analysis: Perform cross-validated LDA (LOOCV) with variable income.cat as response, and all the numeric variables as predictors. Compute and display the resulting (cross-validated) contingency table as well as the corresponding classification accuracy and classification error. Comment on the result. Use Rstudio
lda
Predicting Earning Potential using the Adult Dataset
Haojun Zhu
December 5, 2016
Introduction
The adult dataset is from the 1994 Census database. It is also known as “Census Income” dataset. Details of this dataset can be found at UCI Machine Learning Repository. I am interested to learn how well can I predict whether an individual’s annual income exceeds $50,000 using the set of variables in this data set. The question is inspected in two different approaches – traditional statistical modeling and machine learning techniques. Logistic regression is used as the statistical modeling tool as the outcome is binary. Four different machine learning techniques – Neural network, classification and regression tree, random forest, and support vector machine, are used to answer the same question. It is found machine learning techniques does not always outperform statistical modeling.
Exploratory Data Analysis and Data Cleaning
The dataset is read into R from UCI Machine Learning Reposity directly.
library(ggplot2)
library(plyr)
library(ROCR)
adult <- read.table('https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data',
sep = ',', fill = F, strip.white = T)
colnames(adult) <- c('age', 'workclass', 'fnlwgt', 'educatoin',
'educatoin_num', 'marital_status', 'occupation', 'relationship', 'race', 'sex',
'capital_gain', 'capital_loss', 'hours_per_week', 'native_country', 'income')
Some of the variables are not self-explanatory. The continuous
variable fnlwgt
represents final weight, which is the
number of units in the target population that the responding unit
represents. The variable education_num
stands for the
number of years of education in total, which is a continuous
representation of the discrete variable education
. The
variable relationship
represents the responding unit’s
role in the family. capital_gain
and
capital_loss
are income from investment sources other
than wage/salary.
For simplicity of this analysis, the weighting factor is
discarded. Total number of years of education can represent by the
highest education level completed. Role in the family can be
assessed from gender and marital status. Thus, the following 3
variables are deleted education
,
relationship
, and fnlwgt
.
adult$educatoin <- NULL
adult$fnlwgt <- NULL
adult$relationship <- NULL
The first variable age
is a continuous variable. As
an initial step, two histograms are plotted.
# histogram of age by income group
ggplot(adult) + aes(x=as.numeric(age), group=income, fill=income) +
geom_histogram(binwidth=1, color='black')
# histogram of age by gender group
ggplot(adult) + aes(x=as.numeric(age), group=sex, fill=sex) +
geom_histogram(binwidth=1, color='black')
It is noticed that majority of the observations make less than $50,000 a year. For those do make over $50,000 annually, they are mainly in midcareer. Interestingly, females are underrepresented. This could be possibly caused by census bias.
The variable workclass
stands for the industry in
which the responding unit is employed.
summary(adult$workclass)
## ? Federal-gov Local-gov Never-worked
## 1836 960 2093 7
## Private Self-emp-inc Self-emp-not-inc State-gov
## 22696 1116 2541 1298
## Without-pay
## 14
Notice that there are two small groups –
Never-worked
and Without-pay
. I will
combine them with Unknowns
into a group called
Other/Unknown
. Those who work in the government are
further break down into federal, state, and local levels. To
facilitate the analysis, I group them into one group called
Government
. While those who are self-employed fall
into two groups, incorporated and not incorporated, and are grouped
into Self-Employed
.
levels(adult$workclass)[1] <- 'Unknown'
# combine into Government job
adult$workclass <- gsub('^Federal-gov', 'Government', adult$workclass)
adult$workclass <- gsub('^Local-gov', 'Government', adult$workclass)
adult$workclass <- gsub('^State-gov', 'Government', adult$workclass)
# combine into Sele-Employed job
adult$workclass <- gsub('^Self-emp-inc', 'Self-Employed', adult$workclass)
adult$workclass <- gsub('^Self-emp-not-inc', 'Self-Employed', adult$workclass)
# combine into Other/Unknown
adult$workclass <- gsub('^Never-worked', 'Other', adult$workclass)
adult$workclass <- gsub('^Without-pay', 'Other', adult$workclass)
adult$workclass <- gsub('^Other', 'Other/Unknown', adult$workclass)
adult$workclass <- gsub('^Unknown', 'Other/Unknown', adult$workclass)
adult$workclass <- as.factor(adult$workclass)
Thus, there are now four categories for the variable
workclass
.
summary(adult$workclass)
## Government Other/Unknown Private Self-Employed
## 4351 1857 22696 3657
To explore the relationship between industry and income, the
counts at the two income
categories across all four
categories of workclass
are calculated, as well as the
in-group proportions. The results are represented as a bar
plot.
# barplot of job type by income group
# get the counts by industry and income group
count <- table(adult[adult$workclass == 'Government',]$income)["<=50K"]
count <- c(count, table(adult[adult$workclass == 'Government',]$income)[">50K"])
count <- c(count, table(adult[adult$workclass == 'Other/Unknown',]$income)["<=50K"])
count <- c(count, table(adult[adult$workclass == 'Other/Unknown',]$income)[">50K"])
count <- c(count, table(adult[adult$workclass == 'Private',]$income)["<=50K"])
count <- c(count, table(adult[adult$workclass == 'Private',]$income)[">50K"])
count <- c(count, table(adult[adult$workclass == 'Self-Employed',]$income)["<=50K"])
count <- c(count, table(adult[adult$workclass == 'Self-Employed',]$income)[">50K"])
count <- as.numeric(count)
# create a dataframe
industry <- rep(levels(adult$workclass), each = 2)
income <- rep(c('<=50K', '>50K'), 4)
df <- data.frame(industry, income, count)
df
## industry income count
## 1 Government <=50K 3010
## 2 Government >50K 1341
## 3 Other/Unknown <=50K 1666
## 4 Other/Unknown >50K 191
## 5 Private <=50K 17733
## 6 Private >50K 4963
## 7 Self-Employed <=50K 2311
## 8 Self-Employed >50K 1346
# calculate the percentages
df <- ddply(df, .(industry), transform, percent = count/sum(count) * 100)
# format the labels and calculate their positions
df <- ddply(df, .(industry), transform, pos = (cumsum(count) - 0.5 * count))
df$label <- paste0(sprintf("%.0f", df$percent), "%")
# bar plot of counts by industry with in group proportions
ggplot(df, aes(x = industry, y = count, fill = income)) +
geom_bar(stat = "identity") +
geom_text(aes(y = pos, label = label), size = 2) +
ggtitle('Income by Industry')
Those who are self employed have the highest tendency of making greater than $50,000 a year.
Since education_num
is a continuous representation
of education
, a stacked bar plot is used to visualize
the relationship between education_num
and
income
, in-group proportions are calculated as
well.
# create a dataframe
df1 <- data.frame(table(adult$income, adult$educatoin_num))
names(df1) <- c('income', 'education_num', 'count')
df1
## income education_num count
## 1 <=50K 1 51
## 2 >50K 1 0
## 3 <=50K 2 162
## 4 >50K 2 6
## 5 <=50K 3 317
## 6 >50K 3 16
## 7 <=50K 4 606
## 8 >50K 4 40
## 9 <=50K 5 487
## 10 >50K 5 27
## 11 <=50K 6 871
## 12 >50K 6 62
## 13 <=50K 7 1115
## 14 >50K 7 60
## 15 <=50K 8 400
## 16 >50K 8 33
## 17 <=50K 9 8826
## 18 >50K 9 1675
## 19 <=50K 10 5904
## 20 >50K 10 1387
## 21 <=50K 11 1021
## 22 >50K 11 361
## 23 <=50K 12 802
## 24 >50K 12 265
## 25 <=50K 13 3134
## 26 >50K 13 2221
## 27 <=50K 14 764
## 28 >50K 14 959
## 29 <=50K 15 153
## 30 >50K 15 423
## 31 <=50K 16 107
## 32 >50K 16 306
# calculate the percentages
df1 <- ddply(df1, .(education_num), transform, percent = count/sum(count) * 100)
# format the labels and calculate their positions
df1 <- ddply(df1, .(education_num), transform, pos = (cumsum(count) - 0.5 * count))
df1$label <- paste0(sprintf("%.0f", df1$percent), "%")
# remove some in group percentage to avoid overlapped text
df1$label[which(df1$percent < 5)] <- NA
# bar plot of counts by years of education with in group proportions
ggplot(df1, aes(x = education_num, y = count, fill = income)) +
geom_bar(stat = "identity") +
geom_text(aes(y = pos, label = label), size = 2) +
ggtitle('Income Level with Years of Education')
It is not hard to notice that the in group proportion of making greater than $50,000 a year increase as the years of education increases. For those who don’t have any forms of college education (less than or equal to 8 years of education), less than 10% have an annual income of greater than $50,000. While for those with doctorate degrees, nearly 3 out of 4 makes greater than $50,000 a year!
summary(adult$occupation)
## ? Adm-clerical Armed-Forces Craft-repair
## 1843 3770 9 4099
## Exec-managerial Farming-fishing Handlers-cleaners Machine-op-inspct
## 4066 994 1370 2002
## Other-service Priv-house-serv Prof-specialty Protective-serv
## 3295 149 4140 649
## Sales Tech-support Transport-moving
## 3650 928 1597
For simplicity of the model, occupation is also blocked into
several groups, namely Blue-Collar
,
Professional
, Sales
,
Service
, and White-Collar
. Because of the
small number of people who are service members, they are combined
with Unknowns
to form an Unknown/Other
group.
levels(adult$occupation)[1] <- 'Unknown'
adult$occupation <- gsub('Adm-clerical', 'White-Collar', adult$occupation)
adult$occupation <- gsub('Craft-repair', 'Blue-Collar', adult$occupation)
adult$occupation <- gsub('Exec-managerial', 'White-Collar', adult$occupation)
adult$occupation <- gsub('Farming-fishing', 'Blue-Collar', adult$occupation)
adult$occupation <- gsub('Handlers-cleaners', 'Blue-Collar', adult$occupation)
adult$occupation <- gsub('Machine-op-inspct', 'Blue-Collar', adult$occupation)
adult$occupation <- gsub('Other-service', 'Service', adult$occupation)
adult$occupation <- gsub('Priv-house-serv', 'Service', adult$occupation)
adult$occupation <- gsub('Prof-specialty', 'Professional', adult$occupation)
adult$occupation <- gsub('Protective-serv', 'Service', adult$occupation)
adult$occupation <- gsub('Tech-support', 'Service', adult$occupation)
adult$occupation <- gsub('Transport-moving', 'Blue-Collar', adult$occupation)
adult$occupation <- gsub('Unknown', 'Other/Unknown', adult$occupation)
adult$occupation <- gsub('Armed-Forces', 'Other/Unknown', adult$occupation)
adult$occupation <- as.factor(adult$occupation)
summary(adult$occupation)
## Blue-Collar Other/Unknown Professional Sales Service
## 10062 1852 4140 3650 5021
## White-Collar
## 7836
# create a dataframe
df2 <- data.frame(table(adult$income, adult$occupation))
names(df2) <- c('income', 'occupation', 'count')
df2
## income occupation count
## 1 <=50K Blue-Collar 8362
## 2 >50K Blue-Collar 1700
## 3 <=50K Other/Unknown 1660
## 4 >50K Other/Unknown 192
## 5 <=50K Professional 2281
## 6 >50K Professional 1859
## 7 <=50K Sales 2667
## 8 >50K Sales 983
## 9 <=50K Service 4389
## 10 >50K Service 632
## 11 <=50K White-Collar 5361
## 12 >50K White-Collar 2475
# calculate the percentages
df2 <- ddply(df2, .(occupation), transform, percent = count/sum(count) * 100)
# format the labels and calculate their positions
df2 <- ddply(df2, .(occupation), transform, pos = (cumsum(count) - 0.5 * count))
df2$label <- paste0(sprintf("%.0f", df2$percent), "%")
# bar plot of counts by occupation with in group proportions
ggplot(df2, aes(x = occupation, y = count, fill = income)) +
geom_bar(stat = "identity") +
geom_text(aes(y = pos, label = label), size = 2) +
ggtitle('Income Level with Different Occupations')
It is noticed that income varies greatly across different
occupations. Nearly half of Professional
occupation
makes greater than $50,000 a year, while that percentage is only
13% for Service
occupation.
marital_status
is a categorical variable with 7
categories indicating the marital status of observations. In fact,
it can be blocked into a few categories as well.
summary(adult$marital_status)
## Divorced Married-AF-spouse Married-civ-spouse
## 4443 23 14976
## Married-spouse-absent Never-married Separated
## 418 10683 1025
## Widowed
## 993
adult$marital_status <- gsub('Married-AF-spouse', 'Married', adult$marital_status)
adult$marital_status <- gsub('Married-civ-spouse', 'Married', adult$marital_status)
adult$marital_status <- gsub('Married-spouse-absent', 'Married', adult$marital_status)
adult$marital_status <- gsub('Never-married', 'Single', adult$marital_status)
adult$marital_status <- as.factor(adult$marital_status)
summary(adult$marital_status)
## Divorced Married Separated Single Widowed
## 4443 15417 1025 10683 993
df3 <- data.frame(table(adult$income, adult$marital_status))
names(df3) <- c('income', 'marital_status', 'count')
df3
## income marital_status count
## 1 <=50K Divorced 3980
## 2 >50K Divorced 463
## 3 <=50K Married 8681
## 4 >50K Married 6736
## 5 <=50K Separated 959
## 6 >50K Separated 66
## 7 <=50K Single 10192
## 8 >50K Single 491
## 9 <=50K Widowed 908
## 10 >50K Widowed 85
# calculate the percentages
df3 <- ddply(df3, .(marital_status), transform, percent = count/sum(count) * 100)
# format the labels and calculate their positions
df3 <- ddply(df3, .(marital_status), transform, pos = (cumsum(count) - 0.5 * count))
df3$label <- paste0(sprintf("%.0f", df3$percent), "%")
# bar plot of counts by marital status with in group proportions
ggplot(df3, aes(x = marital_status, y = count, fill = income)) +
geom_bar(stat = "identity") +
geom_text(aes(y = pos, label = label), size = 2) +
ggtitle('Income Level with Marital Status')
For those who are married, nearly half of them are making greater than $50,000 a year.
capital_gain
and capital_loss
are two
continuous variables describing income and loss from financial
investments. Histograms show that the distributions of these two
variables are both highly screwed.
# histogram of capital_gain
ggplot(adult) + aes(x=as.numeric(capital_gain), group=income, fill=income) +
geom_histogram(bins=10, color='black') + ggtitle('Histogram of Capital Gain')
# histogram of capital_loss
ggplot(adult) + aes(x=as.numeric(capital_loss), group=income, fill=income) +
geom_histogram(bins=10, color='black') + ggtitle('Histogram of Capital Loss')
# percentage of observatiosn with no capital gain or loss
sum(adult$capital_gain == 0)/length(adult$capital_gain)
## [1] 0.9167102
sum(adult$capital_loss == 0)/length(adult$capital_loss)
## [1] 0.9533491
In fact, most observations have zero capital_gain
and/or capital_loss
. Similarly, there
native_country
displays high skewness as most
observations are from United States. Therefore, these three
variables are excluded from the analysis as well.
adult$capital_gain <- NULL
adult$capital_loss <- NULL
adult$native_country <- NULL
race
is a categorical variable. Bar plot shows that
White and Asian-Pacific Islander have high earning potentials –
over 25% of the observations of these 2 races make above $50,000
annually.
df4 <- data.frame(table(adult$income, adult$race))
names(df4) <- c('income', 'race', 'count')
df4
## income race count
## 1 <=50K Amer-Indian-Eskimo 275
## 2 >50K Amer-Indian-Eskimo 36
## 3 <=50K Asian-Pac-Islander 763
## 4 >50K Asian-Pac-Islander 276
## 5 <=50K Black 2737
## 6 >50K Black 387
## 7 <=50K Other 246
## 8 >50K Other 25
## 9 <=50K White 20699
## 10 >50K White 7117
# calculate the percentages
df4 <- ddply(df4, .(race), transform, percent = count/sum(count) * 100)
# format the labels and calculate their positions
df4 <- ddply(df4, .(race), transform, pos = (cumsum(count) - 0.5 * count))
df4$label <- paste0(sprintf("%.0f", df4$percent), "%")
# do not display percentage for low counts categories
df4$label[df4$race == 'Other'] <- NA
df4$label[df4$race == 'Amer-Indian-Eskimo'] <- NA
# bar plot of counts by marital status with in group proportions
ggplot(df4, aes(x = race, y = count, fill = income)) +
geom_bar(stat = "identity") +
geom_text(aes(y = pos, label = label), size = 2) +
ggtitle('Income Level by Race')
Now, there are 9 variables left for the analysis. Below is a table of summary statistics.
summary(adult)
## age workclass educatoin_num marital_status
## Min. :17.00 Government : 4351 Min. : 1.00 Divorced : 4443
## 1st Qu.:28.00 Other/Unknown: 1857 1st Qu.: 9.00 Married :15417
## Median :37.00 Private :22696 Median :10.00 Separated: 1025
## Mean :38.58 Self-Employed: 3657 Mean :10.08 Single :10683
## 3rd Qu.:48.00 3rd Qu.:12.00 Widowed : 993
## Max. :90.00 Max. :16.00
## occupation race sex
## Blue-Collar :10062 Amer-Indian-Eskimo: 311 Female:10771
## Other/Unknown: 1852 Asian-Pac-Islander: 1039 Male :21790
## Professional : 4140 Black : 3124
## Sales : 3650 Other : 271
## Service : 5021 White :27816
## White-Collar : 7836
## hours_per_week income
## Min. : 1.00 <=50K:24720
## 1st Qu.:40.00 >50K : 7841
## Median :40.00
## Mean :40.44
## 3rd Qu.:45.00
## Max. :99.00
Model Fitting – Logistic Regression
80% of the original data is used as the training set, while the rest 20% is used as test set.
sz <- round(.8 * dim(adult)[1]) # training set size
training_set <- adult[1:sz,]
testing_set <- adult[-(1:sz),]
A logistic regression using income
as the response
variable, and all other 8 variables as predictors is fitted. Its
parameter estimates and confidence intervals are reported as
below.
m1 <- glm(income ~ ., data = training_set, family = binomial('logit'))
summary(m1)
##
## Call:
## glm(formula = income ~ ., family = binomial("logit"), data = training_set)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.6900 -0.5892 -0.2606 -0.0752 3.2844
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -9.304867 0.278827 -33.371 < 2e-16 ***
## age 0.028424 0.001632 17.413 < 2e-16 ***
## workclassOther/Unknown -1.226062 0.812990 -1.508 0.1315
## workclassPrivate 0.065641 0.053401 1.229 0.2190
## workclassSelf-Employed -0.143134 0.068836 -2.079 0.0376 *
## educatoin_num 0.316604 0.009367 33.799 < 2e-16 ***
## marital_statusMarried 2.009367 0.067075 29.957 < 2e-16 ***
## marital_statusSeparated -0.118183 0.162774 -0.726 0.4678
## marital_statusSingle -0.456110 0.082510 -5.528 3.24e-08 ***
## marital_statusWidowed -0.049819 0.153751 -0.324 0.7459
## occupationOther/Unknown 0.890005 0.811907 1.096 0.2730
## occupationProfessional 0.758835 0.067266 11.281 < 2e-16 ***
## occupationSales 0.493151 0.063295 7.791 6.63e-15 ***
## occupationService 0.156311 0.066728 2.343 0.0192 *
## occupationWhite-Collar 0.775103 0.052391 14.795 < 2e-16 ***
## raceAsian-Pac-Islander 0.127056 0.248409 0.511 0.6090
## raceBlack 0.307972 0.237403 1.297 0.1945
## raceOther -0.466651 0.370702 -1.259 0.2081
## raceWhite 0.489306 0.227223 2.153 0.0313 *
## sexMale 0.387568 0.051550 7.518 5.55e-14 ***
## hours_per_week 0.030626 0.001629 18.802 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 28685 on 26048 degrees of freedom
## Residual deviance: 19163 on 26028 degrees of freedom
## AIC: 19205
##
## Number of Fisher Scoring iterations: 6
confint(m1)
## 2.5 % 97.5 %
## (Intercept) -9.86450760 -8.769848897
## age 0.02522800 0.031627206
## workclassOther/Unknown -2.81095155 0.353340625
## workclassPrivate -0.03879528 0.170549950
## workclassSelf-Employed -0.27807457 -0.008224174
## educatoin_num 0.29832991 0.335050878
## marital_statusMarried 1.87910604 2.142085777
## marital_statusSeparated -0.44670006 0.192468641
## marital_statusSingle -0.61768425 -0.294168625
## marital_statusWidowed -0.35840810 0.245101683
## occupationOther/Unknown -0.69042951 2.469666670
## occupationProfessional 0.62703557 0.890730836
## occupationSales 0.36894319 0.617075711
## occupationService 0.02504151 0.286644006
## occupationWhite-Collar 0.67249541 0.877875622
## raceAsian-Pac-Islander -0.34768819 0.628341149
## raceBlack -0.14373152 0.789261324
## raceOther -1.20916579 0.249803089
## raceWhite 0.05890259 0.952096611
## sexMale 0.28666433 0.488756993
## hours_per_week 0.02744120 0.033826994
This logistic regression model appears to fit the data well. To explore the possibility of a parsimonious model, both the forward and backward stepwise selection algorithms using AIC are performed.
m_full <- m1 # full model is the model just fitted
m_null <- glm(income ~ 1, data = training_set, family = binomial('logit'))
# backward selection
step(m_full, trace = F, scope = list(lower=formula(m_null), upper=formula(m_full)),
direction = 'backward')
##
## Call: glm(formula = income ~ age + workclass + educatoin_num + marital_status +
## occupation + race + sex + hours_per_week, family = binomial("logit"),
## data = training_set)
##
## Coefficients:
## (Intercept) age workclassOther/Unknown
## -9.30487 0.02842 -1.22606
## workclassPrivate workclassSelf-Employed educatoin_num
## 0.06564 -0.14313 0.31660
## marital_statusMarried marital_statusSeparated marital_statusSingle
## 2.00937 -0.11818 -0.45611
## marital_statusWidowed occupationOther/Unknown occupationProfessional
## -0.04982 0.89000 0.75883
## occupationSales occupationService occupationWhite-Collar
## 0.49315 0.15631 0.77510
## raceAsian-Pac-Islander raceBlack raceOther
## 0.12706 0.30797 -0.46665
## raceWhite sexMale hours_per_week
## 0.48931 0.38757 0.03063
##
## Degrees of Freedom: 26048 Total (i.e. Null); 26028 Residual
## Null Deviance: 28690
## Residual Deviance: 19160 AIC: 19210
# forward selection
step(m_null, trace = F, scope = list(lower=formula(m_null), upper=formula(m_full)),
direction = 'forward')
##
## Call: glm(formula = income ~ marital_status + educatoin_num + hours_per_week +
## age + occupation + sex + race + workclass, family = binomial("logit"),
## data = training_set)
##
## Coefficients:
## (Intercept) marital_statusMarried marital_statusSeparated
## -9.30487 2.00937 -0.11818
## marital_statusSingle marital_statusWidowed educatoin_num
## -0.45611 -0.04982 0.31660
## hours_per_week age occupationOther/Unknown
## 0.03063 0.02842 0.89000
## occupationProfessional occupationSales occupationService
## 0.75883 0.49315 0.15631
## occupationWhite-Collar sexMale raceAsian-Pac-Islander
## 0.77510 0.38757 0.12706
## raceBlack raceOther raceWhite
## 0.30797 -0.46665 0.48931
## workclassOther/Unknown workclassPrivate workclassSelf-Employed
## -1.22606 0.06564 -0.14313
##
## Degrees of Freedom: 26048 Total (i.e. Null); 26028 Residual
## Null Deviance: 28690
## Residual Deviance: 19160 AIC: 19210
Both the forward and backward stepwise selection algorithms give the same model as the initial fit. Thus, the original model is chosen and model diagnostics are to follow. As expected for a good model, most of the residuals are falling within ±3±3:
# create a data frame to store information regarding deviance residuals
index <- 1:dim(training_set)[1]
dev_resid <- residuals(m1)
income <- training_set$income
dff <- data.frame(index, dev_resid, income)
ggplot(dff, aes(x = index, y = dev_resid, color = income)) +
geom_point() +
geom_hline(yintercept = 3, linetype = 'dashed', color = 'blue') +
geom_hline(yintercept = -3, linetype = 'dashed', color = 'blue')
ggtitle('Plot of Deviance Residuals')
## $title
## [1] "Plot of Deviance Residuals"
##
## attr(,"class")
## [1] "labels"
Logistic regression is modeling the probability that an individual makes more than $50,000 annually. In another word, a response closer to 1 indicates higher chance of making over $50,000, while a response closer to 0 indicates a higher chance of making less than $50,000. Thus, a threshold of 0.5 is used to determine whether an individual is predicted to make more than $50,000 annually or not. A confusion matrix is presented to evaluate how well the model predicts income.
prob <- predict(m1, testing_set, type = 'response')
pred <- rep('<=50K', length(prob))
pred[prob>=.5] <- '>50K'
# confusion matrix
tb <- table(pred, testing_set$income)
tb
##
## pred <=50K >50K
## <=50K 4533 745
## >50K 379 855
The prediction result has an accuracy of 82.74%, and a misclassification rate of 17.26%.
Model Fitting – Machine Learning Techniques
Several machine learning techniques are employed to predict whether an observation makes greater than $50,000 a year or not. In a similar manner as logistic regression, the original data set is still split into a training set and a testing set. Models are trained on the training set and validated on the testing set.
First, a neural network (NN) with one hidden layer is trained. Although there are only 9 input variables, many of them are categorical and would result several dummy variables. Thus, the number of hidden nodes is chosen to be 40. Least square method is used to optimize the objective function. Maximum number of iterations are set to 5,000.
library(nnet)
nn1 <- nnet(income ~ ., data = training_set, size = 40, maxit = 500)
## # weights: 881
## initial value 42694.598693
## iter 10 value 13952.527059
## iter 20 value 11964.837047
## iter 30 value 11113.810122
## iter 40 value 10217.851198
## iter 50 value 9845.432703
## iter 60 value 9633.081097
## iter 70 value 9480.659095
## iter 80 value 9349.965952
## iter 90 value 9233.513642
## iter 100 value 9151.360640
## iter 110 value 9110.961857
## iter 120 value 9082.755974
## iter 130 value 9059.320777
## iter 140 value 9043.830155
## iter 150 value 9040.959046
## iter 160 value 9038.815732
## iter 170 value 9032.738085
## iter 180 value 9018.050029
## iter 190 value 9001.261563
## iter 200 value 8992.690013
## iter 210 value 8981.607443
## iter 220 value 8964.099595
## iter 230 value 8954.285577
## iter 240 value 8944.588444
## iter 250 value 8938.395168
## iter 260 value 8935.190911
## iter 270 value 8935.041831
## iter 280 value 8934.401285
## iter 290 value 8933.457211
## iter 300 value 8930.927573
## iter 310 value 8925.476516
## iter 320 value 8922.785005
## iter 330 value 8918.849712
## iter 340 value 8914.825420
## iter 350 value 8911.518719
## iter 360 value 8906.944817
## iter 370 value 8899.378519
## iter 380 value 8892.137864
## iter 390 value 8882.955368
## iter 400 value 8878.300505
## iter 410 value 8873.472777
## iter 420 value 8869.363353
## iter 430 value 8865.648170
## iter 440 value 8863.799029
## iter 450 value 8861.247037
## iter 460 value 8857.641253
## iter 470 value 8855.679827
## iter 480 value 8852.820136
## iter 490 value 8850.291575
## iter 500 value 8848.493134
## final value 8848.493134
## stopped after 500 iterations
nn1.pred <- predict(nn1, newdata = testing_set, type = 'raw')
pred1 <- rep('<=50K', length(nn1.pred))
pred1[nn1.pred>=.5] <- '>50K'
# confusion matrix
tb1 <- table(pred1, testing_set$income)
tb1
##
## pred1 <=50K >50K
## <=50K 4528 705
## >50K 384 895
The prediction result of NN has an accuracy of 83.28%, and a misclassification rate of 16.72%.
Next, a classification and regression tree (CART) is grown on the training set. The CART model is obtained by recursively partitioning the data space and fitting a simple prediction model within each partition. Although it is generally named CART, the tree grown in this case is actually a classification tree.
library(rpart)
tree2 <- rpart(income ~ ., data = training_set, method = 'class', cp = 1e-3)
tree2.pred.prob <- predict(tree2, newdata = testing_set, type = 'prob')
tree2.pred <- predict(tree2, newdata = testing_set, type = 'class')
# confusion matrix
tb2 <- table(tree2.pred, testing_set$income)
tb2
##
## tree2.pred <=50K >50K
## <=50K 4533 732
## >50K 379 868
The prediction result of CART has an accuracy of 82.94%, and a missclassification rate of 17.06%.
Random forest (RF) is another powerful machine learning tool. It improves predictive accuracy by generating a large number of bootstrapped trees. Final predicted outcome is attained by combining the results across all of the trees.
library(randomForest)
rf3 <- randomForest(income ~ ., data = training_set, ntree = 1000)
rf3.pred.prob <- predict(rf3, newdata = testing_set, type = 'prob')
rf3.pred <- predict(rf3, newdata = testing_set, type = 'class')
# confusion matrix
tb3 <- table(rf3.pred, testing_set$income)
tb3
##
## rf3.pred <=50K >50K
## <=50K 4555 708
## >50K 357 892
The prediction result of RF has an accuracy of 83.65%, and a misclassification rate of 16.35%.
Last, I use support vector machine (SVM) to predict the income level. SVM is a discriminative classifier that constructs a hyperplane in a high dimensional space used for classification.
library(kernlab)
svm4 <- ksvm(income ~ ., data = training_set)
svm4.pred.prob <- predict(svm4, newdata = testing_set, type = 'decision')
svm4.pred <- predict(svm4, newdata = testing_set, type = 'response')
# confusion matrix
tb4 <- table(svm4.pred, testing_set$income)
tb4
##
## svm4.pred <=50K >50K
## <=50K 4582 773
## >50K 330 827
The prediction result of SVM has an accuracy of 83.06%, and a misclassification rate of 16.94%.
Conclusion
ROC curve is a plot of true positive rate against false positive rate under all threshold values. Confusion matrix is a measurement of overall prediction accuracy. Since the majority of observations in the data set has income less than $50,000 a year, sensitivity and specificity contribute to the overall accuracy by different weights. The five different classifiers are compared using ROC curve.
# create a prediction object
pr <- prediction(prob, testing_set$income)
prf <- performance(pr, measure = "tpr", x.measure = "fpr")
# create a data frame for TP and FP rates
dd <- data.frame(FP = [email protected][[1]], TP = [email protected][[1]])
# NN
pr1 <- prediction(nn1.pred, testing_set$income)
prf1 <- performance(pr1, measure = "tpr", x.measure = "fpr")
dd1 <- data.frame(FP = [email protected][[1]], TP = [email protected][[1]])
# CART
pr2 <- prediction(tree2.pred.prob[,2], testing_set$income)
prf2 <- performance(pr2, measure = "tpr", x.measure = "fpr")
dd2 <- data.frame(FP = [email protected][[1]], TP = [email protected][[1]])
# RF
pr3 <- prediction(rf3.pred.prob[,2], testing_set$income)
prf3 <- performance(pr3, measure = "tpr", x.measure = "fpr")
dd3 <- data.frame(FP = [email protected][[1]], TP = [email protected][[1]])
# SVM
pr4 <- prediction(svm4.pred.prob, testing_set$income)
prf4 <- performance(pr4, measure = "tpr", x.measure = "fpr")
dd4 <- data.frame(FP = [email protected][[1]], TP = [email protected][[1]])
# plot ROC curve for logistic regression
g <- ggplot() +
geom_line(data = dd, aes(x = FP, y = TP, color = 'Logistic Regression')) +
geom_line(data = dd1, aes(x = FP, y = TP, color = 'Neural Networks')) +
geom_line(data = dd2, aes(x = FP, y = TP, color = 'CART')) +
geom_line(data = dd3, aes(x = FP, y = TP, color = 'Random Forest')) +
geom_line(data = dd4, aes(x = FP, y = TP, color = 'Support Vector Machine')) +
geom_segment(aes(x = 0, xend = 1, y = 0, yend = 1)) +
ggtitle('ROC Curve') +
labs(x = 'False Positive Rate', y = 'True Positive Rate')
g + scale_colour_manual(name = 'Classifier', values = c('Logistic Regression'='#E69F00',
'Neural Networks'='#56B4E9', 'CART'='#009E73',
'Random Forest'='#D55E00', 'Support Vector Machine'='#0072B2'))
# AUC
auc <- rbind(performance(pr, measure = 'auc')@y.values[[1]],
performance(pr1, measure = 'auc')@y.values[[1]],
performance(pr2, measure = 'auc')@y.values[[1]],
performance(pr3, measure = 'auc')@y.values[[1]],
performance(pr4, measure = 'auc')@y.values[[1]])
rownames(auc) <- (c('Logistic Regression', 'Neural Networks', 'CART',
'Random Forest', 'Support Vector Machine'))
colnames(auc) <- 'Area Under ROC Curve'
round(auc, 4)
## Area Under ROC Curve
## Logistic Regression 0.8825
## Neural Networks 0.8871
## CART 0.8501
## Random Forest 0.8718
## Support Vector Machine 0.8608
Logistic regression has the highest area under curve (AUC) value, while CART has the lowest AUC. Machine learning is no denying a powerful, but it should not be considered as a substitute of traditional statistical modeling.