Question

In: Accounting

Predicting if income exceeds $50,000 per year based on 1994 US Census Data with Simple Classification...

Predicting if income exceeds $50,000 per year based on 1994 US Census Data with Simple Classification Techniques

Linear Discriminant Analysis: Perform cross-validated LDA (LOOCV) with variable income.cat as response, and all the numeric variables as predictors. Compute and display the resulting (cross-validated) contingency table as well as the corresponding classification accuracy and classification error. Comment on the result. Use Rstudio

Solutions

Expert Solution

lda

Predicting Earning Potential using the Adult Dataset

Haojun Zhu

December 5, 2016

Introduction

The adult dataset is from the 1994 Census database. It is also known as “Census Income” dataset. Details of this dataset can be found at UCI Machine Learning Repository. I am interested to learn how well can I predict whether an individual’s annual income exceeds $50,000 using the set of variables in this data set. The question is inspected in two different approaches – traditional statistical modeling and machine learning techniques. Logistic regression is used as the statistical modeling tool as the outcome is binary. Four different machine learning techniques – Neural network, classification and regression tree, random forest, and support vector machine, are used to answer the same question. It is found machine learning techniques does not always outperform statistical modeling.

Exploratory Data Analysis and Data Cleaning

The dataset is read into R from UCI Machine Learning Reposity directly.

library(ggplot2)
library(plyr)
library(ROCR)
adult <- read.table('https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data', 
                    sep = ',', fill = F, strip.white = T)
colnames(adult) <- c('age', 'workclass', 'fnlwgt', 'educatoin', 
                     'educatoin_num', 'marital_status', 'occupation', 'relationship', 'race', 'sex', 
                     'capital_gain', 'capital_loss', 'hours_per_week', 'native_country', 'income')

Some of the variables are not self-explanatory. The continuous variable fnlwgt represents final weight, which is the number of units in the target population that the responding unit represents. The variable education_num stands for the number of years of education in total, which is a continuous representation of the discrete variable education. The variable relationship represents the responding unit’s role in the family. capital_gain and capital_loss are income from investment sources other than wage/salary.

For simplicity of this analysis, the weighting factor is discarded. Total number of years of education can represent by the highest education level completed. Role in the family can be assessed from gender and marital status. Thus, the following 3 variables are deleted education, relationship, and fnlwgt.

adult$educatoin <- NULL
adult$fnlwgt <- NULL
adult$relationship <- NULL

The first variable age is a continuous variable. As an initial step, two histograms are plotted.

# histogram of age by income group
ggplot(adult) + aes(x=as.numeric(age), group=income, fill=income) + 
  geom_histogram(binwidth=1, color='black')

# histogram of age by gender group
ggplot(adult) + aes(x=as.numeric(age), group=sex, fill=sex) + 
  geom_histogram(binwidth=1, color='black')

It is noticed that majority of the observations make less than $50,000 a year. For those do make over $50,000 annually, they are mainly in midcareer. Interestingly, females are underrepresented. This could be possibly caused by census bias.

The variable workclass stands for the industry in which the responding unit is employed.

summary(adult$workclass)

##                ?      Federal-gov        Local-gov     Never-worked 
##             1836              960             2093                7 
##          Private     Self-emp-inc Self-emp-not-inc        State-gov 
##            22696             1116             2541             1298 
##      Without-pay 
##               14

Notice that there are two small groups – Never-worked and Without-pay. I will combine them with Unknowns into a group called Other/Unknown. Those who work in the government are further break down into federal, state, and local levels. To facilitate the analysis, I group them into one group called Government. While those who are self-employed fall into two groups, incorporated and not incorporated, and are grouped into Self-Employed.

levels(adult$workclass)[1] <- 'Unknown'
# combine into Government job
adult$workclass <- gsub('^Federal-gov', 'Government', adult$workclass)
adult$workclass <- gsub('^Local-gov', 'Government', adult$workclass)
adult$workclass <- gsub('^State-gov', 'Government', adult$workclass) 

# combine into Sele-Employed job
adult$workclass <- gsub('^Self-emp-inc', 'Self-Employed', adult$workclass)
adult$workclass <- gsub('^Self-emp-not-inc', 'Self-Employed', adult$workclass)

# combine into Other/Unknown
adult$workclass <- gsub('^Never-worked', 'Other', adult$workclass)
adult$workclass <- gsub('^Without-pay', 'Other', adult$workclass)
adult$workclass <- gsub('^Other', 'Other/Unknown', adult$workclass)
adult$workclass <- gsub('^Unknown', 'Other/Unknown', adult$workclass)

adult$workclass <- as.factor(adult$workclass)

Thus, there are now four categories for the variable workclass.

summary(adult$workclass)

##    Government Other/Unknown       Private Self-Employed 
##          4351          1857         22696          3657

To explore the relationship between industry and income, the counts at the two income categories across all four categories of workclass are calculated, as well as the in-group proportions. The results are represented as a bar plot.

# barplot of job type by income group
# get the counts by industry and income group
count <- table(adult[adult$workclass == 'Government',]$income)["<=50K"]
count <- c(count, table(adult[adult$workclass == 'Government',]$income)[">50K"])
count <- c(count, table(adult[adult$workclass == 'Other/Unknown',]$income)["<=50K"])
count <- c(count, table(adult[adult$workclass == 'Other/Unknown',]$income)[">50K"])
count <- c(count, table(adult[adult$workclass == 'Private',]$income)["<=50K"])
count <- c(count, table(adult[adult$workclass == 'Private',]$income)[">50K"])
count <- c(count, table(adult[adult$workclass == 'Self-Employed',]$income)["<=50K"])
count <- c(count, table(adult[adult$workclass == 'Self-Employed',]$income)[">50K"])
count <- as.numeric(count)

# create a dataframe
industry <- rep(levels(adult$workclass), each = 2)
income <- rep(c('<=50K', '>50K'), 4)
df <- data.frame(industry, income, count)
df

##        industry income count
## 1    Government  <=50K  3010
## 2    Government   >50K  1341
## 3 Other/Unknown  <=50K  1666
## 4 Other/Unknown   >50K   191
## 5       Private  <=50K 17733
## 6       Private   >50K  4963
## 7 Self-Employed  <=50K  2311
## 8 Self-Employed   >50K  1346

# calculate the percentages
df <- ddply(df, .(industry), transform, percent = count/sum(count) * 100)

# format the labels and calculate their positions
df <- ddply(df, .(industry), transform, pos = (cumsum(count) - 0.5 * count))
df$label <- paste0(sprintf("%.0f", df$percent), "%")

# bar plot of counts by industry with in group proportions 
ggplot(df, aes(x = industry, y = count, fill = income)) +
  geom_bar(stat = "identity") +
  geom_text(aes(y = pos, label = label), size = 2) + 
  ggtitle('Income by Industry')

Those who are self employed have the highest tendency of making greater than $50,000 a year.

Since education_num is a continuous representation of education, a stacked bar plot is used to visualize the relationship between education_num and income, in-group proportions are calculated as well.

# create a dataframe
df1 <- data.frame(table(adult$income, adult$educatoin_num))
names(df1) <- c('income', 'education_num', 'count')
df1

##    income education_num count
## 1   <=50K             1    51
## 2    >50K             1     0
## 3   <=50K             2   162
## 4    >50K             2     6
## 5   <=50K             3   317
## 6    >50K             3    16
## 7   <=50K             4   606
## 8    >50K             4    40
## 9   <=50K             5   487
## 10   >50K             5    27
## 11  <=50K             6   871
## 12   >50K             6    62
## 13  <=50K             7  1115
## 14   >50K             7    60
## 15  <=50K             8   400
## 16   >50K             8    33
## 17  <=50K             9  8826
## 18   >50K             9  1675
## 19  <=50K            10  5904
## 20   >50K            10  1387
## 21  <=50K            11  1021
## 22   >50K            11   361
## 23  <=50K            12   802
## 24   >50K            12   265
## 25  <=50K            13  3134
## 26   >50K            13  2221
## 27  <=50K            14   764
## 28   >50K            14   959
## 29  <=50K            15   153
## 30   >50K            15   423
## 31  <=50K            16   107
## 32   >50K            16   306

# calculate the percentages
df1 <- ddply(df1, .(education_num), transform, percent = count/sum(count) * 100)

# format the labels and calculate their positions
df1 <- ddply(df1, .(education_num), transform, pos = (cumsum(count) - 0.5 * count))
df1$label <- paste0(sprintf("%.0f", df1$percent), "%")

# remove some in group percentage to avoid overlapped text
df1$label[which(df1$percent < 5)] <- NA

# bar plot of counts by years of education with in group proportions 
ggplot(df1, aes(x = education_num, y = count, fill = income)) +
  geom_bar(stat = "identity") +
  geom_text(aes(y = pos, label = label), size = 2) + 
  ggtitle('Income Level with Years of Education')

It is not hard to notice that the in group proportion of making greater than $50,000 a year increase as the years of education increases. For those who don’t have any forms of college education (less than or equal to 8 years of education), less than 10% have an annual income of greater than $50,000. While for those with doctorate degrees, nearly 3 out of 4 makes greater than $50,000 a year!

summary(adult$occupation)

##                 ?      Adm-clerical      Armed-Forces      Craft-repair 
##              1843              3770                 9              4099 
##   Exec-managerial   Farming-fishing Handlers-cleaners Machine-op-inspct 
##              4066               994              1370              2002 
##     Other-service   Priv-house-serv    Prof-specialty   Protective-serv 
##              3295               149              4140               649 
##             Sales      Tech-support  Transport-moving 
##              3650               928              1597

For simplicity of the model, occupation is also blocked into several groups, namely Blue-Collar, Professional, Sales, Service, and White-Collar. Because of the small number of people who are service members, they are combined with Unknowns to form an Unknown/Other group.

levels(adult$occupation)[1] <- 'Unknown'
adult$occupation <- gsub('Adm-clerical', 'White-Collar', adult$occupation)
adult$occupation <- gsub('Craft-repair', 'Blue-Collar', adult$occupation)
adult$occupation <- gsub('Exec-managerial', 'White-Collar', adult$occupation)
adult$occupation <- gsub('Farming-fishing', 'Blue-Collar', adult$occupation)
adult$occupation <- gsub('Handlers-cleaners', 'Blue-Collar', adult$occupation)
adult$occupation <- gsub('Machine-op-inspct', 'Blue-Collar', adult$occupation)
adult$occupation <- gsub('Other-service', 'Service', adult$occupation)
adult$occupation <- gsub('Priv-house-serv', 'Service', adult$occupation)
adult$occupation <- gsub('Prof-specialty', 'Professional', adult$occupation)
adult$occupation <- gsub('Protective-serv', 'Service', adult$occupation)
adult$occupation <- gsub('Tech-support', 'Service', adult$occupation)
adult$occupation <- gsub('Transport-moving', 'Blue-Collar', adult$occupation)
adult$occupation <- gsub('Unknown', 'Other/Unknown', adult$occupation)
adult$occupation <- gsub('Armed-Forces', 'Other/Unknown', adult$occupation)
adult$occupation <- as.factor(adult$occupation)
summary(adult$occupation)

##   Blue-Collar Other/Unknown  Professional         Sales       Service 
##         10062          1852          4140          3650          5021 
##  White-Collar 
##          7836

# create a dataframe
df2 <- data.frame(table(adult$income, adult$occupation))
names(df2) <- c('income', 'occupation', 'count')
df2

##    income    occupation count
## 1   <=50K   Blue-Collar  8362
## 2    >50K   Blue-Collar  1700
## 3   <=50K Other/Unknown  1660
## 4    >50K Other/Unknown   192
## 5   <=50K  Professional  2281
## 6    >50K  Professional  1859
## 7   <=50K         Sales  2667
## 8    >50K         Sales   983
## 9   <=50K       Service  4389
## 10   >50K       Service   632
## 11  <=50K  White-Collar  5361
## 12   >50K  White-Collar  2475

# calculate the percentages
df2 <- ddply(df2, .(occupation), transform, percent = count/sum(count) * 100)

# format the labels and calculate their positions
df2 <- ddply(df2, .(occupation), transform, pos = (cumsum(count) - 0.5 * count))
df2$label <- paste0(sprintf("%.0f", df2$percent), "%")

# bar plot of counts by occupation with in group proportions 
ggplot(df2, aes(x = occupation, y = count, fill = income)) +
  geom_bar(stat = "identity") +
  geom_text(aes(y = pos, label = label), size = 2) + 
  ggtitle('Income Level with Different Occupations')

It is noticed that income varies greatly across different occupations. Nearly half of Professional occupation makes greater than $50,000 a year, while that percentage is only 13% for Service occupation.

marital_status is a categorical variable with 7 categories indicating the marital status of observations. In fact, it can be blocked into a few categories as well.

summary(adult$marital_status)

##              Divorced     Married-AF-spouse    Married-civ-spouse 
##                  4443                    23                 14976 
## Married-spouse-absent         Never-married             Separated 
##                   418                 10683                  1025 
##               Widowed 
##                   993

adult$marital_status <- gsub('Married-AF-spouse', 'Married', adult$marital_status)
adult$marital_status <- gsub('Married-civ-spouse', 'Married', adult$marital_status)
adult$marital_status <- gsub('Married-spouse-absent', 'Married', adult$marital_status)
adult$marital_status <- gsub('Never-married', 'Single', adult$marital_status)
adult$marital_status <- as.factor(adult$marital_status)
summary(adult$marital_status)

##  Divorced   Married Separated    Single   Widowed 
##      4443     15417      1025     10683       993

df3 <- data.frame(table(adult$income, adult$marital_status))
names(df3) <- c('income', 'marital_status', 'count')
df3

##    income marital_status count
## 1   <=50K       Divorced  3980
## 2    >50K       Divorced   463
## 3   <=50K        Married  8681
## 4    >50K        Married  6736
## 5   <=50K      Separated   959
## 6    >50K      Separated    66
## 7   <=50K         Single 10192
## 8    >50K         Single   491
## 9   <=50K        Widowed   908
## 10   >50K        Widowed    85

# calculate the percentages
df3 <- ddply(df3, .(marital_status), transform, percent = count/sum(count) * 100)

# format the labels and calculate their positions
df3 <- ddply(df3, .(marital_status), transform, pos = (cumsum(count) - 0.5 * count))
df3$label <- paste0(sprintf("%.0f", df3$percent), "%")

# bar plot of counts by marital status with in group proportions 
ggplot(df3, aes(x = marital_status, y = count, fill = income)) +
  geom_bar(stat = "identity") +
  geom_text(aes(y = pos, label = label), size = 2) + 
  ggtitle('Income Level with Marital Status')

For those who are married, nearly half of them are making greater than $50,000 a year.

capital_gain and capital_loss are two continuous variables describing income and loss from financial investments. Histograms show that the distributions of these two variables are both highly screwed.

# histogram of capital_gain
ggplot(adult) + aes(x=as.numeric(capital_gain), group=income, fill=income) + 
  geom_histogram(bins=10, color='black') + ggtitle('Histogram of Capital Gain')

# histogram of capital_loss
ggplot(adult) + aes(x=as.numeric(capital_loss), group=income, fill=income) + 
  geom_histogram(bins=10, color='black') + ggtitle('Histogram of Capital Loss')

# percentage of observatiosn with no capital gain or loss
sum(adult$capital_gain == 0)/length(adult$capital_gain)

## [1] 0.9167102

sum(adult$capital_loss == 0)/length(adult$capital_loss)

## [1] 0.9533491

In fact, most observations have zero capital_gain and/or capital_loss. Similarly, there native_country displays high skewness as most observations are from United States. Therefore, these three variables are excluded from the analysis as well.

adult$capital_gain <- NULL
adult$capital_loss <- NULL
adult$native_country <- NULL

race is a categorical variable. Bar plot shows that White and Asian-Pacific Islander have high earning potentials – over 25% of the observations of these 2 races make above $50,000 annually.

df4 <- data.frame(table(adult$income, adult$race))
names(df4) <- c('income', 'race', 'count')
df4

##    income               race count
## 1   <=50K Amer-Indian-Eskimo   275
## 2    >50K Amer-Indian-Eskimo    36
## 3   <=50K Asian-Pac-Islander   763
## 4    >50K Asian-Pac-Islander   276
## 5   <=50K              Black  2737
## 6    >50K              Black   387
## 7   <=50K              Other   246
## 8    >50K              Other    25
## 9   <=50K              White 20699
## 10   >50K              White  7117

# calculate the percentages
df4 <- ddply(df4, .(race), transform, percent = count/sum(count) * 100)

# format the labels and calculate their positions
df4 <- ddply(df4, .(race), transform, pos = (cumsum(count) - 0.5 * count))
df4$label <- paste0(sprintf("%.0f", df4$percent), "%")

# do not display percentage for low counts categories
df4$label[df4$race == 'Other'] <- NA
df4$label[df4$race == 'Amer-Indian-Eskimo'] <- NA

# bar plot of counts by marital status with in group proportions 
ggplot(df4, aes(x = race, y = count, fill = income)) +
  geom_bar(stat = "identity") +
  geom_text(aes(y = pos, label = label), size = 2) + 
  ggtitle('Income Level by Race')

Now, there are 9 variables left for the analysis. Below is a table of summary statistics.

summary(adult)

##       age                workclass     educatoin_num     marital_status 
##  Min.   :17.00   Government   : 4351   Min.   : 1.00   Divorced : 4443  
##  1st Qu.:28.00   Other/Unknown: 1857   1st Qu.: 9.00   Married  :15417  
##  Median :37.00   Private      :22696   Median :10.00   Separated: 1025  
##  Mean   :38.58   Self-Employed: 3657   Mean   :10.08   Single   :10683  
##  3rd Qu.:48.00                         3rd Qu.:12.00   Widowed  :  993  
##  Max.   :90.00                         Max.   :16.00                    
##          occupation                    race           sex       
##  Blue-Collar  :10062   Amer-Indian-Eskimo:  311   Female:10771  
##  Other/Unknown: 1852   Asian-Pac-Islander: 1039   Male  :21790  
##  Professional : 4140   Black             : 3124                 
##  Sales        : 3650   Other             :  271                 
##  Service      : 5021   White             :27816                 
##  White-Collar : 7836                                            
##  hours_per_week    income     
##  Min.   : 1.00   <=50K:24720  
##  1st Qu.:40.00   >50K : 7841  
##  Median :40.00                
##  Mean   :40.44                
##  3rd Qu.:45.00                
##  Max.   :99.00

Model Fitting – Logistic Regression

80% of the original data is used as the training set, while the rest 20% is used as test set.

sz <- round(.8 * dim(adult)[1])  # training set size
training_set <- adult[1:sz,]
testing_set <- adult[-(1:sz),]

A logistic regression using income as the response variable, and all other 8 variables as predictors is fitted. Its parameter estimates and confidence intervals are reported as below.

m1 <- glm(income ~ ., data = training_set, family = binomial('logit'))
summary(m1)

## 
## Call:
## glm(formula = income ~ ., family = binomial("logit"), data = training_set)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.6900  -0.5892  -0.2606  -0.0752   3.2844  
## 
## Coefficients:
##                          Estimate Std. Error z value Pr(>|z|)    
## (Intercept)             -9.304867   0.278827 -33.371  < 2e-16 ***
## age                      0.028424   0.001632  17.413  < 2e-16 ***
## workclassOther/Unknown  -1.226062   0.812990  -1.508   0.1315    
## workclassPrivate         0.065641   0.053401   1.229   0.2190    
## workclassSelf-Employed  -0.143134   0.068836  -2.079   0.0376 *  
## educatoin_num            0.316604   0.009367  33.799  < 2e-16 ***
## marital_statusMarried    2.009367   0.067075  29.957  < 2e-16 ***
## marital_statusSeparated -0.118183   0.162774  -0.726   0.4678    
## marital_statusSingle    -0.456110   0.082510  -5.528 3.24e-08 ***
## marital_statusWidowed   -0.049819   0.153751  -0.324   0.7459    
## occupationOther/Unknown  0.890005   0.811907   1.096   0.2730    
## occupationProfessional   0.758835   0.067266  11.281  < 2e-16 ***
## occupationSales          0.493151   0.063295   7.791 6.63e-15 ***
## occupationService        0.156311   0.066728   2.343   0.0192 *  
## occupationWhite-Collar   0.775103   0.052391  14.795  < 2e-16 ***
## raceAsian-Pac-Islander   0.127056   0.248409   0.511   0.6090    
## raceBlack                0.307972   0.237403   1.297   0.1945    
## raceOther               -0.466651   0.370702  -1.259   0.2081    
## raceWhite                0.489306   0.227223   2.153   0.0313 *  
## sexMale                  0.387568   0.051550   7.518 5.55e-14 ***
## hours_per_week           0.030626   0.001629  18.802  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 28685  on 26048  degrees of freedom
## Residual deviance: 19163  on 26028  degrees of freedom
## AIC: 19205
## 
## Number of Fisher Scoring iterations: 6

confint(m1)

##                               2.5 %       97.5 %
## (Intercept)             -9.86450760 -8.769848897
## age                      0.02522800  0.031627206
## workclassOther/Unknown  -2.81095155  0.353340625
## workclassPrivate        -0.03879528  0.170549950
## workclassSelf-Employed  -0.27807457 -0.008224174
## educatoin_num            0.29832991  0.335050878
## marital_statusMarried    1.87910604  2.142085777
## marital_statusSeparated -0.44670006  0.192468641
## marital_statusSingle    -0.61768425 -0.294168625
## marital_statusWidowed   -0.35840810  0.245101683
## occupationOther/Unknown -0.69042951  2.469666670
## occupationProfessional   0.62703557  0.890730836
## occupationSales          0.36894319  0.617075711
## occupationService        0.02504151  0.286644006
## occupationWhite-Collar   0.67249541  0.877875622
## raceAsian-Pac-Islander  -0.34768819  0.628341149
## raceBlack               -0.14373152  0.789261324
## raceOther               -1.20916579  0.249803089
## raceWhite                0.05890259  0.952096611
## sexMale                  0.28666433  0.488756993
## hours_per_week           0.02744120  0.033826994

This logistic regression model appears to fit the data well. To explore the possibility of a parsimonious model, both the forward and backward stepwise selection algorithms using AIC are performed.

m_full <- m1  # full model is the model just fitted
m_null <- glm(income ~ 1, data = training_set, family = binomial('logit'))

# backward selection
step(m_full, trace = F, scope = list(lower=formula(m_null), upper=formula(m_full)),
     direction = 'backward')

## 
## Call:  glm(formula = income ~ age + workclass + educatoin_num + marital_status + 
##     occupation + race + sex + hours_per_week, family = binomial("logit"), 
##     data = training_set)
## 
## Coefficients:
##             (Intercept)                      age   workclassOther/Unknown  
##                -9.30487                  0.02842                 -1.22606  
##        workclassPrivate   workclassSelf-Employed            educatoin_num  
##                 0.06564                 -0.14313                  0.31660  
##   marital_statusMarried  marital_statusSeparated     marital_statusSingle  
##                 2.00937                 -0.11818                 -0.45611  
##   marital_statusWidowed  occupationOther/Unknown   occupationProfessional  
##                -0.04982                  0.89000                  0.75883  
##         occupationSales        occupationService   occupationWhite-Collar  
##                 0.49315                  0.15631                  0.77510  
##  raceAsian-Pac-Islander                raceBlack                raceOther  
##                 0.12706                  0.30797                 -0.46665  
##               raceWhite                  sexMale           hours_per_week  
##                 0.48931                  0.38757                  0.03063  
## 
## Degrees of Freedom: 26048 Total (i.e. Null);  26028 Residual
## Null Deviance:       28690 
## Residual Deviance: 19160     AIC: 19210

# forward selection
step(m_null, trace = F, scope = list(lower=formula(m_null), upper=formula(m_full)),
     direction = 'forward')

## 
## Call:  glm(formula = income ~ marital_status + educatoin_num + hours_per_week + 
##     age + occupation + sex + race + workclass, family = binomial("logit"), 
##     data = training_set)
## 
## Coefficients:
##             (Intercept)    marital_statusMarried  marital_statusSeparated  
##                -9.30487                  2.00937                 -0.11818  
##    marital_statusSingle    marital_statusWidowed            educatoin_num  
##                -0.45611                 -0.04982                  0.31660  
##          hours_per_week                      age  occupationOther/Unknown  
##                 0.03063                  0.02842                  0.89000  
##  occupationProfessional          occupationSales        occupationService  
##                 0.75883                  0.49315                  0.15631  
##  occupationWhite-Collar                  sexMale   raceAsian-Pac-Islander  
##                 0.77510                  0.38757                  0.12706  
##               raceBlack                raceOther                raceWhite  
##                 0.30797                 -0.46665                  0.48931  
##  workclassOther/Unknown         workclassPrivate   workclassSelf-Employed  
##                -1.22606                  0.06564                 -0.14313  
## 
## Degrees of Freedom: 26048 Total (i.e. Null);  26028 Residual
## Null Deviance:       28690 
## Residual Deviance: 19160     AIC: 19210

Both the forward and backward stepwise selection algorithms give the same model as the initial fit. Thus, the original model is chosen and model diagnostics are to follow. As expected for a good model, most of the residuals are falling within ±3±3:

# create a data frame to store information regarding deviance residuals
index <- 1:dim(training_set)[1]
dev_resid <- residuals(m1)
income <- training_set$income
dff <- data.frame(index, dev_resid, income)

ggplot(dff, aes(x = index, y = dev_resid, color = income)) +
  geom_point() + 
  geom_hline(yintercept = 3, linetype = 'dashed', color = 'blue') +
  geom_hline(yintercept = -3, linetype = 'dashed', color = 'blue')

  ggtitle('Plot of Deviance Residuals')

## $title
## [1] "Plot of Deviance Residuals"
## 
## attr(,"class")
## [1] "labels"

Logistic regression is modeling the probability that an individual makes more than $50,000 annually. In another word, a response closer to 1 indicates higher chance of making over $50,000, while a response closer to 0 indicates a higher chance of making less than $50,000. Thus, a threshold of 0.5 is used to determine whether an individual is predicted to make more than $50,000 annually or not. A confusion matrix is presented to evaluate how well the model predicts income.

prob <- predict(m1, testing_set, type = 'response')
pred <- rep('<=50K', length(prob))
pred[prob>=.5] <- '>50K'
# confusion matrix 
tb <- table(pred, testing_set$income)
tb

##        
## pred    <=50K >50K
##   <=50K  4533  745
##   >50K    379  855

The prediction result has an accuracy of 82.74%, and a misclassification rate of 17.26%.

Model Fitting – Machine Learning Techniques

Several machine learning techniques are employed to predict whether an observation makes greater than $50,000 a year or not. In a similar manner as logistic regression, the original data set is still split into a training set and a testing set. Models are trained on the training set and validated on the testing set.

First, a neural network (NN) with one hidden layer is trained. Although there are only 9 input variables, many of them are categorical and would result several dummy variables. Thus, the number of hidden nodes is chosen to be 40. Least square method is used to optimize the objective function. Maximum number of iterations are set to 5,000.

library(nnet)
nn1 <- nnet(income ~ ., data = training_set, size = 40, maxit = 500)

## # weights:  881
## initial  value 42694.598693 
## iter  10 value 13952.527059
## iter  20 value 11964.837047
## iter  30 value 11113.810122
## iter  40 value 10217.851198
## iter  50 value 9845.432703
## iter  60 value 9633.081097
## iter  70 value 9480.659095
## iter  80 value 9349.965952
## iter  90 value 9233.513642
## iter 100 value 9151.360640
## iter 110 value 9110.961857
## iter 120 value 9082.755974
## iter 130 value 9059.320777
## iter 140 value 9043.830155
## iter 150 value 9040.959046
## iter 160 value 9038.815732
## iter 170 value 9032.738085
## iter 180 value 9018.050029
## iter 190 value 9001.261563
## iter 200 value 8992.690013
## iter 210 value 8981.607443
## iter 220 value 8964.099595
## iter 230 value 8954.285577
## iter 240 value 8944.588444
## iter 250 value 8938.395168
## iter 260 value 8935.190911
## iter 270 value 8935.041831
## iter 280 value 8934.401285
## iter 290 value 8933.457211
## iter 300 value 8930.927573
## iter 310 value 8925.476516
## iter 320 value 8922.785005
## iter 330 value 8918.849712
## iter 340 value 8914.825420
## iter 350 value 8911.518719
## iter 360 value 8906.944817
## iter 370 value 8899.378519
## iter 380 value 8892.137864
## iter 390 value 8882.955368
## iter 400 value 8878.300505
## iter 410 value 8873.472777
## iter 420 value 8869.363353
## iter 430 value 8865.648170
## iter 440 value 8863.799029
## iter 450 value 8861.247037
## iter 460 value 8857.641253
## iter 470 value 8855.679827
## iter 480 value 8852.820136
## iter 490 value 8850.291575
## iter 500 value 8848.493134
## final  value 8848.493134 
## stopped after 500 iterations

nn1.pred <- predict(nn1, newdata = testing_set, type = 'raw')

pred1 <- rep('<=50K', length(nn1.pred))
pred1[nn1.pred>=.5] <- '>50K'
# confusion matrix 
tb1 <- table(pred1, testing_set$income)
tb1

##        
## pred1   <=50K >50K
##   <=50K  4528  705
##   >50K    384  895

The prediction result of NN has an accuracy of 83.28%, and a misclassification rate of 16.72%.

Next, a classification and regression tree (CART) is grown on the training set. The CART model is obtained by recursively partitioning the data space and fitting a simple prediction model within each partition. Although it is generally named CART, the tree grown in this case is actually a classification tree.

library(rpart)
tree2 <- rpart(income ~ ., data = training_set, method = 'class', cp = 1e-3)
tree2.pred.prob <- predict(tree2, newdata = testing_set, type = 'prob')
tree2.pred <- predict(tree2, newdata = testing_set, type = 'class')
# confusion matrix 
tb2 <- table(tree2.pred, testing_set$income)
tb2

##           
## tree2.pred <=50K >50K
##      <=50K  4533  732
##      >50K    379  868

The prediction result of CART has an accuracy of 82.94%, and a missclassification rate of 17.06%.

Random forest (RF) is another powerful machine learning tool. It improves predictive accuracy by generating a large number of bootstrapped trees. Final predicted outcome is attained by combining the results across all of the trees.

library(randomForest)
rf3 <- randomForest(income ~ ., data = training_set, ntree = 1000)
rf3.pred.prob <- predict(rf3, newdata = testing_set, type = 'prob')
rf3.pred <- predict(rf3, newdata = testing_set, type = 'class')
# confusion matrix 
tb3 <- table(rf3.pred, testing_set$income)
tb3

##         
## rf3.pred <=50K >50K
##    <=50K  4555  708
##    >50K    357  892

The prediction result of RF has an accuracy of 83.65%, and a misclassification rate of 16.35%.

Last, I use support vector machine (SVM) to predict the income level. SVM is a discriminative classifier that constructs a hyperplane in a high dimensional space used for classification.

library(kernlab)
svm4 <- ksvm(income ~ ., data = training_set)
svm4.pred.prob <- predict(svm4, newdata = testing_set, type = 'decision')
svm4.pred <- predict(svm4, newdata = testing_set, type = 'response')
# confusion matrix 
tb4 <- table(svm4.pred, testing_set$income)
tb4

##          
## svm4.pred <=50K >50K
##     <=50K  4582  773
##     >50K    330  827

The prediction result of SVM has an accuracy of 83.06%, and a misclassification rate of 16.94%.

Conclusion

ROC curve is a plot of true positive rate against false positive rate under all threshold values. Confusion matrix is a measurement of overall prediction accuracy. Since the majority of observations in the data set has income less than $50,000 a year, sensitivity and specificity contribute to the overall accuracy by different weights. The five different classifiers are compared using ROC curve.

# create a prediction object
pr <- prediction(prob, testing_set$income)
prf <- performance(pr, measure = "tpr", x.measure = "fpr")

# create a data frame for TP and FP rates
dd <- data.frame(FP = [email protected][[1]], TP = [email protected][[1]])

# NN
pr1 <- prediction(nn1.pred, testing_set$income)
prf1 <- performance(pr1, measure = "tpr", x.measure = "fpr")
dd1 <- data.frame(FP = [email protected][[1]], TP = [email protected][[1]])

# CART
pr2 <- prediction(tree2.pred.prob[,2], testing_set$income)
prf2 <- performance(pr2, measure = "tpr", x.measure = "fpr")
dd2 <- data.frame(FP = [email protected][[1]], TP = [email protected][[1]])

# RF
pr3 <- prediction(rf3.pred.prob[,2], testing_set$income)
prf3 <- performance(pr3, measure = "tpr", x.measure = "fpr")
dd3 <- data.frame(FP = [email protected][[1]], TP = [email protected][[1]])

# SVM
pr4 <- prediction(svm4.pred.prob, testing_set$income)
prf4 <- performance(pr4, measure = "tpr", x.measure = "fpr")
dd4 <- data.frame(FP = [email protected][[1]], TP = [email protected][[1]])



# plot ROC curve for logistic regression
g <- ggplot() + 
  geom_line(data = dd, aes(x = FP, y = TP, color = 'Logistic Regression')) + 
  geom_line(data = dd1, aes(x = FP, y = TP, color = 'Neural Networks')) + 
  geom_line(data = dd2, aes(x = FP, y = TP, color = 'CART')) + 
  geom_line(data = dd3, aes(x = FP, y = TP, color = 'Random Forest')) +
  geom_line(data = dd4, aes(x = FP, y = TP, color = 'Support Vector Machine')) +
  geom_segment(aes(x = 0, xend = 1, y = 0, yend = 1)) +
  ggtitle('ROC Curve') + 
  labs(x = 'False Positive Rate', y = 'True Positive Rate') 


g +  scale_colour_manual(name = 'Classifier', values = c('Logistic Regression'='#E69F00', 
                                               'Neural Networks'='#56B4E9', 'CART'='#009E73', 
                                               'Random Forest'='#D55E00', 'Support Vector Machine'='#0072B2'))

# AUC
auc <- rbind(performance(pr, measure = 'auc')@y.values[[1]],
             performance(pr1, measure = 'auc')@y.values[[1]],
             performance(pr2, measure = 'auc')@y.values[[1]],
             performance(pr3, measure = 'auc')@y.values[[1]],
             performance(pr4, measure = 'auc')@y.values[[1]])
rownames(auc) <- (c('Logistic Regression', 'Neural Networks', 'CART',
                                    'Random Forest', 'Support Vector Machine'))
colnames(auc) <- 'Area Under ROC Curve'
round(auc, 4)

##                        Area Under ROC Curve
## Logistic Regression                  0.8825
## Neural Networks                      0.8871
## CART                                 0.8501
## Random Forest                        0.8718
## Support Vector Machine               0.8608

Logistic regression has the highest area under curve (AUC) value, while CART has the lowest AUC. Machine learning is no denying a powerful, but it should not be considered as a substitute of traditional statistical modeling.

ekkarill92 answered 1 year ago

Next > < Previous

Question

Predicting if income exceeds $50,000 per year based on 1994 US Census Data with Simple Classification...

Solutions

Expert Solution

Related Solutions

Consider the following Data: Year Tea (L per person) Coffee (L per person) 1994 42.4 95.85...

Consider the following Data: Year Tea (L per person) Coffee (L per person) 1994 42.4 95.85...

At 20% per year simple interest, how many years will it take SR 50,000 to accumulate...

According to the 2010 US Census, the average number of residents per housing unit for the...

Discuss the classification of individual income taxpayers and the categories of income and tax rates based...

2. Switzerland has a per capita income of $50,000, Russia has a per capita income of...

The median US salary is $50,700, according to US Census data. Using a one-sample t-test, test...

Suppose that, according to a recent census, the income per capita measured in U.S. dollars was...

Find data for a given year on income per capita and income inequality for the following...

For data DEMOG, fit three simple linear regression models of the per capita income on each...