In: Statistics and Probability
Salmons Stores operates a national chain of women's apparel stores. Five thousand copies of an expensive four-color sales catalog have been printed, and each catalog includes a coupon that provides a $50 discount on purchases of $200 or more. Salmons would like to send the catalogs only to customers who have the highest probability of using the coupon. The DATAile Salmons contains data from an earlier promotional campaign. For each of 1,000 Salmons customers, three variables are tracked: last year's total spending at Salmons, whether they have a Salmons store credit card, and whether they used the promotional coupon they were sent.
Create a standard partition of the data with all the tracked variables and 50% of observations in the training set, 30% in the validation set, and 20% in the test set. Use logistic regression to classify observations as a promotion-responder or not by using Spending and Card as input variables and Coupon as the output variable. Perform Variable Selection with the best subsets procedure with the number of best subsets equal to two.
Click on the datafile logo to reference the data.
(a) | Evaluate the logistic regression models based on their classification error. Recommend a final model and express the model as a mathematical equation relating the output variable to the input variables. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
If required, round your answers to three decimal places. For subtractive or negative numbers use a minus sign even if there is a + sign before the blank. (Example: -300) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Value = + *Spending + *Card | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
(c) | What is the area under the ROC curve on the test set? | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
If required, round your answer to three decimal places. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
To achieve a sensitivity of at least 0.80, how much Class 0 error rate must be tolerated? | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
If required, round your answer to three decimal places. This is part of the data and i need the solution but not the answer, it is better to provide screen shoot on excel, thanks a lot
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
The data set is very small to divide data in training testing and validation so have only divided this data in the training and testing.
Also provided R code for your reference,
The total observation in data is 29
> dim(data)
[1] 29
> set.seed(100)
> sample_split = createDataPartition(data$Coupon, p=0.7,
list=FALSE)
> Train = data[sample_split, ]
> dim(Train )
[1] 21
> Test = data[-sample_split,]
> dim(Test )
[1] 8
70% data in train and 30% in the test.
The fitted model on train data is,
trainCV = trainControl(method='repeatedcv',
number=3,
repeats=2,
verbose=TRUE)
modLog = train(x = Train[,-'Coupon'],
y = Train$Coupon,
method='glm',
trControl = trainCV)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.098e-02 2.439e-01 0.086 0.932
Spending 4.717e-05 5.661e-05 0.833 0.416
Card 1.825e-01 2.238e-01 0.815 0.426
The fitted model is,
Value of Coupon = 0.02098 + 0.00004717 * Spending + 0.1825 * Card
The p-values are greater than 0.05 so variables are not significant but would be those both affect on response together.
Results for train model,
Accuracy : 0.7619
95% CI : (0.5283, 0.9178)
Sensitivity : 1.0000
Specificity: 0.0000
Results for the test model:
Accuracy : 0.625
95% CI : (0.2449, 0.9148)
Sensitivity : 1.000
Specificity: 0.000
The training model performs well then the testing model because accuracy is high than the test model.
ROC curve:
ROC curve is the probability that how much area comes under the curve it is the plot between True positive rate and false positive rate.
for train model and test model specificity is 0 so not possible to draw curve.
>>>>>>>>>>>>>>> Best Luck >>>>>>>>>>>>>>