Question

In: Advanced Math

Would a logistic regression analysis reveal the keys to improving customer retention? Examine the enclosed results...

Would a logistic regression analysis reveal the keys to improving customer retention?

Examine the enclosed results of logistic regression of Retained on Esent and on Eclickrate, and then discuss the results and make recommendations on how to improve customer retention.

Expert Solution

Abstract

Logistic regression is used to obtain odds ratio in the presence of more than one explanatory variable. The procedure is quite similar to multiple linear regression, with the exception that the response variable is binomial. The result is the impact of each variable on the odds ratio of the observed event of interest. The main advantage is to avoid confounding effects by analyzing the association of all variables together. In this article, we explain the logistic regression procedure using examples to make it as simple as possible. After definition of the technique, the basic interpretation of the results is highlighted and then some special issues are discussed.

Introduction

One of the previous topics in Lessons in biostatistics presented the calculation, usage and interpretation of odds ratio statistic and greatly demonstrated the simplicity of odds ratio in clinical practice (1). The example used then was from a fictional study where the effects of two drug treatments to Staphylococcus Aureus (SA) endocarditis were compared. Original data are reproduced on Table 1.

Table 1.

Results from fictional endocarditis treatment study by McHugh (1).

	Standard treatment	New treatment	Totals
Died	152	17	169
Survived	248	103	351

Totals	400	120	520

Following (1), the odds ratio (OR) of death of patients using standard treatment can be calculated as (152 × 103) / (248 × 47) = 3.71, meaning that patients at standard treatment present a chance to die 3.71 times greater than patients under new treatment. To a more detailed information about basic OR interpretations, please see McHugh (1). However, a more complex problem can arise when, instead of the association between one explanatory variable and one response variable (e.g., type of treatment and death), we are interested in the joint relationship between two or more explanatory variables and the response variable. Let us suppose we are now interested in the relationship between age and death in the same group of SA endocarditis patients. Table 2 presents the fictional new data. You ought to remember that those data are not real data and that the relationships described here are not meant to reflect any real associations.

Table 2.

Results from fictional endocarditis treatment study by McHugh looking at age (1).

	Younger (30–45 yrs)	Older (46–60 yrs)	Totals
Died	120	49	169
Survived	217	134	351

Totals	337	183	520

Again, we can calculate an OR as (120 × 134 / 217 × 49) = 1.51, meaning that the chance of an younger individual (between 30 and 45 years-old) death is about 1.5 times the chance of the death of an older individual (between 46 and 60 years-old). Now, instead, we have two variables related to the event of interest (death) at individuals with SA endocarditis. But in the presence of more than one explanatory variable, separately testing each independent variable against the response variable introduces bias into the research (2), Performing multiple tests on the same data inflates the alpha, thus increasing Type I error rates while missing possible confounding effects. So, how do we know whether the treatment effect on endocarditis result is being masked by the effect of age? Let us take a look at the treatment effect as stratified by age (Table 3).

Table 3.

Effect of treatment on endocarditis stratified by age.

Older (46–60 yrs)		Standard treatment	New treatment	Totals	OR

	Died	43	6	49	2.44
	Survived	100	34	134

	Totals	143	40	183

Younger (30–45 yrs)		Standard treatment	New treatment	Totals	OR

	Died	109	11	120	4.62
	Survived	148	69	217

	Totals	257	80	337

As table 3 illustrates, the impact of treatment is higher on younger individuals, because OR in the younger patients subgroup is higher than in the older patients subgroup. Therefore, it would be incorrect to simply look at the treatment results without considering the impact of age. The simplest way to solve this problem is to calculate some form of “weighted” OR (i.e., Mantel-Haenszel OR (3)), using Equation 1 below, where n_i is the sample size of age class I, and a, b, c and d are the table cells, as presented by McHugh (1).

It means that the weighted chance of death associated with standard treatment is 3.74 times the chance of death of individuals taking new treatment. However, as the number of explanatory variables increases, the complexity of these calculations can become nearly impossible to handle. Additionally, Mantel-Haenszel OR, like the simple OR, admits only categorical explanatory variables. For instance, to use a continuous variable like age we need to set a breaking point to categorize (in our case, arbitrarily set at 45 years-old) and could not use the real age. Determining breaking points is not always easy! But there is a better approach: using logistic regression instead.

Definition

Logistic regression works very similar to linear regression, but with a binomial response variable. The greatest advantage when compared to Mantel-Haenszel OR is the fact that you can use continuous explanatory variables and it is easier to handle more than two explanatory variables simultaneously. Although apparently trivial, this last characteristic is essential when we are interested in the impact of various explanatory variables on the response variable. If we look at multiple explanatory variables independently, we ignore the covariance among variables and are subjected to confounding effects, as was demonstrated in the example above when the effect of treatment on death probability was partially hidden by the effect of age.

A logistic regression will model the chance of an outcome based on individual characteristics. Because chance is a ratio, what will be actually modeled is the logarithm of the chance given by:

where π indicates the probability of an event (e.g., death in the previous example), and β_i are the regression coefficients associated with the reference group and the x_i explanatory variables. At this point, an important concept must to be highlighted. The reference group, represented by β₀, is constituted by those individuals presenting the reference level of each and every variable x_1...m. To illustrate, considering our previous example, these are the individuals older aged that received standard treatment. Later, we will discuss how to set the reference level.

Logistic regression step-by-step

Let us apply a logistic regression to the example described before to see how it works and how to interpret the results. Let us build a logistic regression model to include all explanatory variables (age and treatment). This kind of model with all variables included is a called “full model” or a “saturated model” and is the best starting option if you have a good sample size and small number of variables to include (issues about sample size, variable inclusion and selection and others will be discussed in the next section. For now, we will keep it as simple as possible).

The result of our model can be seen below, at Table 4.

Table 4.

Results from multivariate logistic regression model containing all explanatory variables (full model).

Term	β estimate	Standard error	P value
Intercept (β₀)	−2.121	0.303	<0.001
Age: Younger (β₁)	0.454	0.207	0.028
Treatment: Standard (β₂)	1.333	0.283	<0.001

Now all we have to do is to interpret this output. Beginning with the intercept term, which corresponds to our β₀. Taking the exponential of β₀ we have the mean odds to death of individuals in the reference category. So, exp(β₀) = exp(−2.121) = 0.12 is the chance of death among those individuals that are older and received new treatment. A small difference in the interpretation of coefficients appears when we go to the next coefficients. Individuals that also received new treatment but are younger have a mean chance of death exp(β₁) = exp(0.454) = 1.58 times the chance of reference individuals. Similarly, older individuals that received standard treatment have a mean chance exp(β₂) = exp(1.333) = 3.79 times the chance of reference individuals to die. But what if individuals are younger and received standard treatment? Then we have to calculate exp(β₁+β₂) = exp(1.787) = 5.97 times the mean chance of reference individuals.

This is the basics of logistic regression interpretation. However, some issues appear during the analysis and solutions are not always readily available.

Conclusion:

Logistic regression is a powerful tool, especially in epidemiologic studies, allowing multiple explanatory variables being analyzed simultaneously, meanwhile reducing the effect of confounding factors. However, researchers must pay attention to model building, avoiding just feeding software with raw data and going forward to results. Some difficult decisions on model building will depend entirely on the expertise of researcher on the field.

Colby Messinger answered 2 years ago

how would you check the robustness and validity of results in logistic regression

How would you differentiate among multiple discriminant analysis, regression analysis, logistic regression analysis, and analysis of...

How would you differentiate among multiple discriminant analysis, regression analysis, logistic regression analysis, and analysis of variance and demonstrate statistical significance for each?

When should logistic regression be used for data analysis? What is the assumption of logistic regression?...

When should logistic regression be used for data analysis? What is the assumption of logistic regression? How to explain odds ratio?

how does the logistic regression work to further the ability of the results

Question (Chapter 12) Customer analytics can be used for improving retention rates. How?

Examine classification using logistic regression. In R console, type mtcars. The dataset mtcars is a generic...

Examine classification using logistic regression. In R console, type mtcars. The dataset mtcars is a generic dataset in R. This dataset comprises of fuel consumption and 10 aspects of automobile design and performance for 32 automobiles. Using only the variables am (0 = automatic, 1 = manual) and mpg, your task is to fit a logistic regression model. Complete the following steps using R. Create a scatter plot of am vs. mpg. Describe the relationship and explain why a simple...

1) True or False? In multinomial logistic regression analysis, the model fit is given by the...

1) True or False? In multinomial logistic regression analysis, the model fit is given by the value of R-squared. In linear regression analysis, the overall significance test is an F test. In logistic regression analysis, the overall significance test is an F test. When building a model for regression analysis, the type of data for the outcome variable guides you to choose linear, logistic, or multinomial logistic. Two variables that covary might not have a causal relationship. Causation implies correlation....