In: Advanced Math
Would a logistic regression analysis reveal the keys to improving customer retention?
Examine the enclosed results of logistic regression of Retained on Esent and on Eclickrate, and then discuss the results and make recommendations on how to improve customer retention.
Abstract
Logistic regression is used to obtain odds ratio in the presence of more than one explanatory variable. The procedure is quite similar to multiple linear regression, with the exception that the response variable is binomial. The result is the impact of each variable on the odds ratio of the observed event of interest. The main advantage is to avoid confounding effects by analyzing the association of all variables together. In this article, we explain the logistic regression procedure using examples to make it as simple as possible. After definition of the technique, the basic interpretation of the results is highlighted and then some special issues are discussed.
Introduction
One of the previous topics in Lessons in biostatistics presented the calculation, usage and interpretation of odds ratio statistic and greatly demonstrated the simplicity of odds ratio in clinical practice (1). The example used then was from a fictional study where the effects of two drug treatments to Staphylococcus Aureus (SA) endocarditis were compared. Original data are reproduced on Table 1.
Table 1.
Results from fictional endocarditis treatment study by McHugh (1).
Standard treatment | New treatment | Totals | |
---|---|---|---|
Died | 152 | 17 | 169 |
Survived | 248 | 103 | 351 |
Totals | 400 | 120 | 520 |
Following (1), the odds ratio (OR) of death of patients using standard treatment can be calculated as (152 × 103) / (248 × 47) = 3.71, meaning that patients at standard treatment present a chance to die 3.71 times greater than patients under new treatment. To a more detailed information about basic OR interpretations, please see McHugh (1). However, a more complex problem can arise when, instead of the association between one explanatory variable and one response variable (e.g., type of treatment and death), we are interested in the joint relationship between two or more explanatory variables and the response variable. Let us suppose we are now interested in the relationship between age and death in the same group of SA endocarditis patients. Table 2 presents the fictional new data. You ought to remember that those data are not real data and that the relationships described here are not meant to reflect any real associations.
Table 2.
Results from fictional endocarditis treatment study by McHugh looking at age (1).
Younger (30–45 yrs) | Older (46–60 yrs) | Totals | |
---|---|---|---|
Died | 120 | 49 | 169 |
Survived | 217 | 134 | 351 |
Totals | 337 | 183 | 520 |
Again, we can calculate an OR as (120 × 134 / 217 × 49) = 1.51, meaning that the chance of an younger individual (between 30 and 45 years-old) death is about 1.5 times the chance of the death of an older individual (between 46 and 60 years-old). Now, instead, we have two variables related to the event of interest (death) at individuals with SA endocarditis. But in the presence of more than one explanatory variable, separately testing each independent variable against the response variable introduces bias into the research (2), Performing multiple tests on the same data inflates the alpha, thus increasing Type I error rates while missing possible confounding effects. So, how do we know whether the treatment effect on endocarditis result is being masked by the effect of age? Let us take a look at the treatment effect as stratified by age (Table 3).
Table 3.
Effect of treatment on endocarditis stratified by age.
Older (46–60 yrs) | Standard treatment | New treatment | Totals | OR | |
Died | 43 | 6 | 49 | 2.44 | |
Survived | 100 | 34 | 134 | ||
Totals | 143 | 40 | 183 | ||
Younger (30–45 yrs) | Standard treatment | New treatment | Totals | OR | |
Died | 109 | 11 | 120 | 4.62 | |
Survived | 148 | 69 | 217 | ||
Totals | 257 | 80 | 337 |
As table 3 illustrates, the impact of treatment is higher on younger individuals, because OR in the younger patients subgroup is higher than in the older patients subgroup. Therefore, it would be incorrect to simply look at the treatment results without considering the impact of age. The simplest way to solve this problem is to calculate some form of “weighted” OR (i.e., Mantel-Haenszel OR (3)), using Equation 1 below, where ni is the sample size of age class I, and a, b, c and d are the table cells, as presented by McHugh (1).
It means that the weighted chance of death associated with standard treatment is 3.74 times the chance of death of individuals taking new treatment. However, as the number of explanatory variables increases, the complexity of these calculations can become nearly impossible to handle. Additionally, Mantel-Haenszel OR, like the simple OR, admits only categorical explanatory variables. For instance, to use a continuous variable like age we need to set a breaking point to categorize (in our case, arbitrarily set at 45 years-old) and could not use the real age. Determining breaking points is not always easy! But there is a better approach: using logistic regression instead.
Definition
Logistic regression works very similar to linear regression, but with a binomial response variable. The greatest advantage when compared to Mantel-Haenszel OR is the fact that you can use continuous explanatory variables and it is easier to handle more than two explanatory variables simultaneously. Although apparently trivial, this last characteristic is essential when we are interested in the impact of various explanatory variables on the response variable. If we look at multiple explanatory variables independently, we ignore the covariance among variables and are subjected to confounding effects, as was demonstrated in the example above when the effect of treatment on death probability was partially hidden by the effect of age.
A logistic regression will model the chance of an outcome based on individual characteristics. Because chance is a ratio, what will be actually modeled is the logarithm of the chance given by:
where π indicates the probability of an event (e.g., death in the previous example), and βi are the regression coefficients associated with the reference group and the xi explanatory variables. At this point, an important concept must to be highlighted. The reference group, represented by β0, is constituted by those individuals presenting the reference level of each and every variable x1...m. To illustrate, considering our previous example, these are the individuals older aged that received standard treatment. Later, we will discuss how to set the reference level.
Logistic regression step-by-step
Let us apply a logistic regression to the example described before to see how it works and how to interpret the results. Let us build a logistic regression model to include all explanatory variables (age and treatment). This kind of model with all variables included is a called “full model” or a “saturated model” and is the best starting option if you have a good sample size and small number of variables to include (issues about sample size, variable inclusion and selection and others will be discussed in the next section. For now, we will keep it as simple as possible).
The result of our model can be seen below, at Table 4.
Table 4.
Results from multivariate logistic regression model containing all explanatory variables (full model).
Term | β estimate | Standard error | P value |
---|---|---|---|
Intercept (β0) | −2.121 | 0.303 | <0.001 |
Age: Younger (β1) | 0.454 | 0.207 | 0.028 |
Treatment: Standard (β2) | 1.333 | 0.283 | <0.001 |
Now all we have to do is to interpret this output. Beginning with the intercept term, which corresponds to our β0. Taking the exponential of β0 we have the mean odds to death of individuals in the reference category. So, exp(β0) = exp(−2.121) = 0.12 is the chance of death among those individuals that are older and received new treatment. A small difference in the interpretation of coefficients appears when we go to the next coefficients. Individuals that also received new treatment but are younger have a mean chance of death exp(β1) = exp(0.454) = 1.58 times the chance of reference individuals. Similarly, older individuals that received standard treatment have a mean chance exp(β2) = exp(1.333) = 3.79 times the chance of reference individuals to die. But what if individuals are younger and received standard treatment? Then we have to calculate exp(β1+β2) = exp(1.787) = 5.97 times the mean chance of reference individuals.
This is the basics of logistic regression interpretation. However, some issues appear during the analysis and solutions are not always readily available.
Conclusion:
Logistic regression is a powerful tool, especially in epidemiologic studies, allowing multiple explanatory variables being analyzed simultaneously, meanwhile reducing the effect of confounding factors. However, researchers must pay attention to model building, avoiding just feeding software with raw data and going forward to results. Some difficult decisions on model building will depend entirely on the expertise of researcher on the field.