As it turns out, logistic regression is much harder than it looks. Actually, the hard part is trying to compare the results of logistic regression across models. The basic gist of the problem is that the coefficients produced by a run-of-the-mill logistic regression are affected by the degree of unobserved heterogeneity in the model, thus making it difficult to discern real differences in the true effect of a given variable or set of variables from differences induced by changes in the degree of unobserved heterogeneity.

To see how this works, let’s imagine that the values of a given binary outcome is driven by the following data generating process:

where refers to an unobserved latent variable ranging from to which depicts the underlying propensity for a given event to occur, represents the effect associated with the th independent variable , and represents an adjustment factor which allows the variance of the error term to be adjusted up or down.

Since is unobservable, the latent variable model can’t be estimated directly. Instead, we take the latent variable model as a point of departure and treat —which we can observe—as a binary indicator of whether or not the value of is above a given threshold . By convention, we typically assume that . If we further assume that has a logistic distribution such that and , we find with a little bit of work that

This should look familiar—it is the standard logistic regression model. If we had assumed took on a normal distribution such that and , we would have ended up with a probit model. Consequently, anything I say here about the logistic regression applies to probit models as well.

The relationship between the set of “true” effects and the set of estimated effects is as follows:

Simply put, when we estimate an effect using logistic regression, we are actually estimating the ratio between the true effect and the degree of unobserved heterogeneity. We can think about this as a form of implicit standardization. The problem is that to the extent that the magnitude of varies across models, so does the metric according to which coefficients are standardized. What this means is that the magnitude of can vary across models even when the the magnitude of the true effect remains constant.

The implication here is that we can’t get away with the usual trick of comparing a series of nested models to determine the way in which the inclusion of controls affects the parameter estimates associated with a given variable of interest. Moreover, we can’t compare group-specific models unless we are willing to assume groupwise homoscedasticity. The latter principle also extends to the interpretation of interaction effects within a single model. In other words, unobserved heterogeneity can pose big problems in the context of logistic regression.

Perhaps somewhat surprisingly, discussion of this issue goes back at least as far as Winship and Mare (1984) who proposed a solution based on the use of a standardized dependent variable. Alternative solutions have since been proposed by Allison (1999), Williams (2009), and, most recently, Karlson et al. who have a paper forthcoming in Sociological Methodology. In addition to providing a nice overview of this line of work, Mood (2010) discusses a number of other solutions including the use of linear probability models. While the linear probability model is not without its problems, it is easy to estimate and interpret. Moreover, the problems that does have are often easily remedied without turning to logistic regression.

• I was taught that the appropriate use of the nested-model approach was to pick the most parsimonious set of explanatory variables. I.e. that approach doesn’t tell us anything useful about omitted variables nor should we be comparing parameter estimates across models. I’m pretty sure this advice holds across linear, logistic, non-parametric, etc applications…. so, what’s with all the hand wringing?

In the case of linear models with continuous dependent variables, there is something to be gained by comparing nested models. Let’s imagine that we first regress income on a binary measure of race (e.g., white versus non-white), and then regress income on race while also controlling for education. We find that the magnitude of the race effect in the second model is lower than the first. The conventional interpretation here is that at least some of the race effect is being explained away by differences in the level of education between white and non-white respondents. This type of example is how many of us first learned about the idea of statistical control in the context of multiple regression. The problem is that this idea is, again, typically introduced in the context of linear models with continuous dependent variables where coefficients are not affected by the degree of unobserved heterogeneity in the model.

• All the cool kids that are into big data, machine learning, and predictive modeling seem to be all about the K-fold cross validation technique (http://en.m.wikipedia.org/wiki/Cross-validation_(statistics)#section_2). Off the top of my head it seems possible to estimate sigma with a similar approach. If you had an estimate of sigma in each model, would you be more willing to compare the betas?

I’m not sure about the K-fold bit, but I think I’m with you on the idea of trying estimate sigma (which, to reiterate, is a scaling factor, not a measure of residual variance). This is the basic intuition behind the work of Williams (2009) who argues for the use of heterogeneous choice models in which sigma is modeled directly.

• Right. Sigma is just a scalar. Hm, I’ll have to ponder on this some more….

• Pingback: Logistics Regression()

• Dan Wang

Fascinating post! Per a conversation with Matt Salganik, quantifying model fit is also a huge issue in logistic regression. Here’s a great way of showing model fit that is visual, intuitive, and uses as much information from the data and model as possible: http://onlinelibrary.wiley.com/doi/10.1111/j.1540-5907.2011.00525.x/abstract

• Anonymous

Why not just calculate and report marginal effects?