As it turns out, logistic regression is much harder than it looks. Actually, the hard part is trying to compare the results of logistic regression across models. The basic gist of the problem is that the coefficients produced by a run-of-the-mill logistic regression are affected by the degree of unobserved heterogeneity in the model, thus making it difficult to discern real differences in the true effect of a given variable or set of variables from differences induced by changes in the degree of unobserved heterogeneity.

To see how this works, let’s imagine that the values of a given binary outcome y_i is driven by the following data generating process:

    \[ y^*_i = \alpha_0 + \alpha_1{x_{i1}} + \ldots + \alpha_J{x_{iJ}} + \sigma\epsilon_i, \]

where y^*_i refers to an unobserved latent variable ranging from -\infty to \infty which depicts the underlying propensity for a given event y_i to occur, \alpha_j represents the effect associated with the jth independent variable x_{ij}, and \sigma represents an adjustment factor which allows the variance of the error term \epsilon_i to be adjusted up or down.

Since y^*_i is unobservable, the latent variable model can’t be estimated directly. Instead, we take the latent variable model as a point of departure and treat y—which we can observe—as a binary indicator of whether or not the value of y^*_i is above a given threshold \tau. By convention, we typically assume that \tau = 0. If we further assume that \epsilon has a logistic distribution such that E(\epsilon|\mathbf{x}) = 0 and Var(\epsilon|\mathbf{x}) = \pi^2/3, we find with a little bit of work that

    \[ \text{ln}\left(\frac{\text{Pr}(y_i = 1)}{1-[\text{Pr}(y_i = 1)]}\right) = \beta_0 + \beta_1{x_{i1}} + \ldots + \beta_J{x_{iJ}}. \]

This should look familiar—it is the standard logistic regression model. If we had assumed \epsilon took on a normal distribution such that E(\epsilon|\mathbf{x}) = 0 and Var(\epsilon|\mathbf{x}) = 1, we would have ended up with a probit model. Consequently, anything I say here about the logistic regression applies to probit models as well.

The relationship between the set of “true” effects \alpha_j and the set of estimated effects \beta_j is as follows:

    \[ \beta_j = \alpha_j/\sigma \hspace{.5in} j = 1, \ldots, J. \]

Simply put, when we estimate an effect using logistic regression, we are actually estimating the ratio between the true effect and the degree of unobserved heterogeneity. We can think about this as a form of implicit standardization. The problem is that to the extent that the magnitude of \sigma varies across models, so does the metric according to which coefficients are standardized. What this means is that the magnitude of \beta_j can vary across models even when the the magnitude of the true effect \alpha_j remains constant.

The implication here is that we can’t get away with the usual trick of comparing a series of nested models to determine the way in which the inclusion of controls affects the parameter estimates associated with a given variable of interest. Moreover, we can’t compare group-specific models unless we are willing to assume groupwise homoscedasticity. The latter principle also extends to the interpretation of interaction effects within a single model. In other words, unobserved heterogeneity can pose big problems in the context of logistic regression.

Perhaps somewhat surprisingly, discussion of this issue goes back at least as far as Winship and Mare (1984) who proposed a solution based on the use of a standardized dependent variable. Alternative solutions have since been proposed by Allison (1999), Williams (2009), and, most recently, Karlson et al. who have a paper forthcoming in Sociological Methodology. In addition to providing a nice overview of this line of work, Mood (2010) discusses a number of other solutions including the use of linear probability models. While the linear probability model is not without its problems, it is easy to estimate and interpret. Moreover, the problems that does have are often easily remedied without turning to logistic regression.

This got posted at R-bloggers last night, after the men’s 100 meter Olympic event was over. Marcus Gesmann predicted Usain Bolt’s 9.63 second result within 0.05 seconds. Even better, he did it using a simple log-linear model that didn’t control for any other factors.

Check the original article at R-bloggers, which talks more about the progression of faster running times and includes the R code used.