Thursday, April 25, 2013

Logistic Regression

I wanted in this post to look at Logistic Regression, which is the statistical technique I've used so far to produce predictions for the Titanic Competition.

Logistic regression is a procedure commonly used when predicting a binary outcome from a set of continuous or categorical predictors. Other types of logistic allow for ordinal and multinomial predictors.

Advantages of Logistic Regression

  • relatively free of restrictions
  • capacity to analyse a mix of continuous, discrete and dichotomous variables

Practical Issues

  •  Need sufficient number of cases relative to the number of variables. SAS has a logistic regression method which is not sensitive to data sparseness
    • This is not a problem with the Titanic data.
  • Need sufficient expected frequencies in each cell. The usual rule of thumb is that logistic regression is not reliable if more than 20% of cells have expected frequencies less than five, or if there are expected frequencies less than one. The solution in these cases is to collapse categories for variables with more than two levels, and / or to accept lower power.
    • This is not a problem, at least at the moment, with the Titanic data.
  • Logistic Regression assumes a linear relationship between continuous predictors and the logit transform of the dependent variable. I'll cover this assumption in a separate post.
  • Absence of multicollinearity. This refers to the situation where two independent variables are highly correlated. If two variables were perfectly correlated, then the second variable is not adding any additional information to the model, and is redundant. In the Titanic data set, imagine if a new variable were added called "family", which was the total of sibsp and parch. The family variable would not add any additional information to a model.
  • Independence of errors. Logistic regression , as is the case with most other forms of regression, assumes that the responses of different cases are independent of each other. This assumption is arguably violated with the Titanic dataset, as passengers were related by family, nationality, and class.
    • According the Tabachnick and Fidell (2013), the impact of non-independence in linear regression is to produce overdispersion. This is when the variability in cell frequencies is greater than expected by the underlying model.
    • The authors indicate that this results in an inflated Type I error rate for tests for predictors. The suggested remedy is to undertake multilevel modelling.
    • I'm not sure at this stage how this relates to the Titanic Data, how it might impact the prediction results, and what, if anything I should do. 
  • Absence of Outliers in the solution. Outliers are cases not well predicted by the model. A case that actually is in one category of the outcome may show a high probability for being 
Other Issues

  •  Logistic regression  requires categorical variables to be converted into dummy/indicator variables. This is not a problem with SPSS (and other programs) as SPSS automatically creates new variables for variables declared as categorical.
  • Norusis outlines four diagnostic checking areas for logistic regression:
    • is the relationship between the logit and continuous variables linear?
    • How well does the model discriminate between cases that experience the event and cases that do not experience the event (model discrimination)?
    • How well do predicted probabilities match observed probabilities over the entire range of values (Model calibration)
    • are there unusual cases

No comments:

Post a Comment