Logistic Regression in Dissertation & Thesis Research
What are the odds that a 43-year-old, single woman who wears glasses and favors the color gray is a librarian? If your dissertation or thesis research question resembles this, then the analysis you may want to use is a logistic regression.
Logistic regression is a statistic that allows group membership to be predicted from predictor variables, regardless of whether the predictor variables are continuous, discrete, or a combination of both. In the example above, the group to which we are trying to predict membership is "librarians". The predictor variables are age, marital status, glasses, and favorite color.
Why would research want to predict such group membership? In the health sciences, research frequently examines whether or not a subject will get a disease based on a number of predictors. For example, your question may ask if age, weight, gender, tobacco use, and marital status predict whether a subject gets cancer.
When to Use Logistic Regression
Logistic regression is the statistic to use when your dependent variable is anticipated to be nonlinear with one or more of your independent variables. For example, the probability of one of the subjects getting cancer may not be affected too much by a 5-cigarettes-smoked difference among subjects who are light smokers (say 0-5 per day), but may change a lot with an equal difference among subjects who are heavy smokers (say 25-30 a day). In this example, the relationship between the dependent variable (cancer) and the independent/predictor variable (tobacco use) is not linear.
In this example, we must ask whether the predictor variables can predict the constant (cancer). The most direct way to do this is to compare a model with the constant plus the predictor variables to a model with just the constant. If the analysis, the logistic regression, indicates a reliable difference between the two models, then there is a significant relationship between the predictors and the outcome (cancer).
Using the above example, we would compare the model which consists of the prediction variables (age, weight, gender, tobacco use, and marital status) and the constant (cancer) to a model which consists of only the constant (cancer). If the model with the predictors is significantly different than the model with just the constant alone, then our model with the predictors can be said to predict the outcome (cancer) better than no predictors at all.
You may be thinking that, of course, having predictors is better than not having any predictors at all! But what if your predictor variables were things like favorite color, type of car owned, presence of braces, and pet ownership? Would these predictor variables predict the constant (cancer) reliably? Probably not!
Another way to see if the predictor variables predict the outcome (cancer) is to compare a model with only some of the predictor variables plus the constant with a model with all of the predictor variables plus the constant, called the "full model".
Continuing our example, we might compare the model of the predictor variables (age and weight) plus the constant (cancer) to a model with all of the predictor variables (age, weight, gender, tobacco use, and marital status) plus the constant (cancer). The objective here is to find the best model "fit". That is, you want your model to do the best job of predicting the constant (cancer) with the fewest predictor variables.
Types of Logistic Regression
There are several types of logistic regression that can be used for dissertation and thesis analyses. They include direct, sequential, and stepwise logistic regressions. Which one you use for your analysis depends on your research.
In analysis using direct logistic regression, all of the predictor variables are entered into the equation at the same time. If your research has not indicated anything about the order of your predictor variables or the importance of them in relation to the constant (which, in this case, is cancer), then your statistic of choice would be a direct logistic regression for the analysis.
If your research does indicate a certain order for or importance of your predictor variables, then a sequential logistic regression is the statistic you would use. Unfortunately, there is no easy way to accomplish this with most statistical software packages. Many times, you must complete your analysis performing multiple "runs". See your statistical software's manual for how to do this.
As with the stepwise multiple regression statistic, the stepwise logistic regression is not recommended for dissertation analyses, as it tends to capitalize on chance, and your results may not generalize to other similar samples. The stepwise logistic regression is best viewed as a data screening tool, and the decision of whether to include a predictor variable should be less harsh than with other statistics (e.g. .15 or .20 or less).