Key points
- Stepwise logistic regression is a technique for building a logistic model that iteratively selects or deselects predictors based on their statistical significance.
- Stepwise logistic regression can minimize model complexity and enhance model performance by removing irrelevant or redundant variables; nevertheless, it has significant drawbacks and limitations, such as sensitivity, bias, and ignorance of interactions or nonlinear effects.
- Stepwise logistic regression can be performed in R using the stepAIC function from the MASS package, which allows choosing the direction of the stepwise procedure, either "both," "backward," or "forward."
- Stepwise logistic regression should be interpreted and evaluated using various criteria, such as AIC, deviance, coefficients, p-values, odds ratios, confidence intervals, accuracy, precision, recall, F1-score, ROC curve, AUC, cross-validation, bootstrap, or hold-out test set.
- Stepwise logistic regression should be used cautiously and supplemented with other variable selection methods, such as domain knowledge, exploratory data analysis, correlation analysis, or regularization techniques.
Hello, this is Zubair Goraya, a data analyst and a writer for Data Analysis, a website that provides tutorials related to RStudio. This article will discuss Stepwise Logistic regression in R, a powerful technique for modeling binary outcomes.
Stepwise Logistic Regression in R: A Complete Guide
Logistic Regression is a popular method for predicting binary outcomes, such as whether or not a client would purchase a product.
However, when you have many potential predictors, how do you choose the best ones for your model?
One way to do this is by using stepwise logistic regression, a procedure that iteratively adds and removes variables based on their statistical significance and predictive power.
In this article, you will learn:
- What is stepwise logistic regression, and why use it
- How to perform stepwise logistic regression in R using the stepAIC function
- How to compare different stepwise methods, such as forward, backward, and both-direction selection?
- How to interpret and evaluate the results of stepwise logistic regression?
- What are the advantages and disadvantages of stepwise logistic regression
- How to avoid some common pitfalls and challenges of stepwise logistic regression
What is Stepwise Logistic Regression, and Why Use It?
Stepwise logistic regression is a variable selection technique that aims to find the optimal subset of predictors for a logistic regression model. It does this by starting with an initial model, either with no predictors (forward selection) or with all predictors (backward elimination), and then adding or removing variables one at a time based on a criterion such as the Akaike information criterion (AIC) or the Bayesian information criterion (BIC).
- Sensitive to the order of variable entry or removal, which can lead to different final models depending on the starting point and direction of the procedure.
- Due to the multiple testing and data snooping involved, it can produce biased estimates of the coefficients, standard errors, inflated p-values, and confidence intervals.
- It can ignore meaningful interactions or nonlinear effects among the variables and potential confounding or moderating factors.
- It can be computationally intensive and time-consuming, especially when dealing with large data sets or many potential predictors.
How to Perform Stepwise Logistic Regression in R using the stepAIC Function
One of the easiest ways to perform stepwise logistic regression in R is using the stepAIC function from the MASS package. This function performs model selection by AIC and allows you to specify the direction of the stepwise procedure, either "both," "backward," or "forward."
To use the stepAIC function, you must have two models:
- Base model that defines the initial set of variables in the procedure
- Scope model that defines the range of variables that can be added or removed from the base model.
Using Stepwise Logistic Regression to Predict if a Patient Has Diabetes!
Suppose you want to use stepwise logistic regression to predict whether a patient has diabetes based on several clinical variables. For this purpose, you can use the PimaIndiansDiabetes2 data set from the mlbench package.The data set contains 392 observations and 9 variables:
- diabetes: Factor indicating whether the patient has diabetes (pos) or not (neg)
- pregnant: Number of times pregnant
- glucose: Plasma glucose concentration
- pressure: Diastolic blood pressure
- triceps: Triceps skin fold thickness
- insulin: 2-Hour serum insulin
- mass: Body mass index
- pedigree: Diabetes pedigree function
- age: Age in years
Data Loading and Preprocessing
You can load the data set and remove any missing values as follows:# Load the data and remove NAs
library(mlbench)
data("PimaIndiansDiabetes2", package = "mlbench")
PimaIndiansDiabetes2 <- na.omit(PimaIndiansDiabetes2)
str(PimaIndiansDiabetes2)# Inspect the data
Split the data set
# Split the data into training and test set
#install.packages("caret")
library(caret)
library(dplyr)
set.seed(123)
training.samples <- PimaIndiansDiabetes2$diabetes %>%
createDataPartition(p = 0.8, list = FALSE)
train.data <- PimaIndiansDiabetes2[training.samples, ]
test.data <- PimaIndiansDiabetes2[-training.samples, ]
dim(train.data)
dim(test.data)
Base and Scope models
Now, you can define the base and scope models for the stepwise procedure. For the base model, you can use either an intercept-only model or a model with one or more essential or relevant predictors for the outcome. For the scope model, you can use either a complete model with all predictors or a model with a subset of predictors that you want to consider for the procedure.
For example, you can use the following models:
# Define the base model (intercept-only)
base.model <- glm(diabetes ~ 1, data = train.data, family = binomial)
# Define the scope model (full model)
scope.model <- glm(diabetes ~ ., data = train.data, family = binomial)
Perform stepwise logistic regression
# Perform stepwise logistic regression library(MASS) step.model <- stepAIC(base.model, direction = "both", scope = scope.model, trace = FALSE)
# Summarize the final selected model summary(step.model)
How to Compare Different Stepwise Methods, such as Forward, Backward, and Both-Direction Selection
As mentioned earlier, there are different ways to perform stepwise logistic regression, depending on the direction of the procedure.
The three main methods are:
- Forward selection: This method starts with an intercept-only model and adds variables one at a time based on their significance and contribution to the model fit. The procedure stops when no more variables can be added or when the AIC increases.
- Backward elimination: This method starts with a complete model with all variables and removes variables one at a time based on their significance and contribution to the model fit. The procedure stops when no more variables can be removed, or the AIC increases.
- Both-direction selection: This method combines forward and backward selection by adding and removing variables at each step based on their significance and contribution to the model fit. The procedure stops when no more variables can be added or removed, or the AIC increases.
# Perform forward selection
forward.model <- stepAIC(base.model,
direction = "forward", scope = scope.model, trace = FALSE)
backward.model <- stepAIC(scope.model,
direction = "backward", scope = scope.model, trace = FALSE)
# Compare the forward model and both-direction model
ANOVA(forward.model, step.model, test = "Chisq")
Compare AIC values of all three models
# Compare AIC values of all three models
AIC(base.model, forward.model, backward.model, step.model)
How to Interpret and Evaluate the Results of Stepwise Logistic Regression
Once you have performed stepwise logistic regression and selected a final model, you need to interpret and evaluate the model's results regarding its fit, performance, explanation, and validation.
One way to do this is by using the following steps:
- Fit: You can use the summary function to view the details of the model fit, such as the coefficients, standard errors, p-values, AIC, deviance, etc. You can also use the ANOVA function to compare different models based on their deviance or likelihood ratio test.
- Performance: You can use various metrics to measure the performance of the model on the training data or a test data set, such as accuracy, precision, recall, F1-score, ROC curve, AUC, etc. You can calculate these metrics using functions from packages such as caret or pROC.
- Explanation: You can use various methods to explain and interpret the model's results regarding its predictors and the outcome variable, such as odds ratios, confidence intervals, marginal effects, etc. You can calculate these methods using functions from packages such as broom or margins.
- Validation: You can use various methods to validate and generalize the model's results to new or unseen data, such as cross-validation, bootstrap, or hold-out test set. You can use functions from packages such as caret or Boot to perform these methods.
For example, you can use the following code to evaluate the results of the both-direction model that you selected earlier:
# Performance: Calculate the accuracy, precision, recall, and F1-score of the both-direction model on the test data library(caret) # Convert the predicted classes to a factor with the same levels as test.data$diabetes pred.class <- factor(ifelse(pred > 0.5, "pos", "neg"), levels = levels(test.data$diabetes)) # Create the confusion matrix cm <- confusionMatrix(pred.class, test.data$diabetes cm
The confusion matrix and associated statistics reveal the performance of a classification model for predicting diabetes outcomes. The matrix displays the number of true negatives (TN), false positives (FP), false negatives (FN), and true positives (TP) for the predicted versus the actual diabetes classes. However, a noteworthy concern arises as there are no optimistic predictions (pos) in the output, indicating that the model failed to identify any positive cases, resulting in zero TP and FP.
The absence of optimistic predictions leads to undefined specificity, sensitivity, and negative predictive values (Neg Pred Value). The model's overall accuracy is 0.6667, suggesting that it correctly classifies approximately two-thirds of the cases. Nevertheless, this accuracy metric should be interpreted cautiously, as it needs to provide a complete picture due to the model's inability to predict positive cases.
The Kappa statistic is zero, indicating no agreement between the predicted and actual classes beyond what might be expected by chance. This lack of agreement reinforces the model's limitations in capturing meaningful patterns in the data.
Additionally, Mcnemar's Test P-Value is significantly small (9.443e-07), suggesting a substantial difference in the model's performance compared to a model with an equal number of false negatives and positives.
On the other hand, the balanced accuracy is 0.5000, reflecting the same issues as the sensitivity and specificity, as it considers the performance across both classes.
In conclusion, the model's failure to predict positive cases significantly hampers its usefulness in practical applications. The absence of sensitivity and specificity values, combined with the low Kappa statistic, implies that the model is not effectively capturing the underlying patterns of diabetes outcomes. Thus, further refinement of the model or exploration of different predictive algorithms is crucial to improve its performance and make it suitable for accurate diabetes prediction.
What are the Advantages and Disadvantages of Stepwise Logistic Regression?
Stepwise logistic regression has some advantages and disadvantages you should know before using it.
Some of the advantages are:
- It can reduce the complexity and improve the model's performance by eliminating irrelevant or redundant variables.
- It can help to avoid overfitting, multicollinearity, and high variance, as well as to increase interpretability and generalizability.
- It can be easy and fast to implement and automate using functions such as stepAIC.
Some of the disadvantages are:
- It can be sensitive to the order of variable entry or removal, which can lead to different final models depending on the starting point and direction of the procedure.
- Due to the multiple testing and data snooping involved in the process, it can produce biased estimates of the coefficients and standard errors, as well as inflated p-values and confidence intervals.
- It can ignore meaningful interactions or nonlinear effects among the variables and potential confounding or moderating factors.
- It can be computationally intensive and time-consuming, especially when dealing with large data sets or many potential predictors.
How to Avoid Some Common Pitfalls and Challenges of Stepwise Logistic Regression
Stepwise logistic regression can be a valuable tool for variable selection, but it also comes with pitfalls and challenges that you should avoid or overcome.Common pitfalls and Challenges:
Appropriate criterion for variable selection
Assumptions and diagnostics of logistic regression
Validating and generalizing the results
FAQs
What is the difference between forward and backward stepwise regression?
Forward stepwise regression starts with an intercept-only model and adds variables one at a time based on their significance and contribution to the model fit. Backward stepwise regression starts with a full model and removes variables one at a time based on their significance and contribution to the model fit.
What is the advantage of both-direction stepwise regression?
Both-direction stepwise regression combines forward and backward steps, adding and removing variables based on their significance and contribution to the model fit. This more flexible method can explore more possible models than forward or backward stepwise regression alone.
How to choose the best criterion for variable selection in stepwise logistic regression?
The best criterion for variable selection depends on the data and problem. Some standard criteria are AIC, BIC, or Cp. AIC tends to select more variables than BIC or Cp, which can lead to more complex but less parsimonious models. BIC or Cp tend to select fewer variables than AIC, which can lead to more straightforward but more economical models. You should compare different criteria and choose the one that best suits your data and problem.
How to check the assumptions and diagnostics of logistic regression before and after performing stepwise logistic regression?
Logistic regression has some assumptions and diagnostics that you should check before and after performing stepwise logistic regression. For example, you should check for linearity in the logit, independence of errors, absence of multicollinearity, outliers, influential points, etc. You can use functions from packages such as car or ggplot2 to perform these checks.
How do we validate and generalize the results of stepwise logistic regression to new or unseen data?
You can use various methods to validate and generalize the results of stepwise logistic regression to new or unseen data. For example, you can use cross-validation, bootstrap, or hold-out test set. Cross-validation splits the data into k folds and uses k-1 folds for training and one fold for testing. Bootstrap resamples the data with replacement and uses each resamples for training and testing. The hold-out test set splits the data into training and test sets once and uses them for training and testing. You should report your results with appropriate measures of uncertainty, such as standard errors, confidence intervals, or prediction intervals.
Conclusion
In this article, you learned:- What is stepwise logistic regression, and why use it
- How to perform stepwise logistic regression in R using the stepAIC function
- How to compare different stepwise methods, such as forward, backward, and both-direction selection
- How to interpret and evaluate the results of stepwise logistic regression
- What are the advantages and disadvantages of stepwise logistic regression?
- How to avoid some common pitfalls and challenges of stepwise logistic regression