Statistical Modeling

Dr. Courtney Brown

Assignment #6

For this assignment, you will be exploring the idea of multicollinearity using the same data set as with the last assignment. As with all assignments in this course, remember that this is an assignment of scientific writing, so explain your results clearly so that anyone can understand your findings. You are to create a regression model that suffers from multicollinearity. Then you are to fix the problem by using the statistical output to determine which variables are causing the problem, and then deleting them from the model. Thus, you will need two tables, one with the full model that has too much multicollinearity, and one with the reduced model that has the multicollinearity problem repaired. Be sure to explain how you used the statistical output to determine which variables were the problem ones. Be sure not to delete a variable from the model that you may think is really importand to explain your dependent variable. Use ridge regression to help you diagnose which parameter estimates are most susceptible to the effects of multicollinearity.

The VIF numbers are the reciprocal of the TOL numbers. Some people think that VIF numbers greater than 2.5 and TOL numbers less than .4 are a problem. Others say that VIFs greater than 10 and TOLs lower than .1 are a problem. Thus, there is no hard and fast rule. The TOL numbers range from 0 to 1, and they are 1 minus the R-squared when each independent variable is regressed against each other independent variable. Thus, a TOL number near 1 means that there is no problem with multicollinearity, whereas a TOL number near zero indicates there is a potential problem. In such cases, the standard errors will be large and the t-statistics will be small. But the OLS estimates themselves are still unbiased and BLUE. Some of the estimates may still be OK, but others may be very flaky. Ridge regression can help you know which estimates are unstable.

For the dependent variable (use only one), use either feelings for President Jimmy Carter or feelings for challenger Ronald Reagan using the variables CARFEEL3 or REAFEEL3. These variables are for feeling thermometers asked of the survey respondents in September of 1980. Try to find an interesting set of independent variables that explains one of these two dependent variables. Then add some variables (or create interaction terms) that add excessive multicollinearity to the model. What are the signs that multicollinearity may be a problem in your model? Diagnose the problem, and then fix it.

Here is some R code to get you started. The rest is up to you. Further below is a SAS program that may give you some additional ideas. But use R for this assignment.

# First we get our data.
library(car)
mydata <- read.table("panel80.txt")
# attach(mydata) # In case you want to work with the variables directly
names(mydata) # This shows us all the variable names.
# options(scipen=20) # suppress "scientific" notation
options(scipen=NULL) # Brings things back to normal
reagan.model <- lm(REAFEEL3 ~ INC + AGE + PARTYID + REPPART3 + INC:AGE + INC*PARTYID*REPPART3, data=mydata)
summary(reagan.model)
layout(matrix(c(1,2,3,4),2,2)) # optional 4 graphs/page
plot(reagan.model) # These are diagnostic plots.
vif(lm(REAFEEL3 ~ INC + AGE + PARTYID + REPPART3 + INC:AGE + INC*PARTYID*REPPART3, data=mydata))
tol <- 1/vif(lm(REAFEEL3 ~ INC + AGE + PARTYID + REPPART3 + INC:AGE + INC*PARTYID*REPPART3, data=mydata))
tol
mysubsetdata<-subset(mydata, select=c(REAFEEL3, REPPART3, INC, AGE, PARTYID)) #This keeps only the variables that we are using.
cor(mysubsetdata, use = "pairwise.complete.obs") # A correlation matrix for the variables in the regression

windows()

library(MASS)
x <- lm.ridge(REAFEEL3 ~ INC + AGE + PARTYID + REPPART3 + INC:AGE + INC*PARTYID*REPPART3, data=mydata, lambda=seq(0,100,by=1))
plot(x)
title("Ridge Regresssion")
abline(h=0)
abline(v=50,lty=3)
x # This prints out the values of the ridge estimates as lambda increases.