Statistical Modeling
Dr. Courtney Brown
Assignment #5
For this assignment, you will be working with dummy variables to explore conditional relationships using R. As with all assignments in this course, remember that this is an assignment of scientific writing, so explain your results clearly so that anyone can understand your findings. Be sure to read the assigned article by Gerald Wright, "Linear Models for Evaluating Conditional Relationships." You are to use the program below to create and use both intercept and slope dummy variables. Be sure to eliminate the normal intercept (regression through the origin) so that you avoid a "not full rank" problem. The program below creates race intercept dummy variables based on race and slope dummy variables based on race and partyid. This combination for the slope dummy variables is not particulary useful since very few African Americans are Republican (or were in 1980). Thus, you are to create new dummy variables that make sense. If you still want to use race intercept dummy variables, then you should change the slope dummy variables so that they are based on some variable other than partyid. But you could also create intercept dummy variables based on gender, or something else entirely that has nothing to do with race, and then the partyid slope dummy variables would be a good idea. You decide what to do. The bottom line is that you need to have intercept dummy variables, and then you need to create slope dummy variables that are based on the combination of those intercept dummy variables and some other variable. Finally, you are to conduct a test to see if the parameter estimates for the intercept dummy variables are equal, and another test to see if the parameter estimates for the slope dummy variables are equal. All of this is done with R. Be sure to INTERPRET YOUR RESULTS with a few pages of text.
For the dependent variable (use only one), use either feelings for President Jimmy Carter or feelings for challenger Ronald Reagan using the variables CARFEEL3 or REAFEEL3. These variables are for feeling thermometers asked of the survey respondents in September of 1980. Try to find an interesting set of independent variables that explains one of these two dependent variables.
You will be working with multiple regression using the same data set as you used in a previous assignment. If needed, you should first, download the survey data set for the Reagan vs. Carter election in 1980. Put the extracted data set on your R_Working_Directory.
Here is some help in interpreting the variable values.
Feeling thermometers: 0 to 100, with 50 being neutral.
Liberal/conservative scales: 1=extreme liberal, 7=extreme conservative.
Inter1-Inter3: respondent's interest in the campaign/low to high.
P1 through P4: This refers to the panel wave, January, July, Sept. & Nov.
Expectation to vote: 5 will vote, 1 no.
Education: years of education
Income: not in thousands of dollars, but a scale, low to high.
Frequency of church attendance: low to high
R : This refers to the respondent.
Generally all of the variables go from to low to high. Thus, if you see a variable
and you do not know the coding scheme, assume that a small number means less
and a larger number means more. The other codes are in the variable labels.
Most of the variables for this data set originated as a panel study supplied by the Interuniversity Consortium for Social and Political Research (ICPSR). Emory University is a member of the ICPSR. I have added some contextual variables to the survey data set by extracting these contextual data from a separate ICPSR data set.
Here are the variables in the data set:
V3543= NEIGHBOR #1-VOTE F PRES REF=3543 ID=763
V3547= NEIGHBOR #2-VOTE F PRES REF=3547 ID=763
V3551= NEIGHBOR #3-VOTE F PRES REF=3551 ID=763
INTER1= INTEREST IN POLITICS FOR R,P1
INTER2= INTEREST IN POLITICS FOR R,P2
INTER3= INTEREST IN POLITICS FOR R,P3
INTER4= INTEREST IN POLITICS FOR R,P4
INFO1= INFORMATION LEVEL FROM NEWS FOR R,P1
INFO2= INFORMATION LEVEL FROM NEWS FOR R,P2
DEMCAND1= FEELING THERM. FOR ALL DEM. CANDS,P1
DEMCAND2= FEELING THERM. FOR ALL DEM CANDS, P2
DEMCAND3= FEELING THERM. FOR ALL DEM CANDS, P3
REPCAND1= FEELING THERM FOR ALL REP CANDS, P1
REPCAND2= FEELING THERM FOR ALL REP CANDS, P2
REPCAND3= FEELING THERM FOR ALL REP CANDS, P3
DEMPART1= FEELING THERM FOR DEM PARTY, P1
DEMPART2= FEELING THERM FOR DEM PARTY, P2
DEMPART3= FEELING THERM FOR DEM PARTY, P3
REPPART1= FEELING THERM FOR REP PARTY, P1
REPPART2= FEELING THERM FOR REP PARTY, P2
REPPART3= FEELING THERM FOR REP PARTY, P3
PARTIES1= FEELING THERM FOR BOTH PARTIES,P1
PARTIES2= FEELING THERM FOR BOTH PARTIES,P2
PARTIES3= FEELING THERM FOR BOTH PARTIES,P3
INDFEEL1= FEELING THERM FOR INDEPENDENTS,P1
INDFEEL2= FEELING THERM FOR INDEPENDENTS,P2
INDFEEL3= FEELING THERM FOR INDEPENDENTS,P3
CARFEEL1= FEELING THERM FOR CARTER, P1
CARFEEL2= FEELING THERM FOR CARTER, P2
CARFEEL3= FEELING THERM FOR CARTER, P3
REAFEEL1= FEELING THERM FOR REAGAN, P1
REAFEEL2= FEELING THERM FOR REAGAN, P2
REAFEEL3= FEELING THERM FOR REAGAN, P3
KENFEEL1= FEELING THERM FOR KENNEDY, P1
KENFEEL2= FEELING THERM FOR KENNEDY, P2
KENFEEL3= FEELING THERM FOR KENNEDY, P3
NEWV121= LIB/CON SCALE FOR R, P1
NEWV2125= LIB/CON SCALE FOR R, P2
NEWV3213= LIB/CON SCALE FOR R, P3
NEWV122= LIB/CON SCALE FOR CARTER, P1
NEWV2126= LIB/CON SCALE FOR CARTER, P2
NEWV3214= LIB/CON SCALE FOR CARTER, P3
NEWV123= LIB/CON SCALE FOR REAGAN, P1
NEWV2127= LIB/CON SCALE FOR REAGAN, P2
NEWV3215= LIB/CON SCALE FOR REAGAN, P3
NEWV130= LIB/CON SCALE FOR REPS, P1
NEWV2134= LIB/CON SCALE FOR REPS, P2
NEWV3224= LIB/CON SCALE FOR REPS, P3
NEWV131= LIB/CON SCALE FOR DEMS, P1
NEWV2135= LIB/CON SCALE FOR DEMS, P2
NEWV3225= LIB/CON SCALE FOR DEMS, P3
PARTYID1= PARTY ID, P1
PARTYID2= PARTY ID, P2
PARTYID3= PARTY ID, P3
PARTYID4= PARTY ID, P4
PARTYID= 1 STRONG DEMOCRAT, 7 STRONG REPUBLICAN
NEWV251= EXPECTATION TO VOTE FOR R,P1
NEWV2272= EXPECTATION TO VOTE FOR R,P2
NEWV3081= EXPECTATION TO VOTE FOR R,P3
COMMUN1= R CUMMUNICATED ABOUT CAMPAIGN,P1
COMMUN2= R COMMUNICATED ABOUT CAMPAIGN,P2
COMMUN4= R COMMUNICATED ABOUT CAMPAIGN,P4
COMMALL= R COMMUNICATED ABOUT CAMPAIGN,ALL
EDUC= EDUCATION OF R
INC= INCOME OF R
REL= RELIGION OF R
RELFREQ= FREQ OF CHURCH ATTENDENCE FOR R
STATUS= STATUS (INC+ED) OF R
STATE= STATE OF RESIDENCE FOR R
PRESTOCO= TOTAL PRESIDENTIAL VOTE, COUNTY
PRESTOST= TOTAL PRES. VOTE, STATE 9
CONGTOCO= TOTAL CONGRESSIONAL VOTE,COUNTY
PDEMCONT= PROP. PRES DEM VOTE, COUNTY
PREPCONT= PROP. PRES REP VOTE, COUNTY
PDEMSTAT= PROP. PRES DEM VOTE, STATE
PREPSTAT= PROP. PRES REP VOTE, STATE
CDEMCONT= PROP. CONG. DEM VOTE, COUNTY
CREPCONT= PROP. CONG. REP VOTE, COUNTY
RFRIENDS= ALL 3 NEIGHBORS INTEND VOTE REAGAN
DFRIENDS= ALL 3 NEIGHBORS INTEND VOTE CARTER
MFRIENDS= 3 NEIGHBORS SPLIT IN VOTE INTENTION
AGE
SEX= 1 IS MALE AND 2 IS FEMALE
RACE= 1 IS WHITE, 2 IS BLACK, 3 IS OTHER
REGION= THE SOLID SOUTH IS 4
DIDVOTE= R VOTED 1 IS YES AND 2 IS NO
VOTE= 1 REAGAN 2 CARTER 3 CLARK 4 ANDERSO
PARTREG= PARTY REGISTRATION 1 DEM 2 IND 3 REP
VOTEVALI= VOTER VALIDATION 1 VALIDATED 2 NO
Here is some R code to get you started. The rest is up to you. I am giving you two ways to create the dummy variables. The first may be more intuitive. Further below is a SAS program that may give you some additional ideas. But use R for this assignment. To learn how to test for coefficient equality, see this link and follow the R code below:
http://www.nd.edu/~rwilliam/stats2/l42.pdf
An alternate approach using a Z-statistic can be found here:
http://www.udel.edu/soc/faculty/parker/SOCI836_S08_files/Paternosteretal_CRIM98.pdf
This article is also useful:
http://www.stat.ufl.edu/~aa/sta6127/ch13.pdf
Method #1 for creating the dummy variables:
# First we get our data.
mydata <- read.table("panel80.txt")
# attach(mydata) # In case you want to work with the variables directly
names(mydata) # This shows us all the variable names.
# options(scipen=20) # suppress "scientific" notation
options(scipen=NULL) # Brings things back to normal
# Now let's create our intercept dummy variables.
mysubsetdata <- subset(mydata, RACE == 1 | RACE == 2) # This gets rid of the "other" category
in race.
# The method below creates something called a "factor," and then converts that factor into a real number by adding a zero to it.
white <- (mysubsetdata$RACE==1)+0
black <- (mysubsetdata$RACE==2)+0
races <- cbind(white,black)
races # Prints out the race data
# Now we create the slope dummy variables and set up our linear regression model.
wpartyid=white*mysubsetdata$PARTYID # This is one way of creating the whites only slope
dummy variable for partyid.
bpartyid=black*mysubsetdata$PARTYID # This is one way of creating the blacks only slope
dummy variable for partyid.
carter.model <-lm(mysubsetdata$CARFEEL3 ~ white + black + wpartyid + bpartyid + mysubsetdata$SEX + mysubsetdata$AGE - 1)
summary(carter.model)
# Note: R does not calculate the R-squared statistic correctly when suppressing the intercept in the regression model above.
# To do this correctly, the following code is appended here. It will calculate the R-squared statistic for you correctly.
rss.residuals <- sum((residuals(carter.model))^2)
rss.residuals
mean.dep.var <- mean(mysubsetdata$CARFEEL3, na.rm="TRUE")
mean.dep.var
tss.dep.var <- sum((mysubsetdata$CARFEEL3-mean.dep.var)^2, na.rm="TRUE")
tss.dep.var
r.square <- 1 - (rss.residuals/tss.dep.var)
r.square
# One way to test for the equality of two regression parameters is with an F-test, using the Wald procedure.
# Another way is to make a confidence interval.
# Either way, you will need the coefficient-covariance matrix. Here it is:
V <- vcov(carter.model)
V
METHOD #1: The Wald Procedure
At this point, you will need a paper and pencil to finish the equality test since the lm procedure in R does not do this for you automatically. From the coefficient-covariance matrix, V, take the variance of each variable, and the covariance for the two variables together, plus the original parameter estimates from the regression, and then calculate your F-statistic using this formula:
F = [(parameter 1 - parameter 2)/(the square root of (variance of the first parameter + variance of the second parameter - 2*(the covariance of both parameters)))] and all this squared.
or, using R
F <- ((P1 - P2)/(sqrt(varP1 + varP2 - 2*covP1P2)))^2
where P1 and P2 come from your regression, and varP1, varP2 and covP1P2 come from your coefficient-covariance matrix, V. You can also get varP1 and varP2 by squaring the standard errors for P1 and P2 that you get from your regression.
Note that the big difference between the F-statistic and the t-statistic is that with the F you are squaring everything, including the difference between the two parameter values.
Now that you have the F statistic, you need the degrees of freedom, and there are two types, call them df1 and df2. Here, df1= 1, or the number of tests involved (there just one), and df2= (N - k), where N is the number of observations in your regression, and k is the number of slope and intercept parameters that you are estimating in your model. Now, using R, do the following, substituting the right numbers for F, df1, and df2:
F.stats <- df(F, df1, df2)
F.stats
Alternatively, you can use your F-distribution table that is in the back of your book.
METHOD #2: The Confidence Interval
Perhaps the easiest way to proceed, is simply to calculate a 95% confidence interval (using 1.96SE) for the difference between the two parameters. Thus, you want a confidence interval for (P1-P2). This is easy to do with the combined standard error that you used above, sqrt(varP1 + varP2 - 2*covP1P2), where you get these things from the coefficient-covariance matrix, V, as before. Look to see if zero is inside the confidence interval, and you are done! (For a reference, see Eric A. Hanushek and John E. Jackson, Statistical Methods for Social Scientists, New York: Academic Press, 1977, p. 124.)
A NOTE ABOUT STANDARDIZED PARAMETER ESTIMATES WHEN USING DUMMY VARIABLES: If you want to compute the standardized parameter estimates for your model, do not standardize the intercept dummy variables. Leave those as values of 0 and 1. But when you standardize your variables, you will need to create a new data set that contains the all the variables in your model, including the variables that you created yourself, such as the slope dummy variables. Try using the cbind command to combine your primary data set with the other variables (but not the intercept dummy variables). Then standardize the new data set.
Method #2 for creating the dummy variables:
# First we get our data.
mydata <- read.table("panel80.df")
# attach(mydata) # In case you want to work with the variables directly
names(mydata) # This shows us all the variable names.
# options(scipen=20) # suppress "scientific" notation
options(scipen=NULL) # Brings things back to normal
# Now let's create our intercept dummy variables.
mysubsetdata <- subset(mydata, RACE == 1 | RACE == 2) # This gets rid of the "other" category
in race.
xf <- factor(mysubsetdata$RACE, levels=1:2) # This factors the RACE variable.
xfnew <- as.data.frame(model.matrix(~xf-1)) # This creates a new data frame out of the RACE factors.
white <- xfnew$xf1
black <- xfnew$xf2
racedata <- cbind(white, black) # We include the new variable with our data set
racedata # Prints out the race data
# Now we create the slope dummy variables and set up our linear regression model.
wpartyid=white*mysubsetdata$PARTYID # This is one way of creating the whites only slope
dummy variable for partyid.
bpartyid=black*mysubsetdata$PARTYID # This is one way of creating the blacks only slope
dummy variable for partyid.
carter.model <-lm(mysubsetdata$CARFEEL3 ~ white + black + wpartyid + bpartyid + mysubsetdata$SEX + mysubsetdata$AGE - 1)
summary(carter.model)
# Note: R does not calculate the R-squared statistic correctly when suppressing the intercept in the regression model above.
# To do this correctly, the following code is appended here. It will calculate the R-squared statistic for you correctly.
rss.residuals <- sum((residuals(carter.model))^2)
rss.residuals
mean.dep.var <- mean(mysubsetdata$CARFEEL3, na.rm="TRUE")
mean.dep.var
tss.dep.var <- sum((mysubsetdata$CARFEEL3-mean.dep.var)^2, na.rm="TRUE")
tss.dep.var
r.square <- 1 - (rss.residuals/tss.dep.var)
r.square
* Below is the SAS code that does the same thing as above;
libname windata 'e:\';
GOPTIONS lfactor=10 hsize=6 in vsize=6 in horigin=1 in vorigin=1 in;
options nocenter ls=120;
**********************************************************;
* CLASS, NOTE THAT IF YOU BEGIN A LINE WITH AN ASTERISK *
* THEN YOU CAN PUT NOTES IN YOUR PROGRAM FILES. THIS IS
* LIKE A COMMENT CARD IN SPSS. HOWEVER, REMEMBER
* TO EVENTUALLY PUT A FINAL SEMICOLON AT THE END OF YOUR COMMENTS.;
***********************************************************;
* NOTE THAT I INDENT SOME STATEMENTS. THIS
* IS JUST FOR NEATNESS.;
***********************************************************;
* COPYRIGHT (c) Courtney Brown 2005, All Rights Reserved;
* Permission granted to use this file and computer code for any nonprofit and
* educational purposes, including classroom instruction.
* No further permission required.
* Please cite source as "From www.courtneybrown.com";
***********************************************************;
DATA panel80;SET windata.panel80;
gender=sex;
if ((race eq 1) or (race eq 2)); * This gets rid of the "other" category
in race;
if (race eq 1) then white=1;else white=0; * This creates the intercept dummy
variable for whites;
if (race eq 2) then black=1;else black=0; * This creates the intercept dummy
variable for African Americans;
wpartyid=white*partyid; * This is one way of creating the whites only slope
dummy variable for partyid;
bpartyid=black*partyid; * This is one way of creating the blacks only slope
dummy variable for partyid;
proc reg;
model carfeel3 = white black wpartyid bpartyid gender age / stb tol noint;
test white=black;
test wpartyid=bpartyid;
title 'Carter Feelings';
run;
quit;