1. Your Problem
Your report on your data file is due on December 9, the last day of class, and is worth 600
points, about the number of points on the final examination. Your assignment is to report a model
that fits each of the dependent variables and also to report the usual statistical results for your
models. That is, for each data set, you are to report a model in usual form, the fraction of variance
explained by your model, the overall significance observed significance level (p-value) of your
model, the estimated coefficients of your model, their standard errors or t-values, the estimated
standard deviation of a the error component, the lack of fit F-test, and a brief report of residual plots,
if necessary.
Your data file contains information on eleven variables observed for 500 cases. The first
column contains an index that represents the time order T of the observations. It should only be an
identifying variable and should not be associated with the dependent variables. The second variable
is the situation number S that is to be used for finding the lack of fit F-test of the adequacy of the
model. I will discuss lack of fit tests later. There are six independent variables contained in columns
three through nine. Columns 3, 4, and 5 are continuous independent variables that I will call X1, X2,
and X3, respectively. Columns 6, 7, and 8 are indicator variables (that is, they either have the value
0 or 1). I will call these I1, I2, and I3, respectively. There are three dependent variables contained
in columns 9, 10, and 11. I will call these Y1, Y2, and Y3 respectively. Two of the dependent
variables may have no association whatsoever with the independent variables, and the other may be
associated with one or more of the independent variables. The data file that I analyzed for this
material is on the Web site for the course. The SPO files that I produced in class are also on the Web
site.
2. Preliminary steps
Make sure that you have an electronic copy of your data file. Since a printout of your data file will take about eight pages, a relatively small printout, I would print it out. After you print out the file, scan through it. Keep your copies of the data file in a safe place. Working a project is easier if you focus on one variable at a time. Take notes on the steps that you wish to perform in your analysis before you start your computer work. As the output is produced, take quick notes on your findings. Remember that you can save your output (the .SPO file). Examine these plots more carefully after you have finished your computation.
The first step in a regression analysis is to plot the dependent variable against the independent
variable. The graphs menu is SPSS has a scatterplot submenu. Use the scatterplot routines to plot
the dependent variable against each of the independent variables. The unusual patters that a
scatterplot may indicate are nonlinear regression function (a curved pattern), heteroscedastic error
distribution (horn shaped plot), and outliers. Remember to write down your observations about the
scatterplots. The Y3 variable in this semester's example had obvious linear associations, and there
were no obvious associations between Y1 and any of the independent variables or between Y2 and
any of the independent variables.
3. The Analysis of the Apparently Weak Variable
First, I focused on Y1, the dependent variable that did not appear to have any strong
associations. I used the linear regression routines (Statistics menu, regression submenu, linear choice).
Then I put Y1 in the dependent variable box. I used all of the independent variables (X1, X2, X3,
I1, I2, and I3) in the analysis. I used the default enter option. I clicked on the "statistics" button in
the linear regression specification and asked for "casewise diagnostics" to be shown. From the
"plots" button, I chose "histogram" and "normal probability plot". Finally, from the "save" button,
I chose "unstandardized residuals" and "unstandardized predicted" values.
The regression using all six independent variables explained a small fraction of the variation
of Y1, a result that is consistent with Y1 being the variable with no associations. The F-statistic for
the model with all six variables in was small as well.
A statistician who used 0.01 as the significance level would accept the null hypothesis that
none of the independent variables was associated with Y1. One method of analysis, called the
"protected F-test," is to stop any further examination of t-tests when the overall test is not significant.
My personal preference is to use a small level of significance, such as 0.01 or lower, and to use a
protected (against inflation of the level of significance from multiple testing) F-test.
There is always a tradeoff of the probability of a Type I error and the probability of a Type II error. Reducing the level of significance increases the probability of a Type II error; increasing the level of significance reduces the probability of a Type II error. In practice, a statistician must balance the two probabilities of error. You have the discretion of setting the level of significance in your analysis. There is a wide range of settings of the level of significance made among statistical practitioners. In this example, setting the lower level of significance and assessing Y1 as not having associations matched the way the data was generated. In practice, one would learn the correct decision about an association from the ongoing evolution of research results about the dependent variable.
I finished my analysis of Y1 by using the explore option in the summarize submenu of the
statistics menu. I selected Y1 as the variable to be analyzed and asked for histograms and normality
tests. I also performed similar analyses on Y2.
4. Analysis of the Apparently Related Variable
The dependent variable Y3 appeared to have associations with the independent I used the
linear regression choice in the regression submenu of the statistics menu. I used Y3 as the dependent
variable and all six independent variables. The fraction of variance explained ( R squared value) was
high, showing very clearly that Y3 had one or more associations with the independent variables. The
value of the F-statistic for the test of the null hypothesis that all six partial regression coefficients
were simultaneously zero was high, with an observed significance level of 0.000. The t statistics of
the partial regression coefficients showed which variables appeared to have the strongest associations.
I then plotted the unstandardized residuals against the unstandardized predicted values. That
plot was patternless. I obtained the approximate lack of fit test by running a one-way analysis of
variance of the unstandardized residuals against the situation variable S, the variable in the second
column of the database. Here, the approximate lack of fit test was 0.99. This indicated that the model
appeared to be adequate.
5. Example Report
Your report should be brief. You should have three independent sections. One should discuss
Y1, a second Y2, and the third Y3. You must state clearly whether a dependent variable does or does
not appear to be associated with any of the independent variables in your data set. If a variable does
appear to be associated with the independent variables, you must state clearly which independent
variables are associated with the dependent variable. Be sure to include a statement of the model that
you have fitted to the dependent variable. Include the fraction of variance explained and the results
of the lack of fit test.
6. Reminders
The report on the last project is worth as many points as the final examination. Consequently,
it is in your interest to be careful in writing up the results and checking your statements before you
turn in the report. As you come to closure on the analysis of Y1 Y2, and Y3, I recommend that you
write up your results immediately, and begin to edit your report. Make sure that you specify the
independent variables (X1, X2, X3, I1, I2, and I3) that are associated with each variable.