November 29, 1999


Project Two

Multivariate Linear Regression

Exploratory Data Analysis of Synthetic Data



1. Your Problem

Your report on your data file is due on December 9, the last day of class, and is worth 600 points, about the number of points on the final examination. Your assignment is to report a model that fits each of the dependent variables and also to report the usual statistical results for your models. That is, for each data set, you are to report a model in usual form, the fraction of variance explained by your model, the overall significance observed significance level (p-value) of your model, the estimated coefficients of your model, their standard errors or t-values, the estimated standard deviation of a the error component, the lack of fit F-test, and a brief report of residual plots, if necessary.

Your data file contains information on eleven variables observed for 500 cases. The first column contains an index that represents the time order T of the observations. It should only be an identifying variable and should not be associated with the dependent variables. The second variable is the situation number S that is to be used for finding the lack of fit F-test of the adequacy of the model. I will discuss lack of fit tests later. There are six independent variables contained in columns three through nine. Columns 3, 4, and 5 are continuous independent variables that I will call X1, X2, and X3, respectively. Columns 6, 7, and 8 are indicator variables (that is, they either have the value 0 or 1). I will call these I1, I2, and I3, respectively. There are three dependent variables contained in columns 9, 10, and 11. I will call these Y1, Y2, and Y3 respectively. Two of the dependent variables may have no association whatsoever with the independent variables, and the other may be associated with one or more of the independent variables. The data file that I analyzed for this material is on the Web site for the course. The SPO files that I produced in class are also on the Web site.



2. Preliminary steps

Make sure that you have an electronic copy of your data file. Since a printout of your data file will take about eight pages, a relatively small printout, I would print it out. After you print out the file, scan through it. Keep your copies of the data file in a safe place. Working a project is easier if you focus on one variable at a time. Take notes on the steps that you wish to perform in your analysis before you start your computer work. As the output is produced, take quick notes on your findings. Remember that you can save your output (the .SPO file). Examine these plots more carefully after you have finished your computation.

The first step in a regression analysis is to plot the dependent variable against the independent variable. The graphs menu is SPSS has a scatterplot submenu. Use the scatterplot routines to plot the dependent variable against each of the independent variables. The unusual patters that a scatterplot may indicate are nonlinear regression function (a curved pattern), heteroscedastic error distribution (horn shaped plot), and outliers. Remember to write down your observations about the scatterplots. The Y3 variable in this semester's example had obvious linear associations, and there were no obvious associations between Y1 and any of the independent variables or between Y2 and any of the independent variables.

3. The Analysis of the Apparently Weak Variable

First, I focused on Y1, the dependent variable that did not appear to have any strong associations. I used the linear regression routines (Statistics menu, regression submenu, linear choice). Then I put Y1 in the dependent variable box. I used all of the independent variables (X1, X2, X3, I1, I2, and I3) in the analysis. I used the default enter option. I clicked on the "statistics" button in the linear regression specification and asked for "casewise diagnostics" to be shown. From the "plots" button, I chose "histogram" and "normal probability plot". Finally, from the "save" button, I chose "unstandardized residuals" and "unstandardized predicted" values.

The regression using all six independent variables explained a small fraction of the variation of Y1, a result that is consistent with Y1 being the variable with no associations. The F-statistic for the model with all six variables in was small as well.

A statistician who used 0.01 as the significance level would accept the null hypothesis that none of the independent variables was associated with Y1. One method of analysis, called the "protected F-test," is to stop any further examination of t-tests when the overall test is not significant. My personal preference is to use a small level of significance, such as 0.01 or lower, and to use a protected (against inflation of the level of significance from multiple testing) F-test.

There is always a tradeoff of the probability of a Type I error and the probability of a Type II error. Reducing the level of significance increases the probability of a Type II error; increasing the level of significance reduces the probability of a Type II error. In practice, a statistician must balance the two probabilities of error. You have the discretion of setting the level of significance in your analysis. There is a wide range of settings of the level of significance made among statistical practitioners. In this example, setting the lower level of significance and assessing Y1 as not having associations matched the way the data was generated. In practice, one would learn the correct decision about an association from the ongoing evolution of research results about the dependent variable.

I finished my analysis of Y1 by using the explore option in the summarize submenu of the statistics menu. I selected Y1 as the variable to be analyzed and asked for histograms and normality tests. I also performed similar analyses on Y2.

4. Analysis of the Apparently Related Variable

The dependent variable Y3 appeared to have associations with the independent I used the linear regression choice in the regression submenu of the statistics menu. I used Y3 as the dependent variable and all six independent variables. The fraction of variance explained ( R squared value) was high, showing very clearly that Y3 had one or more associations with the independent variables. The value of the F-statistic for the test of the null hypothesis that all six partial regression coefficients were simultaneously zero was high, with an observed significance level of 0.000. The t statistics of the partial regression coefficients showed which variables appeared to have the strongest associations.

I then plotted the unstandardized residuals against the unstandardized predicted values. That plot was patternless. I obtained the approximate lack of fit test by running a one-way analysis of variance of the unstandardized residuals against the situation variable S, the variable in the second column of the database. Here, the approximate lack of fit test was 0.99. This indicated that the model appeared to be adequate.

5. Example Report

Your report should be brief. You should have three independent sections. One should discuss Y1, a second Y2, and the third Y3. You must state clearly whether a dependent variable does or does not appear to be associated with any of the independent variables in your data set. If a variable does appear to be associated with the independent variables, you must state clearly which independent variables are associated with the dependent variable. Be sure to include a statement of the model that you have fitted to the dependent variable. Include the fraction of variance explained and the results of the lack of fit test.

6. Reminders

The report on the last project is worth as many points as the final examination. Consequently, it is in your interest to be careful in writing up the results and checking your statements before you turn in the report. As you come to closure on the analysis of Y1 Y2, and Y3, I recommend that you write up your results immediately, and begin to edit your report. Make sure that you specify the independent variables (X1, X2, X3, I1, I2, and I3) that are associated with each variable.

End of Project Two Handout