AMS 578 Homework.

PLEASE READ, THIS IS IMPORTANT:
1. All homework has to be done independently. You can discuss with your peers but collaboration is not allowed.
2. Don't hand in a bunch of computer output without explaining. The best format will be a written summary on what you do and attach your computer output with some comments on what you did.

Assignment 1. (Due Feb. 9 in class)

The purpose of this assignment is to demonstrate that you have a basic command of the use of some statistical software. Use of R is recommended but not required.

The data for this question comes from the 1980 census. Information for each state on the following variables is presented:

   over65: percentage of population over the age of 65
   medage: median age of population
   percap: per-capita income in dollars
   college: percentage with a college education
   hs: percentage with a high school education

Conduct a preliminary numerical and graphical summary of the data. Submit no more than 5 pages total of plots and output. Describe what you find and indicate any features that you think may be interesting. Answer the following specific questions:

 1.What pair of variables have the greatest positive correlation?
 2.What is the 0.15 quantile of the per capita incomes?
 3.What is the average per capita income for states with above average percentage population with a college education?

You'll need to look at the help page for the quantile() function to figure out how to get the 15th percentile.

Assignment 2 (TBD)

1. (for master student only) Given in the lecture

2. (for Ph.D student only) Given in the lecture

3. Researchers at General Motors collected data on 59 U.S. Metropolitan Areas in a study of whether air pollution
contributes to mortality. Among the variables that they analyzed were

Mortality:   Age adjusted mortality in annual deaths per hundred thousand
Education:   Median education in years
NOx : Nitrous Oxide - ppm
NonWhite:   Percentage of non whites
income:  Median income in thousands of dollars
JanTemp:   Mean January temperature (degrees Farenheit)
JulTemp:   Mean July temperature (degrees Farenheit)

 1) Fit a linear regression model with Mortality as the response and NOx as the sole predictor. Report the
   value of the regression coefficient for NOx and its standard error. Test the hypothesis that slope parameter
   for NOx is zero.
 2) Fit a linear regression model with Mortality as the response and log(NOx) as the sole predictor.
   Report the value of the regression coefficient for log(NOx) and its standard error. Test the hypothesis
   that slope parameter for log(NOx) is zero.
 3) Compare the models used in the first two questions. Which model fits better? Can you use an F-test to
   compare these two models?
 4) Fit a linear regression model with Mortality as the response and log(NOx), Education, NOx, NonWhite, income, JanTemp, JulyTemp as predictors.
    a.Test the hypothesis that slope parameter for log(NOx) is zero. Why is this test different from that in
      question 2?
    b.Which urban area had a mortality rate that was least well predicted by this model?
    c.Compute the mean of the residuals.
    d.Compute the correlations between the residuals and income and between residuals and the fitted values.
    e.Predict the difference in mortality between two urban areas where all predictors are identical except
      that one is 5 degrees warmer in January than the other.
 5)Test the hypothesis that January and July temperatures in the previous model may be replaced by the
   difference between these two temperatures.
 

Assignment 3 (TBD)

1. Study Multiple Testing

Randomly generate a variable as response
> y <- rnorm(200)
Randomly generate 50 variable as predictor
> x <- matrix(0, 200,50)
> for (i in 1:50){x[,i] <- rnorm(200)}
Fit least square regression
> lm(y~x)

How many variables appear to be significant at significance level 0.05 and 0.01? What does the F-test show?
Explain the discrepancy.

2. A real estate agent collected some data on house prices for houses in a neighborhood of Chicago. The variables are price in thousands of dollars, number of bedrooms, floor space in square feet, number of rooms, lot frontage in feet, annual property tax, number of bathrooms, number of parking spaces in the garage if any, and condition (1="needs work", 0="good condition").

Fit a regression model with the price as the response and all the other variables as predictors. Answer the following questions:
       a.If all other predictors are held constant and the number of bedrooms is increased by one, what is the predicted effect on the price? Does this seem plausible? (Hint:
        Think about the bedroom size)
      b.Form a 95% confidence interval for the regression parameter associated with the number of bedrooms.
      c.Fit another regression model with Price as the response and the number of bedrooms as the sole predictor. Explain in words why the effect of the number of bedrooms
        differs in this model from the previous one.
      d.If you have a house for sale in this neighborhood in good condition with 3 bedrooms, 1500 square feet, 8 rooms, 40 feet of lot frontage, $1000 annual taxes, 1.5
        bathrooms and a 1 car garage, how much would you expect it to sell for? Give some assessment of the uncertainty in your prediction.
      e.What is the predicted price of a house that does not exist i.e. one where all the predictors take the value zero? Does this mean that the model used above is invalid for
        any prediction?

In an attempt to correct the difficulty observed in previous questions, an investigator fits a model without an intercept term. This can be accomplished in R by using a -1 in the model expression e.g.

          lm(y ~ x -1)
 

 Fit a model with price as the response and rest as predictors but without using an intercept.
      f.Compute the mean of the residuals and the correlation between the residuals and the fitted values for this model. How do the values obtained differ from models that
        do contain an intercept term?
      g.Compare the residual standard errors and r-squared's for this model and the previous one. Do you think that this model has a substantially better fit than the previous
        one?
      h.Recompute the predicted sale price for the (real) house mentioned in the first question and give a 95% CI for your prediction.
 

Assignment 4 (TBA)
 

 1.The National Institute for Standards and Technology collected data on the coefficient of thermal expansion for copper as it varies with temperature in degrees kelvin. Build
    a regression model for the purposes of predicting the coefficient of thermal expansion at high temperatures. Demonstrate your model in action by predicting the coefficient
    of thermal expansion at 1000 degrees Kelvin along with an appropriate 95% confidence interval.

2.Each case in the data represents a pair of zones in Chicago. The variable x gives estimated travel times between two destination computed from estimated walking times plus    information from bus timetables. The variable y gives the average of travel times as reported by n travellers.
      a.Plot y against x - what do you see?
      b.Fit a linear regression to predict y using x with an appropriate choice of weights.
 

Assignment 5 (TBA)

Follows the second problem of Assignment 4, check your model for lack of fit.
 

Assignment 6 (TBA)
An experiment was conducted to contruct a model for total oxygen demand in dairy wastes as a function of five laboratory measurements. Data were collected on the same
sample over time for a period of 220 days. Although this might result in correlated errors, assume that they are not. The variables are

    y =log(oxygen demand, mg oxygen per minute)
    x1=biological oxygen demand, mg/liter
    x2=total Kjeldahl nitrogen, mg/liter
    x3=total solids, mg/liter
    x4=total volatile solids, mg/liter
    x5=chemical oxygen demand, mg liter

   #Fit a model with y as the response and the other variables as predictors. Plot the residuals against the fitted values, check for outliers using the formal test and compute
    and plot the Cook's distances. Comment on what should be inferred from the plots.
   #Remove the predictor with the largest p-value from the model above and refit. Again remove the predictor with the largest p-value from the model above and refit,
    continuing this elimination process until all the predictor p-values are less than 5%. For this model, plot the residuals against the fitted values, check for outliers using
    the formal test and compute and plot the Cook's distances. Comment on what should be inferred from the plots. Were the same points influential?
   #Remove the case from the data that has the largest Cook's distance for the model used in 1. Repeat the variable selection as in 2. Produce the same plots and tests as
    before. Comment.
   #Which should be done first - variable selection or influential point investigation?

Assignment 7 (Due on Monday, April 5th)

The data give the volume (cubic feet), height (feet) and diameter (inches) (at 54 inches above ground) for a sample of 31 black cherry trees in the Allegheny National Forest,
Pennsylvania. The data were collected in order to find an estimate for the volume of a tree (and therefore the timber yield), given its height and diameter.  You may need this
function for Box-Cox transformation.

Build a regression model for predicting volume given height and diameter. You should consider transformations of these variables in building your model. Your report
should include a description of your final model, along with a verification that your model has satisfactory diagnostics. You should explain why you chose this model in
comparison to alternatives. You should outline the steps you followed in finding this model but do not give details on every intermediate model you considered. Your report
should not exceed 6 pages.

Assignment 8 (TBA)

The data percentage of body fat, age, weight, height, and ten body circumference measurements (e.g., abdomen) are recorded for 252 men. Body fat is estimated through an
underwater weighing technique, but this is inconvenient to use widely.  Use 2/3 of your observations to develop a regression model that allows the estimation of body fat for men using only a scale and a measuring tape. Validate your model on the remaining 1/3 of your observations.

    Case Number
    Percent body fat using Brozek's equation, 457/Density - 414.2
    Percent body fat using Siri's equation, 495/Density - 450
    Density (gm/cm^3)
    Age (yrs)
    Weight (lbs)
    Height (inches)
    Adiposity index = Weight/Height^2 (kg/m^2)
    Fat Free Weight = (1 - fraction of body fat) * Weight, using Brozek's formula (lbs)
    Neck circumference (cm)
    Chest circumference (cm)
    Abdomen circumference (cm) "at the umbilicus and level with the iliac crest"
    Hip circumference (cm)
    Thigh circumference (cm)
    Knee circumference (cm)
    Ankle circumference (cm)
    Extended biceps circumference (cm)
    Forearm circumference (cm)
    Wrist circumference (cm) "distal to the styloid processes"

Your model should predict %body fat according to Siri. You may not use Brozek's %body fat, Density or Fat Free Weight as predictors.

Your report should contain a description of the model you chose along with the variable selection process you used to find it. Note that outliers and influential points should
be examined.

Your report should not exceed five pages. You should aim to find a simple model that predicts the response well. R library is now available on mathlab machines. The precompile library for Windows is available here.

Assignment 9 (TBA)

The data contain four measurements made of male Egyptian skulls from five different time periods ranging from 4000 B.C. to 150 A.D. We wish to analyze the data to
determine if there are any differences in the skull sizes between the time periods and if they show any changes with time. The researchers theorize that a change in skull size
over time is evidence of the interbreeding of the Egyptians with immigrant populations over the years.

The data set contains five variables:
1.MB: Maximal Breadth of Skull
2.BH: Basibregmatic Height of Skull
3.BL: Basialveolar Length of Skull
4.NH: Nasal Height of Skull
5.Year: Approximate Year of Skull Formation (negative = B.C., positive = A.D.)

1. Conduct a priciple componet analysis on first four variables. Interpret what you obtained.
2. Plot first two principle component against Year. Describe what you observe.
3. Do a formal test to answer the question: Is there a change in skull size over time?