PLEASE READ, THIS IS IMPORTANT:
1. All homework has to be done independently.
You can discuss with your peers but collaboration is not allowed.
2. Don't hand in a bunch of computer output without
explaining. The best format will be a written summary on what you do and
attach your computer output with some comments on what you did.
Assignment 1. (Due Feb. 9 in class)
The purpose of this assignment is to demonstrate that you have a basic command of the use of some statistical software. Use of R is recommended but not required.
The data for this question comes from the 1980 census. Information for each state on the following variables is presented:
over65:
percentage of population over the age of 65
medage: median age of population
percap: per-capita income in dollars
college: percentage with a college education
hs: percentage with a high school education
Conduct a preliminary numerical and graphical summary of the data. Submit no more than 5 pages total of plots and output. Describe what you find and indicate any features that you think may be interesting. Answer the following specific questions:
1.What pair of variables have the greatest
positive correlation?
2.What is the 0.15 quantile of the per
capita incomes?
3.What is the average per capita income
for states with above average percentage population with a college education?
You'll need to look at the help page for the quantile() function to figure out how to get the 15th percentile.
Assignment 2 (TBD)
1. (for master student only) Given in the lecture
2. (for Ph.D student only) Given in the lecture
3. Researchers at General Motors collected data
on 59 U.S. Metropolitan Areas in a study of whether air pollution
contributes to mortality. Among the variables that they analyzed were
Mortality: Age adjusted mortality in annual deaths per hundred
thousand
Education: Median education in years
NOx : Nitrous Oxide - ppm
NonWhite: Percentage of non whites
income: Median income in thousands of dollars
JanTemp: Mean January temperature (degrees Farenheit)
JulTemp: Mean July temperature (degrees Farenheit)
1) Fit a linear regression model with Mortality as the response
and NOx as the sole predictor. Report the
value of the regression coefficient for NOx and its standard
error. Test the hypothesis that slope parameter
for NOx is zero.
2) Fit a linear regression model with Mortality as the response
and log(NOx) as the sole predictor.
Report the value of the regression coefficient for log(NOx)
and its standard error. Test the hypothesis
that slope parameter for log(NOx) is zero.
3) Compare the models used in the first two questions. Which
model fits better? Can you use an F-test to
compare these two models?
4) Fit a linear regression model with Mortality as the response
and log(NOx), Education, NOx, NonWhite, income, JanTemp, JulyTemp as predictors.
a.Test the hypothesis that slope parameter for log(NOx)
is zero. Why is this test different from that in
question 2?
b.Which urban area had a mortality rate that was
least well predicted by this model?
c.Compute the mean of the residuals.
d.Compute the correlations between the residuals
and income and between residuals and the fitted values.
e.Predict the difference in mortality between two
urban areas where all predictors are identical except
that one is 5 degrees warmer in January
than the other.
5)Test the hypothesis that January and July temperatures in the
previous model may be replaced by the
difference between these two temperatures.
Assignment 3 (TBD)
1. Study Multiple Testing
Randomly generate a variable as response
> y <- rnorm(200)
Randomly generate 50 variable as predictor
> x <- matrix(0, 200,50)
> for (i in 1:50){x[,i] <- rnorm(200)}
Fit least square regression
> lm(y~x)
How many variables appear to be significant
at significance level 0.05 and 0.01? What does the F-test show?
Explain the discrepancy.
2. A real estate agent collected some data on house prices for houses in a neighborhood of Chicago. The variables are price in thousands of dollars, number of bedrooms, floor space in square feet, number of rooms, lot frontage in feet, annual property tax, number of bathrooms, number of parking spaces in the garage if any, and condition (1="needs work", 0="good condition").
Fit a regression model with the price
as the response and all the other variables as predictors. Answer the following
questions:
a.If all other predictors are held constant and the number of bedrooms
is increased by one, what is the predicted effect on the price? Does this
seem plausible? (Hint:
Think about the bedroom size)
b.Form
a 95% confidence interval for the regression parameter associated with
the number of bedrooms.
c.Fit
another regression model with Price as the response and the number of bedrooms
as the sole predictor. Explain in words why the effect of the number of
bedrooms
differs in this model from the previous one.
d.If
you have a house for sale in this neighborhood in good condition with 3
bedrooms, 1500 square feet, 8 rooms, 40 feet of lot frontage, $1000 annual
taxes, 1.5
bathrooms and a 1 car garage, how much would you expect it to sell for?
Give some assessment of the uncertainty in your prediction.
e.What
is the predicted price of a house that does not exist i.e. one where all
the predictors take the value zero? Does this mean that the model used
above is invalid for
any prediction?
In an attempt to correct the difficulty observed in previous questions, an investigator fits a model without an intercept term. This can be accomplished in R by using a -1 in the model expression e.g.
lm(y ~ x -1)
Fit a model with price as the
response and rest as predictors but without using an intercept.
f.Compute
the mean of the residuals and the correlation between the residuals and
the fitted values for this model. How do the values obtained differ from
models that
do contain an intercept term?
g.Compare
the residual standard errors and r-squared's for this model and the previous
one. Do you think that this model has a substantially better fit than the
previous
one?
h.Recompute
the predicted sale price for the (real) house mentioned in the first question
and give a 95% CI for your prediction.
Assignment 4 (TBA)
1.The National Institute for Standards and Technology collected
data
on the coefficient of thermal expansion for copper as it varies with temperature
in degrees kelvin. Build
a regression model for the purposes of predicting
the coefficient of thermal expansion at high temperatures. Demonstrate
your model in action by predicting the coefficient
of thermal expansion at 1000 degrees Kelvin along
with an appropriate 95% confidence interval.
2.Each case in the data represents
a pair of zones in Chicago. The variable x gives estimated travel times
between two destination computed from estimated walking times plus
information from bus timetables. The variable y gives the average of travel
times as reported by n travellers.
a.Plot y against x - what do you see?
b.Fit a linear regression to predict
y using x with an appropriate choice of weights.
Assignment 5 (TBA)
Follows the second problem of Assignment 4, check your model for lack
of fit.
Assignment 6 (TBA)
An experiment was conducted to contruct a model for total oxygen demand
in dairy wastes as a function of five laboratory measurements. Data
were collected on the same
sample over time for a period of 220 days. Although this might result
in correlated errors, assume that they are not. The variables are
y =log(oxygen demand, mg oxygen per minute)
x1=biological oxygen demand, mg/liter
x2=total Kjeldahl nitrogen, mg/liter
x3=total solids, mg/liter
x4=total volatile solids, mg/liter
x5=chemical oxygen demand, mg liter
#Fit a model with y as the response and the other variables
as predictors. Plot the residuals against the fitted values, check for
outliers using the formal test and compute
and plot the Cook's distances. Comment on what should
be inferred from the plots.
#Remove the predictor with the largest p-value from the
model above and refit. Again remove the predictor with the largest p-value
from the model above and refit,
continuing this elimination process until all the
predictor p-values are less than 5%. For this model, plot the residuals
against the fitted values, check for outliers using
the formal test and compute and plot the Cook's
distances. Comment on what should be inferred from the plots. Were the
same points influential?
#Remove the case from the data that has the largest Cook's
distance for the model used in 1. Repeat the variable selection as in 2.
Produce the same plots and tests as
before. Comment.
#Which should be done first - variable selection or influential
point investigation?
Assignment 7 (Due on Monday, April 5th)
The data give the volume (cubic feet), height (feet) and diameter (inches)
(at 54 inches above ground) for a sample of 31 black cherry trees in the
Allegheny National Forest,
Pennsylvania. The data were collected
in order to find an estimate for the volume of a tree (and therefore the
timber yield), given its height and diameter. You may need this
function for Box-Cox transformation.
Build a regression model for predicting volume given height and diameter.
You should consider transformations of these variables in building your
model. Your report
should include a description of your final model, along with a verification
that your model has satisfactory diagnostics. You should explain why you
chose this model in
comparison to alternatives. You should outline the steps you followed
in finding this model but do not give details on every intermediate model
you considered. Your report
should not exceed 6 pages.
Assignment 8 (TBA)
The data percentage of body fat, age,
weight, height, and ten body circumference measurements (e.g., abdomen)
are recorded for 252 men. Body fat is estimated through an
underwater weighing technique, but this is inconvenient to use widely.
Use 2/3 of your observations to develop a regression model that allows
the estimation of body fat for men using only a scale and a measuring tape.
Validate your model on the remaining 1/3 of your observations.
Case Number
Percent body fat using Brozek's equation, 457/Density
- 414.2
Percent body fat using Siri's equation, 495/Density
- 450
Density (gm/cm^3)
Age (yrs)
Weight (lbs)
Height (inches)
Adiposity index = Weight/Height^2 (kg/m^2)
Fat Free Weight = (1 - fraction of body fat) * Weight,
using Brozek's formula (lbs)
Neck circumference (cm)
Chest circumference (cm)
Abdomen circumference (cm) "at the umbilicus and
level with the iliac crest"
Hip circumference (cm)
Thigh circumference (cm)
Knee circumference (cm)
Ankle circumference (cm)
Extended biceps circumference (cm)
Forearm circumference (cm)
Wrist circumference (cm) "distal to the styloid
processes"
Your model should predict %body fat according to Siri. You may not use Brozek's %body fat, Density or Fat Free Weight as predictors.
Your report should contain a description of the model you chose along
with the variable selection process you used to find it. Note that outliers
and influential points should
be examined.
Your report should not exceed five pages. You should aim to find a simple model that predicts the response well. R library is now available on mathlab machines. The precompile library for Windows is available here.
Assignment 9 (TBA)
The data contain four measurements
made of male Egyptian skulls from five different time periods ranging from
4000 B.C. to 150 A.D. We wish to analyze the data to
determine if there are any differences in the skull sizes between the
time periods and if they show any changes with time. The researchers theorize
that a change in skull size
over time is evidence of the interbreeding of the Egyptians with immigrant
populations over the years.
The data set contains five variables:
1.MB: Maximal Breadth of Skull
2.BH: Basibregmatic Height of Skull
3.BL: Basialveolar Length of Skull
4.NH: Nasal Height of Skull
5.Year: Approximate Year of Skull Formation (negative = B.C., positive
= A.D.)
1. Conduct a priciple componet analysis on first four variables. Interpret
what you obtained.
2. Plot first two principle component against Year. Describe what you
observe.
3. Do a formal test to answer the question: Is there a change in skull
size over time?