Thesis Abstracts in Statistics

Reliability Applications of the EM Algorithm
by Jose Ramon G. Albert,    Advisor: Laurence Baxter         August, 1993

It is desired to estimate the parameters of a lifelength model. Inference is
based on field data which are incomplete in some fashion. For instance, the
lifelength distribution may be affected by an unobserved environmental  factor;
the reported life of a component may include a period of unknown  duration
during which the component is not in use; or the component may be  part of a
larger system, and failure mode analysis reveals only one module  containing the
failed component, not the identity of the component. It is  shown how the EM
algorithm can be used to calculate the maximum likelihood  estimates of the
parameters of interest in these cases.


The Application of Jackknife Statistics to the Problem of Obtaining Interval
Estimates of the Recombination Fraction in Phase-Unknown Nuclear Families
by Barbara Denise Berger,         Advisor: Nancy Mendell          August, 1993

Maximum likelihood estimates (MLE's) are traditionally used to estimate the
recombination fraction, q, in genetic linkage studies. The jackknife estimate
of this parameter is examined as an alternative to the MLE. Properties of the
jackknife estimate and its respective confidence interval are computed for
two cases: (i) a single sibship of size S in which one offspring is  eliminated
at a time and (ii) a group of N nuclear families in which one  family is
systematically removed with each iteration. In both cases,  jackknifing produces
optimal results for situations in which loose linkage is  involved. Hence, the
jackknife estimate can be a useful statistical tool for  genetic exclusion
studies.


Renewal Theory: Limit Theorems for Renewal Processes in a Random Environment
and Nonparametric Confidence Intervals for the Renewal Function

by Linxiong Li,       Advisor: Laurence Baxter               August, 1993

This dissertation focuses on two aspects of renewal theory: generalizing
standard limit theorems to the case of a random environment and constructing
nonparametric confidence intervals for the renewal function. Firstly,  consider
a repairable system and assume that the sequence of failures of the  system
comprises a renewal process. A number of standard theorems, such as  the key
renewal theorem, Blackwell's theorem, etc., describe limiting  properties of the
renewal function. We present natural, easily-verifiable  conditions under which
these limit theorems hold when the system is operating  in a random environment
modeled by an arbitrary stochastic process. Secondly,  assuming that the
functional form of the underlying distribution function  generating the renewal
process is unknown, we propose a point estimator and a  nonparametric confidence
interval for the renewal function. In addition, for  the case of censored data,
we derive a consistent nonparametric confidence  interval for the renewal
function.


Small-Sample Properties of the Maximum Likelihood Estimates of the Parameters
for A Two-component Normal Mixture
by Li-Jung Tseng,      Advisor: Stephen Finch           August, 1993

This dissertation uses cumulants to parametrize the location contaminated
normal mixture and studies the asymptotic distribution of the MLE's. We solve
for the regular parameters in terms of the first four cumulants. A complete
procedure of reparametrization is proposed to calculate the MLE's of the
cumulants. The value of using estimated cumulants in the location  contaminated
mixture is that their distribution appears to converge to  normality faster than
the MLE's of the regular parameters. Thode's algorithm  is confirmed to find the
MLE's with extremely high probability and to be more  efficient computationally
than an algorithm using cumulants.
   The optimization procedure is extended to find the MLE's for the scale
contaminated normal mixture. It is found that finding a numerical estimate of
the MLE depends on the sample kurtosis and the mixing proportion. The fourth
cumulant is positive definite, which implies the theoretical kurtosis of the
scale contaminated mixture is always greater than zero. A negative sample
kurtosis sometimes results in a zero LRT, with the estimates of the MLE's on
the boundary of the parameter space. This pattern would be expected based on
Lindsay's [38] results.
   The sample kurtosis is found to have considerable influence on the
distributional convergence of the MLE's of the parameters for the scale
contaminated normal mixture. The distribution of ^m appears normal for all
values of p. Further, ^sl and ^s2 have individual patterns of convergence to
normality depending on the value of the mixing proportion. When the mixing
proportion is equal to 0.50, both ^sl and ^s2 appear to have normal
distributions with the apparent convergence of ^sl slower than that of ^s2. A
value of p close to 1 will result in slow rates of convergence for both ^sl  and
^s2.
  While investigating the MLE's of the cumulants, we find that the estimated
cumulants of the scale contaminated normal mixture converge to normality more
slowly than those of the regular parameters, which is opposite to the results
for the location contaminated normal mixture.


Weighted Product-Limit Statistic for Survival Analysis Under Random  Truncation
by Xiao-rong Yan,      Advisor: Qiqing Yu               August, 1994

Random left-truncation is modeled by the conditional distribution of the  random
variable X of interest, given that it is larger than the truncation  random
variable Y. A nonparametric test statistic for comparing two survival  functions
under the left truncation model is proposed, named as Weighted  Product Limit
(WPL) statistic. The explicit expression of the WPL is given  and its asymptotic
distribution is derived.
  Since the WPL is based directly on the estimated survival functions rather
than based on ranks, its power against a stochastic ordering alternative  should
be more sensitive than the commonly used log rank test to the  magnitude of the
difference in survival time. The log rank test is not always  sensitive against
stochastic ordering alternatives especially when the  hazards cross.
   We give the numerical results to explore the sample size required under
distributions with different skewness and tail weights, factors that might be
expected to affect the empirical sizes of the tests under study. We compare  the
empirical sizes of the WPL test with the log rank test under the null
hypothesis of no difference in survival. The result shows that the sample  size
required for WPL is much smaller than the log rank test. Therefore under  small
or moderate sample sizes WPL has a much smaller type I error than the  log rank
test. The power of the test are also compared with the log rank test under the
proportional hazard model, when the log rank test is the most powerful rank
invariant test, and under other different stochastic ordering alternatives.
When the hazards are ordered, WPL test is less powerful than the log rank  test.
However, it has superior performance over the log rank test if the  hazards
cross, especially when the difference in survival crosses a large  span of time.


Application of Bivariate Normal Mixture to Twin Data
by Seungwoo Lee,          Advisor:  Nancy Mendell            December, 1995

The testing problem of single component bivariate normal distribution versus  a
mixture of bivariate normal distribution is applied to the problems of  studying
MZ and DZ twins. The problem arises when observations are taken from  two
related individuals and the joint distribution in the pair is a bivariate
normal distribution. The alternative distribution would occur if there is  some
factor present in some proportions of the population, or there is a  major gene
accounting for the mixture.
   The existence of a mixture can be tested by likelihood ratio test.
Therefore, a simulation study is conducted to investigate the approximation  of
asymptotic distribution of quantity -2ln(L1/L2) using quasi-Newton
methods. Three different mixture models are considered in the simulation: two
component bivariate normal mixture distribution when the samples are from MZ
twin pairs, four component bivariate normal mixture distribution when the
samples are from DZ twin pairs, and the MZ and DZ combined samples. For each
mixture model, a gamma distribution is fitted to obtain the asymptotic
distribution of quantity -2ln(L1/L2), and fitted percentage tables are
provided.


Simulation Study of SKUMIX Algorithm: A Research on Skewness-Mixture Problem
by Yu-Ming Ning,        Advisor: Stephen Finch               December, 1995

The SKUMIX algorithm proposed by Maclean et al.(1976), which is a likelihood
ratio test for distinguishing skewness from commingled distributions by using  a
power transform to remove skewness appropriately for the alternative  hypotheses
tested, has been investigated and found incorrect. A new SKUMIX  program is then
given for improving the optimization procedure to obtain the  global maximum
likelihood estimates .
   The new program is also based on the principle of the simultaneous
estimation of skewness parameters. derived from power transformations with
mixture parameters. This program can detect the difference between the  inherent
distributional skewness and the apparent skewness that is a  manifestation of
the mixture of two distributions.
   The percentage points of the likelihood ratio test of the null hypothesis
for 2,500 samples of various sizes with the Box-Cox power parameter l = 0. 5.
9. 1. 1 1. and 15 are explored through both methods of simulation and  modeling.


An Application of Robust Estimation to Linkage Analysis
by Richard MacLeod Single,         Advisor: Stephen Finch        August,  1996

A method for detecting linkage between a genetic marker and a quantitative
trait in sibship data is presented that models the dependence structure  within
families. An efficient algorithm is given for the application of  generalized
least squares (GLS) to the regression procedure developed by  Haseman and Elston
(1972). Also, the implications of fitting an incomplete  model in the context of
a simulation study are reported.
   The gain in efficiency from using GLS as opposed to ordinary least squares
in the Haseman-Elston (H-E) procedure is studied.  Simulation studies, using  a
fully informative marker locus, indicate that the average gain in  efficiency is
approximately 11% for three sibling families, 25% for four  sibling families,
36% for five sibling families, and 44% for six sibling  families. The realized
gain in efficiency is a function of the underlying  genetic parameters in the
study and is roughly equal to the percentages  reported.
   The null distribution of the test statistic based on the GLS estimator of
the H-E regression coefficient is found to be significantly skewed for  studies
with a small number of families. Nevertheless, the observed  significance levels
for the test, using critical values from the standard  normal distribution, are
not far from nominal levels.


Exact Distributions of Extreme Value Statistics for Urn Models with
Epidemiologic Applications
by Kwisung Hwang,       Advisor: Roger Grimson (Community Medicine)
December, 1996

This dissertation analyzes tests for temporal clustering of disease in a
population. The classical occupancy models(or urn models) are used to  formulate
tests for temporal clustering of disease occurrence in a period of  C discrete
consecutive time units. The extreme value statistics such as the  maximum (M1 )
or the second largest (M2) cell frequencies, and the sum (S2)  of the maximum
and the second largest cell frequencies in C cells are used to  detect the
evidence of disease clustering. The exact moments of M1 needed in a global
analysis which involves several  space-time units(periods) have been unknown.
The exact probability mass  function and moments of M1 are derived. Furthermore,
the exact probability  mass functions and moments of M2 and S2 are presented.
    The numerical results of tests using M1, M2, and S2 under the null
hypothesis of no clustering are given. Tests based on M1, M2, and S2 are
compared using well-known series of data that have been previously determined
to possess temporal clustering. Analysis reveals that the new test based on  S2
is more sensitive to clustering than are the tests based on M1 or M2.


A Test for a Mixture of Two Gamma Distributions
by Daekee Min, Advisor: Nancy Mendell              December, 1996

A method was developed for obtaining the MLE of the parameters of a two
component gamma mixture distribution. This computational algorithm uses the
E.\/I algorithm plus a modified Newton algorithm. Investigation of the
algorithm indicates that is obtains the global maximum likelihood. A large
scale simulation study determined the properties of the likelihood ratio test
of the null hypothesis that a variable has a gamma distribution vs. the
alternative of a two component gamma distribution (where the components  differ
in the shape parameter but not the scale parameter).
   The simulation results indicate that the null distribution is essentially
invariant to the value of the shape parameter. Modeling of the null
distribution indicates that it is well approximated by a gamma distribution
with shape parameter equal to the quantity 0.927 + 1.18/sqrt(n) and scale
parameter equal to 2.16.


An Evaluation of the Power and Precision of Missing Data Procedures
by Nora Louise Galambos,       Advisor: Nancy Mendell        August, 1997

Considered here is the problem of estimating the parameters of two groups of
continuous data when some percentage of the categorical grouping variable is
missing. Hence, the data consists of n1 values from group 1, n2 values from
group 2, and nm values for which group membership is unknown, with both  groups
normally distributed with unknown means (m1 and m2) and some common  unknown
variance s.
   The EM algorithm was used to calculate total sample maximum likelihood
estimates (TSE) of the two group means and the group membership proportions.
Estimates of the unknown parameters using only categorized observations, or
complete cases, were also calculated.  A simulation study was conducted to
determine the efficiency of the TSE relative to complete case estimates  (CCE).
   The relative difference between the mean squared errors of the TSE and CCE
estimates of the difference between the means, (m2 - m1), was calculated to
determine the relative efficiency of the two estimation methods.  When
missingness and sample size were in this range, the population effect size  had
a substantial effect on the relative efficiency of the estimators of (m2  - m1).
Further simulations were conducted to determine the null distribution  of the
likelihood ratio test (LRT) of the null hypothesis that, m1 = m2, and  to find
the appropriate a level critical values.
   Finally, the power of the fitted a=0.05 level critical values to detect  the
null hypothesis, m1 = m2, against the alternative hypothesis, m1 
? m2,  was
studied, via simulation, and compared to the power of the t test.  Overall, the
statistic based on the TSE provided a more powerful test, but  when the
population effect size and sample size were large with a small  percentage of
data missing, the powers of the two tests were essentially the  same.


Empirical Distribution Function-Based Goodness-of-Fit Tests for the Two-
Component Homoscedastic Normal Mixture
by Jordan L. Neus,     Advisor: Nancy Mendell                August, 1997

The null distributions of Lilliefors modifications of the Kolmogorov Smirnov
test, the Anderson-Darling test, and the Cramer-von Mises test were
investigated as GOF tests to the two-component homoscedastic normal mixture.
The mixture parameters were estimated using a variable metric minimization
algorithm to search for the MLE's.  The simulation results, which are based  on
90,000 random samples per sample size, suggest that the null percentiles  of
each test are not invariant to the values of the mixture parameters p and  d.
Moreover, the simulated critical values of the Anderson-Darling test and  the
Cramer-von Mises test appeared not to be monotonic functions of the  sample
size, and even increased for some parameter combinations.  Consequently, the
Anderson-Darling test and the Cramer-von Mises test were  not studied in detail.
In contrast, the critical values of the modified  Kolmogorov-Smirnov test
appeared to be well-behaved, strictly decreasing  functions of the sample size.
Selected critical values of the modified
Kolmogorov-Smirnov test were modeled as nonlinear functions of p, d, and n in
order to reduce the error associated with the observed parameter dependence.
Formulae for estimating selected critical values using parameter estimates  are
proposed and assessed. Power functions for the modified Kolmogorov- Smirnov test
are also presented for selected alternative distributions.


Exact Null Distributions of Runs Statistics for Occupany Models wiht
Applications to Disease Cluster Analysis

by James Mancuso,        Advisor: Roger Grimson (Medical School)           May,
1998

Exact tests for the temporal clustering of disease based on the Maxwell-
Boltzmann Occupancy Model are considered with emphasis on the sparse data
situation.  The tests are conditional on the total number of cases diagnosed  in
a specified population over a fixed period of time.  Standard tests such  as the
Ederer-Myers-Mantel (EMM) Test, the Scan Test, the Empty Cells Test  (ECT), the
Binary Runs Test, and the Maximum Cell Occupancy Number (MAX) Test  are
reviewed.
   The exact null distribution of the Longest Rum of Empty Cells (LREC) Test,
introduced by Grimson, Aldrich, and Drane (1992) is investigated further.
Approximations to the P-value are developed which reduced computation while
maintaining excellent accuracy for P-values below twenty percent.  Moments  are
investigated, and a general expression is derived for the computation of  the
factorial and ordinary moments in terms of the reverse cumulative  distribution.
In addition, a fundamental connection is made between the  exact distribution of
the LREC Statistic and the Bose-Einstein Occupancy  Model with maximum capacity.
   Furthermore, combinatorial methods are applied to develop new tests which
extend the runs based methodology.  A sparse cell is defined to be one which
contains at most one object, and one of the tests based on the Longest Run of
Sparse Cells (LRSC).  Another test is based on the Number of Runs of Empty
Cells (NREC).
   The new tests are evaluated and compared to the LREC, ECT, and MAX tests  in
terms of power against the Pulse alternative discussed by Wallenstein and  Neff
(1987).  Results suggest that the LREC and NREC tests yield a  significant
increase in power over the standard tests for sparse data  situations, where
there are only half as many cases as unit time intervals.
   Suggestions are made to aid the epidemiologist in choosing a test for
clustering, and results are applied to data concerning the incidence of
pediatric Rhabdomyosarcome (RMS) in Gaston County, North Carolina from 1970-
1990.

A Comparative Study of Tests for Homogeneity of Variances under Ordered
Alternatives

by Orlando Arencibia Moreles,     Advisor: Stephen Finch          May, 1998

The robustness of validity and power of tests of equal variance against  ordered
alternatives was studies using Monte Carlo techniques.  Following  Fujino (1979)
and Mudholkar et al, (1993), I studied equal sample sizes drawn  from six
populations from the normal, uniform, and Laplace distributions.   Sample size
ranges from 3 to 25 for a range of alternatives.
   Three types of modifications to standard procedures were evaluated:
Bartlett's correction (or Wang's modification), an empirical correction for  the
estimated kurtosis of the sample distribution, and use of a statistic  other
than the mean in determining deviations.  Three procedures that have  robustness
of validity are more powerful than the other procedures.  The  Boswell-Brunk
test with Wang's modification of Bartlett's correction,  Layard's correction for
kurtosis, and using the median rather than the mean  when the estimated kurtosis
is sufficiently large was the robust procedure  that had greatest average power.
Takeuchi's test with Layard's correction  and using the median rather than the
mean when the absolute value of the  estimated kurtosis is sufficiently large is
a procedure that has slightly  higher power for some but not all alternatives.
Vincent's test with the same  modifications is also a good procedure.