Calculating Beta and Sample Size


1. The Null Distribution of the One-Sample Test.

The null distribution of a test statistic is the probability density function that describes the test statistic when the null hypothesis is true. For example, in the problem of evaluating a student taking a 100 question true-false taste, the null hypothesis is that the student is a random guesser. Under this null hypothesis, when the questions are given in a randomized order, a random guesser has a .5 chance of answering each question correctly, the standard deviation of the random variable indicating whether a specific question is answered correctly is .5, and the sequence of answers is independent. By the Central Limit Theorem for Averages, I can conclude that the probability distribution of the fraction of questions correctly answered when the null hypothesis is true has a shape that is nearly normal with expected value .50 and standard deviation .05. That is, about 99.99% of the random guessers taking this examination will get between .30 and .70 of the questions correct (that is, will be within four standard deviations of the expected value).

I could just as well use the fraction of questions incorrectly answered or the number of questions correctly answered as the statistic that is the basis of my test. These choices are obviously related to the fraction of questions correctly answered in the sense that the knowledge of the fraction of questions correctly answered by a student would allow you to infer the values of the other statistics. Each of these statistics, however, has its own null distribution whose properties must be logically consistent with the fact that the fraction of questions correctly answered is a random variable whose probability distribution is nearly normal with expected value .50 and standard deviation .05.

2. Side of the Alternative.

The second major step in testing a null hypothesis is to examine the alternative hypothesis to determine whether the appropriate test is right-sided, left-sided, or two-sided. A right-sided alternative hypothesis specifies that the expected value of the test statistic is greater than the value specified in the null hypothesis. That is, the alternative hypothesis is that the expected value of the test statistic specified under the alternative is to the right of the expected value specified under the null. A left-sided alternative hypothesis specifies that the expected value of the test statistic is less than the value specified in the null hypothesis; that is, the expected value of the test statistic when the alternative hypothesis is correct is to the left of the expected value of the test statistic under the null hypothesis. A one-sided alternative hypothesis is either a right-sided alternative or a left-sided alternative. A two-sided alternative hypothesis about the expected value of a test statistic specifies that its expected value is different from the value specified in the null hypothesis. That is, under the alternative hypothesis, the expected value of the test statistic could be either to the right or to the left of the expected value of the test statistic specified under the null hypothesis.

The logic of the problem determines the side of the test. In the example of this chapter using the number of questions correctly answered, the alternative hypothesis is right-sided--namely, that the average of the fraction of questions correctly answered by the subject is greater than .50, the fraction expected to be answered correctly by a random guesser. If the statistic used in the test were the fraction of questions incorrectly answered, the logic of the problem would be that the null hypothesis was that the expected fraction of questions incorrectly answered was .50 (the fraction of errors expected from a random guesser) and the alternative hypothesis that the expected fraction of errors was less than .50. The alternative hypothesis is left-sided with this logic.

If the logic of my alternative hypothesis were that I sought to identify either students who performed better than random guessers or students who performed worse than random guessers, then I would have a two-sided alternative. In this event, the alternative hypothesis would be that the expected fraction of questions correctly answered was not equal to .50.

In the problems for this course, I will ask you to determine the sidedness of a test. The answer to the question is the sidedness of the alternative.

3. Finding the Critical Region.

The sample space of the experimental outcome was divided into two incompatible events. One, the acceptance region, consisted of outcomes for which I would accept the null hypothesis. The other, the rejection region or critical region, consisted of outcomes for which I would reject the null hypothesis. A Type I error would occur when the null hypothesis was true and I observed an experimental outcome falling in the rejection region. The significance level of the experiment was the probability of a Type I error.

In problems requiring the calculation of the probability of a Type II error or a sample size needed to give specified probability of a Type II error, I have to find the critical region of a test of a null hypothesis. Because of this, a natural first problem to consider is to find the critical region for a specified null hypothesis, specified alternative hypothesis, and specified level of significance.

For example, in the continuing problem of this chapter, the null hypothesis is the fraction of questions correctly answered has a nearly normal distribution with expected value .50 and standard deviation .05 corresponds to the null hypothesis that a random guesser is taking a 100 question true-false test with the order of the questions randomized. The natural alternative hypothesis to this null hypothesis is that the fraction of questions correctly answered has a nearly normal probability distribution with expected value greater than .50 and standard deviation slightly less than .05. That is, the student taking the test knows enough material that the tester can expect the student to answer a greater fraction of questions correctly than a random guesser.

With this logic and choice of test statistics, the test is right-sided. My procedure will be to find a number, called the critical value, such that when the fraction of questions correctly answered is to the left of the critical value I accept the null hypothesis and when the fraction of questions correctly answered is to equal to or to the right of the critical value I reject the null hypothesis. That is, the critical region will be the set of values to the right of the critical value. Of course, I have to choose the critical value so that the probability of observing an experimental result in the critical region is equal to the significance level when the null hypothesis is true. That is, I have to define the critical value so that the probability of a Type I error is equal to the specified significance level.

Suppose that the value of the significance level is alpha=.01, and that I want to find the critical value for the problem of this chapter. From Table 3.2, I know that .01 of the area of a standard normal probability distribution is to the right of 2.32. That is, 1% of the area of a normal probability density function is to the right of the vertical line drawn 2.32 standard deviations above the expected value of the normally distributed random variable.

When the null hypothesis is true, the expected value of the fraction of questions correctly answered on a 100 question true-false test with questions presented in random order is .50, and the standard deviation is .05. The value .616 is equal to the expected value .50 plus the product of 2.32 and .05, the standard deviation of the fraction correctly answered by a random guesser. My rejection region then is the set of fractions of questions correctly answered greater than or equal to .616, and the acceptance region is the set of fractions less than .616.

If the observed result falls in the rejection region, then I reject the null hypothesis. Otherwise, I accept the null hypothesis. So if a student took a 100 question true- false test with the questions presented in randomized order and got .59 of the questions correct (that is, answered 59 correctly and 41 incorrectly), I would accept the null hypothesis at the .01 level of significance.

If the alternative hypothesis were that the expected value of the fraction of questions correct was not equal to .50, then the test would be two-sided. For a significance level alpha=0.01, the probability that the fraction of questions correctly answered was in the rejection region when the null hypothesis was true would have to be .01. The rejection region would obviously have a right side and a left side, and together the probability of falling in either region when the null hypothesis held would be the level of significance. There are many such rejection regions. The procedure most commonly used in a statistics class is to choose the rejection regions so that the probability of being in the right region is equal to the probability of being in the left region.

Again, using the table of the standard normal cdf, I know that .005 of the area is to left of -2.58, .005 is to the right of 2.58, and .99 of the area is between -2.58 and 2.58 in a standard normal probabilitydistribution. Since .129 is 2.58 standard deviations for this problem, the rejection region is a fraction of questions correctly answered greater than .629 or less than .371.

4. Finding the three p-values.

Researchers often use p-values rather than critical values in the actual analysis of data because statistical computing programs find a p-value for almost every statistic calculated. When the p-value of a test statistic is greater than the level of significance specified, a researcher should accept the null hypothesis. When the p-value is less than or equal to the level of significance, the researcher should reject the null hypothesis. The p-value can be either a right-sided p-value, a left-sided p-value, or a two-sided p-value, depending upon the alternative specified.

I will find the three p-values for a student who got .59 of 100 questions in a 100 question true-false test with questions in randomized order and null hypothesis that the student was a random guesser. The numerical value of the right-sided p-value is approximately equal to the answer to a common question using the normal probabilitydistribution: what is the probability that a normal random variable with expected value .50 and standard deviation .05 exceeds .59? To answer this, I first convert .59 to standard units by subtracting the expected value of .50 and then dividing by the standard deviation of .05: standard units(0.59)=(0.59-0.50)/.05=1.80. From the table of the normal cdf, Phi(1.80)=.9641, and so P({Z>1.80})=.0359. The right-sided p-value is therefore approximately .04 (rounding off). The Central Limit for Averages only guarantees that the limiting distribution is normal. That is, the cumulative distribution function of the standardized (by expected value .50 and standard deviation .05/ n0.5) fraction of questions correctly answered by a random guesser taking a true-false examination with n questions presented in random order converges to the cumulative distribution function of a standard normal random variable. The application here is that the limiting value is a good approximation to the probability asked for. Finally, the interpretation of the p-value is that it is a statistic that measures the observed significance level of a set of observed data. Its calculation uses a probability distribution, but it is not a probability.

The left-sided p-value is approximately .9641. This is the probability that a standard normal random variable will take a value less than 1.80. This p-value is numerically equal to the probability that a random guesser will get a fraction correct of .59 or less.

The two-sided p-value is approximately .0719. This is numerically equal to the probability that a standard normal random variable would be greater than 1.80 or less than -1.80. This value is numerically equal to the probability that a random guesser would get a fraction correct .59 or greater or .41 or less. In the practical world, a statistician would report the two-sided p-value as approximately .07 rather than the value of .0719 given here to help guide you through the calculations.

To determine which p-value I should use, I look at the alternative hypothesis. The choice of p-value should match the sidedness of the test. If the test is right-sided, then I should use the right-sided p-value. For the example problem here with a right-sided test, level of significance set at .01, and .59 of the 100 questions correctly answered by the subject, I should find the right-sided p-value. This is the probability that a random guesser will do as well or better than the subject did. Since the right-sided p-value of approximately .036 was greater than the significance level (alpha =.01), I would accept the null hypothesis that this student is a random guesser.

5. Probability of a Type II Error and Power Calculations.

The continuing problem of this chapter is to test the null hypothesis that a student

taking a 100 question true-false test with questions presented in a randomized order is a random guesser. This null hypothesis is equivalent to the null hypothesis that the fraction of questions correctly answered is a random variable whose probability distribution is nearly normal with expected value .50, the fraction correct expected from a random guesser, and standard deviation .05. I will test this null hypothesis at the alpha=.01 level of significance against the alternative hypothesis that the student's fraction correct is a random variable whose probability distribution is nearly normal with expected value p greater than .50 and standard deviation {p(1-p)/n}0.5, a value less than 0.05 for a sample size of 100. Earlier, I found that the critical value for a right-sided test of this null hypothesis at the .01 level of significance was .616.

I will now show you how to find the approximate value of the probability of a Type II error, accepting a false null hypothesis. For example, suppose that a student taking the 100 question true-false test was better than a random guesser--specifically, the probability distribution of the fraction correct is nearly normal with expected value .60 and standard deviation .049. A Type II error will occur when the student's fraction correct is less than (to the left of) the critical value of .616. The trick to calculating the probability of a Type II error is to draw both the null and alternative distributions of the statistic used before starting the calculations. Then, decide whether the test is right-sided, left-sided, or two-sided. Once these issues are correctly resolved, the calculations are easy.

When I test a student whose expected fraction correct is .60 (that is, the expected fraction of correct answers on the student's examination is 0.60 and the standard deviation of the fraction of correct answers on the examination is 0.049), I can calculate the approximate probability of a Type II error using the Central Limit Theorem and the table of the cdf of the standard normal. First, I convert the critical value of .616 to standard units for the alternate distribution. That is, standard units(.616)} = (.616-.60)/.049=.33. The expected value here is the expected value under the alternative distribution, not the expected value under the null distribution. The standard deviation used is the standard deviation of the alternative distribution, not the standard deviation of the null.

The table of the cdf of the standard normal does not have an entry for 0.33. Since the value of Phi(0.30)=.6179 and the value of Phi(0.35)=.6368, I know that .6179<Phi(0.32)<.6368. I can interpolate the value of Phi(0.33) from the two tabulated values, or I can use a more complete table. From the more complete table, Phi(0.33)=.6293; and hence the probability that the score will be to the left of .616 is approximately equal to .6293. Since a score to the left of .616 will lead to an incorrect acceptance of the null hypothesis (a Type II Error), the probability of a Type II Error is approximately .6293. That is, I have almost a 5 in 8 chance of calling a student this good a "random guesser."

The power of a test of a null hypothesis for a specific alternative distribution is the probability that the null hypothesis will be rejected when the sample is drawn from the alternative specified. That is, for a specified alternative, Power = 1- Probability of Type II Error. In this problem, rejection occurs for a fraction correct greater than .616. The probability that a score will be to the right of .616 (which is a value 0.33 standard deviations of the alternative above the expected value of the alternative) is then approximately .3707 (which equals 1-.6293). This is the power of the test to reject the null hypothesis when the student is as good as was specified.

Researchers want tests to have a low probability of a Type II error; that is, they want their tests to have high power. The use of a 100 question true-false test to distinguish students who could get .60 of the questions correct from random guessers is not a wise idea because the probability of a Type II error is so high. In practice, a statistician would advise that the a greater number of true-false questions would need to be asked to have a low probability of a Type II error. The next section shows how to decide on exactly the number of questions to be asked.

6. Calculating Sample Size to Give Specified Power.

Up to this point, I have worked problems in which the size of the sample was specified in the problem. The examination that my student is taking has 100 questions. Of course, I can decide to give a longer or a shorter examination. The issue in this section is how to calculate how large a sample is needed to give specified power. Before I discuss the general question, I will work a special case of this problem that happens to be relatively simple.

Example 1. A student has a probability of answering a true-false question that I will ask on this examination that is equal to .616. Then the probability distribution of the fraction of questions this student will answer correctly has a shape that is nearly normal with expected value .616 and standard deviation slightly less than .05, .049. When I test whether this student is a random guesser against the alternative that this student is better than a random guesser at the alpha=.01 level of significance, what is the power of this test?

Solution: Of course, the null distribution is the same because the null hypothesis has not been changed. The probability distribution of the fraction correctly answered is nearly normal with expected value .50 and standard deviation .05. Since the test is still right-sided with level of significance .01, the critical value is still .616. Under the new alternative specified, the probability distribution of the fraction of questions correctly answered has a shape that is nearly normal with expected value .616 and standard deviation .049.

Geometrically, the power of the test is the area to the right of .616 in the distribution of the alternative distribution. For this problem, the expected value of the alternative distribution matches exactly the critical value, and the area to the right of the critical value is obviously .5. The power is .5 for this alternative, and the probability of a Type II error is .5.

End of Example 1.

This geometric insight then make the solution of problems like Example 2 easy:

Example 2. A student, Jones, will take an examination and has such a partial mastery of the subject matter that the probability Jones answers a true-false question correctly is .6. The professor giving Jones the examination will test the null hypothesis that Jones is a random guesser against the alternative hypothesis that Jones is better than a random guesser at the .01 level of significance. How many questions n should be in the examination so that the power of the test is .5 for Jones?

To solve this problem, I use the geometric insight above. When Jones takes an examination consisting of n randomly ordered questions, the probability distribution of the fraction of questions Jones answers correctly is nearly normal with expected value .6 and standard deviation.5/n0.5. For the power to be .5 for Jones, the number of questions n must be such that the critical value of the test is equal to the expected fraction of questions Jones answers correctly. That is, the professor should choose the value of n to satisfy the equation: .5 + 2.32 (.5/ n0.5) = .6. That is, choose n to satisfy .6 - .5 = 1.16/ n0.5, which reduces to choose n to satisfy 1.16/ n0.5 = 0.1, n0.5 = 11.6. That is, n=134.56. The test must have 135 questions. Using more questions will increase the power of the test to a value greater than .5 for Jones.

End of Example 2.

The solution of the general problem requires more attention to detail. An example of such a problem is:

Example 3. How many true-false questions must be in a test with randomized order so that the probability of a Type II error of the test of the null hypothesis that a student is a random guesser at the .01 level of significance is .20 for a student who can answer .6 of the questions correctly.

Solution: To solve this problem, I first describe both the null distribution and the alternative distribution. The null distribution of the fraction of questions correctly answered, of course, has a nearly normal distribution with expected value .5 and standard deviation .5/ n0.5. The test is still right-sided with the same level of significance, so that the critical value is .5+2.32(.5n0.5). The alternative distribution has a nearly normal shape with expected value .6 and standard deviation .49/n0.5.

Next, in the probability distribution of the alternative distribution, I draw the vertical line corresponding to the critical value, .5+2.32(.5/n0.5). Since the test is right-sided, the probability of a Type II error of the test is the area to the left of the critical value in the distribution of the alternative distribution. When you work problems like this, always remember that the areas must be on opposite sides: For a left-sided test the probability of a Type I error is the area to the left of the critical value in the null distribution, and the probability of a Type II error is the area to the right of the critical value in the alternative distribution appropriate to the problem. The task, then, is to choose the value of n so that the area to the left of the critical value is .2 (and equivalently that the area to the right of the critical value is .8). From the table of the cdf of the standard normal, I know that P(Z<-0.842)=Phi(-0.842)=0.20, and the probability that Z is greater than -0.84 is 0.80.

The last step in the analysis then is to choose n so that the critical value, .5+2.32(.5/ n0.5), has a standard units value of -0.84 in the distribution of the alternative distribution. That is, choose n so that standard units (.5+ 2.32(.5/n0.5))= -0.84. Since the expected value of the alternative is 0.6 and the standard deviation is 0.049/n0.5, the equation becomes choose n so that [(.5+ 2.32(.5/n0.5))-0.6 ] /(.49/ n0.5) = -0.84. That is, choose n so that .5-.6 + 2.32(.5/ n0.5)=-0.84 (.49/ n0.5). This reduces to choose n so that .5-.6 =[(-0.84)(0.49)- (2.32)(0.5)]/ n0.5, or choose n so that

n0.5 = [(-0.84)(0.49)-(2.32)(0.50)]/[.5-.6]=15.7. The number of questions is then n=247, the integer equal to 15.72 after it is rounded up. Using more than 247 questions will produce a test with probability of Type II error less than .20 for this student.

End of Example 3.

There are two larger lessons in this solution. The first is that a 100 question examination had a probability of Type II error beta=.63, a 135 question examination had beta=.50, and a 247 question examination had beta=.20. This illustrates the principle that an experiment with more experimental units has a lower probability of Type II error than a smaller experiment run at the same level of significance.

The second requires a closer look at the solution: n0.5 = [(-0.84)(0.49)-(2.32)(0.50)]/[.5-.6]. Each of the numbers in this solution has a substantive explanation. The value -0.84 was the argument that makes the standard normal cdf equal to 0.20, the value of beta. Any problem like this would have a corresponding number that I will call zbeta. The value 0.49 was the standard deviation of an observation drawn when the alternative distribution was correct, and I will represent it as sigma1. Similarly, the value 2.32 was the argument that makes the standard normal cdf equal to .99 so that a right-sided test has level of significance .01. I will call the value corresponding to 2.32 in this problem zalpha. The value 0.50 multiplying 2.32 was the standard deviation of a single observation under the null, and so I will use the symbol sigma0 to represent it. The value .5 in the denominator was the expected value of a single observation under the null hypothesis, and so I will use the symbol E0 to represent it. The value 0.6 was the expected value under the alternative, and I will use E1 to represent it.

There is one last point here. The solution n0.5 = [(-0.84)(0.49)-(2.32)(0.50)]/[.5-.6] is n0.5 = [(0.84)(0.49)+(2.32)(0.50)]/[.6-.5] after I reverse the minus signs correctly. Whenever alpha < .5 and beta < .5 in this kind of problem, the terms will be positive like this. In mathematics, the use of a pair of vertical bars means make a numerical value positive. For example, |E1-E0|=|.5-.6|=|-.1|=.1. A second example would be |zbeta|=|-0.84|=0.84. The solution to this problem is an example of the general formula for the square root of the sample size n needed so that the probability of a Type I error is alpha and the probability of a Type II error is beta: n0.5>[|zbeta|sigma1 + |zalpha|sigma0]/ [|E1-E0|]. The requirement for the formula to be valid is that both alpha < .5 and beta < .5. When this requirement does not hold, then you have to work the problem from first principles or use a more general result. Fortunately, very few experimenters are interested in intentionally running an experiment with beta>.5, so that this simpler formula is quite sufficient.

7. Some Finer Points Ignored in This Discussion.

I have made two simplifications in this chapter. The first and more important in its effects was that the questions in the examination were of the same difficulty. In fact, as you know all too well, some questions are easier than others, so that the assumption that the probability of a correct answer was the same for each question was not realistic. In writing an examination, the tester deliberating includes questions of varying levels of difficulty. The assumption of equal difficulty of questions is more reasonable in the tasting experiments that I discussed in Chapter 1.

The second simplification was that in the calculation of the p-value, I did not use the continuity correction. I had to find the probability that a random guesser would correctly answer a fraction of .59 or more of the 100 questions asked correctly. That is, I had to find the probability that a random guesser answered 59 or more questions correctly. The probability distribution of the number correctly answered by a random guesser was nearly normal with expected value 50 and standard deviation 5. Here, I am approximating the probability distribution of a random variable that takes integer values with a random variable that takes fractional values. The continuity correction is to adjust the event 59 or more questions correctly to the event 58.5 or more in the normal probabilitydistribution. The effect of this is to round off the value of the continuous normal random variable to the integer values that are actually observed. The value 58.5 is converted to standard units, here, 1.7. The resulting p-value is then .0446, compared to 0.036, the value obtained not using the correction. Many mathematicians have considered whether the continuity approximation produces a better result and have found that it yields noticeable but slight improvements in accuracy.

 

8. Some Design Considerations

The specification of the four key parameters (E0, E1, sigma0, sigma1 ) in the application of the key formula for finding the sample size must be thoughtfully made in a real study. The key formula is: n0.5>[|zbeta|sigma1 + |zalpha|sigma0]/ [|E1-E0|]. Notice that the sample size becomes larger when sigma1 is greater than sigma0. Suppose the distribution assumed under the null hypothesis is a normally distributed random variable with mean 500 and standard deviation 100, and that the average of a random sample of size n will be used to test the null hypothesis. Suppose the rejection rule is right sided and that alpha is 0.01, so that one rejects when the sample average is greater than 500+2.326(100)/n0.5 . Suppose that the distribution assumed under the alternative is that of a normal distribution with mean 500 and standard deviation 100. Then for every sample size n, the probability of a type II error is Phi(2), which is about 0.975. This design problem occurs because the test procedure is sensitive to changes in means, but the alternative given specifies a change in variances with no change in means. This specification of the problem is the most extreme wrenching of the problem discussed in this chapter, but the sample size required to meet a given specification of alpha and beta increases whenever sigma1 is greater than sigma . If this is a possibility in the study that you are designing, your plans must be thought through more carefully.

 

End of Discussion