Statistical Hypothesis Testing

Fundamentals of Statistical Hypothesis Testing

Scientific hypotheses that are to be tested statistically are phrased as in the form of a null hypothesis ( H_o). The null hypothesis is that there is no difference between the experimental and control groups, for example, that there are no differences due to sex or age, that a new drug is not poisonous or does not cure cancer, etc. The null hypothesis is typically that there is no difference between the arithmetic mean ( µ) of each of the two groups compared. We write:

H_o: µ₁ = µ₂

The null hypothesis is that the mean µ of group 1 equals the mean of group 2.

Evaluation of the result of a biological investigation by statistical test seeks to reject the null hypothesis. We estimate the probability ( p ) that the observed difference between two groups could have been obtained by chance alone. If this probability is less than some predetermined value, we reject the null hypothesis. This value is called the significance level, in biological studies usually 5% or sometimes 1%. The result is then said to be statistically significant. We conclude that the experimental treatment, or the group difference, is biologically meaningful, and we proceed to investigate why this is so.

Consider a simple, non-biological example. You are tossing Loonies with a Mississauga riverboat gambler. She is winning, and you wonder whether the game is rigged. The null hypothesis is that the probability of heads or tails is equal on each toss (H_o: p_H = p_T); you use a 5% significance level. The probability of either result is 1/2: thus, the combined probability of getting a run of 1, 2, 3 or 4 heads in a row is 1/2, 1/4, 1/8, and 1/16, respectively. The probability of a run of four is 6.25%, which is still greater than the 5% significance value: a run of four heads is unlikely, but would not be quite enough evidence to support a conclusion of dishonesty. However, if five successive tosses turned up heads (p = 1/32 = 3.125%), the probability of the series is less than the predetermined 5% significance level. Thus, a run of five heads would cause us to reject the null hypothesis at the 5% significance level, and conclude that the coin is loaded. Note that this is an a priori expectation, made before the experiment: what is the chance the the next n tosses will all be heads? Once the outcome of a particular toss or series of tosses is known, the probability of a head on the next toss is 1/2. Past events do not affect future events.

Thus, we make an evidence-based decision to accept or reject the null hypothesis at some pre-determined value. It is important to realize that this conclusion may or may not be correct. Our acceptance or rejection of an hypothesis, and the reality of the truth or falsity of the hypothesis, creates four possibilities, shown below.

Decision / Reality	True	False
Accept	OK	II
Reject	I	OK

The 'OK's indicate two ways of being "right." We may correctly conclude that there is no significant difference (we accept a true null hypothesis: the coin is honest, and we decide that it is), or we may correctly conclude that there is a significant difference (we reject a false null hypothesis: the coin is loaded, and we have found this out). On the other hand, there are also two ways of being "wrong." We may incorrectly conclude that there is a difference (we reject a true null: the coin is really honest, and the gambler was simply "lucky"), or we may incorrectly conclude that there is no difference (we accept a false null: for example, the coin may be loaded to give only 10% more heads, and we didn't get enough evidence in a short game to prove this).

The first kind of mistake is referred to as "Type I error", and the latter as "Type II error". In experimental biology, we are ordinarily most concerned to reduce the probability of Type I error: we do not wish to conclude that there is a difference unless we are very sure of the evidence. For example, we don't want to do an honours project to explain why Puffins on Gull Island are bigger than those on Green Island, if preliminary data suggests that any difference is due to chance, or how a herbal medicine is effective against gout, if it isnt'. In biology, the pre-determined significance level typically sets the upper limit of Type I error at 5% or 1%: thus the proportion of errors of this type does not exceed one in twenty, or one in a hundred.

On the other hand, certain types of biomedical experiments may be equally or even more concerned with Type II error. A physician does not want to prescribe a cancer drug unless she is certain about its effectiveness: she wants to minimize Type I error. At the same time, the pharmaceutical company that manufactures the drug will not market it if there is any evidence that it causes birth defects (teratogenicity): it wants to minimize Type II error. The precautionary approach is concerned with Type II error. A significance level of 1% or lower might be set in the clinical trials for effectiveness against cancer; a significance level as high as 50% might be set in the teratogenicity tests.

Type I error is frequently misinterpreted. It is common to hear someone say of a p = 0.10 result that, Well it's not statistically significant but its biologically significant. However, the correct understanding is that the result offers no support for the hypothesis. More serious consequences arise when a quack doctor markets a so-called "cancer cure" based on evidence of a p = 0.20 result, arguing to distraught parents that there is a 1 in 5 chance that the drug may be effective.

A "significant" result means only that it is not expected by chance alone: it does not mean that the difference is large or "important". An observed result may be (statistically) significant without being of large (biological) magnitude. The ability of a test to detect a difference of a particular magnitude is called the power of the test, and is typically dependent on the sample size. With a small sample size, only relatively large differences can be shown to be significant, whereas with extremely large samples, even very small differences can be shown to be significant. For example, comparison of 20 weasel skulls from the island of Newfoundland with twenty from the mainland is sufficient to show that island animals are on average 10% larger and that the result is significant, whereas samples of nearly a thousand inshore and offshore codfish show that a genetic difference of less 0.2% is significant. A practical consequence of the difference is that it is possible to assign a weasel skull to the correct population of origin with considerable accuracy, whereas success in re-assignment of a codfish to its source population is little better than chance.