Principal Components Analysis

Interpreting a Principal Components Analysis - Theory & Practice

The procedure described in the lab handout generates four matrices:

            [1] Original raw data (6 variables x n taxa)
            [2] Normalized data (5 variables x n taxa)
            [3] Coefficients of eigenvectors (PC1 - PC5) (5 x 5)
            [4] Principal component scores (I - V) (5 x n)

The relationship between [1] and [2] is straightforward: the last five variables are divided by the first in order to remove the effect of size differences so as to be able to concentrate on shape differences. In the discussion that follows, references to measurements, variables, or data are to these 'normalized' variables. Each of these normalized variables is an axis that can be graphed, e.g, normalized cranial breadth can be plotted against normalized cranial depth in a bivariate plot. Such a plot would tell us something about head shape.

The Principal Components Analysis converts the normalized data in [2] to so-called 'principal component scores' in [4]. As discussed in the lab, the variables are in essence rotated through multiple dimensions so as to see combinations of variables that describe the major patterns of variation among taxa. Matrix [3] is identical to the 'eigenanalysis' table produced by MINITAB when the PCA analysis is run. Eigenvector coefficients indicate the angles of rotation: calculation of these angles involves matrix algebra, and an understanding of the mathematics involved is beyond the scope of this course (see Manly 1979 for more details). Once calculated, however, the relationship among the data, the coefficients, and the scores is very straightforward, and is important for understanding and interpreting the results of the PCA analysis.

On each principal component axis, each individual has a single 'score' in [4] to which all five measurements in [2] contribute. The contribution or 'weight' for each measurement is the eigenvector coefficient for that measurement in [3]. That is, the coefficient for each measurement determines how 'important' that measurement is for the particular component. The 'score' of each individual is in essence a new 'measurement' that combines all of the original physical measurements. Each of the principal component axes represents an independent pattern of variation. Like the original data, the scores are axes that can be graphed.

For each individual, the score on any axis is calculated as

    Score = measurement {1} x coefficient {1} +
            measurement {2} x coefficient {2} +
            measurement {3} x coefficient {3} +
            measurement {4} x coefficient {4} +
            measurement {5} x coefficient {5}

where measurement {1} and coefficient {1} are the values associated with the first variable, and so on.

Click here for some mathematical notes on PCA

Where do Principal Component scores come from? - an example

To make this clearer and to explain the interpretation of the eigenvector coefficients, consider the following example, based on an analysis of terrestrial carnivores. The normalized data (from [2]) for one 'cat' species and one 'dog' species are as follows:

Family ncb ncd ntr nrw nrd
Felid 0.444 0.475 0.350 0.276 0.186
Canid 0.305 0.340 0.539 0.154 0.116

When a MINITAB PCA analysis of the covariance matrix of carnivore skulls is performed, the eigenvector coefficients (from [3]) on the first axis are

PC1
ncb -0.541
ncd -0.371
ntr 0.670
nrw -0.288
nrd -0.192

and the component scores for these two species on the first axis (from [4]) are

I
Felid -0.298
Canid 0.003

For the Felid, this score was calculated by MINITAB by the formula above (keeping only 3 significant digits) as

-0.298 = (0.444) x (-0.541) +
         (0.475) x (-0.371) +
         (0.350) x ( 0.670) +
         (0.276) x (-0.288) +
         (0.186) x (-0.192)

Similarly, the score for the Canid was calculated as

0.003 = (0.305) x (-0.541) +
        (0.340) x (-0.371) +
        (0.539) x ( 0.670) +
        (0.154) x (-0.288) +
        (0.116) x (-0.192)

[IMPORTANT: For the reasons explained in the mathematical note, you will not be able to repeat the above calculation of scores directly from your normalized data and the eigenvector coefficients. This does not affect any of the discussion below.]

Interpreting principal components and eigenvector coefficients

Now that we have some understanding of where the principal component scores come from, the important questions for this lab are (1) How can the eigenvector coefficients be interpreted as descriptions of biological shapes? and (2) What is the biological meaning of the score for any individual? That is, how do the individual weights contribute to the overall 'score' of the individual? Let us take the first principal component as an example.

Consider first the cat. The most prominent feature is its inflated cranium, indicated by the large cranial breadth and depth. Both of these variables have large, negative coefficients or weights: thus large (positive) measurements are multiplied by large (negative) weights, and the result is a large negative number. Thinking of a number line, these measurements and their respective weights 'push the score to the left', that is, they shift the score to the negative end of the axis. The only weight that is positive (the only one that can push the score to the right) is that for tooth row; the cat however has a relatively short rostrum (cats have relatively flat faces), so this positive 'push' is relatively small. The last two variables each contribute a small negative 'push'.

Contrast this with the dog. The most prominent feature is the long rostrum, seen as a long tooth row. This contributes a large positive value to the component score: it is 'pushed to the right' and increases the score on the axis. As before, the weights on cranial breadth and depth are negative, however the cranium is fairly compact and the net negative contribution is much smaller than in the cat. As in the cat, the last two variables each contribute a small negative 'push', though less so because the dog's muzzle is not so high as the cat's.

Thus, the three largest weights on the first principal component may be interpreted as contrasting the cross-sectional areas of the cranium (large negative loadings) with the length of the snout (large positive loading). Animals with bulgy craniums and short faces will go toward the negative end of the axis, animals with compact craniums and long faces will go towards the positive end of the axis. That is, animals with "dog-like" proportions will have more positive (in this case, less negative) scores, those with "cat-like" proportions will have very negative scores. The first principal component, which accounts for more than half of the observed variance among skulls, contrasts dog-like carnivores and cat-like carnivores. In Figure 1 (below), Canids (C) are at the right extreme of PCI and Felids (F) at the left extreme. Other species are intermediate, some more cat-like and some more dog-like [see the note on the use of the correlation matrix].

The other principal components can be interpreted in the same manner: their nature will depend on the mix of species and families used in any particular analysis. Each principal component is completely independent of the others, and represents a completely different pattern of variation. [To prove this to yourself, you could use MINITAB to calculate the correlation between the first and second principal components of your data]. However, each successive component explains a smaller and smaller proportion of the total shape variation [as indicated by the 'cumulative variance' in the PCA table]. Usually the first three components taken together explain 90-95% of the variance, and are the only ones we need worry about interpreting.

The following figure is a comparison of PCI versus PCII. For Axis I, note particularly the distribution of the three canid species (C) at the extreme right, and of the three felid species (F) at the extreme left. [Other codes: M = Mustelidae, P = Procyonidae, U = Ursidae, V = Viverridae]

         -
    0.780+                  P      U
         -
II      -                    P
         -                                         C      C
         -   F
    0.720+                                                    C
         -            M
         -                     U
         -            M
         -            F
    0.660+     F                        U
         -           M
         -                                        V
         -                         M
         -                                            V
    0.600+
       --+---------+---------+---------+---------+---------+---- I
        -0.300    -0.240    -0.180    -0.120    -0.060     0.000

Figure 1: Variation in skull shape among 17 species in six families of fissiped carnivores. First and second PCA axes from five normalized measurements.

For Further Reference

Manly, F. F. J. (1994). Multivariate Statistical Methods: a Primer. 2nd ed. Chapman & Hall.
[Chapter 5 discusses Principal Components Analysis; don't get bogged down in the math].

Wiley, E. O. (1981). Phylogenetics. Wiley Interscience.
[See pp. 339-365 on quantitative data analysis, including a discussion of PCA]