In general, parsimony is the principle that the simplest explanation that can explain the data is to be preferred. In the analysis of phylogeny, parsimony means that a hypothesis of relationships that requires the smallest number of character changes is most likely to be correct. In molecular systematics, these character changes are DNA variants. [These are called Single Nucleotide Polymorphisms, or SNPs, pronounced "Snips"].
Suppose that we wish to determine the evolutionary relationships among four taxa (sing., taxon): A, B, C, & D. [A B C & D may be taxa or individuals at any level: these are also called Operational Taxonomic Units (OTUs)]. With respect to taxon A, there are three hypotheses: A is most closely related to B, or to C, or to D. These three hypotheses of cladistic relationships can be diagrammed respectively as follows (the diagrams are called cladograms):
A
C
A
B
A B
|__|
|__|
|__|
|
| |
|
| |
B
D
C
D
D C
Portions of these cladograms shown as '|' are branches, and '__' is the internode. The most closely related pairs of taxa are on either side of the internode. For example, the first diagram shows A & B as more closely related to each other than either is to C or D.
A typical data set to evaluate these hypotheses would be homologous DNA sequences from each taxon. In the figure below, only the coding DNA strand for 30 bp is shown: nucleotides in the last three lines are identical to those in the first, except where indicated:
A aat tcg ctt cta gga atc tgc cta atc ctg
B ... ..a ..g ..a .t. ... ... t.. ... ..a
C ... ..a ..c ..c ... ..t ... ... ... t.a
D ... ..a ..a ..g ..g ..t ... t.t ..t t..
1 2 3
4
5 6
7
Two kinds of nucleotide positions can be distinguished, informative and uninformative. Informative positions are those that give information about evolutionary relationships among taxa; uninformative sites don't. In our four-taxon problem, an informative position will indicate of the three hypotheses is the simplest (most parsimonious) explanation of the SNP pattern implied by the data at that position, that is, that it requires fewest nucleotide changes.
What is the information content of each of the seven numbered positions?
1. Position 1 is invariant: it gives no information about relationships among these taxa. The majority of sites are like this.
2. Position 2 is variable, but still gives no information: it indicates only that A is different from the other three taxa, or, put another way, that B, C, & D are similar but not that they are closely related The unique change in A at this position is therefore an autoapomorphy [a unique derived character in only one of the taxa under study]. Position 2 can be explained by a single mutational change in any of the three trees. In the diagrams below, '+' indicates a change from 'a' [in B C & D] to 'g' on the branch leading to taxon A. The same would be true if any one of the other three taxa were uniquely variable at this site. Most variable sites are of this type: five are shown in the figure above (find them).
nuc
g
a g
a
g a
Taxon
A C A
B
A B
+__|
+__|
+__|
|
| |
|
| |
Taxon
B D C
D
D C
nuc
a
a a
a
a a
3. Position 3 is extremely variable, but again there is no information: every taxon is different from every other. Position 3 requires three mutations in any of the three trees. [Under certain models of molecular evolution, this site would be informative].
4. Position 4 is less variable, but once again there is no information: although it might appear at first glance that this site favors the the first tree over the latter two, in fact each tree can be explained with two changes:
a
c a
c
a a
a a
A
C A
C
A B
A B
|___+
or |_+_+
|___|
|___|
|
+ |
|
+ +
+ +
B
D B
D
C D
D C
a
g a
g
c g
c g
There are many variants on this type of site, all of which are uninformative. The similarity of A & B is thus due to a shared ancestral character [a symplesiomorphy], which does not necessarily indicate that they are closely related. [Under certain models of molecular evolution, this type of site would be informative].
5. Position 5 is informative: it indicates that A & B are each others' closest relatives, that is, that the first tree is the most parsimonious explanation. To explain this distribution requires a single change on the internode in Tree 1 (a g), and two changes each on branches in Trees 2 & 3. The similarity of A & B would thus be interpreted as a shared derived character [a synapomorphy]. Two such sites are shown (find the other).
a
g a
a
a a
A
C A
B
A B
|_+_|
+___+
+___+
|
| |
|
| |
B
D C
D
D C
a
g g
g
g g
6. Position 6 is also informative: it indicates that A & Care each others' closest relatives, so that the second tree is the most parsimonious explanation. The similarity of A & C would thus be interpreted as a shared derived character [a synapomorphy]. To explain this distribution requires a single change in Tree 2, and two changes in Trees 1 & 3:
c
c c
t
c t
A
C A
B
A B
+___+
|_+_|
+___|
|
| |
|
| +
B
D C
D
D C
t
t c
t
t c
7. Finally, Position 7 is informative: it indicates that A & Dare each others' closest relatives, so that the third tree is the most parsimonious explanation. The similarity of A & D would thus be interpreted as a shared derived character [a synapomorphy]. To explain this distribution requires a single change in Tree 3, and two changes in Trees 1 & 2:
g
a g
a
g a
A
C A
B
A B
+___|
+___|
|_+_|
|
+ |
+
| |
B
D C
D
D C
a
g a
g
g a
An evolutionary parsimony analysis counts the number of informative positions favoring each of the (in this case, three) possible trees: whichever is in the majority will indicate the most parsimonious tree.
For example, the data set below can be used to investigate the relationship of Pagophilus groenlandicus (harp seals) to other phocid (eared) seals of the North Atlantic (see Carr & Perry 1997). A total of four sites favor (Pagophilus + Phoca), six sites favor (Pagophilus + Cystophora) as closest relatives, and only three favor (Pagophilus + Erignathus). [Identify all these sites]. Pagophilus thus appears to be most closely related to Cystophora.
Similar principles can be used to evaluate data sets with larger number of taxa. However, the number of possible trees increase hypergeometrically: with seven taxa, there are 945 possible trees, with ten over two million, and with 22 over 3 x 1023. It is obviously impractical to inspect all possible trees: a number of computerized search algorithms have been developed, notably the PAUP (Phylogenetic Analysis Using Parsimony) package of Dave Swofford (see Swofford et al. 1995. In "Molecular Systematics" (D. Hillis et al., eds.) Sinauer), the PHYLIP package of Joe Felsenstein (Univ. of Washington), and the MEGA package of Masatoshi Nei and Sudhir Kamar (Univ. of Pennsylvania).
M
T N I R
K
T H P L
M K
I I N N
S F
I D L P
A P
S 25
Pagophilus atg
acc aac atc cga aaa acc cac cca cta ata aag att atc aac aac tca
ttc atc
gac cta ccc gca cca tca 75
Phoca
...
... ... ... ... ... ... ..t ... ... ... ..a ... ... ... ... ...
...
... ... ... ... a.. ... ...
Cystophora ...
... ... ... ... ... ... ... ... ... ... ..a ... ... ... ... ...
... ..t
... ... ... a.. ... ...
Erignathus ...
... ... ... ... ... ..t ... ... ... ..c ..a ..c ... ... .g. ...
... ...
... ... ... a.. ..g ...
N
I S A W
W
N F G S
L L
V I C L
I L
Q I L T
G L
F 50
Pagophilus aat
atc tca gca tga tga aac ttt gga tcc ctg ctc gta atc tgc tta
atc cta cag atc cta aca ggc
cta ttc
150
Phoca
... ... ... ... ... ... ... ... ...
..t ..t
... .g. ... ... c.. ... ..g
..t ... t.. ... ... t.g ...
Cystophora ..c
... ... ... ... ... ... ... ... ... ..c ... .g. ... ... ...
... t.. ... ... ... ... ... ...
...
Erignathus ..c
... ... ... ... ... ... ..c ... ... ..c ... .gg ... ... c.t
..t t.. ..a ... ... ... ... ...
...
L
A M H Y
T
S D T I
T A
F S S V
T H
I C R D
V N
Y 75
Pagophilus ctg
gcc ata cat tat acc tca gac
aca atc
aca gcc ttc tca tca gtg acc cat
atc
tgt cga gac gta aac tac 225
Phoca
..a
... ... ..c ..c ... ... ... ... .c.
...
... ... ... ... ..a ... ..c ... ..c
... ...
... ... ...
Cystophora ..a
... ... ... ... ..t ... ... ... .ct
... ...
... ..g ... ..a ..a ..c ... ... ...
... ...
... ...
Erignathus ..a
... ... ... ..c ... ... ..t ... .c.
... ..t
... ... ... ..a ... ... ...
... ...
... ... ..t ..t
G
W I I R
Y
L H A N
G A
S M F F
I C
L Y M H
V G
R 100
Pagophilus ggc
tga atc atc cga tac cta cac
gca aat
gga gcc tcc ata ttt ttc atc
tgc tta
tac ata cac gta gga
cga 300
Phoca
...
... ... ... ..t ..t ..t ... ... ...
...
..t ... ... ... ... ... ... c.. ...
..g ..t
... ... ...
Cystophora ...
... ..t ... ... ..t ... ... ... ...
... ...
... ... ... ... ... ... c.g ... ... ...
..g
... ...
Erignathus ...
... ..t ... ... ..t a.. ... ..t ..c
... ..t
..t ... ..c ... ... ... c.. ... ... ..t
...
... ...
G
L Y Y G
S
Y T F T
E T
W N I G
I I
L L F T
V M
A 125
Pagophilus gga
ctc tac tac ggt tcc tac aca ttc
aca
gaa aca tga aat
atc ggc att atc ctc cta ttc
acc gtc
ata gct 375
Phoca
...
..g ..t ... ..c ... ... ... ... ...
..g
... ... ..c ... ... ... ... ... t..
... ... ... ... ..c
Cystophora ...
..g ... ... ..c ... ... ... ..t ...
..g
... ... ... ... ... ... ... ... t..
..t ..t ... ... ...
Erignathus ...
..a ... ... ..c ..t ..t ... ..t .t.
...
... ... ..c ... ... ... ... ..a ...
... ... ... ... ..c
T
A F M G
Y
V
L 133
Pagophilus acg
gca ttc atg ggt tac gtc cta
cc 401
Phoca
..a
... ... ... ..c ... ... ... ..
Cystophora ..a
... ... ... ..c ... ... ... ..
Erignathus ..a
... ... ..a ..c ... ... ... ..
401 base pair portion of
the mitochondrial cytochrome b gene from four species of
phocid
seals (Phoca = harbour seal, Pagophilus = harp
seal, Cystophora
= hooded seal, Erignathus = bearded seal).
Nucleotides are the
same as in Phoca except where indicated; the inferred
amino acid
sequence for Phoca is indicated by the IUPAC
single-letter code.
Data from Carr and
Perry (1997);
cf. Perry et al. (1995) "Journal of Mammalogy," 76:22-31