Cluster Analysis with unequal branch lengths


Pair group methods such as UPGMA assume that branch lengths are equal. With molecular data, the assumption is that rates of SNP evolution are equal among lineages.  When this assumption is violated, conventional cluster methods are guaranteed to give the wrong tree.

Consider five taxa (A, B, C, D, E) with the following distance matrix. Note that distances involving taxon C are unusually large:
 


A B C D E
A 0 - - - -
B 20 0 - - -
C 80 80 0 - -
D 60 60 100 0 -
E 80 80 120 80 0

As before, A & B are most similar (20 units): join them into one cluster (AB) joining at 20, and re-calculate other average distances. This gives:
 


(AB) C D E
(AB) 0 - - -
C 80 0 - -
D 60 100 0 -
E 80 120 80 0

(AB) & D are most similar (60 units): join them into one cluster (ABD) joining at 60, and re-calculate the average distances. This gives:
 


(ABD) C E
(ABD) 0 - -
C 90 0 -
E 80 120 0

E & (ABD) are most similar (80 units): join them into one cluster (ABDE) joining at 80, and re-calculate the average distance. This gives:
 


(ABDE) C
(ABDE) 0 -
C 105 0

C joins the remaining taxa at 105. This completes the analysis.

The analysis suggests that C is the least similar to the others. If similarity of ABDE in the phenogram (below, left) estimates their relationship to C, then it implies that C is the most distantly related taxon to the other four. In fact, the evolutionary tree from which the data were derived (below, right) shows that C is most closely related to (AB) [they have the most recent common ancestor], but has evolved at twice the rate of other taxa. The violation of the assumption of rate equality in the method is guaranteed to give a wrong answer.

 

Example & text material © 2025 by Steven M. Carr