DNA data matrix coded03-08

The Site Frequency Spectrum (SFS) as a description of allelic variation

The sample data set shows 10bp sequences from five individuals, including three that show SNP variability. The binary-coded data matrix [left] shows the occurrence of these three variable sites among five individuals, coded as 0 or 1, to indicate whether each SNP site in each individual has the same base as Individual 1 (hence ancestral) or not (hence derived). These data define three alleles (haplotypes), found in individuals ## (1), (2 & 3), and (4 & 5). Recall the assumption that Individual 1 shows the ancestral state (0) at all three positions, and various of the other individuals the derived state (1). The derived forms of the three SNP sites occurs in 2/5, 2/5, and 4/5 individuals, respectively, as shown by shading.

With n individuals, the derived allele could occur in 1, 2, 3, ... (n-1) individuals, which establishes (n-1) classes, in this case 5 - 1 = 4 classes as shown. [They are called "allele frequencies" because they correspond to SNP variants in 20%, 40%, 60%, or 80% of the individuals]. The SFS graph summarizes the allele frequency data 2/5, 2/5, and 4/5 by showing that 2/3 = 67% of the SNPs occur in two individuals, and that 1/3 = 33% occurs in one individual. Note that the data show no cases where the derived allele is present in only 1 or in 3 individuals.

Advanced: The SFS can also be written for n individuals as a vector f = (f1, f2, ..., fn-1), in this case 
f = (0.0, 0.67, 0.0, 0.33). If the ancestral and derived SNPs at each position are not known, we can write a folded frequency spectrum vector, by combining ("folding") sites with 10000 + 01111 and 11000 + 00111 SNP distributions ( "1+4" and "2+3" patterns). In this case, the folded vector is f* = (f1, f2) = (0.33, 0.67). Observe that if there were n = 6 individuals, and there were a fourth site coded from top to bottom as 000111 or 111000 ("3+3" pattern), it would be possible to write a folded vector f* = (f1, f2, f3) = (0.25, 0.50, 0.25).
 
For n=5 as in the example, the notation for a folded vector doesn't define an
f*3 term if the series is limited to f*j < n/2 < 5/2 = 2.5. In English, any odd number 2n+1 has the same number of allele frequency classes as the next lower even number 2n.

Figure © 2013 by Sinauer; Text material © 2017 by Steven M. Carr