Calculating Fst for haploid data in R


— 1,366 reads

I was looking for a tutorial to estimate FST for my microbial (aka haploid) dataset but sadly there was no specific instruction, although other researchers are definitely looking for it (see here). Sadly, one respondent said that “there is no way to estimate Fst for microbial data”. However, ultimately I found a “way” to estimate Fst, and here is a quick R tutorial.

The fixation index, otherwise known as FST, is population differentiation due to genetic structure and is one of the fundamental concepts in population genetics. Essentially, FST is the proportion of the total genetic variance contained in a subpopulation (the S-subscript) compared to the total genetic variance (the T-subscript). Its value can range from 0 to 1. Higher FST indicates a considerable degree of differentiation among populations. The figure below illustrates what it means for two populations to have very different genetic structures.

Genetic differentiation: an example of the behavior of fixation index... |  Download Scientific Diagram
(Left) Two sub-population have the same genotype composition, hence Fst is 0. On the other hand, (right) if two sub-population have very different genotype compositions, the Fst will be 1.

Hierfstat is actually quite an old library package in R which provide many population genetics F-statistics estimation for haploid or diploid genetic data. I used the following script to estimate FST for my dataset:


# Lets load the fasta file by extract only the SNPs
seq.snp = fasta2DNAbin('~/path/to/seq.fasta', snpOnly=T)
obj = DNAbin2genind(seq.snp)

# Load the meta file that contains population structure 
meta <- read.table('~/path/to/meta.csv', sep=',', header = T)

# Add the structure I want to test in the GenInd object
obj$pop = as.factor(meta[match(rownames(obj$tab), meta$Strain),]$Site)

# Convert genind object into hierfstat object
obj.hf = genind2hierfstat(obj)

# Various estimation of Fst
genet.dist(obj.hf, method='Nei87', diploid=F)
genet.dist(obj.hf, method='Dch', diploid=F)

What is happening here? Initially, I load the fasta file using fasta2DNAbin function from adegenet package as DNA binary object by extracting only SNP information. Make sure that your fasta sequences are aligned properly.

Then I convert it into genind object. I also load a meta-file that contains several properties of the sequence isolates including the sampling site. I intend to estimate Fst by sampling site. I match and add site info in the genind object. Finally, I convert the genind object into hierfstat object. The meta-file may look like the following:

Isolate1Site XA
Isolate2Site YB

Once you have the hierfstat object, you can estimate different FST by genet.dist function. Just make sure to specify diploid=F argument to indicate that it is a haploid dataset.

You can do some more analysis and visualization in this dataset. For example, you can check the basic stats (HO, HS, FIS, FST, etc.). A bunch of other analysis and visualization is possible: check more here.

The different methods of calculating FST are described by Takezaki and Nei (1996) paper.


Leave a Reply

Join as a subscriber

Only the posts on data visualization, bioinformatics how to tutorials, web-development, and general comments on research and science will be sent.

Join 11 other subscribers