I was looking for a tutorial to estimate FST for my microbial (aka haploid) dataset but sadly there was no specific instruction, although other researchers are definitely looking for it (see here). Sadly, one respondent said that “there is no way to estimate Fst for microbial data”. However, ultimately I found a “way” to estimate Fst, and here is a quick R tutorial.
The fixation index, otherwise known as FST, is population differentiation due to genetic structure and is one of the fundamental concepts in population genetics. Essentially, FST is the proportion of the total genetic variance contained in a subpopulation (the S-subscript) compared to the total genetic variance (the T-subscript). Its value can range from 0 to 1. Higher FST indicates a considerable degree of differentiation among populations. The figure below illustrates what it means for two populations to have very different genetic structures.

Hierfstat is actually quite an old library package in R which provide many population genetics F-statistics estimation for haploid or diploid genetic data. I used the following script to estimate FST for my dataset:
library("hierfstat")
library("adegenet")
# Lets load the fasta file by extract only the SNPs
seq.snp = fasta2DNAbin('~/path/to/seq.fasta', snpOnly=T)
obj = DNAbin2genind(seq.snp)
# Load the meta file that contains population structure
meta <- read.table('~/path/to/meta.csv', sep=',', header = T)
# Add the structure I want to test in the GenInd object
obj$pop = as.factor(meta[match(rownames(obj$tab), meta$Strain),]$Site)
# Convert genind object into hierfstat object
obj.hf = genind2hierfstat(obj)
# Various estimation of Fst
genet.dist(obj.hf, method='Nei87', diploid=F)
genet.dist(obj.hf, method='Dch', diploid=F)
What is happening here? Initially, I load the fasta file using fasta2DNAbin function from adegenet package as DNA binary object by extracting only SNP information. Make sure that your fasta sequences are aligned properly.
Then I convert it into genind object. I also load a meta-file that contains several properties of the sequence isolates including the sampling site. I intend to estimate Fst by sampling site. I match and add site info in the genind object. Finally, I convert the genind object into hierfstat object. The meta-file may look like the following:
Strain | Site | Species |
Isolate1 | Site X | A |
Isolate2 | Site Y | B |
… | … | … |
Once you have the hierfstat object, you can estimate different FST by genet.dist function. Just make sure to specify diploid=F argument to indicate that it is a haploid dataset.
You can do some more analysis and visualization in this dataset. For example, you can check the basic stats (HO, HS, FIS, FST, etc.). A bunch of other analysis and visualization is possible: check more here.
The different methods of calculating FST are described by Takezaki and Nei (1996) paper.
FST from SNP-call data
If you have VCF file from variant calling, then its simpler to use vcftools
to estimate FST.
As input, use VCF file with either SNPs or INDELs.
Also, you need to define populations. For example, let’s define two populations:
isolate1
isolate2
isolate3
isolate5
isolate6
isolate7
Now, you can call vcftools
:
vcftools \
--gzvcf input.vcf.gz \
--weir-fst-pop population1.txt \
--weir-fst-pop population2.txt \
--out output_table.txt
Output will be a tab-separated file showing polymorphism location and respective FST value:
CHROM POS WEIR_AND_COCKERHAM_FST
contig_39 13783 0.00694444
contig_39 108394 0.827147
contig_39 119940 0.0681671
contig_39 125443 0.00347802
Now, plot it!
Leave a Reply