I was looking for a tutorial to estimate FST for my microbial (aka haploid) dataset but sadly there was no specific instruction, although other researchers are definitely looking for it (see here). Sadly, one respondent said that “there is no way to estimate Fst for microbial data”. However, ultimately I found a “way” to estimate Fst, and here is a quick R tutorial.
The fixation index, otherwise known as FST, is population differentiation due to genetic structure and is one of the fundamental concepts in population genetics. Essentially, FST is the proportion of the total genetic variance contained in a subpopulation (the S-subscript) compared to the total genetic variance (the T-subscript). Its value can range from 0 to 1. Higher FST indicates a considerable degree of differentiation among populations. The figure below illustrates what it means for two populations to have very different genetic structures.
Hierfstat is actually quite an old library package in R which provide many population genetics F-statistics estimation for haploid or diploid genetic data. I used the following script to estimate FST for my dataset:
library("hierfstat")
library("adegenet")
# Lets load the fasta file by extract only the SNPs
seq.snp = fasta2DNAbin('~/path/to/seq.fasta', snpOnly=T)
obj = DNAbin2genind(seq.snp)
# Load the meta file that contains population structure
meta <- read.table('~/path/to/meta.csv', sep=',', header = T)
# Add the structure I want to test in the GenInd object
obj$pop = as.factor(meta[match(rownames(obj$tab), meta$Strain),]$Site)
# Convert genind object into hierfstat object
obj.hf = genind2hierfstat(obj)
# Various estimation of Fst
genet.dist(obj.hf, method='Nei87', diploid=F)
genet.dist(obj.hf, method='Dch', diploid=F)
What is happening here? Initially, I load the fasta file using fasta2DNAbin function from adegenet package as DNA binary object by extracting only SNP information. Make sure that your fasta sequences are aligned properly.
Then I convert it into genind object. I also load a meta-file that contains several properties of the sequence isolates including the sampling site. I intend to estimate Fst by sampling site. I match and add site info in the genind object. Finally, I convert the genind object into hierfstat object. The meta-file may look like the following:
Strain | Site | Species |
Isolate1 | Site X | A |
Isolate2 | Site Y | B |
… | … | … |
Once you have the hierfstat object, you can estimate different FST by genet.dist function. Just make sure to specify diploid=F argument to indicate that it is a haploid dataset.
You can do some more analysis and visualization in this dataset. For example, you can check the basic stats (HO, HS, FIS, FST, etc.). A bunch of other analysis and visualization is possible: check more here.
The different methods of calculating FST are described by Takezaki and Nei (1996) paper.
Leave a Reply