Calculating Fst for haploid data in R

I was looking for a tutorial to estimate F_ST for my microbial (aka haploid) dataset but sadly there was no specific instruction, although other researchers are definitely looking for it (see here). Sadly, one respondent said that “there is no way to estimate Fst for microbial data”. However, ultimately I found a “way” to estimate Fst, and here is a quick R tutorial.

The fixation index, otherwise known as F_ST, is population differentiation due to genetic structure and is one of the fundamental concepts in population genetics. Essentially, F_ST is the proportion of the total genetic variance contained in a subpopulation (the S-subscript) compared to the total genetic variance (the T-subscript). Its value can range from 0 to 1. Higher F_ST indicates a considerable degree of differentiation among populations. The figure below illustrates what it means for two populations to have very different genetic structures.

(Left) Two sub-population have the same genotype composition, hence Fst is 0. On the other hand, (right) if two sub-population have very different genotype compositions, the Fst will be 1.

Hierfstat is actually quite an old library package in R which provide many population genetics F-statistics estimation for haploid or diploid genetic data. I used the following script to estimate F_ST for my dataset:

library("hierfstat")
library("adegenet")

# Lets load the fasta file by extract only the SNPs
seq.snp = fasta2DNAbin('~/path/to/seq.fasta', snpOnly=T)
obj = DNAbin2genind(seq.snp)

# Load the meta file that contains population structure 
meta <- read.table('~/path/to/meta.csv', sep=',', header = T)

# Add the structure I want to test in the GenInd object
obj$pop = as.factor(meta[match(rownames(obj$tab), meta$Strain),]$Site)

# Convert genind object into hierfstat object
obj.hf = genind2hierfstat(obj)

# Various estimation of Fst
genet.dist(obj.hf, method='Nei87', diploid=F)
genet.dist(obj.hf, method='Dch', diploid=F)

What is happening here? Initially, I load the fasta file using fasta2DNAbin function from adegenet package as DNA binary object by extracting only SNP information. Make sure that your fasta sequences are aligned properly.

Then I convert it into genind object. I also load a meta-file that contains several properties of the sequence isolates including the sampling site. I intend to estimate Fst by sampling site. I match and add site info in the genind object. Finally, I convert the genind object into hierfstat object. The meta-file may look like the following:

Strain	Site	Species
Isolate1	Site X	A
Isolate2	Site Y	B
…	…	…

Once you have the hierfstat object, you can estimate different F_ST by genet.dist function. Just make sure to specify diploid=F argument to indicate that it is a haploid dataset.

You can do some more analysis and visualization in this dataset. For example, you can check the basic stats (HO, HS, FIS, FST, etc.). A bunch of other analysis and visualization is possible: check more here.

The different methods of calculating F_ST are described by Takezaki and Nei (1996) paper.

Calculating Fst for haploid data in R

Comments

Leave a ReplyCancel reply

Join as a subscriber

Calculating Fst for haploid data in R

Share this:

Comments

Leave a ReplyCancel reply

Join as a subscriber