Calculating Fst for haploid data in R

I was looking for a tutorial to estimate F_ST for my microbial (aka haploid) dataset but sadly there was no specific instruction, although other researchers are definitely looking for it (see here). Sadly, one respondent said that “there is no way to estimate Fst for microbial data”. However, ultimately I found a “way” to estimate Fst, and here is a quick R tutorial.

The fixation index, otherwise known as F_ST, is population differentiation due to genetic structure and is one of the fundamental concepts in population genetics. Essentially, F_ST is the proportion of the total genetic variance contained in a subpopulation (the S-subscript) compared to the total genetic variance (the T-subscript). Its value can range from 0 to 1. Higher F_ST indicates a considerable degree of differentiation among populations. The figure below illustrates what it means for two populations to have very different genetic structures.

(Left) Two sub-population have the same genotype composition, hence Fst is 0. On the other hand, (right) if two sub-population have very different genotype compositions, the Fst will be 1.

Hierfstat is actually quite an old library package in R which provide many population genetics F-statistics estimation for haploid or diploid genetic data. I used the following script to estimate F_ST for my dataset:

library("hierfstat")
library("adegenet")

# Lets load the fasta file by extract only the SNPs
seq.snp = fasta2DNAbin('~/path/to/seq.fasta', snpOnly=T)
obj = DNAbin2genind(seq.snp)

# Load the meta file that contains population structure 
meta <- read.table('~/path/to/meta.csv', sep=',', header = T)

# Add the structure I want to test in the GenInd object
obj$pop = as.factor(meta[match(rownames(obj$tab), meta$Strain),]$Site)

# Convert genind object into hierfstat object
obj.hf = genind2hierfstat(obj)

# Various estimation of Fst
genet.dist(obj.hf, method='Nei87', diploid=F)
genet.dist(obj.hf, method='Dch', diploid=F)

What is happening here? Initially, I load the fasta file using fasta2DNAbin function from adegenet package as DNA binary object by extracting only SNP information. Make sure that your fasta sequences are aligned properly.

Then I convert it into genind object. I also load a meta-file that contains several properties of the sequence isolates including the sampling site. I intend to estimate Fst by sampling site. I match and add site info in the genind object. Finally, I convert the genind object into hierfstat object. The meta-file may look like the following:

Strain	Site	Species
Isolate1	Site X	A
Isolate2	Site Y	B
…	…	…

Once you have the hierfstat object, you can estimate different F_ST by genet.dist function. Just make sure to specify diploid=F argument to indicate that it is a haploid dataset.

You can do some more analysis and visualization in this dataset. For example, you can check the basic stats (HO, HS, FIS, FST, etc.). A bunch of other analysis and visualization is possible: check more here.

The different methods of calculating F_ST are described by Takezaki and Nei (1996) paper.

F_ST from SNP-call data

If you have VCF file from variant calling, then its simpler to use vcftools to estimate FST.

As input, use VCF file with either SNPs or INDELs.

Also, you need to define populations. For example, let’s define two populations:

isolate1
isolate2
isolate3

isolate5
isolate6
isolate7

Now, you can call vcftools:

vcftools \
   --gzvcf input.vcf.gz \
   --weir-fst-pop population1.txt \
   --weir-fst-pop population2.txt \
   --out output_table.txt

Output will be a tab-separated file showing polymorphism location and respective F_ST value:

CHROM	POS	WEIR_AND_COCKERHAM_FST
contig_39	13783	0.00694444
contig_39	108394	0.827147
contig_39	119940	0.0681671
contig_39	125443	0.00347802

Now, plot it!

Calculating Fst for haploid data in R

F_ST from SNP-call data

Discover more from Arafat Rahman

Comments

Leave a ReplyCancel reply

Learn Python for Bioinformatics

Calculating Fst for haploid data in R

FST from SNP-call data

Share this:

Discover more from Arafat Rahman

Comments

Leave a ReplyCancel reply

Learn Python for Bioinformatics

F_ST from SNP-call data