For one of my Ph.D. projects, I had to generate phylogeny from multi-locus sequence data. Often I have to repeat similar analyses and need to go back to the previous workflow to check what I actually did. I’m sharing the protocol here mainly to help my future self, and may be this is useful to you!
Multi-locus sequence analysis (MLSA) involves using multiple genes or loci, usually conserved housekeeping genes, to construct phylogeny and other sequence-based analyses. Since different genes may have different mutation rates, MLSA generally gives a better approximation of underlying evolution and a more realistic resolution of phylogenetic relations among taxa than only one gene.
This is also a better alternative to ribosomal 16s/ITS-based analysis, especially for many bacterial species (including Bradyrhizobium, which I worked with), because the 16s/ITS are often very similar in these genera and cannot be used to differentiate species.
How to estimate the fixation index, FST, to test for population differentiation in R
I was looking for a tutorial to estimate FST for my microbial (aka haploid) dataset but sadly there was no specific instruction, although other researchers are definitely looking for it (see here). Sadly, one respondent said that “there is no way to estimate Fst for microbial data”. However, ultimately I found a “way” to estimate Fst, and here is a quick R tutorial.
The fixation index, otherwise known as FST, is population differentiation due to genetic structure and is one of the fundamental concepts in population genetics. Essentially, FST is the proportion of the total genetic variance contained in a subpopulation (the S-subscript) compared to the total genetic variance (the T-subscript). Its value can range from 0 to 1. Higher FST indicates a considerable degree of differentiation among populations. The figure below illustrates what it means for two populations to have very different genetic structures.