It’s common to do principal component analysis (PCA) for large SNP data in VCF format. However, sometimes you want to do a simpler analysis for a very small dataset. Let’s say, you have a gene alignment (or a supermatrix alignment of several genes) and want to visualize SNP structures in the sequence alignment file using PCA. In this post, I’ll show you how to do just that.
library("adegenet")
library("ggplot2")
snp <- fasta2genlight('~/path/to/fasta', snpOnly=T)
meta <- read.table('~/path/to/meta', sep=',', header = T)
fasta2genlight is a nice function from the Adgenet library that extract SNPs from alignment. It also detect the ploidy of your dataset, but do not forget to check the poidy by `snp$ploidy`.
We load our sequence alignment file with this function, and we also load a meta file that contains sequence features that we are interested to visualize in the PCA. The meta file must have a column containing isolate ids that matched fasta id as well. The meta file may look like the following:
Strain | SamplingSite | Species |
Isolate1 | Site X | A |
Isolate2 | Site Y | B |
… | … | … |