Quick PCA analysis from sequence alignment data in R

Sometimes you need to visualize SNP data from a small Fasta file. PCA analysis is a great way to look for any distribution pattern.

It’s common to do principal component analysis (PCA) for large SNP data in VCF format. However, sometimes you want to do a simpler analysis for a very small dataset. Let’s say, you have a gene alignment (or a supermatrix alignment of several genes) and want to visualize SNP structures in the sequence alignment file using PCA. In this post, I’ll show you how to do just that.

library("adegenet")
library("ggplot2")

snp <- fasta2genlight('~/path/to/fasta', snpOnly=T)
meta <- read.table('~/path/to/meta', sep=',', header = T)

fasta2genlight is a nice function from the Adgenet library that extract SNPs from alignment. It also detect the ploidy of your dataset, but do not forget to check the poidy by `snp$ploidy`.

We load our sequence alignment file with this function, and we also load a meta file that contains sequence features that we are interested to visualize in the PCA. The meta file must have a column containing isolate ids that matched fasta id as well. The meta file may look like the following:

StrainSamplingSiteSpecies
Isolate1Site XA
Isolate2Site YB
Continue reading “Quick PCA analysis from sequence alignment data in R”

How to make Co-phylogeny plot: easy tanglegram in R

Tanglegrams are co-phylogeny which is a very powerful visualization tool to examine co-evolution. Here is a tutorial on how to make them in R.

Tanglegram is a representation of co-phylogeny where two phylogenetic trees are linked. This method is super useful to visualize common traits shared by both trees. For example, it can be used to visualize host-pathogen (or host-symbiotic) evolution and visualize if there is any phylogenetic concordance between the two phylogenetic trees.

I was in need to visualize co-phylogeny of phylogenetic tree reconstructed from chromosomal and symbiotic genes. Surprisingly, I didn’t find any strait-forward solution in R that can be used for drawing tanglegram. Particularly I wanted to leverage the beautiful ggtree library. After trying out several methods, I found the following approach works well for me so far.

In this post, I’m going to use two toy trees with the following Newick format. Note that they have the same isolate, but different tree-topology (since supposedly different gene-set were used to reconstruct them).

Continue reading “How to make Co-phylogeny plot: easy tanglegram in R”