Phylogeny from multi-locus sequences aka MLSA

in ,

— 211 reads

Multi-locus sequence analysis (MLSA) involves using multiple genes or loci, usually conserved housekeeping genes, to construct phylogeny and other sequence-based analyses. Since different genes may have different mutation rates, MLSA generally gives a better approximation of underlying evolution and a more realistic resolution of phylogenetic relations among taxa than only one gene.

This is also a better alternative to ribosomal 16s/ITS-based analysis, especially for many bacterial species (including Bradyrhizobium, which I worked with), because the 16s/ITS are often very similar in these genera and cannot be used to differentiate species.

Multilocus sequence alignment may look like this. Source

TL;DR

Now-a-days, I use AutoMLSA2, which essentially automates all the intermediate steps one need to do to create a MLSA phylogeny. AutoMLSA2 requires protein or amino acid sequences of your target genes/locus. Its best to download the target gene sequences from NCBI or other curated databases and then run the following command:

$ automlsa2 --query AAofInterest.faa --dir ../path/to/target/genome -t 8 --allow_missing 4 --missing_check project_name

You need to define directory path containing target genome files using --dir

--query contains fasta file of amino acid sequences for gene/locus of interest.

-t defines number of threads.

--allow_missing is how many missing genes/locus can be tolerated for a genome, if a genome do not have more more missing genes, it will be dropped from further analysis.

Preparation of data

Here’s the scenario. You have a bunch of newly generated genome sequence files. They are all mapped and annotated. Generally, the annotation program gives output in Genbank (gbk) format. You want to search and extract some specific genes from all genomes.

So basically, this protocol involves doing BLAST on a local database.

I generally change the formats to fasta using this program I wrote: https://github.com/acarafat/gbk2fasta

This program has three options: fna, ffn, and faa. fna extracts all GenBank sequence records. ffn extracts only CDS records. fna extracts CDS and translates it into amino acid sequences. This Python script will be used in command line/shell/terminal. Use the following prompt to convert .gbk file to .fasta:

$ python gbk2fasta.py input.gbk fna

Homology search

Let’s join all these genomes together. In the UNIX command shell, move to the directory which contains all genome fasta files and concatenate them in a single file:

$ cat *.fasta > database/genomes.fasta

Make a BLAST database for all genomes:

~/database$ makeblastdb -in genomes.fasta -dbtype nucl

Now let’s do the query for all of your target genes. I would download the reference gene sequence from NCBI to be used as a query and keep them in separate fasta files. This nucleotide BLAST will generate a detailed output report, including the matched sequence.

$ blastn -db database/genomes.fasta -query geneOfInterest.fasta -out geneOfInterest.fasta.out -outfmt '7 std qcovs sseq'

In some cases, doing a simple nucleotide BLAST might not be enough, and you need to do other variants of BLAST, such as tblastn, which use amino acid query sequence to search for it to an NCBI nucleotide database which has been translated to all six reading frames.

$ tblastn -db database/genomes.fasta -query proteinOfInterest.fasta -outfmt '7 std qcovs sseq' -out geneOfInterest.fasta.out

Now it’s time to mine the blast output file and search for hits for each query that meets a particular threshold. For example, we can search for the hit with at least 50% similarity with the query. In the BLAST output format 7, the similarity statistics are reported on the 3rd column; therefore we can use awk '$3 > 50'.

This is an important filtering step since we don’t want to keep a sequence-hit below to a certain length or percent match. You should adjust this step based on your project. I’m showing a command for one locus. Replace the locus name for your gene of interest.

$ grep '^geneOfInterest' geneOfInterest.fasta.out | awk '$3 > 50'  |  cut -f 2,14 | awk '{print">"$0}' | awk '{print$1"\n"$2}' > geneOfInterest.extracted.fasta

Here, second column contains the subject contig id from your database, and column 14 contains the amino acid sequence. So using cut -f 2,14 we can extract those and write into a fasta file.

Ideally, we want to get one sequence per gene. In many cases, some genes have higher copy numbers; therefore, multiple homologs can be present. In that case, we can do filtering by size and percent match as well as higher query coverage and try to get the hit that gives maximum value.

This work fine if you are only interested in the amino acid sequences reported in the blast output file. However, if you need to mine the corresponding DNA sequences, you can use the `samtools faidx` command.

For this, lets get the sequence contigs, and coordinates first:

grep '^geneOfInterest' geneOfInterest.fasta.out | awk '$3 > 50' | cut -f 2,9,10 > coordinates.txt

Now we need to actually run samtools faidx command and extract nucleotide sequences. Sometime your blast output may have multiple hit from same genome. I have written a Python script (GitHub link) which you can use to generate to extract only largest hit. The script will generate a bash code, which you need to run again.

Once we have the extracted fasta sequences of hits that match our criterion, we can take a look and count the number of hits extracted.

$ less geneOfInterest.extracted.fasta
$ grep '^>' geneOfInterest.extracted.fasta | wc -l

Repeat the process for all other genes of interest.

Align and concatenate BLAST hits

Now that all target genes’ homologous match has been extracted in a separate fasta file. We want to concatenate them together. However, the sequences need to be aligned.

Let’s align them using MAFFT:

$ mafft input.fasta > output.fasta.aln

For concatenating the genes, we can use the SuperMatrix package in R. However, this can be slow.

Another alternative is to use this Pearl script, which is super fast and works really well: https://github.com/nylander/catfasta2phyml

Use this following command to concatenate different aligned loci:

$ perl catfasta2phyml.pl -f --concatenate --verbose orfs/*.fna > concatenated.fna 2> log

Last part of the log file will contain information about DNA alignment partition.

Prepare the partition file

Since we are doing partitioned analysis for multi-locus alignments, different genes/partitions can have different evolutionary histories and mutation rates. We need to account for that, and we can assign different substitution models for different genes/loci. Therefore, while we are joining the loci from each isolate, we need to track at what position different gene starts or ends in the concatenated sequence file. This is also called a partition file. One way is to use this RAxML format file:

DNA, geneA.fasta.aln = 1-209
DNA, geneB.fasta.aln = 210-815
DNA, geneC.fasta.aln = 816-1230

Phylogeny with IQTree

Now, we can use IQTree to infer the phylogeny. We also need to add the RAxML position file and concatenated sequence file. We’ll be doing 1000 bootstrap resampling:

$ iqtree2 -s concatenated.fna -bb 1000 -alrt 1000 -nt 8 -q partition.raxml

Doing it in the high-performance cluster will look like this:

$ sbatch -N 1 -n 8 -p intel --wrap "module load iqtree/2.1.1;  iqtree2 -s concatenated.fna -bb 1000 -alrt 1000 -nt 8 -q partition.raxml

To visualize phylogeny, ggtree is my favorite package because it offers customization in many ways!

If you have any questions, please leave a comment! If you detect an error, please let me know.

Phylogeny with FastTree

IQTree is great, but it’s slow since it uses maximum-likelihood approach. Therefore, if you want to see the tree structure in general while IQTree is running, you can use FastTree. FastTree uses heuristics to approximate a maximum-likelihood tree. For getting a quick idea, it is pretty good.

$ FastTree concatenated.fna > fasttree.output.txt

Comments

One response to “Phylogeny from multi-locus sequences aka MLSA”

  1. […] A decade ago, 2012-2013, I used MEGA5 to infer phylogeny using simple Neighbour-Joining methods, and used the figure generated by MEGA5 to present and publish my results. Later, when I started learning other phylogeny reconstruction methods like Maximum Likelihood (ML) and Bayesian, I started to explore plethora of other tree drawing software, including FigTree which is a handy click-and-annotate phylogeny software. Now a days, ML tree seems to be the minimum requirement, and using MLSA (multi-locus sequence analysis) is even better. […]

Leave a Reply

Join as a subscriber

Only the posts on data visualization, bioinformatics how to tutorials, web-development, and general comments on research and science will be sent.

Join 11 other subscribers