Phylogeny from multi-locus sequences aka MLSA

in ,

— 414 reads

Multi-locus sequence analysis (MLSA) involves using multiple genes or loci, usually conserved housekeeping genes, to construct phylogeny and other sequence-based analyses. Since different genes may have different mutation rates, MLSA generally gives a better approximation of underlying evolution and a more realistic resolution of phylogenetic relations among taxa than only one gene. For a fundamental background understanding, read: Deciding how to construct a phylogenetic tree.

This is also a better alternative to ribosomal 16s/ITS-based analysis, especially for many bacterial species (including Bradyrhizobium, which I worked with), because the 16s/ITS are often very similar in these genera and cannot be used to differentiate species.

TL;DR

Now-a-days, I use AutoMLSA2, which essentially automates all the intermediate steps one need to do to create a MLSA phylogeny. AutoMLSA2 requires protein or amino acid sequences of your target genes/locus. Its best to download the target gene sequences from NCBI or other curated databases and then run the following command:

$ automlsa2 --query AAofInterest.faa --dir ../path/to/target/genome -t 8 --allow_missing 4 --missing_check project_name

You need to define directory path containing target genome files using --dir

--query contains fasta file of amino acid sequences for gene/locus of interest.

-t defines number of threads.

--allow_missing is how many missing genes/locus can be tolerated for a genome, if a genome do not have more more missing genes, it will be dropped from further analysis.

Toy data

For this tutorial to follow along, you can download complete genomes of 12 alpha-proteobacteria from this link.

For making MLSA, we will look for individual genes in these genomes. You can download nucleotide and amino acid sequences of these genes to be used as query from NCBI: dnaK, recA, 16s rRNA, etc.

This protocol involves doing BLAST on a local database.

Homology search

Let’s join all these genomes together. In the UNIX command shell, move to the directory which contains all genome fasta files and concatenate them in a single file:

$ cat *.fasta > database/genomes.fasta

Make a BLAST database for all genomes:

~/database$ makeblastdb -in genomes.fasta -dbtype nucl

Now let’s do the query for all of your target genes. I would download the reference gene sequence from NCBI to be used as a query and keep them in separate fasta files. This nucleotide BLAST will generate a detailed output report, including the matched sequence.

$ blastn -db database/genomes.fasta -query geneOfInterest.fasta -out geneOfInterest.fasta.out -outfmt '7 std qcovs sseq'

In some cases, doing a simple nucleotide BLAST might not be enough, and you need to do other variants of BLAST, such as tblastn, which use amino acid query sequence to search for it to an NCBI nucleotide database which has been translated to all six reading frames.

$ tblastn -db database/genomes.fasta -query proteinOfInterest.fasta -outfmt '7 std qcovs sseq' -out geneOfInterest.fasta.out

Now it’s time to mine the blast output file and search for hits for each query that meets a particular threshold. For example, we can search for the hit with at least 50% similarity with the query. In the BLAST output format 7, the similarity statistics are reported on the 3rd column; therefore we can use awk '$3 > 50'.

This is an important filtering step since we don’t want to keep a sequence-hit below to a certain length or percent match. You should adjust this step based on your project. I’m showing a command for one locus. Replace the locus name for your gene of interest.

$ grep '^geneOfInterest' geneOfInterest.fasta.out | awk '$3 > 50'  |  cut -f 2,14 | awk '{print">"$0}' | awk '{print$1"\n"$2}' > geneOfInterest.extracted.fasta

Here, second column contains the subject contig id from your database, and column 14 contains the amino acid sequence. So using cut -f 2,14 we can extract those and write into a fasta file.

Ideally, we want to get one sequence per gene. In many cases, some genes have higher copy numbers; therefore, multiple homologs can be present. In that case, we can do filtering by size and percent match as well as higher query coverage and try to get the hit that gives maximum value.

This work fine if you are only interested in the amino acid sequences reported in the blast output file. However, if you need to mine the corresponding DNA sequences, you can use the `samtools faidx` command.

For this, lets get the sequence contigs, and coordinates first:

grep '^geneOfInterest' geneOfInterest.fasta.out | awk '$3 > 50' | cut -f 2,9,10 > coordinates.txt

Now we need to actually run samtools faidx command and extract nucleotide sequences. Sometime your blast output may have multiple hit from same genome. I have written a Python script (GitHub link) which you can use to generate to extract only largest hit. This script, make_faidx.py, will go through the co-ordinate file, for each genome it will keep the LARGEST hit, and if the co-ordinates are from complementary strand, it will add flag for reverse-complement.

python make_faidx.py coordinates.txt database/genomes.fasta

The script will generate a bash code, which you need to run again:

# Make it executable
chmod +x faidx_command.sh

# Run it to extract sequences
./faidx_command.sh > geneOfInterest.extracted.fasta

Once we have the extracted fasta sequences of hits that match our criterion, we can take a look and count the number of hits extracted.

$ less geneOfInterest.extracted.fasta
$ grep '^>' geneOfInterest.extracted.fasta | wc -l

Repeat the process for all other genes of interest.

Align and concatenate BLAST hits

Now that all target genes’ homologous match has been extracted in a separate fasta file. We want to concatenate them together. However, the sequences need to be aligned.

Let’s align them using MAFFT:

$ mafft input.fasta > output.fasta.aln

For concatenating the genes, we can use the SuperMatrix package in R. However, this can be slow.

Another alternative is to use this Pearl script, which is super fast and works really well: https://github.com/nylander/catfasta2phyml

Use this following command to concatenate different aligned loci:

$ perl catfasta2phyml.pl -f --concatenate --verbose orfs/*.fna > concatenated.fna 2> log

Last part of the log file will contain information about DNA alignment partition.

Prepare the partition file

Since we are doing partitioned analysis for multi-locus alignments, different genes/partitions can have different evolutionary histories and mutation rates. We need to account for that, and we can assign different substitution models for different genes/loci. Therefore, while we are joining the loci from each isolate, we need to track at what position different gene starts or ends in the concatenated sequence file. This is also called a partition file. One way is to use this RAxML format file:

DNA, geneA.fasta.aln = 1-209
DNA, geneB.fasta.aln = 210-815
DNA, geneC.fasta.aln = 816-1230

Phylogeny with IQTree

Now, we can use IQTree to infer the phylogeny. We also need to add the RAxML position file and concatenated sequence file. We’ll be doing 1000 bootstrap resampling:

$ iqtree2 -s concatenated.fna -bb 1000 -alrt 1000 -nt 8 -q partition.raxml

Doing it in the high-performance cluster will look like this:

$ sbatch -N 1 -n 8 -p intel --wrap "module load iqtree/2.1.1;  iqtree2 -s concatenated.fna -bb 1000 -alrt 1000 -nt 8 -q partition.raxml

To visualize phylogeny, ggtree is my favorite package because it offers customization in many ways!

If you have any questions, please leave a comment! If you detect an error, please let me know.

Phylogeny with FastTree

IQTree is great, but it’s slow since it uses maximum-likelihood approach. Therefore, if you want to see the tree structure in general while IQTree is running, you can use FastTree. FastTree uses heuristics to approximate a maximum-likelihood tree. For getting a quick idea, it is pretty good.

$ FastTree concatenated.fna > fasttree.output.txt

Now that you have a good phylogeny, you can generate a publication quality figure by using ggtree in R. Let’s follow this tutorial: Creating a Publication Quality Phylogeny Using ggtree


Converting GenBank to Fasta

Here’s the scenario. You have a bunch of newly generated genome sequence files. They are all mapped and annotated. If they are assembled, then you can directly use that. If you have GenBank, then you need to search and extract some specific genes from all genomes.

I generally change the formats to fasta using this program: https://github.com/acarafat/gbk2fasta

This program has three options: fna, ffn, and faa. fna extracts all GenBank sequence records. ffn extracts only CDS records. fna extracts CDS and translates it into amino acid sequences. This Python script will be used in command line/shell/terminal. Use the following prompt to convert .gbk file to .fasta:

$ python gbk2fasta.py input.gbk fna


Discover more from Arafat Rahman

Subscribe to get the latest posts sent to your email.


Comments

One response to “Phylogeny from multi-locus sequences aka MLSA”

  1. […] A decade ago, 2012-2013, I used MEGA5 to infer phylogeny using simple Neighbour-Joining methods, and used the figure generated by MEGA5 to present and publish my results. Later, when I started learning other phylogeny reconstruction methods like Maximum Likelihood (ML) and Bayesian, I started to explore plethora of other tree drawing software, including FigTree which is a handy click-and-annotate phylogeny software. Now a days, ML tree seems to be the minimum requirement, and using MLSA (multi-locus sequence analysis) is even better. […]

Leave a Reply to Creating a Publication Quality Phylogeny Using ggtree – Arafat RahmanCancel reply

Learn Python for Bioinformatics

I have created a set of worksheets that give a quick overview of Python for Bioinformatics (as well as intro to UNIX). You just have to give it 3-6 hours, and you will know the essentials!

Join 87 other subscribers