Multi-locus sequence analysis (MLSA) involves using multiple genes or loci, usually conserved housekeeping genes, to construct phylogeny and other sequence-based analyses. Since different genes may have different mutation rates, MLSA generally gives a better approximation of underlying evolution and a more realistic resolution of phylogenetic relations among taxa than only one gene. For a fundamental background understanding, read: Deciding how to construct a phylogenetic tree.
This is also a better alternative to ribosomal 16s/ITS-based analysis, especially for many bacterial species (including Bradyrhizobium, which I worked with), because the 16s/ITS are often very similar in these genera and cannot be used to differentiate species.
TL;DR
Now-a-days, I use AutoMLSA2, which essentially automates all the intermediate steps one need to do to create a MLSA phylogeny. AutoMLSA2 requires protein or amino acid sequences of your target genes/locus. Its best to download the target gene sequences from NCBI or other curated databases and then run the following command:
$ automlsa2 --query AAofInterest.faa --dir ../path/to/target/genome -t 8 --allow_missing 4 --missing_check project_nameYou need to define directory path containing target genome files using --dir
--query contains fasta file of amino acid sequences for gene/locus of interest.
-t defines number of threads.
--allow_missing is how many missing genes/locus can be tolerated for a genome, if a genome do not have more more missing genes, it will be dropped from further analysis.
Toy data
For this tutorial to follow along, you can download complete genomes of 12 alpha-proteobacteria from this link.
For making MLSA, we will look for individual genes in these genomes. You can download nucleotide and amino acid sequences of these genes to be used as query from NCBI: dnaK, recA, 16s rRNA, etc.
This protocol involves doing BLAST on a local database.
Homology search
Let’s join all these genomes together. In the UNIX command shell, move to the directory which contains all genome fasta files and concatenate them in a single file:
$ cat *.fasta > database/genomes.fastaMake a BLAST database for all genomes:
~/database$ makeblastdb -in genomes.fasta -dbtype nuclNow let’s do the query for all of your target genes. I would download the reference gene sequence from NCBI to be used as a query and keep them in separate fasta files. This nucleotide BLAST will generate a detailed output report, including the matched sequence.
$ blastn -db database/genomes.fasta -query geneOfInterest.fasta -out geneOfInterest.fasta.out -outfmt '7 std qcovs sseq'In some cases, doing a simple nucleotide BLAST might not be enough, and you need to do other variants of BLAST, such as tblastn, which use amino acid query sequence to search for it to an NCBI nucleotide database which has been translated to all six reading frames.
$ tblastn -db database/genomes.fasta -query proteinOfInterest.fasta -outfmt '7 std qcovs sseq' -out geneOfInterest.fasta.outNow it’s time to mine the blast output file and search for hits for each query that meets a particular threshold. For example, we can search for the hit with at least 50% similarity with the query. In the BLAST output format 7, the similarity statistics are reported on the 3rd column; therefore we can use awk '$3 > 50'.
This is an important filtering step since we don’t want to keep a sequence-hit below to a certain length or percent match. You should adjust this step based on your project. I’m showing a command for one locus. Replace the locus name for your gene of interest.
$ grep '^geneOfInterest' geneOfInterest.fasta.out | awk '$3 > 50'  |  cut -f 2,14 | awk '{print">"$0}' | awk '{print$1"\n"$2}' > geneOfInterest.extracted.fastaHere, second column contains the subject contig id from your database, and column 14 contains the amino acid sequence. So using cut -f 2,14 we can extract those and write into a fasta file.
Ideally, we want to get one sequence per gene. In many cases, some genes have higher copy numbers; therefore, multiple homologs can be present. In that case, we can do filtering by size and percent match as well as higher query coverage and try to get the hit that gives maximum value.
This work fine if you are only interested in the amino acid sequences reported in the blast output file. However, if you need to mine the corresponding DNA sequences, you can use the `samtools faidx` command.
For this, lets get the sequence contigs, and coordinates first:
grep '^geneOfInterest' geneOfInterest.fasta.out | awk '$3 > 50' | cut -f 2,9,10 > coordinates.txtNow we need to actually run samtools faidx command and extract nucleotide sequences. Sometime your blast output may have multiple hit from same genome. I have written a Python script (GitHub link) which you can use to generate to extract only largest hit. This script, make_faidx.py, will go through the co-ordinate file, for each genome it will keep the LARGEST hit, and if the co-ordinates are from complementary strand, it will add flag for reverse-complement. 
python make_faidx.py coordinates.txt database/genomes.fastaThe script will generate a bash code, which you need to run again:
# Make it executable
chmod +x faidx_command.sh
# Run it to extract sequences
./faidx_command.sh > geneOfInterest.extracted.fastaOnce we have the extracted fasta sequences of hits that match our criterion, we can take a look and count the number of hits extracted.
$ less geneOfInterest.extracted.fasta
$ grep '^>' geneOfInterest.extracted.fasta | wc -lRepeat the process for all other genes of interest.
Align and concatenate BLAST hits
Now that all target genes’ homologous match has been extracted in a separate fasta file. We want to concatenate them together. However, the sequences need to be aligned.
Let’s align them using MAFFT:
$ mafft input.fasta > output.fasta.alnFor concatenating the genes, we can use the SuperMatrix package in R. However, this can be slow.
Another alternative is to use this Pearl script, which is super fast and works really well: https://github.com/nylander/catfasta2phyml
Use this following command to concatenate different aligned loci:
$ perl catfasta2phyml.pl -f --concatenate --verbose orfs/*.fna > concatenated.fna 2> logLast part of the log file will contain information about DNA alignment partition.
Prepare the partition file
Since we are doing partitioned analysis for multi-locus alignments, different genes/partitions can have different evolutionary histories and mutation rates. We need to account for that, and we can assign different substitution models for different genes/loci. Therefore, while we are joining the loci from each isolate, we need to track at what position different gene starts or ends in the concatenated sequence file. This is also called a partition file. One way is to use this RAxML format file:
DNA, geneA.fasta.aln = 1-209
DNA, geneB.fasta.aln = 210-815
DNA, geneC.fasta.aln = 816-1230Phylogeny with IQTree
Now, we can use IQTree to infer the phylogeny. We also need to add the RAxML position file and concatenated sequence file. We’ll be doing 1000 bootstrap resampling:
$ iqtree2 -s concatenated.fna -bb 1000 -alrt 1000 -nt 8 -q partition.raxmlDoing it in the high-performance cluster will look like this:
$ sbatch -N 1 -n 8 -p intel --wrap "module load iqtree/2.1.1;  iqtree2 -s concatenated.fna -bb 1000 -alrt 1000 -nt 8 -q partition.raxmlTo visualize phylogeny, ggtree is my favorite package because it offers customization in many ways!
If you have any questions, please leave a comment! If you detect an error, please let me know.
Phylogeny with FastTree
IQTree is great, but it’s slow since it uses maximum-likelihood approach. Therefore, if you want to see the tree structure in general while IQTree is running, you can use FastTree. FastTree uses heuristics to approximate a maximum-likelihood tree. For getting a quick idea, it is pretty good.
$ FastTree concatenated.fna > fasttree.output.txtNow that you have a good phylogeny, you can generate a publication quality figure by using ggtree in R. Let’s follow this tutorial: Creating a Publication Quality Phylogeny Using ggtree
Converting GenBank to Fasta
Here’s the scenario. You have a bunch of newly generated genome sequence files. They are all mapped and annotated. If they are assembled, then you can directly use that. If you have GenBank, then you need to search and extract some specific genes from all genomes.
I generally change the formats to fasta using this program: https://github.com/acarafat/gbk2fasta
This program has three options: fna, ffn, and faa. fna extracts all GenBank sequence records. ffn extracts only CDS records. fna extracts CDS and translates it into amino acid sequences. This Python script will be used in command line/shell/terminal. Use the following prompt to convert .gbk file to .fasta:
$ python gbk2fasta.py input.gbk fna

Leave a Reply