old shabby unicycle hanging on brick wall

Hybrid Assembly of Bacterial Genome

This is Part 4 of tutorial series: NGS Workflow for Genome Assembly to Annotation for Hybrid Bacterial Data 

We’ll be use hybrid sequencing data (Illumina and Nanopore). This tutorial has five parts. 

Disclaimer: This post is a work in progress. This is genome assembly and annotation workflow that I use for microbial genomics. Previously, I used this template to teach different class in OSU as well as in other training facilities.


In hybrid assembly, we use both short- and long-read data. This is kind of best of both worlds! Short reads have very low error rates and usually have good depth. But it fells short when there are repeated region on the genome that stretches longer than it could be resolved by short-read assembly. On the other hand, long-read data usually more error-prone, but since it can stretches long region, it can be used to “scaffold” the contigs from short-read assembly to create a complete or near-complete bacterial genome assembly.

Although now a days long-read technologies can provide you higher depth, and it’s possible to get really good assembly from long-read data only, it’s better to know about all of these (and test on your data).

Unicycler

Unicycler is a short-read-first hybrid assembly tool. It should only be used for hybrid assembly when long-read-first is not an option – i.e. when long-read depth is low. It can also work as a Illumina-only assembler as well, where it runs as a SPAdes-optimiser. Based on the author:

(I think) Unicycler is good for short-read-only bacterial genomes, as it produces cleaner assembly graphs than SPAdes alone.

Running Unicycler is as simple as the following:

unicycler -1 short_reads_1.fastq.gz -2 short_reads_2.fastq.gz -l long_reads.fastq.gz -o output_dir

But that’s just one approach. There another way to achieve the same thing. You can create a long-read based assembly first. But since there are higher rate of errors in the long reads, therefore you can use short-read data to map/align on your long-read assembly and “polish” it.

Unicycler assembly graph of Shigella sp. AUSMDU00026329

Polypolish

Recently long-read technologies evolved into really efficient: it’s possible to get high depth (~100X) and very low rate of errors. High-depth and high-accuracy long reads make long-read-first hybrid assembly (long-read assembly followed by short-read polishing) a viable approach that’s often preferable to Unicycler. 

There are many tools that do this. Here, let’s introduce Polypolish, which lets each short read match to all the places it fits in the genome, not just the best one. This is important for fixing errors in repeated DNA regions, because it ensures those repeats also get covered by short reads. As a result, Polypolish can correct mistakes that other tools might miss.

How to use Polypolish?

Let’s say, you have your long read assembly file named assembly.fasta. You need to index it first and map the short reads onto it. For that, we can use Burrow-Wheeler Aligner or bwa (which is preferred aligner for short-reads).

bwa index assembly.fasta

After indexing, we are going to map the short-reads onto the assembly.fasta. This will create mapped reads alignment files in *.sam format.


bwa mem -t 16 -a assembly.fasta reads_1.fastq.gz > alignments_1.sam
bwa mem -t 16 -a assembly.fasta reads_2.fastq.gz > alignments_2.sam

Once the mapping is done, we can filter reads based on quality, and then polish the assembly based on aligned short-reads.

polypolish filter --in1 alignments_1.sam --in2 alignments_2.sam --out1 filtered_1.sam --out2 filtered_2.sam
polypolish polish assembly.fasta filtered_1.sam filtered_2.sam > polished_assembly.fasta

Finally, let’s clean up the workspace by removing intermediate files!

rm *.amb *.ann *.bwt *.pac *.sa *.sam

Autocycler

Again, sometime you can have multiple sequence-assembly solutions from different programs. These assemblies might be little different from each other, and you want to get the best/conservative assembly. So you can use all these assemblies, and combined them together to create a (hopefully) better assembly.

Autocycler is a recently developed tool what does that. it is actually successor of another great tool Tricycler.

I’m not going to cover this tool in this tutorial. You should follow the tutorial here from the author to use Autocycler:

CheckM2

Once you are happy with your hybrid assembly, you can visualize the bandage plot (check this tutorial). But you can also test for completeness and contamination using CheckM2.

CheckM2 uses machine learning models that work for all types of organisms (regardless of their taxonomy) to estimate how complete a genome bin is and how much contamination it has.

Here’s a tutorial on how to use CheckM2.

Quast

Another widely used genome assembly evaluation tool is Quast. It’s specially good if there is a good reference genome available for the organism you are working with. You can provide the raw reads, assembly, reference genome, and annotation of your assembly.

How to use it:

quast.py test_data/contigs_1.fasta \
        -r test_data/reference.fasta.gz \
        -g test_data/genes.txt \
        -1 test_data/reads1.fastq.gz 
        -2 test_data/reads2.fastq.gz \
        -o quast_test_output

Now let’s check how to annotate genomes.


Comments

Leave a Reply

Learn Python for Bioinformatics

I have created a set of worksheets that give a quick overview of Python for Bioinformatics (as well as intro to UNIX). You just have to give it 3-6 hours, and you will know the essentials!

Join 78 other subscribers

Discover more from Arafat Rahman

Subscribe monthly newsletter, and download free worksheet on Python for Bioinformatics.

Continue reading