This is Part 4 of tutorial series: NGS Workflow for Genome Assembly to Annotation for Hybrid Bacterial Data
We’ll be use hybrid sequencing data (Illumina and Nanopore). This tutorial has five parts.
- Part 1: Downloading and preparing data
- Part 2: Assembly with short-reads.
- Part 3: Assembly with long-reads.
- Part 4: Hybrid assembly (long- and short-reads).
- Part 5: Bacterial genome annotation.
Disclaimer: This post is a work in progress. This is genome assembly and annotation workflow that I use for microbial genomics. Previously, I used this template to teach different class in OSU as well as in other training facilities.
In hybrid assembly, we use both short- and long-read data. This is kind of best of both worlds! Short reads have very low error rates and usually have good depth. But it fells short when there are repeated region on the genome that stretches longer than it could be resolved by short-read assembly. On the other hand, long-read data usually more error-prone, but since it can stretches long region, it can be used to “scaffold” the contigs from short-read assembly to create a complete or near-complete bacterial genome assembly.
Although now a days long-read technologies can provide you higher depth, and it’s possible to get really good assembly from long-read data only, it’s better to know about all of these (and test on your data).
Unicycler
Unicycler
is a short-read-first hybrid assembly tool. It should only be used for hybrid assembly when long-read-first is not an option – i.e. when long-read depth is low. It can also work as a Illumina-only assembler as well, where it runs as a SPAdes-optimiser. Based on the author:
(I think) Unicycler is good for short-read-only bacterial genomes, as it produces cleaner assembly graphs than SPAdes alone.
Running Unicycler
is as simple as the following:
unicycler -1 short_reads_1.fastq.gz -2 short_reads_2.fastq.gz -l long_reads.fastq.gz -o output_dir
But that’s just one approach. There another way to achieve the same thing. You can create a long-read based assembly first. But since there are higher rate of errors in the long reads, therefore you can use short-read data to map/align on your long-read assembly and “polish” it.

Polypolish
Recently long-read technologies evolved into really efficient: it’s possible to get high depth (~100X) and very low rate of errors. High-depth and high-accuracy long reads make long-read-first hybrid assembly (long-read assembly followed by short-read polishing) a viable approach that’s often preferable to Unicycler.
There are many tools that do this. Here, let’s introduce Polypolish
, which lets each short read match to all the places it fits in the genome, not just the best one. This is important for fixing errors in repeated DNA regions, because it ensures those repeats also get covered by short reads. As a result, Polypolish can correct mistakes that other tools might miss.
How to use Polypolish
?
Let’s say, you have your long read assembly file named assembly.fasta
. You need to index it first and map the short reads onto it. For that, we can use Burrow-Wheeler Aligner
or bwa (which is preferred aligner for short-reads).
bwa index assembly.fasta
After indexing, we are going to map the short-reads onto the assembly.fasta
. This will create mapped reads alignment files in *.sam
format.
bwa mem -t 16 -a assembly.fasta reads_1.fastq.gz > alignments_1.sam
bwa mem -t 16 -a assembly.fasta reads_2.fastq.gz > alignments_2.sam
Once the mapping is done, we can filter reads based on quality, and then polish the assembly based on aligned short-reads.
polypolish filter --in1 alignments_1.sam --in2 alignments_2.sam --out1 filtered_1.sam --out2 filtered_2.sam
polypolish polish assembly.fasta filtered_1.sam filtered_2.sam > polished_assembly.fasta
Finally, let’s clean up the workspace by removing intermediate files!
rm *.amb *.ann *.bwt *.pac *.sa *.sam
Autocycler
Again, sometime you can have multiple sequence-assembly solutions from different programs. These assemblies might be little different from each other, and you want to get the best/conservative assembly. So you can use all these assemblies, and combined them together to create a (hopefully) better assembly.
Autocycler
is a recently developed tool what does that. it is actually successor of another great tool Tricycler
.
I’m not going to cover this tool in this tutorial. You should follow the tutorial here from the author to use Autocycler
:
CheckM2
Once you are happy with your hybrid assembly, you can visualize the bandage plot (check this tutorial). But you can also test for completeness and contamination using CheckM2
.
CheckM2
uses machine learning models that work for all types of organisms (regardless of their taxonomy) to estimate how complete a genome bin is and how much contamination it has.
Here’s a tutorial on how to use CheckM2.
Quast
Another widely used genome assembly evaluation tool is Quast
. It’s specially good if there is a good reference genome available for the organism you are working with. You can provide the raw reads, assembly, reference genome, and annotation of your assembly.
How to use it:
quast.py test_data/contigs_1.fasta \
-r test_data/reference.fasta.gz \
-g test_data/genes.txt \
-1 test_data/reads1.fastq.gz
-2 test_data/reads2.fastq.gz \
-o quast_test_output
Now let’s check how to annotate genomes.
Leave a Reply