colorful fish lures and a wobbler

Microbial De Novo Genome Assembly from Long-Read Data

This is Part 3 of tutorial series: NGS Workflow for Genome Assembly to Annotation for Hybrid Bacterial Data 

We’ll be use hybrid sequencing data (Illumina and Nanopore). This tutorial has five parts. 

Disclaimer: This post is a work in progress. This is genome assembly and annotation workflow that I use for microbial genomics. Previously, I used this template to teach different class in OSU as well as in other training facilities.


For this tutorial, you need to have a bacterial Nanopore fastq file. If you have your own, then great! Otherwise, you can download publicly available bacterial genomic-reads from here.

We will be using Flye as assembler here.

Installation

conda install bioconda::flye
conda install bioconda::nanoplot
conda install bioconda::dnaapler

Consolidating FastQ files

Usually, when you get raw sequence data from your sequencer, they are barcoded. For example, a Nanopore SQK-RBK114.24 kit will allow you to ru n 24 distinct samples. Good news is, the reads are already demultiplexed, meaning the Nanopore MinKnow software already categorized reads in separate folders where each folder represents a barcode.

Sometime there are multiple fastq files in each barcode folder. Therefore, you need to onsolidate multiple basecalled FASTQ files for each unique barcode into a single file. This is necessary because Nanopore sequencing often generates multiple small FASTQ files for each sample (barcode), and combining them simplifies downstream analysis. 

#!/bin/bash
for folder in `ls -d barcode*`; do
        newfile="${folder}.fastq.gz"
        cat $folder/* > $newfile
        echo "$folder $newfile"
done

Checking taxonomy

We can quickly generate k-mer fingerprints of the fastq files and compare with reference genomes.

sendsketch.sh -da in=strain1.fastq.gz out=./sketches/strain1.sketch

You can automate sendsketch.sh for multiple samples, and summarize the top result using unix script I shared in the previous tutorial.

Quality of the Nanopore Reads

Now, we need to assess the quality of the Nanopore sequencing reads. Quality control is essential to ensure that the data is suitable for downstream analysis such as genome assembly and annotation.  But we can’t use fastqc! Instead, we will use NanoPlot.

The script uses NanoPlot to generate quality control plots for the FASTQ files. NanoPlot provides detailed statistics and visualizations of read length, read quality, and other important metrics.

NanoPlot -t 2 -o dataset/strain1 -p strain1.hex --loglength -f png --plots hex --title strain1 --fastq strain1.fastq.gz

We can also summarize the quality metrics generated by NanoPlot into a concise tabular format. This step provides a quick overview of the sequencing quality for each sample.

Let’s say you have generated NanoPlot output for all barcoded samples, and the outputs are in the dataset directory. The following script extracts key metrics from the NanoPlot output files, such as the number of reads, total bases, mean read length, median read length, mean quality, N50, and longest read.

#!/bin/bash
echo -e "dataset        number of reads total bases     mean length     median length   mean quality read N50   longest read"
for infile in `ls -1 ../raw/*gz`; do
     	   dataset=`echo -e "$infile" | sed 's/.fastq.gz//g;s/^.*\///g'`
     	   n50=`grep "N50" ./$dataset/${dataset}.hexNanoStats.txt | sed 's/^.*\s//g'`
     	   meanlength=`grep "Mean read length:" ./$dataset/${dataset}.hexNanoStats.txt | sed 's/^.*\s//g'`
     	   meanquality=`grep "Mean read quality:" ./$dataset/${dataset}.hexNanoStats.txt | sed 's/^.*\s//g'`
     	   medianlength=`grep "Median read length:" ./$dataset/${dataset}.hexNanoStats.txt | sed 's/^.*\s//g'`
     	   numreads=`grep "Number of reads:" ./$dataset/${dataset}.hexNanoStats.txt | sed 's/^.*\s//g'`
     	   totalbases=`grep "Total bases:" ./$dataset/${dataset}.hexNanoStats.txt | sed 's/^.*\s//g'`
     	   longestread=`grep -A 1 "Top 5 longest reads" ./$dataset/${dataset}.hexNanoStats.txt | tail -n 1 | sed 's/^1:\s//g'`
     	   echo -e "$dataset       $numreads       $totalbases     $meanlength     $medianlength   $meanquality    $n50    $longestread"
Done

Assembly and Rotation

We will use Flye for assembly. There are other alternatives though (like Canu), but Flye works pretty well for de novo bacterial genome assembly. But you should test multiple tools for your project, and decide what gives best results for you!

We will assemble genome from the sequencing reads first, and then rotate circular contigs to start at a specific gene (e.g., dnaA/repA). Genome assembly is the process of reconstructing the genome from sequencing reads, and rotating circular contigs ensures that the assembly starts at a biologically meaningful position.

The script first runs the Flye assembler to perform genome assembly with multiple polishing iterations.

flye --iterations 5 -t $threads -o flye/flyehq_strain1 --scaffold --read-error 0.03 --nano-hq strain1.fastq.gz

Notice, we are using --nano-hq for defining the input. Here hq means that these are high-quality reads (basecalled with SUP model).

Now let’s use dnaapler tool to rotate circular contigs. This ensures that the assembled genome starts at the correct gene, so that the genome is consistent across all strains we are dealing with.

#find non-circular contigs and linear chromosomes and put in ignore list
#rotate the rest

echo "Make ignore list" 
touch flyehq_${dataset}/rotate_ignore_list

awk '$4 == "N"' flyehq_strain1/assembly_info.txt | cut -f 1 > flyehq_strain1/rotate_ignore_list

echo "Ignore list created" 

dnaapler all --ignore flyehq_strain1/rotate_ignore_list --input flyehq_strain1/assembly.fasta --output flyehq_strain1/rotated --prefix strain1 --threads 8 -f

Notice, we are only rotating the contigs which are circular.

Assembly Statistics

Flye generates a flye.log file, which contains the assembly statistics at the end. But you can use grep:

assemblysize=`grep 'Total length:' flye/strain1/flye.log | sed 's/\tTotal length:\t//g'`
coverage=`grep 'Mean coverage:' flye/strain1/flye.log | sed 's/\tMean coverage:\t//g'`
fragments=`grep 'Fragments:' flye/strain1/flye.log | sed 's/\tFragments:\t//g'`
N50=`grep 'Fragments N50:' flye/strain1/flye.log | sed 's/\tFragments N50:\t//g'`

echo "$strain   $assemblysize   $coverage       $fragments      $N50"

Visualize the Contigs

Flye generates assembly_graph.fastg that can be visualized as graph. Load the assembly into Bandage to visualize the graph structure of the assembled contigs. It can show you the completeness of the assembly:

bandage image assembly_graph.fastg output.png
Move the slider horizontally to compare short-read vs long-read assembly graph of Shigella sp. AUSMDU00026329.

Now let’s see how to do hybrid assembly!


Comments

Leave a Reply

Learn Python for Bioinformatics

I have created a set of worksheets that give a quick overview of Python for Bioinformatics (as well as intro to UNIX). You just have to give it 3-6 hours, and you will know the essentials!

Join 78 other subscribers

Discover more from Arafat Rahman

Subscribe monthly newsletter, and download free worksheet on Python for Bioinformatics.

Continue reading