This is Part 3 of tutorial series: NGS Workflow for Genome Assembly to Annotation for Hybrid Bacterial Data
We’ll be use hybrid sequencing data (Illumina and Nanopore). This tutorial has five parts.
- Part 1: Downloading and preparing data
- Part 2: Assembly with short-reads.
- Part 3: Assembly with long-reads.
- Part 4: Hybrid assembly (long- and short-reads).
- Part 5: Bacterial genome annotation.
Disclaimer: This post is a work in progress. This is genome assembly and annotation workflow that I use for microbial genomics. Previously, I used this template to teach different class in OSU as well as in other training facilities.
For this tutorial, you need to have a bacterial Nanopore fastq file. If you have your own, then great! Otherwise, you can download publicly available bacterial genomic-reads from here.
We will be using Flye
as assembler here.
Installation
conda install bioconda::flye
conda install bioconda::nanoplot
conda install bioconda::dnaapler
Consolidating FastQ files
Usually, when you get raw sequence data from your sequencer, they are barcoded. For example, a Nanopore SQK-RBK114.24 kit will allow you to ru n 24 distinct samples. Good news is, the reads are already demultiplexed, meaning the Nanopore MinKnow software already categorized reads in separate folders where each folder represents a barcode.
Sometime there are multiple fastq files in each barcode folder. Therefore, you need to onsolidate multiple basecalled FASTQ files for each unique barcode into a single file. This is necessary because Nanopore sequencing often generates multiple small FASTQ files for each sample (barcode), and combining them simplifies downstream analysis.
#!/bin/bash
for folder in `ls -d barcode*`; do
newfile="${folder}.fastq.gz"
cat $folder/* > $newfile
echo "$folder $newfile"
done
Checking taxonomy
We can quickly generate k-mer fingerprints of the fastq files and compare with reference genomes.
sendsketch.sh -da in=strain1.fastq.gz out=./sketches/strain1.sketch
You can automate sendsketch.sh
for multiple samples, and summarize the top result using unix script I shared in the previous tutorial.
Quality of the Nanopore Reads
Now, we need to assess the quality of the Nanopore sequencing reads. Quality control is essential to ensure that the data is suitable for downstream analysis such as genome assembly and annotation. But we can’t use fastqc
! Instead, we will use NanoPlot
.
The script uses NanoPlot to generate quality control plots for the FASTQ files. NanoPlot provides detailed statistics and visualizations of read length, read quality, and other important metrics.
NanoPlot -t 2 -o dataset/strain1 -p strain1.hex --loglength -f png --plots hex --title strain1 --fastq strain1.fastq.gz
We can also summarize the quality metrics generated by NanoPlot into a concise tabular format. This step provides a quick overview of the sequencing quality for each sample.
Let’s say you have generated NanoPlot output for all barcoded samples, and the outputs are in the dataset
directory. The following script extracts key metrics from the NanoPlot output files, such as the number of reads, total bases, mean read length, median read length, mean quality, N50, and longest read.
#!/bin/bash
echo -e "dataset number of reads total bases mean length median length mean quality read N50 longest read"
for infile in `ls -1 ../raw/*gz`; do
dataset=`echo -e "$infile" | sed 's/.fastq.gz//g;s/^.*\///g'`
n50=`grep "N50" ./$dataset/${dataset}.hexNanoStats.txt | sed 's/^.*\s//g'`
meanlength=`grep "Mean read length:" ./$dataset/${dataset}.hexNanoStats.txt | sed 's/^.*\s//g'`
meanquality=`grep "Mean read quality:" ./$dataset/${dataset}.hexNanoStats.txt | sed 's/^.*\s//g'`
medianlength=`grep "Median read length:" ./$dataset/${dataset}.hexNanoStats.txt | sed 's/^.*\s//g'`
numreads=`grep "Number of reads:" ./$dataset/${dataset}.hexNanoStats.txt | sed 's/^.*\s//g'`
totalbases=`grep "Total bases:" ./$dataset/${dataset}.hexNanoStats.txt | sed 's/^.*\s//g'`
longestread=`grep -A 1 "Top 5 longest reads" ./$dataset/${dataset}.hexNanoStats.txt | tail -n 1 | sed 's/^1:\s//g'`
echo -e "$dataset $numreads $totalbases $meanlength $medianlength $meanquality $n50 $longestread"
Done
Assembly and Rotation
We will use Flye
for assembly. There are other alternatives though (like Canu), but Flye
works pretty well for de novo bacterial genome assembly. But you should test multiple tools for your project, and decide what gives best results for you!
We will assemble genome from the sequencing reads first, and then rotate circular contigs to start at a specific gene (e.g., dnaA/repA). Genome assembly is the process of reconstructing the genome from sequencing reads, and rotating circular contigs ensures that the assembly starts at a biologically meaningful position.
The script first runs the Flye assembler to perform genome assembly with multiple polishing iterations.
flye --iterations 5 -t $threads -o flye/flyehq_strain1 --scaffold --read-error 0.03 --nano-hq strain1.fastq.gz
Notice, we are using --nano-hq
for defining the input. Here hq means that these are high-quality reads (basecalled with SUP model).
Now let’s use dnaapler tool to rotate circular contigs. This ensures that the assembled genome starts at the correct gene, so that the genome is consistent across all strains we are dealing with.
#find non-circular contigs and linear chromosomes and put in ignore list
#rotate the rest
echo "Make ignore list"
touch flyehq_${dataset}/rotate_ignore_list
awk '$4 == "N"' flyehq_strain1/assembly_info.txt | cut -f 1 > flyehq_strain1/rotate_ignore_list
echo "Ignore list created"
dnaapler all --ignore flyehq_strain1/rotate_ignore_list --input flyehq_strain1/assembly.fasta --output flyehq_strain1/rotated --prefix strain1 --threads 8 -f
Notice, we are only rotating the contigs which are circular.
Assembly Statistics
Flye
generates a flye.log
file, which contains the assembly statistics at the end. But you can use grep
:
assemblysize=`grep 'Total length:' flye/strain1/flye.log | sed 's/\tTotal length:\t//g'`
coverage=`grep 'Mean coverage:' flye/strain1/flye.log | sed 's/\tMean coverage:\t//g'`
fragments=`grep 'Fragments:' flye/strain1/flye.log | sed 's/\tFragments:\t//g'`
N50=`grep 'Fragments N50:' flye/strain1/flye.log | sed 's/\tFragments N50:\t//g'`
echo "$strain $assemblysize $coverage $fragments $N50"
Visualize the Contigs
Flye
generates assembly_graph.fastg
that can be visualized as graph. Load the assembly into Bandage to visualize the graph structure of the assembled contigs. It can show you the completeness of the assembly:
bandage image assembly_graph.fastg output.png


Now let’s see how to do hybrid assembly!
Leave a Reply