red and black corn

Variant Calling for Microbial Genomes Using Short- and Long-Reads

Variant calling is a fundamental analysis in microbial genomics. It has a range of applications: diagnosing clinical pathogenic strains during an infection, understanding an epidemic, genotyping, and, more generally, studying evolution.

There are so many SNP analysis pipelines. Then why am I writing this tutorial—and what new approach am I adding?

Well, the pipeline in this tutorial addresses two problems. The traditional way to call SNPs is to first establish a good reference genome. Then, aligning the reads from other genomes against this reference is the starting point. Finally, the process compares each reference position with the base-calls from the aligned reads. If most reads have a different base at that position compared to the reference, and the base-call quality is high, then it’s called an SNP.

Although this works well for human or plant genomes, it’s not a good option for bacteria. Bacteria can have a large pan-genome, and there can be huge gene presence-absence variation from one strain to another. Therefore, using a single reference almost always cannot capture variations in the accessory genes or in genomic regions where a lot of structural variation is occurring.

Enter Graphtyper.

Compared to a single reference-based pipeline, Graphtyper uses a pan-genome graph-based algorithm. Therefore, it is more sensitive in finding the majority of SNP variations, which are not limited to a single reference.

Another problem concerns long-read sequences. Nowadays, it is common to use hybrid sequencing of both short and long reads. Since the base calling accuracy of long reads is quite good now, it’s appealing to use both long and short reads for SNP calling. Variant calling from hybrid data is an active research area, and the Weisberg lab (where I am currently a postdoc) has developed a pipeline to do just that. Recently, in a preprint, we compared several SNP calling tools using short reads and long reads from the same set of bacteria, where the long reads were split into 150 bp and 400 bpfragments. We tested several SNP calling tools, and it turns out Graphtyper works really well with this hybrid data approach. You can check the preprint here: A comparison of short- and long-read whole genome sequencing for microbial pathogen epidemiology

In this tutorial I will show you how to use Graphtyper using this hybrid approach, step by step.

Here’s an overview of the tools used in this tutorial and the major steps:

Dataset Description

I am going to use the following publicly available Allorhizobium vitis raw read data from the NCBI SRA database. This dataset includes both long and short reads. To programmatically download the dataset, you can use Kingfisher by following this tutorial.: Programmatically Downloading Raw Data from NCBI

StrainSRATechnology
Agrobacterium vitis AB4SRR10586500Illumina MiSeq
Agrobacterium vitis CG118SRR10572927
Illumina MiSeq
Agrobacterium vitis CG1SRR11533874Illumina MiSeq
Agrobacterium vitis CG49SRR10572947Illumina MiSeq
Agrobacterium vitis Ni1SRR10586498Illumina MiSeq
Agrobacterium vitis P86/93SRR11174105Illumina HiSeq
Agrobacterium vitis Rr4SRR10586495Illumina MiSeq
Agrobacterium vitis CG678 SRR12020691Nanopore MinION
Agrobacterium vitis CG412SRR12020689Nanopore MinION
Allorhizobium vitis K309SRR7539305Nanopore MinION
Allorhizbium vitis CG957SRR28041148
Nanopore MinION
Agrobacterium vitis F2/5SRR12020695Nanopore MinION

Project Directory Structure

Let’s organize the project directory in following structure. We will keep raw reads in the reads directory, and further organize them in illumina and nanopore subdirectories. In this pipeline, we’ll further process the Nanopore long-reads by splitting in 400bp segments.

Another directory is align, where we will keep the bam files generated by aligning reads from each strains onto reference.

SNP_call
       |- reads
       |      |-illumina
       |      |-nanopore
       |               |-split_reads
       |- align

Splitting Nanopore Reads

Graphtyper originally developed to handle Illumina short-read sequences, not long reads. What our study (Schiffer et al., preprint, 2025) found is, if you split the long-reads in 400bp long fragments, variants called by Graphtyper is very similar to variants called from Illumina data only. For this, we can use a tool called seqkit, which has a program named sliding. Here’s the script:

#!/bin/bash
for infile in `ls -1 *.fastq*`; do
        strain=`echo -e $infile | sed 's/.fastq//g' | sed 's/.gz//g'`
        cat $infile | seqkit sliding -W 400 -s 400 -o ${strain}.400.fastq
done

sliding essentially runs a sliding window through the sequence where both window size and step length is 400. After it is done, make sure to move the split fastq files in the split_reads subdirectory.

Selecting Reference

One of the critical step of this kind of analysis is to select a good reference. Usually Agrobacterium vitis contains one circular chromosome, one chromid and can have plasmids. I’ll be using the fasta file of Chromosome 1 from strain CG71, because it is a complete assembly. Download it as a fasta file, name it REF.fna, and keep it in the SNP_call directory.

Indexing Reference

Indexing is like creating an address for different locations in the genome file so that it can be easily navigated during different analysis. We will create index for three different analysis. Let’s run the following commands:

#!/bin/bash
java -jar picard.jar CreateSequenceDictionary R= REF.fna O= REF.dict

samtools faidx REF.fna

bwa index REF.fna

Aligning reads onto reference

Although we are going to use Graphtyper for calling SNPs, before that we need to align the raw reads from each sample onto the reference, and need to preprocess the alignment file. Since we have both Illumina and Nanopore, we will use bwa for aligning Illumina reads and minimap for aligning 400bp split Nanopore reads.

#!/bin/bash
file1=$1
file2=$2
dataset=$3

id=$dataset
#align bwa mem
bwa mem -M -t 8 ./REF.fna $file1 $file2 | \


#process alignments to sort sam files and mark duplicate reads. 
java -jar picard.jar AddOrReplaceReadGroups I=/dev/stdin O=/dev/stdout RGID= $id RGLB= $id RGPL= Illumina RGPU= $id RGSM=$dataset TMP_DIR=tmp/$dataset | \
java -jar picard.jar SortSam I=/dev/stdin O=${dataset}.rg.bam SORT_ORDER=coordinate CREATE_INDEX=true TMP_DIR=tmp/$dataset

java -jar picard.jar MarkDuplicates I=${dataset}.rg.bam O=${dataset}.marked_duplicates.bam ASSUME_SORT_ORDER=coordinate M=${dataset}_marked_duplicates_metrics.txt TMP_DIR=tmp/$dataset
rm ${dataset}.rg.bam
rm ${dataset}.rg.bai

java -jar picard.jar SortSam I=${dataset}.marked_duplicates.bam O=${dataset}.marked_duplicates.sorted.bam SORT_ORDER=coordinate CREATE_INDEX=true TMP_DIR=tmp/$dataset
rm ${dataset}.marked_duplicates.bam


rm -rf tmp/$dataset

Run the script by following:

./run_bwa-gatk.sh illumina_R1.fastq.gz illumina_R2.fastq.gz strain_id

Basically this script takes in the reads, align on REF.fna, then use picard to sort the alignment files, mark duplicates, and group reads. We need to do the same for Nanopore reads too:

#!/bin/bash

file1=$1
dataset=$2
id=$dataset

#align with bwa mem
minimap2 -ax map-ont REF.fna $file1 | \

#add read groups
java -jar picard.jar AddOrReplaceReadGroups I=/dev/stdin O=/dev/stdout RGID= $id RGLB= $id RGPL= Nanopore RGPU= $id RGSM=$dataset TMP_DIR=tmp/$dataset | \
java -jar picard.jar SortSam I=/dev/stdin O=${dataset}.rg.bam SORT_ORDER=coordinate CREATE_INDEX=true TMP_DIR=tmp/$dataset

#mark duplicate reads and sort
java -jar picard.jar MarkDuplicates I=${dataset}.rg.bam O=${dataset}.marked_duplicates.bam ASSUME_SORT_ORDER=coordinate M=${dataset}_marked_duplicates_metrics.txt TMP_DIR=tmp/$dataset
rm ${dataset}.rg.bam
rm ${dataset}.rg.bai

java -jar /nfs7/BPP/Weisberg_Lab/software/software/picard.jar SortSam I=${dataset}.marked_duplicates.bam O=${dataset}.marked_duplicates.sorted.bam SORT_ORDER=coordinate CREATE_INDEX=true TMP_DIR=tmp/$dataset

rm ${dataset}.marked_duplicates.bam
rm -rf tmp/$dataset

Here we use minimap for alignment. Run it by the following command:

./un_minimap-gatk_bac400.sh nanopore_reads.400bp.fastq.gz strain_id

This two scripts will generate bunch of bam files. Move them into the align subdirectory.

Run Graphtyper

We’ll create a list of all these bam files generated in previous step:

ls -1 *bam > bamlist

Now, run Graphtyper:

for contig in `grep '>' REF.fna | sed 's/>//g' | sed 's/ .*//g'`; do
        graphtyper genotype --threads 16 --verbose --sams=bamlist --region=$contig REF.fna
done

Combine VCFs

Graphtyper creates a subdirectory named results, where for each contig in the REF.fna, it creates a subdirectory and a set of variant calling files (VCFs). We want to concatenate these vcf files together, zip them, and create index. For that, we can use the following script:

graphtyper vcf_concatenate results/*/*.vcf.gz | bgzip -c > species_calls_merged.vcf.gz

tabix -p vcf species_calls_merged.vcf.gz

Filtering

We have the combined species_calls_merged.vcf.gz file now. Since graphtyper was created for diploid organism, it may mistakenly call heterozygous call, which are just errors. So we need to do some filtering:

#!/bin/bash
#filter heterozygous calls because bacteria are haploid, these are just errors
java -jar GenomeAnalysisTK.jar -T VariantFiltration  -G_filter "isHet == 1" -G_filterName "isHetFilter" --setFilteredGtToNocall -o species_calls_merged.vcf.nonpassnonhetfilt.vcf --variant species_calls_merged.vcf.gz -R ./REF.fna

vcffilter -f "ABHet < 0.0 | ABHet > 0.33" -f "ABHom < 0.0 | ABHom > 0.97" -f "MaxAASR > 0.4" -f "MQ > 30" species_calls_merged.vcf.nonpassnonhetfilt.vcf > species_calls_merged.vcf.nonpassnonhetfilt.vcf.vcffilter.vcf

Final outputs

So we have the final output! But we still may need fasta file contianing the SNP only sites. Here’s the way to do it:

#!/bin/bash

vcf-to-tab species_calls_merged.vcf.nonpassnonhetfilt.vcf.vcffilter.vcf > species.tab
sed -i 's_/\S\+_/_g'  species.tab
perl vcftab_to_snpaln_nodel.pl --output_ref -i  species.tab >  species.tab.fasta

Rscript fasta_to_disttable.r species.tab.fasta

You can find the fasta_to_disttable.r and vcftab_to_snpaln_nodel.pl here.


Discover more from Arafat Rahman

Subscribe to get the latest posts sent to your email.


Comments

Leave a Reply

Learn Python for Bioinformatics

I have created a set of worksheets that give a quick overview of Python for Bioinformatics (as well as intro to UNIX). You just have to give it 3-6 hours, and you will know the essentials!

Join 86 other subscribers