Variant calling is a fundamental analysis in microbial genomics. It has a range of applications: diagnosing clinical pathogenic strains during an infection, understanding an epidemic, genotyping, and, more generally, studying evolution.
There are so many SNP analysis pipelines. Then why am I writing this tutorial—and what new approach am I adding?
Well, the pipeline in this tutorial addresses two problems. The traditional way to call SNPs is to first establish a good reference genome. Then, aligning the reads from other genomes against this reference is the starting point. Finally, the process compares each reference position with the base-calls from the aligned reads. If most reads have a different base at that position compared to the reference, and the base-call quality is high, then it’s called an SNP.
Although this works well for human or plant genomes, it’s not a good option for bacteria. Bacteria can have a large pan-genome, and there can be huge gene presence-absence variation from one strain to another. Therefore, using a single reference almost always cannot capture variations in the accessory genes or in genomic regions where a lot of structural variation is occurring.
Enter Graphtyper.
Compared to a single reference-based pipeline, Graphtyper uses a pan-genome graph-based algorithm. Therefore, it is more sensitive in finding the majority of SNP variations, which are not limited to a single reference.
Another problem concerns long-read sequences. Nowadays, it is common to use hybrid sequencing of both short and long reads. Since the base calling accuracy of long reads is quite good now, it’s appealing to use both long and short reads for SNP calling. Variant calling from hybrid data is an active research area, and the Weisberg lab (where I am currently a postdoc) has developed a pipeline to do just that. Recently, in a preprint, we compared several SNP calling tools using short reads and long reads from the same set of bacteria, where the long reads were split into 150 bp and 400 bpfragments. We tested several SNP calling tools, and it turns out Graphtyper works really well with this hybrid data approach. You can check the preprint here: A comparison of short- and long-read whole genome sequencing for microbial pathogen epidemiology
In this tutorial I will show you how to use Graphtyper using this hybrid approach, step by step.
Here’s an overview of the tools used in this tutorial and the major steps:

Dataset Description
I am going to use the following publicly available Allorhizobium vitis raw read data from the NCBI SRA database. This dataset includes both long and short reads. To programmatically download the dataset, you can use Kingfisher by following this tutorial.: Programmatically Downloading Raw Data from NCBI
Strain | SRA | Technology |
Agrobacterium vitis AB4 | SRR10586500 | Illumina MiSeq |
Agrobacterium vitis CG118 | SRR10572927 | Illumina MiSeq |
Agrobacterium vitis CG1 | SRR11533874 | Illumina MiSeq |
Agrobacterium vitis CG49 | SRR10572947 | Illumina MiSeq |
Agrobacterium vitis Ni1 | SRR10586498 | Illumina MiSeq |
Agrobacterium vitis P86/93 | SRR11174105 | Illumina HiSeq |
Agrobacterium vitis Rr4 | SRR10586495 | Illumina MiSeq |
Agrobacterium vitis CG678 | SRR12020691 | Nanopore MinION |
Agrobacterium vitis CG412 | SRR12020689 | Nanopore MinION |
Allorhizobium vitis K309 | SRR7539305 | Nanopore MinION |
Allorhizbium vitis CG957 | SRR28041148 | Nanopore MinION |
Agrobacterium vitis F2/5 | SRR12020695 | Nanopore MinION |
Project Directory Structure
Let’s organize the project directory in following structure. We will keep raw reads in the reads
directory, and further organize them in illumina
and nanopore
subdirectories. In this pipeline, we’ll further process the Nanopore long-reads by splitting in 400bp segments.
Another directory is align
, where we will keep the bam
files generated by aligning reads from each strains onto reference.
SNP_call
|- reads
| |-illumina
| |-nanopore
| |-split_reads
|- align
Splitting Nanopore Reads
Graphtyper originally developed to handle Illumina short-read sequences, not long reads. What our study (Schiffer et al., preprint, 2025) found is, if you split the long-reads in 400bp long fragments, variants called by Graphtyper is very similar to variants called from Illumina data only. For this, we can use a tool called seqkit
, which has a program named sliding
. Here’s the script:
#!/bin/bash
for infile in `ls -1 *.fastq*`; do
strain=`echo -e $infile | sed 's/.fastq//g' | sed 's/.gz//g'`
cat $infile | seqkit sliding -W 400 -s 400 -o ${strain}.400.fastq
done
sliding
essentially runs a sliding window through the sequence where both window size and step length is 400. After it is done, make sure to move the split fastq files in the split_reads
subdirectory.
Selecting Reference
One of the critical step of this kind of analysis is to select a good reference. Usually Agrobacterium vitis contains one circular chromosome, one chromid and can have plasmids. I’ll be using the fasta file of Chromosome 1 from strain CG71, because it is a complete assembly. Download it as a fasta file, name it REF.fna
, and keep it in the SNP_call
directory.
Indexing Reference
Indexing is like creating an address for different locations in the genome file so that it can be easily navigated during different analysis. We will create index for three different analysis. Let’s run the following commands:
#!/bin/bash
java -jar picard.jar CreateSequenceDictionary R= REF.fna O= REF.dict
samtools faidx REF.fna
bwa index REF.fna
Aligning reads onto reference
Although we are going to use Graphtyper
for calling SNPs, before that we need to align the raw reads from each sample onto the reference, and need to preprocess the alignment file. Since we have both Illumina and Nanopore, we will use bwa
for aligning Illumina reads and minimap
for aligning 400bp split Nanopore reads.
#!/bin/bash
file1=$1
file2=$2
dataset=$3
id=$dataset
#align bwa mem
bwa mem -M -t 8 ./REF.fna $file1 $file2 | \
#process alignments to sort sam files and mark duplicate reads.
java -jar picard.jar AddOrReplaceReadGroups I=/dev/stdin O=/dev/stdout RGID= $id RGLB= $id RGPL= Illumina RGPU= $id RGSM=$dataset TMP_DIR=tmp/$dataset | \
java -jar picard.jar SortSam I=/dev/stdin O=${dataset}.rg.bam SORT_ORDER=coordinate CREATE_INDEX=true TMP_DIR=tmp/$dataset
java -jar picard.jar MarkDuplicates I=${dataset}.rg.bam O=${dataset}.marked_duplicates.bam ASSUME_SORT_ORDER=coordinate M=${dataset}_marked_duplicates_metrics.txt TMP_DIR=tmp/$dataset
rm ${dataset}.rg.bam
rm ${dataset}.rg.bai
java -jar picard.jar SortSam I=${dataset}.marked_duplicates.bam O=${dataset}.marked_duplicates.sorted.bam SORT_ORDER=coordinate CREATE_INDEX=true TMP_DIR=tmp/$dataset
rm ${dataset}.marked_duplicates.bam
rm -rf tmp/$dataset
Run the script by following:
./run_bwa-gatk.sh illumina_R1.fastq.gz illumina_R2.fastq.gz strain_id
Basically this script takes in the reads, align on REF.fna, then use picard
to sort the alignment files, mark duplicates, and group reads. We need to do the same for Nanopore reads too:
#!/bin/bash
file1=$1
dataset=$2
id=$dataset
#align with bwa mem
minimap2 -ax map-ont REF.fna $file1 | \
#add read groups
java -jar picard.jar AddOrReplaceReadGroups I=/dev/stdin O=/dev/stdout RGID= $id RGLB= $id RGPL= Nanopore RGPU= $id RGSM=$dataset TMP_DIR=tmp/$dataset | \
java -jar picard.jar SortSam I=/dev/stdin O=${dataset}.rg.bam SORT_ORDER=coordinate CREATE_INDEX=true TMP_DIR=tmp/$dataset
#mark duplicate reads and sort
java -jar picard.jar MarkDuplicates I=${dataset}.rg.bam O=${dataset}.marked_duplicates.bam ASSUME_SORT_ORDER=coordinate M=${dataset}_marked_duplicates_metrics.txt TMP_DIR=tmp/$dataset
rm ${dataset}.rg.bam
rm ${dataset}.rg.bai
java -jar /nfs7/BPP/Weisberg_Lab/software/software/picard.jar SortSam I=${dataset}.marked_duplicates.bam O=${dataset}.marked_duplicates.sorted.bam SORT_ORDER=coordinate CREATE_INDEX=true TMP_DIR=tmp/$dataset
rm ${dataset}.marked_duplicates.bam
rm -rf tmp/$dataset
Here we use minimap
for alignment. Run it by the following command:
./un_minimap-gatk_bac400.sh nanopore_reads.400bp.fastq.gz strain_id
This two scripts will generate bunch of bam
files. Move them into the align
subdirectory.
Run Graphtyper
We’ll create a list of all these bam files generated in previous step:
ls -1 *bam > bamlist
Now, run Graphtyper:
for contig in `grep '>' REF.fna | sed 's/>//g' | sed 's/ .*//g'`; do
graphtyper genotype --threads 16 --verbose --sams=bamlist --region=$contig REF.fna
done
Combine VCFs
Graphtyper creates a subdirectory named results
, where for each contig in the REF.fna, it creates a subdirectory and a set of variant calling files (VCFs). We want to concatenate these vcf files together, zip them, and create index. For that, we can use the following script:
graphtyper vcf_concatenate results/*/*.vcf.gz | bgzip -c > species_calls_merged.vcf.gz
tabix -p vcf species_calls_merged.vcf.gz
Filtering
We have the combined species_calls_merged.vcf.gz
file now. Since graphtyper was created for diploid organism, it may mistakenly call heterozygous call, which are just errors. So we need to do some filtering:
#!/bin/bash
#filter heterozygous calls because bacteria are haploid, these are just errors
java -jar GenomeAnalysisTK.jar -T VariantFiltration -G_filter "isHet == 1" -G_filterName "isHetFilter" --setFilteredGtToNocall -o species_calls_merged.vcf.nonpassnonhetfilt.vcf --variant species_calls_merged.vcf.gz -R ./REF.fna
vcffilter -f "ABHet < 0.0 | ABHet > 0.33" -f "ABHom < 0.0 | ABHom > 0.97" -f "MaxAASR > 0.4" -f "MQ > 30" species_calls_merged.vcf.nonpassnonhetfilt.vcf > species_calls_merged.vcf.nonpassnonhetfilt.vcf.vcffilter.vcf
Final outputs
So we have the final output! But we still may need fasta file contianing the SNP only sites. Here’s the way to do it:
#!/bin/bash
vcf-to-tab species_calls_merged.vcf.nonpassnonhetfilt.vcf.vcffilter.vcf > species.tab
sed -i 's_/\S\+_/_g' species.tab
perl vcftab_to_snpaln_nodel.pl --output_ref -i species.tab > species.tab.fasta
Rscript fasta_to_disttable.r species.tab.fasta
You can find the fasta_to_disttable.r
and vcftab_to_snpaln_nodel.pl
here.
Leave a Reply