This is part of tutorial series: NGS Workflow for Genome Assembly to Annotation for Hybrid Bacterial Data
We’ll be use hybrid sequencing data (Illumina and Nanopore). This tutorial has five parts.
- Part 1: Downloading and preparing data
- Part 2: Assembly with short-reads.
- Part 3: Assembly with long-reads.
- Part 4: Hybrid assembly (long- and short-reads).
- Part 5: Bacterial genome annotation.
Disclaimer: This post is a work in progress. This is genome assembly and annotation workflow that I use for microbial genomics. Previously, I used this template to teach different class in OSU as well as in other training facilities.
Usually, we get the data in fastq format from the sequencer machine in our lab (or from a sequencing facility). If you have your own data, you can follow-along this tutorial. However, I’m assuming you don’t have your own data. So we are going to download it from a public database (NCBI)
Even if you have your own data, often we have to download genomes or raw-data from NCBI. For a large project, downloading them one by one is not very helpful and time-consuming. Therefore, there are many tools which can assist you to programmatically download these dataset. One such tool is kingfisher
.
Why kingfisher
, why not SRA-Toolkit
by NCBI? Well, kingfisher
is faster to SRA-Toolkit
.
For this tutorial, we’ll use three different genome-data. Both has longread (Nanopore) and shortread (Illumina) data available on NCBI:
Strain | BioProject | Sample | Illumina | Nanopore |
Agrobacterium fabrum ARqua1 | PRJNA976066 | SAMN14693017 | NovaSeq SRR24759617 | PromethIONSRR24759618 |
Shigella sp. AUSMDU00026329 | PRJNA857526 | SAMN13348498 | NextSeqSRR10506616 | PromethIONSRR32486128 |
Shigella sp. AUSMDU00036386 | PRJNA857526 | SAMN41119865 | NextSeqSRR28839724 | PromethIONSRR32486129 |
We will download the fastq files from SRA (Sequence Read Archives) database. Let’s learn how to use Kingfisher for downloading these genomes.
Software Installation
Don’t forget to check software manual for Kingfisher.
conda install bioconda::kingfisher
Using Kingfisher
to download reads from SRA (or any other NCBI database)
Usually, you download the raw fastq by using the following command:
kingfisher get -r $biosample -m ena-ascp ena-ftp aws-http prefetch
But we want to download sequence efficiently, programmatically, automatically. For that, let’s make list of the target data we want to download first and save it in name sra_list.tab
:
SRR24759617 Agro_illumina
SRR24759618 Agro_nanopore
SRR10506616 Shig1_illumina
SRR32486128 Shig1_nanopore
SRR28839724 Shig2_illumina
SRR32486129 Shig2_nanopore
This is a tab-seperated file, here, first column is the accession/biosample id of the strain, and second column is the strain name (and sequencing technology)
Let’s make a script individual_kf.sh
for downloading these genomes.
#!/bin/bash
biosample=$1
strain=$2
kingfisher get -r $biosample -m ena-ascp ena-ftp aws-http prefetch
if [ -f ${biosample}_1.fastq ]; then
#mv ${biosample}_1.fastq ${strain}_R1.fastq
#mv ${biosample}_2.fastq ${strain}_R2.fastq
gzip ${biosample}_R1.fastq
gzip ${biosample}_R2.fastq
fi
mv ${biosample}_1.fastq.gz ${strain}_R1.fastq.gz
mv ${biosample}_2.fastq.gz ${strain}_R2.fastq.gz
Make sure to make it executable:
chmod +x individual_kf.sh
This is a shell script that takes two input, biosample and strain. It first downloads the sra file (fastq), if not zipped it gzip it, and rename it from SRA to strain-name.
But we want to call the program automatically for all strains. For that, we can make another bash script that will handle the individual_kf.sh
for all strains. The name of this script is run_kf_handle.sh
:
#!/bin/bash
while read line; do
biosample=`echo -e "$line" | cut -f 1 -d ' '`
strain=`echo -e "$line" | cut -f 2 -d ' '`
echo "bash individual_kf.sh $biosample $strain"
bash individual_kf.sh $biosample $strain
done < sra_list.tab
Make sure to make it executable:
chmod +x run_kf_handle.sh
Well, what it does is reads the sra_list.tab
file, get the two field named biosample
and strain
, and then print the commands to terminal, and run it.
To run this script, just do:
./run_kf_handle.sh
This will download the dataset that will be used in this tutorial.
I like to organize the downloaded file in dedicated directory. So let’s organize our current project directory:
mkdir raw
mv *gz raw
cd raw
mkdir illumina nanopore
mv *illumina.fastq.gz illumina
mv *nanopore.fastq.gz nanopore
ls
Now you can go to: Part 2 of NGS Workflow for Genome Assembly to Annotation for Hybrid Bacterial Data
Leave a Reply