Programmatically Downloading Raw Data from NCBI

This is part of tutorial series: NGS Workflow for Genome Assembly to Annotation for Hybrid Bacterial Data

We’ll be use hybrid sequencing data (Illumina and Nanopore). This tutorial has five parts.

Part 1: Downloading and preparing data
Part 2: Assembly with short-reads.
Part 3: Assembly with long-reads.
Part 4: Hybrid assembly (long- and short-reads).
Part 5: Bacterial genome annotation.

Disclaimer: This post is a work in progress. This is genome assembly and annotation workflow that I use for microbial genomics. Previously, I used this template to teach different class in OSU as well as in other training facilities.

Usually, we get the data in fastq format from the sequencer machine in our lab (or from a sequencing facility). If you have your own data, you can follow-along this tutorial. However, I’m assuming you don’t have your own data. So we are going to download it from a public database (NCBI)

Even if you have your own data, often we have to download genomes or raw-data from NCBI. For a large project, downloading them one by one is not very helpful and time-consuming. Therefore, there are many tools which can assist you to programmatically download these dataset. One such tool is kingfisher.

Why kingfisher, why not SRA-Toolkit by NCBI? Well, kingfisher is faster to SRA-Toolkit.

For this tutorial, we’ll use three different genome-data. Both has longread (Nanopore) and shortread (Illumina) data available on NCBI:

Strain	BioProject	Sample	Illumina	Nanopore
Agrobacterium fabrum ARqua1	PRJNA976066	SAMN14693017	NovaSeq SRR24759617	PromethIONSRR24759618
Shigella sp. AUSMDU00026329	PRJNA857526	SAMN13348498	NextSeqSRR10506616	PromethIONSRR32486128
Shigella sp. AUSMDU00036386	PRJNA857526	SAMN41119865	NextSeqSRR28839724	PromethIONSRR32486129

We will download the fastq files from SRA (Sequence Read Archives) database. Let’s learn how to use Kingfisher for downloading these genomes.

Software Installation

Don’t forget to check software manual for Kingfisher.

conda install bioconda::kingfisher

Using `Kingfisher` to download reads from SRA (or any other NCBI database)

Usually, you download the raw fastq by using the following command:

kingfisher get -r $biosample -m ena-ascp ena-ftp aws-http prefetch

But we want to download sequence efficiently, programmatically, automatically. For that, let’s make list of the target data we want to download first and save it in name sra_list.tab:

SRR24759617	Agro_illumina
SRR24759618	Agro_nanopore
SRR10506616	Shig1_illumina
SRR32486128	Shig1_nanopore
SRR28839724	Shig2_illumina
SRR32486129	Shig2_nanopore

This is a tab-seperated file, here, first column is the accession/biosample id of the strain, and second column is the strain name (and sequencing technology)

Let’s make a script individual_kf.sh for downloading these genomes.

#!/bin/bash

biosample=$1
strain=$2

kingfisher get -r $biosample -m ena-ascp ena-ftp aws-http prefetch

if [ -f ${biosample}_1.fastq ]; then
        #mv ${biosample}_1.fastq ${strain}_R1.fastq
        #mv ${biosample}_2.fastq ${strain}_R2.fastq
        gzip ${biosample}_R1.fastq
        gzip ${biosample}_R2.fastq
fi


mv ${biosample}_1.fastq.gz ${strain}_R1.fastq.gz
mv ${biosample}_2.fastq.gz ${strain}_R2.fastq.gz

Make sure to make it executable:

chmod +x individual_kf.sh

This is a shell script that takes two input, biosample and strain. It first downloads the sra file (fastq), if not zipped it gzip it, and rename it from SRA to strain-name.

But we want to call the program automatically for all strains. For that, we can make another bash script that will handle the individual_kf.sh for all strains. The name of this script is run_kf_handle.sh:

#!/bin/bash

while read line; do
        biosample=`echo -e "$line" | cut -f 1 -d '      '`
        strain=`echo -e "$line" | cut -f 2 -d ' '`
        echo "bash individual_kf.sh $biosample $strain"
        bash individual_kf.sh $biosample $strain
done < sra_list.tab

Make sure to make it executable:

chmod +x run_kf_handle.sh

Well, what it does is reads the sra_list.tab file, get the two field named biosample and strain, and then print the commands to terminal, and run it.

To run this script, just do:

./run_kf_handle.sh

This will download the dataset that will be used in this tutorial.

I like to organize the downloaded file in dedicated directory. So let’s organize our current project directory:

mkdir raw
mv *gz raw

cd raw

mkdir illumina nanopore
mv *illumina.fastq.gz illumina
mv *nanopore.fastq.gz nanopore

ls

Now you can go to: Part 2 of NGS Workflow for Genome Assembly to Annotation for Hybrid Bacterial Data

Programmatically Downloading Raw Data from NCBI

Software Installation

Using `Kingfisher` to download reads from SRA (or any other NCBI database)

Comments

Leave a ReplyCancel reply

Learn Python for Bioinformatics

Programmatically Downloading Raw Data from NCBI

Software Installation

Using Kingfisher to download reads from SRA (or any other NCBI database)

Share this:

Comments

Leave a ReplyCancel reply

Learn Python for Bioinformatics

Discover more from Arafat Rahman

Using `Kingfisher` to download reads from SRA (or any other NCBI database)