silver music wind instrument

Leveraging Power of Parallelization in UNIX Scripts

in

— 143 reads

I needed to convert a bunch of bam files back into fastq files. I was implementing a regular for loop to load the files and convert them, one by one, serially. After running the first batch of conversion, I realized that the process is too slow, and is not the best thing to do when you have multiple cores sitting idly. There are actually several ways you can parallelize the computation task without going deep in the multiple-processing jungle.

Although samtools supports multi-threading, there are many UNIX based program do not have the ability. And many tasks actually doing the same thing with a large number of input files (i.e. multiple IO), and there are very simple way in Unix to submit them altogether without waiting for them to finish.

Here, I’m summarizing different ways to run a simple Bash script in parallel. But before that, let’s check the original sequential case:

#!/bin/bash

for file in `ls *bam`
do
        filename=`echo $file | sed 's/.bam//g'`
        echo $filename
        samtools bam2fq $file > ${filename}.fastq 
done

Do not forget to make this script executable before running in the terminal using chmod +x bam2fq.sh.

This basically goes through all .bam files in the present directory, remove the .bam extension for the output filename, print it in the terminal screen, and start conversion to fastq.

This works fine, but this is a serial-processing job. Most of the computers now have multiple CPU cores, and handle multiple threads. So we can do better.

UNIX has a builtin bash & (aka ampersand) operator. It can run a command asynchronously, that means the shell environment will not wait for the current command to complete before moving to the next one. So, let’s use this ampersand & in this script.

#!/bin/bash

for file in `ls *bam`
do
        filename=`echo $file | sed 's/.bam//g'`
        echo $filename
        samtools bam2fq $file > ${filename}.fastq & 
done

If you check this script carefully, you’ll notice that we have added a & after samtools bam2fq command. Now the bash shell will submit all bam files in the directory as a input of samtools program together. The job-scheduler in the OS will handle how it process multiple submissions through different core and threads of your CPU, and you do not have to bother about that!

However this is essentially bombarding all bam files to your CPU, and it might be a lot, specially if your computer doing some other tasks. So it might be a good idea to limit how many jobs it process in a single time. We could do that by implementing the following:

#!/bin/bash


N=8

for file in `ls *bam`
do
        ((i=i%N)); ((i++==0)) && wait
        filename=`echo $file | sed 's/.bam//g'`
        echo $filename
        samtools bam2fq $file > ${filename}.fastq &
done

This will submit N jobs concurrently in a batch, wait to finish, and submit another N jobs when the first N jobs are done.

In this script, && means that it takes into account the state of the previous command. If the first batch fails, then the second batch won’t be executed. And wait basically does the waiting for the first N jobs to be finished before submitting new batch of N jobs.

Although this is fast, but still little “ugly”, since even some of the submitted jobs in the first N jobs are finished, it will wait for all of them are to be completed before submitting another batch of N jobs.

There is a nice GNU based program called Parallel excels in this case. But for that, we need to modify the main script little bit, so that it main script only handle a single bam as input.

#!/bin/bash

input_file=$1

filename=`echo $input_file | sed 's/.bam//g'`

samtools bam2fq $input_file > ${filename}.fastq

Let’s name the script above as bam2fq.sh. Notice we removed the for loop here, that means it only work on one input file provided as argument through $1. To run this script on single bam file, just run the following in the terminal:

bash bam2fq.sh file1.bam

Now that we have the main script ready and working, we can submit N jobs which are truly parallel and independent from each other by following command:

parallel --jobs N --verbose ./bam2fq.sh ::: *bam

This is now doing N jobs in parallel (--jobs), with detailed output of what is happening (--verbose).

So how much time I’ve saved?

ImplementationTime
Serial, one-by-one2-3 hours
Concurrent in 8 jobs per batch~30 minutes
GNU Parallel~19 minutes

Happy coding!

I have used the following resources to learn this:


Comments

Leave a Reply

Join as a subscriber

Only the posts on data visualization, bioinformatics how to tutorials, web-development, and general comments on research and science will be sent.

Join 13 other subscribers