I needed to convert a bunch of bam
files back into fastq
files. I was implementing a regular for loop to load the files and convert them, one by one, serially. After running the first batch of conversion, I realized that the process is too slow, and is not the best thing to do when you have multiple cores sitting idly. There are actually several ways you can parallelize the computation task without going deep in the multiple-processing jungle.
Although samtools
supports multi-threading, there are many UNIX based program do not have the ability. And many tasks actually doing the same thing with a large number of input files (i.e. multiple IO), and there are very simple way in Unix to submit them altogether without waiting for them to finish.
Here, I’m summarizing different ways to run a simple Bash script in parallel. But before that, let’s check the original sequential case:
#!/bin/bash
for file in `ls *bam`
do
filename=`echo $file | sed 's/.bam//g'`
echo $filename
samtools bam2fq $file > ${filename}.fastq
done
Do not forget to make this script executable before running in the terminal using chmod +x bam2fq.sh
.
This basically goes through all .bam
files in the present directory, remove the .bam
extension for the output filename, print it in the terminal screen, and start conversion to fastq
.
This works fine, but this is a serial-processing job. Most of the computers now have multiple CPU cores, and handle multiple threads. So we can do better.
UNIX has a builtin bash &
(aka ampersand) operator. It can run a command asynchronously, that means the shell environment will not wait for the current command to complete before moving to the next one. So, let’s use this ampersand &
in this script.
#!/bin/bash
for file in `ls *bam`
do
filename=`echo $file | sed 's/.bam//g'`
echo $filename
samtools bam2fq $file > ${filename}.fastq &
done
If you check this script carefully, you’ll notice that we have added a &
after samtools bam2fq
command. Now the bash shell will submit all bam files in the directory as a input of samtools
program together. The job-scheduler in the OS will handle how it process multiple submissions through different core and threads of your CPU, and you do not have to bother about that!
However this is essentially bombarding all bam files to your CPU, and it might be a lot, specially if your computer doing some other tasks. So it might be a good idea to limit how many jobs it process in a single time. We could do that by implementing the following:
#!/bin/bash
N=8
for file in `ls *bam`
do
((i=i%N)); ((i++==0)) && wait
filename=`echo $file | sed 's/.bam//g'`
echo $filename
samtools bam2fq $file > ${filename}.fastq &
done
This will submit N jobs concurrently in a batch, wait to finish, and submit another N jobs when the first N jobs are done.
In this script, &&
means that it takes into account the state of the previous command. If the first batch fails, then the second batch won’t be executed. And wait
basically does the waiting for the first N jobs to be finished before submitting new batch of N jobs.
Although this is fast, but still little “ugly”, since even some of the submitted jobs in the first N jobs are finished, it will wait for all of them are to be completed before submitting another batch of N jobs.
There is a nice GNU based program called Parallel excels in this case. But for that, we need to modify the main script little bit, so that it main script only handle a single bam as input.
#!/bin/bash
input_file=$1
filename=`echo $input_file | sed 's/.bam//g'`
samtools bam2fq $input_file > ${filename}.fastq
Let’s name the script above as bam2fq.sh. Notice we removed the for loop here, that means it only work on one input file provided as argument through $1. To run this script on single bam file, just run the following in the terminal:
bash bam2fq.sh file1.bam
Now that we have the main script ready and working, we can submit N jobs which are truly parallel and independent from each other by following command:
parallel --jobs N --verbose ./bam2fq.sh ::: *bam
This is now doing N jobs in parallel (--jobs
), with detailed output of what is happening (--verbose
).
So how much time I’ve saved?
Implementation | Time |
---|---|
Serial, one-by-one | 2-3 hours |
Concurrent in 8 jobs per batch | ~30 minutes |
GNU Parallel | ~19 minutes |
Happy coding!
I have used the following resources to learn this:
Leave a Reply