words in dictionary

A note on learning computational biology

in

— 336 reads


Nowadays, for biologists, it’s almost a requirement to know how to code, especially when you have NGS data.

I often see Ph.D. positions opening circular explicitly mentioning that apart from a biology major, some data analysis skills are also necessary.

I have seen many Ph.D. grads picking up these skills once they have generated NGS data from the wet lab. They had to learn to analyze these using command line tools in a cluster, sometimes using R, Python if necessary.

Complete Beginner

If you are a complete beginner, don’t aim to ‘understand’ everything discussed in a course or lecture or book. It’s okay to be partially ignorant but still moving forward. Try to go through 60-70% content of the following source within one-two months. The objective in this stage is to get some good understanding of core Bioinformatics concepts and terminology.

1. Bioinformatics Methods I and II, offered by Toronto University in massive-open-online-course (MOOC) Coursera.org has pretty good materials (video+tutorial).

2. On Shikkhok.com, a MOOC platform in Bengali language, there is a very short course on Bioinformatics, বায়োইনফরমেটিক্স পরিচিতি, offered by Bio-Bio-1 Foundation.

3. Reading books is the best way. I’ve found ‘Essential Bioinformatics’ by Jin Xiong an easy to understand book. Introduction to Computational Genomics: A Case Studies Approach by Cristianini and Hahn is also another exceptional book. Another great book I recommend is Bioinformatics Data Skill by Buffalo.

Intermediate

1. Start reading computational biology/bioinformatics-related research papers. It’s good idea to read a research paper, do all the analysis mentioned in the paper with it’s data, and trying to generate same/similar results. This process is called reproduction and very helpful to understand how to do a real bioinformatics project. Here’s a list of bioinformatics journals.

The journal Nature has a series of educational articles where experts describe different concepts in Bioinformatics, Statistics and Data Visualizations. Dr. Xianjun Dong from Harvard University has compiled an index of those papers in a PDF document. I encourage everyone to use this resource as a syllabus.

2. Learn coding. The target is to write small scripts that can automate many boring mouse-clicking tasks and save time. Say doing a single BLAST on NCBI is easy. But when you need to do BLAST with 20+ sequences, it’s madness. So learn Python, it’s current state-of-the-art language for bioinformatics programing (along with data science, too) and very easy to use/learn. Python has a very useful library called BioPython which is essential in day to day computational biology programming. I teach a Python for Bioinformatics online-course in cBLAST (an online course forum run by University of Dhaka). Here are some other good resources to learn Python:

  • Codecademy.com has great learning environment.
  • This site contains several slides at its very bottom section ‘Introduction to Programming for Bioinformatics in Python’. I actually learnt Python from these slides. Just write the commands and try to get same answers, do the exercise. It’s very easy to understand.
  • Rosalind.info is a site where one can learn and improve his/her skill in bioinformatics programming. You can learn python in it’s ‘Python Village’ section. After that, I suggest solving problems in ‘Bioinformatics Stronghold’. The structure of Rosalind.info is very interesting. Initially, the problems will be easy. But as you start to solve them, the problems will be harder. This is a good place to learn about different bioinformatics algorithms.

Well, have a look into Rosalind Country Ranking, Bangladesh is currently in the 3rd position world-wide!

3. Learn R. If programming in Python seems hard for you, then start with R, which is a statistical programming language, and its syntax is quite similar to Python. Knowing how to use R for data analysis, statistics, and visualization will be very helpful in down the research line.

4. Learn Unix, at least have a basic understanding. Most of state-of-the art tools are used in Unix command line. Learn how to use Unix environment in Mac or Linux set-up.

5. Start using GitHub. Before publishing, you should always share your statistical analyses codes (i.e. R/Python scripts, Jupyter Notebooks, pipeline like Kraken), ideally in GitHub. Even though the analyses are very generic.

I hopped in a discord server on Microbial Ecology & Bioinformatics discussion. One user who is a journal editor, mentioned that they ask authors to share the original code. And often they find that the interpretation for these analysis in the manuscript are far from the actual code output.

To quote: “I desk-reject those papers without any code as “not reproducible.””

For my first paper from Ph.D., I included the codes and raw data in GitHub. Reviewer 3 actually did the whole analysis by them! And they shared their analysis with us via the journal platform! Reviewer 3 actually did some more analyses, and suggested another test that supported our findings, and we incorporated that as well.

So, document your analyses, and make sure that it is reproducible.

Advanced

At this stage, you should able to find relevant tutorials based on your research interest.

RNA-Seq

I have used different packages like DESeq2, EdgeR, Limma to do RNA-Seq data analysis. For starters, it is really important to understand the normalization problem between libraries these packages try to solve. StatQuest have some great video to explain these things. Here’s an index of their YouTube video channel:

Video Explaining RNA-Seq Normalization Methods

Pipeline for doing RNA-Seq analysis

Also there are some pipe-line I have followed for doing the analysis. Here’s some link to them:

[Updated: April 8, 2019]


Discover more from Arafat Rahman

Subscribe to get the latest posts sent to your email.


Comments

One response to “A note on learning computational biology”

  1. Bless you-thank you-Could you expand this?|

Leave a Reply

Learn Python for Bioinformatics

I have created a set of worksheets that give a quick overview of Python for Bioinformatics (as well as intro to UNIX). You just have to give it 3-6 hours, and you will know the essentials!

Join 87 other subscribers