Nowadays, for biologists, it’s almost a requirement to know how to code, especially when you have NGS data.
I often see Ph.D. positions opening circular explicitly mentioning that apart from a biology major, some data analysis skills are also necessary.
I have seen many Ph.D. grads picking up these skills once they have generated NGS data from the wet lab. They had to learn to analyze these using command line tools in a cluster, sometimes using R, Python if necessary.
Complete Beginner
If you are a complete beginner, don’t aim to ‘understand’ everything discussed in a course or lecture or book. It’s okay to be partially ignorant but still moving forward. Try to go through 60-70% content of
1. Bioinformatics Methods I and II, offered by Toronto University in massive-open-online-course (MOOC) Coursera.org has pretty good materials (video+tutorial).
2. On Shikkhok.com, a MOOC platform in Bengali language, there is a very short course on Bioinformatics, বায়োইনফরমেটিক্স পরিচিতি, offered by Bio-Bio-1 Foundation.
3. Reading books is the best way. I’ve found ‘Essential Bioinformatics’ by Jin Xiong an easy to understand book. Introduction to Computational Genomics: A Case Studies Approach by Cristianini and Hahn is also another exceptional book. Another great book I recommend is Bioinformatics Data Skill by Buffalo.
Intermediate
1. Start reading computational biology/bioinformatics-related research papers. It’s
The journal Nature has a series of educational articles where experts describe different concepts in Bioinformatics, Statistics and Data Visualizations. Dr. Xianjun Dong from Harvard University has compiled an index of those papers in a PDF document. I encourage everyone to use this resource as a syllabus.
2. Learn
- Codecademy.com has
great learning environment. - This site contains several slides at its very bottom section ‘Introduction to Programming for Bioinformatics in Python’. I actually
learnt Python from these slides. Just write the commands and try to getsame answers, do the exercise. It’s very easy to understand. - Rosalind.info is a site where one can learn and improve his/her skill
in bioinformatics programming. You can learn python in it’s ‘Python Village’ section. After that, I suggest solving problems in ‘Bioinformatics Stronghold’. The structure of Rosalind.info isvery interesting . Initially, the problems will be easy. But as you startto solve them, the problems will be harder. This is a good place to learn about different bioinformatics algorithms.
Well, have a look into Rosalind Country Ranking, Bangladesh is currently in the 3rd position world-wide!
3. Learn R. If programming in Python seems hard for you, then start with R, which is a statistical programming language, and its syntax is quite similar to Python. Knowing how to use R for data analysis, statistics, and visualization will be very helpful in down the research line.
4. Learn Unix, at least have a basic understanding. Most of state-of-the art tools are used in Unix command line. Learn how to use Unix environment in Mac or Linux set-up.
5. Start using GitHub. Before publishing, you should always share your statistical analyses codes (i.e. R/Python scripts, Jupyter Notebooks, pipeline like Kraken), ideally in GitHub. Even though the analyses are very generic.
I hopped in a discord server on Microbial Ecology & Bioinformatics discussion. One user who is a journal editor, mentioned that they ask authors to share the original code. And often they find that the interpretation for these analysis in the manuscript are far from the actual code output.
To quote: “I desk-reject those papers without any code as “not reproducible.””
For my first paper from Ph.D., I included the codes and raw data in GitHub. Reviewer 3 actually did the whole analysis by them! And they shared their analysis with us via the journal platform! Reviewer 3 actually did some more analyses, and suggested another test that supported our findings, and we incorporated that as well.
So, document your analyses, and make sure that it is reproducible.
Advanced
At this stage, you should able to find relevant tutorials based on your research interest.
RNA-Seq
I have used different packages like DESeq2, EdgeR, Limma to do
Video Explaining RNA-Seq Normalization Methods
- A Gentle Introduction to RNA-seq
- A Gentle Introduction to ChIP-seq
- edgeR, part1: Library Normalization
- DESeq2, part1: Library Normalization
- edgeR and DESeq2, part2: Independent Filtering (removing genes with low read counts)
- RNA-seq – The Problem with Technical Replicates
- RPKM, FPKM, and TPM
Pipeline for doing RNA-Seq analysis
Also there are some pipe-line I have followed for doing the analysis. Here’s some link to them:
- From reads to genes to pathways: differential expression analysis of RNA-Seq experiments using Rsubread and the edgeR quasi-likelihood pipeline
- DESeq2 analysis template in R
- RNA-seq analysis is easy as 1-2-3 with limma, Glimma and edgeR
[Updated: April 8, 2019]
Leave a Reply