How to estimate the fixation index, FST, to test for population differentiation in R
I was looking for a tutorial to estimate FST for my microbial (aka haploid) dataset but sadly there was no specific instruction, although other researchers are definitely looking for it (see here). Sadly, one respondent said that “there is no way to estimate Fst for microbial data”. However, ultimately I found a “way” to estimate Fst, and here is a quick R tutorial.
The fixation index, otherwise known as FST, is population differentiation due to genetic structure and is one of the fundamental concepts in population genetics. Essentially, FST is the proportion of the total genetic variance contained in a subpopulation (the S-subscript) compared to the total genetic variance (the T-subscript). Its value can range from 0 to 1. Higher FST indicates a considerable degree of differentiation among populations. The figure below illustrates what it means for two populations to have very different genetic structures.
Sometimes you need to visualize SNP data from a small Fasta file. PCA analysis is a great way to look for any distribution pattern.
It’s common to do principal component analysis (PCA) for large SNP data in VCF format. However, sometimes you want to do a simpler analysis for a very small dataset. Let’s say, you have a gene alignment (or a supermatrix alignment of several genes) and want to visualize SNP structures in the sequence alignment file using PCA. In this post, I’ll show you how to do just that.
snp <- fasta2genlight('~/path/to/fasta', snpOnly=T)
meta <- read.table('~/path/to/meta', sep=',', header = T)
fasta2genlight is a nice function from the Adgenet library that extract SNPs from alignment. It also detect the ploidy of your dataset, but do not forget to check the poidy by `snp$ploidy`.
We load our sequence alignment file with this function, and we also load a meta file that contains sequence features that we are interested to visualize in the PCA. The meta file must have a column containing isolate ids that matched fasta id as well. The meta file may look like the following:
Tanglegrams are co-phylogeny which is a very powerful visualization tool to examine co-evolution. Here is a tutorial on how to make them in R.
Tanglegram is a representation of co-phylogeny where two phylogenetic trees are linked. This method is super useful to visualize common traits shared by both trees. For example, it can be used to visualize host-pathogen (or host-symbiotic) evolution and visualize if there is any phylogenetic concordance between the two phylogenetic trees.
I was in need to visualize co-phylogeny of phylogenetic tree reconstructed from chromosomal and symbiotic genes. Surprisingly, I didn’t find any strait-forward solution in R that can be used for drawing tanglegram. Particularly I wanted to leverage the beautiful ggtree library. After trying out several methods, I found the following approach works well for me so far.
In this post, I’m going to use two toy trees with the following Newick format. Note that they have the same isolate, but different tree-topology (since supposedly different gene-set were used to reconstruct them).
I have reviewed a paper by Sharp and Hahn (2011) in YouTube for my students in Evolution course (I am a TA for Spring 2020). This paper is a complete review on the fascinating origin of HIV and how AIDS becomes a pandemic.
Here’s the YouTube video for your interest.
Monkey origin of HIV epidemic
This is a fascinating paper that discusses the origin of HIV viruses. In 1981, several homosexual young men died in a mysterious disease. They had rare opportunistic infections caused by apparently harmless bacteria. Later, this virus was named HIV or Human Immunodeficiency Virus and the disease was recognized as Acquired Immune Deficiency Syndrome or AIDS. In 1986 morphologically similar but antigenically different viruses isolated from other AIDS patients in Africa. A later study found that these viruses are genetically very similar to a simian immunodeficiency virus (SIV) causing immunodeficiency disease in the captive macaque.
Later scientists searched for SIVs in other primates coming from Africa. They found that African Primates harbor SIV viruses which are non-pathogenic to the hosts. However, phylogenetically HIV virus is very similar to this SIVs but pathogenic to humans. Later scientists figured out that some SIVs crossed the species boundary from monkey to human several times causing the origin of HIV. HIV is not a single virus – they are a collection of similar viral strains. There are two big classes: HIV-1 and HIV-2. HIV-1 virus is are similar to SIV in chimpanzees and is HIV-2 virus is very close to the SIV virus in sooty mangabey.
If SIVs (mostly) non-pathogenic to non-human primates, what about the macaques: why they develop AIDS because of SIV infection? SIV is not natural in macaques hosts as they are Asian primates. SIVs are only found in African African monkeys or primates.
How a scholarly conflict on FMD virus classification between Bangladeshi and Chinese research-groups was resolved.
Scientists form hypothesis, formulate experiments and publish their results. But not all researchers agree to the same conclusion. Hence, comes scientific-conflict. How to resolve that? Publish and exchange opinions!
Recently, scientific debate and exchange of opinions came to my attention. Back in Bangladesh during and after my MS study in Microbiology at the University of Dhaka, I was affiliated with Dr. M Anwar Hossain’s FMDV research project (Microbial Genetics and Bioinformatics Lab). In that project, Dr. Hossain lead the research on Foot-and-Mouth disease virus detection and vaccine development. Briefly speaking, this is a very dangerous viral disease of cattle and cause several million dollar loss in agro-veterinary economy in Bangladesh.
The FMD is an RNA-virus and it evolves very fast due to higher mutational rate. Classifying newly emerged FMD virus can be a complex task. I co-authored a paper (Siddique et al. 2018) on which the research group detected two novel sub-lineage of FMDV virus, namely Ind2001BD1 and BD2. We have used distance-based clustering method (multi-dimensional clustering: novel in this field but widely used for classification) as well as more traditional phylogenetic method to establish and propose this. In this point, I should mention that there is a world reference laboratory for FMDV characterization (WRLFMD), but FMDV strains isolated by Siddique et al. 2018 did not fall into the classification maintained by them.
Few tips for new graduate-students to make shift from undergrad mindset to become an independent researcher.
If you are a new graduate student, time can be tough because you have a lot of hassles in the first year. You have several heavy coursework, need to maintain minimum GPA, have to do teaching assistants (TA), probably rotate in labs and select your Ph.D. supervisor, learn new skills in the laboratory as well as set-up yourself in a new city (new country for international students like me), explore the new campus, make new friends and cook your own food. That’s a lot for the first year.
While it is easy to become overwhelmed to meet these requirements, a fresh graduate student must not forget the big idea of a Ph.D. research. Many new graduate students are fresh Bachelor or Masters degree holders who approach the Ph.D. program as like undergraduate school, which is a mistake. Often there can be a gap of suggestion and lake of help to make a clear conception of what a Ph.D. program is.
[This is a writeup from 2016. Caution! If you are willing to apply in US for higher study and planning to write SOP soon, I’ll suggest not to read this write-up, because it may bias you. And do not use this format as your SOP. Your SOP is supposed to be unique. Better, write your own SOP, then show it some experienced for valued feedback.]
When I was a student of class 9 in secondary school, for the first time I participated national science fair with an electronics project “Determination of gravitational acceleration constant g using digital method” as a member of science club Anushandhitshu Chokro. It was awarded the 1st prize in district level and 8th in national level. From that moment my inspiration in science started to flourish, which is one of the motivating factors for why right now I want to enroll a Ph.D. program to get a rigorous training of scientific method. Continue reading “Statement of Purpose”
Sometimes we are practitioner of a field, professionals or artists do not feel easy to show our own work, the process of work, what we are learning, sharing our work with more general people. This book encourage and provide some guidelines for that, because time has been changed in this connected world. Key suggestions are to show work regularly, keep a good (domain) name, teach people what you are learning, creating a cult of followers along with following and being part of practitioner community. Very small book.
I teach an on-line course ‘Python/Biopython for Bioinformatics’ in cBLAST. This is three month’s course, I use biological examples showing how we use Python to handle and analyze biological data. The video lectures are in Bangla, and video’s are both slides and screen-casts of coding.
Center for Bioinformatics Learning Advancement and Systematic Training, or cBLAST, is part of University of Dhaka, Bangladesh. One will get certificate after successfully completing this 3-month course for University of Dhaka.
Python, is an easy-to learn, high-level computer language that is used in many of the computational analysis in Bioinformatics. This course will start developing initial skills of interactive programming and script writing in Python. Then we’ll cover Biopython, Matplotlib and NumPy. Finally, some algorithmic aspects of programming will be discussed in this course. Continue reading “Python for Bioinformatics: An Online Course from Me!”
Many asks me about learning Bioinformatics. So, I’m going to put some good learning resources in this note.
If you are a complete beginner, don’t aim to ‘understand’ everything discussed in a course or lecture or book. It’s okay to be partially ignorant but still moving forward. Try to go through 60-70% content of the following source within one-two months. The objective in this stage is to get some good understanding of core Bioinformatics concepts and terminology.
1. Bioinformatics Methods I and II, offered by Toronto University in massive-open-online-course (MOOC) Coursera.org has pretty good materials (video+tutorial).
2. On Shikkhok.com, a MOOC platform in Bengali language, there is a very short course on Bioinformatics, বায়োইনফরমেটিক্স পরিচিতি, offered by Bio-Bio-1 Foundation.
3. Reading books is the best way. I’ve found ‘Essential Bioinformatics’ by Jin Xiong an easy to understand book.