Do you need to eyeball through fastq alignment file, with special features to highlight the quality score of each bases?
Introducing fastqviz, a Streamlit app which can do just that. I made it quite some time ago to visualize some amplicon data for my own project.
You can upload your fastq file. The fastqviz viewer will show the alignment, along with color highlighting quality of each bases (pink is high quality, dark is low). Scroll sideways to explore through sequence length.
I’m seeing a trend that bioscience students from Bangladesh (South Asia in general) are increasingly interested in publishing papers on computational drug design. Many bright students, undergrad or freshly graduated, are actually publishing drug designing related papers in good journals.
I have also done similar projects. But now, I think it is a bad trend.
While it is completely sound to do computational simulations for drug discovery, however, most of the published articles I’m seeing seem to motivated to get a “publication”, with a hope that these “publication” will help to get opportunities for higher studies abroad.
It is understandable that since our universities in Bangladesh do not provide enough good opportunities to have research experience under supervision of good mentor, many bright students are leaning to jump in such “do it yourself drug design and publish it” endeavors.
For one of my Ph.D. projects, I had to generate phylogeny from multi-locus sequence data. Often I have to repeat similar analyses and need to go back to the previous workflow to check what I actually did. I’m sharing the protocol here mainly to help my future self, and may be this is useful to you!
Multi-locus sequence analysis (MLSA) involves using multiple genes or loci, usually conserved housekeeping genes, to construct phylogeny and other sequence-based analyses. Since different genes may have different mutation rates, MLSA generally gives a better approximation of underlying evolution and a more realistic resolution of phylogenetic relations among taxa than only one gene.
This is also a better alternative to ribosomal 16s/ITS-based analysis, especially for many bacterial species (including Bradyrhizobium, which I worked with), because the 16s/ITS are often very similar in these genera and cannot be used to differentiate species.
After completing the human genome sequencing in the 2000s, much of our DNA, even more than half, was considered unnecessary. It was called “junk DNA,” “broken genes” trapped in the genome’s prison, and the DNA fossils of viruses that have been silenced. It was thought these destroyed DNA elements had no purpose and were unrelated to changes.
However, in the past decade, research has shown that some of this “junk DNA” is not entirely useless. Compared to the entire genome, the amount of functional genes is minimal, only 2 percent, which can produce various proteins. It is now known that certain so-called “non-coding” DNA helps regulate the expression of different genes. However, whether these DNA controllers are essential or potentially harmful to the body is still debated among scientists.
In 2013, research from the ENCODE project revealed that non-coding DNA, which was previously considered junk, actually performs important functions, leading to controversy and discussions. Since then, there has been further research on non-coding DNA. In this article, we will learn about some of the new research that does not support the concept of “junk DNA.”
Junk DNA can be essential in mammalian development
Researchers from the University of California, Berkeley, and Washington University have worked on the transposons. Transposons involves the selfish DNA elements that integrate into various parts of the genome. A specific family of transposons is derived from the DNA of different ancient viruses. When these viruses infect a host, their DNA elements become integrated into the genome of certain individuals. During the long evolution of the host’s genome and the viral genome, these “foreign” elements underwent evolution as transposons. This research suggests that transposons from viruses are inevitable for the survival of certain hosts. When researchers removed these transpositions from the genomes of mice, more than half of them died before birth.
Figure: Only 1.5-2% of human genome are protein coding genes. So what are the rest?
Last year I collaborated with cross-institution researchers from Bangladesh led by BCSIR on the severe Dengue outbreak in 2021. The study has been published in the BMC Virology Journal recently.
Dengue remains a dangerous endemic disease causing many deaths and suffering in Bangladesh.
My contribution was to analyze the phylodynamics of the Dengue virus (DENV). The DENV3 serotype dominated the 2021 outbreak. We confirmed a previously reported genotype shift (DENV3-II to DENV3-I).
Right after the 2006–2009 DENV3 outbreak, the genotype shift presumably happened in Bangladesh (tMRCA from genotype I DENV3 sequences from Bangladesh was 2013).
Another interesting fact. The 2021 epidemic isolates, which share strong similarities, are well separated from the 2017 epidemic isolates in Bangladesh but interleaved by two DENV3 isolates sampled in 2019. Interestingly, these two isolates, 19XN13542 and 19XN14065, were sampled in China from travelers returning from Bangladesh.
Figure: Phylodynamic analysis showed a clad shift in the DENV3 in Bangladesh. Source: Original paper.
So what is going on? A transboundary movement of the virus? Only further research can tell. But this highlights the importance of genomic surveillance for these infectious diseases.
This year’s cases are showing a high prevalence of DENV3, too (>60%) followed by the other strains, according to IEDCR.
How to estimate the fixation index, FST, to test for population differentiation in R
I was looking for a tutorial to estimate FST for my microbial (aka haploid) dataset but sadly there was no specific instruction, although other researchers are definitely looking for it (see here). Sadly, one respondent said that “there is no way to estimate Fst for microbial data”. However, ultimately I found a “way” to estimate Fst, and here is a quick R tutorial.
The fixation index, otherwise known as FST, is population differentiation due to genetic structure and is one of the fundamental concepts in population genetics. Essentially, FST is the proportion of the total genetic variance contained in a subpopulation (the S-subscript) compared to the total genetic variance (the T-subscript). Its value can range from 0 to 1. Higher FST indicates a considerable degree of differentiation among populations. The figure below illustrates what it means for two populations to have very different genetic structures.
(Left) Two sub-population have the same genotype composition, hence Fst is 0. On the other hand, (right) if two sub-population have very different genotype compositions, the Fst will be 1. Continue reading “Calculating Fst for haploid data in R”
Sometimes you need to visualize SNP data from a small Fasta file. PCA analysis is a great way to look for any distribution pattern.
It’s common to do principal component analysis (PCA) for large SNP data in VCF format. However, sometimes you want to do a simpler analysis for a very small dataset. Let’s say, you have a gene alignment (or a supermatrix alignment of several genes) and want to visualize SNP structures in the sequence alignment file using PCA. In this post, I’ll show you how to do just that.
library("adegenet")
library("ggplot2")
snp <- fasta2genlight('~/path/to/fasta', snpOnly=T)
meta <- read.table('~/path/to/meta', sep=',', header = T)
fasta2genlight is a nice function from the Adgenet library that extract SNPs from alignment. It also detect the ploidy of your dataset, but do not forget to check the poidy by `snp$ploidy`.
We load our sequence alignment file with this function, and we also load a meta file that contains sequence features that we are interested to visualize in the PCA. The meta file must have a column containing isolate ids that matched fasta id as well. The meta file may look like the following:
Tanglegrams are co-phylogeny which is a very powerful visualization tool to examine co-evolution. Here is a tutorial on how to make them in R.
Tanglegram is a representation of co-phylogeny where two phylogenetic trees are linked. This method is super useful to visualize common traits shared by both trees. For example, it can be used to visualize host-pathogen (or host-symbiotic) evolution and visualize if there is any phylogenetic concordance between the two phylogenetic trees.
I was in need to visualize co-phylogeny of phylogenetic tree reconstructed from chromosomal and symbiotic genes. Surprisingly, I didn’t find any strait-forward solution in R that can be used for drawing tanglegram. Particularly I wanted to leverage the beautiful ggtree library. After trying out several methods, I found the following approach works well for me so far.
In this post, I’m going to use two toy trees with the following Newick format. Note that they have the same isolate, but different tree-topology (since supposedly different gene-set were used to reconstruct them).
I have reviewed a paper by Sharp and Hahn (2011) in YouTube for my students in Evolution course (I am a TA for Spring 2020). This paper is a complete review on the fascinating origin of HIV and how AIDS becomes a pandemic.
Here’s the YouTube video for your interest.
Monkey origin of HIV epidemic
This is a fascinating paper that discusses the origin of HIV viruses. In 1981, several homosexual young men died in a mysterious disease. They had rare opportunistic infections caused by apparently harmless bacteria. Later, this virus was named HIV or Human Immunodeficiency Virus and the disease was recognized as Acquired Immune Deficiency Syndrome or AIDS. In 1986 morphologically similar but antigenically different viruses isolated from other AIDS patients in Africa. A later study found that these viruses are genetically very similar to a simian immunodeficiency virus (SIV) causing immunodeficiency disease in the captive macaque.
Later scientists searched for SIVs in other primates coming from Africa. They found that African Primates harbor SIV viruses which are non-pathogenic to the hosts. However, phylogenetically HIV virus is very similar to this SIVs but pathogenic to humans. Later scientists figured out that some SIVs crossed the species boundary from monkey to human several times causing the origin of HIV. HIV is not a single virus – they are a collection of similar viral strains. There are two big classes: HIV-1 and HIV-2. HIV-1 virus is are similar to SIV in chimpanzees and is HIV-2 virus is very close to the SIV virus in sooty mangabey.
If SIVs (mostly) non-pathogenic to non-human primates, what about the macaques: why they develop AIDS because of SIV infection? SIV is not natural in macaques hosts as they are Asian primates. SIVs are only found in African African monkeys or primates.
How a scholarly conflict on FMD virus classification between Bangladeshi and Chinese research-groups was resolved.
Scientists form hypothesis, formulate experiments and publish their results. But not all researchers agree to the same conclusion. Hence, comes scientific-conflict. How to resolve that? Publish and exchange opinions!
Recently, scientific debate and exchange of opinions came to my attention. Back in Bangladesh during and after my MS study in Microbiology at the University of Dhaka, I was affiliated with Dr. M Anwar Hossain’s FMDV research project (Microbial Genetics and Bioinformatics Lab). In that project, Dr. Hossain lead the research on Foot-and-Mouth disease virus detection and vaccine development. Briefly speaking, this is a very dangerous viral disease of cattle and cause several million dollar loss in agro-veterinary economy in Bangladesh.
Figere 3 from Siddique et al. 2018 shows that proposed sub-lineage Ind2001BD1 and BD2 do not fall into established genotypes. Source: https://onlinelibrary.wiley.com/doi/epdf/10.1111/tbed.12834
The FMD is an RNA-virus and it evolves very fast due to higher mutational rate. Classifying newly emerged FMD virus can be a complex task. I co-authored a paper (Siddique et al. 2018) on which the research group detected two novel sub-lineage of FMDV virus, namely Ind2001BD1 and BD2. We have used distance-based clustering method (multi-dimensional clustering: novel in this field but widely used for classification) as well as more traditional phylogenetic method to establish and propose this. In this point, I should mention that there is a world reference laboratory for FMDV characterization (WRLFMD), but FMDV strains isolated by Siddique et al. 2018 did not fall into the classification maintained by them.