Chapter 12: Dealing with genomic feature in Biopython

This is the Chapter 12 of Py4Bio.

BioPython has a vast array of modules, specializing in different aspect of Bioinformatics data. It can handle alignments, do BLAST and Entrez search, parse phylogenetic (newick), protein-structure (PDB), general feature format (GFF) and many other formats. It is out of scope to cover these formats in this chapter.

However, I want to focus on parsing and GenBank files. My personal observation is, GenBank files (Gbk/Gbff) are one of the most complex and information-loaded file format, particularly when you work with genome annotations. In this chapter, we will first get an overview of GenBank file format, and then we will explore Biopython commands that allow us to parse and play with different parts of GenBank entry.

(GFF, or general feature format is also another important and widely used format in genomics. I will add the BioPython tutorial for handling and parsing GFF file later.)

Anatomy of GenBank format

GenBank is one of the most comprehensive repositories of publicly available nucleotide sequences and associated annotations.

The GenBank file format is structured to represent annotated DNA or RNA sequences with rich metadata.

GenBank is organized into several clearly defined sections:

Header: Information about the sequence (meta)
FEATURES: Actual functional annotations
ORIGIN: The nucleotide/amino acid sequence.

Let’s take a look at an annotated excerpt from a GenBank file for a plasmid named pBR322:

A. Header

The Header section contains mainly metadata regarding the database entry in question. For example, here the LOCUS line is defining the identity of the entry:

LOCUS       SCU49845     57 aa         linear BCT 20-JUN-2018

LOCUS id: SCU49845
Length: 57
Type: aa (amino acid sequence)
Topology: Linear here
Taxonomic division: BCT
Date of submission: 20-JUN-2018

DEFINITION  int gene activator RinB [Staphylococcus aureus].

A concise description of the sequence, often includes gene names, functions, and completeness.

ACCESSION   SCU49845
VERSION     SCU49845.1

ACCESSION: Unique identifier for the sequence int the Database.

VERSION: There can be multiple version, specially if the records need to be updated or corrected after initial submission. Accession + version number.

DBLINK      BioProject: PRJEB2655
            BioSample: SAMEA1708709
DBSOURCE    embl accession FMRH01000027.1

Each GenBank entries are generally part of a unique project, and belongs to a particular sample from which it’s coming from. For example, BioProject indicates the project and BioSample is the id for the sample itself. Each BioSample can have different data, for example SRA (raw reads), GEO (RNA-Seq expression), Genome (Genome Sequencnig) etc. DBLINK essentially cross-refer this GenBank entry with other NCBI associated database entries.

SOURCE      Staphylococcus aureus
  ORGANISM  Staphylococcus aureus
            Bacteria; Bacillati; Bacillota; Bacilli; Bacillales;
            Staphylococcaceae; Staphylococcus.

Provides the scientific name and taxonomic lineage.

REFERENCE   1
  CONSRTM   Pathogen Informatics
  TITLE     Direct Submission
  JOURNAL   Submitted (28-SEP-2016) WTSI, Pathogen Informatics, Wellcome Trust
            Sanger Institute, CB10 1SA, United Kingdom

This contains references related to the sequence, such as publications and authors.

B. FEATURES

This is the most complex part of the GenBank where it contains the annotations associated with the DNA/amino-acid sequence. We will dive into different part of FEATURES more elaborately later, but for now, here is a summary:

FEATUERS annotates biological features using:

Feature key (e.g., gene, CDS, misc_feature)
Location (e.g., 1..57)
Qualifiers (e.g., /gene="TCP1-beta")

The common features keys are usually:

Feature	Description
source	Describes the entire sequence
gene	Gene region
CDS	Coding sequence (translated to protein)
mRNA	Messenger RNA regions
tRNA, rRNA	RNA genes
exon, intron	For eukaryotic genes
misc_feature	Non-standard or miscellaneous annotations

C. ORIGIN

ORIGIN      
        1 mikqilrllf llamyelgky vtehvyimtt anddveapsd faklsdqsdl mraevse

The actual nucleotide sequence, with position numbers and 60 bases per line. The sequence is always shown in lowercase and wraps across multiple lines.

End of file

//

The double-slash marks the end of the GenBank entry.

Qualifiers of FEATURES

Let’s use this file as example: SCU49845

Let’s look into the FEATURES of this GenBank file (SCU49845):

FEATURES             Location/Qualifiers
     source          1..5028
                     /organism="Saccharomyces cerevisiae"
                     /mol_type="genomic DNA"
                     /db_xref="taxon:4932"
                     /chromosome="IX"
     mRNA            <1..>206
                     /product="TCP1-beta"
     CDS             <1..206
                     /codon_start=3
                     /product="TCP1-beta"
                     /protein_id="AAA98665.1"
                     /db_xref="GI:1293614"
                     /translation="SSIYNGISTSGLDLNNGTIADMRQLGIVESYKLKRAVVSSASEA
                     AEVLLRVDNIIRARPRTANRQHM"
     gene            <687..>3158
                     /gene="AXL2"
     mRNA            <687..>3158
                     /gene="AXL2"
                     /product="Axl2p"
     CDS             687..3158
                     /gene="AXL2"
                     /note="plasma membrane glycoprotein"
                     /codon_start=1
                     /product="Axl2p"
                     /protein_id="AAA98666.1"
                     /db_xref="GI:1293615"
                     /translation="MTQLQISLLLTATISLLHLVVATPYEAYPIGKQYPPVARVNESF
                     TFQISNDTYKSSVDKTAQITYNCFDLPSWLSFDSSSRTFSGEPSSDLLSDANTTLYFN
                     VILEGTDSADSTSLNNTYQFVVTNRPSISLSSDFNLLALLKNYGYTNGKNALKLDPNE
                     VFNVTFDRSMFTNEESIVSYYGRSQLYNAPLPNWLFFDSGELKFTGTAPVINSAIAPE
                     TSYSFVIIATDIEGFSAVEVEFELVIGAHQLTTSIQNSLIINVTDTGNVSYDLPLNYV
                     YLDDDPISSDKLGSINLLDAPDWVALDNATISGSVPDELLGKNSNPANFSVSIYDTYG
                     DVIYFNFEVVSTTDLFAISSLPNINATRGEWFSYYFLPSQFTDYVNTNVSLEFTNSSQ
                     DHDWVKFQSSNLTLAGEVPKNFDKLSLGLKANQGSQSQELYFNIIGMDSKITHSNHSA
                     NATSTRSSHHSTSTSSYTSSTYTAKISSTSAAATSSAPAALPAANKTSSHNKKAVAIA
                     CGVAIPLGVILVALICFLIFWRRRRENPDDENLPHAISGPDLNNPANKPNQENATPLN
                     NPFDDDASSYDDTSIARRLAALNTLKLDNHSATESDISSVDEKRDSLSGMNTYNDQFQ
                     SQSKEELLAKPPVQPPESPFFDPQNRSSSVYMDSEPAVNKSWRYTGNLSPVSDIVRDS
                     YGSQKTVDTEKLFDLEAPEKEKRTSRDVTMSSLDPWNSNISPSPVRKSVTPSPYNVTK
                     HRNRHLQNIQDSQSGKNGITPTTMSTSSSDDFVPVKDGENFCWVHSMEPDRRPSKKRL
                     VDFSNKSNVNVGQVKDIHGRIPEML"
     gene            complement(<3300..>4037)
                     /gene="REV7"
     mRNA            complement(<3300..>4037)
                     /gene="REV7"
                     /product="Rev7p"
     CDS             complement(3300..4037)
                     /gene="REV7"
                     /codon_start=1
                     /product="Rev7p"
                     /protein_id="AAA98667.1"
                     /db_xref="GI:1293616"
                     /translation="MNRWVEKWLRVYLKCYINLILFYRNVYPPQSFDYTTYQSFNLPQ
                     FVPINRHPALIDYIEELILDVLSKLTHVYRFSICIINKKNDLCIEKYVLDFSELQHVD
                     KDDQIITETEVFDEFRSSLNSLIMHLEKLPKVNDDTITFEAVINAIELELGHKLDRNR
                     RVDSLEEKAEIERDSNWVKCQEDENLPDNNGFQPPKIKLTSLVGSDVGPLIIHQFSEK
                     LISGDDKILNGVYSQYEEGESIFGSLF"

Let’s check the FEATUERS from this file. It contains:

source: Entire length of the sequence.
mRNA: 3 mRNA were found in this region.
CDS: Coding sequences, 3 reported.
gene: Reported genes, 3 were found. Notice one of the gene is in the complement strand.

Each features has several entry under it, also known as qualifiers. For example, if we check the first CDS, it has following:

     CDS             <1..206
                     /codon_start=3
                     /product="TCP1-beta"
                     /protein_id="AAA98665.1"
                     /db_xref="GI:1293614"
                     /translation="SSIYNGISTSGLDLNNGTIADMRQLGIVESYKLKRAVVSSASEA
                     AEVLLRVDNIIRARPRTANRQHM"

1..206 indicates the location (start to end position of the CDS annotation).

codon_start, product, protein_id, db_xref, translation these are all qualifiers of this particular CDS.

How to parse a GenBank file in Python

Let’s use this file as example: SCU49845

from Bio import SeqIO

for seq_record in SeqIO.parse('yeast_genes.gbk', 'genbank'):
    print(seq_record)

There is only one record in the yeast_genes.gbk file, i.e the record carrying info on SCU49845 locus. So using a for loop is not necessary. We can just read the file using the following:

>>> seq_record = SeqIO.read('yeast_genes.gbk', 'genbank')
>>> print(seq_record)

SeqRecord(seq=Seq('GATCCTCCATATACAACGGTATCTCCACCTCAGGTTTAGATCTCAACAACGGAA...ATC'), id='U49845.1', name='SCU49845', description='Saccharomyces cerevisiae TCP1-beta gene, partial cds; and Axl2p (AXL2) and Rev7p (REV7) genes, complete cds', dbxrefs=[])

However, this is also not super useful, because it only prints the whole thing. We are interested in the features. Luckily, there is a specific method for finding the features!

seq_record = SeqIO.read('yeast_genes.gbk', 'genbank')

for f in seq_record.features:
    print(f)

Now it is showing us the features!

Usually, the features in Biopython Genabank has following methods:

<Genbank Feature>.id
<Genbank Feature>.location
<Genbank Feature>.qualifiers
<Genbank Feature>.strand
<Genbank Feature>.type

Let’s iterate over the features again and see the methods:

for f in seq_record.features:
    print(f.id)
    print(f.location)
    print(f.type)
    print(f.strand)

Feature type will be the main annotated category, for example: CDS, gene, mRNA, etc.

The strand refers to DNA strand. 1 means forward (+ strand, current one, or template strand) and -1 means the other DNA strand (- strand, complement strand of the current one).

Let’s subset the features by CDS.

for f in seq_record.features:
    if f.type == "CDS":
        print(f.strand, f.location)

Look, we have three CDS and their location here!

Now, let’s say we are interested in the qualifiers of each feature. Qualifiers are stored as a dictionary after parsing in Biopython. Here’s an example.

for f in seq_record.features:
    if f.type == "CDS":
        print(f.qualifiers)

That means, you can now iterate over the keys of the dictionary in the feature-qualifiers. For example, if the gene is reported in the CDS feature qualifiers, then we can look for it:

for f in seq_record.features:
    if f.type == "CDS":
        print(f.qualifiers["gene"])

But we get a key-error because our first CDS does not have gene as qualifier key. There are two ways we can solve this problem. We can use exception handling here:

for f in seq_record.features:
    if f.type == "CDS":
        try:
            print(f.qualifiers["gene"])
        except:
            continue

The way exception handling works, it try to run a code block. If it succeeds, then it does what it supposed to do. If there are some kind of errors (like Key Error here), it will do the alternative block. In this case the alternative block is continue, which tell Pyhon to not to raise error, and move to the next part of the loop.

Another way of achieving the same goal is to check if gene is present in the keys in the first place:

for f in seq_record.features:
    if f.type == "CDS":
        if 'gene' in f.qualifiers:
            print(f.qualifiers["gene"])

So the output here will be:

['AXL2']
['REV7']

Notice, the qualifiers dictionary values are in list. So you might want to use indexing to report the first entry:

for f in seq_record.features:
    if f.type == "CDS":
        if 'gene' in f.qualifiers:
            print(f.qualifiers["gene"][0])

Go to the index of Py4Bio.