This is the Chapter 12 of Py4Bio.
BioPython has a vast array of modules, specializing in different aspect of Bioinformatics data. It can handle alignments, do BLAST and Entrez search, parse phylogenetic (newick), protein-structure (PDB), general feature format (GFF) and many other formats. It is out of scope to cover these formats in this chapter.
However, I want to focus on parsing and GenBank files. My personal observation is, GenBank files (Gbk/Gbff) are one of the most complex and information-loaded file format, particularly when you work with genome annotations. In this chapter, we will first get an overview of GenBank file format, and then we will explore Biopython commands that allow us to parse and play with different parts of GenBank entry.
(GFF, or general feature format is also another important and widely used format in genomics. I will add the BioPython tutorial for handling and parsing GFF file later.)
Anatomy of GenBank format
GenBank is one of the most comprehensive repositories of publicly available nucleotide sequences and associated annotations.
The GenBank file format is structured to represent annotated DNA or RNA sequences with rich metadata.
GenBank is organized into several clearly defined sections:
- Header: Information about the sequence (meta)
- FEATURES: Actual functional annotations
- ORIGIN: The nucleotide/amino acid sequence.
Let’s take a look at an annotated excerpt from a GenBank file for a plasmid named pBR322:

A. Header
The Header section contains mainly metadata regarding the database entry in question. For example, here the LOCUS line is defining the identity of the entry:
LOCUS SCU49845 57 aa linear BCT 20-JUN-2018
- LOCUS id: SCU49845
- Length: 57
- Type: aa (amino acid sequence)
- Topology: Linear here
- Taxonomic division: BCT
- Date of submission: 20-JUN-2018
DEFINITION int gene activator RinB [Staphylococcus aureus].
A concise description of the sequence, often includes gene names, functions, and completeness.
ACCESSION SCU49845
VERSION SCU49845.1
ACCESSION: Unique identifier for the sequence int the Database.
VERSION: There can be multiple version, specially if the records need to be updated or corrected after initial submission. Accession + version number.
DBLINK BioProject: PRJEB2655
BioSample: SAMEA1708709
DBSOURCE embl accession FMRH01000027.1
Each GenBank entries are generally part of a unique project, and belongs to a particular sample from which it’s coming from. For example, BioProject indicates the project and BioSample is the id for the sample itself. Each BioSample can have different data, for example SRA (raw reads), GEO (RNA-Seq expression), Genome (Genome Sequencnig) etc. DBLINK essentially cross-refer this GenBank entry with other NCBI associated database entries.
SOURCE Staphylococcus aureus
ORGANISM Staphylococcus aureus
Bacteria; Bacillati; Bacillota; Bacilli; Bacillales;
Staphylococcaceae; Staphylococcus.
Provides the scientific name and taxonomic lineage.
REFERENCE 1
CONSRTM Pathogen Informatics
TITLE Direct Submission
JOURNAL Submitted (28-SEP-2016) WTSI, Pathogen Informatics, Wellcome Trust
Sanger Institute, CB10 1SA, United Kingdom
This contains references related to the sequence, such as publications and authors.
B. FEATURES
This is the most complex part of the GenBank where it contains the annotations associated with the DNA/amino-acid sequence. We will dive into different part of FEATURES more elaborately later, but for now, here is a summary:
FEATUERS annotates biological features using:
- Feature key (e.g.,
gene
,CDS
,misc_feature
) - Location (e.g.,
1..57
) - Qualifiers (e.g.,
/gene="TCP1-beta"
)
The common features keys are usually:
Feature | Description |
---|---|
source | Describes the entire sequence |
gene | Gene region |
CDS | Coding sequence (translated to protein) |
mRNA | Messenger RNA regions |
tRNA, rRNA | RNA genes |
exon, intron | For eukaryotic genes |
misc_feature | Non-standard or miscellaneous annotations |
C. ORIGIN
ORIGIN
1 mikqilrllf llamyelgky vtehvyimtt anddveapsd faklsdqsdl mraevse
The actual nucleotide sequence, with position numbers and 60 bases per line. The sequence is always shown in lowercase and wraps across multiple lines.
End of file
//
The double-slash marks the end of the GenBank entry.
Qualifiers of FEATURES
Let’s use this file as example: SCU49845
Let’s look into the FEATURES of this GenBank file (SCU49845):
FEATURES Location/Qualifiers
source 1..5028
/organism="Saccharomyces cerevisiae"
/mol_type="genomic DNA"
/db_xref="taxon:4932"
/chromosome="IX"
mRNA <1..>206
/product="TCP1-beta"
CDS <1..206
/codon_start=3
/product="TCP1-beta"
/protein_id="AAA98665.1"
/db_xref="GI:1293614"
/translation="SSIYNGISTSGLDLNNGTIADMRQLGIVESYKLKRAVVSSASEA
AEVLLRVDNIIRARPRTANRQHM"
gene <687..>3158
/gene="AXL2"
mRNA <687..>3158
/gene="AXL2"
/product="Axl2p"
CDS 687..3158
/gene="AXL2"
/note="plasma membrane glycoprotein"
/codon_start=1
/product="Axl2p"
/protein_id="AAA98666.1"
/db_xref="GI:1293615"
/translation="MTQLQISLLLTATISLLHLVVATPYEAYPIGKQYPPVARVNESF
TFQISNDTYKSSVDKTAQITYNCFDLPSWLSFDSSSRTFSGEPSSDLLSDANTTLYFN
VILEGTDSADSTSLNNTYQFVVTNRPSISLSSDFNLLALLKNYGYTNGKNALKLDPNE
VFNVTFDRSMFTNEESIVSYYGRSQLYNAPLPNWLFFDSGELKFTGTAPVINSAIAPE
TSYSFVIIATDIEGFSAVEVEFELVIGAHQLTTSIQNSLIINVTDTGNVSYDLPLNYV
YLDDDPISSDKLGSINLLDAPDWVALDNATISGSVPDELLGKNSNPANFSVSIYDTYG
DVIYFNFEVVSTTDLFAISSLPNINATRGEWFSYYFLPSQFTDYVNTNVSLEFTNSSQ
DHDWVKFQSSNLTLAGEVPKNFDKLSLGLKANQGSQSQELYFNIIGMDSKITHSNHSA
NATSTRSSHHSTSTSSYTSSTYTAKISSTSAAATSSAPAALPAANKTSSHNKKAVAIA
CGVAIPLGVILVALICFLIFWRRRRENPDDENLPHAISGPDLNNPANKPNQENATPLN
NPFDDDASSYDDTSIARRLAALNTLKLDNHSATESDISSVDEKRDSLSGMNTYNDQFQ
SQSKEELLAKPPVQPPESPFFDPQNRSSSVYMDSEPAVNKSWRYTGNLSPVSDIVRDS
YGSQKTVDTEKLFDLEAPEKEKRTSRDVTMSSLDPWNSNISPSPVRKSVTPSPYNVTK
HRNRHLQNIQDSQSGKNGITPTTMSTSSSDDFVPVKDGENFCWVHSMEPDRRPSKKRL
VDFSNKSNVNVGQVKDIHGRIPEML"
gene complement(<3300..>4037)
/gene="REV7"
mRNA complement(<3300..>4037)
/gene="REV7"
/product="Rev7p"
CDS complement(3300..4037)
/gene="REV7"
/codon_start=1
/product="Rev7p"
/protein_id="AAA98667.1"
/db_xref="GI:1293616"
/translation="MNRWVEKWLRVYLKCYINLILFYRNVYPPQSFDYTTYQSFNLPQ
FVPINRHPALIDYIEELILDVLSKLTHVYRFSICIINKKNDLCIEKYVLDFSELQHVD
KDDQIITETEVFDEFRSSLNSLIMHLEKLPKVNDDTITFEAVINAIELELGHKLDRNR
RVDSLEEKAEIERDSNWVKCQEDENLPDNNGFQPPKIKLTSLVGSDVGPLIIHQFSEK
LISGDDKILNGVYSQYEEGESIFGSLF"
Let’s check the FEATUERS from this file. It contains:
- source: Entire length of the sequence.
- mRNA: 3 mRNA were found in this region.
- CDS: Coding sequences, 3 reported.
- gene: Reported genes, 3 were found. Notice one of the gene is in the complement strand.
Each features has several entry under it, also known as qualifiers. For example, if we check the first CDS, it has following:
CDS <1..206
/codon_start=3
/product="TCP1-beta"
/protein_id="AAA98665.1"
/db_xref="GI:1293614"
/translation="SSIYNGISTSGLDLNNGTIADMRQLGIVESYKLKRAVVSSASEA
AEVLLRVDNIIRARPRTANRQHM"
1..206
indicates the location (start to end position of the CDS annotation).
codon_start
, product
, protein_id
, db_xref
, translation
these are all qualifiers of this particular CDS.
How to parse a GenBank file in Python
Let’s use this file as example: SCU49845
from Bio import SeqIO
for seq_record in SeqIO.parse('yeast_genes.gbk', 'genbank'):
print(seq_record)
There is only one record in the yeast_genes.gbk file, i.e the record carrying info on SCU49845 locus. So using a for
loop is not necessary. We can just read the file using the following:
>>> seq_record = SeqIO.read('yeast_genes.gbk', 'genbank')
>>> print(seq_record)
SeqRecord(seq=Seq('GATCCTCCATATACAACGGTATCTCCACCTCAGGTTTAGATCTCAACAACGGAA...ATC'), id='U49845.1', name='SCU49845', description='Saccharomyces cerevisiae TCP1-beta gene, partial cds; and Axl2p (AXL2) and Rev7p (REV7) genes, complete cds', dbxrefs=[])
However, this is also not super useful, because it only prints the whole thing. We are interested in the features. Luckily, there is a specific method for finding the features!
seq_record = SeqIO.read('yeast_genes.gbk', 'genbank')
for f in seq_record.features:
print(f)
Now it is showing us the features!
Usually, the features in Biopython Genabank has following methods:
<Genbank Feature>.id
<Genbank Feature>.location
<Genbank Feature>.qualifiers
<Genbank Feature>.strand
<Genbank Feature>.type
Let’s iterate over the features again and see the methods:
for f in seq_record.features:
print(f.id)
print(f.location)
print(f.type)
print(f.strand)
Feature type will be the main annotated category, for example: CDS, gene, mRNA, etc.
The strand refers to DNA strand. 1 means forward (+ strand, current one, or template strand) and -1 means the other DNA strand (- strand, complement strand of the current one).
Let’s subset the features by CDS.
for f in seq_record.features:
if f.type == "CDS":
print(f.strand, f.location)
Look, we have three CDS and their location here!
Now, let’s say we are interested in the qualifiers of each feature. Qualifiers are stored as a dictionary after parsing in Biopython. Here’s an example.
for f in seq_record.features:
if f.type == "CDS":
print(f.qualifiers)
That means, you can now iterate over the keys of the dictionary in the feature-qualifiers. For example, if the gene
is reported in the CDS feature qualifiers, then we can look for it:
for f in seq_record.features:
if f.type == "CDS":
print(f.qualifiers["gene"])
But we get a key-error because our first CDS does not have gene
as qualifier key. There are two ways we can solve this problem. We can use exception handling here:
for f in seq_record.features:
if f.type == "CDS":
try:
print(f.qualifiers["gene"])
except:
continue
The way exception handling works, it try to run a code block. If it succeeds, then it does what it supposed to do. If there are some kind of errors (like Key Error here), it will do the alternative block. In this case the alternative block is continue
, which tell Pyhon to not to raise error, and move to the next part of the loop.
Another way of achieving the same goal is to check if gene
is present in the keys in the first place:
for f in seq_record.features:
if f.type == "CDS":
if 'gene' in f.qualifiers:
print(f.qualifiers["gene"])
So the output here will be:
['AXL2']
['REV7']
Notice, the qualifiers dictionary values are in list. So you might want to use indexing to report the first entry:
for f in seq_record.features:
if f.type == "CDS":
if 'gene' in f.qualifiers:
print(f.qualifiers["gene"][0])
Leave a Reply