Biopython is a library for the Python programming language.
Package that assists with processing biological data.
Consists of several modules – some with common operations, some more specialized.
But before learning more about BioPython, let’s get some idea about Object Oriented Programming (OOP).
Biopython is object-oriented. So, some knowledge helps understand how biopython work.
Object Oriented Programming
OOP is a way of organizing data and methods that work on them in a coherent package. OOP helps structure and organize the code.
Classes and objects
A class:
- is a user defined
type
- is a mold for creating objects
- specifies how an object can contain and process data
- represents an abstraction or a template for how an object of that class will behave
An object is an instance of a class.
All objects have a type
– shows which class
they were made from.

Attributes and methods

Class and Instances
Class and object example
Class: Seq
Seq
has:
- attribute length
- method translate
An object of the class Seq
is created like this:
myseq = Seq(“ATGGCCG”)
Get sequence length:
myseq.length
Get sequence translation:
myseq.translate()
OOP Summary
An object has to be instantiated, i.e. created, to exist.
Every object has a certain type, i.e. is of a certain class.
The class decides which attributes and methods an object has.
Attributes and methods are accessed using . after the object variable name.
Explaining OOP with BioPython
You can install BioPython from here.
With time, you will come to know many other packages. It’s a good idea to familiarize yourself with the documentation of that package. For example, BioPython documentation can be found here.
Test your BioPython installation:
>>> import Bio
>>> print(Bio.__version__)
Biopython functionality and tools
Tools to parse bioinformatics files into Python data structures
Supports the following formats
- BLAST, Clustalw, FASTA
- PubMed and Medline
- ExPASy files
- SwissProt, PDB
Files in the supported formats can be iterated over record by record or indexed and accessed via a dictionary interface.
Seq
object
Represents one sequence. It has following methods:
- translate()
- transcribe()
- complement()
- reverse_complement()
>>> from Bio.Seq import Seq
>>> my_seq = Seq("AGTACACTGGT")
>>> my_seq
Seq('AGTACACTGGT')
>>> my_seq.translate()
Seq('STL')
>>> my_seq.transcribe()
Seq('AGUACACUGGU')
>>> my_seq.complement()
Seq('TCATGTGACCA')
What is happening here?
from Bio.Seq import Seq
Bio
and Seq
are packages.
Packages contain modules.
We are importing Seq module from the Seq package.
Modules are essentially individual Python file.
Seq module contains class Seq()
We are making an instance of Seq() class in my_seq variable.
translate
, transcribe
, complements
are methods defined in Seq() class.
We are calling those methods for my_seq instance we just created.
Few more examples with Seq
All transcribe() does is a switch T –> U. lThe Seq object also includes a back-transcription method.
from Bio.Seq import Seq
coding_dna = Seq("ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG")
messenger_rna = coding_dna.transcribe()
cDNA = messenger_rna.back_transcribe()
print(coding_dna)
print(messenger_rna)
print(cDNA)
Seq
as a string
Most string methods work on Seqs
If string is needed, do str(seq)
>>> seq = Seq('CCGGGTTAACGTA')
>>> seq[:5]
Seq('CCGGG', IUPACUnambiguousDNA())
>>> len(seq)
13
>>> seq.lower()
Seq('ccgggttaacgta', DNAAlphabet())
>>> print(seq)
CCGGGTTAACGTA
>>> list(seq)
['C', 'C', 'G', 'G', 'G', 'T', 'T', 'A', 'A', 'C', 'G', 'T', 'A']
>>> mystring = str(seq)
>>> print(mystring)
CCGGGTTAACGTA
>>> type(seq)
<class 'Bio.Seq.Seq'>
>>> type(mystring)
<type 'str'>
Introduction to SeqRecord
Seq
contains the sequence. But sequences often come with a lot more.
SeqRecord = Seq + metadata
Main attributes: ID and Seq
But it may have additional attributes:
- name – Sequence name, e.g. gene name (string)
- description – Additional text, imagine fasta description (string)
- dbxrefs – List of database cross references (list of strings)
- features – Any (sub)features defined (list of SeqFeature objects)
- annotations – Further information about the whole sequence (dictionary) Most entries are strings, or lists of strings.
from Bio.SeqRecord import SeqRecord
from Bio.Seq import Seq
seq = Seq('CCGGGTTAACGTA')
seq_record_1 = SeqRecord(seq, id='01')
print(seq_record_1)
# Or,
seq_record_1 = SeqRecord(seq)
seq_record_1.id = "01"
seq_record_1.descrption = "toxic membrane protein"
print(seq_record_1)
# Or,
#another way to define a Seq Record
seq_record_1 =
SeqRecord(Seq('CCGGGTTAACGTA'), id = 'YP_025292.1', name='HokC', description='toxic membrane protein',dbxrefs=[])
print(seq_record_1)
SeqIO: Another important module of BioPython
We will never type sequence in a program. We’ll read it from a file in different format (fasta, GenBank, etc.).
SeqIO provides tools to to retrieve sequences as SeqRecord, and can write SeqRecord to file.
Reading: parse(file_handle, format)
Writing: write(SeqRecords(s), file_handle, format)
Read/write common biological file formats:
- FASTA, GenBank, EMBL
- FASTQ (sequencing reads)
- GFF, BED (genomic features)
- PDB (protein structures)
- PHYLIP, Nexus (phylogenetic data)
from Bio import SeqIO
SeqIO.parse()
Reads in sequence data as SeqRecord objects
It expects two arguments:
An object (called handle) to read the data. It can be:
- a file opened for reading
- the output from a command line program
- data downloaded from the internet
A lower case string specifying the sequence format
The object returned by SeqIO.parse() is an iterator which returns SeqRecord objects.
from Bio import SeqIO
file_handle = "P53.fasta" # notice we are not using open()
for seq_rec in SeqIO.parse(file_handle, "fasta"):
print(seq_rec.id)
print(seq_rec.seq)
print(len(seq_rec.seq))
Writing files
from Bio import SeqIO
seq_list = [ ... ] # this should be a list of SeqRecords
SeqIO.write(seq_list, "seq_list.fasta", "fasta")
Example Use Case: Fetching a sequence from NCBI
from Bio import Entrez, SeqIO
Entrez.email = "your@email.com"
handle = Entrez.efetch(db="nucleotide",
id="NM_001301717", rettype="gb", retmode="text")
record = SeqIO.read(handle, "genbank")
print(record.description)