Py4Bio Chapter 6: Dictionaries

This is the Chapter 6 of Py4Bio.

A dictionary is a table of items.

Each item has a “key” and a “value”

Also known as hash table.

KeysValues
EnglishGood morning
SpanishBuenas días
SwedishGod morgon
GermanGuten morgen
VendaNdi matscheloni
AfrikaansGoeie môre

I want to know “Good morning” in Swedish. I get the “Good Morning” dictionary, search for “Swedish” key, and get the value!

>>> good_morning_dict = {
...     "English": "Good morning",
...     "Swedish": "God morgon",
...     "German": "Guten morgen",
...     "Venda": "Ndi matscheloni",
... }
>>> print(good_morning_dict["Swedish"])
God morgon

How to create a dictionary:

>>> D1 = {}
>>> len(D1)
0

>>> D2 = {"name": "Andrew", "age": 33}
>>> len(D2)
2
>>> D2["name"]
'Andrew'
>>> D2["age"]
33

>>> D2["AGE"]
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
KeyError: 'AGE'

Adding new elements:

>>> my_sister = {}

>>> my_sister["name"] = "Christy"
>>> print("len =", len(my_sister), "and value is", my_sister)
len = 1 and value is {'name': 'Christy’}

>>> my_sister["children"] = ["Maggie", "Porter"]
>>> print("len =", len(my_sister), "and value is", my_sister))
len = 2 and value is {'name': 'Christy', 'children': ['Maggie', 'Porter']}
>>> 

Get the keys and values:

>>> city = {"name": "Cape Town", "country": "South Africa",
...     "population": 2984000, "lat.": -33.93, "long.": 18.46}

>>> print(city.keys())
['country', 'long.', 'lat.', 'name', 'population']

>>> print(city.values())
['South Africa', 18.460000000000001, -33.93, 'Cape Town', 2984000]


>>> for k in city:
...     print k, "=", city[k]
... 
country = South Africa
long. = 18.46
lat. = -33.93
name = Cape Town
population = 2984000

Let’s do a biological example: Ambiguity codes

Sometimes DNA bases are ambiguous.

Eg, the sequencer might be able to tell that a base is not a G or T but could be either A or C.

The standard (IUPAC) one-letter code for DNA includes letters for ambiguity.

M is A or C
R is A or G
W is A or T
S is C or G

so on and so forth.

Suppose you don’t know all the possible bases exist in a sequence.

>>> seq = "TKKAMRCRAATARKWC"
>>> counts = {}
>>> for base in seq:
...     if base not in counts:
...             n = 0
...     else:
...             n = counts[base]
...     counts[base] = n + 1
... 

>>> print(counts)
{'A': 4, 'C': 2, 'K': 3, 'M': 1, 'R': 3, 'T': 2, 'W': 1}

In this example, if the base isn’t a key in the counts dictionary then use zero.  Otherwise use the value from the dict.

The idiom “use a default value if the key doesn’t exist” is very common.  Python has a special method to make it easy. For example:

>>> seq = "TKKAMRCRAATARKWC"
>>> counts = {}
>>> for base in seq:
...   counts[base] = counts.get(base, 0) + 1


>>> print(counts)
{'A': 4, 'C': 2, 'K': 3, 'M': 1, 'R': 3, 'T': 2, 'W': 1}
>>> counts.get("A", 9)
4

>>> counts["B"]
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
KeyError: 'B'

>>> counts.get("B", 9)
9

Another example: Reverse complement

>>> complement_table = {"A": "T", "T": "A", "C": "G", "G": "C"}
>>> seq = "CCTGTATT"
>>> new_seq = []
>>> for letter in seq:
...     complement_letter = complement_table[letter]
...     new_seq.append(complement_letter)


>>> print(new_seq)
['G', 'G', 'A', 'C', 'A', 'T', 'A', 'A']
>>> new_seq.reverse()
>>> print(new_seq)
['A', 'A', 'T', 'A', 'C', 'A', 'G', 'G']
>>> print("".join(new_seq))
AATACAGG

Example: Listing codons

>>> seq = "TCTCCAAGACGCATCCCAGTG"
>>> seq[0:3]
'TCT'
>>> seq[3:6]
'CCA'
>>> seq[6:9]
'AGA'
>>> range(0, len(seq), 3)
[0, 3, 6, 9, 12, 15, 18]
>>> for i in range(0, len(seq), 3):
...     print("Codon", i/3, "is", seq[i:i+3])
... 
Codon 0 is TCT
Codon 1 is CCA
Codon 2 is AGA
Codon 3 is CGC
Codon 4 is ATC
Codon 5 is CCA
Codon 6 is GTG

This program will hit a problem in the last codon, if the sequence is not divisible by 3. For example:

>>> seq = "TCTCCAA"
>>> for i in range(0, len(seq), 3):
...     print("Base", i/3, "is", seq[i:i+3])
... 
Base 0 is TCT
Base 1 is CCA
Base 2 is A

A is not a codon!

What to do?  It depends on what you want.

But you’ll probably want to know if the sequence length isn’t divisible by three.

The ‘%’ (remainder) operato

>>> 0 % 3
0
>>> 1 % 3
1
>>> 2 % 3
2
>>> 3 % 3
0

>>> seq = "TCTCCAA"
>>> len(seq)
7
>>> len(seq) % 3
1

So what is the solution?

Skip the last few letters. Here I’ll adjust the length

>>> seq = "TCTCCAA"
>>> for i in range(0, len(seq) - len(seq)%3, 3):
...     print("Base", i/3, "is", seq[i:i+3])
... 
Base 0 is TCT
Base 1 is CCA

Or refuse to do it if not divisible by 3!

Counting codon frequency

>>> seq = "TCTCCAAGACGCATCCCAGTG"
>>> codon_counts = {}
>>> for i in range(0, len(seq) - len(seq)%3, 3):
...     codon = seq[i:i+3]
...     codon_counts[codon] = codon_counts.get(codon, 0) + 1
... 
>>> codon_counts
{'ATC': 1, 'GTG': 1, 'TCT': 1, 'AGA': 1, 'CCA': 2, 'CGC': 1}

Notice that the codon_counts dictionary elements aren’t sorted.

Sorting the output

People like sorted output.  It’s easier to find “GTG” if the codon table is in order.

Use keys to get the dictionary keys then use sort to sort the keys (put them in order).

>>> codon_counts = {'ATC': 1, 'GTG': 1, 'TCT': 1, 'AGA': 1, 'CCA': 2, 'CGC': 1}
>>> codons = codon_counts.keys()

>>> print(codons)
['ATC', 'GTG', 'TCT', 'AGA', 'CCA', 'CGC']

>>> codons.sort()
>>> print(codons)
['AGA', 'ATC', 'CCA', 'CGC', 'GTG', 'TCT']

>>> for codon in codons:
...     print(codon, "=", codon_counts[codon] )
... 
AGA = 1
ATC = 1
CCA = 2
CGC = 1
GTG = 1
TCT = 1

Exercise 1: Ask the user for a sequence.  The sequence may include ambiguous codes (letters besides A, T, C or G).  Use a dictionary to find the number of times each letter is found.

Example:

Enter DNA: TACATCGATGCWACTN
A = 4
C = 4
G = 2
N = 1
T = 4
W = 1

Exercise 2: Make a program to find the reverse complement of an ambiguous DNA sequence.

ambiguous_dna_complement = {
"A": "T",
"C": "G",
"G": "C",
"T": "A",
"M": "K",
"R": "Y",
"W": "W",
"S": "S",
"Y": "R",
"K": "M",
"V": "B",
"H": "D",
"D": "H",
"B": "V",
"N": "N",
}

Exercise 3: Translate DNA into protein.

Write a program to ask for a DNA sequence.

Translate the DNA into protein.  (See next page for the codon table to use.)  When the codon doesn’t code for anything (eg, stop codon), use “*”.  Ignore the extra bases if the sequence length is not a multiple of 3.  Decide how you want to handle ambiguous codes.

Compare your results with someone else or with a web server.

table = {
     'TTT': 'F', 'TTC': 'F', 'TTA': 'L', 'TTG': 'L', 'TCT': 'S',
     'TCC': 'S', 'TCA': 'S', 'TCG': 'S', 'TAT': 'Y', 'TAC': 'Y',
     'TGT': 'C', 'TGC': 'C', 'TGG': 'W', 'CTT': 'L', 'CTC': 'L',
     'CTA': 'L', 'CTG': 'L', 'CCT': 'P', 'CCC': 'P', 'CCA': 'P',
     'CCG': 'P', 'CAT': 'H', 'CAC': 'H', 'CAA': 'Q', 'CAG': 'Q',
     'CGT': 'R', 'CGC': 'R', 'CGA': 'R', 'CGG': 'R', 'ATT': 'I',
     'ATC': 'I', 'ATA': 'I', 'ATG': 'M', 'ACT': 'T', 'ACC': 'T',
     'ACA': 'T', 'ACG': 'T', 'AAT': 'N', 'AAC': 'N', 'AAA': 'K',
     'AAG': 'K', 'AGT': 'S', 'AGC': 'S', 'AGA': 'R', 'AGG': 'R',
     'GTT': 'V', 'GTC': 'V', 'GTA': 'V', 'GTG': 'V', 'GCT': 'A',
     'GCC': 'A', 'GCA': 'A', 'GCG': 'A', 'GAT': 'D', 'GAC': 'D',
     'GAA': 'E', 'GAG': 'E', 'GGT': 'G', 'GGC': 'G', 'GGA': 'G',
     'GGG': 'G', }

# Extra data in case you want it.
stop_codons = [ 'TAA', 'TAG', 'TGA']
start_codons = [ 'TTG', 'CTG', 'ATG']