This is the Chapter 6 of Py4Bio.
A dictionary is a table of items.
Each item has a “key” and a “value”
Also known as hash table.
Keys | Values |
English | Good morning |
Spanish | Buenas días |
Swedish | God morgon |
German | Guten morgen |
Venda | Ndi matscheloni |
Afrikaans | Goeie môre |
I want to know “Good morning” in Swedish. I get the “Good Morning” dictionary, search for “Swedish” key, and get the value!
>>> good_morning_dict = {
... "English": "Good morning",
... "Swedish": "God morgon",
... "German": "Guten morgen",
... "Venda": "Ndi matscheloni",
... }
>>> print(good_morning_dict["Swedish"])
God morgon
How to create a dictionary:
>>> D1 = {}
>>> len(D1)
0
>>> D2 = {"name": "Andrew", "age": 33}
>>> len(D2)
2
>>> D2["name"]
'Andrew'
>>> D2["age"]
33
>>> D2["AGE"]
Traceback (most recent call last):
File "<stdin>", line 1, in ?
KeyError: 'AGE'
Adding new elements:
>>> my_sister = {}
>>> my_sister["name"] = "Christy"
>>> print("len =", len(my_sister), "and value is", my_sister)
len = 1 and value is {'name': 'Christy’}
>>> my_sister["children"] = ["Maggie", "Porter"]
>>> print("len =", len(my_sister), "and value is", my_sister))
len = 2 and value is {'name': 'Christy', 'children': ['Maggie', 'Porter']}
>>>
Get the keys and values:
>>> city = {"name": "Cape Town", "country": "South Africa",
... "population": 2984000, "lat.": -33.93, "long.": 18.46}
>>> print(city.keys())
['country', 'long.', 'lat.', 'name', 'population']
>>> print(city.values())
['South Africa', 18.460000000000001, -33.93, 'Cape Town', 2984000]
>>> for k in city:
... print k, "=", city[k]
...
country = South Africa
long. = 18.46
lat. = -33.93
name = Cape Town
population = 2984000
Let’s do a biological example: Ambiguity codes
Sometimes DNA bases are ambiguous.
Eg, the sequencer might be able to tell that a base is not a G or T but could be either A or C.
The standard (IUPAC) one-letter code for DNA includes letters for ambiguity.
M is A or C
R is A or G
W is A or T
S is C or G
so on and so forth.
Suppose you don’t know all the possible bases exist in a sequence.
>>> seq = "TKKAMRCRAATARKWC"
>>> counts = {}
>>> for base in seq:
... if base not in counts:
... n = 0
... else:
... n = counts[base]
... counts[base] = n + 1
...
>>> print(counts)
{'A': 4, 'C': 2, 'K': 3, 'M': 1, 'R': 3, 'T': 2, 'W': 1}
In this example, if the base isn’t a key in the counts dictionary then use zero. Otherwise use the value from the dict.
The idiom “use a default value if the key doesn’t exist” is very common. Python has a special method to make it easy. For example:
>>> seq = "TKKAMRCRAATARKWC"
>>> counts = {}
>>> for base in seq:
... counts[base] = counts.get(base, 0) + 1
>>> print(counts)
{'A': 4, 'C': 2, 'K': 3, 'M': 1, 'R': 3, 'T': 2, 'W': 1}
>>> counts.get("A", 9)
4
>>> counts["B"]
Traceback (most recent call last):
File "<stdin>", line 1, in ?
KeyError: 'B'
>>> counts.get("B", 9)
9
Another example: Reverse complement
>>> complement_table = {"A": "T", "T": "A", "C": "G", "G": "C"}
>>> seq = "CCTGTATT"
>>> new_seq = []
>>> for letter in seq:
... complement_letter = complement_table[letter]
... new_seq.append(complement_letter)
>>> print(new_seq)
['G', 'G', 'A', 'C', 'A', 'T', 'A', 'A']
>>> new_seq.reverse()
>>> print(new_seq)
['A', 'A', 'T', 'A', 'C', 'A', 'G', 'G']
>>> print("".join(new_seq))
AATACAGG
Example: Listing codons
>>> seq = "TCTCCAAGACGCATCCCAGTG"
>>> seq[0:3]
'TCT'
>>> seq[3:6]
'CCA'
>>> seq[6:9]
'AGA'
>>> range(0, len(seq), 3)
[0, 3, 6, 9, 12, 15, 18]
>>> for i in range(0, len(seq), 3):
... print("Codon", i/3, "is", seq[i:i+3])
...
Codon 0 is TCT
Codon 1 is CCA
Codon 2 is AGA
Codon 3 is CGC
Codon 4 is ATC
Codon 5 is CCA
Codon 6 is GTG
This program will hit a problem in the last codon, if the sequence is not divisible by 3. For example:
>>> seq = "TCTCCAA"
>>> for i in range(0, len(seq), 3):
... print("Base", i/3, "is", seq[i:i+3])
...
Base 0 is TCT
Base 1 is CCA
Base 2 is A
A is not a codon!
What to do? It depends on what you want.
But you’ll probably want to know if the sequence length isn’t divisible by three.
The ‘%’ (remainder) operato
>>> 0 % 3
0
>>> 1 % 3
1
>>> 2 % 3
2
>>> 3 % 3
0
>>> seq = "TCTCCAA"
>>> len(seq)
7
>>> len(seq) % 3
1
So what is the solution?
Skip the last few letters. Here I’ll adjust the length
>>> seq = "TCTCCAA"
>>> for i in range(0, len(seq) - len(seq)%3, 3):
... print("Base", i/3, "is", seq[i:i+3])
...
Base 0 is TCT
Base 1 is CCA
Or refuse to do it if not divisible by 3!
Counting codon frequency
>>> seq = "TCTCCAAGACGCATCCCAGTG"
>>> codon_counts = {}
>>> for i in range(0, len(seq) - len(seq)%3, 3):
... codon = seq[i:i+3]
... codon_counts[codon] = codon_counts.get(codon, 0) + 1
...
>>> codon_counts
{'ATC': 1, 'GTG': 1, 'TCT': 1, 'AGA': 1, 'CCA': 2, 'CGC': 1}
Notice that the codon_counts dictionary elements aren’t sorted.
Sorting the output
People like sorted output. It’s easier to find “GTG” if the codon table is in order.
Use keys to get the dictionary keys then use sort
to sort the keys (put them in order).
>>> codon_counts = {'ATC': 1, 'GTG': 1, 'TCT': 1, 'AGA': 1, 'CCA': 2, 'CGC': 1}
>>> codons = codon_counts.keys()
>>> print(codons)
['ATC', 'GTG', 'TCT', 'AGA', 'CCA', 'CGC']
>>> codons.sort()
>>> print(codons)
['AGA', 'ATC', 'CCA', 'CGC', 'GTG', 'TCT']
>>> for codon in codons:
... print(codon, "=", codon_counts[codon] )
...
AGA = 1
ATC = 1
CCA = 2
CGC = 1
GTG = 1
TCT = 1
Exercise 1: Ask the user for a sequence. The sequence may include ambiguous codes (letters besides A, T, C or G). Use a dictionary to find the number of times each letter is found.
Example:
Enter DNA: TACATCGATGCWACTN
A = 4
C = 4
G = 2
N = 1
T = 4
W = 1
Exercise 2: Make a program to find the reverse complement of an ambiguous DNA sequence.
ambiguous_dna_complement = {
"A": "T",
"C": "G",
"G": "C",
"T": "A",
"M": "K",
"R": "Y",
"W": "W",
"S": "S",
"Y": "R",
"K": "M",
"V": "B",
"H": "D",
"D": "H",
"B": "V",
"N": "N",
}
Exercise 3: Translate DNA into protein.
Write a program to ask for a DNA sequence.
Translate the DNA into protein. (See next page for the codon table to use.) When the codon doesn’t code for anything (eg, stop codon), use “*”. Ignore the extra bases if the sequence length is not a multiple of 3. Decide how you want to handle ambiguous codes.
Compare your results with someone else or with a web server.
table = {
'TTT': 'F', 'TTC': 'F', 'TTA': 'L', 'TTG': 'L', 'TCT': 'S',
'TCC': 'S', 'TCA': 'S', 'TCG': 'S', 'TAT': 'Y', 'TAC': 'Y',
'TGT': 'C', 'TGC': 'C', 'TGG': 'W', 'CTT': 'L', 'CTC': 'L',
'CTA': 'L', 'CTG': 'L', 'CCT': 'P', 'CCC': 'P', 'CCA': 'P',
'CCG': 'P', 'CAT': 'H', 'CAC': 'H', 'CAA': 'Q', 'CAG': 'Q',
'CGT': 'R', 'CGC': 'R', 'CGA': 'R', 'CGG': 'R', 'ATT': 'I',
'ATC': 'I', 'ATA': 'I', 'ATG': 'M', 'ACT': 'T', 'ACC': 'T',
'ACA': 'T', 'ACG': 'T', 'AAT': 'N', 'AAC': 'N', 'AAA': 'K',
'AAG': 'K', 'AGT': 'S', 'AGC': 'S', 'AGA': 'R', 'AGG': 'R',
'GTT': 'V', 'GTC': 'V', 'GTA': 'V', 'GTG': 'V', 'GCT': 'A',
'GCC': 'A', 'GCA': 'A', 'GCG': 'A', 'GAT': 'D', 'GAC': 'D',
'GAA': 'E', 'GAG': 'E', 'GGT': 'G', 'GGC': 'G', 'GGA': 'G',
'GGG': 'G', }
# Extra data in case you want it.
stop_codons = [ 'TAA', 'TAG', 'TGA']
start_codons = [ 'TTG', 'CTG', 'ATG']