Py4Bio Chapter 1: Strings

This is the Chapter 1 of Py4Bio.

Computers store text as strings.

>>> s = "GATTACA"

Here “s” is a variable. You can imagine it as a basket, or a holder, which carry the string “GAATACA”.

Python uses 0 based indexing. Means counting starts from 0.

Why are strings important?

DNA Sequences are strings (..catgaaggaa ccacagccca gagcaccaagggctatccat..)

Database records contain strings.

LOCUS AC005138
DEFINITION  Homo sapiens chromosome 17, clone hRPK.261_A_13, complete sequence
AUTHORS   Birren,B., Fasman,K., Linton,L., Nusbaum,C. and Lander,E.

HTML is one (big) string.

Splicing strings

Let’s learn how to get character from string.

>>> s = "GATTACA"
>>> s[0]
'G'
>>> s[1]
'A'
>>> s[-1]
'A'
>>> s[-2]
'C'
>>> s[7]
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
IndexError: string index out of range

Getting substrings


>>> s = "GATTACA"
>>> s[1:3]
'AT'
>>> s[:3]
'GAT'
>>> s[4:]
'ACA'
>>> s[3:5]
'TA'
>>> s[:]
'GATTACA'
>>> s[::2]
'GTAA'
>>> s[-2:2:-1]
'CAT'

Creating strings

Strings start and end with a single or double quote characters (they must be the same)

‘This is a string’
“This is another string”
“Strings can be in double quotes”
‘Or in single quotes.’
‘There’s no difference.’
‘Okay, there\’s a small one.’
“““Can be in 
Multiline”””

Special Characters and Escape Sequences

Backslashes (\) are used to  introduce special characters.

>>> s = 'Okay, there\'s a small one.'
>>> print(s)
Okay, there's a small one.

The \ “escapes” the following single quote.

Some special characters

Escape SequenceMeaning
\\Backslash (keep a \)
\’Single quote (keeps the ‘)
\”Double quote (keeps the “)
\nNewline
\tTab

Let’s do more work with strings. Now introducing methods and built in keywords.

>>> len("GATTACA") # Length
7
>>> "GAT" + "TACA" # Concatanation
'GATTACA'
>>> "A" * 10 # Repeat
'AAAAAAAAAA'
>>> "G" in "GATTACA" # Substring test
True
>>> "GAT" in "GATTACA" 
True
>>> "AGT" in "GATTACA" 
False
>>> "GATTACA".find("ATT") # Substring location
1
>>> "GATTACA".count("T") # Substring count
2

Converting from/to strings

>>> "38" + 5
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
TypeError: cannot concatenate 'str' and 'int' objects
>>> int("38") + 5
43
>>> "38" + str(5)
'385'
>>> int("38"), str(5)
(38, '5')
>>> int("2.71828")
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
ValueError: invalid literal for int(): 2.71828
>>> float("2.71828")
2.71828

Strings cannot be modified. They are immutable.

Instead, create a new one.

>>> s = "GATTACA"
>>> s[3] = "C"
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
TypeError: object doesn't support item assignment
>>> s = s[:3] + "C" + s[4:]
>>> s
'GATCACA'

Some more methods

>>> "GATTACA".lower()
'gattaca'
>>> "gattaca".upper()
'GATTACA'
>>> "GATTACA".replace("G", "U")
'UATTACA'
>>> "GATTACA".replace("C", "U")
'GATTAUA'
>>> "GATTACA".replace("AT", "**")
'G**TACA'
>>> "GATTACA".startswith("G")
True
>>> "GATTACA".startswith("g")
False

Ask the user for a string 

The Python function “input” asks the user (that’s you!) for a string

>>> seq = input("Enter a DNA sequence: ")
Enter a DNA sequence: ATGTATTGCATATCGT
>>> seq.count("A")
4
>>> print("There are", seq.count("T"), "thymines")
There are 7 thymines
>>> "ATA" in seq
True
>>> substr = input("Enter a subsequence to find: ")
Enter a subsequence to find: GCA
>>> substr in seq
True

Exercise 1: Ask the user for a sequence, then print its length.

Enter a sequence: ATTAC
It is 5 bases long

Exercise 2: Modify the program so it also prints the number of A, T, C, and G characters in the sequence

Enter a sequence: ATTAC
It is 5 bases long
adenine: 2
thymine: 2
cytosine: 1
guanine: 0

Exercise 3: Modify the program to allow both lower-case and upper-case characters in the sequence

Enter a sequence: ATTgtc
It is 6 bases long
adenine: 1
thymine: 3
cytosine: 1
guanine: 1

Exercise 4: Modify the program to print the number of unknown characters in the sequence

Enter a sequence: ATTU*gtc
It is 8 bases long
adenine: 1
thymine: 3
cytosine: 1
guanine: 1
unknown: 2