Py4Bio Chapter 4: if and files

This is the Chapter 4 of Py4Bio.

The if statement: Do a code block only when something is True

if test:
    print("The expression is true”)

Example:

if "GAATTC" in "ATCTGGAATTCATCG":
    print("EcoRI site is present”)

What if you want the false case? There are several possibilities; here’s two:

Python has a not in operator

if "GAATTC" not in "AAAAAAAAA":
    print("EcoRI will not cut the sequence”)

The not operator switches true and false.

if not "GAATTC" in "AAAAAAAAA":
    print("EcoRI will not cut the sequence”)

More explicit examples:

>>> x = True
>>> x
True
>>> not x
False
>>> not not x
True
>>> if "GAATTC" not in "AAAAAAAAA":
...     print("EcoRI will not cut the sequence”)
... 
EcoRI will not cut the sequence
>>> if not "GAATTC" in "ATCTGGAATTCATCG":
...   print("EcoRI will not cut the sequence”)
... 
>>> if not "GAATTC" in "AAAAAAAAA":
...     print("EcoRI will not cut the sequence”)
... 
EcoRI will not cut the sequence

What if you want to do one thing when the test is true and another thing when the test is false? Use else!

Examples:

>>> if "GAATTC" in "ATCTGGAATTCATCG":
...     print("EcoRI site is present”)
... else:
...     print("EcoRI will not cut the sequence”)
... 
EcoRI site is present
>>> if "GAATTC" in "AAAACTCGT":
...     print("EcoRI site is present”)
... else:
...     print("EcoRI will not cut the sequence”)
... 
EcoRI will not cut the sequence

But where is the cutting site?

The ‘find’ method of strings returns the index of a substring in the string, or -1 if the substring doesn’t exist

>>> seq = "ATCTGGAATTCATCG"
>>> seq.find("GAATTC")
5
# This means, there is a GAATTC  at position 5 (0-index system)

>>> seq.find("GGCGC")
-1
# But there is no GGCGC in the sequence

But where is the site?

>>> seq = "ATCTGGAATTCATCG"
>>> pos = seq.find("GAATTC")
>>> if pos == -1:
...     print("EcoRI does not cut the sequence”)
... else:
...     print("EcoRI site starting at index", pos)
... 
EcoRI site starting at index 5

A more complex example: Nested code blocks

Using if inside a for

restriction_sites = [
  "GAATTC",    # EcoRI
  "GGATCC",    # BamHI
  "AAGCTT",    # HindIII
]

seq = input("Enter a DNA sequence: ")

for site in restriction_sites:
    if site in seq:
        print(site, "is a cleavage site”)
    else:
        print(site, "is not present”)

Read lines from a file

input() asks the user for input

Most of the time you’ll get data from a file.  (Or would you rather type in the sequence every time?)

To read from a file you need to tell Python to open that file.

The open function

>>> infile = open("10_sequences.seq")
>>> print(infile)
<open file '10_sequences.seq', mode 'r' at 0x817ca60>

open returns a new object of type file

A file can’t be displayed like a number or a string.  It is useful because it has methods for working with the data in the file.

>>> infile.readline()
'CCTGTATTAGCAGCAGATTCGATTAGCTTTACAACAATTCAATAAAATAGCTTCGCGCTAA\n'

readline returns one line from the file.

The line includes the end of line character (represented here by “\n”)

(Note: the last line of some files may not have a “\n”)

readline finishes with “”

>>> infile.readline()
'CCTGTATTAGCAGCAGATTCGATTAGCTTTACAACAATTCAATAAAATAGCTTCGCGCTAA\n'
>>> infile.readline()
'ATTTTTAACTTTTCTCTGTCGTCGCACAATCGACTTTCTCTGTTTTCTTGGGTTTACCGGAA\n'
>>> infile.readline()
'TTGTTTCTGCTGCGATGAGGTATTGCTCGTCAGCCTGAGGCTGAAAATAAAATCCGTGGT\n'
>>> infile.readline()
'CACACCCAATAAGTTAGAGAGAGTACTTTGACTTGGAGCTGGAGGAATTTGACATAGTCGAT\n'
>>> infile.readline()
'TCTTCTCCAAGACGCATCCACGTGAACCGTTGTAACTATGTTCTGTGC\n'
>>> infile.readline()
'CCACACCAAAAAAACTTTCCACGTGAACCGAAAACGAAAGTCTTTGGTTTTAATCAATAA\n'
>>> infile.readline()
'GTGCTCTCTTCTCGGAGAGAGAAGGTGGGCTGCTTGTCTGCCGATGTACTTTATTAAATCCAATAA\n'
>>> infile.readline()
'CCACACCAAAAAAACTTTCCACGTGTGAACTATACTCCAAAAACGAAGTATTGGTTTATCATAA\n'
>>> infile.readline()
'TCTGAAAAGTGCAAAGAACGATGATGATGATGATAGAGGAACCTGAGCAGCCATGTCTGAACCTATAGC\n'
>>> infile.readline()
'GTATTGGTCGTCGTGCGACTAAATTAGGTAAAAAAGTAGTTCTAAGAGATTTTGATGATTCAATGCAAAGTTCTATTAATCGTTCAATTG\n'
>>> infile.readline()
''

When there are no more lines,

readline() returns the empty string

You can use readlines() to read the whole thing!

>>> infile.readlines()
'CCTGTATTAGCAGCAGATTCGATTAGCTTTACAACAATTCAATAAAATAGCTTCGCGCTAA\n
ATTTTTAACTTTTCTCTGTCGTCGCACAATCGACTTTCTCTGTTTTCTTGGGTTTACCGGAA\n
TTGTTTCTGCTGCGATGAGGTATTGCTCGTCAGCCTGAGGCTGAAAATAAAATCCGTGGT\n
CACACCCAATAAGTTAGAGAGAGTACTTTGACTTGGAGCTGGAGGAATTTGACATAGTCGAT\n
TCTTCTCCAAGACGCATCCACGTGAACCGTTGTAACTATGTTCTGTGC\n
CCACACCAAAAAAACTTTCCACGTGAACCGAAAACGAAAGTCTTTGGTTTTAATCAATAA\n
GTGCTCTCTTCTCGGAGAGAGAAGGTGGGCTGCTTGTCTGCCGATGTACTTTATTAAATCCAATAA\n
CCACACCAAAAAAACTTTCCACGTGTGAACTATACTCCAAAAACGAAGTATTGGTTTATCATAA\n
TCTGAAAAGTGCAAAGAACGATGATGATGATGATAGAGGAACCTGAGCAGCCATGTCTGAACCTATAGC\n
GTATTGGTCGTCGTGCGACTAAATTAGGTAAAAAAGTAGTTCTAAGAGATTTTGATGATTCAATGCAAAGTTCTATTAATCGTTCAATTG\n'

Let’s do something more complex: list the sequences starting with a cytosine.

We will be using rstrip to get rid of the “\n”

>>> filename = "10_sequences.seq"
>>> for line in open(filename):
...     line = line.rstrip()
...     if line.startswith("C"):
...             print(line)
... 

CCTGTATTAGCAGCAGATTCGATTAGCTTTACAACAATTCAATAAAATAGCTTCGCGCTAA
CACACCCAATAAGTTAGAGAGAGTACTTTGACTTGGAGCTGGAGGAATTTGACATAGTCGAT
CCACACCAAAAAAACTTTCCACGTGAACCGAAAACGAAAGTCTTTGGTTTTAATCAATAA
CCACACCAAAAAAACTTTCCACGTGTGAACTATACTCCAAAAACGAAGTATTGGTTTATCATAA

strip(): Removes characters from both the beginning and the end of the string.

lstrip(): Removes characters only from the beginning (left side) of the string.

rstrip(): Removes characters only from the end (right side) of the string.

But using open in this way is not recommended, because you SHOULD also close the file.

>>> filename = "/usr/coursehome/dalke/10_sequences.seq”
>>> filehandle = open(filename)
>>> for line in filehandle:
...     line = line.rstrip()
...     if line.startswith("C"):
...             print(line)
... 
CCTGTATTAGCAGCAGATTCGATTAGCTTTACAACAATTCAATAAAATAGCTTCGCGCTAA
CACACCCAATAAGTTAGAGAGAGTACTTTGACTTGGAGCTGGAGGAATTTGACATAGTCGAT
CCACACCAAAAAAACTTTCCACGTGAACCGAAAACGAAAGTCTTTGGTTTTAATCAATAA
CCACACCAAAAAAACTTTCCACGTGTGAACTATACTCCAAAAACGAAGTATTGGTTTATCATAA
>>> filehandle.close()

So it’s better to handle a file using with, which closes the file after use:

>>> filename = "/usr/coursehome/dalke/10_sequences.seq”
>>> with open(filename) as fh:
...     lines = fh.readlines()
>>> for line in lines.split(“\n”):
...     line = line.rstrip()
...     if line.startswith("C"):
...             print(line)
... 
CCTGTATTAGCAGCAGATTCGATTAGCTTTACAACAATTCAATAAAATAGCTTCGCGCTAA
CACACCCAATAAGTTAGAGAGAGTACTTTGACTTGGAGCTGGAGGAATTTGACATAGTCGAT
CCACACCAAAAAAACTTTCCACGTGAACCGAAAACGAAAGTCTTTGGTTTTAATCAATAA
CCACACCAAAAAAACTTTCCACGTGTGAACTATACTCCAAAAACGAAGTATTGGTTTATCATAA

# No filehandle.close() is needed

Exercise 1: Get a sequence from the user.  If there is an A in the sequence, print the number of times it appears in the sequence.  Do the same for T, C and G.  If a base does not exist, don’t print anything.

Enter a sequence: ACCAGGCA      
A count: 3
C count: 3
G count: 2

Exercise 2: Modify the program so that if A do not exist, prints “A not found”.  Do the same for T, C and G.

Exercise 3: Read the file 10_sequences.seq. Print out the line number (starting with 1) then the line.  Remember to use strip() to remove the extra newline.

Exercise 4: List the sequences in 10_sequences.seq which have the pattern CTATA. Once that works, print the index of the first time that pattern is found.