2

I looked through the other questions on this topic, but couldn't find something that really addresses what I'm trying to figure out.

The problem is this: I'm trying to create a program that looks for palindromes in two complementary strands of DNA, returning the position and length of each palindrome identified.

For instance, if given the sequence TTGATATCTT, the program should find the complement (AACTATAGAA), and then identify the second index as being the start of a 6-character palindrome.

I'm brand new to programming, so it might look totally goofy, but the code I came up with looks like this:

'''This first part imports the sequence (usually consisting of multiple lines of text)
from a file. I have a feeling there's an easier way of doing this, but I just don't
know what that would be.'''

length = 4
li = []
for line in open("C:/Python33/Stuff/Rosalind/rosalind_revp.txt"):
    if line[0] != ">":
        li.append(line)
seq = (''.join(li))

'''The complement() function takes the starting sequence and creates its complement'''

def complement(seq):
    li = []
    t = int(len(seq))
    for i in range(0, t):
        n = (seq[i])
        if n == "A":
            li.append(n.replace("A", "T"))        
        if n == "T":
            li.append(n.replace("T", "A"))
        if n == "C":
            li.append(n.replace("C", "G"))
        if n == "G":
            li.append(n.replace("G", "C"))
    answer = (''.join(li))
    return(answer)

'''the ip() function goes letter by letter, testing to see if it matches with the letter
x spaces in front of it on the complementary strand(x being specified by length). If the
letter doesn't match, it continues to the next one. After checking all possibilities for
one length, the function runs again with the argument length+1.'''

def ip(length, seq):
    n = 0
    comp = complement(seq)
    while length + n <= (len(seq)):
        for i in range(0, length-1):
            if seq[n + i] != comp[n + length - 1 - i]:
                n += 1
                break
            if (n + i) > (n + length - 1 - i):
                print(n + 1, length)
                n += 1
    if length <= 12:
        ip(length + 1, seq)

ip(length, seq)

The thing runs absolutely perfectly when starting with short sequences (TCAATGCATGCGGGTCTATATGCAT, for example), but with longer sequences, I invariably get this error message:

Traceback (most recent call last):
  File "C:/Python33/Stuff/Ongoing/palindrome.py", line 48, in <module>
    ip(length, seq)
  File "C:/Python33/Stuff/Ongoing/palindrome.py", line 39, in ip
    if seq[n + i] != comp[n + length - 1 - i]:
IndexError: string index out of range

The message is given after the program finishes checking the possible 4-character palindromes, before starting the function for length + 1.

I understand what the message is saying, but I don't understand why I'm getting it. Why would this work for some strings and not others? I've been checking for the past hour to see if it makes a difference whether the sequence has an odd number of characters or an even number of characters, is a multiple of 4, is just shy of a multiple of 4, etc. I'm stumped. What am I missing?

Any help would be appreciated.

P.S. The problem comes from the Rosalind Website (Rosalind.info), which uses 1-based numbering. Hence the print(n+1, length) at the end.

3
  • Could you give a sample string and length that it does break on? Commented Mar 17, 2013 at 14:50
  • Sure! "ATATCTGTCGTTGCTCTAAGCGTGTCTAGGAAAGGTCGGGAATCTCCCTTAACTCGGCTT" 61 characters. Commented Mar 17, 2013 at 14:59
  • I'm pretty sure the problem lies in the bounding of n and i for seq[n + i]. Do some print statements to debug the values there and see if you wind up with seq[len(seq)] at some point, which would be out of bounds. Commented Mar 17, 2013 at 16:20

1 Answer 1

3

TheIndexErrorcan be avoided by changing the last line of:

if line[0] != ">":
    li.append(line)

to

if line[0] != ">":
    li.append(line.rstrip())

near the beginning of your code. This prevents any trailing whitespace, especially newlines, read from the file from becoming part of theseqstring. Having them in it is a problem because thecomplement()function ignores and thus removes them, so theanswerstring it returns isn't necessarily the same length as the input argument. This causes comp and seq to not be the same length in the inip()function.

You didn't ask, but here's how I'd shorten your code and make it more "Pythonic":

COMPLEMENT = str.maketrans("ATCG", "TAGC")
LENGTH = 4

with open("palindrome.txt") as input:
    seq = ''.join(line.rstrip() for line in input if line[0] != ">")

def complement(seq): return seq.translate(COMPLEMENT)

def ip(length, seq):
    n = 0
    comp = complement(seq)
    while length + n <= len(seq):
        for i in range(0, length-1):
            if seq[n + i] != comp[n + length - 1 - i]:
                n += 1
                break
            if n + i > n + length - 1 - i:
                print(n + 1, length)
                n += 1
    if length <= 12:
        ip(length + 1, seq)

print(repr(seq))
print(repr(complement(seq)))
ip(LENGTH, seq)

BTW, those two print() function calls added near the end are what gave me the clue about what was wrong.

Sign up to request clarification or add additional context in comments.

2 Comments

That fixed it! Just brilliant--I never would have known to do that. Thank you so much!
Nic: You're welcome. Actually I had a bit `o luck -- probably because it's St. Patrick's ♣ Day ;-) -- and found the problem almost immediately.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.