6

I have a list of coordinates:

coordinates = [[1,5], [10,15], [25, 35]]

I have a string as follows:

line = 'ATCACGTGTGTGTACACGTACGTGTGNGTNGTTGAGTGKWSGTGAAAAAKCT'

I want to replace intervals indicated in pairs in coordinates as start and end with character 'N'.

The only way I can think of is the following:

for element in coordinates:
    length = element[1] - element[0]
    line = line.replace(line[element[0]:element[1]], 'N'*length)

The desired output would be:

line = 'ANNNNGTGTGNNNNNACGTACGTGTNNNNNNNNNNGTGKWSGTGAAAAAKCT'

where intervals, [1,5), [10,15) and [25, 35) are replaced with N in line.

This requires me to loop through the coordinate list and update my string line, every time. I was wondering if there is another way that one can replace a list of intervals in a string?

Note: There is a problem with the original solution in this question. In line.replace(line[element[0]:element[1]], 'N'*length), replace will replace all other instances of string identical to the one in line[element[0]:element[1]] from the sequence and for people working with DNA, this is definitely not what you want! I however, keep the solution as it is to not disturb the flow of comments and discussion following.

8
  • 2
    Please add example (desired) output to the question. Commented Jul 30, 2020 at 9:14
  • 2
    But I think this should do what you want: for start, end in coordinates: line = line[:start] + "N" * (end - start) + line[end:] -- if I've correctly understood. Commented Jul 30, 2020 at 9:17
  • 1
    I am not sure your current solution even does what you expect. replace replaces all occurrences of the sub-string so it might not only replace the indices you give it Commented Jul 30, 2020 at 9:19
  • @Tomerikoo Oh, really, that's so important. It looks in my example is working correctly with the indices I give to it. How do you think it could cause a problem? Is there another method I could use instead? Commented Jul 30, 2020 at 9:22
  • 2
    @Homap it might cause a problem if for example the substring between indices 1 and 5 (TCAC) appears somewhere else in the string, so it will be replaced as well. That might not be what you want Commented Jul 30, 2020 at 9:31

2 Answers 2

6

Instead of string concatenation (wich is wasteful due to created / destroyed string instances), use a list:

coordinates = [[1,5], [10,15], [25, 35]] # sorted

line = 'ATCACGTGTGTGTACACGTACGTGTGNGTNGTTGAGTGKWSGTGAAAAAKCT'

result = list(line)
# opted for exclusive end pos
for r in [range(start,end) for start,end in coordinates]:
    for p in r:
        result[p]='N'

res = ''.join(result)
print(res)

To get:

ANNNNGTGTGNNNNNACGTACGTGTNNNNNNNNNNGTGKWSGTGAAAAAKCT

optimized to use slicing and exclusive end:

for start,end in coordinates:
    result[start:end] = ["N"]*(end-start)

res = ''.join(result)
print(line)
print(res)

gives you your wanted output:

ATCACGTGTGTGTACACGTACGTGTGNGTNGTTGAGTGKWSGTGAAAAAKCT 
ANNNNGTGTGNNNNNACGTACGTGTNNNNNNNNNNGTGKWSGTGAAAAAKCT
Sign up to request clarification or add additional context in comments.

1 Comment

This solution took about 89 seconds on 2.4 GB file.
2

Good question, this should work.

coordinates = [[1,5], [10,15], [25, 35]]
line = 'ATCACGTGTGTGTACACGTACGTGTGNGTNGTTGAGTGKWSGTGAAAAAKCT'
for L,R in coordinates:
    line = line[:L] + "N"*(R-L) + line[R:]
print(line)

You may need to adjust this depending on how the coordinates are defined, eg. inclusive/1-indexed.

We need more people working with DNA, so great work.

2 Comments

Good. The code in the question would probably imply that the indices should be as you have now shown (and I thought the same - see my comment under the question) but some example output from the OP would certainly help clarify this.
Ah, now we have example output in the question, and it is as suspected.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.