0

I have a string s = "ATAATGCGTGGAATTATGACCGGAATC" I would like to extract all substrings starting with ATG and ending with GGA . So the results would be ATGCGTGGA and ATGACCGGA .

This is what I have done so far but not working. Thanks for helping me in advance.


s = "ATAATGCGTGGAATTATGACCGGAATC"
x = re.findall('^ATG.+GGA$', s)
print(x)  
3
  • 1
    Tha is perfectly working, your string is neither starting by ATG or endidng by GGA Commented Sep 5, 2022 at 12:45
  • JUst remove ^ and $ from your regexp: re.findall('ATG.+GGA', s) Commented Sep 5, 2022 at 12:46
  • Note that "ATGCGTGGAATTATGACCGGA" is third solution. Commented Sep 5, 2022 at 12:54

2 Answers 2

2

With ^ and $ you are anchoring to start and end of line, don't do that if you want to find substrings. Also by default regex is "greedy", it will match the longest possible sequence.

You need to use +? for a non-greedy (aka lazy) match that matches the shortest sequences:

x = re.findall('ATG.+?GGA', s)
Sign up to request clarification or add additional context in comments.

Comments

1

Symbols ^ and $ refer to the beginning and end of the string, not the beginning and end of the substring.

Just remove ^ and $ from your regexp: re.findall('ATG.+GGA', s).

In addition, you might want to add ? after the +, to stop at the first found CGA rather than the last: re.findall('ATG.+?GGA', s)

Refer to Module re: regular expression syntax in the official python documentation, for more information about ^, $ and ?.

1 Comment

This results in one found substring ATGCGTGGAATTATGACCGGA because the .+ match is greedy. See my answer :)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.