0

I have a text which I want to match through words in a given set. After matching it will simply tag them. The code is this

mytext = "xxxxx repA1 yyyy REPA1 zzz."
geneset = {'leuB', 'repA1'} # The actual length is ~1Million entries

result = mytext
for gene in geneset:
    regexp = re.compile(gene, flags=re.IGNORECASE)
    result = re.sub(regexp, r'<GENE>\g<0></GENE>', mytext)

print result

The expected output is:

xxxxx <GENE>repA1</GENE> yyyy <GENE>REPA1</GENE> zzz.

But why the code above failed to generate the results?

1
  • Your code seems to work for me. Apart from the change from a set to a list. Commented May 27, 2014 at 6:16

2 Answers 2

2

In your code, you are using the re.sub over the original text (that no are changing in each loop), if you use instead the result variable like result = re.sub(regexp, r'<GENE>\g<0></GENE>', result) the output will be correct.

Sign up to request clarification or add additional context in comments.

Comments

1

You should change mytext in re.sub to result. That way you update the variable result each time you loop over geneset, instead of starting with the original (and not-updated) string mytext on every iteration.

for gene in geneset:
    regexp = re.compile(r"(?i)({})".format(gene))
    result = re.sub(regexp, r'<GENE>\g<1></GENE>', result)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.