I have a text which I want to match through words in a given set. After matching it will simply tag them. The code is this
mytext = "xxxxx repA1 yyyy REPA1 zzz."
geneset = {'leuB', 'repA1'} # The actual length is ~1Million entries
result = mytext
for gene in geneset:
regexp = re.compile(gene, flags=re.IGNORECASE)
result = re.sub(regexp, r'<GENE>\g<0></GENE>', mytext)
print result
The expected output is:
xxxxx <GENE>repA1</GENE> yyyy <GENE>REPA1</GENE> zzz.
But why the code above failed to generate the results?