This solution uses the str.find method to find all copies of an ngram within the txt string, saving the indices of each copy to the indices set so we can easily handle overlapping matches.
We then copy txt, char by char to the result list, inserting angle brackets where required. This strategy is more efficient than inserting the angle brackets using multiple .replace call because each .replace call needs to rebuild the whole string.
I've extended your data slightly to illustrate that my code handles multiple copies of an ngram.
txt = "how does this work now chisolm"
ngrams = ["ow ", "his", "s w"]
print(txt)
print(ngrams)
# Search for all copies of each ngram in txt
# saving the indices where the ngrams occur
indices = set()
for s in ngrams:
slen = len(s)
lo = 0
while True:
i = txt.find(s, lo)
if i == -1:
break
lo = i + slen
print(s, i)
indices.update(range(i, lo-1))
print(indices)
# Copy the txt to result, inserting angle brackets
# to show matches
switch = True
result = []
for i, u in enumerate(txt):
if switch:
if i in indices:
result.append('<')
switch = False
result.append(u)
else:
result.append(u)
if i not in indices:
result.append('>')
switch = True
print(''.join(result))
output
how does this work now chisolm
['ow ', 'his', 's w']
ow 1
ow 20
his 10
his 24
s w 12
{1, 2, 10, 11, 12, 13, 20, 21, 24, 25}
h<ow >does t<his w>ork n<ow >c<his>olm
If you want adjacent groups to be merged, we can easily do that using the str.replace method. But to make that work properly we need to pre-process the original data, converting all runs of whitespace to single spaces. A simple way to do that is to split the data and re-join it.
txt = "how does this\nwork now chisolm hisow"
ngrams = ["ow", "his", "work"]
#Convert all whitespace to single spaces
txt = ' '.join(txt.split())
print(txt)
print(ngrams)
# Search for all copies of each ngram in txt
# saving the indices where the ngrams occur
indices = set()
for s in ngrams:
slen = len(s)
lo = 0
while True:
i = txt.find(s, lo)
if i == -1:
break
lo = i + slen
print(s, i)
indices.update(range(i, lo-1))
print(indices)
# Copy the txt to result, inserting angle brackets
# to show matches
switch = True
result = []
for i, u in enumerate(txt):
if switch:
if i in indices:
result.append('<')
switch = False
result.append(u)
else:
result.append(u)
if i not in indices:
result.append('>')
switch = True
# Convert the list to a single string
output = ''.join(result)
# Merge adjacent groups
output = output.replace('> <', ' ').replace('><', '')
print(output)
output
how does this work now chisolm hisow
['ow', 'his', 'work']
ow 1
ow 20
ow 34
his 10
his 24
his 31
work 14
{32, 1, 34, 10, 11, 14, 15, 16, 20, 24, 25, 31}
h<ow> does t<his work> n<ow> c<his>olm <hisow>
ngrams = ['his', 's wo', 'wor']. Do you expect to gethow does t<his wor>k?ngrams = ['ab'],txt = 'abominable abs'. Do you expect<ab>ominable abs,<ab>ominable <ab>sorabominable <ab>s?ngramelement that is not present intxt?