I have a string that represents an html document. I'm trying to replace text in that document, excluding the markup and attribute values ofcourse with some replacement html. I thought it would be simple, but it is incredibly tedious when you want to replace the text with markup. For example, to replace somekeyword with <a href = "link">somekeyword</a>.
from lxml.html import fragments_fromstring, fromstring, tostring
from re import compile
def markup_aware_sub(pattern, repl, text):
exp = compile(pattern)
root = fromstring(text)
els = [el for el in root.getiterator() if el.text]
els = [el for el in els if el.text.strip()]
for el in els:
text = exp.sub(repl, el.text)
if text == el.text:
continue
parent = el.getparent()
new_el = fromstring(text)
new_el.tag = el.tag
for k, v in el.attrib.items():
new_el.attrib[k] = v
parent.replace(el, new_el)
return tostring(root)
markup_aware_sub('keyword', '<a>blah</a>', '<div><p>Text with keyword here</p></div>')
It works but only if the keyword is exactly two "nestings" down. There has to be a better way to do it than the above, but after googling for many hours I can't find anything.
<body><div><p class="keyword">My keyword</p></div><div>keyword</div></body>, should all "keyword" text be replaced with<a>blah</a>, but not the attribute?<p>contents without questionning. About replacing a keyword, you're not really using regex, which is good for performance (but then the question title is a bit wrong).