Regex replace text nodes in an html document

Question

I have a string that represents an html document. I'm trying to replace text in that document, excluding the markup and attribute values ofcourse with some replacement html. I thought it would be simple, but it is incredibly tedious when you want to replace the text with markup. For example, to replace somekeyword with <a href = "link">somekeyword</a>.

from lxml.html import fragments_fromstring, fromstring, tostring
from re import compile
def markup_aware_sub(pattern, repl, text):
    exp = compile(pattern)
    root = fromstring(text)

    els = [el for el in root.getiterator() if el.text]
    els = [el for el in els if el.text.strip()]
    for el in els:
        text = exp.sub(repl, el.text)
        if text == el.text:
            continue
        parent = el.getparent()
        new_el = fromstring(text)
        new_el.tag = el.tag
        for k, v in el.attrib.items():
            new_el.attrib[k] = v
        parent.replace(el, new_el)
    return tostring(root)

markup_aware_sub('keyword', '<a>blah</a>', '<div><p>Text with keyword here</p></div>')

It works but only if the keyword is exactly two "nestings" down. There has to be a better way to do it than the above, but after googling for many hours I can't find anything.

why don't you use a html parser for this ? Python has an inbuilt html parser. — unni
– unni, Commented Oct 27, 2011 at 12:31
Could you give a "before" and "after" example? Say you have <body><div><p class="keyword">My keyword</p></div><div>keyword</div></body>, should all "keyword" text be replaced with <a>blah</a>, but not the attribute? — jro
– jro, Commented Oct 27, 2011 at 12:32
@unni +1: an html parser would permit to search through all <p> contents without questionning. About replacing a keyword, you're not really using regex, which is good for performance (but then the question title is a bit wrong). — Joël
– Joël, Commented Oct 27, 2011 at 13:01

unni · Accepted Answer · 2011-10-27 13:44:54Z

3

This might be the solution you are lookin for:

from HTMLParser import HTMLParser

class MyParser(HTMLParser):
    def __init__(self,link, keyword):
    HTMLParser.__init__(self)
    self.__html = []
    self.link = link
    self.keyword = keyword

    def handle_data(self, data):
    text = data.strip()
    self.__html.append(text.replace(self.keyword,'<a href="'+self.link+'>'+self.keyword+'</a>'))

    def handle_starttag(self, tag, attrs):
    self.__html.append("<"+tag+">")

    def handle_endtag(self, tag):
    self.__html.append("</"+tag+">")

    def new_html(self):
    return ''.join(self.__html).strip()


parser = MyParser("blah","keyword")
parser.feed("<div><p>Text with keyword here</p></div>")
parser.close()
print parser.new_html()

This will give you the following output

<div><p>Text with <a href="blah>keyword</a> here</p></div>

The problem with your lxml approach only seems to occur when the keywords has only a single nesting. It seems to work fine with multiple nestings. So I added an if condition to catch this exception.

from lxml.html import fragments_fromstring, fromstring, tostring
from re import compile
def markup_aware_sub(pattern, repl, text):
    exp = compile(pattern)
    root = fromstring(text)
    els = [el for el in root.getiterator() if el.text]
    els = [el for el in els if el.text.strip()]

    if len(els) == 1:
      el = els[0]
      text = exp.sub(repl, el.text)
      parent = el.getparent()
      new_el = fromstring(text)
      new_el.tag = el.tag
      for k, v in el.attrib.items():
          new_el.attrib[k] = v
      return tostring(new_el)

    for el in els:
      text = exp.sub(repl, el.text)
      if text == el.text:
        continue
      parent = el.getparent()
      new_el = fromstring(text)
      new_el.tag = el.tag
      for k, v in el.attrib.items():
          new_el.attrib[k] = v
      parent.replace(el, new_el)
    return tostring(root)

print markup_aware_sub('keyword', '<a>blah</a>', '<p>Text with keyword here</p>')

Not very elegant, but seems to work. Please check it out.

edited Oct 27, 2011 at 13:44

answered Oct 27, 2011 at 12:56

unni

3,0114 gold badges22 silver badges22 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Gaslight Deceive Subvert Over a year ago

Thanks! But I do prefer an approach based on lxml instead of using a different api.

unni Over a year ago

@BjörnLindqvist no problemo. I will look for something based on lxml too.

Collectives™ on Stack Overflow

Regex replace text nodes in an html document

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related