1

I have a string that represents an html document. I'm trying to replace text in that document, excluding the markup and attribute values ofcourse with some replacement html. I thought it would be simple, but it is incredibly tedious when you want to replace the text with markup. For example, to replace somekeyword with <a href = "link">somekeyword</a>.

from lxml.html import fragments_fromstring, fromstring, tostring
from re import compile
def markup_aware_sub(pattern, repl, text):
    exp = compile(pattern)
    root = fromstring(text)

    els = [el for el in root.getiterator() if el.text]
    els = [el for el in els if el.text.strip()]
    for el in els:
        text = exp.sub(repl, el.text)
        if text == el.text:
            continue
        parent = el.getparent()
        new_el = fromstring(text)
        new_el.tag = el.tag
        for k, v in el.attrib.items():
            new_el.attrib[k] = v
        parent.replace(el, new_el)
    return tostring(root)

markup_aware_sub('keyword', '<a>blah</a>', '<div><p>Text with keyword here</p></div>')

It works but only if the keyword is exactly two "nestings" down. There has to be a better way to do it than the above, but after googling for many hours I can't find anything.

3
  • 1
    why don't you use a html parser for this ? Python has an inbuilt html parser. Commented Oct 27, 2011 at 12:31
  • Could you give a "before" and "after" example? Say you have <body><div><p class="keyword">My keyword</p></div><div>keyword</div></body>, should all "keyword" text be replaced with <a>blah</a>, but not the attribute? Commented Oct 27, 2011 at 12:32
  • @unni +1: an html parser would permit to search through all <p> contents without questionning. About replacing a keyword, you're not really using regex, which is good for performance (but then the question title is a bit wrong). Commented Oct 27, 2011 at 13:01

1 Answer 1

3

This might be the solution you are lookin for:

from HTMLParser import HTMLParser

class MyParser(HTMLParser):
    def __init__(self,link, keyword):
    HTMLParser.__init__(self)
    self.__html = []
    self.link = link
    self.keyword = keyword

    def handle_data(self, data):
    text = data.strip()
    self.__html.append(text.replace(self.keyword,'<a href="'+self.link+'>'+self.keyword+'</a>'))

    def handle_starttag(self, tag, attrs):
    self.__html.append("<"+tag+">")

    def handle_endtag(self, tag):
    self.__html.append("</"+tag+">")

    def new_html(self):
    return ''.join(self.__html).strip()


parser = MyParser("blah","keyword")
parser.feed("<div><p>Text with keyword here</p></div>")
parser.close()
print parser.new_html()

This will give you the following output

<div><p>Text with <a href="blah>keyword</a> here</p></div>

The problem with your lxml approach only seems to occur when the keywords has only a single nesting. It seems to work fine with multiple nestings. So I added an if condition to catch this exception.

from lxml.html import fragments_fromstring, fromstring, tostring
from re import compile
def markup_aware_sub(pattern, repl, text):
    exp = compile(pattern)
    root = fromstring(text)
    els = [el for el in root.getiterator() if el.text]
    els = [el for el in els if el.text.strip()]

    if len(els) == 1:
      el = els[0]
      text = exp.sub(repl, el.text)
      parent = el.getparent()
      new_el = fromstring(text)
      new_el.tag = el.tag
      for k, v in el.attrib.items():
          new_el.attrib[k] = v
      return tostring(new_el)

    for el in els:
      text = exp.sub(repl, el.text)
      if text == el.text:
        continue
      parent = el.getparent()
      new_el = fromstring(text)
      new_el.tag = el.tag
      for k, v in el.attrib.items():
          new_el.attrib[k] = v
      parent.replace(el, new_el)
    return tostring(root)

print markup_aware_sub('keyword', '<a>blah</a>', '<p>Text with keyword here</p>')

Not very elegant, but seems to work. Please check it out.

Sign up to request clarification or add additional context in comments.

2 Comments

Thanks! But I do prefer an approach based on lxml instead of using a different api.
@BjörnLindqvist no problemo. I will look for something based on lxml too.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.