2

I want to escape the unescaped data inside a xml string e.g.

string = "<tag parameter = "something">I want to escape these >, < and &</tag>"

to

"<tag parameter = "something">I want to escape these &gt;, &lt; and &amp;</tag>"
  • Now, I definitely can't use any xml parsing libraries like xml.dom.minidom or xml.etree because the data is unescaped & will give error
  • In regex, I figure out way to match & get start and end positions of data substing

    exp = re.search(">.+?</", label)
    # Get position of the data between tags
    start = exp.start() + 1
    end = exp.end() - 2
    return label[ : start] + saxutils.escape(label[start : end]) + label[end : ]
    
  • But in re.search, I can't match the exact xml format

  • If I use re.findall I can't get positions of the substrings found
  • I could always find positions of found substring by index but that won't be efficient, I want a simple but efficent solution
  • BeautifulSoup solutions are welcomed but I wish there was some more beautiful way to do it with python's basic libraries
4
  • 2
    And how was this faulty XML produced in the first place? It'd be easier if you could fix that instead.. Commented Mar 5, 2014 at 13:19
  • in string you have , instead of . in ">.+?</", maybe thats why you can't match anything in re.search Commented Mar 5, 2014 at 14:04
  • @Aleksandar the above program snippet works absolutely fine Commented Mar 5, 2014 at 20:37
  • Sure @MartijnPieters, but I'm in a time crunch and this is the hack I could come up with :) Commented Mar 5, 2014 at 20:54

1 Answer 1

3

Perhaps you should be considering re.sub:

>>> oldString = '<tag parameter = "something">I want to escape these >, < and &</tag>'
>>> newString = re.sub(r"(<tag.*?>)(.*?)</tag>", lambda m: m.group(1) + cgi.escape(m.group(2)) + "</tag>", oldString)
>>> print newString
<tag parameter = "something">I want to escape these &gt;, &lt; and &amp;</tag>

My warning is that the regular expression will definitely break if you have nested tags. See Why is it such a bad idea to parse XML with regex?

Sign up to request clarification or add additional context in comments.

2 Comments

its really efficient form of my code, but my primary concern is to match the enclosing tags <tag.*> and </tag> also using regex
@prth I edited the code to match <tag> tags. I'm not sure if this is exactly what you wanted.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.