Escape unescaped data of XML string in python3

Question

I want to escape the unescaped data inside a xml string e.g.

string = "<tag parameter = "something">I want to escape these >, < and &</tag>"

to

"<tag parameter = "something">I want to escape these &gt;, &lt; and &amp;</tag>"

Now, I definitely can't use any xml parsing libraries like xml.dom.minidom or xml.etree because the data is unescaped & will give error

In regex, I figure out way to match & get start and end positions of data substing

exp = re.search(">.+?</", label)
# Get position of the data between tags
start = exp.start() + 1
end = exp.end() - 2
return label[ : start] + saxutils.escape(label[start : end]) + label[end : ]

But in re.search, I can't match the exact xml format
If I use re.findall I can't get positions of the substrings found
I could always find positions of found substring by index but that won't be efficient, I want a simple but efficent solution
BeautifulSoup solutions are welcomed but I wish there was some more beautiful way to do it with python's basic libraries

And how was this faulty XML produced in the first place? It'd be easier if you could fix that instead.. — Martijn Pieters
– Martijn Pieters, Commented Mar 5, 2014 at 13:19
in string you have , instead of . in ">.+?</", maybe thats why you can't match anything in re.search — Aleksandar
– Aleksandar, Commented Mar 5, 2014 at 14:04
Sure @MartijnPieters, but I'm in a time crunch and this is the hack I could come up with :) — Parth
– Parth, Commented Mar 5, 2014 at 20:54

Community · Accepted Answer · 2017-05-23 11:57:49Z

3

Perhaps you should be considering re.sub:

>>> oldString = '<tag parameter = "something">I want to escape these >, < and &</tag>'
>>> newString = re.sub(r"(<tag.*?>)(.*?)</tag>", lambda m: m.group(1) + cgi.escape(m.group(2)) + "</tag>", oldString)
>>> print newString
<tag parameter = "something">I want to escape these &gt;, &lt; and &amp;</tag>

My warning is that the regular expression will definitely break if you have nested tags. See Why is it such a bad idea to parse XML with regex?

edited May 23, 2017 at 11:57

CommunityBot

11 silver badge

answered Mar 5, 2014 at 13:50

icedtrees

6,5466 gold badges28 silver badges35 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Parth Over a year ago

its really efficient form of my code, but my primary concern is to match the enclosing tags <tag.*> and </tag> also using regex

icedtrees Over a year ago

@prth I edited the code to match <tag> tags. I'm not sure if this is exactly what you wanted.

Collectives™ on Stack Overflow

Escape unescaped data of XML string in python3

1 Answer 1

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related