Nested string replace with regex in Python

Question

I've got a bunch of HTML pages, in which I'd like to convert CSS-formatted text snippets into standard HTML tags. e.g some text will become some text

I'm stuck at nested span fragments:

<span class="italic"><span class="bold">XXXXXXXX</span></span>
<span class="italic">some text<span class="bold">nested text<span class="underline">deep nested text</span></span></span>

I'd like to convert the fragment using Python's regex library. What would be the optimal strategy to regex search-&-replace the above input?

It's just a personal preference. I know it could be done with recusive plain string search... But somehow I find regex solutions to be more elegant... — masroore
– masroore, Commented Dec 10, 2013 at 5:18
The optimal strategy would really be to use something other than regular expressions, which are terribly underpowered for this. Beautiful Soup is the most popular go-to solution for parsing HTML in Python. — Tim Pierce
– Tim Pierce, Commented Dec 10, 2013 at 5:19
It probably won't be so elegant. To do tag balancing, you need something stronger than regex. If you still want to use regular expressions, you'll need to use a loop. — Michael
– Michael, Commented Dec 10, 2013 at 5:20

James Mills · Accepted Answer · 2013-12-10 05:46:55Z

1

My solution using lxml and cssselect and a bit of Python:

#!/usr/bin/env python

import cssselect  # noqa
from lxml.html import fromstring


html = """
<span class="italic"><span class="bold">XXXXXXXX</span></span>
<span class="italic">some text<span class="bold">nested text<span class="underline">deep nested text</span></span></span>
"""

class_to_style = {
    "underline": "u",
    "italic": "i",
    "bold": "b",
}

output = []
doc = fromstring(html)
spans = doc.cssselect("span")
for span in spans:
    if span.attrib.get("class"):
        output.append("<{0}>{1}</{0}>".format(class_to_style[span.attrib["class"]], span.text or ""))
print "".join(output)

Output:

<i></i><b>XXXXXXXX</b><i>some text</i><b>nested text</b><u>deep nested text</u>

NB: This is a naive solution and does not produce the correct output as you'd have to keep a queue of open tags and close them at the end.

edited Dec 10, 2013 at 5:46

answered Dec 10, 2013 at 5:35

James Mills

19.1k4 gold badges53 silver badges63 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

masroore Over a year ago

Awesome! I was unaware of cssselect until now! Thanks @James Mills !

masroore Over a year ago

Oops! It doesn't work as expected.. the output should be: XXXXXXXXsome textnested textdeep nested text

James Mills Over a year ago

Yes my solution is naive at best. You'll have to keep a queue of open tags and close them at the end. I'm sure you can do this? :) Updated my answer to reflect this. (Have to leave you a little work!)

masroore Over a year ago

You're right I'm exploring csselect & spyda. Thanks for the heads up!

Collectives™ on Stack Overflow

Nested string replace with regex in Python

1 Answer 1

4 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

4 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related