1

I've got a bunch of HTML pages, in which I'd like to convert CSS-formatted text snippets into standard HTML tags. e.g <span class="bold">some text</span> will become <b>some text</b>

I'm stuck at nested span fragments:

<span class="italic"><span class="bold">XXXXXXXX</span></span>
<span class="italic">some text<span class="bold">nested text<span class="underline">deep nested text</span></span></span>

I'd like to convert the fragment using Python's regex library. What would be the optimal strategy to regex search-&-replace the above input?

7
  • 1
    Why must it be done by regular expression? Commented Dec 10, 2013 at 5:16
  • It's just a personal preference. I know it could be done with recusive plain string search... But somehow I find regex solutions to be more elegant... Commented Dec 10, 2013 at 5:18
  • 2
    The optimal strategy would really be to use something other than regular expressions, which are terribly underpowered for this. Beautiful Soup is the most popular go-to solution for parsing HTML in Python. Commented Dec 10, 2013 at 5:19
  • It probably won't be so elegant. To do tag balancing, you need something stronger than regex. If you still want to use regular expressions, you'll need to use a loop. Commented Dec 10, 2013 at 5:20
  • 1
    The ultimate html-regex rant is here. Commented Dec 10, 2013 at 5:30

1 Answer 1

1

My solution using lxml and cssselect and a bit of Python:

#!/usr/bin/env python

import cssselect  # noqa
from lxml.html import fromstring


html = """
<span class="italic"><span class="bold">XXXXXXXX</span></span>
<span class="italic">some text<span class="bold">nested text<span class="underline">deep nested text</span></span></span>
"""

class_to_style = {
    "underline": "u",
    "italic": "i",
    "bold": "b",
}

output = []
doc = fromstring(html)
spans = doc.cssselect("span")
for span in spans:
    if span.attrib.get("class"):
        output.append("<{0}>{1}</{0}>".format(class_to_style[span.attrib["class"]], span.text or ""))
print "".join(output)

Output:

<i></i><b>XXXXXXXX</b><i>some text</i><b>nested text</b><u>deep nested text</u>

NB: This is a naive solution and does not produce the correct output as you'd have to keep a queue of open tags and close them at the end.

Sign up to request clarification or add additional context in comments.

4 Comments

Awesome! I was unaware of cssselect until now! Thanks @James Mills !
Oops! It doesn't work as expected.. the output should be: <i><b>XXXXXXXX</b></i><i>some text<b>nested text<u>deep nested text</b></u></i>
Yes my solution is naive at best. You'll have to keep a queue of open tags and close them at the end. I'm sure you can do this? :) Updated my answer to reflect this. (Have to leave you a little work!)
You're right I'm exploring csselect & spyda. Thanks for the heads up!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.