Regular Expression in Python for Removing XML Comments and HTML elements

Question

I am parsing RSS content using Universal feed Parser. In the description tag some times I am getting velues like below:

<!--This is the XML comment -->
<p>This is a Test Paragraph</p></br>
<b>Sample Bold</b>
<m:Table>Sampe Text</m:Table>

Inorder to remove HTML elements/tags I am using the following Regex.

pattern = re.compile(u'<\/?\w+\s*[^>]*?\/?>', re.DOTALL | re.MULTILINE | re.IGNORECASE | re.UNICODE)
desc = pattern.sub(u" ", desc)

This helps to remove the HTML tags but not the xml comments. How do I remove both the elemnts and XML coments?

The proper way to do this would be to use an XML parser Like @duffymo said. Try BeautifulSoup — WilHall
– WilHall, Commented Oct 12, 2011 at 12:00
A parser is an overkill in this case. You don't need to know the tree structure, tag namespace, name, and attributes only to throw them away, do you? Oh, and @rplnt, you forgot about the CDATA (<![CDATA[some text <this is not a tag!> some more text]]>). — pyos
– pyos, Commented Oct 12, 2011 at 12:03

unutbu · Accepted Answer · 2011-10-12 12:07:52Z

5

Using lxml:

import lxml.html as LH

content='''
<!--This is the XML comment -->
<p>This is a Test Paragraph</p></br>
<b>Sample Bold</b>
<Table>Sampe Text</Table>
'''

doc=LH.fromstring(content)
print(doc.text_content())

yields

This is a Test Paragraph
Sample Bold
Sampe Text

answered Oct 12, 2011 at 12:07

unutbu

886k197 gold badges1.9k silver badges1.7k bronze badges

Sign up to request clarification or add additional context in comments.

Comments

duffymo · Accepted Answer · 2011-10-12 11:46:05Z

4

Using regular expressions this way is a bad idea.

I'd navigate the DOM tree after using a real parser and remove what I wanted that way.

answered Oct 12, 2011 at 11:46

duffymo

310k46 gold badges376 silver badges570 bronze badges

9 Comments

yann.kmm Over a year ago

As per the accepted answer here stackoverflow.com/questions/1732348/…. Use beautiful soup instead.

pyos Over a year ago

You guys from Ban Regex Movement are really freaking me out. Regex cannot be used to PARSE XML because tags can be nested (<b><i></i></b>) but they can be used to STRIP tags 'cause a tag is simply anything between angle brackets. Read Wikipedia, dammit. (Sorry.)

yann.kmm Over a year ago

There is no movement to ban regexp, it's just to point out that the correct tools should be used for each task, and before stripping out a tag you have to find it, and how would you do that? with a regexp? Bad idea.

pyos Over a year ago

So why is it bad then, exactly?

duffymo Over a year ago

Because the DOM tree has more context, it gives you element type information, and it has a good API (XPath) for finding things.

|

Igor Medeiros · Accepted Answer · 2017-01-02 19:59:08Z

1

There's a simple way to this with pure Python:

def remove_html_markup(s):
    tag = False
    quote = False
    out = ""

    for c in s:
            if c == '<' and not quote:
                tag = True
            elif c == '>' and not quote:
                tag = False
            elif (c == '"' or c == "'") and tag:
                quote = not quote
            elif not tag:
                out = out + c

    return out

The idea is explained here: http://youtu.be/2tu9LTDujbw

You can see it working here: http://youtu.be/HPkNPcYed9M?t=35s

PS - If you're interested in the class(about smart debugging with python) I give you a link: http://www.udacity.com/overview/Course/cs259/CourseRev/1. It's free!

You're welcome!

edited Jan 2, 2017 at 19:59

answered Jan 22, 2013 at 17:39

Igor Medeiros

4,1462 gold badges29 silver badges34 bronze badges

Comments

pyos · Accepted Answer · 2011-10-12 12:02:09Z

0

Why so complex? re.sub('<!\[CDATA\[(.*?)\]\]>|<.*?>', lambda m: m.group(1) or '', desc, flags=re.DOTALL)

If you want XML tags intact, you should probably check out a list of HTML tags at http://www.whatwg.org/specs/web-apps/current-work/multipage/ and use the '(<!\[CDATA\[.*?\]\]>)||</?(?:tag names separated by pipes)(?:\s.*?)?>' regex.

edited Oct 12, 2011 at 12:02

answered Oct 12, 2011 at 11:50

pyos

1514 bronze badges

Collectives™ on Stack Overflow

Regular Expression in Python for Removing XML Comments and HTML elements

4 Answers 4

Comments

9 Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

Comments

9 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related