1

I am parsing RSS content using Universal feed Parser. In the description tag some times I am getting velues like below:

<!--This is the XML comment -->
<p>This is a Test Paragraph</p></br>
<b>Sample Bold</b>
<m:Table>Sampe Text</m:Table>

Inorder to remove HTML elements/tags I am using the following Regex.

pattern = re.compile(u'<\/?\w+\s*[^>]*?\/?>', re.DOTALL | re.MULTILINE | re.IGNORECASE | re.UNICODE)
desc = pattern.sub(u" ", desc)

This helps to remove the HTML tags but not the xml comments. How do I remove both the elemnts and XML coments?

3
  • Wouldn't this be enough? r'<.*?>' Commented Oct 12, 2011 at 11:47
  • The proper way to do this would be to use an XML parser Like @duffymo said. Try BeautifulSoup Commented Oct 12, 2011 at 12:00
  • A parser is an overkill in this case. You don't need to know the tree structure, tag namespace, name, and attributes only to throw them away, do you? Oh, and @rplnt, you forgot about the CDATA (<![CDATA[some text <this is not a tag!> some more text]]>). Commented Oct 12, 2011 at 12:03

4 Answers 4

5

Using lxml:

import lxml.html as LH

content='''
<!--This is the XML comment -->
<p>This is a Test Paragraph</p></br>
<b>Sample Bold</b>
<Table>Sampe Text</Table>
'''

doc=LH.fromstring(content)
print(doc.text_content())

yields

This is a Test Paragraph
Sample Bold
Sampe Text
Sign up to request clarification or add additional context in comments.

Comments

4

Using regular expressions this way is a bad idea.

I'd navigate the DOM tree after using a real parser and remove what I wanted that way.

9 Comments

As per the accepted answer here stackoverflow.com/questions/1732348/…. Use beautiful soup instead.
You guys from Ban Regex Movement are really freaking me out. Regex cannot be used to PARSE XML because tags can be nested (<b><i></i></b>) but they can be used to STRIP tags 'cause a tag is simply anything between angle brackets. Read Wikipedia, dammit. (Sorry.)
There is no movement to ban regexp, it's just to point out that the correct tools should be used for each task, and before stripping out a tag you have to find it, and how would you do that? with a regexp? Bad idea.
So why is it bad then, exactly?
Because the DOM tree has more context, it gives you element type information, and it has a good API (XPath) for finding things.
|
1

There's a simple way to this with pure Python:

def remove_html_markup(s):
    tag = False
    quote = False
    out = ""

    for c in s:
            if c == '<' and not quote:
                tag = True
            elif c == '>' and not quote:
                tag = False
            elif (c == '"' or c == "'") and tag:
                quote = not quote
            elif not tag:
                out = out + c

    return out

The idea is explained here: http://youtu.be/2tu9LTDujbw

You can see it working here: http://youtu.be/HPkNPcYed9M?t=35s

PS - If you're interested in the class(about smart debugging with python) I give you a link: http://www.udacity.com/overview/Course/cs259/CourseRev/1. It's free!

You're welcome!

Comments

0

Why so complex? re.sub('<!\[CDATA\[(.*?)\]\]>|<.*?>', lambda m: m.group(1) or '', desc, flags=re.DOTALL)

If you want XML tags intact, you should probably check out a list of HTML tags at http://www.whatwg.org/specs/web-apps/current-work/multipage/ and use the '(<!\[CDATA\[.*?\]\]>)|<!--.*?-->|</?(?:tag names separated by pipes)(?:\s.*?)?>' regex.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.