1

I have a Python script that will look at an HTML file that has the following format:

<DOC>
<HTML>
...
</HTML>
</DOC>
<DOC>
<HTML>
...
</HTML>
</DOC>

How do I remove all HTML tags (replace the tags with '') with the exception of the opening and closing DOC tags using regex in Python? Also, if I want to retain the alt-text of an tag, what should the regex expression look like?

4
  • 5
    You should use a DOM parser, not a regular expression. See docs.python.org/library/xml.dom.html Commented Sep 27, 2009 at 21:54
  • And can you be more specific about what you want to remove? Commented Sep 27, 2009 at 21:55
  • I want to remove all tags except for the <DOC> and </DOC> tags. Commented Sep 27, 2009 at 22:04
  • Is 'html' a hypothetical element name and not really 'html'? Commented Sep 28, 2009 at 4:04

3 Answers 3

4

For what you are trying to accomplish I would use BeautifulSoup rather than regex.

http://www.crummy.com/software/BeautifulSoup/

Sign up to request clarification or add additional context in comments.

Comments

2

Check out lxml, a really nice python library for dealing with xml. You can use drop_tag to accomplish what you are looking for.

from lxml import html 
h = html.fragment_fromstring('<doc>Hello <b>World!</b></doc>')
h.find('*').drop_tag()
print(html.tostring(h, encoding=unicode))

<doc>Hello World!</doc>

Comments

1

search and replace with this regex: search for: <.*?> replace with: "

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.