HTML tag replacement using regex and python

Question

I have a Python script that will look at an HTML file that has the following format:

<DOC>
<HTML>
...
</HTML>
</DOC>
<DOC>
<HTML>
...
</HTML>
</DOC>

How do I remove all HTML tags (replace the tags with '') with the exception of the opening and closing DOC tags using regex in Python? Also, if I want to retain the alt-text of an tag, what should the regex expression look like?

You should use a DOM parser, not a regular expression. See docs.python.org/library/xml.dom.html — meder omuraliev
– meder omuraliev, Commented Sep 27, 2009 at 21:54
I want to remove all tags except for the <DOC> and </DOC> tags. — GobiasKoffi
– GobiasKoffi, Commented Sep 27, 2009 at 22:04
Is 'html' a hypothetical element name and not really 'html'? — meder omuraliev
– meder omuraliev, Commented Sep 28, 2009 at 4:04

John La Rooy · Accepted Answer · 2009-09-28 03:11:57Z

4

For what you are trying to accomplish I would use BeautifulSoup rather than regex.

http://www.crummy.com/software/BeautifulSoup/

answered Sep 28, 2009 at 3:11

John La Rooy

306k54 gold badges378 silver badges513 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

moorej · Accepted Answer · 2009-09-28 02:54:23Z

2

Check out lxml, a really nice python library for dealing with xml. You can use drop_tag to accomplish what you are looking for.

from lxml import html 
h = html.fragment_fromstring('<doc>Hello <b>World!</b></doc>')
h.find('*').drop_tag()
print(html.tostring(h, encoding=unicode))

<doc>Hello World!</doc>

answered Sep 28, 2009 at 2:54

moorej

5473 silver badges17 bronze badges

Comments

ennuikiller · Accepted Answer · 2009-09-27 21:53:42Z

1

search and replace with this regex: search for: <.*?> replace with: "

answered Sep 27, 2009 at 21:53

ennuikiller

47.1k15 gold badges115 silver badges137 bronze badges

Collectives™ on Stack Overflow

HTML tag replacement using regex and python

3 Answers 3

Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related