4

Using python regex, how do i remove all tags in html? The tags sometimes have styling, such as below:

<sup style="vertical-align:top;line-height:120%;font-size:7pt">(1)</sup>

I would like to remove everything between and including the sup tags in a larger string of html.

3
  • what would be your end result? Commented Jul 2, 2014 at 14:37
  • 2
    Obligatory reading for OPs trying to manipulate HTML with regex: stackoverflow.com/a/1732454/3001761 Commented Jul 2, 2014 at 14:38
  • 1
    I fixed my issue by converting html to string and using the following: re.sub(r'<sup+.*?sup>+','',string of html) Commented Jul 2, 2014 at 14:39

1 Answer 1

6

I would use an HTML Parser instead (why). For example, BeautifulSoup and unwrap() can handle your beautiful sup:

Tag.unwrap() is the opposite of wrap(). It replaces a tag with whatever’s inside that tag. It’s good for stripping out markup.

from bs4 import BeautifulSoup

data = """
<div>
    <sup style="vertical-align:top;line-height:120%;font-size:7pt">(1)</sup>
</div>
"""

soup = BeautifulSoup(data)
for sup in soup.find_all('sup'):
    sup.unwrap()

print soup.prettify()

Prints:

<div>
(1)
</div>
Sign up to request clarification or add additional context in comments.

2 Comments

Thanks this is much more effective. I appreciate it.
Is there a way of removing the tags along with the content inside them? The current solution only removes the tag.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.